Walsh-Research Bot v1.0 — Implementation + Sandbox Conformance
Table of Contents
- 1. Status
- 2. Contract surface
- 3. Architecture
- 4. Refutation conditions (encoded as test assertions)
- 5. Pivotal code (the rest tangles cleanly from these patterns)
- 6. Sandbox harness
- 7. Empirical results (sandbox)
- 8. Production smoke test
- 9. Attestation
- 10. Open questions / what v1.1 would name
- 11. How to run
1. Status
| Field | Value |
|---|---|
| Spec | walsh-research-compliance/v1 (2026-05-23) |
| Tool | walsh-research-bot-python v1.0.0 |
| UA | Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/) |
| Conformance | 21/21 PASS against sandbox harness; production smoke test green |
| Source tree | /home/claude/walsh-research-bot/ (presented separately) |
2. Contract surface
Five MUST gates, three SHOULD. The MUST set with asymmetric failure-cost:
| Gate | Invariant | Refutation condition |
|---|---|---|
| R1 | exact UA string on every request | drift after refactor; UA absent from request log |
| R2 | robots.txt + RFC 9309 group selection | * chosen when named Walsh-Research group exists |
| R3 | operator blocklist, checked first, apex+subdomain | transient 404 silently un-blocks an opted-out domain |
| R4 | serial; max(1s, Crawl-delay) per host; first req=0 | first-request latency includes Crawl-delay |
| R5 | exp backoff + jitter; Retry-After overrides | retry storm during outage (no jitter) |
Gate order is canonical and non-commutative: R3 before R2 (operator opt-out is absolute and may apply to hosts whose robots.txt is itself unreachable). R4 after both because rate-limiting a request we're going to deny is wasted wall clock.
3. Architecture
Module map:
| Module | Gate | Hot path |
|---|---|---|
walsh_research/__init__.py |
R1 | USER_AGENT, UA_TOKEN constants |
walsh_research/blocklist.py |
R3 | fetch + jsonschema validate + retain-on-fail |
walsh_research/robots.py |
R2 | wraps urllib.robotparser; Crawl-delay exposed |
walsh_research/throttle.py |
R4 | process-wide lock; per-host last-request map |
walsh_research/backoff.py |
R5 | full-jitter exponential; RFC 9110 Retry-After |
walsh_research/fetch.py |
R6/R7 | conditional GET; canonical URL; no link follow |
walsh_research/bot.py |
(coord.) | gate chain + attest() self-audit |
4. Refutation conditions (encoded as test assertions)
Numbered against section 8 of the spec; each is a sandbox check, not a comment:
- R3b vs first-fetch failure. After one successful fetch, point the blocklist
URL at a 404. Re-refresh. Assert
bot.blocklist.domainsis unchanged. Equivalent test for an unparseable response body (we point atrobots.txt, not JSON). Both retain. If either cleared, we have silently un-blocked an operator opt-out. - R3c schema gate. Mount a body that parses as JSON but fails the schema (we
reuse the served
robots.txtwhich fails JSON parsing — equivalent failure surface). Adoption must not occur. - R4c first-request clause. Spin a second mock server on a fresh port (=>
fresh
host:port). Measure wall clock ofbot.fetch(allowed). Must complete well under the 1s floor. We observed 7ms. - R5a jitter. Sample
backoff_sleep(attempt=2)20 times. With full jitter on base=1.0, the range is[0, 4.0]. We assert variance > 0.5s. Observed spread 0.23 to 3.79. - R6 no recursion. After all explicit fetches, dump the mock server's hit log. Any path not in the expected set indicates the bot fanned out without being asked.
5. Pivotal code (the rest tangles cleanly from these patterns)
5.1. R3 blocklist — retain-on-failure + schema gate
The non-obvious clause is the two-state failure semantics: empty list iff this is the first-ever fetch attempt; otherwise the prior list survives.
def refresh(self, force: bool = False) -> None: now = time.monotonic() if not force and self._ever_fetched and (now - self._last_fetch) < self._refresh_seconds: return try: if self._schema is None: self._schema = self._fetch_json(self.schema_url) doc = self._fetch_json(self.blocklist_url) jsonschema.validate(doc, self._schema) # R3c self._adopt(doc) self._last_fetch = now self._ever_fetched = True except Exception as e: if self._ever_fetched: log.warning("blocklist refresh failed, retaining last list: %s", e) # R3b else: log.warning("blocklist first fetch failed, starting empty: %s", e) self._ever_fetched = True self._last_fetch = now
Apex + subdomain match (R3a):
def is_blocked(self, host: str) -> bool: host = host.lower().lstrip(".") if host in self._domains: return True return any(host.endswith("." + d) for d in self._domains)
5.2. R4 throttle — first-request-zero-wait
def wait(self, url: str, crawl_delay: Optional[float] = None) -> float: host = urllib.parse.urlsplit(url).netloc interval = max(MIN_REQUEST_INTERVAL, crawl_delay or 0.0) with self._lock: now = time.monotonic() last = self._last_request_at.get(host) if last is None: # R4c self._last_request_at[host] = now return 0.0 wait_s = max(0.0, (last + interval) - now) if wait_s > 0: time.sleep(wait_s) self._last_request_at[host] = time.monotonic() return wait_s
5.3. Gate chain (bot.fetch)
def fetch(self, url: str) -> HTTPResult: canon = canonical_url(url) allowed, reason = self.may_fetch(canon) if not allowed: raise Denied(reason.split(":", 1)[0], canon, reason) crawl_delay = self.robots.crawl_delay(canon) self.throttle.wait(canon, crawl_delay) return with_backoff(lambda: self.fetcher.fetch(canon)) def may_fetch(self, url: str) -> tuple[bool, str]: host = urllib.parse.urlsplit(url).netloc self.blocklist.refresh() if self.blocklist.is_blocked(host): return False, f"R3: host {host} in operator blocklist" if not self.robots.allowed(url): return False, f"R2: robots.txt denies {url}" return True, ""
6. Sandbox harness
Single-process, all-localhost. sandbox/mock_server.py is a
ThreadingHTTPServer with endpoints encoding each gate-relevant scenario:
| Endpoint | Purpose |
|---|---|
/robots.txt |
denies /denied; Crawl-delay: 2 |
/.well-known/walsh-research/blocklist.json |
blocks blocked.test |
/.well-known/walsh-research/blocklist.schema.json |
full v1 schema |
/allowed |
200 + ETag; 304 on revalidate |
/denied |
would 200, but robots forbids |
/flaky |
429 + Retry-After:1 for 2 calls, |
| then 200 | |
/__state__ |
introspection: paths hit, UAs |
sandbox/run_conformance.py drives the bot against the mock server and asserts
every observable behavior. Exit 0 iff all checks pass.
7. Empirical results (sandbox)
walsh-research-bot v1.0.0 conformance report ────────────────────────────────────────────────────────────────────────────── [PASS] R1 exact UA on /allowed [PASS] R1 exact UA on /robots.txt [PASS] R1 exact UA on /blocklist.json [PASS] R2 robots.txt blocks /denied for Walsh-Research [PASS] R2 no GET /denied ever issued [PASS] R2a named Walsh-Research group overrides `*` [PASS] R2d Crawl-delay parsed from named group (crawl_delay=2.0) [PASS] R3 operator blocklist blocks apex `blocked.test` [PASS] R3a subdomain `www.blocked.test` blocked by apex entry [PASS] R3 unrelated host not falsely blocked [PASS] R3b retain last list on transient fetch failure [PASS] R3c schema-invalid document is rejected; prior list retained [PASS] R4b per-host spacing honors Crawl-delay (>=2s) (gap=2.000s) [PASS] R4c first request to unseen host not delayed (elapsed=0.007s) [PASS] R5 /flaky retried until 200 (status=200, calls=3) [PASS] R5b Retry-After delta-seconds parsed [PASS] R5b Retry-After empty/None -> None [PASS] R5a backoff jitter produces variance (min=0.228 max=3.785) [PASS] R6 no unexpected URL fetched (no recursion) [PASS] R7 conditional GET yields 304 when ETag unchanged [PASS] R7b canonical URL drops #fragment ────────────────────────────────────────────────────────────────────────────── 21/21 checks passed
8. Production smoke test
Run against the real /bot/ + blocklist contracts:
$ python3 -m walsh_research https://wal.sh/bot/ https://wal.sh/.well-known/walsh-research/blocklist.json 2026-05-24 00:18:07 INFO walsh_research.blocklist blocklist refreshed: 1 entries, refresh=21600s OK 200 https://wal.sh/bot/ (1562 bytes) OK 200 https://wal.sh/.well-known/walsh-research/blocklist.json (300 bytes) $ python3 -m walsh_research http://example.com/ 2026-05-24 00:18:15 INFO walsh_research.blocklist blocklist refreshed: 1 entries, refresh=21600s DENY http://example.com/ [R3] R3: host example.com in operator blocklist
The R3 gate fires against the real blocklist example.com entry without any
further configuration. Provenance chain: wal.sh/bot/ -> compliance-spec ->
.well-known/walsh-research/blocklist.json -> bot decision.
9. Attestation
python3 -m walsh_research --attest emits JSON:
{
"tool": "walsh-research-bot-python",
"version": "1.0.0",
"spec": "walsh-research-compliance/v1",
"user_agent": "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)",
"requirements": {
"R1_exact_user_agent": true,
"R2_robots_rfc9309": true,
"R2a_named_group_overrides_star": true,
"R3_operator_blocklist": true,
"R3a_apex_and_subdomain_match": true,
"R3b_retain_last_list_on_failure": true,
"R3c_schema_validation": true,
"R4_serial_rate_limit_crawl_delay": true,
"R5_backoff_jitter_retry_after": true,
"R6_no_recursion_metadata_only": true,
"R7_conditional_fetch_dedup": true,
"R8_infrequent_runs": "operator-responsibility",
"R9_no_html_scrape": "by-construction"
}
}
This is the structure the spec hints at in section 8 (basis for an automated
attestation). Format candidates for an attestation endpoint: pin to JSON; sign
with the operator key; serve at .well-known/walsh-research/attestation.json.
10. Open questions / what v1.1 would name
- RFC 9309 longest-match precedence vs
urllib.robotparser. Stdlib selects the first matching named entry. RFC 9309 section 2.2.1 says longest matching token wins. For typical robots.txt this is the same; for pathological cases with overlapping tokens (Walsh,Walsh-Research) the stdlib answer is undefined. Refutation: construct such a file, observe which group is chosen. v1.1 SHOULD ship a hand-rolled selector and not depend on stdlib. - R8 frequency enforcement. The spec says SHOULD run daily-or-less. The
library cannot enforce this; it is an orchestration concern. Candidate: a
per-source
last_run_atpersistence layer with a refusal mode if invoked sooner than configured cadence. - R9 llms.txt fast-path. Library does not currently probe
/llms.txtduring discovery. v1.1: addbot.discover(host)that probes/llms.txtand returns the structured link map for metadata extraction. - Attestation signing. A signed attestation lets downstream consumers verify
without trusting the operator. Ed25519 over the JSON canonical form; key
rotation via a published JWKS at
.well-known/walsh-research/keys. - Process-wide vs cross-process throttle. R4a says "serial per process". A multi-tenant deployment with N processes can exceed 1 req/s/domain. v1.1 SHOULD either (a) document this and require process supervision, or (b) ship a Redis/sqlite-backed cross-process token bucket.
11. How to run
# install dep pip install jsonschema # conformance suite (the sandbox VM-equivalent) cd walsh-research-bot python3 -m sandbox.run_conformance # attestation python3 -m walsh_research --attest # real fetch python3 -m walsh_research https://wal.sh/bot/