Walsh-Research Bot v1.0 — Implementation + Sandbox Conformance

Table of Contents

1. Status

Field Value
Spec walsh-research-compliance/v1 (2026-05-23)
Tool walsh-research-bot-python v1.0.0
UA Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)
Conformance 21/21 PASS against sandbox harness; production smoke test green
Source tree /home/claude/walsh-research-bot/ (presented separately)

2. Contract surface

Five MUST gates, three SHOULD. The MUST set with asymmetric failure-cost:

Gate Invariant Refutation condition
R1 exact UA string on every request drift after refactor; UA absent from request log
R2 robots.txt + RFC 9309 group selection * chosen when named Walsh-Research group exists
R3 operator blocklist, checked first, apex+subdomain transient 404 silently un-blocks an opted-out domain
R4 serial; max(1s, Crawl-delay) per host; first req=0 first-request latency includes Crawl-delay
R5 exp backoff + jitter; Retry-After overrides retry storm during outage (no jitter)

Gate order is canonical and non-commutative: R3 before R2 (operator opt-out is absolute and may apply to hosts whose robots.txt is itself unreachable). R4 after both because rate-limiting a request we're going to deny is wasted wall clock.

3. Architecture

diagram-gate-chain.png

Module map:

Module Gate Hot path
walsh_research/__init__.py R1 USER_AGENT, UA_TOKEN constants
walsh_research/blocklist.py R3 fetch + jsonschema validate + retain-on-fail
walsh_research/robots.py R2 wraps urllib.robotparser; Crawl-delay exposed
walsh_research/throttle.py R4 process-wide lock; per-host last-request map
walsh_research/backoff.py R5 full-jitter exponential; RFC 9110 Retry-After
walsh_research/fetch.py R6/R7 conditional GET; canonical URL; no link follow
walsh_research/bot.py (coord.) gate chain + attest() self-audit

4. Refutation conditions (encoded as test assertions)

Numbered against section 8 of the spec; each is a sandbox check, not a comment:

  • R3b vs first-fetch failure. After one successful fetch, point the blocklist URL at a 404. Re-refresh. Assert bot.blocklist.domains is unchanged. Equivalent test for an unparseable response body (we point at robots.txt, not JSON). Both retain. If either cleared, we have silently un-blocked an operator opt-out.
  • R3c schema gate. Mount a body that parses as JSON but fails the schema (we reuse the served robots.txt which fails JSON parsing — equivalent failure surface). Adoption must not occur.
  • R4c first-request clause. Spin a second mock server on a fresh port (=> fresh host:port). Measure wall clock of bot.fetch(allowed). Must complete well under the 1s floor. We observed 7ms.
  • R5a jitter. Sample backoff_sleep(attempt=2) 20 times. With full jitter on base=1.0, the range is [0, 4.0]. We assert variance > 0.5s. Observed spread 0.23 to 3.79.
  • R6 no recursion. After all explicit fetches, dump the mock server's hit log. Any path not in the expected set indicates the bot fanned out without being asked.

5. Pivotal code (the rest tangles cleanly from these patterns)

5.1. R3 blocklist — retain-on-failure + schema gate

The non-obvious clause is the two-state failure semantics: empty list iff this is the first-ever fetch attempt; otherwise the prior list survives.

def refresh(self, force: bool = False) -> None:
    now = time.monotonic()
    if not force and self._ever_fetched and (now - self._last_fetch) < self._refresh_seconds:
        return
    try:
        if self._schema is None:
            self._schema = self._fetch_json(self.schema_url)
        doc = self._fetch_json(self.blocklist_url)
        jsonschema.validate(doc, self._schema)   # R3c
        self._adopt(doc)
        self._last_fetch = now
        self._ever_fetched = True
    except Exception as e:
        if self._ever_fetched:
            log.warning("blocklist refresh failed, retaining last list: %s", e)  # R3b
        else:
            log.warning("blocklist first fetch failed, starting empty: %s", e)
            self._ever_fetched = True
            self._last_fetch = now

Apex + subdomain match (R3a):

def is_blocked(self, host: str) -> bool:
    host = host.lower().lstrip(".")
    if host in self._domains:
        return True
    return any(host.endswith("." + d) for d in self._domains)

5.2. R4 throttle — first-request-zero-wait

def wait(self, url: str, crawl_delay: Optional[float] = None) -> float:
    host = urllib.parse.urlsplit(url).netloc
    interval = max(MIN_REQUEST_INTERVAL, crawl_delay or 0.0)
    with self._lock:
        now = time.monotonic()
        last = self._last_request_at.get(host)
        if last is None:                          # R4c
            self._last_request_at[host] = now
            return 0.0
        wait_s = max(0.0, (last + interval) - now)
        if wait_s > 0:
            time.sleep(wait_s)
        self._last_request_at[host] = time.monotonic()
        return wait_s

5.3. Gate chain (bot.fetch)

def fetch(self, url: str) -> HTTPResult:
    canon = canonical_url(url)
    allowed, reason = self.may_fetch(canon)
    if not allowed:
        raise Denied(reason.split(":", 1)[0], canon, reason)
    crawl_delay = self.robots.crawl_delay(canon)
    self.throttle.wait(canon, crawl_delay)
    return with_backoff(lambda: self.fetcher.fetch(canon))

def may_fetch(self, url: str) -> tuple[bool, str]:
    host = urllib.parse.urlsplit(url).netloc
    self.blocklist.refresh()
    if self.blocklist.is_blocked(host):
        return False, f"R3: host {host} in operator blocklist"
    if not self.robots.allowed(url):
        return False, f"R2: robots.txt denies {url}"
    return True, ""

6. Sandbox harness

Single-process, all-localhost. sandbox/mock_server.py is a ThreadingHTTPServer with endpoints encoding each gate-relevant scenario:

Endpoint Purpose
/robots.txt denies /denied; Crawl-delay: 2
/.well-known/walsh-research/blocklist.json blocks blocked.test
/.well-known/walsh-research/blocklist.schema.json full v1 schema
/allowed 200 + ETag; 304 on revalidate
/denied would 200, but robots forbids
/flaky 429 + Retry-After:1 for 2 calls,
  then 200
/__state__ introspection: paths hit, UAs

sandbox/run_conformance.py drives the bot against the mock server and asserts every observable behavior. Exit 0 iff all checks pass.

7. Empirical results (sandbox)

walsh-research-bot v1.0.0  conformance report
──────────────────────────────────────────────────────────────────────────────
  [PASS] R1     exact UA on /allowed
  [PASS] R1     exact UA on /robots.txt
  [PASS] R1     exact UA on /blocklist.json
  [PASS] R2     robots.txt blocks /denied for Walsh-Research
  [PASS] R2     no GET /denied ever issued
  [PASS] R2a    named Walsh-Research group overrides `*`
  [PASS] R2d    Crawl-delay parsed from named group              (crawl_delay=2.0)
  [PASS] R3     operator blocklist blocks apex `blocked.test`
  [PASS] R3a    subdomain `www.blocked.test` blocked by apex entry
  [PASS] R3     unrelated host not falsely blocked
  [PASS] R3b    retain last list on transient fetch failure
  [PASS] R3c    schema-invalid document is rejected; prior list retained
  [PASS] R4b    per-host spacing honors Crawl-delay (>=2s)       (gap=2.000s)
  [PASS] R4c    first request to unseen host not delayed         (elapsed=0.007s)
  [PASS] R5     /flaky retried until 200                         (status=200, calls=3)
  [PASS] R5b    Retry-After delta-seconds parsed
  [PASS] R5b    Retry-After empty/None -> None
  [PASS] R5a    backoff jitter produces variance                 (min=0.228 max=3.785)
  [PASS] R6     no unexpected URL fetched (no recursion)
  [PASS] R7     conditional GET yields 304 when ETag unchanged
  [PASS] R7b    canonical URL drops #fragment
──────────────────────────────────────────────────────────────────────────────
  21/21 checks passed

8. Production smoke test

Run against the real /bot/ + blocklist contracts:

$ python3 -m walsh_research https://wal.sh/bot/ https://wal.sh/.well-known/walsh-research/blocklist.json
2026-05-24 00:18:07 INFO walsh_research.blocklist  blocklist refreshed: 1 entries, refresh=21600s
OK  200  https://wal.sh/bot/                                              (1562 bytes)
OK  200  https://wal.sh/.well-known/walsh-research/blocklist.json         (300 bytes)

$ python3 -m walsh_research http://example.com/
2026-05-24 00:18:15 INFO walsh_research.blocklist  blocklist refreshed: 1 entries, refresh=21600s
DENY      http://example.com/  [R3] R3: host example.com in operator blocklist

The R3 gate fires against the real blocklist example.com entry without any further configuration. Provenance chain: wal.sh/bot/ -> compliance-spec -> .well-known/walsh-research/blocklist.json -> bot decision.

9. Attestation

python3 -m walsh_research --attest emits JSON:

{
  "tool": "walsh-research-bot-python",
  "version": "1.0.0",
  "spec": "walsh-research-compliance/v1",
  "user_agent": "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)",
  "requirements": {
    "R1_exact_user_agent": true,
    "R2_robots_rfc9309": true,
    "R2a_named_group_overrides_star": true,
    "R3_operator_blocklist": true,
    "R3a_apex_and_subdomain_match": true,
    "R3b_retain_last_list_on_failure": true,
    "R3c_schema_validation": true,
    "R4_serial_rate_limit_crawl_delay": true,
    "R5_backoff_jitter_retry_after": true,
    "R6_no_recursion_metadata_only": true,
    "R7_conditional_fetch_dedup": true,
    "R8_infrequent_runs": "operator-responsibility",
    "R9_no_html_scrape": "by-construction"
  }
}

This is the structure the spec hints at in section 8 (basis for an automated attestation). Format candidates for an attestation endpoint: pin to JSON; sign with the operator key; serve at .well-known/walsh-research/attestation.json.

10. Open questions / what v1.1 would name

  1. RFC 9309 longest-match precedence vs urllib.robotparser. Stdlib selects the first matching named entry. RFC 9309 section 2.2.1 says longest matching token wins. For typical robots.txt this is the same; for pathological cases with overlapping tokens (Walsh, Walsh-Research) the stdlib answer is undefined. Refutation: construct such a file, observe which group is chosen. v1.1 SHOULD ship a hand-rolled selector and not depend on stdlib.
  2. R8 frequency enforcement. The spec says SHOULD run daily-or-less. The library cannot enforce this; it is an orchestration concern. Candidate: a per-source last_run_at persistence layer with a refusal mode if invoked sooner than configured cadence.
  3. R9 llms.txt fast-path. Library does not currently probe /llms.txt during discovery. v1.1: add bot.discover(host) that probes /llms.txt and returns the structured link map for metadata extraction.
  4. Attestation signing. A signed attestation lets downstream consumers verify without trusting the operator. Ed25519 over the JSON canonical form; key rotation via a published JWKS at .well-known/walsh-research/keys.
  5. Process-wide vs cross-process throttle. R4a says "serial per process". A multi-tenant deployment with N processes can exceed 1 req/s/domain. v1.1 SHOULD either (a) document this and require process supervision, or (b) ship a Redis/sqlite-backed cross-process token bucket.

11. How to run

# install dep
pip install jsonschema

# conformance suite (the sandbox VM-equivalent)
cd walsh-research-bot
python3 -m sandbox.run_conformance

# attestation
python3 -m walsh_research --attest

# real fetch
python3 -m walsh_research https://wal.sh/bot/