Walsh-Research Bot Compliance Specification

Table of Contents

1. Status

  • Spec version: walsh-research-compliance/v1 (2026-05-23)
  • Applies to: every tool that fetches third-party resources under the Walsh-Research identity — crawlers, agents, one-off scripts, in any language.
  • Reference implementation: the Clojure crawler jwalsh/tech-crawler (namespaces tech-crawler.crawler / tech-crawler.core).
  • Authoritative contract surface: https://wal.sh/bot/ and the /.well-known/walsh-research/ documents (below).

This document is the implementation contract behind the public policy at https://wal.sh/bot/. If the two ever disagree, https://wal.sh/bot/ is authoritative for what we promise the public; this spec is authoritative for how our code must behave. A future inner-sourced bot/agents library should codify this spec directly; until then, hand this document to any new tool.

2. Conformance language

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as in RFC 2119. A tool is conformant when it satisfies every MUST and MUST NOT.

3. Why language-agnostic

None of these rules depend on Clojure. Each requirement is stated as behavior over four primitives every language has: an HTTP client, an XML/JSON parser, a monotonic clock + sleep, and a SHA-256 hash. The Clojure snippets are illustrative; equivalent Python / Rust / Guile / TypeScript / Go are mechanical (see 9).

4. Identity and discovery

4.1. R1 — User-Agent (MUST)

Every request MUST send this exact User-Agent:

Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)

The product token used for robots.txt matching is Walsh-Research (case-insensitive). The +URL MUST resolve to a public policy page.

(def ua-token "Walsh-Research")
(def default-headers
  {"User-Agent" "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)"})

4.2. Canonical resources (MUST be consulted)

Resource URL
Public policy https://wal.sh/bot/
Blocklist https://wal.sh/.well-known/walsh-research/blocklist.json
Blocklist schema https://wal.sh/.well-known/walsh-research/blocklist.schema.json

The schema is reusable. If you are building a similar bot compliance system, the walsh-research-blocklist/v1 contract and its JSON Schema are available at the URLs above under no restrictions.

5. Pre-request gates

A tool MUST evaluate these gates, in this order, before issuing the actual request for a target URL. The first gate that denies stops the request.

Pre-request gate flow: blocklist → robots.txt → throttle → fetch or deny

  1. Operator blocklist (R3) — operator-side opt-out; absolute.
  2. robots.txt (R2) — site-side rules.
  3. Rate limit / Crawl-delay (R4) — pacing.

Then fetch, applying backoff (R5) on rate-limit responses.

Backoff state machine: fetch → check status → retry with exponential backoff or abort

5.1. R2 — robots.txt, RFC 9309 (MUST)

A tool MUST fetch /robots.txt for each host, cache it (24h SHOULD), and obey it for the Walsh-Research token. Group selection MUST follow RFC 9309:

  • If one or more groups name our token, the single most specific group applies (longest matching User-agent value); the * group is ignored.
  • Else the * group applies.
  • Else nothing is disallowed.

A named group therefore overrides * in both directions: it can block us where * allows, and exempt us where * blocks. A tool MUST support Crawl-delay within the selected group (see R4). On any fetch/parse error a tool SHOULD fail open (allow).

;; select the most-specific matching group, then test the path prefix
(defn robots-allowed? [url]
  (let [g (select-robots-group (fetch-robots url) ua-token)]   ; nil | {:disallows [..] :crawl-delay s}
    (not (some #(str/starts-with? (path-of url) %) (:disallows g)))))

Opt-out via robots (what we honor):

User-agent: Walsh-Research
Disallow: /

5.2. R3 — Operator blocklist (MUST)

Some opt-outs arrive out-of-band (email to j@wal.sh). These live in one published JSON contract that every Walsh-Research tool MUST consult, so an opt-out is honored uniformly across bots and languages. This is in addition to robots.txt and is checked first.

Contract walsh-research-blocklist/v1:

{
  "contract": "walsh-research-blocklist/v1",
  "updated":  "2026-05-23T00:00:00Z",
  "operator": "Jason Walsh",
  "contact":  "j@wal.sh",
  "policy":   "https://wal.sh/bot/",
  "refresh":  "PT6H",
  "blocked": [
    { "domain": "example.com", "added": "2026-05-23", "reason": "email opt-out" }
  ]
}

Requirements:

  • A tool MUST fetch the blocklist and skip any request whose host equals or is a subdomain of a listed domain (example.com blocks www.example.com).
  • A tool SHOULD validate the document against the published blocklist.schema.json before adopting it, and MUST NOT adopt a document that fails validation (retain the previous list instead).
  • A tool SHOULD cache the list for the contract's refresh duration (ISO-8601). Timing is dictated by the data source, not hard-coded.
  • On fetch failure or 404 a tool MUST retain the last-known list rather than clearing it — a transient outage MUST NOT silently un-block an opt-out. (A first-ever fetch failure yields an empty list = fail open.)
(defn blocked? [url]                          ; host apex + subdomain match
  (let [{:keys [domains]} (cached-blocklist)] ; validated against schema; TTL from :refresh
    (domain-blocked? (host-of url) domains)))

5.3. R4 — Rate limiting and Crawl-delay (MUST)

  • Requests MUST be serial per process — no concurrent connections.
  • A tool MUST NOT exceed one request per second per domain.
  • A tool MUST honor a host's robots Crawl-delay: the spacing before the next request to a host is max(1 second, Crawl-delay).
  • The first request to a previously-unseen host MUST NOT be delayed.
(def min-request-interval-ms 1000)            ; <= 1 req/s/domain

(defn throttle! [url]                         ; block until this host's slot is free
  (let [host (host-of url)
        interval (max min-request-interval-ms (crawl-delay-ms url)) ; robots Crawl-delay
        last (get @host-last-request host)    ; nil when unseen -> no wait
        wait (max 0 (- (+ (or last (now)) interval) (now)))]
    (when (pos? wait) (sleep wait))
    (swap! host-last-request assoc host (+ (now) wait))))

5.4. R5 — Backoff and Retry-After (MUST)

On HTTP 429 or 503 a tool MUST back off and retry with exponential backoff plus jitter, and MUST respect a Retry-After header when present (use it instead of the computed delay). A bounded number of retries SHOULD be used before giving up.

(defn with-backoff [thunk]                     ; thunk -> http response
  (loop [n 0]
    (let [resp (thunk)]
      (if (and (#{429 503} (:status resp)) (< n max-retries))
        (do (sleep (or (retry-after-ms resp) (* base-ms (Math/pow 2 n) (rand))))
            (recur (inc n)))
        resp))))

6. Scope and data-handling limits

6.1. R6 — Stay in scope (MUST / MUST NOT)

  • A tool MUST NOT follow links or crawl recursively. It fetches a declared set of targets only.
  • A tool MUST NOT download sub-resources (images, scripts, stylesheets).
  • A tool MUST NOT train models on, index for search, or republish fetched content. Extract metadata, not content.
  • A tool MUST NOT store content beyond what the analysis needs.

6.2. R7 — Be a cache-friendly citizen (SHOULD)

A tool SHOULD send conditional requests using stored validators (If-None-Match / If-Modified-Since) and treat 304 Not Modified as "no new data" without re-processing. A tool SHOULD dedup results by canonical URL (strip #fragment and trailing slash) so repeated runs accumulate first-seen items only.

6.3. R8 — Frequency (SHOULD)

A tool SHOULD run infrequently (daily or less) per source.

6.4. R9 — Prefer structured formats; never scrape HTML (SHOULD / MUST NOT)

A tool MUST NOT extract content by scraping arbitrary HTML with per-site selectors – an unbounded maintenance burden, and easy to do impolitely.

For a feed-less source a tool SHOULD obtain a structured representation, preferring, in order:

  1. Content negotiation — request Accept: text/markdown and use the body only when the server returns a markdown content-type (e.g. Mintlify-hosted docs, wal.sh).
  2. llms.txt convention — fetch /llms.txt at the host root: a curated markdown index of the site's key pages (https://llmstxt.org).

Extract metadata generically from the markdown (links, headings); a tool MUST NOT rely on per-site structure. This keeps feed-less coverage clean and consistent with R6 (metadata, not content). A source that offers neither a feed nor markdown is left out rather than scraped.

6.4.1. Note: the llms.txt convention (https://llmstxt.org)

llms.txt is a proposed standard (Jeremy Howard / Answer.AI, 2024) for a markdown file at a site's root that gives automated readers a curated, concise map of the site, instead of forcing them to parse navigation-heavy HTML. Two files:

  • /llms.txt — a curated index: an H1 site name, an optional blockquote summary, then sections of markdown links ([title](url): note) to the pages that matter. This is what a tool SHOULD fetch and parse for links.
  • /llms-full.txt — the same pages' full content concatenated as markdown, for tools that want the text in one request. A metadata-only crawler (this contract) uses the index, not the full dump.

It is served as a plain file (often text/plain), so a tool SHOULD accept a text/plain body at /llms.txt even though it is markdown. Adoption is growing among docs platforms (Mintlify auto-generates it; Cloudflare, Anthropic docs, Stripe, Next.js, shadcn, Perplexity ship one). It is advisory, not an access-control mechanism: robots.txt (R2) and the blocklist (R3) still govern whether a tool may fetch; llms.txt only offers a cleaner representation once allowed. A tool MAY probe /llms.txt during source discovery to decide whether a feed-less site is crawlable without scraping.

7. Opt-out workflow (operator side)

  1. A site can self-serve via robots.txt (R2) at any time — honored within the robots cache TTL.
  2. Or email j@wal.sh; the operator appends a {domain, added, reason} entry to blocklist.json and deploys. All bots honor it within the refresh TTL.

A tool implementer's job is only to consume both sources correctly.

8. Conformance checklist

A tool is conformant when every MUST below holds. Use this as a self-audit (and as the basis for an automated attestation).

# Requirement MUST
1 Exact User-Agent string
2 robots.txt fetched + obeyed, RFC 9309 group selection
2a Named Walsh-Research group overrides *
3 Operator blocklist consulted, checked before robots
3a apex + subdomain match
3b retain last list on fetch failure
3c validate against schema; reject invalid (SHOULD)
4 serial; <= 1 req/s/domain; honor Crawl-delay
5 429/503 exponential backoff + jitter; Retry-After
6 no recursion, no sub-resources, metadata-only
7 conditional fetch + dedup (SHOULD)
8 infrequent (SHOULD)
9 feed-less: content-neg markdown / llms.txt; never scrape HTML (SHOULD / MUST NOT)

9. Porting notes

The logic is identical across languages; only the four primitives change.

Primitive Clojure Python Rust Guile TypeScript Go
HTTP client clj-http httpx / requests reqwest (web client) fetch / undici net/http
XML parse clojure.data.xml xml.etree quick-xml (sxml) fast-xml-parser encoding/xml
JSON parse cheshire json serde_json (guile-json) JSON.parse encoding/json
Clock+sleep System/currentTimeMillis / Thread/sleep time / sleep std::time / thread::sleep (current-time / usleep) Date.now / setTimeout time
SHA-256 java.security.MessageDigest hashlib sha2 (gcrypt) crypto.subtle crypto/sha256

Cross-language sketch of the pre-request gate (pseudocode, applies verbatim):

function may_fetch(url):
    if blocked_by_operator(url):   return DENY   # R3, checked first
    if not robots_allows(url):     return DENY   # R2, RFC 9309 group selection
    throttle(url)                                 # R4, max(1s, crawl_delay)
    return ALLOW
# Python: the same constants and order
UA = "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)"
UA_TOKEN = "Walsh-Research"
def may_fetch(url):
    if blocked(url):            return False   # R3
    if not robots_allowed(url): return False   # R2
    throttle(url)                              # R4
    return True

10. Versioning

This spec is versioned (walsh-research-compliance/vN). Breaking changes to a MUST bump the major. The blocklist data contract (walsh-research-blocklist/vN) versions independently. Tools SHOULD record which spec version they target so attestations are comparable.

11. Changelog

  • v1 (2026-05-23) — initial: R1 UA, R2 robots/RFC 9309 + Crawl-delay, R3 operator blocklist + schema, R4 rate limit, R5 backoff, R6 scope limits, R7 conditional fetch + dedup, R8 frequency.
  • v1, rev 2 (2026-05-23) — add R9: prefer structured formats (content negotiation / llms.txt) for feed-less sources; never scrape HTML.