Walsh-Research Bot Compliance Specification

1. Status
2. Conformance language
3. Why language-agnostic
4. Identity and discovery
- 4.1. R1 — User-Agent (MUST)
- 4.2. Canonical resources (MUST be consulted)
5. Pre-request gates
6. Scope and data-handling limits
7. Opt-out workflow (operator side)
8. Conformance checklist
9. Porting notes
10. Versioning
11. Changelog

1. Status

Spec version: walsh-research-compliance/v1 (2026-05-23)
Applies to: every tool that fetches third-party resources under the Walsh-Research identity — crawlers, agents, one-off scripts, in any language.
Reference implementation: the Clojure crawler jwalsh/tech-crawler (namespaces tech-crawler.crawler / tech-crawler.core).
Authoritative contract surface: https://wal.sh/bot/ and the /.well-known/walsh-research/ documents (below).

This document is the implementation contract behind the public policy at https://wal.sh/bot/. If the two ever disagree, https://wal.sh/bot/ is authoritative for what we promise the public; this spec is authoritative for how our code must behave. A future inner-sourced bot/agents library should codify this spec directly; until then, hand this document to any new tool.

2. Conformance language

The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as in RFC 2119. A tool is conformant when it satisfies every MUST and MUST NOT.

3. Why language-agnostic

None of these rules depend on Clojure. Each requirement is stated as behavior over four primitives every language has: an HTTP client, an XML/JSON parser, a monotonic clock + sleep, and a SHA-256 hash. The Clojure snippets are illustrative; equivalent Python / Rust / Guile / TypeScript / Go are mechanical (see 9).

4. Identity and discovery

4.1. R1 — User-Agent (MUST)

Every request MUST send this exact User-Agent:

Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)

The product token used for robots.txt matching is Walsh-Research (case-insensitive). The +URL MUST resolve to a public policy page.

(def ua-token "Walsh-Research")
(def default-headers
  {"User-Agent" "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)"})

4.2. Canonical resources (MUST be consulted)

Resource	URL
Public policy	https://wal.sh/bot/
Blocklist	https://wal.sh/.well-known/walsh-research/blocklist.json
Blocklist schema	https://wal.sh/.well-known/walsh-research/blocklist.schema.json

The schema is reusable. If you are building a similar bot compliance system, the walsh-research-blocklist/v1 contract and its JSON Schema are available at the URLs above under no restrictions.

5. Pre-request gates

A tool MUST evaluate these gates, in this order, before issuing the actual request for a target URL. The first gate that denies stops the request.

Pre-request gate flow: blocklist → robots.txt → throttle → fetch or deny

Operator blocklist (R3) — operator-side opt-out; absolute.
robots.txt (R2) — site-side rules.
Rate limit / Crawl-delay (R4) — pacing.

Then fetch, applying backoff (R5) on rate-limit responses.

Backoff state machine: fetch → check status → retry with exponential backoff or abort

5.1. R2 — robots.txt, RFC 9309 (MUST)

A tool MUST fetch /robots.txt for each host, cache it (24h SHOULD), and obey it for the Walsh-Research token. Group selection MUST follow RFC 9309:

If one or more groups name our token, the single most specific group applies (longest matching User-agent value); the * group is ignored.
Else the * group applies.
Else nothing is disallowed.

A named group therefore overrides * in both directions: it can block us where * allows, and exempt us where * blocks. A tool MUST support Crawl-delay within the selected group (see R4). On any fetch/parse error a tool SHOULD fail open (allow).

;; select the most-specific matching group, then test the path prefix
(defn robots-allowed? [url]
  (let [g (select-robots-group (fetch-robots url) ua-token)]   ; nil | {:disallows [..] :crawl-delay s}
    (not (some #(str/starts-with? (path-of url) %) (:disallows g)))))

Opt-out via robots (what we honor):

User-agent: Walsh-Research
Disallow: /

5.2. R3 — Operator blocklist (MUST)

Some opt-outs arrive out-of-band (email to j@wal.sh). These live in one published JSON contract that every Walsh-Research tool MUST consult, so an opt-out is honored uniformly across bots and languages. This is in addition to robots.txt and is checked first.

Contract walsh-research-blocklist/v1:

{
  "contract": "walsh-research-blocklist/v1",
  "updated":  "2026-05-23T00:00:00Z",
  "operator": "Jason Walsh",
  "contact":  "j@wal.sh",
  "policy":   "https://wal.sh/bot/",
  "refresh":  "PT6H",
  "blocked": [
    { "domain": "example.com", "added": "2026-05-23", "reason": "email opt-out" }
  ]
}

Requirements:

A tool MUST fetch the blocklist and skip any request whose host equals or is a subdomain of a listed domain (example.com blocks www.example.com).
A tool SHOULD validate the document against the published blocklist.schema.json before adopting it, and MUST NOT adopt a document that fails validation (retain the previous list instead).
A tool SHOULD cache the list for the contract's refresh duration (ISO-8601). Timing is dictated by the data source, not hard-coded.
On fetch failure or 404 a tool MUST retain the last-known list rather than clearing it — a transient outage MUST NOT silently un-block an opt-out. (A first-ever fetch failure yields an empty list = fail open.)

(defn blocked? [url]                          ; host apex + subdomain match
  (let [{:keys [domains]} (cached-blocklist)] ; validated against schema; TTL from :refresh
    (domain-blocked? (host-of url) domains)))

5.3. R4 — Rate limiting and Crawl-delay (MUST)

Requests MUST be serial per process — no concurrent connections.
A tool MUST NOT exceed one request per second per domain.
A tool MUST honor a host's robots Crawl-delay: the spacing before the next request to a host is max(1 second, Crawl-delay).
The first request to a previously-unseen host MUST NOT be delayed.

(def min-request-interval-ms 1000)            ; <= 1 req/s/domain

(defn throttle! [url]                         ; block until this host's slot is free
  (let [host (host-of url)
        interval (max min-request-interval-ms (crawl-delay-ms url)) ; robots Crawl-delay
        last (get @host-last-request host)    ; nil when unseen -> no wait
        wait (max 0 (- (+ (or last (now)) interval) (now)))]
    (when (pos? wait) (sleep wait))
    (swap! host-last-request assoc host (+ (now) wait))))

5.4. R5 — Backoff and Retry-After (MUST)

On HTTP 429 or 503 a tool MUST back off and retry with exponential backoff plus jitter, and MUST respect a Retry-After header when present (use it instead of the computed delay). A bounded number of retries SHOULD be used before giving up.

(defn with-backoff [thunk]                     ; thunk -> http response
  (loop [n 0]
    (let [resp (thunk)]
      (if (and (#{429 503} (:status resp)) (< n max-retries))
        (do (sleep (or (retry-after-ms resp) (* base-ms (Math/pow 2 n) (rand))))
            (recur (inc n)))
        resp))))

6. Scope and data-handling limits

6.1. R6 — Stay in scope (MUST / MUST NOT)

A tool MUST NOT follow links or crawl recursively. It fetches a declared set of targets only.
A tool MUST NOT download sub-resources (images, scripts, stylesheets).
A tool MUST NOT train models on, index for search, or republish fetched content. Extract metadata, not content.
A tool MUST NOT store content beyond what the analysis needs.

6.2. R7 — Be a cache-friendly citizen (SHOULD)

A tool SHOULD send conditional requests using stored validators (If-None-Match / If-Modified-Since) and treat 304 Not Modified as "no new data" without re-processing. A tool SHOULD dedup results by canonical URL (strip #fragment and trailing slash) so repeated runs accumulate first-seen items only.

6.3. R8 — Frequency (SHOULD)

A tool SHOULD run infrequently (daily or less) per source.

6.4. R9 — Prefer structured formats; never scrape HTML (SHOULD / MUST NOT)

A tool MUST NOT extract content by scraping arbitrary HTML with per-site selectors – an unbounded maintenance burden, and easy to do impolitely.

For a feed-less source a tool SHOULD obtain a structured representation, preferring, in order:

Content negotiation — request Accept: text/markdown and use the body only when the server returns a markdown content-type (e.g. Mintlify-hosted docs, wal.sh).
llms.txt convention — fetch /llms.txt at the host root: a curated markdown index of the site's key pages (https://llmstxt.org).

Extract metadata generically from the markdown (links, headings); a tool MUST NOT rely on per-site structure. This keeps feed-less coverage clean and consistent with R6 (metadata, not content). A source that offers neither a feed nor markdown is left out rather than scraped.

6.4.1. Note: the llms.txt convention (https://llmstxt.org)

llms.txt is a proposed standard (Jeremy Howard / Answer.AI, 2024) for a markdown file at a site's root that gives automated readers a curated, concise map of the site, instead of forcing them to parse navigation-heavy HTML. Two files:

/llms.txt — a curated index: an H1 site name, an optional blockquote summary, then sections of markdown links ([title](url): note) to the pages that matter. This is what a tool SHOULD fetch and parse for links.
/llms-full.txt — the same pages' full content concatenated as markdown, for tools that want the text in one request. A metadata-only crawler (this contract) uses the index, not the full dump.

It is served as a plain file (often text/plain), so a tool SHOULD accept a text/plain body at /llms.txt even though it is markdown. Adoption is growing among docs platforms (Mintlify auto-generates it; Cloudflare, Anthropic docs, Stripe, Next.js, shadcn, Perplexity ship one). It is advisory, not an access-control mechanism: robots.txt (R2) and the blocklist (R3) still govern whether a tool may fetch; llms.txt only offers a cleaner representation once allowed. A tool MAY probe /llms.txt during source discovery to decide whether a feed-less site is crawlable without scraping.

7. Opt-out workflow (operator side)

A site can self-serve via robots.txt (R2) at any time — honored within the robots cache TTL.
Or email j@wal.sh; the operator appends a {domain, added, reason} entry to blocklist.json and deploys. All bots honor it within the refresh TTL.

A tool implementer's job is only to consume both sources correctly.

8. Conformance checklist

A tool is conformant when every MUST below holds. Use this as a self-audit (and as the basis for an automated attestation).

#	Requirement	MUST
1	Exact User-Agent string	✓
2	robots.txt fetched + obeyed, RFC 9309 group selection	✓
2a	Named `Walsh-Research` group overrides `*`	✓
3	Operator blocklist consulted, checked before robots	✓
3a	apex + subdomain match	✓
3b	retain last list on fetch failure	✓
3c	validate against schema; reject invalid (SHOULD)	◐
4	serial; <= 1 req/s/domain; honor Crawl-delay	✓
5	429/503 exponential backoff + jitter; Retry-After	✓
6	no recursion, no sub-resources, metadata-only	✓
7	conditional fetch + dedup (SHOULD)	◐
8	infrequent (SHOULD)	◐
9	feed-less: content-neg markdown / llms.txt; never scrape HTML (SHOULD / MUST NOT)	◐

9. Porting notes

The logic is identical across languages; only the four primitives change.

Primitive	Clojure	Python	Rust	Guile	TypeScript	Go
HTTP client	clj-http	httpx / requests	reqwest	(web client)	fetch / undici	net/http
XML parse	clojure.data.xml	xml.etree	quick-xml	(sxml)	fast-xml-parser	encoding/xml
JSON parse	cheshire	json	serde_json	(guile-json)	JSON.parse	encoding/json
Clock+sleep	System/currentTimeMillis / Thread/sleep	time / sleep	std::time / thread::sleep	(current-time / usleep)	Date.now / setTimeout	time
SHA-256	java.security.MessageDigest	hashlib	sha2	(gcrypt)	crypto.subtle	crypto/sha256

Cross-language sketch of the pre-request gate (pseudocode, applies verbatim):

function may_fetch(url):
    if blocked_by_operator(url):   return DENY   # R3, checked first
    if not robots_allows(url):     return DENY   # R2, RFC 9309 group selection
    throttle(url)                                 # R4, max(1s, crawl_delay)
    return ALLOW

# Python: the same constants and order
UA = "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)"
UA_TOKEN = "Walsh-Research"
def may_fetch(url):
    if blocked(url):            return False   # R3
    if not robots_allowed(url): return False   # R2
    throttle(url)                              # R4
    return True

10. Versioning

This spec is versioned (walsh-research-compliance/vN). Breaking changes to a MUST bump the major. The blocklist data contract (walsh-research-blocklist/vN) versions independently. Tools SHOULD record which spec version they target so attestations are comparable.

11. Changelog

v1 (2026-05-23) — initial: R1 UA, R2 robots/RFC 9309 + Crawl-delay, R3 operator blocklist + schema, R4 rate limit, R5 backoff, R6 scope limits, R7 conditional fetch + dedup, R8 frequency.
v1, rev 2 (2026-05-23) — add R9: prefer structured formats (content negotiation / llms.txt) for feed-less sources; never scrape HTML.