Walsh-Research Bot Compliance Specification
Table of Contents
1. Status
- Spec version:
walsh-research-compliance/v1(2026-05-23) - Applies to: every tool that fetches third-party resources under the
Walsh-Researchidentity — crawlers, agents, one-off scripts, in any language. - Reference implementation: the Clojure crawler
jwalsh/tech-crawler(namespacestech-crawler.crawler/tech-crawler.core). - Authoritative contract surface: https://wal.sh/bot/ and the
/.well-known/walsh-research/documents (below).
This document is the implementation contract behind the public policy at https://wal.sh/bot/. If the two ever disagree, https://wal.sh/bot/ is authoritative for what we promise the public; this spec is authoritative for how our code must behave. A future inner-sourced bot/agents library should codify this spec directly; until then, hand this document to any new tool.
2. Conformance language
The key words MUST, MUST NOT, SHOULD, SHOULD NOT, and MAY are used as in RFC 2119. A tool is conformant when it satisfies every MUST and MUST NOT.
3. Why language-agnostic
None of these rules depend on Clojure. Each requirement is stated as behavior over four primitives every language has: an HTTP client, an XML/JSON parser, a monotonic clock + sleep, and a SHA-256 hash. The Clojure snippets are illustrative; equivalent Python / Rust / Guile / TypeScript / Go are mechanical (see 9).
4. Identity and discovery
4.1. R1 — User-Agent (MUST)
Every request MUST send this exact User-Agent:
Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)
The product token used for robots.txt matching is Walsh-Research
(case-insensitive). The +URL MUST resolve to a public policy page.
(def ua-token "Walsh-Research") (def default-headers {"User-Agent" "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)"})
4.2. Canonical resources (MUST be consulted)
| Resource | URL |
|---|---|
| Public policy | https://wal.sh/bot/ |
| Blocklist | https://wal.sh/.well-known/walsh-research/blocklist.json |
| Blocklist schema | https://wal.sh/.well-known/walsh-research/blocklist.schema.json |
The schema is reusable. If you are building a similar bot compliance
system, the walsh-research-blocklist/v1 contract and its JSON Schema
are available at the URLs above under no restrictions.
5. Pre-request gates
A tool MUST evaluate these gates, in this order, before issuing the actual request for a target URL. The first gate that denies stops the request.
- Operator blocklist (R3) — operator-side opt-out; absolute.
- robots.txt (R2) — site-side rules.
- Rate limit / Crawl-delay (R4) — pacing.
Then fetch, applying backoff (R5) on rate-limit responses.
5.1. R2 — robots.txt, RFC 9309 (MUST)
A tool MUST fetch /robots.txt for each host, cache it (24h SHOULD), and
obey it for the Walsh-Research token. Group selection MUST follow RFC 9309:
- If one or more groups name our token, the single most specific group
applies (longest matching
User-agentvalue); the*group is ignored. - Else the
*group applies. - Else nothing is disallowed.
A named group therefore overrides * in both directions: it can block us
where * allows, and exempt us where * blocks. A tool MUST support
Crawl-delay within the selected group (see R4). On any fetch/parse error a
tool SHOULD fail open (allow).
;; select the most-specific matching group, then test the path prefix (defn robots-allowed? [url] (let [g (select-robots-group (fetch-robots url) ua-token)] ; nil | {:disallows [..] :crawl-delay s} (not (some #(str/starts-with? (path-of url) %) (:disallows g)))))
Opt-out via robots (what we honor):
User-agent: Walsh-Research Disallow: /
5.2. R3 — Operator blocklist (MUST)
Some opt-outs arrive out-of-band (email to j@wal.sh). These live in one published JSON contract that every Walsh-Research tool MUST consult, so an opt-out is honored uniformly across bots and languages. This is in addition to robots.txt and is checked first.
Contract walsh-research-blocklist/v1:
{
"contract": "walsh-research-blocklist/v1",
"updated": "2026-05-23T00:00:00Z",
"operator": "Jason Walsh",
"contact": "j@wal.sh",
"policy": "https://wal.sh/bot/",
"refresh": "PT6H",
"blocked": [
{ "domain": "example.com", "added": "2026-05-23", "reason": "email opt-out" }
]
}
Requirements:
- A tool MUST fetch the blocklist and skip any request whose host equals or
is a subdomain of a listed
domain(example.comblockswww.example.com). - A tool SHOULD validate the document against the published
blocklist.schema.jsonbefore adopting it, and MUST NOT adopt a document that fails validation (retain the previous list instead). - A tool SHOULD cache the list for the contract's
refreshduration (ISO-8601). Timing is dictated by the data source, not hard-coded. - On fetch failure or 404 a tool MUST retain the last-known list rather than clearing it — a transient outage MUST NOT silently un-block an opt-out. (A first-ever fetch failure yields an empty list = fail open.)
(defn blocked? [url] ; host apex + subdomain match (let [{:keys [domains]} (cached-blocklist)] ; validated against schema; TTL from :refresh (domain-blocked? (host-of url) domains)))
5.3. R4 — Rate limiting and Crawl-delay (MUST)
- Requests MUST be serial per process — no concurrent connections.
- A tool MUST NOT exceed one request per second per domain.
- A tool MUST honor a host's robots
Crawl-delay: the spacing before the next request to a host ismax(1 second, Crawl-delay). - The first request to a previously-unseen host MUST NOT be delayed.
(def min-request-interval-ms 1000) ; <= 1 req/s/domain (defn throttle! [url] ; block until this host's slot is free (let [host (host-of url) interval (max min-request-interval-ms (crawl-delay-ms url)) ; robots Crawl-delay last (get @host-last-request host) ; nil when unseen -> no wait wait (max 0 (- (+ (or last (now)) interval) (now)))] (when (pos? wait) (sleep wait)) (swap! host-last-request assoc host (+ (now) wait))))
5.4. R5 — Backoff and Retry-After (MUST)
On HTTP 429 or 503 a tool MUST back off and retry with exponential
backoff plus jitter, and MUST respect a Retry-After header when present
(use it instead of the computed delay). A bounded number of retries SHOULD be
used before giving up.
(defn with-backoff [thunk] ; thunk -> http response (loop [n 0] (let [resp (thunk)] (if (and (#{429 503} (:status resp)) (< n max-retries)) (do (sleep (or (retry-after-ms resp) (* base-ms (Math/pow 2 n) (rand)))) (recur (inc n))) resp))))
6. Scope and data-handling limits
6.1. R6 — Stay in scope (MUST / MUST NOT)
- A tool MUST NOT follow links or crawl recursively. It fetches a declared set of targets only.
- A tool MUST NOT download sub-resources (images, scripts, stylesheets).
- A tool MUST NOT train models on, index for search, or republish fetched content. Extract metadata, not content.
- A tool MUST NOT store content beyond what the analysis needs.
6.2. R7 — Be a cache-friendly citizen (SHOULD)
A tool SHOULD send conditional requests using stored validators
(If-None-Match / If-Modified-Since) and treat 304 Not Modified as "no new
data" without re-processing. A tool SHOULD dedup results by canonical URL
(strip #fragment and trailing slash) so repeated runs accumulate first-seen
items only.
6.3. R8 — Frequency (SHOULD)
A tool SHOULD run infrequently (daily or less) per source.
6.4. R9 — Prefer structured formats; never scrape HTML (SHOULD / MUST NOT)
A tool MUST NOT extract content by scraping arbitrary HTML with per-site selectors – an unbounded maintenance burden, and easy to do impolitely.
For a feed-less source a tool SHOULD obtain a structured representation, preferring, in order:
- Content negotiation — request
Accept: text/markdownand use the body only when the server returns a markdown content-type (e.g. Mintlify-hosted docs, wal.sh). - llms.txt convention — fetch
/llms.txtat the host root: a curated markdown index of the site's key pages (https://llmstxt.org).
Extract metadata generically from the markdown (links, headings); a tool MUST NOT rely on per-site structure. This keeps feed-less coverage clean and consistent with R6 (metadata, not content). A source that offers neither a feed nor markdown is left out rather than scraped.
6.4.1. Note: the llms.txt convention (https://llmstxt.org)
llms.txt is a proposed standard (Jeremy Howard / Answer.AI, 2024) for a
markdown file at a site's root that gives automated readers a curated, concise
map of the site, instead of forcing them to parse navigation-heavy HTML. Two
files:
/llms.txt— a curated index: an H1 site name, an optional blockquote summary, then sections of markdown links ([title](url): note) to the pages that matter. This is what a tool SHOULD fetch and parse for links./llms-full.txt— the same pages' full content concatenated as markdown, for tools that want the text in one request. A metadata-only crawler (this contract) uses the index, not the full dump.
It is served as a plain file (often text/plain), so a tool SHOULD accept a
text/plain body at /llms.txt even though it is markdown. Adoption is
growing among docs platforms (Mintlify auto-generates it; Cloudflare, Anthropic
docs, Stripe, Next.js, shadcn, Perplexity ship one). It is advisory, not an
access-control mechanism: robots.txt (R2) and the blocklist (R3) still govern
whether a tool may fetch; llms.txt only offers a cleaner representation
once allowed. A tool MAY probe /llms.txt during source discovery to decide
whether a feed-less site is crawlable without scraping.
7. Opt-out workflow (operator side)
- A site can self-serve via robots.txt (R2) at any time — honored within the robots cache TTL.
- Or email j@wal.sh; the operator appends a
{domain, added, reason}entry toblocklist.jsonand deploys. All bots honor it within therefreshTTL.
A tool implementer's job is only to consume both sources correctly.
8. Conformance checklist
A tool is conformant when every MUST below holds. Use this as a self-audit (and as the basis for an automated attestation).
| # | Requirement | MUST |
|---|---|---|
| 1 | Exact User-Agent string | ✓ |
| 2 | robots.txt fetched + obeyed, RFC 9309 group selection | ✓ |
| 2a | Named Walsh-Research group overrides * |
✓ |
| 3 | Operator blocklist consulted, checked before robots | ✓ |
| 3a | apex + subdomain match | ✓ |
| 3b | retain last list on fetch failure | ✓ |
| 3c | validate against schema; reject invalid (SHOULD) | ◐ |
| 4 | serial; <= 1 req/s/domain; honor Crawl-delay | ✓ |
| 5 | 429/503 exponential backoff + jitter; Retry-After | ✓ |
| 6 | no recursion, no sub-resources, metadata-only | ✓ |
| 7 | conditional fetch + dedup (SHOULD) | ◐ |
| 8 | infrequent (SHOULD) | ◐ |
| 9 | feed-less: content-neg markdown / llms.txt; never scrape HTML (SHOULD / MUST NOT) | ◐ |
9. Porting notes
The logic is identical across languages; only the four primitives change.
| Primitive | Clojure | Python | Rust | Guile | TypeScript | Go |
|---|---|---|---|---|---|---|
| HTTP client | clj-http | httpx / requests | reqwest | (web client) | fetch / undici | net/http |
| XML parse | clojure.data.xml | xml.etree | quick-xml | (sxml) | fast-xml-parser | encoding/xml |
| JSON parse | cheshire | json | serde_json | (guile-json) | JSON.parse | encoding/json |
| Clock+sleep | System/currentTimeMillis / Thread/sleep | time / sleep | std::time / thread::sleep | (current-time / usleep) | Date.now / setTimeout | time |
| SHA-256 | java.security.MessageDigest | hashlib | sha2 | (gcrypt) | crypto.subtle | crypto/sha256 |
Cross-language sketch of the pre-request gate (pseudocode, applies verbatim):
function may_fetch(url):
if blocked_by_operator(url): return DENY # R3, checked first
if not robots_allows(url): return DENY # R2, RFC 9309 group selection
throttle(url) # R4, max(1s, crawl_delay)
return ALLOW
# Python: the same constants and order UA = "Mozilla/5.0 (compatible; Walsh-Research/1.0; +https://wal.sh/bot/)" UA_TOKEN = "Walsh-Research" def may_fetch(url): if blocked(url): return False # R3 if not robots_allowed(url): return False # R2 throttle(url) # R4 return True
10. Versioning
This spec is versioned (walsh-research-compliance/vN). Breaking changes to a
MUST bump the major. The blocklist data contract
(walsh-research-blocklist/vN) versions independently. Tools SHOULD record
which spec version they target so attestations are comparable.
11. Changelog
- v1 (2026-05-23) — initial: R1 UA, R2 robots/RFC 9309 + Crawl-delay, R3 operator blocklist + schema, R4 rate limit, R5 backoff, R6 scope limits, R7 conditional fetch + dedup, R8 frequency.
- v1, rev 2 (2026-05-23) — add R9: prefer structured formats (content negotiation / llms.txt) for feed-less sources; never scrape HTML.