Bot Operator Contract Surface — Methodology and Best Practices
Table of Contents
1. Frame
This document is a contract for bot operators, derived from the
wal.sh access-log population study (2026-05-23) and cross-referenced
against current IETF drafts, Cloudflare Verified Bots policy, and the
Cloudflare Agent Readiness scanner.
The goal is operational: if I build a bot tomorrow, what is the minimum I publish, sign, and document so that a well-run origin treats my traffic as attested rather than discarded? The answer has four tiers. The bottom two are universally implemented; the top two are the standardization frontier.
2. Methodology: what the log study revealed
The histogram pass on wal.sh access logs surfaced four attestation
tiers in the wild. The population splits cleanly along them.
2.1. Tier 0: bare UA assertion
Examples from the corpus:
Mozilla/5.0 (compatible; crawler)Googlebot-Image/1.0(identifier, no contract surface)
No operator linkage, no policy URL, no email, no signature. Edge policy decisions cannot distinguish these from spoofed traffic.
2.2. Tier 1: email contact in UA
The legacy convention. Operator publishes a contact mailto in the UA comment block:
Bytespider; spider-feedback@bytedance.comClaudeBot/1.0; +claudebot@anthropic.com
Attested operator identity but no machine-readable policy surface.
2.3. Tier 2: +URL convention in UA
The dominant attestation (~95% by volume in the wal.sh corpus):
+https://developer.amazon.com/support/amazonbot(1501 hits)+https://openai.com/bot(177)+https://perplexity.ai/perplexitybot
Machine-readable enough to histogram. Still unauthenticated: the UA field is plaintext and forgeable.
2.4. Tier 3: reverse-DNS / IP allowlist verification
Googlebot's model. Operator publishes IP ranges; origin verifies reverse DNS resolves to the operator's domain, then forward-resolves to confirm. Defeats UA spoofing for operators that publish.
2.5. Tier 4: Web Bot Auth (HTTP Message Signatures)
The standardization frontier. Operator signs every request with
Ed25519; public key at /.well-known/http-message-signatures-directory.
IETF draft draft-meunier-web-bot-auth-architecture-05 (March 2026).
Cloudflare integrated July 2025. Production operators: OpenAI,
Browserbase, Manus.
3. The bot operator checklist
3.1. Tier 1+2: UA string
Format that satisfies both conventions:
MyBot/1.0 (+https://example.com/bot; bot-operator@example.com)
For wal.sh outbound requests:
wal.sh-research/1.0 (+https://wal.sh/bots/; research@wal.sh)
3.2. Tier 2: policy URL contents
The page at the +URL should publish:
- User-Agent string (exact, parseable)
- Operator name and contact
- Purpose (what it collects and why)
- Crawl behavior (robots.txt, concurrency, backoff)
- Content usage (training yes/no, inference yes/no, retention)
- Identity verification (IP ranges, Web Bot Auth directory)
- Opt-out mechanism
3.3. Tier 2.5: robots.txt compliance (RFC 9309)
- Fetch
/robots.txtbefore crawling any other path - Cache for no longer than 24 hours
- Match User-Agent tokens case-insensitively
- Honor
Crawl-delayeven though not in RFC 9309 proper - On 4xx/5xx, fall back to "allow all" per RFC 9309 §2.3.1.2
3.4. Tier 3: IP range publication
Publish a JSON file:
{
"creationTime": "2026-05-23T00:00:00Z",
"prefixes": [
{ "ipv4Prefix": "203.0.113.0/24" },
{ "ipv6Prefix": "2001:db8::/32" }
]
}
Configure reverse DNS: crawl-203-0-113-1.crawl.example.com.
3.5. Tier 4: Web Bot Auth
Per IETF draft and Cloudflare integration:
- Generate Ed25519 signing key
- Host the directory at
/.well-known/http-message-signatures-directory - Sign every outbound request per RFC 9421
Reference implementation: https://github.com/cloudflareresearch/web-bot-auth
4. wal.sh bot population (May 2026)
23 bots observed in access logs, classified by tier and purpose:
| Bot | Operator | Purpose | Tier | Crawl-delay |
|---|---|---|---|---|
| Googlebot | Search index | 2+3 | (Search Console) | |
| Bingbot | Microsoft | Search index | 2+3 | — |
| Applebot | Apple | Siri/Spotlight | 2 | — |
| DuckDuckBot | DuckDuckGo | Search index | 2 | — |
| GPTBot | OpenAI | AI training | 2 | — |
| OAI-SearchBot | OpenAI | ChatGPT search | 2 | — |
| ChatGPT-User | OpenAI | User browsing | 2 | — |
| PerplexityBot | Perplexity | AI search | 2 | — |
| DuckAssistBot | DuckDuckGo | AI answers | 2 | — |
| Amazonbot | Amazon | Alexa answers | 2 | 60 |
| Bytespider | ByteDance | Training/search | 1 | 30 |
| AhrefsBot | Ahrefs | SEO | 2 | 10 |
| SemrushBot | Semrush | SEO | 2 | 10 |
| Meta crawler | Meta | Social preview | 2 | — |
| PetalBot | Huawei | Search | 2 | — |
| YandexBot | Yandex | Search | 2 | — |
| SeznamBot | Seznam | Czech search | 2 | — |
| QwantBot | Qwant | French search | 2 | — |
| Sogou | Sogou | Chinese search | 2 | — |
| LinkupBot | Linkup | Web index | 2 | — |
| SleepBot | Unknown | Unknown | 0 | — |
5. Adjacent specs
- RFC 9309: Robots Exclusion Protocol
- RFC 9421: HTTP Message Signatures
- RFC 7638: JWK Thumbprint
- draft-meunier-web-bot-auth-architecture-05: Web Bot Auth
- draft-romm-aipref-contentsignals: Content Signals (what wal.sh uses)
- llms.txt (llmstxt.org): LLM discovery aid (wal.sh publishes one)
- Cloudflare Verified Bots policy
- Cloudflare Web Bot Auth docs
- Wikimedia Robot Policy — the most operationally precise public example
6. Related
- Cloudflare Agents Week 2026 — agent readiness, sandbox GA
- CLI Coding Agents Q2 — agents that make outbound tool calls
- Agent Memory Systems — agent identity and provenance