Bot Operator Contract Surface — Methodology and Best Practices

Table of Contents

1. Frame

This document is a contract for bot operators, derived from the wal.sh access-log population study (2026-05-23) and cross-referenced against current IETF drafts, Cloudflare Verified Bots policy, and the Cloudflare Agent Readiness scanner.

The goal is operational: if I build a bot tomorrow, what is the minimum I publish, sign, and document so that a well-run origin treats my traffic as attested rather than discarded? The answer has four tiers. The bottom two are universally implemented; the top two are the standardization frontier.

2. Methodology: what the log study revealed

The histogram pass on wal.sh access logs surfaced four attestation tiers in the wild. The population splits cleanly along them.

2.1. Tier 0: bare UA assertion

Examples from the corpus:

  • Mozilla/5.0 (compatible; crawler)
  • Googlebot-Image/1.0 (identifier, no contract surface)

No operator linkage, no policy URL, no email, no signature. Edge policy decisions cannot distinguish these from spoofed traffic.

2.2. Tier 1: email contact in UA

The legacy convention. Operator publishes a contact mailto in the UA comment block:

  • Bytespider; spider-feedback@bytedance.com
  • ClaudeBot/1.0; +claudebot@anthropic.com

Attested operator identity but no machine-readable policy surface.

2.3. Tier 2: +URL convention in UA

The dominant attestation (~95% by volume in the wal.sh corpus):

  • +https://developer.amazon.com/support/amazonbot (1501 hits)
  • +https://openai.com/bot (177)
  • +https://perplexity.ai/perplexitybot

Machine-readable enough to histogram. Still unauthenticated: the UA field is plaintext and forgeable.

2.4. Tier 3: reverse-DNS / IP allowlist verification

Googlebot's model. Operator publishes IP ranges; origin verifies reverse DNS resolves to the operator's domain, then forward-resolves to confirm. Defeats UA spoofing for operators that publish.

2.5. Tier 4: Web Bot Auth (HTTP Message Signatures)

The standardization frontier. Operator signs every request with Ed25519; public key at /.well-known/http-message-signatures-directory. IETF draft draft-meunier-web-bot-auth-architecture-05 (March 2026). Cloudflare integrated July 2025. Production operators: OpenAI, Browserbase, Manus.

3. The bot operator checklist

3.1. Tier 1+2: UA string

Format that satisfies both conventions:

MyBot/1.0 (+https://example.com/bot; bot-operator@example.com)

For wal.sh outbound requests:

wal.sh-research/1.0 (+https://wal.sh/bots/; research@wal.sh)

3.2. Tier 2: policy URL contents

The page at the +URL should publish:

  • User-Agent string (exact, parseable)
  • Operator name and contact
  • Purpose (what it collects and why)
  • Crawl behavior (robots.txt, concurrency, backoff)
  • Content usage (training yes/no, inference yes/no, retention)
  • Identity verification (IP ranges, Web Bot Auth directory)
  • Opt-out mechanism

3.3. Tier 2.5: robots.txt compliance (RFC 9309)

  1. Fetch /robots.txt before crawling any other path
  2. Cache for no longer than 24 hours
  3. Match User-Agent tokens case-insensitively
  4. Honor Crawl-delay even though not in RFC 9309 proper
  5. On 4xx/5xx, fall back to "allow all" per RFC 9309 §2.3.1.2

3.4. Tier 3: IP range publication

Publish a JSON file:

{
  "creationTime": "2026-05-23T00:00:00Z",
  "prefixes": [
    { "ipv4Prefix": "203.0.113.0/24" },
    { "ipv6Prefix": "2001:db8::/32" }
  ]
}

Configure reverse DNS: crawl-203-0-113-1.crawl.example.com.

3.5. Tier 4: Web Bot Auth

Per IETF draft and Cloudflare integration:

  1. Generate Ed25519 signing key
  2. Host the directory at /.well-known/http-message-signatures-directory
  3. Sign every outbound request per RFC 9421

Reference implementation: https://github.com/cloudflareresearch/web-bot-auth

4. wal.sh bot population (May 2026)

23 bots observed in access logs, classified by tier and purpose:

Bot Operator Purpose Tier Crawl-delay
Googlebot Google Search index 2+3 (Search Console)
Bingbot Microsoft Search index 2+3
Applebot Apple Siri/Spotlight 2
DuckDuckBot DuckDuckGo Search index 2
GPTBot OpenAI AI training 2
OAI-SearchBot OpenAI ChatGPT search 2
ChatGPT-User OpenAI User browsing 2
PerplexityBot Perplexity AI search 2
DuckAssistBot DuckDuckGo AI answers 2
Amazonbot Amazon Alexa answers 2 60
Bytespider ByteDance Training/search 1 30
AhrefsBot Ahrefs SEO 2 10
SemrushBot Semrush SEO 2 10
Meta crawler Meta Social preview 2
PetalBot Huawei Search 2
YandexBot Yandex Search 2
SeznamBot Seznam Czech search 2
QwantBot Qwant French search 2
Sogou Sogou Chinese search 2
LinkupBot Linkup Web index 2
SleepBot Unknown Unknown 0

5. Adjacent specs

  • RFC 9309: Robots Exclusion Protocol
  • RFC 9421: HTTP Message Signatures
  • RFC 7638: JWK Thumbprint
  • draft-meunier-web-bot-auth-architecture-05: Web Bot Auth
  • draft-romm-aipref-contentsignals: Content Signals (what wal.sh uses)
  • llms.txt (llmstxt.org): LLM discovery aid (wal.sh publishes one)
  • Cloudflare Verified Bots policy
  • Cloudflare Web Bot Auth docs
  • Wikimedia Robot Policy — the most operationally precise public example

6. Related