Research Crawler — Sources Overview

Table of Contents

Overview

What the agentic-research crawler (jwalsh/tech-crawler) currently checks. Every source is fetched as a compliant Walsh-Research bot — robots.txt (RFC 9309), operator blocklist, ≤1 req/s/domain + Crawl-delay — per the compliance spec.

  • Live: 31 sources (5 structured adapters, 2 markdown, 24 feeds)
  • Planned: 14 sources (registered, not yet crawled)
  • Total registered: 45

Sources are added only after verifying a real, crawlable endpoint (a working feed / API / markdown, and robots.txt that allows us).

Live — structured adapters

Source Type How Tier
arxiv-cs-ai papers HTML (enlive) 1
github-trending repos HTML (enlive) 1
hn-front discussion HTML (enlive) 1
harvard-seas events JSON (Localist API) 1
mit-calendar events JSON (Localist API) 2

Live — markdown (feed-less, no scraping)

Crawled via a structured representation instead of HTML scraping — content negotiation (Accept: text/markdown) or the /llms.txt convention.

Source How URL
cloudflare-docs content negotiation (md) developers.cloudflare.com
mcp-docs /llms.txt index modelcontextprotocol.io/llms.txt

Live — feeds (RSS/Atom)

Mechanistic interpretability & ML (mech-interp)

transformer-circuits.pub · Neel Nanda · Lilian Weng · Interconnects · Eugene Yan

Formal methods, distributed systems, correctness (formal-methods)

Aphyr/Jepsen · Antithesis · Marc Brooker · Murat Demirbas · Hillel Wayne · James Bornholt

LLM tooling & agents (agents)

Simon Willison · Latent Space · Anthropic Research · Anthropic News

Corporate AI labs (labs)

Hugging Face Blog · OpenAI · Google Research · Microsoft Research · Apple ML Research

Corporate systems / distsys / PLT (corp-systems)

Cloudflare (blog) · Netflix Tech Blog · All Things Distributed (Werner Vogels) · Jane Street

Planned (registered, not yet live)

Category Sources
critique Pluralistic · 404 Media · EFF Deeplinks · Logic Magazine
aviation-sdr The Air Current · RTL-SDR
systems Klara Systems · FreeBSD Foundation · LWN · Hackaday
aggregator Lobsters (robots.txt disallows us — kept out)
structured huggingface-models · papers-with-code · semantic-scholar

Compliance notes

  • Lobsters is registered but not crawled: its robots.txt disallows all non-allowlisted bots, and we honor that.
  • Crawl-delay observed and respected, e.g. arXiv 15s, Hacker News 30s.
  • Feed-less corporate blogs without a real feed or markdown (Meta AI, Uber Eng, Databricks, LangChain, …) are deliberately left out rather than scraped.
  • Full rules: Walsh-Research Bot Compliance Specification.