Research Crawler — Sources Overview
Table of Contents
Overview
What the agentic-research crawler (jwalsh/tech-crawler) currently checks. Every source is fetched as a compliant Walsh-Research bot — robots.txt (RFC 9309), operator blocklist, ≤1 req/s/domain + Crawl-delay — per the compliance spec.
- Live: 31 sources (5 structured adapters, 2 markdown, 24 feeds)
- Planned: 14 sources (registered, not yet crawled)
- Total registered: 45
Sources are added only after verifying a real, crawlable endpoint (a working feed / API / markdown, and robots.txt that allows us).
Live — structured adapters
| Source | Type | How | Tier |
|---|---|---|---|
| arxiv-cs-ai | papers | HTML (enlive) | 1 |
| github-trending | repos | HTML (enlive) | 1 |
| hn-front | discussion | HTML (enlive) | 1 |
| harvard-seas | events | JSON (Localist API) | 1 |
| mit-calendar | events | JSON (Localist API) | 2 |
Live — markdown (feed-less, no scraping)
Crawled via a structured representation instead of HTML scraping — content
negotiation (Accept: text/markdown) or the /llms.txt convention.
| Source | How | URL |
|---|---|---|
| cloudflare-docs | content negotiation (md) | developers.cloudflare.com |
| mcp-docs | /llms.txt index | modelcontextprotocol.io/llms.txt |
Live — feeds (RSS/Atom)
Mechanistic interpretability & ML (mech-interp)
transformer-circuits.pub · Neel Nanda · Lilian Weng · Interconnects · Eugene Yan
Formal methods, distributed systems, correctness (formal-methods)
Aphyr/Jepsen · Antithesis · Marc Brooker · Murat Demirbas · Hillel Wayne · James Bornholt
LLM tooling & agents (agents)
Simon Willison · Latent Space · Anthropic Research · Anthropic News
Corporate AI labs (labs)
Hugging Face Blog · OpenAI · Google Research · Microsoft Research · Apple ML Research
Corporate systems / distsys / PLT (corp-systems)
Cloudflare (blog) · Netflix Tech Blog · All Things Distributed (Werner Vogels) · Jane Street
Planned (registered, not yet live)
| Category | Sources |
|---|---|
| critique | Pluralistic · 404 Media · EFF Deeplinks · Logic Magazine |
| aviation-sdr | The Air Current · RTL-SDR |
| systems | Klara Systems · FreeBSD Foundation · LWN · Hackaday |
| aggregator | Lobsters (robots.txt disallows us — kept out) |
| structured | huggingface-models · papers-with-code · semantic-scholar |
Compliance notes
- Lobsters is registered but not crawled: its robots.txt disallows all non-allowlisted bots, and we honor that.
- Crawl-delay observed and respected, e.g. arXiv 15s, Hacker News 30s.
- Feed-less corporate blogs without a real feed or markdown (Meta AI, Uber Eng, Databricks, LangChain, …) are deliberately left out rather than scraped.
- Full rules: Walsh-Research Bot Compliance Specification.