Research Crawler – Sources Overview

Overview
Live – structured adapters
Live – markdown (feed-less, no scraping)
Live – feeds (RSS/Atom)
Registered, not yet live
Compliance notes

Overview

What the agentic-research crawler (jwalsh/tech-crawler) currently checks. Every source is fetched as a compliant Walsh-Research bot – robots.txt (RFC 9309), operator blocklist, ≤1 req/s/domain + Crawl-delay – per the compliance spec.

Live: ~86 sources – 5 structured adapters, 2 markdown, 79 RSS/Atom feeds
Tracked threads: agents, evals, interp, formal methods, surveillance/critique, BSD/systems, Clojure/Scheme, SDR/aviation, corp-systems, corp-engineering

This page is a curated overview, grouped by thread with representative sources; the exact, machine-readable feed manifest (URLs, tiers, crawl cadence) is maintained in tech-crawler and is the source of truth. Sources are added only after verifying a real, crawlable endpoint (a working feed / API / markdown, and robots.txt that allows us).

Recent growth: the 2026-05-31 run crawled 79 feeds, up from 61 on 2026-05-29 – the jump is the new corp-engineering thread (12 engineering blogs) plus the promotion of the surveillance/critique and BSD/systems sources that were previously registered-but-not-live.

Live – structured adapters

Source	Type	How	Tier
arxiv-cs-ai	papers	HTML (enlive)	1
github-trending	repos	HTML (enlive)	1
hn-front	discussion	HTML (enlive)	1
harvard-seas	events	JSON (Localist API)	1
mit-calendar	events	JSON (Localist API)	2

Live – markdown (feed-less, no scraping)

Crawled via a structured representation instead of HTML scraping – content negotiation (Accept: text/markdown) or the /llms.txt convention.

Source	How	URL
cloudflare-docs	content negotiation (md)	developers.cloudflare.com
mcp-docs	/llms.txt index	modelcontextprotocol.io/llms.txt

Category	Sources
structured	huggingface-models · papers-with-code · semantic-scholar
aggregator	Lobsters (robots.txt disallows us – kept out)

Compliance notes

Lobsters is registered but not crawled: its robots.txt disallows all non-allowlisted bots, and we honor that.
Crawl-delay observed and respected, e.g. arXiv 15s, Hacker News 30s.
The corp-engineering thread is real feeds only: blogs without a working RSS/Atom or markdown representation are left out rather than scraped.
Full rules: Walsh-Research Bot Compliance Specification.

Research Crawler – Sources Overview

Research Crawler – Sources Overview

Table of Contents

Overview

Live – structured adapters

Live – markdown (feed-less, no scraping)

Live – feeds (RSS/Atom)

Mechanistic interpretability & ML (interp)

Evals & safety (evals)

Formal methods, distributed systems, correctness (formal-methods)

LLM tooling & agents (agents)

Clojure / Scheme (lang)

SDR & aviation (sdr-aviation)

Corporate AI labs (labs)

Corporate systems / distsys / PLT (corp-systems)

Corporate engineering blogs (corp-engineering) – new 2026-05-31

Surveillance & critique (critique)

BSD & systems (systems)

Registered, not yet live

Compliance notes