Agentic Research Crawler: Sources & State
Table of Contents
1. Overview
tech-crawler is a Clojure CLI that crawls heterogeneous
agentic-AI research sources and normalizes them into one schema —
{:title :url :source :date :tags} — stored as a local EDN append log for
offline querying. It optimizes for structured metadata extraction, not
comprehensiveness or summarization: extract metadata, not meaning.
This note is the human-curated layer over the machine-readable crawl data —
it records the source landscape, what is live, and the politeness rules. The
crawler's data/crawl-*.json snapshots remain the raw layer that this page
cites.
2. Current research interests
Four threads drive source selection:
- Correct-by-construction — verified synthesis and formal methods (Nada Amin: DafnyBench, VerMCTS, miniKanren relational programming; guided proof search in Coq).
- Agentic governance — algorithmic institutions, agent standards, and AI-agent law (Berkman Klein Center).
- Capability evaluation under constraints — eval design, deterministic simulation testing, distributed-systems correctness.
- Convergence — where the above meet: verified, governable, evaluable agents.
3. Source landscape
3.1. Live structured adapters
| Source | Type | Parser | Tier |
|---|---|---|---|
| arxiv-cs-ai | papers | HTML (enlive) | 1 |
| github-trending | repos | HTML (enlive) | 1 |
| hn-front | discussion | HTML (enlive) | 1 |
| harvard-seas | events | JSON Localist | 1 |
| mit-calendar | events | JSON Localist | 2 |
3.2. Curated RSS/Atom feeds (25 sources, 26 endpoints)
One generic feed adapter handles RSS 2.0, RSS 1.0/RDF, and Atom 1.0 by navigating local element names, so namespace differences are transparent. Feeds are imported from an OPML manifest.
- mech-interp / ml-systems — transformer-circuits.pub, Neel Nanda, Lilian Weng, Interconnects, Eugene Yan
- formal methods / distsys — Jepsen/aphyr, Antithesis, Marc Brooker, Murat Demirbas, Hillel Wayne, James Bornholt
- agents / providers — Simon Willison, Latent Space, Anthropic Research, Anthropic News
- critique / rights — Pluralistic, 404 Media, EFF Deeplinks, Logic(s)
- aviation / SDR — The Air Current, RTL-SDR
- systems / FreeBSD — Klara Systems, FreeBSD Foundation, LWN, Hackaday
- aggregator — Lobsters (tag-filtered)
Live now (research-core): the 15 feeds in mech-interp, formal-methods, and
agents are crawled. The remaining categories are registered but :planned.
Lobsters stays :planned — its robots.txt disallows non-allowlisted bots,
so we do not crawl it.
4. Crawler hygiene
The crawl path is robots-aware and bandwidth-polite:
- Conditional fetch — per-URL ETag / Last-Modified validators are cached;
repeat crawls send
If-None-Match/If-Modified-Sinceand skip unchanged feeds on a304. (Observed live: a feed returns200with51items on first crawl, then304 not-modifiedon the next.) - robots.txt — parsed and cached 24h per host; disallowed paths are skipped; fails open on fetch error.
- Backoff — exponential with full jitter on
429~/~503, honoringRetry-After. - Dedup — items are deduplicated by canonical URL (fragment and trailing slash stripped), tracking first-seen so the log accumulates new items only.
5. Operational state (2026-05-23)
- Live targets: 20 (5 structured + 15 feeds).
- Feeds: 15 live, 11 planned (of 26 endpoints).
- Open conjectures: 4 (none refuted).
- C-001 — one
normalizefn handles >80% of source variance. - C-002 — daily crawl frequency suffices.
- C-003 — title+url+date+source+tags suffices for triage.
- C-004 — CSS selectors for P1 sources stable quarterly.
- C-001 — one
6. Crawler etiquette
The crawler identifies itself with a descriptive User-Agent carrying a bot
info URL and contact, and honors robots.txt and Retry-After. It extracts
metadata only — it does not republish content or train on it.