Agentic Research Crawler: Sources & State

1. Overview
2. Current research interests
3. Source landscape
- 3.1. Live structured adapters
- 3.2. Curated RSS/Atom feeds (25 sources, 26 endpoints)
4. Crawler hygiene
5. Operational state (2026-05-23)
6. Crawler etiquette

1. Overview

tech-crawler is a Clojure CLI that crawls heterogeneous agentic-AI research sources and normalizes them into one schema — {:title :url :source :date :tags} — stored as a local EDN append log for offline querying. It optimizes for structured metadata extraction, not comprehensiveness or summarization: extract metadata, not meaning.

This note is the human-curated layer over the machine-readable crawl data — it records the source landscape, what is live, and the politeness rules. The crawler's data/crawl-*.json snapshots remain the raw layer that this page cites.

2. Current research interests

Four threads drive source selection:

Correct-by-construction — verified synthesis and formal methods (Nada Amin: DafnyBench, VerMCTS, miniKanren relational programming; guided proof search in Coq).
Agentic governance — algorithmic institutions, agent standards, and AI-agent law (Berkman Klein Center).
Capability evaluation under constraints — eval design, deterministic simulation testing, distributed-systems correctness.
Convergence — where the above meet: verified, governable, evaluable agents.

3. Source landscape

3.1. Live structured adapters

Source	Type	Parser	Tier
arxiv-cs-ai	papers	HTML (enlive)	1
github-trending	repos	HTML (enlive)	1
hn-front	discussion	HTML (enlive)	1
harvard-seas	events	JSON Localist	1
mit-calendar	events	JSON Localist	2

3.2. Curated RSS/Atom feeds (25 sources, 26 endpoints)

One generic feed adapter handles RSS 2.0, RSS 1.0/RDF, and Atom 1.0 by navigating local element names, so namespace differences are transparent. Feeds are imported from an OPML manifest.

mech-interp / ml-systems — transformer-circuits.pub, Neel Nanda, Lilian Weng, Interconnects, Eugene Yan
formal methods / distsys — Jepsen/aphyr, Antithesis, Marc Brooker, Murat Demirbas, Hillel Wayne, James Bornholt
agents / providers — Simon Willison, Latent Space, Anthropic Research, Anthropic News
critique / rights — Pluralistic, 404 Media, EFF Deeplinks, Logic(s)
aviation / SDR — The Air Current, RTL-SDR
systems / FreeBSD — Klara Systems, FreeBSD Foundation, LWN, Hackaday
aggregator — Lobsters (tag-filtered)

Live now (research-core): the 15 feeds in mech-interp, formal-methods, and agents are crawled. The remaining categories are registered but :planned. Lobsters stays :planned — its robots.txt disallows non-allowlisted bots, so we do not crawl it.

4. Crawler hygiene

The crawl path is robots-aware and bandwidth-polite:

Conditional fetch — per-URL ETag / Last-Modified validators are cached; repeat crawls send If-None-Match / If-Modified-Since and skip unchanged feeds on a 304. (Observed live: a feed returns 200 with 51 items on first crawl, then 304 not-modified on the next.)
robots.txt — parsed and cached 24h per host; disallowed paths are skipped; fails open on fetch error.
Backoff — exponential with full jitter on 429~/~503, honoring Retry-After.
Dedup — items are deduplicated by canonical URL (fragment and trailing slash stripped), tracking first-seen so the log accumulates new items only.

5. Operational state (2026-05-23)

Live targets: 20 (5 structured + 15 feeds).
Feeds: 15 live, 11 planned (of 26 endpoints).
Open conjectures: 4 (none refuted).
- C-001 — one normalize fn handles >80% of source variance.
- C-002 — daily crawl frequency suffices.
- C-003 — title+url+date+source+tags suffices for triage.
- C-004 — CSS selectors for P1 sources stable quarterly.

6. Crawler etiquette

The crawler identifies itself with a descriptive User-Agent carrying a bot info URL and contact, and honors robots.txt and Retry-After. It extracts metadata only — it does not republish content or train on it.

See bot traffic analysis for the inbound side — how other crawlers treat this site's robots.txt.