Agentic Research Crawler: Sources & State

Table of Contents

1. Overview

tech-crawler is a Clojure CLI that crawls heterogeneous agentic-AI research sources and normalizes them into one schema — {:title :url :source :date :tags} — stored as a local EDN append log for offline querying. It optimizes for structured metadata extraction, not comprehensiveness or summarization: extract metadata, not meaning.

This note is the human-curated layer over the machine-readable crawl data — it records the source landscape, what is live, and the politeness rules. The crawler's data/crawl-*.json snapshots remain the raw layer that this page cites.

2. Current research interests

Four threads drive source selection:

  1. Correct-by-construction — verified synthesis and formal methods (Nada Amin: DafnyBench, VerMCTS, miniKanren relational programming; guided proof search in Coq).
  2. Agentic governance — algorithmic institutions, agent standards, and AI-agent law (Berkman Klein Center).
  3. Capability evaluation under constraints — eval design, deterministic simulation testing, distributed-systems correctness.
  4. Convergence — where the above meet: verified, governable, evaluable agents.

3. Source landscape

3.1. Live structured adapters

Source Type Parser Tier
arxiv-cs-ai papers HTML (enlive) 1
github-trending repos HTML (enlive) 1
hn-front discussion HTML (enlive) 1
harvard-seas events JSON Localist 1
mit-calendar events JSON Localist 2

3.2. Curated RSS/Atom feeds (25 sources, 26 endpoints)

One generic feed adapter handles RSS 2.0, RSS 1.0/RDF, and Atom 1.0 by navigating local element names, so namespace differences are transparent. Feeds are imported from an OPML manifest.

  • mech-interp / ml-systems — transformer-circuits.pub, Neel Nanda, Lilian Weng, Interconnects, Eugene Yan
  • formal methods / distsys — Jepsen/aphyr, Antithesis, Marc Brooker, Murat Demirbas, Hillel Wayne, James Bornholt
  • agents / providers — Simon Willison, Latent Space, Anthropic Research, Anthropic News
  • critique / rights — Pluralistic, 404 Media, EFF Deeplinks, Logic(s)
  • aviation / SDR — The Air Current, RTL-SDR
  • systems / FreeBSD — Klara Systems, FreeBSD Foundation, LWN, Hackaday
  • aggregator — Lobsters (tag-filtered)

Live now (research-core): the 15 feeds in mech-interp, formal-methods, and agents are crawled. The remaining categories are registered but :planned. Lobsters stays :planned — its robots.txt disallows non-allowlisted bots, so we do not crawl it.

4. Crawler hygiene

The crawl path is robots-aware and bandwidth-polite:

  • Conditional fetch — per-URL ETag / Last-Modified validators are cached; repeat crawls send If-None-Match / If-Modified-Since and skip unchanged feeds on a 304. (Observed live: a feed returns 200 with 51 items on first crawl, then 304 not-modified on the next.)
  • robots.txt — parsed and cached 24h per host; disallowed paths are skipped; fails open on fetch error.
  • Backoff — exponential with full jitter on 429~/~503, honoring Retry-After.
  • Dedup — items are deduplicated by canonical URL (fragment and trailing slash stripped), tracking first-seen so the log accumulates new items only.

5. Operational state (2026-05-23)

  • Live targets: 20 (5 structured + 15 feeds).
  • Feeds: 15 live, 11 planned (of 26 endpoints).
  • Open conjectures: 4 (none refuted).
    • C-001 — one normalize fn handles >80% of source variance.
    • C-002 — daily crawl frequency suffices.
    • C-003 — title+url+date+source+tags suffices for triage.
    • C-004 — CSS selectors for P1 sources stable quarterly.

6. Crawler etiquette

The crawler identifies itself with a descriptive User-Agent carrying a bot info URL and contact, and honors robots.txt and Retry-After. It extracts metadata only — it does not republish content or train on it.