PAgE 2026 Keynote — Can Coding Agents Write Verifiably Correct Software?
Table of Contents
1. Talk
Thesis: have LLM agents emit formal correctness proofs alongside code, discharged by Verus (SMT-based verifier for Rust). Reported result is a frontier-model-gated yes on repository-level systems code.
Scope of the claim (refutation boundary): the demonstrated capability is proof synthesis against fixed, human-authored specifications. The specification remains the trusted head of the TCB; the agent supplies the proof, not the spec. Claim is refuted as an end-to-end correctness result if the spec is also agent-authored and unvalidated.
2. Link set
- Talk / program
- https://pldi26.sigplan.org/home/page-2026
- Live stream (PAgE, Meadows B)
- https://www.youtube.com/watch?v=V0wjNL6gAG8
- Project umbrella (MSR, Practical System Verification)
- https://www.microsoft.com/en-us/research/project/practical-system-verification/
- Verus (verifier)
- https://github.com/verus-lang/verus
- LLM-for-Verus code + benchmarks
- https://github.com/microsoft/verus-proof-synthesis
- Leaderboard
- https://microsoft.github.io/verus-proof-synthesis/
- Verus publications & projects index
- https://verus-lang.github.io/verus/publications-and-projects/
3. Core content
3.1. Verus (substrate)
SMT-based verifier for Rust (Lattuada et al. 2024). Specs and proofs are written in Rust itself; linear ghost types and the borrow checker carry ownership and separation into proofs. TCB = top-level specification + Verus verifier + SMT solver (Z3) + Rust compiler. Semantic foundation for the proof-oriented type extensions: VerusBelt (Hance et al. 2026).
3.2. Two systems, two benchmarks
| System | Target | Benchmark | Tasks |
|---|---|---|---|
| AutoVerus | small algorithmic code | Verus-Bench | 150 |
| VeruSAGE | code in large real systems projects | VeruSAGE-Bench | 849 |
AutoVerus (Yang, Li, et al. 2025) pipeline: inference -> refinement -> repair (few-shot candidates, targeted refinement, verification-error debugging). VeruSAGE (Yang, Neamtu, et al. 2025): agentic observation-reasoning-action loop.
3.2.1. Verus-Bench (150, algorithm-level)
| Source | Tasks | Description |
|---|---|---|
| CloverBench | 11 | Classic CS examples |
| MBPP | 78 | Formal specification problems |
| Diffy | 38 | Array / loop programs |
| Misc | 23 | Verus tutorial examples |
3.2.2. VeruSAGE-Bench (849, repository-level)
| Project | Tasks | Domain |
|---|---|---|
| Anvil (AL) | 104 | Distributed systems |
| Anvil-Advanced | 63 | Distributed systems |
| IronKV (IR) | 118 | Key-value store |
| Memory Allocator | 89 | Systems |
| Node Replication | 29 | Distributed systems |
| NRKernel (NR) | 204 | OS kernel |
| ATMO (OS) | 157 | Microkernel |
| Storage (ST) | 63 | Storage systems |
| Vest (VE) | 22 | Serialization |
3.3. Cost / capability findings (from the lineage)
- Proof-generation capability is held by expensive frontier models (e.g. Claude Sonnet 4.5), not cheap models (e.g. o4-mini).
- VeruSyn (Di et al. 2026) responds to this with data synthesis: ~6.9M verified Rust programs (each spec + proof) to lift a small model.
4. Paper lineage (Lu group, LLM-for-Verus)
- AutoVerus — OOPSLA'25 — arXiv:2409.13082 / https://doi.org/10.1145/3763174
- Automated Proof Generation via Self-Evolution — ICLR'25 — arXiv:2410.15756
- VeriStruct (data-structure modules) — arXiv:2510.25015
- What's in a Proof? (expert proof-writing in F* and Verus) — arXiv:2508.02733
- VeruSAGE — Dec'25 — arXiv:2512.18436
- Reducing Costs of Proof Synthesis (VeruSyn) — Feb'26 — arXiv:2602.04910
- ExVerus (Yang et al. 2026) (proof repair via counterexample reasoning) — ICML'26 — arXiv:2603.25810
Adjacent track (spec/code co-synthesis, trust anchor moves):
- AlphaVerus (bootstrapping verified code generation) — ICML'25
- Vericoding benchmark — arXiv:2509.22908
5. Verified-systems projects (the VeruSAGE-Bench targets)
| Project | Repo / source | Venue |
|---|---|---|
| Anvil | github.com/vmware-research/verifiable-controllers | OSDI'24 (best) |
| IronKV | github.com/verus-lang/verified-ironkv (ex IronFleet) | — |
| Memory Allocator | github.com/verus-lang/verified-memory-allocator | (mimalloc-based) |
| Node Replication | github.com/verus-lang/verified-node-replication | — |
| NRKernel | github.com/utaal/verified-nrkernel | — |
| Atmosphere (ATMO) | github.com/mars-research/atmosphere | SOSP'25 |
| Storage (PoWER) | github.com/microsoft/verified-storage | OSDI'25 (dist.) |
| Vest | github.com/secure-foundations/vest | USENIX Sec'25 |
| VeriSMo (related) | github.com/microsoft/verismo | OSDI'24 (best) |
All targets are published, peer-reviewed artifacts with human-authored specs (e.g. Anvil (Sun et al. 2024)).
6. Related researchers
6.1. Speaker / LLM-for-Verus
- Shan Lu (Microsoft; U Chicago) — speaker, group lead
- Chenyuan Yang (UIUC) — AutoVerus / VeruSAGE first author
- Shuvendu K. Lahiri (MSR) — intent formalization
- Shuai Lu, Yeyun Gong, Peng Cheng (MSR) — data synthesis
- Shraddha Barke (MSR) — PAgE chair; expert proof-writing study
6.2. Verus core / systems
- Bryan Parno (CMU, secure-foundations) — PI
- Andrea Lattuada (MPI-SWS) — Verus lead
- Travis Hance (CMU) — Verus, VerusBelt
- Chris Hawblitzel, Jacob R. Lorch (MSR) — IronFleet -> Verus systems lineage
- Jon Howell, Oded Padon, Tej Chajed — Verus foundations
- Derek Dreyer, Yusuke Matsushita (MPI-SWS, Iris) — VerusBelt semantic foundation
- Hayley LeBlanc — verified storage (PoWER)
- Anton Burtsev, Vikram Narayanan (mars-research) — Atmosphere
6.3. Backfill seam (overlaps with BugBash 2026 notes)
- IronFleet lineage (Hawblitzel, Lorch) -> distributed-systems correctness; cross-reference correct-by-construction and Jepsen/Antithesis as the empirical-testing complement to mechanized proof.
7. Lineage graph
Figure 1: LLM-for-Verus lineage (Lu group): substrate, pipeline, benchmarks, human-authored spec sources, and the spec+code co-synthesis adjacent track.