PAgE 2026 Keynote — Can Coding Agents Write Verifiably Correct Software?

Table of Contents

1. Talk

Thesis: have LLM agents emit formal correctness proofs alongside code, discharged by Verus (SMT-based verifier for Rust). Reported result is a frontier-model-gated yes on repository-level systems code.

Scope of the claim (refutation boundary): the demonstrated capability is proof synthesis against fixed, human-authored specifications. The specification remains the trusted head of the TCB; the agent supplies the proof, not the spec. Claim is refuted as an end-to-end correctness result if the spec is also agent-authored and unvalidated.

2. Link set

3. Core content

3.1. Verus (substrate)

SMT-based verifier for Rust (Lattuada et al. 2024). Specs and proofs are written in Rust itself; linear ghost types and the borrow checker carry ownership and separation into proofs. TCB = top-level specification + Verus verifier + SMT solver (Z3) + Rust compiler. Semantic foundation for the proof-oriented type extensions: VerusBelt (Hance et al. 2026).

3.2. Two systems, two benchmarks

System Target Benchmark Tasks
AutoVerus small algorithmic code Verus-Bench 150
VeruSAGE code in large real systems projects VeruSAGE-Bench 849

AutoVerus (Yang, Li, et al. 2025) pipeline: inference -> refinement -> repair (few-shot candidates, targeted refinement, verification-error debugging). VeruSAGE (Yang, Neamtu, et al. 2025): agentic observation-reasoning-action loop.

3.2.1. Verus-Bench (150, algorithm-level)

Source Tasks Description
CloverBench 11 Classic CS examples
MBPP 78 Formal specification problems
Diffy 38 Array / loop programs
Misc 23 Verus tutorial examples

3.2.2. VeruSAGE-Bench (849, repository-level)

Project Tasks Domain
Anvil (AL) 104 Distributed systems
Anvil-Advanced 63 Distributed systems
IronKV (IR) 118 Key-value store
Memory Allocator 89 Systems
Node Replication 29 Distributed systems
NRKernel (NR) 204 OS kernel
ATMO (OS) 157 Microkernel
Storage (ST) 63 Storage systems
Vest (VE) 22 Serialization

3.3. Cost / capability findings (from the lineage)

  • Proof-generation capability is held by expensive frontier models (e.g. Claude Sonnet 4.5), not cheap models (e.g. o4-mini).
  • VeruSyn (Di et al. 2026) responds to this with data synthesis: ~6.9M verified Rust programs (each spec + proof) to lift a small model.

4. Paper lineage (Lu group, LLM-for-Verus)

  1. AutoVerus — OOPSLA'25 — arXiv:2409.13082 / https://doi.org/10.1145/3763174
  2. Automated Proof Generation via Self-Evolution — ICLR'25 — arXiv:2410.15756
  3. VeriStruct (data-structure modules) — arXiv:2510.25015
  4. What's in a Proof? (expert proof-writing in F* and Verus) — arXiv:2508.02733
  5. VeruSAGE — Dec'25 — arXiv:2512.18436
  6. Reducing Costs of Proof Synthesis (VeruSyn) — Feb'26 — arXiv:2602.04910
  7. ExVerus (Yang et al. 2026) (proof repair via counterexample reasoning) — ICML'26 — arXiv:2603.25810

Adjacent track (spec/code co-synthesis, trust anchor moves):

  • AlphaVerus (bootstrapping verified code generation) — ICML'25
  • Vericoding benchmark — arXiv:2509.22908

5. Verified-systems projects (the VeruSAGE-Bench targets)

Project Repo / source Venue
Anvil github.com/vmware-research/verifiable-controllers OSDI'24 (best)
IronKV github.com/verus-lang/verified-ironkv (ex IronFleet)
Memory Allocator github.com/verus-lang/verified-memory-allocator (mimalloc-based)
Node Replication github.com/verus-lang/verified-node-replication
NRKernel github.com/utaal/verified-nrkernel
Atmosphere (ATMO) github.com/mars-research/atmosphere SOSP'25
Storage (PoWER) github.com/microsoft/verified-storage OSDI'25 (dist.)
Vest github.com/secure-foundations/vest USENIX Sec'25
VeriSMo (related) github.com/microsoft/verismo OSDI'24 (best)

All targets are published, peer-reviewed artifacts with human-authored specs (e.g. Anvil (Sun et al. 2024)).

6. Related researchers

6.1. Speaker / LLM-for-Verus

  • Shan Lu (Microsoft; U Chicago) — speaker, group lead
  • Chenyuan Yang (UIUC) — AutoVerus / VeruSAGE first author
  • Shuvendu K. Lahiri (MSR) — intent formalization
  • Shuai Lu, Yeyun Gong, Peng Cheng (MSR) — data synthesis
  • Shraddha Barke (MSR) — PAgE chair; expert proof-writing study

6.2. Verus core / systems

  • Bryan Parno (CMU, secure-foundations) — PI
  • Andrea Lattuada (MPI-SWS) — Verus lead
  • Travis Hance (CMU) — Verus, VerusBelt
  • Chris Hawblitzel, Jacob R. Lorch (MSR) — IronFleet -> Verus systems lineage
  • Jon Howell, Oded Padon, Tej Chajed — Verus foundations
  • Derek Dreyer, Yusuke Matsushita (MPI-SWS, Iris) — VerusBelt semantic foundation
  • Hayley LeBlanc — verified storage (PoWER)
  • Anton Burtsev, Vikram Narayanan (mars-research) — Atmosphere

6.3. Backfill seam (overlaps with BugBash 2026 notes)

  • IronFleet lineage (Hawblitzel, Lorch) -> distributed-systems correctness; cross-reference correct-by-construction and Jepsen/Antithesis as the empirical-testing complement to mechanized proof.

7. Lineage graph

LLM-for-Verus lineage graph

Figure 1: LLM-for-Verus lineage (Lu group): substrate, pipeline, benchmarks, human-authored spec sources, and the spec+code co-synthesis adjacent track.

8. References

Di, Nongyu, Tianyu Chen, Shan Lu, Shuai Lu, Yeyun Gong, Peng Cheng, Jacob R. Lorch, Yuan Yao, and Xiaoxing Ma. 2026. “Reducing the Costs of Proof Synthesis on Rust Systems by Scaling up a Seed Training Set.”
Hance, Travis, Laila Elbeheiry, Yusuke Matsushita, and Derek Dreyer. 2026. “Verusbelt: A Semantic Foundation for Verus’s Proof-Oriented Extensions to the Rust Type System.” In Proc. Acm Pldi.
Lattuada, Andrea, Travis Hance, Jay Bosamiya, Matthias Brun, Chanhee Cho, Hayley LeBlanc, Pranav Srinivasan, et al. 2024. “Verus: A Practical Foundation for Systems Verification.” In Proceedings of the Acm Sigops 30th Symposium on Operating Systems Principles (Sosp ’24). https://doi.org/10.1145/3694715.3695952.
Sun, Xudong, Wenjie Ma, Jiawei Tyler Gu, Zicheng Ma, Tej Chajed, Jon Howell, Andrea Lattuada, et al. 2024. “Anvil: Verifying Liveness of Cluster Management Controllers.” In Proc. Usenix Osdi.
Yang, Chenyuan, Xuheng Li, Md Rakib Hossain Misu, Jianan Yao, Weidong Cui, Yeyun Gong, Chris Hawblitzel, Shuvendu K. Lahiri, et al. 2025. “Autoverus: Automated Proof Generation for Rust Code.” Proc. Acm Program. Lang. 9 (OOPSLA2). https://doi.org/10.1145/3763174.
Yang, Chenyuan, Natalie Neamtu, Chris Hawblitzel, Jacob R. Lorch, and Shan Lu. 2025. “Verusage: A Study of Agent-Based Verification for Rust Systems.”
Yang, Jun, Yuechun Sun, Yi Wu, Rodrigo Caridad, Yongwei Yuan, Jianan Yao, Shan Lu, and Kexin Pei. 2026. “Exverus: Verus Proof Repair via Counterexample Reasoning.”