Terminal-Bench: Benchmarking Terminal Coding Agents
Table of Contents
1. Overview
Terminal-Bench evaluates AI agents on tasks they must complete in a real terminal: build a project, fix a failing test suite, recover a corrupted file, stand up a service. An agent is given a task description and a shell, and is judged by whether the end state passes a hidden verification script – not by whether its prose looked plausible. It is the terminal-side complement to code-only benchmarks like SWE-bench: the unit of work is a session, not a diff.
2. How the harness works
Each task runs in an isolated sandbox (a container) with a fixed starting state
and a test that decides pass/fail. The agent does not call an API directly; it
drives a terminal – the harness pipes the agent's commands into a shell (a
tmux-style session) and feeds the output back, turn by turn. This matters for
two reasons:
- It tests the loop, not a single shot. The agent must read output, react to errors, and recover – the same generate/observe/repair loop a human runs.
- It is agent-agnostic. Anything that can read a prompt and emit shell commands can be scored: Claude Code, Codex CLI, Aider, a bespoke harness. The benchmark measures the agent + model + scaffold together, which is why the same model scores differently under different CLIs.
A task is resolved only if the verification script passes on the final state. Partial credit is not the point; a session that "almost" works fails.
3. Reading the leaderboard
Scores are the fraction of tasks resolved. Two numbers move them that are easy to conflate:
- Model capability – the underlying LLM.
- Scaffold quality – the CLI/harness: how it manages context, retries, tool use, and recovery.
A weaker model in a better scaffold can beat a stronger model in a worse one. When comparing entries, hold one axis fixed: same model across CLIs isolates the scaffold; same CLI across models isolates capability. Leaderboard position without that decomposition is marketing, not measurement.
4. Why it matters here
Terminal-Bench is the closest public proxy for the work this site keeps returning to – agents driving real toolchains (publish pipelines, REPLs, provers) rather than answering questions. The harness's tmux-driven, read-react-repair loop is the same one behind the verification-first work (Lean/Dafny proofs an agent drives) and the CLI coding-agents comparison.
5. See also
- Terminal AI Agents: The 2025 Landscape – the broader survey this benchmark sits inside.
- CLI Coding Agents (2026 Q2) – the scaffold axis: same model, different CLI.
- Multi-Agent Workflow Frameworks – orchestration beyond a single agent.