Terminal-Bench: Benchmarking Terminal Coding Agents

1. Overview
2. How the harness works
3. Reading the leaderboard
4. Why it matters here
5. See also

1. Overview

Terminal-Bench evaluates AI agents on tasks they must complete in a real terminal: build a project, fix a failing test suite, recover a corrupted file, stand up a service. An agent is given a task description and a shell, and is judged by whether the end state passes a hidden verification script – not by whether its prose looked plausible. It is the terminal-side complement to code-only benchmarks like SWE-bench: the unit of work is a session, not a diff.

2. How the harness works

Each task runs in an isolated sandbox (a container) with a fixed starting state and a test that decides pass/fail. The agent does not call an API directly; it drives a terminal – the harness pipes the agent's commands into a shell (a tmux-style session) and feeds the output back, turn by turn. This matters for two reasons:

It tests the loop, not a single shot. The agent must read output, react to errors, and recover – the same generate/observe/repair loop a human runs.
It is agent-agnostic. Anything that can read a prompt and emit shell commands can be scored: Claude Code, Codex CLI, Aider, a bespoke harness. The benchmark measures the agent + model + scaffold together, which is why the same model scores differently under different CLIs.

A task is resolved only if the verification script passes on the final state. Partial credit is not the point; a session that "almost" works fails.

3. Reading the leaderboard

Scores are the fraction of tasks resolved. Two numbers move them that are easy to conflate:

Model capability – the underlying LLM.
Scaffold quality – the CLI/harness: how it manages context, retries, tool use, and recovery.

A weaker model in a better scaffold can beat a stronger model in a worse one. When comparing entries, hold one axis fixed: same model across CLIs isolates the scaffold; same CLI across models isolates capability. Leaderboard position without that decomposition is marketing, not measurement.

4. Why it matters here

Terminal-Bench is the closest public proxy for the work this site keeps returning to – agents driving real toolchains (publish pipelines, REPLs, provers) rather than answering questions. The harness's tmux-driven, read-react-repair loop is the same one behind the verification-first work (Lean/Dafny proofs an agent drives) and the CLI coding-agents comparison.

5. See also

Terminal AI Agents: The 2025 Landscape – the broader survey this benchmark sits inside.
CLI Coding Agents (2026 Q2) – the scaffold axis: same model, different CLI.
Multi-Agent Workflow Frameworks – orchestration beyond a single agent.