How We Contain Claude: Mapping Against the Stack
Table of Contents
1. Thesis
The post is not adjacent work; it is independent rediscovery of two load-bearing primitives already in the stack — the L4 provenance layer and blast-radius as the deployment risk function — plus one direct refutation condition aimed at the custom egress components (saproxy, netax). Its two most consequential incidents are both egress through a permitted path, which is precisely the secret-custody empty cell named in Agent Sandbox Architectures one level up, on the same publication date.
Source: Anthropic Engineering: How We Contain Claude
Refutation condition for the thesis: it is wrong if any of the post's load-bearing incidents reduces to a model-layer failure (something a classifier could have caught) rather than a provenance/egress boundary failure. Both headline incidents — the api.anthropic.com exfil and the employee phish — fail the model layer by construction (nothing anomalous to catch), so the thesis holds on the evidence presented.
2. Correspondence
| Post primitive | Stack artifact | Relation |
|---|---|---|
| blast radius = P(fail)·damage(fail) | reliability-lab six-gate; elenctic-spec L0-L3 | Same decomposition, operationalized as deployment risk function |
| three components: model / environment / external content | Seven Concerns L1-L7 | Their triad coarsens the stack. env = L1; external content = L4. No analogue to L5-L7 |
| probabilistic vs deterministic controls | Agent Permission Guardrails | Exact match. "telling != enforcing" is the guardrails piece verbatim |
| OS sandbox (Seatbelt/bubblewrap), egress-denied-by-default | Bastille jails (SEFACA); netax divert(4) | Convergent. netax is the homegrown egress interceptor; their primitives are the battle-tested form |
| two-isolation requirement (fs AND network) | Agent Sandbox Architectures | Already the reference axiom in the corpus |
| allowlist = capability grant, not destination filter | provenance-laundering taxonomy | The api.anthropic.com exfil is a textbook laundering case |
| in-VM MITM proxy "only the VM knows provenance" | Digital Shapeshifting L4; saproxy four-phase filter | Trust enforced where provenance is legible |
| direct prompt injection via user (phish) | role-boundary detection | Their finding (classifiers anchor on user intent) is the role-boundary failure |
| MCP authz optional; stdio excluded | Agent Permission Guardrails | "Every function reachable through an allowed domain" = no per-tool OAuth scope |
| persistent memory poisoning | JITIR Against the Field | Direct exposure. Falsification conditions already scoped; no session-startup classifier yet |
| multi-agent trust escalation | Agentic Q1 2026 | Epistemic labels prevent sub-agent output from being promoted to higher trust |
3. The Two Egress Incidents = the Secret-Custody Empty Cell
The sandbox-systems decomposition names four isolations — compute, filesystem custody, network egress, secret custody — and identifies secret custody as the axis the field is still building, the empty cell: vendors sell the compute boundary and leave egress and credential custody to a config the operator may never write.
The post's headline incident is that empty cell getting hit in production. A malicious
workspace file carried an attacker-controlled API key; Claude called the Files API; the
egress proxy saw api.anthropic.com, an approved destination, and passed it; the files
landed in the attacker's account. The sandbox worked; the secret custody axis did not
exist. Their fix — an in-VM proxy that passes only the VM's provisioned session token and
rejects an embedded key — is exactly secret custody as a first-class boundary, enforced
where provenance is legible.
This is the strongest single connection in the corpus: the post is the empirical falsification that the sandbox-systems piece predicted on the same day.
The phish is the same shape on a different vector: a user-delivered payload that reads
~/.aws/credentials and POSTs it out. Model layer is blind (the user typed it); only
egress + filesystem custody hold. Two incidents, one axis.
4. Where the Corpus is Already Ahead
- Agent identity. The post leaves "own principal vs inherited user
permissions, probably a blend" as open. The governance tuple
[persona:agent:reviewer@env(project:workspace)]already resolves it. - Adversarial review. The post's multi-agent trust-escalation warning is the security framing of structured challenge/response between agents. CPRR's refutation step is the mechanism.
5. Related Work
- Agent Sandbox Architectures — secret-custody empty cell; this post is its empirical falsification
- Agent Permission Guardrails — probabilistic vs deterministic; MCP authz gap; wire-level PDP/PEP
- Agent Memory Architectures — persistent memory poisoning exposure
- Agentic Systems Q1 2026 — identity/governance, adversarial review, MCP security
- Agent Isolation with FreeBSD Jails — our implementation of the two-isolation requirement
- anthropic-experimental/sandbox-runtime — their open-source sandbox runtime