How We Contain Claude: Mapping Against the Stack

Table of Contents

1. Thesis

The post is not adjacent work; it is independent rediscovery of two load-bearing primitives already in the stack — the L4 provenance layer and blast-radius as the deployment risk function — plus one direct refutation condition aimed at the custom egress components (saproxy, netax). Its two most consequential incidents are both egress through a permitted path, which is precisely the secret-custody empty cell named in Agent Sandbox Architectures one level up, on the same publication date.

Source: Anthropic Engineering: How We Contain Claude

Refutation condition for the thesis: it is wrong if any of the post's load-bearing incidents reduces to a model-layer failure (something a classifier could have caught) rather than a provenance/egress boundary failure. Both headline incidents — the api.anthropic.com exfil and the employee phish — fail the model layer by construction (nothing anomalous to catch), so the thesis holds on the evidence presented.

2. Correspondence

Post primitive Stack artifact Relation
blast radius = P(fail)·damage(fail) reliability-lab six-gate; elenctic-spec L0-L3 Same decomposition, operationalized as deployment risk function
three components: model / environment / external content Seven Concerns L1-L7 Their triad coarsens the stack. env = L1; external content = L4. No analogue to L5-L7
probabilistic vs deterministic controls Agent Permission Guardrails Exact match. "telling != enforcing" is the guardrails piece verbatim
OS sandbox (Seatbelt/bubblewrap), egress-denied-by-default Bastille jails (SEFACA); netax divert(4) Convergent. netax is the homegrown egress interceptor; their primitives are the battle-tested form
two-isolation requirement (fs AND network) Agent Sandbox Architectures Already the reference axiom in the corpus
allowlist = capability grant, not destination filter provenance-laundering taxonomy The api.anthropic.com exfil is a textbook laundering case
in-VM MITM proxy "only the VM knows provenance" Digital Shapeshifting L4; saproxy four-phase filter Trust enforced where provenance is legible
direct prompt injection via user (phish) role-boundary detection Their finding (classifiers anchor on user intent) is the role-boundary failure
MCP authz optional; stdio excluded Agent Permission Guardrails "Every function reachable through an allowed domain" = no per-tool OAuth scope
persistent memory poisoning JITIR Against the Field Direct exposure. Falsification conditions already scoped; no session-startup classifier yet
multi-agent trust escalation Agentic Q1 2026 Epistemic labels prevent sub-agent output from being promoted to higher trust

3. The Two Egress Incidents = the Secret-Custody Empty Cell

The sandbox-systems decomposition names four isolations — compute, filesystem custody, network egress, secret custody — and identifies secret custody as the axis the field is still building, the empty cell: vendors sell the compute boundary and leave egress and credential custody to a config the operator may never write.

The post's headline incident is that empty cell getting hit in production. A malicious workspace file carried an attacker-controlled API key; Claude called the Files API; the egress proxy saw api.anthropic.com, an approved destination, and passed it; the files landed in the attacker's account. The sandbox worked; the secret custody axis did not exist. Their fix — an in-VM proxy that passes only the VM's provisioned session token and rejects an embedded key — is exactly secret custody as a first-class boundary, enforced where provenance is legible.

This is the strongest single connection in the corpus: the post is the empirical falsification that the sandbox-systems piece predicted on the same day.

The phish is the same shape on a different vector: a user-delivered payload that reads ~/.aws/credentials and POSTs it out. Model layer is blind (the user typed it); only egress + filesystem custody hold. Two incidents, one axis.

4. Where the Corpus is Already Ahead

  • Agent identity. The post leaves "own principal vs inherited user permissions, probably a blend" as open. The governance tuple [persona:agent:reviewer@env(project:workspace)] already resolves it.
  • Adversarial review. The post's multi-agent trust-escalation warning is the security framing of structured challenge/response between agents. CPRR's refutation step is the mechanism.

5. Related Work