2026 Q2 Skills: Context, Caching, and the Sandbox Lens

1. Thesis
2. What a skill is in 2026
3. Skills through the sandbox lens (first-class)
4. Caching and context: why skill composition is affordable
5. Evals: making a skill provide good context
6. Deployment, rollout, testing
7. Claims (REPL-evaluated)
8. Open questions
9. References

1. Thesis

A skill is "an organized collection of prompts and instructions for an LLM" (??, ????). That framing is correct but incomplete. The moment a skill runs, it spends capability: it executes code, reads and writes files, reaches the network, and uses credentials. The right first-class frame for a 2026 skill is therefore not the prompt — it is the sandbox.

Agent control in 2026 spans three registers:

Probabilistic — prompts, system instructions, alignment (what the model is trained to do).
Explicit-deterministic — hooks, permission classifiers, wire-level policy (what the harness enforces).
Implicit-deterministic — a skill's tool-scope, an MCP server's surface, a schema (what the design permits at all).

Skills live in (3). A skill with a restricted tool list cannot exfiltrate a file not because a policy refused it but because the action was never on the menu. That is a sandbox boundary expressed as authoring convention. This note reads skills through that lens, then covers the caching/context economics that make them cheap, and the eval + deployment tooling that makes them testable.

2. What a skill is in 2026

The SKILL.md convention (Anthropic-originated) is the closest thing to a cross-harness standard, read by Claude Code, Copilot CLI, and OpenCode (??, a). OpenAI Codex ships skills as bundled, reusable behaviours (e.g. a setup-demo-app skill) (??, ). GitHub ships gh skill — a search/preview/install/publish/update lifecycle for skills sourced from repositories (??, a). The unit is converging; the governance is not.

3. Skills through the sandbox lens (first-class)

The reference decomposition: "sandbox" names at least four isolations — compute boundary, filesystem custody, network egress, secret custody — and "capability is structural; safety is declarative" (??, a). Splitting filesystem into read and write gives the five axes the operator actually reasons about. Every skill can be scored on them:

Axis	The question	Skill-level control (implicit)	Harness control (explicit)
compute	can it execute code?	is =Bash=/exec in the tool list?	sandbox/jail, `rctl`
fs-read	what can it read?	`Read=/=Grep` scope	mount RO, Seatbelt `file-read*`
fs-write	what can it mutate?	`Edit=/=Write` present?	mount RW set, worktree isolation
egress	what hosts can it reach?	=WebFetch=/MCP tools present?	egress proxy / `pf` allowlist
credential	what secrets does it touch?	does a tool require a key?	egress-proxy injection; key never in env

The load-bearing observation: a skill's tool-list restriction is a sandbox policy written in the skill, one register earlier than the harness. But it is the weakest of the three registers — it constrains what the model is offered, not what the process can do. The corpus reference axiom holds: filesystem AND network isolation are both required, and the compute boundary alone is "a launchpad" (??, a). So a skill's tool-scope is necessary context hygiene, never the security boundary.

Two failure modes the sandbox notes make concrete, both relevant to skills:

The credential axis is the asymmetric risk. The secret-custody "empty cell" — a locally-run agent that materialises secrets only at approved-host egress — is unoccupied as a product, and the documented containment incidents land exactly there (??, a, ??, a). A skill that calls an authenticated API inherits this: if the credential is readable, a prompt-injected skill exfiltrates it.
Skill inputs are tainted. "If a third party could have written any part of it, it's tainted" (??, a). A skill that reads logs, issues, or tool output is reading attacker-writable text; its tool-scope is the only thing bounding the blast radius when that text is interpreted as instructions.

The permission story that closes the gap is three-layer — identity → session scope → wire-level — and the MCP authorization gap is real: the MCP spec makes auth optional and excludes stdio transports (the common dev case), so most skills reaching MCP tools run with session-wide grants and no per-operation scope (??, a).

4. Caching and context: why skill composition is affordable

Skills are prompt-cache-aware primitives — named templates with tool scopes that expand at call time (??, a). Three corpus facts make composition cheap: prompt caching is opt-in with a 1-hour TTL (ENABLE_PROMPT_CACHING_1H, 2.1.108) and dynamic system-prompt sections can be excluded for cache reuse (--exclude-dynamic-system-prompt-sections, 2.1.114) (??, a); the CLAUDE.md hierarchy is itself a cache strategy (global-static / repo / session-dynamic) (??, a); and context compression outperforms full-context agents on long-horizon tool use (??, a). The open question the corpus poses and does not answer: at 1M context, is curated memory still necessary, or does it become the signal in a sea of raw context? (??, a)

5. Evals: making a skill provide good context

The unit of progress is the eval. OpenAI's framing defines success in four categories — outcome (did the task complete), process (right tools, right order), style (conventions), efficiency (no wasted tokens/commands) — and keeps the must-pass list small (??, ). The harness is layered: deterministic checks against JSONL traces (codex exec --json — "did it run npm install?"), model-assisted rubric grading with a structured --output-schema, and small prompt sets (10–20 cases) spanning explicit invocation, implicit triggering, contextual variation, and negative controls (??, ).

Read through this note's lens, "good context" for a skill is the minimal context that passes the must-pass checks — which is also the minimal sandbox surface. The efficiency goal and the implicit-guardrail goal point the same way: a skill that needs fewer tools and less context is both cheaper and safer.

6. Deployment, rollout, testing

gh skill gives skills a software lifecycle: search (discover), preview (test before install), install (deploy from a repo namespace), publish --dry-run (validate before release), update (bulk upgrade) (??, a). preview and --dry-run are the testing/rollout seams — the place an eval suite (above) should gate. A skill rollout is then: eval locally → gh skill preview → publish --dry-run → publish → update across the fleet.

8. Open questions

Is a skill's tool-scope observable as a sandbox profile — can a harness derive the compute/fs/egress/credential footprint from a SKILL.md and refuse it?
Does 1M context enable a new class of whole-repo skills, or just make existing skills cheaper? (??, a)
Where does eval "efficiency" (minimal context) and security (minimal surface) diverge, if ever?

9. References

Core external links:

anthropics/skills — the SKILL.md convention.
Antithesis: Agent Skills (2026).
OpenAI: Evaluating agent skills.
gh skill — deploy / rollout / test lifecycle.

Internal:

Agent sandbox systems — the four-isolation decomposition.
Containment mapping.
Agent permission guardrails.
Annotation systems — the verdict methodology.

2026 Q2 Skills: Context, Caching, and the Sandbox Lens

2026 Q2 Skills: Context, Caching, and the Sandbox Lens

Table of Contents

1. Thesis

2. What a skill is in 2026

3. Skills through the sandbox lens (first-class)

4. Caching and context: why skill composition is affordable

5. Evals: making a skill provide good context

6. Deployment, rollout, testing

7. Claims (REPL-evaluated)

7.1. Skills are named prompt templates with tool scopes

7.2. SKILL.md is the cross-harness skill convention

7.3. Prompt caching is opt-in, 1h TTL, with cache-reuse flags

7.4. All six surveyed CLI agents support MCP

7.5. MCP authorization is optional and excludes stdio

7.6. Capability is structural; the secret-custody empty cell is unoccupied

7.7. Skill inputs are tainted by construction

7.8. OpenAI eval-skills: four success categories + layered harness

7.9. gh skill provides a search/preview/install/publish/update lifecycle

8. Open questions

9. References