2026 Q2 Skills: Context, Caching, and the Sandbox Lens
Table of Contents
- 1. Thesis
- 2. What a skill is in 2026
- 3. Skills through the sandbox lens (first-class)
- 4. Caching and context: why skill composition is affordable
- 5. Evals: making a skill provide good context
- 6. Deployment, rollout, testing
- 7. Claims (REPL-evaluated)
- 7.1. Skills are named prompt templates with tool scopes
- 7.2. SKILL.md is the cross-harness skill convention
- 7.3. Prompt caching is opt-in, 1h TTL, with cache-reuse flags
- 7.4. All six surveyed CLI agents support MCP
- 7.5. MCP authorization is optional and excludes stdio
- 7.6. Capability is structural; the secret-custody empty cell is unoccupied
- 7.7. Skill inputs are tainted by construction
- 7.8. OpenAI eval-skills: four success categories + layered harness
- 7.9. gh skill provides a search/preview/install/publish/update lifecycle
- 8. Open questions
- 9. References
1. Thesis
A skill is "an organized collection of prompts and instructions for an LLM" (??, ????). That framing is correct but incomplete. The moment a skill runs, it spends capability: it executes code, reads and writes files, reaches the network, and uses credentials. The right first-class frame for a 2026 skill is therefore not the prompt — it is the sandbox.
Agent control in 2026 spans three registers:
- Probabilistic — prompts, system instructions, alignment (what the model is trained to do).
- Explicit-deterministic — hooks, permission classifiers, wire-level policy (what the harness enforces).
- Implicit-deterministic — a skill's tool-scope, an MCP server's surface, a schema (what the design permits at all).
Skills live in (3). A skill with a restricted tool list cannot exfiltrate a file not because a policy refused it but because the action was never on the menu. That is a sandbox boundary expressed as authoring convention. This note reads skills through that lens, then covers the caching/context economics that make them cheap, and the eval + deployment tooling that makes them testable.
2. What a skill is in 2026
The SKILL.md convention (Anthropic-originated) is the closest thing to a
cross-harness standard, read by Claude Code, Copilot CLI, and OpenCode
(??, a). OpenAI Codex ships skills as bundled, reusable behaviours
(e.g. a setup-demo-app skill) (??, ). GitHub ships gh skill
— a search/preview/install/publish/update lifecycle for skills sourced from
repositories (??, a). The unit is converging; the governance is not.
3. Skills through the sandbox lens (first-class)
The reference decomposition: "sandbox" names at least four isolations — compute boundary, filesystem custody, network egress, secret custody — and "capability is structural; safety is declarative" (??, a). Splitting filesystem into read and write gives the five axes the operator actually reasons about. Every skill can be scored on them:
| Axis | The question | Skill-level control (implicit) | Harness control (explicit) |
|---|---|---|---|
| compute | can it execute code? | is =Bash=/exec in the tool list? | sandbox/jail, rctl |
| fs-read | what can it read? | Read=/=Grep scope |
mount RO, Seatbelt file-read* |
| fs-write | what can it mutate? | Edit=/=Write present? |
mount RW set, worktree isolation |
| egress | what hosts can it reach? | =WebFetch=/MCP tools present? | egress proxy / pf allowlist |
| credential | what secrets does it touch? | does a tool require a key? | egress-proxy injection; key never in env |
The load-bearing observation: a skill's tool-list restriction is a sandbox policy written in the skill, one register earlier than the harness. But it is the weakest of the three registers — it constrains what the model is offered, not what the process can do. The corpus reference axiom holds: filesystem AND network isolation are both required, and the compute boundary alone is "a launchpad" (??, a). So a skill's tool-scope is necessary context hygiene, never the security boundary.
Two failure modes the sandbox notes make concrete, both relevant to skills:
- The credential axis is the asymmetric risk. The secret-custody "empty cell" — a locally-run agent that materialises secrets only at approved-host egress — is unoccupied as a product, and the documented containment incidents land exactly there (??, a, ??, a). A skill that calls an authenticated API inherits this: if the credential is readable, a prompt-injected skill exfiltrates it.
- Skill inputs are tainted. "If a third party could have written any part of it, it's tainted" (??, a). A skill that reads logs, issues, or tool output is reading attacker-writable text; its tool-scope is the only thing bounding the blast radius when that text is interpreted as instructions.
The permission story that closes the gap is three-layer — identity → session scope
→ wire-level — and the MCP authorization gap is real: the MCP spec makes auth
optional and excludes stdio transports (the common dev case), so most skills
reaching MCP tools run with session-wide grants and no per-operation scope
(??, a).
4. Caching and context: why skill composition is affordable
Skills are prompt-cache-aware primitives — named templates with tool scopes that
expand at call time (??, a). Three corpus facts make composition
cheap: prompt caching is opt-in with a 1-hour TTL
(ENABLE_PROMPT_CACHING_1H, 2.1.108) and dynamic system-prompt sections can be
excluded for cache reuse (--exclude-dynamic-system-prompt-sections, 2.1.114)
(??, a); the CLAUDE.md hierarchy is itself a cache strategy
(global-static / repo / session-dynamic) (??, a); and
context compression outperforms full-context agents on long-horizon tool use
(??, a). The open question the corpus poses and does not answer:
at 1M context, is curated memory still necessary, or does it become the signal in a
sea of raw context? (??, a)
5. Evals: making a skill provide good context
The unit of progress is the eval. OpenAI's framing defines success in four
categories — outcome (did the task complete), process (right tools, right
order), style (conventions), efficiency (no wasted tokens/commands) — and keeps
the must-pass list small (??, ). The harness is layered:
deterministic checks against JSONL traces (codex exec --json — "did it run
npm install?"), model-assisted rubric grading with a structured --output-schema,
and small prompt sets (10–20 cases) spanning explicit invocation, implicit
triggering, contextual variation, and negative controls (??, ).
Read through this note's lens, "good context" for a skill is the minimal context that passes the must-pass checks — which is also the minimal sandbox surface. The efficiency goal and the implicit-guardrail goal point the same way: a skill that needs fewer tools and less context is both cheaper and safer.
6. Deployment, rollout, testing
gh skill gives skills a software lifecycle: search (discover), preview (test
before install), install (deploy from a repo namespace), publish --dry-run
(validate before release), update (bulk upgrade) (??, a). preview and
--dry-run are the testing/rollout seams — the place an eval suite (above) should
gate. A skill rollout is then: eval locally → gh skill preview → publish
--dry-run → publish → update across the fleet.
7. Claims (REPL-evaluated)
Verdicts produced 2026-06-24. correct = the cited term/claim was found in the
cited corpus note by REPL (wal-sh.site.org/read-all + slurp + substring);
attributed = sourced from a fetched primary document, not independently executed.
7.1. Skills are named prompt templates with tool scopes
REPL confirmed "prompt template" and "tool scope" in
2026-q2-claude-code-features. (??, a)
7.2. SKILL.md is the cross-harness skill convention
REPL confirmed "SKILL.md" in 2026-q2-cli-coding-agents. (??, a)
7.3. Prompt caching is opt-in, 1h TTL, with cache-reuse flags
REPL found ENABLE_PROMPT_CACHING_1H and --exclude-dynamic-system-prompt-sections
in claude-code-workshop-2026; absent from 2026-q2-claude-code-features. Citation
corrected accordingly. (??, a)
7.4. All six surveyed CLI agents support MCP
REPL confirmed "all six" and "MCP" in 2026-q2-cli-coding-agents. (??, a)
7.5. MCP authorization is optional and excludes stdio
REPL confirmed "stdio", "OAuth 2.1", "optional" in
2026-agent-permission-guardrails. (??, a)
7.6. Capability is structural; the secret-custody empty cell is unoccupied
REPL confirmed "secret custody", "egress proxy", "structural", "empty cell",
"compute boundary" in 2026-agent-sandbox-systems. (??, a)
7.7. Skill inputs are tainted by construction
REPL confirmed "tainted", "prompt injection", "domain" in
tainted-data-llm-pipelines. (??, a)
7.8. OpenAI eval-skills: four success categories + layered harness
Primary source: developers.openai.com/blog/eval-skills — outcome/process/style/ efficiency; deterministic JSONL-trace checks + rubric grading + 10–20 case sets. Not independently executed. (??, )
7.9. gh skill provides a search/preview/install/publish/update lifecycle
Primary source: cli.github.com/manual/gh_skill (preview). Subcommands install,
list, preview, search, publish (--dry-run), update. Preview only; not run
locally. (??, a)
8. Open questions
- Is a skill's tool-scope observable as a sandbox profile — can a harness derive
the compute/fs/egress/credential footprint from a
SKILL.mdand refuse it? - Does 1M context enable a new class of whole-repo skills, or just make existing skills cheaper? (??, a)
- Where does eval "efficiency" (minimal context) and security (minimal surface) diverge, if ever?
9. References
Core external links:
- anthropics/skills — the SKILL.md convention.
- Antithesis: Agent Skills (2026).
- OpenAI: Evaluating agent skills.
- gh skill — deploy / rollout / test lifecycle.
Internal:
- Agent sandbox systems — the four-isolation decomposition.
- Containment mapping.
- Agent permission guardrails.
- Annotation systems — the verdict methodology.