2026 Q2 Skills: Context, Caching, and the Sandbox Lens

Table of Contents

1. Thesis

A skill is "an organized collection of prompts and instructions for an LLM" (??, ????). That framing is correct but incomplete. The moment a skill runs, it spends capability: it executes code, reads and writes files, reaches the network, and uses credentials. The right first-class frame for a 2026 skill is therefore not the prompt — it is the sandbox.

Agent control in 2026 spans three registers:

  1. Probabilistic — prompts, system instructions, alignment (what the model is trained to do).
  2. Explicit-deterministic — hooks, permission classifiers, wire-level policy (what the harness enforces).
  3. Implicit-deterministic — a skill's tool-scope, an MCP server's surface, a schema (what the design permits at all).

Skills live in (3). A skill with a restricted tool list cannot exfiltrate a file not because a policy refused it but because the action was never on the menu. That is a sandbox boundary expressed as authoring convention. This note reads skills through that lens, then covers the caching/context economics that make them cheap, and the eval + deployment tooling that makes them testable.

2. What a skill is in 2026

The SKILL.md convention (Anthropic-originated) is the closest thing to a cross-harness standard, read by Claude Code, Copilot CLI, and OpenCode (??, a). OpenAI Codex ships skills as bundled, reusable behaviours (e.g. a setup-demo-app skill) (??, ). GitHub ships gh skill — a search/preview/install/publish/update lifecycle for skills sourced from repositories (??, a). The unit is converging; the governance is not.

3. Skills through the sandbox lens (first-class)

The reference decomposition: "sandbox" names at least four isolations — compute boundary, filesystem custody, network egress, secret custody — and "capability is structural; safety is declarative" (??, a). Splitting filesystem into read and write gives the five axes the operator actually reasons about. Every skill can be scored on them:

Axis The question Skill-level control (implicit) Harness control (explicit)
compute can it execute code? is =Bash=/exec in the tool list? sandbox/jail, rctl
fs-read what can it read? Read=/=Grep scope mount RO, Seatbelt file-read*
fs-write what can it mutate? Edit=/=Write present? mount RW set, worktree isolation
egress what hosts can it reach? =WebFetch=/MCP tools present? egress proxy / pf allowlist
credential what secrets does it touch? does a tool require a key? egress-proxy injection; key never in env

The load-bearing observation: a skill's tool-list restriction is a sandbox policy written in the skill, one register earlier than the harness. But it is the weakest of the three registers — it constrains what the model is offered, not what the process can do. The corpus reference axiom holds: filesystem AND network isolation are both required, and the compute boundary alone is "a launchpad" (??, a). So a skill's tool-scope is necessary context hygiene, never the security boundary.

Two failure modes the sandbox notes make concrete, both relevant to skills:

  • The credential axis is the asymmetric risk. The secret-custody "empty cell" — a locally-run agent that materialises secrets only at approved-host egress — is unoccupied as a product, and the documented containment incidents land exactly there (??, a, ??, a). A skill that calls an authenticated API inherits this: if the credential is readable, a prompt-injected skill exfiltrates it.
  • Skill inputs are tainted. "If a third party could have written any part of it, it's tainted" (??, a). A skill that reads logs, issues, or tool output is reading attacker-writable text; its tool-scope is the only thing bounding the blast radius when that text is interpreted as instructions.

The permission story that closes the gap is three-layer — identity → session scope → wire-level — and the MCP authorization gap is real: the MCP spec makes auth optional and excludes stdio transports (the common dev case), so most skills reaching MCP tools run with session-wide grants and no per-operation scope (??, a).

4. Caching and context: why skill composition is affordable

Skills are prompt-cache-aware primitives — named templates with tool scopes that expand at call time (??, a). Three corpus facts make composition cheap: prompt caching is opt-in with a 1-hour TTL (ENABLE_PROMPT_CACHING_1H, 2.1.108) and dynamic system-prompt sections can be excluded for cache reuse (--exclude-dynamic-system-prompt-sections, 2.1.114) (??, a); the CLAUDE.md hierarchy is itself a cache strategy (global-static / repo / session-dynamic) (??, a); and context compression outperforms full-context agents on long-horizon tool use (??, a). The open question the corpus poses and does not answer: at 1M context, is curated memory still necessary, or does it become the signal in a sea of raw context? (??, a)

5. Evals: making a skill provide good context

The unit of progress is the eval. OpenAI's framing defines success in four categories — outcome (did the task complete), process (right tools, right order), style (conventions), efficiency (no wasted tokens/commands) — and keeps the must-pass list small (??, ). The harness is layered: deterministic checks against JSONL traces (codex exec --json — "did it run npm install?"), model-assisted rubric grading with a structured --output-schema, and small prompt sets (10–20 cases) spanning explicit invocation, implicit triggering, contextual variation, and negative controls (??, ).

Read through this note's lens, "good context" for a skill is the minimal context that passes the must-pass checks — which is also the minimal sandbox surface. The efficiency goal and the implicit-guardrail goal point the same way: a skill that needs fewer tools and less context is both cheaper and safer.

6. Deployment, rollout, testing

gh skill gives skills a software lifecycle: search (discover), preview (test before install), install (deploy from a repo namespace), publish --dry-run (validate before release), update (bulk upgrade) (??, a). preview and --dry-run are the testing/rollout seams — the place an eval suite (above) should gate. A skill rollout is then: eval locally → gh skill previewpublish --dry-runpublishupdate across the fleet.

7. Claims (REPL-evaluated)

Verdicts produced 2026-06-24. correct = the cited term/claim was found in the cited corpus note by REPL (wal-sh.site.org/read-all + slurp + substring); attributed = sourced from a fetched primary document, not independently executed.

7.1. Skills are named prompt templates with tool scopes

REPL confirmed "prompt template" and "tool scope" in 2026-q2-claude-code-features. (??, a)

7.2. SKILL.md is the cross-harness skill convention

REPL confirmed "SKILL.md" in 2026-q2-cli-coding-agents. (??, a)

7.3. Prompt caching is opt-in, 1h TTL, with cache-reuse flags

REPL found ENABLE_PROMPT_CACHING_1H and --exclude-dynamic-system-prompt-sections in claude-code-workshop-2026; absent from 2026-q2-claude-code-features. Citation corrected accordingly. (??, a)

7.4. All six surveyed CLI agents support MCP

REPL confirmed "all six" and "MCP" in 2026-q2-cli-coding-agents. (??, a)

7.5. MCP authorization is optional and excludes stdio

REPL confirmed "stdio", "OAuth 2.1", "optional" in 2026-agent-permission-guardrails. (??, a)

7.6. Capability is structural; the secret-custody empty cell is unoccupied

REPL confirmed "secret custody", "egress proxy", "structural", "empty cell", "compute boundary" in 2026-agent-sandbox-systems. (??, a)

7.7. Skill inputs are tainted by construction

REPL confirmed "tainted", "prompt injection", "domain" in tainted-data-llm-pipelines. (??, a)

7.8. OpenAI eval-skills: four success categories + layered harness

Primary source: developers.openai.com/blog/eval-skills — outcome/process/style/ efficiency; deterministic JSONL-trace checks + rubric grading + 10–20 case sets. Not independently executed. (??, )

7.9. gh skill provides a search/preview/install/publish/update lifecycle

Primary source: cli.github.com/manual/gh_skill (preview). Subcommands install, list, preview, search, publish (--dry-run), update. Preview only; not run locally. (??, a)

8. Open questions

  • Is a skill's tool-scope observable as a sandbox profile — can a harness derive the compute/fs/egress/credential footprint from a SKILL.md and refuse it?
  • Does 1M context enable a new class of whole-repo skills, or just make existing skills cheaper? (??, a)
  • Where does eval "efficiency" (minimal context) and security (minimal surface) diverge, if ever?

9. References

Core external links:

Internal: