Annotation Systems: Operational Specification v3
Table of Contents
- 1. Clojure Reader for the Annotation Layer
- 2. Verdict Vocabulary
- 3. Literate Validation Loop
- 4. Process Findings: Multi-Agent Annotation in Practice
- 4.1. One verdict per heading assumes one reviewer
- 4.2. The verdict vocabulary has already drifted
- 4.3. Validation tooling needs its own validation
- 4.4. Annotating a conjecture is not verifying a fact
- 4.5. Unstructured agent output self-contradicts
- 4.6. Point-in-time claims are unverifiable on a single host
- 5. Protocol Proposals: Filling the Five Gaps
- 6. Field Report: Applied to a Live FreeBSD/Guile Crash (2026-06-22)
- 7. Protocol Proposals: Filling Gaps from Field Use (2026-06-22)
- 8. Validation Ledger
- 9. References
See also: Annotation Systems for Evolving Documents (problem statement, anchor comparison, recommendation) | L7 Methodology Review (2026-06-22)
This specification covers operational detail: the verdict vocabulary, property drawer conventions, Clojure reader tooling, gap proposals from multi-agent use, a FreeBSD field report, and the literate validation loop. The problem framing and anchor comparison are in index.org.
1. Clojure Reader for the Annotation Layer
A convention is only as good as the tooling that queries it. The recommendation
in the problem-statement note – :CUSTOM_ID: plus property drawers – is
greppable, and that was offered as a feature. But grep cannot tell a real heading
drawer from a #+begin_example block demonstrating the convention; it cannot
count verdicts; it cannot reconcile a heading's recorded verdict against the audit
ledger. To make annotations first-class, the site's org reader (wal-sh.site.org)
now parses the drawer attached to each heading, so the annotation layer becomes
queryable data rather than text to search.
The parser walks headings and, when a :PROPERTIES: drawer sits directly under
one, captures its keys. Review keys mark a heading as annotated:
(def annotation-keys #{:review :status :verdict :verified_at :verified_by :finding}) (defn annotated-headings [doc] (for [h (:headings doc) :let [p (:properties h)] :when (some annotation-keys (keys p))] {:path (:path doc) :custom-id (:custom_id p) :heading (:text h) :verdict (:verdict p) :verified-by (:verified_by p) :status (:status p) :finding (:finding p)}))
Loading the corpus in the REPL (the site runs a persistent Clojure REPL in tmux; this transcript is from it):
(require '[wal-sh.site.org :as o]) (def docs (o/read-all "site")) ;=> 751 docs (o/verdict-summary docs) ;;=> {:correct 4, :corrected 2, :unverified 9} (count (o/annotations docs)) ;;=> 15 (frequencies (map :verified-by (filter :verdict (o/annotations docs)))) ;;=> {"skeptic-agent" 2, "vera-agent" 3, "claude-agent" 1} (map (juxt :custom-id :verdict) (filter #(= "corrected" (:verdict %)) (o/annotations docs))) ;;=> (["org-annotate-file" "corrected"] ["cljs-advantage" "corrected"])
The reader counts six recorded verdicts (four correct, two corrected) across
the corpus, attributed to three agents – and nine further headings annotated with
other keys but not yet given a verdict. One of the two corrected findings is
org-annotate-file: the vera-agent correction in the problem-statement note
(the MELPA-package mix-up). The annotation is data the document can read about itself.
The precision this buys over grep: a raw grep ':VERDICT:' returns nine hits, but
three of those are inside the #+begin_example blocks illustrating the convention.
The reader captures only drawers genuinely attached to a heading – example headings
are comma-escaped (,*) and do not match – so the six it counts are real
annotations, not documentation about annotations.
1.1. From annotations to the grind
Three functions turn the layer into work for the secondary-review agent team and the audit ledger:
;; the allocation queue: headings explicitly flagged for attention (o/needs-review docs) ; :STATUS: needs-citation or a disputed verdict ;;=> [] (no outstanding flags right now -- empty queue) ;; broader candidates: anchored headings with no verdict yet (o/unverified-anchored docs) ; have :CUSTOM_ID:, lack :VERDICT: ;; tamper-evidence: drawers whose verdict the hash-chain ledger does not record (o/reconcile-ledger docs (map :subject (read-chain ".verify/chain.jsonl")))
needs-review is the input to the per-heading fan-out: the research-editor
orchestrator allocates each flagged heading to a citation-auditor or
claim-validator, and each verdict it returns is written back as a :VERDICT:
drawer and appended to .verify/chain.jsonl (the hash chain described in the
project's verification-ledger spec). reconcile-ledger closes the loop: it lists
any heading claiming a verdict the tamper-evident trail does not record, so the
drawer and the ledger cannot silently diverge.
The annotation layer, then, is not just where findings are stored. It is the
queue the grind draws from and the surface the audit checks – both reachable from
one read-all.
2. Verdict Vocabulary
Three vocabularies were in production use after the first multi-agent session, with no enforcement mechanism. This section defines the controlled vocabulary.
2.1. Prose review verdicts (:VERDICT:)
The :VERDICT: field in property drawers accepts exactly these values. No agent
may introduce a new value without updating this list and the parse test in the
Makefile:
| Value | Meaning |
|---|---|
correct |
Claim verified against a primary source; no changes required |
corrected |
Claim was wrong; the heading was updated to reflect the finding |
disputed |
Conflicting primary sources; no clear resolution |
needs-citation |
Claim is plausible but no primary source was found |
verified |
Machine execution confirmed (lean, dafny, REPL output) |
reproduced |
Failure observed firsthand on this machine at this version |
attributed |
Claim sourced from upstream; not independently verified locally |
speculative |
Explicit conjecture; basis stated; no contradicting evidence |
Verdicts not in this table are invalid. An agent that encounters an unlisted
verdict in an existing drawer must normalize it to the nearest value and note the
normalization in :FINDING:.
2.2. Execution verdicts (:EXEC_VERDICT:)
The :EXEC_ prefix separates machine execution from human review. A heading can
carry both:
* Involution :PROPERTIES: :CUSTOM_ID: involution :VERDICT: corrected :FINDING: String.reverse → stringReverse (API drift) :EXEC_VERDICT: verified :EXEC_FINDING: lean4: exit=0 in 4823ms, 3 blocks :END:
EXEC_VERDICT vocabulary:
| Value | Meaning |
|---|---|
verified |
Code runs; exit 0 and behavior matches claim |
broken |
Code does not run or exits non-zero unexpectedly |
crashed |
Code reproducibly crashes (SIGSEGV, SIGABRT, etc.); crash is the expected evidence |
partial |
Some blocks verified; others skipped or broken |
skipped |
Execution not attempted (environment unavailable) |
2.3. Required drawer fields by verdict
Every drawer that carries :VERDICT: must also carry:
:CUSTOM_ID:on the heading (stable kebab-case; never update after creation):VERIFIED_AT:in ISO-8601 format with seconds:YYYY-MM-DDTHH:MM:SSZ:VERIFIED_BY:the agent or tool that produced the verdict
Verdicts corrected and disputed additionally require:
:FINDING:a single-line summary of what was found. For multiple observations, use:FINDING+:for each additional line, not comma-separated values in one:FINDING:.
Verdicts corrected and disputed additionally benefit from:
:REASONING:a one-line summary of the search or source consulted (if the reasoning warrants more than one line, use:REASONING_REF:pointing to a file under.verify/reasoning/)
No drawer field should be padded: :VERDICT: correct is wrong. Use
:VERDICT: correct (single space after the colon).
2.4. Timestamp format
:VERIFIED_AT: must use ISO-8601 with full seconds in UTC:
:VERIFIED_AT: 2026-06-21T14:30:00Z
Not accepted: missing seconds (2026-06-21T14:30Z), date only (2026-06-21),
prose format (June 21, 2026). The staleness test sorts by :VERIFIED_AT:;
inconsistent precision produces incorrect staleness ordering.
3. Literate Validation Loop
Research notes on this site contain executable code blocks – Lean 4 proofs,
Clojure REPL transcripts, Dafny contracts, dot diagrams. The annotation
convention extends to these: a heading with code blocks can carry both a prose
verdict (:VERDICT:) and an execution verdict (:EXEC_VERDICT:).
The validation loop runs in a clean room:
- Copy the org file to
mktemp -d(isolation, not trust) - Tangle with
emacs --batch+org-babel-tangle - Detect languages from
#+begin_srcheaders - Scaffold per-language toolchain (
lake init,deps.edn, etc.) - Build/verify (
lean,dafny verify,clj,python3,bash) - Annotate headings with
:EXEC_VERDICT:and:EXEC_FINDING: - Append to
.verify/chain.jsonl(tamper-evident ledger) - Remove the tmpdir
3.1. Backfill candidates
Pages with heavy code examples that should be validated:
- Lean 4 codec proofs – 8 blocks, validated 2026-06-12 (the first use)
- Dafny–Clojure interop – Clojure + Dafny blocks
- Reversible pipeline transforms – Clojure REPL transcripts
- Keeler theorem – Lean 4 section, ClojureScript
- pocket-es – ClojureScript + Node
The skill /validate-literate automates this pipeline. Run with
gmake validate-literate FILE=path or /validate-literate path
from Claude Code.
4. Process Findings: Multi-Agent Annotation in Practice
A four-team annotation run (validation / time / versions / ui over a speculative thesis) plus a parallel five-agent citation audit, both on 2026-06-13, exposed gaps in the annotation/validation process itself. Recorded so the schema can absorb them.
4.1. One verdict per heading assumes one reviewer
The property-drawer convention (:VERDICT: / :FINDING: / :VERIFIED_BY:) models
a single review. Four teams reviewing one heading along different axes collide on
:VERDICT:. The run used namespaced keys (:VALIDATION:, :TIME:, :VERSIONS:,
:UI:) with a rolled-up :VERDICT: on the parent – but no aggregation rule is
defined: is a heading "disputed" if any axis disputes, or only the majority? The
schema needs first-class multi-axis review, or a stated roll-up function.
4.2. The verdict vocabulary has already drifted
Three vocabularies are in use: the drawer spec says correct | corrected |
disputed | needs-citation (four); verify-chain accepts those plus verified
(five); EXEC_VERDICT uses verified | broken | crashed | partial | skipped
(see Gap 8). The teams emitted a sixth value, unverifiable, that no spec
lists. Pick one controlled vocabulary and validate against it, or the ledger's
verdicts are not comparable.
4.3. Validation tooling needs its own validation
The mechanical citation-integrity check used find -name references.bib, which
silently missed references-category-theory.bib and reported four false
"dangling" citations. Caught only by manual follow-up. This mirrors the
production-validation harness's B1 self-review catch: a checker's own bug reads as
a finding. The checker, not just the artifact, must be reviewed adversarially.
4.4. Annotating a conjecture is not verifying a fact
disputed on a speculative thesis conflates "unproven conjecture" with "factually
wrong." The reviewed thesis carried :STATUS: speculative, but the per-axis
verdicts do not separate "the claim is false" from "the claim is unproven." A
speculative claim needs a status axis orthogonal to the factual verdict.
4.5. Unstructured agent output self-contradicts
A bib auditor reported "1 unverifiable" then immediately "0 unverifiable" in the same message. Review tallies should be schema-constrained (structured output) so a review's own arithmetic cannot disagree with itself.
4.6. Point-in-time claims are unverifiable on a single host
Cross-harness flag claims (e.g. gemini -r, codex resume) cannot be checked on
a host that has only claude installed. They are dated research assertions that
drift by version and need a scheduled research-agent pass against primary docs –
not a one-time table. (Deferred.)
5. Protocol Proposals: Filling the Five Gaps
The process findings above named the gaps; this section proposes concrete mechanisms for each. Each proposal specifies org-mode syntax, a resolution algorithm, and the minimal tooling required.
5.1. Gap 1: Verdict Aggregation Rules
When multiple agents review the same heading along different axes, the current
schema collapses all verdicts into one :VERDICT: field. This produces a collision
rather than a synthesis.
The W3C Web Annotation Data Model (Sanderson, Ciccarese, and Young 2017) handles multi-reviewer scenarios by treating each annotation as a separate resource with its own target and body. Aggregation is left to consuming applications. Academic review systems (USENIX, IEEE, OOPSLA) similarly store per-reviewer scores independently and compute a meta-score during the editorial phase – the review records do not merge; only the decision does.
For org-mode property drawers, the proposed approach uses namespaced per-axis verdicts and an explicit roll-up rule stored as a property:
* Cross-Platform Build Claim :PROPERTIES: :CUSTOM_ID: cross-platform-build-claim :REVIEW_AXIS_FACTUAL: correct :REVIEW_BY_FACTUAL: claim-validator :REVIEW_AXIS_VERSIONS: disputed :REVIEW_BY_VERSIONS: versions-agent :REVIEW_AXIS_PLATFORM: needs-citation :REVIEW_BY_PLATFORM: platform-agent :VERDICT_RULE: pessimistic-any :VERDICT: disputed :VERDICT_COMPUTED_AT: 2026-06-21T14:00Z :END:
The :VERDICT_RULE: field specifies the aggregation function. Three rules cover
the common cases:
| Rule | Algorithm | When to use |
|---|---|---|
pessimistic-any |
disputed if any axis is disputed or needs-citation | Security or factual accuracy |
majority |
verdict held by the plurality of axes | Low-stakes editorial review |
axis:<name> |
verdict equals the named axis, ignoring others | When one axis is authoritative |
The roll-up is computed by the research-editor orchestrator after all
per-axis agents report. It writes :VERDICT: and :VERDICT_COMPUTED_AT:.
Per-axis verdicts remain in the drawer as the audit trail.
Clojure query to find headings where the roll-up differs from the strictest per-axis verdict:
(defn aggregation-drift [annotations] (for [{:keys [custom-id verdict] :as a} annotations :let [axes (filter #(re-find #"^:review_axis_" (name (key %))) (seq (:properties a))) strictest (-> (map val axes) (sort-by {"correct" 3 "corrected" 2 "needs-citation" 1 "disputed" 0}) first)] :when (and strictest (not= verdict strictest))] {:id custom-id :recorded verdict :strictest strictest}))
The ledger block for an aggregated verdict sets verifier to
research-editor/aggregate and notes to the rule and per-axis counts:
{
"verifier": "research-editor/aggregate",
"verdict": "disputed",
"subject": "cross-platform-build-claim",
"notes": "rule=pessimistic-any; axes={factual:correct,versions:disputed,platform:needs-citation}"
}
5.2. Gap 2: Speculative Claims Framework
The four-value verdict vocabulary (correct | corrected | disputed |
needs-citation) is a factual-accuracy scale. It cannot express epistemic
status: a speculative hypothesis is not factually wrong, but it is not verified
either. Marking it disputed misrepresents the claim; leaving it unverified
understates the reviewer's judgment.
PaperTrail (Martin-Boyle et al. 2026) uses three categories: supported, unsupported,
omitted. This still collapses epistemic status into the verification verdict.
Academic preprint norms handle this differently: speculative, preliminary,
and conjecture are status labels on the claim itself, orthogonal to whether
the claim was reviewed.
The proposal is a two-axis drawer: one axis for claim type, one for factual verdict.
* Hypothesis: Threshold Signatures Reduce Single-Point Trust :PROPERTIES: :CUSTOM_ID: threshold-sig-hypothesis :CLAIM_TYPE: speculative :CLAIM_BASIS: extrapolated from Shamir (1979) and Gennaro (2007) :VERDICT: needs-citation :VERIFIED_BY: claim-validator :VERIFIED_AT: 2026-06-21T10:00Z :FINDING: Claim is plausible but lacks an empirical study for the specific protocol cited. :END:
:CLAIM_TYPE: vocabulary:
| Value | Meaning |
|---|---|
established |
Textbook result or widely-accepted finding |
research |
Empirically supported, primary sources available |
speculative |
Plausible extrapolation; explicit basis required |
conjecture |
Formal open problem or author's unproven hypothesis |
opinion |
Normative claim; not subject to factual verification |
The factual verdict applies within the claim's epistemic scope. A conjecture
cannot receive correct; it can receive needs-citation (basis not stated),
disputed (counterexample known), or speculative (no verdict yet). The
allowed verdict set is per :CLAIM_TYPE::
(def allowed-verdicts {"established" #{"correct" "corrected" "disputed" "needs-citation"} "research" #{"correct" "corrected" "disputed" "needs-citation"} "speculative" #{"speculative" "disputed" "needs-citation"} "conjecture" #{"speculative" "disputed" "needs-citation"} "opinion" #{"noted"}})
A :CLAIM_TYPE: of speculative with :VERDICT: speculative reads: "this is
a conjecture, acknowledged as such, with no contradicting evidence found." It is
not a flag for follow-up. The research-editor validator skips these when
building the needs-review queue.
5.3. Gap 3: Scheduled Re-Verification
A heading verified as correct on FreeBSD 15.0 may become incorrect on 15.1.
Cross-platform, cross-version, and time-sensitive assertions drift without a
mechanism that triggers re-verification when conditions change.
Git tag hooks and CI triggers handle temporal re-runs in software projects. For
a document store, the analog is a :RECHECK_AFTER: property that signals when
a heading's verdict should be reconsidered:
* FreeBSD Compatibility :PROPERTIES: :CUSTOM_ID: freebsd-compat :VERDICT: correct :VERIFIED_AT: 2026-06-07T00:00Z :VERIFIED_BY: platform-agent :RECHECK_AFTER: freebsd-15.1-release :RECHECK_CONDITION: platform-version :END: Tested on FreeBSD 15.0-RELEASE. =pkg install emacs= succeeds; nREPL port 42527 is reachable.
:RECHECK_AFTER: accepts two formats: an ISO-8601 date
(:RECHECK_AFTER: 2027-01-01) or an event tag (:RECHECK_AFTER:
freebsd-15.1-release). :RECHECK_CONDITION: is a hint to the scheduler.
The scheduling loop runs as part of the daily-publish audit pipeline. It queries the corpus for headings where either:
:RECHECK_AFTER:is a date earlier than today, and the heading's:VERIFIED_AT:predates:RECHECK_AFTER:; or:RECHECK_AFTER:is an event tag that appears in a known-events map.
(defn stale-verifications [docs today known-events] (for [{:keys [custom-id verified-at recheck-after recheck-condition] :as h} (annotated-headings docs) :when recheck-after :let [trigger-date (or (parse-date recheck-after) (get known-events recheck-after))] :when (and trigger-date (.isBefore (parse-date verified-at) trigger-date) (.isBefore trigger-date today))] {:id custom-id :condition recheck-condition :trigger trigger-date}))
Stale headings enter the needs-review queue with :STATUS: recheck-triggered.
The original verdict remains in the drawer until a new agent run overwrites it.
The ledger records the re-verification with the same subject and a new block,
making the re-check history traceable.
The known-events map is maintained in scripts/recheck-events.edn:
{:freebsd-15.1-release "2026-09-01"
:freebsd-16.0-release "2027-06-01"
:clojure-1.13-release "2026-12-01"}
5.4. Gap 4: Ledger Reconciliation Automation
The hash chain at .verify/chain.jsonl detects tampering but does not flag when
a :VERDICT: drawer contradicts the ledger's most recent entry for the same
subject. A drawer says correct; the ledger's newest block for that heading
says disputed. Neither check catches the divergence: verify-chain passes
(the chain is intact), and the drawer grep finds correct.
The proposal:
- Group ledger blocks by
subject. - For each subject, take the most recent block by timestamp.
- Join against annotated headings by
:CUSTOM_ID:. - Report headings where drawer verdict differs from ledger verdict.
(defn reconcile-ledger [docs chain-path] (let [chain (read-chain chain-path) latest-by-subject (->> chain (group-by :subject) (map-vals #(last (sort-by :ts %))))] (for [{:keys [custom-id verdict] :as h} (annotated-headings docs) :let [ledger-entry (get latest-by-subject custom-id)] :when (and ledger-entry (not= verdict (:verdict ledger-entry)))] {:id custom-id :drawer verdict :ledger (:verdict ledger-entry) :ledger-ts (:ts ledger-entry)})))
This runs as gmake reconcile-ledger and exits non-zero when drift is found.
The output format matches the orphan-detection script's convention: one
DIVERGENCE: <id> drawer=<v1> ledger=<v2> line per conflict.
Three legitimate causes of divergence that should not be errors:
| Cause | Signal | Resolution |
|---|---|---|
| Drawer updated after ledger (agent corrected drawer but forgot to append) | Ledger timestamp predates last git touch of the file | Append a new ledger block |
| Heading split: one heading became two, the old CUSTOM_ID now absent | Drawer orphaned | Update CUSTOM_ID references |
| Known in-flight re-verification | :STATUS: recheck-triggered in drawer |
Suppress until re-check completes |
The reconciler suppresses the third case automatically by checking for
:STATUS: recheck-triggered. The first two are manual resolutions. A
--repair flag on the CLI will append a ledger block matching the current
drawer verdict, recording the human as verifier:
bb scripts/verify-chain repair \ --subject freebsd-compat \ --verifier jwalsh \ --notes "drawer corrected manually; ledger back-filled"
5.5. Gap 5: Agent Conversation Provenance
:VERIFIED_BY: skeptic-agent records who verified but not why the verdict was
reached. For non-obvious verdicts (corrected, disputed), the reviewer's
reasoning is the load-bearing part: what searches were run, what source was
consulted, what alternative was considered and rejected.
The proposal uses two additions. First, a :REASONING_REF: property pointing
to a git object or file path that contains the full reasoning:
* JWT HS256 Key Size Claim :PROPERTIES: :CUSTOM_ID: jwt-hs256-key-size :VERDICT: corrected :FINDING: Minimum key size is 256 bits (32 bytes), not 128. :VERIFIED_BY: claim-validator :VERIFIED_AT: 2026-06-21T09:30Z :REASONING_REF: .verify/reasoning/jwt-hs256-key-size-2026-06-21.md :END:
The reasoning file is plain Markdown and lives under .verify/reasoning/
(gitignored by default; opt-in with git add -f for key decisions):
# Reasoning: jwt-hs256-key-size — 2026-06-21 Verdict: corrected Verifier: claim-validator ## Search conducted - RFC 7518 Section 3.2: "A key of the same size as the hash output or larger MUST be used with this algorithm." - NIST SP 800-107r1 Table 1: SHA-256 output is 256 bits. - Original claim said "128 bits minimum" — no source found for this. ## Alternative considered Could the original author mean 128-bit entropy (= 16 bytes of random)? The claim was in a context discussing key storage, not generation, so this reading is unlikely. Verdict stands: corrected to 256 bits / 32 bytes. ## What was not checked - Specific library implementations (java-jwt, jose4j) were not consulted. The RFC is the normative source; library behavior is a separate claim.
Second, a lightweight inline option for short verdicts: a :REASONING: property
for single-line summaries that do not warrant a separate file:
:REASONING: RFC 7518 §3.2 is explicit; no ambiguity in the claim.
The schema priority: if :REASONING_REF: is set, the file is authoritative.
If only :REASONING: is set, it is the full record. Both absent means the
reasoning is untraced – acceptable for correct verdicts, flagged for
corrected and disputed.
(defn untraced-nontrivial [docs] (for [{:keys [custom-id verdict reasoning reasoning-ref] :as h} (annotated-headings docs) :when (and verdict (#{"corrected" "disputed"} verdict) (nil? reasoning) (nil? reasoning-ref))] {:id custom-id :verdict verdict}))
The ledger block for a reasoned verdict adds a reasoning field pointing to
the same file:
{
"verifier": "claim-validator",
"verdict": "corrected",
"subject": "jwt-hs256-key-size",
"notes": "RFC 7518 §3.2; corrected 128→256 bits",
"reasoning": ".verify/reasoning/jwt-hs256-key-size-2026-06-21.md"
}
The reasoning field in the ledger block is not part of the hash preimage (to avoid chaining the reasoning file's content into the integrity check). It is an out-of-band pointer – the chain proves the verdict was recorded; the reasoning file explains it.
6. Field Report: Applied to a Live FreeBSD/Guile Crash (2026-06-22)
On 2026-06-22, the annotation-systems methodology was applied to a downstream
crash investigation: a known Guile spawn bug (SIGSEGV, rc=139, backtrace in
libc.so.7) on FreeBSD 14.4-RELEASE-p6 (hydra, freshly upgraded from 14.3).
The research log lives at aygp-dr/scheme-llm-toolkit:docs/research/guile-spawn-bug-79494.org.
The :VERDICT: / :FINDING: pair forced an explicit "did I see this myself?"
decision per claim. The OS upgrade made that distinction load-bearing: the
crash was reproduced firsthand, while "fix lands in 3.0.11 / ports will pick
it up" was carried over from upstream and had to be downgraded to
needs-citation. :VERIFIED_AT: in UTC plus :VERIFIED_BY: gave the log
provenance suitable for re-checking after a future package bump.
Three edges strained during this single task; documented as protocol proposals in Gaps 6 through 8 below.
7. Protocol Proposals: Filling Gaps from Field Use (2026-06-22)
Three gaps surfaced applying the methodology to the FreeBSD/Guile crash investigation. Each is narrower than the multi-agent aggregation gaps above but more likely to appear in any single-investigator downstream bug report.
7.1. Gap 6: Export Annotations for External Plain-Text Trackers
Property drawers are invisible on HTML export. That invisibility is by design for published research notes: readers see prose, not audit metadata. But downstream bug work – FreeBSD Bugzilla, bug-guile mailing list, GitHub issues – requires the verdict and evidence in readable text. The same investigation produces two artifacts with opposite requirements: the org source (drawers invisible) and the bug report (drawers should surface).
The current workaround is manual transcription. That breaks provenance: the bug report's evidence is a copy, not a reference, and will drift from the org source.
Proposed: an "export annotations as appendix" mode. An org export option
annotate:t on a document (or set via #+OPTIONS: per-export) renders each
annotated heading's drawer as a visible appendix block:
#+OPTIONS: toc:2 num:t annotate:t annotate-dest:appendix
The appendix format, suitable for plain-text submission:
== Annotations == [guile-spawn-crash] VERDICT: reproduced VERIFIED_AT: 2026-06-22T04:00Z VERIFIED_BY: claude-code FINDING: SIGSEGV (rc=139) reproduced on FreeBSD 14.4-RELEASE-p6, libc.so.7 backtrace consistent with upstream bug 79494. [fix-timeline] VERDICT: needs-citation FINDING: "Fix in 3.0.11, ports will pick it up" is upstream attribution, not locally verified on this build.
The export filter skips drawers with no annotation keys (:CUSTOM_ID: alone
does not trigger inclusion). Only drawers containing at least one of
:VERDICT:, :FINDING:, :EXEC_VERDICT:, or :EXEC_FINDING: are emitted.
This requires an ox-html extension. Until that exists, the interim workaround is a Babashka script that reads annotated headings from the REPL reader and emits a plain-text appendix:
;; Interim: emit annotations as plain-text for bug-tracker submission (defn annotations->plain-text [docs custom-ids] (->> (annotated-headings docs) (filter #(or (nil? custom-ids) (custom-ids (:custom-id %)))) (map (fn [{:keys [custom-id verdict finding verified-at verified-by]}] (str "[" custom-id "]\n" "VERDICT: " verdict "\n" (when verified-at (str "VERIFIED_AT: " verified-at "\n")) (when verified-by (str "VERIFIED_BY: " verified-by "\n")) (when finding (str "FINDING: " finding "\n"))))) (str/join "\n")))
7.2. Gap 7: Verdict Vocabulary – Reproduced vs. Attributed
The four-value verdict vocabulary (correct | corrected | disputed |
needs-citation) does not distinguish two importantly different epistemic
positions:
- Reproduced: the investigator observed the phenomenon firsthand in this session, on this build, at this timestamp.
- Attributed: the claim is sourced from an upstream document, prior report, or secondary source; it has not been independently verified against the current environment.
Both can be "correct" in the sense that the attributed claim is accurate – but the confidence level and the action required on a version bump differ. A reproduced observation needs no follow-up after upgrade until it fails again. An attributed claim needs re-verification whenever the upstream context changes.
In the Guile crash investigation, :VERDICT: disputed was overloaded for a
root-cause claim (GC interaction) that was correct-by-attribution but
unverified against the current libc. This misrepresents the finding.
Proposed additions to the verdict vocabulary:
| Value | Meaning | Appropriate follow-up |
|---|---|---|
reproduced |
Observed firsthand this session, on this host/build | Re-verify only after environment change |
attributed |
Sourced from upstream; not independently checked | Re-verify after environment change or source update |
The existing needs-citation remains for claims that lack any source.
attributed covers claims that have a source but have not been checked
locally. These are not the same: needs-citation is a flag to add a reference;
attributed is a provenance label saying "the reference is there, but I have
not re-run it."
The :CLAIM_TYPE: axis proposed in Gap 2 is orthogonal: a reproduced
verdict on a research claim is the strongest evidentiary position; an
attributed verdict on a research claim is a provisional position pending
local replication.
Drawer example from the Guile investigation:
* Crash Reproduction :PROPERTIES: :CUSTOM_ID: guile-crash-reproduction :VERDICT: reproduced :VERIFIED_AT: 2026-06-22T03:45Z :VERIFIED_BY: claude-code :FINDING: SIGSEGV, rc=139, on FreeBSD 14.4-RELEASE-p6. Backtrace in libc.so.7 at GC point. :END: * Root Cause (upstream claim) :PROPERTIES: :CUSTOM_ID: guile-root-cause-upstream :VERDICT: attributed :CLAIM_BASIS: upstream bug 79494 report :FINDING: GC interaction with BSD mmap semantics. Fix expected in 3.0.11. :END:
The verify-chain ledger extension accepts both new values without modification
(the schema uses open strings for verdict). The controlled-vocabulary
validator in the REPL reader needs updating:
(def verdict-vocab #{"correct" "corrected" "disputed" "needs-citation" "reproduced" "attributed" ; added 2026-06-22 "speculative" "noted"}) ; from Gap 2 proposal
7.3. Gap 8: First-Class Reproduced-Failure Evidence
:EXEC_VERDICT: verified covers "code runs and passes." :EXEC_VERDICT:
broken covers "code does not compile or exits non-zero in an unexpected
failure mode." Neither captures the specific, positive evidentiary status of
a reproduced crash: the investigator intentionally ran the code, it
reproducibly segfaulted, and that SIGSEGV is the expected output of the bug
investigation – not an unexpected failure.
A reproduced crash is positive evidence. Marking it broken understates the
finding: broken implies the result is bad and needs fixing. crashed
asserts: "this is what we expected; the bug is confirmed."
Proposed addition to the EXEC_VERDICT vocabulary:
| Value | Meaning |
|---|---|
verified |
Code runs; exit 0 and behavior matches claim |
broken |
Code does not run or exits non-zero unexpectedly |
crashed |
Code reproducibly crashes (SIGSEGV, SIGABRT, etc.); crash is the expected evidence |
partial |
Some blocks verified; others skipped or broken |
skipped |
Execution not attempted (environment unavailable) |
The :EXEC_FINDING: field should capture the signal, exit code, and (if
available) the first line of the backtrace:
* Guile Spawn Test :PROPERTIES: :CUSTOM_ID: guile-spawn-test :EXEC_VERDICT: crashed :EXEC_FINDING: SIGSEGV, rc=139. libc.so.7 frame at GC mark. FreeBSD 14.4-RELEASE-p6, Guile 3.0.9. :EXEC_VALIDATED_AT: 2026-06-22T03:45Z :EXEC_VALIDATED_BY: claude-code :END:
The literate-validation loop (see Literate Validation Loop) adds crashed as
a recognized exit path alongside verified and broken:
(def exec-verdict-vocab #{"verified" "broken" "crashed" "partial" "skipped"}) (defn classify-execution [exit-code signal stderr] (cond (and signal (#{"SIGSEGV" "SIGABRT" "SIGBUS"} signal)) "crashed" (zero? exit-code) "verified" :else "broken"))
Practically, the distinction matters for cross-platform bug work: when filing
a FreeBSD Bugzilla report, the relevant heading should carry :EXEC_VERDICT:
crashed with the signal and rc, not broken, so that a reader can immediately
understand this is a confirmed crash, not an environment issue.
8. Validation Ledger
Any validation appends a block to the tamper-evident audit ledger
.verify/chain.jsonl (a local hash chain). Each block binds the hash of the
change and the verifier to the previous block's hash, so past validations
cannot be silently rewritten – editing any block breaks every block after it.
bb scripts/verify-chain add --change-file <path> --verifier <name> \ --verdict <correct|corrected|disputed|verified|needs-citation> --subject <id> gmake verify-chain # check chain integrity (exit 1 if broken) gmake audit-trail # print the trail
The :VERDICT: of a :VERIFIED: drawer and a ledger block's verdict are the
same value: the drawer lives on the heading; the ledger is the cross-cutting
append-only trail. Spec: docs/verification-ledger-spec.org.
Every verdict written to a property drawer must have a corresponding entry in
.verify/chain.jsonl. The two records must agree on verdict.
If bd is unavailable (schema mismatch across machines), continue with file-based
tracking in _queue/ or _drafts/ and note the bd failure in the session's first
annotated heading under :FINDING:.