Annotation Systems: Operational Specification v3

Table of Contents

See also: Annotation Systems for Evolving Documents (problem statement, anchor comparison, recommendation) | L7 Methodology Review (2026-06-22)

This specification covers operational detail: the verdict vocabulary, property drawer conventions, Clojure reader tooling, gap proposals from multi-agent use, a FreeBSD field report, and the literate validation loop. The problem framing and anchor comparison are in index.org.

1. Clojure Reader for the Annotation Layer

A convention is only as good as the tooling that queries it. The recommendation in the problem-statement note – :CUSTOM_ID: plus property drawers – is greppable, and that was offered as a feature. But grep cannot tell a real heading drawer from a #+begin_example block demonstrating the convention; it cannot count verdicts; it cannot reconcile a heading's recorded verdict against the audit ledger. To make annotations first-class, the site's org reader (wal-sh.site.org) now parses the drawer attached to each heading, so the annotation layer becomes queryable data rather than text to search.

The parser walks headings and, when a :PROPERTIES: drawer sits directly under one, captures its keys. Review keys mark a heading as annotated:

(def annotation-keys
  #{:review :status :verdict :verified_at :verified_by :finding})

(defn annotated-headings [doc]
  (for [h (:headings doc)
        :let [p (:properties h)]
        :when (some annotation-keys (keys p))]
    {:path (:path doc) :custom-id (:custom_id p) :heading (:text h)
     :verdict (:verdict p) :verified-by (:verified_by p)
     :status (:status p) :finding (:finding p)}))

Loading the corpus in the REPL (the site runs a persistent Clojure REPL in tmux; this transcript is from it):

(require '[wal-sh.site.org :as o])
(def docs (o/read-all "site"))     ;=> 751 docs

(o/verdict-summary docs)
;;=> {:correct 4, :corrected 2, :unverified 9}

(count (o/annotations docs))
;;=> 15

(frequencies (map :verified-by (filter :verdict (o/annotations docs))))
;;=> {"skeptic-agent" 2, "vera-agent" 3, "claude-agent" 1}

(map (juxt :custom-id :verdict)
     (filter #(= "corrected" (:verdict %)) (o/annotations docs)))
;;=> (["org-annotate-file" "corrected"] ["cljs-advantage" "corrected"])

The reader counts six recorded verdicts (four correct, two corrected) across the corpus, attributed to three agents – and nine further headings annotated with other keys but not yet given a verdict. One of the two corrected findings is org-annotate-file: the vera-agent correction in the problem-statement note (the MELPA-package mix-up). The annotation is data the document can read about itself.

The precision this buys over grep: a raw grep ':VERDICT:' returns nine hits, but three of those are inside the #+begin_example blocks illustrating the convention. The reader captures only drawers genuinely attached to a heading – example headings are comma-escaped (,*) and do not match – so the six it counts are real annotations, not documentation about annotations.

1.1. From annotations to the grind

Three functions turn the layer into work for the secondary-review agent team and the audit ledger:

;; the allocation queue: headings explicitly flagged for attention
(o/needs-review docs)          ; :STATUS: needs-citation or a disputed verdict
;;=> []   (no outstanding flags right now -- empty queue)

;; broader candidates: anchored headings with no verdict yet
(o/unverified-anchored docs)   ; have :CUSTOM_ID:, lack :VERDICT:

;; tamper-evidence: drawers whose verdict the hash-chain ledger does not record
(o/reconcile-ledger docs (map :subject (read-chain ".verify/chain.jsonl")))

needs-review is the input to the per-heading fan-out: the research-editor orchestrator allocates each flagged heading to a citation-auditor or claim-validator, and each verdict it returns is written back as a :VERDICT: drawer and appended to .verify/chain.jsonl (the hash chain described in the project's verification-ledger spec). reconcile-ledger closes the loop: it lists any heading claiming a verdict the tamper-evident trail does not record, so the drawer and the ledger cannot silently diverge.

The annotation layer, then, is not just where findings are stored. It is the queue the grind draws from and the surface the audit checks – both reachable from one read-all.

2. Verdict Vocabulary

Three vocabularies were in production use after the first multi-agent session, with no enforcement mechanism. This section defines the controlled vocabulary.

2.1. Prose review verdicts (:VERDICT:)

The :VERDICT: field in property drawers accepts exactly these values. No agent may introduce a new value without updating this list and the parse test in the Makefile:

Value Meaning
correct Claim verified against a primary source; no changes required
corrected Claim was wrong; the heading was updated to reflect the finding
disputed Conflicting primary sources; no clear resolution
needs-citation Claim is plausible but no primary source was found
verified Machine execution confirmed (lean, dafny, REPL output)
reproduced Failure observed firsthand on this machine at this version
attributed Claim sourced from upstream; not independently verified locally
speculative Explicit conjecture; basis stated; no contradicting evidence

Verdicts not in this table are invalid. An agent that encounters an unlisted verdict in an existing drawer must normalize it to the nearest value and note the normalization in :FINDING:.

2.2. Execution verdicts (:EXEC_VERDICT:)

The :EXEC_ prefix separates machine execution from human review. A heading can carry both:

* Involution
:PROPERTIES:
:CUSTOM_ID: involution
:VERDICT: corrected
:FINDING: String.reverse → stringReverse (API drift)
:EXEC_VERDICT: verified
:EXEC_FINDING: lean4: exit=0 in 4823ms, 3 blocks
:END:

EXEC_VERDICT vocabulary:

Value Meaning
verified Code runs; exit 0 and behavior matches claim
broken Code does not run or exits non-zero unexpectedly
crashed Code reproducibly crashes (SIGSEGV, SIGABRT, etc.); crash is the expected evidence
partial Some blocks verified; others skipped or broken
skipped Execution not attempted (environment unavailable)

2.3. Required drawer fields by verdict

Every drawer that carries :VERDICT: must also carry:

  • :CUSTOM_ID: on the heading (stable kebab-case; never update after creation)
  • :VERIFIED_AT: in ISO-8601 format with seconds: YYYY-MM-DDTHH:MM:SSZ
  • :VERIFIED_BY: the agent or tool that produced the verdict

Verdicts corrected and disputed additionally require:

  • :FINDING: a single-line summary of what was found. For multiple observations, use :FINDING+: for each additional line, not comma-separated values in one :FINDING:.

Verdicts corrected and disputed additionally benefit from:

  • :REASONING: a one-line summary of the search or source consulted (if the reasoning warrants more than one line, use :REASONING_REF: pointing to a file under .verify/reasoning/)

No drawer field should be padded: :VERDICT: correct is wrong. Use :VERDICT: correct (single space after the colon).

2.4. Timestamp format

:VERIFIED_AT: must use ISO-8601 with full seconds in UTC:

:VERIFIED_AT: 2026-06-21T14:30:00Z

Not accepted: missing seconds (2026-06-21T14:30Z), date only (2026-06-21), prose format (June 21, 2026). The staleness test sorts by :VERIFIED_AT:; inconsistent precision produces incorrect staleness ordering.

3. Literate Validation Loop

Research notes on this site contain executable code blocks – Lean 4 proofs, Clojure REPL transcripts, Dafny contracts, dot diagrams. The annotation convention extends to these: a heading with code blocks can carry both a prose verdict (:VERDICT:) and an execution verdict (:EXEC_VERDICT:).

The validation loop runs in a clean room:

literate-validation.png

  1. Copy the org file to mktemp -d (isolation, not trust)
  2. Tangle with emacs --batch + org-babel-tangle
  3. Detect languages from #+begin_src headers
  4. Scaffold per-language toolchain (lake init, deps.edn, etc.)
  5. Build/verify (lean, dafny verify, clj, python3, bash)
  6. Annotate headings with :EXEC_VERDICT: and :EXEC_FINDING:
  7. Append to .verify/chain.jsonl (tamper-evident ledger)
  8. Remove the tmpdir

3.1. Backfill candidates

Pages with heavy code examples that should be validated:

The skill /validate-literate automates this pipeline. Run with gmake validate-literate FILE=path or /validate-literate path from Claude Code.

4. Process Findings: Multi-Agent Annotation in Practice

A four-team annotation run (validation / time / versions / ui over a speculative thesis) plus a parallel five-agent citation audit, both on 2026-06-13, exposed gaps in the annotation/validation process itself. Recorded so the schema can absorb them.

4.1. One verdict per heading assumes one reviewer

The property-drawer convention (:VERDICT: / :FINDING: / :VERIFIED_BY:) models a single review. Four teams reviewing one heading along different axes collide on :VERDICT:. The run used namespaced keys (:VALIDATION:, :TIME:, :VERSIONS:, :UI:) with a rolled-up :VERDICT: on the parent – but no aggregation rule is defined: is a heading "disputed" if any axis disputes, or only the majority? The schema needs first-class multi-axis review, or a stated roll-up function.

4.2. The verdict vocabulary has already drifted

Three vocabularies are in use: the drawer spec says correct | corrected | disputed | needs-citation (four); verify-chain accepts those plus verified (five); EXEC_VERDICT uses verified | broken | crashed | partial | skipped (see Gap 8). The teams emitted a sixth value, unverifiable, that no spec lists. Pick one controlled vocabulary and validate against it, or the ledger's verdicts are not comparable.

4.3. Validation tooling needs its own validation

The mechanical citation-integrity check used find -name references.bib, which silently missed references-category-theory.bib and reported four false "dangling" citations. Caught only by manual follow-up. This mirrors the production-validation harness's B1 self-review catch: a checker's own bug reads as a finding. The checker, not just the artifact, must be reviewed adversarially.

4.4. Annotating a conjecture is not verifying a fact

disputed on a speculative thesis conflates "unproven conjecture" with "factually wrong." The reviewed thesis carried :STATUS: speculative, but the per-axis verdicts do not separate "the claim is false" from "the claim is unproven." A speculative claim needs a status axis orthogonal to the factual verdict.

4.5. Unstructured agent output self-contradicts

A bib auditor reported "1 unverifiable" then immediately "0 unverifiable" in the same message. Review tallies should be schema-constrained (structured output) so a review's own arithmetic cannot disagree with itself.

4.6. Point-in-time claims are unverifiable on a single host

Cross-harness flag claims (e.g. gemini -r, codex resume) cannot be checked on a host that has only claude installed. They are dated research assertions that drift by version and need a scheduled research-agent pass against primary docs – not a one-time table. (Deferred.)

5. Protocol Proposals: Filling the Five Gaps

The process findings above named the gaps; this section proposes concrete mechanisms for each. Each proposal specifies org-mode syntax, a resolution algorithm, and the minimal tooling required.

5.1. Gap 1: Verdict Aggregation Rules

When multiple agents review the same heading along different axes, the current schema collapses all verdicts into one :VERDICT: field. This produces a collision rather than a synthesis.

The W3C Web Annotation Data Model (Sanderson, Ciccarese, and Young 2017) handles multi-reviewer scenarios by treating each annotation as a separate resource with its own target and body. Aggregation is left to consuming applications. Academic review systems (USENIX, IEEE, OOPSLA) similarly store per-reviewer scores independently and compute a meta-score during the editorial phase – the review records do not merge; only the decision does.

For org-mode property drawers, the proposed approach uses namespaced per-axis verdicts and an explicit roll-up rule stored as a property:

* Cross-Platform Build Claim
:PROPERTIES:
:CUSTOM_ID: cross-platform-build-claim
:REVIEW_AXIS_FACTUAL: correct
:REVIEW_BY_FACTUAL: claim-validator
:REVIEW_AXIS_VERSIONS: disputed
:REVIEW_BY_VERSIONS: versions-agent
:REVIEW_AXIS_PLATFORM: needs-citation
:REVIEW_BY_PLATFORM: platform-agent
:VERDICT_RULE: pessimistic-any
:VERDICT: disputed
:VERDICT_COMPUTED_AT: 2026-06-21T14:00Z
:END:

The :VERDICT_RULE: field specifies the aggregation function. Three rules cover the common cases:

Rule Algorithm When to use
pessimistic-any disputed if any axis is disputed or needs-citation Security or factual accuracy
majority verdict held by the plurality of axes Low-stakes editorial review
axis:<name> verdict equals the named axis, ignoring others When one axis is authoritative

The roll-up is computed by the research-editor orchestrator after all per-axis agents report. It writes :VERDICT: and :VERDICT_COMPUTED_AT:. Per-axis verdicts remain in the drawer as the audit trail.

Clojure query to find headings where the roll-up differs from the strictest per-axis verdict:

(defn aggregation-drift [annotations]
  (for [{:keys [custom-id verdict] :as a} annotations
        :let [axes (filter #(re-find #"^:review_axis_" (name (key %)))
                           (seq (:properties a)))
              strictest (-> (map val axes)
                            (sort-by {"correct" 3 "corrected" 2
                                      "needs-citation" 1 "disputed" 0})
                            first)]
        :when (and strictest (not= verdict strictest))]
    {:id custom-id :recorded verdict :strictest strictest}))

The ledger block for an aggregated verdict sets verifier to research-editor/aggregate and notes to the rule and per-axis counts:

{
  "verifier": "research-editor/aggregate",
  "verdict": "disputed",
  "subject": "cross-platform-build-claim",
  "notes": "rule=pessimistic-any; axes={factual:correct,versions:disputed,platform:needs-citation}"
}

5.2. Gap 2: Speculative Claims Framework

The four-value verdict vocabulary (correct | corrected | disputed | needs-citation) is a factual-accuracy scale. It cannot express epistemic status: a speculative hypothesis is not factually wrong, but it is not verified either. Marking it disputed misrepresents the claim; leaving it unverified understates the reviewer's judgment.

PaperTrail (Martin-Boyle et al. 2026) uses three categories: supported, unsupported, omitted. This still collapses epistemic status into the verification verdict. Academic preprint norms handle this differently: speculative, preliminary, and conjecture are status labels on the claim itself, orthogonal to whether the claim was reviewed.

The proposal is a two-axis drawer: one axis for claim type, one for factual verdict.

* Hypothesis: Threshold Signatures Reduce Single-Point Trust
:PROPERTIES:
:CUSTOM_ID: threshold-sig-hypothesis
:CLAIM_TYPE: speculative
:CLAIM_BASIS: extrapolated from Shamir (1979) and Gennaro (2007)
:VERDICT: needs-citation
:VERIFIED_BY: claim-validator
:VERIFIED_AT: 2026-06-21T10:00Z
:FINDING: Claim is plausible but lacks an empirical study for the specific protocol cited.
:END:

:CLAIM_TYPE: vocabulary:

Value Meaning
established Textbook result or widely-accepted finding
research Empirically supported, primary sources available
speculative Plausible extrapolation; explicit basis required
conjecture Formal open problem or author's unproven hypothesis
opinion Normative claim; not subject to factual verification

The factual verdict applies within the claim's epistemic scope. A conjecture cannot receive correct; it can receive needs-citation (basis not stated), disputed (counterexample known), or speculative (no verdict yet). The allowed verdict set is per :CLAIM_TYPE::

(def allowed-verdicts
  {"established"  #{"correct" "corrected" "disputed" "needs-citation"}
   "research"     #{"correct" "corrected" "disputed" "needs-citation"}
   "speculative"  #{"speculative" "disputed" "needs-citation"}
   "conjecture"   #{"speculative" "disputed" "needs-citation"}
   "opinion"      #{"noted"}})

A :CLAIM_TYPE: of speculative with :VERDICT: speculative reads: "this is a conjecture, acknowledged as such, with no contradicting evidence found." It is not a flag for follow-up. The research-editor validator skips these when building the needs-review queue.

5.3. Gap 3: Scheduled Re-Verification

A heading verified as correct on FreeBSD 15.0 may become incorrect on 15.1. Cross-platform, cross-version, and time-sensitive assertions drift without a mechanism that triggers re-verification when conditions change.

Git tag hooks and CI triggers handle temporal re-runs in software projects. For a document store, the analog is a :RECHECK_AFTER: property that signals when a heading's verdict should be reconsidered:

* FreeBSD Compatibility
:PROPERTIES:
:CUSTOM_ID: freebsd-compat
:VERDICT: correct
:VERIFIED_AT: 2026-06-07T00:00Z
:VERIFIED_BY: platform-agent
:RECHECK_AFTER: freebsd-15.1-release
:RECHECK_CONDITION: platform-version
:END:

Tested on FreeBSD 15.0-RELEASE. =pkg install emacs= succeeds; nREPL port 42527
is reachable.

:RECHECK_AFTER: accepts two formats: an ISO-8601 date (:RECHECK_AFTER: 2027-01-01) or an event tag (:RECHECK_AFTER: freebsd-15.1-release). :RECHECK_CONDITION: is a hint to the scheduler.

The scheduling loop runs as part of the daily-publish audit pipeline. It queries the corpus for headings where either:

  • :RECHECK_AFTER: is a date earlier than today, and the heading's :VERIFIED_AT: predates :RECHECK_AFTER:; or
  • :RECHECK_AFTER: is an event tag that appears in a known-events map.
(defn stale-verifications [docs today known-events]
  (for [{:keys [custom-id verified-at recheck-after recheck-condition]
         :as h} (annotated-headings docs)
        :when recheck-after
        :let [trigger-date (or (parse-date recheck-after)
                               (get known-events recheck-after))]
        :when (and trigger-date
                   (.isBefore (parse-date verified-at) trigger-date)
                   (.isBefore trigger-date today))]
    {:id custom-id :condition recheck-condition :trigger trigger-date}))

Stale headings enter the needs-review queue with :STATUS: recheck-triggered. The original verdict remains in the drawer until a new agent run overwrites it. The ledger records the re-verification with the same subject and a new block, making the re-check history traceable.

The known-events map is maintained in scripts/recheck-events.edn:

{:freebsd-15.1-release "2026-09-01"
 :freebsd-16.0-release "2027-06-01"
 :clojure-1.13-release "2026-12-01"}

5.4. Gap 4: Ledger Reconciliation Automation

The hash chain at .verify/chain.jsonl detects tampering but does not flag when a :VERDICT: drawer contradicts the ledger's most recent entry for the same subject. A drawer says correct; the ledger's newest block for that heading says disputed. Neither check catches the divergence: verify-chain passes (the chain is intact), and the drawer grep finds correct.

The proposal:

  1. Group ledger blocks by subject.
  2. For each subject, take the most recent block by timestamp.
  3. Join against annotated headings by :CUSTOM_ID:.
  4. Report headings where drawer verdict differs from ledger verdict.
(defn reconcile-ledger [docs chain-path]
  (let [chain (read-chain chain-path)
        latest-by-subject
          (->> chain
               (group-by :subject)
               (map-vals #(last (sort-by :ts %))))]
    (for [{:keys [custom-id verdict] :as h} (annotated-headings docs)
          :let [ledger-entry (get latest-by-subject custom-id)]
          :when (and ledger-entry
                     (not= verdict (:verdict ledger-entry)))]
      {:id custom-id
       :drawer verdict
       :ledger (:verdict ledger-entry)
       :ledger-ts (:ts ledger-entry)})))

This runs as gmake reconcile-ledger and exits non-zero when drift is found. The output format matches the orphan-detection script's convention: one DIVERGENCE: <id> drawer=<v1> ledger=<v2> line per conflict.

Three legitimate causes of divergence that should not be errors:

Cause Signal Resolution
Drawer updated after ledger (agent corrected drawer but forgot to append) Ledger timestamp predates last git touch of the file Append a new ledger block
Heading split: one heading became two, the old CUSTOM_ID now absent Drawer orphaned Update CUSTOM_ID references
Known in-flight re-verification :STATUS: recheck-triggered in drawer Suppress until re-check completes

The reconciler suppresses the third case automatically by checking for :STATUS: recheck-triggered. The first two are manual resolutions. A --repair flag on the CLI will append a ledger block matching the current drawer verdict, recording the human as verifier:

bb scripts/verify-chain repair \
  --subject freebsd-compat \
  --verifier jwalsh \
  --notes "drawer corrected manually; ledger back-filled"

5.5. Gap 5: Agent Conversation Provenance

:VERIFIED_BY: skeptic-agent records who verified but not why the verdict was reached. For non-obvious verdicts (corrected, disputed), the reviewer's reasoning is the load-bearing part: what searches were run, what source was consulted, what alternative was considered and rejected.

The proposal uses two additions. First, a :REASONING_REF: property pointing to a git object or file path that contains the full reasoning:

* JWT HS256 Key Size Claim
:PROPERTIES:
:CUSTOM_ID: jwt-hs256-key-size
:VERDICT: corrected
:FINDING: Minimum key size is 256 bits (32 bytes), not 128.
:VERIFIED_BY: claim-validator
:VERIFIED_AT: 2026-06-21T09:30Z
:REASONING_REF: .verify/reasoning/jwt-hs256-key-size-2026-06-21.md
:END:

The reasoning file is plain Markdown and lives under .verify/reasoning/ (gitignored by default; opt-in with git add -f for key decisions):

# Reasoning: jwt-hs256-key-size — 2026-06-21

Verdict: corrected
Verifier: claim-validator

## Search conducted
- RFC 7518 Section 3.2: "A key of the same size as the hash output or larger
  MUST be used with this algorithm."
- NIST SP 800-107r1 Table 1: SHA-256 output is 256 bits.
- Original claim said "128 bits minimum" — no source found for this.

## Alternative considered
Could the original author mean 128-bit entropy (= 16 bytes of random)? The
claim was in a context discussing key storage, not generation, so this
reading is unlikely. Verdict stands: corrected to 256 bits / 32 bytes.

## What was not checked
- Specific library implementations (java-jwt, jose4j) were not consulted.
  The RFC is the normative source; library behavior is a separate claim.

Second, a lightweight inline option for short verdicts: a :REASONING: property for single-line summaries that do not warrant a separate file:

:REASONING: RFC 7518 §3.2 is explicit; no ambiguity in the claim.

The schema priority: if :REASONING_REF: is set, the file is authoritative. If only :REASONING: is set, it is the full record. Both absent means the reasoning is untraced – acceptable for correct verdicts, flagged for corrected and disputed.

(defn untraced-nontrivial [docs]
  (for [{:keys [custom-id verdict reasoning reasoning-ref] :as h}
        (annotated-headings docs)
        :when (and verdict
                   (#{"corrected" "disputed"} verdict)
                   (nil? reasoning)
                   (nil? reasoning-ref))]
    {:id custom-id :verdict verdict}))

The ledger block for a reasoned verdict adds a reasoning field pointing to the same file:

{
  "verifier": "claim-validator",
  "verdict": "corrected",
  "subject": "jwt-hs256-key-size",
  "notes": "RFC 7518 §3.2; corrected 128→256 bits",
  "reasoning": ".verify/reasoning/jwt-hs256-key-size-2026-06-21.md"
}

The reasoning field in the ledger block is not part of the hash preimage (to avoid chaining the reasoning file's content into the integrity check). It is an out-of-band pointer – the chain proves the verdict was recorded; the reasoning file explains it.

6. Field Report: Applied to a Live FreeBSD/Guile Crash (2026-06-22)

On 2026-06-22, the annotation-systems methodology was applied to a downstream crash investigation: a known Guile spawn bug (SIGSEGV, rc=139, backtrace in libc.so.7) on FreeBSD 14.4-RELEASE-p6 (hydra, freshly upgraded from 14.3). The research log lives at aygp-dr/scheme-llm-toolkit:docs/research/guile-spawn-bug-79494.org.

The :VERDICT: / :FINDING: pair forced an explicit "did I see this myself?" decision per claim. The OS upgrade made that distinction load-bearing: the crash was reproduced firsthand, while "fix lands in 3.0.11 / ports will pick it up" was carried over from upstream and had to be downgraded to needs-citation. :VERIFIED_AT: in UTC plus :VERIFIED_BY: gave the log provenance suitable for re-checking after a future package bump.

Three edges strained during this single task; documented as protocol proposals in Gaps 6 through 8 below.

7. Protocol Proposals: Filling Gaps from Field Use (2026-06-22)

Three gaps surfaced applying the methodology to the FreeBSD/Guile crash investigation. Each is narrower than the multi-agent aggregation gaps above but more likely to appear in any single-investigator downstream bug report.

7.1. Gap 6: Export Annotations for External Plain-Text Trackers

Property drawers are invisible on HTML export. That invisibility is by design for published research notes: readers see prose, not audit metadata. But downstream bug work – FreeBSD Bugzilla, bug-guile mailing list, GitHub issues – requires the verdict and evidence in readable text. The same investigation produces two artifacts with opposite requirements: the org source (drawers invisible) and the bug report (drawers should surface).

The current workaround is manual transcription. That breaks provenance: the bug report's evidence is a copy, not a reference, and will drift from the org source.

Proposed: an "export annotations as appendix" mode. An org export option annotate:t on a document (or set via #+OPTIONS: per-export) renders each annotated heading's drawer as a visible appendix block:

#+OPTIONS: toc:2 num:t annotate:t annotate-dest:appendix

The appendix format, suitable for plain-text submission:

== Annotations ==

[guile-spawn-crash]
VERDICT: reproduced
VERIFIED_AT: 2026-06-22T04:00Z
VERIFIED_BY: claude-code
FINDING: SIGSEGV (rc=139) reproduced on FreeBSD 14.4-RELEASE-p6, libc.so.7 backtrace consistent with upstream bug 79494.

[fix-timeline]
VERDICT: needs-citation
FINDING: "Fix in 3.0.11, ports will pick it up" is upstream attribution, not locally verified on this build.

The export filter skips drawers with no annotation keys (:CUSTOM_ID: alone does not trigger inclusion). Only drawers containing at least one of :VERDICT:, :FINDING:, :EXEC_VERDICT:, or :EXEC_FINDING: are emitted.

This requires an ox-html extension. Until that exists, the interim workaround is a Babashka script that reads annotated headings from the REPL reader and emits a plain-text appendix:

;; Interim: emit annotations as plain-text for bug-tracker submission
(defn annotations->plain-text [docs custom-ids]
  (->> (annotated-headings docs)
       (filter #(or (nil? custom-ids) (custom-ids (:custom-id %))))
       (map (fn [{:keys [custom-id verdict finding verified-at verified-by]}]
              (str "[" custom-id "]\n"
                   "VERDICT: " verdict "\n"
                   (when verified-at (str "VERIFIED_AT: " verified-at "\n"))
                   (when verified-by (str "VERIFIED_BY: " verified-by "\n"))
                   (when finding     (str "FINDING: "    finding     "\n")))))
       (str/join "\n")))

7.2. Gap 7: Verdict Vocabulary – Reproduced vs. Attributed

The four-value verdict vocabulary (correct | corrected | disputed | needs-citation) does not distinguish two importantly different epistemic positions:

  • Reproduced: the investigator observed the phenomenon firsthand in this session, on this build, at this timestamp.
  • Attributed: the claim is sourced from an upstream document, prior report, or secondary source; it has not been independently verified against the current environment.

Both can be "correct" in the sense that the attributed claim is accurate – but the confidence level and the action required on a version bump differ. A reproduced observation needs no follow-up after upgrade until it fails again. An attributed claim needs re-verification whenever the upstream context changes.

In the Guile crash investigation, :VERDICT: disputed was overloaded for a root-cause claim (GC interaction) that was correct-by-attribution but unverified against the current libc. This misrepresents the finding.

Proposed additions to the verdict vocabulary:

Value Meaning Appropriate follow-up
reproduced Observed firsthand this session, on this host/build Re-verify only after environment change
attributed Sourced from upstream; not independently checked Re-verify after environment change or source update

The existing needs-citation remains for claims that lack any source. attributed covers claims that have a source but have not been checked locally. These are not the same: needs-citation is a flag to add a reference; attributed is a provenance label saying "the reference is there, but I have not re-run it."

The :CLAIM_TYPE: axis proposed in Gap 2 is orthogonal: a reproduced verdict on a research claim is the strongest evidentiary position; an attributed verdict on a research claim is a provisional position pending local replication.

Drawer example from the Guile investigation:

* Crash Reproduction
:PROPERTIES:
:CUSTOM_ID: guile-crash-reproduction
:VERDICT: reproduced
:VERIFIED_AT: 2026-06-22T03:45Z
:VERIFIED_BY: claude-code
:FINDING: SIGSEGV, rc=139, on FreeBSD 14.4-RELEASE-p6. Backtrace in libc.so.7 at GC point.
:END:

* Root Cause (upstream claim)
:PROPERTIES:
:CUSTOM_ID: guile-root-cause-upstream
:VERDICT: attributed
:CLAIM_BASIS: upstream bug 79494 report
:FINDING: GC interaction with BSD mmap semantics. Fix expected in 3.0.11.
:END:

The verify-chain ledger extension accepts both new values without modification (the schema uses open strings for verdict). The controlled-vocabulary validator in the REPL reader needs updating:

(def verdict-vocab
  #{"correct" "corrected" "disputed" "needs-citation"
    "reproduced" "attributed"                ; added 2026-06-22
    "speculative" "noted"})                  ; from Gap 2 proposal

7.3. Gap 8: First-Class Reproduced-Failure Evidence

:EXEC_VERDICT: verified covers "code runs and passes." :EXEC_VERDICT: broken covers "code does not compile or exits non-zero in an unexpected failure mode." Neither captures the specific, positive evidentiary status of a reproduced crash: the investigator intentionally ran the code, it reproducibly segfaulted, and that SIGSEGV is the expected output of the bug investigation – not an unexpected failure.

A reproduced crash is positive evidence. Marking it broken understates the finding: broken implies the result is bad and needs fixing. crashed asserts: "this is what we expected; the bug is confirmed."

Proposed addition to the EXEC_VERDICT vocabulary:

Value Meaning
verified Code runs; exit 0 and behavior matches claim
broken Code does not run or exits non-zero unexpectedly
crashed Code reproducibly crashes (SIGSEGV, SIGABRT, etc.); crash is the expected evidence
partial Some blocks verified; others skipped or broken
skipped Execution not attempted (environment unavailable)

The :EXEC_FINDING: field should capture the signal, exit code, and (if available) the first line of the backtrace:

* Guile Spawn Test
:PROPERTIES:
:CUSTOM_ID: guile-spawn-test
:EXEC_VERDICT: crashed
:EXEC_FINDING: SIGSEGV, rc=139. libc.so.7 frame at GC mark. FreeBSD 14.4-RELEASE-p6, Guile 3.0.9.
:EXEC_VALIDATED_AT: 2026-06-22T03:45Z
:EXEC_VALIDATED_BY: claude-code
:END:

The literate-validation loop (see Literate Validation Loop) adds crashed as a recognized exit path alongside verified and broken:

(def exec-verdict-vocab
  #{"verified" "broken" "crashed" "partial" "skipped"})

(defn classify-execution [exit-code signal stderr]
  (cond
    (and signal (#{"SIGSEGV" "SIGABRT" "SIGBUS"} signal)) "crashed"
    (zero? exit-code)                                       "verified"
    :else                                                   "broken"))

Practically, the distinction matters for cross-platform bug work: when filing a FreeBSD Bugzilla report, the relevant heading should carry :EXEC_VERDICT: crashed with the signal and rc, not broken, so that a reader can immediately understand this is a confirmed crash, not an environment issue.

8. Validation Ledger

Any validation appends a block to the tamper-evident audit ledger .verify/chain.jsonl (a local hash chain). Each block binds the hash of the change and the verifier to the previous block's hash, so past validations cannot be silently rewritten – editing any block breaks every block after it.

bb scripts/verify-chain add --change-file <path> --verifier <name> \
    --verdict <correct|corrected|disputed|verified|needs-citation> --subject <id>
gmake verify-chain     # check chain integrity (exit 1 if broken)
gmake audit-trail      # print the trail

The :VERDICT: of a :VERIFIED: drawer and a ledger block's verdict are the same value: the drawer lives on the heading; the ledger is the cross-cutting append-only trail. Spec: docs/verification-ledger-spec.org.

Every verdict written to a property drawer must have a corresponding entry in .verify/chain.jsonl. The two records must agree on verdict.

If bd is unavailable (schema mismatch across machines), continue with file-based tracking in _queue/ or _drafts/ and note the bd failure in the session's first annotated heading under :FINDING:.

9. References

Martin-Boyle, Anna, Cara A.C. Leckey, Martha C. Brown, and Harmanpreet Kaur. 2026. “Papertrail: A Claim-Evidence Interface for Grounding Provenance in Llm-Based Scholarly Q&a.” In Chi ’26: Proceedings of the 2026 Chi Conference on Human Factors in Computing Systems. ACM. https://doi.org/10.1145/3772318.3791101.
Sanderson, Robert, Paolo Ciccarese, and Benjamin Young. 2017. “Web Annotation Data Model.” W3C Recommendation. World Wide Web Consortium. https://www.w3.org/TR/annotation-model/.