Controlled Vocabulary v1 for wal.sh

1. Problem Statement
2. Related Work
3. Design Decisions
4. V1 Controlled Vocabulary
5. Validation Rules
- 5.1. What does NOT go in #+FILETAGS
6. Migration Strategy
7. Implementation: Adding Build-Time Validation
8. Term Equivalency Table
9. When to Add a New Term
- 9.1. Process for adding a new term
10. Invariants Summary
11. What This Vocabulary Does Not Solve
12. Implementation Priority

1. Problem Statement

wal.sh has 708 org files as of 2026. 505 carry #+KEYWORDS headers; 131 of those are placeholder text or empty. The remaining 492 use free-text, author- chosen terms with no normalization rules.

The existing vocabulary is a folksonomy: terms chosen at authoring time, never reviewed against prior usage. The predictable pathologies are all present:

Symptom	Evidence
Explosive unique count	3,406 unique terms across 492 docs
Power-law singleton mass	~2,777 terms appear exactly once (81%)
Near-duplicate fragmentation	`org-mode` (22) vs `org mode` (4); `agentic system` (8) vs `agentic systems` (2) vs `agentic-systems` (2) vs `agentic ai` (2) vs `agentic-ai` (2)
Sentence-length "keywords"	"moving from angular 1 to 4", "best practices for jquery performance"
Year-as-keyword	`1973`, `2011`, `2024` – bypass-able via #+DATE, pollutes chip display
Too-broad single terms	`clojure` on 61 docs, `javascript` on 35 – chips with too many hits
Format pollution	`sessions`, `keynote`, `index`, `welcome` are structural, not topical

The search engine (pocket-es) surfaces keywords as clickable chips. A chip that resolves to 61 documents is noise. A chip that resolves to one is isolation. The operational goal: chips that connect 2-25 documents each.

This document defines a controlled vocabulary to replace the current folksonomy. It is deliberately v1 – simple, practical, retroactively applicable to 2012-era jQuery notes and 2026-era agent architecture papers.

2. Related Work

2.1. Gwern.net

Gwern uses a two-tier approach: a flat tag system (~200 maintained terms) combined with a manually maintained Tags metadata field per essay. Tags are topic-oriented, not faceted. He avoids format tags entirely – an essay about machine learning gets machine-learning, not essay or long-form. The discipline is editorial: Gwern writes 1-3 tags per essay, pruning aggressively. This produces clean chips but requires constant manual curation as the corpus grows.

Key lesson: fewer tags per document forces prioritization. Six tags per document encourages tag accumulation without semantic work.

2.2. Simon Willison's blog

Willison uses a flat tag taxonomy (~500-600 terms) with a tag-management interface (tag rename, tag merge). He does not distinguish facets. The result is browsable-by-tag but with moderate fragmentation: llm, llms, large-language-models coexist. His corpus is larger (5,000+ posts) and the tag interface provides a compensating mechanism for users.

Key lesson: flat tags work at scale only if the author has a merge interface and uses it. Without merging, fragmentation compounds annually.

2.3. ACM CCS and IEEE Keyword Schemes

The ACM Computing Classification System is hierarchical (3-4 levels) and faceted (Computing > Theory > Algorithms). IEEE uses free-text index terms plus a controlled vocabulary per venue. Both are designed for cross-author consistency in a shared corpus – the structural problem they solve does not exist for a single-author personal site.

Key lesson: hierarchical classification encodes disciplinary consensus that a personal archive does not have and does not need. The useful borrowing from library science is the facet concept, not the hierarchy depth.

2.4. Library Science: Faceted Classification

Ranganathan's PMEST facets (Personality, Matter, Energy, Space, Time) from 1933 are the theoretical ancestor of modern faceted search. The principle: a document's classification should answer several orthogonal questions (what domain, what format, what time period) independently, not encode a single hierarchical path.

Faceted classification is the right model for a personal archive because:

Documents span many domains simultaneously (a conference note about Clojure at a 2019 event touches language + event + year)
A single path (conferences > 2019 > clojure) requires choosing which dimension is primary; facets avoid that choice
Facets compose: filtering on lang:clojure AND format:conference works without encoding their intersection as a single term

2.5. Folksonomies

Vander Wal's original folksonomy analysis (2004) and Golder & Huberman's empirical study (2006) both observed that unconstrained social tagging produces power-law distributions: a few tags used very frequently, a long tail of tags used once. This is the current state of wal.sh's vocabulary.

The convergence approach (see keyword-vocabulary-convergence) iterates a folksonomy toward a controlled vocabulary's properties without discarding accumulated tagging work. V1 vocabulary defines the target state that convergence is trying to reach.

3. Design Decisions

3.1. Flat tags over hierarchy

wal.sh does not need 4-level hierarchy. The corpus is one author's research across ~14 years. A document about Clojure concurrency is not ambiguous enough to require Computer Science > Programming Languages > Functional > Clojure > Concurrency. The facet structure provides orthogonal dimensions without encoding depth.

3.2. Controlled terms only in #+FILETAGS; free text stays in #+KEYWORDS

org-mode has two separate metadata mechanisms:

#+KEYWORDS (string, comma-separated): processed by org-publish into HTML metadata, indexed by pocket-es for BM25 search, displayed as chips. Free text is acceptable here for searchability.
#+FILETAGS (colon-delimited tags): org-mode's native tagging system, inherits in outlines, visible in org-agenda, used by tag-based queries. Controlled vocabulary terms belong here.

V1 uses both with different contracts:

Field	Content	Contract
`#+FILETAGS`	Controlled vocabulary terms only	Must be in the v1 term list; validated at build time
`#+KEYWORDS`	Free-text, descriptive, searchable	No validation; normalized post-hoc via convergence

This means the search chips (pocket-es) continue working from #+KEYWORDS as today. The #+FILETAGS controlled vocabulary enables org-agenda queries, org-roam graph navigation, and future structured filtering.

3.3. No year tags

Year is in #+DATE. The pocket-es indexer already extracts date. A year tag duplicates information already in the document and creates maintenance drift (a note dated 2019 with a 2024 tag because it was updated). The date field is the canonical year authority.

3.4. No location singletons in controlled vocabulary

Location terms (boston, cascais, portland, amsterdam) are valid search keywords in #+KEYWORDS and are useful in pocket-es BM25 results, but they do not belong in the controlled vocabulary. They are specificity identifiers, not category terms. The exception: boston is a valid #+KEYWORDS term; it is not a valid #+FILETAGS controlled term.

3.5. Format facet replaces structural keywords

sessions, keynote, index, welcome, overview are currently used as keywords but encode document structure, not topic. The format facet captures these cleanly (format:conference, format:index) and removes them from the keyword noise.

4. V1 Controlled Vocabulary

Six facets. 47 terms. The design principle: a term earns its place in the controlled vocabulary by appearing in at least 3 documents AND being irreducible (cannot be expressed by combining other controlled terms).

4.1. Facet 1: lang (Programming Language)

Used when the primary substance of the document is about a specific language as a language – its semantics, idioms, ecosystem, or community.

Do NOT use for documents that happen to contain code in a language. A note about database indexing that uses Python examples is not lang:python; it is domain:databases. Use lang when the language itself is the subject.

Term	Covers	Current keyword equivalents
`lang:clojure`	Clojure, ClojureScript, clojure.spec, EDN	`clojure` (61), `clojurescript` (21)
`lang:scheme`	Scheme, Guile, Racket, Chez, Gambit	`scheme` (19), `guile` (10), `racket` (11)
`lang:lisp`	Common Lisp, Emacs Lisp, historical Lisp	`lisp` (17), `elisp` (6)
`lang:python`	Python 2, Python 3, CPython	`python` (31)
`lang:javascript`	JS, ES6+, CoffeeScript, Node.js	`javascript` (35), `typescript` (14)
`lang:typescript`	TypeScript specifically when type system is the topic	`typescript` (14)
`lang:rust`	Rust	`rust` (8)
`lang:haskell`	Haskell, PureScript	`haskell` (5)
`lang:java`	Java, Kotlin, Scala, JVM languages	`jvm` (4), `scala` (2)
`lang:go`	Go	`gophercon` area
`lang:cpp`	C, C++, systems languages	scattered
`lang:sql`	SQL, relational query languages	`postgresql` (4), `datomic` (7)

Normalization note: lang:scheme absorbs racket (which is a Scheme descendant), guile, racketcon. When the conference itself is the topic (not Racket as a language), use format:conference instead.

4.2. Facet 2: domain (Subject Domain)

The conceptual territory the document inhabits. Most documents get 1-3 domain tags.

Term	Covers	Current keyword equivalents
`domain:agents`	LLM agents, agentic systems, multi-agent	`agent` (9), `agents` (8), `agentic system` (8), `ai agents` (9), `multi-agent` (10)
`domain:ml`	Machine learning, deep learning, neural networks	`machine learning` (25), `deep learning` (5), `neural network` (4)
`domain:llm`	Large language models as objects of study	`llm` (23), `openai` (6), `anthropic` (9), `claude` (9)
`domain:formal-methods`	TLA+, Lean, Alloy, Coq, model checking, proof	`formal verification` (7), `tla+` (4), `lean4` (6), `alloy` (present)
`domain:security`	Security, privacy, threat modeling, cryptography	`security` (20), `privacy` (5)
`domain:distributed`	Distributed systems, consensus, CAP, network protocols	`distributed systems` (7), `networking` (9)
`domain:databases`	Databases, query languages, storage, indexing	`database` (5), `datomic` (7), `postgresql` (4)
`domain:web`	Web platform: HTML, CSS, DOM, HTTP, REST	`html5` (8), `css` (9), `react` (10), `graphql` (7)
`domain:infrastructure`	Cloud, k8s, containers, CI/CD, monitoring	`aws` (12), `freebsd` (10), `kubernetes` (4), `devops` (6)
`domain:algorithms`	Algorithms, data structures, complexity, puzzles	`algorithms` (5), `data structures` (6), `dynamic programming` (5)
`domain:category-theory`	Category theory, type theory, abstract algebra	`lambda calculus` (7), `category theory` (4)
`domain:search`	Information retrieval, search engines, indexing	`search` (6), `bm25` (5), `pocket-es` (7)
`domain:emacs`	Emacs, org-mode, Emacs Lisp as tools	`emacs` (24), `org-mode` (22), `elisp` (6)
`domain:retrocomputing`	Historical systems, pre-1990 computing, archaeology	`pdp-11` (5), `retrocomputing` (4), `unix v4` (7), `1973` (6)
`domain:aviation`	ADS-B, flight tracking, airspace, aircraft	`ads-b` (7)
`domain:fintech`	Financial technology, payments, trading	`fintech` (5)

4.3. Facet 3: format (Document Format)

What kind of document this is. Orthogonal to domain and language. A conference note and a research essay about the same topic get different format tags.

Term	Covers
`format:conference`	Conference notes, event proceedings, talk summaries
`format:research`	Research essays, deep-dives, long-form analysis
`format:tutorial`	How-to guides, walkthroughs, setup instructions
`format:spec`	Specifications, contracts, invariant definitions
`format:weekly`	Weekly activity summaries
`format:index`	Index/overview pages (landing pages for a topic area)
`format:brief`	Morning briefs, quick notes, short-form
`format:experiment`	Exploratory notes, proof-of-concept, WIP

The distinction between format:research and format:tutorial is intent: research documents argue a position or explore a question; tutorials instruct. A document can be both (format:research + format:tutorial is valid).

4.4. Facet 4: era (Time Horizon)

Captures the era a document's content belongs to, independent of authoring date. A 2024 retrospective on 1970s Unix belongs to era:historical, not era:current.

Term	Covers
`era:historical`	Pre-2000 content, retrocomputing, historical analysis
`era:foundational`	2000-2015, web 2.0 era, early frameworks, jQuery
`era:current`	2016-present, modern tooling, current practices
`era:emerging`	Speculative, cutting-edge, forward-looking

Most documents do not need an era tag. Use it when the era is the point of the document – when "this was how we did it in 2012" or "this is what's coming" is the primary framing.

4.5. Facet 5: project (Site Projects)

wal.sh-specific research threads. These are internal organizational tags, not topical tags.

Term	Covers
`project:pocket-es`	The site's BM25 search engine
`project:webring`	The wwn webring / bot trap
`project:ads-b`	The ADS-B flight tracking project
`project:goldberry`	The Goldberry frontend system
`project:agentic-2026`	The 2026 agentic research series
`project:beads`	The beads (bd) issue tracking system
`project:unix-v4`	The Unix V4 retrocomputing research

Project tags are used sparingly – only when a document is primarily about the project rather than incidentally using it. The pocket-es spec document gets project:pocket-es; a research note that uses BM25 as one example does not.

4.6. Facet 6: status (Publication Status)

Optional. Use only for documents in non-final states.

Term	Covers
`status:draft`	Work in progress, not ready for external linking
`status:stub`	Placeholder, minimal content, intended to grow
`status:deprecated`	Superseded by a newer document
`status:evergreen`	Intentionally maintained and kept current

Most documents should have no status tag – absence implies normal publication state. The status facet exists to distinguish documents that need curation from those that are complete.

5. Validation Rules

A document's #+FILETAGS field is valid when:

Every colon-delimited token matches the pattern facet:term where facet is one of {lang, domain, format, era, project, status} and term is in the term list above for that facet.
At most one era tag per document.
At most one status tag per document.
lang tags are used only when the language is the subject (not just the implementation language of an example).
Minimum: at least one domain tag per published research note. Conference notes require format:conference.
A __KEYWORDS__ placeholder in #+FILETAGS is a lint failure.

5.1. What does NOT go in #+FILETAGS

Location names (boston, berlin, portland)
Year numbers (2019, 2024) – use #+DATE
Proper nouns for specific events (clojure-conj-2023, racketcon-2024)
Person names
Library/framework names as topics (reagent, redux, langgraph) unless the framework's design is the subject
Free-text phrases ("best practices for X")

These all belong in #+KEYWORDS, where they contribute to BM25 search without polluting the controlled vocabulary.

6. Migration Strategy

The corpus has three tiers with different migration costs:

6.1. Tier 1: 131 documents without keywords (zero-cost baseline)

These need both #+KEYWORDS and #+FILETAGS added. Priority order:

Conference notes (format:conference is mechanical to assign)
Research notes without any metadata
Index pages (format:index is mechanical)

Suggested tooling: a Babashka script that reads #+TITLE and #+DESCRIPTION, calls the Ollama embedding API (nomic-embed-text:v1.5 at 192.168.86.22:11434), finds the 5 nearest tagged documents, and proposes their #+FILETAGS as candidates. The author approves or adjusts.

6.2. Tier 2: 492 documents with free-text keywords (convergence path)

Do NOT bulk-replace existing #+KEYWORDS. They contain searchable signal. Instead, ADD #+FILETAGS alongside the existing #+KEYWORDS.

Phase 1 (mechanical, 1-2 days of scripted work):

All files under site/events/ get format:conference in #+FILETAGS
All files under site/activity-summary/ get format:weekly in #+FILETAGS
All files with #+KEYWORDS: pocket-es get project:pocket-es
All files with #+KEYWORDS containing clojure get lang:clojure

This phase is pure automation – no editorial judgment required.

Phase 2 (editorial, high-value documents first):

The 119 terms appearing 5+ times in #+KEYWORDS are the consolidation targets. For each, map to the controlled vocabulary and add the corresponding #+FILETAGS to documents using that term.
Priority: the 20 highest-frequency terms cover a large fraction of docs.

Phase 3 (convergence cleanup in #+KEYWORDS):

After #+FILETAGS is populated, run the convergence algorithm (see 2026-keyword-vocabulary-convergence) to normalize #+KEYWORDS.
Merge near-duplicates: agentic system / agentic systems / agentic-systems → agentic-systems (prefer hyphenated form, consistent with FILETAGS style).
Remove year-only tokens from #+KEYWORDS where #+DATE already carries the year.
Do NOT remove location keywords – they are valid search terms.

6.3. Tier 3: Normalize #+KEYWORDS casing

The 3,406 unique keyword tokens include mixed-case variants. The convergence algorithm normalizes to lowercase. As a build-time invariant: the header audit (wal-sh.site.check-headers, run via bb scripts/check-headers) – or a new sibling – should warn when #+KEYWORDS contains uppercase tokens, since the BM25 tokenizer lowercases at index time anyway. Storing uppercase is misleading.

7. Implementation: Adding Build-Time Validation

The header audit (wal-sh.site.check-headers, run via bb scripts/check-headers; the Python scripts/check_required_headers.py is its now-superseded source) validates presence of four headers. Extend it (or scripts/check_vocabulary.py) to:

check_vocabulary.py:
  For each .org file under site/:
    1. If #+FILETAGS is present:
       - Parse colon-delimited tokens
       - Reject any token not matching facet:term pattern
       - Reject any token not in the v1 term list
       - Warn on missing domain tag for research notes
       - Warn on missing format:conference for files under events/
    2. Emit: OK / WARN / FAIL per file
    3. Exit non-zero if any FAIL

Add as a dependency of the lint target in Makefile:

lint: check-well-known check-required-headers check-vocabulary

The vocabulary term list lives in a single source of truth: site/research/controlled-vocabulary-v1.org (this file) plus a machine- readable extraction. The simplest machine-readable form: a YAML comment block in this file, or a separate scripts/vocabulary_v1.py module that imports as a dict.

8. Term Equivalency Table

For migration: what current free-text #+KEYWORDS terms map to which controlled vocabulary #+FILETAGS terms.

Current #+KEYWORDS term(s)	Maps to #+FILETAGS term
`clojure`, `clojurescript`	`lang:clojure`
`scheme`, `guile`, `racket`	`lang:scheme`
`lisp`, `elisp`	`lang:lisp` (if language is the subject)
`python`	`lang:python`
`javascript`, `js`, `node.js`	`lang:javascript`
`typescript`	`lang:typescript`
`rust`	`lang:rust`
`haskell`	`lang:haskell`
`java`, `jvm`, `scala`	`lang:java`
`agent`, `agents`, `agentic system`, `ai agents`, `multi-agent`	`domain:agents`
`machine learning`, `deep learning`, `neural network`, `scikit-learn`	`domain:ml`
`llm`, `claude`, `openai`, `anthropic`, `gemini`, `ollama`	`domain:llm`
`formal verification`, `tla+`, `lean4`, `alloy`, `model checking`	`domain:formal-methods`
`security`, `privacy`, `cybersecurity`	`domain:security`
`distributed systems`, `networking`, `consensus`	`domain:distributed`
`database`, `datomic`, `postgresql`, `sql`	`domain:databases`
`html5`, `css`, `react`, `graphql`, `jquery`, `angular`	`domain:web`
`aws`, `freebsd`, `kubernetes`, `devops`, `ci/cd`, `terraform`	`domain:infrastructure`
`algorithm`, `data structures`, `dynamic programming`	`domain:algorithms`
`category theory`, `lambda calculus`, `type theory`	`domain:category-theory`
`search`, `bm25`, `pocket-es`, `information retrieval`	`domain:search`
`emacs`, `org-mode`, `org-babel`	`domain:emacs`
`pdp-11`, `retrocomputing`, `unix v4`, `1973`	`domain:retrocomputing`
`ads-b`, `sdr`, `aviation`, `dump1090`	`domain:aviation`
`fintech`, `payments`, `trading`	`domain:fintech`
(file under events/)	`format:conference`
(file under activity-summary/)	`format:weekly`
`spec`, `specification`, `contract`	`format:spec`
`weekly-summary`	`format:weekly`
`morning-brief`	`format:brief`
`pocket-es` (as project)	`project:pocket-es`
`webring`, `wwn`	`project:webring`
`goldberry`	`project:goldberry`
`beads`	`project:beads`
`ads-b` (as project)	`project:ads-b`

9. When to Add a New Term

A new term earns entry into the controlled vocabulary when:

At least 3 existing documents would use it immediately (retroactive applicability test).
It cannot be expressed by combining two existing terms.
It is orthogonal to existing facets – it adds a dimension of discrimination not already covered.
It survives a "would I filter on this?" test: if someone searching wal.sh asked to filter by this property, would the results be a coherent set of documents?

New terms that fail these tests belong in #+KEYWORDS (free text) but not in #+FILETAGS (controlled vocabulary). The controlled vocabulary grows slowly, by consensus with prior content, not by authoring-time impulse.

9.1. Process for adding a new term

Open this file and add the term in the appropriate facet table with its coverage description.
Update scripts/vocabulary_v1.py (the machine-readable list).
Tag the documents that prompted the addition.
Run gmake lint to verify the new term validates correctly.
Commit: feat: add lang:erlang to controlled vocabulary v1.

10. Invariants Summary

These are the machine-checkable properties that define a healthy vocabulary state for wal.sh:

Invariant	Target	Current State
Every `#+FILETAGS` token matches a controlled term	100%	0% (FILETAGS barely used)
Every published research note has ≥1 domain tag	100%	~0%
Every event file has `format:conference`	100%	~5%
`#+KEYWORDS` singleton ratio	<15%	~81%
`#+KEYWORDS` near-duplicate pairs	<5	~6+ confirmed
Documents with no `#+KEYWORDS`	<5%	~21%
Controlled vocabulary term count	<80	47 (v1)
Maximum `#+FILETAGS` tokens per doc	≤6	N/A

The singleton ratio target of <15% (from the convergence document's <10% criterion, relaxed slightly for the steady-state with ongoing content addition) should be measurable by the pocket-es indexer or the convergence Babashka script.

11. What This Vocabulary Does Not Solve

Be explicit about scope limits:

It does not eliminate #+KEYWORDS explosion. Free-text keywords will continue to grow. The convergence algorithm manages them; the controlled vocabulary does not replace them.
It does not retroactively classify 14 years of notes automatically. Phase 2 migration requires editorial judgment for ~272 documents (the research notes). Scripted Phase 1 handles the rest mechanically.
It does not handle cross-facet compound concepts. A document about "Rust-based formal verification tools" is lang:rust + domain:formal-methods. The vocabulary does not create lang:rust:formal-methods compounds.
It does not version over time. When the vocabulary needs a v2 (new facets, retired terms), that is a new document and a migration pass, not an in-place update to this file.

12. Implementation Priority

Ordered by value-to-effort ratio:

Add scripts/vocabulary_v1.py with the machine-readable term list (30 min)
Extend scripts/check_vocabulary.py for FILETAGS validation (2 hours)
Add gmake lint-vocab target to Makefile (15 min)
Script Phase 1 migration: events/ → format:conference, activity-summary/ → format:weekly (1 hour)
Manually tag the 20 most-linked research notes with domain tags (2-3 hours)
Run #+KEYWORDS normalization (convergence pass 1) on the full corpus (4 hours + review)