Controlled Vocabulary v1 for wal.sh

Table of Contents

1. Problem Statement

wal.sh has 708 org files as of 2026. 505 carry #+KEYWORDS headers; 131 of those are placeholder text or empty. The remaining 492 use free-text, author- chosen terms with no normalization rules.

The existing vocabulary is a folksonomy: terms chosen at authoring time, never reviewed against prior usage. The predictable pathologies are all present:

Symptom Evidence
Explosive unique count 3,406 unique terms across 492 docs
Power-law singleton mass ~2,777 terms appear exactly once (81%)
Near-duplicate fragmentation org-mode (22) vs org mode (4); agentic system (8) vs agentic systems (2) vs agentic-systems (2) vs agentic ai (2) vs agentic-ai (2)
Sentence-length "keywords" "moving from angular 1 to 4", "best practices for jquery performance"
Year-as-keyword 1973, 2011, 2024 — bypass-able via #+DATE, pollutes chip display
Too-broad single terms clojure on 61 docs, javascript on 35 — chips with too many hits
Format pollution sessions, keynote, index, welcome are structural, not topical

The search engine (pocket-es) surfaces keywords as clickable chips. A chip that resolves to 61 documents is noise. A chip that resolves to one is isolation. The operational goal: chips that connect 2-25 documents each.

This document defines a controlled vocabulary to replace the current folksonomy. It is deliberately v1 — simple, practical, retroactively applicable to 2012-era jQuery notes and 2026-era agent architecture papers.

2. Related Work

2.1. Gwern.net

Gwern uses a two-tier approach: a flat tag system (~200 maintained terms) combined with a manually maintained Tags metadata field per essay. Tags are topic-oriented, not faceted. He avoids format tags entirely — an essay about machine learning gets machine-learning, not essay or long-form. The discipline is editorial: Gwern writes 1-3 tags per essay, pruning aggressively. This produces clean chips but requires constant manual curation as the corpus grows.

Key lesson: fewer tags per document forces prioritization. Six tags per document encourages tag accumulation without semantic work.

2.2. Simon Willison's blog

Willison uses a flat tag taxonomy (~500-600 terms) with a tag-management interface (tag rename, tag merge). He does not distinguish facets. The result is browsable-by-tag but with moderate fragmentation: llm, llms, large-language-models coexist. His corpus is larger (5,000+ posts) and the tag interface provides a compensating mechanism for users.

Key lesson: flat tags work at scale only if the author has a merge interface and uses it. Without merging, fragmentation compounds annually.

2.3. ACM CCS and IEEE Keyword Schemes

The ACM Computing Classification System is hierarchical (3-4 levels) and faceted (Computing > Theory > Algorithms). IEEE uses free-text index terms plus a controlled vocabulary per venue. Both are designed for cross-author consistency in a shared corpus — the structural problem they solve does not exist for a single-author personal site.

Key lesson: hierarchical classification encodes disciplinary consensus that a personal archive does not have and does not need. The useful borrowing from library science is the facet concept, not the hierarchy depth.

2.4. Library Science: Faceted Classification

Ranganathan's PMEST facets (Personality, Matter, Energy, Space, Time) from 1933 are the theoretical ancestor of modern faceted search. The principle: a document's classification should answer several orthogonal questions (what domain, what format, what time period) independently, not encode a single hierarchical path.

Faceted classification is the right model for a personal archive because:

  1. Documents span many domains simultaneously (a conference note about Clojure at a 2019 event touches language + event + year)
  2. A single path (conferences > 2019 > clojure) requires choosing which dimension is primary; facets avoid that choice
  3. Facets compose: filtering on lang:clojure AND format:conference works without encoding their intersection as a single term

2.5. Folksonomies

Vander Wal's original folksonomy analysis (2004) and Golder & Huberman's empirical study (2006) both observed that unconstrained social tagging produces power-law distributions: a few tags used very frequently, a long tail of tags used once. This is the current state of wal.sh's vocabulary.

The convergence approach (see keyword-vocabulary-convergence) iterates a folksonomy toward a controlled vocabulary's properties without discarding accumulated tagging work. V1 vocabulary defines the target state that convergence is trying to reach.

3. Design Decisions

3.1. Flat tags over hierarchy

wal.sh does not need 4-level hierarchy. The corpus is one author's research across ~14 years. A document about Clojure concurrency is not ambiguous enough to require Computer Science > Programming Languages > Functional > Clojure > Concurrency. The facet structure provides orthogonal dimensions without encoding depth.

3.2. Controlled terms only in #+FILETAGS; free text stays in #+KEYWORDS

org-mode has two separate metadata mechanisms:

  • #+KEYWORDS (string, comma-separated): processed by org-publish into HTML metadata, indexed by pocket-es for BM25 search, displayed as chips. Free text is acceptable here for searchability.
  • #+FILETAGS (colon-delimited tags): org-mode's native tagging system, inherits in outlines, visible in org-agenda, used by tag-based queries. Controlled vocabulary terms belong here.

V1 uses both with different contracts:

Field Content Contract
#+FILETAGS Controlled vocabulary terms only Must be in the v1 term list; validated at build time
#+KEYWORDS Free-text, descriptive, searchable No validation; normalized post-hoc via convergence

This means the search chips (pocket-es) continue working from #+KEYWORDS as today. The #+FILETAGS controlled vocabulary enables org-agenda queries, org-roam graph navigation, and future structured filtering.

3.3. No year tags

Year is in #+DATE. The pocket-es indexer already extracts date. A year tag duplicates information already in the document and creates maintenance drift (a note dated 2019 with a 2024 tag because it was updated). The date field is the canonical year authority.

3.4. No location singletons in controlled vocabulary

Location terms (boston, cascais, portland, amsterdam) are valid search keywords in #+KEYWORDS and are useful in pocket-es BM25 results, but they do not belong in the controlled vocabulary. They are specificity identifiers, not category terms. The exception: boston is a valid #+KEYWORDS term; it is not a valid #+FILETAGS controlled term.

3.5. Format facet replaces structural keywords

sessions, keynote, index, welcome, overview are currently used as keywords but encode document structure, not topic. The format facet captures these cleanly (format:conference, format:index) and removes them from the keyword noise.

4. V1 Controlled Vocabulary

Six facets. 47 terms. The design principle: a term earns its place in the controlled vocabulary by appearing in at least 3 documents AND being irreducible (cannot be expressed by combining other controlled terms).

4.1. Facet 1: lang (Programming Language)

Used when the primary substance of the document is about a specific language as a language — its semantics, idioms, ecosystem, or community.

Do NOT use for documents that happen to contain code in a language. A note about database indexing that uses Python examples is not lang:python; it is domain:databases. Use lang when the language itself is the subject.

Term Covers Current keyword equivalents
lang:clojure Clojure, ClojureScript, clojure.spec, EDN clojure (61), clojurescript (21)
lang:scheme Scheme, Guile, Racket, Chez, Gambit scheme (19), guile (10), racket (11)
lang:lisp Common Lisp, Emacs Lisp, historical Lisp lisp (17), elisp (6)
lang:python Python 2, Python 3, CPython python (31)
lang:javascript JS, ES6+, CoffeeScript, Node.js javascript (35), typescript (14)
lang:typescript TypeScript specifically when type system is the topic typescript (14)
lang:rust Rust rust (8)
lang:haskell Haskell, PureScript haskell (5)
lang:java Java, Kotlin, Scala, JVM languages jvm (4), scala (2)
lang:go Go gophercon area
lang:cpp C, C++, systems languages scattered
lang:sql SQL, relational query languages postgresql (4), datomic (7)

Normalization note: lang:scheme absorbs racket (which is a Scheme descendant), guile, racketcon. When the conference itself is the topic (not Racket as a language), use format:conference instead.

4.2. Facet 2: domain (Subject Domain)

The conceptual territory the document inhabits. Most documents get 1-3 domain tags.

Term Covers Current keyword equivalents
domain:agents LLM agents, agentic systems, multi-agent agent (9), agents (8), agentic system (8), ai agents (9), multi-agent (10)
domain:ml Machine learning, deep learning, neural networks machine learning (25), deep learning (5), neural network (4)
domain:llm Large language models as objects of study llm (23), openai (6), anthropic (9), claude (9)
domain:formal-methods TLA+, Lean, Alloy, Coq, model checking, proof formal verification (7), tla+ (4), lean4 (6), alloy (present)
domain:security Security, privacy, threat modeling, cryptography security (20), privacy (5)
domain:distributed Distributed systems, consensus, CAP, network protocols distributed systems (7), networking (9)
domain:databases Databases, query languages, storage, indexing database (5), datomic (7), postgresql (4)
domain:web Web platform: HTML, CSS, DOM, HTTP, REST html5 (8), css (9), react (10), graphql (7)
domain:infrastructure Cloud, k8s, containers, CI/CD, monitoring aws (12), freebsd (10), kubernetes (4), devops (6)
domain:algorithms Algorithms, data structures, complexity, puzzles algorithms (5), data structures (6), dynamic programming (5)
domain:category-theory Category theory, type theory, abstract algebra lambda calculus (7), category theory (4)
domain:search Information retrieval, search engines, indexing search (6), bm25 (5), pocket-es (7)
domain:emacs Emacs, org-mode, Emacs Lisp as tools emacs (24), org-mode (22), elisp (6)
domain:retrocomputing Historical systems, pre-1990 computing, archaeology pdp-11 (5), retrocomputing (4), unix v4 (7), 1973 (6)
domain:aviation ADS-B, flight tracking, airspace, aircraft ads-b (7)
domain:fintech Financial technology, payments, trading fintech (5)

4.3. Facet 3: format (Document Format)

What kind of document this is. Orthogonal to domain and language. A conference note and a research essay about the same topic get different format tags.

Term Covers
format:conference Conference notes, event proceedings, talk summaries
format:research Research essays, deep-dives, long-form analysis
format:tutorial How-to guides, walkthroughs, setup instructions
format:spec Specifications, contracts, invariant definitions
format:weekly Weekly activity summaries
format:index Index/overview pages (landing pages for a topic area)
format:brief Morning briefs, quick notes, short-form
format:experiment Exploratory notes, proof-of-concept, WIP

The distinction between format:research and format:tutorial is intent: research documents argue a position or explore a question; tutorials instruct. A document can be both (format:research + format:tutorial is valid).

4.4. Facet 4: era (Time Horizon)

Captures the era a document's content belongs to, independent of authoring date. A 2024 retrospective on 1970s Unix belongs to era:historical, not era:current.

Term Covers
era:historical Pre-2000 content, retrocomputing, historical analysis
era:foundational 2000-2015, web 2.0 era, early frameworks, jQuery
era:current 2016-present, modern tooling, current practices
era:emerging Speculative, cutting-edge, forward-looking

Most documents do not need an era tag. Use it when the era is the point of the document — when "this was how we did it in 2012" or "this is what's coming" is the primary framing.

4.5. Facet 5: project (Site Projects)

wal.sh-specific research threads. These are internal organizational tags, not topical tags.

Term Covers
project:pocket-es The site's BM25 search engine
project:webring The wwn webring / bot trap
project:ads-b The ADS-B flight tracking project
project:goldberry The Goldberry frontend system
project:agentic-2026 The 2026 agentic research series
project:beads The beads (bd) issue tracking system
project:unix-v4 The Unix V4 retrocomputing research

Project tags are used sparingly — only when a document is primarily about the project rather than incidentally using it. The pocket-es spec document gets project:pocket-es; a research note that uses BM25 as one example does not.

4.6. Facet 6: status (Publication Status)

Optional. Use only for documents in non-final states.

Term Covers
status:draft Work in progress, not ready for external linking
status:stub Placeholder, minimal content, intended to grow
status:deprecated Superseded by a newer document
status:evergreen Intentionally maintained and kept current

Most documents should have no status tag — absence implies normal publication state. The status facet exists to distinguish documents that need curation from those that are complete.

5. Validation Rules

A document's #+FILETAGS field is valid when:

  1. Every colon-delimited token matches the pattern facet:term where facet is one of {lang, domain, format, era, project, status} and term is in the term list above for that facet.
  2. At most one era tag per document.
  3. At most one status tag per document.
  4. lang tags are used only when the language is the subject (not just the implementation language of an example).
  5. Minimum: at least one domain tag per published research note. Conference notes require format:conference.
  6. A __KEYWORDS__ placeholder in #+FILETAGS is a lint failure.

5.1. What does NOT go in #+FILETAGS

  • Location names (boston, berlin, portland)
  • Year numbers (2019, 2024) — use #+DATE
  • Proper nouns for specific events (clojure-conj-2023, racketcon-2024)
  • Person names
  • Library/framework names as topics (reagent, redux, langgraph) unless the framework's design is the subject
  • Free-text phrases ("best practices for X")

These all belong in #+KEYWORDS, where they contribute to BM25 search without polluting the controlled vocabulary.

6. Migration Strategy

The corpus has three tiers with different migration costs:

6.1. Tier 1: 131 documents without keywords (zero-cost baseline)

These need both #+KEYWORDS and #+FILETAGS added. Priority order:

  1. Conference notes (format:conference is mechanical to assign)
  2. Research notes without any metadata
  3. Index pages (format:index is mechanical)

Suggested tooling: a Babashka script that reads #+TITLE and #+DESCRIPTION, calls the Ollama embedding API (nomic-embed-text:v1.5 at 192.168.86.22:11434), finds the 5 nearest tagged documents, and proposes their #+FILETAGS as candidates. The author approves or adjusts.

6.2. Tier 2: 492 documents with free-text keywords (convergence path)

Do NOT bulk-replace existing #+KEYWORDS. They contain searchable signal. Instead, ADD #+FILETAGS alongside the existing #+KEYWORDS.

Phase 1 (mechanical, 1-2 days of scripted work):

  • All files under site/events/ get format:conference in #+FILETAGS
  • All files under site/activity-summary/ get format:weekly in #+FILETAGS
  • All files with #+KEYWORDS: pocket-es get project:pocket-es
  • All files with #+KEYWORDS containing clojure get lang:clojure

This phase is pure automation — no editorial judgment required.

Phase 2 (editorial, high-value documents first):

  • The 119 terms appearing 5+ times in #+KEYWORDS are the consolidation targets. For each, map to the controlled vocabulary and add the corresponding #+FILETAGS to documents using that term.
  • Priority: the 20 highest-frequency terms cover a large fraction of docs.

Phase 3 (convergence cleanup in #+KEYWORDS):

  • After #+FILETAGS is populated, run the convergence algorithm (see 2026-keyword-vocabulary-convergence) to normalize #+KEYWORDS.
  • Merge near-duplicates: agentic system / agentic systems / agentic-systemsagentic-systems (prefer hyphenated form, consistent with FILETAGS style).
  • Remove year-only tokens from #+KEYWORDS where #+DATE already carries the year.
  • Do NOT remove location keywords — they are valid search terms.

6.3. Tier 3: Normalize #+KEYWORDS casing

The 3,406 unique keyword tokens include mixed-case variants. The convergence algorithm normalizes to lowercase. As a build-time invariant: the check_required_headers.py script (or a new sibling) should warn when #+KEYWORDS contains uppercase tokens, since the BM25 tokenizer lowercases at index time anyway. Storing uppercase is misleading.

7. Implementation: Adding Build-Time Validation

The existing scripts/check_required_headers.py validates presence of four headers. Extend it (or create scripts/check_vocabulary.py) to:

check_vocabulary.py:
  For each .org file under site/:
    1. If #+FILETAGS is present:
       - Parse colon-delimited tokens
       - Reject any token not matching facet:term pattern
       - Reject any token not in the v1 term list
       - Warn on missing domain tag for research notes
       - Warn on missing format:conference for files under events/
    2. Emit: OK / WARN / FAIL per file
    3. Exit non-zero if any FAIL

Add as a dependency of the lint target in Makefile:

lint: check-well-known check-required-headers check-vocabulary

The vocabulary term list lives in a single source of truth: site/research/controlled-vocabulary-v1.org (this file) plus a machine- readable extraction. The simplest machine-readable form: a YAML comment block in this file, or a separate scripts/vocabulary_v1.py module that imports as a dict.

8. Term Equivalency Table

For migration: what current free-text #+KEYWORDS terms map to which controlled vocabulary #+FILETAGS terms.

Current #+KEYWORDS term(s) Maps to #+FILETAGS term
clojure, clojurescript lang:clojure
scheme, guile, racket lang:scheme
lisp, elisp lang:lisp (if language is the subject)
python lang:python
javascript, js, node.js lang:javascript
typescript lang:typescript
rust lang:rust
haskell lang:haskell
java, jvm, scala lang:java
agent, agents, agentic system, ai agents, multi-agent domain:agents
machine learning, deep learning, neural network, scikit-learn domain:ml
llm, claude, openai, anthropic, gemini, ollama domain:llm
formal verification, tla+, lean4, alloy, model checking domain:formal-methods
security, privacy, cybersecurity domain:security
distributed systems, networking, consensus domain:distributed
database, datomic, postgresql, sql domain:databases
html5, css, react, graphql, jquery, angular domain:web
aws, freebsd, kubernetes, devops, ci/cd, terraform domain:infrastructure
algorithm, data structures, dynamic programming domain:algorithms
category theory, lambda calculus, type theory domain:category-theory
search, bm25, pocket-es, information retrieval domain:search
emacs, org-mode, org-babel domain:emacs
pdp-11, retrocomputing, unix v4, 1973 domain:retrocomputing
ads-b, sdr, aviation, dump1090 domain:aviation
fintech, payments, trading domain:fintech
(file under events/) format:conference
(file under activity-summary/) format:weekly
spec, specification, contract format:spec
weekly-summary format:weekly
morning-brief format:brief
pocket-es (as project) project:pocket-es
webring, wwn project:webring
goldberry project:goldberry
beads project:beads
ads-b (as project) project:ads-b

9. When to Add a New Term

A new term earns entry into the controlled vocabulary when:

  1. At least 3 existing documents would use it immediately (retroactive applicability test).
  2. It cannot be expressed by combining two existing terms.
  3. It is orthogonal to existing facets — it adds a dimension of discrimination not already covered.
  4. It survives a "would I filter on this?" test: if someone searching wal.sh asked to filter by this property, would the results be a coherent set of documents?

New terms that fail these tests belong in #+KEYWORDS (free text) but not in #+FILETAGS (controlled vocabulary). The controlled vocabulary grows slowly, by consensus with prior content, not by authoring-time impulse.

9.1. Process for adding a new term

  1. Open this file and add the term in the appropriate facet table with its coverage description.
  2. Update scripts/vocabulary_v1.py (the machine-readable list).
  3. Tag the documents that prompted the addition.
  4. Run gmake lint to verify the new term validates correctly.
  5. Commit: feat: add lang:erlang to controlled vocabulary v1.

10. Invariants Summary

These are the machine-checkable properties that define a healthy vocabulary state for wal.sh:

Invariant Target Current State
Every #+FILETAGS token matches a controlled term 100% 0% (FILETAGS barely used)
Every published research note has ≥1 domain tag 100% ~0%
Every event file has format:conference 100% ~5%
#+KEYWORDS singleton ratio <15% ~81%
#+KEYWORDS near-duplicate pairs <5 ~6+ confirmed
Documents with no #+KEYWORDS <5% ~21%
Controlled vocabulary term count <80 47 (v1)
Maximum #+FILETAGS tokens per doc ≤6 N/A

The singleton ratio target of <15% (from the convergence document's <10% criterion, relaxed slightly for the steady-state with ongoing content addition) should be measurable by the pocket-es indexer or the convergence Babashka script.

11. What This Vocabulary Does Not Solve

Be explicit about scope limits:

  • It does not eliminate #+KEYWORDS explosion. Free-text keywords will continue to grow. The convergence algorithm manages them; the controlled vocabulary does not replace them.
  • It does not retroactively classify 14 years of notes automatically. Phase 2 migration requires editorial judgment for ~272 documents (the research notes). Scripted Phase 1 handles the rest mechanically.
  • It does not handle cross-facet compound concepts. A document about "Rust-based formal verification tools" is lang:rust + domain:formal-methods. The vocabulary does not create lang:rust:formal-methods compounds.
  • It does not version over time. When the vocabulary needs a v2 (new facets, retired terms), that is a new document and a migration pass, not an in-place update to this file.

12. Implementation Priority

Ordered by value-to-effort ratio:

  1. Add scripts/vocabulary_v1.py with the machine-readable term list (30 min)
  2. Extend scripts/check_vocabulary.py for FILETAGS validation (2 hours)
  3. Add gmake lint-vocab target to Makefile (15 min)
  4. Script Phase 1 migration: events/ → format:conference, activity-summary/ → format:weekly (1 hour)
  5. Manually tag the 20 most-linked research notes with domain tags (2-3 hours)
  6. Run #+KEYWORDS normalization (convergence pass 1) on the full corpus (4 hours + review)