Controlled Vocabulary v1 for wal.sh
Table of Contents
- 1. Problem Statement
- 2. Related Work
- 3. Design Decisions
- 4. V1 Controlled Vocabulary
- 5. Validation Rules
- 6. Migration Strategy
- 7. Implementation: Adding Build-Time Validation
- 8. Term Equivalency Table
- 9. When to Add a New Term
- 10. Invariants Summary
- 11. What This Vocabulary Does Not Solve
- 12. Implementation Priority
1. Problem Statement
wal.sh has 708 org files as of 2026. 505 carry #+KEYWORDS headers; 131 of
those are placeholder text or empty. The remaining 492 use free-text, author-
chosen terms with no normalization rules.
The existing vocabulary is a folksonomy: terms chosen at authoring time, never reviewed against prior usage. The predictable pathologies are all present:
| Symptom | Evidence |
|---|---|
| Explosive unique count | 3,406 unique terms across 492 docs |
| Power-law singleton mass | ~2,777 terms appear exactly once (81%) |
| Near-duplicate fragmentation | org-mode (22) vs org mode (4); agentic system (8) vs agentic systems (2) vs agentic-systems (2) vs agentic ai (2) vs agentic-ai (2) |
| Sentence-length "keywords" | "moving from angular 1 to 4", "best practices for jquery performance" |
| Year-as-keyword | 1973, 2011, 2024 — bypass-able via #+DATE, pollutes chip display |
| Too-broad single terms | clojure on 61 docs, javascript on 35 — chips with too many hits |
| Format pollution | sessions, keynote, index, welcome are structural, not topical |
The search engine (pocket-es) surfaces keywords as clickable chips. A chip that resolves to 61 documents is noise. A chip that resolves to one is isolation. The operational goal: chips that connect 2-25 documents each.
This document defines a controlled vocabulary to replace the current folksonomy. It is deliberately v1 — simple, practical, retroactively applicable to 2012-era jQuery notes and 2026-era agent architecture papers.
2. Related Work
2.1. Gwern.net
Gwern uses a two-tier approach: a flat tag system (~200 maintained terms)
combined with a manually maintained Tags metadata field per essay. Tags
are topic-oriented, not faceted. He avoids format tags entirely — an essay
about machine learning gets machine-learning, not essay or long-form.
The discipline is editorial: Gwern writes 1-3 tags per essay, pruning
aggressively. This produces clean chips but requires constant manual curation
as the corpus grows.
Key lesson: fewer tags per document forces prioritization. Six tags per document encourages tag accumulation without semantic work.
2.2. Simon Willison's blog
Willison uses a flat tag taxonomy (~500-600 terms) with a tag-management
interface (tag rename, tag merge). He does not distinguish facets. The result
is browsable-by-tag but with moderate fragmentation: llm, llms,
large-language-models coexist. His corpus is larger (5,000+ posts) and the
tag interface provides a compensating mechanism for users.
Key lesson: flat tags work at scale only if the author has a merge interface and uses it. Without merging, fragmentation compounds annually.
2.3. ACM CCS and IEEE Keyword Schemes
The ACM Computing Classification System is hierarchical (3-4 levels) and faceted (Computing > Theory > Algorithms). IEEE uses free-text index terms plus a controlled vocabulary per venue. Both are designed for cross-author consistency in a shared corpus — the structural problem they solve does not exist for a single-author personal site.
Key lesson: hierarchical classification encodes disciplinary consensus that a personal archive does not have and does not need. The useful borrowing from library science is the facet concept, not the hierarchy depth.
2.4. Library Science: Faceted Classification
Ranganathan's PMEST facets (Personality, Matter, Energy, Space, Time) from 1933 are the theoretical ancestor of modern faceted search. The principle: a document's classification should answer several orthogonal questions (what domain, what format, what time period) independently, not encode a single hierarchical path.
Faceted classification is the right model for a personal archive because:
- Documents span many domains simultaneously (a conference note about Clojure at a 2019 event touches language + event + year)
- A single path (
conferences > 2019 > clojure) requires choosing which dimension is primary; facets avoid that choice - Facets compose: filtering on
lang:clojureANDformat:conferenceworks without encoding their intersection as a single term
2.5. Folksonomies
Vander Wal's original folksonomy analysis (2004) and Golder & Huberman's empirical study (2006) both observed that unconstrained social tagging produces power-law distributions: a few tags used very frequently, a long tail of tags used once. This is the current state of wal.sh's vocabulary.
The convergence approach (see keyword-vocabulary-convergence) iterates a folksonomy toward a controlled vocabulary's properties without discarding accumulated tagging work. V1 vocabulary defines the target state that convergence is trying to reach.
3. Design Decisions
3.1. Flat tags over hierarchy
wal.sh does not need 4-level hierarchy. The corpus is one author's research
across ~14 years. A document about Clojure concurrency is not ambiguous
enough to require Computer Science > Programming Languages > Functional >
Clojure > Concurrency. The facet structure provides orthogonal dimensions
without encoding depth.
3.2. Controlled terms only in #+FILETAGS; free text stays in #+KEYWORDS
org-mode has two separate metadata mechanisms:
#+KEYWORDS(string, comma-separated): processed by org-publish into HTML metadata, indexed by pocket-es for BM25 search, displayed as chips. Free text is acceptable here for searchability.#+FILETAGS(colon-delimited tags): org-mode's native tagging system, inherits in outlines, visible in org-agenda, used by tag-based queries. Controlled vocabulary terms belong here.
V1 uses both with different contracts:
| Field | Content | Contract |
|---|---|---|
#+FILETAGS |
Controlled vocabulary terms only | Must be in the v1 term list; validated at build time |
#+KEYWORDS |
Free-text, descriptive, searchable | No validation; normalized post-hoc via convergence |
This means the search chips (pocket-es) continue working from #+KEYWORDS
as today. The #+FILETAGS controlled vocabulary enables org-agenda queries,
org-roam graph navigation, and future structured filtering.
3.3. No year tags
Year is in #+DATE. The pocket-es indexer already extracts date. A year
tag duplicates information already in the document and creates maintenance
drift (a note dated 2019 with a 2024 tag because it was updated). The
date field is the canonical year authority.
3.4. No location singletons in controlled vocabulary
Location terms (boston, cascais, portland, amsterdam) are valid
search keywords in #+KEYWORDS and are useful in pocket-es BM25 results,
but they do not belong in the controlled vocabulary. They are specificity
identifiers, not category terms. The exception: boston is a valid
#+KEYWORDS term; it is not a valid #+FILETAGS controlled term.
3.5. Format facet replaces structural keywords
sessions, keynote, index, welcome, overview are currently used
as keywords but encode document structure, not topic. The format facet
captures these cleanly (format:conference, format:index) and removes
them from the keyword noise.
4. V1 Controlled Vocabulary
Six facets. 47 terms. The design principle: a term earns its place in the controlled vocabulary by appearing in at least 3 documents AND being irreducible (cannot be expressed by combining other controlled terms).
4.1. Facet 1: lang (Programming Language)
Used when the primary substance of the document is about a specific language as a language — its semantics, idioms, ecosystem, or community.
Do NOT use for documents that happen to contain code in a language. A note
about database indexing that uses Python examples is not lang:python; it is
domain:databases. Use lang when the language itself is the subject.
| Term | Covers | Current keyword equivalents |
|---|---|---|
lang:clojure |
Clojure, ClojureScript, clojure.spec, EDN | clojure (61), clojurescript (21) |
lang:scheme |
Scheme, Guile, Racket, Chez, Gambit | scheme (19), guile (10), racket (11) |
lang:lisp |
Common Lisp, Emacs Lisp, historical Lisp | lisp (17), elisp (6) |
lang:python |
Python 2, Python 3, CPython | python (31) |
lang:javascript |
JS, ES6+, CoffeeScript, Node.js | javascript (35), typescript (14) |
lang:typescript |
TypeScript specifically when type system is the topic | typescript (14) |
lang:rust |
Rust | rust (8) |
lang:haskell |
Haskell, PureScript | haskell (5) |
lang:java |
Java, Kotlin, Scala, JVM languages | jvm (4), scala (2) |
lang:go |
Go | gophercon area |
lang:cpp |
C, C++, systems languages | scattered |
lang:sql |
SQL, relational query languages | postgresql (4), datomic (7) |
Normalization note: lang:scheme absorbs racket (which is a Scheme
descendant), guile, racketcon. When the conference itself is the topic
(not Racket as a language), use format:conference instead.
4.2. Facet 2: domain (Subject Domain)
The conceptual territory the document inhabits. Most documents get 1-3 domain tags.
| Term | Covers | Current keyword equivalents |
|---|---|---|
domain:agents |
LLM agents, agentic systems, multi-agent | agent (9), agents (8), agentic system (8), ai agents (9), multi-agent (10) |
domain:ml |
Machine learning, deep learning, neural networks | machine learning (25), deep learning (5), neural network (4) |
domain:llm |
Large language models as objects of study | llm (23), openai (6), anthropic (9), claude (9) |
domain:formal-methods |
TLA+, Lean, Alloy, Coq, model checking, proof | formal verification (7), tla+ (4), lean4 (6), alloy (present) |
domain:security |
Security, privacy, threat modeling, cryptography | security (20), privacy (5) |
domain:distributed |
Distributed systems, consensus, CAP, network protocols | distributed systems (7), networking (9) |
domain:databases |
Databases, query languages, storage, indexing | database (5), datomic (7), postgresql (4) |
domain:web |
Web platform: HTML, CSS, DOM, HTTP, REST | html5 (8), css (9), react (10), graphql (7) |
domain:infrastructure |
Cloud, k8s, containers, CI/CD, monitoring | aws (12), freebsd (10), kubernetes (4), devops (6) |
domain:algorithms |
Algorithms, data structures, complexity, puzzles | algorithms (5), data structures (6), dynamic programming (5) |
domain:category-theory |
Category theory, type theory, abstract algebra | lambda calculus (7), category theory (4) |
domain:search |
Information retrieval, search engines, indexing | search (6), bm25 (5), pocket-es (7) |
domain:emacs |
Emacs, org-mode, Emacs Lisp as tools | emacs (24), org-mode (22), elisp (6) |
domain:retrocomputing |
Historical systems, pre-1990 computing, archaeology | pdp-11 (5), retrocomputing (4), unix v4 (7), 1973 (6) |
domain:aviation |
ADS-B, flight tracking, airspace, aircraft | ads-b (7) |
domain:fintech |
Financial technology, payments, trading | fintech (5) |
4.3. Facet 3: format (Document Format)
What kind of document this is. Orthogonal to domain and language. A conference note and a research essay about the same topic get different format tags.
| Term | Covers |
|---|---|
format:conference |
Conference notes, event proceedings, talk summaries |
format:research |
Research essays, deep-dives, long-form analysis |
format:tutorial |
How-to guides, walkthroughs, setup instructions |
format:spec |
Specifications, contracts, invariant definitions |
format:weekly |
Weekly activity summaries |
format:index |
Index/overview pages (landing pages for a topic area) |
format:brief |
Morning briefs, quick notes, short-form |
format:experiment |
Exploratory notes, proof-of-concept, WIP |
The distinction between format:research and format:tutorial is intent:
research documents argue a position or explore a question; tutorials instruct.
A document can be both (format:research + format:tutorial is valid).
4.4. Facet 4: era (Time Horizon)
Captures the era a document's content belongs to, independent of authoring
date. A 2024 retrospective on 1970s Unix belongs to era:historical, not
era:current.
| Term | Covers |
|---|---|
era:historical |
Pre-2000 content, retrocomputing, historical analysis |
era:foundational |
2000-2015, web 2.0 era, early frameworks, jQuery |
era:current |
2016-present, modern tooling, current practices |
era:emerging |
Speculative, cutting-edge, forward-looking |
Most documents do not need an era tag. Use it when the era is the point of the document — when "this was how we did it in 2012" or "this is what's coming" is the primary framing.
4.5. Facet 5: project (Site Projects)
wal.sh-specific research threads. These are internal organizational tags, not topical tags.
| Term | Covers |
|---|---|
project:pocket-es |
The site's BM25 search engine |
project:webring |
The wwn webring / bot trap |
project:ads-b |
The ADS-B flight tracking project |
project:goldberry |
The Goldberry frontend system |
project:agentic-2026 |
The 2026 agentic research series |
project:beads |
The beads (bd) issue tracking system |
project:unix-v4 |
The Unix V4 retrocomputing research |
Project tags are used sparingly — only when a document is primarily about the
project rather than incidentally using it. The pocket-es spec document gets
project:pocket-es; a research note that uses BM25 as one example does not.
4.6. Facet 6: status (Publication Status)
Optional. Use only for documents in non-final states.
| Term | Covers |
|---|---|
status:draft |
Work in progress, not ready for external linking |
status:stub |
Placeholder, minimal content, intended to grow |
status:deprecated |
Superseded by a newer document |
status:evergreen |
Intentionally maintained and kept current |
Most documents should have no status tag — absence implies normal publication
state. The status facet exists to distinguish documents that need curation
from those that are complete.
5. Validation Rules
A document's #+FILETAGS field is valid when:
- Every colon-delimited token matches the pattern
facet:termwherefacetis one of {lang,domain,format,era,project,status} andtermis in the term list above for that facet. - At most one
eratag per document. - At most one
statustag per document. langtags are used only when the language is the subject (not just the implementation language of an example).- Minimum: at least one
domaintag per published research note. Conference notes requireformat:conference. - A
__KEYWORDS__placeholder in#+FILETAGSis a lint failure.
5.1. What does NOT go in #+FILETAGS
- Location names (
boston,berlin,portland) - Year numbers (
2019,2024) — use#+DATE - Proper nouns for specific events (
clojure-conj-2023,racketcon-2024) - Person names
- Library/framework names as topics (
reagent,redux,langgraph) unless the framework's design is the subject - Free-text phrases ("best practices for X")
These all belong in #+KEYWORDS, where they contribute to BM25 search
without polluting the controlled vocabulary.
6. Migration Strategy
The corpus has three tiers with different migration costs:
6.1. Tier 1: 131 documents without keywords (zero-cost baseline)
These need both #+KEYWORDS and #+FILETAGS added. Priority order:
- Conference notes (format:conference is mechanical to assign)
- Research notes without any metadata
- Index pages (format:index is mechanical)
Suggested tooling: a Babashka script that reads #+TITLE and #+DESCRIPTION,
calls the Ollama embedding API (nomic-embed-text:v1.5 at
192.168.86.22:11434), finds the 5 nearest tagged documents, and proposes
their #+FILETAGS as candidates. The author approves or adjusts.
6.2. Tier 2: 492 documents with free-text keywords (convergence path)
Do NOT bulk-replace existing #+KEYWORDS. They contain searchable signal.
Instead, ADD #+FILETAGS alongside the existing #+KEYWORDS.
Phase 1 (mechanical, 1-2 days of scripted work):
- All files under
site/events/getformat:conferencein#+FILETAGS - All files under
site/activity-summary/getformat:weeklyin#+FILETAGS - All files with
#+KEYWORDS: pocket-esgetproject:pocket-es - All files with
#+KEYWORDScontainingclojuregetlang:clojure
This phase is pure automation — no editorial judgment required.
Phase 2 (editorial, high-value documents first):
- The 119 terms appearing 5+ times in
#+KEYWORDSare the consolidation targets. For each, map to the controlled vocabulary and add the corresponding#+FILETAGSto documents using that term. - Priority: the 20 highest-frequency terms cover a large fraction of docs.
Phase 3 (convergence cleanup in #+KEYWORDS):
- After
#+FILETAGSis populated, run the convergence algorithm (see 2026-keyword-vocabulary-convergence) to normalize#+KEYWORDS. - Merge near-duplicates:
agentic system/agentic systems/agentic-systems→agentic-systems(prefer hyphenated form, consistent with FILETAGS style). - Remove year-only tokens from
#+KEYWORDSwhere#+DATEalready carries the year. - Do NOT remove location keywords — they are valid search terms.
6.3. Tier 3: Normalize #+KEYWORDS casing
The 3,406 unique keyword tokens include mixed-case variants. The convergence
algorithm normalizes to lowercase. As a build-time invariant: the
check_required_headers.py script (or a new sibling) should warn when
#+KEYWORDS contains uppercase tokens, since the BM25 tokenizer lowercases
at index time anyway. Storing uppercase is misleading.
7. Implementation: Adding Build-Time Validation
The existing scripts/check_required_headers.py validates presence of four
headers. Extend it (or create scripts/check_vocabulary.py) to:
check_vocabulary.py:
For each .org file under site/:
1. If #+FILETAGS is present:
- Parse colon-delimited tokens
- Reject any token not matching facet:term pattern
- Reject any token not in the v1 term list
- Warn on missing domain tag for research notes
- Warn on missing format:conference for files under events/
2. Emit: OK / WARN / FAIL per file
3. Exit non-zero if any FAIL
Add as a dependency of the lint target in Makefile:
lint: check-well-known check-required-headers check-vocabulary
The vocabulary term list lives in a single source of truth:
site/research/controlled-vocabulary-v1.org (this file) plus a machine-
readable extraction. The simplest machine-readable form: a YAML comment block
in this file, or a separate scripts/vocabulary_v1.py module that imports as
a dict.
8. Term Equivalency Table
For migration: what current free-text #+KEYWORDS terms map to which
controlled vocabulary #+FILETAGS terms.
| Current #+KEYWORDS term(s) | Maps to #+FILETAGS term |
|---|---|
clojure, clojurescript |
lang:clojure |
scheme, guile, racket |
lang:scheme |
lisp, elisp |
lang:lisp (if language is the subject) |
python |
lang:python |
javascript, js, node.js |
lang:javascript |
typescript |
lang:typescript |
rust |
lang:rust |
haskell |
lang:haskell |
java, jvm, scala |
lang:java |
agent, agents, agentic system, ai agents, multi-agent |
domain:agents |
machine learning, deep learning, neural network, scikit-learn |
domain:ml |
llm, claude, openai, anthropic, gemini, ollama |
domain:llm |
formal verification, tla+, lean4, alloy, model checking |
domain:formal-methods |
security, privacy, cybersecurity |
domain:security |
distributed systems, networking, consensus |
domain:distributed |
database, datomic, postgresql, sql |
domain:databases |
html5, css, react, graphql, jquery, angular |
domain:web |
aws, freebsd, kubernetes, devops, ci/cd, terraform |
domain:infrastructure |
algorithm, data structures, dynamic programming |
domain:algorithms |
category theory, lambda calculus, type theory |
domain:category-theory |
search, bm25, pocket-es, information retrieval |
domain:search |
emacs, org-mode, org-babel |
domain:emacs |
pdp-11, retrocomputing, unix v4, 1973 |
domain:retrocomputing |
ads-b, sdr, aviation, dump1090 |
domain:aviation |
fintech, payments, trading |
domain:fintech |
| (file under events/) | format:conference |
| (file under activity-summary/) | format:weekly |
spec, specification, contract |
format:spec |
weekly-summary |
format:weekly |
morning-brief |
format:brief |
pocket-es (as project) |
project:pocket-es |
webring, wwn |
project:webring |
goldberry |
project:goldberry |
beads |
project:beads |
ads-b (as project) |
project:ads-b |
9. When to Add a New Term
A new term earns entry into the controlled vocabulary when:
- At least 3 existing documents would use it immediately (retroactive applicability test).
- It cannot be expressed by combining two existing terms.
- It is orthogonal to existing facets — it adds a dimension of discrimination not already covered.
- It survives a "would I filter on this?" test: if someone searching wal.sh asked to filter by this property, would the results be a coherent set of documents?
New terms that fail these tests belong in #+KEYWORDS (free text) but not in
#+FILETAGS (controlled vocabulary). The controlled vocabulary grows slowly,
by consensus with prior content, not by authoring-time impulse.
9.1. Process for adding a new term
- Open this file and add the term in the appropriate facet table with its coverage description.
- Update
scripts/vocabulary_v1.py(the machine-readable list). - Tag the documents that prompted the addition.
- Run
gmake lintto verify the new term validates correctly. - Commit:
feat: add lang:erlang to controlled vocabulary v1.
10. Invariants Summary
These are the machine-checkable properties that define a healthy vocabulary state for wal.sh:
| Invariant | Target | Current State |
|---|---|---|
Every #+FILETAGS token matches a controlled term |
100% | 0% (FILETAGS barely used) |
| Every published research note has ≥1 domain tag | 100% | ~0% |
Every event file has format:conference |
100% | ~5% |
#+KEYWORDS singleton ratio |
<15% | ~81% |
#+KEYWORDS near-duplicate pairs |
<5 | ~6+ confirmed |
Documents with no #+KEYWORDS |
<5% | ~21% |
| Controlled vocabulary term count | <80 | 47 (v1) |
Maximum #+FILETAGS tokens per doc |
≤6 | N/A |
The singleton ratio target of <15% (from the convergence document's <10% criterion, relaxed slightly for the steady-state with ongoing content addition) should be measurable by the pocket-es indexer or the convergence Babashka script.
11. What This Vocabulary Does Not Solve
Be explicit about scope limits:
- It does not eliminate
#+KEYWORDSexplosion. Free-text keywords will continue to grow. The convergence algorithm manages them; the controlled vocabulary does not replace them. - It does not retroactively classify 14 years of notes automatically. Phase 2 migration requires editorial judgment for ~272 documents (the research notes). Scripted Phase 1 handles the rest mechanically.
- It does not handle cross-facet compound concepts. A document about
"Rust-based formal verification tools" is
lang:rust+domain:formal-methods. The vocabulary does not createlang:rust:formal-methodscompounds. - It does not version over time. When the vocabulary needs a v2 (new facets, retired terms), that is a new document and a migration pass, not an in-place update to this file.
12. Implementation Priority
Ordered by value-to-effort ratio:
- Add
scripts/vocabulary_v1.pywith the machine-readable term list (30 min) - Extend
scripts/check_vocabulary.pyfor FILETAGS validation (2 hours) - Add
gmake lint-vocabtarget to Makefile (15 min) - Script Phase 1 migration: events/ → format:conference, activity-summary/ → format:weekly (1 hour)
- Manually tag the 20 most-linked research notes with domain tags (2-3 hours)
- Run
#+KEYWORDSnormalization (convergence pass 1) on the full corpus (4 hours + review)