Keyword Vocabulary Convergence

1. Problem
- 1.1. Per-document keyword count distribution
2. Convergence rounds
3. Method
- 3.1. Convergence criterion
- 3.2. Exceptions to singleton pruning
4. Lemmatization as post-processing
5. Semantic augmentation
6. Current state (2026-06-01)
7. Related work
8. Implementation
9. Applicability

Controlled Vocabulary v1 – 6 facets, 47 terms, migration strategy

1. Problem

A personal research archive accumulates free-text keywords over years. Without curation, the vocabulary drifts:

Symptom	Example	Count
Singletons (used once)	`$.tmpl()`, `2136 6th avenue`	2645 (80%)
Near-duplicates	`org mode` / `org-mode`	6 pairs
Too broad	`clojure` (55 docs)	1
Missing entirely	131 docs with no keywords	21%

The search engine (pocket-es) displays keywords as clickable chips. A keyword that appears once connects nothing. A keyword on 55 docs produces 6 pages of results – too many to scan. The goal: every keyword appears on 2–50 docs (1–5 pages at 10 results per page).

1.1. Per-document keyword count distribution

The number of keywords per document should also be consistent. Too many keywords dilute signal; too few leave the doc unreachable.

Metric	Value
Mean	9.9 keywords/doc
Std dev	6.1
Consistent range (mean ± 1σ)	4–16
Over-tagged (>16)	72 docs
Under-tagged (<4)	11 docs
No keywords	98 docs

Worst offenders: libpython-clj2 (63 keywords), startup-weekend-seattle-2012 (48), angular-summit-boston-2017 (42). These are keyword dumps from the 2024 GPT-3.5 review pass – the model listed every noun in the document.

The convergence loop should also normalize per-document counts: prune over-tagged docs to the 10–12 most discriminating keywords, fill under-tagged docs from content analysis.

kw/doc	R0	R1	R2
0	131	98	43
1-3	15	11	2
4-6	62	95	114
7-9	120	155	292
10-12	105	120	98
13-16	55	52	43
17-20	38	30	38
21+	72	52	0

2. Convergence rounds

2.1. Round 0: GPT-3.5 batch review (2024-08)

The 2024 AI review pass added #+KEYWORDS to ~492 files. Keywords were generated from document content by GPT-3.5. No controlled vocabulary, no normalization, no deduplication. Result: 3285 unique keywords, 80% singletons, max 63 keywords on one doc.

2.2. Round 1: initial cleanup (2026-05-31)

Agents rewrote 96 slop descriptions. Stopwords data and code removed from the tokenizer. 80 event dates corrected from AI review dates to actual conference dates. 14 hallucinated titles replaced.

Metric	Before	After
Docs with keywords	492 (80%)	530 (84%)
Zero keywords	131	98
Mean ± std	9.9 ± 6.1	~9.5 ± 5.5
Over 20	72	52

2.3. Round 2: systematic convergence (2026-06-01 / 2026-06-02)

Three parallel agent passes:

Fill: added 6–10 keywords to 67 zero-keyword docs (weekly summaries, event indexes, SmallCon sessions, research stubs)
Prune: cut 37 over-tagged docs from 21–63 keywords down to 10–12 (removed session-level noise, kept technology names)
Bump: raised 11 under-4 docs to 4–8 keywords

Metric	Before R2	After R2	Change
Docs	628	630	+2 (new notes)
With keywords	530 (84%)	587 (93%)	+57
Zero keywords	98	43	−55
Over 20	52	0	−52
Under 4	11	2	−9
Mean ± std	~9.5 ± 5.5	8.9 ± 3.5	tighter
Max	63	20	−43

The remaining 43 zero-keyword docs are generated stubs (bot lab canaries, #+INCLUDE fragments, YYYY templates) that are excluded from org-publish and do not appear in search results.

2.4. Round 3 (planned)

Merge 6 near-duplicate keyword pairs (hyphen vs space)
Apply irregular lemmatization table (strategies → strategy)
Run embedding similarity to find docs that should share keywords
Prune singletons (target: <10% singleton ratio)
Convergence criterion: singleton ratio <10%, no keyword >50 docs, every doc 4–16 keywords, merge candidates <5

3. Method

The convergence process is iterative, analogous to Newton's method for root-finding but applied to a tagged document set:

repeat:
  1. Count: keyword → number of docs
  2. Prune: remove keywords with count = 1
  3. Merge: normalize near-duplicates (hyphen/space, singular/plural)
  4. Split: replace too-broad keywords with specific sub-terms
  5. Add: keyword-less docs get 4–10 terms from content analysis
  6. Re-index
until distribution stabilizes (fixed point)

Each iteration reduces the singleton count and tightens the distribution. The process converges because:

Pruning removes noise without adding new terms
Merging reduces unique count without changing doc count
Splitting is bounded by the specificity of the content
Adding fills gaps that pruning created

3.1. Convergence criterion

The vocabulary has converged when:

Singleton ratio < 10% (currently 80%)
No keyword exceeds 50 docs
Every doc has at least 4 keywords
Merge candidates (Levenshtein distance ≤ 2) < 5

3.2. Exceptions to singleton pruning

Location keywords (seattle, cascais, portland, durham) are valid singletons – "what did I attend in Seattle?" is a real query with one correct answer. Similarly, person names, specific conference editions (elm-conf-2016), and unique identifiers should survive pruning even at count 1. The pruning rule is: remove singletons that are noise, not singletons that are specificity.

4. Lemmatization as post-processing

The keyword normalizer strips trailing s for plurals, but English has irregular forms: strategies → strategy, indices → index, searching → search. A full NLP lemmatizer (spaCy, NLTK WordNet lemmatizer) is overkill for 600 documents. Instead, a corpus-specific lookup table of known irregular forms converges faster:

;; Irregular form → base form (corpus-specific)
{"strategies" "strategy"
 "indices" "index"
 "architectures" "architecture"
 "vulnerabilities" "vulnerability"
 "dependencies" "dependency"
 "libraries" "library"
 "categories" "category"
 "hierarchies" "hierarchy"}

This table is part of the fixed-point iteration: each round discovers new irregular forms that the simple s-stripping misses. The table stabilizes when no new forms appear between iterations.

5. Semantic augmentation

Syntactic methods (substring matching, edit distance) miss semantic duplicates. Two documents about "sandboxing AI agents" and "FreeBSD jail isolation for coding assistants" share no keywords but are topically identical.

Embedding-based similarity fills this gap:

1. Embed each doc's title + description via nomic-embed-text:v1.5
   (768-dim, Ollama at 192.168.86.22:11434)
2. Compute pairwise cosine similarity
3. For doc pairs with similarity > 0.85 and keyword overlap < 2:
   suggest shared keywords from the union of both docs' terms

This finds connections the syntactic vocabulary misses and proposes keyword additions that increase the graph's connectivity.

6. Current state (2026-06-01)

Metric	Value
Total docs	623
With keywords	492 (79%)
Without keywords	131 (21%)
Unique keywords	3285
Singletons	2645 (80%)
Useful range (2–50)	639 (19%)
Too broad (>50)	1 (`clojure` at 55)
Merge candidates	6 (hyphen vs space)
Embedding model	nomic-embed-text:v1.5 (768-dim)

7. Related work

7.1. Folksonomies and controlled vocabularies

Vander Wal coined "folksonomy" (2004) to describe user-generated tagging without a controlled vocabulary. The power law distribution (few tags used many times, many tags used once) is the expected outcome. See:

Golder, S. A., & Huberman, B. A. (2006). "Usage Patterns of Collaborative Tagging Systems." Journal of Information Science, 32(2), 198–208.
Mathes, A. (2004). "Folksonomies – Cooperative Classification and Communication Through Shared Metadata." Computer Mediated Communication – LIS590CMC.

The convergence method here is the curation step that transforms a folksonomy into a controlled vocabulary over time.

7.2. Tag gardening

The practice of periodically reviewing and normalizing tags is called "tag gardening" in knowledge management. Tools like Obsidian, Notion, and Roam Research provide tag-merge and tag-rename operations.

The fixed-point framing adds rigor: instead of ad-hoc cleanup, define a measurable convergence criterion and iterate until met.

7.3. Library science

Controlled vocabularies (LCSH, MeSH, ACM CCS) are the library science equivalent. The key differences from our approach:

Controlled vocabularies are designed top-down before tagging begins
Folksonomies emerge bottom-up from usage
This method is bottom-up vocabulary refined toward top-down properties (bounded frequency, no duplicates, full coverage)

7.4. Embedding-based tag suggestion

Using embeddings to suggest tags is well-studied:

Belém, F. M., et al. (2017). "A Survey on Tag Recommendation Methods." JASIST, 68(4), 830–844.
Zhang, L., et al. (2019). "Tag2Vec: Learning Tag Representations in Tag Networks." WWW, 1219–1229.

The nomic-embed-text model provides dense representations without training a domain-specific model. The 768-dim embedding captures semantic similarity that keyword overlap misses.

7.5. NLP tooling approaches

Several NLP tools can augment the iterative convergence:

Tool	Role	How it helps
nomic-embed-text:v1.5 (Ollama)	Semantic similarity	Find doc pairs that should share keywords
spaCy NER	Entity extraction	Auto-tag locations, people, organizations
WordNet (NLTK)	Hypernym trees	Know `clojure` is-a `language` is-a `functional programming` for split decisions
TF-IDF	Term importance	Identify keywords that discriminate one doc from the corpus

WordNet is particularly useful for the "split too-broad" step: clojure at 55 docs is too broad, but WordNet's hypernym tree shows it is-a programming language (already at 20) and has-part clojurescript (21), clojure-conj (25), clojure.spec (3). The split replaces the broad term with specific sub-terms that already exist in the vocabulary.

The embedding model (768-dim via Ollama at 192.168.86.22:11434) handles semantic similarity without requiring WordNet or spaCy installed locally. SpaCy's NER would add structured entity types (GPE for locations, ORG for organizations) which help with the singleton exception rule (location keywords are valid singletons).

7.6. Similar implementations

Gwern.net uses hierarchical tags with a controlled vocabulary of ~200 terms, manually curated. Tags are applied at authoring time.
Simon Willison's blog uses flat tags, ~500 terms, with a tag-management interface. No automated convergence.
DEVONthink uses AI-based auto-tagging with a suggest-then-confirm workflow. The user approves or rejects suggestions.

8. Implementation

The convergence loop runs as a Babashka script against the search index:

$ bb -cp src scripts/keyword-converge.bb
Iteration 1: 3285 unique, 2645 singletons (80%), 6 merges
  Pruned 2645 singletons
  Merged 6 pairs (org mode → org-mode, etc.)
  Added keywords to 131 docs
Iteration 2: 847 unique, 112 singletons (13%), 0 merges
  Pruned 112 singletons
Iteration 3: 735 unique, 23 singletons (3%), 0 merges
  Converged: singleton ratio 3% < 10%

The script modifies org files in-place (updating #+KEYWORDS) and re-indexes after each iteration. The embedding step runs once after convergence to find semantic gaps.

9. Applicability

This method works for any tagged document corpus:

Personal wikis (Obsidian, org-roam, TiddlyWiki)
Blog post archives
Conference talk databases
Code repository topic tags
Bookmark collections (the original folksonomy use case)

The requirements are: a corpus of tagged documents, a way to count tag frequency, and a way to edit tags programmatically. The convergence criterion adapts to corpus size – the 2–50 range scales as ceil(log2(N)) to ceil(N/10) for N documents.