Keyword Vocabulary Convergence

Table of Contents

keyword-graph.png

1. Problem

A personal research archive accumulates free-text keywords over years. Without curation, the vocabulary drifts:

Symptom Example Count
Singletons (used once) $.tmpl(), 2136 6th avenue 2645 (80%)
Near-duplicates org mode / org-mode 6 pairs
Too broad clojure (55 docs) 1
Missing entirely 131 docs with no keywords 21%

The search engine (pocket-es) displays keywords as clickable chips. A keyword that appears once connects nothing. A keyword on 55 docs produces 6 pages of results — too many to scan. The goal: every keyword appears on 2–50 docs (1–5 pages at 10 results per page).

1.1. Per-document keyword count distribution

The number of keywords per document should also be consistent. Too many keywords dilute signal; too few leave the doc unreachable.

Metric Value
Mean 9.9 keywords/doc
Std dev 6.1
Consistent range (mean ± 1σ) 4–16
Over-tagged (>16) 72 docs
Under-tagged (<4) 11 docs
No keywords 98 docs

Worst offenders: libpython-clj2 (63 keywords), startup-weekend-seattle-2012 (48), angular-summit-boston-2017 (42). These are keyword dumps from the 2024 GPT-3.5 review pass — the model listed every noun in the document.

The convergence loop should also normalize per-document counts: prune over-tagged docs to the 10–12 most discriminating keywords, fill under-tagged docs from content analysis.

convergence-gaussian.png

kw/doc R0 R1 R2
0 131 98 43
1-3 15 11 2
4-6 62 95 114
7-9 120 155 292
10-12 105 120 98
13-16 55 52 43
17-20 38 30 38
21+ 72 52 0

2. Convergence rounds

2.1. Round 0: GPT-3.5 batch review (2024-08)

The 2024 AI review pass added #+KEYWORDS to ~492 files. Keywords were generated from document content by GPT-3.5. No controlled vocabulary, no normalization, no deduplication. Result: 3285 unique keywords, 80% singletons, max 63 keywords on one doc.

2.2. Round 1: initial cleanup (2026-05-31)

Agents rewrote 96 slop descriptions. Stopwords data and code removed from the tokenizer. 80 event dates corrected from AI review dates to actual conference dates. 14 hallucinated titles replaced.

Metric Before After
Docs with keywords 492 (80%) 530 (84%)
Zero keywords 131 98
Mean ± std 9.9 ± 6.1 ~9.5 ± 5.5
Over 20 72 52

2.3. Round 2: systematic convergence (2026-06-01 / 2026-06-02)

Three parallel agent passes:

  1. Fill: added 6–10 keywords to 67 zero-keyword docs (weekly summaries, event indexes, SmallCon sessions, research stubs)
  2. Prune: cut 37 over-tagged docs from 21–63 keywords down to 10–12 (removed session-level noise, kept technology names)
  3. Bump: raised 11 under-4 docs to 4–8 keywords
Metric Before R2 After R2 Change
Docs 628 630 +2 (new notes)
With keywords 530 (84%) 587 (93%) +57
Zero keywords 98 43 −55
Over 20 52 0 −52
Under 4 11 2 −9
Mean ± std ~9.5 ± 5.5 8.9 ± 3.5 tighter
Max 63 20 −43

The remaining 43 zero-keyword docs are generated stubs (bot lab canaries, #+INCLUDE fragments, YYYY templates) that are excluded from org-publish and do not appear in search results.

2.4. Round 3 (planned)

  • Merge 6 near-duplicate keyword pairs (hyphen vs space)
  • Apply irregular lemmatization table (strategiesstrategy)
  • Run embedding similarity to find docs that should share keywords
  • Prune singletons (target: <10% singleton ratio)
  • Convergence criterion: singleton ratio <10%, no keyword >50 docs, every doc 4–16 keywords, merge candidates <5

3. Method

The convergence process is iterative, analogous to Newton's method for root-finding but applied to a tagged document set:

repeat:
  1. Count: keyword → number of docs
  2. Prune: remove keywords with count = 1
  3. Merge: normalize near-duplicates (hyphen/space, singular/plural)
  4. Split: replace too-broad keywords with specific sub-terms
  5. Add: keyword-less docs get 4–10 terms from content analysis
  6. Re-index
until distribution stabilizes (fixed point)

Each iteration reduces the singleton count and tightens the distribution. The process converges because:

  • Pruning removes noise without adding new terms
  • Merging reduces unique count without changing doc count
  • Splitting is bounded by the specificity of the content
  • Adding fills gaps that pruning created

3.1. Convergence criterion

The vocabulary has converged when:

  • Singleton ratio < 10% (currently 80%)
  • No keyword exceeds 50 docs
  • Every doc has at least 4 keywords
  • Merge candidates (Levenshtein distance ≤ 2) < 5

3.2. Exceptions to singleton pruning

Location keywords (seattle, cascais, portland, durham) are valid singletons — "what did I attend in Seattle?" is a real query with one correct answer. Similarly, person names, specific conference editions (elm-conf-2016), and unique identifiers should survive pruning even at count 1. The pruning rule is: remove singletons that are noise, not singletons that are specificity.

4. Lemmatization as post-processing

The keyword normalizer strips trailing s for plurals, but English has irregular forms: strategiesstrategy, indicesindex, searchingsearch. A full NLP lemmatizer (spaCy, NLTK WordNet lemmatizer) is overkill for 600 documents. Instead, a corpus-specific lookup table of known irregular forms converges faster:

;; Irregular form → base form (corpus-specific)
{"strategies" "strategy"
 "indices" "index"
 "architectures" "architecture"
 "vulnerabilities" "vulnerability"
 "dependencies" "dependency"
 "libraries" "library"
 "categories" "category"
 "hierarchies" "hierarchy"}

This table is part of the fixed-point iteration: each round discovers new irregular forms that the simple s-stripping misses. The table stabilizes when no new forms appear between iterations.

5. Semantic augmentation

Syntactic methods (substring matching, edit distance) miss semantic duplicates. Two documents about "sandboxing AI agents" and "FreeBSD jail isolation for coding assistants" share no keywords but are topically identical.

Embedding-based similarity fills this gap:

1. Embed each doc's title + description via nomic-embed-text:v1.5
   (768-dim, Ollama at 192.168.86.22:11434)
2. Compute pairwise cosine similarity
3. For doc pairs with similarity > 0.85 and keyword overlap < 2:
   suggest shared keywords from the union of both docs' terms

This finds connections the syntactic vocabulary misses and proposes keyword additions that increase the graph's connectivity.

6. Current state (2026-06-01)

Metric Value
Total docs 623
With keywords 492 (79%)
Without keywords 131 (21%)
Unique keywords 3285
Singletons 2645 (80%)
Useful range (2–50) 639 (19%)
Too broad (>50) 1 (clojure at 55)
Merge candidates 6 (hyphen vs space)
Embedding model nomic-embed-text:v1.5 (768-dim)

7. Related work

7.1. Folksonomies and controlled vocabularies

Vander Wal coined "folksonomy" (2004) to describe user-generated tagging without a controlled vocabulary. The power law distribution (few tags used many times, many tags used once) is the expected outcome. See:

  • Golder, S. A., & Huberman, B. A. (2006). "Usage Patterns of Collaborative Tagging Systems." Journal of Information Science, 32(2), 198–208.
  • Mathes, A. (2004). "Folksonomies — Cooperative Classification and Communication Through Shared Metadata." Computer Mediated Communication — LIS590CMC.

The convergence method here is the curation step that transforms a folksonomy into a controlled vocabulary over time.

7.2. Tag gardening

The practice of periodically reviewing and normalizing tags is called "tag gardening" in knowledge management. Tools like Obsidian, Notion, and Roam Research provide tag-merge and tag-rename operations.

The fixed-point framing adds rigor: instead of ad-hoc cleanup, define a measurable convergence criterion and iterate until met.

7.3. Library science

Controlled vocabularies (LCSH, MeSH, ACM CCS) are the library science equivalent. The key differences from our approach:

  • Controlled vocabularies are designed top-down before tagging begins
  • Folksonomies emerge bottom-up from usage
  • This method is bottom-up vocabulary refined toward top-down properties (bounded frequency, no duplicates, full coverage)

7.4. Embedding-based tag suggestion

Using embeddings to suggest tags is well-studied:

  • Belém, F. M., et al. (2017). "A Survey on Tag Recommendation Methods." JASIST, 68(4), 830–844.
  • Zhang, L., et al. (2019). "Tag2Vec: Learning Tag Representations in Tag Networks." WWW, 1219–1229.

The nomic-embed-text model provides dense representations without training a domain-specific model. The 768-dim embedding captures semantic similarity that keyword overlap misses.

7.5. NLP tooling approaches

Several NLP tools can augment the iterative convergence:

Tool Role How it helps
nomic-embed-text:v1.5 (Ollama) Semantic similarity Find doc pairs that should share keywords
spaCy NER Entity extraction Auto-tag locations, people, organizations
WordNet (NLTK) Hypernym trees Know clojure is-a language is-a functional programming for split decisions
TF-IDF Term importance Identify keywords that discriminate one doc from the corpus

WordNet is particularly useful for the "split too-broad" step: clojure at 55 docs is too broad, but WordNet's hypernym tree shows it is-a programming language (already at 20) and has-part clojurescript (21), clojure-conj (25), clojure.spec (3). The split replaces the broad term with specific sub-terms that already exist in the vocabulary.

The embedding model (768-dim via Ollama at 192.168.86.22:11434) handles semantic similarity without requiring WordNet or spaCy installed locally. SpaCy's NER would add structured entity types (GPE for locations, ORG for organizations) which help with the singleton exception rule (location keywords are valid singletons).

7.6. Similar implementations

  • Gwern.net uses hierarchical tags with a controlled vocabulary of ~200 terms, manually curated. Tags are applied at authoring time.
  • Simon Willison's blog uses flat tags, ~500 terms, with a tag-management interface. No automated convergence.
  • DEVONthink uses AI-based auto-tagging with a suggest-then-confirm workflow. The user approves or rejects suggestions.

8. Implementation

The convergence loop runs as a Babashka script against the search index:

$ bb -cp src scripts/keyword-converge.bb
Iteration 1: 3285 unique, 2645 singletons (80%), 6 merges
  Pruned 2645 singletons
  Merged 6 pairs (org mode → org-mode, etc.)
  Added keywords to 131 docs
Iteration 2: 847 unique, 112 singletons (13%), 0 merges
  Pruned 112 singletons
Iteration 3: 735 unique, 23 singletons (3%), 0 merges
  Converged: singleton ratio 3% < 10%

The script modifies org files in-place (updating #+KEYWORDS) and re-indexes after each iteration. The embedding step runs once after convergence to find semantic gaps.

9. Applicability

This method works for any tagged document corpus:

  • Personal wikis (Obsidian, org-roam, TiddlyWiki)
  • Blog post archives
  • Conference talk databases
  • Code repository topic tags
  • Bookmark collections (the original folksonomy use case)

The requirements are: a corpus of tagged documents, a way to count tag frequency, and a way to edit tags programmatically. The convergence criterion adapts to corpus size — the 2–50 range scales as ceil(log2(N)) to ceil(N/10) for N documents.