pocket-es: Specification

Table of Contents

1. Scope

This document defines what a conforming pocket-es implementation must do, not how. An agent reading this spec should be able to build a compatible system in any language and verify it against the same test fixtures.

2. Index Schema

The index is a single JSON file. A conforming index has this shape:

{
  "_cluster": {
    "name":       string,     // index identity
    "version":    string,     // schema version
    "built_at":   string,     // ISO 8601 timestamp
    "git_sha":    string,     // source commit
    "doc_count":  integer,    // total documents
    "vocab_size": integer,    // unique terms after stopword removal
    "avg_dl":     float       // average document length in tokens
  },
  "idf": {                   // term → IDF score (precomputed)
    "<term>": float,
    ...
  },
  "docs": [                   // one entry per indexed org file
    {
      "_id":         string,  // relative path without .org or /index
      "_dir":        boolean, // true if source was index.org (dir-form)
      "title":       string,
      "date":        string,  // YYYY-MM-DD or partial
      "keywords":    [string],
      "description": string,
      "headings":    [string],  // first 15 section headings
      "terms":       {"<term>": integer},  // top 50 term frequencies
      "doc_len":     integer  // total token count
    }
  ],
  "suggest_corpus": [string]  // prefix completion candidates
}

2.1. Invariants

  1. Every _id is unique within docs
  2. _dir is true iff the source file was named index.org
  3. doc_count equals length(docs)
  4. Every term in any doc's terms map exists as a key in idf
  5. avg_dl equals mean(doc.doc_len for doc in docs)
  6. No term in terms is a stopword (see Tokenization)

3. URL Resolution

The _id and _dir fields determine the live URL:

_dir URL pattern Example
true /<_id>/ /research/pocket-es/
false /<_id>.html /research/local-first.html

3.1. Invariant

Every URL produced by this rule MUST return HTTP 200 on the live site. This is the contract boundary between org-publish (which writes files), Apache (which serves them), and the search client (which links to them).

A verification pass:

for doc in index.docs:
    url = "/" + doc._id + ("/" if doc._dir else ".html")
    assert http_get(url).status == 200

4. Tokenization

Tokenization is the shared contract between the indexer and every client. A conforming tokenizer satisfies these invariants:

4.1. Invariants (property-testable)

  1. No stopwords in output. For any input string s: intersection(tokenize(s), STOPWORDS) = {}=
  2. All lowercase. For any input string s: every token in tokenize(s) equals lowercase(token)
  3. Idempotent on output. For any input string s: tokenize(join(" ", tokenize(s))) = tokenize(s)=
  4. Bounded term frequencies. For any input and limit n: length(term_frequencies(s, n)) < n=

4.2. Stopwords

The canonical stopword list is defined in src/pocket_es/token.cljc. Any conforming implementation must use the same set.

4.3. Normalization

  1. Strip org-mode markup (#+ headers, :PROPERTIES: drawers, source block delimiters)
  2. Split on whitespace and punctuation
  3. Lowercase
  4. Remove tokens shorter than 2 characters
  5. Remove stopwords

5. BM25 Scoring

Parameters: k1 = 1.2, b = 0.75.

For a single term t against document d:

tf    = doc.terms[t] or 0
dl    = doc.doc_len
idf_v = idf[t] or 0
numer = tf * (k1 + 1)
denom = tf + k1 * (1 - b + b * dl / avg_dl)
score = idf_v * numer / denom   (0 if denom == 0)

Multi-term queries sum per-term scores. Field boosts multiply the sum.

5.1. Invariants

  1. Deterministic. Same index + same query = same scores
  2. Zero for absent terms. If tf = 0= then score = 0=
  3. Monotonic in tf. Higher term frequency = higher score (given same idf, dl)
  4. Longer docs score lower. Given same tf and idf, score(dl=100) > score(dl=200)

6. Query DSL

A conforming implementation must support these query types:

Query type Required fields Behavior
match field, implicit _all tokenize query, BM25 score per term, sum
term field exact match on keyword/field array
bool must, should, must_not, filter intersect/union/exclude/filter
multi_match query, fields (with optional ^boost) match across fields with boost weights
match_all (none) return all documents, score 1.0
prefix field prefix scan on string or array fields

6.1. Suggest

suggest({text, size}) returns up to size strings from the suggest corpus that start with text (case-insensitive prefix match).

7. State Machine (UI)

The browser UI manages five atoms and a URL with three parameters:

Atom Type Default URL param
query-text string "" q
page integer 0 p (1-indexed in URL)
date-filter string or nil nil d
results map or nil nil (derived)
suggestions vector [] (derived)

7.1. Transitions

Action query date-filter page URL method
Type in search box new keep reset replaceState
Click suggestion (different) new keep reset pushState
Click suggestion (same/active) clear keep reset pushState
Click "Try" link new reset reset pushState
Click keyword tag new reset reset pushState
Click date filter keep new reset pushState
Click prev/next keep keep inc/dec pushState
Press Escape clear keep reset replaceState
Browser back/forward from URL from URL from URL (popstate)
Page load (init) from URL from URL from URL replaceState

7.2. Invariant

After any transition, sync-url! runs. The URL in the address bar must reflect the current values of q, d, p. A page refresh at that URL must restore identical state.

8. Consumers

A conforming index supports these access patterns:

Consumer Runtime Entry point Scores with
Browser ClojureScript pocket-es.js (self-injecting) BM25 in CLJS
Emacs Elisp pocket-es.el BM25 in elisp
Node CLI Node.js scripts/search.js via dist/pocket-es.js BM25 in compiled CLJS
Babashka bb bb -cp src -m pocket-es.cli BM25 in token.cljc + cli.clj
Console Browser DevTools pocketES.search(...) BM25 in CLJS

All consumers must produce identical ranking for the same query against the same index. The BM25 parameters (k1, b) and the tokenizer (stopwords, normalization) are the shared contract.

9. Verification

A conforming deployment passes these checks:

  1. Index integrity. doc_count = length(docs)=, all IDF terms present
  2. URL resolution. Every _id resolves to HTTP 200 using the _dir rule
  3. Tokenization contract. Property tests pass (4 invariants, 300+ iterations)
  4. Scoring determinism. Same query returns same results across consumers
  5. State round-trip. URL with q, d, p params restores identical view
  6. No stopwords in index. No term in any doc's terms map is a stopword
  7. Date sanity. No doc has date matching the AI review date (2024-08-11)
  8. Description tone. No description contains slop markers (delve, comprehensive, explore, etc.)