pocket-es: Specification
Table of Contents
1. Scope
This document defines what a conforming pocket-es implementation must do, not how. An agent reading this spec should be able to build a compatible system in any language and verify it against the same test fixtures.
2. Index Schema
The index is a single JSON file. A conforming index has this shape:
{
"_cluster": {
"name": string, // index identity
"version": string, // schema version
"built_at": string, // ISO 8601 timestamp
"git_sha": string, // source commit
"doc_count": integer, // total documents
"vocab_size": integer, // unique terms after stopword removal
"avg_dl": float // average document length in tokens
},
"idf": { // term → IDF score (precomputed)
"<term>": float,
...
},
"docs": [ // one entry per indexed org file
{
"_id": string, // relative path without .org or /index
"_dir": boolean, // true if source was index.org (dir-form)
"title": string,
"date": string, // YYYY-MM-DD or partial
"keywords": [string],
"description": string,
"headings": [string], // first 15 section headings
"terms": {"<term>": integer}, // top 50 term frequencies
"doc_len": integer // total token count
}
],
"suggest_corpus": [string] // prefix completion candidates
}
2.1. Invariants
- Every
_idis unique withindocs _diristrueiff the source file was namedindex.orgdoc_countequalslength(docs)- Every term in any doc's
termsmap exists as a key inidf avg_dlequalsmean(doc.doc_len for doc in docs)- No term in
termsis a stopword (see Tokenization)
3. URL Resolution
The _id and _dir fields determine the live URL:
_dir |
URL pattern | Example |
|---|---|---|
true |
/<_id>/ |
/research/pocket-es/ |
false |
/<_id>.html |
/research/local-first.html |
3.1. Invariant
Every URL produced by this rule MUST return HTTP 200 on the live site. This is the contract boundary between org-publish (which writes files), Apache (which serves them), and the search client (which links to them).
A verification pass:
for doc in index.docs:
url = "/" + doc._id + ("/" if doc._dir else ".html")
assert http_get(url).status == 200
4. Tokenization
Tokenization is the shared contract between the indexer and every client. A conforming tokenizer satisfies these invariants:
4.1. Invariants (property-testable)
- No stopwords in output. For any input string
s:intersection(tokenize(s), STOPWORDS) ={}= - All lowercase. For any input string
s: every token intokenize(s)equalslowercase(token) - Idempotent on output. For any input string
s:tokenize(join(" ", tokenize(s))) =tokenize(s)= - Bounded term frequencies. For any input and limit
n:length(term_frequencies(s, n)) <n=
4.2. Stopwords
The canonical stopword list is defined in src/pocket_es/token.cljc.
Any conforming implementation must use the same set.
4.3. Normalization
- Strip org-mode markup (
#+headers,:PROPERTIES:drawers, source block delimiters) - Split on whitespace and punctuation
- Lowercase
- Remove tokens shorter than 2 characters
- Remove stopwords
5. BM25 Scoring
Parameters: k1 = 1.2, b = 0.75.
For a single term t against document d:
tf = doc.terms[t] or 0 dl = doc.doc_len idf_v = idf[t] or 0 numer = tf * (k1 + 1) denom = tf + k1 * (1 - b + b * dl / avg_dl) score = idf_v * numer / denom (0 if denom == 0)
Multi-term queries sum per-term scores. Field boosts multiply the sum.
5.1. Invariants
- Deterministic. Same index + same query = same scores
- Zero for absent terms. If
tf =0= thenscore =0= - Monotonic in tf. Higher term frequency = higher score (given same idf, dl)
- Longer docs score lower. Given same tf and idf,
score(dl=100) > score(dl=200)
6. Query DSL
A conforming implementation must support these query types:
| Query type | Required fields | Behavior |
|---|---|---|
match |
field, implicit _all |
tokenize query, BM25 score per term, sum |
term |
field |
exact match on keyword/field array |
bool |
must, should, must_not, filter |
intersect/union/exclude/filter |
multi_match |
query, fields (with optional ^boost) |
match across fields with boost weights |
match_all |
(none) | return all documents, score 1.0 |
prefix |
field |
prefix scan on string or array fields |
6.1. Suggest
suggest({text, size}) returns up to size strings from the suggest
corpus that start with text (case-insensitive prefix match).
7. State Machine (UI)
The browser UI manages five atoms and a URL with three parameters:
| Atom | Type | Default | URL param |
|---|---|---|---|
query-text |
string | "" |
q |
page |
integer | 0 |
p (1-indexed in URL) |
date-filter |
string or nil | nil |
d |
results |
map or nil | nil |
(derived) |
suggestions |
vector | [] |
(derived) |
7.1. Transitions
| Action | query | date-filter | page | URL method |
|---|---|---|---|---|
| Type in search box | new | keep | reset | replaceState |
| Click suggestion (different) | new | keep | reset | pushState |
| Click suggestion (same/active) | clear | keep | reset | pushState |
| Click "Try" link | new | reset | reset | pushState |
| Click keyword tag | new | reset | reset | pushState |
| Click date filter | keep | new | reset | pushState |
| Click prev/next | keep | keep | inc/dec | pushState |
| Press Escape | clear | keep | reset | replaceState |
| Browser back/forward | from URL | from URL | from URL | (popstate) |
| Page load (init) | from URL | from URL | from URL | replaceState |
7.2. Invariant
After any transition, sync-url! runs. The URL in the address bar
must reflect the current values of q, d, p. A page refresh at
that URL must restore identical state.
8. Consumers
A conforming index supports these access patterns:
| Consumer | Runtime | Entry point | Scores with |
|---|---|---|---|
| Browser | ClojureScript | pocket-es.js (self-injecting) |
BM25 in CLJS |
| Emacs | Elisp | pocket-es.el |
BM25 in elisp |
| Node CLI | Node.js | scripts/search.js via dist/pocket-es.js |
BM25 in compiled CLJS |
| Babashka | bb | bb -cp src -m pocket-es.cli |
BM25 in token.cljc + cli.clj |
| Console | Browser DevTools | pocketES.search(...) |
BM25 in CLJS |
All consumers must produce identical ranking for the same query against the same index. The BM25 parameters (k1, b) and the tokenizer (stopwords, normalization) are the shared contract.
9. Verification
A conforming deployment passes these checks:
- Index integrity.
doc_count =length(docs)=, all IDF terms present - URL resolution. Every
_idresolves to HTTP 200 using the_dirrule - Tokenization contract. Property tests pass (4 invariants, 300+ iterations)
- Scoring determinism. Same query returns same results across consumers
- State round-trip. URL with
q,d,pparams restores identical view - No stopwords in index. No term in any doc's
termsmap is a stopword - Date sanity. No doc has
datematching the AI review date (2024-08-11) - Description tone. No description contains slop markers (delve, comprehensive, explore, etc.)