pocket-es: Specification
Table of Contents
1. Scope
This document defines what a conforming pocket-es implementation must do, not how. An agent reading this spec should be able to build a compatible system in any language and verify it against the same test fixtures.
2. Two schemas: producer and consumer
pocket-es has exactly two wire contracts, each independently versioned:
| Schema | Producer | Consumer | Version field | Current |
|---|---|---|---|---|
| Index schema | the indexer (indexer.clj) |
every query runtime (the store) | _cluster.version |
2 |
| Search request schema | input surfaces (free / JSON / phase-B) | search() (the executor) |
$schema_version |
1 |
The index schema describes data at rest; the search request schema describes
the query in flight. They version separately because they change for different
reasons: the index schema changes when the indexer emits new fields, the request
schema when search() accepts a new query shape. A conforming deployment
documents and validates both.
3. Index Schema (producer ↔ store)
The index is a single JSON file. A conforming index has this shape:
{
"_cluster": {
"name": string, // index identity
"version": integer, // index-schema version (currently 2)
"built_at": string, // ISO 8601 timestamp
"git_sha": string, // source commit
"doc_count": integer, // total documents
"vocab_size": integer, // unique terms after stopword removal
"avg_dl": float // average document length in tokens
},
"idf": { // term → IDF score (precomputed)
"<term>": float,
...
},
"docs": [ // one entry per indexed org file
{
"_id": string, // relative path without .org or /index
"_dir": boolean, // true if source was index.org (dir-form)
"title": string,
"date": string, // YYYY-MM-DD or partial
"keywords": [string],
"description": string,
"headings": [string], // first 15 section headings
"terms": {"<term>": integer}, // top 50 term frequencies
"doc_len": integer // total token count
}
],
"suggest_corpus": [string] // prefix completion candidates
}
3.1. Invariants
- Every
_idis unique withindocs _diristrueiff the source file was namedindex.orgdoc_countequalslength(docs)- Every term in any doc's
termsmap exists as a key inidf avg_dlequalsmean(doc.doc_len for doc in docs)- No term in
termsis a stopword (see Tokenization) _cluster.versionis present and an integer; a consumer that does not recognize the version refuses the index rather than mis-parsing it
3.2. Conformance (confirmed)
The live index at site/static/search-index.json conforms:
$ jq '._cluster | {version, doc_count}' site/static/search-index.json
{ "version": 2, "doc_count": 618 }
$ jq '._cluster.doc_count == (.docs|length)' site/static/search-index.json
true
Schema history:
| version | change |
|---|---|
| 1 | initial: _cluster, idf, docs, suggest_corpus |
| 2 | added _dir per doc (dir-form URL resolution) |
4. URL Resolution
The _id and _dir fields determine the live URL:
_dir |
URL pattern | Example |
|---|---|---|
true |
/<_id>/ |
/research/pocket-es/ |
false |
/<_id>.html |
/research/local-first.html |
4.1. Invariant
Every URL produced by this rule MUST return HTTP 200 on the live site. This is the contract boundary between org-publish (which writes files), Apache (which serves them), and the search client (which links to them).
A verification pass:
for doc in index.docs:
url = "/" + doc._id + ("/" if doc._dir else ".html")
assert http_get(url).status == 200
5. Tokenization
Tokenization is the shared contract between the indexer and every client. A conforming tokenizer satisfies these invariants:
5.1. Invariants (property-testable)
- No stopwords in output. For any input string
s:intersection(tokenize(s), STOPWORDS) ={}= - All lowercase. For any input string
s: every token intokenize(s)equalslowercase(token) - Idempotent on output. For any input string
s:tokenize(join(" ", tokenize(s))) =tokenize(s)= - Bounded term frequencies. For any input and limit
n:length(term_frequencies(s, n)) <n=
5.2. Stopwords
The canonical stopword list is defined in src/pocket_es/token.cljc.
Any conforming implementation must use the same set.
5.3. Normalization
- Strip org-mode markup (
#+headers,:PROPERTIES:drawers, source block delimiters) - Split on whitespace and punctuation
- Lowercase
- Remove tokens shorter than 2 characters
- Remove stopwords
6. BM25 Scoring
Parameters: k1 = 1.2, b = 0.75.
For a single term t against document d:
tf = doc.terms[t] or 0 dl = doc.doc_len idf_v = idf[t] or 0 numer = tf * (k1 + 1) denom = tf + k1 * (1 - b + b * dl / avg_dl) score = idf_v * numer / denom (0 if denom == 0)
Multi-term queries sum per-term scores. Field boosts multiply the sum.
6.1. Invariants
- Deterministic. Same index + same query = same scores
- Zero for absent terms. If
tf =0= thenscore =0= - Monotonic in tf. Higher term frequency = higher score (given same idf, dl)
- Longer docs score lower. Given same tf and idf,
score(dl=100) > score(dl=200)
7. Search Request Schema (input ↔ executor)
The second wire contract: what search() accepts. Every input surface (free
text, JSON, and the phase-B simple/lucene producers) terminates in one IR; this
schema is that IR written down. Authoritative source is the malli schema
pocket-es.dsl/SearchRequest (see query-surface-spec); this is its
language-neutral description.
{
"$schema_version": 1, // request-schema version (optional; defaults to 1)
"query": <Query>, // required -- exactly one query container
"size": integer, // optional, 1..100
"from": integer // optional, >= 0
}
A <Query> is an object with exactly one of these keys:
| key | value shape | behavior |
|---|---|---|
match |
{<field>: string} |
tokenize, BM25 per term, sum |
term |
{<field>: string \vert [string]} |
exact match on keyword/array field |
prefix |
{<field>: string} |
prefix scan |
match_all |
{} |
all docs, score 1.0 |
multi_match |
{query: string, fields: [<field>^<boost>]} |
match across fields, boost-weighted |
bool |
{must?, should?, filter?, must_not?: [<Query>]} |
≥1 clause required; recurses |
7.1. Closed field set
<field> is one of exactly:
_all title keywords description headings terms
_all is the union field. terms is the tokenized body (the per-document term
frequency map); there is no body field. multi_match field entries may carry
an optional ^<number> boost suffix, e.g. title^3.
7.2. Invariants
- Single clause. A
<Query>object has exactly one key (the others are a different query). Zero or two keys is invalid. - Non-empty bool. A
boolhas at least one ofmust / should / filter / must_not. - Closed fields. Every
<field>is a member of the closed set above. A field outside the set is invalid – including in the boost-suffixedmulti_matchform. - No phrases. There is no phrase clause; the index stores no positions, so a phrase that returned would be a lie. Rejection is structural: no key admits a phrase.
- Validate at the producer boundary. The JSON input surface validates against
this schema before calling
search()and renders the humanized error inline. Free/simple modes construct the IR and are valid by construction (so they are not re-validated on the hot path); the refutation hook below keeps that claim honest.
7.3. Refutation hook
For any input string s, dsl/explain (free-mode-IR s) MUST return nil. If it
ever returns an error, the free-mode IR constructor and this schema have drifted.
This is a test (test.check, see the rollout plan), not a hope.
8. Query DSL
A conforming implementation must support these query types:
| Query type | Required fields | Behavior |
|---|---|---|
match |
field, implicit _all |
tokenize query, BM25 score per term, sum |
term |
field |
exact match on keyword/field array |
bool |
must, should, must_not, filter |
intersect/union/exclude/filter |
multi_match |
query, fields (with optional ^boost) |
match across fields with boost weights |
match_all |
(none) | return all documents, score 1.0 |
prefix |
field |
prefix scan on string or array fields |
8.1. Suggest
suggest({text, size}) returns up to size strings from the suggest
corpus that start with text (case-insensitive prefix match).
9. State Machine (UI)
The browser UI manages six atoms and a URL with four parameters:
| Atom | Type | Default | URL param |
|---|---|---|---|
query-text |
string | "" |
q |
page |
integer | 0 |
p (1-indexed in URL) |
date-filter |
string or nil | nil |
d |
json-mode |
boolean | false |
j (1 when on) |
results |
map or nil | nil |
(derived) |
suggestions |
vector | [] |
(derived) |
9.1. Transitions
| Action | query | date-filter | page | URL method |
|---|---|---|---|---|
| Type in search box | new | keep | reset | replaceState |
| Click suggestion (different) | new | keep | reset | pushState |
| Click suggestion (same/active) | clear | keep | reset | pushState |
| Click "Try" link | new | reset | reset | pushState |
| Click keyword tag | new | reset | reset | pushState |
| Click date filter | keep | new | reset | pushState |
| Click prev/next | keep | keep | inc/dec | pushState |
| Press Escape | clear | keep | reset | replaceState |
| Toggle JSON mode | keep | keep | reset | pushState |
| Browser back/forward | from URL | from URL | from URL | (popstate) |
| Page load (init) | from URL | from URL | from URL | replaceState |
Toggling JSON mode swaps the input surface (text input ⇄ textarea) and re-runs
from the now-active box; it never auto-translates the typed query (INV-6 in the
rollout plan). json-mode round-trips through the j URL param like the date
filter.
9.2. Invariant
After any transition, sync-url! runs. The URL in the address bar
must reflect the current values of q, d, p. A page refresh at
that URL must restore identical state.
10. Consumers
A conforming index supports these access patterns:
| Consumer | Runtime | Entry point | Scores with |
|---|---|---|---|
| Browser | ClojureScript | pocket-es.js (self-injecting) |
BM25 in CLJS |
| Emacs | Elisp | pocket-es.el |
BM25 in elisp |
| Node CLI | Node.js | scripts/search.js via dist/pocket-es.js |
BM25 in compiled CLJS |
| Babashka | bb | bb -cp src -m pocket-es.cli |
BM25 in token.cljc + cli.clj |
| Console | Browser DevTools | pocketES.search(...) |
BM25 in CLJS |
All consumers must produce identical ranking for the same query against the same index. The BM25 parameters (k1, b) and the tokenizer (stopwords, normalization) are the shared contract.
11. Verification
A conforming deployment passes these checks:
- Index integrity.
doc_count =length(docs)=, all IDF terms present - URL resolution. Every
_idresolves to HTTP 200 using the_dirrule - Tokenization contract. Property tests pass (4 invariants, 300+ iterations)
- Scoring determinism. Same query returns same results across consumers
- State round-trip. URL with
q,d,pparams restores identical view - No stopwords in index. No term in any doc's
termsmap is a stopword - Date sanity. No doc has
datematching the AI review date (2024-08-11) - Description tone. No description contains slop markers (delve, comprehensive, explore, etc.)
- Index schema version.
_cluster.versionis present, integer, and recognized - Request schema conformance. Every IR the JSON surface forwards to
search()is valid per the Search Request Schema, orsearch()is never called and an inline error is shown instead - Producer/consumer agreement (refutation hook).
dsl/explain (free-mode-IR s)isnilfor alls(the free producer never emits an IR the schema rejects)