pocket-es: Specification

Table of Contents

1. Scope

This document defines what a conforming pocket-es implementation must do, not how. An agent reading this spec should be able to build a compatible system in any language and verify it against the same test fixtures.

2. Two schemas: producer and consumer

pocket-es has exactly two wire contracts, each independently versioned:

Schema Producer Consumer Version field Current
Index schema the indexer (indexer.clj) every query runtime (the store) _cluster.version 2
Search request schema input surfaces (free / JSON / phase-B) search() (the executor) $schema_version 1

The index schema describes data at rest; the search request schema describes the query in flight. They version separately because they change for different reasons: the index schema changes when the indexer emits new fields, the request schema when search() accepts a new query shape. A conforming deployment documents and validates both.

3. Index Schema (producer ↔ store)

The index is a single JSON file. A conforming index has this shape:

{
  "_cluster": {
    "name":       string,     // index identity
    "version":    integer,    // index-schema version (currently 2)
    "built_at":   string,     // ISO 8601 timestamp
    "git_sha":    string,     // source commit
    "doc_count":  integer,    // total documents
    "vocab_size": integer,    // unique terms after stopword removal
    "avg_dl":     float       // average document length in tokens
  },
  "idf": {                   // term → IDF score (precomputed)
    "<term>": float,
    ...
  },
  "docs": [                   // one entry per indexed org file
    {
      "_id":         string,  // relative path without .org or /index
      "_dir":        boolean, // true if source was index.org (dir-form)
      "title":       string,
      "date":        string,  // YYYY-MM-DD or partial
      "keywords":    [string],
      "description": string,
      "headings":    [string],  // first 15 section headings
      "terms":       {"<term>": integer},  // top 50 term frequencies
      "doc_len":     integer  // total token count
    }
  ],
  "suggest_corpus": [string]  // prefix completion candidates
}

3.1. Invariants

  1. Every _id is unique within docs
  2. _dir is true iff the source file was named index.org
  3. doc_count equals length(docs)
  4. Every term in any doc's terms map exists as a key in idf
  5. avg_dl equals mean(doc.doc_len for doc in docs)
  6. No term in terms is a stopword (see Tokenization)
  7. _cluster.version is present and an integer; a consumer that does not recognize the version refuses the index rather than mis-parsing it

3.2. Conformance (confirmed)

The live index at site/static/search-index.json conforms:

$ jq '._cluster | {version, doc_count}' site/static/search-index.json
{ "version": 2, "doc_count": 618 }
$ jq '._cluster.doc_count == (.docs|length)' site/static/search-index.json
true

Schema history:

version change
1 initial: _cluster, idf, docs, suggest_corpus
2 added _dir per doc (dir-form URL resolution)

4. URL Resolution

The _id and _dir fields determine the live URL:

_dir URL pattern Example
true /<_id>/ /research/pocket-es/
false /<_id>.html /research/local-first.html

4.1. Invariant

Every URL produced by this rule MUST return HTTP 200 on the live site. This is the contract boundary between org-publish (which writes files), Apache (which serves them), and the search client (which links to them).

A verification pass:

for doc in index.docs:
    url = "/" + doc._id + ("/" if doc._dir else ".html")
    assert http_get(url).status == 200

5. Tokenization

Tokenization is the shared contract between the indexer and every client. A conforming tokenizer satisfies these invariants:

5.1. Invariants (property-testable)

  1. No stopwords in output. For any input string s: intersection(tokenize(s), STOPWORDS) = {}=
  2. All lowercase. For any input string s: every token in tokenize(s) equals lowercase(token)
  3. Idempotent on output. For any input string s: tokenize(join(" ", tokenize(s))) = tokenize(s)=
  4. Bounded term frequencies. For any input and limit n: length(term_frequencies(s, n)) < n=

5.2. Stopwords

The canonical stopword list is defined in src/pocket_es/token.cljc. Any conforming implementation must use the same set.

5.3. Normalization

  1. Strip org-mode markup (#+ headers, :PROPERTIES: drawers, source block delimiters)
  2. Split on whitespace and punctuation
  3. Lowercase
  4. Remove tokens shorter than 2 characters
  5. Remove stopwords

6. BM25 Scoring

Parameters: k1 = 1.2, b = 0.75.

For a single term t against document d:

tf    = doc.terms[t] or 0
dl    = doc.doc_len
idf_v = idf[t] or 0
numer = tf * (k1 + 1)
denom = tf + k1 * (1 - b + b * dl / avg_dl)
score = idf_v * numer / denom   (0 if denom == 0)

Multi-term queries sum per-term scores. Field boosts multiply the sum.

6.1. Invariants

  1. Deterministic. Same index + same query = same scores
  2. Zero for absent terms. If tf = 0= then score = 0=
  3. Monotonic in tf. Higher term frequency = higher score (given same idf, dl)
  4. Longer docs score lower. Given same tf and idf, score(dl=100) > score(dl=200)

7. Search Request Schema (input ↔ executor)

The second wire contract: what search() accepts. Every input surface (free text, JSON, and the phase-B simple/lucene producers) terminates in one IR; this schema is that IR written down. Authoritative source is the malli schema pocket-es.dsl/SearchRequest (see query-surface-spec); this is its language-neutral description.

{
  "$schema_version": 1,          // request-schema version (optional; defaults to 1)
  "query": <Query>,              // required -- exactly one query container
  "size":  integer,              // optional, 1..100
  "from":  integer               // optional, >= 0
}

A <Query> is an object with exactly one of these keys:

key value shape behavior
match {<field>: string} tokenize, BM25 per term, sum
term {<field>: string \vert [string]} exact match on keyword/array field
prefix {<field>: string} prefix scan
match_all {} all docs, score 1.0
multi_match {query: string, fields: [<field>^<boost>]} match across fields, boost-weighted
bool {must?, should?, filter?, must_not?: [<Query>]} ≥1 clause required; recurses

7.1. Closed field set

<field> is one of exactly:

_all  title  keywords  description  headings  terms

_all is the union field. terms is the tokenized body (the per-document term frequency map); there is no body field. multi_match field entries may carry an optional ^<number> boost suffix, e.g. title^3.

7.2. Invariants

  1. Single clause. A <Query> object has exactly one key (the others are a different query). Zero or two keys is invalid.
  2. Non-empty bool. A bool has at least one of must / should / filter / must_not.
  3. Closed fields. Every <field> is a member of the closed set above. A field outside the set is invalid – including in the boost-suffixed multi_match form.
  4. No phrases. There is no phrase clause; the index stores no positions, so a phrase that returned would be a lie. Rejection is structural: no key admits a phrase.
  5. Validate at the producer boundary. The JSON input surface validates against this schema before calling search() and renders the humanized error inline. Free/simple modes construct the IR and are valid by construction (so they are not re-validated on the hot path); the refutation hook below keeps that claim honest.

7.3. Refutation hook

For any input string s, dsl/explain (free-mode-IR s) MUST return nil. If it ever returns an error, the free-mode IR constructor and this schema have drifted. This is a test (test.check, see the rollout plan), not a hope.

8. Query DSL

A conforming implementation must support these query types:

Query type Required fields Behavior
match field, implicit _all tokenize query, BM25 score per term, sum
term field exact match on keyword/field array
bool must, should, must_not, filter intersect/union/exclude/filter
multi_match query, fields (with optional ^boost) match across fields with boost weights
match_all (none) return all documents, score 1.0
prefix field prefix scan on string or array fields

8.1. Suggest

suggest({text, size}) returns up to size strings from the suggest corpus that start with text (case-insensitive prefix match).

9. State Machine (UI)

The browser UI manages six atoms and a URL with four parameters:

Atom Type Default URL param
query-text string "" q
page integer 0 p (1-indexed in URL)
date-filter string or nil nil d
json-mode boolean false j (1 when on)
results map or nil nil (derived)
suggestions vector [] (derived)

9.1. Transitions

Action query date-filter page URL method
Type in search box new keep reset replaceState
Click suggestion (different) new keep reset pushState
Click suggestion (same/active) clear keep reset pushState
Click "Try" link new reset reset pushState
Click keyword tag new reset reset pushState
Click date filter keep new reset pushState
Click prev/next keep keep inc/dec pushState
Press Escape clear keep reset replaceState
Toggle JSON mode keep keep reset pushState
Browser back/forward from URL from URL from URL (popstate)
Page load (init) from URL from URL from URL replaceState

Toggling JSON mode swaps the input surface (text input ⇄ textarea) and re-runs from the now-active box; it never auto-translates the typed query (INV-6 in the rollout plan). json-mode round-trips through the j URL param like the date filter.

9.2. Invariant

After any transition, sync-url! runs. The URL in the address bar must reflect the current values of q, d, p. A page refresh at that URL must restore identical state.

10. Consumers

A conforming index supports these access patterns:

Consumer Runtime Entry point Scores with
Browser ClojureScript pocket-es.js (self-injecting) BM25 in CLJS
Emacs Elisp pocket-es.el BM25 in elisp
Node CLI Node.js scripts/search.js via dist/pocket-es.js BM25 in compiled CLJS
Babashka bb bb -cp src -m pocket-es.cli BM25 in token.cljc + cli.clj
Console Browser DevTools pocketES.search(...) BM25 in CLJS

All consumers must produce identical ranking for the same query against the same index. The BM25 parameters (k1, b) and the tokenizer (stopwords, normalization) are the shared contract.

11. Verification

A conforming deployment passes these checks:

  1. Index integrity. doc_count = length(docs)=, all IDF terms present
  2. URL resolution. Every _id resolves to HTTP 200 using the _dir rule
  3. Tokenization contract. Property tests pass (4 invariants, 300+ iterations)
  4. Scoring determinism. Same query returns same results across consumers
  5. State round-trip. URL with q, d, p params restores identical view
  6. No stopwords in index. No term in any doc's terms map is a stopword
  7. Date sanity. No doc has date matching the AI review date (2024-08-11)
  8. Description tone. No description contains slop markers (delve, comprehensive, explore, etc.)
  9. Index schema version. _cluster.version is present, integer, and recognized
  10. Request schema conformance. Every IR the JSON surface forwards to search() is valid per the Search Request Schema, or search() is never called and an inline error is shown instead
  11. Producer/consumer agreement (refutation hook). dsl/explain (free-mode-IR s) is nil for all s (the free producer never emits an IR the schema rejects)