pocket-es: Specification

1. Scope
2. Two schemas: producer and consumer
3. Index Schema (producer ↔ store)
- 3.1. Invariants
- 3.2. Conformance (confirmed)
4. URL Resolution
- 4.1. Invariant
5. Tokenization
6. BM25 Scoring
- 6.1. Invariants
7. Search Request Schema (input ↔ executor)
8. Query DSL
- 8.1. Suggest
9. State Machine (UI)
- 9.1. Transitions
- 9.2. Invariant
10. Consumers
11. Verification

1. Scope

This document defines what a conforming pocket-es implementation must do, not how. An agent reading this spec should be able to build a compatible system in any language and verify it against the same test fixtures.

2. Two schemas: producer and consumer

pocket-es has exactly two wire contracts, each independently versioned:

Schema	Producer	Consumer	Version field	Current
Index schema	the indexer (`indexer.clj`)	every query runtime (the store)	`_cluster.version`	`2`
Search request schema	input surfaces (free / JSON / phase-B)	`search()` (the executor)	`$schema_version`	`1`

The index schema describes data at rest; the search request schema describes the query in flight. They version separately because they change for different reasons: the index schema changes when the indexer emits new fields, the request schema when search() accepts a new query shape. A conforming deployment documents and validates both.

3. Index Schema (producer ↔ store)

The index is a single JSON file. A conforming index has this shape:

{
  "_cluster": {
    "name":       string,     // index identity
    "version":    integer,    // index-schema version (currently 2)
    "built_at":   string,     // ISO 8601 timestamp
    "git_sha":    string,     // source commit
    "doc_count":  integer,    // total documents
    "vocab_size": integer,    // unique terms after stopword removal
    "avg_dl":     float       // average document length in tokens
  },
  "idf": {                   // term → IDF score (precomputed)
    "<term>": float,
    ...
  },
  "docs": [                   // one entry per indexed org file
    {
      "_id":         string,  // relative path without .org or /index
      "_dir":        boolean, // true if source was index.org (dir-form)
      "title":       string,
      "date":        string,  // YYYY-MM-DD or partial
      "keywords":    [string],
      "description": string,
      "headings":    [string],  // first 15 section headings
      "terms":       {"<term>": integer},  // top 50 term frequencies
      "doc_len":     integer  // total token count
    }
  ],
  "suggest_corpus": [string]  // prefix completion candidates
}

3.1. Invariants

Every _id is unique within docs
_dir is true iff the source file was named index.org
doc_count equals length(docs)
Every term in any doc's terms map exists as a key in idf
avg_dl equals mean(doc.doc_len for doc in docs)
No term in terms is a stopword (see Tokenization)
_cluster.version is present and an integer; a consumer that does not recognize the version refuses the index rather than mis-parsing it

3.2. Conformance (confirmed)

The live index at site/static/search-index.json conforms:

$ jq '._cluster | {version, doc_count}' site/static/search-index.json
{ "version": 2, "doc_count": 618 }
$ jq '._cluster.doc_count == (.docs|length)' site/static/search-index.json
true

Schema history:

version	change
1	initial: `_cluster`, `idf`, `docs`, `suggest_corpus`
2	added `_dir` per doc (dir-form URL resolution)

4. URL Resolution

The _id and _dir fields determine the live URL:

`_dir`	URL pattern	Example
`true`	`/<_id>/`	`/research/pocket-es/`
`false`	`/<_id>.html`	`/research/local-first.html`

4.1. Invariant

Every URL produced by this rule MUST return HTTP 200 on the live site. This is the contract boundary between org-publish (which writes files), Apache (which serves them), and the search client (which links to them).

A verification pass:

for doc in index.docs:
    url = "/" + doc._id + ("/" if doc._dir else ".html")
    assert http_get(url).status == 200

5. Tokenization

Tokenization is the shared contract between the indexer and every client. A conforming tokenizer satisfies these invariants:

5.1. Invariants (property-testable)

No stopwords in output. For any input string s: intersection(tokenize(s), STOPWORDS) = {}=
All lowercase. For any input string s: every token in tokenize(s) equals lowercase(token)
Idempotent on output. For any input string s: tokenize(join(" ", tokenize(s))) = tokenize(s)=
Bounded term frequencies. For any input and limit n: length(term_frequencies(s, n)) < n=

5.2. Stopwords

The canonical stopword list is defined in src/pocket_es/token.cljc. Any conforming implementation must use the same set.

5.3. Normalization

Strip org-mode markup (#+ headers, :PROPERTIES: drawers, source block delimiters)
Split on whitespace and punctuation
Lowercase
Remove tokens shorter than 2 characters
Remove stopwords

6. BM25 Scoring

Parameters: k1 = 1.2, b = 0.75.

For a single term t against document d:

tf    = doc.terms[t] or 0
dl    = doc.doc_len
idf_v = idf[t] or 0
numer = tf * (k1 + 1)
denom = tf + k1 * (1 - b + b * dl / avg_dl)
score = idf_v * numer / denom   (0 if denom == 0)

Multi-term queries sum per-term scores. Field boosts multiply the sum.

6.1. Invariants

Deterministic. Same index + same query = same scores
Zero for absent terms. If tf = 0= then score = 0=
Monotonic in tf. Higher term frequency = higher score (given same idf, dl)
Longer docs score lower. Given same tf and idf, score(dl=100) > score(dl=200)

7. Search Request Schema (input ↔ executor)

The second wire contract: what search() accepts. Every input surface (free text, JSON, and the phase-B simple/lucene producers) terminates in one IR; this schema is that IR written down. Authoritative source is the malli schema pocket-es.dsl/SearchRequest (see query-surface-spec); this is its language-neutral description.

{
  "$schema_version": 1,          // request-schema version (optional; defaults to 1)
  "query": <Query>,              // required -- exactly one query container
  "size":  integer,              // optional, 1..100
  "from":  integer               // optional, >= 0
}

A <Query> is an object with exactly one of these keys:

key	value shape	behavior
`match`	`{<field>: string}`	tokenize, BM25 per term, sum
`term`	`{<field>: string \vert [string]}`	exact match on keyword/array field
`prefix`	`{<field>: string}`	prefix scan
`match_all`	`{}`	all docs, score 1.0
`multi_match`	`{query: string, fields: [<field>^<boost>]}`	match across fields, boost-weighted
`bool`	`{must?, should?, filter?, must_not?: [<Query>]}`	≥1 clause required; recurses

7.1. Closed field set

<field> is one of exactly:

_all  title  keywords  description  headings  terms

_all is the union field. terms is the tokenized body (the per-document term frequency map); there is no body field. multi_match field entries may carry an optional ^<number> boost suffix, e.g. title^3.

7.2. Invariants

Single clause. A <Query> object has exactly one key (the others are a different query). Zero or two keys is invalid.
Non-empty bool. A bool has at least one of must / should / filter / must_not.
Closed fields. Every <field> is a member of the closed set above. A field outside the set is invalid – including in the boost-suffixed multi_match form.
No phrases. There is no phrase clause; the index stores no positions, so a phrase that returned would be a lie. Rejection is structural: no key admits a phrase.
Validate at the producer boundary. The JSON input surface validates against this schema before calling search() and renders the humanized error inline. Free/simple modes construct the IR and are valid by construction (so they are not re-validated on the hot path); the refutation hook below keeps that claim honest.

7.3. Refutation hook

For any input string s, dsl/explain (free-mode-IR s) MUST return nil. If it ever returns an error, the free-mode IR constructor and this schema have drifted. This is a test (test.check, see the rollout plan), not a hope.

8. Query DSL

A conforming implementation must support these query types:

Query type	Required fields	Behavior
`match`	`field`, implicit `_all`	tokenize query, BM25 score per term, sum
`term`	`field`	exact match on keyword/field array
`bool`	`must`, `should`, `must_not`, `filter`	intersect/union/exclude/filter
`multi_match`	`query`, `fields` (with optional `^boost`)	match across fields with boost weights
`match_all`	(none)	return all documents, score 1.0
`prefix`	`field`	prefix scan on string or array fields

8.1. Suggest

suggest({text, size}) returns up to size strings from the suggest corpus that start with text (case-insensitive prefix match).

9. State Machine (UI)

The browser UI manages six atoms and a URL with four parameters:

Atom	Type	Default	URL param
`query-text`	string	`""`	`q`
`page`	integer	`0`	`p` (1-indexed in URL)
`date-filter`	string or nil	`nil`	`d`
`json-mode`	boolean	`false`	`j` (`1` when on)
`results`	map or nil	`nil`	(derived)
`suggestions`	vector	`[]`	(derived)

9.1. Transitions

Action	query	date-filter	page	URL method
Type in search box	new	keep	reset	replaceState
Click suggestion (different)	new	keep	reset	pushState
Click suggestion (same/active)	clear	keep	reset	pushState
Click "Try" link	new	reset	reset	pushState
Click keyword tag	new	reset	reset	pushState
Click date filter	keep	new	reset	pushState
Click prev/next	keep	keep	inc/dec	pushState
Press Escape	clear	keep	reset	replaceState
Toggle JSON mode	keep	keep	reset	pushState
Browser back/forward	from URL	from URL	from URL	(popstate)
Page load (init)	from URL	from URL	from URL	replaceState

Toggling JSON mode swaps the input surface (text input ⇄ textarea) and re-runs from the now-active box; it never auto-translates the typed query (INV-6 in the rollout plan). json-mode round-trips through the j URL param like the date filter.

9.2. Invariant

After any transition, sync-url! runs. The URL in the address bar must reflect the current values of q, d, p. A page refresh at that URL must restore identical state.

10. Consumers

A conforming index supports these access patterns:

Consumer	Runtime	Entry point	Scores with
Browser	ClojureScript	`pocket-es.js` (self-injecting)	BM25 in CLJS
Emacs	Elisp	`pocket-es.el`	BM25 in elisp
Node CLI	Node.js	`scripts/search.js` via `dist/pocket-es.js`	BM25 in compiled CLJS
Babashka	bb	`bb -cp src -m pocket-es.cli`	BM25 in `token.cljc` + `cli.clj`
Console	Browser DevTools	`pocketES.search(...)`	BM25 in CLJS

All consumers must produce identical ranking for the same query against the same index. The BM25 parameters (k1, b) and the tokenizer (stopwords, normalization) are the shared contract.

11. Verification

A conforming deployment passes these checks:

Index integrity. doc_count = length(docs)=, all IDF terms present
URL resolution. Every _id resolves to HTTP 200 using the _dir rule
Tokenization contract. Property tests pass (4 invariants, 300+ iterations)
Scoring determinism. Same query returns same results across consumers
State round-trip. URL with q, d, p params restores identical view
No stopwords in index. No term in any doc's terms map is a stopword
Date sanity. No doc has date matching the AI review date (2024-08-11)
Description tone. No description contains slop markers (delve, comprehensive, explore, etc.)
Index schema version. _cluster.version is present, integer, and recognized
Request schema conformance. Every IR the JSON surface forwards to search() is valid per the Search Request Schema, or search() is never called and an inline error is shown instead
Producer/consumer agreement (refutation hook). dsl/explain (free-mode-IR s) is nil for all s (the free producer never emits an IR the schema rejects)