pocket-es Component Contract Specification
Table of Contents
1. Overview
pocket-es is a client-side BM25 search engine for wal.sh. Data flows in one direction: org files → Clojure indexer → JSON artifact → ClojureScript loader → search engine → UI. Each boundary is a typed contract. This document specifies the preconditions, postconditions, and invariants at each boundary.
Components and their canonical file locations:
| Component | Path |
|---|---|
| Indexer | site/research/pocket-es/src/pocket_es/indexer.clj |
| Shared tokenizer | site/research/pocket-es/src/pocket_es/token.cljc |
| Token tests | site/research/pocket-es/test/pocket_es/token_test.cljc |
| Index artifact | site/static/search-index.json |
| Core engine | site/research/pocket-es/src/pocket_es/core.cljs |
| UI | site/research/pocket-es/src/pocket_es/ui.cljs |
| Build entry | site/research/pocket-es/shadow-cljs.edn (:lib build) |
| JS init symbol | pocket-es.ui/init! (mapped to :modules {:pocket-es …}) |
2. Contract 1: Org File → Indexer
The indexer is pocket-es.indexer (src/pocket_es/indexer.clj). It reads org
files via clojure.java.io and tokenizes with the shared pocket-es.token
namespace (src/pocket_es/token.cljc). The same tokenizer compiles to
ClojureScript for the client search engine – there is one tokenization
contract, not two.
Run the indexer:
clj -M:index site/
2.1. Preconditions
- File encoding is UTF-8 (errors are replaced, not fatal).
- File lives under the
site/tree and matches glob**/*.org. - File is not under any excluded subtree:
_drafts/templates/includes/orgparse-examples/
- File contains at least one
#+TITLE:header with a non-empty value. This is the sole hard gate; all other headers are optional.
2.2. Required headers
| Header | Requirement | Notes |
|---|---|---|
#+TITLE |
REQUIRED | First occurrence wins; empty value causes doc rejection. |
2.3. Optional headers (used if present)
| Header | Notes |
|---|---|
#+DATE |
Stored verbatim as string; no format enforced. |
#+KEYWORDS |
Split on [,;]+; each token lowercased, stripped. Multi-value OK. |
#+DESCRIPTION |
Stored verbatim; no length limit enforced. |
2.4. Postconditions (valid indexable document)
A document is only emitted when:
#+TITLEis present and non-empty.doc_len >5= ORkeywordsis non-empty. (Stubs with fewer than 5 tokens and no keywords are silently dropped.)
2.5. Invariants
- Only the first occurrence of each header key is kept.
- Org source blocks (
#+begin_.../#+end_...) are stripped before tokenization; they do not contribute to term frequencies. - Org drawers (
:NAME:.../:END:) are stripped before tokenization. - Org links
[[target][text]]are replaced by their display text. - Org markup characters
~=*/+_are replaced by spaces before tokenization.
3. Contract 2: Indexer → Index Artifact (JSON Schema)
The indexer writes to stdout; the Makefile redirects to
site/static/search-index.json.
3.1. Schema version and EDN config
The index artifact version is _cluster.version. This should be
specified in deps.edn (or a dedicated index-config.edn) rather
than hard-coded in the indexer source, so that the contract document,
the build config, and the runtime all agree on the schema version:
;; Proposed: deps.edn :index alias or standalone index-config.edn
{:index-schema-version 2
:output-path "site/static/search-index.json"
:bm25 {:k1 1.2 :b 0.75}
:top-n-terms 50
:min-doc-freq 2
:idf-threshold 3.0}
Currently the version is a literal 2 in pocket-es.indexer. The
loader (Contract 3) checks :version at parse time but has no way to
reject a schema it does not understand – a version bump without a
loader update will silently produce wrong results.
3.2. Top-level structure
{
"_cluster": { ... },
"idf": { "<term>": <float>, ... },
"docs": [ { ... }, ... ],
"suggest_corpus": [ "<string>", ... ]
}
All four top-level keys are always present.
3.3. _cluster object
| Field | Type | Required | Constraints |
|---|---|---|---|
name |
string | yes | Always "wal-sh-pocket" |
version |
int | yes | Always 2 |
built_at |
string | yes | ISO 8601 timestamp from java.time.Instant/now |
git_sha |
string | yes | Short SHA from git rev-parse --short HEAD; |
"unknown" when git is unavailable |
|||
doc_count |
int | yes | Equal to count of emitted docs; must be >= 1 |
vocab_size |
int | yes | Equal to count of idf entries; must be >= 100 |
avg_dl |
float | yes | Mean document length in tokens; must be >= 1.0 |
3.4. docs array elements
| Field | Type | Required | Constraints |
|---|---|---|---|
_id |
string | yes | URL path relative to site root, no extension |
title |
string | yes | Verbatim from #+TITLE; non-empty |
date |
string | yes | Verbatim from #+DATE or "" |
keywords |
array of strings | yes | Lowercased tokens; may be [] |
description |
string | yes | Verbatim from #+DESCRIPTION or "" |
headings |
array of strings | yes | First 15 org headings; may be [] |
terms |
object | yes | Top-50 term→count pairs; count is raw int |
doc_len |
int | yes | Total token count before top-50 truncation |
3.5. idf map
- Keys: vocabulary terms (strings). Values: floats rounded to 3 decimal places.
- Formula:
round(log((N - df + 0.5) / (df + 0.5) + 1), 3)(BM25 IDF). - Inclusion:
doc_freq >2= ORidf > 3.0.
3.6. suggest_corpus array
- Sorted lexicographically. Sources: keyword tokens + title words after stripping non-word characters, lowercasing, filtering stopwords and pure digits.
3.7. Structural assertions
The following invariants must hold for any valid artifact:
| # | Assertion | Source |
|---|---|---|
| 1 | _cluster.version = 2= |
indexer/build-index |
| 2 | _cluster.git_sha present and non-empty |
indexer/git-sha |
| 3 | doc_count > 1= |
indexer/build-index |
| 4 | vocab_size > 100= |
indexer/compute-idf |
| 5 | avg_dl > 1.0= |
indexer/build-index |
| 6 | idf non-empty |
indexer/compute-idf |
| 7 | docs non-empty |
indexer/parse-org-file |
| 8 | suggest_corpus non-empty |
indexer/build-suggest-corpus |
| 9 | Each doc has _id, title, terms, doc_len |
indexer/parse-org-file |
| 10 | len(docs) = doc_count= |
indexer/build-index |
4. Contract 3: Loader (parse-index)
4.1. Internal IndexMap structure (CLJS)
{:cluster {:name string, :version 2, :built-at string,
:git-sha string, ; from git rev-parse --short HEAD
:doc-count number, :vocab-size number, :avg-dl number,
:index-size-kb number} ; added at load time
:idf {string → number} ; string keys
:avg-dl number ; promoted to top-level
:docs [{:_id string, :title string, :date string,
:keywords [string], :description string,
:headings [string],
:terms {string → number}, ; string keys
:doc_len number} ...] ; snake_case preserved
:suggest-corpus [string ...]}
4.2. Key decisions
- No deep
js->cljon the full blob. Each top-level key is accessed viaunchecked-getand converted independently. :doc_lenis snake_case (not kebab-case) – intentional, matches JSON.:avg-dlappears at both top-level and[:cluster :avg-dl]. Scoring uses top-level; cluster map is for display only.:termsvalues are raw JS numbers (shallow conversion).
4.3. Index URL resolution
<script data-index-url"…">= attribute – explicit override.location.port =="8700"= →/search-index.json(dev server).- Default →
/static/search-index.json(published site).
5. Contract 4: Search API
5.1. search
(search idx request) → JS {hits: {total: int, hits: [{_id, _score, _source}]}}
_scoreisMath.round(raw_score * 100)– always non-negative integer.- Results sorted descending by
_score. Only score > 0 included. - Default
_sourcestrips:termsand:doc_len.
5.2. Query DSL dispatch
| Type | Shape | Scoring |
|---|---|---|
match_all |
{} |
1.0 for every doc |
match |
{field: text} |
Field-dependent (title ×3, kw ×2, headings ×2, else BM25) |
term |
{field: value} |
Exact match: 1.0 or 0.0 |
multi_match |
{query, fields} |
Sum across fields with ^N boosts |
bool |
{must, should, filter, must_not} |
must + should; filter/must_not gate only |
prefix |
{field: prefix} |
Case-insensitive prefix: 1.0 or 0.0 |
range |
{field: {gte,gt,lte,lt}} |
In-range: 1.0 or 0.0 (date field only) |
| unknown | any | Returns 0 (no error) |
BM25 parameters: k1=1.2, b=0.75 (constants).
5.2.1. range bounds: dates and date-math
The range filter compares the doc's date (a YYYY-MM-DD string) against the
bounds lexicographically – which equals chronological order for ISO dates. A doc
whose date is unparseable scores 0 (it never matches a range filter).
Each bound (gte, gt, lte, lt) is resolved by pocket-es.date/resolve-bound
before comparison:
| Bound form | Resolves to | Notes |
|---|---|---|
| absent | no constraint | |
"2026-06-08" |
2026-06-08 |
literal ISO, extracted |
"now", "now/d" |
today | /d round-to-day is a no-op here |
"now-7d" |
today − 7 days | unit d days, w weeks (7d) |
"now-1w" |
today − 7 days | |
"now+1d" |
today + 1 day | offset may be negative or positive |
| anything else | returned verbatim (legacy) | lexicographic compare, may be wrong |
Grammar: now ( [+-] N (d|w) )? ( /d )?. Unsupported forms (now/w, now-1M,
||-anchored math, time-of-day) deliberately do not resolve – they fall to
the verbatim branch, leaving the filter unconstrained rather than silently
mis-dated. Resolution is pure: today is injected (resolve-bound bound today),
never read from a clock, so the contract is deterministic and property-tested
(test/pocket_es/date_test.cljc). The browser engine supplies the clock via
date/today-iso; the JVM CLI does not implement range at all.
The motivating use: lte: "now/d" excludes future-dated placeholders (an
upcoming-event note dated months ahead) from a "recent" query, so "recent"
surfaces only what has actually happened up to today.
5.3. suggest
(suggest idx {text: "prefix", size: 10}) → JS ["match1", "match2", ...]
Prefix match against sorted :suggest-corpus. Case-insensitive.
6. Contract 5: UI Component
6.1. Mount point resolution (first match wins)
#pocket-es-root– standalone HTML#content– org-published page.outline-2– fallback org layoutdocument.body– final fallback
6.2. Input handling
| Event | Behavior |
|---|---|
| input | 150ms debounce, then do-search! |
| Escape | Clear input and results immediately |
| Tab | Expand plain text to JSON query template; cursor in quotes |
| click | On suggestion pill: fill input and search immediately |
6.3. Query mode dispatch
- Input starts with
{and parses as JSON → raw ES request. - Otherwise →
multi_matchwith["title^3", "description", "headings^2", "terms"].
6.4. Console API (window.pocketES)
search(req), suggest(req), cluster.health(), cat.indices(), cat.count(), _idx.
7. Contract 6: Tokenizer Test Suite
The shared tokenizer is covered by test/pocket_es/token_test.cljc. The
.cljc extension means the same test file runs on both the JVM (via
clojure.test) and in the browser (via cljs.test), verifying that
CLJ and CLJS produce identical token streams.
Run on the JVM:
clj -M:test
7.1. Example-based tests
| Test name | What it verifies |
|---|---|
tokenize-basic |
nil/empty/blank → []; stopword removal; lowercasing; |
| short-word filter with allowlist; org markup stripping; | |
| dash collapsing | |
tokenize-org-blocks |
#+begin_src … #+end_src stripped before tokenization; |
:PROPERTIES: … :END: drawers stripped |
|
parse-org-headers-test |
Key→value extraction; first-occurrence-wins semantics |
term-frequencies-test |
Correct counts; top-n limit respected |
7.2. Property-based tests (test.check)
| Spec name | Trials | Property |
|---|---|---|
tokenize-never-returns-stopwords |
100 | No token is a member of token/stopwords |
tokenize-all-lowercase |
100 | Every token satisfies ( t (str/lower-case t))= |
tokenize-idempotent-on-output |
50 | Tokens from second pass are a subset of first-pass tokens |
term-frequencies-bounded |
50 | (count tf) < n= for any n |
8. Appendix: Known Asymmetries
Asymmetries A1 and A2 from v1 – diverging stopword sets and the
SHORT_ALLOWLIST gap – are resolved in v2. The shared token.cljc
namespace is the single source of truth for both the indexer and the
client search engine. No per-platform divergence remains.
| ID | Issue | Status |
|---|---|---|
| A3 | :doc_len is snake_case – sole exception to kebab-case convention. |
Open |