pocket-es Component Contract Specification
Table of Contents
1. Overview
pocket-es is a client-side BM25 search engine for wal.sh. Data flows in one direction: org files → Clojure indexer → JSON artifact → ClojureScript loader → search engine → UI. Each boundary is a typed contract. This document specifies the preconditions, postconditions, and invariants at each boundary.
Components and their canonical file locations:
| Component | Path |
|---|---|
| Indexer | site/research/pocket-es/src/pocket_es/indexer.clj |
| Shared tokenizer | site/research/pocket-es/src/pocket_es/token.cljc |
| Token tests | site/research/pocket-es/test/pocket_es/token_test.cljc |
| Index artifact | site/static/search-index.json |
| Core engine | site/research/pocket-es/src/pocket_es/core.cljs |
| UI | site/research/pocket-es/src/pocket_es/ui.cljs |
| Build entry | site/research/pocket-es/shadow-cljs.edn (:lib build) |
| JS init symbol | pocket-es.ui/init! (mapped to :modules {:pocket-es …}) |
2. Contract 1: Org File → Indexer
The indexer is pocket-es.indexer (src/pocket_es/indexer.clj). It reads org
files via clojure.java.io and tokenizes with the shared pocket-es.token
namespace (src/pocket_es/token.cljc). The same tokenizer compiles to
ClojureScript for the client search engine — there is one tokenization
contract, not two.
Run the indexer:
clj -M:index site/
2.1. Preconditions
- File encoding is UTF-8 (errors are replaced, not fatal).
- File lives under the
site/tree and matches glob**/*.org. - File is not under any excluded subtree:
_drafts/templates/includes/orgparse-examples/
- File contains at least one
#+TITLE:header with a non-empty value. This is the sole hard gate; all other headers are optional.
2.2. Required headers
| Header | Requirement | Notes |
|---|---|---|
#+TITLE |
REQUIRED | First occurrence wins; empty value causes doc rejection. |
2.3. Optional headers (used if present)
| Header | Notes |
|---|---|
#+DATE |
Stored verbatim as string; no format enforced. |
#+KEYWORDS |
Split on [,;]+; each token lowercased, stripped. Multi-value OK. |
#+DESCRIPTION |
Stored verbatim; no length limit enforced. |
2.4. Postconditions (valid indexable document)
A document is only emitted when:
#+TITLEis present and non-empty.doc_len >5= ORkeywordsis non-empty. (Stubs with fewer than 5 tokens and no keywords are silently dropped.)
2.5. Invariants
- Only the first occurrence of each header key is kept.
- Org source blocks (
#+begin_.../#+end_...) are stripped before tokenization; they do not contribute to term frequencies. - Org drawers (
:NAME:.../:END:) are stripped before tokenization. - Org links
[[target][text]]are replaced by their display text. - Org markup characters
~=*/+_are replaced by spaces before tokenization.
3. Contract 2: Indexer → Index Artifact (JSON Schema)
The indexer writes to stdout; the Makefile redirects to
site/static/search-index.json.
3.1. Top-level structure
{
"_cluster": { ... },
"idf": { "<term>": <float>, ... },
"docs": [ { ... }, ... ],
"suggest_corpus": [ "<string>", ... ]
}
All four top-level keys are always present.
3.2. _cluster object
| Field | Type | Required | Constraints |
|---|---|---|---|
name |
string | yes | Always "wal-sh-pocket" |
version |
int | yes | Always 2 |
built_at |
string | yes | ISO 8601 timestamp from java.time.Instant/now |
git_sha |
string | yes | Short SHA from git rev-parse --short HEAD; |
"unknown" when git is unavailable |
|||
doc_count |
int | yes | Equal to count of emitted docs; must be >= 1 |
vocab_size |
int | yes | Equal to count of idf entries; must be >= 100 |
avg_dl |
float | yes | Mean document length in tokens; must be >= 1.0 |
3.3. docs array elements
| Field | Type | Required | Constraints |
|---|---|---|---|
_id |
string | yes | URL path relative to site root, no extension |
title |
string | yes | Verbatim from #+TITLE; non-empty |
date |
string | yes | Verbatim from #+DATE or "" |
keywords |
array of strings | yes | Lowercased tokens; may be [] |
description |
string | yes | Verbatim from #+DESCRIPTION or "" |
headings |
array of strings | yes | First 15 org headings; may be [] |
terms |
object | yes | Top-50 term→count pairs; count is raw int |
doc_len |
int | yes | Total token count before top-50 truncation |
3.4. idf map
- Keys: vocabulary terms (strings). Values: floats rounded to 3 decimal places.
- Formula:
round(log((N - df + 0.5) / (df + 0.5) + 1), 3)(BM25 IDF). - Inclusion:
doc_freq >2= ORidf > 3.0.
3.5. suggest_corpus array
- Sorted lexicographically. Sources: keyword tokens + title words after stripping non-word characters, lowercasing, filtering stopwords and pure digits.
3.6. Structural assertions
The following invariants must hold for any valid artifact:
| # | Assertion | Source |
|---|---|---|
| 1 | _cluster.version = 2= |
indexer/build-index |
| 2 | _cluster.git_sha present and non-empty |
indexer/git-sha |
| 3 | doc_count > 1= |
indexer/build-index |
| 4 | vocab_size > 100= |
indexer/compute-idf |
| 5 | avg_dl > 1.0= |
indexer/build-index |
| 6 | idf non-empty |
indexer/compute-idf |
| 7 | docs non-empty |
indexer/parse-org-file |
| 8 | suggest_corpus non-empty |
indexer/build-suggest-corpus |
| 9 | Each doc has _id, title, terms, doc_len |
indexer/parse-org-file |
| 10 | len(docs) = doc_count= |
indexer/build-index |
4. Contract 3: Loader (parse-index)
4.1. Internal IndexMap structure (CLJS)
{:cluster {:name string, :version 2, :built-at string,
:git-sha string, ; from git rev-parse --short HEAD
:doc-count number, :vocab-size number, :avg-dl number,
:index-size-kb number} ; added at load time
:idf {string → number} ; string keys
:avg-dl number ; promoted to top-level
:docs [{:_id string, :title string, :date string,
:keywords [string], :description string,
:headings [string],
:terms {string → number}, ; string keys
:doc_len number} ...] ; snake_case preserved
:suggest-corpus [string ...]}
4.2. Key decisions
- No deep
js->cljon the full blob. Each top-level key is accessed viaunchecked-getand converted independently. :doc_lenis snake_case (not kebab-case) — intentional, matches JSON.:avg-dlappears at both top-level and[:cluster :avg-dl]. Scoring uses top-level; cluster map is for display only.:termsvalues are raw JS numbers (shallow conversion).
4.3. Index URL resolution
<script data-index-url"…">= attribute — explicit override.location.port =="8700"= →/search-index.json(dev server).- Default →
/static/search-index.json(published site).
5. Contract 4: Search API
5.1. search
(search idx request) → JS {hits: {total: int, hits: [{_id, _score, _source}]}}
_scoreisMath.round(raw_score * 100)— always non-negative integer.- Results sorted descending by
_score. Only score > 0 included. - Default
_sourcestrips:termsand:doc_len.
5.2. Query DSL dispatch
| Type | Shape | Scoring |
|---|---|---|
match_all |
{} |
1.0 for every doc |
match |
{field: text} |
Field-dependent (title ×3, kw ×2, headings ×2, else BM25) |
term |
{field: value} |
Exact match: 1.0 or 0.0 |
multi_match |
{query, fields} |
Sum across fields with ^N boosts |
bool |
{must, should, filter, must_not} |
must + should; filter/must_not gate only |
prefix |
{field: prefix} |
Case-insensitive prefix: 1.0 or 0.0 |
| unknown | any | Returns 0 (no error) |
BM25 parameters: k1=1.2, b=0.75 (constants).
5.3. suggest
(suggest idx {text: "prefix", size: 10}) → JS ["match1", "match2", ...]
Prefix match against sorted :suggest-corpus. Case-insensitive.
6. Contract 5: UI Component
6.1. Mount point resolution (first match wins)
#pocket-es-root— standalone HTML#content— org-published page.outline-2— fallback org layoutdocument.body— final fallback
6.2. Input handling
| Event | Behavior |
|---|---|
| input | 150ms debounce, then do-search! |
| Escape | Clear input and results immediately |
| Tab | Expand plain text to JSON query template; cursor in quotes |
| click | On suggestion pill: fill input and search immediately |
6.3. Query mode dispatch
- Input starts with
{and parses as JSON → raw ES request. - Otherwise →
multi_matchwith["title^3", "description", "headings^2", "terms"].
6.4. Console API (window.pocketES)
search(req), suggest(req), cluster.health(), cat.indices(), cat.count(), _idx.
7. Contract 6: Tokenizer Test Suite
The shared tokenizer is covered by test/pocket_es/token_test.cljc. The
.cljc extension means the same test file runs on both the JVM (via
clojure.test) and in the browser (via cljs.test), verifying that
CLJ and CLJS produce identical token streams.
Run on the JVM:
clj -M:test
7.1. Example-based tests
| Test name | What it verifies |
|---|---|
tokenize-basic |
nil/empty/blank → []; stopword removal; lowercasing; |
| short-word filter with allowlist; org markup stripping; | |
| dash collapsing | |
tokenize-org-blocks |
#+begin_src … #+end_src stripped before tokenization; |
:PROPERTIES: … :END: drawers stripped |
|
parse-org-headers-test |
Key→value extraction; first-occurrence-wins semantics |
term-frequencies-test |
Correct counts; top-n limit respected |
7.2. Property-based tests (test.check)
| Spec name | Trials | Property |
|---|---|---|
tokenize-never-returns-stopwords |
100 | No token is a member of token/stopwords |
tokenize-all-lowercase |
100 | Every token satisfies ( t (str/lower-case t))= |
tokenize-idempotent-on-output |
50 | Tokens from second pass are a subset of first-pass tokens |
term-frequencies-bounded |
50 | (count tf) < n= for any n |
8. Appendix: Known Asymmetries
Asymmetries A1 and A2 from v1 — diverging stopword sets and the
SHORT_ALLOWLIST gap — are resolved in v2. The shared token.cljc
namespace is the single source of truth for both the indexer and the
client search engine. No per-platform divergence remains.
| ID | Issue | Status |
|---|---|---|
| A3 | :doc_len is snake_case — sole exception to kebab-case convention. |
Open |