pocket-es Component Contract Specification

1. Overview
2. Contract 1: Org File → Indexer
3. Contract 2: Indexer → Index Artifact (JSON Schema)
4. Contract 3: Loader (parse-index)
5. Contract 4: Search API
6. Contract 5: UI Component
7. Contract 6: Tokenizer Test Suite
- 7.1. Example-based tests
- 7.2. Property-based tests (test.check)
8. Appendix: Known Asymmetries

1. Overview

pocket-es is a client-side BM25 search engine for wal.sh. Data flows in one direction: org files → Clojure indexer → JSON artifact → ClojureScript loader → search engine → UI. Each boundary is a typed contract. This document specifies the preconditions, postconditions, and invariants at each boundary.

Components and their canonical file locations:

Component	Path
Indexer	site/research/pocket-es/src/pocket_es/indexer.clj
Shared tokenizer	site/research/pocket-es/src/pocket_es/token.cljc
Token tests	site/research/pocket-es/test/pocket_es/token_test.cljc
Index artifact	site/static/search-index.json
Core engine	site/research/pocket-es/src/pocket_es/core.cljs
UI	site/research/pocket-es/src/pocket_es/ui.cljs
Build entry	site/research/pocket-es/shadow-cljs.edn (:lib build)
JS init symbol	pocket-es.ui/init! (mapped to :modules {:pocket-es …})

2. Contract 1: Org File → Indexer

The indexer is pocket-es.indexer (src/pocket_es/indexer.clj). It reads org files via clojure.java.io and tokenizes with the shared pocket-es.token namespace (src/pocket_es/token.cljc). The same tokenizer compiles to ClojureScript for the client search engine — there is one tokenization contract, not two.

Run the indexer:

clj -M:index site/

2.1. Preconditions

File encoding is UTF-8 (errors are replaced, not fatal).
File lives under the site/ tree and matches glob **/*.org.
File is not under any excluded subtree:
- _drafts/
- templates/
- includes/
- orgparse-examples/
File contains at least one #+TITLE: header with a non-empty value. This is the sole hard gate; all other headers are optional.

2.2. Required headers

Header	Requirement	Notes
`#+TITLE`	REQUIRED	First occurrence wins; empty value causes doc rejection.

2.3. Optional headers (used if present)

Header	Notes
`#+DATE`	Stored verbatim as string; no format enforced.
`#+KEYWORDS`	Split on `[,;]+`; each token lowercased, stripped. Multi-value OK.
`#+DESCRIPTION`	Stored verbatim; no length limit enforced.

2.4. Postconditions (valid indexable document)

A document is only emitted when:

#+TITLE is present and non-empty.
doc_len > 5= OR keywords is non-empty. (Stubs with fewer than 5 tokens and no keywords are silently dropped.)

2.5. Invariants

Only the first occurrence of each header key is kept.
Org source blocks (#+begin_... / #+end_...) are stripped before tokenization; they do not contribute to term frequencies.
Org drawers (:NAME:... / :END:) are stripped before tokenization.
Org links [[target][text]] are replaced by their display text.
Org markup characters ~=*/+_ are replaced by spaces before tokenization.

3. Contract 2: Indexer → Index Artifact (JSON Schema)

The indexer writes to stdout; the Makefile redirects to site/static/search-index.json.

3.1. Top-level structure

{
  "_cluster": { ... },
  "idf":      { "<term>": <float>, ... },
  "docs":     [ { ... }, ... ],
  "suggest_corpus": [ "<string>", ... ]
}

All four top-level keys are always present.

3.2. _cluster object

Field	Type	Required	Constraints
`name`	string	yes	Always `"wal-sh-pocket"`
`version`	int	yes	Always `2`
`built_at`	string	yes	ISO 8601 timestamp from `java.time.Instant/now`
`git_sha`	string	yes	Short SHA from `git rev-parse --short HEAD`;
			`"unknown"` when git is unavailable
`doc_count`	int	yes	Equal to count of emitted docs; must be >= 1
`vocab_size`	int	yes	Equal to count of idf entries; must be >= 100
`avg_dl`	float	yes	Mean document length in tokens; must be >= 1.0

3.3. docs array elements

Field	Type	Required	Constraints
`_id`	string	yes	URL path relative to site root, no extension
`title`	string	yes	Verbatim from `#+TITLE`; non-empty
`date`	string	yes	Verbatim from `#+DATE` or `""`
`keywords`	array of strings	yes	Lowercased tokens; may be `[]`
`description`	string	yes	Verbatim from `#+DESCRIPTION` or `""`
`headings`	array of strings	yes	First 15 org headings; may be `[]`
`terms`	object	yes	Top-50 term→count pairs; count is raw int
`doc_len`	int	yes	Total token count before top-50 truncation

3.4. idf map

Keys: vocabulary terms (strings). Values: floats rounded to 3 decimal places.
Formula: round(log((N - df + 0.5) / (df + 0.5) + 1), 3) (BM25 IDF).
Inclusion: doc_freq > 2= OR idf > 3.0.

3.5. suggest_corpus array

Sorted lexicographically. Sources: keyword tokens + title words after stripping non-word characters, lowercasing, filtering stopwords and pure digits.

3.6. Structural assertions

The following invariants must hold for any valid artifact:

#	Assertion	Source
1	`_cluster.version =` 2=	indexer/build-index
2	`_cluster.git_sha` present and non-empty	indexer/git-sha
3	`doc_count >` 1=	indexer/build-index
4	`vocab_size >` 100=	indexer/compute-idf
5	`avg_dl >` 1.0=	indexer/build-index
6	`idf` non-empty	indexer/compute-idf
7	`docs` non-empty	indexer/parse-org-file
8	`suggest_corpus` non-empty	indexer/build-suggest-corpus
9	Each doc has `_id, title, terms, doc_len`	indexer/parse-org-file
10	`len(docs) =` doc_count=	indexer/build-index

4. Contract 3: Loader (parse-index)

4.1. Internal IndexMap structure (CLJS)

{:cluster        {:name string, :version 2, :built-at string,
                  :git-sha string,              ; from git rev-parse --short HEAD
                  :doc-count number, :vocab-size number, :avg-dl number,
                  :index-size-kb number}        ; added at load time
 :idf            {string → number}              ; string keys
 :avg-dl         number                          ; promoted to top-level
 :docs           [{:_id string, :title string, :date string,
                   :keywords [string], :description string,
                   :headings [string],
                   :terms {string → number},     ; string keys
                   :doc_len number} ...]         ; snake_case preserved
 :suggest-corpus [string ...]}

4.2. Key decisions

No deep js->clj on the full blob. Each top-level key is accessed via unchecked-get and converted independently.
:doc_len is snake_case (not kebab-case) — intentional, matches JSON.
:avg-dl appears at both top-level and [:cluster :avg-dl]. Scoring uses top-level; cluster map is for display only.
:terms values are raw JS numbers (shallow conversion).

4.3. Index URL resolution

<script data-index-url"…">= attribute — explicit override.
location.port == "8700"= → /search-index.json (dev server).
Default → /static/search-index.json (published site).

5. Contract 4: Search API

5.1. search

(search idx request) → JS {hits: {total: int, hits: [{_id, _score, _source}]}}

_score is Math.round(raw_score * 100) — always non-negative integer.
Results sorted descending by _score. Only score > 0 included.
Default _source strips :terms and :doc_len.

5.2. Query DSL dispatch

Type	Shape	Scoring
`match_all`	`{}`	1.0 for every doc
`match`	`{field: text}`	Field-dependent (title ×3, kw ×2, headings ×2, else BM25)
`term`	`{field: value}`	Exact match: 1.0 or 0.0
`multi_match`	`{query, fields}`	Sum across fields with `^N` boosts
`bool`	`{must, should, filter, must_not}`	must + should; filter/must_not gate only
`prefix`	`{field: prefix}`	Case-insensitive prefix: 1.0 or 0.0
unknown	any	Returns 0 (no error)

BM25 parameters: k1=1.2, b=0.75 (constants).

5.3. suggest

(suggest idx {text: "prefix", size: 10}) → JS ["match1", "match2", ...]

Prefix match against sorted :suggest-corpus. Case-insensitive.

6. Contract 5: UI Component

6.1. Mount point resolution (first match wins)

#pocket-es-root — standalone HTML
#content — org-published page
.outline-2 — fallback org layout
document.body — final fallback

6.2. Input handling

Event	Behavior
input	150ms debounce, then `do-search!`
Escape	Clear input and results immediately
Tab	Expand plain text to JSON query template; cursor in quotes
click	On suggestion pill: fill input and search immediately

6.3. Query mode dispatch

Input starts with { and parses as JSON → raw ES request.
Otherwise → multi_match with ["title^3", "description", "headings^2", "terms"].

6.4. Console API (window.pocketES)

search(req), suggest(req), cluster.health(), cat.indices(), cat.count(), _idx.

7. Contract 6: Tokenizer Test Suite

The shared tokenizer is covered by test/pocket_es/token_test.cljc. The .cljc extension means the same test file runs on both the JVM (via clojure.test) and in the browser (via cljs.test), verifying that CLJ and CLJS produce identical token streams.

Run on the JVM:

clj -M:test

7.1. Example-based tests

Test name	What it verifies
`tokenize-basic`	nil/empty/blank → `[]`; stopword removal; lowercasing;
	short-word filter with allowlist; org markup stripping;
	dash collapsing
`tokenize-org-blocks`	`#+begin_src` … `#+end_src` stripped before tokenization;
	`:PROPERTIES:` … `:END:` drawers stripped
`parse-org-headers-test`	Key→value extraction; first-occurrence-wins semantics
`term-frequencies-test`	Correct counts; top-n limit respected

7.2. Property-based tests (test.check)

Spec name	Trials	Property
`tokenize-never-returns-stopwords`	100	No token is a member of `token/stopwords`
`tokenize-all-lowercase`	100	Every token satisfies `(` t (str/lower-case t))=
`tokenize-idempotent-on-output`	50	Tokens from second pass are a subset of first-pass tokens
`term-frequencies-bounded`	50	`(count tf) <` n= for any n

8. Appendix: Known Asymmetries

Asymmetries A1 and A2 from v1 — diverging stopword sets and the SHORT_ALLOWLIST gap — are resolved in v2. The shared token.cljc namespace is the single source of truth for both the indexer and the client search engine. No per-platform divergence remains.

ID	Issue	Status
A3	`:doc_len` is snake_case — sole exception to kebab-case convention.	Open