pocket-es Component Contract Specification

Table of Contents

1. Overview

pocket-es is a client-side BM25 search engine for wal.sh. Data flows in one direction: org files → Clojure indexer → JSON artifact → ClojureScript loader → search engine → UI. Each boundary is a typed contract. This document specifies the preconditions, postconditions, and invariants at each boundary.

Components and their canonical file locations:

Component Path
Indexer site/research/pocket-es/src/pocket_es/indexer.clj
Shared tokenizer site/research/pocket-es/src/pocket_es/token.cljc
Token tests site/research/pocket-es/test/pocket_es/token_test.cljc
Index artifact site/static/search-index.json
Core engine site/research/pocket-es/src/pocket_es/core.cljs
UI site/research/pocket-es/src/pocket_es/ui.cljs
Build entry site/research/pocket-es/shadow-cljs.edn (:lib build)
JS init symbol pocket-es.ui/init! (mapped to :modules {:pocket-es …})

2. Contract 1: Org File → Indexer

The indexer is pocket-es.indexer (src/pocket_es/indexer.clj). It reads org files via clojure.java.io and tokenizes with the shared pocket-es.token namespace (src/pocket_es/token.cljc). The same tokenizer compiles to ClojureScript for the client search engine — there is one tokenization contract, not two.

Run the indexer:

clj -M:index site/

2.1. Preconditions

  • File encoding is UTF-8 (errors are replaced, not fatal).
  • File lives under the site/ tree and matches glob **/*.org.
  • File is not under any excluded subtree:
    • _drafts/
    • templates/
    • includes/
    • orgparse-examples/
  • File contains at least one #+TITLE: header with a non-empty value. This is the sole hard gate; all other headers are optional.

2.2. Required headers

Header Requirement Notes
#+TITLE REQUIRED First occurrence wins; empty value causes doc rejection.

2.3. Optional headers (used if present)

Header Notes
#+DATE Stored verbatim as string; no format enforced.
#+KEYWORDS Split on [,;]+; each token lowercased, stripped. Multi-value OK.
#+DESCRIPTION Stored verbatim; no length limit enforced.

2.4. Postconditions (valid indexable document)

A document is only emitted when:

  1. #+TITLE is present and non-empty.
  2. doc_len > 5= OR keywords is non-empty. (Stubs with fewer than 5 tokens and no keywords are silently dropped.)

2.5. Invariants

  • Only the first occurrence of each header key is kept.
  • Org source blocks (#+begin_... / #+end_...) are stripped before tokenization; they do not contribute to term frequencies.
  • Org drawers (:NAME:... / :END:) are stripped before tokenization.
  • Org links [[target][text]] are replaced by their display text.
  • Org markup characters ~=*/+_ are replaced by spaces before tokenization.

3. Contract 2: Indexer → Index Artifact (JSON Schema)

The indexer writes to stdout; the Makefile redirects to site/static/search-index.json.

3.1. Top-level structure

{
  "_cluster": { ... },
  "idf":      { "<term>": <float>, ... },
  "docs":     [ { ... }, ... ],
  "suggest_corpus": [ "<string>", ... ]
}

All four top-level keys are always present.

3.2. _cluster object

Field Type Required Constraints
name string yes Always "wal-sh-pocket"
version int yes Always 2
built_at string yes ISO 8601 timestamp from java.time.Instant/now
git_sha string yes Short SHA from git rev-parse --short HEAD;
      "unknown" when git is unavailable
doc_count int yes Equal to count of emitted docs; must be >= 1
vocab_size int yes Equal to count of idf entries; must be >= 100
avg_dl float yes Mean document length in tokens; must be >= 1.0

3.3. docs array elements

Field Type Required Constraints
_id string yes URL path relative to site root, no extension
title string yes Verbatim from #+TITLE; non-empty
date string yes Verbatim from #+DATE or ""
keywords array of strings yes Lowercased tokens; may be []
description string yes Verbatim from #+DESCRIPTION or ""
headings array of strings yes First 15 org headings; may be []
terms object yes Top-50 term→count pairs; count is raw int
doc_len int yes Total token count before top-50 truncation

3.4. idf map

  • Keys: vocabulary terms (strings). Values: floats rounded to 3 decimal places.
  • Formula: round(log((N - df + 0.5) / (df + 0.5) + 1), 3) (BM25 IDF).
  • Inclusion: doc_freq > 2= OR idf > 3.0.

3.5. suggest_corpus array

  • Sorted lexicographically. Sources: keyword tokens + title words after stripping non-word characters, lowercasing, filtering stopwords and pure digits.

3.6. Structural assertions

The following invariants must hold for any valid artifact:

# Assertion Source
1 _cluster.version = 2= indexer/build-index
2 _cluster.git_sha present and non-empty indexer/git-sha
3 doc_count > 1= indexer/build-index
4 vocab_size > 100= indexer/compute-idf
5 avg_dl > 1.0= indexer/build-index
6 idf non-empty indexer/compute-idf
7 docs non-empty indexer/parse-org-file
8 suggest_corpus non-empty indexer/build-suggest-corpus
9 Each doc has _id, title, terms, doc_len indexer/parse-org-file
10 len(docs) = doc_count= indexer/build-index

4. Contract 3: Loader (parse-index)

4.1. Internal IndexMap structure (CLJS)

{:cluster        {:name string, :version 2, :built-at string,
                  :git-sha string,              ; from git rev-parse --short HEAD
                  :doc-count number, :vocab-size number, :avg-dl number,
                  :index-size-kb number}        ; added at load time
 :idf            {string → number}              ; string keys
 :avg-dl         number                          ; promoted to top-level
 :docs           [{:_id string, :title string, :date string,
                   :keywords [string], :description string,
                   :headings [string],
                   :terms {string → number},     ; string keys
                   :doc_len number} ...]         ; snake_case preserved
 :suggest-corpus [string ...]}

4.2. Key decisions

  • No deep js->clj on the full blob. Each top-level key is accessed via unchecked-get and converted independently.
  • :doc_len is snake_case (not kebab-case) — intentional, matches JSON.
  • :avg-dl appears at both top-level and [:cluster :avg-dl]. Scoring uses top-level; cluster map is for display only.
  • :terms values are raw JS numbers (shallow conversion).

4.3. Index URL resolution

  1. <script data-index-url"…">= attribute — explicit override.
  2. location.port == "8700"= → /search-index.json (dev server).
  3. Default → /static/search-index.json (published site).

5. Contract 4: Search API

5.1. search

(search idx request) → JS {hits: {total: int, hits: [{_id, _score, _source}]}}
  • _score is Math.round(raw_score * 100) — always non-negative integer.
  • Results sorted descending by _score. Only score > 0 included.
  • Default _source strips :terms and :doc_len.

5.2. Query DSL dispatch

Type Shape Scoring
match_all {} 1.0 for every doc
match {field: text} Field-dependent (title ×3, kw ×2, headings ×2, else BM25)
term {field: value} Exact match: 1.0 or 0.0
multi_match {query, fields} Sum across fields with ^N boosts
bool {must, should, filter, must_not} must + should; filter/must_not gate only
prefix {field: prefix} Case-insensitive prefix: 1.0 or 0.0
unknown any Returns 0 (no error)

BM25 parameters: k1=1.2, b=0.75 (constants).

5.3. suggest

(suggest idx {text: "prefix", size: 10}) → JS ["match1", "match2", ...]

Prefix match against sorted :suggest-corpus. Case-insensitive.

6. Contract 5: UI Component

6.1. Mount point resolution (first match wins)

  1. #pocket-es-root — standalone HTML
  2. #content — org-published page
  3. .outline-2 — fallback org layout
  4. document.body — final fallback

6.2. Input handling

Event Behavior
input 150ms debounce, then do-search!
Escape Clear input and results immediately
Tab Expand plain text to JSON query template; cursor in quotes
click On suggestion pill: fill input and search immediately

6.3. Query mode dispatch

  • Input starts with { and parses as JSON → raw ES request.
  • Otherwise → multi_match with ["title^3", "description", "headings^2", "terms"].

6.4. Console API (window.pocketES)

search(req), suggest(req), cluster.health(), cat.indices(), cat.count(), _idx.

7. Contract 6: Tokenizer Test Suite

The shared tokenizer is covered by test/pocket_es/token_test.cljc. The .cljc extension means the same test file runs on both the JVM (via clojure.test) and in the browser (via cljs.test), verifying that CLJ and CLJS produce identical token streams.

Run on the JVM:

clj -M:test

7.1. Example-based tests

Test name What it verifies
tokenize-basic nil/empty/blank → []; stopword removal; lowercasing;
  short-word filter with allowlist; org markup stripping;
  dash collapsing
tokenize-org-blocks #+begin_src#+end_src stripped before tokenization;
  :PROPERTIES::END: drawers stripped
parse-org-headers-test Key→value extraction; first-occurrence-wins semantics
term-frequencies-test Correct counts; top-n limit respected

7.2. Property-based tests (test.check)

Spec name Trials Property
tokenize-never-returns-stopwords 100 No token is a member of token/stopwords
tokenize-all-lowercase 100 Every token satisfies ( t (str/lower-case t))=
tokenize-idempotent-on-output 50 Tokens from second pass are a subset of first-pass tokens
term-frequencies-bounded 50 (count tf) < n= for any n

8. Appendix: Known Asymmetries

Asymmetries A1 and A2 from v1 — diverging stopword sets and the SHORT_ALLOWLIST gap — are resolved in v2. The shared token.cljc namespace is the single source of truth for both the indexer and the client search engine. No per-platform divergence remains.

ID Issue Status
A3 :doc_len is snake_case — sole exception to kebab-case convention. Open