pocket-es: Client-Side BM25 Search for org-mode Sites

1. Architecture
- 1.1. Build pipeline
- 1.2. Client and wire sizes
2. Interfaces
3. Query DSL
4. Scoring
5. Query Examples
6. State Machine
7. Stack
8. Specifications

The search box above is a BM25 search engine for this org-mode site. No server, no daemon, no JVM at runtime. One fetch(), one JSON file, scoring in ClojureScript.

The index was built at deploy time from every .org file on this site. The client loads it, parses it, and exposes a Lucene-style query DSL: match, term, bool, multi_match, prefix, plus suggest for autocompletion.

1. Architecture

1.1. Build pipeline

The indexer is JVM Clojure (indexer.clj). It shares a tokenizer (token.cljc) with the ClojureScript client – one tokenization contract, no asymmetries between build and query time. The indexer extracts #+TITLE, #+KEYWORDS, #+DESCRIPTION, #+DATE from every org file, tokenizes the body, computes term frequencies (top 50 per doc), then IDF across the corpus. Writes a single JSON file.

1.2. Client and wire sizes

142KB of Closure-compiled ClojureScript (32KB gzipped). Loads the JSON index, builds an in-memory search structure, and exposes both a visual UI (self-injecting DOM) and a console API.

Asset	Raw	Gzipped	Ratio
pocket-es.js	142KB	32KB	4.4x
search-index.json	824KB	255KB	3.2x
Total	966KB	287KB	3.4x

2. Interfaces

Five consumers share one JSON index and one BM25 scoring contract (k1=1.2, b=0.75). Source lives at the project root in src/pocket_es/.

2.1. Browser

The <script> tag loads pocket-es.js, which self-injects into #pocket-es-root or #content. The console API is available as pocketES.search(...). See Query Examples below.

2.2. Emacs

pocket-es.el loads the JSON index, tokenizes queries with the same contract as the ClojureScript client, and scores results with BM25. Results appear in a *pocket-es* buffer with clickable paths.

;; Interactive search
M-x pocket-es-search RET lambda calculus RET

;; Programmatic access
(pocket-es-search "bedrock rag")
(pocket-es-suggest "clo")

2.3. Node CLI

The shadow-cljs :node-library target exports search, suggest, and parseIndex as CommonJS functions.

$ npm run search -- "tla+" --size 3
3 results (614 docs indexed)

1. [1441.00] TLA+ for System Design
   2026-01-10 -- /research/tla-plus-system-design
2. [1375.00] TLA+ Traffic Lights and Communication Protocol
   2024-08-11 -- /research/tla+
3. [1133.00] E-Commerce Order State Machine in TLA+
   2026-01-11 -- /research/tla-plus-system-design/ecommerce-order-states

$ npm run search -- "agen" --suggest
agent
agent architecture
agent coordination
agent framework

2.4. Babashka REPL

pocket-es.cli uses the shared token.cljc tokenizer and the BM25 formula directly – no JSON string matching, no subprocess. search returns data.

$ bb -cp src -m pocket-es.cli "graphql federation" 3
3 results for "graphql federation" (614 docs indexed)

  [15.1] Clojure + GraphQL Integration
        2019-12 -- /research/clojure-graphql
  [15.0] GraphQL: Schema, Operations, and Federation
        2024-08-11 -- /research/graphql
  [14.4] GraphQLConf 2025
        2025-09-07 -- /events/graphqlconf-2025

For REPL or property-based testing, search returns maps:

$ bb -cp src -e '
(require (quote [pocket-es.cli :as cli]))
(let [idx (cli/load-index "site/static/search-index.json")]
  (map :title (cli/search idx "agent sandbox" :size 3)))
'
("Agent Sandbox Architectures" "Sandboxing AI Coding Agents with FreeBSD Jails"
 "CLI Coding Agents -- 2026 Q2 Comparison")

2.5. Console API

Available in the browser DevTools after the page loads:

pocketES.search({ query: { match: { _all: "crdt" } } })

pocketES.suggest({ text: "clo", size: 8 })

pocketES.cluster.health()

3. Query DSL

Query type	Behavior
`match`	Tokenize, BM25 score per term, sum
`term`	Exact match on keyword/field array
`bool`	must/should/filter/must_not, intersect/union/filter
`multi_match`	Match across fields with boost weights
`match_all`	Return everything, score 1.0
`prefix`	Prefix scan on string/array fields

4. Scoring

BM25 with k1=1.2, b=0.75. The entire scoring function:

(defn bm25-term [term doc idf avg-dl]
  (let [tf    (get-in doc [:terms term] 0)
        dl    (:doc_len doc 1)
        idf-v (get idf term 0)
        numer (* tf (+ k1 1))
        denom (+ tf (* k1 (+ (- 1 b) (* b (/ dl avg-dl)))))]
    (if (zero? denom) 0
        (* idf-v (/ numer denom)))))

k1 1.2, b 0.75. IDF is precomputed at build time. Term frequency lives in the per-document terms map.

5. Query Examples

Click any block to execute it. Results appear inline and are logged to the DevTools console.

5.1. match – tokenize + BM25 score

pocketES.search({ query: { match: { title: "clojure" } } })

5.2. term – exact match on keyword field

pocketES.search({ query: { term: { keywords: "crdt" } } })

5.3. multi_match – across fields with boosts

pocketES.search({ query: { multi_match: {
  query: "agent isolation",
  fields: ["title^3", "description", "headings^2"]
} } })

5.4. bool – must / should / must_not

pocketES.search({ query: { bool: {
  must:     [{ match: { title: "freebsd" } }],
  should:   [{ match: { title: "security" } }],
  must_not: [{ term:  { keywords: "python" } }]
} } })

6. State Machine

Ten user actions, six atoms, URL sync. Try/keyword clicks reset everything; typing keeps the date filter; pagination keeps both. The !json-mode atom (URL-synced as j=) selects the input surface: off = tokenized free text, on = a raw JSON / ES query in a textarea. Toggling swaps the surface but never auto-translates the typed query. Full diagram source in state-machine.dot.

7. Stack

Shared tokenizer: src/pocket_es/token.cljc – compiles to CLJ, CLJS, and bb
Indexer: src/pocket_es/indexer.clj (JVM Clojure) – parses org files, computes BM25 IDF, emits JSON
Client: src/pocket_es/core.cljs – BM25 scoring, Lucene-style query DSL, index loading
UI: src/pocket_es/ui.cljs – self-injecting DOM, debounced input, date filters
CLI: src/pocket_es/cli.clj – bb/JVM entry point, data-returning search function
Tests: test/pocket_es/token_test.cljc – 25 assertions including 300 property-based test iterations

8. Specifications

Conformance spec – the two versioned wire schemas (index schema, search request schema), tokenization, BM25, UI state machine, verification checklist
Structured query surface – JSON-first input contract (malli), where the gate lives
Query surface rollout & test plan – phased plan, invariants, test.check properties, Bombadil LTL check
Dual-input UX spec – earlier simple/advanced redesign