pocket-es: Client-Side BM25 Search for org-mode Sites

Table of Contents

The search box above is a BM25 search engine for this org-mode site. No server, no daemon, no JVM at runtime. One fetch(), one JSON file, scoring in ClojureScript.

The index was built at deploy time from every .org file on this site. The client loads it, parses it, and exposes a Lucene-style query DSL: match, term, bool, multi_match, prefix, plus suggest for autocompletion.

1. Architecture

architecture.png

1.1. Build pipeline

The indexer is JVM Clojure (indexer.clj). It shares a tokenizer (token.cljc) with the ClojureScript client — one tokenization contract, no asymmetries between build and query time. The indexer extracts #+TITLE, #+KEYWORDS, #+DESCRIPTION, #+DATE from every org file, tokenizes the body, computes term frequencies (top 50 per doc), then IDF across the corpus. Writes a single JSON file.

1.2. Client and wire sizes

142KB of Closure-compiled ClojureScript (32KB gzipped). Loads the JSON index, builds an in-memory search structure, and exposes both a visual UI (self-injecting DOM) and a console API.

Asset Raw Gzipped Ratio
pocket-es.js 142KB 32KB 4.4x
search-index.json 824KB 255KB 3.2x
Total 966KB 287KB 3.4x

2. Interfaces

Five consumers share one JSON index and one BM25 scoring contract (k1=1.2, b=0.75). Source lives at the project root in src/pocket_es/.

2.1. Browser

The <script> tag loads pocket-es.js, which self-injects into #pocket-es-root or #content. The console API is available as pocketES.search(...). See Query Examples below.

2.2. Emacs

emacs-pocket-es-gui.png

pocket-es.el loads the JSON index, tokenizes queries with the same contract as the ClojureScript client, and scores results with BM25. Results appear in a *pocket-es* buffer with clickable paths.

;; Interactive search
M-x pocket-es-search RET lambda calculus RET

;; Programmatic access
(pocket-es-search "bedrock rag")
(pocket-es-suggest "clo")

2.3. Node CLI

The shadow-cljs :node-library target exports search, suggest, and parseIndex as CommonJS functions.

$ npm run search -- "tla+" --size 3
3 results (614 docs indexed)

1. [1441.00] TLA+ for System Design
   2026-01-10 — /research/tla-plus-system-design
2. [1375.00] TLA+ Traffic Lights and Communication Protocol
   2024-08-11 — /research/tla+
3. [1133.00] E-Commerce Order State Machine in TLA+
   2026-01-11 — /research/tla-plus-system-design/ecommerce-order-states

$ npm run search -- "agen" --suggest
agent
agent architecture
agent coordination
agent framework

2.4. Babashka REPL

pocket-es.cli uses the shared token.cljc tokenizer and the BM25 formula directly — no JSON string matching, no subprocess. search returns data.

$ bb -cp src -m pocket-es.cli "graphql federation" 3
3 results for "graphql federation" (614 docs indexed)

  [15.1] Clojure + GraphQL Integration
        2019-12 — /research/clojure-graphql
  [15.0] GraphQL: Schema, Operations, and Federation
        2024-08-11 — /research/graphql
  [14.4] GraphQLConf 2025
        2025-09-07 — /events/graphqlconf-2025

For REPL or property-based testing, search returns maps:

$ bb -cp src -e '
(require (quote [pocket-es.cli :as cli]))
(let [idx (cli/load-index "site/static/search-index.json")]
  (map :title (cli/search idx "agent sandbox" :size 3)))
'
("Agent Sandbox Architectures" "Sandboxing AI Coding Agents with FreeBSD Jails"
 "CLI Coding Agents — 2026 Q2 Comparison")

2.5. Console API

Available in the browser DevTools after the page loads:

pocketES.search({ query: { match: { _all: "crdt" } } })
pocketES.suggest({ text: "clo", size: 8 })
pocketES.cluster.health()

3. Query DSL

Query type Behavior
match Tokenize, BM25 score per term, sum
term Exact match on keyword/field array
bool must/should/filter/must_not, intersect/union/filter
multi_match Match across fields with boost weights
match_all Return everything, score 1.0
prefix Prefix scan on string/array fields

4. Scoring

BM25 with k1=1.2, b=0.75. The entire scoring function:

(defn bm25-term [term doc idf avg-dl]
  (let [tf    (get-in doc [:terms term] 0)
        dl    (:doc_len doc 1)
        idf-v (get idf term 0)
        numer (* tf (+ k1 1))
        denom (+ tf (* k1 (+ (- 1 b) (* b (/ dl avg-dl)))))]
    (if (zero? denom) 0
        (* idf-v (/ numer denom)))))

k1 1.2, b 0.75. IDF is precomputed at build time. Term frequency lives in the per-document terms map.

5. Query Examples

Click any block to execute it. Results appear inline and are logged to the DevTools console.

5.1. match — tokenize + BM25 score

pocketES.search({ query: { match: { title: "clojure" } } })

5.2. term — exact match on keyword field

pocketES.search({ query: { term: { keywords: "crdt" } } })

5.3. multi_match — across fields with boosts

pocketES.search({ query: { multi_match: {
  query: "agent isolation",
  fields: ["title^3", "description", "headings^2"]
} } })

5.4. bool — must / should / must_not

pocketES.search({ query: { bool: {
  must:     [{ match: { title: "freebsd" } }],
  should:   [{ match: { title: "security" } }],
  must_not: [{ term:  { keywords: "python" } }]
} } })

6. State Machine

state-machine.png

Nine user actions, five atoms, URL sync. Try/keyword clicks reset everything; typing keeps the date filter; pagination keeps both. Full diagram source in state-machine.dot.

7. Stack

  • Shared tokenizer: src/pocket_es/token.cljc — compiles to CLJ, CLJS, and bb
  • Indexer: src/pocket_es/indexer.clj (JVM Clojure) — parses org files, computes BM25 IDF, emits JSON
  • Client: src/pocket_es/core.cljs — BM25 scoring, Lucene-style query DSL, index loading
  • UI: src/pocket_es/ui.cljs — self-injecting DOM, debounced input, date filters
  • CLI: src/pocket_es/cli.clj — bb/JVM entry point, data-returning search function
  • Tests: test/pocket_es/token_test.cljc — 25 assertions including 300 property-based test iterations