Building a Search Engine in One Session
Table of Contents
1. What happened
pocket-es went from nothing to a working search engine on a live site in one session. 611 org-mode documents indexed, BM25 scoring, a Lucene-style query DSL, click-to-execute examples, deep-linkable URLs, gzip compression, and three consumers of one JSON index (browser, Emacs, build-time validation).
The interesting part is not the search engine itself — BM25 is 15 lines of
code and the query DSL is a cond — but the development process that made
it possible to go this fast without accumulating debt.
2. The .cljc bet paid off immediately
The original prototype used Python for the build-time indexer and ClojureScript for the client. Within hours, we had two tokenizers with different stopword lists, different length filters, and an allowlist that only existed on one side. Documents indexed with one set of rules were searched with another.
The fix was a single .cljc file — token.cljc — that compiles to both
JVM Clojure (for the indexer) and JavaScript (for the browser client). One
tokenizer, one stopword set, one truth. The Python indexer was deleted the
same day it shipped.
If your pipeline has a build step and a query step, and they share any text
processing, start with a shared source file. The cost is near zero (.cljc
is just Clojure with reader conditionals). The cost of not doing it is
silent divergence that surfaces as "why doesn't this search find what I
indexed?"
3. Property-based tests as a contract language
Four property specs, 300 iterations:
- Tokenization never returns stopwords
- All tokens are lowercase
- Tokenizing already-tokenized text is idempotent
- Term frequency counts respect the top-N limit
These ran on the JVM in under a second and caught real issues: a duplicate stopword in the set literal, test expectations that used stopwords as expected output, and the realization that "before" and "after" are stopwords (which matters when your test is about stripping org blocks).
Property tests are particularly good for tokenizers because the contract is simple (input → tokens, with invariants) but the edge cases are unbounded (org markup, Unicode, nested blocks). You cannot enumerate them; you can state what must always be true.
4. Three consumers, one index
The JSON index at site/static/search-index.json is consumed by:
| Consumer | Runtime | Scoring | Entry point |
|---|---|---|---|
| Browser | ClojureScript | BM25 | pocket-es.js → pocketES.* |
| Emacs | Elisp | BM25 | pocket-es.el → M-x |
| Build/test | JVM Clojure | BM25 | indexer.clj / token_test |
All three implement the same BM25 formula with the same parameters (k1=1.2, b=0.75). The Emacs client has 8 ERT tests that validate the index structure and scoring behavior against the same artifact the browser loads.
5. Closure advanced compilation breaks #js literals
The hardest bug to find: ClojureScript's #js {:hits ...} creates objects
with keyword-named properties. Closure Compiler's advanced optimization
renames these properties. The UI accesses them with (.. res -hits -hits),
which gets renamed independently. The property names diverge and the access
returns undefined.
The fix: (js-obj "hits" ...) with string keys. Closure cannot rename
string property names. Every #js {} was replaced with js-obj. Compiler
warnings dropped from 4 to 0.
This is well-known but does not surface until you switch from :none (dev)
to :advanced (release). If your ClojureScript works in dev but throws
TypeError: Cannot read properties of undefined in release, check your
#js literals.
6. The org-mode site as a corpus
An org-mode site is an unusually clean corpus for a search engine:
- Every document has structured metadata (
#+TITLE,#+KEYWORDS,#+DATE) that maps directly to search fields - Source blocks are delimited and can be stripped before tokenization
- Property drawers contain metadata that should not be indexed
- The file tree structure gives you URL paths for free
The main quality issue was keywords. A data audit found 35% of documents had none, 84% of unique keywords appeared in only one document, and conference session titles were used verbatim as keywords. Four cleanup passes (normalize plurals, remove broad terms, replace session-title strings, add missing keywords) dropped zero-keyword documents from 35% to 21%.
7. What was not built
- No stemming. Exact token match with a stopword list. Good enough for a technical corpus.
- No sharding. One fetch, one JSON file. At 611 documents and 280KB gzipped, this is correct.
- No caching. Fresh fetch on every page load. Caching adds debugging cost that is not worth it during prototyping.
- No pagination. Max 10 results. If the top 10 are not useful, the query is wrong.
- No server. The entire search engine runs in the browser. The server serves static files.
8. The keyword reachability bug
The search box issued multi_match across title, description,
headings, and terms — but never keywords. 443 of 616 docs (72%)
have keywords whose tokens do not appear in the body text. Those docs
were unreachable through the UI.
The bug was masked by three layers of misdirection:
- Stopword removal.
dataandcodewere stopwords, sodata-sciencelost a token. Removing them from the stopword list was correct but insufficient — keywords still were not in the search path. - Green tokenizer tests. The PBT suite tests tokenization invariants (no stopwords, lowercase, idempotent, bounded). All passed. But tokenizer correctness cannot catch a path-integration bug: the tokens were fine, the search path never consulted them.
- The
termquery worked.term: {keywords: "data-science"}found the doc. An engineer testing in the console would conclude "keywords work." But the search box usesmatch, notterm. The two paths were never compared.
The fix was two lines: add keywords^2 to the multi_match fields,
and tokenize keywords before comparison (the scoring function compared
tokenized query terms against untokenized keyword strings like
"data-science", which never matched).
The lesson: test the path the user takes, not the path the developer takes. A reachability invariant — "for every keyword in every doc, the match path must return that doc" — would have caught this on the first deploy. Green unit tests and green property tests gave a false sense of "done" while 72% of the corpus was unreachable.
The adversarial reviewer caught it by running match and term
side-by-side across the full index and comparing the result sets.
That comparison is now the acceptance test.
9. See also
- pocket-es — the search engine itself (with live query examples)
- Testing pocket-es from the Outside — the tester's perspective