Building a Search Engine in One Session

1. What happened
2. The .cljc bet paid off immediately
3. Property-based tests as a contract language
4. Three consumers, one index
5. Closure advanced compilation breaks #js literals
6. The org-mode site as a corpus
7. What was not built
8. The keyword reachability bug
9. See also

1. What happened

pocket-es went from nothing to a working search engine on a live site in one session. 611 org-mode documents indexed, BM25 scoring, a Lucene-style query DSL, click-to-execute examples, deep-linkable URLs, gzip compression, and three consumers of one JSON index (browser, Emacs, build-time validation).

The interesting part is not the search engine itself – BM25 is 15 lines of code and the query DSL is a cond – but the development process that made it possible to go this fast without accumulating debt.

2. The .cljc bet paid off immediately

The original prototype used Python for the build-time indexer and ClojureScript for the client. Within hours, we had two tokenizers with different stopword lists, different length filters, and an allowlist that only existed on one side. Documents indexed with one set of rules were searched with another.

The fix was a single .cljc file – token.cljc – that compiles to both JVM Clojure (for the indexer) and JavaScript (for the browser client). One tokenizer, one stopword set, one truth. The Python indexer was deleted the same day it shipped.

If your pipeline has a build step and a query step, and they share any text processing, start with a shared source file. The cost is near zero (.cljc is just Clojure with reader conditionals). The cost of not doing it is silent divergence that surfaces as "why doesn't this search find what I indexed?"

3. Property-based tests as a contract language

Four property specs, 300 iterations:

Tokenization never returns stopwords
All tokens are lowercase
Tokenizing already-tokenized text is idempotent
Term frequency counts respect the top-N limit

These ran on the JVM in under a second and caught real issues: a duplicate stopword in the set literal, test expectations that used stopwords as expected output, and the realization that "before" and "after" are stopwords (which matters when your test is about stripping org blocks).

Property tests are particularly good for tokenizers because the contract is simple (input → tokens, with invariants) but the edge cases are unbounded (org markup, Unicode, nested blocks). You cannot enumerate them; you can state what must always be true.

4. Three consumers, one index

The JSON index at site/static/search-index.json is consumed by:

Consumer	Runtime	Scoring	Entry point
Browser	ClojureScript	BM25	`pocket-es.js` → `pocketES.*`
Emacs	Elisp	BM25	`pocket-es.el` → `M-x`
Build/test	JVM Clojure	BM25	`indexer.clj` / `token_test`

All three implement the same BM25 formula with the same parameters (k1=1.2, b=0.75). The Emacs client has 8 ERT tests that validate the index structure and scoring behavior against the same artifact the browser loads.

5. Closure advanced compilation breaks #js literals

The hardest bug to find: ClojureScript's #js {:hits ...} creates objects with keyword-named properties. Closure Compiler's advanced optimization renames these properties. The UI accesses them with (.. res -hits -hits), which gets renamed independently. The property names diverge and the access returns undefined.

The fix: (js-obj "hits" ...) with string keys. Closure cannot rename string property names. Every #js {} was replaced with js-obj. Compiler warnings dropped from 4 to 0.

This is well-known but does not surface until you switch from :none (dev) to :advanced (release). If your ClojureScript works in dev but throws TypeError: Cannot read properties of undefined in release, check your #js literals.

6. The org-mode site as a corpus

An org-mode site is an unusually clean corpus for a search engine:

Every document has structured metadata (#+TITLE, #+KEYWORDS, #+DATE) that maps directly to search fields
Source blocks are delimited and can be stripped before tokenization
Property drawers contain metadata that should not be indexed
The file tree structure gives you URL paths for free

The main quality issue was keywords. A data audit found 35% of documents had none, 84% of unique keywords appeared in only one document, and conference session titles were used verbatim as keywords. Four cleanup passes (normalize plurals, remove broad terms, replace session-title strings, add missing keywords) dropped zero-keyword documents from 35% to 21%.

7. What was not built

No stemming. Exact token match with a stopword list. Good enough for a technical corpus.
No sharding. One fetch, one JSON file. At 611 documents and 280KB gzipped, this is correct.
No caching. Fresh fetch on every page load. Caching adds debugging cost that is not worth it during prototyping.
No pagination. Max 10 results. If the top 10 are not useful, the query is wrong.
No server. The entire search engine runs in the browser. The server serves static files.

8. The keyword reachability bug

The search box issued multi_match across title, description, headings, and terms – but never keywords. 443 of 616 docs (72%) have keywords whose tokens do not appear in the body text. Those docs were unreachable through the UI.

The bug was masked by three layers of misdirection:

Stopword removal. data and code were stopwords, so data-science lost a token. Removing them from the stopword list was correct but insufficient – keywords still were not in the search path.
Green tokenizer tests. The PBT suite tests tokenization invariants (no stopwords, lowercase, idempotent, bounded). All passed. But tokenizer correctness cannot catch a path-integration bug: the tokens were fine, the search path never consulted them.
The term query worked. term: {keywords: "data-science"} found the doc. An engineer testing in the console would conclude "keywords work." But the search box uses match, not term. The two paths were never compared.

The fix was two lines: add keywords^2 to the multi_match fields, and tokenize keywords before comparison (the scoring function compared tokenized query terms against untokenized keyword strings like "data-science", which never matched).

The lesson: test the path the user takes, not the path the developer takes. A reachability invariant – "for every keyword in every doc, the match path must return that doc" – would have caught this on the first deploy. Green unit tests and green property tests gave a false sense of "done" while 72% of the corpus was unreachable.

The adversarial reviewer caught it by running match and term side-by-side across the full index and comparing the result sets. That comparison is now the acceptance test.

9. See also

pocket-es – the search engine itself (with live query examples)
Testing pocket-es from the Outside – the tester's perspective