pocket-es: Structured Query Surface

Table of Contents

1. Abstract

Three candidate input surfaces (free text, Lucene-style text grammar, JSON) are not three features. They are three frontends over one intermediate representation: the search object search() already consumes. JSON is that IR written down. The decision recorded here: surface JSON as the structured input now, gate it with a malli contract, and defer the text grammar to a later phase reachable behind radio buttons. Single user, 620 docs: the constraints that usually force ergonomic sugar (untrusted input, throughput) do not bind, so correctness of the contract dominates ergonomics of the syntax.

2. Context: one IR, several surfaces

The console API already accepts the IR:

pocketES.search({ query: { match: { _all: "crdt" } } })

So JSON structured search is not unbuilt. It is unsurfaced and ungated. Three gaps, only one of which is the toggle's job:

  1. No DOM surface to type the IR (toggle closes this).
  2. No error rendering when the IR is malformed (toggle closes this).
  3. No validation before the call reaches search() (malli closes this, and it is independent of the toggle: see 6).

The free-text box constructs its IR programmatically (text -> {:match {:_all t}}), so it cannot emit a malformed IR. The structured box lets a human type the IR directly, which is the only path that can produce something the contract rejects. That is why "the toggle enforces validation" reads true experientially while validation is in fact universal: the toggle is the only surface where validation can fire.

3. Decision: JSON-first, text-DSL deferred

Why JSON beats a text grammar as the primary structured surface:

  • The IR is JSON. JSON mode is JSON.parse plus a validate call. No lexer, no parser, no precedence table, no round-trip printer. The text grammar is ~120 lines of parser whose correctness must itself be established; JSON mode has no parser to be wrong.
  • Semantics are known, not discovered. The required mapping back into search() is fixed and lives in one head. A text grammar exists to spare a user from learning the IR. The user already knows the IR. The sugar buys nothing it does not also have to maintain.
  • It is the migration spine. JSON is the canonical form. Later modes (simple, lucene) are alternate frontends that compile to the same IR and are checked against the same contract. Adding radio buttons does not change the contract; it adds producers of IR.

Refutation condition: if hand-typing the same bool blob recurs often enough to annoy, JSON-as-primary has failed ergonomically and the text grammar earns its slot. If the text grammar never ships, JSON-first was right and the deferral cost nothing.

4. Architecture

query-surface-arch.png

Every producer terminates in the same IR. The contract sits at the IR/executor boundary, once. The blast radius of a contract change is one schema, consumed by N surfaces and the JSON-Schema emit.

5. The contract

Closed over six query types and a fixed field set, which is exactly why a schema is tractable here and was not for public Elasticsearch: the public DSL is open over an arbitrary mapping; this one is not.

(ns pocket-es.dsl
  (:require [malli.core :as m]
            [malli.error :as me]
            [malli.json-schema :as mjs]))

;; Closed field set. Provenance: indexer extracts #+TITLE/#+KEYWORDS/
;; #+DESCRIPTION plus headings and the tokenized body (the per-doc `terms`
;; frequency map); _all is the union field. There is no `body` field -- the
;; default free-text fan-out boosts ["title^3" "keywords^2" "description"
;; "headings^2" "terms"], so the closed set must match exactly.
(def +fields+ [:_all :title :keywords :description :headings :terms])

;; A query CONTAINER is exactly one clause. The single-key invariant is the
;; thing that made the public ES schema ugly with oneOf; here it is one :fn.
(def Query
  [:schema
   {:registry
    {::field [:enum :_all :title :keywords :description :headings :terms]
     ;; field with optional ^boost, e.g. "title^3"
     ::boost [:re #"^(_all|title|keywords|description|headings|terms)(\^\d+(?:\.\d+)?)?$"]
     ::query
     [:and
      [:map {:closed true}
       [:match       {:optional true} [:map-of ::field :string]]
       [:term        {:optional true} [:map-of ::field [:or :string [:vector :string]]]]
       [:prefix      {:optional true} [:map-of ::field :string]]
       [:match_all   {:optional true} [:map {:closed true}]]
       [:multi_match {:optional true}
        [:map {:closed true}
         [:query  :string]
         [:fields [:vector {:min 1} ::boost]]]]
       [:bool        {:optional true}
        [:and
         [:map {:closed true}
          [:must     {:optional true} [:vector [:ref ::query]]]
          [:should   {:optional true} [:vector [:ref ::query]]]
          [:filter   {:optional true} [:vector [:ref ::query]]]
          [:must_not {:optional true} [:vector [:ref ::query]]]]
         [:fn {:error/message "bool needs >= 1 of must/should/filter/must_not"}
          (fn [b] (pos? (count b)))]]]]
      [:fn {:error/message
            "a query is exactly one of: match term prefix match_all multi_match bool"}
       (fn [q] (= 1 (count q)))]]}}
   ::query])

(def SearchRequest
  [:map {:closed true}
   [:query Query]
   [:size {:optional true} [:int {:min 1 :max 100}]]
   [:from {:optional true} [:int {:min 0}]]])

(defn validate    [obj] (m/validate SearchRequest obj))
(defn explain     [obj] (some-> (m/explain SearchRequest obj) me/humanize))
(defn json-schema []    (mjs/transform SearchRequest))

Note what the IR cannot express: phrases. There is no phrase clause. Phrase search needs positions, the index stores top-50 terms with none, so a phrase that returned would be a lie. Rejection here is structural, not a guard: you cannot type a phrase clause because no key admits one. The lucene parser (phase B) is the only place a human can write a quote, and that is where the parser must error rather than silently AND the terms.

6. Where the gate lives

Three placements, with a real tradeoff:

placement who is guarded cost on free-text keystroke note
UI boundary only only what UI sends none not a runtime invariant
JSON branch only JSON + console-via-UI none on free hot path chosen
inside search() everything incl. console one validate + js->clj per call revalidates known-good IR

Decision: validate in the JSON branch, not in search(). The reason is INV-2. Free mode constructs its IR (text -> {:match {:_all t}} via multi_match), so it is valid by construction and cannot fail; a search()-level gate would re-validate that known-good IR on every debounced keystroke and force a js->clj round-trip to do it. The JSON branch is the only surface a human types raw IR into, so it is the only place the contract can actually fire.

Tradeoff named: the console path (pocketES.search(...)) stays unguarded. For a single user that is fine – it is a one-line move into es/search if console coverage ever earns it.

Implementation subtlety that bites if missed: es/search takes the JS object. Pass the parsed JSON straight through to search, and js->clj only for the malli check. Do not double-convert, and do not feed the keywordized CLJS map to search.

;; in the UI, branch on explicit @!mode, never on parseability
(if (str/blank? text)
  (clear-and-idle!)
  (case @!mode
    :json (run-structured! text)
    (run-free! text)))          ; :free default; :simple/:lucene compile here in phase B

(defn- run-free! [text]
  ;; total: constructs a valid IR, cannot trip the contract (INV-2)
  (let [req (js-obj "query" (js-obj "multi_match"
                                    (js-obj "query"  text
                                            "fields" #js ["title^3" "keywords^2"
                                                          "description" "headings^2" "terms"]))
                    "size" 200)]
    (render-hits! (es/search @!idx req))
    (render-suggestions! (es/suggest @!idx (js-obj "text" text "size" 6)))))

(defn- run-structured! [text]
  (let [parsed (try-parse-json text)]                 ; nil on malformed JSON
    (if (nil? parsed)
      (render-query-error! "not valid JSON")
      (let [errs (dsl/explain (js->clj parsed :keywordize-keys true))]  ; nil when valid
        (if errs
          (render-query-error! errs)                  ; humanized malli explain, in box
          (do (clear-query-error!)
              (render-hits! (es/search @!idx parsed))))))))  ; JS form, no re-convert

Note: structured mode deliberately does not inject size: 200. If you typed the IR, you said what you want; search() applies its own default when size is absent. The asymmetry with free mode (which always caps at 200) is correct.

The UI's job then narrows to: provide the structured surface, render explain output inline (render-query-error! / clear-query-error!), and URL-sync the mode atom (:free default, like the date filter; the case falls through to run-free! so phase-B :simple=/:lucene= slot in as new branches). Validation is the contract's job; visibility is the UI's.

7. JSON-Schema emit (byproduct)

json-schema above is the artifact the earlier question wanted and that does not exist for public ES (because public ES is open; this is closed). Author once in malli, emit on build, publish:

;; build-time, JVM/bb
(require '[clojure.data.json :as json] '[pocket-es.dsl :as dsl])
(spit "site/static/pocket-es/query.schema.json"
      (json/write-str (dsl/json-schema)))
;; -> wal.sh/research/pocket-es/query.schema.json

Single source of truth: the malli schema. JSON-Schema and (phase B) the parser postcondition both derive from it. No second hand-maintained representation to drift.

8. State machine delta

Current: nine actions, five atoms, URL sync. Adding structured input adds one atom and one sync target.

  • New atom mode: #{:free :json} now, #{:simple :lucene :json} at phase B.
  • URL-synced like the date filter, so a shared link carries its mode.
  • Mode switch does not auto-translate the query text. Auto-translation is the same silent-reinterpretation failure this whole thread exists to delete, moved up a level.
  • One lossless exception, offered not forced: :free -> :json can seed the box with , since free -> IR is total. The reverse (:json -> :free=) is lossy and is not offered: arbitrary IR has no free-text preimage.

9. Phase B: radio buttons + lucene grammar (deferred)

When/if the infix itch is felt concretely. Three radio options, each a producer of the same IR:

  • simple: tokenize -> single match on _all. Total, cannot fail.
  • lucene: lex -> parse -> IR. The grammar's operator-free restriction is exactly simple at the ranking level (match already sums per-token BM25, so N should=/=match clauses rank identically to one match over N tokens). One parser, two configs.
  • json: the surface this spec ships now.

Minimal honest grammar. NOT is the only operator that changes which documents return; AND=/=OR only reorder, and BM25 already orders. Ship the one that does work:

query  = term { term }          ; whitespace = OR, default accumulate (should)
term   = [ "NOT" ] atom
atom   = "(" query ")"          ; nested bool
       | field ":" word         ; -> match on field
       | word                   ; -> match on _all
       | '"' ... '"'            ; REJECT: no positional index
field  = "title" | "description" | "keywords" | "headings" | "body" | "_all"
word   = /[^\s():"]+/

Parser postcondition: dsl/validate (parse s) holds for every s the lexer accepts. Any accepted string that produces invalid IR is a parser bug, surfaced by the contract rather than by noticing wrong results weeks later. Harness: a generative round-trip. malli.generator produces valid IRs, a printer renders each to its lucene string, re-parse, assert parse . print = id. This folds into the existing property-test habit (token_test, 300 iterations).

Implementation note: flat recursive descent, not instaparse. With only NOT prefix and parens there are no precedence levels to manage, ~80 lines, and instaparse would blow the 32KB-gzip budget for a grammar this small.

10. Invariants and refutation conditions

  • INV-1 Every path into search() terminates in a malli-valid SearchRequest or search() throws. (Recommended placement makes this runtime, not UI.)
  • INV-2 Free/simple modes construct IR rather than parse it, so they cannot produce invalid IR and never trip the validator.
  • INV-3 A query container has exactly one clause key (the :fn count guard).
  • INV-4 (phase B) lucene parser output is a subset of malli-valid IR; the print/parse round-trip is identity on generated IRs.
  • Refutation, JSON-first bet: a recurring hand-typed bool blob means the text grammar should ship. Absence of that, plus phase B never landing, means the deferral was correct.
  • Cost to name: malli core against the bundle. Measure the gzip delta once. It is the only structured-validation dependency and there is no second user to optimize against, so the bar for accepting it is low; measure anyway rather than assume.

11. Sequencing

Full phased plan with test plans, invariants, test.check properties, and the Bombadil LTL check lives in query-surface-rollout. Summary:

  1. (Phase 1, shipped) Make the two options clear: a checkbox toggle, no new contract. Off = tokenized free-text; on = JSON. Default off. Toggle sits to the right of the cluster line so it costs no vertical space.
  2. Author pocket-es.dsl (malli). Validate in the JSON branch (run-structured!), render dsl/explain inline. Not a search() precondition – see 6.
  3. Add mode atom + URL sync; swap the text input for a textarea in JSON mode; Tab-complete the closed vocabulary.
  4. Emit query.schema.json at build; publish (only once it has a consumer).
  5. (Deferred) Promote toggle to radio; add simple + lucene producers; add the print/parse round-trip generative test.

Phase 1 is shipped. Steps 1–3 are days; step 4 waits on a felt need, not a calendar.