SIPs, N-grams, and Phrase Search for pocket-es
Table of Contents
- 1. Background
- 2. 1. Amazon Statistically Improbable Phrases
- 3. 2. Collocation Detection
- 4. 3. N-gram Indexing for BM25
- 5. 4. Phrase Search Without N-grams
- 6. 5. Application to pocket-es v2
- 7. 6. Information Theory Connections
- 8. 7. Practical Recommendations for pocket-es
- 9. 8. Summary Table
- 10. References
- 11. See also
1. Background
pocket-es v1 tokenizes to unigrams only. The indexer produces a
{term -> frequency} map per document, capped at the top 50 terms.
BM25 sums per-term scores. There is no phrase detection, no bigram
index, no position storage.
This report surveys the techniques needed to move toward v2: phrase search, auto-generated discriminative labels, and smarter keyword chips.
Corpus facts (as of 2026-06-02 build):
| Metric | Value |
|---|---|
| Documents | 628 |
| Vocabulary (unique unigrams) | 6,779 |
| Total token-doc pairs stored | 26,009 |
| Avg document length (tokens) | 355.8 |
| Avg unique terms per doc (stored) | 41.4 |
| Index raw size | 824 KB |
| Index gzipped | 255 KB |
2. 1. Amazon Statistically Improbable Phrases
2.1. What they are
Amazon introduced SIPs (Statistically Improbable Phrases) in 2004 as a feature on Amazon Books' "Search Inside" pages. For each book, Amazon displayed a list of distinctive multi-word phrases that appeared unusually often in that book relative to how often the same phrases appeared across the broader book corpus.
The feature surfaced "what makes this book unique" – not just what it is about, but what language it alone uses in that way. A phrase like "salting attack" might appear many times in a security textbook but never in Amazon's general corpus, making it highly discriminative.
2.2. The algorithm (reconstructed from Amazon's description and academic parallels)
Amazon did not publish the full algorithm, but the description on the SIPs feature page (archived) and academic work on the same problem allow a reconstruction:
- Candidate phrase extraction. Scan each book for noun phrases (NP chunks via POS tagging) or sliding n-gram windows (bigrams through 5-grams). Candidates must meet a minimum frequency threshold within the document (e.g., appears >= 3 times).
- Background corpus comparison. Compute document frequency in the target book vs. document frequency across the full Amazon book corpus (millions of books). A phrase that appears often in book B but rarely across all books is a SIP for book B.
Scoring. The Amazon formulation is approximately:
SIP score(phrase, doc) = freq_in_doc(phrase) / (freq_in_corpus(phrase) + smoothing)Equivalently: the ratio of observed-in-document to expected-from-corpus. This is the same quantity as the likelihood ratio underlying PMI.
- Ranking and display. Sort by SIP score descending, take top 5-10 per book, display as clickable links that search for other books containing the same phrase.
2.3. The patent
Amazon holds US Patent 7,565,363 ("Identifying specific phrases from a document") filed 2004, granted 2009. The patent describes a method for identifying phrases that occur more frequently in a document than in a background corpus, storing them as index entries, and surfacing them for navigation. The core claim is the ratio of within-document frequency to background corpus frequency, with candidate selection by POS-tag filtering (noun phrases preferred).
The broader mathematical idea – that a phrase's distinctiveness is measured by how much more often it appears in a document than chance predicts – is not patentable and predates Amazon; the patent covers the specific application to book search navigation.
2.4. Relationship to information theory
A SIP is a phrase whose presence in a document is surprising given what the language background would predict. In Shannon's terms, it is a high-surprisal event:
surprisal(phrase in doc) = -log2 P(phrase | background corpus)
The higher the surprisal, the more information the phrase carries about what makes this document distinctive. SIPs operationalize Shannon's insight that information content is inversely proportional to probability – applied at the phrase level rather than the bit level.
3. 2. Collocation Detection
A collocation is a pair (or tuple) of words that co-occur more frequently than their individual frequencies would predict by chance. "Machine learning" is a collocation; "machine repair" is not (the words co-occur by chance and carry no special joint meaning).
Four standard tests, ranked by suitability for a 628-document corpus:
3.1. 2.1 Pointwise Mutual Information (PMI)
PMI(w1, w2) = log2 [ P(w1, w2) / (P(w1) * P(w2)) ]
Where probabilities are estimated from document frequencies across the corpus:
P(w) = df(w) / N P(w1, w2) = df(w1 AND w2) / N
A worked example from the actual pocket-es corpus (N=628):
| Term | df | P |
|---|---|---|
| agent | 94 | 0.1497 |
| sandbox | 11 | 0.0175 |
| agent AND sandbox | 7 | 0.0111 |
PMI(agent, sandbox) = log2(0.0111 / (0.1497 * 0.0175))
= log2(4.235)
= 2.09 bits
NPMI(agent, sandbox) = PMI / -log2(P(w1,w2))
= 2.09 / 6.49
= 0.32 (range: -1 to +1)
An NPMI of 0.32 is moderate – meaningful but not tight coupling. For comparison, "functional programming" (high-frequency multi-word keyword already in the corpus, appearing 29 times) would yield much higher NPMI because both words concentrate in the same subset of docs.
PMI strengths: Simple, interpretable, fast to compute from document frequencies. No need to store positions.
PMI weaknesses: PMI is unreliable for low-frequency pairs. If df(w1, w2) = 1 and df(w1) = 1 and df(w2) = 1, PMI = log2(N) = 9.3, which is the maximum possible value and is also meaningless (just one document). This is the "PMI favors rare pairs" problem. Normalized PMI (NPMI) partially corrects this but does not eliminate it.
Threshold for this corpus: Require df(w1, w2) >= 3 before computing PMI. With 628 documents, a pair that appears in only 1 or 2 documents is likely noise.
3.2. 2.2 Dunning Log-Likelihood Ratio (G2)
The G2 test asks: does the observed co-occurrence pattern differ significantly from what independence would predict?
G2 = 2 * sum over {present, absent} x {w1, w2}:
O_ij * ln(O_ij / E_ij)
where O = observed counts, E = expected under independence
For agent + sandbox in the pocket-es corpus:
| sandbox present | sandbox absent | |
|---|---|---|
| agent present | 7 (k11) | 87 (k12) |
| agent absent | 4 (k21) | 530 (k22) |
G2(agent, sandbox) = 13.84 Chi-squared critical value at p=0.001, df=1: 10.83 Result: statistically significant collocation
G2 is more reliable than raw PMI for small corpora because it uses the full 2x2 contingency table and handles low-frequency pairs more gracefully. A G2 threshold of 10.83 (p < 0.001) is the standard cut-off in corpus linguistics.
Dunning (1993) is the canonical reference. The paper "Accurate methods for the statistics of surprise and coincidence" (Computational Linguistics 19(1)) established G2 as the preferred test for collocation detection over chi-squared (which assumes the normal approximation and breaks down for small expected counts).
3.3. 2.3 Chi-Squared Test
chi2 = N * (k11*k22 - k12*k21)^2
/ [(k11+k12)(k21+k22)(k11+k21)(k12+k22)]
Chi-squared is simpler to compute than G2 and numerically equivalent for large samples, but breaks down when any expected cell count falls below 5. In a 628-document corpus, many candidate bigrams will have expected counts below 5. G2 should be preferred.
3.4. 2.4 Which test to use for 628 documents?
Recommendation: Dunning G2 as the primary filter, NPMI as the ranking criterion.
- G2 >= 10.83 (p < 0.001) as the significance gate: this eliminates accidental co-occurrences.
- NPMI as the ranking score among significant pairs: this handles the "SIP-like" discrimination – how strongly do these two words attract each other?
- Minimum df constraint: require both words and the pair to appear in at least 3 documents (not just 1).
This three-part filter (significance, association strength, minimum frequency) is the standard recipe in corpus linguistics for small corpora. See Manning and Schutze, "Foundations of Statistical Natural Language Processing" (MIT Press, 1999), Chapter 5.
3.5. 2.5 Background corpus vs. within-corpus
The pocket-es corpus is domain-specific (technical research notes, conference summaries, weekly activity logs). Standard English PMI computed from the Google Books n-gram corpus or Wikipedia would identify "machine learning" as a collocation. But within this corpus, "machine learning" appears in 25 documents (keyword) and co-occurs with many other technical terms – its internal PMI may be lower than a more specialized phrase like "winding number" (6 documents) or "vector symbolic architecture" (1 document with high specificity).
For SIP extraction, the interesting comparison is:
- Within-corpus PMI: which phrases are distinctive to specific documents vs. the site as a whole?
- The site-level IDF already captures this for unigrams: terms like "agent" (IDF=1.896) are common; terms like "zettelkasten" (IDF=6.039) are rare. Extending this to bigrams is the direct path to SIPs.
4. 3. N-gram Indexing for BM25
4.1. 3.1 The standard approach: index bigrams alongside unigrams
The BM25 formula does not inherently care whether a "term" is a word or a phrase. A bigram can be treated as an atomic vocabulary entry:
token stream: ["agent", "sandbox", "architecture"]
bigrams: ["agent sandbox", "sandbox architecture"]
combined index: {"agent": 5, "sandbox": 2, "architecture": 3,
"agent sandbox": 2, "sandbox architecture": 1}
The IDF for bigrams is computed identically to unigrams: how many documents contain this bigram? The BM25 score is then the sum of per-term scores, where "term" may be a unigram or a bigram.
At query time, "agent sandbox" tokenizes to either:
- Two unigrams ["agent", "sandbox"] under current behavior, or
- One bigram ["agent sandbox"] if the client detects a quoted phrase, or
- Both, with the bigram getting a boost multiplier.
4.2. 3.2 Scoring bigrams higher than unigram sums
When a user searches "agent sandbox," matching the exact bigram should score higher than matching the two words in unrelated positions. Standard approaches:
- Separate bigram field with a boost. Store bigrams in a parallel
field, e.g.
bigrams: {"agent sandbox": 2}. During query processing, detect the phrase, score the bigram field with a boost (e.g. 2x), and add to the unigram score. - Proximity-biased scoring. Store token positions and add a proximity bonus when matched terms appear within a window of k tokens. This is how Elasticsearch phrase search works internally. It requires position storage.
- IDF-only bigram boost. A bigram has lower IDF than either unigram individually (fewer documents contain the exact pair). At the same tf, a lower-IDF term scores lower – which is the wrong direction. To compensate, multiply bigram IDF by a constant (e.g. 1.5-2x) so that an exact phrase match outscores two scattered unigrams.
For pocket-es, option 1 is the cleanest: add an optional bigrams
field alongside terms, cap at top 25 bigrams per document, and let
the query executor use it when it detects a multi-word query.
4.3. 3.3 Index size impact
Current state:
| Component | Count |
|---|---|
| Unigrams in IDF table | 6,779 |
| Term-doc entries | 26,009 |
| Index raw size | 824 KB |
Bigram estimates for this corpus:
With stopword-filtered token streams of approximately 120 content words per document on average (355 total tokens * ~34% retention after stopwords), adjacent bigrams per document are approximately 119. Across 628 documents, total bigram occurrences are approximately 74,700. Unique bigrams follow a Zipf distribution; empirically, bigram vocabulary tends to be 3-5x the unigram vocabulary, giving an estimate of 20,000-34,000 unique bigrams.
If the indexer stores only the top 25 bigrams per document (vs. top 50 unigrams currently), the additional entries are:
628 docs * 25 bigrams * ~20 bytes/entry = ~314 KB raw Gzipped bigrams (high redundancy): ~80-100 KB additional
Total projected index:
| Asset | Raw | Gzipped |
|---|---|---|
| Current index | 824 KB | 255 KB |
| + bigrams (top 25/doc) | +314 KB | +90 KB |
| Projected total | ~1.14 MB | ~345 KB |
The gzipped total stays well under 500 KB. The fetch budget is not breached for a single-file deployment.
4.4. 3.4 Scoring performance impact
Current BM25 scoring in the ClojureScript client is O(q * d) where q is query terms and d is documents. Adding bigrams to the query adds at most one term (for a two-word query). The marginal cost is one additional hash lookup per document per query. Immeasurable at 628 documents.
The IDF table grows from 6,779 to ~26,000 entries. JSON parse time for the larger table adds approximately 20-30ms on a cold load (estimated from JSON parse benchmarks at this scale). After the first load, the index lives in memory; subsequent queries are unaffected.
5. 4. Phrase Search Without N-grams
Three alternatives when you do not want to pay the bigram index cost:
5.1. 4.1 Position-based phrase matching
Store the position of each token within the document:
terms: {"agent": [12, 47, 203], "sandbox": [13, 204]}
A phrase query "agent sandbox" matches if there exist positions p_a and p_b where p_b = p_a + 1 (consecutive). This is how Lucene/ Elasticsearch implement phrase search internally.
Cost in pocket-es: The terms field currently stores only
frequencies. Adding positions replaces a {"term": count} map with
a {"term": [pos, pos, ...]} map. Position lists are unbounded and
verbose; for 628 docs with avg 355 tokens, a full position index would
be 5-10x larger than the current frequency map. This is incompatible
with the single-JSON-file architecture.
Verdict: Not practical for pocket-es without sharding the index.
5.2. 4.2 Shingle approach (overlapping bigrams as index entries)
A shingle is a hash of consecutive token pairs. Instead of storing the pair as a string, store it as an integer:
shingle("agent", "sandbox") = hash("agent" + "|" + "sandbox") mod 2^32
This is space-efficient (integer keys instead of string keys) and collision-resistant with a 32-bit hash over a vocabulary of 25,000 bigrams. The shingle approach is used in MinHash similarity search and some compact index formats.
For pocket-es, shingles would allow bigram phrase matching in ~500 bytes of additional index space (a 25-entry integer map per document). The tradeoff: the suggest corpus cannot be built from shingles (they are not human-readable).
Verdict: Viable for phrase matching but not for SIP extraction or keyword chip generation, which need the actual phrase strings.
5.3. 4.3 Post-filter approach (BM25 on unigrams, then proximity filter)
The cheapest implementation:
- Run BM25 on unigrams as today – fast, uses the existing index.
- For multi-word queries, post-filter the top-k results (e.g., top 50) by checking whether the original document text contains the phrase.
- The document text is not in the index – so the post-filter would require a second fetch (the .org file or a compressed body store).
Verdict: Architecturally awkward for a no-server, single-fetch design. The current architecture ships zero runtime fetches beyond the index JSON. A body store would double the wire cost.
The most practical variant: post-filter against the heading list and description, which are in the index. A query "agent sandbox" could verify that both words appear in the same heading string. This is not true phrase search but it catches multi-word topic headings without additional fetch cost.
6. 5. Application to pocket-es v2
6.1. 5.1 SIP extraction as a build-time step
The indexer already has access to the full token stream for each document and the global document frequency table (it computes IDF). Adding SIP extraction is a pipeline addition:
(defn extract-sips
"Extract statistically improbable bigrams for a document.
Returns top-n bigrams ranked by NPMI against the corpus.
Requires:
doc-tokens -- the full token sequence for this document
corpus-dfs -- {bigram -> df} across all documents
N -- total document count"
[doc-tokens corpus-dfs N n]
(let [bigrams (->> (partition 2 1 doc-tokens)
(map #(str/join " " %)))
doc-bfreqs (frequencies bigrams)
scored (for [[bg freq] doc-bfreqs
:let [df (get corpus-dfs bg 0)]
:when (>= df 3)] ; min df filter
(let [p-ab (/ df N)
[w1 w2] (str/split bg #" ")
p-a (/ (get corpus-dfs w1 1) N)
p-b (/ (get corpus-dfs w2 1) N)
pmi (/ (Math/log (/ p-ab (* p-a p-b)))
(Math/log 2))
npmi (/ pmi (- (/ (Math/log p-ab)
(Math/log 2))))]
{:bigram bg :npmi npmi :df df :doc-freq freq}))]
(->> scored
(filter #(> (:npmi %) 0.2)) ; NPMI threshold
(sort-by :npmi >)
(take n))))
This runs once at build time. The output is stored in each document's
entry as a sips field (top-10 bigrams ranked by NPMI):
{
"_id": "research/agent-sandbox",
"title": "Agent Sandbox Architectures",
"sips": ["agent sandbox", "capability token", "freebsd jail"],
...
}
6.2. 5.2 SIPs as auto-generated keyword chips
Currently, keyword chips in the pocket-es UI come from #+KEYWORDS
headers in the org files. These are manually maintained and the
lessons document notes that 21% of documents have zero keywords.
SIPs provide a fallback: when a document has fewer than N keywords, supplement with the top-K SIPs. SIPs are computed automatically, so no manual editing is required.
The distinction matters for search quality:
- Author keywords express intent: what the author thinks the document is about.
- SIPs express distinction: what makes this document different from others in the corpus.
A document on "Clojure concurrency" might carry the author keyword "clojure" (high corpus frequency, low IDF=1.979) but a SIP of "core.async channel" (low corpus frequency, high NPMI). The SIP is more useful as a chip because clicking it retrieves a much more targeted result set.
6.3. 5.3 Bigram indexing for phrase search
The minimal v2 bigram extension to the index schema:
{
"_cluster": { "version": 3, ... }, // bump schema version
"idf": { ... }, // unigrams only (unchanged)
"bigram_idf": { // new: bigram -> IDF
"agent sandbox": 4.21,
"freebsd jail": 5.18,
...
},
"docs": [
{
...existing fields...,
"bigrams": { // new: top-25 bigrams per doc
"agent sandbox": 3,
"freebsd jail": 1,
...
},
"sips": ["agent sandbox", ...] // new: top-10 SIPs
}
]
}
Query routing change: when the query string contains a space, the executor splits into unigrams (as before) and also constructs the phrase as a bigram query:
query: "agent sandbox"
-> unigram score: BM25("agent") + BM25("sandbox")
-> bigram score: BM25_bigram("agent sandbox") * BIGRAM_BOOST
-> total: unigram_score + bigram_score
BIGRAM_BOOST of 1.5-2.0 ensures that documents where "agent" and
"sandbox" appear adjacently score above documents where they appear
independently. The boost should be tuned against a small set of
known queries.
6.4. 5.4 Query surface integration
The current query DSL has an explicit invariant that phrase queries are not supported ("No phrases. There is no phrase clause; the index stores no positions"). A v2 bigram index changes this:
- Free-text mode: detect spaces in the query, add bigram matching automatically. No query DSL change visible to the user.
- JSON mode: add a new
phrasequery type alongsidematch. Aphrasequery requires the bigram index and falls back to unigram matching if the bigram is not in the index. - Suggest corpus: append SIPs to the suggest corpus. Autocomplete for "agent s" would now suggest "agent sandbox" in addition to individual words starting with "s".
6.5. 5.5 Incremental rollout path
The bigram index can be added to the existing JSON schema without breaking existing consumers:
- Add
bigram_idfand per-docbigramsandsipsto the indexer. Bump_cluster.versionto 3. - Old clients (version 2) ignore unknown fields. Existing behavior is unchanged.
- New clients detect version >= 3, enable bigram scoring and SIP chip display.
- Schema version check (already mandated by spec invariant 7) handles this gracefully.
This matches the existing versioning discipline: the spec says "a consumer that does not recognize the version refuses the index rather than mis-parsing it" – but since bigrams are additive (new fields, no changed semantics), version 3 is backward-compatible in practice.
7. 6. Information Theory Connections
7.1. 6.1 Surprisal, self-information, and SIPs
Shannon (1948) defined the self-information (surprisal) of an event:
I(x) = -log2 P(x)
A SIP is a phrase with high self-information relative to the background language model. When the background model predicts P(phrase) = 0.0001 but a document contains the phrase frequently, the surprisal of the phrase given the document is low – but the surprisal of the phrase given the background is high. The difference is the information the document provides about the phrase.
PMI is directly related to surprisal:
PMI(w1, w2) = log2 P(w1,w2) - log2 P(w1) - log2 P(w2)
= -I(w1,w2) + I(w1) + I(w2)
= reduction in surprisal from knowing the joint vs. individual
A high PMI means the two words together are less surprising than their individual frequencies would suggest – they carry mutual information about each other's presence.
7.2. 6.2 KL divergence and document-level discrimination
The distribution of terms in a document d versus the corpus as a
whole can be measured by KL divergence:
KL(P_d || P_corpus) = sum over terms t:
P_d(t) * log2(P_d(t) / P_corpus(t))
A term with high P_d(t) / P_corpus(t) contributes strongly to the
KL divergence – it is a term that distinguishes this document from
the average document. This is exactly what BM25's IDF factor captures
at the unigram level, and what SIP extraction captures at the phrase
level.
A document's top KL-divergence terms are its SIPs when extended to bigrams:
SIP score(bigram b, doc d) = P_d(b) * log2(P_d(b) / P_corpus(b))
This framing connects SIPs to the keyword vocabulary convergence work documented elsewhere in this repo: as the vocabulary stabilizes across the site, the within-document divergence from the corpus mean decreases. Documents with stable vocabularies have lower-surprisal terms; new documents with fresh terminology spike the KL divergence.
7.3. 6.3 Relationship to the existing IDF structure
The IDF values in the current pocket-es index are approximately:
IDF(t) = log(1 + (N - df(t) + 0.5) / (df(t) + 0.5))
This is the Robertson-Sparck Jones IDF, which is a smoothed version of self-information. A term with df=1 (appears in one document out of
- has:
IDF = log(1 + (628 - 1 + 0.5) / (1 + 0.5)) = log(1 + 418.3) = 6.04
Which matches the observed maximum IDF of 6.039 in the index.
Extending this to bigrams gives:
IDF_bigram("agent sandbox") = log(1 + (628 - 7 + 0.5) / (7 + 0.5)) = log(1 + 82.8) = 4.44
Higher than either individual term's IDF ("agent" IDF=1.896, "sandbox" IDF=3.2-ish), which is the correct behavior: the phrase is more discriminative than either word alone. The BM25 scoring formula applies unchanged; only the unit of analysis changes from word to phrase.
8. 7. Practical Recommendations for pocket-es
Ordered by implementation cost, ascending:
8.1. 7.1 SIP field in the indexer (low cost, high value)
Add SIP extraction to indexer.clj as a build-time computation.
Store the top-10 SIPs per document in the index. This requires:
- A two-pass over the document collection: first pass builds bigram document frequencies, second pass scores per-document SIPs.
- The indexer already does a similar two-pass for unigram IDF.
- No change to the query engine or client.
- No change to the JSON schema that breaks existing consumers.
Impact: SIPs become available as metadata for display (keyword chips for zero-keyword documents, "distinctive phrases" panel on result cards) and for future phrase querying.
Estimated effort: One session (the lexical analysis code is 20-30 lines of Clojure; the two-pass architecture mirrors the existing IDF computation).
8.2. 7.2 SIPs in the suggest corpus (low cost, query quality improvement)
Append the top-5 SIPs from each document to the suggest_corpus
list. Autocomplete for "agent s" would suggest "agent sandbox" in
addition to individual terms. This change is entirely in the indexer;
the client already handles multi-word suggest entries (the keyword
"functional programming" appears in the current suggest corpus).
Estimated effort: 5 lines of Clojure.
8.3. 7.3 Multi-word keyword phrase search (medium cost)
Currently, a keyword like "functional programming" is stored as an
atomic string in the keywords array. A term query for
{keywords: "functional programming"} matches it exactly. But a
match query for "functional programming" tokenizes the query to
["functional", "programming"] and searches the terms map (body
tokens), missing the keyword match.
The fix: when a keyword contains spaces, also index its component
words as additional body tokens, or tokenize keywords before
comparison in the match path. This is a tokenizer-level change to
the keyword handling in indexer.clj.
Estimated effort: 10-15 lines of Clojure.
8.4. 7.4 Bigram index (medium-high cost, meaningful phrase search)
Add bigram_idf and per-doc bigrams fields to the index. Extend
the query engine to detect multi-word free-text queries and apply
bigram scoring with a boost. Bump schema version to 3.
Index size impact: +~90 KB gzipped (from 255 KB to ~345 KB). Query latency impact: negligible at 628 documents. Cold load impact: +20-30ms for larger JSON parse.
Estimated effort: Two sessions (indexer change + client-side query routing change + property test updates).
8.5. 7.5 What not to build
- Position indexing: Too expensive in storage for a single-JSON-file architecture. The wire cost would exceed the fetch budget.
- External background corpus: Overkill for 628 documents. The site is its own domain; within-corpus discrimination is what matters.
- POS tagging: Adds a dependency (a POS tagger library) for marginal gain over sliding-window bigrams. The corpus is technical prose with high noun density; most interesting collocations are noun-noun or adjective-noun pairs that sliding bigrams will find anyway.
- Trigrams or higher: Diminishing returns and exponential growth in candidate space. Bigrams capture most meaningful phrases in technical text. 3-grams like "model context protocol" are better handled by storing them as normalized keywords.
9. 8. Summary Table
| Technique | Implementation cost | Value for 628 docs | Index size impact |
|---|---|---|---|
| SIP extraction (indexer only) | Low | High (auto-keywords for 21% zero-keyword docs) | Negligible |
| SIPs in suggest corpus | Trivial | Medium (better autocomplete) | None |
| Multi-word keyword phrase fix | Low | High (fixes silent search gap) | None |
| Bigram IDF + per-doc bigrams | Medium | High (phrase search) | +90KB gzipped |
| Position indexing | Very high | High | Prohibitive |
| External background corpus | High | Low (domain-specific corpus works without it) | Fetch cost |
10. References
- Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27(3), 379-423. The foundational paper on information theory; surprisal and mutual information are defined here.
- Manning, C.D. and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press. Chapters 5 (collocations) and 15 (information retrieval) are directly applicable. PMI, G2, and chi-squared for collocation detection are covered in Ch. 5.
- Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61-74. Establishes G2 as the preferred collocation test for small corpora; shows chi-squared fails at low expected counts.
- Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3), 129-146. Original IDF formulation; the BM25 IDF used in pocket-es is the 1994 Robertson-Sparck Jones refinement.
- Robertson, S., Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389. Comprehensive treatment of BM25 variants.
- Church, K.W. and Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22-29. The canonical PMI paper for NLP applications.
- Turney, P.D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of ECML-2001. Shows PMI computed from web co-occurrence frequencies; relevant to the "background corpus" design question.
- Amazon.com (2004). "Statistically Improbable Phrases." Feature description page (archived). The original public description of the SIP algorithm.
- Amazon US Patent 7,565,363 (2009). "Identifying specific phrases from a document." Filed 2004; core claim is the document-vs-corpus frequency ratio for phrase discrimination.
- Salton, G. and McGill, M.J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill. Chapter 4 covers term weighting; the conceptual foundation for IDF-based scoring.
11. See also
- pocket-es – live search engine with query examples
- Conformance spec – index schema and invariants (note invariant 4: "No phrases")
- Building a Search Engine in One Session – lessons from v1 build
- Testing pocket-es from the Outside – verification methodology