Three PDFs open in the same session. Eighty pages each, dense. They cross-reference each other constantly — a table on page 47 of doc B is the result of an experiment described on page 12 of doc A, a footnote in doc C points at an appendix in doc A, the methods section of doc A is a one-line citation pointing at doc B.
A real person reading these has all three open and is hopping between them every thirty seconds.
I asked my own system, the one I've been wiring up for the past few months: where do these three disagree?
It returned four chunks. Two were from the same document. None were from the section that actually answers the question.
It didn't fail in the technical sense. The retrieval ran fine. Sub-second. All green lights. It failed in the sense that mattered — the user was no closer to an answer than if they'd opened the PDFs themselves and hit Cmd-F a few times.
That sent me down a rabbit hole. The bottom of it turned out not to be a new problem at all. Coding agents have been solving exactly this shape of pain for two years, quietly, while everyone else fixated on context-window benchmarks. The lessons transfer almost line-for-line.
This is also the layer above the one I wrote about last week. That post was about the parser — the Russian-doll problem of getting a 280 MB nested .docx into an addressable tree at all. This post assumes the tree exists. Three PDFs that disagree are the cross-document version of the same problem: within-doc nesting becomes between-doc cross-reference, but the shape is identical. Substrate first, then operations on the substrate. I built v0.1 of the substrate; what follows is the retrieval layer I'm now wiring on top of it.
The shape of "lost"
Forget research papers for a second. Look at your own computer.
Somewhere on that machine is a PDF you downloaded eleven months ago. You remember it had a chart in it that compared two things you now need to compare again. You don't remember the filename. You don't remember the folder. You half-remember it might've been an email attachment. You're not sure if it was a PDF or a Word doc. You've forgotten the title.
You open Spotlight or the Windows search bar. You type a word you think was in it. You get back forty-three results, most of them wrong, ranked in some order that is not the order of "useful to you right now." The actual file is item nineteen, but you give up at item six.
That is the entire problem in miniature. There is information on the device, you can describe what you're looking for in your own words, but the bridge between the description and the file does not exist. Filename matching is too brittle. Full-text grep returns too many hits and ranks them by some logic that has nothing to do with intent.
Now imagine an assistant living on that computer that you can actually ask in plain English. "Find me the chart that compared the two cloud providers' egress fees, I think it was a PDF, maybe in something I downloaded last summer."
For that to work, the assistant needs more than a search bar. It needs to know that the file at ~/Downloads/q3-vendor-comparison.pdf is related to the email it came from, which is related to the calendar event the email was about, which is related to the colleague whose name appears in the email signature. It needs to know that the chart on page 12 isn't just an image — it's the result of the table on page 11. It needs to know that "cloud provider" and "AWS, GCP" are the same idea. It needs to rank "files I'd actually want to see" above "files that contain the word 'cloud' somewhere."
That assistant exists in pieces inside Cursor, inside Claude Code, inside Aider. It exists at scale inside Glean and Sourcegraph and Perplexity. The mechanics that get them from "I have a vague description" to "here's the exact paragraph" are not magic, and they're not exotic. They are a stack of techniques layered carefully — and the stack is mostly the same regardless of whether the thing you're searching is a code repo, a stack of research PDFs, or your Downloads folder.
The rest of this post is what's in that stack, why each piece exists, and what happens when you skip any one of them.
Why bigger windows don't fix it
The tempting move, when you have a corpus that doesn't fit in memory, is to wait for memory to catch up. Gemini ships a million-token window. Claude ships two hundred thousand. Some experimental architectures claim ten million.1 Surely the answer is to wait a year and shove the whole corpus in.
The answer is not to wait a year and shove the whole corpus in.
There are three reasons this doesn't work, and all three are loud.
The first is a math problem. Transformer attention is quadratic in sequence length without architectural tricks, and even with those tricks the key-value cache for a million-token prefix is enormous.2 Independent benchmarks have shown that an optimized retrieval pipeline can be more than a thousand times cheaper per query than a long-context inference call doing the same job.3 At a thousand queries a day this is a rounding error. At a hundred thousand queries a day it is a budget-gating fact.
The second is a quality problem. Models given long contexts get worse, not better, at finding information buried in the middle of them. Liu et al. called it "lost in the middle" and the U-shaped attention curve they documented in 2023 has only partially closed since.4 Stuffing a million marginally relevant tokens into a prompt does not help the model find the right answer — it actively hurts by drowning the right answer in plausible-looking distractors.
The third is the one I find most underappreciated. A team building agentic coding tools recently named it the Navigation Paradox: as your context capacity grows, the bottleneck stops being capacity and becomes salience.5 The model has the bytes. It does not know which bytes matter. If file A imports file B which depends on file C, and only A is "obviously relevant" by token similarity, no amount of additional window helps the model understand that the bug is in C. Their controlled experiment on the FastAPI RealWorld application showed graph-augmented traversal hitting 99.4% on architectural-dependency tasks where naive long-context retrieval failed outright.5
Bigger window. Worse memory. More expensive. The whole frame is wrong.
The 25 in BM25 is literally an iteration counter. Stephen Robertson's group at City University London spent the 1980s trying probabilistic weighting schemes; this was the 25th. "BM" is "Best Matching." The retrieval system the algorithm ran inside was called Okapi, named after the African mammal. Forty years later, the algorithm is still the single hardest thing to beat as a baseline on keyword-heavy queries — you cannot embarrass yourself by including it.6
What replaces the wrong frame is something the field has started calling context engineering — the deliberate, multi-stage construction of the context that ends up in front of the model, instead of treating "the prompt" as a single bucket you fill.7 The rest of this post is the catalog of moves inside that frame.
Chunking is most of the game
If retrieval has a foundation, it's chunking. Whatever clever thing you do downstream — embeddings, graphs, reranking — operates on the chunks you produced. If the chunks are bad, nothing downstream rescues them.
The naive move is fixed-size chunking. Take the document, slice it every 512 tokens, store each slice. It's trivial to implement. It's also actively destructive: it slices through the middle of paragraphs, separates a subject from its predicate, severs a definition from its example. The chunk that contains the answer often does not contain the question's anchoring context — the named entity, the date, the units. Its embedding ends up in the wrong place in vector space, and the retriever can't find it.
Sliding-window chunking with overlap is the first improvement. Most pipelines use ten to twenty percent overlap, which is the empirical sweet spot — more is wasteful, less leaves seams. Recursive splitting, the default in LangChain's RecursiveCharacterTextSplitter, is the second: walk a hierarchy of separators (\n\n, \n, . , ) and split at the coarsest one that fits. Cheap, surprisingly effective on prose.8
But the move that actually changes what's possible is structure-aware chunking. Use the document's own organization. For a research paper: chunk at section boundaries (Abstract, Introduction, Methods, Results, Discussion), keep tables and figures as their own units, preserve citation spans as their own indexable entities. For Markdown: chunk at heading levels. For code: chunk at function and class boundaries via tree-sitter.9 The thing that ties these together is the same insight — the document already told you where its seams are; the only sin is ignoring them.
There's a tension between two goals at the chunk level. Small chunks give you precision: an embedding of two sentences sits very specifically in vector space, easy to retrieve exactly. Large chunks give you context: when the LLM gets the chunk, it has enough surrounding material to actually reason. You want both. The trick is parent-child retrieval — embed small chunks for the search step, but at retrieval time return the large parent. Search with a tweezer, hand back a hammer. LlamaIndex calls this AutoMergingRetriever; Anthropic and others call it small-to-big.10
Then there's the failure mode that even good chunking doesn't solve. A chunk that says "profits increased by 14% due to the new supply chain strategy" is, in isolation, useless. Whose profits? Which quarter? Which strategy? The chunk has been semantically orphaned by being separated from its document.
There are two known fixes for this. Anthropic's contextual retrieval prepends a model-generated context summary to each chunk before embedding ("This chunk is from the Q2 2023 Acme earnings report and discusses revenue growth in Europe").11 Their reported numbers are striking — combined with a parallel BM25 index over the same contextualized chunks and a reranker, retrieval failures drop by sixty-seven percent.
Jina AI's late chunking attacks the same problem from the other side: instead of asking a model to write context summaries, embed the entire document with a long-context encoder first, then split the contextualized token embeddings and pool each chunk.12 Each chunk's vector is already conditioned on the whole document. Same goal, different mechanism, far cheaper at index time.
The numbers above are the cumulative effect of stacking these techniques on Anthropic's eval set, normalized so that pure embedding retrieval = 100. Each step buys real ground. None of them are exotic.
Hybrid retrieval, and the boring math of fusion
If you only remember one thing from this post, make it this: dense vector search and lexical keyword search fail on different queries, so you run both and merge.
Dense embeddings — the things you get from a sentence transformer or text-embedding-3-large — are good at semantic intent. "How do I authenticate users" finds a chunk titled "JWT handler" because the embedding model has learned that those mean similar things. Dense embeddings are bad at proper nouns, exact identifiers, error codes, and rare terms. They will gleefully return three near-duplicates of something semantically adjacent to your query while ignoring the one chunk that contains the literal string you're looking for. And they will do it without ever returning zero results — vector similarity is a continuous distance, the index always has something to give back, even if the something is wrong.
BM25 is the opposite. It will not find a synonym to save its life. But search for "AKS error E0451" and BM25 finds the line in the changelog that mentions exactly that. It also fails loudly — if the term isn't there, the result list is empty or short, and you know.
| Retrieval modality | What it gets right | What it misses |
|---|---|---|
| Sparse lexical (BM25) | Exact terminology, proper nouns, IDs, version numbers, error strings | Synonyms, paraphrase, cross-language, conceptual queries |
| Dense vector (bi-encoder) | Intent, synonymy, conceptual similarity, paraphrase | Rare terms, IDs, exact strings; fails silently |
| Hybrid (RRF) | Both of the above | Twice the indexing complexity |
| Multi-vector late interaction (ColBERT) | Token-level matching, near cross-encoder accuracy | Storage explodes — one vector per token |
Running both is the cheap part. Merging the two ranked lists is the part that bites.
The merge is harder than it looks because BM25 scores and cosine similarities are not on the same scale. A BM25 score of 14.3 means nothing relative to a cosine of 0.78. You can't just add them. Min-max normalization is fragile and outlier-sensitive. The best-known answer is Reciprocal Rank Fusion — Cormack, Clarke, and Grossman, CIKM 2009 — which sidesteps the score problem entirely by only using rank.13
The formula is one line:
Where is typically 60, and the sum runs over all the retrievers in your ensemble. A document gets a high RRF score if it appears near the top of multiple ranked lists; a document only one retriever loves gets penalized. RRF works for two retrievers, four retrievers, or a vector index plus BM25 plus a graph traversal plus a manual filter — it scales to any number of input lists. It's parameter-light, hard to break, and Elastic, OpenSearch, Weaviate, Qdrant, Milvus, pgvector, Vespa, and Chroma all ship native RRF support.14
If you have labeled query-document pairs, a learned convex combination beats RRF.15 Almost no team has that data in production. RRF wins by default.
Reranking — the cheapest large win
After hybrid retrieval, you have maybe a hundred or two hundred candidate chunks ranked by their fused score. The naive next move is to take the top five or ten and feed them to the LLM. Don't.
A bi-encoder, the kind of model that produced your embeddings, has a hard architectural ceiling. It encodes the query and the document independently into single vectors, then scores them with a dot product. There is no opportunity for the model to look at both at once. Any nuance that requires comparing query tokens to document tokens — "is the word 'authentication' in the query referring to the paragraph about auth tokens or the one about TLS handshakes?" — is lost forever.
A cross-encoder fixes this by feeding [query; document] jointly through a transformer and producing a single relevance score. The attention layer can do the comparison the bi-encoder couldn't. The catch is cost: cross-encoders are quadratic in the combined input length, so you can't run them over millions of documents. You can run them over the top hundred from a cheap first stage. That's the standard pattern.
The empirical lift is usually five to thirty NDCG@10 points on standard benchmarks. More on out-of-domain queries.16 It is, dollar for dollar, the highest-ROI addition to a retrieval pipeline.
The model that's most interesting in this neighborhood is ColBERT. Khattab and Zaharia, 2020.17 Its pitch is that you can get most of the cross-encoder's quality at most of the bi-encoder's speed if you're clever about how you store the embeddings. Instead of one vector per chunk, ColBERT stores one vector per token. At query time it computes, for each query token, the maximum cosine over all document tokens, and sums those maxes — this is the "MaxSim" operation. The intuition: each token in the query gets to find its best match anywhere in the document, and the document's score is how well its best matches add up.
The numbers are real. A recent biomedical retrieval paper paired ModernBERT for first-stage candidate generation with ColBERTv2 for fine-grained reranking and reported state-of-the-art accuracy on MIRAGE (0.4448 average across five tasks), beating MedCPT, the previous best on the benchmark.18
ColBERT's storage cost is its weakness — one vector per token blows up the index — but recent variants like ColBERTv2 and JinaColBERT-v2 use residual compression that gets the storage back down to manageable. There's also ColPali, which encodes entire PDF page images via a vision-language model and runs late interaction over image patches. For research papers with figures, tables, multi-column layouts, and other things OCR mangles, ColPali is the strongest known approach — it skips OCR entirely and embeds the page as a picture.19
OpenAlex is named after the Library of Alexandria. Built by OurResearch — the same nonprofit behind Unpaywall — within months of Microsoft Academic Graph being discontinued in 2022. As of 2024 it serves around 115 million API queries per month, free, polite-pool throttling only, no key required for most endpoints. If you're building anything that touches academic literature, OpenAlex is the canonical entrypoint. Its RDF dump (SemOpenAlex) is 26 billion triples.20
From strings to things — knowledge graphs
Hybrid retrieval plus reranking handles a lot. It does not handle everything. There's a class of queries that is fundamentally outside what chunk-based retrieval can do, and you only notice once a user asks one.
"What are the main themes across these documents?"
There is no chunk that contains "the main themes." The themes only exist as a property of the entire corpus. Vector search will pick five chunks that contain the word "theme" or its synonyms and serve them up; none of them will be an answer. The query is asking for global synthesis, and chunk retrieval is fundamentally local.
"Which authors that cite Smith 2019 have also collaborated with Tao?"
There is no chunk that answers this. The answer requires walking edges in a citation graph and an authorship graph and intersecting two sets. No amount of embedding similarity finds it.
"Where do these three documents disagree on X?" — the question I started this post with.
This is multi-hop. It needs to find the position each document takes on X (three retrievals), align them on the same dimension, and detect the disagreement. The retriever can find the relevant passages, but the synthesis step needs structure that pure text doesn't provide.
Microsoft's GraphRAG, published in mid-2024, is the canonical answer to this class.21 The pipeline is offline-heavy:
The yellow nodes are offline. They run once per ingest. The blue nodes run per query. The expensive part is everything in yellow — and it is genuinely expensive. On the MultiHop-RAG benchmark, building a baseline RAG index took 135 seconds. Building a full knowledge-graph GraphRAG index took 7,702 seconds, fifty-seven times more.22
But once it's built, the right query types fly:
Community-GraphRAG, the lighter variant that only goes as far as community summaries, has lower retrieval latency than the baseline because it doesn't have to scan every chunk — it scans a small set of dense community summaries. The full KG version is slower at retrieval because it does graph traversal, but it can answer queries the baseline cannot answer at all.
The variants matter. The vanilla GraphRAG paper is one design point in a family that has expanded fast since:
| Variant | Index cost | Update cost | What it adds |
|---|---|---|---|
| GraphRAG (Microsoft) | High — entity extraction + Leiden + community summaries | Full rebuild | Global queries via community map-reduce |
| LightRAG | Medium | Incremental — new docs merge into existing graph | Dual-level keys (low + high), no community summaries |
| LazyGraphRAG | Very low — defers LLM summarization to query time | Trivial | About 0.1% of full GraphRAG indexing cost, most of the quality |
| Triplex (extractor) | N/A — replaces the LLM extractor | N/A | Phi-3-3.8B fine-tuned for triple extraction; reportedly outperforms GPT-4o at ~1/60 cost |
For a use case where the corpus grows daily — research papers, internal docs, anything that isn't a one-time snapshot — LightRAG's incremental update is decisive. Full GraphRAG requires re-running community detection, which is expensive enough to make the system feel like a batch ETL pipeline rather than a live product.
There's also a real technical wrinkle in GraphRAG that didn't make the original paper but is now well-documented: the Leiden hierarchy isn't reproducible. On sparse knowledge graphs, modularity optimization admits exponentially many near-optimal partitions, so two runs with different seeds give different community structures.23 If you care about result stability — say, a customer asks "what changed?" between two ingest runs — pure Leiden is going to lie to you. Recent work has proposed k-core decomposition as a deterministic alternative.
The Leiden algorithm was invented by bibliometricians at Leiden University. Vincent Traag, Ludo Waltman, and Nees Jan van Eck wrote it specifically because Louvain — the previous community-detection standard — was producing internally disconnected communities in citation networks. So the algorithm at the heart of GraphRAG was designed by people who study how academic papers reference each other. Convenient, given that GraphRAG is now most useful for exactly that kind of corpus.24
Hierarchy isn't optional
Knowledge graphs handle entity relationships. They don't handle document hierarchy as well as a different family of techniques does.
A long document is a tree. A book has chapters which have sections which have subsections which have paragraphs. A research paper has Abstract → Introduction → Methods → Results → Discussion. A code repo has packages → modules → classes → methods. A legal contract has parts → articles → clauses → sub-clauses. Throwing the entire tree away and embedding a flat list of leaves loses an enormous amount of signal.
RAPTOR is the cleanest answer here.25 Sarthi et al., ICLR 2024. The construction is recursive:
Embed leaf chunks. Cluster them — RAPTOR uses UMAP for dimensionality reduction and Gaussian Mixture Models for soft clustering, so a single chunk can belong to multiple thematic clusters. Generate an abstractive summary for each cluster. Embed the summaries. Cluster those. Recurse until you have a root.
At query time, the move is counterintuitive but right: don't traverse top-down. Flatten every node from every level into a single index and run plain vector search across all of them. A broad thematic query naturally lands on high-level summary nodes; a specific factoid query naturally lands on leaves. The math takes care of selecting the right level of abstraction.
Coupled with GPT-4, RAPTOR retrieval improved state-of-the-art on the QuALITY long-document reasoning benchmark by 20 absolute points.26 That is a big number for a retrieval-only change.
There's a simpler related idea that doesn't need any LLM-generated summaries: parent-document retrieval, which I mentioned earlier. Embed leaves, retrieve leaves, return parents. RAPTOR's contribution over parent-document retrieval is the recursive abstractive summarization step — instead of "return the section that contains this paragraph," you can return "return the chapter-level summary that this paragraph is one piece of," and the LLM gets a much higher-information context for the same token budget.
What the coding agents already figured out
This is the section where I bring the post back to where it started. It's also the section last week's post pointed at and didn't unpack — the gestured-at "coding tools made this jump years ago" finally cashed out in detail.
Coding agents have been the testbed for industrial-strength retrieval over deeply nested, heavily cross-referenced data for the last two years. The thing they handle — a million-line monorepo with thousands of files, each importing dozens of others, with symbol references threading through everything — is, structurally, the hardest version of the problem. And because the products are competitive and the users are technical, the solutions have evolved in public.
Five tools, five distinct approaches that converge on the same shape of stack:
| Tool | Index built from | Embeddings | Symbol/structure | Storage |
|---|---|---|---|---|
| Aider | tree-sitter tags.scm queries | None at first; later added | Personalized PageRank over reference graph | In-memory per session |
| Cursor | AST-aware chunking | Proprietary + OpenAI | Merkle tree of file hashes | Turbopuffer (object storage) |
| Sourcegraph Cody | SCIP indexes per language | Embeddings + BM25 + symbol search | Compiler-accurate via SCIP | Server-side, multi-repo |
| Continue | Configurable | Open or local | tree-sitter | LanceDB (local) |
| Claude Code | (agentic) | (none required) | grep, find, fd, tree-sitter via tools | None — agent navigates the FS live |
The interesting thing isn't any individual choice. It's that all five of them, despite very different product surfaces, end up with some combination of the same four ingredients: structural parsing of the code, embeddings for intent, lexical search for exact identifiers, and a graph layer for relationships between symbols.
Aider is the simplest version and the most instructive.27 It runs tree-sitter on every file to extract definitions and references — function signatures, class names, who calls whom. That gives it a directed graph: file A references file B if A uses a symbol defined in B. It then runs personalized PageRank over this graph, with the personalization vector biased toward whatever files are currently in the chat. PageRank finds the structurally important nodes; personalization makes "important" mean "important to what we're doing right now." The output is a token-budgeted (default 1k tokens) tree of the most relevant symbol signatures, sent with every request. No embeddings necessary, at least in the original. The hierarchy — repo → package → module → file → class → function — is fully respected because tree-sitter gave it back to him for free.
Tree-sitter was built at GitHub by Max Brunsfeld for the Atom editor. It's an incremental, error-recovering parser generator — meaning you can re-parse a 10,000-line file after a single keystroke in microseconds, and the parser doesn't choke on syntactically invalid code. That property is the entire reason it's usable for live IDE features and for the constantly-changing-repo indexing that Aider, Cursor, and friends rely on. It's also why every code retrieval system in the table above uses it as the chunker, regardless of what they do downstream.28
Cursor goes the embedding route, with a twist that solves the cross-user efficiency problem.29 Every file's content gets hashed; the hashes are arranged into a Merkle tree synced to the server. When something changes, only the affected nodes re-embed — that's the incremental indexing story. The clever bit is that they then content-address the embeddings: two users who clone the same repo (and clones average 92% similarity, per Cursor's own numbers) hit the same embedding cache. The actual storage is Turbopuffer, a serverless vector-plus-full-text engine that runs on object storage. Path obfuscation is applied client-side for privacy. The retriever is hybrid — embeddings for intent, full-text for identifiers — and the routing is split by @ mention (@codebase, @files, @docs, @web).
Sourcegraph Cody goes the deepest on structure.30 They've built and now standardized SCIP, the Source Code Intelligence Protocol — a Protobuf format emitted by language-specific indexers (scip-typescript, scip-java, scip-python, scip-rust, and so on). SCIP gives you compiler-accurate go-to-definition, find-references, and cross-repo symbol navigation. It replaced an older format called LSIF, which used opaque numeric IDs and was painful to debug; SCIP uses human-readable string IDs that survive routine refactoring. Cody combines SCIP graph queries, embeddings, and full-text search, and supports multi-repo with up to a million-token context window when needed.
SCIP is pronounced "skip" and is a deliberate nod to Structure and Interpretation of Computer Programs (SICP). Sourcegraph chose the recursive-acronym name partly because it's funnier than LSIF and partly because the new format genuinely "skips" the opaque-ID problems of the old one. The Protobuf schema centered on stable, human-readable string IDs is the actual technical win — SCIP indexers are roughly ten times easier to write than LSIF indexers were, which is why every language has one now.31
Claude Code takes the most provocative position: skip the index entirely. It ships with file-system tools — grep, find, fd, plus tree-sitter-backed symbol lookups — and lets the agent navigate the repo at query time. No precomputed embeddings, no graph index, no offline build. The bet is that with strong enough models and good enough tools, an agent can find what it needs in real time. On the SWE-bench Verified benchmark this approach reaches a 74.4% resolution rate.32 Whether this scales to enterprise monorepos at acceptable latency is an open question — but the lesson it teaches is that the agent loop can substitute for a precomputed index, given strong enough primitives.
The pattern across all five: structural parsing is non-negotiable. Embeddings alone don't work. Lexical search alone doesn't work. The graph of relationships matters as much as the text. Reranking, when used, pays for itself in points-per-dollar more than almost anything else in the stack. The hierarchy is preserved end to end.
That's the playbook.
The stack, and why it generalizes
Here's the thing the coding-agent crowd figured out that I think is underappreciated outside of code: this stack is not specific to code.
Research papers have citation graphs. That's the same shape as a call graph. A paper "calls" another paper by citing it. Authors collaborate, which is the same shape as a module-import relationship. Concepts and methods recur across papers, which is the same shape as a symbol referenced from multiple files. The graph schema for an academic corpus is structurally identical to the graph schema for a codebase, with different node and edge labels:
Legal documents are even more rigid — every clause has a stable ID, every reference is explicit, and the hierarchy (Part → Article → Section → Clause) is fixed. Technical PDFs have figures, tables, captions, and references that map cleanly to a set of node types. SEC filings, medical records, regulatory submissions — all of them have the same underlying topology: nested hierarchical units with explicit cross-references and entities that recur.
Which means the same stack works:
That's the shape of an integrated stack. Section-aware chunking on the way in, late chunking or contextual embeddings to preserve document context in the vectors, hybrid first-stage retrieval with BM25 plus dense, RRF fusion, cross-encoder reranking, graph expansion for multi-hop, and provenance metadata threaded through the whole pipeline so every span in the answer can be traced back to its source.
The local-versus-enterprise question maps onto this directly. A small corpus (under a couple hundred thousand tokens, low update rate, latency-critical) can skip most of the index work and use cache-augmented generation — preload the corpus into the prompt once, persist the KV cache, only the question varies per query.33 No retrieval errors, low latency, simple. The moment your corpus exceeds context, or updates frequently, or has security requirements that require per-user filtering, you fall through to the full retrieval stack. Most production systems end up with a hybrid: prompt caching for the stable system prompt and document prefix, retrieval for the dynamic tail.
Enterprise concerns layer cleanly on top. Knowledge graphs have a quiet advantage that vector indices don't — because the graph is symbolic, you can attach access control to nodes and edges, and the retrieval traversal can be filtered by user role at query time without re-embedding anything.34 Vector indices need parallel metadata filters, which most production systems implement, but the security model is bolted on rather than native. Same applies to GDPR-style deletion — graphs let you remove or anonymize specific nodes; vector indices need explicit support for deleting and re-inserting individual vectors, which most modern ones now have but which used to be a real pain point.
And on the operational side, semantic caching is the quietly-massive win. Store the embedding of every previous query alongside the response. When a new query comes in, embed it, and if it's within ~0.2 cosine distance of a cached query, serve the cached response. In production document Q&A systems, this drops average latency from ~6.5 seconds to ~100 milliseconds — a 65× speedup on cache hits — and at a 30-50% hit rate, the inference-cost savings are substantial.35
MemGPT got renamed to Letta partly because of naming collisions with Memgraph (the graph database) and Microsoft's never-shipped "MemGPT." The original paper introduced an OS-inspired virtual-context architecture for agents — core memory always in context, recall memory paged from conversation history, archival memory for everything else, with the agent invoking function calls to read and write across tiers. The pattern has since been picked up almost universally for agents that need long-running memory, regardless of whether they use the original library. The name change happened because the team wanted something Latinate-sounding, and "letta" means "small letter" or "message."36
What I'm doing about it
I started this post with three PDFs in a session and a system that returned the wrong four chunks. The system is the same one I wrote about last week — the inspector that turns Russian-doll packages into addressable trees with hierarchy intact. The substrate works. What sits on top of it is what's been broken, and that's what the next few weeks of work are about.
Most of the infra I needed was already built — chunking pipeline, embedding store, hybrid retrieval, basic reranking. The thing missing was the layer above all of it: structure-aware document modeling that respects the seams the substrate already exposes, cross-document graph relationships, hierarchy preserved end to end so the parser's work isn't thrown away the second retrieval starts.
The plan, in order of what I'm wiring next:
- Section-aware chunking that respects document structure rather than slicing every 512 tokens. For PDFs that means actual layout parsing — figures, tables, and citation spans become first-class indexed units.
- Late chunking on the dense side, so each chunk's vector is conditioned on the whole document and "we" and "this method" stop being orphaned references.
- A property graph layer for entities and cross-document references — citations between papers, shared concepts, contradicting claims. Starting with LightRAG-style incremental updates because the corpus grows daily and a full Leiden recompute on every ingest is not a thing I want in production.
- A query classifier in front of the retrieval stack that routes factoid queries to hybrid, global queries to community search, multi-hop queries to an agent loop. Most of the time the cheap path is the right path; the expensive paths only run when they earn it.
- Semantic caching on the response side, because if the same shape of query comes back ten times in a session, paying for retrieval ten times is silly.
I am going to ship this incrementally and watch what survives contact with real users. The piece I'm most uncertain about is the graph layer — it's expensive to build, the variants have different tradeoffs, and getting the entity ontology right is half the battle. The piece I'm most confident about is reranking, because the empirical evidence is unambiguous and the cost is bounded.
If three of these moves land, the system stops returning four chunks where two are duplicates. If five land, it starts answering "where do these three documents disagree" correctly.
If none of them land, that is also a useful answer.
Footnotes
-
Llama 4's announced 10M-token context. Gemini 1.5 Pro's 2M-token context. The trajectory is real, but the costs and quality issues this post discusses are real too. See OpenAI, "Introducing GPT-4.1 in the API," April 2025; Google, Gemini 1.5 Pro release notes; Meta AI, Llama 4 announcement, 2025. ↩
-
NVIDIA Developer Blog, "Scaling to Millions of Tokens with Efficient Long-Context LLM Training," 2024. Discusses the quadratic memory cost and the engineering needed to mitigate it — Flash Attention, KV-cache compression, ring attention. ↩
-
Glukhov, "RAG vs Long-Context LLMs," 2024 — benchmark figures comparing per-query cost across architectures. The 1250× number is for a specific RAG-vs-long-context configuration; the order of magnitude holds across most realistic comparisons. ↩
-
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024, https://arxiv.org/abs/2307.03172. The U-shaped attention finding has been partially mitigated in newer models (Gemini 2.5, Claude 3.7) but the effect is still measurable for multi-document synthesis. ↩
-
"The Navigation Paradox in Large-Context Agentic Coding: Graph-Structured Dependency Navigation Outperforms Retrieval in Architecture-Heavy Tasks," arXiv 2602.20048. The 99.4% number is from controlled experiments on the FastAPI RealWorld application using their CodeCompass MCP tool. ↩ ↩2
-
Robertson and Spärck Jones, original BM25 work at City University London, 1980s–90s. The Okapi system trivia is well-attested across IR history surveys. ↩
-
Andrej Karpathy popularized the "context engineering" framing in mid-2024 as a counterpoint to "prompt engineering" — the work of building the pipeline that produces the prompt, rather than tuning the prompt itself. ↩
-
LangChain
RecursiveCharacterTextSplitterdocumentation. The 10–20% overlap heuristic is empirical and shows up consistently across Pinecone, Databricks, and Anthropic's published guidance. ↩ -
"Five Levels of Chunking Strategies in RAG," Greg Kamradt, 2023. Structure-aware and semantic chunking discussed in detail. ↩
-
LlamaIndex
AutoMergingRetrieverandParentDocumentRetrieverdocumentation. The small-to-big pattern is also called "hierarchical retrieval" in some surveys. ↩ -
Anthropic, "Introducing Contextual Retrieval," September 2024, https://www.anthropic.com/news/contextual-retrieval. The 67% retrieval-failure reduction (contextual embedding + contextual BM25 + reranker) is from their reported eval numbers; the prompt-caching note about ~90% cost reduction on Claude is also theirs. ↩
-
Günther et al., "Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models," arXiv 2409.04701, 2024. The mechanism is genuinely different from contextual retrieval — it bakes context into vectors via the encoder rather than via prepended LLM-generated text. ↩
-
Cormack, Clarke, Grossman, "Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods," CIKM 2009. The default is from the original paper and has held up empirically across most subsequent retrieval research. ↩
-
Native RRF support is documented in Elastic 8.x, OpenSearch 2.19+, Weaviate, Qdrant, Milvus, pgvector, Vespa, and Chroma documentation as of early 2026. ↩
-
Bruch et al., "An Analysis of Fusion Functions for Hybrid Retrieval," 2022, https://arxiv.org/abs/2210.11934. They show learned convex combinations beat RRF when training data is available. Few production systems have that data. ↩
-
BEIR benchmark numbers across multiple cross-encoder rerankers. The 5–30 NDCG@10 range is from the BGE, Cohere, and mxbai reranker reports. ↩
-
Khattab and Zaharia, "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT," SIGIR 2020. ColBERTv2 (2021) is the production-relevant variant with residual compression. ↩
-
"ModernBERT + ColBERT: Enhancing biomedical RAG through an advanced re-ranking retriever," arXiv 2510.04757. State-of-the-art on MIRAGE at the time of publication. ↩
-
Faysse et al., ColPali — late interaction over PaliGemma image embeddings of PDF pages. The fact that it skips OCR is the critical engineering insight; OCR is one of the noisiest preprocessing steps in scientific PDF pipelines. ↩
-
OurResearch, OpenAlex public statistics, 2024. ↩
-
Edge et al., "From Local to Global: A GraphRAG Approach to Query-Focused Summarization," Microsoft Research, arXiv 2404.16130, April 2024. ↩
-
"RAG vs. GraphRAG: A Systematic Evaluation and Key Insights," arXiv 2502.11371. Numbers from the MultiHop-RAG benchmark. ↩
-
Wang & Chen, "Core-based Hierarchies for Efficient GraphRAG," arXiv 2603.05207, 2025. Proves the modularity-near-optimal-partition issue and proposes k-core decomposition. ↩
-
Traag, Waltman, van Eck, "From Louvain to Leiden: guaranteeing well-connected communities," Scientific Reports, 2019. The bibliometrics origin is in the authors' earlier publications and the algorithm's original motivating paper. ↩
-
Sarthi et al., "RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval," ICLR 2024, https://arxiv.org/abs/2401.18059. ↩
-
RAPTOR paper, QuALITY benchmark results — 20-point absolute accuracy improvement over the previous state of the art when paired with GPT-4. ↩
-
Aider blog, "Repository map" post and the tree-sitter follow-up. https://aider.chat/2023/10/22/repomap.html ↩
-
tree-sitter project documentation. https://tree-sitter.github.io/ ↩
-
Cursor blog, indexing internals post. The 92% similarity figure across cloned codebases is from their own reported numbers. ↩
-
Sourcegraph blog, "SCIP — a better code indexing format than LSIF," and "Cross-repository code navigation." https://sourcegraph.com/blog/announcing-scip ↩
-
Same Sourcegraph post — the SICP nod is explicit in the announcement. ↩
-
SWE-bench Verified leaderboard, 2025–2026 results. Multiple agents have hit 70%+ resolution rates, with Claude Code among them. ↩
-
Chan et al., "Don't Do RAG: When Cache-Augmented Generation Is All You Need for Knowledge Tasks," arXiv 2412.15605, 2024. ↩
-
Fluree, "Leveraging Knowledge Graph Databases for EU Data Compliance," 2023; Neo4j Aura RBAC documentation. ↩
-
brain.co, "Semantic Caching: Accelerating beyond basic RAG with up to 65x latency reduction," 2024. The 0.2 cosine threshold is from their production tuning notes. ↩
-
Packer et al., "MemGPT: Towards LLMs as Operating Systems," arXiv 2310.08560, 2023. The Letta rename happened in 2024. ↩