The Russian Doll Problem

A guy on my team spent three days last week trying to read a single Word file.

The file is 280 MB. It's an "issue closure package," which is what my industry calls the artifact you produce after something has gone wrong and you need a paper trail proving you understood why. Inside that one .docx: forty-something embedded PDFs sitting in specific table cells, three Excel sheets that themselves have embedded photos, a couple of PowerPoints, and an Outlook .msg thread that contains more .msg threads inside it. Plus the photographs, dozens of them. There are always photographs.

He has access to every AI tool the company pays for. Microsoft 365 Copilot. A private LLM gateway with every frontier model behind it. A small zoo of point tools sales reps have demoed at us over the past year. None of them got him meaningfully closer to done. The PDFs nested inside the table cells of the parent document showed up to the AI as opaque blobs labelled oleObject1.bin. The .msg with .msg inside it was, charitably, not even attempted.

I went looking for the company that had already solved this. I assumed it existed. Document AI is one of the loudest enterprise categories on the planet right now. Reducto closed a $75M Series B in October, total funding now $108M.¹ Hebbia is sitting on $160M+, runs at 92% accuracy on a legal/finance benchmark where vanilla RAG gets 68%, and is used by BlackRock and KKR.² Glean is at a $7.2B valuation. Harvey is at $5B. Surely somebody had thought about this.

Nobody has. Or, more honestly: lots of people have thought about pieces of it, several have built clever pieces of the substrate, and as of today not one company is shipping the obvious primitive. That's the gap I want to write about, because once you see it you can't un-see it, and because I think it's the most under-served whitespace in document AI right now.

The shape of the thing

When I say "nested," I don't mean a parsing detail. I mean the nesting is the meaning.

When a regulatory affairs lead hands you a CTD submission, or an M&A associate hands you a data room, or a litigator hands you a motion plus its exhibits, what you're holding is not a document. It's a graph. The parent memo points at sub-reports. Sub-reports embed test data. Test data embeds the photos and spreadsheets the technician was holding when she ran the test. The email thread on page 47 references a different memo entirely, attached three layers deep on a different branch. The structure carries information. "This figure lives here, in this section, attached to this row of this table" is itself a fact, and often the most important one.

Here's what one of these things actually looks like, sketched conservatively:

Now go ask Microsoft Copilot to summarize this for you. Or upload it to ChatGPT. Or hand it to one of the more sophisticated managed RAG platforms. What happens, every time I have tested it, is one of three flavors of the same failure.

Sometimes the tool flattens everything into a stream of text and the hierarchy disappears. Sometimes the embedded objects get silently dropped because the parser couldn't handle them. Sometimes the text gets extracted but mixed with the parent in a way the AI can't disentangle, so the conclusion of the parent memo and the appendix of an embedded report and the body of an unrelated email forwarded inside an attachment all end up in the same bowl of soup. Different failure modes, same underlying problem. The hierarchy was the message. The hierarchy is gone.

Karpathy's framing has a corollary nobody has acted on at the document layer.³ If context is a finite, carefully engineered resource, then losing the structure of a 280 MB package isn't an inconvenience. It's a violation of the central design constraint of the entire stack. Good context engineering on top of a parser that can't see past the first layer is impossible by construction.

"Just stuff it in the context window"

This is the first objection every engineer raises when I describe the problem, and it's wrong in a specific, well-documented way.

Last July, Chroma published a technical report called Context Rot. They tested 18 frontier models — GPT-4.1, Claude 4 Opus and Sonnet, Gemini 2.5 Pro, Qwen3 — on tasks deliberately designed to be simple, with input length as the only variable. Every single model degraded as input length grew. Not at the edge of the window. At every length they tested. Some models held steady at 95% accuracy and then nosedived to 60% once a particular threshold got crossed, and that threshold was different for every model on every task in ways that made it impossible to predict in production.⁴

The Stanford "lost in the middle" paper from 2024 has held up across model generations: 30+ point accuracy drops on facts placed in the middle of long contexts versus at the edges. Anthropic's own engineering blog put it plainly: "every new token introduced depletes" the attention budget.⁵

So stuffing doesn't work. The state of the art instead is to retrieve and route — feed the model only the chunks that matter. Hebbia's Matrix gets to 92% accuracy on a brutal legal-and-finance benchmark by running multiple agents in parallel, each on a curated slice. Anthropic's Contextual Retrieval research showed that prepending each chunk with a 50–100 word summary of where it sits in the larger document, plus hybrid retrieval and reranking, drops retrieval failure from 5.7% to 1.9%. A 67% reduction.⁶

All real progress. None of it solves the Russian doll problem, because retrieval-and-route assumes the corpus has been parsed into a coherent index in the first place. If your ingestion pipeline silently drops every embedded PDF before retrieval even starts, no amount of clever downstream routing recovers what was lost upstream. You can't retrieve what was never indexed.

The structure of an enterprise package is not metadata. It is the message. Lose the structure and you have not summarized the document. You have summarized a different document.

I went looking for who was solving this

Two weeks of reading every vendor's docs end to end. I'll save you the fortnight. The honest map looks like this:

Tool	What it does	What it doesn't	Russian-doll fit
Microsoft 365 Copilot	Reads the active document, surfaces references in SharePoint and Graph, drafts inside Word.	Embedded OLE objects are largely invisible. The company that owns Word can't read its own format's embeddings.	Misses
Reducto / Docling	Best parsing in the business. Emits a real, addressable document tree with stable IDs and bounding boxes.¹	Stops at the parse layer. No recursive resolution, no user-in-the-loop choices.	Substrate
Hebbia Matrix	Multi-agent grid: documents are rows, questions are columns, agents fill the cells with citations.²	Operates row by row. Each document is an atomic unit. The embedded subtree gets flattened into the row.	Adjacent
Glean / Bedrock / Vertex AI Agents	Connector-driven RAG over corporate knowledge. Permissions, audit, scale.	Whole-corpus or whole-document scoping. No notion of "this agent only sees Section 3.2 of this file."	Misses
LlamaIndex Document Agents	The closest architectural match in open source. A pattern: per-document agents plus a meta-agent.	It's a pattern. There is no managed product, no SaaS, no out-of-the-box thing a buyer can deploy.	DIY
SuperDoc	Open-source DOCX editor with an MCP surface. Targeting legal-tech with agentic redlining.	Optimizes for edit fidelity (preserving OOXML round-trip), not comprehension fidelity over nested embedded objects. Different lane.	Different problem

The pattern is consistent. Every layer of the stack is well-funded and improving. Parsers can produce real document trees. Vector databases support hybrid filtering. Agent frameworks orchestrate parallel sub-agents. Retrieval research has gotten genuinely good. And yet at the layer where someone with a 280 MB closure package actually lives — heterogeneous, recursively-embedded, needs to be read end-to-end with hierarchy intact — there is no product. There is no API. There is, as far as I can tell, nobody currently shipping it.

The kicker I found while diagnosing my colleague's package: a Microsoft Q&A thread, updated as recently as February 2026, where users were asking Microsoft's own support how to extract embedded PDFs from a Word doc with a few hundred attachments inside. The accepted answer? Rename the .docx to .zip and manually traverse the word/embeddings/ folder.⁷

In 2026. From the company that owns Word, Outlook, PowerPoint, and the entire OLE ecosystem.

A small tangent that probably tells you everything. The OOXML spec — the format for .docx and friends — has a feature called altChunk that lets one document inline-import another at render time. Almost exactly what an agentic document tool would want as a primitive. Pandoc ignores it. Mammoth ignores it. LibreOffice handles it inconsistently. A tool that handled altChunk correctly, with the user in the loop, would be the kind of small detail that makes a regulatory-affairs lead immediately recognize you've actually thought about her problem. None of them have.

How a human actually does this

If you start from how a person reviews one of these things, the right primitive is almost embarrassingly obvious. So obvious I keep wondering what I'm missing.

A human opens the parent memo. Scrolls. Hits an embedded file. Makes a snap call: do I need to open this. If yes, opens it, skims it, decides if its contents are material to the parent's conclusion, forms a memory or copies a quote into her notes. If no, keeps scrolling. Recurses for every embedded thing she cares about. By the end, she has a curated, hierarchical mental model of the package. Parent. Children that mattered. Children-of-children that mattered. A fuzzy "appendix material" bucket for the irrelevant 80%.

Now compare that to what current tools do:

The AI version of the human flow is straightforward to describe. Parse the package into a real tree. Surface every embedded object as a placeholder node, marked with a small visible warning. Think of the yellow squiggle in DevTools when something's unresolved. For each placeholder, the user picks one of four resolutions. The same four a human reviewer picks implicitly, every time.

The four-way choice is the whole insight. Delete covers "this attachment is irrelevant noise." Inline covers "include the full text, I want the model to see it." Summarize covers "include a compressed version, I'm budget-constrained." Override covers "the parser is failing on this format, let me type what it actually says." That covers every realistic case I have run into. I genuinely cannot think of a fifth.

The output is a marked-up document, structurally faithful to the original, where every embedded object has been resolved into a tagged placeholder the model can reason about. The system prompt teaches the model what the tags mean. The hierarchy survives the round trip. The model now reads the package the way a human reads it: aware of what's deep, what's shallow, what's a quote, what's a summary, what is the reviewer's own annotation.

Build retrieval, RAG, agentic pipelines, comparison views, audit trails on top of that substrate, and they fall out as easy work. The hard part is the substrate. The substrate is missing.

"It's just a parser plus a UI"

The reaction I get from engineers, every time. Three things make it not.

The recursion is real. An embedded PDF can itself contain attachments. PDF supports it natively. An embedded .msg can contain other .msg files with their own attachments. An embedded Excel can have an embedded image with EXIF metadata pointing at another file in the package. The data structure is a graph, not a list. Walking it correctly, deduplicating across branches, handling cycles, presenting the tree in a UX that doesn't collapse under its own weight — that's months of engineering, not weeks.

The format matrix is brutal. DOCX with OLE objects. PDF with PDF attachments. PDF with embedded fonts that confuse OCR. Outlook .msg files (an undocumented binary format from 1995 that we still use because the world runs on Outlook). Embedded Excel that itself contains embedded images. Visio diagrams. Whatever PowerPoint slides count as. Each is a parser with edge cases you only learn empirically. Six months of honest work gets you a moat a single competitor can't catch in under a year, because the bugs aren't specifiable in advance. You discover them by running the tool against real customer documents, in production, when the stakes matter.

The judgment is human. The four-way resolution is not something to automate away. The whole point is that "is this attachment relevant?" is a domain call, and a litigator's answer is different from an auditor's, which is different from a regulatory affairs lead's, and they are all correct in their own contexts. The product has to put the controls in front of the human and let her choose. Tools that try to AI-decide this end up wrong in the cases that matter most. People who have lived through "the AI summarized this for me and missed the only sentence that mattered" are not going to trust the next round of automated promises. They want the controls back.

Who else lives here

Once you see the pattern you can't stop seeing it. The same nested-package bottleneck shows up almost identically in:

Legal discovery and litigation, where a motion plus its exhibits is exactly the structure I've been describing — and reviewing every cross-reference manually is the bulk of associate hours at a litigation firm. Regulatory submissions in pharma and medical devices: CTD dossiers, 510(k) submissions, literally nested folders of nested documents. FDA reviewers do this manually. Insurance claim packages, where adjusters work through police reports, medical records, photos, and invoices that arrive zipped together. M&A data rooms, which every dealmaker I've talked to complains about specifically — thousands of files with cross-references nobody can hold in their head. Audit workpapers in Big Four. Engineering change orders in aerospace and automotive. FOIA responses. Government intelligence packages.

I'm not saying I want to build for all of these. I'm saying the structure is identical across all of them, which means the moat is portable. Solve it once in one vertical — ideally the one you have first-hand pain in — and the engine ports.

💡

A possibly-relevant fun fact. The format we now know as .msg, the file Outlook produces when you save an email with all attachments embedded, is a binary derivative of Microsoft Compound File Binary. CFB was a 1990s spec that tried to let you store a whole filesystem inside a single file. Every .msg you've ever opened is a tiny operating system pretending to be a message. Opening it correctly, with all attachments and their attachments, has been a parser writer's nightmare for thirty years. There is something pleasing about an AI revolution that has scaled past trillion-parameter models but still chokes on a 1995 file format.

The deeper bet

What I've described, on its surface, is a tool for resolving embedded files. That's the wedge. The deeper bet is that the hierarchical, addressable, user-curated document tree is the right substrate for a whole class of AI workflows the industry hasn't named yet.

Once you have a stable tree with stable IDs — the equivalent of a DOM for documents — you can do the things Cursor and Claude Code did for codebases. Scope an agent to a sub-tree. Run different agents on different sections in parallel. Audit which agent saw which node. Prove containment to a compliance reviewer: this AI literally could not see outside the section it was attached to. Diff versions of the same package. Build a learning loop that remembers, for this team, which kinds of attachments usually get summarized vs inlined.

None of that is possible without the substrate. All of it is possible with it. And the substrate is, currently, nobody's product.

Coding tools made this jump years ago. Cursor and Claude Code don't treat your repo as one file. They treat it as a graph with symbol-level addressing, function-level scoping, file-level isolation. The reason that works is that abstract syntax trees give you the data structure for free. Document AI hasn't had its AST moment. Partly because nobody has insisted on building one. Partly because the format zoo is genuinely uglier than any programming language ever invented.

Worth 25 minutes if you're building anything that pushes long context. Kelly Hong from Chroma walks through the Context Rot research:

What I'm doing about it

Building it. Quietly, on the side. v0.1 already exists. It's a local web app that ingests a .docx, walks the OOXML tree, surfaces every embedded object as a yellow-flagged placeholder in a DevTools-style inspector, lets you pick one of the four resolutions, and produces a marked-up payload an LLM can reason about with hierarchy intact. It's rough. It's slow. It works.

On the code itself: closed source today, and I hate that. The whole point of a substrate is that it gets used, and putting a sales call between a curious engineer and a clone command is the first sign you're building the wrong shape of thing. The plan is AGPL the second the surface stops moving every other day — months, not years. AGPL specifically because the right answer is: anyone who wants to grab the repo and run it against their own documents on their own laptop on their own quiet Sunday gets to do that for free, forever. A company that wants to deploy it inside their network and have an internal team build on top of it runs into the network-copyleft clause, and at that point we can have a conversation about what a fair commercial licence looks like. It's the pattern Grafana, MongoDB, and a few other developer-tool companies I respect have used to fund the actual work while keeping the code honest. I'd rather take the extra weeks to do the relicence properly once than rush it now and ship a flag day in nine months.

The next four months are about getting it out of my laptop and in front of two or three teams who have this exact pain on real packages with real deadlines. If that goes well, we'll talk about the rest. If it doesn't, that's also a useful answer.

If you're a regulatory affairs lead, an audit-workpaper reviewer, an M&A associate, in-house counsel, a quality engineer in a regulated industry, or anyone else whose Tuesday afternoons get eaten by 280 MB Word documents — and you'd give me twenty minutes — I want to hear from you. The single most valuable thing I can do this quarter is talk to people who live in this problem. Email's in the footer.

I don't know if this becomes a company. I do know somebody is going to build it, because the gap is real and the demand is obvious to anyone who has spent three days inside a closure package. I'd rather it be someone who started from the pain than someone who started from the funding round.

Reducto, "Reducto Raises $75M Series B to Define the Future of AI Document Intelligence," PR Newswire, 14 October 2025; total funding figure of $108M and a16z lead per Reducto's own blog post on the same date. reducto.ai/blog. ↩ ↩²
OpenAI customer story on Hebbia, "Automating 90% of finance and legal work with agents," 2025. The 92% vs 68% figure is from their published benchmark comparing Hebbia's multi-agent Matrix to vanilla RAG on legal and financial documents. openai.com/index/hebbia. ↩ ↩²
Karpathy, X post, 25 June 2025 (embedded above). The phrase "context engineering" was popularized independently by Tobi Lütke and Karpathy in mid-2025 and has been adopted by Anthropic, LangChain, and most serious LLM-app builders since. ↩
Hong, Kelly, Anton Troynikov, and Jeff Huber. "Context Rot: How Increasing Input Tokens Impacts LLM Performance." Chroma technical report, July 2025. research.trychroma.com/context-rot. ↩
Liu et al., Stanford / TACL 2024 ("Lost in the Middle: How Language Models Use Long Contexts"); Anthropic Engineering, "Effective Context Engineering for AI Agents," September 2025. ↩
Anthropic, "Introducing Contextual Retrieval," September 2024. Combined Contextual Embedding plus Contextual BM25 plus reranking reduces top-20 retrieval failure from 5.7% to 1.9%. anthropic.com/news/contextual-retrieval. ↩
Microsoft Q&A, "How do I extract embedded files from a Word document efficiently?" Multiple threads, latest update February 2026. The accepted workaround is to rename the .docx to .zip and dig through the archive. ↩