Why Your $50M Deal Almost Failed Because AI Can't Read

You paid $200,000 for the premier AI subscription. Your team uploaded 847 pages of merger documents. The AI confidently told you the indemnification cap was $5M. It was actually $50M—buried in an appendix cross-reference on page 623.

Your lawyer caught it. This time.

MIT researchers just published a paper that explains why this keeps happening—and more importantly, how they solved it. The implications for anyone managing serious money are profound.

The Expensive Problem You're Already Facing

Here's what's happening in high-stakes operations right now:

Your legal team uploads acquisition documents to Claude or GPT-5. The AI processes everything, seems confident, misses a material adverse change clause that completely alters deal economics. Cost of missing it: $12M in unexpected liabilities.

Your investment analysts feed AI tools hundreds of quarterly reports to find cross-portfolio trends. The AI summarizes beautifully but misses that three portfolio companies have the same problematic vendor—information that was actually present in the documents. Cost: preventable supply chain crisis.

Your family office processes trust documents spanning decades. The AI loses track of amendments, misinterprets outdated cross-references, provides guidance based on superseded language. Cost: potential breach of fiduciary duty.

These aren't edge cases. This is Tuesday.

The numbers tell the story: MIT researchers tested GPT-5 on increasingly complex document tasks. Performance dropped from 80% accuracy to under 20% as complexity increased—even on documents well within the model's stated context window.

You're paying premium prices for degraded results. Worse, you don't know which results are degraded until something breaks.

Why "Bigger Context Windows" Won't Save You

The AI industry's answer to this problem has been simple: bigger context windows. GPT-4 handled 8K tokens, then 32K, then 128K. Claude pushed to 200K. The new models promise millions.

Problem solved, right?

Not even close.

First, the physics don't work. Processing context isn't linear—it's exponential. A model that handles 100K tokens reasonably well doesn't handle 1M tokens by just using 10x more compute. The computational requirements explode. So do your API costs.

Second, and more fundamental: neural networks aren't databases. They're pattern-matching systems that hold everything in a kind of working memory. You wouldn't ask a human analyst to memorize 500 pages and answer questions purely from memory. You'd give them the documents and let them reference them.

But that's exactly what we're asking AI to do.

The current workarounds don't fix this:

Summarization throws away information. It works fine for "give me the gist" but fails catastrophically for "find every instance where these three conditions co-occur." You can't summarize your way through due diligence.

RAG (Retrieval Augmented Generation) helps but hits walls fast. It retrieves chunks based on similarity to your question. If you don't know to ask about the obscure cross-reference on page 623, RAG won't find it. It's a search engine, not a thinking system.

Chunking strategies help you fit content into context windows but break logical connections across chunks. The clause on page 50 references exhibits on page 400. Chunking severs that link.

The dirty secret: models with 200K token context windows start degrading at 50K tokens for complex reasoning tasks. You're paying for capacity you can't actually use.

Enter Recursive Language Models: The MIT Breakthrough

Here's what MIT researchers Alex Zhang, Tim Kraska, and Omar Khattab figured out:

Stop treating documents as something the AI needs to memorize. Treat them as external data the AI can programmatically explore.

The technical name is Recursive Language Models (RLMs). The concept is elegant:

Load your documents into a programming environment (Python)
Give the AI tools to write code that explores those documents
Let the AI recursively call itself on relevant sections
Combine results programmatically to form final answers

Instead of asking the AI to "read all 10,000 pages and remember everything," you're giving it the ability to do this:

"Search all pages for 'indemnification cap'
→ Found 47 references
→ Filter for ones in sections titled 'Limitations of Liability'
→ Found 8 references  
→ For each: extract surrounding context and cross-referenced exhibits
→ Analyze extracted sections for actual cap amounts
→ Cross-check with amendment dates
→ Report highest applicable cap with source citations"

The AI never holds all 10,000 pages in memory. It strategically filters, examines relevant chunks, and builds up its answer through code and recursive analysis.

The researchers tested this on four different types of tasks:

Simple needle-in-haystack (find specific facts)
Multi-document research questions (synthesize across sources)
Dense reasoning tasks (process nearly every line)
Pairwise analysis (compare all combinations of items)

They scaled inputs from 8,000 tokens to over 1 million tokens—more than 100x beyond typical context windows.

The Numbers That Matter to Your Bottom Line

Let's cut to what you care about: does it work, and what does it cost?

Performance Gains

On tasks requiring dense information processing (the kind you face in due diligence), RLMs outperformed standard GPT-5 by 28-58%.

On one particularly brutal test—analyzing pairs of items across a dataset—GPT-5 essentially failed with 0.04% accuracy. The RLM version achieved 58% accuracy.

Most striking: standard models' performance collapsed as documents got longer. RLM performance stayed relatively stable even at 10M+ tokens.

On a multi-document research task with 1,000 documents (6-11M tokens total):

RLM(GPT-5): 91% accuracy
Summarization approach: 70% accuracy
Retrieval (RAG) approach: 51% accuracy
Base GPT-5: couldn't process (exceeded context limits)

Cost Reality

This is where it gets interesting for operators watching budgets.

Average API cost for RLM on those 6-11M token research tasks: $0.99 per query.

If GPT-5 could somehow process 6-11M tokens directly (it can't), the cost would be $1.50-$2.75 just to ingest the input—before any processing. Plus the performance would be worse.

The summarization baseline that achieved 70% accuracy cost $0.57 on average but required GPT-5-nano for compression (to avoid exploding costs) and still got worse results.

RLMs are cost-competitive or cheaper while delivering significantly better accuracy. You're not paying more for the capability—you're getting more value per dollar spent.

The catch: costs have high variance. Simple queries might cost $0.20. Complex ones requiring many recursive calls might hit $2-3. But you're getting actual answers, not hallucinated confidence.

Real-World Applications for Serious Operations

Let's get specific about where this matters.

Due Diligence That Actually Works

You're acquiring a manufacturing business. The data room has 2,400 documents. You need to:

Identify all environmental liabilities across facilities
Cross-reference insurance policies with actual coverage gaps
Find contradictions between employment agreements and benefit plan documents
Track IP ownership chains through multiple assignments

Current approach: army of associates billing $800/hour, plus AI tools that keep missing things, resulting in post-close surprises.

RLM approach: AI systematically searches for environmental mentions across all documents, recursively examines each finding in context, cross-references with related documents, builds comprehensive liability picture with source citations. Your associates review and validate instead of doing initial extraction.

Time saved: weeks. Accuracy gained: measurable. Hidden liabilities found: priceless.

Portfolio Intelligence

You have positions in 200+ companies. You want to know:

Which portfolio companies share problematic vendors
How inflation is affecting margins across sectors
Which CEOs are discussing succession in earnings calls
Geographic concentration risks across the portfolio

Current approach: manual analyst work, keyword searches that miss context, spreadsheet hell.

RLM approach: Process all quarterly reports, transcripts, and filings. Code-based analysis finds semantic patterns (not just keywords). Recursive examination of each finding in full context. Programmatic aggregation of insights.

You get signal instead of noise.

Legal Operations at Scale

Your family office has 40 years of trust documents, amendments, and related agreements. You need to:

Determine current distribution rules across all trusts
Find conflicts between trust terms and tax elections
Track how amendments modified original intent
Ensure consistency across generations of documents

Current approach: senior partner bills 60 hours at $1,500/hour to manually review everything. Still misses obscure cross-references.

RLM approach: AI reads chronologically, tracks amendments programmatically, recursively examines each modification in context of original terms, builds complete current picture with change history.

Senior partner reviews AI's work in 8 hours instead of creating it in 60.

Tax Strategy Across Entities

You have 35 entities across 8 jurisdictions. Tax planning requires understanding:

Inter-company transaction history
Transfer pricing documentation consistency
Nexus-creating activities by state
Treaty eligibility chains

Current approach: multiple tax advisors each understanding their piece, coordination meetings, pray nothing falls through cracks.

RLM approach: Centralized analysis across all entity documents, programmatic tracking of transactions across entities, recursive examination of treaty chains, automated consistency checking.

Your tax team works from complete picture instead of assembling puzzle pieces.

What This Means for Your AI Strategy

If you're building AI infrastructure for high-stakes operations, here's what changes:

Near-Term (Next 6-12 Months)

Watch for RLM-style capabilities showing up in enterprise AI products. Anthropic, OpenAI, and others are absolutely aware of this research. The commercial implementations are coming.

Current "long context" solutions will start looking obsolete. A model with a 1M token context window using standard approaches will lose to a model with a 200K window using recursive approaches.

Early adopters gain intelligence advantages. The firm that implements this for due diligence first gets better deal outcomes. The family office that implements this for document analysis first makes fewer costly mistakes.

Investment Implications

AI companies solving the context problem through inference-time scaling (like RLMs) become more valuable than those just building bigger context windows.

The context window arms race matters less. "We support 10M tokens" stops being a meaningful differentiator if those tokens don't actually work for complex tasks.

Focus shifts to inference-time scaling capabilities. The question becomes: how well can your model strategically process information, not how much can it hold in memory?

If you're allocating capital to AI infrastructure or AI-focused funds, these are the capabilities to evaluate.

Operational Changes

Rethink how you structure AI workflows. Stop designing around context window limits. Start designing around programmatic information access.

Stop manually chunking documents to fit AI limitations. Let the AI figure out chunking strategy based on the actual task.

Build systems assuming functionally infinite context. Design for "the AI can access all relevant information" rather than "we need to carefully select what fits."

Invest in data infrastructure that makes documents programmatically accessible to AI. Clean metadata, clear document relationships, API access to document stores.

The Catch (What the Research Paper Won't Emphasize)

Before you go implement this Monday morning, understand the current limitations:

Implementation Reality

This isn't plug-and-play yet. The MIT research used custom code and required models with strong programming abilities. Your procurement team can't just "turn on RLM mode" in ChatGPT Enterprise.

Response times vary wildly. Simple queries might take seconds. Complex ones requiring many recursive calls might take minutes. If you need sub-second response times, current implementations aren't there.

You need frontier models. The research worked with GPT-5 and Qwen3-Coder-480B (a massive open-source model). These are expensive, and in GPT-5's case, not universally accessible yet. Smaller models without strong coding abilities struggle with this approach.

The models make inefficient choices. Because they weren't trained specifically for recursive reasoning, they sometimes make thousands of unnecessary sub-calls or verify answers redundantly. This drives up costs and time.

What to Watch

Training models for recursive approaches: The next generation of models trained specifically to reason recursively will be dramatically more efficient. This is a training methodology innovation, not just an inference trick.

Speed improvements through parallelization: Current implementations run sub-calls sequentially. Parallelizing them could cut response times by 10x or more.

Enterprise-ready implementations: The gap between research code and production-ready enterprise software is real. The companies that bridge this gap first will capture enormous value.

Hybrid approaches: Combining RLMs with traditional RAG, summarization, and other techniques for different parts of workflows. Not everything needs full recursive analysis.

How to Move on This Now

You don't need to wait for perfect commercial implementations to start preparing.

Immediate Actions (This Week)

Audit your current long-document AI workflows. Where are you hitting context limits? Where is accuracy dropping? What's the cost of those accuracy drops?

Identify your three highest-value use cases where current AI fails on long documents. Due diligence? Portfolio analysis? Legal review? Prioritize by potential impact.

Calculate current costs: associate time, AI API costs, cost of errors. Establish baseline metrics for comparison when better tools arrive.

Next 90 Days

Run experiments with available tools. While true RLM implementations aren't commercialized yet, you can test the principles:

Use AI to write code that explores documents
Try recursive prompting strategies manually
Test how much filtering and strategic access helps vs. dumping everything into context

Build relationships with vendors implementing this. Talk to Anthropic, OpenAI, and enterprise AI platforms about their roadmaps for long-context handling. Make clear this is a priority for you.

Train technical teams on recursive patterns. Even without perfect tools, understanding how to break down problems recursively improves AI utilization now.

Strategic Positioning (Next 6 Months)

Factor RLM-style capabilities into AI vendor selection. When evaluating enterprise AI platforms, ask specifically about their approaches to handling documents beyond context windows.

Redesign document processing architecture. Move toward systems where AI can programmatically query document stores rather than having documents fed into prompts. This is valuable even before RLMs.

Prepare data infrastructure. Clean metadata, establish document relationships, create APIs for document access. When RLM tools arrive, you want your data ready.

Consider competitive advantages. If you're in M&A, litigation, investment management, or other information-intensive fields, early adoption of superior document analysis creates measurable edge.

The Bottom Line

The difference between $10M and $100M decisions often hides in document details. The reference on page 623. The amendment that modified the clause. The cross-portfolio pattern buried across 40 separate filings.

Current AI reads documents like a distracted analyst who skims everything and confidently misses details. It's better than nothing. It's not good enough for decisions that matter.

Recursive Language Models read like a forensic accountant with infinite patience. They systematically explore, recursively examine, programmatically verify. They don't get distracted. They don't get tired. They don't forget what was on page 50 by the time they reach page 500.

This is the difference between AI as a helpful assistant and AI as a genuine intellectual leverage tool.

Your competitors are already testing this. The law firms, investment banks, and family offices with deep technical teams are experimenting with recursive approaches now. The vendors are building commercial implementations.

The question isn't whether recursive approaches to long-context AI will become standard. The research demonstrates they work too well for that to be in doubt.

The question is: how fast do you move?

The firms that implement superior document analysis first will make better decisions faster at lower cost. They'll catch the details others miss. They'll process in weeks what others process in months.

In markets where information advantage equals economic advantage, this matters.

The technology exists. The research is published. The race is on.

Download the original research paper: https://arxiv.org/pdf/2512.24601

Back to blog