Stop Growing the Context Window. Start Delegating It.

Recursive Language Models flip the long-context race: instead of bigger windows or lossy summaries, models delegate context to sub-LLMs and code.

5 min read

Stop Growing the Context Window. Start Delegating It.

For two years the long-context race has run in one direction: make the window bigger. 128k became 1M, and the assumption underneath never changed. The model should hold everything in its head at once.

Recursive Language Models (RLM), a direction Prime Intellect has called "the paradigm of 2026," invert that assumption. The bet is that models should stop trying to remember and start managing. Instead of passively holding a giant context or compressing it with summaries, an RLM actively delegates: it hands slices of the problem to sub-LLMs and Python scripts, then works with their outputs.

One caveat up front. This is a bet by one lab, not a benchmarked result. No public evaluation yet shows RLM beating strong long-context baselines head to head. What follows is the shape of the argument and why it maps onto failures agent builders already see, not a verdict.

Summarization is the bug, not the fix

The standard answer to "my agent ran out of context" has been summarization. Compress the history, keep going. It feels like memory management. It is controlled information loss.

Every summary is a one-way door. The details you discard are the details you cannot recover when the task turns out to depend on them. When an agent acts on a stale compressed summary of its own earlier work, it is no longer reasoning over the task. It is reasoning over a lossy paraphrase of the task, and it has no reliable way to notice the difference.

RLM's answer: do not compress the context. Delegate it. Keep the raw material intact somewhere addressable, and spawn a bounded worker, a sub-LLM call or a script, to go read the relevant part and come back with an answer. The parent model never needed the whole corpus in its window. It needed the ability to query it.

How the agent loop changes

Concretely, the shift alters the shape of an agent loop. These examples illustrate the mechanism rather than any documented RLM implementation:

  • Old pattern: load a 400-page document into context, or summarize it down to fit, then answer questions from whatever survived compression.
  • RLM pattern: the model writes a script to grep the document, or dispatches a sub-LLM with a narrow prompt over one section, and integrates the returned evidence.
  • Old pattern: a long-horizon coding agent carries its full session history until it degrades, then compacts and hopes.
  • RLM pattern: the agent treats its own history as an external store it can search and selectively expand, paying context cost only for what the current step needs.

Code is a useful delegation target because scripts are cheap and mostly deterministic, but determinism cuts both ways: a buggy regex will extract the wrong value with perfect consistency, so delegated extraction still needs verification. Sub-LLMs cover the fuzzy retrieval that code cannot, at the cost of reintroducing model error at a smaller scope.

Why the bottleneck argument matters

The more provocative implication in the RLM framing is about where progress comes from next. If context management, rather than model size, is the binding constraint on agentic AI, then the leverage sits in architecture. In principle, a mid-sized model with disciplined delegation could sustain longer coherent task horizons than a frontier model drowning in its own unmanaged window. That is a hypothesis, not a measured result; treating it as settled would repeat the exact overreach the framing warns against.

Still, the direction of the argument matches what agent builders already observe: failures on long tasks are rarely "the model was not smart enough" and frequently "the model lost track of what it knew." Long-context models keep improving too, so the honest framing is a race between two strategies, not a decided winner.

What to do with this

For agent builders, a few concrete moves follow even before the benchmarks settle:

  • Treat summarization as a last resort, not a memory strategy. Prefer keeping raw context externally addressable and retrieving on demand.
  • Give agents tools to query their own history and materials: search, targeted expansion, script execution. Delegation needs affordances.
  • Budget context per step, not per session. Ask what this step actually needs in the window, not whether the whole task fits in it.
  • Evaluate long-horizon performance separately from benchmark intelligence. If the RLM thesis is right, a model that manages context well will beat a smarter one that does not on any task longer than a single window. That is worth testing on your own workloads rather than assuming.

The context window stopped being the interesting number. The interesting question is what the model does when the task will not fit, and RLM's answer, recurse instead of compress, deserves a serious look precisely because it can be tested now, on real agent workloads, without waiting for anyone's paradigm to arrive on schedule.