Agent Memory Is a Billing Problem

Most builders treat agent memory as an organization problem. It is a cost equation. The structure of your memory directly determines your inference bill through prompt caching mechanics.

3 min read

Most builders approach agent memory like a file system: organize everything, structure it well, retrieve when needed. That framing will cost you money.

The structure of your agent's memory is a cost equation.

LLM inference has two phases. Pre-fill reads the prompt — compute-bound, expensive. Decoding generates the response — memory-bound. Prompt caching operates on the pre-fill phase. When a provider sees an identical prefix at the top of a request, it serves cached KV tensors instead of recomputing them: a 50–90% discount on those input tokens, plus a significant reduction in time-to-first-token. Anthropic, OpenAI, and Google all support this in production.

The constraint is hard: a single token difference anywhere in the prefix invalidates the cache for everything that follows. One injected timestamp. One dynamic variable at the top. Cache gone.

This makes the structural rule non-negotiable: static content first, variable content last. Not preference. Billing.

What this means for agent memory design

Always-loaded context must come first — and stay identical across requests. System prompts, role definitions, stable capability descriptions. If multiple agents in your fleet share an identical prefix, they share a cache. One pre-fill cost amortized across the whole fleet. If they differ by a token, you pay for each agent, each request, every time.

Variable state belongs at the bottom. Recent memory, cross-agent logs, user queries — anything that changes between requests. Inject it last. Inserting variable data anywhere above your stable layers breaks the cache for everything above it, which usually means breaking it for the expensive parts.

Compression is a cost multiplier, not just hygiene. A capped, compressed memory — say, 50 lines of high-density context — caches more efficiently and expires less often than a sprawling 500-line context that accumulates everything. Pruning ephemeral context is not tidiness. It is a structural decision to keep your cache warm. The weekly rollup that collapses daily notes into summaries? That is a billing optimization running as a cron job.

The document nobody reads that way

The design document that specifies your agent's memory — what loads always vs. sometimes, what gets summarized vs. deleted, what goes at the top vs. the bottom — is also a cost model. Most teams write it as an information architecture document and never run the numbers.

An agent memory designed for completeness — nothing pruned, context growing, variable data mixed with stable — will be consistently more expensive than one designed around caching: static top, variable bottom, ruthless compression. The delta is not a rounding error. At agent-fleet scale, it is a structural disadvantage that compounds every month.

Builders who internalize this treat memory architecture as infrastructure cost engineering from the start. Everyone else pays the difference on their next billing cycle and calls it the cost of running agents.