Stop Using Claude as a Search Engine

Claude burns expensive input tokens when you use it for full-text search. Put the corpus in NotebookLM, send Claude only retrieved answers with citations...

5 min read

The inversion

Most token-saving advice focuses on prompt caching. The bigger saving is simpler: the cheapest token is the one that never enters Claude's context.

If you paste a 50k-character log or dozens of PDFs into Claude so you can ask questions, you are paying a reasoning model to do full-text search. That is usually the wrong tool.

A reported research session using this pattern involved a 47-paper SLAM corpus. Keeping the papers inside Claude consumed enough quota that two or three similar sessions per week could exhaust a Claude Max plan. Moving the corpus out of Claude changed the cost profile.

The architecture is:

  • NotebookLM stores the source corpus and retrieves relevant passages with citations.
  • Claude receives the retrieved answer, then reasons, writes, codes, compares, or orchestrates the next step.
  • You decide when the retrieved evidence is enough and when to ask a sharper question.

NotebookLM is not free compute in any permanent sense. Pricing, limits, and product behavior can change. The useful point is architectural: retrieval tokens do not need to be Claude tokens.

Why prompt caching is not enough

The usual counterargument is: "Just enable prompt caching."

That helps when the same large prefix is reused quickly. It helps much less when research happens in bursts: ask, read, switch tabs, think, come back later. In the reported workflow, Anthropic's default prompt-cache TTL was treated as one hour. Once the cache expired, the next call paid to write the corpus again.

Caching is good for rapid follow-ups. It is a weak fit for reading a large corpus across an afternoon, a week, or a semester.

The numbers

In the reported 5-round Opus 4.7 run, the corpus was roughly 500k tokens, or about 384k words by NotebookLM's count.

The cost comparison was stark:

  • Papers in NotebookLM, Claude reasons over retrieved answers: about $0.55.
  • Papers in Claude prompt, single-session cache reuse: about $9.59.
  • Papers in Claude prompt, cache misses across sessions: much worse, cited as about 86x.

The important mechanism was cache_creation. In the NotebookLM-fronted run, Claude wrote only 17,379 tokens to cache because it saw the retrieved teacher answers, roughly 3k to 6k tokens each, plus small system deltas. The 47 papers never entered Claude's cache.

Treat the ratios as session logs, not benchmarks. The scaling argument is still strong: putting the corpus in the prompt grows Claude input roughly with corpus size. Putting the corpus behind retrieval keeps Claude's visible context closer to the size of the answer.

The division of labor

Use NotebookLM as a read-only consultation layer. Load the papers, prospectus, notes, or exported knowledge base once. Ask it questions that should be answered only from those sources. Keep its answers citation-bound.

Use Claude as the working agent. It can turn those retrieved answers into code, outlines, comparisons, scripts, decision memos, or follow-up questions. When it needs more domain evidence, it asks NotebookLM again instead of ingesting the full corpus.

A practical rule: do not write notes, code, or experiment results back into the notebook. Keep the corpus static. NotebookLM is for retrieval, not project memory.

When this is not worth it

Do not add this layer by default. It is probably unnecessary when:

  • The corpus is under roughly 5k tokens and you will query it once or twice.
  • You only need Q&A and no surrounding workflow. NotebookLM's web UI may be enough.
  • Latency matters more than token cost. The reported NotebookLM calls took about 16 to 48 seconds, often around 45 seconds, compared with about 20 to 35 seconds for direct Claude queries in that user's setup.
  • The task is code navigation, jump-to-definition, or dependency tracing. Text RAG is a poor substitute for AST-aware tools.

Workloads that fit

The pattern works best when the same private corpus is queried repeatedly.

  1. Research and coursework. A semester's worth of papers can live in one notebook. Claude can ask cross-paper questions, then synthesize the cited answers into study notes, research plans, or implementation tasks.

  2. Prospectus and filing review. Long financial documents fit the same model: one notebook per company or filing, a reusable question set, then Claude composes a comparison table or memo from retrieved answers. The reported workflow claimed a large time reduction, but the safer conclusion is narrower: repeated structured questions over long documents are a good fit.

  3. Personal knowledge bases. An Obsidian or Notion export plus highlights can support questions like "how has my view on this topic changed over time?" Keyword search is bad at that. Retrieval plus reasoning is better.

Setup notes

Google does not provide an official NotebookLM API. The reported setup used a third-party client, notebooklm-client by @icebear0828, and a Claude Code skill called /notecraft.

That bridge matters. Claude is not magically connected to NotebookLM. Either you manually copy NotebookLM answers into Claude, or you use an unofficial client that logs into NotebookLM, sends questions, receives cited answers, and passes those answers back into the Claude workflow.

Two caveats are important:

  • storage_state.json contains a live Google session. Treat it like a credential.
  • The client is reverse-engineered. Google can change the backend and break it without warning.

If those risks are unacceptable, use the manual version of the pattern or wait for an official API.

Bottom line

Do not pay a reasoning model to read a large static corpus every time you ask a question. Put retrieval behind a cheaper or more specialized layer, then spend Claude tokens on the part Claude is good at: reasoning over the retrieved evidence and doing the work around it.

For research-shaped workloads, the reported trade was slower responses in exchange for much lower expensive-tier token spend. Whether that is worth it depends on your constraint: time, money, latency, or reliability.