← All insights
TechnicalMay 26, 20265 min read

Token Caching Strategies That Actually Work in Production

Share
Article cover placeholder

TL;DR

Prompt caching patterns that deliver 30-80% input token cost savings: the structural rules that get cache hits, the common failures that kill hit rates, provider-specific patterns, and the metrics that prove caching is working.

Token Caching Strategies That Actually Work in Production

Prompt caching at the model provider level (OpenAI, Anthropic, Google) is one of the highest-impact cost optimisations in modern AI agent systems. For workloads with significant stable prefix content — system prompts, retrieved context, conversation history — caching reduces input token cost 60-80% on the cached portion and reduces latency by 200-500ms.

The catch is that caching only works on workloads that are structured for it. Most agents are not structured for it out of the box. This is the pattern that actually delivers the savings in production.

What prompt caching does

When the model provider sees a prompt whose prefix matches a recent prompt from the same caller, it can reuse the cached representation of the prefix rather than processing the tokens from scratch. The savings:

  • Cost: input tokens in the cached prefix are billed at a lower rate (typically 10-25% of normal input token cost, depending on provider)
  • Latency: the prefix does not need to be processed, saving roughly the prefix tokens' worth of input-processing time
  • Cache hit window: typically 5-60 minutes depending on provider and prompt size; longer cache windows are available on some tiers

The savings apply to the cached portion only. The variable suffix is processed normally.

Where caching matters

Caching matters for workloads where:

  • The system prompt is substantial (hundreds to thousands of tokens) and stable
  • The agent processes a stream of similar requests with shared context (customer support tickets, document analyses, repeated queries against the same knowledge base)
  • The same conversation history flows through multiple turns
  • RAG context is reused across queries on similar topics

Caching does not matter for:

  • One-off prompts with no significant stable prefix
  • Workloads where prompts vary entirely across calls
  • Very short prompts where the prefix is too small to matter

For typical enterprise AI workloads, 40-70% of calls benefit from caching meaningfully. Workloads dominated by long stable prompts (knowledge-base-grounded customer support, document Q&A) see 60-80%+ cost reduction overall.

The structural pattern that works

To get cache hits, structure prompts so that:

  1. Stable content is at the front: system prompt, instructions, retrieved context, prior conversation history, fixed examples — all in the first portion of the prompt
  2. Variable content is at the end: the current user query, current request, current data — at the end of the prompt
  3. The stable portion is identical byte-for-byte across calls that should share the cache
  4. Cache markers are placed correctly at the boundary between stable and variable content

The third point is critical and most teams miss it. A timestamp in the system prompt, a per-call session ID anywhere in the prefix, or any drift in formatting breaks the cache.

Common failures that kill cache hit rates

Timestamps in the prompt: "Current time: 2026-05-26T14:23:45Z" in the system prompt makes every prompt unique. Move timestamps to the variable portion or remove them if not strictly needed.

Per-call session IDs in the prefix: session metadata in the system prompt makes each session's prefix unique. Either move session context to the variable portion or use separate caches per session.

Whitespace and formatting drift: trailing whitespace, different newline conventions, JSON serialisation that re-orders keys differently. The cached representation is byte-sensitive.

Retrieval context order changes: top-k retrieved chunks where the order changes per query (e.g., due to slight scoring differences) breaks caching even when the chunks are similar.

Per-user personalisation in the prefix: "User name: Alice" in the system prompt creates a cache per user. For high-user-count workloads this is sometimes acceptable; for shared agents it is wasteful.

The shape of an enterprise prompt that caches well

[SYSTEM PROMPT — stable, byte-identical across calls]
You are an AI assistant for ACME Corp's customer service.
Your role is to help customers resolve their issues quickly...
[detailed instructions, examples, output format requirements]

[RETRIEVED CONTEXT — varies per query but cached per retrieval-set]
[Document 1 content]
[Document 2 content]
[Document 3 content]

[CONVERSATION HISTORY — cached per session for multi-turn workflows]
User: I have a question about my order
Assistant: I'd be happy to help. What is your order number?
User: 12345
Assistant: Looking that up now...

[CURRENT QUERY — variable, not cached]
User: When will it arrive?

The first portion (system prompt) caches across all calls. The retrieved context caches when the same retrieval set is reused. The conversation history caches across turns of the same session. The current query is the only part that always varies.

This is the structure that typical platform-grade frameworks build by default if you configure them correctly. AgentWorks structures prompts this way for native cache support.

Provider-specific patterns

Anthropic (Claude): explicit cache control markers in the prompt. You mark cache breakpoints, telling the API "everything up to here can be cached." Up to 4 cache breakpoints per prompt. Cache duration: 5 minutes refreshing on hit, with extended caching for paid tiers.

OpenAI: automatic prompt caching when the prefix matches a recent prompt. No explicit cache control. Cache duration is short by default.

Google (Gemini): explicit context caching API. You upload context, get a cache reference, then make calls with the reference. Different model than the other two; works well for very long stable contexts (codebases, document collections).

The right structure varies slightly per provider. The platform should abstract the differences so the agent definition is provider-agnostic.

Measuring cache effectiveness

The metrics that matter:

  • Cache hit rate: proportion of input tokens that hit the cache, by agent and by workflow
  • Effective input cost per call: actual cost after caching, compared to nominal full-price cost
  • Latency improvement: time-to-first-token before and after caching is enabled

For each agent, baseline these metrics, enable caching, re-measure. If the cache hit rate is below 30-40% on a workload that should benefit, something in your prompt structure is breaking the cache.

When caching can hurt

The honest trade-offs:

  • Stale context risk: if cached context becomes stale (e.g., a price changes but the cached prompt still has the old price), the agent uses outdated information. Cache expiry helps; explicit invalidation on known-stale events helps more.

  • Cache poisoning risk: in shared multi-tenant scenarios, ensuring tenant isolation in cached prefixes is critical. Cache keys must include tenant context to prevent cross-tenant leakage.

  • Cost of misses on small prompts: caching overhead is small but non-zero. For very short prompts the overhead can exceed the savings. Cache selectively, not universally.

What AgentWorks does

The platform structures prompts for caching by default: system prompts in the stable position, retrieved context cached per retrieval set, conversation history cached per session, current input in the variable position. The provider-specific cache control is abstracted away so the agent developer does not need to think about it per provider.

Cache hit rates and savings are visible in the per-agent observability dashboards. For agents where caching is not delivering expected savings, the tooling shows where the cache misses are coming from.

The result for typical enterprise workloads: 30-60% input token cost savings across the agent estate, with the largest savings on the agents with the longest stable prefixes (customer support, knowledge-grounded Q&A, document processing).

About the author

· Founder, AgentWorks

Erwin Berkouwer is the founder of AgentWorks — an AI agent platform purpose-built for European teams that need EU AI Act-ready governance, multi-LLM choice across OpenAI, Anthropic, Google and Mistral, and transparent per-token € pricing.

Read more about Erwin