How much can prompt caching actually save?

On the cached portion of a prompt, savings run roughly 90% with Anthropic, 50% with OpenAI's automatic caching, and up to 90% with Gemini depending on configuration. Real-world blended savings depend heavily on what share of your prompt is genuinely static versus dynamic per request.

Why isn't my prompt caching working even though I enabled it?

The most common cause is a byte-level change near the top of the prompt on every request — a timestamp, session ID, or reordered field — which invalidates the cache match. Restructure so all static content comes first and dynamic content comes last.

Is semantic caching safe for accuracy-sensitive workflows?

Use it cautiously. Semantic caching returns a cached answer for a semantically similar but not identical query, which works well for bounded FAQ-style intents but risks confidently wrong answers on open-ended or nuanced questions. Set a conservative similarity threshold and avoid it for high-stakes reasoning tasks.

Does model routing hurt prompt caching?

Yes, if not designed for it. Routing a task to a different model provider each time prevents cache reuse, since each provider maintains its own cache. Platforms need to keep prompt structure consistent per agent even while routing the variable parts of a task across models.

Token Caching Strategies That Actually Cut LLM

Prompt caching is the single biggest lever available for cutting LLM API costs, and most teams that have "enabled" it are still paying full price for most of their traffic. The technology is not the hard part — every major provider supports it now. The hard part is prompt hygiene: structuring requests so the same prefix actually repeats byte-for-byte, request after request.

Here is the mechanism. Providers cache the attention-layer key-value state for a stable prompt prefix. When a later request arrives with an identical prefix, the model skips the expensive prefill computation for those tokens and only processes the new suffix. Anthropic prices cached tokens at roughly 0.1x of standard input cost — a 90% reduction — OpenAI caches automatically at 0.5x, and Gemini runs 0.1-0.25x depending on configuration. On a workflow that sends the same 8,000-token system prompt and document context with every message, that difference is the gap between $24 and $0.30 per million repeated tokens.

Why most implementations leak savings

The cache only hits on an exact prefix match. That sounds simple until you look at how most teams actually construct prompts: dynamic content — a timestamp, a session ID, a user's name — gets interpolated near the top of the system prompt, which invalidates the cache on every single request. The fix has nothing to do with the provider's API and everything to do with prompt structure.

Static-first ordering. Put everything that never changes — system instructions, tool definitions, few-shot examples, reference documents — at the very top of the prompt, in a fixed order. Dynamic content — the current user message, retrieved context specific to this turn, timestamps — goes at the end, after the cache breakpoint.

Byte-identical prefixes. A single extra space, a reordered JSON key, or a regenerated UUID in what should be a static block breaks the match. Serialize static content once, store it as a constant, and never regenerate it per request.

Explicit breakpoints where the provider requires them. Anthropic's automatic multi-turn caching now manages breakpoints with a single top-level cache_control field for most conversational patterns, which removes a source of manual error. Older integrations that still hand-place breakpoints are more fragile — audit them if you have not touched the caching config since early 2026.

Expert tip: log your cache hit rate per endpoint, not just per model. A single endpoint that regenerates its system prompt on every call — often because it embeds a live config value near the top — can single-handedly drag down the blended savings across an otherwise well-cached workload.

One more practical note: caching interacts with retries. A failed call that gets retried with a slightly different prompt (a regenerated error-handling instruction, for example) will not hit the cache even if the original attempt did, silently doubling cost on your failure path. Keep retry logic reusing the exact same cached prefix as the original call.

Beyond prefix caching: semantic caching

Prefix caching only helps when the same request structure repeats. Semantic caching goes further by recognizing when two different requests mean the same thing — "what's the weather?" and "tell me today's temperature" hitting the same cached response — which can cut LLM calls entirely for high-repetition query patterns, not just reduce their input cost. This works best for support and FAQ-style agents with a bounded set of common intents, and worst for open-ended reasoning tasks where near-duplicate phrasing rarely maps to a genuinely identical answer. Set your similarity threshold conservatively; an overly aggressive semantic cache returns confidently wrong answers to subtly different questions.

What this looks like in a governed agent runtime

Caching strategy interacts directly with model routing, and this is where a lot of teams underinvest. If your platform routes a task to whichever model is cheapest for that step, the cache almost never warms up, because the prefix hits a different provider's cache (or none) each time. AgentWorks structures its own agent runtime around this: prompt templates and tool definitions are treated as static blocks per agent, so repeated runs of the same pre-built agent hit cache consistently even as the AUTO router selects between Claude, GPT-5, Gemini, and Mistral for the variable parts of the task.

Per-run cost visibility matters here too — without it, caching gains are invisible to whoever owns the budget. AgentWorks bills tokens at cost plus 10% from a euro wallet with per-run cost shown, so a drop in cache-miss rate shows up directly as a lower bill, not as a line item buried in a monthly cloud invoice three teams away.

Getting started

Instrument cache hit rate per endpoint before changing anything — you cannot fix what you have not measured.
Restructure prompts so static content (instructions, tools, examples) sits first, and dynamic content (user input, live context) sits last.
Freeze the serialization of static blocks — generate them once as constants, not per request.
Evaluate semantic caching only for bounded, high-repetition query patterns like FAQ or support triage, with a conservative similarity threshold.

Caching is not a one-time setting — it degrades silently as prompts evolve and new dynamic fields get added near the top of a template. Re-audit hit rates quarterly, especially after any prompt or tool-definition change. For how AgentWorks handles model routing alongside caching, see our AI agent platform overview.

See how it works in practice. Book a 15-minute platform walkthrough at agent-works.ai/contact.

Token Caching Strategies That Actually Cut Costs

Why most implementations leak savings

Beyond prefix caching: semantic caching

What this looks like in a governed agent runtime

Getting started

About the author

How to Reduce AI Hallucinations with Cited Answers

GPT-5 vs Claude vs Gemini: Picking the Right Model

Connect Notion & Confluence to Your AI Agents

Why most implementations leak savings

Beyond prefix caching: semantic caching

What this looks like in a governed agent runtime

Getting started

About the author

Related articles

How to Reduce AI Hallucinations with Cited Answers

GPT-5 vs Claude vs Gemini: Picking the Right Model

Connect Notion & Confluence to Your AI Agents