Prompt caching explained: when repeated context becomes cheaper
If you send the same long system prompt or document prefix across multiple API calls, you are paying to re-process the same tokens on every request. Prompt caching lets the provider store that prefix on their side so subsequent calls only pay for the new tokens — plus a write cost for the cache itself.
The short answer is: prompt caching cuts input costs by 50–90% for workloads that reuse the same prompt prefix — support bots with long system instructions, document analysis with the same source text, or multi-turn conversations with stable context. But it only works if the shared prefix is long enough to justify the write overhead.
If you only remember one thing: prompt caching is useful for repeated workloads, not for one-shot queries. The cache write step costs more than a standard read. You need at least a few reuses to see savings.
Editor’s Note: Providers advertise cache discounts on their pricing pages, but the write-to-read ratio matters. If you only call the cached prefix twice, the savings may be negligible. If you call it hundreds of times, the savings add up fast.
Editor’s Note: Prompt caching is invisible to the model — it does not change output quality. It only changes how the provider processes the request on the backend. Do not change your prompt strategy to “optimise for caching” if it hurts output quality.
Quick answer {#quick-answer}
- What it saves: input token processing. The provider can skip re-encoding the cached prefix and charge a lower rate for the cached portion.
- What it costs: the first write to cache and occasional invalidation.
- When it works: repeated system prompts, repeated document prefixes, recurring few-shot examples, and multi-turn conversations where the first turn contains most of the context.
- When it does not: short one-shot queries, random user inputs, or prompts that change significantly between calls.
Typical discount structure (provider-dependent): cached input tokens are 50–90% cheaper than fresh input tokens. Output tokens are not affected.
How it works, provider by provider {#how-it-works-provider-by-provider}
OpenAI prompt caching {#openai-prompt-caching}
OpenAI automatically caches prompt prefixes for models that support it (GPT-4.1 family and GPT-5 series). The cache starts from the beginning of the prompt. If the first 1,000 tokens are identical across requests, those tokens are served from cache.
- Cache write: $1.25/M tokens (vs $2.00/M fresh input)
- Cache read: $0.3125/M tokens
- Minimum: automatic; no special API parameter needed
OpenAI’s cache is transparent — you see “cached” on the usage response. Cache invalidation happens after 5–10 minutes of inactivity or if the prompt prefix changes.
Anthropic prompt caching {#anthropic-prompt-caching}
Anthropic requires explicit cache control markers. You specify which part of the prompt should be cached using a cache_control parameter in the messages array.
- Cache write: $1.25/M tokens (vs $3.00/M fresh input for Sonnet 4.6)
- Cache read: $0.30/M tokens
- Break-even: roughly 3–5 reuses after the initial write
Anthropic’s cache is more flexible — you can cache a section in the middle of the prompt, not just the prefix. Useful for shared document context with varying user questions.
Google prompt caching {#google-prompt-caching}
Google’s Gemini API caches prompts that exceed 32,000 tokens. The cache is contextual — it stores the prompt plus the model context for reuse.
- Cache write: 50% of fresh input rate
- Cache read: 25% of fresh input rate
- Durations: 1-hour default; extendable
Mistral prompt caching {#mistral-prompt-caching}
Mistral automatically caches prompts. There is no explicit parameter or cache control — it applies to repeated prefixes within a short time window.
- Discount: approximately 40–60% on cached portions
- Transparent: appears in usage statistics
When caching does not pay off {#when-caching-does-not-pay-off}
Prompt caching has a break-even point. If you send a 5,000-token prefix to exactly 3 users and never again, the cache write cost on the first request may cancel the savings on requests 2 and 3.
The break-even is workload-dependent, but a rough rule of thumb: if you reuse the same prefix fewer than 5 times within a short window, the cache is unlikely to save you money.
Other cases where caching does not help:
- Random or user-specific prompts that change every call.
- Short prompts under 1,000 tokens (the cache overhead may exceed the processing cost).
- Infrequent usage where the cache expires between calls.
Worked example: support bot with shared system prompt {#worked-example-support-bot-with-shared-system-prompt}
Scenario: A customer support bot with a 3,000-token system prompt serving 5,000 conversations/month. Each conversation averages 3 turns. The system prompt is identical for every call.
Without caching: 3,000 tokens × 15,000 calls × $3.00/M = $135/month in input processing.
With caching: first call per session writes the cache at $1.25/M = $0.004/call × 5,000 sessions = $20/month. Subsequent calls read from cache at $0.30/M = $0.001/call × 10,000 calls = $10/month. Total: $30/month.
Savings: approximately 78%.
What this page cannot tell you {#what-this-page-cannot-tell-you}
This page cannot predict your exact cache hit rate. Cache hit ratios depend on prompt stability, traffic patterns and provider timeout behaviour. If your prompts vary heavily between users or sessions, the savings will be lower than the theoretical maximum.
Methodology and sources {#methodology-and-sources}
Check date: 2026-05-25
What was checked: Current provider documentation and pricing pages for OpenAI, Anthropic, Google and Mistral prompt-caching features.
Worked-example assumptions: Anthropic Sonnet 4.6 pricing used. Cache hit rate assumed at 100% for the second and third calls in each session. Real-world rates will be lower.
Assumptions and limits:
- Cache pricing changes over time.
- Cache duration limits vary by provider.
- Some models within each provider’s range do not support caching.
- Real-world cache hit rates depend on prompt stability and request patterns.
Source list {#source-list}
- OpenAI — prompt caching documentation — https://platform.openai.com/docs/guides/prompt-caching
- Anthropic — prompt caching guide — https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- Google Gemini — caching API — https://ai.google.dev/gemini-api/docs/caching
- Mistral — platform docs — https://docs.mistral.ai/api/
Related guides {#related-guides}
- Prompt length, output length and why AI bills surprise teams
- API model pricing: input, output, cache and batch costs
- Caching AI answers: when it is safe, risky or pointless
- Batch APIs for LLMs: cheaper, slower and often underused
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.