API model pricing: input, output, cache and batch costs
If you are comparing model APIs, the line item that looks cheapest on a pricing page is often only half the story. The real bill usually comes from four things:
- how much text you send in;
- how much text the model sends back;
- whether some of the input can be cached and reused; and
- whether the work can wait for batch pricing.
The safest short answer is this: compare the full workload, not just the input token rate. A model with a lower headline price can still cost more once output gets verbose, cache reuse is too low to pay back, or the job could have used a batch route instead.
Summary
For most teams, the right question is not “what is the cheapest model?” It is “what is the cheapest way to finish this job well enough?”
That means you should separate prompt length, output length, cache-hit rate and batchability before you compare models. It also means you should be wary of any quote that assumes one neat average prompt and a tidy response. Real workloads are messier than that.
A useful rule of thumb:
- if the prompt is reused, cache can help;
- if the task can wait, batch can help;
- if the output is long, output pricing may dominate;
- if the task is short and interactive, latency usually matters more than tiny token savings.
What API model pricing usually includes
Most API pricing pages break the bill into separate buckets. The labels differ by provider, but the shape is similar.
| Pricing bucket | What it means | Why it matters |
|---|---|---|
| Input tokens | The text you send to the model | Long prompts, retrieved documents and tool outputs all land here |
| Output tokens | The text the model generates | A cheap input rate can still become a large bill if output is verbose |
| Cached input | Reused prompt text billed at a lower rate | Helps when the same instructions or context are sent repeatedly |
| Batch pricing | Lower price for non-urgent jobs | Useful when latency does not matter |
| Context thresholds | Some models change price above a token limit | A long prompt can move you into a different pricing band |
The key move is to stop treating pricing like a single number. A model is not just an input rate; it is a bundle of billing rules.
Current pricing snapshot
The figures below were checked directly from provider pricing docs on 2026-05-22. They are date-scoped and change often.
| Provider / model | Input price | Cached input | Output price | Batch note |
|---|---|---|---|---|
| Anthropic Claude Sonnet 4.5 | $3 / MTok standard input | $3.75 / MTok 5m write; $6 / MTok 1h write; $0.30 / MTok cache hits | $15 / MTok | Anthropic notes that cache pricing can stack with Batch API discount, but the pricing page excerpt used here does not give one simple model-row batch price |
| Google Gemini 2.5 Pro | $1.25 / 1M tokens up to 200K input, then $2.50 / 1M | $0.13 / 1M cached input up to 200K, then $0.25 / 1M | $10 / 1M tokens up to 200K, then $15 / 1M | Gemini models are available in batch mode at 50% discount according to the provider pricing page note |
| Google Gemini 2.0 Flash | $0.15 / 1M input tokens | Not shown in the batch table excerpt | $0.60 / 1M output tokens | $0.075 / 1M input tokens and $0.30 / 1M output tokens with Batch API |
Two things to notice here:
- Google’s pricing page does not just show one price for everything. Some models have a token threshold at 200K tokens, and the price changes above that line.
- Anthropic’s cache pricing is not a simple “free discount”. The write step costs more than fresh input, so reuse has to be real before caching pays back.
Input vs output costs
A lot of people start by comparing input rates because they are easy to read. That is fine for a first pass, but it is not enough.
If a task has small prompts and short replies, input may dominate. If the task produces long summaries, structured JSON, explanations, or tool-heavy follow-ups, output can become the bigger line item.
That is why a model that looks cheap on paper can still cost more overall:
- the prompt may be longer than you expected;
- the model may answer more verbosely than a cheaper alternative;
- retries can multiply the total output;
- tool calls can add more context to the next turn.
The point is not that output is always the main cost. The point is that you cannot know until you estimate both halves of the exchange.
Cached input and why it matters
Cached input is useful when the same instructions, retrieval bundle, or reference text is sent again and again.
Anthropic’s pricing page is a good reminder that cache is not magic. The page says:
- 5-minute cache writes cost 1.25x standard input;
- 1-hour cache writes cost 2x standard input;
- cache hits cost 10% of standard input;
- the 5-minute cache pays back after one cache read;
- the 1-hour cache pays back after two cache reads.
That last line is the one to keep.
In plain English: if you only reuse a large prompt once, caching may be a false economy. If the same context is reused several times, cache becomes a real lever.
A small payback formula
If x is the standard input cost for the repeated text:
- fresh twice = 2x
- 5-minute cache, one write plus one read = 1.25x + 0.1x = 1.35x
- 1-hour cache, one write plus two reads = 2x + 0.2x = 2.2x against 3x fresh for three uses
So the question is not “does cache always save money?” It is “how many times will I reuse the same text?”
Batch pricing and when it helps
Batch pricing is for jobs that do not need an immediate answer. That usually means overnight enrichment, backfills, bulk classification, large-scale extraction or cleanup work.
Google’s pricing page states that Gemini models are available in batch mode at 50% discount. The same page also shows an explicit Gemini 2.0 Flash batch table where input falls from $0.15 to $0.075 per 1M tokens and output falls from $0.60 to $0.30 per 1M tokens.
That is the practical trade-off:
- if you need latency, batch is probably the wrong shape;
- if you need throughput and can wait, batch can cut cost without changing the task itself.
Batch is not a clever way to hide bad prompt design. It is just a cheaper lane for work that does not need to be real-time.
Why cheap per token can still cost more
The cheapest-looking path can still lose once the workload is realistic.
Here is the common failure mode:
- the team compares only the headline input price;
- the prompt is too long or too repeated;
- output gets verbose or retries climb;
- the final bill is higher than expected.
That is why the better comparison is a full workload estimate.
Worked example: two identical runs with a reusable prompt
Assumptions:
- provider: Anthropic Claude Sonnet 4.5;
- 50,000 tokens are stable instructions/reference text;
- 20,000 tokens are unique to each request;
- each run returns 8,000 output tokens;
- we compare two fresh runs against one 5-minute cache write plus two cache reads.
| Scenario | Calculation | Cost |
|---|---|---|
| Two fresh runs | 2 × ((50,000 + 20,000) × $3 / 1M) + 2 × (8,000 × $15 / 1M) | $0.6600 |
| 5-minute cache | 50,000 × $3.75 / 1M + 2 × (50,000 × $0.30 / 1M + 20,000 × $3 / 1M + 8,000 × $15 / 1M) | $0.5775 |
| Difference | Fresh minus cache | $0.0825 saved |
The point of the example is not that Claude Sonnet 4.5 is always the right model. The point is that cache only starts to matter when the same prompt text comes back enough times to pay for the write step.
A simple way to estimate your own spend
Use this checklist before you commit to a model or a vendor.
Action checklist
- Count the input tokens in the real prompt, not the tidy version you wish you had.
- Split input into one-off text and reusable text.
- Estimate output length as a range, not a single number.
- Check whether the task can wait for batch pricing.
- Check whether any model price changes above a context threshold.
- Ask whether cache will be reused enough to pay back.
- Compare the whole workload, not just the input line item.
Formula block
A repeatable estimate is:
Total cost = fresh input × fresh input rate + cached input × cache rate + output × output rate + batch modifier, if applicable
That is only an estimate. It is not a prediction of your actual bill.
A sensible way to use it is to build a low / likely / high range:
- low: short prompt, short output, high cache reuse, batch available;
- likely: your current average workload;
- high: longer prompts, longer output, less cache reuse, no batch discount.
What this page cannot tell you
This page cannot tell you your actual bill.
It cannot tell you:
- how long your prompts really are;
- how often the same text is reused;
- whether your output will be short or chatty;
- whether your account tier has different rates;
- whether your usage pattern belongs in batch rather than interactive mode.
It can only show you the pricing shape and the questions that matter before you compare vendors.
GB / NI / global applicability
This article is global. There is no GB / NI split to apply here.
The useful caution is the same everywhere: provider pricing changes, account tiers may differ, and the published page is the thing to check before you buy into a model or a workflow.
Methodology and sources
Check date: 2026-05-22
What was checked: current provider pricing docs for input, output, cache and batch pricing behaviour.
How the figures were used: only to illustrate billing structure and workload estimation. They are not a promise about any reader’s actual bill.
Source URLs checked on 2026-05-22:
- https://platform.claude.com/docs/en/about-claude/pricing
- https://cloud.google.com/gemini-enterprise-agent-platform/generative-ai/pricing
Data points pulled from those pages:
- Anthropic Claude Sonnet 4.5 standard input, 5-minute cache write, 1-hour cache write, cache hit and output pricing.
- Anthropic cache payback guidance.
- Google Gemini 2.5 Pro base input, cached input and output pricing, including the 200K token threshold.
- Google Gemini 2.0 Flash batch pricing table and the page-level note that Gemini models can run in batch mode at 50% discount.
Assumptions used in worked examples:
- All currency values are USD.
- MTok means million tokens.
- The example workload repeats the same stable prompt text across runs.
- Output length is treated as a separate cost driver.
- Batch discount applies only when the workload is actually eligible for batch mode.
Change log
- 2026-05-22: first draft built from the llm-editor-approved launch slice brief for API model pricing, with current provider pricing checks and a cache/batch workload example.
Source list
- Anthropic pricing documentation: https://platform.claude.com/docs/en/about-claude/pricing
- Google Gemini pricing documentation: https://cloud.google.com/gemini-enterprise-agent-platform/generative-ai/pricing