theLLMs

Last checked: 2026-05-22

Scope: Global provider pricing. Currency quotes below are USD and date-scoped to the provider docs checked on 2026-05-22.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

API model pricing: input, output, cache and batch costs

If you are comparing model APIs, the line item that looks cheapest on a pricing page is often only half the story. The real bill usually comes from four things:

  • how much text you send in;
  • how much text the model sends back;
  • whether some of the input can be cached and reused; and
  • whether the work can wait for batch pricing.

The safest short answer is this: compare the full workload, not just the input token rate. A model with a lower headline price can still cost more once output gets verbose, cache reuse is too low to pay back, or the job could have used a batch route instead.

Summary

For most teams, the right question is not “what is the cheapest model?” It is “what is the cheapest way to finish this job well enough?”

That means you should separate prompt length, output length, cache-hit rate and batchability before you compare models. It also means you should be wary of any quote that assumes one neat average prompt and a tidy response. Real workloads are messier than that.

A useful rule of thumb:

  • if the prompt is reused, cache can help;
  • if the task can wait, batch can help;
  • if the output is long, output pricing may dominate;
  • if the task is short and interactive, latency usually matters more than tiny token savings.

What API model pricing usually includes

Most API pricing pages break the bill into separate buckets. The labels differ by provider, but the shape is similar.

Pricing bucketWhat it meansWhy it matters
Input tokensThe text you send to the modelLong prompts, retrieved documents and tool outputs all land here
Output tokensThe text the model generatesA cheap input rate can still become a large bill if output is verbose
Cached inputReused prompt text billed at a lower rateHelps when the same instructions or context are sent repeatedly
Batch pricingLower price for non-urgent jobsUseful when latency does not matter
Context thresholdsSome models change price above a token limitA long prompt can move you into a different pricing band

The key move is to stop treating pricing like a single number. A model is not just an input rate; it is a bundle of billing rules.

Current pricing snapshot

The figures below were checked directly from provider pricing docs on 2026-05-22. They are date-scoped and change often.

Provider / modelInput priceCached inputOutput priceBatch note
Anthropic Claude Sonnet 4.5$3 / MTok standard input$3.75 / MTok 5m write; $6 / MTok 1h write; $0.30 / MTok cache hits$15 / MTokAnthropic notes that cache pricing can stack with Batch API discount, but the pricing page excerpt used here does not give one simple model-row batch price
Google Gemini 2.5 Pro$1.25 / 1M tokens up to 200K input, then $2.50 / 1M$0.13 / 1M cached input up to 200K, then $0.25 / 1M$10 / 1M tokens up to 200K, then $15 / 1MGemini models are available in batch mode at 50% discount according to the provider pricing page note
Google Gemini 2.0 Flash$0.15 / 1M input tokensNot shown in the batch table excerpt$0.60 / 1M output tokens$0.075 / 1M input tokens and $0.30 / 1M output tokens with Batch API

Two things to notice here:

  1. Google’s pricing page does not just show one price for everything. Some models have a token threshold at 200K tokens, and the price changes above that line.
  2. Anthropic’s cache pricing is not a simple “free discount”. The write step costs more than fresh input, so reuse has to be real before caching pays back.

Input vs output costs

A lot of people start by comparing input rates because they are easy to read. That is fine for a first pass, but it is not enough.

If a task has small prompts and short replies, input may dominate. If the task produces long summaries, structured JSON, explanations, or tool-heavy follow-ups, output can become the bigger line item.

That is why a model that looks cheap on paper can still cost more overall:

  • the prompt may be longer than you expected;
  • the model may answer more verbosely than a cheaper alternative;
  • retries can multiply the total output;
  • tool calls can add more context to the next turn.

The point is not that output is always the main cost. The point is that you cannot know until you estimate both halves of the exchange.

Cached input and why it matters

Cached input is useful when the same instructions, retrieval bundle, or reference text is sent again and again.

Anthropic’s pricing page is a good reminder that cache is not magic. The page says:

  • 5-minute cache writes cost 1.25x standard input;
  • 1-hour cache writes cost 2x standard input;
  • cache hits cost 10% of standard input;
  • the 5-minute cache pays back after one cache read;
  • the 1-hour cache pays back after two cache reads.

That last line is the one to keep.

In plain English: if you only reuse a large prompt once, caching may be a false economy. If the same context is reused several times, cache becomes a real lever.

A small payback formula

If x is the standard input cost for the repeated text:

  • fresh twice = 2x
  • 5-minute cache, one write plus one read = 1.25x + 0.1x = 1.35x
  • 1-hour cache, one write plus two reads = 2x + 0.2x = 2.2x against 3x fresh for three uses

So the question is not “does cache always save money?” It is “how many times will I reuse the same text?”

Batch pricing and when it helps

Batch pricing is for jobs that do not need an immediate answer. That usually means overnight enrichment, backfills, bulk classification, large-scale extraction or cleanup work.

Google’s pricing page states that Gemini models are available in batch mode at 50% discount. The same page also shows an explicit Gemini 2.0 Flash batch table where input falls from $0.15 to $0.075 per 1M tokens and output falls from $0.60 to $0.30 per 1M tokens.

That is the practical trade-off:

  • if you need latency, batch is probably the wrong shape;
  • if you need throughput and can wait, batch can cut cost without changing the task itself.

Batch is not a clever way to hide bad prompt design. It is just a cheaper lane for work that does not need to be real-time.

Why cheap per token can still cost more

The cheapest-looking path can still lose once the workload is realistic.

Here is the common failure mode:

  1. the team compares only the headline input price;
  2. the prompt is too long or too repeated;
  3. output gets verbose or retries climb;
  4. the final bill is higher than expected.

That is why the better comparison is a full workload estimate.

Worked example: two identical runs with a reusable prompt

Assumptions:

  • provider: Anthropic Claude Sonnet 4.5;
  • 50,000 tokens are stable instructions/reference text;
  • 20,000 tokens are unique to each request;
  • each run returns 8,000 output tokens;
  • we compare two fresh runs against one 5-minute cache write plus two cache reads.
ScenarioCalculationCost
Two fresh runs2 × ((50,000 + 20,000) × $3 / 1M) + 2 × (8,000 × $15 / 1M)$0.6600
5-minute cache50,000 × $3.75 / 1M + 2 × (50,000 × $0.30 / 1M + 20,000 × $3 / 1M + 8,000 × $15 / 1M)$0.5775
DifferenceFresh minus cache$0.0825 saved

The point of the example is not that Claude Sonnet 4.5 is always the right model. The point is that cache only starts to matter when the same prompt text comes back enough times to pay for the write step.

A simple way to estimate your own spend

Use this checklist before you commit to a model or a vendor.

Action checklist

  • Count the input tokens in the real prompt, not the tidy version you wish you had.
  • Split input into one-off text and reusable text.
  • Estimate output length as a range, not a single number.
  • Check whether the task can wait for batch pricing.
  • Check whether any model price changes above a context threshold.
  • Ask whether cache will be reused enough to pay back.
  • Compare the whole workload, not just the input line item.

Formula block

A repeatable estimate is:

Total cost = fresh input × fresh input rate + cached input × cache rate + output × output rate + batch modifier, if applicable

That is only an estimate. It is not a prediction of your actual bill.

A sensible way to use it is to build a low / likely / high range:

  • low: short prompt, short output, high cache reuse, batch available;
  • likely: your current average workload;
  • high: longer prompts, longer output, less cache reuse, no batch discount.

What this page cannot tell you

This page cannot tell you your actual bill.

It cannot tell you:

  • how long your prompts really are;
  • how often the same text is reused;
  • whether your output will be short or chatty;
  • whether your account tier has different rates;
  • whether your usage pattern belongs in batch rather than interactive mode.

It can only show you the pricing shape and the questions that matter before you compare vendors.

GB / NI / global applicability

This article is global. There is no GB / NI split to apply here.

The useful caution is the same everywhere: provider pricing changes, account tiers may differ, and the published page is the thing to check before you buy into a model or a workflow.

Methodology and sources

Check date: 2026-05-22

What was checked: current provider pricing docs for input, output, cache and batch pricing behaviour.

How the figures were used: only to illustrate billing structure and workload estimation. They are not a promise about any reader’s actual bill.

Source URLs checked on 2026-05-22:

Data points pulled from those pages:

  • Anthropic Claude Sonnet 4.5 standard input, 5-minute cache write, 1-hour cache write, cache hit and output pricing.
  • Anthropic cache payback guidance.
  • Google Gemini 2.5 Pro base input, cached input and output pricing, including the 200K token threshold.
  • Google Gemini 2.0 Flash batch pricing table and the page-level note that Gemini models can run in batch mode at 50% discount.

Assumptions used in worked examples:

  • All currency values are USD.
  • MTok means million tokens.
  • The example workload repeats the same stable prompt text across runs.
  • Output length is treated as a separate cost driver.
  • Batch discount applies only when the workload is actually eligible for batch mode.

Change log

  • 2026-05-22: first draft built from the llm-editor-approved launch slice brief for API model pricing, with current provider pricing checks and a cache/batch workload example.

Source list