theLLMs

Last checked: 2026-05-25

Scope: Global. Output token pricing checked against current provider API pages on 2026-05-25. Actual cost savings depend on workload characteristics and model behaviour.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Output tokens are expensive: designing shorter AI answers without hurting usefulness

Output tokens typically cost 3–4× more per token than input tokens. If you are generating 500-token answers when 100 would do, you are paying 3–5× the necessary output cost. Over a million calls a month, that difference adds up to thousands of dollars.

The short answer is: output tokens are the most expensive part of most LLM bills, and teams routinely over-generate because they never measure average output length or set explicit limits. A shorter, tighter answer is almost always cheaper and often better.

If you only remember one thing: set a max_tokens limit on every API call, measure your average output token count weekly, and treat every unnecessary sentence in an AI answer as a cost line item.

Editor’s Note: Models default to generating as much as they think is helpful. They do not have a budget. If you do not set limits, the model will keep writing.

Editor’s Note: Shorter answers also mean lower latency, fewer truncation issues in multi-turn conversations, and a better user experience for anyone scanning for a key fact.

Quick answer {#quick-answer}

  • Output tokens cost 3–4× input tokens on most major provider pricing pages.
  • Default max_tokens is often unlimited or generous (4,096–8,192). If you do not set one, you pay for whatever the model decides to write.
  • A 200-token answer costs ~60% less than a 500-token answer at the same input length.
  • Verbose model defaults — chain-of-thought, bullet lists, repeated phrasing — are not always necessary.

Practical output budget by task:

TaskTypical useful output lengthSavings vs default
Classification / tag5–20 tokens~95%
JSON extraction50–200 tokens~80%
Q&A fact answer50–150 tokens~85%
Summarisation100–300 tokens~60%
Email draft150–300 tokens~50%
Code generation100–500 tokensVaries
Analysis / reasoning300–800 tokens~20%

These are planning ranges. The right length depends on the task, not on what the model prefers to write.

Where output waste hides {#where-output-waste-hides}

Redundant framing {#redundant-framing}

Models often preface answers with framing: “Based on the document you provided, I can confirm that…” That is fine for a human reader. For an automated pipeline, it is waste.

Fix: prompt for conciseness. “Answer directly. Do not repeat the question. Do not add disclaimers unless asked.”

Bullet lists that could be sentences {#bullet-lists-that-could-be-sentences}

A model asked to “explain” may default to a structured list with introductions, labels and spacing. A single sentence may carry the same information.

Fix: specify the format in the prompt: “Answer in one sentence. Maximum 50 tokens.”

Chain-of-thought in the visible output {#chain-of-thought-in-the-visible-output}

Models that show their reasoning as part of the answer are generating tokens you pay for but may not need to show the user. Some providers offer hidden reasoning (Anthropic’s extended thinking, OpenAI’s o-series reasoning tokens) that does not count toward output tokens.

Fix: check whether your provider offers cost-free reasoning tokens or hidden scratchpad options. Separating reasoning from output can reduce visible token count by 50–80%.

Output repair loops {#output-repair-loops}

When the model generates an answer that does not match the expected format, you retry. Each retry is another full input+output cycle for the same user need. The cost multiplier is 2–5× the ideal cost.

Fix: validate output format in the prompt and give a negative example: “Do not include markdown formatting. Do not add extra whitespace. Output only the JSON.”

Worked example: trimming 300 tokens per call {#worked-example-trimming-300-tokens-per-call}

Assume Google Gemini 2.5 Pro at $1.25/M input, $5.00/M output, 100,000 calls/month.

Before trimming: average 500 output tokens per call = 500 × 100,000 × $5/M = $250/month output cost.

After trimming: average 200 output tokens per call = 200 × 100,000 × $5/M = $100/month output cost.

Savings: $150/month (60% reduction) — from prompt changes alone, no model change, no infrastructure change.

Formula block {#formula-block}

Monthly output cost = average output tokens per call × calls per month × output price per 1M tokens / 1,000,000

To calculate savings: (current_output_tokens - target_output_tokens) × calls × output_price

What does not work {#what-does-not-work}

  • “Be concise” in the system prompt without a token limit. The model’s idea of “concise” may still be 400 tokens.
  • Setting max_tokens too low and getting truncation. If your expected answer needs 300 tokens, set max_tokens to 500 so the answer fits naturally.
  • Short prompts for short answers. Output length is not proportional to input length. A short prompt can still produce a long answer.
  • Blame the model. Output length is a product of prompting, model behaviour and task complexity. Fix the prompt before changing the model.

Decision chain {#decision-chain}

  1. Measure your average output token count last month. If you do not have this number, start logging it.
  2. Compare it to the minimum useful output length for your task.
  3. If the gap is >50%, test shorter output prompts with an eval set.
  4. Validate that shorter answers do not miss critical information.
  5. Set max_tokens to the 95th percentile of your test output length.

What this page cannot tell you {#what-this-page-cannot-tell-you}

This page cannot tell you the minimum useful output length for your specific task. It can only give you the framework for finding it. The right length depends on user expectations, task complexity and whether the output is human-facing or machine-parsed.

Methodology and sources {#methodology-and-sources}

Check date: 2026-05-25

What was checked: Provider pricing pages for current input/output token cost ratios; model default max_tokens behaviour.

Worked-example assumptions: Google Gemini 2.5 Pro pricing used. Savings may differ for providers with different input/output cost ratios.

Assumptions and limits:

  • Average output token counts are illustrative. Real numbers depend on task, prompt and model.
  • Some providers offer token-level breakdowns in API responses.
  • Output token prices change over time.

Source list {#source-list}

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.