Temperature, top-p and deterministic outputs: what the settings actually do
Lower temperature does not make a model smarter. It makes a model more predictable. That is useful for extraction, classification and support bots. Higher temperature does not make a model creative — it makes a model more random, which can help in brainstorming or varied outputs, but also increases the chance of nonsense.
The short answer is: temperature controls how “risky” the next token choice is. Top-p controls how wide the pool of candidates is. Setting both to their extremes is usually a sign you are not sure what you want.
If you only remember one thing: use temperature 0 for deterministic tasks, temperature 0.3–0.7 for general-purpose use, and treat anything above 0.8 as “I am deliberately courting variety.” Know why you want variety.
Editor’s Note: Many teams turn up temperature because “the answers sound too samey.” That is often a prompt quality problem, not a temperature problem. Fix the prompt before you turn up the randomness.
Editor’s Note: Deterministic output does not mean “the same every time forever.” Model updates, prompt changes and floating-point differences between hardware can still shift results.
Quick answer {#quick-answer}
- Extraction, classification, structured output: temperature 0, top-p 1. You want the same answer for the same input.
- Support bot, summarisation, Q&A: temperature 0.1–0.3. Slight natural variation without drift.
- General writing, drafting, ideation: temperature 0.5–0.7. Enough room for different phrasings.
- Brainstorming, variation search: temperature 0.7–1.0. Expect some duds.
- Deliberate randomness (demos, games): temperature > 1.0. Rarely useful in production.
This is not a universal table. Different models respond differently to the same settings. A 0.7 temperature on GPT-4.5 may not produce the same behaviour as 0.7 on a fine-tuned Llama 3B. You need to test your own workload.
What temperature actually does {#what-temperature-actually-does}
Temperature scales the probability distribution of the next token before sampling. A temperature of 0 means the model always picks the single most likely token (argmax). Values above 0 flatten the distribution, making less-likely tokens more probable. Values below 0 are not valid in most APIs — temperature is clamped to 0 internally.
Two common misunderstandings:
- Temperature 0 is not “more accurate.” It is “more predictable.” If the most likely token is wrong, temperature 0 makes it wrong in the same way every time.
- Temperature does not add meaning. It adds randomness. If the prompt is vague, more randomness does not make the output smarter. It makes it less consistent.
What top-p (nucleus sampling) does {#what-top-p-nucleus-sampling-does}
Top-p selects the smallest set of tokens whose cumulative probability exceeds the threshold p, then samples from that set. If top-p is 0.9, the model considers only the most-likely tokens that together account for 90% of the probability mass.
The practical difference from temperature: top-p dynamically adjusts the candidate pool based on the model’s confidence. When the model is confident (one token dominates), top-p 0.9 is close to deterministic. When the model is uncertain (many similar-probability tokens), top-p 0.9 still allows variety.
In practice, most teams set top-p to 1 (disabled) and tune temperature alone. Using both at once is valid but redundant unless you have a specific use case for the combined effect.
Deterministic output: what it buys and does not buy {#deterministic-output-what-it-buys-and-does-not-buy}
Setting temperature to 0 gives you deterministic output for a given prompt, model version and hardware. That is useful for:
- Testing and regression checks: the same prompt should produce the same output.
- Extraction pipelines: consistent field output without surprises.
- Support bots: the same customer query should not get a different answer each time.
It does not guarantee:
- Correctness (the model can still be confidently wrong).
- Portability across model versions (an update changes the token probabilities).
- Identical output across providers or hardware (float rounding differs).
Worked example: extraction with temperature 0 vs 0.3 {#worked-example-extraction-with-temperature-0-vs-0-3}
Prompt: “Extract the date, amount and payee from this invoice text: ‘Invoice dated 14 March 2026 for £1,247.50 payable to Acme Widgets Ltd.’”
- Temperature 0: Always returns the same structured JSON. Every time.
- Temperature 0.3: May occasionally swap field order, add extra whitespace, or vary “Acme Widgets Ltd” to “Acme Widgets Limited.” Not ideal for automated processing.
The 0.3 version is not better. It is just less predictable.
Tuning matrix by task {#tuning-matrix-by-task}
| Task | Temperature | Top-p | Notes |
|---|---|---|---|
| JSON extraction | 0 | 1 | Argmax. Run multiple times and compare if needed. |
| Classification / tagging | 0 | 1 | Same input → same label. |
| Code generation | 0–0.2 | 1 | Lower for deterministic logic, higher for boilerplate. |
| Summarisation | 0.1–0.3 | 1 | Slight variation is fine; stay predictable. |
| Customer-facing Q&A | 0.1–0.3 | 1 | Consistency matters for trust. |
| Drafting emails | 0.3–0.5 | 0.9 | Some tone variation is useful. |
| Creative writing | 0.5–0.8 | 0.9 | Higher = more variety, more failure. |
| Brainstorming | 0.7–1.0 | 0.9 | Expect to filter output. |
Treat this as a starting point. Your workload, model and user expectations will shift the numbers.
What this page cannot tell you {#what-this-page-cannot-tell-you}
This page cannot tell you the exact settings for your workload. It can only explain what the knobs do and where other teams typically set them.
It cannot tell you whether your prompt is well-constructed enough for temperature to matter, whether your model version handles temperature the same way as the docs suggest, or whether your evaluation framework caught a regression introduced by a parameter change.
Methodology and sources {#methodology-and-sources}
Check date: 2026-05-25
What was checked: Current provider API docs for OpenAI, Anthropic, Google and Mistral regarding sampling parameter behaviour and common ranges.
Worked-example assumptions: All currency values in GBP; extraction output is purely illustrative.
Assumptions and limits:
- Temperature and top-p interact differently depending on the model family.
- Provider docs may not fully disclose the exact sampling implementation.
- Customer-facing production systems should always test parameter changes against eval suites.
Source list {#source-list}
- OpenAI API reference — sampling parameters — https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature
- Anthropic API docs — thinking and sampling parameters — https://docs.anthropic.com/en/api/messages
- Google Gemini API — generation config — https://ai.google.dev/gemini-api/docs/text-generation
- Mistral AI — API parameters — https://docs.mistral.ai/api/
Related guides {#related-guides}
- System prompts, developer prompts and user prompts: who controls what?
- Prompt length, output length and why AI bills surprise teams
- Context windows explained: why bigger is not always better
- Structured outputs and JSON mode reliability limits
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.