Latency in LLM apps: first token, total time and user experience
A slow AI feature is not always a slow model. The delay can come from queueing, a long prompt, a long answer, tool calls, network hops, retries, or a slow downstream service that the model has to wait for.
The useful answer is to measure the whole request path, not just the model call. If the first token is slow, the user feels stuck. If the total time is slow, the feature feels heavy. Both matter, and they are not the same problem.
Teams often blame the model first because that is the visible part. In practice, the slowest bit is often the plumbing around it — bloated prompts, oversized conversation histories sent on every turn, synchronous retrieval that blocks streaming, or validation loops that retry before anything reaches the user [1]. A model that benchmarks as “fast” can still feel slow if the prompt is bloated, the output is too long, or the system waits on tools before streaming anything useful.
Quick answer
If the feature feels slow, measure three things separately: time to first token, total completion time, and time spent outside the model call.
If first token is slow, look at queueing, prompt size, routing and any pre-processing the app does before streaming starts. If total time is slow, look at output length, tool calls, retries and slow post-processing. If only some users see the issue, check region, account tier, concurrency or downstream service load.
Do not jump straight to a bigger model or a new vendor until you know where the delay actually lives.
What latency means in practice
“Latency” in an LLM app is a bundle of timings, not one number.
The main parts are:
- queueing time: the request waits before a worker starts;
- prompt processing time: the model reads the input tokens;
- time to first token: the delay before the first streamed output appears;
- generation time: the model finishes the rest of the answer;
- tool time: external API calls, database queries or function calls;
- network time: round-trip delay between user, app and provider;
- retry time: extra time after a failed call or validation loop.
A feature can have a decent total time but still feel bad if nothing appears for several seconds. In many products, time to first token is the number that most closely tracks user frustration.
What usually causes the delay
Common causes are boring, which is why they recur:
- The prompt is larger than it needs to be.
- The app sends too much history back into the model.
- The model waits on retrieval or tool calls before it can stream anything.
- The response is longer than the product really needs.
- The app retries automatically after a validation failure.
- A downstream API is slow, so the model sits idle.
- The user is on a busy region or account tier with more contention.
None of that proves the provider is slow. It only proves the request path is doing work.
A worked example
A customer support bot was handling small-business IT queries. Users reported the bot felt “laggy” — responses appeared 4–5 seconds after hitting send.
Before: The app sent the full 40-message chat history on every turn. The prompt included the complete conversation plus a 600-word system instruction. The model took ~1.2 seconds to process the prompt, the streaming connection was held open while the app ran a synchronous ticket-ID lookup, and the first token appeared at 3.8 seconds. Total response time averaged 6.2 seconds.
After: The team trimmed conversation history to the last 8 messages (the only ones relevant for context), moved the ticket-ID lookup to an async prefetch that runs in parallel with streaming, and reduced the system instruction to 200 words by removing redundant examples. First token dropped to 0.7 seconds. Total response time dropped to 1.9 seconds.
The model did not change. The provider did not change. The fix was plumbing [1][4].
What to measure
Before changing models, record a simple timing breakdown:
- request received;
- request queued;
- model call started;
- first token streamed;
- tool calls started and finished;
- final token received;
- response rendered to the user.
That lets you answer the only question that matters: where is the wait happening?
A helpful rule is to keep the timing labels aligned with the user experience. Engineers often measure only API latency, but users care about when something usable appears on screen. OpenTelemetry’s tracing specification provides a standard way to instrument each component in the request path [4], making it possible to see whether the delay is in your app, your network or the provider.
What to change first
If the first token is slow, try these in order:
- trim the prompt;
- shorten the conversation history you resend;
- avoid needless retrieval before the first streamed token;
- stream earlier if the workflow allows it (send
stream: trueand process SSE events as they arrive, as recommended by OpenAI’s streaming docs [1]); - reduce any synchronous pre-checks;
- cap output length where a shorter answer is enough.
If total time is slow, look at:
- output limits;
- tool-call count;
- validation loops;
- retries;
- slow database or API calls;
- post-processing that could happen after the user already has the answer.
A smaller model can help, but only if it removes the actual bottleneck.
Practical decision check
Use this check before you rewrite the stack:
- Does the user need the full answer immediately, or just a visible start?
- Is the app waiting on retrieval, tools or validation before it can stream?
- Are you sending more context than the task needs?
- Is the output length under your control?
- Are retries hiding the true time cost?
- Are a few users seeing a region or tenant-specific slowdown?
If you cannot answer those questions, the measurement layer is too thin.
What this page cannot tell you
This page cannot tell you which provider is fastest for your workload.
It cannot tell you:
- your exact queueing delay;
- your model’s real throughput under load;
- whether a tool call or a rerank step is the real blocker;
- whether the delay sits in your app, your network or the vendor;
- whether a user-perceived slowdown is caused by one long request or many small ones.
It can only show you how to stop guessing.
What would change the advice
The guidance above assumes the bottleneck is in the request path and can be fixed without adding systemic risk. That assumption breaks down in three cases:
-
When latency optimisation adds complexity without reducing user-perceived delay. Adding streaming proxies, cache layers or multi-region routing introduces new failure modes — dropped connections, stale cache entries, routing misconfiguration. If the user is already reading the response faster than the model produces it, optimising further does not improve the experience.
-
When the bottleneck is user reading speed, not model speed. A 300-word answer produced in 0.8 seconds feels instant. Trying to squeeze that to 0.4 seconds by switching providers or adding infrastructure is engineering theatre — no user notices the difference.
-
When a provider ships a hardware generation that changes the latency baseline. If a provider deploys new inference hardware (next-generation GPUs, custom ASICs) or changes its routing layer, the assumptions that guided your optimisation may no longer hold. Re-check latency distributions quarterly, not just when something feels slow.
If any of these apply, stop optimising and measure again.
Regional caveats
Latency optimisation advice is universal, but the specific levers vary by region:
-
UK/Europe: Provider endpoint location matters. A request routed to a US-west data centre adds 80–150 ms of round-trip time compared to a London or Frankfurt endpoint. Several providers offer European inference endpoints with data-sovereignty guarantees; check whether your account is configured to prefer them.
-
Asia-Pacific / South America: Provider coverage is thinner. Fewer regional inference endpoints mean higher baseline network latency. This makes prompt-size optimisation proportionally more valuable — every unnecessary token costs extra round-trip time.
-
Global / multi-region deployments: If your users are distributed across regions, a single provider endpoint will serve some well and others poorly. Consider multi-region routing or at minimum measure your actual latency distribution by user region before declaring a provider “slow.”
The useful caution is universal: latency should be measured on the live request path, not inferred from a model page or a marketing claim. But the baseline you are optimising against depends on where you and your users are.
Methodology and sources
Check date: 2026-05-22
What was checked: provider streaming documentation and general observability guidance.
What the sources were used for:
- streaming behaviour and the fact that first-token timing matters to user experience [1][2][3];
- how rate limits, concurrency and request shaping can affect observed delay [1];
- the need to instrument the full request path rather than only the model call [4].
Assumptions and limits:
- no hands-on latency benchmark is claimed here;
- provider load changes over time;
- timing numbers vary by region, tenant and workload;
- user-perceived speed is not the same as raw API completion time.
Change log
- 2026-05-25: revised per editorial review (LLM-0081). Integrated Editor’s Notes, added inline citations, fixed related-guide links to production routes, added worked example with concrete numbers, replaced Global applicability with regional caveats, added What would change the advice section.
- 2026-05-22: first draft built from the llm-editor-approved brief, with a request-path timing model, a user-experience framing for first token versus total time, and practical checks before model changes.
Source list
- [1] OpenAI streaming docs — https://platform.openai.com/docs/guides/streaming
- [2] Anthropic streaming docs — https://docs.anthropic.com/en/docs/build-with-claude/streaming
- [3] Google Gemini streaming docs — https://ai.google.dev/gemini-api/docs/streaming
- [4] OpenTelemetry documentation — https://opentelemetry.io/docs/