#- Latency- in- LLM- apps:- first- token,- total- time- and- user- experience

##- TL;DR

A- slow- AI- feature- is- not- always- a- slow- model.- The- delay- can- come- from- queueing,- a- long- prompt,- a- long- answer,- tool- calls,- network- hops,- retries,- or- a- slow- downstream- service- that- the- model- has- to- wait- for.

The- useful- answer- is- to- measure- the- whole- request- path,- not- just- the- model- call.- If- the- first- token- is- slow,- the- user- feels- stuck.- If- the- total- time- is- slow,- the- feature- feels- heavy.- Both- matter,- and- they- are- not- the- same- problem.

Teams- often- blame- the- model- first- because- that- is- the- visible- part.- In- practice,- the- slowest- bit- is- often- the- plumbing- around- it- —- bloated- prompts,- oversized- conversation- histories- sent- on- every- turn,- synchronous- retrieval- that- blocks- streaming,- or- validation- loops- that- retry- before- anything- reaches- the- user.- A- model- that- benchmarks- as- “fast”- can- still- feel- slow- if- the- prompt- is- bloated,- the- output- is- too- long,- or- the- system- waits- on- tools- before- streaming- anything- useful.

##- TL;DR

If- the- feature- feels- slow,- measure- three- things- separately:- time- to- first- token,- total- completion- time,- and- time- spent- outside- the- model- call.

If- first- token- is- slow,- look- at- queueing,- prompt- size,- routing- and- any- pre-processing- the- app- does- before- streaming- starts.- If- total- time- is- slow,- look- at- output- length,- tool- calls,- retries- and- slow- post-processing.- If- only- some- users- see- the- issue,- check- region,- account- tier,- concurrency- or- downstream- service- load.

Do- not- jump- straight- to- a- bigger- model- or- a- new- vendor- until- you- know- where- the- delay- actually- lives.

- - Editor's- Note - -

Time- to- first- token- (TTFT)- is- the- metric- that- most- closely- tracks- user- frustration- —- more- than- total- completion- time.- A- user- staring- at- a- blank- screen- for- 3- seconds- will- perceive- the- feature- as- "broken,"- even- if- the- full- answer- arrives- in- 4- seconds- total.- If- you- can- only- instrument- one- latency- metric,- start- with- TTFT.- Streaming- a- placeholder- or- a- typing- indicator- before- the- model- starts- generating- can- buy- you- an- extra- second- of- perceived- responsiveness.

##- What- latency- means- in- practice

“Latency”- in- an- LLM- app- is- a- bundle- of- timings,- not- one- number.

The- main- parts- are:

queueing- time:- the- request- waits- before- a- worker- starts; prompt- processing- time:- the- model- reads- the- input- tokens; time- to- first- token:- the- delay- before- the- first- streamed- output- appears; generation- time:- the- model- finishes- the- rest- of- the- answer; tool- time:- external- API- calls,- database- queries- or- function- calls; network- time:- round-trip- delay- between- user,- app- and- provider; ##- What- usually- causes- the- delay

Common- causes- are- boring,- which- is- why- they- recur:

1.- The- prompt- is- larger- than- it- needs- to- be. 2.- The- app- sends- too- much- history- back- into- the- model. 3.- The- model- waits- on- retrieval- or- tool- calls- before- it- can- stream- anything. 4.- The- response- is- longer- than- the- product- really- needs. 5.- The- app- retries- automatically- after- a- validation- failure. 6.- A- downstream- API- is- slow,- so- the- model- sits- idle. 7.- The- user- is- on- a- busy- region- or- account- tier- with- more- contention.

None- of- that- proves- the- provider- is- slow.- It- only- proves- the- request- path- is- doing- work.

##- A- worked- example

A- customer- support- bot- was- handling- small-business- IT- queries.- Users- reported- the- bot- felt- “laggy”- —- responses- appeared- 4–5- seconds- after- hitting- send.

Before:- The- app- sent- the- full- 40-message- chat- history- on- every- turn.- The- prompt- included- the- complete- conversation- plus- a- 600-word- system- instruction.- The- model- took- ~1.2- seconds- to- process- the- prompt,- the- streaming- connection- was- held- open- while- the- app- ran- a- synchronous- ticket-ID- lookup,- and- the- first- token- appeared- at- 3.8- seconds.- Total- response- time- averaged- 6.2- seconds.

After:- The- team- trimmed- conversation- history- to- the- last- 8- messages- (the- only- ones- relevant- for- context),- moved- the- ticket-ID- lookup- to- an- async- prefetch- that- runs- in- parallel- with- streaming,- and- reduced- the- system- instruction- to- 200- words- by- removing- redundant- examples.- First- token- dropped- to- 0.7- seconds.- Total- response- time- dropped- to- 1.9- seconds.

The- model- did- not- change.- The- provider- did- not- change.- The- fix- was- plumbing.

- - Editor's- Note - -

The- single- highest-ROI- latency- fix- we- see- across- production- LLM- apps- is- trimming- conversation- history.- Most- apps- send- the- full- chat- log- on- every- turn,- and- most- turns- only- need- the- last- 6–10- messages.- This- is- not- a- model- problem- —- it- is- a- prompt-design- problem.- Before- you- optimise- anything- else,- check- how- many- tokens- you- are- sending- that- the- model- does- not- actually- need.

##- What- to- measure

Before- changing- models,- record- a- simple- timing- breakdown:

request- received; request- queued; model- call- started; first- token- streamed; tool- calls- started- and- finished; final- token- received; response- rendered- to- the- user.

That- lets- you- answer- the- only- question- that- enough- of- your- plumbing- is- optimized.

##- What- to- change- first

If- the- first- token- is- slow,- try- these- in- order:

trim- the- prompt; shorten- the- conversation- history- you- resend; avoid- needless- retrieval- before- the- first- streamed- token; stream- earlier- if- the- workflow- allows- it- (send- stream:- true- and- process- SSE- events- as- they- arrive,- as- recommended- by- OpenAI’s- streaming- docs); reduce- any- synchronous- pre-checks; cap- output- length- where- a- shorter- answer- is- enough.

If- total- time- is- slow,- look- at:

output- limits; tool-call- count; validation- loops; retries; slow- database- or- API- calls; post-processing- that- could- happen- after- the- user- already- has- the- answer.

A- smaller- model- can- help,- but- only- if- it- removes- the- actual- bottleneck.

##- Practical- decision- check

Use- this- check- before- you- rewrite- the- stack:

Does- the- user- need- the- full- answer- immediately,- or- just- a- visible- start? Is- the- app- waiting- on- retrieval,- tools- or- validation- before- it- can- stream? Are- you- sending- more- context- than- the- task- needs? Is- the- output- length- under- your- control? Are- retries- hiding- the- true- time- cost? Are- a- few- users- seeing- a- region- or- tenant-specific- slowdown?

If- you- cannot- answer- those- questions,- the- measurement- layer- is- too- thin.

##- What- this- page- cannot- tell- you

This- page- cannot- tell- you- which- provider- is- fastest- for- your- workload.

It- cannot- tell- you:

your- exact- queueing- delay; your- model’s- real- throughput- under- load; whether- a- tool- call- or- a- rerank- step- is- the- real- blocker; whether- the- delay- sits- in- your- app,- your- network- or- the- vendor; whether- a- user-perceived- slowdown- is- caused- by- one- long- request- or- many- small- ones.

It- can- only- show- you- how- to- stop- guessing.

##- What- would- change- the- advice

The- guidance- above- assumes- the- bottleneck- is- in- the- request- path- and- can- be- fixed- without- adding- systemic- risk.- That- assumption- breaks- down- in- three- cases:

When- latency- optimisation- adds- complexity- without- reducing- user-perceived- delay.- Adding- streaming- proxies,- cache- layers- or- multi-region- routing- introduces- new- failure- modes- —- dropped- connections,- stale- cache- entries,- routing- misconfiguration.- If- the- user- is- already- reading- the- response- faster- than- the- model- produces- it,- optimising- further- does- not- improve- the- experience. When- the- bottleneck- is- user- reading- speed,- not- model- speed.- A- 300-word- answer- produced- in- 0.8- seconds- feels- instant.- Trying- to- squeeze- that- to- 0.4- seconds- by- switching- providers- or- adding- infrastructure- is- engineering- theatre- —- no- user- notices- the- difference. When- a- provider- ships- a- hardware- generation- that- changes- the- latency- baseline.- If- a- provider- deploys- new- inference- hardware- (next-generation- GPUs,- custom- ASICs)- or- changes- its- routing- layer,- the- assumptions- that- guided- your- optimisation- may- no- longer- hold.- Re-check- latency- distributions- quarterly,- not- just- when- something- feels- slow.

If- any- of- these- apply,- stop- optimising- and- measure- again.

- - Editor's- Note - -

Quarterly- latency- re-benchmarking- is- not- paranoia- —- it- is- operational- hygiene.- Provider- routing- changes,- model- version- swaps,- and- silent- infrastructure- migrations- happen- without- announcement.- A- feature- that- was- fast- in- March- can- be- slow- in- June- with- no- code- changes- on- your- side.- Set- a- calendar- reminder.- Better- yet,- automate- it:- a- simple- cron- job- that- calls- your- production- endpoint- and- logs- TTFT- and- total- time- gives- you- a- trail- when- users- start- complaining.

If- any- of- these- apply,- stop- optimising- and- measure- again.

##- Regional- caveats

Latency- optimisation- advice- is- universal,- but- the- specific- levers- vary- by- region:

UK/Europe:- Provider- endpoint- location- matters.- A- request- routed- to- a- US-west- data- centre- adds- 80–150- ms- of- round-trip- time- compared- to- a- London- or- Frankfurt- endpoint.- Several- providers- offer- European- inference- endpoints- with- data-sovereignty- guarantees;- check- whether- your- account- is- configured- to- prefer- them. Asia-Pacific- /- South- America:- Provider- coverage- is- thinner.- Fewer- regional- inference- endpoints- mean- higher- baseline- network- latency.- This- makes- prompt-size- optimisation- proportionally- more- valuable- —- every- unnecessary- token- costs- extra- round-trip- time. Global- /- multi-region- deployments:- If- your- users- are- distributed- across- regions,- a- single- provider- endpoint- will- serve- some- well- and- others- poorly.- Consider- multi-region- routing- or- at- minimum- measure- your- actual- latency- distribution- by- user- region- before- declaring- a- provider- “slow.”

The- useful- caution- is- universal:- latency- should- be- measured- on- the- live- request- path,- not- inferred- from- a- model- page- or- a- marketing- claim.- But- the- baseline- you- are- optimising- against- depends- on- where- you- and- your- users- are.ion- before- declaring- a- provider- “slow.”

##- Related- guides- guides

What- is- a- token,- and- why- does- it- affect- AI- cost? Context- windows- explained:- why- bigger- is- not- always- better API- model- pricing:- input,- output,- cache- and- batch- costs Rate- limits- explained:- requests,- tokens,- tiers- and- hidden- launch- risks

##- Methodology

Data- checked:- 2026-05-28 Sources- consulted:- OpenAI- streaming- documentation,- Anthropic- streaming- documentation,- Google- Gemini- streaming- documentation,- OpenTelemetry- tracing- specification Assumptions:- No- hands-on- latency- benchmark- is- claimed.- Provider- load- changes- over- time.- Timing- numbers- vary- by- region,- tenant- and- workload.- User-perceived- speed- is- not- the- same- as- raw- API- completion- time. Limitations:- This- article- does- not- benchmark- specific- providers,- does- not- provide- legal- or- SLA- advice,- and- does- not- cover- hardware-level- latency- optimisation.- Worked- example- numbers- are- illustrative,- drawn- from- documented- team- experiences,- not- from- a- controlled- experiment. Jurisdiction:- Global.- Regional- caveats- cover- UK/Europe,- Asia-Pacific/South- America,- and- multi-region- deployments.

##- Source- list

OpenAI- streaming- documentation- —- https://platform.openai.com/docs/guides/streaming- (accessed- 2026-05-28) Anthropic- streaming- documentation- —- https://docs.anthropic.com/en/docs/build-with-claude/streaming- (accessed- 2026-05-28) Google- Gemini- streaming- documentation- —- https://ai.google.dev/gemini-api/docs/streaming- (accessed- 2026-05-28) OpenTelemetry- documentation- —- https://opentelemetry.io/docs/- (accessed- 2026-05-28)

##- Trust- Stack

Last- checked:- 2026-05-28 Corrections:- Contact- us- to- report- errors

##- Change- log

2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- cards,- slugified- heading- IDs,- standardised- Methodology- and- Source- List- formats,- added- Trust- Stack- section- with- corrections- policy- and- affiliation- declaration,- removed- inline- citation- numbers,- removed- workflow- leak- reference,- corrected- frontmatter- writtenBy- label. 2026-05-25:- Integrated- Editor’s- Notes,- added- inline- citations,- fixed- related-guide- links- to- production- routes,- added- worked- example- with- concrete- numbers,- replaced- Global- applicability- with- regional- caveats,- added- What- would- change- the- advice- section. 2026-05-22:- First- published.