Long-context benchmarks: needle tests, document QA and real recall
A model scores 99% on the needle-in-haystack test. The press release claims it can “reason over 200K tokens.” You feed it a 150-page legal brief and ask for a summary of conflicting clauses across sections five, twelve and twenty-three. The answer is shallow, misses a contradiction, and invents a provision that is not there.
The needle test was not lying — it was measuring something narrower than what most teams actually need.
Quick answer
Long-context benchmarks measure different things. Needle-in-haystack tests whether a model can retrieve a single fact embedded anywhere in a long context. Document QA tests multi-fact retrieval and synthesis. Real recall tests whether the model can accurately use information from the middle of a long context — the part retrieval-augmented systems often fail on. These are three distinct capabilities, and a model that passes one usually does not pass all three.
Before picking a model for a long-context workload, test it on the specific retrieval and synthesis pattern your application needs, not on the benchmark the vendor chose to publish.
What this means
Needle-in-haystack (NIAH) is the simplest test: place a single target fact somewhere in a long prompt and ask the model to retrieve it. Models score well on this even at extreme lengths (1M+ tokens) because the task is essentially memorisation with a single retrieval target. It tests whether the model can attend to any position in context, but it does not test reasoning over multiple pieces of information.
Document QA benchmarks (like HotpotQA, QMSum, or LongBench) require retrieving and combining information from multiple locations. These are harder because the model must identify which parts of the context are relevant, hold multiple facts in attention simultaneously, and synthesise them into an answer. Performance drops significantly beyond 32K tokens on most models.
Real recall (or “lost-in-the-middle”) tests check whether the model can retrieve information from the middle of a long context. Research from Liu et al. (2024) and others shows that most models perform worse on facts placed in the middle third of their context window — even when they retrieve the same fact reliably at the start or end. This matters for chat history, long documents, and any application where the relevant information is not always at the beginning.
Where teams misuse long-context benchmarks
-
Equating needle test success with document comprehension. A model that passes NIAH at 200K tokens may still fail on a 50-page contract summary task. The needle test tests retrieval, not synthesis. They are different capabilities.
-
Citing maximum context length as usable context length. The marketing number is the maximum input the model accepts. The usable context length — where retrieval and synthesis remain reliable — is typically much lower. A model that accepts 200K tokens may degrade noticeably after 32K.
-
Benchmarking on short examples and drawing long-context conclusions. A model that scores well on 4K-token LongBench subsets may not generalise to 100K-token real documents. Test at your actual workload size.
-
Overinterpreting synthetic benchmarks. NIAH tests use artificial facts (“The secret ingredient is kale”). Performance on these does not predict performance on real documents with complex relationships and domain-specific language.
Practical decision check
- Does your workload need single-fact retrieval across long contexts? Needle tests are relevant.
- Does it need multi-fact synthesis? Use document QA benchmarks at your actual workload length.
- Does information arrive in the middle of the context (chat history, long document processing)? Prioritise lost-in-the-middle test results.
- Is your context shorter than 32K tokens? Most modern models handle this well. Long-context benchmarking matters mainly above 32K.
- Are you comparing models for a specific long-context task? Build your own test set from real examples rather than relying on published benchmarks.
Methodology and sources
Check date: 2026-05-25
What was checked: Needle-in-haystack implementations (Greg Kamradt’s original test and variants), LongBench, RULER, HELM long-context evaluations, and the “lost in the middle” literature.
Assumptions and limits: Long-context performance varies by data type (code vs prose vs structured data), by prompt formatting, and by model implementation details. No single benchmark captures all long-context use cases.
Source list
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al. 2024) — https://arxiv.org/abs/2307.03172
- LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding — https://arxiv.org/abs/2308.14508
- RULER: What’s the Real Context Length of Your LLM? — https://arxiv.org/abs/2404.06654
- Kamradt’s Needle In A Haystack test — https://github.com/gkamradt/LLMTest_NeedleInAHaystack
- HELM long-context evaluation results — https://crfm.stanford.edu/helm/latest/
Related guides
- How LLM benchmarks work, and what they miss
- Contamination and leakage: why benchmark scores can be too good
- Context windows explained: why bigger is not always better
- RAG evaluation: checking retrieval before blaming the model
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.