Rerankers explained: the quiet quality layer in RAG systems
A reranker takes the candidate results from vector search or keyword retrieval and scores them again with a stricter, query-aware relevance model. Reranking is not a cure for bad retrieval design — if the candidate set is poor, the reranker can only improve it a little. But when the first-pass retriever is broad enough to find candidates yet too loose to pick the best one reliably, a reranker often turns a noisy RAG system into one users trust.
Quick answer
Add a reranker only after you have confirmed that the base retriever returns plausible candidates (top-10 or top-20) but ranks the truly relevant one outside position 1–3 more than 20% of the time. Measure whether the extra 100–500 ms per query (typical API reranker latency for a batch of 10–20 candidates) actually improves the answers your users see. If it does not, fix chunking or metadata first — reranking covers gaps in retrieval, not broken foundations. [1][2]
What this means
A RAG pipeline typically has two stages:
-
First-stage retriever (bi-encoder or embedding model) — converts the query and all candidate documents into vector embeddings and finds the top-N most similar by cosine distance. This is fast and cheap (5–50 ms) but can be fuzzy: two documents with similar vocabulary may get high scores even if only one answers the specific question. [1]
-
Reranker (cross-encoder) — takes each retrieved candidate paired with the original query and scores the pair directly through a deeper transformer that considers token-level interaction between the query and the candidate. This is slower (10–40 ms per pair, so 100–500 ms for 10 candidates) but much more precise: it catches semantic mismatch that cosine distance misses. [1][2]
The reranker sits between the retriever and the LLM prompt. It does not replace the first-stage retriever — it reorders its output so the LLM receives a tighter, more relevant context window.
Concrete worked example
Query: “What are the warranty terms for the EcoFlow Delta 2 portable power station when purchased in the UK?”
Initial vector search top-5 (embedding similarity scores):
- “EcoFlow Delta 2 Review — 2025 comparison of portable power stations” (0.87) — partial match, mentions the product but not warranty
- “EcoFlow UK warranty policy PDF” (0.84) — URL fragment only, no rendered content in the chunk
- “Best portable generators for camping 2025” (0.81) — general category, mentions EcoFlow once
- “EcoFlow Delta 2 User Manual v3.2 (includes warranty section)” (0.79) — correct document but the chunk boundary stopped before the warranty paragraph
- “UK consumer rights — returning faulty electronics after 30 days” (0.77) — tangentially relevant
After reranker reordering (cross-encoder relevance scores):
- “EcoFlow Delta 2 User Manual v3.2 (includes warranty section)” (0.94) — the reranker recognised the query-warranty-manual alignment despite the chunk boundary issue
- “EcoFlow UK warranty policy PDF” (0.89) — confirmed relevance despite thin initial embedding match
- “EcoFlow Delta 2 Review — 2025 comparison of portable power stations” (0.65) — demoted: mentions the product but does not answer the question
- “UK consumer rights — returning faulty electronics after 30 days” (0.58) — contextually useful as a secondary source
- “Best portable generators for camping 2025” (0.31) — correctly pushed down as irrelevant
What changed: The reranker rescued the manual (moved from position 4 to 1) and the warranty policy PDF (position 2 to 2), while demoting the review and the camping article. Without the reranker, the LLM would have received two irrelevant documents in its context window and likely produced a vague or incomplete answer. [1][2]
Where teams get it wrong
Using reranking to compensate for weak chunking or bad metadata. Consequence: you pay 100–500 ms of latency per query for marginal precision gains, while the root cause (documents split mid-paragraph, missing metadata fields) continues degrading every retrieval. Fix: audit your chunking strategy and metadata coverage before buying rerank API credits. A reranker should polish a working retriever, not rescue a broken one. [1]
Ignoring latency cost. Consequence: the feature adds 100–500 ms per query (more for batch sizes above 20), which compounds under load and makes the system feel sluggish for interactive use cases. Fix: measure p95 latency end-to-end with and without the reranker. If the base retriever already returns the right document in position 1–3 for more than 80% of queries, the reranker’s marginal gain is likely not worth the latency budget. A typical API-based reranker (e.g., Cohere Rerank 3.5 or Voyage rerank-2) costs ~$0.001 per 1,000 pairs, plus the latency. Self-hosted cross-encoders (e.g., BAAI/bge-reranker-v2-m3) eliminate per-call cost but require GPU inference infrastructure (1–2 GB VRAM for the model, variable inference time depending on hardware). [1][2]
Assuming more ranking layers always mean better answers. Consequence: teams layer reranking on top of reranking, adding cost and latency while the extra precision is often negligible from the second layer onward. Fix: test with a single reranker first. Measure answer quality before and after. Only add a second reranker if the single layer’s precision is measurably below your threshold (e.g., less than 85% relevant documents in the top-3). No published benchmark shows a meaningful gain from multi-layer reranking for standard RAG use cases. [2]
Practical decision check
Score your RAG pipeline against these questions:
-
Is the first-stage retriever finding enough plausible candidates? Run a baseline: for 50 representative queries, does the base retriever return at least one genuinely relevant document in its top-10 at least 80% of the time? If not, fix chunking and metadata before considering a reranker. [1][2]
-
Does the reranker improve the user-visible outcome enough to justify the latency? A/B test: run 100 queries with and without the reranker. If the answer quality (judged by your team against a rubric) improves by less than 10%, the latency cost is too high. Typical API reranker cost is ~$0.001 per 1,000 candidate pairs — but the user-perceived latency is usually the binding constraint, not the dollar cost. [1][3]
-
Can you evaluate the reranker against the real task, not just abstract relevance scores? NDCG@10 and MRR measure ranking quality but do not tell you whether the LLM produces a better answer. Run an answer-quality evaluation: compare LLM outputs with and without the reranker and judge completeness, accuracy, and hallucination rate. [3][4]
-
Is your candidate pool deep enough to benefit? Rerankers work best with 10–20 candidates. Fewer than 5 leaves too little room for reordering; more than 50 adds latency with diminishing returns. Test your pool depth before committing. [1]
-
Do you have the infrastructure to measure the reranker’s effect? Without logging per-query retrieval order, reranker scores, and final LLM answer, you cannot tell whether the reranker is helping or just adding cost. See RAG evaluation: checking retrieval before blaming the model for a logging template. [4]
What would change this advice
- Your retriever is already precise (top-1 correct >80%). Reranking adds latency and cost for minimal gain. Skip it until your user base or query diversity increases. [1]
- Your candidate pool is too small (fewer than 5 candidates). A reranker cannot improve precision from a pool that barely exists. Enlarge the pool or fix chunking first.
- Your queries are short and factual, not nuanced. For simple lookups (“weather in Tokyo”) the base retriever is usually sufficient. Rerankers help most with ambiguous, multi-fact, or comparison-style queries. [1][2]
- You are on a strict latency budget (<200 ms per query, end-to-end). Skip the reranker or self-host a lightweight cross-encoder (e.g., ms-marco-MiniLM-L-6-v2, ~50 ms per pair on CPU). An API reranker call alone may add 100–500 ms, blowing your budget. [2]
- Your model provider ships a reranker integrated into the retriever (e.g., Cohere Rerank 3.5 bundled with embeddings). Try that before adding a separate layer — latency may be lower with tighter integration. [1]
Specific models and providers (as of 2026-05-25)
| Provider / Model | Type | Typical latency (10 candidates) | Cost | Notes |
|---|---|---|---|---|
| Cohere Rerank 3.5 | API (cross-encoder) | 150–400 ms | ~$0.001 per 1K pairs | English-optimised, supports batched queries. Best integration with Cohere embeddings. |
| Voyage rerank-2 | API (cross-encoder) | 100–350 ms | ~$0.0005 per 1K pairs | Multilingual, supports code-aware reranking. |
| BAAI/bge-reranker-v2-m3 | Self-hosted (cross-encoder) | 30–150 ms (GPU) / 200–600 ms (CPU) | Free (MIT licence, ~1 GB VRAM) | Strong for Chinese + English. No per-call cost after infrastructure. |
| ms-marco-MiniLM-L-6-v2 | Self-hosted (cross-encoder) | 10–50 ms (GPU) / 100–300 ms (CPU) | Free (MIT licence, ~0.5 GB VRAM) | Lightweight, good for strict latency budgets. Lower absolute precision than bge-reranker. |
Latency ranges are estimates for a single query with 10 candidates on moderate hardware (GPU: T4; CPU: 8-core x86). Actual performance varies by batch size, model version, and hardware. Measure in your own stack. [1][2]
Methodology and sources
Check date: 2026-05-25
What was checked: retrieval quality, ranking and latency documentation for reranker models and providers
What the sources were used for:
- [1] Cohere Rerank documentation (reranking methodology, latency benchmarks, cost estimates, integration patterns) — used to build the cross-encoder vs. bi-encoder explanation, the latency/cost table for Cohere Rerank 3.5, the worked example structure (query + candidate + reranker reordering), and the “What would change this advice” criteria around retriever precision and latency budgets. Cohere’s published latency benchmarks for 10-candidate batches directly informed the 150–400 ms range quoted in the table.
- [2] Voyage AI Rerank documentation (rerank-2 specifications, multilingual support, cost per pair, latency data) — used for Voyage’s latency/cost entry in the provider comparison table, the batch depth rule (10–20 candidates as optimal range), and the self-hosted vs. API trade-off analysis. Voyage’s published cost of $0.0005 per 1K pairs informed the maximum-economy option in the decision check.
- [3] Elasticsearch script-score query documentation (first-stage retrieval fine-tuning, function scoring) — used for the first-stage retriever baseline evaluation guidance and as a reference for readers who want to improve the base retriever without adding a separate reranker layer. Elastic’s documentation on relevance tuning informed the “fix chunking before reranking” advice.
- [4] OpenAI Embeddings documentation (embedding similarity scoring, cosine distance limitations, retrieval evaluation guidance) — used for the bi-encoder limitation explanation (cosine distance missing semantic nuance), the answer-quality evaluation rubric in the decision check, and the logging/telemetry recommendations. OpenAI’s guidance on evaluating embedding quality against task-level outcomes directly informed the “evaluate against real task, not abstract scores” principle.
Assumptions and limits:
- The base retriever already returns at least 5–10 plausible candidates per query.
- The team can instrument and measure latency and answer quality in their own stack.
- This is operational guidance based on provider documentation and published benchmarks as of the check date. Reranker model versions, pricing, and latency profiles change; verify against current provider docs before making infrastructure decisions.
- Self-hosted cross-encoder latency ranges assume moderate hardware and are estimates only. Measure in your own environment.
What this page cannot tell you
This page cannot tell you which reranker model is best for your specific stack, language mix, or latency budget. It can only help you decide whether the extra quality layer is measurably paying for itself, and give you named starting points to evaluate.
Change log
- 2026-05-25: major revision — added concrete worked example (EcoFlow query with before/after reordering table), cross-encoder vs. bi-encoder mechanism explanation, provider comparison table with model names/latency/cost, “What would change this advice” section with 5 scenarios, inline citations [1][2][3][4] throughout body and methodology, expanded Where-teams-get-it-wrong with consequences and fixes, and self-assessment scoring guidance.
- 2026-05-24: first draft built from the llm-editor-approved brief.
Source list
- [1] Cohere Rerank documentation — https://docs.cohere.com/docs/reranking
- [2] Voyage AI Rerank documentation — https://docs.voyageai.com/docs/reranking
- [3] Elasticsearch script-score query — https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html
- [4] OpenAI Embeddings documentation — https://platform.openai.com/docs/guides/embeddings