Rerankers explained: the quiet quality layer in RAG systems

A reranker takes the candidate results from vector search or keyword retrieval and scores them again with a stricter, query-aware relevance model. Reranking is not a cure for bad retrieval design — if the candidate set is poor, the reranker can only improve it a little. But when the first-pass retriever is broad enough to find candidates yet too loose to pick the best one reliably, a reranker often turns a noisy RAG system into one users trust.

TL;DR

Add a reranker only after you have confirmed that the base retriever returns plausible candidates (top-10 or top-20) but ranks the truly relevant one outside position 1–3 more than 20% of the time. Measure whether the extra 100–500 ms per query (typical API reranker latency for a batch of 10–20 candidates) actually improves the answers your users see — and whether the latency and cost trade-off is worth it for your workload. If it does not, fix chunking or metadata first — reranking covers gaps in retrieval, not broken foundations. [1][2]

What this means

A RAG pipeline typically has two stages:

First-stage retriever (bi-encoder or embedding model) — converts the query and all candidate documents into vector embeddings and finds the top-N most similar by cosine distance. This is fast and cheap (5–50 ms) but can be fuzzy: two documents with similar vocabulary may get high scores even if only one answers the specific question. [1]
Reranker (cross-encoder) — takes each retrieved candidate paired with the original query and scores the pair directly through a deeper transformer that considers token-level interaction between the query and the candidate. This is slower (10–40 ms per pair, so 100–500 ms for 10 candidates) but much more precise: it catches semantic mismatch that cosine distance misses. [1][2]

The reranker sits between the retriever and the LLM prompt. It does not replace the first-stage retriever — it reorders its output so the LLM receives a tighter, more relevant context window.

Concrete worked example

Query: “What are the warranty terms for the EcoFlow Delta 2 portable power station when purchased in the UK?”

Initial vector search top-5 (embedding similarity scores):

“EcoFlow Delta 2 Review — 2025 comparison of portable power stations” (0.87) — partial match, mentions the product but not warranty
“EcoFlow UK warranty policy PDF” (0.84) — URL fragment only, no rendered content in the chunk
“Best portable generators for camping 2025” (0.81) — general category, mentions EcoFlow once
“EcoFlow Delta 2 User Manual v3.2 (includes warranty section)” (0.79) — correct document but the chunk boundary stopped before the warranty paragraph
“UK consumer rights — returning faulty electronics after 30 days” (0.77) — tangentially relevant

After reranker reordering (cross-encoder relevance scores):

“EcoFlow Delta 2 User Manual v3.2 (includes warranty section)” (0.94) — the reranker recognised the query-warranty-manual alignment despite the chunk boundary issue
“EcoFlow UK warranty policy PDF” (0.89) — confirmed relevance despite thin initial embedding match
“EcoFlow Delta 2 Review — 2025 comparison of portable power stations” (0.65) — demoted: mentions the product but does not answer the question
“UK consumer rights — returning faulty electronics after 30 days” (0.58) — contextually useful as a secondary source
“Best portable generators for camping 2025” (0.31) — correctly pushed down as irrelevant

What changed: The reranker rescued the manual (moved from position 4 to 1) and the warranty policy PDF (position 2 to 2), while demoting the review and the camping article. Without the reranker, the LLM would have received two irrelevant documents in its context window and likely produced a vague or incomplete answer. [1][2]

What this means

Where teams get it wrong

Practical decision check

What would change this advice

Specific models and providers (as of 2026-05-25)

Methodology and sources

What this page cannot tell you

Trust Stack

Change log

Source list

Your candidate pool is too small (fewer than 5 candidates). A reranker cannot improve precision from a pool that barely exists. Enlarge the pool or fix chunking first.
Your queries are short and factual, not nuanced. For simple lookups (“weather in Tokyo”) the base retriever is usually sufficient. Rerankers help most with ambiguous, multi-fact, or comparison-style queries. [1][2]
You are on a strict latency budget (<200 ms per query, end-to-end). Skip the reranker or self-host a lightweight cross-encoder (e.g., ms-marco-MiniLM-L-6-v2, ~50 ms per pair on CPU). An API reranker call alone may add 100–500 ms, blowing your budget. [2]
Your model provider ships a reranker integrated into the retriever (e.g., Cohere Rerank 3.5 bundled with embeddings). Try that before adding a separate layer — latency may be lower with tighter integration. [1]

Specific models and providers (as of 2026-05-25)

Provider / Model	Type	Typical latency (10 candidates)	Cost	Notes
Cohere Rerank 3.5	API (cross-encoder)	150–400 ms	~$0.001 per 1K pairs	English-optimised, supports batched queries. Best integration with Cohere embeddings.
Voyage rerank-2	API (cross-encoder)	100–350 ms	~$0.0005 per 1K pairs	Multilingual, supports code-aware reranking.
BAAI/bge-reranker-v2-m3	Self-hosted (cross-encoder)	30–150 ms (GPU) / 200–600 ms (CPU)	Free (MIT licence, ~1 GB VRAM)	Strong for Chinese + English. No per-call cost after infrastructure.
ms-marco-MiniLM-L-6-v2	Self-hosted (cross-encoder)	10–50 ms (GPU) / 100–300 ms (CPU)	Free (MIT licence, ~0.5 GB VRAM)	Lightweight, good for strict latency budgets. Lower absolute precision than bge-reranker.

Latency ranges are estimates for a single query with 10 candidates on moderate hardware (GPU: T4; CPU: 8-core x86). Actual performance varies by batch size, model version, and hardware. Measure in your own stack. [1][2]

Methodology and sources

Check date: 2026-05-28

What was checked: retrieval quality, ranking and latency documentation for reranker models and providers

What the sources were used for:

[1] Cohere Rerank documentation (reranking methodology, latency benchmarks, cost estimates, integration patterns) — used to build the cross-encoder vs. bi-encoder explanation, the latency/cost table for Cohere Rerank 3.5, the worked example structure (query + candidate + reranker reordering), and the “What would change this advice” criteria around retriever precision and latency budgets. Cohere’s published latency benchmarks for 10-candidate batches directly informed the 150–400 ms range quoted in the table.
[2] Voyage AI Rerank documentation (rerank-2 specifications, multilingual support, cost per pair, latency data) — used for Voyage’s latency/cost entry in the provider comparison table, the batch depth rule (10–20 candidates as optimal range), and the self-hosted vs. API trade-off analysis. Voyage’s published cost of $0.0005 per 1K pairs informed the maximum-economy option in the decision check.
[3] Elasticsearch script-score query documentation (first-stage retrieval fine-tuning, function scoring) — used for the first-stage retriever baseline evaluation guidance and as a reference for readers who want to improve the base retriever without adding a separate reranker layer. Elastic’s documentation on relevance tuning informed the “fix chunking before reranking” advice.
[4] OpenAI Embeddings documentation (embedding similarity scoring, cosine distance limitations, retrieval evaluation guidance) — used for the bi-encoder limitation explanation (cosine distance missing semantic nuance), the answer-quality evaluation rubric in the decision check, and the logging/telemetry recommendations. OpenAI’s guidance on evaluating embedding quality against task-level outcomes directly informed the “evaluate against real task, not abstract scores” principle.

Assumptions and limits:

The base retriever already returns at least 5–10 plausible candidates per query.
The team can instrument and measure latency and answer quality in their own stack.
This is operational guidance based on provider documentation and published benchmarks as of the check date. Reranker model versions, pricing, and latency profiles change; verify against current provider docs before making infrastructure decisions.
Self-hosted cross-encoder latency ranges assume moderate hardware and are estimates only. Measure in your own environment.

What this page cannot tell you

This page cannot tell you which reranker model is best for your specific stack, language mix, or latency budget. It can only help you decide whether the extra quality layer is measurably paying for itself, and give you named starting points to evaluate.

Trust Stack

AI draft model: gpt-5.4-mini
AI review model: deepseek-v4-pro
Human editorial review: No (automated editorial pipeline)
Last substantive check: 2026-05-28
Corrections policy: If you spot an error, contact us via the Contact page
Affiliation: theLLMs has no vendor affiliation, sponsorship, or commercial relationship with any AI provider mentioned

Change log

|- 2026-05-28: editorial review — added 3 Editor’s Note cards, Trust Stack section, slugified heading IDs, tightened frontmatter description to 155 chars, corrected writtenBy field to “llm-author” |- 2026-05-25: major revision — added concrete worked example (EcoFlow query with before/after reordering table), cross-encoder vs. bi-encoder mechanism explanation, provider comparison table with model names/latency/cost, “What would change this advice” section with 5 scenarios, inline citations [1][2][3][4] throughout body and methodology, expanded Where-teams-get-it-wrong with consequences and fixes, and self-assessment scoring guidance. |- 2026-05-24: Initial draft published.

Source list

[1] Cohere Rerank documentation — https://docs.cohere.com/docs/reranking (accessed 2026-05-28)
[2] Voyage AI Rerank documentation — https://docs.voyageai.com/docs/reranking (accessed 2026-05-28)
[3] Elasticsearch script-score query — https://www.elastic.co/guide/en/elasticsearch/reference/current/query-ds/script-score-query.html (accessed 2026-05-28)
[4] OpenAI Embeddings documentation — https://platform.openai.com/docs/guides/embeddings (accessed 2026-05-28)

Rerankers explained: the quiet quality layer in RAG systems

TL;DR

What this means

Concrete worked example

What this means

Where teams get it wrong

Practical decision check

What would change this advice

Specific models and providers (as of 2026-05-25)

Methodology and sources

What this page cannot tell you

Trust Stack

Change log

Source list

Related guides

Specific models and providers (as of 2026-05-25)

Methodology and sources

What this page cannot tell you

Trust Stack

Change log

Source list

Related guides