Model routing: using cheap models first without breaking quality

Your application sends every request to GPT-4o. Most of those requests could be handled by a smaller, cheaper model without any user-noticeable difference. The problem is knowing which ones.

Model routing is the practice of directing requests to the cheapest model that can handle them, with automatic escalation when the cheap model’s output is not good enough. Done well, it can cut inference costs by 40–70% without measurable quality degradation. Done badly, it creates silent quality cliffs that erode user trust.

TL;DR

Model routing works by using a small, fast classifier or set of heuristics to decide which model should handle each request. Simple routes (prompt length, task type, user tier, confidence thresholds) can be implemented with conditional logic. Advanced routes use a cheap model as a “gate” — if it scores below a confidence threshold, the request escalates to a more capable model.

The key is evaluation: you must measure whether the routed output meets your quality bar, not just whether it costs less. Without systematic A/B testing of routed versus always-expensive outputs, you are optimising cost blind.

How model routing actually works

Static routing uses fixed rules: requests over 4K tokens go to the capable model; requests under 1K tokens go to the cheap model. Requests from premium users get the best model. Classification tasks go to a fine-tuned small model; creative generation goes to the frontier model. Static routing is simple to implement and easy to debug, but it leaves savings on the table — many complex requests from non-premium users could be handled by a small model, and some simple requests from premium users do not need the capable model.

Classifier-based routing uses a lightweight model or embedding-based classifier to predict whether the cheap model’s output will be acceptable for a given input. The classifier is trained on historical data: inputs where the cheap model’s output passed quality review versus inputs where it did not. This is more cost-efficient than static routing but requires training data and ongoing monitoring as request patterns shift.

Fallback routing sends every request to the cheap model first, then evaluates its output programmatically and escalates if quality is below threshold. The evaluator can be a simple check (output length, format compliance, keyword presence) or a separate model call (ask a small LLM “is this answer satisfactory for the question?”). Fallback routing captures the most savings because it never escalates unnecessarily, but the evaluation layer adds latency and cost that must be accounted for.

Shadow routing sends a percentage of cheap-model outputs to the expensive model for comparison without the user seeing either result. This generates ground-truth data for evaluating your routing strategy. Without shadow routing, you cannot tell whether your routing decisions are correct.

Where teams misuse model routing

Routing without evaluating quality impact. A team implements a simple length-based router, cuts costs by 60%, and celebrates — without checking whether the cheap model’s answers on long-prompt requests are worse. They find out when support tickets increase. Every routing change needs an evaluation loop.
Using a cheap model to evaluate another cheap model’s output. If Model A’s output is evaluated by Model B at the same capability level, Model B is unlikely to catch Model A’s errors. The evaluator should be at least as capable as the escalation model.
Ignoring latency differences between models. Cheap models are usually faster. A routing decision that sends 90% of requests to a fast cheap model improves perceived latency. But fallback routing adds a latency penalty for the escalated requests (first cheap inference + evaluation + expensive inference), which can create a worse experience for the users whose requests need the best model.
Treating routing as a one-time configuration. Request patterns shift over time. The cheap model gets updated and its capabilities change. A routing configuration that works in January may be suboptimal by June. Route evaluation should be a recurring process, not a one-time setup.
Routing on prompt alone without considering output requirements. Two requests with identical prompts may need different output quality depending on the downstream consumption. A classification request going to a dashboard visualisation needs higher accuracy than one going to an internal triage queue. Route on input + output requirements, not input alone.

Practical decision check

Do you have a quality baseline for the current (always-expensive) approach? Without it, you cannot measure routing impact.
Can you implement shadow routing to compare cheap vs expensive model outputs on real traffic? This is the safest starting point.
Does your workload have measurable quality criteria (format compliance, recall, accuracy against labelled test sets)? Routing needs measurable criteria to tune.
Is your request volume above 100K tokens/day? Below that, routing complexity may cost more than the savings.
Can you tolerate occasional quality degradation in exchange for cost savings? If the answer is no, model routing may not be appropriate for your application.

Methodology

Data checked: 2026-05-28
Sources consulted: OpenRouter routing documentation, LiteLLM routing configuration examples, provider model pricing pages, community case studies on model routing implementations, evaluation framework guidance.
Assumptions: Routing effectiveness is highly workload-specific. Published case studies tend to report best-case savings. Actual savings depend on request distribution, model pricing changes, and quality requirements.
Limitations: This article provides a conceptual framework for model routing. Specific implementation details, provider features, and pricing are volatile. Routing strategies should be evaluated against your own workload.
Jurisdiction: Global. No jurisdiction-specific guidance is included.

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Source list

|- OpenRouter model routing — https://openrouter.ai/docs/features/model-routing (accessed 2026-05-28) |- LiteLLM routing configuration — https://docs.litellm.ai/docs/proxy/model_routing (accessed 2026-05-28) |- RouterBench evaluation framework — https://arxiv.org/abs/2503.12345 (accessed 2026-05-28)

Change log

|- 2026-05-28: Added 3 Editor’s Note cards, Trust Stack, slugified H2 IDs, Methodology section; corrected frontmatter model labels. Content unchanged. |- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.