Fallback design: what happens when the AI call fails?
Every AI feature will fail at some point. The model API returns a 503. The rate limit hits in the middle of a user session. The context window overflows on an unexpectedly long input. The embedder is down. The vector store returns garbage.
Most teams design for the happy path and discover the failure modes in production, under user load, during a demo for the CEO.
Editor’s Note: The reliability of an AI feature is determined by what happens when the AI is unavailable, not by how often it works when everything is healthy. Editor’s Note: A cached answer that is 80% correct is better than a 503 error. A human escalation path that takes 3 minutes is better than an infinite spinner.
Quick answer
Design every AI feature with four fallback layers, in order:
- Degraded mode — return a simpler answer, a cached result, or a safe default
- Graceful error — show a clear message explaining what went wrong without blaming the provider
- Human escalation — route the request to a person or a manual workflow
- Offline state — disable the feature entirely if upstream dependencies are known to be unavailable
The right fallback depends on the cost of failure. For a code completion, degraded mode (simpler completion) is fine. For a medical triage tool, only human escalation is acceptable.
What the tutorials skip
Every provider has different failure modes. OpenAI, Anthropic, Google, and open-weight providers all have different error codes, rate limits, availability SLAs, and retry behaviours. A retry strategy that works for one may cause cascading failures for another. Read the provider’s error documentation before writing fallback code.
Rate limits are not just about requests. Provider rate limits cover requests per minute, tokens per minute, tokens per day, and concurrent connections. Hitting any of these can block all users simultaneously. Design for backpressure before the limit is reached, not after.
Context window overflow is not an error. When a user sends a very long input, the model does not return an error — it truncates, loses context, or produces a worse answer. Fallback designs need to detect overflow before calling the API and handle long inputs differently (summarise, chunk, or refuse with explanation).
Degraded mode needs testing. A cached answer from last week may be worse than admitting the feature is temporarily unavailable. Test degraded mode outputs the same way you test happy-path outputs: accuracy, safety, and user experience.
Where teams misuse fallback design
Infinite retries. Retrying the same request against the same provider with the same parameters is not a fallback strategy. After 2–3 retries with exponential backoff, fall back to a different provider, a cached answer, or an error message.
Silent fallback. Switching to a weaker model or a cached answer without telling the user erodes trust. If the feature is running in degraded mode, say so. Users would rather know they are getting a less accurate answer than assume the full system is working.
Falling back to hallucination. Some systems respond to an API error by having the model generate a plausible-sounding answer with no ground truth. This is worse than an error message. A wrong answer confidently delivered is harder to detect than a feature that says “I don’t know right now.”
Practical design patterns
Pattern 1 — Provider cascade
Call primary provider first (e.g., Claude). On timeout or 5xx, fail over to secondary provider (e.g., GPT-4o). On failure, fail over to cached response. On cache miss, show degraded UI.
Set request timeouts aggressively (10–15 seconds for LLM calls). Cache successful responses with a TTL appropriate for staleness tolerance.
Pattern 2 — Graceful degradation matrix
| Failure type | User experience | What the system does |
|---|---|---|
| Model API down | ”Results may be simpler — our AI is temporarily unavailable” | Return cached results or simplified heuristic output |
| Rate limited | ”You’ve used your AI requests for now. Try again in a few minutes.” | Show limit, offer retry timer |
| Context overflow | ”Your input was too long. Try a shorter question or upload a summary.” | Refuse with explanation, offer hint |
| Embedding API down | ”Search is temporarily limited to keywords.” | Fall back to BM25 or traditional search |
| All upstream down | ”This feature is temporarily unavailable. Your request has been logged.” | Log request, notify operations, show offline banner |
Pattern 3 — Human escalation with queues
For high-cost failures (customer support, compliance, medical), route automatically to a human review queue. The queue should show:
- The original user request
- What the AI attempted and why it failed
- Recommended actions from the fallback logic
Track queue size and resolution time. If the queue backs up, that is a signal the AI feature is too unreliable for its current use case.
Decision framework
| Question | Should you build fallback? |
|---|---|
| Is the feature user-facing? | Yes, always |
| Is the feature critical to the user’s task? | Yes, with human escalation |
| Is latency sensitive? | Yes, with aggressive timeouts and cached fallback |
| Is the cost of a wrong answer high? | Yes, refuse rather than fall back to a weak model |
| Can you detect failure without the user reporting it? | Yes, monitor API errors and degrade proactively |
Methodology and sources
This guide draws on provider API reliability documentation (OpenAI, Anthropic, Google), industry incident post-mortems from AI-native products, and established patterns in distributed systems design applied to AI feature architecture.
- OpenAI error handling documentation: https://platform.openai.com/docs/guides/error-codes — checked 2026-05-24
- Anthropic API errors and retries: https://docs.anthropic.com/en/api/errors — checked 2026-05-24
- Google AI API rate limits: https://ai.google.dev/pricing — checked 2026-05-24
- AWS Well-Architected Framework — fault tolerance patterns: https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html — checked 2026-05-24
Change log
2026-05-24 — First published version.
Source list
- OpenAI error code docs: https://platform.openai.com/docs/guides/error-codes
- Anthropic API errors: https://docs.anthropic.com/en/api/errors
- Google AI API rate limits: https://ai.google.dev/pricing
- AWS Well-Architected Framework: https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html