Fallback design: what happens when the AI call fails?

Every AI feature will fail at some point. The model API returns a 503. The rate limit hits in the middle of a user session. The context window overflows on an unexpectedly long input. The embedder is down. The vector store returns garbage.

Most teams design for the happy path and discover the failure modes in production, under user load, during a demo for the CEO.

Editor’s Note: The reliability of an AI feature is determined by what happens when the AI is unavailable, not by how often it works when everything is healthy. Editor’s Note: A cached answer that is 80% correct is better than a 503 error. A human escalation path that takes 3 minutes is better than an infinite spinner.

Quick answer

Design every AI feature with four fallback layers, in order:

Degraded mode — return a simpler answer, a cached result, or a safe default
Graceful error — show a clear message explaining what went wrong without blaming the provider
Human escalation — route the request to a person or a manual workflow
Offline state — disable the feature entirely if upstream dependencies are known to be unavailable

The right fallback depends on the cost of failure. For a code completion, degraded mode (simpler completion) is fine. For a medical triage tool, only human escalation is acceptable.

What the tutorials skip

Every provider has different failure modes. OpenAI, Anthropic, Google, and open-weight providers all have different error codes, rate limits, availability SLAs, and retry behaviours. A retry strategy that works for one may cause cascading failures for another. Read the provider’s error documentation before writing fallback code.

Rate limits are not just about requests. Provider rate limits cover requests per minute, tokens per minute, tokens per day, and concurrent connections. Hitting any of these can block all users simultaneously. Design for backpressure before the limit is reached, not after.

Context window overflow is not an error. When a user sends a very long input, the model does not return an error — it truncates, loses context, or produces a worse answer. Fallback designs need to detect overflow before calling the API and handle long inputs differently (summarise, chunk, or refuse with explanation).

Degraded mode needs testing. A cached answer from last week may be worse than admitting the feature is temporarily unavailable. Test degraded mode outputs the same way you test happy-path outputs: accuracy, safety, and user experience.

Where teams misuse fallback design

Infinite retries. Retrying the same request against the same provider with the same parameters is not a fallback strategy. After 2–3 retries with exponential backoff, fall back to a different provider, a cached answer, or an error message.

Silent fallback. Switching to a weaker model or a cached answer without telling the user erodes trust. If the feature is running in degraded mode, say so. Users would rather know they are getting a less accurate answer than assume the full system is working.

Falling back to hallucination. Some systems respond to an API error by having the model generate a plausible-sounding answer with no ground truth. This is worse than an error message. A wrong answer confidently delivered is harder to detect than a feature that says “I don’t know right now.”

Practical design patterns

Pattern 1 — Provider cascade

Call primary provider first (e.g., Claude). On timeout or 5xx, fail over to secondary provider (e.g., GPT-4o). On failure, fail over to cached response. On cache miss, show degraded UI.

Set request timeouts aggressively (10–15 seconds for LLM calls). Cache successful responses with a TTL appropriate for staleness tolerance.

Pattern 2 — Graceful degradation matrix

Failure type	User experience	What the system does
Model API down	”Results may be simpler — our AI is temporarily unavailable”	Return cached results or simplified heuristic output
Rate limited	”You’ve used your AI requests for now. Try again in a few minutes.”	Show limit, offer retry timer
Context overflow	”Your input was too long. Try a shorter question or upload a summary.”	Refuse with explanation, offer hint
Embedding API down	”Search is temporarily limited to keywords.”	Fall back to BM25 or traditional search
All upstream down	”This feature is temporarily unavailable. Your request has been logged.”	Log request, notify operations, show offline banner

Pattern 3 — Human escalation with queues

For high-cost failures (customer support, compliance, medical), route automatically to a human review queue. The queue should show:

The original user request
What the AI attempted and why it failed
Recommended actions from the fallback logic

Track queue size and resolution time. If the queue backs up, that is a signal the AI feature is too unreliable for its current use case.

Decision framework

Question	Should you build fallback?
Is the feature user-facing?	Yes, always
Is the feature critical to the user’s task?	Yes, with human escalation
Is latency sensitive?	Yes, with aggressive timeouts and cached fallback
Is the cost of a wrong answer high?	Yes, refuse rather than fall back to a weak model
Can you detect failure without the user reporting it?	Yes, monitor API errors and degrade proactively

Methodology and sources

This guide draws on provider API reliability documentation (OpenAI, Anthropic, Google), industry incident post-mortems from AI-native products, and established patterns in distributed systems design applied to AI feature architecture.

OpenAI error handling documentation: https://platform.openai.com/docs/guides/error-codes — checked 2026-05-24
Anthropic API errors and retries: https://docs.anthropic.com/en/api/errors — checked 2026-05-24
Google AI API rate limits: https://ai.google.dev/pricing — checked 2026-05-24
AWS Well-Architected Framework — fault tolerance patterns: https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html — checked 2026-05-24

Change log

2026-05-24 — First published version.

Source list

OpenAI error code docs: https://platform.openai.com/docs/guides/error-codes
Anthropic API errors: https://docs.anthropic.com/en/api/errors
Google AI API rate limits: https://ai.google.dev/pricing
AWS Well-Architected Framework: https://docs.aws.amazon.com/wellarchitected/latest/framework/welcome.html