theLLMs

Last checked: 2026-05-24

Scope: Global. Provider reliability patterns, API error documentation, and industry incident post-mortems checked on 2026-05-24. Specific error codes and rate-limit tiers vary by provider and plan.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Fallback design: what happens when the AI call fails?

Every AI feature will fail at some point. The model API returns a 503. The rate limit hits in the middle of a user session. The context window overflows on an unexpectedly long input. The embedder is down. The vector store returns garbage.

Most teams design for the happy path and discover the failure modes in production, under user load, during a demo for the CEO.

Editor’s Note: The reliability of an AI feature is determined by what happens when the AI is unavailable, not by how often it works when everything is healthy. Editor’s Note: A cached answer that is 80% correct is better than a 503 error. A human escalation path that takes 3 minutes is better than an infinite spinner.

Quick answer

Design every AI feature with four fallback layers, in order:

  1. Degraded mode — return a simpler answer, a cached result, or a safe default
  2. Graceful error — show a clear message explaining what went wrong without blaming the provider
  3. Human escalation — route the request to a person or a manual workflow
  4. Offline state — disable the feature entirely if upstream dependencies are known to be unavailable

The right fallback depends on the cost of failure. For a code completion, degraded mode (simpler completion) is fine. For a medical triage tool, only human escalation is acceptable.

What the tutorials skip

Every provider has different failure modes. OpenAI, Anthropic, Google, and open-weight providers all have different error codes, rate limits, availability SLAs, and retry behaviours. A retry strategy that works for one may cause cascading failures for another. Read the provider’s error documentation before writing fallback code.

Rate limits are not just about requests. Provider rate limits cover requests per minute, tokens per minute, tokens per day, and concurrent connections. Hitting any of these can block all users simultaneously. Design for backpressure before the limit is reached, not after.

Context window overflow is not an error. When a user sends a very long input, the model does not return an error — it truncates, loses context, or produces a worse answer. Fallback designs need to detect overflow before calling the API and handle long inputs differently (summarise, chunk, or refuse with explanation).

Degraded mode needs testing. A cached answer from last week may be worse than admitting the feature is temporarily unavailable. Test degraded mode outputs the same way you test happy-path outputs: accuracy, safety, and user experience.

Where teams misuse fallback design

Infinite retries. Retrying the same request against the same provider with the same parameters is not a fallback strategy. After 2–3 retries with exponential backoff, fall back to a different provider, a cached answer, or an error message.

Silent fallback. Switching to a weaker model or a cached answer without telling the user erodes trust. If the feature is running in degraded mode, say so. Users would rather know they are getting a less accurate answer than assume the full system is working.

Falling back to hallucination. Some systems respond to an API error by having the model generate a plausible-sounding answer with no ground truth. This is worse than an error message. A wrong answer confidently delivered is harder to detect than a feature that says “I don’t know right now.”

Practical design patterns

Pattern 1 — Provider cascade

Call primary provider first (e.g., Claude). On timeout or 5xx, fail over to secondary provider (e.g., GPT-4o). On failure, fail over to cached response. On cache miss, show degraded UI.

Set request timeouts aggressively (10–15 seconds for LLM calls). Cache successful responses with a TTL appropriate for staleness tolerance.

Pattern 2 — Graceful degradation matrix

Failure typeUser experienceWhat the system does
Model API down”Results may be simpler — our AI is temporarily unavailable”Return cached results or simplified heuristic output
Rate limited”You’ve used your AI requests for now. Try again in a few minutes.”Show limit, offer retry timer
Context overflow”Your input was too long. Try a shorter question or upload a summary.”Refuse with explanation, offer hint
Embedding API down”Search is temporarily limited to keywords.”Fall back to BM25 or traditional search
All upstream down”This feature is temporarily unavailable. Your request has been logged.”Log request, notify operations, show offline banner

Pattern 3 — Human escalation with queues

For high-cost failures (customer support, compliance, medical), route automatically to a human review queue. The queue should show:

  • The original user request
  • What the AI attempted and why it failed
  • Recommended actions from the fallback logic

Track queue size and resolution time. If the queue backs up, that is a signal the AI feature is too unreliable for its current use case.

Decision framework

QuestionShould you build fallback?
Is the feature user-facing?Yes, always
Is the feature critical to the user’s task?Yes, with human escalation
Is latency sensitive?Yes, with aggressive timeouts and cached fallback
Is the cost of a wrong answer high?Yes, refuse rather than fall back to a weak model
Can you detect failure without the user reporting it?Yes, monitor API errors and degrade proactively

Methodology and sources

This guide draws on provider API reliability documentation (OpenAI, Anthropic, Google), industry incident post-mortems from AI-native products, and established patterns in distributed systems design applied to AI feature architecture.

Change log

2026-05-24 — First published version.

Source list