theLLMs

Last checked: 2026-05-22

Scope: Global. Provider quota and rate-limit docs were checked on 2026-05-22; specific limits can vary by account, model and region.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Rate limits explained: requests, tokens, tiers and hidden launch risks

Rate limits are the boring reason a feature works in a demo and then falls over under real use.

They can cap requests per minute, tokens per minute, concurrent jobs, daily usage or spend. They can also vary by model, tier, account age, billing status or region. If you do not plan for them, the app will plan for you.

Editor’s Note: A rate-limit problem is often a capacity problem wearing a polite error message.

Editor’s Note: The right fix is usually not “raise the limit” first. It is “reduce demand, add fallback, or spread the load.”

Summary

If you are launching an AI feature, check the request cap, token cap and concurrency rules before you ship.

The safest plan is to assume the live workload will be spikier than the test workload. Then add one or more of these: backoff, queueing, cache hits, shorter prompts, smaller outputs, batching where allowed, and a user-facing fallback when the provider says no.

Do not treat a successful demo as proof that your launch capacity is safe.

What rate limits actually cover

Different providers use different labels, but the idea is the same: the service is protecting itself and other customers from overload.

Common limit types include:

  • requests per minute;
  • tokens per minute;
  • tokens per day;
  • concurrent requests;
  • file or tool-call quotas;
  • model-specific caps;
  • spend or billing thresholds.

A feature can be under the request cap and still fail because the token cap is hit first. A small number of long prompts can be more dangerous than many short ones.

Hidden launch risks

The obvious risk is a hard failure. The less obvious risks are slower and messier:

  1. requests queue up and the UI feels broken;
  2. retries multiply the load;
  3. silent fallbacks use a cheaper model with worse output;
  4. partial responses cause rework downstream;
  5. a bursty customer segment trips a tier cap;
  6. a batch job competes with live traffic;
  7. a new feature exposes a quota you never noticed.

The app can look healthy in development and still be under-provisioned for the first real launch spike.

What to check before launch

Use this checklist:

  • What is the request cap for the exact model you plan to use?
  • What is the token cap, and does it apply to input, output or both?
  • What happens when the cap is exceeded: error, queue, reject or throttle?
  • Is concurrency limited separately from requests?
  • Are some accounts, regions or models capped differently?
  • Do retries count against the same quota?
  • Is there a paid path or tier upgrade if you need more headroom?
  • Is there a fallback model or non-LLM path that still gives the user something useful?

If you cannot answer those questions, you are not ready to promise live reliability.

A simple launch pattern

A sane launch path usually looks like this:

  • start with a narrower feature scope;
  • cap output length;
  • keep prompts short;
  • add retry backoff, not retry storms;
  • cache repeated work;
  • separate live traffic from batch jobs;
  • show a useful fallback when the provider is unavailable.

That is not glamorous. It is how you avoid learning about rate limits from angry users.

Practical decision check

Before you buy more capacity, ask:

  • Is the bottleneck really demand, or is the app being wasteful?
  • Can the prompt be shorter?
  • Can the answer be shorter?
  • Can the same result be cached or batched?
  • Is live traffic being mixed with offline work?
  • Is there a deterministic fallback?
  • Have we tested the burst case, not just the happy path?

Rate-limit planning is mostly about respecting the gap between average use and real use.

What this page cannot tell you

This page cannot tell you your exact quota.

It cannot tell you:

  • what tier your account is on;
  • whether your limit will change tomorrow;
  • whether a hidden billing rule will apply;
  • whether a specific provider will raise your cap quickly;
  • whether your real traffic pattern will be burstier than you think.

It can only help you ask the right questions before launch day.

Global applicability

This article is global. There is no UK, GB or Northern Ireland split to apply here.

The useful caution is the same everywhere: the published limit page is part of the product. If you have not read it, you have not planned the launch.

Methodology and sources

Check date: 2026-05-22

What was checked: provider rate-limit and quota documentation.

What the sources were used for:

  • the kinds of limits providers expose;
  • the difference between request caps, token caps and concurrency;
  • the need for fallback and backoff in production workflows.

Assumptions and limits:

  • limits can vary by account and model;
  • provider docs change;
  • this is operational guidance, not a guarantee that a plan is safe;
  • launch load can exceed test load in ways that are hard to predict.

Change log

  • 2026-05-24: integrated into the theLLMs prototype after editor Ship review; related links were converted to live prototype routes only.
  • 2026-05-22: first draft built from the llm-editor-approved brief, with a launch-capacity checklist, a burst-risk framing, and fallback planning guidance.

Source list