Rate limits explained: requests, tokens, tiers and hidden launch risks

Rate limits are the boring reason a feature works in a demo and then falls over under real use.

They can cap requests per minute, tokens per minute, concurrent jobs, daily usage or spend. They can also vary by model, tier, account age, billing status or region. If you do not plan for them, the app will plan for you.

TL;DR

If you are launching an AI feature, check the request cap, token cap and concurrency rules before you ship.

The safest plan is to assume the live workload will be spikier than the test workload. Then add one or more of these: backoff, queueing, cache hits, shorter prompts, smaller outputs, batching where allowed, and a user-facing fallback when the provider says no.

Do not treat a successful demo as proof that your launch capacity is safe.

What rate limits actually cover

Different providers use different labels, but the idea is the same: the service is protecting itself and other customers from overload.

Common limit types include:

requests per minute;
tokens per minute;
tokens per day;
concurrent requests;
file or tool-call quotas;
model-specific caps;
spend or billing thresholds.

A feature can be under the request cap and still fail because the token cap is hit first. A small number of long prompts can be more dangerous than many short ones.

Hidden launch risks

The obvious risk is a hard failure. The less obvious risks are slower and messier:

requests queue up and the UI feels broken;
retries multiply the load;
silent fallbacks use a cheaper model with worse output;
partial responses cause rework downstream;
a bursty customer segment trips a tier cap;
a batch job competes with live traffic;
a new feature exposes a quota you never noticed.

The app can look healthy in development and still be under-provisioned for the first real launch spike.

What to check before launch

Use this checklist:

What is the request cap for the exact model you plan to use?
What is the token cap, and does it apply to input, output or both?
What happens when the cap is exceeded: error, queue, reject or throttle?
Is concurrency limited separately from requests?
Are some accounts, regions or models capped differently?
Do retries count against the same quota?
Is there a paid path or tier upgrade if you need more headroom?
Is there a fallback model or non-LLM path that still gives the user something useful?

If you cannot answer those questions, you are not ready to promise live reliability.

A simple launch pattern

A sane launch path usually looks like this:

start with a narrower feature scope;
cap output length;
keep prompts short;
add retry backoff, not retry storms;
cache repeated work;
separate live traffic from batch jobs;
show a useful fallback when the provider is unavailable.

That is not glamorous. It is how you avoid learning about rate limits from angry users.

Practical decision check

Before you buy more capacity, ask:

Is the bottleneck really demand, or is the app being wasteful?
Can the prompt be shorter?
Can the answer be shorter?
Can the same result be cached or batched?
Is live traffic being mixed with offline work?
Is there a deterministic fallback?
Have we tested the burst case, not just the happy path?

Rate-limit planning is mostly about respecting the gap between average use and real use.

What this page cannot tell you

This page cannot tell you your exact quota.

It cannot tell you:

what tier your account is on;
whether your limit will change tomorrow;
whether a hidden billing rule will apply;
whether a specific provider will raise your cap quickly;
whether your real traffic pattern will be burstier than you think.

It can only help you ask the right questions before launch day.

Methodology

Data checked: 2026-05-28
Sources consulted: Provider rate-limit and quota documentation (OpenAI, Anthropic, Google Gemini, Google Cloud)
Assumptions: Limits can vary by account, model, and region. Provider documentation changes. This is operational guidance, not a guarantee that any specific plan is safe.
Limitations: Launch load can exceed test load in ways that are hard to predict. This guide covers planning and fallback design, not provider-specific quota negotiation.
Jurisdiction: Global. The guidance on rate limits, backoff, and fallbacks applies regardless of jurisdiction.

Source list

OpenAI rate limits docs — https://platform.openai.com/docs/guides/rate-limits (accessed 2026-05-28)
Anthropic rate limits docs — https://docs.anthropic.com/en/api/rate-limits (accessed 2026-05-28)
Google Gemini rate limits docs — https://ai.google.dev/gemini-api/docs/rate-limits (accessed 2026-05-28)
Google Cloud quotas overview — https://cloud.google.com/docs/quota (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist: added third Editor’s Note (usage tiers), slugified all H2 IDs, added Trust Stack and proper Methodology sections, added source access dates, renamed Summary to Quick Answer, removed workflow leaks from Change Log, and corrected related guide links to relative paths.
2026-05-24: Added related guide links and prepared for publication.
2026-05-22: First published. Initial draft with launch-capacity checklist, burst-risk framing, and fallback planning guidance.