Batch APIs for LLMs: cheaper, slower and often underused
Every synchronous chat API call to an LLM provider carries a premium for low latency. If your application does not need an answer in seconds, you are paying for speed you are not using.
Batch APIs — where you submit a file of requests and collect results minutes or hours later — typically cost 50% less than real-time endpoints. They are one of the simplest cost optimisation tools available, and most teams underuse them.
Quick answer
Batch APIs let you submit large collections of LLM requests at once and retrieve results asynchronously. OpenAI’s batch API offers 50% discount on most models. Anthropic offers 50% off on batch for most models. Google’s Vertex AI batch prediction provides similar discounts. The trade-off is latency: results come back in minutes to hours depending on queue depth and workload size.
Use batch APIs for: content classification at scale, data extraction pipelines, translation jobs, offline content generation, evaluation runs against test sets, and any workload where a sync response time is not required. Do not use batch APIs for: chat applications, real-time features, interactive tools, or any flow where a user waits for the response.
What batch APIs actually do
Instead of sending one request and waiting for one response, you upload a JSONL file containing many requests. The provider processes them asynchronously and places the results in a storage location or returns a downloadable file. You poll for completion and download the results.
OpenAI’s batch implementation accepts up to 50,000 requests per batch or 100 MB of input, whichever limit comes first. Anthropic’s Message Batches accept up to 10,000 requests per batch. Both offer the same model availability and output quality as their real-time endpoints — the discount comes from scheduler flexibility, not degraded service.
The key operational difference: batch requests share a queue with other customers’ batch workloads. Completion time depends on total queue length and provider capacity. OpenAI and Anthropic both guarantee 24-hour completion windows but often complete much faster.
Where teams misuse batch APIs
-
Using sync APIs for batchable workloads. A team processing 10,000 customer support tickets through a real-time chat endpoint, paying full price for sub-second latency on a task that could wait an hour. The cost difference on a batch workload of this size is substantial — a 50% discount on large volumes changes the unit economics of the feature.
-
Treating batch as lower quality. Batch and real-time endpoints use the same models with the same parameters. Output quality is identical. The discount comes from queue flexibility, not model degradation. Some providers may use lower-priority scheduling for batch, but the inference itself is identical.
-
Ignoring batch for evaluation. Teams running weekly eval suites against test sets of 1,000+ examples often use sync APIs out of habit. Switching to batch for these periodic workloads cuts the evaluation cost in half with no practical downside — the results are not needed until the next day.
-
Failing to design for batch from the start. Building a data pipeline that sends individual requests and collects responses is easy. Restructuring it to submit a file-based batch workload later is harder. If your workload is batchable by nature, design the batch path first.
Practical decision check
- Does the user wait for this response? If yes, use sync. If no, use batch.
- Can results arrive within minutes to hours? If yes, batch is viable.
- Is your workload volume above 100 requests? Below that, the setup overhead may outweigh the discount.
- Do you need identical model output to sync pricing? Yes — batch uses the same models at the same parameter settings.
- Does your application need real-time error handling per request? Batch is harder to debug per-request failures. Sync may be worth the premium during development.
What teams get wrong about batch design
The main failure mode is building a batch pipeline that replicates per-request behaviour without thinking about error handling at scale. A batch of 5,000 requests will have some failures — rate limits, timeouts, content filter hits. Your pipeline must handle partial completion: retry failed items, track per-request metadata, and avoid reprocessing successful results.
The second failure mode is assuming batch completion time is fixed. Batch queues are shared. Your batch of 100 requests might complete in two minutes during off-peak hours or take two hours during a provider’s busy period. Monitoring and alerting on batch completion time prevents silent delays from propagating downstream.
Methodology and sources
Check date: 2026-05-25
What was checked: OpenAI Batch API documentation and pricing, Anthropic Message Batches documentation and pricing, Google Vertex AI batch prediction docs, community reports on batch completion times.
Assumptions and limits: Batch API availability and pricing change. Completion times are provider-dependent and vary by region and queue load. Some models may not be available on batch endpoints.
Source list
- OpenAI Batch API — https://platform.openai.com/docs/guides/batch
- Anthropic Message Batches — https://docs.anthropic.com/en/docs/build-with-claude/batch-processing
- Google Vertex AI batch prediction — https://cloud.google.com/vertex-ai/docs/predictions/batch-predictions
Related guides
- API model pricing: input, output, cache and batch costs
- Prompt caching explained: when repeated context becomes cheaper
- Rate limits explained: requests, tokens, tiers and hidden launch risks
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.