OpenAI, Anthropic, Google and Mistral APIs: what comparison pages should measure
Comparing hosted AI providers is an industry sport. The problem is that most comparisons compare the wrong things: leaderboard scores, feature checklists, and demo performance on cherry-picked tasks.
A useful comparison measures what changes after you sign up: cost, quality on your data, reliability, data control, and switching difficulty.
Quick answer
Ignore rankings that compare providers on benchmark averages or feature counts. Build your own comparison around four categories: cost under realistic usage, quality on your specific task set, data governance and portability, and reliability evidence. Everything else is noise.
What to measure
Cost under realistic usage
Provider pricing pages list per-token input and output costs. Those numbers are the starting point, not the finish.
Useful cost comparison requires:
- Prompt caching. Does the provider discount repeated prefix tokens? How does caching work — automatically, manually, or not at all?
- Batch discounts. Is there a separate batch API with lower per-token cost? What latency trade-off does it carry?
- Output verbosity. Some models produce longer outputs for the same prompt. Cost per task can vary by 2x or more even at the same per-token rate.
- Rate limit tiers. Higher tiers may require a commit or a minimum spend. The cost per token at your actual usage tier may be different from the headline price.
Compare cost per completed task, not cost per token.
Quality on your workload
Benchmark generalisations do not help here. You need to test the models on your actual inputs and measure what matters for your product.
A minimal evaluation set should include:
- examples that represent the range of your user inputs;
- edge cases that have caused failures before;
- known-answer questions to check factual grounding;
- examples that test safety-relevant boundaries.
Run the same set through each provider and score outputs against the same criteria. The model that scores highest on MMLU is rarely the model that scores highest on a specific business task.
Data governance and portability
This is where provider choices differ most, and where switching later is hardest.
Compare:
- Training data use. Does the provider use API inputs for training? Can you opt out?
- Data retention. How long are prompts, outputs and logs kept? Can you delete or export them?
- Region controls. Can you restrict data processing to a specific region? What legal basis applies?
- Model update policy. Does the API endpoint serve the same model version indefinitely, or can it change without notice?
- Export options. Can you export logs, evaluations, prompts and configuration?
Reliability evidence
Provider SLAs, status pages and incident reports vary hugely in detail.
Compare:
- Published SLA. Is there an uptime guarantee? Does it cover the API or only specific endpoints?
- Status page quality. Is the status page historical or just current? Are incident descriptions specific or vague?
- Latency variance. Averages hide long tails. Look for p95 or p99 latency if available.
What to ignore
Some comparison metrics are actively misleading:
- Benchmark aggregates. A single number cannot represent capability across tasks. A provider that gains two points on MMLU and drops ten on coding is not improving overall.
- Model size or parameter count. Larger is not reliably better, and MoE architectures make direct comparisons meaningless.
- Feature checklists. Both providers may claim “structured outputs” but implement them differently. Format guarantees are not the same as correctness guarantees.
- Unweighted feature comparisons. A checkmark beside “audit logs” is worthless if the implementation limits log retention to 24 hours.
What teams get wrong
The most common mistake is comparing providers before defining your own criteria. Without a rubric tied to your workload, you are comparing marketing materials.
Other common errors:
- treating the cheapest per-token price as the cheapest total cost;
- assuming “enterprise” means the same thing at every provider;
- ignoring training data policy because it seems unlikely to matter until it does;
- comparing model versions that are not actually equivalent;
- accepting a free trial performance as a reliable indication of paid-tier behaviour.
Practical decision check
Before you commit to a provider comparison:
- Have you defined your evaluation criteria before looking at provider materials?
- Are you measuring cost per task, not cost per token?
- Have you run your own evaluation set through each provider?
- Do you know the data retention and training use policy for each option?
- Do you know how to leave if you need to?
If any answer is no, the comparison is not ready.
Methodology and sources
Check date: 2026-05-24
What was checked: Provider pricing pages, API documentation, data processing agreements, model cards, and status pages from OpenAI, Anthropic, Google (Gemini API) and Mistral.
What the sources were used for: Identifying the structural differences in pricing, data governance, and reliability transparency between providers.
Assumptions and limits: Provider terms and pricing change. This is a comparison framework, not a price comparison, and does not recommend any specific provider.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief, with a comparison rubric for evaluating hosted LLM providers.
Source list
- OpenAI API pricing — https://openai.com/api/pricing/
- Anthropic API pricing — https://www.anthropic.com/pricing
- Google Gemini API pricing — https://ai.google.dev/pricing
- Mistral pricing — https://console.mistral.ai/pricing
- OpenAI data controls FAQ — https://help.openai.com/en/articles/7039943-data-controls-faq
- Anthropic privacy policy — https://www.anthropic.com/legal/privacy
- Google Cloud data processing — https://cloud.google.com/terms/data-processing-addendum
- Mistral data policy — https://mistral.ai/terms/