Hosted API vs self-hosted open model: the real cost comparison
The hosted-API-vs-self-hosted decision is not simply “pay per token vs pay per GPU hour.” The real calculation includes utilisation, ops labour, scaling latency, model update frequency, fallback requirements, and the cost of being wrong about demand.
The short answer is: hosted APIs are cheaper for low-to-moderate usage and variable workloads. Self-hosting starts to make sense above roughly 1–5 million daily tokens — but only if you can sustain high GPU utilisation and have the ops capacity to manage inference infrastructure.
If you only remember one thing: self-hosting an open model is a fixed-cost bet. If your usage is stable and high, you win. If your usage is variable or lower than expected, the idle GPU time costs more than API tokens would have.
Editor’s Note: Many teams self-host a model and discover they spend more on ops debugging, model swapping and container restarts than they saved on API fees. The API is not just token cost — it includes reliability engineering you would otherwise build yourself.
Editor’s Note: Self-hosting also means you carry the full latency and availability risk. A GPU box that goes offline at 3 AM does not generate a support ticket; it just breaks your product.
Quick answer {#quick-answer}
- Hosted API: pay per token. Zero upfront infrastructure. Variable cost scales with usage. Includes provider reliability, model updates and maintenance.
- Self-hosted: fixed GPU/server cost. Zero per-token marginal cost after hardware. You pay for ops, scaling, model management and fallback infrastructure.
The cost crossover varies hugely by workload, but a rough guide:
| Usage level | Likely cheaper option |
|---|---|
| <100K tokens/day | Hosted API, without question |
| 100K–1M tokens/day | Hosted API unless utilisation is >80% |
| 1M–5M tokens/day | Borderline; depends on model size, GPU cost and ops load |
| 5M+ tokens/day | Self-hosting often wins with sustained utilisation |
These are planning ranges, not guarantees. Your exact numbers depend on model size, memory requirements, batch efficiency, GPU rental cost and local electricity rates.
The full cost model {#the-full-cost-model}
Hosted API costs {#hosted-api-costs}
- Per-token input and output fees.
- No upfront hardware cost.
- No ops cost for inference infrastructure.
- Price includes model hosting, updates and availability.
- Predictable billing based on usage.
Hidden costs: rate-limit errors causing retries (which cost additional tokens), unexpected traffic spikes (no throttle-override without account changes), and vendor lock-in if you build around provider-specific features.
Self-hosted costs {#self-hosted-costs}
- GPU/server rental or purchase cost.
- Power, cooling and connectivity.
- Ops labour: setup, updates, scaling, monitoring, incident response.
- Model management: downloading, converting, quantising, version changes.
- Fallback redundancy: at least two GPUs or servers for production uptime.
Hidden costs: unused GPU capacity during off-peak hours, time spent on dependency updates and container rebuilds, and the opportunity cost of ops work that could go into product features.
Worked example: 10M tokens/day {#worked-example-10m-tokens-day}
Assume a Llama 3.1 70B-class model, 10M input + 2M output tokens per day, 30-day month, using GPT-4.1 class API pricing vs a self-hosted single A100 80GB.
Hosted API:
- Input: 10M × 30 × $2.00/M = $600/month
- Output: 2M × 30 × $8.00/M = $480/month
- Total: $1,080/month
Self-hosted (A100 80GB cloud rental ~$2.50/hr):
- GPU: $2.50 × 24 × 30 = $1,800/month
- Plus storage, networking, ops labour: ~$300/month
- Total: $2,100/month
At 10M input tokens/day, the hosted API is cheaper. The self-hosted option needs either much higher throughput or cheaper GPU access (dedicated hardware, reserved instances, or local power) to break even.
When self-hosting wins {#when-self-hosting-wins}
Self-hosting becomes attractive when:
- You have sustained high utilisation — the GPU runs >80% loaded for most of the day.
- You can use smaller or quantised models — a llama.cpp Q4_K_M 8B model on a single RTX 4090 can serve meaningful throughput at a fraction of the API cost.
- You have specific latency, privacy or compliance requirements that rule out external APIs.
- You already have the hardware — the GPU is already sitting in your rack or dev machine.
- Your workload is batchable — offline processing jobs that do not need real-time responses.
The hybrid approach {#the-hybrid-approach}
A common practical pattern is to use both: self-host a smaller model for high-volume, lower-stakes tasks (classification, extraction, summarisation) and route complex or safety-critical work to a hosted frontier model.
This hybrid model reduces API costs for the bulk of requests while keeping the quality ceiling for difficult cases. It adds complexity — you need a routing layer that decides when each model is sufficient — but the savings can be significant.
Decision checklist {#decision-checklist}
Before choosing, ask:
- What is your peak daily token throughput? If you cannot estimate within 2×, start with an API.
- Can you commit to 80%+ GPU utilisation for 12+ hours/day? If yes, self-hosting may work.
- Do you have ops capacity to manage inference infrastructure? If one person wears all the hats, the API is cheaper in total cost.
- Do you need sub-200ms p50 latency? Self-hosted can be faster than APIs for certain models.
- Are you processing sensitive data that cannot leave your infrastructure? Self-hosting may be mandatory.
What this page cannot tell you {#what-this-page-cannot-tell-you}
This page cannot tell you your exact break-even point. GPU prices fluctuate, provider pricing changes, and your workload’s token distribution affects the calculation. The only way to know for sure is to measure your actual usage and compare against a self-hosted test deployment.
Methodology and sources {#methodology-and-sources}
Check date: 2026-05-25
What was checked: Cloud GPU pricing (AWS, Lambda Labs, RunPod), provider API pricing (OpenAI, Anthropic), open-model inference requirements from model cards.
Worked-example assumptions:
- GPU: single A100 80GB at ~$2.50/hr (on-demand cloud). Reserved/preemptible pricing can be 50–70% lower.
- Model: Llama 3.1 70B at 4-bit quantisation. Throughput estimate of ~50 tokens/second for this class.
- API: GPT-4.1 pricing. Lower-cost provider APIs would shift the comparison.
Assumptions and limits:
- Self-hosted inference throughput varies by runtime (vLLM, TGI, llama.cpp) and batch size.
- Ops labour is valued at market rates; self-hosted teams often discount their own time.
- Regional GPU availability and pricing differ significantly.
Source list {#source-list}
- OpenAI API pricing — https://openai.com/api/pricing/
- Anthropic pricing — https://www.anthropic.com/pricing
- Lambda Labs GPU cloud pricing — https://lambdalabs.com/service/gpu-cloud/pricing
- RunPod GPU pricing — https://www.runpod.io/pricing
- Ollama / vLLM documentation — available at respective project repos
Related guides {#related-guides}
- Open weights vs hosted APIs: practical trade-offs
- Local LLM runtimes: Ollama, llama.cpp, vLLM and TGI in plain English
- GPU rental for LLM inference: what an operator needs to know
- API model pricing: input, output, cache and batch costs
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.