Hosted API vs self-hosted open model: the real cost comparison

The hosted-API-vs-self-hosted decision is not simply “pay per token vs pay per GPU hour.” The real calculation includes utilisation, ops labour, scaling latency, model update frequency, fallback requirements, and the cost of being wrong about demand.

The short answer is: hosted APIs are cheaper for low-to-moderate usage and variable workloads. Self-hosting starts to make sense above roughly 1–5 million daily tokens — but only if you can sustain high GPU utilisation and have the ops capacity to manage inference infrastructure.

If you only remember one thing: self-hosting an open model is a fixed-cost bet. If your usage is stable and high, you win. If your usage is variable or lower than expected, the idle GPU time costs more than API tokens would have.

TL;DR

Hosted API: pay per token. Zero upfront infrastructure. Variable cost scales with usage. Includes provider reliability, model updates and maintenance.
Self-hosted: fixed GPU/server cost. Zero per-token marginal cost after hardware. You pay for ops, scaling, model management and fallback infrastructure.

The cost crossover varies hugely by workload, but a rough guide:

These are planning ranges, not guarantees. Your exact numbers depend on model size, memory requirements, batch efficiency, GPU rental cost and local electricity rates.

The full cost model

Hosted API costs

Per-token input and output fees.
No upfront hardware cost.
No ops cost for inference infrastructure.
Price includes model hosting, updates and availability.
Predictable billing based on usage.

Hidden costs: rate-limit errors causing retries (which cost additional tokens), unexpected traffic spikes (no throttle-override without account changes), and vendor lock-in if you build around provider-specific features.

Self-hosted costs

GPU/server rental or purchase cost.
Power, cooling and connectivity.
Ops labour: setup, updates, scaling, monitoring, incident response.
Model management: downloading, converting, quantising, version changes.
Fallback redundancy: at least two GPUs or servers for production uptime.

Hidden costs: unused GPU capacity during off-peak hours, time spent on dependency updates and container rebuilds, and the opportunity cost of ops work that could go into product features.

Worked example: 10M tokens/day

Assume a Llama 3.1 70B-class model, 10M input + 2M output tokens per day, 30-day month, using GPT-4.1 class API pricing vs a self-hosted single A100 80GB.

Hosted API:

Input: 10M × 30 × $2.00/M = $600/month
Output: 2M × 30 × $8.00/M = $480/month
Total: $1,080/month

Self-hosted (A100 80GB cloud rental ~$2.50/hr):

GPU: $2.50 × 24 × 30 = $1,800/month
Plus storage, networking, ops labour: ~$300/month
Total: $2,100/month

At 10M input tokens/day, the hosted API is cheaper. The self-hosted option needs either much higher throughput or cheaper GPU access (dedicated hardware, reserved instances, or local power) to break even.

When self-hosting wins

Self-hosting becomes attractive when:

You have sustained high utilisation — the GPU runs >80% loaded for most of the day.
You can use smaller or quantised models — a llama.cpp Q4_K_M 8B model on a single RTX 4090 can serve meaningful throughput at a fraction of the API cost.
You have specific latency, privacy or compliance requirements that rule out external APIs.
You already have the hardware — the GPU is already sitting in your rack or dev machine.
Your workload is batchable — offline processing jobs that do not need real-time responses.

The hybrid approach

A common practical pattern is to use both: self-host a smaller model for high-volume, lower-stakes tasks (classification, extraction, summarisation) and route complex or safety-critical work to a hosted frontier model.

This hybrid model reduces API costs for the bulk of requests while keeping the quality ceiling for difficult cases. It adds complexity — you need a routing layer that decides when each model is sufficient — but the savings can be significant.

Decision checklist

Before choosing, ask:

What is your peak daily token throughput? If you cannot estimate within 2×, start with an API.
Can you commit to 80%+ GPU utilisation for 12+ hours/day? If yes, self-hosting may work.
Do you have ops capacity to manage inference infrastructure? If one person wears all the hats, the API is cheaper in total cost.
Do you need sub-200ms p50 latency? Self-hosted can be faster than APIs for certain models.
Are you processing sensitive data that cannot leave your infrastructure? Self-hosting may be mandatory.

What this page cannot tell you

This page cannot tell you your exact break-even point. GPU prices fluctuate, provider pricing changes, and your workload’s token distribution affects the calculation. The only way to know for sure is to measure your actual usage and compare against a self-hosted test deployment.

For a broader comparison that covers non-cost trade-offs — including output quality, control, customisation, privacy, and vendor lock-in — see our guide on open weights vs hosted APIs: practical trade-offs.

Methodology

Data checked: 2026-05-28
Sources consulted: Cloud GPU pricing (AWS, Lambda Labs, RunPod), provider API pricing (OpenAI, Anthropic), open-model inference requirements from model cards.
Worked-example assumptions: GPU: single A100 80GB at ~$2.50/hr (on-demand cloud). Reserved/preemptible pricing can be 50–70% lower. Model: Llama 3.1 70B at 4-bit quantisation. Throughput estimate of ~50 tokens/second for this class. API: GPT-4.1 pricing. Lower-cost provider APIs would shift the comparison.
Assumptions: Self-hosted inference throughput varies by runtime (vLLM, TGI, llama.cpp) and batch size. Ops labour is valued at market rates; self-hosted teams often discount their own time. Regional GPU availability and pricing differ significantly.
Limitations: This comparison uses on-demand cloud GPU pricing. Dedicated hardware, reserved instances, or colocated servers change the economics considerably. The worked example uses specific model and pricing snapshots that may not reflect your workload.
Jurisdiction: Global. GPU pricing is in USD. Data residency and privacy requirements that mandate self-hosting vary by jurisdiction (GDPR in EU/UK, sector-specific regulations in financial services and healthcare).

Source list

OpenAI API pricing — https://openai.com/api/pricing/ (accessed 2026-05-28)
Anthropic pricing — https://www.anthropic.com/pricing (accessed 2026-05-28)
Lambda Labs GPU cloud pricing — https://lambdalabs.com/service/gpu-cloud/pricing (accessed 2026-05-28)
RunPod GPU pricing — https://www.runpod.io/pricing (accessed 2026-05-28)
vLLM documentation — https://docs.vllm.ai/en/stable/ (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist. Added 3 Editor’s Notes in <aside> format, Trust Stack, slugified IDs on remaining headings, access dates on sources, completed truncated description, fixed writtenBy frontmatter, restructured Methodology with jurisdiction.
2026-05-27: Added direct source URLs to all named providers; added Change Log section.