theLLMs

Last checked: 2026-05-25

Scope: Global. GPU pricing and availability checked 2026-05-25. Cloud GPU markets fluctuate significantly.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

GPU rental for LLM inference: what an operator needs to know

Running an open model on rented GPU hardware looks cheaper than API calls — until you factor in the GPU utilisation you actually achieve. The gap between theoretical throughput and real-world performance is where most self-hosting cost calculations fall apart.

Quick answer

GPU rental for inference makes financial sense when you have sustained throughput above a breakpoint that depends on your model size, batch strategy, and hardware choice. For a 7B parameter model, the breakpoint is roughly 10–50 million tokens per day versus a paid API. For a 70B model, it is higher because the GPU cost scales faster than the API cost for large models.

The real cost driver is not the GPU rental price — it is utilisation. A GPU running at 10% capacity (idle between requests, no batching) costs more per token than an API. A GPU running at 70%+ capacity through continuous batching can undercut API pricing significantly. The difference is not marginal; it can be an order of magnitude.

What affects GPU inference cost

Hardware choice: An A100 80 GB rents for roughly £1.50–£3.50 per hour depending on provider and commitment (runpod, Lambda, Vast, Google Cloud, AWS). An H100 is £3–£7 per hour. An RTX 4090 is £0.50–£1.00 per hour. The right GPU depends on model memory requirements, not just model size — a 70B model at FP16 needs at least 140 GB of GPU memory, which means at least two A100s or one H100 with enough memory bandwidth.

Quantisation: Running a model at 4-bit quantisation cuts memory requirements by roughly 75%. A 70B model that needs 140 GB at FP16 fits in ~35 GB at Q4, opening up A100 and even some 4090 setups. The quality loss from Q4 is minimal for most inference workloads.

Batching: This is the biggest lever. A single GPU processing one request at a time achieves a fraction of its theoretical throughput. Using vLLM or similar continuous batching, the same GPU can process 8–64 simultaneous requests with minimal per-request latency increase. The effective cost per token drops proportionally to batch size.

Cache reuse: If your workload has repeated prefixes (system prompts, few-shot examples, common instruction templates), KV cache reuse reduces per-token compute substantially. Some inference engines can serve many requests from the same shared cache, effectively multiplying throughput without additional GPU time.

SLA requirements: Real-time inference needs idle GPU capacity to handle traffic spikes. Background batch inference can fill the GPU continuously. The same workload costs 2–3x more when run as a real-time service versus a batch job.

Where teams misuse GPU rental

  1. Comparing GPU rental to API pricing at peak utilisation. The 50% utilisation scenario is the realistic one for most teams, not the 90% utilisation the calculator defaults to. Run the numbers at your actual expected utilisation, not theoretical maximum.

  2. Ignoring incidental costs. GPU rental is the headline number. Storage for model weights, EBS volumes for logs, networking costs for data transfer, and engineering time for setup and maintenance add 20–40% to the total.

  3. Renting a too-large GPU. A 7B model on an H100 is overkill. The same throughput can be achieved on an A100 or even a 4090 at a fraction of the cost. Match GPU to model requirements, not to vendor availability.

  4. Treating availability GPUs as interchangeable. Spot/preemptible GPU instances are 60–80% cheaper but can be terminated with little warning. For batch inference this is manageable with checkpointing. For real-time inference, preemptible instances are not viable.

  5. Serving without an inference engine. Running a model with raw Transformers or llama.cpp without an optimised serving layer wastes GPU capacity. vLLM, TGI, and Triton Inference Server all provide continuous batching, PagedAttention, and KV cache management that multiply throughput by 3–10x over naive serving.

Practical decision check

  • What is your daily token volume? Below 10M/day for a 7B model: API is cheaper. Above 50M/day: GPU rental starts to win.
  • Can you batch requests? Low-traffic applications with sporadic requests do not batch well. Batch workloads and high-traffic apps benefit most.
  • Do you need real-time latency, or is batch acceptable? Real-time needs idle capacity. Batch can fill the GPU.
  • Have you factored in engineering time? GPU setup, inference engine configuration, monitoring, and incident response add 10–30 hours upfront plus ongoing maintenance.
  • Can you tolerate spot/preemptible instances? If yes, costs drop significantly. If no, on-demand pricing applies.

Methodology and sources

Check date: 2026-05-25

What was checked: Cloud GPU pricing from Lambda, RunPod, Vast, Google Cloud and AWS. vLLM, TGI and Ollama documentation for throughput benchmarks. Community benchmarks on tokens-per-second per GPU type.

Assumptions and limits: GPU pricing fluctuates with demand and availability. Actual throughput depends on model architecture, quantisation level, batch strategy, and hardware topology. The breakpoint calculations assume stable workload patterns.

Source list

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.