GPU rental for LLM inference: what an operator needs to know

Running an open model on rented GPU hardware looks cheaper than API calls — until you factor in the GPU utilisation you actually achieve. The gap between theoretical throughput and real-world performance is where most self-hosting cost calculations fall apart.

TL;DR

GPU rental for inference makes financial sense when you have sustained throughput above a breakpoint that depends on your model size, batch strategy, and hardware choice. For a 7B parameter model, the breakpoint is roughly 10–50 million tokens per day versus a paid API. For a 70B model, it is higher because the GPU cost scales faster than the API cost for large models.

The real cost driver is not the GPU rental price — it is utilisation. A GPU running at 10% capacity (idle between requests, no batching) costs more per token than an API. A GPU running at 70%+ capacity through continuous batching can undercut API pricing significantly. The difference is not marginal; it can be an order of magnitude.

What affects GPU inference cost

Hardware choice: An A100 80 GB rents for roughly £1.50–£3.50 per hour depending on provider and commitment (runpod, Lambda, Vast, Google Cloud, AWS). An H100 is £3–£7 per hour. An RTX 4090 is £0.50–£1.00 per hour. The right GPU depends on model memory requirements, not just model size — a 70B model at FP16 needs at least 140 GB of GPU memory, which means at least two A100s or one H100 with enough memory bandwidth.

Quantisation: Running a model at 4-bit quantisation cuts memory requirements by roughly 75%. A 70B model that needs 140 GB at FP16 fits in ~35 GB at Q4, opening up A100 and even some 4090 setups. The quality loss from Q4 is minimal for most inference workloads.

Batching: This is the biggest lever. A single GPU processing one request at a time achieves a fraction of its theoretical throughput. Using vLLM or similar continuous batching, the same GPU can process 8–64 simultaneous requests with minimal per-request latency increase. The effective cost per token drops proportionally to batch size.

Cache reuse: If your workload has repeated prefixes (system prompts, few-shot examples, common instruction templates), KV cache reuse reduces per-token compute substantially. Some inference engines can serve many requests from the same shared cache, effectively multiplying throughput without additional GPU time.

SLA requirements: Real-time inference needs idle GPU capacity to handle traffic spikes. Background batch inference can fill the GPU continuously. The same workload costs 2–3x more when run as a real-time service versus a batch job.

Where teams misuse GPU rental

Comparing GPU rental to API pricing at peak utilisation. The 50% utilisation scenario is the realistic one for most teams, not the 90% utilisation the calculator defaults to. Run the numbers at your actual expected utilisation, not theoretical maximum.
Ignoring incidental costs. GPU rental is the headline number. Storage for model weights, EBS volumes for logs, networking costs for data transfer, and engineering time for setup and maintenance add 20–40% to the total.
Renting a too-large GPU. A 7B model on an H100 is overkill. The same throughput can be achieved on an A100 or even a 4090 at a fraction of the cost. Match GPU to model requirements, not to vendor availability.
Treating availability GPUs as interchangeable. Spot/preemptible GPU instances are 60–80% cheaper but can be terminated with little warning. For batch inference this is manageable with checkpointing. For real-time inference, preemptible instances are not viable.
Serving without an inference engine. Running a model with raw Transformers or llama.cpp without an optimised serving layer wastes GPU capacity. vLLM, TGI, and Triton Inference Server all provide continuous batching, PagedAttention, and KV cache management that multiply throughput by 3–10x over naive serving.

Practical decision check

What is your daily token volume? Below 10M/day for a 7B model: API is cheaper. Above 50M/day: GPU rental starts to win.
Can you batch requests? Low-traffic applications with sporadic requests do not batch well. Batch workloads and high-traffic apps benefit most.
Do you need real-time latency, or is batch acceptable? Real-time needs idle capacity. Batch can fill the GPU.
Have you factored in engineering time? GPU setup, inference engine configuration, monitoring, and incident response add 10–30 hours upfront plus ongoing maintenance.
Can you tolerate spot/preemptible instances? If yes, costs drop significantly. If no, on-demand pricing applies.

Methodology

Data checked: 2026-05-28
Sources consulted: Cloud GPU pricing from Lambda, RunPod, Vast, Google Cloud and AWS. vLLM, TGI and Ollama documentation for throughput benchmarks. Community benchmarks on tokens-per-second per GPU type.
Assumptions: GPU pricing fluctuates with demand and availability. Actual throughput depends on model architecture, quantisation level, batch strategy, and hardware topology. The breakpoint calculations assume stable workload patterns.
Limitations: This article covers GPU rental for inference only, not training. It focuses on single-node setups; multi-node distributed inference has different cost dynamics. GPU availability and pricing shift frequently — check current rates before budgeting.
Jurisdiction: Global. GPU pricing is in GBP/USD as noted. GPU availability varies by cloud region; EU and US regions typically have the widest selection.

Source list

vLLM documentation — https://docs.vllm.ai/en/stable/ (accessed 2026-05-28)
Hugging Face TGI — https://huggingface.co/docs/text-generation-inference/en/index (accessed 2026-05-28)
RunPod GPU pricing — https://www.runpod.io/pricing (accessed 2026-05-28)
Lambda GPU Cloud — https://lambdalabs.com/service/gpu-cloud/pricing (accessed 2026-05-28)
Vast AI GPU marketplace — https://vast.ai/ (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist. Added 3 Editor’s Notes, Trust Stack, slugified heading IDs, access dates on sources, fixed writtenBy frontmatter, restructured Methodology section with jurisdiction.
2026-05-27: Added direct source URLs to all named providers; added Change Log section.