theLLMs

Last checked: 2026-05-25

Scope: Global. Hardware pricing and availability data checked on 2026-05-25; chip market conditions change rapidly.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Hardware supply and inference economics: why chips shape AI products

AI products do not run on code alone. Every request hits a physical chip in a datacentre somewhere, and the economics of those chips — supply, pricing, allocation — directly shape what products can charge, how much latency users experience, and which features are viable.

Quick answer

GPU supply constraints and inference economics are the hidden bottleneck in AI product design. A single H100 GPU costs roughly $25,000–$40,000 and serves approximately 2–5 concurrent users for large-model inference. At cloud rental rates of $2–4 per GPU-hour, the per-query inference cost for a large frontier model can be $0.01–$0.10 before any provider margin. This chip-level cost floor determines minimum viable pricing, feature eligibility (do you really want to add a summarisation button?), and which optimisation techniques (quantisation, caching, smaller models) are worth the engineering investment.

Why hardware supply matters for product decisions

GPU supply has been constrained since 2023. The reasons are structural: fab capacity for advanced nodes is limited, high-bandwidth memory (HBM) is a bottleneck, and packaging capacity for multi-die chips is constrained. These constraints mean:

  • Inference capacity is not elastic. You cannot instantly scale inference the way you can scale stateless web servers. Each concurrent user needs GPU memory and compute.

  • Hardware allocation affects feature decisions. Adding a “compare these documents” feature that requires 10,000-token context windows doubles per-query GPU time. That is a cost decision, not just a code decision.

  • Small-model optimisation has real margin impact. A quantised 7B model costs roughly 20x less per query than a 70B model. The chip running it has the same hourly cost; the difference is throughput.

The economics at each layer

Training. Building a frontier-model training cluster requires 10,000–100,000 GPUs. At $25K per GPU, a training cluster is a $250M–$2.5B capital investment before datacentre, power, and networking costs. This explains why frontier training is limited to a handful of organisations.

Inference serving. A single GPU can serve 1–10 concurrent users for a 70B-class model, depending on precision, context length, and batch size. At cloud GPU rental of $3/hour, serving a user for one hour of conversation costs $0.30–$3.00 in GPU time alone. This is why free-tier AI products have tight usage limits.

Hardware generations matter. The H100 was roughly 3x faster than the A100 for inference. The B200 (Blackwell) aims for another 2–3x improvement. Each hardware generation shifts the cost curve, making previously uneconomical features viable and reshaping the competitive landscape.

Specialised inference chips. Startups like Groq, Cerebras, and d-Matrix build chips specifically for inference. Their cost-per-query can be lower than GPU alternatives, but they face the same supply constraints for advanced fab capacity.

What teams get wrong

  1. Ignoring GPU availability when planning features. A feature that looks cheap in prototype with no users can become uneconomical at scale when GPU costs dominate the cost structure.

  2. Assuming inference costs will follow Moore’s Law. GPU performance improves, but demand for larger models and longer contexts grows faster. Per-query costs for state-of-the-art inference have not declined dramatically.

  3. Focusing only on model cost, not total infrastructure. GPU rental is only part of the cost. Networking, storage, cooling, datacentre space, and engineering overhead can add 50–100% to the raw compute cost.

  4. Treating GPU supply as just a datacentre team’s problem. Hardware constraints are product constraints. A product that burns 10 seconds of GPU time per query has a different feature surface than one that uses 0.1 seconds.

Practical decision check

  • What GPU class does your inference workload need? A 7B model runs on consumer GPUs; a 70B model needs datacentre GPUs.
  • What is your per-query GPU cost at projected scale? Include batch efficiency improvements, caching gains, and hardware generation shifts.
  • Can your product work with a smaller or quantised model? The cost difference between 7B and 70B is roughly 10–20x.
  • What is your hardware allocation plan? If you need 1,000 concurrent H100-equivalent GPU-hours, verify availability before committing to a launch date.
  • Are you locked into a specific hardware vendor? Portability between GPU vendors and cloud providers affects your bargaining position and supply resilience.

Methodology and sources

Check date: 2026-05-25

What was checked: Public cloud GPU pricing pages, hardware vendor pricing disclosures, industry analyst reports on GPU supply, training cluster cost estimates from published papers and provider disclosures.

Assumptions and limits: Hardware pricing is volatile and varies by region, volume discount, and supply conditions. Cloud GPU availability depends on region and provider. This guide focuses on mid-2026 conditions and will need updating as supply dynamics change.

Source list

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.