Hardware supply and inference economics: why chips shape AI products

AI products do not run on code alone. Every request hits a physical chip in a datacentre somewhere, and the economics of those chips — supply, pricing, allocation — directly shape what products can charge, how much latency users experience, and which features are viable.

TL;DR

GPU supply constraints and inference economics are the hidden bottleneck in AI product design. A single H100 GPU costs roughly $25,000–$40,000 and serves approximately 2–5 concurrent users for large-model inference. At cloud rental rates of $2–4 per GPU-hour, the per-query inference cost for a large frontier model can be $0.01–$0.10 before any provider margin. This chip-level cost floor determines minimum viable pricing, feature eligibility (do you really want to add a summarisation button?), and which optimisation techniques (quantisation, caching, smaller models) are worth the engineering investment.

Why hardware supply matters for product decisions

GPU supply has been constrained since 2023. The reasons are structural: fab capacity for advanced nodes is limited, high-bandwidth memory (HBM) is a bottleneck, and packaging capacity for multi-die chips is constrained. These constraints mean:

Inference capacity is not elastic. You cannot instantly scale inference the way you can scale stateless web servers. Each concurrent user needs GPU memory and compute.
Hardware allocation affects feature decisions. Adding a “compare these documents” feature that requires 10,000-token context windows doubles per-query GPU time. That is a cost decision, not just a code decision.
Small-model optimisation has real margin impact. A quantised 7B model costs roughly 20x less per query than a 70B model. The chip running it has the same hourly cost; the difference is throughput.

The economics at each layer

Training. Building a frontier-model training cluster requires 10,000–100,000 GPUs. At $25K per GPU, a training cluster is a $250M–$2.5B capital investment before datacentre, power, and networking costs. This explains why frontier training is limited to a handful of organisations.

Inference serving. A single GPU can serve 1–10 concurrent users for a 70B-class model, depending on precision, context length, and batch size. At cloud GPU rental of $3/hour, serving a user for one hour of conversation costs $0.30–$3.00 in GPU time alone. This is why free-tier AI products have tight usage limits.

Specialised inference chips. Startups like Groq, Cerebras, and d-Matrix build chips specifically for inference. Their cost-per-query can be lower than GPU alternatives, but they face the same supply constraints for advanced fab capacity.

What teams get wrong

Ignoring GPU availability when planning features. A feature that looks cheap in prototype with no users can become uneconomical at scale when GPU costs dominate the cost structure.
Assuming inference costs will follow Moore’s Law. GPU performance improves, but demand for larger models and longer contexts grows faster. Per-query costs for state-of-the-art inference have not declined dramatically.
Focusing only on model cost, not total infrastructure. GPU rental is only part of the cost. Networking, storage, cooling, datacentre space, and engineering overhead can add 50–100% to the raw compute cost.
Treating GPU supply as just a datacentre team’s problem. Hardware constraints are product constraints. A product that burns 10 seconds of GPU time per query has a different feature surface than one that uses 0.1 seconds.

Practical decision check

What GPU class does your inference workload need? A 7B model runs on consumer GPUs; a 70B model needs datacentre GPUs.
What is your per-query GPU cost at projected scale? Include batch efficiency improvements, caching gains, and hardware generation shifts.
Can your product work with a smaller or quantised model? The cost difference between 7B and 70B is roughly 10–20x.
What is your hardware allocation plan? If you need 1,000 concurrent H100-equivalent GPU-hours, verify availability before committing to a launch date.
Are you locked into a specific hardware vendor? Portability between GPU vendors and cloud providers affects your bargaining position and supply resilience.

Methodology

Data checked: 2026-05-28
Sources consulted: Public cloud GPU pricing pages, hardware vendor pricing disclosures, industry analyst reports on GPU supply (SemiAnalysis, IEA), training cluster cost estimates from published papers and provider disclosures
Assumptions: Hardware pricing is volatile and varies by region, volume discount, and supply conditions. Cloud GPU availability depends on region and provider. GPU pricing figures (H100 at $25K–$40K) reflect mid-2026 market conditions and may not reflect enterprise volume discounts. Cloud GPU rental rates of $2–4/hour are spot-market estimates.
Limitations: This article does not benchmark specific cloud providers, does not cover on-premise datacentre economics, and does not provide financial or procurement advice. Hardware supply dynamics for non-NVIDIA accelerators (AMD, Intel, custom ASICs) are covered only briefly. Chip export controls and geopolitical supply risks are out of scope.
Jurisdiction: Global. GPU pricing and availability vary by region. US export controls on advanced semiconductors may affect availability in certain markets.

Source list

NVIDIA H100 specifications — https://www.nvidia.com/en-gb/data-center/h100/ (accessed 2026-05-28)
IEA: AI and semiconductors — https://www.iea.org/commentaries/ai-and-semiconductors (accessed 2026-05-28)
SemiAnalysis AI hardware supply analysis — https://www.semianalysis.com/ (accessed 2026-05-28)
Cloud GPU pricing (AWS, GCP, Azure, Lambda Labs, Vast.ai) — current pricing pages (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist. Added three Editor’s Note aside cards, slugified all heading IDs, added Trust Stack section with corrections policy and affiliation declaration, corrected frontmatter writtenBy label, fixed truncated description, standardised Methodology and Source List formats with access dates.
2026-05-27: Added direct source URLs and Change Log section.