Local LLM runtimes: Ollama, llama.cpp, vLLM and TGI in plain English

If you want to run an open-weight model on your own hardware, you need a runtime. The four most common options are Ollama, llama.cpp, vLLM and Hugging Face’s Text Generation Inference (TGI).

They all load a model and serve inference. But they serve different use cases, and picking the wrong one means either fighting the tool or paying for hardware you do not use.

TL;DR

Ollama is for prototyping and personal use — fastest setup, decent single-user performance. llama.cpp is for local inference on consumer hardware, especially when you care about quantisation and CPU/GPU hybrid setups. vLLM is for production serving with high throughput, batching and multiple concurrent users. TGI is for teams already in the Hugging Face ecosystem who need a production-ready API with safety features.

Your choice depends on whether you are experimenting, building for personal use, or serving paying users.

Ollama: fastest path to running a model

Ollama wraps model downloading, quantisation management, and a basic API server into one command. ollama run llama3.2 is genuinely that simple.

Use it when:

you want to experiment with open models on a single machine;
you need a simple API compatible with OpenAI’s chat format;
you want to manage multiple models and switch between them easily;
hardware utilisation and maximum throughput are not primary concerns.

Ollama is not designed for production serving. It runs one request at a time per model by default, its batching is limited, and it does not optimise for throughput under concurrent load.

llama.cpp: maximum performance on consumer hardware

llama.cpp is the runtime that made local LLMs practical. It supports quantised model formats (GGUF), runs on CPU, GPU, or a mix of both, and gives you fine-grained control over memory allocation, context size, and inference parameters.

Use it when:

you are running inference on a laptop or consumer GPU;
you want maximum performance per watt on consumer hardware;
you need support for non-standard quantisation levels or model formats;
you are building a custom application that embeds inference directly (via the C++ library or Python bindings).

The trade-off is that llama.cpp is a lower-level tool. You configure it through command-line flags. There is no built-in API server for production use, though third-party wrappers like llama.cpp-server fill that gap.

vLLM: production serving at scale

vLLM is built for throughput. It supports continuous batching, PagedAttention for efficient memory management, tensor parallelism across multiple GPUs, and a production-ready OpenAI-compatible API.

Use it when:

you are serving multiple concurrent users with variable request sizes;
you need to maximise throughput per GPU dollar;
you want to run a model at production scale with a single GPU or multiple GPUs;
you need features like prefix caching, speculative decoding, or guided JSON generation.

vLLM is more complex to set up than Ollama. You need to choose model formats, configure scheduling parameters, and understand GPU memory allocation. But for production workloads, the throughput advantage is substantial — often 2-5x over simpler runtimes.

TGI: the Hugging Face ecosystem option

TGI (Text Generation Inference) is Hugging Face’s production inference server. It supports the Hugging Face model hub natively, includes safety features like content moderation, and integrates with the broader Hugging Face toolchain.

Use it when:

you are already using the Hugging Face ecosystem for model management;
you need built-in content moderation or safety filtering;
you want seamless model downloads and versioning from the Hugging Face hub;
you need a production server but prefer Hugging Face’s configuration conventions.

TGI is comparable to vLLM in production capability but takes a different approach to configuration and focuses more on ecosystem integration.

How to choose

If you are…	Start with…
Experimenting with open models	Ollama
Running on a laptop or consumer GPU	llama.cpp
Building a production API service	vLLM
Already using Hugging Face for everything	TGI
Unsure and want to move between options	Start with Ollama, migrate to vLLM when needed

The best path is usually to start with Ollama, understand your workload, and migrate to vLLM or TGI when you outgrow it.

What teams get wrong

choosing a runtime before knowing their workload characteristics (concurrent users, request size, latency requirements);
assuming Ollama is production-ready because the API looks like OpenAI’s;
spending time optimising llama.cpp parameters when a different runtime would solve the problem;
comparing runtimes by peak throughput without considering setup complexity and maintenance cost;
ignoring the model format requirements — not all models run on all runtimes without conversion.

Practical decision check

How many concurrent users do you need to serve?
What is your latency target?
What hardware do you have available?
Are you willing to convert model formats?
Do you need an OpenAI-compatible API?

If you are serving more than one concurrent user, skip Ollama for production. If you are running on consumer hardware, start with llama.cpp. If you have GPU budget for a production service, go with vLLM.

Methodology

Data checked: 2026-05-28
Sources consulted: Official documentation, repositories and deployment guides for Ollama, llama.cpp, vLLM and TGI; community benchmarks and deployment experience reports; MLC LLM benchmark data
Assumptions: Runtime versions and capabilities change rapidly. Benchmarks between runtimes depend heavily on model, hardware and workload. The guidance above reflects mid-2026 patterns and typical use cases, not exhaustive benchmarking.
Limitations: This article does not benchmark runtimes against each other on specific hardware configurations. It does not cover enterprise deployment patterns (Kubernetes, load balancing, autoscaling) in depth. Specific runtime versions and features may have changed since publication. Hardware-specific optimisations (AMD ROCm, Apple Metal) are mentioned only in passing.
Jurisdiction: Global. Runtime availability and hardware access do not vary by jurisdiction, though export controls on certain GPU hardware may affect deployment options in some regions.

Source list

Ollama documentation — https://github.com/ollama/ollama (accessed 2026-05-28)
llama.cpp repository — https://github.com/ggerganov/llama.cpp (accessed 2026-05-28)
vLLM documentation — https://docs.vllm.ai/en/latest/ (accessed 2026-05-28)
Hugging Face TGI — https://huggingface.co/docs/text-generation-inference/en/index (accessed 2026-05-28)
Open LLM benchmark (MLC) — https://llm.mlc.ai/bench (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist. Added three Editor’s Note aside cards, slugified all heading IDs, added Trust Stack section with corrections policy and affiliation declaration, corrected frontmatter writtenBy label, fixed truncated description, standardised Methodology and Source List formats with access dates, removed internal process language from Change Log.
2026-05-24: First published.

Local LLM runtimes: Ollama, llama.cpp, vLLM and TGI in plain English

TL;DR

Ollama: fastest path to running a model

llama.cpp: maximum performance on consumer hardware

vLLM: production serving at scale

TGI: the Hugging Face ecosystem option

How to choose

What teams get wrong

Practical decision check

Methodology

Source list

Trust Stack

Change log

Related guides