Local LLM runtimes: Ollama, llama.cpp, vLLM and TGI in plain English
If you want to run an open-weight model on your own hardware, you need a runtime. The four most common options are Ollama, llama.cpp, vLLM and Hugging Face’s Text Generation Inference (TGI).
They all load a model and serve inference. But they serve different use cases, and picking the wrong one means either fighting the tool or paying for hardware you do not use.
Quick answer
Ollama is for prototyping and personal use — fastest setup, decent single-user performance. llama.cpp is for local inference on consumer hardware, especially when you care about quantisation and CPU/GPU hybrid setups. vLLM is for production serving with high throughput, batching and multiple concurrent users. TGI is for teams already in the Hugging Face ecosystem who need a production-ready API with safety features.
Your choice depends on whether you are experimenting, building for personal use, or serving paying users.
Ollama: fastest path to running a model
Ollama wraps model downloading, quantisation management, and a basic API server into one command. ollama run llama3.2 is genuinely that simple.
Use it when:
- you want to experiment with open models on a single machine;
- you need a simple API compatible with OpenAI’s chat format;
- you want to manage multiple models and switch between them easily;
- hardware utilisation and maximum throughput are not primary concerns.
Ollama is not designed for production serving. It runs one request at a time per model by default, its batching is limited, and it does not optimise for throughput under concurrent load.
llama.cpp: maximum performance on consumer hardware
llama.cpp is the runtime that made local LLMs practical. It supports quantised model formats (GGUF), runs on CPU, GPU, or a mix of both, and gives you fine-grained control over memory allocation, context size, and inference parameters.
Use it when:
- you are running inference on a laptop or consumer GPU;
- you want maximum performance per watt on consumer hardware;
- you need support for non-standard quantisation levels or model formats;
- you are building a custom application that embeds inference directly (via the C++ library or Python bindings).
The trade-off is that llama.cpp is a lower-level tool. You configure it through command-line flags. There is no built-in API server for production use, though third-party wrappers like llama.cpp-server fill that gap.
vLLM: production serving at scale
vLLM is built for throughput. It supports continuous batching, PagedAttention for efficient memory management, tensor parallelism across multiple GPUs, and a production-ready OpenAI-compatible API.
Use it when:
- you are serving multiple concurrent users with variable request sizes;
- you need to maximise throughput per GPU dollar;
- you want to run a model at production scale with a single GPU or multiple GPUs;
- you need features like prefix caching, speculative decoding, or guided JSON generation.
vLLM is more complex to set up than Ollama. You need to choose model formats, configure scheduling parameters, and understand GPU memory allocation. But for production workloads, the throughput advantage is substantial — often 2-5x over simpler runtimes.
TGI: the Hugging Face ecosystem option
TGI (Text Generation Inference) is Hugging Face’s production inference server. It supports the Hugging Face model hub natively, includes safety features like content moderation, and integrates with the broader Hugging Face toolchain.
Use it when:
- you are already using the Hugging Face ecosystem for model management;
- you need built-in content moderation or safety filtering;
- you want seamless model downloads and versioning from the Hugging Face hub;
- you need a production server but prefer Hugging Face’s configuration conventions.
TGI is comparable to vLLM in production capability but takes a different approach to configuration and focuses more on ecosystem integration.
How to choose
| If you are… | Start with… |
|---|---|
| Experimenting with open models | Ollama |
| Running on a laptop or consumer GPU | llama.cpp |
| Building a production API service | vLLM |
| Already using Hugging Face for everything | TGI |
| Unsure and want to move between options | Start with Ollama, migrate to vLLM when needed |
The best path is usually to start with Ollama, understand your workload, and migrate to vLLM or TGI when you outgrow it.
What teams get wrong
- choosing a runtime before knowing their workload characteristics (concurrent users, request size, latency requirements);
- assuming Ollama is production-ready because the API looks like OpenAI’s;
- spending time optimising llama.cpp parameters when a different runtime would solve the problem;
- comparing runtimes by peak throughput without considering setup complexity and maintenance cost;
- ignoring the model format requirements — not all models run on all runtimes without conversion.
Practical decision check
- How many concurrent users do you need to serve?
- What is your latency target?
- What hardware do you have available?
- Are you willing to convert model formats?
- Do you need an OpenAI-compatible API?
If you are serving more than one concurrent user, skip Ollama for production. If you are running on consumer hardware, start with llama.cpp. If you have GPU budget for a production service, go with vLLM.
Methodology and sources
Check date: 2026-05-24
What was checked: Official documentation, repositories and deployment guides for Ollama, llama.cpp, vLLM and TGI; community benchmarks and deployment experience reports.
What the sources were used for: Identifying capability boundaries, setup complexity, performance characteristics and production readiness of each runtime.
Assumptions and limits: Runtime versions and capabilities change rapidly. Benchmarks between runtimes depend heavily on model, hardware and workload. The guidance above reflects mid-2026 patterns.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief, with a plain-English comparison of the four main open-source LLM runtimes.
Source list
- Ollama documentation — https://github.com/ollama/ollama
- llama.cpp repository — https://github.com/ggerganov/llama.cpp
- vLLM documentation — https://docs.vllm.ai/en/latest/
- Hugging Face TGI — https://huggingface.co/docs/text-generation-inference/en/index
- Open LLM benchmark (MLC) — https://llm.mlc.ai/bench