Quantisation explained: why model files have Q4, Q5 and GGUF labels
When you download an open-weight model, you rarely get a single file. You get a list of variants: Q4_K_M, Q5_0, Q8_0, F16. These labels describe how precisely the model’s weights have been stored. Lower precision means smaller file sizes and faster inference — at the cost of some quality.
The short answer is: quantisation reduces model file size by 50–75% with minimal quality loss for most tasks. Q4 and Q5 quantisations are the practical sweet spot for consumer GPUs. Q8 and FP16 are for high-quality needs with enough hardware to match.
If you only remember one thing: start with Q4_K_M for any model. It is the most tested, most compatible and offers the best quality-to-size ratio for the vast majority of users. Only go higher if you have the VRAM and need the marginal quality gain.
Editor’s Note: Quantisation is not compression — it is precision reduction. The model’s weights are stored with fewer bits per value. The model learns to work around the lost precision during training or fine-tuning.
Editor’s Note: A Q4 model is not “4× worse” than FP16. In practice, quality degradation is often below measurable significance for chat and text generation tasks. Extraction and classification are slightly more sensitive.
Quick answer {#quick-answer}
| Quantisation | Size vs FP16 | VRAM for 7B model | VRAM for 70B model | Quality impact |
|---|---|---|---|---|
| FP16 (full precision) | 100% | ~14 GB | ~140 GB | Reference baseline |
| Q8_0 | ~50% | ~7 GB | ~70 GB | Negligible |
| Q6_K | ~40% | ~5.5 GB | ~55 GB | Minimal |
| Q5_K_M | ~35% | ~5 GB | ~49 GB | Very small |
| Q5_0 | ~33% | ~4.7 GB | ~47 GB | Small |
| Q4_K_M | ~28% | ~4 GB | ~39 GB | Slight |
| Q4_0 | ~25% | ~3.5 GB | ~35 GB | Noticeable on some tasks |
| Q3_K_M | ~22% | ~3 GB | ~30 GB | Degradation visible |
| Q2_K | ~17% | ~2.5 GB | ~24 GB | Significant degradation |
These are approximate. Exact sizes vary by model architecture and quantisation implementation.
What the labels mean {#what-the-labels-mean}
Number: bits per weight {#number-bits-per-weight}
- Q2: 2 bits per weight. Maximum compression, significant quality loss.
- Q3: 3 bits per weight. Aggressive compression, visible degradation on complex tasks.
- Q4: 4 bits per weight. The practical sweet spot. Good quality, good speed.
- Q5: 5 bits per weight. Higher quality, slightly larger files. Useful for sensitive tasks.
- Q6: 6 bits per weight. Near-reference quality.
- Q8: 8 bits per weight. Essentially reference quality at half the size.
- FP16: 16-bit float. Full precision.
Suffix: method variant {#suffix-method-variant}
- K_M: “Medium” — the recommended K-quant. Balances quality across layers.
- K_S: “Small” — smaller but lower quality than K_M.
- _0: Baseline quantisation method, less refined than K-variants.
The widely recommended default is Q4_K_M — 4-bit, medium K-quant. Most model distributors (TheBloke, Bartowski, MaziyarPanahi) list this as the first or most-downloaded quant.
GGUF: the container format {#gguf-the-container-format}
GGUF is the file format that stores the quantised model along with metadata (tokenizer config, model architecture, hyperparameters). It replaced the older GGML format. If you see a .gguf file extension, it is a quantised model ready to load in llama.cpp, Ollama or compatible runtimes.
How quantisation affects real tasks {#how-quantisation-affects-real-tasks}
The impact of quantisation depends heavily on the task:
- Chat and creative writing: Q4_K_M is virtually indistinguishable from FP16 in most blind tests. The model’s training and prompt quality dominate output quality far more than quantisation level.
- Extraction and structured output: Q4 is fine for most fields, but if you need exact number extraction or specific-token outputs, Q5 or Q8 may reduce errors.
- Classification and routing: Q4 works well. The marginal improvement from Q8 is typically <1% accuracy.
- Code generation: Q4_K_M is standard. Some developers report better success with Q5 or Q8 for complex or multi-file code generation.
- Reasoning and maths: Q5_K_M is a safer starting point. The precision loss at Q4 can compound across multi-step reasoning chains.
Worked example: 7B model on an 8GB GPU {#worked-example-7b-model-on-an-8gb-gpu}
With an 8GB VRAM GPU (RTX 3070, RTX 4060 Ti):
- Q4_K_M: fits easily (~4 GB). Leaves room for context and batch processing. Runs at 30–50 tokens/second.
- Q8_0: may fit (~7 GB) but leaves almost no headroom. Context length is limited. Speed drops as memory fills.
- FP16: does not fit (~14 GB). Cannot load at all.
For this hardware, Q4_K_M is the only practical option and works well.
Worked example: 70B model on a workstation {#worked-example-70b-model-on-a-workstation}
With a dual-GPU workstation (e.g., 2× RTX 3090, 48 GB total):
- Q3_K_M: fits (~30 GB). Runs but quality is degraded.
- Q4_K_M: may fit (~39 GB) if model layers are split across GPUs. Good quality, moderate speed.
- Q5_K_M: too large (~49 GB). Does not fit.
- FP16: impossible (~140 GB).
Q4_K_M is borderline but workable. Many users prefer Q3_K_M for a comfortable fit or Q4_K_S (~33 GB) for a middle ground.
What quantisation does not change {#what-quantisation-does-not-change}
- The model’s training data. Quantisation does not add or remove knowledge.
- The model’s architecture. Attention mechanisms, layer counts and vocabulary stay the same.
- The model’s safety alignment. An unsafe Q4 model will be equally unsafe as an unsafe FP16 model of the same weights.
Decision tree {#decision-tree}
- How much VRAM do you have? See the table above. Pick the highest quantisation that fits with headroom for your context length.
- Is this a quality-sensitive task? (Medical, financial, legal extraction → prefer Q5 or higher. Chat, content, classification → Q4 is fine.)
- Are you using CPU or GPU? CPU inference is slower but can handle larger quantised models. GPU benefits from Q4–Q5.
- Are you batch-processing? Lower quantisation (Q4) improves batch throughput on GPU because more model memory is freed for larger batches.
What this page cannot tell you {#what-this-page-cannot-tell-you}
This page cannot tell you the exact quality difference between Q4_K_M and Q5_K_M for your specific model and task. The only reliable way to know is to run both versions on your eval set and compare. For most teams, Q4_K_M is the starting point, and moving to Q5 or Q8 is a marginal optimisation that matters only for precision-critical workloads.
Methodology and sources {#methodology-and-sources}
Check date: 2026-05-25
What was checked: llama.cpp quantisation documentation, GGUF specification, community benchmarks from r/LocalLLaMA and Hugging Face model card comparisons.
Worked-example assumptions: Llama 3.1 / Mistral-class model architectures. VRAM estimates include ~1 GB overhead for KV cache at default context length.
Assumptions and limits:
- Quantisation quality varies between model architectures.
- Some models are more resilient to quantisation than others.
- CPU vs GPU inference performance varies by runtime and hardware.
- New quantisation methods (IQ, GGUF type variants) continue to evolve.
Source list {#source-list}
- llama.cpp quantisation documentation — https://github.com/ggerganov/llama.cpp
- GGUF specification — https://github.com/ggerganov/gguf
- Hugging Face — quantisation guide — https://huggingface.co/docs/transformers/en/quantization
- Ollama model library — https://ollama.com/library
Related guides {#related-guides}
- Model parameters and sizes: why 7B, 70B and MoE labels can mislead
- Local LLM runtimes: Ollama, llama.cpp, vLLM and TGI in plain English
- GPU rental for LLM inference: what an operator needs to know
- Hosted API vs self-hosted open model: the real cost comparison
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.