Quantisation explained: why model files have Q4, Q5 and GGUF labels

When you download an open-weight model, you rarely get a single file. You get a list of variants: Q4_K_M, Q5_0, Q8_0, F16. These labels describe how precisely the model’s weights have been stored. Lower precision means smaller file sizes and faster inference — at the cost of some quality.

TL;DR

Quantisation reduces model file size by 50–75% with minimal quality loss for most tasks. Q4 and Q5 quantisations are the practical sweet spot for consumer GPUs. Q8 and FP16 are for high-quality needs with enough hardware to match.

If you only remember one thing: start with Q4_K_M for any model. It is the most tested, most compatible and offers the best quality-to-size ratio for the vast majority of users. Only go higher if you have the VRAM and need the marginal quality gain.

| Quantisation | Size vs FP16 | VRAM for 7B model | VRAM for 70B model | Quality impact | | | | | | | | FP16 (full precision) | 100% | ~14 GB | ~140 GB | Reference baseline | | Q8_0 | ~50% | ~7 GB | ~70 GB | Negligible | | Q6_K | ~40% | ~5.5 GB | ~55 GB | Minimal | | Q5_K_M | ~35% | ~5 GB | ~49 GB | Very small | | Q5_0 | ~33% | ~4.7 GB | ~47 GB | Small | | Q4_K_M | ~28% | ~4 GB | ~39 GB | Slight | | Q4_0 | ~25% | ~3.5 GB | ~35 GB | Noticeable on some tasks | | Q3_K_M | ~22% | ~3 GB | ~30 GB | Degradation visible | | Q2_K | ~17% | ~2.5 GB | ~24 GB | Significant degradation |

These are approximate. Exact sizes vary by model architecture and quantisation implementation.

What the labels mean

Number: bits per weight

Q2: 2 bits per weight. Maximum compression, significant quality loss.
Q3: 3 bits per weight. Aggressive compression, visible degradation on complex tasks.
Q4: 4 bits per weight. The practical sweet spot. Good quality, good speed.
Q5: 5 bits per weight. Higher quality, slightly larger files. Useful for sensitive tasks.
Q6: 6 bits per weight. Near-reference quality.
Q8: 8 bits per weight. Essentially reference quality at half the size.
FP16: 16-bit float. Full precision.

Suffix: method variant

K_M: “Medium” — the recommended K-quant. Balances quality across layers.
K_S: “Small” — smaller but lower quality than K_M.
_0: Baseline quantisation method, less refined than K-variants.

The widely recommended default is Q4_K_M — 4-bit, medium K-quant. Most model distributors (TheBloke, Bartowski, MaziyarPanahi) list this as the first or most-downloaded quant.

GGUF: the container format

GGUF is the file format that stores the quantised model along with metadata (tokenizer config, model architecture, hyperparameters). It replaced the older GGML format. If you see a .gguf file extension, it is a quantised model ready to load in llama.cpp, Ollama or compatible runtimes.

How quantisation affects real tasks

The impact of quantisation depends heavily on the task:

Chat and creative writing: Q4_K_M is virtually indistinguishable from FP16 in most blind tests. The model’s training and prompt quality dominate output quality far more than quantisation level.
Extraction and structured output: Q4 is fine for most fields, but if you need exact number extraction or specific-token outputs, Q5 or Q8 may reduce errors.
Classification and routing: Q4 works well. The marginal improvement from Q8 is typically <1% accuracy.
Code generation: Q4_K_M is standard. Some developers report better success with Q5 or Q8 for complex or multi-file code generation.
Reasoning and maths: Q5_K_M is a safer starting point. The precision loss at Q4 can compound across multi-step reasoning chains.

Worked example: 7B model on an 8GB GPU

With an 8GB VRAM GPU (RTX 3070, RTX 4060 Ti):

Q4_K_M: fits easily (~4 GB). Leaves room for context and batch processing. Runs at 30–50 tokens/second.
Q8_0: may fit (~7 GB) but leaves almost no headroom. Context length is limited. Speed drops as memory fills.
FP16: does not fit (~14 GB). Cannot load at all.

For this hardware, Q4_K_M is the only practical option and works well.

Worked example: 70B model on a workstation

With a dual-GPU workstation (e.g., 2× RTX 3090, 48 GB total):

Q3_K_M: fits (~30 GB). Runs but quality is degraded.
Q4_K_M: may fit (~39 GB) if model layers are split across GPUs. Good quality, moderate speed.
Q5_K_M: too large (~49 GB). Does not fit.
FP16: impossible (~140 GB).

Q4_K_M is borderline but workable. Many users prefer Q3_K_M for a comfortable fit or Q4_K_S (~33 GB) for a middle ground.

What quantisation does not change

The model’s training data. Quantisation does not add or remove knowledge.
The model’s architecture. Attention mechanisms, layer counts and vocabulary stay the same.
The model’s safety alignment. An unsafe Q4 model will be equally unsafe as an unsafe FP16 model of the same weights.

Decision tree

How much VRAM do you have? See the table above. Pick the highest quantisation that fits with headroom for your context length.
Is this a quality-sensitive task? (Medical, financial, legal extraction → prefer Q5 or higher. Chat, content, classification → Q4 is fine.)
Are you using CPU or GPU? CPU inference is slower but can handle larger quantised models. GPU benefits from Q4–Q5.
Are you batch-processing? Lower quantisation (Q4) improves batch throughput on GPU because more model memory is freed for larger batches.

What this page cannot tell you

This page cannot tell you the exact quality difference between Q4_K_M and Q5_K_M for your specific model and task. The only reliable way to know is to run both versions on your eval set and compare. For most teams, Q4_K_M is the starting point, and moving to Q5 or Q8 is a marginal optimisation that matters only for precision-critical workloads.

Methodology

Data checked: 2026-05-25
Sources consulted: llama.cpp quantisation documentation, GGUF specification, community benchmarks from r/LocalLLaMA and Hugging Face model card comparisons
Assumptions: Llama 3.1 / Mistral-class model architectures. VRAM estimates include ~1 GB overhead for KV cache at default context length. Quantisation quality varies between model architectures; some models are more resilient to quantisation than others.
Limitations: CPU vs GPU inference performance varies by runtime and hardware. New quantisation methods (IQ, GGUF type variants) continue to evolve. This guide covers standard llama.cpp quantisation types; other runtimes may use different naming conventions.
Jurisdiction: Global. Quantisation formats and methods are not jurisdiction-specific.

Source list

llama.cpp quantisation documentation — https://github.com/ggerganov/llama.cpp (accessed 2026-05-25)
GGUF specification — https://github.com/ggerganov/gguf (accessed 2026-05-25)
Hugging Face quantisation guide — https://huggingface.co/docs/transformers/en/quantization (accessed 2026-05-25)
Ollama model library — https://ollama.com/library (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Editorial review against 16-gate checklist. Fixed frontmatter (writtenBy), replaced block-quoted notes with proper Editor’s Note aside cards (3 total), added Methodology section, added Trust Stack, added missing heading IDs, removed outdated “direct source URLs” note from change log.
2026-05-27: Added direct source URLs to all named providers and services. Content unchanged.
2026-05-25: First published. Plain-English quantisation guide with VRAM comparison table and worked examples.