theLLMs

Last checked: 2026-05-25

Scope: Global. Quantisation methods and formats were checked against current llama.cpp and GGUF documentation on 2026-05-25. Exact quality impact varies by model and quantisation method.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

Quantisation explained: why model files have Q4, Q5 and GGUF labels

When you download an open-weight model, you rarely get a single file. You get a list of variants: Q4_K_M, Q5_0, Q8_0, F16. These labels describe how precisely the model’s weights have been stored. Lower precision means smaller file sizes and faster inference — at the cost of some quality.

The short answer is: quantisation reduces model file size by 50–75% with minimal quality loss for most tasks. Q4 and Q5 quantisations are the practical sweet spot for consumer GPUs. Q8 and FP16 are for high-quality needs with enough hardware to match.

If you only remember one thing: start with Q4_K_M for any model. It is the most tested, most compatible and offers the best quality-to-size ratio for the vast majority of users. Only go higher if you have the VRAM and need the marginal quality gain.

Editor’s Note: Quantisation is not compression — it is precision reduction. The model’s weights are stored with fewer bits per value. The model learns to work around the lost precision during training or fine-tuning.

Editor’s Note: A Q4 model is not “4× worse” than FP16. In practice, quality degradation is often below measurable significance for chat and text generation tasks. Extraction and classification are slightly more sensitive.

Quick answer {#quick-answer}

QuantisationSize vs FP16VRAM for 7B modelVRAM for 70B modelQuality impact
FP16 (full precision)100%~14 GB~140 GBReference baseline
Q8_0~50%~7 GB~70 GBNegligible
Q6_K~40%~5.5 GB~55 GBMinimal
Q5_K_M~35%~5 GB~49 GBVery small
Q5_0~33%~4.7 GB~47 GBSmall
Q4_K_M~28%~4 GB~39 GBSlight
Q4_0~25%~3.5 GB~35 GBNoticeable on some tasks
Q3_K_M~22%~3 GB~30 GBDegradation visible
Q2_K~17%~2.5 GB~24 GBSignificant degradation

These are approximate. Exact sizes vary by model architecture and quantisation implementation.

What the labels mean {#what-the-labels-mean}

Number: bits per weight {#number-bits-per-weight}

  • Q2: 2 bits per weight. Maximum compression, significant quality loss.
  • Q3: 3 bits per weight. Aggressive compression, visible degradation on complex tasks.
  • Q4: 4 bits per weight. The practical sweet spot. Good quality, good speed.
  • Q5: 5 bits per weight. Higher quality, slightly larger files. Useful for sensitive tasks.
  • Q6: 6 bits per weight. Near-reference quality.
  • Q8: 8 bits per weight. Essentially reference quality at half the size.
  • FP16: 16-bit float. Full precision.

Suffix: method variant {#suffix-method-variant}

  • K_M: “Medium” — the recommended K-quant. Balances quality across layers.
  • K_S: “Small” — smaller but lower quality than K_M.
  • _0: Baseline quantisation method, less refined than K-variants.

The widely recommended default is Q4_K_M — 4-bit, medium K-quant. Most model distributors (TheBloke, Bartowski, MaziyarPanahi) list this as the first or most-downloaded quant.

GGUF: the container format {#gguf-the-container-format}

GGUF is the file format that stores the quantised model along with metadata (tokenizer config, model architecture, hyperparameters). It replaced the older GGML format. If you see a .gguf file extension, it is a quantised model ready to load in llama.cpp, Ollama or compatible runtimes.

How quantisation affects real tasks {#how-quantisation-affects-real-tasks}

The impact of quantisation depends heavily on the task:

  • Chat and creative writing: Q4_K_M is virtually indistinguishable from FP16 in most blind tests. The model’s training and prompt quality dominate output quality far more than quantisation level.
  • Extraction and structured output: Q4 is fine for most fields, but if you need exact number extraction or specific-token outputs, Q5 or Q8 may reduce errors.
  • Classification and routing: Q4 works well. The marginal improvement from Q8 is typically <1% accuracy.
  • Code generation: Q4_K_M is standard. Some developers report better success with Q5 or Q8 for complex or multi-file code generation.
  • Reasoning and maths: Q5_K_M is a safer starting point. The precision loss at Q4 can compound across multi-step reasoning chains.

Worked example: 7B model on an 8GB GPU {#worked-example-7b-model-on-an-8gb-gpu}

With an 8GB VRAM GPU (RTX 3070, RTX 4060 Ti):

  • Q4_K_M: fits easily (~4 GB). Leaves room for context and batch processing. Runs at 30–50 tokens/second.
  • Q8_0: may fit (~7 GB) but leaves almost no headroom. Context length is limited. Speed drops as memory fills.
  • FP16: does not fit (~14 GB). Cannot load at all.

For this hardware, Q4_K_M is the only practical option and works well.

Worked example: 70B model on a workstation {#worked-example-70b-model-on-a-workstation}

With a dual-GPU workstation (e.g., 2× RTX 3090, 48 GB total):

  • Q3_K_M: fits (~30 GB). Runs but quality is degraded.
  • Q4_K_M: may fit (~39 GB) if model layers are split across GPUs. Good quality, moderate speed.
  • Q5_K_M: too large (~49 GB). Does not fit.
  • FP16: impossible (~140 GB).

Q4_K_M is borderline but workable. Many users prefer Q3_K_M for a comfortable fit or Q4_K_S (~33 GB) for a middle ground.

What quantisation does not change {#what-quantisation-does-not-change}

  • The model’s training data. Quantisation does not add or remove knowledge.
  • The model’s architecture. Attention mechanisms, layer counts and vocabulary stay the same.
  • The model’s safety alignment. An unsafe Q4 model will be equally unsafe as an unsafe FP16 model of the same weights.

Decision tree {#decision-tree}

  1. How much VRAM do you have? See the table above. Pick the highest quantisation that fits with headroom for your context length.
  2. Is this a quality-sensitive task? (Medical, financial, legal extraction → prefer Q5 or higher. Chat, content, classification → Q4 is fine.)
  3. Are you using CPU or GPU? CPU inference is slower but can handle larger quantised models. GPU benefits from Q4–Q5.
  4. Are you batch-processing? Lower quantisation (Q4) improves batch throughput on GPU because more model memory is freed for larger batches.

What this page cannot tell you {#what-this-page-cannot-tell-you}

This page cannot tell you the exact quality difference between Q4_K_M and Q5_K_M for your specific model and task. The only reliable way to know is to run both versions on your eval set and compare. For most teams, Q4_K_M is the starting point, and moving to Q5 or Q8 is a marginal optimisation that matters only for precision-critical workloads.

Methodology and sources {#methodology-and-sources}

Check date: 2026-05-25

What was checked: llama.cpp quantisation documentation, GGUF specification, community benchmarks from r/LocalLLaMA and Hugging Face model card comparisons.

Worked-example assumptions: Llama 3.1 / Mistral-class model architectures. VRAM estimates include ~1 GB overhead for KV cache at default context length.

Assumptions and limits:

  • Quantisation quality varies between model architectures.
  • Some models are more resilient to quantisation than others.
  • CPU vs GPU inference performance varies by runtime and hardware.
  • New quantisation methods (IQ, GGUF type variants) continue to evolve.

Source list {#source-list}

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.