Quantization reduces the precision of model weights from their original format (usually 16-bit floats) to lower bit-widths (8-bit, 4-bit, or lower). This makes models smaller and often faster, at the cost of some quality degradation.

It's the single most important technique for running large models on consumer hardware.

Why Quantization Matters

A 70B parameter model at full precision (FP16) needs ~140GB just for weights. That's more than any consumer GPU. At 4-bit quantization, the same model needs ~35GB — suddenly within reach of high-end consumer hardware.

Precision Bits per Weight 70B Model Size Typical Use
FP32 32 ~280GB Training only
FP16 / BF16 16 ~140GB Reference inference
INT8 8 ~70GB Server inference
INT4 / Q4 4-5 ~35-40GB Consumer GPUs
Q2 / Q3 2-3 ~20-25GB Extreme compression

How Quantization Works

Original model weights are continuous floating-point numbers. Quantization maps them to a smaller set of discrete values.

Original (FP16): -0.8372, 0.1294, 0.5831, -0.0023, ... │ ▼ Quantized (4-bit): -8, 1, 6, 0, ... (scaled integers) + Scale factor: 0.1047 (to reconstruct approximate values) Reconstructed: -0.8376, 0.1047, 0.6282, 0.0000, ... │ └── Small errors, but much smaller memory footprint

Key Quantization Concepts

Bit-width
How many bits represent each weight. Lower = smaller but less precise. Common: 8, 4, 3, 2.
Scale and Zero-point
Parameters that map quantized integers back to approximate float values. Usually stored per block of weights.
Block Size
How many weights share the same scale factor. Smaller blocks = more accurate but more overhead.
Mixed Precision
Different parts of the model quantized to different precisions. Attention layers often kept at higher precision.

Quantization Methods

GGUF / llama.cpp Quants

  • Q4_K_M, Q5_K_M, Q6_K, etc.
  • K-quants use varying block sizes
  • _M = medium quality/size tradeoff
  • Best for CPU and Apple Silicon

GPTQ

  • Calibration-based quantization
  • Optimized for NVIDIA GPUs
  • Good quality at 4-bit
  • Requires pre-quantized models

AWQ

  • Activation-aware quantization
  • Protects important weights
  • Often better quality than GPTQ
  • NVIDIA GPU focused

GGML / Original

  • Legacy format (use GGUF instead)
  • Q4_0, Q4_1, Q5_0, Q5_1
  • Simpler but less efficient
  • Still works but outdated

Quality vs Size Tradeoffs

Quantization always loses some information. The question is whether it matters for your use case.

Quant Level Size Reduction Quality Impact Recommendation
Q8 ~50% Negligible Use if it fits; minimal tradeoff
Q6_K ~62% Very small Excellent balance for most uses
Q5_K_M ~69% Small Good general-purpose choice
Q4_K_M ~75% Noticeable on some tasks Sweet spot for consumer hardware
Q3_K_M ~81% Moderate degradation When Q4 doesn't fit
Q2_K ~88% Significant degradation Last resort only

Quality Degradation is Task-Dependent

Quantization hurts some tasks more than others. Reasoning and math tend to degrade faster than creative writing or summarization. If your use case requires precise reasoning, stay at Q5 or higher. For casual chat, Q4 is usually fine.

Speed Benefits

Beyond fitting models in less memory, quantization often increases inference speed:

A 4-bit model often generates tokens faster than its FP16 equivalent, even ignoring the "fits in memory" benefit.

Practical Guidance

The Q4_K_M Sweet Spot

For most local LLM use cases, Q4_K_M provides the best balance. It's 75% smaller than FP16, fast, and quality is acceptable for most tasks. Start here unless you have a specific reason not to.

When to Use Higher Precision

When to Use Lower Precision

Common Mistakes

Mistake: Comparing Quantized Models Unfairly

A 70B model at Q4 isn't necessarily better than a 7B model at FP16. Quantization degrades quality, and larger doesn't always mean better after heavy compression. Compare models at similar effective quality levels.

Mistake: Ignoring KV Cache Precision

Quantizing the model weights doesn't quantize the KV cache by default. For long contexts, the cache can still dominate memory. Some inference engines support KV cache quantization separately.