Quantization
Trading numerical precision for smaller models and faster inference
Quantization reduces the precision of model weights from their original format (usually 16-bit floats) to lower bit-widths (8-bit, 4-bit, or lower). This makes models smaller and often faster, at the cost of some quality degradation.
It's the single most important technique for running large models on consumer hardware.
Why Quantization Matters
A 70B parameter model at full precision (FP16) needs ~140GB just for weights. That's more than any consumer GPU. At 4-bit quantization, the same model needs ~35GB — suddenly within reach of high-end consumer hardware.
| Precision | Bits per Weight | 70B Model Size | Typical Use |
|---|---|---|---|
| FP32 | 32 | ~280GB | Training only |
| FP16 / BF16 | 16 | ~140GB | Reference inference |
| INT8 | 8 | ~70GB | Server inference |
| INT4 / Q4 | 4-5 | ~35-40GB | Consumer GPUs |
| Q2 / Q3 | 2-3 | ~20-25GB | Extreme compression |
How Quantization Works
Original model weights are continuous floating-point numbers. Quantization maps them to a smaller set of discrete values.
Key Quantization Concepts
- Bit-width
- How many bits represent each weight. Lower = smaller but less precise. Common: 8, 4, 3, 2.
- Scale and Zero-point
- Parameters that map quantized integers back to approximate float values. Usually stored per block of weights.
- Block Size
- How many weights share the same scale factor. Smaller blocks = more accurate but more overhead.
- Mixed Precision
- Different parts of the model quantized to different precisions. Attention layers often kept at higher precision.
Quantization Methods
GGUF / llama.cpp Quants
- Q4_K_M, Q5_K_M, Q6_K, etc.
- K-quants use varying block sizes
- _M = medium quality/size tradeoff
- Best for CPU and Apple Silicon
GPTQ
- Calibration-based quantization
- Optimized for NVIDIA GPUs
- Good quality at 4-bit
- Requires pre-quantized models
AWQ
- Activation-aware quantization
- Protects important weights
- Often better quality than GPTQ
- NVIDIA GPU focused
GGML / Original
- Legacy format (use GGUF instead)
- Q4_0, Q4_1, Q5_0, Q5_1
- Simpler but less efficient
- Still works but outdated
Quality vs Size Tradeoffs
Quantization always loses some information. The question is whether it matters for your use case.
| Quant Level | Size Reduction | Quality Impact | Recommendation |
|---|---|---|---|
| Q8 | ~50% | Negligible | Use if it fits; minimal tradeoff |
| Q6_K | ~62% | Very small | Excellent balance for most uses |
| Q5_K_M | ~69% | Small | Good general-purpose choice |
| Q4_K_M | ~75% | Noticeable on some tasks | Sweet spot for consumer hardware |
| Q3_K_M | ~81% | Moderate degradation | When Q4 doesn't fit |
| Q2_K | ~88% | Significant degradation | Last resort only |
Quality Degradation is Task-Dependent
Quantization hurts some tasks more than others. Reasoning and math tend to degrade faster than creative writing or summarization. If your use case requires precise reasoning, stay at Q5 or higher. For casual chat, Q4 is usually fine.
Speed Benefits
Beyond fitting models in less memory, quantization often increases inference speed:
- Less memory bandwidth needed: Smaller weights = faster loading from VRAM. This is the main bottleneck during token generation (decode phase).
- Better cache utilization: More weights fit in GPU cache hierarchies.
- Specialized kernels: Many frameworks have optimized routines for quantized operations.
A 4-bit model often generates tokens faster than its FP16 equivalent, even ignoring the "fits in memory" benefit.
Practical Guidance
The Q4_K_M Sweet Spot
For most local LLM use cases, Q4_K_M provides the best balance. It's 75% smaller than FP16, fast, and quality is acceptable for most tasks. Start here unless you have a specific reason not to.
When to Use Higher Precision
- Tasks requiring precise reasoning or math
- When you have the VRAM headroom
- Benchmarking or comparing model capabilities
- Professional/production use where quality matters more than cost
When to Use Lower Precision
- VRAM-constrained hardware
- Casual use, creative writing, brainstorming
- Experimentation and testing
- Running larger models that otherwise wouldn't fit
Common Mistakes
Mistake: Comparing Quantized Models Unfairly
A 70B model at Q4 isn't necessarily better than a 7B model at FP16. Quantization degrades quality, and larger doesn't always mean better after heavy compression. Compare models at similar effective quality levels.
Mistake: Ignoring KV Cache Precision
Quantizing the model weights doesn't quantize the KV cache by default. For long contexts, the cache can still dominate memory. Some inference engines support KV cache quantization separately.