Quantization - Local LLM Knowledge Base

Quantization reduces the precision of model weights from their original format (usually 16-bit floats) to lower bit-widths (8-bit, 4-bit, or lower). This makes models smaller and often faster, at the cost of some quality degradation.

It's the single most important technique for running large models on consumer hardware.

Why Quantization Matters

A 70B parameter model at full precision (FP16) needs ~140GB just for weights. That's more than any consumer GPU. At 4-bit quantization, the same model needs ~35GB — suddenly within reach of high-end consumer hardware.

Precision	Bits per Weight	70B Model Size	Typical Use
FP32	32	~280GB	Training only
FP16 / BF16	16	~140GB	Reference inference
INT8	8	~70GB	Server inference
INT4 / Q4	4-5	~35-40GB	Consumer GPUs
Q2 / Q3	2-3	~20-25GB	Extreme compression

How Quantization Works

Original model weights are continuous floating-point numbers. Quantization maps them to a smaller set of discrete values.

Original (FP16): -0.8372, 0.1294, 0.5831, -0.0023, ... │ ▼ Quantized (4-bit): -8, 1, 6, 0, ... (scaled integers) + Scale factor: 0.1047 (to reconstruct approximate values) Reconstructed: -0.8376, 0.1047, 0.6282, 0.0000, ... │ └── Small errors, but much smaller memory footprint

Key Quantization Concepts

Bit-width: How many bits represent each weight. Lower = smaller but less precise. Common: 8, 4, 3, 2.
Scale and Zero-point: Parameters that map quantized integers back to approximate float values. Usually stored per block of weights.
Block Size: How many weights share the same scale factor. Smaller blocks = more accurate but more overhead.
Mixed Precision: Different parts of the model quantized to different precisions. Attention layers often kept at higher precision.

Quantization Methods

GGUF / llama.cpp Quants

Q4_K_M, Q5_K_M, Q6_K, etc.
K-quants use varying block sizes
_M = medium quality/size tradeoff
Best for CPU and Apple Silicon

GPTQ

Calibration-based quantization
Optimized for NVIDIA GPUs
Good quality at 4-bit
Requires pre-quantized models

AWQ

Activation-aware quantization
Protects important weights
Often better quality than GPTQ
NVIDIA GPU focused

GGML / Original

Legacy format (use GGUF instead)
Q4_0, Q4_1, Q5_0, Q5_1
Simpler but less efficient
Still works but outdated

Quality vs Size Tradeoffs

Quantization always loses some information. The question is whether it matters for your use case.

Quant Level	Size Reduction	Quality Impact	Recommendation
Q8	~50%	Negligible	Use if it fits; minimal tradeoff
Q6_K	~62%	Very small	Excellent balance for most uses
Q5_K_M	~69%	Small	Good general-purpose choice
Q4_K_M	~75%	Noticeable on some tasks	Sweet spot for consumer hardware
Q3_K_M	~81%	Moderate degradation	When Q4 doesn't fit
Q2_K	~88%	Significant degradation	Last resort only

Quality Degradation is Task-Dependent

Quantization hurts some tasks more than others. Reasoning and math tend to degrade faster than creative writing or summarization. If your use case requires precise reasoning, stay at Q5 or higher. For casual chat, Q4 is usually fine.

Speed Benefits

Beyond fitting models in less memory, quantization often increases inference speed:

Less memory bandwidth needed: Smaller weights = faster loading from VRAM. This is the main bottleneck during token generation (decode phase).
Better cache utilization: More weights fit in GPU cache hierarchies.
Specialized kernels: Many frameworks have optimized routines for quantized operations.

A 4-bit model often generates tokens faster than its FP16 equivalent, even ignoring the "fits in memory" benefit.

Practical Guidance

The Q4_K_M Sweet Spot

For most local LLM use cases, Q4_K_M provides the best balance. It's 75% smaller than FP16, fast, and quality is acceptable for most tasks. Start here unless you have a specific reason not to.

When to Use Higher Precision

Tasks requiring precise reasoning or math
When you have the VRAM headroom
Benchmarking or comparing model capabilities
Professional/production use where quality matters more than cost

When to Use Lower Precision

VRAM-constrained hardware
Casual use, creative writing, brainstorming
Experimentation and testing
Running larger models that otherwise wouldn't fit

Common Mistakes

Mistake: Comparing Quantized Models Unfairly

A 70B model at Q4 isn't necessarily better than a 7B model at FP16. Quantization degrades quality, and larger doesn't always mean better after heavy compression. Compare models at similar effective quality levels.

Mistake: Ignoring KV Cache Precision

Quantizing the model weights doesn't quantize the KV cache by default. For long contexts, the cache can still dominate memory. Some inference engines support KV cache quantization separately.