KV Cache
The memory structure that grows during generation
The KV (Key-Value) cache stores intermediate attention computations so they don't need to be recalculated for each new token. It's essential for efficient inference — but it grows with context length and often becomes the limiting factor for how much context you can actually use.
Why KV Cache Exists
When generating token N, the model needs to attend to all previous tokens (1 through N-1). Without caching, this would require recomputing attention for all previous tokens — making generation O(N²) in compute.
The cache trades memory for compute: store previous computations to avoid redoing them.
What's Actually Stored
For each layer and each attention head, the cache stores:
- Keys (K): Projections used to compute attention scores
- Values (V): The information that gets aggregated based on attention
KV Cache Size Calculation
The formula for KV cache memory:
KV Cache = 2 × layers × hidden_dim × context_length × bytes_per_value
| Model | Layers | Hidden Dim | KV Cache @ 4K ctx (FP16) | KV Cache @ 32K ctx (FP16) |
|---|---|---|---|---|
| Llama 3 8B | 32 | 4096 | ~2 GB | ~16 GB |
| Llama 2 13B | 40 | 5120 | ~3.3 GB | ~26 GB |
| Llama 3 70B | 80 | 8192 | ~10.5 GB | ~84 GB |
The Real Limit
A 70B model at Q4 needs ~35GB for weights. At 32K context with FP16 KV cache, you need another 84GB just for the cache. This is why a 48GB GPU can't actually run 70B at full context — the cache doesn't fit.
KV Cache Quantization
Just like model weights, the KV cache can be quantized to reduce memory:
| KV Precision | Memory per Token | Quality Impact |
|---|---|---|
| FP16 | 2 bytes/value | None (full precision) |
| FP8 | 1 byte/value | Minimal |
| INT8 | 1 byte/value | Small |
| INT4 | 0.5 bytes/value | Noticeable on long context |
Most inference engines support KV cache quantization. It's often more impactful than weight quantization for long-context workloads.
Context Length Scaling
KV cache grows linearly with context length. This creates a tradeoff:
Grouped Query Attention (GQA)
Modern models like Llama 2/3 use Grouped Query Attention to reduce KV cache size:
GQA is a middle ground: significant cache reduction with minimal quality loss.
Practical Implications
Memory Planning
When planning VRAM requirements:
Total VRAM = Model Weights + KV Cache + Overhead
Where KV Cache depends on your actual context usage, not just model's max context.
The "Fits but Slow" Problem
If your model + minimal KV cache fits, but you try to use long context:
- KV cache exceeds VRAM
- Cache spills to system RAM
- Each generation step requires RAM→VRAM transfer
- Performance drops dramatically
Right-Size Your Context
Just because a model supports 128K context doesn't mean you should use it. Shorter context = smaller KV cache = more headroom = faster inference. Use the context length you need, not the maximum available.
Advanced: PagedAttention
vLLM introduced PagedAttention, which manages KV cache like virtual memory pages:
- Allocates cache in fixed-size blocks
- No memory wasted on unused context
- Enables efficient batching with variable lengths
- Can share cache pages across requests (for shared prefixes)
This is mainly relevant for serving multiple concurrent requests, not single-user local inference.