The KV (Key-Value) cache stores intermediate attention computations so they don't need to be recalculated for each new token. It's essential for efficient inference — but it grows with context length and often becomes the limiting factor for how much context you can actually use.

Why KV Cache Exists

When generating token N, the model needs to attend to all previous tokens (1 through N-1). Without caching, this would require recomputing attention for all previous tokens — making generation O(N²) in compute.

Without KV Cache (inefficient): Token 1: compute attention for [1] Token 2: compute attention for [1, 2] Token 3: compute attention for [1, 2, 3] Token 4: compute attention for [1, 2, 3, 4] ... Total work grows quadratically! With KV Cache (efficient): Token 1: compute K₁, V₁ → store in cache Token 2: compute K₂, V₂ → store; attend using [K₁,K₂], [V₁,V₂] Token 3: compute K₃, V₃ → store; attend using [K₁,K₂,K₃], [V₁,V₂,V₃] ... Each new token only computes its own K,V — uses cached previous values

The cache trades memory for compute: store previous computations to avoid redoing them.

What's Actually Stored

For each layer and each attention head, the cache stores:

For each token in context: ┌─────────────────────────────────────────────────┐ │ Layer 1: K₁ (head_dim × num_heads) + V₁ │ │ Layer 2: K₂ + V₂ │ │ ... │ │ Layer N: Kₙ + Vₙ │ └─────────────────────────────────────────────────┘ Total per token = 2 × num_layers × hidden_dim × bytes_per_element

KV Cache Size Calculation

The formula for KV cache memory:

KV Cache = 2 × layers × hidden_dim × context_length × bytes_per_value
Model Layers Hidden Dim KV Cache @ 4K ctx (FP16) KV Cache @ 32K ctx (FP16)
Llama 3 8B 32 4096 ~2 GB ~16 GB
Llama 2 13B 40 5120 ~3.3 GB ~26 GB
Llama 3 70B 80 8192 ~10.5 GB ~84 GB

The Real Limit

A 70B model at Q4 needs ~35GB for weights. At 32K context with FP16 KV cache, you need another 84GB just for the cache. This is why a 48GB GPU can't actually run 70B at full context — the cache doesn't fit.

KV Cache Quantization

Just like model weights, the KV cache can be quantized to reduce memory:

KV Precision Memory per Token Quality Impact
FP16 2 bytes/value None (full precision)
FP8 1 byte/value Minimal
INT8 1 byte/value Small
INT4 0.5 bytes/value Noticeable on long context

Most inference engines support KV cache quantization. It's often more impactful than weight quantization for long-context workloads.

Context Length Scaling

KV cache grows linearly with context length. This creates a tradeoff:

VRAM Budget: 24GB Model weights (Q4): 20GB Remaining for KV: 4GB At FP16 KV cache for a 70B model: ├── 4K context: 2.6GB ✓ Fits ├── 8K context: 5.2GB ✗ Over budget └── 32K context: 21GB ✗ Way over Options: 1. Use shorter context 2. Quantize KV cache (INT8 = half the memory) 3. Get more VRAM 4. Accept slower performance with offloading

Grouped Query Attention (GQA)

Modern models like Llama 2/3 use Grouped Query Attention to reduce KV cache size:

Standard Multi-Head Attention (MHA): ├── Each attention head has its own K, V ├── 32 heads = 32 K's + 32 V's per layer └── Full KV cache size Grouped Query Attention (GQA): ├── Multiple query heads share K, V ├── 32 query heads, 8 KV heads ├── 8 K's + 8 V's per layer └── 4× smaller KV cache! Multi-Query Attention (MQA): ├── ALL query heads share one K, V ├── 32 query heads, 1 KV head └── 32× smaller, but quality tradeoff

GQA is a middle ground: significant cache reduction with minimal quality loss.

Practical Implications

Memory Planning

When planning VRAM requirements:

Total VRAM = Model Weights + KV Cache + Overhead

Where KV Cache depends on your actual context usage, not just model's max context.

The "Fits but Slow" Problem

If your model + minimal KV cache fits, but you try to use long context:

  1. KV cache exceeds VRAM
  2. Cache spills to system RAM
  3. Each generation step requires RAM→VRAM transfer
  4. Performance drops dramatically

Right-Size Your Context

Just because a model supports 128K context doesn't mean you should use it. Shorter context = smaller KV cache = more headroom = faster inference. Use the context length you need, not the maximum available.

Advanced: PagedAttention

vLLM introduced PagedAttention, which manages KV cache like virtual memory pages:

This is mainly relevant for serving multiple concurrent requests, not single-user local inference.