KV Cache - Local LLM Knowledge Base

The KV (Key-Value) cache stores intermediate attention computations so they don't need to be recalculated for each new token. It's essential for efficient inference — but it grows with context length and often becomes the limiting factor for how much context you can actually use.

Why KV Cache Exists

When generating token N, the model needs to attend to all previous tokens (1 through N-1). Without caching, this would require recomputing attention for all previous tokens — making generation O(N²) in compute.

Without KV Cache (inefficient): Token 1: compute attention for [1] Token 2: compute attention for [1, 2] Token 3: compute attention for [1, 2, 3] Token 4: compute attention for [1, 2, 3, 4] ... Total work grows quadratically! With KV Cache (efficient): Token 1: compute K₁, V₁ → store in cache Token 2: compute K₂, V₂ → store; attend using [K₁,K₂], [V₁,V₂] Token 3: compute K₃, V₃ → store; attend using [K₁,K₂,K₃], [V₁,V₂,V₃] ... Each new token only computes its own K,V — uses cached previous values

The cache trades memory for compute: store previous computations to avoid redoing them.

What's Actually Stored

For each layer and each attention head, the cache stores:

Keys (K): Projections used to compute attention scores
Values (V): The information that gets aggregated based on attention

For each token in context: ┌─────────────────────────────────────────────────┐ │ Layer 1: K₁ (head_dim × num_heads) + V₁ │ │ Layer 2: K₂ + V₂ │ │ ... │ │ Layer N: Kₙ + Vₙ │ └─────────────────────────────────────────────────┘ Total per token = 2 × num_layers × hidden_dim × bytes_per_element

KV Cache Size Calculation

The formula for KV cache memory:

KV Cache = 2 × layers × hidden_dim × context_length × bytes_per_value

Model	Layers	Hidden Dim	KV Cache @ 4K ctx (FP16)	KV Cache @ 32K ctx (FP16)
Llama 3 8B	32	4096	~2 GB	~16 GB
Llama 2 13B	40	5120	~3.3 GB	~26 GB
Llama 3 70B	80	8192	~10.5 GB	~84 GB

The Real Limit

A 70B model at Q4 needs ~35GB for weights. At 32K context with FP16 KV cache, you need another 84GB just for the cache. This is why a 48GB GPU can't actually run 70B at full context — the cache doesn't fit.

KV Cache Quantization

Just like model weights, the KV cache can be quantized to reduce memory:

KV Precision	Memory per Token	Quality Impact
FP16	2 bytes/value	None (full precision)
FP8	1 byte/value	Minimal
INT8	1 byte/value	Small
INT4	0.5 bytes/value	Noticeable on long context

Most inference engines support KV cache quantization. It's often more impactful than weight quantization for long-context workloads.

Context Length Scaling

KV cache grows linearly with context length. This creates a tradeoff:

VRAM Budget: 24GB Model weights (Q4): 20GB Remaining for KV: 4GB At FP16 KV cache for a 70B model: ├── 4K context: 2.6GB ✓ Fits ├── 8K context: 5.2GB ✗ Over budget └── 32K context: 21GB ✗ Way over Options: 1. Use shorter context 2. Quantize KV cache (INT8 = half the memory) 3. Get more VRAM 4. Accept slower performance with offloading

Grouped Query Attention (GQA)

Modern models like Llama 2/3 use Grouped Query Attention to reduce KV cache size:

Standard Multi-Head Attention (MHA): ├── Each attention head has its own K, V ├── 32 heads = 32 K's + 32 V's per layer └── Full KV cache size Grouped Query Attention (GQA): ├── Multiple query heads share K, V ├── 32 query heads, 8 KV heads ├── 8 K's + 8 V's per layer └── 4× smaller KV cache! Multi-Query Attention (MQA): ├── ALL query heads share one K, V ├── 32 query heads, 1 KV head └── 32× smaller, but quality tradeoff

GQA is a middle ground: significant cache reduction with minimal quality loss.

Practical Implications

Memory Planning

When planning VRAM requirements:

Total VRAM = Model Weights + KV Cache + Overhead

Where KV Cache depends on your actual context usage, not just model's max context.

The "Fits but Slow" Problem

If your model + minimal KV cache fits, but you try to use long context:

KV cache exceeds VRAM
Cache spills to system RAM
Each generation step requires RAM→VRAM transfer
Performance drops dramatically

Right-Size Your Context

Just because a model supports 128K context doesn't mean you should use it. Shorter context = smaller KV cache = more headroom = faster inference. Use the context length you need, not the maximum available.

Advanced: PagedAttention

vLLM introduced PagedAttention, which manages KV cache like virtual memory pages:

Allocates cache in fixed-size blocks
No memory wasted on unused context
Enables efficient batching with variable lengths
Can share cache pages across requests (for shared prefixes)

This is mainly relevant for serving multiple concurrent requests, not single-user local inference.

Why KV Cache Exists

What's Actually Stored

KV Cache Size Calculation

The Real Limit

KV Cache Quantization

Context Length Scaling

Grouped Query Attention (GQA)

Practical Implications

Memory Planning

The "Fits but Slow" Problem

Right-Size Your Context

Advanced: PagedAttention

Related Topics