VRAM
How much GPU memory you actually need
VRAM (Video RAM) is the memory on your GPU. It determines the maximum model size you can run without offloading to slower system RAM. Getting the VRAM math right is essential for planning your local LLM setup.
The VRAM Equation
Total VRAM Needed = Model Weights + KV Cache + Overhead
Where:
- Model Weights = Parameters × Bytes per Parameter
- KV Cache = Depends on context length (grows during generation)
- Overhead = ~1-2GB for framework, buffers, etc.
Model Weights by Quantization
| Model Size | FP16 | Q8 | Q6_K | Q5_K_M | Q4_K_M |
|---|---|---|---|---|---|
| 7B | 14GB | 7GB | 5.5GB | 4.8GB | 4GB |
| 13B | 26GB | 13GB | 10GB | 9GB | 7.5GB |
| 34B | 68GB | 34GB | 26GB | 23GB | 19GB |
| 70B | 140GB | 70GB | 54GB | 47GB | 40GB |
KV Cache Requirements
The KV cache grows with context length. This is often the hidden VRAM consumer:
| Model | 4K ctx | 8K ctx | 16K ctx | 32K ctx |
|---|---|---|---|---|
| 7-8B | ~1GB | ~2GB | ~4GB | ~8GB |
| 13B | ~1.5GB | ~3GB | ~6GB | ~12GB |
| 70B | ~5GB | ~10GB | ~21GB | ~42GB |
Values assume FP16 KV cache. INT8 KV cache halves these numbers.
Total VRAM Examples
What Fits on Common GPUs
| VRAM | Cards | Max Model (Q4, 8K ctx) | Comfortable Models |
|---|---|---|---|
| 8GB | RTX 4060, 3070 | 7B (tight) | 3-7B |
| 12GB | RTX 4070, 3080 12GB | 13B (tight) | 7-8B |
| 16GB | RTX 4080, 4070 Ti Super | 13B | 7-13B |
| 24GB | RTX 4090, 3090 | 34B (tight) | 7-13B, 34B with short ctx |
| 48GB | 2× 3090, A6000 | 70B | 34-70B |
| 80GB | A100 80GB | 70B+ (comfortable) | 70B with long context |
The "Fits" vs "Usable" Trap
A Model That "Fits" Might Not Be Usable
If model weights consume 95% of VRAM, there's no room for KV cache. The model loads but can only handle minimal context, or has to offload cache to RAM (slow).
Target: 80-85% VRAM utilization max to leave room for KV cache growth.
Strategies When VRAM Is Limited
1. More Aggressive Quantization
Q4 instead of Q6 can save 25%+ VRAM with modest quality loss.
2. Shorter Context
Using 4K instead of 16K context can save gigabytes of KV cache.
3. KV Cache Quantization
INT8 KV cache halves cache memory with minimal quality impact.
4. Smaller Model
A 13B model that fits often beats a 70B that's limping along with offloading.
5. RAM Offloading (Last Resort)
Some layers can run from system RAM. Works but 10-20× slower for offloaded layers.
Memory Types
| Type | Used In | Bandwidth | Notes |
|---|---|---|---|
| GDDR6X | RTX 30/40 series | 700-1000 GB/s | Fast, consumer standard |
| GDDR6 | Lower-end GPUs, AMD | 400-600 GB/s | Slightly slower |
| HBM2/HBM2e | A100, MI100 | 1500-2000 GB/s | Datacenter, very fast |
| HBM3 | H100 | 3000+ GB/s | Fastest available |
Monitoring VRAM Usage
NVIDIA
# Real-time monitoring
watch -n 1 nvidia-smi
# Detailed memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv
During Inference
Watch VRAM usage during:
- Model loading (should stabilize after load)
- Long prompts (prefill uses extra memory)
- Generation (KV cache grows)
Quick VRAM Planning
For a model at Q4 quantization with 8K context:
VRAM needed ≈ (Parameters in billions × 0.6) + 3 GB
Examples:
7B: 7 × 0.6 + 3 = ~7 GB
13B: 13 × 0.6 + 3 = ~11 GB
70B: 70 × 0.6 + 3 = ~45 GB