VRAM (Video RAM) is the memory on your GPU. It determines the maximum model size you can run without offloading to slower system RAM. Getting the VRAM math right is essential for planning your local LLM setup.

The VRAM Equation

Total VRAM Needed = Model Weights + KV Cache + Overhead

Where:
- Model Weights = Parameters × Bytes per Parameter
- KV Cache = Depends on context length (grows during generation)
- Overhead = ~1-2GB for framework, buffers, etc.

Model Weights by Quantization

Model Size FP16 Q8 Q6_K Q5_K_M Q4_K_M
7B 14GB 7GB 5.5GB 4.8GB 4GB
13B 26GB 13GB 10GB 9GB 7.5GB
34B 68GB 34GB 26GB 23GB 19GB
70B 140GB 70GB 54GB 47GB 40GB

KV Cache Requirements

The KV cache grows with context length. This is often the hidden VRAM consumer:

Model 4K ctx 8K ctx 16K ctx 32K ctx
7-8B ~1GB ~2GB ~4GB ~8GB
13B ~1.5GB ~3GB ~6GB ~12GB
70B ~5GB ~10GB ~21GB ~42GB

Values assume FP16 KV cache. INT8 KV cache halves these numbers.

Total VRAM Examples

Example: Running Llama 3 70B Q4 with 8K context Model weights (Q4): 40 GB KV cache (8K, FP16): 10 GB Overhead: 2 GB ───────────────────────────── Total: 52 GB → Needs 2× 24GB GPUs or single 48GB+ card
Example: Running Llama 3 8B Q4 with 16K context Model weights (Q4): 4.5 GB KV cache (16K, FP16): 4.0 GB Overhead: 1.5 GB ───────────────────────────── Total: 10 GB → Fits on 12GB GPU with some headroom

What Fits on Common GPUs

VRAM Cards Max Model (Q4, 8K ctx) Comfortable Models
8GB RTX 4060, 3070 7B (tight) 3-7B
12GB RTX 4070, 3080 12GB 13B (tight) 7-8B
16GB RTX 4080, 4070 Ti Super 13B 7-13B
24GB RTX 4090, 3090 34B (tight) 7-13B, 34B with short ctx
48GB 2× 3090, A6000 70B 34-70B
80GB A100 80GB 70B+ (comfortable) 70B with long context

The "Fits" vs "Usable" Trap

A Model That "Fits" Might Not Be Usable

If model weights consume 95% of VRAM, there's no room for KV cache. The model loads but can only handle minimal context, or has to offload cache to RAM (slow).

Target: 80-85% VRAM utilization max to leave room for KV cache growth.

Strategies When VRAM Is Limited

1. More Aggressive Quantization

Q4 instead of Q6 can save 25%+ VRAM with modest quality loss.

2. Shorter Context

Using 4K instead of 16K context can save gigabytes of KV cache.

3. KV Cache Quantization

INT8 KV cache halves cache memory with minimal quality impact.

4. Smaller Model

A 13B model that fits often beats a 70B that's limping along with offloading.

5. RAM Offloading (Last Resort)

Some layers can run from system RAM. Works but 10-20× slower for offloaded layers.

Memory Types

Type Used In Bandwidth Notes
GDDR6X RTX 30/40 series 700-1000 GB/s Fast, consumer standard
GDDR6 Lower-end GPUs, AMD 400-600 GB/s Slightly slower
HBM2/HBM2e A100, MI100 1500-2000 GB/s Datacenter, very fast
HBM3 H100 3000+ GB/s Fastest available

Monitoring VRAM Usage

NVIDIA

# Real-time monitoring
watch -n 1 nvidia-smi

# Detailed memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

During Inference

Watch VRAM usage during:

Quick VRAM Planning

For a model at Q4 quantization with 8K context:

VRAM needed ≈ (Parameters in billions × 0.6) + 3 GB

Examples:
7B:  7 × 0.6 + 3 = ~7 GB
13B: 13 × 0.6 + 3 = ~11 GB
70B: 70 × 0.6 + 3 = ~45 GB