VRAM - Local LLM Knowledge Base

VRAM (Video RAM) is the memory on your GPU. It determines the maximum model size you can run without offloading to slower system RAM. Getting the VRAM math right is essential for planning your local LLM setup.

The VRAM Equation

Total VRAM Needed = Model Weights + KV Cache + Overhead

Where:
- Model Weights = Parameters × Bytes per Parameter
- KV Cache = Depends on context length (grows during generation)
- Overhead = ~1-2GB for framework, buffers, etc.

Model Weights by Quantization

Model Size	FP16	Q8	Q6_K	Q5_K_M	Q4_K_M
7B	14GB	7GB	5.5GB	4.8GB	4GB
13B	26GB	13GB	10GB	9GB	7.5GB
34B	68GB	34GB	26GB	23GB	19GB
70B	140GB	70GB	54GB	47GB	40GB

KV Cache Requirements

The KV cache grows with context length. This is often the hidden VRAM consumer:

Model	4K ctx	8K ctx	16K ctx	32K ctx
7-8B	~1GB	~2GB	~4GB	~8GB
13B	~1.5GB	~3GB	~6GB	~12GB
70B	~5GB	~10GB	~21GB	~42GB

Values assume FP16 KV cache. INT8 KV cache halves these numbers.

Total VRAM Examples

Example: Running Llama 3 70B Q4 with 8K context Model weights (Q4): 40 GB KV cache (8K, FP16): 10 GB Overhead: 2 GB ───────────────────────────── Total: 52 GB → Needs 2× 24GB GPUs or single 48GB+ card

Example: Running Llama 3 8B Q4 with 16K context Model weights (Q4): 4.5 GB KV cache (16K, FP16): 4.0 GB Overhead: 1.5 GB ───────────────────────────── Total: 10 GB → Fits on 12GB GPU with some headroom

What Fits on Common GPUs

VRAM	Cards	Max Model (Q4, 8K ctx)	Comfortable Models
8GB	RTX 4060, 3070	7B (tight)	3-7B
12GB	RTX 4070, 3080 12GB	13B (tight)	7-8B
16GB	RTX 4080, 4070 Ti Super	13B	7-13B
24GB	RTX 4090, 3090	34B (tight)	7-13B, 34B with short ctx
48GB	2× 3090, A6000	70B	34-70B
80GB	A100 80GB	70B+ (comfortable)	70B with long context

The "Fits" vs "Usable" Trap

A Model That "Fits" Might Not Be Usable

If model weights consume 95% of VRAM, there's no room for KV cache. The model loads but can only handle minimal context, or has to offload cache to RAM (slow).

Target: 80-85% VRAM utilization max to leave room for KV cache growth.

Strategies When VRAM Is Limited

1. More Aggressive Quantization

Q4 instead of Q6 can save 25%+ VRAM with modest quality loss.

2. Shorter Context

Using 4K instead of 16K context can save gigabytes of KV cache.

3. KV Cache Quantization

INT8 KV cache halves cache memory with minimal quality impact.

4. Smaller Model

A 13B model that fits often beats a 70B that's limping along with offloading.

5. RAM Offloading (Last Resort)

Some layers can run from system RAM. Works but 10-20× slower for offloaded layers.

Memory Types

Type	Used In	Bandwidth	Notes
GDDR6X	RTX 30/40 series	700-1000 GB/s	Fast, consumer standard
GDDR6	Lower-end GPUs, AMD	400-600 GB/s	Slightly slower
HBM2/HBM2e	A100, MI100	1500-2000 GB/s	Datacenter, very fast
HBM3	H100	3000+ GB/s	Fastest available

Monitoring VRAM Usage

NVIDIA

# Real-time monitoring
watch -n 1 nvidia-smi

# Detailed memory breakdown
nvidia-smi --query-gpu=memory.used,memory.free,memory.total --format=csv

During Inference

Watch VRAM usage during:

Model loading (should stabilize after load)
Long prompts (prefill uses extra memory)
Generation (KV cache grows)

Quick VRAM Planning

For a model at Q4 quantization with 8K context:

VRAM needed ≈ (Parameters in billions × 0.6) + 3 GB

Examples:
7B:  7 × 0.6 + 3 = ~7 GB
13B: 13 × 0.6 + 3 = ~11 GB
70B: 70 × 0.6 + 3 = ~45 GB