"The model fits in VRAM but it's crawling." This guide covers the common causes of slow inference and how to diagnose them.

Quick Diagnostic

Is the model fully in VRAM? │ ├─ NO → Problem: Memory offloading │ Solution: Smaller model, more quantization, or more VRAM │ └─ YES → Check these in order: │ ├─ GPU utilization low? → Wrong backend or settings │ ├─ GPU temp >83°C? → Thermal throttling │ ├─ Using CPU instead of GPU? → Backend misconfiguration │ ├─ Long context? → KV cache may be spilling │ └─ Expected speed for your hardware? → Might just be bandwidth-limited

Cause #1: Memory Offloading

The most common cause of slow inference. When a model doesn't fully fit in VRAM, layers get offloaded to system RAM or even disk.

How to Detect

Why It's So Slow

Memory Type Bandwidth Relative Speed
VRAM (GDDR6X) ~1,000 GB/s
System RAM (DDR5) ~90 GB/s 0.09× (11× slower)
NVMe SSD ~7 GB/s 0.007× (140× slower)

Solutions

  1. Use more aggressive quantization: Q4 instead of Q6 can make a model fit
  2. Use a smaller model: A 13B model that fits beats a 70B that doesn't
  3. Reduce context length: Shorter context = smaller KV cache
  4. Get more VRAM: Sometimes the only real fix

Cause #2: Thermal Throttling

GPUs reduce clock speeds when they get too hot to prevent damage.

How to Detect

Solutions

  1. Improve airflow: Case fans, remove obstructions
  2. Repaste the GPU: Thermal paste degrades over time
  3. Undervolt: Reduce voltage for less heat with minimal performance loss
  4. Add cooling: Aftermarket coolers, deshroud and add case fans
  5. AC or ambient temp: Room temperature matters

Cause #3: Wrong Backend or Settings

Your inference might be running on CPU when it should be on GPU, or missing optimizations.

How to Detect

Common Fixes

llama.cpp / Ollama

PyTorch / Transformers

General

Cause #4: KV Cache Overflow

Even if weights fit, the KV cache grows with context length and can overflow VRAM.

How to Detect

KV Cache Size Examples (FP16)

Model 4K Context 16K Context 32K Context
Llama 3 8B ~2 GB ~8 GB ~16 GB
Llama 3 70B ~10 GB ~40 GB ~84 GB

Solutions

  1. Use shorter context: Don't use 32K if you only need 4K
  2. Enable KV cache quantization: INT8 KV cache uses half the memory
  3. Start new conversations: Clear accumulated context

Cause #5: Multi-GPU Inefficiency

Adding a second GPU doesn't double performance — interconnect bandwidth limits scaling.

How to Detect

Why It Happens

GPUs need to communicate during inference. PCIe is slow compared to NVLink:

Solutions

  1. Use NVLink if available: Much better scaling
  2. Pipeline parallelism: Can reduce communication needs
  3. Accept the limitation: Multi-GPU over PCIe helps with capacity, less with speed

Cause #6: It's Actually Normal

Sometimes the speed is correct for your hardware — you just expected more.

Expected Performance (Approximate)

Hardware 7B Q4 13B Q4 70B Q4
RTX 4090 ~150 tok/s ~80 tok/s ~25 tok/s (offload)
RTX 3090 ~120 tok/s ~60 tok/s ~20 tok/s (offload)
M2 Ultra ~100 tok/s ~50 tok/s ~18 tok/s
M3 Max ~80 tok/s ~35 tok/s ~10 tok/s

If you're in this ballpark, your setup is probably working correctly. The theoretical maximum is limited by memory bandwidth.

Diagnostic Commands

NVIDIA

# Watch GPU stats in real-time
nvidia-smi -l 1

# Check CUDA version
nvcc --version

# Monitor detailed GPU metrics
nvidia-smi dmon

Apple Silicon

# GPU usage (requires additional tools)
sudo powermetrics --samplers gpu_power

# Memory pressure
vm_stat

General

# llama.cpp: verify GPU layers
./main -m model.gguf -ngl 99 --verbose

# Ollama: check what's loaded
ollama ps

Still Stuck?

Checklist

  1. Is the model actually on GPU? (Check GPU memory usage)
  2. Is GPU utilization high during inference?
  3. Is temperature under control? (<83°C)
  4. Is the model small enough for your VRAM + KV cache?
  5. Are you using optimized builds? (CUDA, Metal, not CPU)
  6. Is Flash Attention enabled?
  7. Are you comparing against realistic benchmarks?