Why Is My Setup Slow?
Troubleshooting inference performance
"The model fits in VRAM but it's crawling." This guide covers the common causes of slow inference and how to diagnose them.
Quick Diagnostic
Cause #1: Memory Offloading
The most common cause of slow inference. When a model doesn't fully fit in VRAM, layers get offloaded to system RAM or even disk.
How to Detect
- VRAM usage at or near 100%
- System RAM usage increasing
- Speed drops dramatically with longer context
- Inference starts okay, slows down during generation
Why It's So Slow
| Memory Type | Bandwidth | Relative Speed |
|---|---|---|
| VRAM (GDDR6X) | ~1,000 GB/s | 1× |
| System RAM (DDR5) | ~90 GB/s | 0.09× (11× slower) |
| NVMe SSD | ~7 GB/s | 0.007× (140× slower) |
Solutions
- Use more aggressive quantization: Q4 instead of Q6 can make a model fit
- Use a smaller model: A 13B model that fits beats a 70B that doesn't
- Reduce context length: Shorter context = smaller KV cache
- Get more VRAM: Sometimes the only real fix
Cause #2: Thermal Throttling
GPUs reduce clock speeds when they get too hot to prevent damage.
How to Detect
- GPU temperature >83°C (check with
nvidia-smior GPU-Z) - Performance degrades over time during a session
- Fans running at maximum speed
- Clock speeds dropping (visible in monitoring tools)
Solutions
- Improve airflow: Case fans, remove obstructions
- Repaste the GPU: Thermal paste degrades over time
- Undervolt: Reduce voltage for less heat with minimal performance loss
- Add cooling: Aftermarket coolers, deshroud and add case fans
- AC or ambient temp: Room temperature matters
Cause #3: Wrong Backend or Settings
Your inference might be running on CPU when it should be on GPU, or missing optimizations.
How to Detect
- GPU utilization near 0% while generating
- CPU at 100% instead
- Performance way below what others report for same hardware
Common Fixes
llama.cpp / Ollama
- Ensure GPU layers flag is set:
-ngl 99or--n-gpu-layers 99 - Check that CUDA/Metal/ROCm build is being used
- Verify with
nvidia-smithat GPU memory is being used
PyTorch / Transformers
- Ensure model is on GPU:
model.to("cuda") - Check
torch.cuda.is_available()returns True - Verify correct CUDA version installed
General
- Flash Attention enabled? Significant speedup if supported
- Using optimized kernels? (cuBLAS, Metal, etc.)
- Batch size appropriate? (Usually 1 for interactive use)
Cause #4: KV Cache Overflow
Even if weights fit, the KV cache grows with context length and can overflow VRAM.
How to Detect
- Speed is fine at start of conversation, degrades over time
- Long prompts cause significant slowdown
- VRAM usage increases during generation
KV Cache Size Examples (FP16)
| Model | 4K Context | 16K Context | 32K Context |
|---|---|---|---|
| Llama 3 8B | ~2 GB | ~8 GB | ~16 GB |
| Llama 3 70B | ~10 GB | ~40 GB | ~84 GB |
Solutions
- Use shorter context: Don't use 32K if you only need 4K
- Enable KV cache quantization: INT8 KV cache uses half the memory
- Start new conversations: Clear accumulated context
Cause #5: Multi-GPU Inefficiency
Adding a second GPU doesn't double performance — interconnect bandwidth limits scaling.
How to Detect
- Multi-GPU setup but performance barely better than single GPU
- One GPU at 100%, others lower
- PCIe bandwidth showing as bottleneck in profiling
Why It Happens
GPUs need to communicate during inference. PCIe is slow compared to NVLink:
- PCIe 4.0 x16: ~32 GB/s
- NVLink: ~900 GB/s
Solutions
- Use NVLink if available: Much better scaling
- Pipeline parallelism: Can reduce communication needs
- Accept the limitation: Multi-GPU over PCIe helps with capacity, less with speed
Cause #6: It's Actually Normal
Sometimes the speed is correct for your hardware — you just expected more.
Expected Performance (Approximate)
| Hardware | 7B Q4 | 13B Q4 | 70B Q4 |
|---|---|---|---|
| RTX 4090 | ~150 tok/s | ~80 tok/s | ~25 tok/s (offload) |
| RTX 3090 | ~120 tok/s | ~60 tok/s | ~20 tok/s (offload) |
| M2 Ultra | ~100 tok/s | ~50 tok/s | ~18 tok/s |
| M3 Max | ~80 tok/s | ~35 tok/s | ~10 tok/s |
If you're in this ballpark, your setup is probably working correctly. The theoretical maximum is limited by memory bandwidth.
Diagnostic Commands
NVIDIA
# Watch GPU stats in real-time
nvidia-smi -l 1
# Check CUDA version
nvcc --version
# Monitor detailed GPU metrics
nvidia-smi dmon
Apple Silicon
# GPU usage (requires additional tools)
sudo powermetrics --samplers gpu_power
# Memory pressure
vm_stat
General
# llama.cpp: verify GPU layers
./main -m model.gguf -ngl 99 --verbose
# Ollama: check what's loaded
ollama ps
Still Stuck?
Checklist
- Is the model actually on GPU? (Check GPU memory usage)
- Is GPU utilization high during inference?
- Is temperature under control? (<83°C)
- Is the model small enough for your VRAM + KV cache?
- Are you using optimized builds? (CUDA, Metal, not CPU)
- Is Flash Attention enabled?
- Are you comparing against realistic benchmarks?