Why Is My Setup Slow? - Local LLM Knowledge Base

"The model fits in VRAM but it's crawling." This guide covers the common causes of slow inference and how to diagnose them.

Quick Diagnostic

Is the model fully in VRAM? │ ├─ NO → Problem: Memory offloading │ Solution: Smaller model, more quantization, or more VRAM │ └─ YES → Check these in order: │ ├─ GPU utilization low? → Wrong backend or settings │ ├─ GPU temp >83°C? → Thermal throttling │ ├─ Using CPU instead of GPU? → Backend misconfiguration │ ├─ Long context? → KV cache may be spilling │ └─ Expected speed for your hardware? → Might just be bandwidth-limited

Cause #1: Memory Offloading

The most common cause of slow inference. When a model doesn't fully fit in VRAM, layers get offloaded to system RAM or even disk.

How to Detect

VRAM usage at or near 100%
System RAM usage increasing
Speed drops dramatically with longer context
Inference starts okay, slows down during generation

Why It's So Slow

Memory Type	Bandwidth	Relative Speed
VRAM (GDDR6X)	~1,000 GB/s	1×
System RAM (DDR5)	~90 GB/s	0.09× (11× slower)
NVMe SSD	~7 GB/s	0.007× (140× slower)

Solutions

Use more aggressive quantization: Q4 instead of Q6 can make a model fit
Use a smaller model: A 13B model that fits beats a 70B that doesn't
Reduce context length: Shorter context = smaller KV cache
Get more VRAM: Sometimes the only real fix

Cause #2: Thermal Throttling

GPUs reduce clock speeds when they get too hot to prevent damage.

How to Detect

GPU temperature >83°C (check with nvidia-smi or GPU-Z)
Performance degrades over time during a session
Fans running at maximum speed
Clock speeds dropping (visible in monitoring tools)

Solutions

Improve airflow: Case fans, remove obstructions
Repaste the GPU: Thermal paste degrades over time
Undervolt: Reduce voltage for less heat with minimal performance loss
Add cooling: Aftermarket coolers, deshroud and add case fans
AC or ambient temp: Room temperature matters

Cause #3: Wrong Backend or Settings

Your inference might be running on CPU when it should be on GPU, or missing optimizations.

How to Detect

GPU utilization near 0% while generating
CPU at 100% instead
Performance way below what others report for same hardware

Common Fixes

llama.cpp / Ollama

Ensure GPU layers flag is set: -ngl 99 or --n-gpu-layers 99
Check that CUDA/Metal/ROCm build is being used
Verify with nvidia-smi that GPU memory is being used

PyTorch / Transformers

Ensure model is on GPU: model.to("cuda")
Check torch.cuda.is_available() returns True
Verify correct CUDA version installed

General

Flash Attention enabled? Significant speedup if supported
Using optimized kernels? (cuBLAS, Metal, etc.)
Batch size appropriate? (Usually 1 for interactive use)

Cause #4: KV Cache Overflow

Even if weights fit, the KV cache grows with context length and can overflow VRAM.

How to Detect

Speed is fine at start of conversation, degrades over time
Long prompts cause significant slowdown
VRAM usage increases during generation

KV Cache Size Examples (FP16)

Model	4K Context	16K Context	32K Context
Llama 3 8B	~2 GB	~8 GB	~16 GB
Llama 3 70B	~10 GB	~40 GB	~84 GB

Solutions

Use shorter context: Don't use 32K if you only need 4K
Enable KV cache quantization: INT8 KV cache uses half the memory
Start new conversations: Clear accumulated context

Cause #5: Multi-GPU Inefficiency

Adding a second GPU doesn't double performance — interconnect bandwidth limits scaling.

How to Detect

Multi-GPU setup but performance barely better than single GPU
One GPU at 100%, others lower
PCIe bandwidth showing as bottleneck in profiling

Why It Happens

GPUs need to communicate during inference. PCIe is slow compared to NVLink:

PCIe 4.0 x16: ~32 GB/s
NVLink: ~900 GB/s

Solutions

Use NVLink if available: Much better scaling
Pipeline parallelism: Can reduce communication needs
Accept the limitation: Multi-GPU over PCIe helps with capacity, less with speed

Cause #6: It's Actually Normal

Sometimes the speed is correct for your hardware — you just expected more.

Expected Performance (Approximate)

Hardware	7B Q4	13B Q4	70B Q4
RTX 4090	~150 tok/s	~80 tok/s	~25 tok/s (offload)
RTX 3090	~120 tok/s	~60 tok/s	~20 tok/s (offload)
M2 Ultra	~100 tok/s	~50 tok/s	~18 tok/s
M3 Max	~80 tok/s	~35 tok/s	~10 tok/s

If you're in this ballpark, your setup is probably working correctly. The theoretical maximum is limited by memory bandwidth.

Diagnostic Commands

NVIDIA

# Watch GPU stats in real-time
nvidia-smi -l 1

# Check CUDA version
nvcc --version

# Monitor detailed GPU metrics
nvidia-smi dmon

Apple Silicon

# GPU usage (requires additional tools)
sudo powermetrics --samplers gpu_power

# Memory pressure
vm_stat

General

# llama.cpp: verify GPU layers
./main -m model.gguf -ngl 99 --verbose

# Ollama: check what's loaded
ollama ps

Still Stuck?

Checklist

Is the model actually on GPU? (Check GPU memory usage)
Is GPU utilization high during inference?
Is temperature under control? (<83°C)
Is the model small enough for your VRAM + KV cache?
Are you using optimized builds? (CUDA, Metal, not CPU)
Is Flash Attention enabled?
Are you comparing against realistic benchmarks?

Quick Diagnostic

Cause #1: Memory Offloading

How to Detect

Why It's So Slow

Solutions

Cause #2: Thermal Throttling

How to Detect

Solutions

Cause #3: Wrong Backend or Settings

How to Detect

Common Fixes

llama.cpp / Ollama

PyTorch / Transformers

General

Cause #4: KV Cache Overflow

How to Detect

KV Cache Size Examples (FP16)

Solutions

Cause #5: Multi-GPU Inefficiency

How to Detect

Why It Happens

Solutions

Cause #6: It's Actually Normal

Expected Performance (Approximate)

Diagnostic Commands

NVIDIA

Apple Silicon

General

Still Stuck?

Checklist

Related Topics