Performance
Measuring, benchmarking, and optimizing
How fast is fast enough? What metrics matter? How do you compare different setups fairly? This section covers performance measurement and optimization for local LLM inference.
Key Metrics
| Metric | What It Measures | Why It Matters |
|---|---|---|
| Tokens/second (tok/s) | Generation speed | How fast responses appear |
| Time to First Token (TTFT) | Prefill latency | How long before output starts |
| Throughput | Total tokens across all requests | Important for serving multiple users |
| Memory usage | VRAM consumption | Determines what fits |
What's "Good" Performance?
For Interactive Use (Chat, Coding)
| Speed | Experience |
|---|---|
| >50 tok/s | Excellent — feels instant |
| 30-50 tok/s | Good — comfortable reading speed |
| 15-30 tok/s | Acceptable — noticeable but usable |
| 5-15 tok/s | Slow — requires patience |
| <5 tok/s | Painful — consider smaller model |
Reference Speeds
Approximate tok/s for common setups (Llama 3 8B Q4, 4K context):
| Hardware | Speed |
|---|---|
| RTX 4090 | ~100-120 tok/s |
| RTX 3090 | ~80-100 tok/s |
| RTX 4070 Ti | ~60-80 tok/s |
| M3 Max | ~50-70 tok/s |
| M2 Pro | ~30-40 tok/s |
| RTX 3060 12GB | ~40-50 tok/s |
Benchmarking
llama.cpp Benchmarking
# Built-in benchmark
./llama-bench -m model.gguf -n 128 -p 512
# Parameters:
# -n: tokens to generate
# -p: prompt tokens
# -r: number of runs
Ollama
# Check model info and performance
ollama run llama3 --verbose
# The output includes timing information
Manual Timing
# Time a generation
time curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Write a story about a robot.",
"stream": false
}'
Performance Optimization
1. Maximize GPU Utilization
# Ensure all layers on GPU
./main -m model.gguf -ngl 99
# Check with nvidia-smi that GPU is being used
watch -n 1 nvidia-smi
2. Right-Size Context
# Use only the context you need
./main -m model.gguf -c 4096 # instead of 32768
# Smaller context = faster prefill, less memory
3. Optimal Quantization
| Goal | Recommendation |
|---|---|
| Best quality | Q6_K or Q8 |
| Balanced | Q5_K_M or Q4_K_M |
| Maximum speed | Q4_K_S or Q3_K_M |
4. Flash Attention
Flash Attention is an optimized attention implementation:
- Enabled by default in modern llama.cpp
- Significantly faster for long contexts
- Reduces memory usage
5. Batch Size Tuning
# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512
# Smaller batch for low memory
./main -m model.gguf -b 128
Common Performance Issues
Model Running on CPU Instead of GPU
Symptoms: Very slow, CPU at 100%, GPU idle
Fix: Ensure -ngl 99 and GPU-enabled build
Thermal Throttling
Symptoms: Performance drops after a few minutes
Fix: Improve cooling, check temps with nvidia-smi
Memory Swapping
Symptoms: Stuttering, inconsistent speed
Fix: Reduce context, use smaller model, or more aggressive quantization
PCIe Bottleneck (Multi-GPU)
Symptoms: Second GPU doesn't improve speed much
Reality: PCIe is slow; consider NVLink or accept the limitation
Theoretical vs Real Performance
Benchmarking Pitfalls
Fair Comparisons Require
- Same model: Comparing 7B to 70B is meaningless
- Same quantization: Q4 vs Q8 changes everything
- Same context length: Longer context = slower
- Same prompt length: Affects prefill time
- Warmed up: First run is often slower
- Same generation length: More tokens = more time
When to Optimize vs Accept
Worth Optimizing
- GPU not being used (easy fix)
- Thermal throttling (cooling fix)
- Using wrong quantization
- Context way larger than needed
Accept the Limitation
- Already at memory bandwidth limit
- Model barely fits (offloading needed)
- PCIe multi-GPU overhead
- Need bigger/faster hardware