How fast is fast enough? What metrics matter? How do you compare different setups fairly? This section covers performance measurement and optimization for local LLM inference.

Key Metrics

Metric What It Measures Why It Matters
Tokens/second (tok/s) Generation speed How fast responses appear
Time to First Token (TTFT) Prefill latency How long before output starts
Throughput Total tokens across all requests Important for serving multiple users
Memory usage VRAM consumption Determines what fits

What's "Good" Performance?

For Interactive Use (Chat, Coding)

Speed Experience
>50 tok/s Excellent — feels instant
30-50 tok/s Good — comfortable reading speed
15-30 tok/s Acceptable — noticeable but usable
5-15 tok/s Slow — requires patience
<5 tok/s Painful — consider smaller model

Reference Speeds

Approximate tok/s for common setups (Llama 3 8B Q4, 4K context):

Hardware Speed
RTX 4090 ~100-120 tok/s
RTX 3090 ~80-100 tok/s
RTX 4070 Ti ~60-80 tok/s
M3 Max ~50-70 tok/s
M2 Pro ~30-40 tok/s
RTX 3060 12GB ~40-50 tok/s

Benchmarking

llama.cpp Benchmarking

# Built-in benchmark
./llama-bench -m model.gguf -n 128 -p 512

# Parameters:
# -n: tokens to generate
# -p: prompt tokens
# -r: number of runs

Ollama

# Check model info and performance
ollama run llama3 --verbose

# The output includes timing information

Manual Timing

# Time a generation
time curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a story about a robot.",
  "stream": false
}'

Performance Optimization

1. Maximize GPU Utilization

# Ensure all layers on GPU
./main -m model.gguf -ngl 99

# Check with nvidia-smi that GPU is being used
watch -n 1 nvidia-smi

2. Right-Size Context

# Use only the context you need
./main -m model.gguf -c 4096  # instead of 32768

# Smaller context = faster prefill, less memory

3. Optimal Quantization

Goal Recommendation
Best quality Q6_K or Q8
Balanced Q5_K_M or Q4_K_M
Maximum speed Q4_K_S or Q3_K_M

4. Flash Attention

Flash Attention is an optimized attention implementation:

5. Batch Size Tuning

# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512

# Smaller batch for low memory
./main -m model.gguf -b 128

Common Performance Issues

Model Running on CPU Instead of GPU

Symptoms: Very slow, CPU at 100%, GPU idle
Fix: Ensure -ngl 99 and GPU-enabled build

Thermal Throttling

Symptoms: Performance drops after a few minutes
Fix: Improve cooling, check temps with nvidia-smi

Memory Swapping

Symptoms: Stuttering, inconsistent speed
Fix: Reduce context, use smaller model, or more aggressive quantization

PCIe Bottleneck (Multi-GPU)

Symptoms: Second GPU doesn't improve speed much
Reality: PCIe is slow; consider NVLink or accept the limitation

Theoretical vs Real Performance

Theoretical max tok/s ≈ Memory Bandwidth / (2 × Parameters × Bytes) RTX 4090 + 70B Q4: 1008 GB/s / (2 × 70B × 0.5 bytes) = ~14 tok/s theoretical Real-world: ~12-15 tok/s (close to theoretical!) This shows LLM inference is memory-bandwidth-bound.

Benchmarking Pitfalls

Fair Comparisons Require

When to Optimize vs Accept

Worth Optimizing

  • GPU not being used (easy fix)
  • Thermal throttling (cooling fix)
  • Using wrong quantization
  • Context way larger than needed

Accept the Limitation

  • Already at memory bandwidth limit
  • Model barely fits (offloading needed)
  • PCIe multi-GPU overhead
  • Need bigger/faster hardware