Performance

How fast is fast enough? What metrics matter? How do you compare different setups fairly? This section covers performance measurement and optimization for local LLM inference.

Key Metrics

Metric	What It Measures	Why It Matters
Tokens/second (tok/s)	Generation speed	How fast responses appear
Time to First Token (TTFT)	Prefill latency	How long before output starts
Throughput	Total tokens across all requests	Important for serving multiple users
Memory usage	VRAM consumption	Determines what fits

What's "Good" Performance?

For Interactive Use (Chat, Coding)

Speed	Experience
>50 tok/s	Excellent — feels instant
30-50 tok/s	Good — comfortable reading speed
15-30 tok/s	Acceptable — noticeable but usable
5-15 tok/s	Slow — requires patience
<5 tok/s	Painful — consider smaller model

Reference Speeds

Approximate tok/s for common setups (Llama 3 8B Q4, 4K context):

Hardware	Speed
RTX 4090	~100-120 tok/s
RTX 3090	~80-100 tok/s
RTX 4070 Ti	~60-80 tok/s
M3 Max	~50-70 tok/s
M2 Pro	~30-40 tok/s
RTX 3060 12GB	~40-50 tok/s

Benchmarking

llama.cpp Benchmarking

# Built-in benchmark
./llama-bench -m model.gguf -n 128 -p 512

# Parameters:
# -n: tokens to generate
# -p: prompt tokens
# -r: number of runs

Ollama

# Check model info and performance
ollama run llama3 --verbose

# The output includes timing information

Manual Timing

# Time a generation
time curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Write a story about a robot.",
  "stream": false
}'

Performance Optimization

1. Maximize GPU Utilization

# Ensure all layers on GPU
./main -m model.gguf -ngl 99

# Check with nvidia-smi that GPU is being used
watch -n 1 nvidia-smi

2. Right-Size Context

# Use only the context you need
./main -m model.gguf -c 4096  # instead of 32768

# Smaller context = faster prefill, less memory

3. Optimal Quantization

Goal	Recommendation
Best quality	Q6_K or Q8
Balanced	Q5_K_M or Q4_K_M
Maximum speed	Q4_K_S or Q3_K_M

4. Flash Attention

Flash Attention is an optimized attention implementation:

Enabled by default in modern llama.cpp
Significantly faster for long contexts
Reduces memory usage

5. Batch Size Tuning

# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512

# Smaller batch for low memory
./main -m model.gguf -b 128

Common Performance Issues

Model Running on CPU Instead of GPU

Symptoms: Very slow, CPU at 100%, GPU idle
Fix: Ensure -ngl 99 and GPU-enabled build

Thermal Throttling

Symptoms: Performance drops after a few minutes
Fix: Improve cooling, check temps with nvidia-smi

Memory Swapping

Symptoms: Stuttering, inconsistent speed
Fix: Reduce context, use smaller model, or more aggressive quantization

PCIe Bottleneck (Multi-GPU)

Symptoms: Second GPU doesn't improve speed much
Reality: PCIe is slow; consider NVLink or accept the limitation

Theoretical vs Real Performance

Theoretical max tok/s ≈ Memory Bandwidth / (2 × Parameters × Bytes) RTX 4090 + 70B Q4: 1008 GB/s / (2 × 70B × 0.5 bytes) = ~14 tok/s theoretical Real-world: ~12-15 tok/s (close to theoretical!) This shows LLM inference is memory-bandwidth-bound.

Benchmarking Pitfalls

Fair Comparisons Require

Same model: Comparing 7B to 70B is meaningless
Same quantization: Q4 vs Q8 changes everything
Same context length: Longer context = slower
Same prompt length: Affects prefill time
Warmed up: First run is often slower
Same generation length: More tokens = more time

Key Metrics

What's "Good" Performance?

For Interactive Use (Chat, Coding)

Reference Speeds

Benchmarking

llama.cpp Benchmarking

Ollama

Manual Timing

Performance Optimization

1. Maximize GPU Utilization

2. Right-Size Context

3. Optimal Quantization

4. Flash Attention

5. Batch Size Tuning

Common Performance Issues

Model Running on CPU Instead of GPU

Thermal Throttling

Memory Swapping

PCIe Bottleneck (Multi-GPU)

Theoretical vs Real Performance

Benchmarking Pitfalls

Fair Comparisons Require

When to Optimize vs Accept

Worth Optimizing

Accept the Limitation

Key Metrics

What's "Good" Performance?

For Interactive Use (Chat, Coding)

Reference Speeds

Benchmarking

llama.cpp Benchmarking

Ollama

Manual Timing

Performance Optimization

1. Maximize GPU Utilization

2. Right-Size Context

3. Optimal Quantization

4. Flash Attention

5. Batch Size Tuning

Common Performance Issues

Model Running on CPU Instead of GPU

Thermal Throttling

Memory Swapping

PCIe Bottleneck (Multi-GPU)

Theoretical vs Real Performance

Benchmarking Pitfalls

Fair Comparisons Require

When to Optimize vs Accept

Worth Optimizing

Accept the Limitation

Related Topics