When Multi-GPU Helps - Local LLM Knowledge Base

"Just add another GPU" sounds simple, but multi-GPU scaling for LLMs is more nuanced. Sometimes it helps dramatically; sometimes it barely matters. Here's when each is true.

The Quick Answer

Multi-GPU Helps Most When

Model doesn't fit on one GPU
You have fast interconnect (NVLink)
Serving multiple concurrent users
Need longer context than one GPU allows

Multi-GPU Helps Least When

Model already fits on one GPU
Only have PCIe connection
Single-user interactive use
Optimizing for latency

The Interconnect Problem

GPUs need to communicate during inference. The speed of this communication determines how well multi-GPU scales:

Memory bandwidth within a GPU: ~1,000 GB/s (4090) NVLink between GPUs: ~600-900 GB/s PCIe 4.0 between GPUs: ~32 GB/s The PCIe bottleneck is why "2× GPUs ≠ 2× speed"

Scenario Analysis

Scenario 1: Model Fits on Single GPU

Example: Running Llama 3 8B on RTX 4090 (24GB)

Setup	Speed	Verdict
1× RTX 4090	~100 tok/s	✅ Fast, simple
2× RTX 4090 (PCIe)	~90-100 tok/s	❌ No benefit, possibly slower

Adding GPUs Can Make It Slower

When the model fits on one GPU, adding another introduces communication overhead without benefit. The second GPU sits mostly idle waiting for data.

Scenario 2: Model Doesn't Fit on Single GPU

Example: Running Llama 3 70B Q4 (~40GB) on 24GB GPUs

Setup	Speed	Verdict
1× RTX 4090 + RAM offload	~10-15 tok/s	⚠️ Works but slow
2× RTX 3090 (PCIe)	~35-40 tok/s	✅ Much better
2× RTX 3090 (NVLink)	~45-50 tok/s	✅ Best consumer option

This Is Where Multi-GPU Shines

When the model doesn't fit on one GPU, multi-GPU eliminates the RAM offloading penalty. Going from offloaded to fully in VRAM can be a 3-4× speedup.

Scenario 3: Serving Multiple Users

Example: Running inference server for a team

Setup	Throughput	Verdict
1× GPU	~50 tok/s total	Requests queue up
2× GPUs (model replicated)	~100 tok/s total	✅ 2× throughput
2× GPUs (model split)	~80 tok/s total	Depends on batch size

For serving, you can either:

Replicate: Run same model on each GPU, load balance requests (linear scaling)
Split: Run one larger model across GPUs (better for bigger models)

Decision Framework

Does your model fit on one GPU with room for KV cache? │ ├── YES │ │ │ ├── Are you serving multiple users? │ │ ├── YES → Consider replication (model on each GPU) │ │ └── NO → Single GPU is fine, don't add more │ │ │ └── Do you need longer context? │ ├── YES → Multi-GPU can help (more KV cache room) │ └── NO → Stick with single GPU │ └── NO │ └── Multi-GPU will help significantly ├── NVLink available? → Use it, good scaling └── PCIe only? → Still helps, but temper expectations

Real-World Builds

2× RTX 3090 Build

The most popular enthusiast multi-GPU setup:

Aspect	Details
Total VRAM	48GB
Cost (used)	~$1,600
Power draw	~700W under load
NVLink	Available (~$100 bridge)
Best for	70B models at Q4
70B Q4 speed	~35-45 tok/s

Alternative: Single Larger Card

Option	VRAM	Cost	Complexity
2× RTX 3090	48GB	~$1,600	Higher (power, cooling, software)
RTX A6000	48GB	~$4,000	Lower (single card)
Mac Studio M2 Max 96GB	96GB (unified)	~$4,000	Lowest (just works)

Software Considerations

Automatic Multi-GPU

Ollama: Automatically uses available GPUs
vLLM: Specify --tensor-parallel-size N

Manual Configuration

# llama.cpp: split across GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Control which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99

The Bottom Line

Multi-GPU Decision Summary

Situation	Recommendation
Model fits, single user	Don't bother with multi-GPU
Model doesn't fit	Multi-GPU is worth it
Serving multiple users	Consider replication or split
Budget allows single big card	Single card is simpler

The Quick Answer

Multi-GPU Helps Most When

Multi-GPU Helps Least When

The Interconnect Problem

Scenario Analysis

Scenario 1: Model Fits on Single GPU

Adding GPUs Can Make It Slower

Scenario 2: Model Doesn't Fit on Single GPU

This Is Where Multi-GPU Shines

Scenario 3: Serving Multiple Users

Decision Framework

Real-World Builds

2× RTX 3090 Build

Alternative: Single Larger Card

Software Considerations

Automatic Multi-GPU

Manual Configuration

The Bottom Line

Multi-GPU Decision Summary

Related Topics