When Does Multi-GPU Actually Help?
Understanding multi-GPU scaling for LLM inference
"Just add another GPU" sounds simple, but multi-GPU scaling for LLMs is more nuanced. Sometimes it helps dramatically; sometimes it barely matters. Here's when each is true.
The Quick Answer
Multi-GPU Helps Most When
- Model doesn't fit on one GPU
- You have fast interconnect (NVLink)
- Serving multiple concurrent users
- Need longer context than one GPU allows
Multi-GPU Helps Least When
- Model already fits on one GPU
- Only have PCIe connection
- Single-user interactive use
- Optimizing for latency
The Interconnect Problem
GPUs need to communicate during inference. The speed of this communication determines how well multi-GPU scales:
Scenario Analysis
Scenario 1: Model Fits on Single GPU
Example: Running Llama 3 8B on RTX 4090 (24GB)
| Setup | Speed | Verdict |
|---|---|---|
| 1× RTX 4090 | ~100 tok/s | ✅ Fast, simple |
| 2× RTX 4090 (PCIe) | ~90-100 tok/s | ❌ No benefit, possibly slower |
Adding GPUs Can Make It Slower
When the model fits on one GPU, adding another introduces communication overhead without benefit. The second GPU sits mostly idle waiting for data.
Scenario 2: Model Doesn't Fit on Single GPU
Example: Running Llama 3 70B Q4 (~40GB) on 24GB GPUs
| Setup | Speed | Verdict |
|---|---|---|
| 1× RTX 4090 + RAM offload | ~10-15 tok/s | ⚠️ Works but slow |
| 2× RTX 3090 (PCIe) | ~35-40 tok/s | ✅ Much better |
| 2× RTX 3090 (NVLink) | ~45-50 tok/s | ✅ Best consumer option |
This Is Where Multi-GPU Shines
When the model doesn't fit on one GPU, multi-GPU eliminates the RAM offloading penalty. Going from offloaded to fully in VRAM can be a 3-4× speedup.
Scenario 3: Serving Multiple Users
Example: Running inference server for a team
| Setup | Throughput | Verdict |
|---|---|---|
| 1× GPU | ~50 tok/s total | Requests queue up |
| 2× GPUs (model replicated) | ~100 tok/s total | ✅ 2× throughput |
| 2× GPUs (model split) | ~80 tok/s total | Depends on batch size |
For serving, you can either:
- Replicate: Run same model on each GPU, load balance requests (linear scaling)
- Split: Run one larger model across GPUs (better for bigger models)
Decision Framework
Real-World Builds
2× RTX 3090 Build
The most popular enthusiast multi-GPU setup:
| Aspect | Details |
|---|---|
| Total VRAM | 48GB |
| Cost (used) | ~$1,600 |
| Power draw | ~700W under load |
| NVLink | Available (~$100 bridge) |
| Best for | 70B models at Q4 |
| 70B Q4 speed | ~35-45 tok/s |
Alternative: Single Larger Card
| Option | VRAM | Cost | Complexity |
|---|---|---|---|
| 2× RTX 3090 | 48GB | ~$1,600 | Higher (power, cooling, software) |
| RTX A6000 | 48GB | ~$4,000 | Lower (single card) |
| Mac Studio M2 Max 96GB | 96GB (unified) | ~$4,000 | Lowest (just works) |
Software Considerations
Automatic Multi-GPU
- Ollama: Automatically uses available GPUs
- vLLM: Specify
--tensor-parallel-size N
Manual Configuration
# llama.cpp: split across GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5
# Control which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99
The Bottom Line
Multi-GPU Decision Summary
| Situation | Recommendation |
|---|---|
| Model fits, single user | Don't bother with multi-GPU |
| Model doesn't fit | Multi-GPU is worth it |
| Serving multiple users | Consider replication or split |
| Budget allows single big card | Single card is simpler |