"Just add another GPU" sounds simple, but multi-GPU scaling for LLMs is more nuanced. Sometimes it helps dramatically; sometimes it barely matters. Here's when each is true.

The Quick Answer

Multi-GPU Helps Most When

  • Model doesn't fit on one GPU
  • You have fast interconnect (NVLink)
  • Serving multiple concurrent users
  • Need longer context than one GPU allows

Multi-GPU Helps Least When

  • Model already fits on one GPU
  • Only have PCIe connection
  • Single-user interactive use
  • Optimizing for latency

The Interconnect Problem

GPUs need to communicate during inference. The speed of this communication determines how well multi-GPU scales:

Memory bandwidth within a GPU: ~1,000 GB/s (4090) NVLink between GPUs: ~600-900 GB/s PCIe 4.0 between GPUs: ~32 GB/s The PCIe bottleneck is why "2× GPUs ≠ 2× speed"

Scenario Analysis

Scenario 1: Model Fits on Single GPU

Example: Running Llama 3 8B on RTX 4090 (24GB)

Setup Speed Verdict
1× RTX 4090 ~100 tok/s ✅ Fast, simple
2× RTX 4090 (PCIe) ~90-100 tok/s ❌ No benefit, possibly slower

Adding GPUs Can Make It Slower

When the model fits on one GPU, adding another introduces communication overhead without benefit. The second GPU sits mostly idle waiting for data.

Scenario 2: Model Doesn't Fit on Single GPU

Example: Running Llama 3 70B Q4 (~40GB) on 24GB GPUs

Setup Speed Verdict
1× RTX 4090 + RAM offload ~10-15 tok/s ⚠️ Works but slow
2× RTX 3090 (PCIe) ~35-40 tok/s ✅ Much better
2× RTX 3090 (NVLink) ~45-50 tok/s ✅ Best consumer option

This Is Where Multi-GPU Shines

When the model doesn't fit on one GPU, multi-GPU eliminates the RAM offloading penalty. Going from offloaded to fully in VRAM can be a 3-4× speedup.

Scenario 3: Serving Multiple Users

Example: Running inference server for a team

Setup Throughput Verdict
1× GPU ~50 tok/s total Requests queue up
2× GPUs (model replicated) ~100 tok/s total ✅ 2× throughput
2× GPUs (model split) ~80 tok/s total Depends on batch size

For serving, you can either:

Decision Framework

Does your model fit on one GPU with room for KV cache? │ ├── YES │ │ │ ├── Are you serving multiple users? │ │ ├── YES → Consider replication (model on each GPU) │ │ └── NO → Single GPU is fine, don't add more │ │ │ └── Do you need longer context? │ ├── YES → Multi-GPU can help (more KV cache room) │ └── NO → Stick with single GPU │ └── NO │ └── Multi-GPU will help significantly ├── NVLink available? → Use it, good scaling └── PCIe only? → Still helps, but temper expectations

Real-World Builds

2× RTX 3090 Build

The most popular enthusiast multi-GPU setup:

Aspect Details
Total VRAM 48GB
Cost (used) ~$1,600
Power draw ~700W under load
NVLink Available (~$100 bridge)
Best for 70B models at Q4
70B Q4 speed ~35-45 tok/s

Alternative: Single Larger Card

Option VRAM Cost Complexity
2× RTX 3090 48GB ~$1,600 Higher (power, cooling, software)
RTX A6000 48GB ~$4,000 Lower (single card)
Mac Studio M2 Max 96GB 96GB (unified) ~$4,000 Lowest (just works)

Software Considerations

Automatic Multi-GPU

Manual Configuration

# llama.cpp: split across GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Control which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99

The Bottom Line

Multi-GPU Decision Summary

Situation Recommendation
Model fits, single user Don't bother with multi-GPU
Model doesn't fit Multi-GPU is worth it
Serving multiple users Consider replication or split
Budget allows single big card Single card is simpler