Multi-GPU Setups
Scaling beyond a single GPU
When one GPU isn't enough — whether for VRAM capacity or speed — you can use multiple GPUs together. But multi-GPU doesn't mean 2× performance. Understanding interconnects and parallelism strategies is crucial.
Why Multi-GPU?
More VRAM
- Run larger models
- Longer context lengths
- 2× 24GB = 48GB total
More Bandwidth
- Potentially faster inference
- Higher throughput with batching
- Depends heavily on interconnect
The Interconnect Problem
GPUs need to communicate during inference. The interconnect bandwidth determines how well multi-GPU scales:
| Interconnect | Bandwidth | Scaling | Availability |
|---|---|---|---|
| NVLink 3.0 | 600 GB/s | Excellent | A100, RTX A6000 |
| NVLink 4.0 | 900 GB/s | Excellent | H100, RTX 5000 Ada |
| NVLink Bridge (Consumer) | ~112 GB/s | Good | RTX 3090 (different from datacenter NVLink) |
| PCIe 4.0 x16 | ~32 GB/s | Limited | All modern GPUs |
| PCIe 5.0 x16 | ~64 GB/s | Moderate | Newer platforms |
Single RTX 4090: 1,008 GB/s memory bandwidth
Two RTX 4090s over PCIe:
├── GPU 0: 1,008 GB/s to its VRAM
├── GPU 1: 1,008 GB/s to its VRAM
├── Between GPUs: ~32 GB/s (PCIe 4.0)
└── Cross-GPU communication is 30× slower than VRAM access!
This is why multi-GPU over PCIe helps more with capacity than speed.
Parallelism Strategies
Tensor Parallelism
Split individual layers across GPUs. Each GPU computes part of every layer.
- Pros: Lower latency, good for interactive use
- Cons: Requires fast interconnect (NVLink preferred)
- Use when: You have NVLink or need low latency
Pipeline Parallelism
Assign different layers to different GPUs. Data flows through GPUs sequentially.
- Pros: Less inter-GPU communication
- Cons: Higher latency, pipeline bubbles
- Use when: PCIe-only connection
Tensor Parallelism (split layers horizontally):
┌─────────────────────────────────────────┐
│ Layer 1: [GPU0: half] ←→ [GPU1: half] │ ← communication every layer
│ Layer 2: [GPU0: half] ←→ [GPU1: half] │
│ ... │
└─────────────────────────────────────────┘
Pipeline Parallelism (split layers vertically):
┌─────────────────────────────────────────┐
│ GPU0: Layers 1-40 │
│ ↓ (one transfer) │
│ GPU1: Layers 41-80 │
└─────────────────────────────────────────┘
Real-World Performance
| Setup | 70B Q4 tok/s | Notes |
|---|---|---|
| 1× RTX 4090 | ~20-25 (offload) | Model doesn't fully fit |
| 2× RTX 3090 (PCIe) | ~35-40 | Model fits, PCIe limits scaling |
| 2× RTX 3090 (NVLink bridge) | ~45-50 | Better scaling with NVLink |
| 4× RTX 3090 | ~50-60 | Diminishing returns on PCIe |
| 2× A100 80GB (NVLink) | ~100+ | Datacenter NVLink scales well |
Consumer Multi-GPU Builds
2× RTX 3090
The most popular enthusiast multi-GPU setup:
- 48GB total VRAM
- ~$1,600 for both (used)
- Runs 70B Q4 comfortably
- NVLink bridge available (optional, ~$100)
- Needs good airflow (700W combined)
Hardware Requirements
| Component | Requirement | Notes |
|---|---|---|
| Motherboard | 2+ PCIe x16 slots | Check actual electrical lanes |
| PSU | 1000W+ (for 2× 3090) | Higher for 4090s |
| Case | Good airflow | 3-slot GPUs need space |
| CPU | Enough PCIe lanes | Most modern CPUs fine |
Software Configuration
llama.cpp
# Split model across 2 GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5
# Uneven split (GPU 0 has more VRAM)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4
Ollama
Ollama automatically uses multiple GPUs when available.
vLLM
# Specify tensor parallelism
python -m vllm.entrypoints.api_server \
--model meta-llama/Llama-3-70b \
--tensor-parallel-size 2
When Multi-GPU Makes Sense
Good Reasons
- Model doesn't fit on single GPU
- You need long context (KV cache is huge)
- Throughput matters more than latency (batch inference)
- You have NVLink available
Questionable Reasons
- "Twice the GPUs = twice the speed" — usually false over PCIe
- Model already fits on one GPU — single bigger GPU often better
- Minimizing latency — multi-GPU adds overhead
The Math: Is Multi-GPU Worth It?
Scenario: Running 70B model
Option A: 1× RTX 4090 ($1,600)
├── Needs offloading (model is ~40GB, card is 24GB)
├── ~20 tok/s
└── Simpler setup
Option B: 2× RTX 3090 used ($1,600)
├── Model fits (48GB total)
├── ~40 tok/s
└── More complex (power, cooling, software)
Same price, 2× performance for this specific use case.
But for 13B models, single 4090 would be faster.