When one GPU isn't enough — whether for VRAM capacity or speed — you can use multiple GPUs together. But multi-GPU doesn't mean 2× performance. Understanding interconnects and parallelism strategies is crucial.

Why Multi-GPU?

More VRAM

  • Run larger models
  • Longer context lengths
  • 2× 24GB = 48GB total

More Bandwidth

  • Potentially faster inference
  • Higher throughput with batching
  • Depends heavily on interconnect

The Interconnect Problem

GPUs need to communicate during inference. The interconnect bandwidth determines how well multi-GPU scales:

Interconnect Bandwidth Scaling Availability
NVLink 3.0 600 GB/s Excellent A100, RTX A6000
NVLink 4.0 900 GB/s Excellent H100, RTX 5000 Ada
NVLink Bridge (Consumer) ~112 GB/s Good RTX 3090 (different from datacenter NVLink)
PCIe 4.0 x16 ~32 GB/s Limited All modern GPUs
PCIe 5.0 x16 ~64 GB/s Moderate Newer platforms
Single RTX 4090: 1,008 GB/s memory bandwidth Two RTX 4090s over PCIe: ├── GPU 0: 1,008 GB/s to its VRAM ├── GPU 1: 1,008 GB/s to its VRAM ├── Between GPUs: ~32 GB/s (PCIe 4.0) └── Cross-GPU communication is 30× slower than VRAM access! This is why multi-GPU over PCIe helps more with capacity than speed.

Parallelism Strategies

Tensor Parallelism

Split individual layers across GPUs. Each GPU computes part of every layer.

Pipeline Parallelism

Assign different layers to different GPUs. Data flows through GPUs sequentially.

Tensor Parallelism (split layers horizontally): ┌─────────────────────────────────────────┐ │ Layer 1: [GPU0: half] ←→ [GPU1: half] │ ← communication every layer │ Layer 2: [GPU0: half] ←→ [GPU1: half] │ │ ... │ └─────────────────────────────────────────┘ Pipeline Parallelism (split layers vertically): ┌─────────────────────────────────────────┐ │ GPU0: Layers 1-40 │ │ ↓ (one transfer) │ │ GPU1: Layers 41-80 │ └─────────────────────────────────────────┘

Real-World Performance

Setup 70B Q4 tok/s Notes
1× RTX 4090 ~20-25 (offload) Model doesn't fully fit
2× RTX 3090 (PCIe) ~35-40 Model fits, PCIe limits scaling
2× RTX 3090 (NVLink bridge) ~45-50 Better scaling with NVLink
4× RTX 3090 ~50-60 Diminishing returns on PCIe
2× A100 80GB (NVLink) ~100+ Datacenter NVLink scales well

Consumer Multi-GPU Builds

2× RTX 3090

The most popular enthusiast multi-GPU setup:

Hardware Requirements

Component Requirement Notes
Motherboard 2+ PCIe x16 slots Check actual electrical lanes
PSU 1000W+ (for 2× 3090) Higher for 4090s
Case Good airflow 3-slot GPUs need space
CPU Enough PCIe lanes Most modern CPUs fine

Software Configuration

llama.cpp

# Split model across 2 GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Uneven split (GPU 0 has more VRAM)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4

Ollama

Ollama automatically uses multiple GPUs when available.

vLLM

# Specify tensor parallelism
python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-3-70b \
    --tensor-parallel-size 2

When Multi-GPU Makes Sense

Good Reasons

Questionable Reasons

The Math: Is Multi-GPU Worth It?

Scenario: Running 70B model Option A: 1× RTX 4090 ($1,600) ├── Needs offloading (model is ~40GB, card is 24GB) ├── ~20 tok/s └── Simpler setup Option B: 2× RTX 3090 used ($1,600) ├── Model fits (48GB total) ├── ~40 tok/s └── More complex (power, cooling, software) Same price, 2× performance for this specific use case. But for 13B models, single 4090 would be faster.