Multi-GPU - Local LLM Knowledge Base

When one GPU isn't enough — whether for VRAM capacity or speed — you can use multiple GPUs together. But multi-GPU doesn't mean 2× performance. Understanding interconnects and parallelism strategies is crucial.

Why Multi-GPU?

More VRAM

Run larger models
Longer context lengths
2× 24GB = 48GB total

More Bandwidth

Potentially faster inference
Higher throughput with batching
Depends heavily on interconnect

The Interconnect Problem

GPUs need to communicate during inference. The interconnect bandwidth determines how well multi-GPU scales:

Interconnect	Bandwidth	Scaling	Availability
NVLink 3.0	600 GB/s	Excellent	A100, RTX A6000
NVLink 4.0	900 GB/s	Excellent	H100, RTX 5000 Ada
NVLink Bridge (Consumer)	~112 GB/s	Good	RTX 3090 (different from datacenter NVLink)
PCIe 4.0 x16	~32 GB/s	Limited	All modern GPUs
PCIe 5.0 x16	~64 GB/s	Moderate	Newer platforms

Single RTX 4090: 1,008 GB/s memory bandwidth Two RTX 4090s over PCIe: ├── GPU 0: 1,008 GB/s to its VRAM ├── GPU 1: 1,008 GB/s to its VRAM ├── Between GPUs: ~32 GB/s (PCIe 4.0) └── Cross-GPU communication is 30× slower than VRAM access! This is why multi-GPU over PCIe helps more with capacity than speed.

Parallelism Strategies

Tensor Parallelism

Split individual layers across GPUs. Each GPU computes part of every layer.

Pros: Lower latency, good for interactive use
Cons: Requires fast interconnect (NVLink preferred)
Use when: You have NVLink or need low latency

Pipeline Parallelism

Assign different layers to different GPUs. Data flows through GPUs sequentially.

Pros: Less inter-GPU communication
Cons: Higher latency, pipeline bubbles
Use when: PCIe-only connection

Tensor Parallelism (split layers horizontally): ┌─────────────────────────────────────────┐ │ Layer 1: [GPU0: half] ←→ [GPU1: half] │ ← communication every layer │ Layer 2: [GPU0: half] ←→ [GPU1: half] │ │ ... │ └─────────────────────────────────────────┘ Pipeline Parallelism (split layers vertically): ┌─────────────────────────────────────────┐ │ GPU0: Layers 1-40 │ │ ↓ (one transfer) │ │ GPU1: Layers 41-80 │ └─────────────────────────────────────────┘

Real-World Performance

Setup	70B Q4 tok/s	Notes
1× RTX 4090	~20-25 (offload)	Model doesn't fully fit
2× RTX 3090 (PCIe)	~35-40	Model fits, PCIe limits scaling
2× RTX 3090 (NVLink bridge)	~45-50	Better scaling with NVLink
4× RTX 3090	~50-60	Diminishing returns on PCIe
2× A100 80GB (NVLink)	~100+	Datacenter NVLink scales well

Consumer Multi-GPU Builds

2× RTX 3090

The most popular enthusiast multi-GPU setup:

48GB total VRAM
~$1,600 for both (used)
Runs 70B Q4 comfortably
NVLink bridge available (optional, ~$100)
Needs good airflow (700W combined)

Hardware Requirements

Component	Requirement	Notes
Motherboard	2+ PCIe x16 slots	Check actual electrical lanes
PSU	1000W+ (for 2× 3090)	Higher for 4090s
Case	Good airflow	3-slot GPUs need space
CPU	Enough PCIe lanes	Most modern CPUs fine

Software Configuration

llama.cpp

# Split model across 2 GPUs
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Uneven split (GPU 0 has more VRAM)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4

Ollama

Ollama automatically uses multiple GPUs when available.

vLLM

# Specify tensor parallelism
python -m vllm.entrypoints.api_server \
    --model meta-llama/Llama-3-70b \
    --tensor-parallel-size 2

When Multi-GPU Makes Sense

Good Reasons

Model doesn't fit on single GPU
You need long context (KV cache is huge)
Throughput matters more than latency (batch inference)
You have NVLink available

Questionable Reasons

"Twice the GPUs = twice the speed" — usually false over PCIe
Model already fits on one GPU — single bigger GPU often better
Minimizing latency — multi-GPU adds overhead

The Math: Is Multi-GPU Worth It?

Scenario: Running 70B model Option A: 1× RTX 4090 ($1,600) ├── Needs offloading (model is ~40GB, card is 24GB) ├── ~20 tok/s └── Simpler setup Option B: 2× RTX 3090 used ($1,600) ├── Model fits (48GB total) ├── ~40 tok/s └── More complex (power, cooling, software) Same price, 2× performance for this specific use case. But for 13B models, single 4090 would be faster.

Why Multi-GPU?

More VRAM

More Bandwidth

The Interconnect Problem

Parallelism Strategies

Tensor Parallelism

Pipeline Parallelism

Real-World Performance

Consumer Multi-GPU Builds

2× RTX 3090

Hardware Requirements

Software Configuration

llama.cpp

Ollama

vLLM

When Multi-GPU Makes Sense

Good Reasons

Questionable Reasons

The Math: Is Multi-GPU Worth It?

Related Topics