Memory Bandwidth - Local LLM Knowledge Base

Memory bandwidth is how fast data can be read from (or written to) memory. For LLM inference, it's typically the limiting factor — more so than TFLOPS, core count, or other specs that get more marketing attention.

Why Bandwidth Matters

During token generation (decode), the model processes one token at a time. For each token:

Generate 1 token: 1. Read all model weights (~35GB for 70B Q4) 2. Do minimal computation 3. Output 1 token 4. Repeat At 30 tokens/second: ├── Reading: 30 × 35GB = 1,050 GB/s needed ├── Computing: Tiny fraction of GPU capability used └── Bottleneck: MEMORY BANDWIDTH

The GPU's compute units (which do the math) are starved for data. They sit idle while waiting for weights to arrive from memory.

Bandwidth by Hardware

Hardware	Memory Type	Bandwidth	~Max tok/s (70B Q4)
RTX 4090	GDDR6X	1,008 GB/s	~29
RTX 4080	GDDR6X	717 GB/s	~20
RTX 3090	GDDR6X	936 GB/s	~27
RTX 3080 (10GB)	GDDR6X	760 GB/s	~22
Mac Studio M2 Ultra	Unified LPDDR5	800 GB/s	~23
M3 Max	Unified LPDDR5	400 GB/s	~11
A100 80GB	HBM2e	2,039 GB/s	~58
H100 80GB	HBM3	3,350 GB/s	~96
DDR5 RAM (system)	DDR5-5600	~90 GB/s	~2.5

* tok/s estimates assume model fits in memory and is bandwidth-limited. Real performance varies.

The Bandwidth Formula

Theoretical maximum decode speed:

Max tok/s = Memory Bandwidth / Model Size

Example: RTX 4090 with 70B Q4 (~35GB)
Max tok/s = 1,008 GB/s / 35 GB = 28.8 tok/s

Real Performance Is Lower

Actual performance is typically 70-90% of theoretical maximum due to: memory access patterns that aren't perfectly sequential, KV cache reads, compute overhead, and software inefficiencies.

Memory Types

GDDR6/GDDR6X

Consumer GPU memory
600-1000 GB/s typical
Cost-effective
Limited capacity (8-24GB)

HBM2/HBM3

Datacenter GPU memory
2000-3500 GB/s
Very expensive
Higher capacity (40-80GB)

Unified (Apple)

Shared CPU/GPU memory
200-800 GB/s
Large capacity possible
No discrete GPU VRAM limits

System RAM (DDR5)

CPU memory
50-100 GB/s
Cheapest per GB
10-20× slower than VRAM

Bandwidth vs VRAM: The Tradeoff

These specs often trade off against each other:

High VRAM, Lower Bandwidth: ├── P40 (24GB, 346 GB/s): Fits 70B Q4, but slow (~10 tok/s) ├── Mac Studio (192GB, 800 GB/s): Fits 405B, moderate speed High Bandwidth, Lower VRAM: ├── RTX 4090 (24GB, 1008 GB/s): Fast, but can't fit 70B Q4 alone ├── H100 (80GB, 3350 GB/s): Both high, but costs $30k+ The question: What's YOUR bottleneck? ├── If model doesn't fit → need more VRAM ├── If model fits but slow → need more bandwidth

Why System RAM Is So Slow

When models don't fit in VRAM, layers can be "offloaded" to system RAM. This works but is much slower:

Memory	Bandwidth	Relative Speed
VRAM (GDDR6X)	~1,000 GB/s	1×
System RAM (DDR5)	~90 GB/s	0.09× (11× slower)
NVMe SSD	~7 GB/s	0.007× (140× slower)

Offloading 50% of layers to RAM doesn't make inference 50% slower — it makes it roughly 5-10× slower because those layers become the bottleneck.

Multi-GPU Bandwidth

Adding GPUs adds bandwidth — but communication between GPUs has its own bandwidth limits:

Single RTX 4090: 1,008 GB/s to VRAM Two RTX 4090s: ├── Each GPU: 1,008 GB/s to its own VRAM ├── PCIe 4.0 x16 between them: ~32 GB/s each direction └── Total effective: ~1.5× single GPU (not 2×) Two 4090s with NVLink (if it existed): ├── Each GPU: 1,008 GB/s to VRAM ├── NVLink: ~900 GB/s between GPUs └── Total effective: closer to 2× single GPU

See Multi-GPU for details on interconnect bandwidth.

Practical Guidance

Bandwidth Rules of Thumb

15-30 tok/s is comfortable for interactive chat
Below 10 tok/s feels sluggish
Below 3 tok/s is barely usable (RAM offload territory)
Quantization helps: Q4 needs ~half the bandwidth of Q8

When to Prioritize Bandwidth

Model fits comfortably in VRAM
Interactive use (chat, coding assistance)
Single-user scenarios

When to Prioritize Capacity

Model doesn't fit otherwise
Batch processing (total throughput matters more)
Very long context requirements

Why Bandwidth Matters

Bandwidth by Hardware

The Bandwidth Formula

Real Performance Is Lower

Memory Types

GDDR6/GDDR6X

HBM2/HBM3

Unified (Apple)

System RAM (DDR5)

Bandwidth vs VRAM: The Tradeoff

Why System RAM Is So Slow

Multi-GPU Bandwidth

Practical Guidance

Bandwidth Rules of Thumb

When to Prioritize Bandwidth

When to Prioritize Capacity

Related Topics