Memory bandwidth is how fast data can be read from (or written to) memory. For LLM inference, it's typically the limiting factor — more so than TFLOPS, core count, or other specs that get more marketing attention.

Why Bandwidth Matters

During token generation (decode), the model processes one token at a time. For each token:

Generate 1 token: 1. Read all model weights (~35GB for 70B Q4) 2. Do minimal computation 3. Output 1 token 4. Repeat At 30 tokens/second: ├── Reading: 30 × 35GB = 1,050 GB/s needed ├── Computing: Tiny fraction of GPU capability used └── Bottleneck: MEMORY BANDWIDTH

The GPU's compute units (which do the math) are starved for data. They sit idle while waiting for weights to arrive from memory.

Bandwidth by Hardware

Hardware Memory Type Bandwidth ~Max tok/s (70B Q4)
RTX 4090 GDDR6X 1,008 GB/s ~29
RTX 4080 GDDR6X 717 GB/s ~20
RTX 3090 GDDR6X 936 GB/s ~27
RTX 3080 (10GB) GDDR6X 760 GB/s ~22
Mac Studio M2 Ultra Unified LPDDR5 800 GB/s ~23
M3 Max Unified LPDDR5 400 GB/s ~11
A100 80GB HBM2e 2,039 GB/s ~58
H100 80GB HBM3 3,350 GB/s ~96
DDR5 RAM (system) DDR5-5600 ~90 GB/s ~2.5

* tok/s estimates assume model fits in memory and is bandwidth-limited. Real performance varies.

The Bandwidth Formula

Theoretical maximum decode speed:

Max tok/s = Memory Bandwidth / Model Size

Example: RTX 4090 with 70B Q4 (~35GB)
Max tok/s = 1,008 GB/s / 35 GB = 28.8 tok/s

Real Performance Is Lower

Actual performance is typically 70-90% of theoretical maximum due to: memory access patterns that aren't perfectly sequential, KV cache reads, compute overhead, and software inefficiencies.

Memory Types

GDDR6/GDDR6X

  • Consumer GPU memory
  • 600-1000 GB/s typical
  • Cost-effective
  • Limited capacity (8-24GB)

HBM2/HBM3

  • Datacenter GPU memory
  • 2000-3500 GB/s
  • Very expensive
  • Higher capacity (40-80GB)

Unified (Apple)

  • Shared CPU/GPU memory
  • 200-800 GB/s
  • Large capacity possible
  • No discrete GPU VRAM limits

System RAM (DDR5)

  • CPU memory
  • 50-100 GB/s
  • Cheapest per GB
  • 10-20× slower than VRAM

Bandwidth vs VRAM: The Tradeoff

These specs often trade off against each other:

High VRAM, Lower Bandwidth: ├── P40 (24GB, 346 GB/s): Fits 70B Q4, but slow (~10 tok/s) ├── Mac Studio (192GB, 800 GB/s): Fits 405B, moderate speed High Bandwidth, Lower VRAM: ├── RTX 4090 (24GB, 1008 GB/s): Fast, but can't fit 70B Q4 alone ├── H100 (80GB, 3350 GB/s): Both high, but costs $30k+ The question: What's YOUR bottleneck? ├── If model doesn't fit → need more VRAM ├── If model fits but slow → need more bandwidth

Why System RAM Is So Slow

When models don't fit in VRAM, layers can be "offloaded" to system RAM. This works but is much slower:

Memory Bandwidth Relative Speed
VRAM (GDDR6X) ~1,000 GB/s
System RAM (DDR5) ~90 GB/s 0.09× (11× slower)
NVMe SSD ~7 GB/s 0.007× (140× slower)

Offloading 50% of layers to RAM doesn't make inference 50% slower — it makes it roughly 5-10× slower because those layers become the bottleneck.

Multi-GPU Bandwidth

Adding GPUs adds bandwidth — but communication between GPUs has its own bandwidth limits:

Single RTX 4090: 1,008 GB/s to VRAM Two RTX 4090s: ├── Each GPU: 1,008 GB/s to its own VRAM ├── PCIe 4.0 x16 between them: ~32 GB/s each direction └── Total effective: ~1.5× single GPU (not 2×) Two 4090s with NVLink (if it existed): ├── Each GPU: 1,008 GB/s to VRAM ├── NVLink: ~900 GB/s between GPUs └── Total effective: closer to 2× single GPU

See Multi-GPU for details on interconnect bandwidth.

Practical Guidance

Bandwidth Rules of Thumb

When to Prioritize Bandwidth

When to Prioritize Capacity