Memory Bandwidth
The spec that actually determines inference speed
Memory bandwidth is how fast data can be read from (or written to) memory. For LLM inference, it's typically the limiting factor — more so than TFLOPS, core count, or other specs that get more marketing attention.
Why Bandwidth Matters
During token generation (decode), the model processes one token at a time. For each token:
The GPU's compute units (which do the math) are starved for data. They sit idle while waiting for weights to arrive from memory.
Bandwidth by Hardware
| Hardware | Memory Type | Bandwidth | ~Max tok/s (70B Q4) |
|---|---|---|---|
| RTX 4090 | GDDR6X | 1,008 GB/s | ~29 |
| RTX 4080 | GDDR6X | 717 GB/s | ~20 |
| RTX 3090 | GDDR6X | 936 GB/s | ~27 |
| RTX 3080 (10GB) | GDDR6X | 760 GB/s | ~22 |
| Mac Studio M2 Ultra | Unified LPDDR5 | 800 GB/s | ~23 |
| M3 Max | Unified LPDDR5 | 400 GB/s | ~11 |
| A100 80GB | HBM2e | 2,039 GB/s | ~58 |
| H100 80GB | HBM3 | 3,350 GB/s | ~96 |
| DDR5 RAM (system) | DDR5-5600 | ~90 GB/s | ~2.5 |
* tok/s estimates assume model fits in memory and is bandwidth-limited. Real performance varies.
The Bandwidth Formula
Theoretical maximum decode speed:
Max tok/s = Memory Bandwidth / Model Size
Example: RTX 4090 with 70B Q4 (~35GB)
Max tok/s = 1,008 GB/s / 35 GB = 28.8 tok/s
Real Performance Is Lower
Actual performance is typically 70-90% of theoretical maximum due to: memory access patterns that aren't perfectly sequential, KV cache reads, compute overhead, and software inefficiencies.
Memory Types
GDDR6/GDDR6X
- Consumer GPU memory
- 600-1000 GB/s typical
- Cost-effective
- Limited capacity (8-24GB)
HBM2/HBM3
- Datacenter GPU memory
- 2000-3500 GB/s
- Very expensive
- Higher capacity (40-80GB)
Unified (Apple)
- Shared CPU/GPU memory
- 200-800 GB/s
- Large capacity possible
- No discrete GPU VRAM limits
System RAM (DDR5)
- CPU memory
- 50-100 GB/s
- Cheapest per GB
- 10-20× slower than VRAM
Bandwidth vs VRAM: The Tradeoff
These specs often trade off against each other:
Why System RAM Is So Slow
When models don't fit in VRAM, layers can be "offloaded" to system RAM. This works but is much slower:
| Memory | Bandwidth | Relative Speed |
|---|---|---|
| VRAM (GDDR6X) | ~1,000 GB/s | 1× |
| System RAM (DDR5) | ~90 GB/s | 0.09× (11× slower) |
| NVMe SSD | ~7 GB/s | 0.007× (140× slower) |
Offloading 50% of layers to RAM doesn't make inference 50% slower — it makes it roughly 5-10× slower because those layers become the bottleneck.
Multi-GPU Bandwidth
Adding GPUs adds bandwidth — but communication between GPUs has its own bandwidth limits:
See Multi-GPU for details on interconnect bandwidth.
Practical Guidance
Bandwidth Rules of Thumb
- 15-30 tok/s is comfortable for interactive chat
- Below 10 tok/s feels sluggish
- Below 3 tok/s is barely usable (RAM offload territory)
- Quantization helps: Q4 needs ~half the bandwidth of Q8
When to Prioritize Bandwidth
- Model fits comfortably in VRAM
- Interactive use (chat, coding assistance)
- Single-user scenarios
When to Prioritize Capacity
- Model doesn't fit otherwise
- Batch processing (total throughput matters more)
- Very long context requirements