GPUs for Local LLM Inference
NVIDIA, AMD, and what specs actually matter
GPUs are the workhorses of local LLM inference. But not all GPU specs matter equally — for inference, VRAM and memory bandwidth are usually more important than raw compute (TFLOPS).
NVIDIA Consumer GPUs
NVIDIA dominates local LLM inference due to CUDA's mature software ecosystem.
| GPU | VRAM | Bandwidth | TDP | Price (New) | Notes |
|---|---|---|---|---|---|
| RTX 4090 | 24GB | 1,008 GB/s | 450W | ~$1,600 | Best single consumer GPU |
| RTX 4080 Super | 16GB | 736 GB/s | 320W | ~$1,000 | Good mid-range |
| RTX 4070 Ti Super | 16GB | 672 GB/s | 285W | ~$800 | Budget 16GB option |
| RTX 4060 Ti 16GB | 16GB | 288 GB/s | 165W | ~$450 | VRAM ok, bandwidth weak |
| RTX 3090 | 24GB | 936 GB/s | 350W | ~$800 used | Best value for 24GB |
| RTX 3090 Ti | 24GB | 1,008 GB/s | 450W | ~$900 used | Slightly faster 3090 |
| RTX 3080 12GB | 12GB | 912 GB/s | 350W | ~$450 used | Good budget option |
Best Value Picks
- New: RTX 4090 if you need best single-GPU performance
- Used: RTX 3090 — 24GB VRAM at ~$800 is hard to beat
- Budget: RTX 3080 12GB used (~$400-450)
NVIDIA Workstation/Datacenter GPUs
| GPU | VRAM | Bandwidth | TDP | Price | Notes |
|---|---|---|---|---|---|
| RTX A6000 | 48GB | 768 GB/s | 300W | ~$4,500 | Workstation, NVLink capable |
| A100 40GB | 40GB | 1,555 GB/s | 400W | ~$8,000 used | Datacenter, HBM2 |
| A100 80GB | 80GB | 2,039 GB/s | 400W | ~$15,000 | Gold standard for LLMs |
| H100 80GB | 80GB | 3,350 GB/s | 700W | ~$30,000 | Current flagship |
| Tesla P40 | 24GB | 346 GB/s | 250W | ~$300 used | Cheap VRAM, slow |
Datacenter GPU Caveats
- No display outputs — can't use as primary GPU
- Blower coolers are LOUD without modification
- May need specific driver versions
- Power connectors may be non-standard
AMD GPUs
AMD offers competitive hardware but software support (ROCm) lags behind CUDA.
| GPU | VRAM | Bandwidth | TDP | Price | LLM Support |
|---|---|---|---|---|---|
| RX 7900 XTX | 24GB | 960 GB/s | 355W | ~$900 | Good (llama.cpp ROCm) |
| RX 7900 XT | 20GB | 800 GB/s | 315W | ~$700 | Good |
| MI100 | 32GB | 1,229 GB/s | 300W | ~$800 used | ROCm support |
| MI210 | 64GB | 1,638 GB/s | 300W | ~$3,000 used | Good ROCm support |
AMD Pros and Cons
Pros
- Often cheaper for same VRAM
- llama.cpp works well
- 24GB at ~$900 (7900 XTX)
Cons
- ROCm less mature than CUDA
- Some software doesn't support AMD
- More troubleshooting required
What Specs Matter
| Spec | Importance for LLM Inference | Why |
|---|---|---|
| VRAM | 🔴 Critical | Determines what models fit |
| Memory Bandwidth | 🔴 Critical | Determines decode speed (tok/s) |
| TFLOPS (Compute) | 🟡 Moderate | Affects prefill speed, less important for decode |
| CUDA Cores | 🟡 Moderate | More cores help with batching |
| Tensor Cores | 🟢 Minor | Help with specific precision formats |
| RT Cores | ⚪ Irrelevant | Ray tracing, not used for LLMs |
Choosing a GPU
What's your budget?
│
├── <$500
│ ├── RTX 3080 12GB used (~$400): Good bandwidth, limited VRAM
│ └── Tesla P40 used (~$300): 24GB but slow
│
├── $500-1000
│ ├── RTX 3090 used (~$800): Best value for 24GB
│ └── RX 7900 XTX (~$900): AMD alternative
│
├── $1000-2000
│ └── RTX 4090 (~$1,600): Best single consumer GPU
│
└── $2000+
├── 2× RTX 3090 (~$1,600): 48GB total
├── RTX A6000 (~$4,500): 48GB single card
└── A100 80GB (~$15,000): Serious workloads
Multi-GPU Considerations
When one GPU isn't enough:
- Same GPU model works best for tensor parallelism
- NVLink dramatically improves multi-GPU scaling (only some cards support it)
- PCIe lanes: Most consumer motherboards limit multi-GPU bandwidth
- See Multi-GPU for details
Used GPU Buying Tips
What to Check
- Mining history: Not necessarily bad, but check temps/fans
- Warranty status: Some NVIDIA cards have transferable warranties
- Thermal paste age: May need repasting on older cards
- Fan condition: Listen for bearing noise, check spin-up
- VRAM errors: Run memory tests before purchasing if possible