GPUs are the workhorses of local LLM inference. But not all GPU specs matter equally — for inference, VRAM and memory bandwidth are usually more important than raw compute (TFLOPS).

NVIDIA Consumer GPUs

NVIDIA dominates local LLM inference due to CUDA's mature software ecosystem.

GPU VRAM Bandwidth TDP Price (New) Notes
RTX 4090 24GB 1,008 GB/s 450W ~$1,600 Best single consumer GPU
RTX 4080 Super 16GB 736 GB/s 320W ~$1,000 Good mid-range
RTX 4070 Ti Super 16GB 672 GB/s 285W ~$800 Budget 16GB option
RTX 4060 Ti 16GB 16GB 288 GB/s 165W ~$450 VRAM ok, bandwidth weak
RTX 3090 24GB 936 GB/s 350W ~$800 used Best value for 24GB
RTX 3090 Ti 24GB 1,008 GB/s 450W ~$900 used Slightly faster 3090
RTX 3080 12GB 12GB 912 GB/s 350W ~$450 used Good budget option

Best Value Picks

NVIDIA Workstation/Datacenter GPUs

GPU VRAM Bandwidth TDP Price Notes
RTX A6000 48GB 768 GB/s 300W ~$4,500 Workstation, NVLink capable
A100 40GB 40GB 1,555 GB/s 400W ~$8,000 used Datacenter, HBM2
A100 80GB 80GB 2,039 GB/s 400W ~$15,000 Gold standard for LLMs
H100 80GB 80GB 3,350 GB/s 700W ~$30,000 Current flagship
Tesla P40 24GB 346 GB/s 250W ~$300 used Cheap VRAM, slow

Datacenter GPU Caveats

AMD GPUs

AMD offers competitive hardware but software support (ROCm) lags behind CUDA.

GPU VRAM Bandwidth TDP Price LLM Support
RX 7900 XTX 24GB 960 GB/s 355W ~$900 Good (llama.cpp ROCm)
RX 7900 XT 20GB 800 GB/s 315W ~$700 Good
MI100 32GB 1,229 GB/s 300W ~$800 used ROCm support
MI210 64GB 1,638 GB/s 300W ~$3,000 used Good ROCm support

AMD Pros and Cons

Pros

  • Often cheaper for same VRAM
  • llama.cpp works well
  • 24GB at ~$900 (7900 XTX)

Cons

  • ROCm less mature than CUDA
  • Some software doesn't support AMD
  • More troubleshooting required

What Specs Matter

Spec Importance for LLM Inference Why
VRAM 🔴 Critical Determines what models fit
Memory Bandwidth 🔴 Critical Determines decode speed (tok/s)
TFLOPS (Compute) 🟡 Moderate Affects prefill speed, less important for decode
CUDA Cores 🟡 Moderate More cores help with batching
Tensor Cores 🟢 Minor Help with specific precision formats
RT Cores ⚪ Irrelevant Ray tracing, not used for LLMs

Choosing a GPU

What's your budget? │ ├── <$500 │ ├── RTX 3080 12GB used (~$400): Good bandwidth, limited VRAM │ └── Tesla P40 used (~$300): 24GB but slow │ ├── $500-1000 │ ├── RTX 3090 used (~$800): Best value for 24GB │ └── RX 7900 XTX (~$900): AMD alternative │ ├── $1000-2000 │ └── RTX 4090 (~$1,600): Best single consumer GPU │ └── $2000+ ├── 2× RTX 3090 (~$1,600): 48GB total ├── RTX A6000 (~$4,500): 48GB single card └── A100 80GB (~$15,000): Serious workloads

Multi-GPU Considerations

When one GPU isn't enough:

Used GPU Buying Tips

What to Check