GPUs - Local LLM Knowledge Base

GPUs are the workhorses of local LLM inference. But not all GPU specs matter equally — for inference, VRAM and memory bandwidth are usually more important than raw compute (TFLOPS).

NVIDIA Consumer GPUs

NVIDIA dominates local LLM inference due to CUDA's mature software ecosystem.

GPU	VRAM	Bandwidth	TDP	Price (New)	Notes
RTX 4090	24GB	1,008 GB/s	450W	~$1,600	Best single consumer GPU
RTX 4080 Super	16GB	736 GB/s	320W	~$1,000	Good mid-range
RTX 4070 Ti Super	16GB	672 GB/s	285W	~$800	Budget 16GB option
RTX 4060 Ti 16GB	16GB	288 GB/s	165W	~$450	VRAM ok, bandwidth weak
RTX 3090	24GB	936 GB/s	350W	~$800 used	Best value for 24GB
RTX 3090 Ti	24GB	1,008 GB/s	450W	~$900 used	Slightly faster 3090
RTX 3080 12GB	12GB	912 GB/s	350W	~$450 used	Good budget option

Best Value Picks

New: RTX 4090 if you need best single-GPU performance
Used: RTX 3090 — 24GB VRAM at ~$800 is hard to beat
Budget: RTX 3080 12GB used (~$400-450)

NVIDIA Workstation/Datacenter GPUs

GPU	VRAM	Bandwidth	TDP	Price	Notes
RTX A6000	48GB	768 GB/s	300W	~$4,500	Workstation, NVLink capable
A100 40GB	40GB	1,555 GB/s	400W	~$8,000 used	Datacenter, HBM2
A100 80GB	80GB	2,039 GB/s	400W	~$15,000	Gold standard for LLMs
H100 80GB	80GB	3,350 GB/s	700W	~$30,000	Current flagship
Tesla P40	24GB	346 GB/s	250W	~$300 used	Cheap VRAM, slow

Datacenter GPU Caveats

No display outputs — can't use as primary GPU
Blower coolers are LOUD without modification
May need specific driver versions
Power connectors may be non-standard

AMD GPUs

AMD offers competitive hardware but software support (ROCm) lags behind CUDA.

GPU	VRAM	Bandwidth	TDP	Price	LLM Support
RX 7900 XTX	24GB	960 GB/s	355W	~$900	Good (llama.cpp ROCm)
RX 7900 XT	20GB	800 GB/s	315W	~$700	Good
MI100	32GB	1,229 GB/s	300W	~$800 used	ROCm support
MI210	64GB	1,638 GB/s	300W	~$3,000 used	Good ROCm support

AMD Pros and Cons

Pros

Often cheaper for same VRAM
llama.cpp works well
24GB at ~$900 (7900 XTX)

Cons

ROCm less mature than CUDA
Some software doesn't support AMD
More troubleshooting required

What Specs Matter

Spec	Importance for LLM Inference	Why
VRAM	🔴 Critical	Determines what models fit
Memory Bandwidth	🔴 Critical	Determines decode speed (tok/s)
TFLOPS (Compute)	🟡 Moderate	Affects prefill speed, less important for decode
CUDA Cores	🟡 Moderate	More cores help with batching
Tensor Cores	🟢 Minor	Help with specific precision formats
RT Cores	⚪ Irrelevant	Ray tracing, not used for LLMs

Choosing a GPU

What's your budget? │ ├── <$500 │ ├── RTX 3080 12GB used (~$400): Good bandwidth, limited VRAM │ └── Tesla P40 used (~$300): 24GB but slow │ ├── $500-1000 │ ├── RTX 3090 used (~$800): Best value for 24GB │ └── RX 7900 XTX (~$900): AMD alternative │ ├── $1000-2000 │ └── RTX 4090 (~$1,600): Best single consumer GPU │ └── $2000+ ├── 2× RTX 3090 (~$1,600): 48GB total ├── RTX A6000 (~$4,500): 48GB single card └── A100 80GB (~$15,000): Serious workloads

Multi-GPU Considerations

When one GPU isn't enough:

Same GPU model works best for tensor parallelism
NVLink dramatically improves multi-GPU scaling (only some cards support it)
PCIe lanes: Most consumer motherboards limit multi-GPU bandwidth
See Multi-GPU for details

Used GPU Buying Tips

What to Check

Mining history: Not necessarily bad, but check temps/fans
Warranty status: Some NVIDIA cards have transferable warranties
Thermal paste age: May need repasting on older cards
Fan condition: Listen for bearing noise, check spin-up
VRAM errors: Run memory tests before purchasing if possible

NVIDIA Consumer GPUs

Best Value Picks

NVIDIA Workstation/Datacenter GPUs

Datacenter GPU Caveats

AMD GPUs

AMD Pros and Cons

Pros

Cons

What Specs Matter

Choosing a GPU

Multi-GPU Considerations

Used GPU Buying Tips

What to Check

Related Topics