Hardware - Local LLM Knowledge Base

Hardware determines what's possible. The model you can run, the speed you can achieve, and the context length you can support are all fundamentally limited by your hardware. Understanding these constraints helps you make better decisions about what to buy and what to expect.

The Key Constraint: Memory

For most local LLM setups, memory is the primary constraint — not compute. The model weights need to live somewhere, and inference is typically bottlenecked by how fast you can read those weights.

┌─────────────────────────────────────────────────────────────┐ │ Memory Hierarchy │ ├─────────────────────────────────────────────────────────────┤ │ │ │ VRAM (GPU Memory) │ │ ├── Fastest: 1-3 TB/s bandwidth │ │ ├── Most expensive per GB │ │ └── 8GB - 80GB typical (consumer to datacenter) │ │ │ │ Unified Memory (Apple Silicon) │ │ ├── Moderate: 200-800 GB/s bandwidth │ │ ├── Shared with system │ │ └── 16GB - 192GB (MacBook to Mac Studio) │ │ │ │ System RAM │ │ ├── Slower: 50-100 GB/s bandwidth │ │ ├── Cheaper per GB │ │ └── 16GB - 1TB+ possible │ │ │ │ SSD/NVMe │ │ ├── Slowest: 3-7 GB/s bandwidth │ │ ├── Cheapest per GB │ │ └── Used for model loading, not inference │ │ │ └─────────────────────────────────────────────────────────────┘

The Bandwidth Rule

During token generation (decode), you read the entire model for each token produced. If your model is 35GB and you want 30 tokens/sec, you need to read 1,050 GB/sec. This is why memory bandwidth matters more than raw compute for most inference workloads.

Hardware Categories

GPUs

NVIDIA, AMD, and their relative strengths for LLM inference. VRAM, bandwidth, and software support.

Apple Silicon

M1/M2/M3/M4 chips with unified memory. Different tradeoffs than discrete GPUs.

VRAM

GPU memory capacity, types (GDDR6X, HBM), and how much you actually need.

Memory Bandwidth

The often-overlooked spec that determines inference speed. Why bandwidth matters more than TFLOPS.

Multi-GPU

NVLink, PCIe, tensor parallelism. When adding GPUs helps and when it doesn't.

CPU Inference

Running models on CPU with RAM offload. Slower but accessible.

Quick Reference: Common Hardware

Hardware	Memory	Bandwidth	Typical tok/s (70B Q4)	Price (New)
RTX 4090	24GB VRAM	1,008 GB/s	~25-30 (needs offload)	~$1,600
RTX 3090	24GB VRAM	936 GB/s	~20-25 (needs offload)	~$800 used
2× RTX 3090	48GB VRAM	~1.4 TB/s effective	~35-40	~$1,600 used
Mac Studio M2 Ultra	192GB unified	800 GB/s	~15-20	~$6,000
M3 Max MacBook Pro	128GB unified	400 GB/s	~8-12	~$4,500
A100 80GB	80GB HBM	2,039 GB/s	~50-60	~$15,000

* Performance figures are approximate and vary by model, quantization, and software.

Key Tradeoffs

VRAM vs Bandwidth

More VRAM lets you fit larger models, but bandwidth determines how fast they run. A 48GB GPU with low bandwidth may be slower than a 24GB GPU with high bandwidth running a smaller model.

Single Powerful vs Multiple Smaller

One 48GB GPU usually outperforms two 24GB GPUs for the same total VRAM, because multi-GPU adds communication overhead. But sometimes two smaller GPUs is all you can get.

New vs Used

Used datacenter GPUs (P40, A100) offer extreme value but come with quirks: cooling, power, driver support. Consumer cards are easier but more expensive per GB.

Discrete GPU vs Apple Silicon

Apple offers huge unified memory pools with reasonable bandwidth. NVIDIA offers higher bandwidth but limited VRAM per card. Different sweet spots for different model sizes.

What Hardware Do You Need?

The answer depends on what models you want to run:

Model Class	Example Models	Q4 Size	Minimum Hardware
Small (1-3B)	Phi-3, Gemma 2B	1-2GB	Any modern GPU, or CPU
Medium (7-8B)	Llama 3 8B, Mistral 7B	4-5GB	8GB VRAM or 16GB unified
Large (13-14B)	Llama 2 13B	7-8GB	12GB+ VRAM or 24GB unified
XL (30-34B)	CodeLlama 34B	18-20GB	24GB VRAM or 48GB unified
XXL (65-70B)	Llama 3 70B	35-40GB	48GB+ VRAM or 64GB+ unified
Frontier (100B+)	Llama 3 405B	200GB+	Multi-GPU or very large Mac

Starting Point Recommendation

If you're just getting started: an RTX 4090 (24GB) handles 7B-13B models well and can run 70B with offloading. It's the best single consumer GPU for local LLM work. For larger models with better speed, look at multi-GPU setups or high-memory Apple Silicon.