Hardware
Physical components and constraints for local LLM systems
Hardware determines what's possible. The model you can run, the speed you can achieve, and the context length you can support are all fundamentally limited by your hardware. Understanding these constraints helps you make better decisions about what to buy and what to expect.
The Key Constraint: Memory
For most local LLM setups, memory is the primary constraint — not compute. The model weights need to live somewhere, and inference is typically bottlenecked by how fast you can read those weights.
The Bandwidth Rule
During token generation (decode), you read the entire model for each token produced. If your model is 35GB and you want 30 tokens/sec, you need to read 1,050 GB/sec. This is why memory bandwidth matters more than raw compute for most inference workloads.
Hardware Categories
GPUs
NVIDIA, AMD, and their relative strengths for LLM inference. VRAM, bandwidth, and software support.
Apple Silicon
M1/M2/M3/M4 chips with unified memory. Different tradeoffs than discrete GPUs.
VRAM
GPU memory capacity, types (GDDR6X, HBM), and how much you actually need.
Memory Bandwidth
The often-overlooked spec that determines inference speed. Why bandwidth matters more than TFLOPS.
Multi-GPU
NVLink, PCIe, tensor parallelism. When adding GPUs helps and when it doesn't.
CPU Inference
Running models on CPU with RAM offload. Slower but accessible.
Quick Reference: Common Hardware
| Hardware | Memory | Bandwidth | Typical tok/s (70B Q4) | Price (New) |
|---|---|---|---|---|
| RTX 4090 | 24GB VRAM | 1,008 GB/s | ~25-30 (needs offload) | ~$1,600 |
| RTX 3090 | 24GB VRAM | 936 GB/s | ~20-25 (needs offload) | ~$800 used |
| 2× RTX 3090 | 48GB VRAM | ~1.4 TB/s effective | ~35-40 | ~$1,600 used |
| Mac Studio M2 Ultra | 192GB unified | 800 GB/s | ~15-20 | ~$6,000 |
| M3 Max MacBook Pro | 128GB unified | 400 GB/s | ~8-12 | ~$4,500 |
| A100 80GB | 80GB HBM | 2,039 GB/s | ~50-60 | ~$15,000 |
* Performance figures are approximate and vary by model, quantization, and software.
Key Tradeoffs
VRAM vs Bandwidth
More VRAM lets you fit larger models, but bandwidth determines how fast they run. A 48GB GPU with low bandwidth may be slower than a 24GB GPU with high bandwidth running a smaller model.
Single Powerful vs Multiple Smaller
One 48GB GPU usually outperforms two 24GB GPUs for the same total VRAM, because multi-GPU adds communication overhead. But sometimes two smaller GPUs is all you can get.
New vs Used
Used datacenter GPUs (P40, A100) offer extreme value but come with quirks: cooling, power, driver support. Consumer cards are easier but more expensive per GB.
Discrete GPU vs Apple Silicon
Apple offers huge unified memory pools with reasonable bandwidth. NVIDIA offers higher bandwidth but limited VRAM per card. Different sweet spots for different model sizes.
What Hardware Do You Need?
The answer depends on what models you want to run:
| Model Class | Example Models | Q4 Size | Minimum Hardware |
|---|---|---|---|
| Small (1-3B) | Phi-3, Gemma 2B | 1-2GB | Any modern GPU, or CPU |
| Medium (7-8B) | Llama 3 8B, Mistral 7B | 4-5GB | 8GB VRAM or 16GB unified |
| Large (13-14B) | Llama 2 13B | 7-8GB | 12GB+ VRAM or 24GB unified |
| XL (30-34B) | CodeLlama 34B | 18-20GB | 24GB VRAM or 48GB unified |
| XXL (65-70B) | Llama 3 70B | 35-40GB | 48GB+ VRAM or 64GB+ unified |
| Frontier (100B+) | Llama 3 405B | 200GB+ | Multi-GPU or very large Mac |
Starting Point Recommendation
If you're just getting started: an RTX 4090 (24GB) handles 7B-13B models well and can run 70B with offloading. It's the best single consumer GPU for local LLM work. For larger models with better speed, look at multi-GPU setups or high-memory Apple Silicon.