These guides are designed to help you make practical decisions. Instead of explaining concepts, they answer specific questions with concrete recommendations.

Which Model Fits on Which Hardware?

VRAM requirements by model size and quantization. Quick lookup tables and the math behind them.

Why Is My Setup Slow?

The model fits but it's crawling. Common causes: bandwidth limits, offloading, thermal throttling, wrong settings.

When Does Multi-GPU Help?

Adding a second GPU: when it scales well, when it doesn't, and the role of interconnects.

When Is RAM Offload Too Slow?

Offloading to system RAM: how slow is too slow, and when it's still worthwhile.

Apple Silicon vs NVIDIA

Unified memory vs discrete GPU. Different architectures, different sweet spots.

AMD vs NVIDIA

ROCm support, driver maturity, price/performance. When AMD makes sense.

Buy vs Build

Prebuilt systems, DIY rigs, used enterprise hardware. Cost vs convenience tradeoffs.

Builds by Use Case

Recommended setups for: hobbyist, serious enthusiast, developer workstation, production server.

Quick Decision Trees

What Hardware Should I Get?

What models do you want to run? │ ├── 7B-13B models (Mistral, Llama 8B) │ └── Budget? │ ├── <$500: Used RTX 3080/3090 (12-24GB) │ ├── <$1000: RTX 4070 Ti Super (16GB) or used 3090 │ └── <$2000: RTX 4090 (24GB) ← best single-card option │ ├── 30B-70B models │ └── Priority? │ ├── Speed: 2× RTX 3090 ($1600 used) or 4090 + offload │ ├── Simplicity: Mac Studio M2 Ultra ($6000) │ └── Budget: Mac Mini M4 Pro + offload (slow but works) │ └── 100B+ models └── Multi-GPU required ├── Consumer: 4× RTX 3090 (custom build, ~$4000) └── Pro: A100 80GB or H100 (datacenter pricing)

My Setup Is Slow — Why?

Is the model fully in VRAM? │ ├── No → Memory offloading is the bottleneck │ ├── Solution: Smaller model, more quantization, or more VRAM │ └── Offload to RAM adds ~10-50x latency per offloaded layer │ └── Yes → Check bandwidth and settings │ ├── Are you thermal throttling? │ └── Check GPU temp (>83°C = throttling on most cards) │ ├── Is batch size = 1? │ └── Single-request inference is bandwidth-bound │ └── This is normal; add batching for throughput │ ├── Wrong precision/backend? │ └── Ensure you're using optimized kernels (Flash Attention, etc.) │ └── Multi-GPU with PCIe bottleneck? └── NVLink helps; PCIe limits scaling

Common Mistakes

Buying for TFLOPS instead of Bandwidth

LLM inference is memory-bound, not compute-bound. A card with lower TFLOPS but higher memory bandwidth will often be faster for inference. Don't optimize for the wrong metric.

Assuming "It Fits" Means "It's Fast"

A model that barely fits in VRAM leaves no room for KV cache and may require constant swapping. You need headroom. Target 80-90% VRAM utilization, not 100%.

Ignoring Power and Cooling

High-end GPUs pull 300-450W each. Two 3090s need dedicated circuits. Four of them need serious electrical work. Factor this into "total cost of ownership."

Comparing Quantized vs Unquantized Benchmarks

A 70B Q4 model is not the same as a 70B FP16 model. When reading benchmarks or comparisons, check the quantization level. A smaller model at higher precision may outperform a larger model at aggressive quantization.