The most common question in local LLM: "Can I run X model on Y hardware?" This guide gives you quick answers and the math to figure it out yourself.

Quick Lookup Table

Models at Q4_K_M quantization with ~4K context:

VRAM Example Hardware Comfortable Fit Tight Fit
8GB RTX 4060, RTX 3070, M1 8GB 7B
12GB RTX 4070, RTX 3080 12GB 7-8B 13B
16GB RTX 4080, RTX 4070 Ti Super, M1 Pro 16GB 13B
24GB RTX 4090, RTX 3090, M2 Pro 24GB 13-14B 34B
32GB M1/M2 Max 32GB 34B
48GB 2× RTX 3090, RTX A6000 34B 70B
64GB M2 Max 64GB, M3 Max 64GB 70B
96GB M2 Max 96GB 70B (comfortable)
128GB M2/M3 Ultra 128GB, M3 Max 128GB 70B + long context 100B+
192GB M2 Ultra 192GB 100B+ 405B (Q2-Q3)

The Math

To calculate yourself:

Total VRAM = Model Weights + KV Cache + Overhead

Model Weights (Q4) ≈ Parameters (B) × 0.5 GB
KV Cache ≈ 1-2 GB per 4K context for smaller models
         ≈ 5-10 GB per 8K context for 70B
Overhead ≈ 1-2 GB
Example: Llama 3 70B Q4 at 8K context Weights: 70 × 0.5 = 35 GB KV Cache: ~10 GB (at 8K) Overhead: ~2 GB ───────────────────────── Total: ~47 GB → Fits on 48GB (tight) or 64GB (comfortable)

Interactive Decision Tree

What VRAM do you have? │ ├── 8GB or less │ └── Stick to 7B models (Llama 3 8B, Mistral 7B, Qwen2 7B) │ Use Q4_K_M or more aggressive quantization │ ├── 12-16GB │ └── 7B-13B models comfortably │ Can try 34B with aggressive quantization + short context │ ├── 24GB │ └── Sweet spot for 13B models with long context │ Can run 34B at Q4 with moderate context │ 70B possible with heavy quantization + offloading (slow) │ ├── 48GB (multi-GPU or workstation) │ └── 34B models comfortably with long context │ 70B at Q4 fits (tight but usable) │ ├── 64-96GB (high-end Mac or multi-GPU) │ └── 70B models comfortably │ Good context length headroom │ └── 128GB+ └── 70B+ with very long context Can attempt 100B+ models

Popular Models by Size

7-8B Class (Entry Level)

Model Q4 Size Good For
Llama 3.1 8B ~4.5GB General chat, coding, instruction following
Mistral 7B ~4GB Fast, good quality, coding
Qwen2 7B ~4GB Strong multilingual, coding
Gemma 2 9B ~5GB Google's efficient model

13-14B Class (Mid-Range)

Model Q4 Size Good For
Llama 2 13B ~7.5GB Legacy, well-tested
Qwen2 14B ~8GB Strong reasoning, multilingual

30-34B Class (Enthusiast)

Model Q4 Size Good For
Llama 2 34B (Code) ~19GB Code generation
Mixtral 8x7B ~26GB MoE, good general capability
Qwen2 32B ~18GB Strong all-around

70B Class (High-End)

Model Q4 Size Good For
Llama 3.1 70B ~40GB Near-frontier capability
Qwen2 72B ~41GB Excellent reasoning
Mixtral 8x22B ~80GB Large MoE, strong capability

Context Length Considerations

The tables above assume moderate context (~4-8K). Longer context needs more KV cache:

Context Additional VRAM (70B) Additional VRAM (7B)
4K (baseline) ~5GB ~1GB
8K ~10GB ~2GB
16K ~20GB ~4GB
32K ~40GB ~8GB

Don't Max Out Context Unless Needed

Just because a model supports 128K context doesn't mean you should use it. Every token of context costs memory. Use what you need.

When It Doesn't Fit

Options (Best to Worst)

  1. More aggressive quantization — Q4 instead of Q6, or Q3
  2. Shorter context — Reduce from 8K to 4K
  3. Smaller model — A faster 13B often beats a limping 70B
  4. Partial offloading — Some layers on CPU (slow but works)
  5. Full CPU inference — Last resort, very slow

The Right Mindset

Don't chase the biggest model. A well-tuned 13B that runs smoothly will give you a better experience than a 70B that stutters. Speed matters for usability.