Which Model Fits on My Hardware?
Quick reference for model-to-hardware matching
The most common question in local LLM: "Can I run X model on Y hardware?" This guide gives you quick answers and the math to figure it out yourself.
Quick Lookup Table
Models at Q4_K_M quantization with ~4K context:
| VRAM | Example Hardware | Comfortable Fit | Tight Fit |
|---|---|---|---|
| 8GB | RTX 4060, RTX 3070, M1 8GB | 7B | — |
| 12GB | RTX 4070, RTX 3080 12GB | 7-8B | 13B |
| 16GB | RTX 4080, RTX 4070 Ti Super, M1 Pro 16GB | 13B | — |
| 24GB | RTX 4090, RTX 3090, M2 Pro 24GB | 13-14B | 34B |
| 32GB | M1/M2 Max 32GB | 34B | — |
| 48GB | 2× RTX 3090, RTX A6000 | 34B | 70B |
| 64GB | M2 Max 64GB, M3 Max 64GB | 70B | — |
| 96GB | M2 Max 96GB | 70B (comfortable) | — |
| 128GB | M2/M3 Ultra 128GB, M3 Max 128GB | 70B + long context | 100B+ |
| 192GB | M2 Ultra 192GB | 100B+ | 405B (Q2-Q3) |
The Math
To calculate yourself:
Total VRAM = Model Weights + KV Cache + Overhead
Model Weights (Q4) ≈ Parameters (B) × 0.5 GB
KV Cache ≈ 1-2 GB per 4K context for smaller models
≈ 5-10 GB per 8K context for 70B
Overhead ≈ 1-2 GB
Example: Llama 3 70B Q4 at 8K context
Weights: 70 × 0.5 = 35 GB
KV Cache: ~10 GB (at 8K)
Overhead: ~2 GB
─────────────────────────
Total: ~47 GB
→ Fits on 48GB (tight) or 64GB (comfortable)
Interactive Decision Tree
What VRAM do you have?
│
├── 8GB or less
│ └── Stick to 7B models (Llama 3 8B, Mistral 7B, Qwen2 7B)
│ Use Q4_K_M or more aggressive quantization
│
├── 12-16GB
│ └── 7B-13B models comfortably
│ Can try 34B with aggressive quantization + short context
│
├── 24GB
│ └── Sweet spot for 13B models with long context
│ Can run 34B at Q4 with moderate context
│ 70B possible with heavy quantization + offloading (slow)
│
├── 48GB (multi-GPU or workstation)
│ └── 34B models comfortably with long context
│ 70B at Q4 fits (tight but usable)
│
├── 64-96GB (high-end Mac or multi-GPU)
│ └── 70B models comfortably
│ Good context length headroom
│
└── 128GB+
└── 70B+ with very long context
Can attempt 100B+ models
Popular Models by Size
7-8B Class (Entry Level)
| Model | Q4 Size | Good For |
|---|---|---|
| Llama 3.1 8B | ~4.5GB | General chat, coding, instruction following |
| Mistral 7B | ~4GB | Fast, good quality, coding |
| Qwen2 7B | ~4GB | Strong multilingual, coding |
| Gemma 2 9B | ~5GB | Google's efficient model |
13-14B Class (Mid-Range)
| Model | Q4 Size | Good For |
|---|---|---|
| Llama 2 13B | ~7.5GB | Legacy, well-tested |
| Qwen2 14B | ~8GB | Strong reasoning, multilingual |
30-34B Class (Enthusiast)
| Model | Q4 Size | Good For |
|---|---|---|
| Llama 2 34B (Code) | ~19GB | Code generation |
| Mixtral 8x7B | ~26GB | MoE, good general capability |
| Qwen2 32B | ~18GB | Strong all-around |
70B Class (High-End)
| Model | Q4 Size | Good For |
|---|---|---|
| Llama 3.1 70B | ~40GB | Near-frontier capability |
| Qwen2 72B | ~41GB | Excellent reasoning |
| Mixtral 8x22B | ~80GB | Large MoE, strong capability |
Context Length Considerations
The tables above assume moderate context (~4-8K). Longer context needs more KV cache:
| Context | Additional VRAM (70B) | Additional VRAM (7B) |
|---|---|---|
| 4K (baseline) | ~5GB | ~1GB |
| 8K | ~10GB | ~2GB |
| 16K | ~20GB | ~4GB |
| 32K | ~40GB | ~8GB |
Don't Max Out Context Unless Needed
Just because a model supports 128K context doesn't mean you should use it. Every token of context costs memory. Use what you need.
When It Doesn't Fit
Options (Best to Worst)
- More aggressive quantization — Q4 instead of Q6, or Q3
- Shorter context — Reduce from 8K to 4K
- Smaller model — A faster 13B often beats a limping 70B
- Partial offloading — Some layers on CPU (slow but works)
- Full CPU inference — Last resort, very slow
The Right Mindset
Don't chase the biggest model. A well-tuned 13B that runs smoothly will give you a better experience than a 70B that stutters. Speed matters for usability.