Model-Hardware Fit - Local LLM Knowledge Base

The most common question in local LLM: "Can I run X model on Y hardware?" This guide gives you quick answers and the math to figure it out yourself.

Quick Lookup Table

Models at Q4_K_M quantization with ~4K context:

VRAM	Example Hardware	Comfortable Fit	Tight Fit
8GB	RTX 4060, RTX 3070, M1 8GB	7B	—
12GB	RTX 4070, RTX 3080 12GB	7-8B	13B
16GB	RTX 4080, RTX 4070 Ti Super, M1 Pro 16GB	13B	—
24GB	RTX 4090, RTX 3090, M2 Pro 24GB	13-14B	34B
32GB	M1/M2 Max 32GB	34B	—
48GB	2× RTX 3090, RTX A6000	34B	70B
64GB	M2 Max 64GB, M3 Max 64GB	70B	—
96GB	M2 Max 96GB	70B (comfortable)	—
128GB	M2/M3 Ultra 128GB, M3 Max 128GB	70B + long context	100B+
192GB	M2 Ultra 192GB	100B+	405B (Q2-Q3)

The Math

To calculate yourself:

Total VRAM = Model Weights + KV Cache + Overhead

Model Weights (Q4) ≈ Parameters (B) × 0.5 GB
KV Cache ≈ 1-2 GB per 4K context for smaller models
         ≈ 5-10 GB per 8K context for 70B
Overhead ≈ 1-2 GB

Example: Llama 3 70B Q4 at 8K context Weights: 70 × 0.5 = 35 GB KV Cache: ~10 GB (at 8K) Overhead: ~2 GB ───────────────────────── Total: ~47 GB → Fits on 48GB (tight) or 64GB (comfortable)

Interactive Decision Tree

What VRAM do you have? │ ├── 8GB or less │ └── Stick to 7B models (Llama 3 8B, Mistral 7B, Qwen2 7B) │ Use Q4_K_M or more aggressive quantization │ ├── 12-16GB │ └── 7B-13B models comfortably │ Can try 34B with aggressive quantization + short context │ ├── 24GB │ └── Sweet spot for 13B models with long context │ Can run 34B at Q4 with moderate context │ 70B possible with heavy quantization + offloading (slow) │ ├── 48GB (multi-GPU or workstation) │ └── 34B models comfortably with long context │ 70B at Q4 fits (tight but usable) │ ├── 64-96GB (high-end Mac or multi-GPU) │ └── 70B models comfortably │ Good context length headroom │ └── 128GB+ └── 70B+ with very long context Can attempt 100B+ models

Popular Models by Size

7-8B Class (Entry Level)

Model	Q4 Size	Good For
Llama 3.1 8B	~4.5GB	General chat, coding, instruction following
Mistral 7B	~4GB	Fast, good quality, coding
Qwen2 7B	~4GB	Strong multilingual, coding
Gemma 2 9B	~5GB	Google's efficient model

13-14B Class (Mid-Range)

Model	Q4 Size	Good For
Llama 2 13B	~7.5GB	Legacy, well-tested
Qwen2 14B	~8GB	Strong reasoning, multilingual

30-34B Class (Enthusiast)

Model	Q4 Size	Good For
Llama 2 34B (Code)	~19GB	Code generation
Mixtral 8x7B	~26GB	MoE, good general capability
Qwen2 32B	~18GB	Strong all-around

70B Class (High-End)

Model	Q4 Size	Good For
Llama 3.1 70B	~40GB	Near-frontier capability
Qwen2 72B	~41GB	Excellent reasoning
Mixtral 8x22B	~80GB	Large MoE, strong capability

Context Length Considerations

The tables above assume moderate context (~4-8K). Longer context needs more KV cache:

Context	Additional VRAM (70B)	Additional VRAM (7B)
4K (baseline)	~5GB	~1GB
8K	~10GB	~2GB
16K	~20GB	~4GB
32K	~40GB	~8GB

Don't Max Out Context Unless Needed

Just because a model supports 128K context doesn't mean you should use it. Every token of context costs memory. Use what you need.

When It Doesn't Fit

Options (Best to Worst)

More aggressive quantization — Q4 instead of Q6, or Q3
Shorter context — Reduce from 8K to 4K
Smaller model — A faster 13B often beats a limping 70B
Partial offloading — Some layers on CPU (slow but works)
Full CPU inference — Last resort, very slow

The Right Mindset

Don't chase the biggest model. A well-tuned 13B that runs smoothly will give you a better experience than a 70B that stutters. Speed matters for usability.

Quick Lookup Table

The Math

Interactive Decision Tree

Popular Models by Size

7-8B Class (Entry Level)

13-14B Class (Mid-Range)

30-34B Class (Enthusiast)

70B Class (High-End)

Context Length Considerations

Don't Max Out Context Unless Needed

When It Doesn't Fit

Options (Best to Worst)

The Right Mindset

Related Topics