Fundamentals - Local LLM Knowledge Base

Before diving into hardware choices or software stacks, you need to understand the fundamental concepts that determine how LLMs work and what resources they require. These concepts will come up repeatedly when evaluating hardware, understanding performance bottlenecks, or troubleshooting issues.

What is an LLM

The basics: what these models are, how they generate text, and why they need so much compute.

Parameters & Scale

What "7B" or "70B" means, how parameter count relates to capability, and the memory math.

Dense vs MoE

Why Mixture of Experts models have huge parameter counts but smaller compute requirements.

Quantization

Trading precision for size: how to fit larger models on smaller hardware, and what you lose.

Context Window

How much text the model can "see" at once, and why longer isn't always better.

KV Cache

The memory structure that grows during generation — often the thing that actually limits your context length.

Prefill vs Decode

The two phases of inference, their different bottlenecks, and why this distinction matters for performance.

Tokens & Tokenization

Why models don't see text the way you do, and how this affects everything from cost to context limits.

Key Relationships

These concepts don't exist in isolation. Understanding how they connect helps you reason about system behavior:

Parameters (model size) │ ├──► VRAM for weights (static) │ └──► Compute per token Context Window │ ├──► KV Cache size (dynamic, grows with context) │ └──► VRAM pressure during generation Quantization │ ├──► Reduces weight memory (more model in same VRAM) │ ├──► Reduces memory bandwidth needs (faster decode) │ └──► May reduce quality (task-dependent)

The Critical Insight

A model "fitting" in VRAM doesn't mean it's usable. You need room for the KV cache, which grows with context length. A 70B model that technically fits in 48GB VRAM might only support 2K context because there's no room left for the cache.

Memory Math

Quick rules for estimating VRAM requirements:

Component	Formula	Example (70B, 4-bit, 8K context)
Model weights	params × bytes_per_param	70B × 0.5 = 35GB
KV cache	2 × layers × hidden_dim × context × bytes	~8GB at 8K context
Working memory	~1-2GB overhead	~2GB
Total		~45GB

This is why a model that "needs 35GB" for weights alone actually requires more like 45GB+ for practical use.