Fundamentals
Core concepts for understanding local LLM systems
Before diving into hardware choices or software stacks, you need to understand the fundamental concepts that determine how LLMs work and what resources they require. These concepts will come up repeatedly when evaluating hardware, understanding performance bottlenecks, or troubleshooting issues.
What is an LLM
The basics: what these models are, how they generate text, and why they need so much compute.
Parameters & Scale
What "7B" or "70B" means, how parameter count relates to capability, and the memory math.
Dense vs MoE
Why Mixture of Experts models have huge parameter counts but smaller compute requirements.
Quantization
Trading precision for size: how to fit larger models on smaller hardware, and what you lose.
Context Window
How much text the model can "see" at once, and why longer isn't always better.
KV Cache
The memory structure that grows during generation — often the thing that actually limits your context length.
Prefill vs Decode
The two phases of inference, their different bottlenecks, and why this distinction matters for performance.
Tokens & Tokenization
Why models don't see text the way you do, and how this affects everything from cost to context limits.
Key Relationships
These concepts don't exist in isolation. Understanding how they connect helps you reason about system behavior:
The Critical Insight
A model "fitting" in VRAM doesn't mean it's usable. You need room for the KV cache, which grows with context length. A 70B model that technically fits in 48GB VRAM might only support 2K context because there's no room left for the cache.
Memory Math
Quick rules for estimating VRAM requirements:
| Component | Formula | Example (70B, 4-bit, 8K context) |
|---|---|---|
| Model weights | params × bytes_per_param | 70B × 0.5 = 35GB |
| KV cache | 2 × layers × hidden_dim × context × bytes | ~8GB at 8K context |
| Working memory | ~1-2GB overhead | ~2GB |
| Total | ~45GB |
This is why a model that "needs 35GB" for weights alone actually requires more like 45GB+ for practical use.