Parameters & Scale
What "7B" and "70B" actually mean
When people say a model has "7 billion parameters," they're describing its size. Parameters are the learned values (weights) that define what the model knows. More parameters generally means more capability — and more hardware requirements.
What Are Parameters?
Parameters are the numbers that make up the neural network. During training, these values are adjusted to minimize prediction errors. After training, they're fixed — the model's "knowledge" is encoded in these billions of numbers.
Parameter Count by Model
| Model | Parameters | FP16 Size | Q4 Size | Typical Use |
|---|---|---|---|---|
| Phi-3 Mini | 3.8B | ~7.6GB | ~2.2GB | Mobile, edge devices |
| Llama 3 8B | 8B | ~16GB | ~4.5GB | Consumer GPU sweet spot |
| Mistral 7B | 7B | ~14GB | ~4GB | Consumer GPU sweet spot |
| Llama 2 13B | 13B | ~26GB | ~7.5GB | 16-24GB VRAM |
| CodeLlama 34B | 34B | ~68GB | ~19GB | 24-48GB VRAM |
| Llama 3 70B | 70B | ~140GB | ~40GB | 48GB+ or multi-GPU |
| Llama 3 405B | 405B | ~810GB | ~230GB | Multi-node clusters |
The Memory Math
Converting parameters to memory requirements:
This Is Just Weights
The weight memory is the minimum. You also need memory for KV cache (which grows with context length), activations, and overhead. A 35GB model might need 45-50GB to actually run at useful context lengths.
Does More Parameters = Better?
Generally yes, but with diminishing returns and important caveats:
More Parameters Helps
- More knowledge capacity
- Better reasoning (usually)
- Better instruction following
- Fewer hallucinations (sometimes)
- Better at complex tasks
But It's Not Everything
- Training data quality matters more
- Architecture improvements help
- Fine-tuning can beat raw scale
- Task-specific models can be better
- Quantization affects realized quality
A well-trained 7B model often beats a poorly trained 13B model. And a heavily quantized 70B model might underperform a full-precision 13B model on some tasks.
Dense vs Mixture of Experts (MoE)
Parameter counts can be misleading for Mixture of Experts models:
MoE models need memory for all parameters but only compute with a subset. This means:
- Memory requirements based on total parameters
- Compute/speed based on active parameters
- A 47B MoE can be faster than a 47B dense model
- But still needs similar memory
Scaling Laws
Research has shown predictable relationships between scale and performance:
Chinchilla Scaling
For optimal training efficiency, models should be trained on roughly 20× as many tokens as they have parameters. A 7B model should see ~140B tokens. Many early models were "undertrained" by this metric.
Rough capability scaling:
- 1-3B: Basic tasks, simple Q&A, limited reasoning
- 7-8B: Good general capability, useful for most tasks
- 13-14B: Strong capability, good reasoning
- 30-34B: Excellent capability, complex reasoning
- 65-70B: Near-frontier capability
- 100B+: Frontier models, best overall performance
Practical Implications
| Parameter Range | Minimum VRAM (Q4) | Comfortable VRAM | Consumer Hardware |
|---|---|---|---|
| 1-3B | 2GB | 4GB | Any modern GPU |
| 7-8B | 4GB | 8GB | RTX 3060+, M1+ |
| 13-14B | 8GB | 12GB | RTX 3080+, M1 Pro+ |
| 30-34B | 18GB | 24GB | RTX 3090/4090, M2 Max+ |
| 65-70B | 35GB | 48GB+ | Multi-GPU or Mac Studio |