Parameters & Scale - Local LLM Knowledge Base

When people say a model has "7 billion parameters," they're describing its size. Parameters are the learned values (weights) that define what the model knows. More parameters generally means more capability — and more hardware requirements.

What Are Parameters?

Parameters are the numbers that make up the neural network. During training, these values are adjusted to minimize prediction errors. After training, they're fixed — the model's "knowledge" is encoded in these billions of numbers.

A single layer might look like: Input (4096 dimensions) │ ▼ ┌─────────────────────────────┐ │ Weight Matrix: 4096 × 4096 │ = 16.7 million parameters │ (one layer!) │ └─────────────────────────────┘ │ ▼ Output (4096 dimensions) A 7B model has ~32 such layers, plus embeddings and other components.

Parameter Count by Model

Model	Parameters	FP16 Size	Q4 Size	Typical Use
Phi-3 Mini	3.8B	~7.6GB	~2.2GB	Mobile, edge devices
Llama 3 8B	8B	~16GB	~4.5GB	Consumer GPU sweet spot
Mistral 7B	7B	~14GB	~4GB	Consumer GPU sweet spot
Llama 2 13B	13B	~26GB	~7.5GB	16-24GB VRAM
CodeLlama 34B	34B	~68GB	~19GB	24-48GB VRAM
Llama 3 70B	70B	~140GB	~40GB	48GB+ or multi-GPU
Llama 3 405B	405B	~810GB	~230GB	Multi-node clusters

The Memory Math

Converting parameters to memory requirements:

Memory = Parameters × Bytes per Parameter FP32 (full precision): 1 param = 4 bytes FP16 (half precision): 1 param = 2 bytes INT8 (8-bit quant): 1 param = 1 byte INT4 (4-bit quant): 1 param = 0.5 bytes Example: 70B model ├── FP16: 70B × 2 = 140 GB ├── INT8: 70B × 1 = 70 GB └── INT4: 70B × 0.5 = 35 GB

This Is Just Weights

The weight memory is the minimum. You also need memory for KV cache (which grows with context length), activations, and overhead. A 35GB model might need 45-50GB to actually run at useful context lengths.

Does More Parameters = Better?

Generally yes, but with diminishing returns and important caveats:

More Parameters Helps

More knowledge capacity
Better reasoning (usually)
Better instruction following
Fewer hallucinations (sometimes)
Better at complex tasks

But It's Not Everything

Training data quality matters more
Architecture improvements help
Fine-tuning can beat raw scale
Task-specific models can be better
Quantization affects realized quality

A well-trained 7B model often beats a poorly trained 13B model. And a heavily quantized 70B model might underperform a full-precision 13B model on some tasks.

Dense vs Mixture of Experts (MoE)

Parameter counts can be misleading for Mixture of Experts models:

Dense Model (e.g., Llama 70B): ├── 70B parameters ├── ALL parameters used for every token └── Memory needed: ~70B parameters worth MoE Model (e.g., Mixtral 8x7B): ├── 47B total parameters ├── Only ~13B active per token (2 of 8 experts) ├── Memory needed: full 47B (all experts loaded) └── Compute needed: only 13B worth

MoE models need memory for all parameters but only compute with a subset. This means:

Memory requirements based on total parameters
Compute/speed based on active parameters
A 47B MoE can be faster than a 47B dense model
But still needs similar memory

Scaling Laws

Research has shown predictable relationships between scale and performance:

Chinchilla Scaling

For optimal training efficiency, models should be trained on roughly 20× as many tokens as they have parameters. A 7B model should see ~140B tokens. Many early models were "undertrained" by this metric.

Rough capability scaling:

1-3B: Basic tasks, simple Q&A, limited reasoning
7-8B: Good general capability, useful for most tasks
13-14B: Strong capability, good reasoning
30-34B: Excellent capability, complex reasoning
65-70B: Near-frontier capability
100B+: Frontier models, best overall performance

Practical Implications

Parameter Range	Minimum VRAM (Q4)	Comfortable VRAM	Consumer Hardware
1-3B	2GB	4GB	Any modern GPU
7-8B	4GB	8GB	RTX 3060+, M1+
13-14B	8GB	12GB	RTX 3080+, M1 Pro+
30-34B	18GB	24GB	RTX 3090/4090, M2 Max+
65-70B	35GB	48GB+	Multi-GPU or Mac Studio

What Are Parameters?

Parameter Count by Model

The Memory Math

This Is Just Weights

Does More Parameters = Better?

More Parameters Helps

But It's Not Everything

Dense vs Mixture of Experts (MoE)

Scaling Laws

Chinchilla Scaling

Practical Implications

Related Topics