When people say a model has "7 billion parameters," they're describing its size. Parameters are the learned values (weights) that define what the model knows. More parameters generally means more capability — and more hardware requirements.

What Are Parameters?

Parameters are the numbers that make up the neural network. During training, these values are adjusted to minimize prediction errors. After training, they're fixed — the model's "knowledge" is encoded in these billions of numbers.

A single layer might look like: Input (4096 dimensions) │ ▼ ┌─────────────────────────────┐ │ Weight Matrix: 4096 × 4096 │ = 16.7 million parameters │ (one layer!) │ └─────────────────────────────┘ │ ▼ Output (4096 dimensions) A 7B model has ~32 such layers, plus embeddings and other components.

Parameter Count by Model

Model Parameters FP16 Size Q4 Size Typical Use
Phi-3 Mini 3.8B ~7.6GB ~2.2GB Mobile, edge devices
Llama 3 8B 8B ~16GB ~4.5GB Consumer GPU sweet spot
Mistral 7B 7B ~14GB ~4GB Consumer GPU sweet spot
Llama 2 13B 13B ~26GB ~7.5GB 16-24GB VRAM
CodeLlama 34B 34B ~68GB ~19GB 24-48GB VRAM
Llama 3 70B 70B ~140GB ~40GB 48GB+ or multi-GPU
Llama 3 405B 405B ~810GB ~230GB Multi-node clusters

The Memory Math

Converting parameters to memory requirements:

Memory = Parameters × Bytes per Parameter FP32 (full precision): 1 param = 4 bytes FP16 (half precision): 1 param = 2 bytes INT8 (8-bit quant): 1 param = 1 byte INT4 (4-bit quant): 1 param = 0.5 bytes Example: 70B model ├── FP16: 70B × 2 = 140 GB ├── INT8: 70B × 1 = 70 GB └── INT4: 70B × 0.5 = 35 GB

This Is Just Weights

The weight memory is the minimum. You also need memory for KV cache (which grows with context length), activations, and overhead. A 35GB model might need 45-50GB to actually run at useful context lengths.

Does More Parameters = Better?

Generally yes, but with diminishing returns and important caveats:

More Parameters Helps

  • More knowledge capacity
  • Better reasoning (usually)
  • Better instruction following
  • Fewer hallucinations (sometimes)
  • Better at complex tasks

But It's Not Everything

  • Training data quality matters more
  • Architecture improvements help
  • Fine-tuning can beat raw scale
  • Task-specific models can be better
  • Quantization affects realized quality

A well-trained 7B model often beats a poorly trained 13B model. And a heavily quantized 70B model might underperform a full-precision 13B model on some tasks.

Dense vs Mixture of Experts (MoE)

Parameter counts can be misleading for Mixture of Experts models:

Dense Model (e.g., Llama 70B): ├── 70B parameters ├── ALL parameters used for every token └── Memory needed: ~70B parameters worth MoE Model (e.g., Mixtral 8x7B): ├── 47B total parameters ├── Only ~13B active per token (2 of 8 experts) ├── Memory needed: full 47B (all experts loaded) └── Compute needed: only 13B worth

MoE models need memory for all parameters but only compute with a subset. This means:

Scaling Laws

Research has shown predictable relationships between scale and performance:

Chinchilla Scaling

For optimal training efficiency, models should be trained on roughly 20× as many tokens as they have parameters. A 7B model should see ~140B tokens. Many early models were "undertrained" by this metric.

Rough capability scaling:

Practical Implications

Parameter Range Minimum VRAM (Q4) Comfortable VRAM Consumer Hardware
1-3B 2GB 4GB Any modern GPU
7-8B 4GB 8GB RTX 3060+, M1+
13-14B 8GB 12GB RTX 3080+, M1 Pro+
30-34B 18GB 24GB RTX 3090/4090, M2 Max+
65-70B 35GB 48GB+ Multi-GPU or Mac Studio