Model Formats
GGUF, safetensors, GPTQ, and when to use each
Different inference engines expect different model formats. Choosing the right format depends on your hardware, software stack, and quantization needs.
Format Overview
| Format | Primary Use | Quantization | Used By |
|---|---|---|---|
| GGUF | llama.cpp, Ollama | Built-in (Q2-Q8) | CPU, Apple, all GPUs |
| safetensors | HuggingFace, PyTorch | Separate (FP16 typical) | Transformers, vLLM |
| GPTQ | NVIDIA GPU inference | 4-bit calibrated | AutoGPTQ, ExLlama |
| AWQ | NVIDIA GPU inference | 4-bit activation-aware | vLLM, TGI |
| EXL2 | ExLlamaV2 | Variable bit-width | ExLlamaV2 |
GGUF (llama.cpp)
The standard format for llama.cpp and Ollama. Successor to GGML.
Key Features
- Self-contained: model + metadata + quantization in one file
- Multiple quantization levels (Q2 through Q8)
- Works on CPU, Apple Metal, CUDA, ROCm
- Largest selection of pre-quantized models
Quantization Naming
llama-3-8b-instruct-Q4_K_M.gguf
│ │ │
│ │ └── M = Medium (S=Small, L=Large quality/size)
│ └──── K = K-quants (mixed block sizes)
└────── 4 = 4-bit precision
Common Quant Levels
| Name | Bits | Size vs FP16 | Quality |
|---|---|---|---|
| Q8_0 | 8 | ~50% | Near lossless |
| Q6_K | 6 | ~38% | Excellent |
| Q5_K_M | 5 | ~31% | Very good |
| Q4_K_M | 4 | ~25% | Good (recommended) |
| Q3_K_M | 3 | ~19% | Acceptable |
| Q2_K | 2 | ~13% | Degraded |
| IQ4_XS | ~4 | ~25% | Good (importance matrix) |
Where to Find GGUF Models
- TheBloke on HuggingFace — most popular quantizer
- HuggingFace GGUF filter
- Ollama library (downloads GGUF automatically)
safetensors (HuggingFace)
Safe, fast model format from HuggingFace. The standard for PyTorch models.
Key Features
- Prevents arbitrary code execution (unlike pickle)
- Fast loading (memory-mapped)
- Default for HuggingFace models
- Usually FP16 or BF16 (full precision)
Use When
- Using transformers library directly
- Using vLLM
- Need full precision (no quantization)
- Converting to other formats
GPTQ
4-bit quantization optimized for NVIDIA GPUs using calibration data.
Key Features
- Calibration-based: quantizes using sample data to minimize error
- Typically 4-bit
- Optimized CUDA kernels
- Requires pre-quantized models (can't quantize on the fly)
Use When
- Using ExLlama or AutoGPTQ
- Want optimized NVIDIA performance
- Model available in GPTQ format
AWQ (Activation-Aware Quantization)
4-bit quantization that preserves important weights based on activation patterns.
Key Features
- Better quality than naive quantization at same bit-width
- Typically 4-bit
- Good vLLM support
- Often outperforms GPTQ
Use When
- Using vLLM for serving
- Want best quality at 4-bit on NVIDIA
- Model available in AWQ format
EXL2 (ExLlamaV2)
Variable bit-width format optimized for ExLlamaV2.
Key Features
- Mixed precision within model (important layers get more bits)
- Very fast inference
- Customizable bit-per-weight targets
- NVIDIA only
Format Decision Tree
What's your setup?
│
├── Apple Silicon
│ └── GGUF (via llama.cpp or Ollama)
│
├── NVIDIA GPU, want simplicity
│ └── GGUF (via Ollama)
│
├── NVIDIA GPU, serving multiple users
│ └── AWQ or safetensors (via vLLM)
│
├── NVIDIA GPU, maximum speed
│ └── EXL2 (via ExLlamaV2) or AWQ
│
├── AMD GPU
│ └── GGUF (via llama.cpp ROCm)
│
├── CPU only
│ └── GGUF (via llama.cpp)
│
└── HuggingFace/transformers directly
└── safetensors
Converting Between Formats
HuggingFace → GGUF
# In llama.cpp directory
python convert.py /path/to/hf-model --outfile model.gguf
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M
GGUF → HuggingFace
Not directly supported. GGUF is a one-way conversion for inference.
File Size Reference
Approximate sizes for a 70B model:
| Format | Precision | Size |
|---|---|---|
| safetensors | FP16 | ~140 GB |
| GGUF Q8 | 8-bit | ~74 GB |
| GGUF Q6_K | 6-bit | ~57 GB |
| GGUF Q4_K_M | 4-bit | ~42 GB |
| GPTQ/AWQ | 4-bit | ~40 GB |
| GGUF Q2_K | 2-bit | ~28 GB |