Different inference engines expect different model formats. Choosing the right format depends on your hardware, software stack, and quantization needs.

Format Overview

Format Primary Use Quantization Used By
GGUF llama.cpp, Ollama Built-in (Q2-Q8) CPU, Apple, all GPUs
safetensors HuggingFace, PyTorch Separate (FP16 typical) Transformers, vLLM
GPTQ NVIDIA GPU inference 4-bit calibrated AutoGPTQ, ExLlama
AWQ NVIDIA GPU inference 4-bit activation-aware vLLM, TGI
EXL2 ExLlamaV2 Variable bit-width ExLlamaV2

GGUF (llama.cpp)

The standard format for llama.cpp and Ollama. Successor to GGML.

Key Features

Quantization Naming

llama-3-8b-instruct-Q4_K_M.gguf
                    │ │ │
                    │ │ └── M = Medium (S=Small, L=Large quality/size)
                    │ └──── K = K-quants (mixed block sizes)
                    └────── 4 = 4-bit precision

Common Quant Levels

Name Bits Size vs FP16 Quality
Q8_0 8 ~50% Near lossless
Q6_K 6 ~38% Excellent
Q5_K_M 5 ~31% Very good
Q4_K_M 4 ~25% Good (recommended)
Q3_K_M 3 ~19% Acceptable
Q2_K 2 ~13% Degraded
IQ4_XS ~4 ~25% Good (importance matrix)

Where to Find GGUF Models

safetensors (HuggingFace)

Safe, fast model format from HuggingFace. The standard for PyTorch models.

Key Features

Use When

GPTQ

4-bit quantization optimized for NVIDIA GPUs using calibration data.

Key Features

Use When

AWQ (Activation-Aware Quantization)

4-bit quantization that preserves important weights based on activation patterns.

Key Features

Use When

EXL2 (ExLlamaV2)

Variable bit-width format optimized for ExLlamaV2.

Key Features

Format Decision Tree

What's your setup? │ ├── Apple Silicon │ └── GGUF (via llama.cpp or Ollama) │ ├── NVIDIA GPU, want simplicity │ └── GGUF (via Ollama) │ ├── NVIDIA GPU, serving multiple users │ └── AWQ or safetensors (via vLLM) │ ├── NVIDIA GPU, maximum speed │ └── EXL2 (via ExLlamaV2) or AWQ │ ├── AMD GPU │ └── GGUF (via llama.cpp ROCm) │ ├── CPU only │ └── GGUF (via llama.cpp) │ └── HuggingFace/transformers directly └── safetensors

Converting Between Formats

HuggingFace → GGUF

# In llama.cpp directory
python convert.py /path/to/hf-model --outfile model.gguf
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M

GGUF → HuggingFace

Not directly supported. GGUF is a one-way conversion for inference.

File Size Reference

Approximate sizes for a 70B model:

Format Precision Size
safetensors FP16 ~140 GB
GGUF Q8 8-bit ~74 GB
GGUF Q6_K 6-bit ~57 GB
GGUF Q4_K_M 4-bit ~42 GB
GPTQ/AWQ 4-bit ~40 GB
GGUF Q2_K 2-bit ~28 GB