Model Formats - Local LLM Knowledge Base

Different inference engines expect different model formats. Choosing the right format depends on your hardware, software stack, and quantization needs.

Format Overview

Format	Primary Use	Quantization	Used By
GGUF	llama.cpp, Ollama	Built-in (Q2-Q8)	CPU, Apple, all GPUs
safetensors	HuggingFace, PyTorch	Separate (FP16 typical)	Transformers, vLLM
GPTQ	NVIDIA GPU inference	4-bit calibrated	AutoGPTQ, ExLlama
AWQ	NVIDIA GPU inference	4-bit activation-aware	vLLM, TGI
EXL2	ExLlamaV2	Variable bit-width	ExLlamaV2

GGUF (llama.cpp)

The standard format for llama.cpp and Ollama. Successor to GGML.

Key Features

Self-contained: model + metadata + quantization in one file
Multiple quantization levels (Q2 through Q8)
Works on CPU, Apple Metal, CUDA, ROCm
Largest selection of pre-quantized models

Quantization Naming

llama-3-8b-instruct-Q4_K_M.gguf
                    │ │ │
                    │ │ └── M = Medium (S=Small, L=Large quality/size)
                    │ └──── K = K-quants (mixed block sizes)
                    └────── 4 = 4-bit precision

Common Quant Levels

Name	Bits	Size vs FP16	Quality
Q8_0	8	~50%	Near lossless
Q6_K	6	~38%	Excellent
Q5_K_M	5	~31%	Very good
Q4_K_M	4	~25%	Good (recommended)
Q3_K_M	3	~19%	Acceptable
Q2_K	2	~13%	Degraded
IQ4_XS	~4	~25%	Good (importance matrix)

Where to Find GGUF Models

TheBloke on HuggingFace — most popular quantizer
HuggingFace GGUF filter
Ollama library (downloads GGUF automatically)

safetensors (HuggingFace)

Safe, fast model format from HuggingFace. The standard for PyTorch models.

Key Features

Prevents arbitrary code execution (unlike pickle)
Fast loading (memory-mapped)
Default for HuggingFace models
Usually FP16 or BF16 (full precision)

Use When

Using transformers library directly
Using vLLM
Need full precision (no quantization)
Converting to other formats

GPTQ

4-bit quantization optimized for NVIDIA GPUs using calibration data.

Key Features

Calibration-based: quantizes using sample data to minimize error
Typically 4-bit
Optimized CUDA kernels
Requires pre-quantized models (can't quantize on the fly)

Use When

Using ExLlama or AutoGPTQ
Want optimized NVIDIA performance
Model available in GPTQ format

AWQ (Activation-Aware Quantization)

4-bit quantization that preserves important weights based on activation patterns.

Key Features

Better quality than naive quantization at same bit-width
Typically 4-bit
Good vLLM support
Often outperforms GPTQ

Use When

Using vLLM for serving
Want best quality at 4-bit on NVIDIA
Model available in AWQ format

EXL2 (ExLlamaV2)

Variable bit-width format optimized for ExLlamaV2.

Key Features

Mixed precision within model (important layers get more bits)
Very fast inference
Customizable bit-per-weight targets
NVIDIA only

Format Decision Tree

What's your setup? │ ├── Apple Silicon │ └── GGUF (via llama.cpp or Ollama) │ ├── NVIDIA GPU, want simplicity │ └── GGUF (via Ollama) │ ├── NVIDIA GPU, serving multiple users │ └── AWQ or safetensors (via vLLM) │ ├── NVIDIA GPU, maximum speed │ └── EXL2 (via ExLlamaV2) or AWQ │ ├── AMD GPU │ └── GGUF (via llama.cpp ROCm) │ ├── CPU only │ └── GGUF (via llama.cpp) │ └── HuggingFace/transformers directly └── safetensors

Converting Between Formats

HuggingFace → GGUF

# In llama.cpp directory
python convert.py /path/to/hf-model --outfile model.gguf
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M

GGUF → HuggingFace

Not directly supported. GGUF is a one-way conversion for inference.

File Size Reference

Approximate sizes for a 70B model:

Format	Precision	Size
safetensors	FP16	~140 GB
GGUF Q8	8-bit	~74 GB
GGUF Q6_K	6-bit	~57 GB
GGUF Q4_K_M	4-bit	~42 GB
GPTQ/AWQ	4-bit	~40 GB
GGUF Q2_K	2-bit	~28 GB

Format Overview

GGUF (llama.cpp)

Key Features

Quantization Naming

Common Quant Levels

Where to Find GGUF Models

safetensors (HuggingFace)

Key Features

Use When

GPTQ

Key Features

Use When

AWQ (Activation-Aware Quantization)

Key Features

Use When

EXL2 (ExLlamaV2)

Key Features

Format Decision Tree

Converting Between Formats

HuggingFace → GGUF

GGUF → HuggingFace

File Size Reference

Related Topics