llama.cpp is a C/C++ implementation of LLM inference that prioritizes efficiency and portability. It runs on everything from Raspberry Pis to datacenter GPUs, supports extensive quantization, and is the backbone of tools like Ollama.

Why llama.cpp?

Strengths

  • Runs anywhere (CPU, NVIDIA, AMD, Apple, etc.)
  • Best-in-class quantization support
  • Very active development
  • No Python dependencies
  • Memory efficient

Trade-offs

  • CLI focused (no built-in UI)
  • Model management is manual
  • More flags to learn than Ollama
  • Batching less developed than vLLM

Installation

Pre-built Binaries

Download from GitHub Releases.

Build from Source (Recommended)

# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build for NVIDIA (CUDA)
make LLAMA_CUDA=1

# Build for Apple Metal
make LLAMA_METAL=1

# Build for AMD (ROCm)
make LLAMA_HIPBLAS=1

Basic Usage

# Simple generation
./main -m model.gguf -p "Hello, I am" -n 100

# Interactive chat
./main -m model.gguf --interactive-first

# With GPU acceleration (put all layers on GPU)
./main -m model.gguf -ngl 99 -p "Hello"

# Server mode (OpenAI-compatible API)
./server -m model.gguf -ngl 99 --port 8080

Key Flags

Flag Purpose Example
-m Model file path -m llama-3-8b.Q4_K_M.gguf
-ngl GPU layers (99 = all) -ngl 99
-c Context size -c 8192
-n Tokens to generate -n 256
-p Prompt -p "Once upon a time"
-t CPU threads -t 8
--temp Temperature (randomness) --temp 0.7
--repeat-penalty Repetition penalty --repeat-penalty 1.1

Multi-GPU

# Split across 2 GPUs (50/50)
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Uneven split (60% GPU0, 40% GPU1)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4

# Specify which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99

Server Mode

llama.cpp includes an OpenAI-compatible API server:

# Start server
./server -m model.gguf -ngl 99 -c 4096 --port 8080

# Use with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Quantization

llama.cpp supports extensive quantization through GGUF format:

Quant Bits Quality Speed Use Case
Q8_0 8 Excellent Good Quality priority
Q6_K 6 Very Good Good Good balance
Q5_K_M 5 Good Better Balanced
Q4_K_M 4 Good Fast Recommended default
Q3_K_M 3 Acceptable Faster VRAM limited
Q2_K 2 Poor Fastest Last resort

Quantizing Your Own Models

# Convert HuggingFace model to GGUF
python convert.py /path/to/model --outfile model.gguf

# Quantize
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M

Performance Tuning

GPU Offloading

# All layers on GPU (best if it fits)
./main -m model.gguf -ngl 99

# Partial offload (when model doesn't fit)
./main -m model.gguf -ngl 40  # 40 layers on GPU, rest on CPU

# Check GPU memory with nvidia-smi to find optimal -ngl

CPU Threads

# Match physical cores (not hyperthreads)
./main -m model.gguf -t 8

# For CPU-only, more threads help
./main -m model.gguf -t 16

Batch Size

# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512

# Smaller batch for low memory
./main -m model.gguf -b 128

Common Issues

Model Not Using GPU

Check that you're using a CUDA/Metal/ROCm build and specifying -ngl: ./main -m model.gguf -ngl 99

Out of Memory

Slow Performance

llama.cpp vs Ollama

Aspect llama.cpp Ollama
Ease of use CLI flags, manual Simple, automatic
Model management Manual downloads Built-in library
Control Full access to all options Simplified subset
Custom models Any GGUF file Modelfile required
Use case Power users, custom setups Getting started, convenience