llama.cpp
The engine behind most local LLM inference
llama.cpp is a C/C++ implementation of LLM inference that prioritizes efficiency and portability. It runs on everything from Raspberry Pis to datacenter GPUs, supports extensive quantization, and is the backbone of tools like Ollama.
Why llama.cpp?
Strengths
- Runs anywhere (CPU, NVIDIA, AMD, Apple, etc.)
- Best-in-class quantization support
- Very active development
- No Python dependencies
- Memory efficient
Trade-offs
- CLI focused (no built-in UI)
- Model management is manual
- More flags to learn than Ollama
- Batching less developed than vLLM
Installation
Pre-built Binaries
Download from GitHub Releases.
Build from Source (Recommended)
# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Build for CPU
make
# Build for NVIDIA (CUDA)
make LLAMA_CUDA=1
# Build for Apple Metal
make LLAMA_METAL=1
# Build for AMD (ROCm)
make LLAMA_HIPBLAS=1
Basic Usage
# Simple generation
./main -m model.gguf -p "Hello, I am" -n 100
# Interactive chat
./main -m model.gguf --interactive-first
# With GPU acceleration (put all layers on GPU)
./main -m model.gguf -ngl 99 -p "Hello"
# Server mode (OpenAI-compatible API)
./server -m model.gguf -ngl 99 --port 8080
Key Flags
| Flag | Purpose | Example |
|---|---|---|
-m |
Model file path | -m llama-3-8b.Q4_K_M.gguf |
-ngl |
GPU layers (99 = all) | -ngl 99 |
-c |
Context size | -c 8192 |
-n |
Tokens to generate | -n 256 |
-p |
Prompt | -p "Once upon a time" |
-t |
CPU threads | -t 8 |
--temp |
Temperature (randomness) | --temp 0.7 |
--repeat-penalty |
Repetition penalty | --repeat-penalty 1.1 |
Multi-GPU
# Split across 2 GPUs (50/50)
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5
# Uneven split (60% GPU0, 40% GPU1)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4
# Specify which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99
Server Mode
llama.cpp includes an OpenAI-compatible API server:
# Start server
./server -m model.gguf -ngl 99 -c 4096 --port 8080
# Use with curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Quantization
llama.cpp supports extensive quantization through GGUF format:
| Quant | Bits | Quality | Speed | Use Case |
|---|---|---|---|---|
| Q8_0 | 8 | Excellent | Good | Quality priority |
| Q6_K | 6 | Very Good | Good | Good balance |
| Q5_K_M | 5 | Good | Better | Balanced |
| Q4_K_M | 4 | Good | Fast | Recommended default |
| Q3_K_M | 3 | Acceptable | Faster | VRAM limited |
| Q2_K | 2 | Poor | Fastest | Last resort |
Quantizing Your Own Models
# Convert HuggingFace model to GGUF
python convert.py /path/to/model --outfile model.gguf
# Quantize
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M
Performance Tuning
GPU Offloading
# All layers on GPU (best if it fits)
./main -m model.gguf -ngl 99
# Partial offload (when model doesn't fit)
./main -m model.gguf -ngl 40 # 40 layers on GPU, rest on CPU
# Check GPU memory with nvidia-smi to find optimal -ngl
CPU Threads
# Match physical cores (not hyperthreads)
./main -m model.gguf -t 8
# For CPU-only, more threads help
./main -m model.gguf -t 16
Batch Size
# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512
# Smaller batch for low memory
./main -m model.gguf -b 128
Common Issues
Model Not Using GPU
Check that you're using a CUDA/Metal/ROCm build and specifying -ngl:
./main -m model.gguf -ngl 99
Out of Memory
- Reduce
-nglto offload some layers to CPU - Reduce context size:
-c 2048 - Use more aggressive quantization (Q4 instead of Q6)
Slow Performance
- Verify GPU is being used (
nvidia-smi) - Check you built with GPU support
- Ensure model isn't being offloaded to RAM
llama.cpp vs Ollama
| Aspect | llama.cpp | Ollama |
|---|---|---|
| Ease of use | CLI flags, manual | Simple, automatic |
| Model management | Manual downloads | Built-in library |
| Control | Full access to all options | Simplified subset |
| Custom models | Any GGUF file | Modelfile required |
| Use case | Power users, custom setups | Getting started, convenience |