llama.cpp - Local LLM Knowledge Base

llama.cpp is a C/C++ implementation of LLM inference that prioritizes efficiency and portability. It runs on everything from Raspberry Pis to datacenter GPUs, supports extensive quantization, and is the backbone of tools like Ollama.

Why llama.cpp?

Strengths

Runs anywhere (CPU, NVIDIA, AMD, Apple, etc.)
Best-in-class quantization support
Very active development
No Python dependencies
Memory efficient

Trade-offs

CLI focused (no built-in UI)
Model management is manual
More flags to learn than Ollama
Batching less developed than vLLM

Installation

Pre-built Binaries

Download from GitHub Releases.

Build from Source (Recommended)

# Clone
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Build for CPU
make

# Build for NVIDIA (CUDA)
make LLAMA_CUDA=1

# Build for Apple Metal
make LLAMA_METAL=1

# Build for AMD (ROCm)
make LLAMA_HIPBLAS=1

Basic Usage

# Simple generation
./main -m model.gguf -p "Hello, I am" -n 100

# Interactive chat
./main -m model.gguf --interactive-first

# With GPU acceleration (put all layers on GPU)
./main -m model.gguf -ngl 99 -p "Hello"

# Server mode (OpenAI-compatible API)
./server -m model.gguf -ngl 99 --port 8080

Key Flags

Flag	Purpose	Example
`-m`	Model file path	`-m llama-3-8b.Q4_K_M.gguf`
`-ngl`	GPU layers (99 = all)	`-ngl 99`
`-c`	Context size	`-c 8192`
`-n`	Tokens to generate	`-n 256`
`-p`	Prompt	`-p "Once upon a time"`
`-t`	CPU threads	`-t 8`
`--temp`	Temperature (randomness)	`--temp 0.7`
`--repeat-penalty`	Repetition penalty	`--repeat-penalty 1.1`

Multi-GPU

# Split across 2 GPUs (50/50)
./main -m model.gguf -ngl 99 --tensor-split 0.5,0.5

# Uneven split (60% GPU0, 40% GPU1)
./main -m model.gguf -ngl 99 --tensor-split 0.6,0.4

# Specify which GPUs
CUDA_VISIBLE_DEVICES=0,1 ./main -m model.gguf -ngl 99

Server Mode

llama.cpp includes an OpenAI-compatible API server:

# Start server
./server -m model.gguf -ngl 99 -c 4096 --port 8080

# Use with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Quantization

llama.cpp supports extensive quantization through GGUF format:

Quant	Bits	Quality	Speed	Use Case
Q8_0	8	Excellent	Good	Quality priority
Q6_K	6	Very Good	Good	Good balance
Q5_K_M	5	Good	Better	Balanced
Q4_K_M	4	Good	Fast	Recommended default
Q3_K_M	3	Acceptable	Faster	VRAM limited
Q2_K	2	Poor	Fastest	Last resort

Quantizing Your Own Models

# Convert HuggingFace model to GGUF
python convert.py /path/to/model --outfile model.gguf

# Quantize
./quantize model.gguf model-Q4_K_M.gguf Q4_K_M

Performance Tuning

GPU Offloading

# All layers on GPU (best if it fits)
./main -m model.gguf -ngl 99

# Partial offload (when model doesn't fit)
./main -m model.gguf -ngl 40  # 40 layers on GPU, rest on CPU

# Check GPU memory with nvidia-smi to find optimal -ngl

CPU Threads

# Match physical cores (not hyperthreads)
./main -m model.gguf -t 8

# For CPU-only, more threads help
./main -m model.gguf -t 16

Batch Size

# Larger batch for throughput (uses more memory)
./main -m model.gguf -b 512

# Smaller batch for low memory
./main -m model.gguf -b 128

Common Issues

Model Not Using GPU

Check that you're using a CUDA/Metal/ROCm build and specifying -ngl: ./main -m model.gguf -ngl 99

Out of Memory

Reduce -ngl to offload some layers to CPU
Reduce context size: -c 2048
Use more aggressive quantization (Q4 instead of Q6)

Slow Performance

Verify GPU is being used (nvidia-smi)
Check you built with GPU support
Ensure model isn't being offloaded to RAM

llama.cpp vs Ollama

Aspect	llama.cpp	Ollama
Ease of use	CLI flags, manual	Simple, automatic
Model management	Manual downloads	Built-in library
Control	Full access to all options	Simplified subset
Custom models	Any GGUF file	Modelfile required
Use case	Power users, custom setups	Getting started, convenience

Why llama.cpp?

Strengths

Trade-offs

Installation

Pre-built Binaries

Build from Source (Recommended)

Basic Usage

Key Flags

Multi-GPU

Server Mode

Quantization

Quantizing Your Own Models

Performance Tuning

GPU Offloading

CPU Threads

Batch Size

Common Issues

Model Not Using GPU

Out of Memory

Slow Performance

llama.cpp vs Ollama

Related Topics