Ollama wraps llama.cpp with a user-friendly interface. It handles model downloads, provides a simple CLI and API, and works on macOS, Linux, and Windows. If you're just getting started with local LLMs, start here.

Quick Start

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically)
ollama run llama3

# Chat!
>>> Tell me a joke about programming

That's it. Ollama downloads the model, loads it, and starts an interactive chat.

Key Commands

Command Purpose
ollama run <model> Run model interactively (downloads if needed)
ollama pull <model> Download model without running
ollama list Show downloaded models
ollama ps Show currently loaded models
ollama rm <model> Delete a downloaded model
ollama serve Start API server (usually runs automatically)

Popular Models

Model Size Use Case Command
Llama 3 8B 4.7GB General purpose, great balance ollama run llama3
Llama 3 70B 40GB Most capable open model ollama run llama3:70b
Mistral 7B 4.1GB Fast and capable ollama run mistral
CodeLlama 7B 3.8GB Code generation ollama run codellama
Phi-3 2.2GB Small but capable ollama run phi3

Browse all models: ollama.com/library

The API

Ollama runs an API server on port 11434. It's compatible with OpenAI's format:

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?"
}'

# Chat format (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Using with Python

import ollama

response = ollama.chat(model='llama3', messages=[
  {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response['message']['content'])

Configuration

GPU Layers

Ollama automatically uses GPU when available. To force specific behavior:

# Set number of GPU layers (in Modelfile or env)
OLLAMA_NUM_GPU=99 ollama run llama3

# Disable GPU entirely
OLLAMA_NUM_GPU=0 ollama run llama3

Context Length

# Set context length for a session
ollama run llama3 --ctx-size 8192

Environment Variables

Variable Purpose Default
OLLAMA_HOST API listen address 127.0.0.1:11434
OLLAMA_MODELS Model storage location ~/.ollama/models
OLLAMA_NUM_GPU GPU layers to use Auto
OLLAMA_NUM_PARALLEL Concurrent requests 1

Custom Models (Modelfiles)

Create custom model configurations:

# Modelfile
FROM llama3

# Set system prompt
SYSTEM You are a helpful coding assistant.

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
# Create the model
ollama create mymodel -f Modelfile

# Run it
ollama run mymodel

Import Custom GGUF Models

# Modelfile for importing
FROM ./my-model.gguf

# Create from GGUF
ollama create mymodel -f Modelfile

Ollama vs Alternatives

Ollama Strengths

  • Easiest setup
  • Model management built-in
  • Cross-platform
  • OpenAI-compatible API
  • Active development

Ollama Limitations

  • Less control than raw llama.cpp
  • Single-user focused
  • No batching/high-throughput
  • Model library is curated (limited)

Troubleshooting

Model Won't Use GPU

# Check if GPU is detected
ollama ps  # Look at "Processor" column

# Force GPU layers
OLLAMA_NUM_GPU=99 ollama run llama3

Out of Memory

Slow Performance

When to Graduate from Ollama

Ollama is great for getting started and personal use. Consider alternatives when you need: high-throughput serving (→ vLLM), fine-grained control (→ llama.cpp directly), or custom model architectures (→ transformers).