Ollama
The easiest way to run LLMs locally
Ollama wraps llama.cpp with a user-friendly interface. It handles model downloads, provides a simple CLI and API, and works on macOS, Linux, and Windows. If you're just getting started with local LLMs, start here.
Quick Start
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Run a model (downloads automatically)
ollama run llama3
# Chat!
>>> Tell me a joke about programming
That's it. Ollama downloads the model, loads it, and starts an interactive chat.
Key Commands
| Command | Purpose |
|---|---|
ollama run <model> |
Run model interactively (downloads if needed) |
ollama pull <model> |
Download model without running |
ollama list |
Show downloaded models |
ollama ps |
Show currently loaded models |
ollama rm <model> |
Delete a downloaded model |
ollama serve |
Start API server (usually runs automatically) |
Popular Models
| Model | Size | Use Case | Command |
|---|---|---|---|
| Llama 3 8B | 4.7GB | General purpose, great balance | ollama run llama3 |
| Llama 3 70B | 40GB | Most capable open model | ollama run llama3:70b |
| Mistral 7B | 4.1GB | Fast and capable | ollama run mistral |
| CodeLlama 7B | 3.8GB | Code generation | ollama run codellama |
| Phi-3 | 2.2GB | Small but capable | ollama run phi3 |
Browse all models: ollama.com/library
The API
Ollama runs an API server on port 11434. It's compatible with OpenAI's format:
# Generate completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?"
}'
# Chat format (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Using with Python
import ollama
response = ollama.chat(model='llama3', messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response['message']['content'])
Configuration
GPU Layers
Ollama automatically uses GPU when available. To force specific behavior:
# Set number of GPU layers (in Modelfile or env)
OLLAMA_NUM_GPU=99 ollama run llama3
# Disable GPU entirely
OLLAMA_NUM_GPU=0 ollama run llama3
Context Length
# Set context length for a session
ollama run llama3 --ctx-size 8192
Environment Variables
| Variable | Purpose | Default |
|---|---|---|
OLLAMA_HOST |
API listen address | 127.0.0.1:11434 |
OLLAMA_MODELS |
Model storage location | ~/.ollama/models |
OLLAMA_NUM_GPU |
GPU layers to use | Auto |
OLLAMA_NUM_PARALLEL |
Concurrent requests | 1 |
Custom Models (Modelfiles)
Create custom model configurations:
# Modelfile
FROM llama3
# Set system prompt
SYSTEM You are a helpful coding assistant.
# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
# Create the model
ollama create mymodel -f Modelfile
# Run it
ollama run mymodel
Import Custom GGUF Models
# Modelfile for importing
FROM ./my-model.gguf
# Create from GGUF
ollama create mymodel -f Modelfile
Ollama vs Alternatives
Ollama Strengths
- Easiest setup
- Model management built-in
- Cross-platform
- OpenAI-compatible API
- Active development
Ollama Limitations
- Less control than raw llama.cpp
- Single-user focused
- No batching/high-throughput
- Model library is curated (limited)
Troubleshooting
Model Won't Use GPU
# Check if GPU is detected
ollama ps # Look at "Processor" column
# Force GPU layers
OLLAMA_NUM_GPU=99 ollama run llama3
Out of Memory
- Try a smaller model or more quantized version
- Reduce context length:
--ctx-size 2048 - Free up VRAM from other applications
Slow Performance
- Check if running on GPU (
ollama ps) - See Why Is My Setup Slow?
When to Graduate from Ollama
Ollama is great for getting started and personal use. Consider alternatives when you need: high-throughput serving (→ vLLM), fine-grained control (→ llama.cpp directly), or custom model architectures (→ transformers).