Ollama - Local LLM Knowledge Base

Ollama wraps llama.cpp with a user-friendly interface. It handles model downloads, provides a simple CLI and API, and works on macOS, Linux, and Windows. If you're just getting started with local LLMs, start here.

Quick Start

# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (downloads automatically)
ollama run llama3

# Chat!
>>> Tell me a joke about programming

That's it. Ollama downloads the model, loads it, and starts an interactive chat.

Key Commands

Command	Purpose
`ollama run <model>`	Run model interactively (downloads if needed)
`ollama pull <model>`	Download model without running
`ollama list`	Show downloaded models
`ollama ps`	Show currently loaded models
`ollama rm <model>`	Delete a downloaded model
`ollama serve`	Start API server (usually runs automatically)

Popular Models

Model	Size	Use Case	Command
Llama 3 8B	4.7GB	General purpose, great balance	`ollama run llama3`
Llama 3 70B	40GB	Most capable open model	`ollama run llama3:70b`
Mistral 7B	4.1GB	Fast and capable	`ollama run mistral`
CodeLlama 7B	3.8GB	Code generation	`ollama run codellama`
Phi-3	2.2GB	Small but capable	`ollama run phi3`

Browse all models: ollama.com/library

The API

Ollama runs an API server on port 11434. It's compatible with OpenAI's format:

# Generate completion
curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?"
}'

# Chat format (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3",
  "messages": [{"role": "user", "content": "Hello!"}]
}'

Using with Python

import ollama

response = ollama.chat(model='llama3', messages=[
  {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response['message']['content'])

Configuration

GPU Layers

Ollama automatically uses GPU when available. To force specific behavior:

# Set number of GPU layers (in Modelfile or env)
OLLAMA_NUM_GPU=99 ollama run llama3

# Disable GPU entirely
OLLAMA_NUM_GPU=0 ollama run llama3

Context Length

# Set context length for a session
ollama run llama3 --ctx-size 8192

Environment Variables

Variable	Purpose	Default
`OLLAMA_HOST`	API listen address	127.0.0.1:11434
`OLLAMA_MODELS`	Model storage location	~/.ollama/models
`OLLAMA_NUM_GPU`	GPU layers to use	Auto
`OLLAMA_NUM_PARALLEL`	Concurrent requests	1

Custom Models (Modelfiles)

Create custom model configurations:

# Modelfile
FROM llama3

# Set system prompt
SYSTEM You are a helpful coding assistant.

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

# Create the model
ollama create mymodel -f Modelfile

# Run it
ollama run mymodel

Import Custom GGUF Models

# Modelfile for importing
FROM ./my-model.gguf

# Create from GGUF
ollama create mymodel -f Modelfile

Ollama vs Alternatives

Ollama Strengths

Easiest setup
Model management built-in
Cross-platform
OpenAI-compatible API
Active development

Ollama Limitations

Less control than raw llama.cpp
Single-user focused
No batching/high-throughput
Model library is curated (limited)

Troubleshooting

Model Won't Use GPU

# Check if GPU is detected
ollama ps  # Look at "Processor" column

# Force GPU layers
OLLAMA_NUM_GPU=99 ollama run llama3

Out of Memory

Try a smaller model or more quantized version
Reduce context length: --ctx-size 2048
Free up VRAM from other applications

Slow Performance

Check if running on GPU (ollama ps)
See Why Is My Setup Slow?

When to Graduate from Ollama

Ollama is great for getting started and personal use. Consider alternatives when you need: high-throughput serving (→ vLLM), fine-grained control (→ llama.cpp directly), or custom model architectures (→ transformers).