Running an LLM locally requires multiple software components working together: GPU drivers, compute frameworks, inference engines, and model formats. Understanding this stack helps you troubleshoot issues and choose the right tools.

The Stack

┌─────────────────────────────────────────────────┐ │ Your Application │ │ (chat UI, API client, script) │ ├─────────────────────────────────────────────────┤ │ Inference Server / API │ │ (Ollama, vLLM, text-generation-webui) │ ├─────────────────────────────────────────────────┤ │ Inference Engine │ │ (llama.cpp, transformers, TensorRT-LLM) │ ├─────────────────────────────────────────────────┤ │ Compute Framework │ │ (PyTorch, GGML, Metal) │ ├─────────────────────────────────────────────────┤ │ GPU Runtime │ │ (CUDA, ROCm, Metal) │ ├─────────────────────────────────────────────────┤ │ GPU Driver │ ├─────────────────────────────────────────────────┤ │ Hardware │ │ (NVIDIA GPU, AMD GPU, Apple Silicon) │ └─────────────────────────────────────────────────┘

Key Components

llama.cpp

The most popular inference engine for local use. C/C++, efficient on CPU and GPU, supports GGUF format with various quantizations.

Ollama

User-friendly wrapper around llama.cpp. Simple CLI, manages model downloads, provides an API. Great for getting started.

vLLM

High-throughput inference server with PagedAttention. Optimized for serving many concurrent requests. Better for production/server use.

Model Formats

GGUF, safetensors, GPTQ, AWQ. Which format to use depends on your inference engine and hardware.

GPU Runtimes

CUDA (NVIDIA)

  • Most mature, best supported
  • Required for NVIDIA GPUs
  • Nearly all LLM software optimized for CUDA
  • cuBLAS, cuDNN, TensorRT available

ROCm (AMD)

  • AMD's CUDA alternative
  • Support improving but less mature
  • llama.cpp works well
  • PyTorch support good, some edge cases

Metal (Apple)

  • Apple's GPU framework
  • Unified memory architecture
  • llama.cpp Metal backend efficient
  • MLX is Apple's native ML framework

Choosing Software

Use Case Recommended Why
Just getting started Ollama Simplest setup, manages everything
Mac user Ollama or MLX Both work well with Metal
Maximum flexibility llama.cpp directly Most control over settings
Serving multiple users vLLM or TGI Optimized for throughput and batching
HuggingFace models transformers + accelerate Native format support
Production NVIDIA TensorRT-LLM Maximum performance on NVIDIA hardware

Start with Ollama

Unless you have specific requirements, start with Ollama. It handles model downloads, provides both CLI and API access, and works on all platforms. You can always move to more specialized tools later.