Software Stack
The layers between your hardware and your model
Running an LLM locally requires multiple software components working together: GPU drivers, compute frameworks, inference engines, and model formats. Understanding this stack helps you troubleshoot issues and choose the right tools.
The Stack
Key Components
llama.cpp
The most popular inference engine for local use. C/C++, efficient on CPU and GPU, supports GGUF format with various quantizations.
Ollama
User-friendly wrapper around llama.cpp. Simple CLI, manages model downloads, provides an API. Great for getting started.
vLLM
High-throughput inference server with PagedAttention. Optimized for serving many concurrent requests. Better for production/server use.
Model Formats
GGUF, safetensors, GPTQ, AWQ. Which format to use depends on your inference engine and hardware.
GPU Runtimes
CUDA (NVIDIA)
- Most mature, best supported
- Required for NVIDIA GPUs
- Nearly all LLM software optimized for CUDA
- cuBLAS, cuDNN, TensorRT available
ROCm (AMD)
- AMD's CUDA alternative
- Support improving but less mature
- llama.cpp works well
- PyTorch support good, some edge cases
Metal (Apple)
- Apple's GPU framework
- Unified memory architecture
- llama.cpp Metal backend efficient
- MLX is Apple's native ML framework
Choosing Software
| Use Case | Recommended | Why |
|---|---|---|
| Just getting started | Ollama | Simplest setup, manages everything |
| Mac user | Ollama or MLX | Both work well with Metal |
| Maximum flexibility | llama.cpp directly | Most control over settings |
| Serving multiple users | vLLM or TGI | Optimized for throughput and batching |
| HuggingFace models | transformers + accelerate | Native format support |
| Production NVIDIA | TensorRT-LLM | Maximum performance on NVIDIA hardware |
Start with Ollama
Unless you have specific requirements, start with Ollama. It handles model downloads, provides both CLI and API access, and works on all platforms. You can always move to more specialized tools later.