Software Stack - Local LLM Knowledge Base

Running an LLM locally requires multiple software components working together: GPU drivers, compute frameworks, inference engines, and model formats. Understanding this stack helps you troubleshoot issues and choose the right tools.

The Stack

┌─────────────────────────────────────────────────┐ │ Your Application │ │ (chat UI, API client, script) │ ├─────────────────────────────────────────────────┤ │ Inference Server / API │ │ (Ollama, vLLM, text-generation-webui) │ ├─────────────────────────────────────────────────┤ │ Inference Engine │ │ (llama.cpp, transformers, TensorRT-LLM) │ ├─────────────────────────────────────────────────┤ │ Compute Framework │ │ (PyTorch, GGML, Metal) │ ├─────────────────────────────────────────────────┤ │ GPU Runtime │ │ (CUDA, ROCm, Metal) │ ├─────────────────────────────────────────────────┤ │ GPU Driver │ ├─────────────────────────────────────────────────┤ │ Hardware │ │ (NVIDIA GPU, AMD GPU, Apple Silicon) │ └─────────────────────────────────────────────────┘

Key Components

llama.cpp

The most popular inference engine for local use. C/C++, efficient on CPU and GPU, supports GGUF format with various quantizations.

Ollama

User-friendly wrapper around llama.cpp. Simple CLI, manages model downloads, provides an API. Great for getting started.

vLLM

High-throughput inference server with PagedAttention. Optimized for serving many concurrent requests. Better for production/server use.

Model Formats

GGUF, safetensors, GPTQ, AWQ. Which format to use depends on your inference engine and hardware.

GPU Runtimes

CUDA (NVIDIA)

Most mature, best supported
Required for NVIDIA GPUs
Nearly all LLM software optimized for CUDA
cuBLAS, cuDNN, TensorRT available

ROCm (AMD)

AMD's CUDA alternative
Support improving but less mature
llama.cpp works well
PyTorch support good, some edge cases

Metal (Apple)

Apple's GPU framework
Unified memory architecture
llama.cpp Metal backend efficient
MLX is Apple's native ML framework

Choosing Software

Use Case	Recommended	Why
Just getting started	Ollama	Simplest setup, manages everything
Mac user	Ollama or MLX	Both work well with Metal
Maximum flexibility	llama.cpp directly	Most control over settings
Serving multiple users	vLLM or TGI	Optimized for throughput and batching
HuggingFace models	transformers + accelerate	Native format support
Production NVIDIA	TensorRT-LLM	Maximum performance on NVIDIA hardware

Start with Ollama

Unless you have specific requirements, start with Ollama. It handles model downloads, provides both CLI and API access, and works on all platforms. You can always move to more specialized tools later.

The Stack

Key Components

llama.cpp

Ollama

vLLM

Model Formats

GPU Runtimes

CUDA (NVIDIA)

ROCm (AMD)

Metal (Apple)

Choosing Software

Start with Ollama

Next Steps