This knowledge base covers the practical realities of running LLMs locally — the hardware constraints, software stacks, performance tradeoffs, and operational concerns that determine whether your setup actually works.

The goal is to help you understand not just what the components are, but how they interact and what tradeoffs matter for your specific use case.

📐

Fundamentals

Core concepts: parameters, quantization, context windows, KV cache, and the prefill/decode distinction that shapes everything.

🔧

Hardware

GPUs, VRAM, memory bandwidth, multi-GPU scaling, and why your choice of hardware determines what's possible.

💻

Software Stack

CUDA, llama.cpp, Ollama, model formats, and the layers between your hardware and your model.

Inference

Loading models, fitting into memory, GPU sharding, batching, and the mechanics of actually running inference.

🏗️

Operations

Power delivery, thermal management, driver stability, and the unglamorous realities of keeping systems running.

📊

Performance

Tokens/sec, latency, throughput, cost analysis, and how to evaluate whether a setup makes sense.

🧭

Decision Guides

Practical answers: which hardware for which models, when multi-GPU helps, Apple vs NVIDIA, buy vs build.

Key Tradeoffs

These tensions appear throughout local LLM systems. Understanding them helps you navigate decisions across all topics:

Capacity vs Bandwidth

A model fitting in memory doesn't mean it runs fast. Memory bandwidth often determines actual performance more than capacity.

Fit vs Usable

Barely fitting a model leaves no room for KV cache. "It fits" and "it's usable" are different statements.

Quality vs Speed

Quantization reduces memory and increases speed, but affects output quality. The right tradeoff depends on your use case.

Prefill vs Decode

These phases have different bottlenecks. Optimizing for one may hurt the other.

Single vs Multi-GPU

Adding GPUs helps capacity but interconnect bandwidth limits scaling. Sometimes one bigger GPU beats two smaller ones.

Convenience vs Cost

Prebuilt systems cost more but work. DIY saves money but costs time. Used enterprise gear is cheap but comes with quirks.

Common Questions