Local LLM Knowledge Base
A technical reference for running large language models on your own hardware
This knowledge base covers the practical realities of running LLMs locally — the hardware constraints, software stacks, performance tradeoffs, and operational concerns that determine whether your setup actually works.
The goal is to help you understand not just what the components are, but how they interact and what tradeoffs matter for your specific use case.
Fundamentals
Core concepts: parameters, quantization, context windows, KV cache, and the prefill/decode distinction that shapes everything.
Hardware
GPUs, VRAM, memory bandwidth, multi-GPU scaling, and why your choice of hardware determines what's possible.
Software Stack
CUDA, llama.cpp, Ollama, model formats, and the layers between your hardware and your model.
Inference
Loading models, fitting into memory, GPU sharding, batching, and the mechanics of actually running inference.
Operations
Power delivery, thermal management, driver stability, and the unglamorous realities of keeping systems running.
Performance
Tokens/sec, latency, throughput, cost analysis, and how to evaluate whether a setup makes sense.
Decision Guides
Practical answers: which hardware for which models, when multi-GPU helps, Apple vs NVIDIA, buy vs build.
Key Tradeoffs
These tensions appear throughout local LLM systems. Understanding them helps you navigate decisions across all topics:
Capacity vs Bandwidth
A model fitting in memory doesn't mean it runs fast. Memory bandwidth often determines actual performance more than capacity.
Fit vs Usable
Barely fitting a model leaves no room for KV cache. "It fits" and "it's usable" are different statements.
Quality vs Speed
Quantization reduces memory and increases speed, but affects output quality. The right tradeoff depends on your use case.
Prefill vs Decode
These phases have different bottlenecks. Optimizing for one may hurt the other.
Single vs Multi-GPU
Adding GPUs helps capacity but interconnect bandwidth limits scaling. Sometimes one bigger GPU beats two smaller ones.
Convenience vs Cost
Prebuilt systems cost more but work. DIY saves money but costs time. Used enterprise gear is cheap but comes with quirks.
Common Questions
Which model fits on my hardware?
VRAM requirements by model size and quantization level →
Why is my setup slow even though the model fits?
Memory bandwidth, offloading, and other performance killers →
Apple Silicon or NVIDIA?
Unified memory vs raw compute, and when each makes sense →
When does multi-GPU actually help?
Scaling realities and interconnect constraints →