Inference - Local LLM Knowledge Base

Inference is the process of running a trained model to generate outputs. This section covers the mechanics of loading models, managing memory, and optimizing for different use cases.

The Inference Pipeline

┌─────────────────────────────────────────────────────────────┐ │ 1. MODEL LOADING │ │ Load weights from disk → GPU memory (or unified memory) │ │ Allocate KV cache buffers │ │ Initialize framework state │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. PREFILL PHASE │ │ Process entire input prompt │ │ Build initial KV cache │ │ Compute-bound (matrix multiplications) │ │ Latency: "time to first token" │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. DECODE PHASE │ │ Generate tokens one at a time │ │ Update KV cache each step │ │ Memory-bandwidth-bound │ │ Speed: "tokens per second" │ └─────────────────────────────────────────────────────────────┘ │ ▼ Response complete

Memory Layout

GPU/Unified Memory: ┌────────────────────────────────────────────────┐ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ Model Weights (static) │ │ │ │ 35GB for 70B Q4, loaded once │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ KV Cache (dynamic) │ │ │ │ Grows with context length │ │ │ │ ~1GB per 4K tokens for 70B │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ Activations & Working Memory │ │ │ │ ~1-2GB overhead │ │ │ └──────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────┘

Key Concepts

Fitting Models in Memory

A model "fits" when weights + KV cache + overhead ≤ available memory.

Scenario	Result	Performance
Fully fits in VRAM	Best case	Maximum speed
Weights fit, KV cache spills	Usable	Slower at long context
Partial offload to RAM	Works	5-10× slower
Doesn't fit anywhere	Fails	—

RAM Offloading

When models don't fully fit, some layers can run from system RAM:

Each offloaded layer adds latency (RAM is 10-20× slower than VRAM)
Configure with -ngl in llama.cpp (layers on GPU)
Usable for experimentation, not ideal for production

Multi-GPU Sharding

Split model across multiple GPUs for larger models:

Tensor parallelism: Split layers horizontally (requires fast interconnect)
Pipeline parallelism: Assign layers to different GPUs sequentially
See Multi-GPU for details

Batching

Processing multiple requests simultaneously improves throughput:

No Batching (Interactive)

One request at a time
Lowest latency
GPU often underutilized
Best for chat/coding assistant

Batched (Throughput)

Multiple requests together
Higher total tokens/sec
Higher per-request latency
Best for batch processing, serving

Continuous Batching

Advanced technique where new requests join in-progress batches:

Implemented by vLLM, TGI
Better GPU utilization
More complex memory management

Inference Optimization Techniques

Flash Attention

Optimized attention implementation that reduces memory and increases speed:

Restructures computation to be more cache-friendly
Standard in modern inference engines
Significant speedup for long contexts

Speculative Decoding

Use a small "draft" model to propose tokens, verify with main model:

Can provide 2-3× speedup
Draft model must be compatible
Benefits depend on acceptance rate

KV Cache Quantization

Store KV cache in lower precision (INT8 instead of FP16):

Halves KV cache memory
Enables longer contexts
Minor quality impact

Serving Patterns

Single User (Local)

# Ollama
ollama run llama3

# llama.cpp interactive
./main -m model.gguf -ngl 99 --interactive

API Server (Single Machine)

# llama.cpp server
./server -m model.gguf -ngl 99 --port 8080

# Ollama (always runs server)
curl http://localhost:11434/api/generate

Production Serving (Multi-User)

# vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b \
    --tensor-parallel-size 2

# Text Generation Inference
docker run --gpus all \
    ghcr.io/huggingface/text-generation-inference \
    --model-id meta-llama/Llama-3-70b

Choosing an Inference Stack

Use Case	Recommended	Why
Getting started	Ollama	Simplest setup
Personal use, all platforms	Ollama or llama.cpp	Works everywhere
Maximum control	llama.cpp	Most configurable
Serving multiple users	vLLM	Best throughput
Production NVIDIA	vLLM or TensorRT-LLM	Optimized for serving