Inference is the process of running a trained model to generate outputs. This section covers the mechanics of loading models, managing memory, and optimizing for different use cases.

The Inference Pipeline

┌─────────────────────────────────────────────────────────────┐ │ 1. MODEL LOADING │ │ Load weights from disk → GPU memory (or unified memory) │ │ Allocate KV cache buffers │ │ Initialize framework state │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. PREFILL PHASE │ │ Process entire input prompt │ │ Build initial KV cache │ │ Compute-bound (matrix multiplications) │ │ Latency: "time to first token" │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. DECODE PHASE │ │ Generate tokens one at a time │ │ Update KV cache each step │ │ Memory-bandwidth-bound │ │ Speed: "tokens per second" │ └─────────────────────────────────────────────────────────────┘ │ ▼ Response complete

Memory Layout

GPU/Unified Memory: ┌────────────────────────────────────────────────┐ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ Model Weights (static) │ │ │ │ 35GB for 70B Q4, loaded once │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ KV Cache (dynamic) │ │ │ │ Grows with context length │ │ │ │ ~1GB per 4K tokens for 70B │ │ │ └──────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ Activations & Working Memory │ │ │ │ ~1-2GB overhead │ │ │ └──────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────┘

Key Concepts

Fitting Models in Memory

A model "fits" when weights + KV cache + overhead ≤ available memory.

Scenario Result Performance
Fully fits in VRAM Best case Maximum speed
Weights fit, KV cache spills Usable Slower at long context
Partial offload to RAM Works 5-10× slower
Doesn't fit anywhere Fails

RAM Offloading

When models don't fully fit, some layers can run from system RAM:

Multi-GPU Sharding

Split model across multiple GPUs for larger models:

Batching

Processing multiple requests simultaneously improves throughput:

No Batching (Interactive)

  • One request at a time
  • Lowest latency
  • GPU often underutilized
  • Best for chat/coding assistant

Batched (Throughput)

  • Multiple requests together
  • Higher total tokens/sec
  • Higher per-request latency
  • Best for batch processing, serving

Continuous Batching

Advanced technique where new requests join in-progress batches:

Inference Optimization Techniques

Flash Attention

Optimized attention implementation that reduces memory and increases speed:

Speculative Decoding

Use a small "draft" model to propose tokens, verify with main model:

KV Cache Quantization

Store KV cache in lower precision (INT8 instead of FP16):

Serving Patterns

Single User (Local)

# Ollama
ollama run llama3

# llama.cpp interactive
./main -m model.gguf -ngl 99 --interactive

API Server (Single Machine)

# llama.cpp server
./server -m model.gguf -ngl 99 --port 8080

# Ollama (always runs server)
curl http://localhost:11434/api/generate

Production Serving (Multi-User)

# vLLM
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b \
    --tensor-parallel-size 2

# Text Generation Inference
docker run --gpus all \
    ghcr.io/huggingface/text-generation-inference \
    --model-id meta-llama/Llama-3-70b

Choosing an Inference Stack

Use Case Recommended Why
Getting started Ollama Simplest setup
Personal use, all platforms Ollama or llama.cpp Works everywhere
Maximum control llama.cpp Most configurable
Serving multiple users vLLM Best throughput
Production NVIDIA vLLM or TensorRT-LLM Optimized for serving