Inference & Serving
How models actually run
Inference is the process of running a trained model to generate outputs. This section covers the mechanics of loading models, managing memory, and optimizing for different use cases.
The Inference Pipeline
┌─────────────────────────────────────────────────────────────┐
│ 1. MODEL LOADING │
│ Load weights from disk → GPU memory (or unified memory) │
│ Allocate KV cache buffers │
│ Initialize framework state │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. PREFILL PHASE │
│ Process entire input prompt │
│ Build initial KV cache │
│ Compute-bound (matrix multiplications) │
│ Latency: "time to first token" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. DECODE PHASE │
│ Generate tokens one at a time │
│ Update KV cache each step │
│ Memory-bandwidth-bound │
│ Speed: "tokens per second" │
└─────────────────────────────────────────────────────────────┘
│
▼
Response complete
Memory Layout
GPU/Unified Memory:
┌────────────────────────────────────────────────┐
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Model Weights (static) │ │
│ │ 35GB for 70B Q4, loaded once │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ KV Cache (dynamic) │ │
│ │ Grows with context length │ │
│ │ ~1GB per 4K tokens for 70B │ │
│ └──────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Activations & Working Memory │ │
│ │ ~1-2GB overhead │ │
│ └──────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────┘
Key Concepts
Fitting Models in Memory
A model "fits" when weights + KV cache + overhead ≤ available memory.
| Scenario | Result | Performance |
|---|---|---|
| Fully fits in VRAM | Best case | Maximum speed |
| Weights fit, KV cache spills | Usable | Slower at long context |
| Partial offload to RAM | Works | 5-10× slower |
| Doesn't fit anywhere | Fails | — |
RAM Offloading
When models don't fully fit, some layers can run from system RAM:
- Each offloaded layer adds latency (RAM is 10-20× slower than VRAM)
- Configure with
-nglin llama.cpp (layers on GPU) - Usable for experimentation, not ideal for production
Multi-GPU Sharding
Split model across multiple GPUs for larger models:
- Tensor parallelism: Split layers horizontally (requires fast interconnect)
- Pipeline parallelism: Assign layers to different GPUs sequentially
- See Multi-GPU for details
Batching
Processing multiple requests simultaneously improves throughput:
No Batching (Interactive)
- One request at a time
- Lowest latency
- GPU often underutilized
- Best for chat/coding assistant
Batched (Throughput)
- Multiple requests together
- Higher total tokens/sec
- Higher per-request latency
- Best for batch processing, serving
Continuous Batching
Advanced technique where new requests join in-progress batches:
- Implemented by vLLM, TGI
- Better GPU utilization
- More complex memory management
Inference Optimization Techniques
Flash Attention
Optimized attention implementation that reduces memory and increases speed:
- Restructures computation to be more cache-friendly
- Standard in modern inference engines
- Significant speedup for long contexts
Speculative Decoding
Use a small "draft" model to propose tokens, verify with main model:
- Can provide 2-3× speedup
- Draft model must be compatible
- Benefits depend on acceptance rate
KV Cache Quantization
Store KV cache in lower precision (INT8 instead of FP16):
- Halves KV cache memory
- Enables longer contexts
- Minor quality impact
Serving Patterns
Single User (Local)
# Ollama
ollama run llama3
# llama.cpp interactive
./main -m model.gguf -ngl 99 --interactive
API Server (Single Machine)
# llama.cpp server
./server -m model.gguf -ngl 99 --port 8080
# Ollama (always runs server)
curl http://localhost:11434/api/generate
Production Serving (Multi-User)
# vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70b \
--tensor-parallel-size 2
# Text Generation Inference
docker run --gpus all \
ghcr.io/huggingface/text-generation-inference \
--model-id meta-llama/Llama-3-70b
Choosing an Inference Stack
| Use Case | Recommended | Why |
|---|---|---|
| Getting started | Ollama | Simplest setup |
| Personal use, all platforms | Ollama or llama.cpp | Works everywhere |
| Maximum control | llama.cpp | Most configurable |
| Serving multiple users | vLLM | Best throughput |
| Production NVIDIA | vLLM or TensorRT-LLM | Optimized for serving |