Prefill vs Decode
Two phases, two different bottlenecks
LLM inference happens in two distinct phases with different computational characteristics. Understanding this distinction is crucial for diagnosing performance issues and choosing hardware.
The Two Phases
Prefill: Processing the Prompt
During prefill, the model processes your entire input prompt in one forward pass.
Characteristics
- Parallel processing: All input tokens processed simultaneously
- Compute-bound: Limited by GPU TFLOPS, not memory bandwidth
- Builds KV cache: Prepares attention state for generation
- Happens once: Only at the start of generation
What Affects Prefill Speed
| Factor | Impact |
|---|---|
| Prompt length | Longer prompts = more computation |
| GPU compute (TFLOPS) | More compute = faster prefill |
| Model size | Larger models = more computation per token |
| Flash Attention | Significant speedup through memory optimization |
Time to First Token (TTFT)
Prefill latency directly determines how long users wait before seeing any response. For interactive applications, TTFT under 1-2 seconds feels responsive. Long prompts (RAG contexts, document analysis) can push TTFT to several seconds.
Decode: Generating Tokens
After prefill, the model generates output tokens one at a time.
Characteristics
- Sequential: One token at a time (can't parallelize)
- Memory-bandwidth-bound: Reading model weights is the bottleneck
- Uses KV cache: Only computes attention for new token
- Repeats: Runs for every generated token
What Affects Decode Speed
| Factor | Impact |
|---|---|
| Memory bandwidth | Higher bandwidth = faster decoding |
| Model size | Larger models = more data to read per token |
| Quantization | Lower precision = less data to read = faster |
| Batch size | More concurrent requests amortize bandwidth cost |
The Bandwidth Bottleneck
Here's why decode is memory-bound:
Why TFLOPS Don't Matter for Decode
A GPU might have 80 TFLOPS of compute, but if it can only read 1 TB/s from memory, the compute units sit idle waiting for data. Decode is almost always bandwidth-limited, not compute-limited.
Different Bottlenecks, Different Solutions
Optimize Prefill (TTFT)
- Flash Attention
- Tensor parallelism (multi-GPU)
- Higher compute GPUs
- Prompt caching
- Shorter prompts
Optimize Decode (tok/s)
- Higher memory bandwidth
- Quantization (smaller model)
- Speculative decoding
- Batching (for throughput)
- KV cache quantization
Real-World Implications
Interactive Use (Chat)
Both phases matter:
- TTFT determines perceived responsiveness
- Decode speed determines reading comfort (15-30 tok/s is comfortable)
- Typical session: short prompts, moderate outputs
Batch Processing (API Serving)
Throughput is king:
- TTFT less important (users wait for complete response)
- Batch multiple requests to amortize bandwidth
- Focus on total tokens per second across all requests
Long Context (RAG, Documents)
Prefill dominates:
- 16K-128K token prompts = multi-second prefill
- TTFT becomes the main latency concern
- KV cache memory becomes critical
Measuring Performance
When benchmarking, measure both phases:
| Metric | What It Measures | Typical Good Values |
|---|---|---|
| Time to First Token (TTFT) | Prefill latency | <1s for short prompts |
| Tokens per Second (decode) | Generation speed | 15-50 depending on model/hardware |
| Total Latency | TTFT + generation time | Depends on output length |
| Throughput (batched) | Total tok/s across all requests | Higher with batching |
Speculative Decoding
A technique to speed up decode by using a smaller "draft" model:
Speculative decoding converts some decode work into prefill-like parallel work, improving throughput.