Prefill vs Decode - Local LLM Knowledge Base

LLM inference happens in two distinct phases with different computational characteristics. Understanding this distinction is crucial for diagnosing performance issues and choosing hardware.

The Two Phases

User sends prompt: "Explain quantum computing in simple terms" │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ PREFILL PHASE │ │ Process entire prompt in parallel │ │ Build initial KV cache │ │ Compute-bound (matrix multiplications) │ │ Latency: "time to first token" (TTFT) │ └─────────────────────────────────────────────────────────────┘ │ ▼ First token generated │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ DECODE PHASE │ │ Generate one token at a time │ │ Read entire model for each token │ │ Memory-bandwidth-bound │ │ Speed: "tokens per second" (tok/s) │ └─────────────────────────────────────────────────────────────┘ │ ▼ Response complete

Prefill: Processing the Prompt

During prefill, the model processes your entire input prompt in one forward pass.

Characteristics

Parallel processing: All input tokens processed simultaneously
Compute-bound: Limited by GPU TFLOPS, not memory bandwidth
Builds KV cache: Prepares attention state for generation
Happens once: Only at the start of generation

What Affects Prefill Speed

Factor	Impact
Prompt length	Longer prompts = more computation
GPU compute (TFLOPS)	More compute = faster prefill
Model size	Larger models = more computation per token
Flash Attention	Significant speedup through memory optimization

Time to First Token (TTFT)

Prefill latency directly determines how long users wait before seeing any response. For interactive applications, TTFT under 1-2 seconds feels responsive. Long prompts (RAG contexts, document analysis) can push TTFT to several seconds.

Decode: Generating Tokens

After prefill, the model generates output tokens one at a time.

Characteristics

Sequential: One token at a time (can't parallelize)
Memory-bandwidth-bound: Reading model weights is the bottleneck
Uses KV cache: Only computes attention for new token
Repeats: Runs for every generated token

What Affects Decode Speed

Factor	Impact
Memory bandwidth	Higher bandwidth = faster decoding
Model size	Larger models = more data to read per token
Quantization	Lower precision = less data to read = faster
Batch size	More concurrent requests amortize bandwidth cost

The Bandwidth Bottleneck

Here's why decode is memory-bound:

For each generated token, you must read: ├── All model weights: 35GB (for 70B Q4 model) ├── KV cache: varies with context └── Total: ~35-50GB per token At 30 tokens/second: ├── Need to read: 30 × 35GB = 1,050 GB/s └── This is MEMORY BANDWIDTH, not compute GPU bandwidth comparison: ├── RTX 4090: 1,008 GB/s → ~29 tok/s theoretical max ├── RTX 3090: 936 GB/s → ~27 tok/s theoretical max ├── A100 80GB: 2,039 GB/s → ~58 tok/s theoretical max └── Mac Studio M2 Ultra: 800 GB/s → ~23 tok/s theoretical max

Why TFLOPS Don't Matter for Decode

A GPU might have 80 TFLOPS of compute, but if it can only read 1 TB/s from memory, the compute units sit idle waiting for data. Decode is almost always bandwidth-limited, not compute-limited.

Different Bottlenecks, Different Solutions

Optimize Prefill (TTFT)

Flash Attention
Tensor parallelism (multi-GPU)
Higher compute GPUs
Prompt caching
Shorter prompts

Optimize Decode (tok/s)

Higher memory bandwidth
Quantization (smaller model)
Speculative decoding
Batching (for throughput)
KV cache quantization

Real-World Implications

Interactive Use (Chat)

Both phases matter:

TTFT determines perceived responsiveness
Decode speed determines reading comfort (15-30 tok/s is comfortable)
Typical session: short prompts, moderate outputs

Batch Processing (API Serving)

Throughput is king:

TTFT less important (users wait for complete response)
Batch multiple requests to amortize bandwidth
Focus on total tokens per second across all requests

Long Context (RAG, Documents)

Prefill dominates:

16K-128K token prompts = multi-second prefill
TTFT becomes the main latency concern
KV cache memory becomes critical

Measuring Performance

When benchmarking, measure both phases:

Metric	What It Measures	Typical Good Values
Time to First Token (TTFT)	Prefill latency	<1s for short prompts
Tokens per Second (decode)	Generation speed	15-50 depending on model/hardware
Total Latency	TTFT + generation time	Depends on output length
Throughput (batched)	Total tok/s across all requests	Higher with batching

Speculative Decoding

A technique to speed up decode by using a smaller "draft" model:

Standard decode: Large model generates: T₁ → T₂ → T₃ → T₄ (one at a time) Speculative decode: 1. Small model quickly drafts: T₁, T₂, T₃, T₄ (fast) 2. Large model verifies all at once (parallel, like prefill) 3. Accept correct predictions, reject wrong ones 4. Repeat from first rejected token Speedup: 2-3× when draft model predictions are accurate

Speculative decoding converts some decode work into prefill-like parallel work, improving throughput.

The Two Phases

Prefill: Processing the Prompt

Characteristics

What Affects Prefill Speed

Time to First Token (TTFT)

Decode: Generating Tokens

Characteristics

What Affects Decode Speed

The Bandwidth Bottleneck

Why TFLOPS Don't Matter for Decode

Different Bottlenecks, Different Solutions

Optimize Prefill (TTFT)

Optimize Decode (tok/s)

Real-World Implications

Interactive Use (Chat)

Batch Processing (API Serving)

Long Context (RAG, Documents)

Measuring Performance

Speculative Decoding

Related Topics