LLM inference happens in two distinct phases with different computational characteristics. Understanding this distinction is crucial for diagnosing performance issues and choosing hardware.

The Two Phases

User sends prompt: "Explain quantum computing in simple terms" │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ PREFILL PHASE │ │ Process entire prompt in parallel │ │ Build initial KV cache │ │ Compute-bound (matrix multiplications) │ │ Latency: "time to first token" (TTFT) │ └─────────────────────────────────────────────────────────────┘ │ ▼ First token generated │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ DECODE PHASE │ │ Generate one token at a time │ │ Read entire model for each token │ │ Memory-bandwidth-bound │ │ Speed: "tokens per second" (tok/s) │ └─────────────────────────────────────────────────────────────┘ │ ▼ Response complete

Prefill: Processing the Prompt

During prefill, the model processes your entire input prompt in one forward pass.

Characteristics

What Affects Prefill Speed

Factor Impact
Prompt length Longer prompts = more computation
GPU compute (TFLOPS) More compute = faster prefill
Model size Larger models = more computation per token
Flash Attention Significant speedup through memory optimization

Time to First Token (TTFT)

Prefill latency directly determines how long users wait before seeing any response. For interactive applications, TTFT under 1-2 seconds feels responsive. Long prompts (RAG contexts, document analysis) can push TTFT to several seconds.

Decode: Generating Tokens

After prefill, the model generates output tokens one at a time.

Characteristics

What Affects Decode Speed

Factor Impact
Memory bandwidth Higher bandwidth = faster decoding
Model size Larger models = more data to read per token
Quantization Lower precision = less data to read = faster
Batch size More concurrent requests amortize bandwidth cost

The Bandwidth Bottleneck

Here's why decode is memory-bound:

For each generated token, you must read: ├── All model weights: 35GB (for 70B Q4 model) ├── KV cache: varies with context └── Total: ~35-50GB per token At 30 tokens/second: ├── Need to read: 30 × 35GB = 1,050 GB/s └── This is MEMORY BANDWIDTH, not compute GPU bandwidth comparison: ├── RTX 4090: 1,008 GB/s → ~29 tok/s theoretical max ├── RTX 3090: 936 GB/s → ~27 tok/s theoretical max ├── A100 80GB: 2,039 GB/s → ~58 tok/s theoretical max └── Mac Studio M2 Ultra: 800 GB/s → ~23 tok/s theoretical max

Why TFLOPS Don't Matter for Decode

A GPU might have 80 TFLOPS of compute, but if it can only read 1 TB/s from memory, the compute units sit idle waiting for data. Decode is almost always bandwidth-limited, not compute-limited.

Different Bottlenecks, Different Solutions

Optimize Prefill (TTFT)

  • Flash Attention
  • Tensor parallelism (multi-GPU)
  • Higher compute GPUs
  • Prompt caching
  • Shorter prompts

Optimize Decode (tok/s)

  • Higher memory bandwidth
  • Quantization (smaller model)
  • Speculative decoding
  • Batching (for throughput)
  • KV cache quantization

Real-World Implications

Interactive Use (Chat)

Both phases matter:

Batch Processing (API Serving)

Throughput is king:

Long Context (RAG, Documents)

Prefill dominates:

Measuring Performance

When benchmarking, measure both phases:

Metric What It Measures Typical Good Values
Time to First Token (TTFT) Prefill latency <1s for short prompts
Tokens per Second (decode) Generation speed 15-50 depending on model/hardware
Total Latency TTFT + generation time Depends on output length
Throughput (batched) Total tok/s across all requests Higher with batching

Speculative Decoding

A technique to speed up decode by using a smaller "draft" model:

Standard decode: Large model generates: T₁ → T₂ → T₃ → T₄ (one at a time) Speculative decode: 1. Small model quickly drafts: T₁, T₂, T₃, T₄ (fast) 2. Large model verifies all at once (parallel, like prefill) 3. Accept correct predictions, reject wrong ones 4. Repeat from first rejected token Speedup: 2-3× when draft model predictions are accurate

Speculative decoding converts some decode work into prefill-like parallel work, improving throughput.