Context Window
How much text the model can "see" at once
The context window is the maximum number of tokens a model can process in a single forward pass. It includes both your input (prompt) and the model's output (generation). Everything outside this window is invisible to the model.
What Context Length Means
Context Lengths by Model
| Model | Native Context | Extended (with RoPE scaling) |
|---|---|---|
| GPT-4 Turbo | 128K | — |
| Claude 3 | 200K | — |
| Llama 3 8B | 8K | Up to 128K |
| Llama 3 70B | 8K | Up to 128K |
| Mistral 7B | 8K | Up to 32K |
| Mixtral 8x7B | 32K | — |
Tokens vs Characters vs Words
Context is measured in tokens, not characters or words:
| Context Length | ~Words | ~Pages | Use Cases |
|---|---|---|---|
| 4K tokens | ~3,000 | ~6 | Chat, simple Q&A |
| 8K tokens | ~6,000 | ~12 | Longer conversations, short docs |
| 32K tokens | ~24,000 | ~48 | Long documents, code files |
| 128K tokens | ~96,000 | ~192 | Books, large codebases |
Why Context Length Matters
What Fits in Context
- Conversation history: Multi-turn chats accumulate tokens
- Documents for analysis: PDFs, code files, articles
- Few-shot examples: Examples to guide model behavior
- System prompts: Instructions that set up the model
- RAG context: Retrieved documents for grounded answers
The Memory Tradeoff
Longer context requires more memory for the KV cache:
| Context Length | KV Cache (70B, FP16) | Total VRAM Needed |
|---|---|---|
| 4K | ~10 GB | ~45 GB (Q4 weights + cache) |
| 16K | ~42 GB | ~77 GB |
| 32K | ~84 GB | ~119 GB |
"Supports 128K" vs "Can Actually Use 128K"
Just because a model supports 128K context doesn't mean you have the VRAM to use it. The KV cache at 128K context can require more memory than the model weights themselves.
Context Length vs Quality
Models don't perform equally well at all context lengths:
"Lost in the Middle"
Research shows models pay more attention to the beginning and end of context, less to the middle. For very long contexts:
- Put important information at the start or end
- Don't assume the model "read" everything in the middle
- Consider chunking and summarization for very long documents
Extending Context
RoPE Scaling
Rotary Position Embedding (RoPE) can be scaled to extend context beyond training length:
# llama.cpp example
./main -m model.gguf --rope-scaling linear --rope-freq-scale 0.5 -c 16384
This works but typically degrades quality compared to native training at that length.
Sliding Window Attention
Some models (Mistral) use sliding window attention that limits each token to attending only to recent tokens, enabling longer sequences with fixed memory cost — but with limitations on long-range dependencies.
Practical Guidance
Right-Size Your Context
Don't use more context than you need. Shorter context means: less VRAM for KV cache, faster prefill, more room for model weights. If your use case needs 4K, don't set context to 32K "just in case."
For Chat Applications
- 4-8K is usually plenty for conversational use
- Implement context pruning for long conversations
- Summarize old messages rather than keeping everything
For Document Analysis
- Measure your typical document sizes in tokens
- Consider RAG over whole-document stuffing
- For very long documents, chunk and process separately
For Code
- Code files can be surprisingly token-heavy
- Consider file-level context rather than repo-level
- Use context for relevant files, not everything