The context window is the maximum number of tokens a model can process in a single forward pass. It includes both your input (prompt) and the model's output (generation). Everything outside this window is invisible to the model.

What Context Length Means

Context Window = 8,192 tokens ┌────────────────────────────────────────────────────────┐ │ │ │ System prompt: ~500 tokens │ │ + Your question: ~200 tokens │ │ + Document you're asking about: ~5,000 tokens │ │ + Model's response: ~2,000 tokens │ │ ──────────────────────────────── │ │ Total: 7,700 tokens ✓ Fits! │ │ │ └────────────────────────────────────────────────────────┘ If total exceeds 8,192: oldest content gets truncated or you get an error, depending on implementation.

Context Lengths by Model

Model Native Context Extended (with RoPE scaling)
GPT-4 Turbo 128K
Claude 3 200K
Llama 3 8B 8K Up to 128K
Llama 3 70B 8K Up to 128K
Mistral 7B 8K Up to 32K
Mixtral 8x7B 32K

Tokens vs Characters vs Words

Context is measured in tokens, not characters or words:

Rough conversion (English text): ├── 1 token ≈ 4 characters ├── 1 token ≈ 0.75 words ├── 1 word ≈ 1.3 tokens └── 1 page (~500 words) ≈ 650 tokens Example: "The quick brown fox jumps over the lazy dog" = 9 words = 10 tokens Code tends to use more tokens per character than prose.
Context Length ~Words ~Pages Use Cases
4K tokens ~3,000 ~6 Chat, simple Q&A
8K tokens ~6,000 ~12 Longer conversations, short docs
32K tokens ~24,000 ~48 Long documents, code files
128K tokens ~96,000 ~192 Books, large codebases

Why Context Length Matters

What Fits in Context

The Memory Tradeoff

Longer context requires more memory for the KV cache:

Context Length KV Cache (70B, FP16) Total VRAM Needed
4K ~10 GB ~45 GB (Q4 weights + cache)
16K ~42 GB ~77 GB
32K ~84 GB ~119 GB

"Supports 128K" vs "Can Actually Use 128K"

Just because a model supports 128K context doesn't mean you have the VRAM to use it. The KV cache at 128K context can require more memory than the model weights themselves.

Context Length vs Quality

Models don't perform equally well at all context lengths:

Performance across context (typical pattern): Quality │ │ ████████████████ │ ████████████████████ │ ████████████████████████████ │ ██████████████████████████████████ └─────────────────────────────────────── Context length 0 4K 8K 16K 32K 64K Most models work best within their training context length. Extended context via RoPE scaling works but quality degrades.

"Lost in the Middle"

Research shows models pay more attention to the beginning and end of context, less to the middle. For very long contexts:

Extending Context

RoPE Scaling

Rotary Position Embedding (RoPE) can be scaled to extend context beyond training length:

# llama.cpp example
./main -m model.gguf --rope-scaling linear --rope-freq-scale 0.5 -c 16384

This works but typically degrades quality compared to native training at that length.

Sliding Window Attention

Some models (Mistral) use sliding window attention that limits each token to attending only to recent tokens, enabling longer sequences with fixed memory cost — but with limitations on long-range dependencies.

Practical Guidance

Right-Size Your Context

Don't use more context than you need. Shorter context means: less VRAM for KV cache, faster prefill, more room for model weights. If your use case needs 4K, don't set context to 32K "just in case."

For Chat Applications

For Document Analysis

For Code