Context Window - Local LLM Knowledge Base

The context window is the maximum number of tokens a model can process in a single forward pass. It includes both your input (prompt) and the model's output (generation). Everything outside this window is invisible to the model.

What Context Length Means

Context Window = 8,192 tokens ┌────────────────────────────────────────────────────────┐ │ │ │ System prompt: ~500 tokens │ │ + Your question: ~200 tokens │ │ + Document you're asking about: ~5,000 tokens │ │ + Model's response: ~2,000 tokens │ │ ──────────────────────────────── │ │ Total: 7,700 tokens ✓ Fits! │ │ │ └────────────────────────────────────────────────────────┘ If total exceeds 8,192: oldest content gets truncated or you get an error, depending on implementation.

Context Lengths by Model

Model	Native Context	Extended (with RoPE scaling)
GPT-4 Turbo	128K	—
Claude 3	200K	—
Llama 3 8B	8K	Up to 128K
Llama 3 70B	8K	Up to 128K
Mistral 7B	8K	Up to 32K
Mixtral 8x7B	32K	—

Tokens vs Characters vs Words

Context is measured in tokens, not characters or words:

Rough conversion (English text): ├── 1 token ≈ 4 characters ├── 1 token ≈ 0.75 words ├── 1 word ≈ 1.3 tokens └── 1 page (~500 words) ≈ 650 tokens Example: "The quick brown fox jumps over the lazy dog" = 9 words = 10 tokens Code tends to use more tokens per character than prose.

Context Length	~Words	~Pages	Use Cases
4K tokens	~3,000	~6	Chat, simple Q&A
8K tokens	~6,000	~12	Longer conversations, short docs
32K tokens	~24,000	~48	Long documents, code files
128K tokens	~96,000	~192	Books, large codebases

Why Context Length Matters

What Fits in Context

Conversation history: Multi-turn chats accumulate tokens
Documents for analysis: PDFs, code files, articles
Few-shot examples: Examples to guide model behavior
System prompts: Instructions that set up the model
RAG context: Retrieved documents for grounded answers

The Memory Tradeoff

Longer context requires more memory for the KV cache:

Context Length	KV Cache (70B, FP16)	Total VRAM Needed
4K	~10 GB	~45 GB (Q4 weights + cache)
16K	~42 GB	~77 GB
32K	~84 GB	~119 GB

"Supports 128K" vs "Can Actually Use 128K"

Just because a model supports 128K context doesn't mean you have the VRAM to use it. The KV cache at 128K context can require more memory than the model weights themselves.

Context Length vs Quality

Models don't perform equally well at all context lengths:

Performance across context (typical pattern): Quality │ │ ████████████████ │ ████████████████████ │ ████████████████████████████ │ ██████████████████████████████████ └─────────────────────────────────────── Context length 0 4K 8K 16K 32K 64K Most models work best within their training context length. Extended context via RoPE scaling works but quality degrades.

"Lost in the Middle"

Research shows models pay more attention to the beginning and end of context, less to the middle. For very long contexts:

Put important information at the start or end
Don't assume the model "read" everything in the middle
Consider chunking and summarization for very long documents

Extending Context

RoPE Scaling

Rotary Position Embedding (RoPE) can be scaled to extend context beyond training length:

# llama.cpp example
./main -m model.gguf --rope-scaling linear --rope-freq-scale 0.5 -c 16384

This works but typically degrades quality compared to native training at that length.

Sliding Window Attention

Some models (Mistral) use sliding window attention that limits each token to attending only to recent tokens, enabling longer sequences with fixed memory cost — but with limitations on long-range dependencies.

Practical Guidance

Right-Size Your Context

Don't use more context than you need. Shorter context means: less VRAM for KV cache, faster prefill, more room for model weights. If your use case needs 4K, don't set context to 32K "just in case."

For Chat Applications

4-8K is usually plenty for conversational use
Implement context pruning for long conversations
Summarize old messages rather than keeping everything

For Document Analysis

Measure your typical document sizes in tokens
Consider RAG over whole-document stuffing
For very long documents, chunk and process separately

For Code

Code files can be surprisingly token-heavy
Consider file-level context rather than repo-level
Use context for relevant files, not everything

What Context Length Means

Context Lengths by Model

Tokens vs Characters vs Words

Why Context Length Matters

What Fits in Context

The Memory Tradeoff

"Supports 128K" vs "Can Actually Use 128K"

Context Length vs Quality

"Lost in the Middle"

Extending Context

RoPE Scaling

Sliding Window Attention

Practical Guidance

Right-Size Your Context

For Chat Applications

For Document Analysis

For Code

Related Topics