Context Windows & Memory
The Attention Bottleneck
What Is a Context Window?
The context window is the maximum number of tokens an LLM can process in a single forward pass. It includes everything: the system prompt, conversation history, user message, and the model's response.
Think of it as the model's "working memory." Anything inside the context window can be attended to; anything outside is invisible. The model has no persistent memory — each API call starts fresh with only the tokens you provide.
Why is it limited? The self-attention mechanism computes a score between every pair of tokens. With n tokens, that's n² scores. Doubling the context length quadruples the memory and compute.
Context window sizes have grown rapidly:
| Model | Context Window | |---------------|----------------| | GPT-2 (2019) | 1,024 tokens | | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K | | Claude 3 (2024)| 200K tokens | | Gemini 1.5 (2024)| 1M tokens |
Practical impact: A 4K context window holds roughly 3,000 words — about 6 pages. A 200K window holds ~150,000 words — an entire novel. Larger windows enable processing entire codebases, long documents, and extended conversations without losing context.
Context Window Sizes Across Models
| Model | Context Length | Approx. Words | Equivalent Content |
|---|---|---|---|
| GPT-2 | 1,024 tokens | ~750 words | 1-2 pages of text |
| GPT-3 | 2,048 tokens | ~1,500 words | 3-4 pages of text |
| GPT-4 | 8,192 tokens | ~6,000 words | 12 pages, a long essay |
| GPT-4-32K | 32,768 tokens | ~24,000 words | 50 pages, a novella chapter |
| Claude 3 | 200,000 tokens | ~150,000 words | An entire novel |
| Gemini 1.5 | 1,000,000 tokens | ~750,000 words | Multiple textbooks |
Context Window Budget
| What Consumes Context | Typical Size | Notes |
|---|---|---|
| System prompt | 100-2,000 tokens | Present in every request |
| Conversation history | 500-50,000 tokens | Grows with each turn |
| RAG context | 500-10,000 tokens | Retrieved documents |
| User message | 10-5,000 tokens | The current query |
| Model response | 50-4,000 tokens | Generated output |
| Total budget | Context window limit | Everything must fit |