The context window is the maximum number of tokens an LLM can process in a single forward pass. It includes everything: the system prompt, conversation history, user message, and the model's response.

Think of it as the model's "working memory." Anything inside the context window can be attended to; anything outside is invisible. The model has no persistent memory — each API call starts fresh with only the tokens you provide.

Why is it limited? The self-attention mechanism computes a score between every pair of tokens. With n tokens, that's n² scores. Doubling the context length quadruples the memory and compute.

Context window sizes have grown rapidly:

| Model | Context Window | |---------------|----------------| | GPT-2 (2019) | 1,024 tokens | | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K | | Claude 3 (2024)| 200K tokens | | Gemini 1.5 (2024)| 1M tokens |

Practical impact: A 4K context window holds roughly 3,000 words — about 6 pages. A 200K window holds ~150,000 words — an entire novel. Larger windows enable processing entire codebases, long documents, and extended conversations without losing context.

Model	Context Length	Approx. Words	Equivalent Content
GPT-2	1,024 tokens	~750 words	1-2 pages of text
GPT-3	2,048 tokens	~1,500 words	3-4 pages of text
GPT-4	8,192 tokens	~6,000 words	12 pages, a long essay
GPT-4-32K	32,768 tokens	~24,000 words	50 pages, a novella chapter
Claude 3	200,000 tokens	~150,000 words	An entire novel
Gemini 1.5	1,000,000 tokens	~750,000 words	Multiple textbooks

What Consumes Context	Typical Size	Notes
System prompt	100-2,000 tokens	Present in every request
Conversation history	500-50,000 tokens	Grows with each turn
RAG context	500-10,000 tokens	Retrieved documents
User message	10-5,000 tokens	The current query
Model response	50-4,000 tokens	Generated output
Total budget	Context window limit	Everything must fit

Context Windows & Memory

What Is a Context Window?

Context Window Sizes Across Models

Context Window Budget