Back to LLMs

Context Windows & Memory

The Attention Bottleneck

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Transformer attention
Step
1/ 7

What Is a Context Window?

The context window is the maximum number of tokens an LLM can process in a single forward pass. It includes everything: the system prompt, conversation history, user message, and the model's response.

Think of it as the model's "working memory." Anything inside the context window can be attended to; anything outside is invisible. The model has no persistent memory — each API call starts fresh with only the tokens you provide.

Why is it limited? The self-attention mechanism computes a score between every pair of tokens. With n tokens, that's n² scores. Doubling the context length quadruples the memory and compute.

Context window sizes have grown rapidly:

| Model | Context Window | |---------------|----------------| | GPT-2 (2019) | 1,024 tokens | | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K | | Claude 3 (2024)| 200K tokens | | Gemini 1.5 (2024)| 1M tokens |

Practical impact: A 4K context window holds roughly 3,000 words — about 6 pages. A 200K window holds ~150,000 words — an entire novel. Larger windows enable processing entire codebases, long documents, and extended conversations without losing context.

Context Window Sizes Across Models

ModelContext LengthApprox. WordsEquivalent Content
GPT-21,024 tokens~750 words1-2 pages of text
GPT-32,048 tokens~1,500 words3-4 pages of text
GPT-48,192 tokens~6,000 words12 pages, a long essay
GPT-4-32K32,768 tokens~24,000 words50 pages, a novella chapter
Claude 3200,000 tokens~150,000 wordsAn entire novel
Gemini 1.51,000,000 tokens~750,000 wordsMultiple textbooks

Context Window Budget

What Consumes ContextTypical SizeNotes
System prompt100-2,000 tokensPresent in every request
Conversation history500-50,000 tokensGrows with each turn
RAG context500-10,000 tokensRetrieved documents
User message10-5,000 tokensThe current query
Model response50-4,000 tokensGenerated output
Total budgetContext window limitEverything must fit

Context Windows and Memory — Lesson Content

Understand context windows, the O(n^2) attention bottleneck, KV cache optimization, and how RAG provides external memory for LLMs.

Every LLM has a context window — the maximum number of tokens it can process at once. This lesson explores why context windows exist (the quadratic cost of attention), how KV caches optimize generation, how positional encoding techniques like RoPE and ALiBi enable longer contexts, and how RAG provides external memory beyond the context limit. You'll see attention matrices showing how earlier tokens fade, memory scaling curves, and comparisons of approaches from sparse attention to state-space models.

Learning Objectives

  • Explain why attention has quadratic memory cost and its practical implications
  • Describe how KV caching accelerates autoregressive generation
  • Compare positional encoding methods and their context length limitations
  • Understand when to use RAG versus long context windows
  • Evaluate future approaches to the long-context problem

Step 1: What Is a Context Window?

The **context window** is the maximum number of tokens an LLM can process in a single forward pass. It includes everything: the system prompt, conversation history, user message, and the model's response. Think of it as the model's "working memory." Anything inside the context window can be attended to; anything outside is invisible. The model has **no persistent memory** — each API call starts fresh with only the tokens you provide. **Why is it limited?** The self-attention mechanism computes a score between every pair of tokens. With n tokens, that's n² scores. Doubling the context length quadruples the memory and compute. **Context window sizes have grown rapidly:** | Model | Context Window | |---------------|----------------| | GPT-2 (2019) | 1,024 tokens | | GPT-3 (2020) | 2,048 tokens | | GPT-4 (2023) | 8K / 32K | | Claude 3 (2024)| 200K tokens | | Gemini 1.5 (2024)| 1M tokens | **Practical impact:** A 4K context window holds roughly 3,000 words — about 6 pages. A 200K window holds ~150,000 words — an entire novel. Larger windows enable processing entire codebases, long documents, and extended conversations without losing context.

Step 2: Attention and Memory Cost (O(n^2))

Self-attention's power comes at a steep cost: **O(n^2) memory and compute** in sequence length. For each token, the model computes an attention score with every other token. With 6 tokens, that's 36 scores (a 6x6 matrix). With 1,000 tokens, it's 1,000,000 scores. With 100,000 tokens, it's 10,000,000,000 scores. **The attention matrix below** shows a causal (decoder-only) pattern for 6 tokens. Each row is a query token; each column is a key token. Zeros above the diagonal enforce the causal mask — tokens can't attend to the future. Notice how attention to earlier tokens **fades** as the sequence progresses. Token 5 ("mat") assigns only 0.02 attention to Token 0 ("The") — the information from the beginning is diluted. This is a practical limitation even within the context window. **Memory breakdown for a single attention layer:** - Attention scores: n × n × sizeof(float16) = n² × 2 bytes - With 32 attention heads: 32 × n² × 2 bytes - For n = 8192: 32 × 67M × 2 = **4 GB per layer** - A 32-layer model: **128 GB just for attention scores** This quadratic scaling is why context windows were historically small and why so much research focuses on efficient attention.

Step 3: KV Cache Optimization

During autoregressive generation, the model generates tokens one at a time. At each step, it needs to attend to all previous tokens. Without optimization, this means recomputing the entire attention matrix for every new token — repeating work already done. **The KV cache** eliminates this redundancy. Key and Value projections for previously generated tokens are computed once and cached. When generating token t+1, only the new token's Query, Key, and Value need to be computed. The new Query attends to all cached Keys, and the result is a weighted sum of all cached Values. **Without KV cache:** Generating n tokens requires computing attention n times over sequences of length 1, 2, ..., n. Total: O(n²) per layer. **With KV cache:** Generating each new token requires O(n) compute (one query attending to n cached keys). Total: O(n²) but with a much smaller constant — and critically, each step is only O(n), not O(n²). **The memory tradeoff:** KV cache trades memory for compute. For a 7B model with 32 layers, the KV cache for 4K tokens is ~1 GB. For 100K tokens, it's ~25 GB. This is why long-context inference requires significant GPU memory even though the model weights are fixed. **Optimization techniques:** - **Multi-Query Attention (MQA):** Share K and V across heads, reducing cache by ~8x - **Grouped-Query Attention (GQA):** Share K and V across groups of heads (compromise) - **Paged Attention (vLLM):** Manage KV cache like virtual memory pages, eliminating fragmentation

Step 4: Long Context Techniques (RoPE, ALiBi)

Standard transformers struggle to generalize to sequence lengths longer than they were trained on. **Positional encoding** techniques determine how well a model handles long contexts. **The problem:** Absolute positional encodings (learned or sinusoidal) assign a fixed embedding to each position. If the model was trained on sequences up to 2,048 tokens, position 2,049 has never been seen — the model breaks down. **RoPE (Rotary Position Embeddings):** - Encodes position by rotating Q and K vectors in 2D subspaces - The angle between Q and K depends only on their **relative distance**, not absolute positions - Can be extended beyond training length by scaling the rotation frequencies - **YaRN, NTK-aware scaling:** Techniques to extend RoPE models from 4K to 128K+ tokens - Used by: LLaMA, Mistral, Qwen, most modern open-source models **ALiBi (Attention with Linear Biases):** - Instead of positional embeddings, adds a linear penalty to attention scores based on distance - Score(i,j) = Q_i · K_j - m · |i - j|, where m is a head-specific slope - Naturally generalizes to longer sequences (penalty just grows linearly) - No positional embeddings needed at all — simpler architecture - Used by: BLOOM, MPT, Falcon **The key insight:** Relative position information (how far apart tokens are) is more useful than absolute position (what index a token has). Both RoPE and ALiBi encode relative distance, which is why they generalize better.

Step 5: RAG as External Memory

Even with long context windows, there are limits. A 200K context can hold a novel, but not an entire knowledge base. And stuffing everything into the context is expensive — you pay per token. **Retrieval-Augmented Generation (RAG)** gives the model external memory by retrieving relevant information at query time: 1. **Index phase:** Split documents into chunks, compute embeddings, store in a vector database 2. **Query phase:** When a user asks a question, embed the query and find the most similar document chunks 3. **Generation phase:** Insert the retrieved chunks into the prompt and generate a response **RAG vs Long Context:** RAG retrieves only what's relevant — it's like looking up specific pages in a library rather than reading every book. This is more efficient but depends on retrieval quality. If the retriever misses the relevant passage, the model can't find the answer. Long context stuffs everything in — it's like giving the model the entire library. More comprehensive but much more expensive and slower. The model's attention may also struggle to find the relevant information in a very long context (the "needle in a haystack" problem). **In practice, RAG and long context are complementary:** - RAG narrows the search space: retrieve 20 relevant passages - Long context processes them: the model reads all 20 passages in one pass - Together, they handle knowledge bases far larger than any context window

Step 6: The Future: Infinite Context?

The quest for longer context windows continues. Several research directions aim to break through current limitations: **Sparse attention** patterns reduce the O(n^2) cost by attending to only a subset of tokens. Instead of every token attending to every other token, each token attends to a local window plus a few distant "landmark" tokens. This can reduce cost to O(n log n) or even O(n). The sparse attention matrix below shows a local + strided pattern: each token attends to its immediate neighbors (local window) plus every k-th token (strided pattern). This preserves both local coherence and long-range connections. **Ring Attention** distributes the sequence across multiple GPUs, with each GPU processing a chunk and passing KV cache to the next in a ring. This scales linearly with the number of GPUs. **State-space models (Mamba, S4)** replace attention entirely with a recurrence that processes tokens in O(n) time and O(1) memory per token. They achieve competitive quality with dramatically lower memory usage for long sequences. **Hybrid approaches** combine attention for short-range interactions with state-space models for long-range, getting the best of both worlds. The trend is clear: context windows are growing from thousands to millions of tokens, and the compute cost is being tamed through architectural innovations.

Step 7: Test Your Understanding

You've learned about context windows, attention cost, KV caches, long-context techniques, and RAG. Let's test your understanding!

Prerequisites

  • Understanding of self-attention mechanism
  • Basic knowledge of how LLMs generate text
  • Familiarity with embeddings and vector similarity

Key Concepts

  • Context Window
  • O(n^2) Attention Cost
  • KV Cache
  • RoPE and ALiBi
  • Retrieval-Augmented Generation (RAG)
  • Sparse Attention