The Decoder: GPT

Autoregressive Text Generation

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Transformer Block
Step
1/ 7

The Decoder: Causal (Left-to-Right) Attention

While the encoder reads and understands text bidirectionally, the decoder is designed for generation -- producing text one token at a time, left to right.

The key constraint is causal attention (also called masked self-attention): each token can only attend to tokens at or before its position. Token 2 ("sat") can see tokens 0 ("The"), 1 ("cat"), and 2 ("sat"), but NOT token 3 ("down").

Why this restriction? Because during generation, future tokens don't exist yet. When the model generates the third word, it has only produced words 1 and 2. Allowing it to "peek" at future tokens during training would be cheating -- the model must learn to predict each token based only on what came before.

GPT (Generative Pre-trained Transformer) by OpenAI took the decoder half and showed that scaling it up with massive data produces remarkably capable language models -- from GPT-1's 117M parameters to GPT-4's estimated 1.8 trillion.

Compare the causal attention pattern below with the bidirectional pattern you saw in the BERT lesson -- notice how the upper triangle is now masked out.

Causal Attention: Each Token Only Sees Previous Tokens

The
cat
sat
down
The
cat
sat
down
1.00
0.00
0.00
0.00
0.40
0.60
0.00
0.00
0.50
0.30
0.20
0.00
0.18
0.37
0.21
0.24
Low
High
Each cell shows how much attention the query token (row) pays to the key token (column). Higher values (red) indicate stronger attention.

Encoder vs Decoder: Key Differences

PropertyEncoder (BERT)Decoder (GPT)
Attention directionBidirectional (all tokens)Causal (left-to-right only)
Token i can seeTokens 0, 1, ..., nTokens 0, 1, ..., i
Pre-training taskMasked Language ModelingNext Token Prediction
Primary useUnderstanding (classification, NER)Generation (text, code, chat)
Key modelsBERT, RoBERTa, DeBERTaGPT-1/2/3/4, LLaMA, Claude

Decoder & GPT — Lesson Content

Learn how GPT uses causal (left-to-right) attention to generate text -- from the causal mask to autoregressive decoding to RLHF.

The decoder is the "generation" half of the Transformer, and GPT proved that scaling it up yields remarkably capable language models. Unlike encoders that see all tokens, decoders use causal masking: each token can only attend to previous tokens, because during generation, future tokens don't exist yet. You'll learn causal masking with the lower triangular mask, autoregressive next-token prediction, temperature and sampling strategies, the GPT scaling story from 117M to 1.8T parameters, and how instruction tuning and RLHF transformed raw GPT into ChatGPT.

Learning Objectives

  • Explain the difference between bidirectional (encoder) and causal (decoder) attention
  • Describe how the causal mask is implemented as a lower triangular matrix
  • Trace the autoregressive generation process step by step
  • Explain how temperature and top-k/top-p sampling affect text generation
  • Describe the GPT scaling progression and emergent capabilities
  • Explain how instruction tuning and RLHF align models with human preferences

Step 1: The Decoder: Causal (Left-to-Right) Attention

While the encoder reads and understands text bidirectionally, the **decoder** is designed for **generation** -- producing text one token at a time, left to right. The key constraint is **causal attention** (also called masked self-attention): each token can only attend to tokens **at or before** its position. Token 2 ("sat") can see tokens 0 ("The"), 1 ("cat"), and 2 ("sat"), but NOT token 3 ("down"). Why this restriction? Because during generation, future tokens **don't exist yet**. When the model generates the third word, it has only produced words 1 and 2. Allowing it to "peek" at future tokens during training would be cheating -- the model must learn to predict each token based only on what came before. **GPT** (Generative Pre-trained Transformer) by OpenAI took the decoder half and showed that scaling it up with massive data produces remarkably capable language models -- from GPT-1's 117M parameters to GPT-4's estimated 1.8 trillion. Compare the causal attention pattern below with the bidirectional pattern you saw in the BERT lesson -- notice how the upper triangle is now masked out.

Step 2: Causal Masking: The Lower Triangular Mask

Causal attention is implemented with a simple **mask matrix** applied to the attention scores before softmax. The mask is a **lower triangular matrix**: 1s on and below the diagonal, 0s above. For our 4-token sequence: Position 0 ("The"): can see [The] Position 1 ("cat"): can see [The, cat] Position 2 ("sat"): can see [The, cat, sat] Position 3 ("down"): can see [The, cat, sat, down] **Implementation:** Before applying softmax, we set masked positions to **-infinity** (or a very large negative number). Since softmax(-inf) = 0, those positions get zero attention weight. **scores_masked = scores + (1 - mask) * (-10000)** **weights = softmax(scores_masked)** This is computationally efficient: the same matrix multiplication runs as in bidirectional attention, but the mask zeroes out future positions. The model trains on all positions simultaneously (teacher forcing), but each position only uses past context -- making training parallel while respecting the causal constraint.
# Create causal mask (lower triangular)
mask = torch.tril(torch.ones(seq_len, seq_len))
# mask = [[1, 0, 0, 0],
#          [1, 1, 0, 0],
#          [1, 1, 1, 0],
#          [1, 1, 1, 1]]

# Apply mask to attention scores
scores = Q @ K.T / sqrt(d_k)
scores = scores.masked_fill(mask == 0, -1e9)
weights = softmax(scores, dim=-1)

Step 3: Autoregressive Generation: Predict Next Token

GPT is trained with a simple objective: **predict the next token given all previous tokens**. **Training (teacher forcing):** Input: "The cat sat down" Target: "cat sat down [END]" Each position predicts the next token: position 0 predicts "cat", position 1 predicts "sat", etc. All positions train simultaneously (parallel), but the causal mask ensures each position only uses past context. **Generation (autoregressive):** 1. Input: "The" 2. Model predicts: "cat" (most likely next token) 3. Input: "The cat" 4. Model predicts: "sat" 5. Input: "The cat sat" 6. Model predicts: "down" 7. Continue until [END] token or max length This sequential generation is called **autoregressive** because each output becomes part of the next input. It's inherently slower than encoding (which processes all tokens at once), but it's the natural way to produce text. The final layer of GPT maps each token's hidden state to a probability distribution over the entire vocabulary (e.g., 50,000 tokens). The token with the highest probability is selected (or sampled -- more on that next).

Step 4: Temperature and Top-k Sampling

When generating text, the model outputs a probability distribution over the vocabulary. How we **select** from this distribution dramatically affects the output quality. **Greedy decoding:** Always pick the highest-probability token. Produces safe but repetitive, boring text. **Temperature (T):** Scales the logits before softmax: **p = softmax(logits / T)** - T = 1.0: Standard probabilities (as trained) - T < 1.0: Sharper distribution (more confident, less random) - T > 1.0: Flatter distribution (more random, more creative) - T -> 0: Equivalent to greedy (picks the argmax) **Top-k sampling:** Only consider the k most likely tokens, zero out the rest, then sample. - Top-5: Only sample from the 5 most likely tokens - Prevents selecting very unlikely tokens (reduces nonsense) **Top-p (nucleus) sampling:** Include tokens until their cumulative probability exceeds p. - Top-p = 0.9: Include tokens until you cover 90% probability - Adapts to the distribution shape (sometimes 3 tokens suffice, sometimes 100) In practice, models like GPT-4 and Claude use a combination of temperature + top-p sampling.

Step 5: GPT Scaling: GPT-1 to GPT-4

The GPT series demonstrated that **scaling up** the decoder architecture produces increasingly capable models. Each generation increased parameters, data, and compute by roughly 10-100x. **GPT-1 (2018):** Proved that decoder pre-training + fine-tuning works for NLP tasks. 12 layers, 117M parameters. Trained on BookCorpus (~800M words). **GPT-2 (2019):** Showed that larger models exhibit "zero-shot" ability -- performing tasks without any fine-tuning, just from the prompt. 48 layers, 1.5B parameters. Initially withheld due to concerns about misuse. **GPT-3 (2020):** Demonstrated "in-context learning" -- the model can learn new tasks from just a few examples in the prompt (few-shot). 96 layers, 175B parameters. Powers the original ChatGPT API. **GPT-4 (2023):** Multimodal (text + images), dramatically improved reasoning. Estimated ~1.8T parameters as a mixture-of-experts. Powers ChatGPT Plus and the API. The key insight: the same simple architecture (stacked decoder blocks with causal attention) yields qualitatively different capabilities at different scales. This is known as **scaling laws** -- performance improves predictably with more parameters, data, and compute.

Step 6: Instruction Tuning & RLHF

Raw GPT models are trained to predict next tokens -- they're good at completing text but bad at following instructions. Two techniques bridge this gap: **Instruction Tuning (Supervised Fine-Tuning / SFT):** Fine-tune the model on (instruction, response) pairs written by humans. - Input: "Summarize this article in 3 bullets: ..." - Output: A proper 3-bullet summary This teaches the model the **format** of helpful responses. **RLHF (Reinforcement Learning from Human Feedback):** 1. Generate multiple responses to each prompt 2. Human raters rank the responses (best to worst) 3. Train a **reward model** that predicts human preferences 4. Use RL (PPO algorithm) to fine-tune the language model to maximize the reward RLHF is what made ChatGPT so effective -- it aligned the model's behavior with human expectations of helpfulness, harmlessness, and honesty. **The alignment pipeline:** Base model (GPT-3) -> Instruction tuning (SFT) -> RLHF -> ChatGPT/GPT-3.5 Modern variants like **DPO** (Direct Preference Optimization) simplify this by skipping the reward model step, directly training the language model on preference pairs.

Step 7: Test Your Understanding

You've learned how the decoder architecture powers GPT and modern language models. Let's test your understanding of causal attention, autoregressive generation, and the path from GPT-1 to ChatGPT.

Prerequisites

  • Self-attention mechanism
  • Encoder architecture (BERT lesson)
  • Transformer block (residual + FFN)

Key Concepts

  • Causal (Masked) Self-Attention
  • Lower Triangular Mask
  • Autoregressive Generation
  • Temperature and Sampling
  • Scaling Laws
  • Instruction Tuning & RLHF