RLHF: Aligning LLMs with Human Preferences — Lesson Content
Understand how Reinforcement Learning from Human Feedback (RLHF) and DPO align language models to be helpful, honest, and harmless.
Pre-training gives LLMs language ability, and fine-tuning teaches them to follow instructions. But neither ensures the model behaves the way humans want. RLHF closes this gap by training models to optimize for human preferences.
You'll trace the full three-step pipeline — SFT, reward model training, and PPO optimization — then learn how DPO simplifies the process. Along the way, you'll see real loss curves and understand the tradeoffs between reward maximization and reward hacking.
Learning Objectives
- Explain the alignment problem and why SFT alone is insufficient
- Describe how a reward model learns from human preference rankings
- Understand the PPO optimization loop and the role of KL penalty
- Trace the full RLHF pipeline from pre-training to aligned model
- Compare RLHF with DPO and explain when each is preferred
Step 1: The Alignment Problem
After pre-training and instruction tuning, an LLM can follow instructions. But **following instructions is not the same as being helpful, harmless, and honest.**
Consider these problems:
**Helpfulness without safety:** "Write a convincing phishing email" — an instruction-tuned model will helpfully comply because it was trained to follow instructions.
**Sycophancy:** Models trained on human-written responses learn to agree with the user, even when the user is wrong. "Is 2+2=5?" → "That's an interesting perspective..."
**Verbosity:** Models learn that longer responses get higher ratings in training data, so they pad answers unnecessarily.
**Hallucination:** The model confidently states false information because it was rewarded for fluent, confident-sounding text.
The core issue: **the training objective (predict next token / follow instructions) doesn't capture what we actually want** — responses that are helpful, truthful, safe, and appropriately concise.
**RLHF (Reinforcement Learning from Human Feedback)** addresses this by training the model to optimize for **human preferences** rather than just next-token prediction. It adds a layer of human judgment on top of language modeling.
Step 2: Step 1: Supervised Fine-Tuning
RLHF is a three-step pipeline. The first step is the familiar **Supervised Fine-Tuning (SFT)** we covered in the fine-tuning lesson.
**Why SFT first?** You need a starting model that can at least produce coherent, instruction-following responses before you can improve them with RL. Trying to apply RLHF to a raw pre-trained model would be like teaching poetry criticism to someone who hasn't learned to read.
**The SFT dataset for RLHF** typically consists of:
- Human-written demonstration data: skilled annotators write ideal responses to diverse prompts
- Quality is paramount — these examples define the "starting point" for RLHF optimization
- Typically 10,000-100,000 high-quality examples
**OpenAI's InstructGPT** used ~13,000 demonstration examples written by a team of 40 contractors. Each example was a prompt paired with an ideal response following specific guidelines for helpfulness, truthfulness, and harmlessness.
After SFT, the model produces decent responses — but it still has the alignment issues described earlier. Steps 2 and 3 of RLHF address these by learning from human preferences.
Step 3: Step 2: Reward Model Training
The second step trains a **reward model** — a neural network that scores how "good" a response is from a human perspective.
**How it works:**
1. Collect prompts from a diverse dataset
2. For each prompt, generate **multiple responses** (2-7) from the SFT model
3. Human annotators **rank** the responses from best to worst
4. Train a model to predict these rankings
**Why rankings instead of absolute scores?** Humans are inconsistent at assigning absolute scores ("Is this response a 7.2 or a 7.5?") but reliable at relative comparisons ("Response A is better than Response B"). Rankings are converted to pairwise comparisons for training.
**The training objective:** Given a preferred response y_w and a dispreferred response y_l for the same prompt x, maximize:
loss = -log(σ(R(x, y_w) - R(x, y_l)))
This is a **binary classification** loss — the model learns that the reward for the preferred response should be higher. The loss starts near 0.69 (random chance for binary classification) and decreases as the model improves.
**InstructGPT** collected ~33,000 comparison pairs from human annotators. The resulting reward model achieved ~72% agreement with held-out human preferences.
L_RM = -E[log σ(R(x, y_w) - R(x, y_l))]
Step 4: Step 3: PPO Optimization
The third step uses **Proximal Policy Optimization (PPO)** to fine-tune the SFT model to maximize the reward model's scores — while staying close to the original SFT model.
**The process at each PPO step:**
1. Sample a batch of prompts
2. Generate responses using the current policy (the model being optimized)
3. Score each response with the reward model
4. Compute the PPO loss and update the model to produce higher-reward responses
5. Apply a **KL penalty** to prevent the model from diverging too far from the SFT model
**Why the KL penalty?** Without it, the model would learn to "hack" the reward model — producing responses that score high on the reward model but are actually bad (reward hacking). The KL divergence penalty keeps the optimized model close to the SFT model, preventing degenerate solutions.
The objective is: **maximize R(x, y) - β × KL(π || π_SFT)**
Where β controls the tradeoff between reward and staying close to the SFT model. Too low and the model reward-hacks. Too high and it barely changes from SFT.
**In practice:** The reward steadily increases during PPO training, while the KL divergence also grows. When KL gets too large, training is typically stopped to prevent reward hacking.
J(π) = E[R(x, y) - β · KL(π(y|x) || π_SFT(y|x))]
Step 5: The RLHF Pipeline End-to-End
Let's trace the full RLHF pipeline from pre-trained model to aligned assistant:
**Stage 1 — SFT:** Start with a pre-trained model (e.g., GPT-3). Fine-tune it on ~13K human-written demonstration examples. The model learns to generate instruction-following responses. Output: **SFT model (π_SFT)**.
**Stage 2 — Reward Model:** Generate multiple responses to ~33K prompts using the SFT model. Human annotators rank the responses. Train a reward model on these preferences. Output: **Reward model R(x, y)**.
**Stage 3 — PPO:** Use the SFT model as the starting policy. For each batch of prompts, generate responses, score them with R, and update the policy to increase reward while staying close to π_SFT. Run for thousands of PPO steps. Output: **Aligned model (π*)**.
**The human labor cost is significant:**
- SFT: ~13K demonstrations (expensive — skilled writers needed)
- Reward model: ~33K comparisons (cheaper — ranking is faster than writing)
- PPO: No new human labor (automated optimization)
**The result:** InstructGPT (1.3B params, RLHF-trained) was preferred over GPT-3 (175B params, no RLHF) by human evaluators. Alignment training can make a small model more useful than a large unaligned one.
Step 6: DPO as an Alternative
**Direct Preference Optimization (DPO)** simplifies RLHF by eliminating the reward model and PPO entirely. Instead of the three-step pipeline (SFT → Reward Model → PPO), DPO does it in one step.
**The key insight:** The reward model in RLHF is used to define a preference ordering. DPO shows you can optimize for the same preferences **directly**, without an intermediate reward model.
**DPO loss:** Given preference pairs (y_w preferred over y_l for prompt x):
L_DPO = -log σ(β · (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))
This is a **supervised learning** objective — no RL, no reward model, no PPO instability.
**Advantages of DPO:**
- Simpler to implement (standard supervised training loop)
- More stable training (no RL instability, no reward hacking)
- Faster to train (no reward model inference during training)
- Fewer hyperparameters to tune
**Disadvantages of DPO:**
- Less flexible — can't reuse the reward model for other purposes
- The implicit reward model can't be inspected or debugged
- May underperform RLHF on some tasks where iterative RL exploration helps
**In practice:** DPO and its variants (IPO, KTO, ORPO) have largely replaced RLHF for open-source models due to their simplicity. Many closed-source labs still use RLHF for frontier models.
L_DPO = -E[log σ(β · (log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x)))]
Step 7: Test Your Understanding
You've learned the full RLHF pipeline and its modern DPO alternative. Let's test your understanding!
Prerequisites
- Understanding of supervised fine-tuning
- Basic knowledge of how LLMs generate text
- Familiarity with loss functions and optimization
Key Concepts
- Alignment Problem
- Reward Model
- Proximal Policy Optimization (PPO)
- KL Divergence Penalty
- Direct Preference Optimization (DPO)
- Reward Hacking