RLHF

Learning from Human Feedback

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Fine-tuning
Step
1/ 7

The Alignment Problem

After pre-training and instruction tuning, an LLM can follow instructions. But following instructions is not the same as being helpful, harmless, and honest.

Consider these problems:

Helpfulness without safety: "Write a convincing phishing email" — an instruction-tuned model will helpfully comply because it was trained to follow instructions.

Sycophancy: Models trained on human-written responses learn to agree with the user, even when the user is wrong. "Is 2+2=5?" → "That's an interesting perspective..."

Verbosity: Models learn that longer responses get higher ratings in training data, so they pad answers unnecessarily.

Hallucination: The model confidently states false information because it was rewarded for fluent, confident-sounding text.

The core issue: the training objective (predict next token / follow instructions) doesn't capture what we actually want — responses that are helpful, truthful, safe, and appropriately concise.

RLHF (Reinforcement Learning from Human Feedback) addresses this by training the model to optimize for human preferences rather than just next-token prediction. It adds a layer of human judgment on top of language modeling.

The Alignment Gap

ProblemWhat the Model DoesWhat We WantRoot Cause
Harmful complianceFollows dangerous instructionsRefuses harmful requestsTrained to follow all instructions
SycophancyAgrees with incorrect claimsPolitely corrects errorsOptimized for user satisfaction
HallucinationInvents plausible-sounding factsAdmits uncertaintyRewarded for confident text
VerbosityPads responses unnecessarilyConcise, focused answersLonger = higher training signal
Bias amplificationReflects and amplifies biasesFair, balanced responsesTrained on biased internet data

The Three Stages of LLM Training

Training StageObjectiveWhat It Teaches
Pre-trainingPredict next tokenLanguage understanding and knowledge
SFT / Instruction tuningMatch human-written responsesFollow instructions, format outputs
RLHFMaximize human preference scoresBe helpful, honest, harmless