After pre-training and instruction tuning, an LLM can follow instructions. But following instructions is not the same as being helpful, harmless, and honest.

Consider these problems:

Helpfulness without safety: "Write a convincing phishing email" — an instruction-tuned model will helpfully comply because it was trained to follow instructions.

Sycophancy: Models trained on human-written responses learn to agree with the user, even when the user is wrong. "Is 2+2=5?" → "That's an interesting perspective..."

Verbosity: Models learn that longer responses get higher ratings in training data, so they pad answers unnecessarily.

Hallucination: The model confidently states false information because it was rewarded for fluent, confident-sounding text.

The core issue: the training objective (predict next token / follow instructions) doesn't capture what we actually want — responses that are helpful, truthful, safe, and appropriately concise.

RLHF (Reinforcement Learning from Human Feedback) addresses this by training the model to optimize for human preferences rather than just next-token prediction. It adds a layer of human judgment on top of language modeling.

Problem	What the Model Does	What We Want	Root Cause
Harmful compliance	Follows dangerous instructions	Refuses harmful requests	Trained to follow all instructions
Sycophancy	Agrees with incorrect claims	Politely corrects errors	Optimized for user satisfaction
Hallucination	Invents plausible-sounding facts	Admits uncertainty	Rewarded for confident text
Verbosity	Pads responses unnecessarily	Concise, focused answers	Longer = higher training signal
Bias amplification	Reflects and amplifies biases	Fair, balanced responses	Trained on biased internet data

Training Stage	Objective	What It Teaches
Pre-training	Predict next token	Language understanding and knowledge
SFT / Instruction tuning	Match human-written responses	Follow instructions, format outputs
RLHF	Maximize human preference scores	Be helpful, honest, harmless

RLHF

The Alignment Problem

The Alignment Gap

The Three Stages of LLM Training