Pre-training is the foundational phase where an LLM learns general language abilities from massive amounts of text. It's called "pre"-training because it happens before any task-specific fine-tuning.

Think of it as giving the model a broad education before specializing. A medical student first learns general biology, chemistry, and anatomy (pre-training) before specializing in cardiology (fine-tuning).

During pre-training, the model:

•Reads billions of text sequences from the training corpus
•Tries to predict the next token for each position
•Compares its predictions to the actual next token
•Updates its weights to make better predictions

This process is self-supervised — the training signal comes from the text itself, not from human labels. Every sentence provides training examples: given "The cat", predict "sat"; given "The cat sat", predict "on". This is incredibly data-efficient because every token in the corpus is both an input and a label.

Pre-training typically takes weeks to months on hundreds or thousands of GPUs and consumes the vast majority of the total training budget.

Training Phase	What It Does	Data	Compute	Time
Pre-training	Learn general language abilities	Trillions of tokens (web text)	99% of total compute	Weeks to months
Fine-tuning (SFT)	Learn to follow instructions	~100K instruction pairs	<1% of compute	Hours to days
RLHF / DPO	Align with human preferences	~50K preference comparisons	<1% of compute	Hours to days

Property	Pre-training	Supervised Learning (traditional)
Labels	Self-supervised (from text itself)	Human-annotated labels required
Data scale	Trillions of tokens	Thousands to millions of examples
Task	General: predict next token	Specific: classify, translate, etc.
Outcome	Broad language understanding	Narrow task performance
Cost	$1M - $100M+	$100 - $100K

Pre-training

What Is Pre-training?

The Three Phases of LLM Training

Pre-training vs Traditional Supervised Learning