Pre-training

Learning from the Internet

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Tokenization
Step
1/ 7

What Is Pre-training?

Pre-training is the foundational phase where an LLM learns general language abilities from massive amounts of text. It's called "pre"-training because it happens before any task-specific fine-tuning.

Think of it as giving the model a broad education before specializing. A medical student first learns general biology, chemistry, and anatomy (pre-training) before specializing in cardiology (fine-tuning).

During pre-training, the model:

  1. Reads billions of text sequences from the training corpus
  2. Tries to predict the next token for each position
  3. Compares its predictions to the actual next token
  4. Updates its weights to make better predictions

This process is self-supervised — the training signal comes from the text itself, not from human labels. Every sentence provides training examples: given "The cat", predict "sat"; given "The cat sat", predict "on". This is incredibly data-efficient because every token in the corpus is both an input and a label.

Pre-training typically takes weeks to months on hundreds or thousands of GPUs and consumes the vast majority of the total training budget.

The Three Phases of LLM Training

Training PhaseWhat It DoesDataComputeTime
Pre-trainingLearn general language abilitiesTrillions of tokens (web text)99% of total computeWeeks to months
Fine-tuning (SFT)Learn to follow instructions~100K instruction pairs<1% of computeHours to days
RLHF / DPOAlign with human preferences~50K preference comparisons<1% of computeHours to days

Pre-training vs Traditional Supervised Learning

PropertyPre-trainingSupervised Learning (traditional)
LabelsSelf-supervised (from text itself)Human-annotated labels required
Data scaleTrillions of tokensThousands to millions of examples
TaskGeneral: predict next tokenSpecific: classify, translate, etc.
OutcomeBroad language understandingNarrow task performance
Cost$1M - $100M+$100 - $100K