Pre-training
Learning from the Internet
What Is Pre-training?
Pre-training is the foundational phase where an LLM learns general language abilities from massive amounts of text. It's called "pre"-training because it happens before any task-specific fine-tuning.
Think of it as giving the model a broad education before specializing. A medical student first learns general biology, chemistry, and anatomy (pre-training) before specializing in cardiology (fine-tuning).
During pre-training, the model:
- •Reads billions of text sequences from the training corpus
- •Tries to predict the next token for each position
- •Compares its predictions to the actual next token
- •Updates its weights to make better predictions
This process is self-supervised — the training signal comes from the text itself, not from human labels. Every sentence provides training examples: given "The cat", predict "sat"; given "The cat sat", predict "on". This is incredibly data-efficient because every token in the corpus is both an input and a label.
Pre-training typically takes weeks to months on hundreds or thousands of GPUs and consumes the vast majority of the total training budget.
The Three Phases of LLM Training
| Training Phase | What It Does | Data | Compute | Time |
|---|---|---|---|---|
| Pre-training | Learn general language abilities | Trillions of tokens (web text) | 99% of total compute | Weeks to months |
| Fine-tuning (SFT) | Learn to follow instructions | ~100K instruction pairs | <1% of compute | Hours to days |
| RLHF / DPO | Align with human preferences | ~50K preference comparisons | <1% of compute | Hours to days |
Pre-training vs Traditional Supervised Learning
| Property | Pre-training | Supervised Learning (traditional) |
|---|---|---|
| Labels | Self-supervised (from text itself) | Human-annotated labels required |
| Data scale | Trillions of tokens | Thousands to millions of examples |
| Task | General: predict next token | Specific: classify, translate, etc. |
| Outcome | Broad language understanding | Narrow task performance |
| Cost | $1M - $100M+ | $100 - $100K |