Pre-training

Learning from the Internet

Difficulty
Intermediate
Duration
12-15 min
Prerequisites
Tokenization
Step
1/ 7

What Is Pre-training?

Pre-training is the foundational phase where an LLM learns general language abilities from massive amounts of text. It's called "pre"-training because it happens before any task-specific fine-tuning.

Think of it as giving the model a broad education before specializing. A medical student first learns general biology, chemistry, and anatomy (pre-training) before specializing in cardiology (fine-tuning).

During pre-training, the model:

  1. Reads billions of text sequences from the training corpus
  2. Tries to predict the next token for each position
  3. Compares its predictions to the actual next token
  4. Updates its weights to make better predictions

This process is self-supervised — the training signal comes from the text itself, not from human labels. Every sentence provides training examples: given "The cat", predict "sat"; given "The cat sat", predict "on". This is incredibly data-efficient because every token in the corpus is both an input and a label.

Pre-training typically takes weeks to months on hundreds or thousands of GPUs and consumes the vast majority of the total training budget.

The Three Phases of LLM Training

Training PhaseWhat It DoesDataComputeTime
Pre-trainingLearn general language abilitiesTrillions of tokens (web text)99% of total computeWeeks to months
Fine-tuning (SFT)Learn to follow instructions~100K instruction pairs<1% of computeHours to days
RLHF / DPOAlign with human preferences~50K preference comparisons<1% of computeHours to days

Pre-training vs Traditional Supervised Learning

PropertyPre-trainingSupervised Learning (traditional)
LabelsSelf-supervised (from text itself)Human-annotated labels required
Data scaleTrillions of tokensThousands to millions of examples
TaskGeneral: predict next tokenSpecific: classify, translate, etc.
OutcomeBroad language understandingNarrow task performance
Cost$1M - $100M+$100 - $100K

Pre-training — Lesson Content

Learn how LLMs are trained from scratch on internet-scale data using next-token prediction — the process that gives them their broad capabilities.

Pre-training is where an LLM acquires its general knowledge and abilities. Through hundreds of thousands of training steps on trillions of tokens, the model learns grammar, facts, reasoning, code, and more — all from a single objective: predict the next token. This lesson covers the pre-training objective, the massive datasets used, the training loop and its hyperparameters, how to read loss curves, and the enormous compute requirements that make frontier model training a multi-million dollar endeavor.

Learning Objectives

  • Explain the next-token prediction objective and why it produces broad capabilities
  • Describe the composition and scale of LLM pre-training data
  • Trace the steps of a pre-training iteration (forward, loss, backward, update)
  • Interpret a pre-training loss curve and identify its phases
  • Estimate compute requirements using the 6*N*D formula

Step 1: What Is Pre-training?

**Pre-training** is the foundational phase where an LLM learns general language abilities from massive amounts of text. It's called "pre"-training because it happens before any task-specific fine-tuning. Think of it as giving the model a broad education before specializing. A medical student first learns general biology, chemistry, and anatomy (pre-training) before specializing in cardiology (fine-tuning). During pre-training, the model: 1. Reads billions of text sequences from the training corpus 2. Tries to predict the next token for each position 3. Compares its predictions to the actual next token 4. Updates its weights to make better predictions This process is **self-supervised** — the training signal comes from the text itself, not from human labels. Every sentence provides training examples: given "The cat", predict "sat"; given "The cat sat", predict "on". This is incredibly data-efficient because every token in the corpus is both an input and a label. Pre-training typically takes weeks to months on hundreds or thousands of GPUs and consumes the vast majority of the total training budget.

Step 2: The Training Objective: Next Token Prediction

The pre-training objective for decoder models (GPT, Claude, Llama) is deceptively simple: > **Given all previous tokens, predict the next token.** Formally, the model maximizes the log-likelihood of the training data: **L = -1/T * sum_{t=1}^{T} log P(x_t | x_1, ..., x_{t-1})** For a sequence "The cat sat on the mat": - Position 1: Given "The" → predict "cat" - Position 2: Given "The cat" → predict "sat" - Position 3: Given "The cat sat" → predict "on" - Position 4: Given "The cat sat on" → predict "the" - Position 5: Given "The cat sat on the" → predict "mat" The model outputs a probability distribution over the entire vocabulary (e.g., 50,000 tokens) at each position. The loss is the **cross-entropy** between the predicted distribution and the one-hot true next token. **Why does this simple objective produce intelligent behavior?** To predict the next word well, the model must learn grammar, facts, reasoning, common sense, style, and more. Predicting that "Paris" follows "The capital of France is" requires world knowledge. Predicting the next line of code requires understanding programming logic.
L = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1})

Step 3: Training Data: Internet Scale

Pre-training data is the raw fuel of LLMs. Modern models train on **trillions of tokens** drawn from diverse internet sources. The quality and composition of this data profoundly affects the model's capabilities. **Common data sources:** **Web crawls (Common Crawl, etc.):** Petabytes of raw web pages, filtered for quality. This is the largest source but also the noisiest — full of spam, duplicates, and low-quality text. Extensive filtering is needed. **Books:** High-quality long-form text (BookCorpus, Books3). Great for learning coherent narratives, complex arguments, and diverse writing styles. **Code:** GitHub repositories, Stack Overflow. Teaches the model programming, logical thinking, and structured problem-solving. Code training appears to improve reasoning even on non-code tasks. **Scientific papers:** ArXiv, PubMed, Semantic Scholar. Provides technical knowledge and formal reasoning patterns. **Wikipedia:** High-quality factual content across many topics and languages. **Data quality matters enormously.** Research consistently shows that a smaller amount of high-quality, deduplicated data produces better models than a larger amount of noisy data. The data curation pipeline — filtering, deduplication, toxicity removal — is as important as the model architecture.

Step 4: The Training Loop

The pre-training loop is conceptually straightforward, but executing it at LLM scale requires sophisticated engineering: **1. Sample a batch:** Pick a batch of text sequences (e.g., 2,048 sequences of 2,048 tokens each = ~4M tokens per batch). **2. Forward pass:** Feed the batch through the transformer. At each position, the model produces a probability distribution over the vocabulary for the next token. **3. Compute loss:** Compare the predicted distributions to the actual next tokens using cross-entropy loss. Average across all positions and all sequences in the batch. **4. Backward pass:** Compute gradients of the loss with respect to all parameters using backpropagation. **5. Update weights:** Apply the optimizer (typically AdamW) to update all parameters. The learning rate follows a schedule — usually a warmup phase followed by cosine decay. **6. Repeat:** Process the next batch. Continue for hundreds of thousands of steps. **Distributed training:** A single GPU can't hold a 175B-parameter model (requires ~350 GB of memory just for weights). Models are split across hundreds or thousands of GPUs using techniques like tensor parallelism, pipeline parallelism, and data parallelism. Keeping all GPUs utilized and synchronized is a major engineering challenge.
# Simplified pre-training loop (pseudocode)
for step in range(num_training_steps):
    batch = sample_batch(dataset, batch_size, seq_len)

    logits = model.forward(batch.input_tokens)  # [B, T, vocab_size]
    loss = cross_entropy(logits, batch.target_tokens)

    loss.backward()           # Compute gradients
    optimizer.step()          # Update weights (AdamW)
    lr_scheduler.step()       # Adjust learning rate
    optimizer.zero_grad()     # Reset gradients

    if step % log_interval == 0:
        print(f"Step {step}, Loss: {loss.item():.4f}")

Step 5: Loss Curves During Pre-training

Watching the loss curve during pre-training tells you how well the model is learning. A healthy pre-training run shows a characteristic pattern: **Phase 1 — Rapid decrease (steps 0-10K):** Loss drops sharply as the model learns basic patterns: common words, simple grammar, frequent bigrams. The model goes from random predictions (loss ~11, perplexity ~60,000) to basic competence. **Phase 2 — Steady improvement (steps 10K-200K):** Loss continues to decrease but more slowly. The model is learning increasingly subtle patterns: long-range dependencies, factual knowledge, style, and reasoning. **Phase 3 — Diminishing returns (steps 200K+):** The loss curve flattens. Each additional step provides smaller improvements. The model is now learning rare patterns and edge cases. Most of the "easy" signal has been extracted. **Larger models achieve lower final loss.** A 175B-parameter model reaches a lower loss than a 7B model trained on the same data — it has more capacity to capture patterns. This is the empirical foundation of scaling laws. **Loss spikes** occasionally appear during training — sudden jumps in loss caused by unusual data batches, numerical instability, or learning rate issues. Modern training pipelines automatically detect and recover from these.

Step 6: Compute Requirements

Training LLMs requires enormous computational resources. The cost is driven by three factors: model size, dataset size, and training duration. **Compute is measured in FLOPs** (floating-point operations). A useful rule of thumb: **Compute ≈ 6 * N * D** Where N = number of parameters and D = number of training tokens. The factor of 6 accounts for both forward and backward passes (roughly 2 FLOPs per parameter per token for forward, 4 for backward). For GPT-3 (175B params, 300B tokens): 6 * 175B * 300B = **3.15 x 10^23 FLOPs** — about 3,150 petaflop-days. **Hardware:** Modern LLMs train on NVIDIA A100 or H100 GPUs, Google TPUs, or custom chips. A single H100 provides ~1,000 TFLOPS (half-precision). Training GPT-3 on a single H100 would take ~10 years. With 1,024 H100s, it takes ~2-3 weeks. **Energy and environment:** Training GPT-3 consumed an estimated 1,287 MWh of electricity — roughly equal to the annual consumption of 120 US homes. Larger models consume proportionally more. This raises important questions about the environmental cost of AI development. The enormous cost creates a barrier to entry — only well-funded labs can train frontier models from scratch. This is why open-weight models (Llama, Mistral) and efficient fine-tuning techniques (LoRA, QLoRA) are so important for democratizing access.

Step 7: Test Your Understanding

You've learned about the pre-training process — the objective, the data, the training loop, loss curves, and compute requirements. Let's check your understanding!

Prerequisites

  • What Is an LLM
  • Tokenization
  • Basic understanding of neural network training (backpropagation, gradient descent)

Key Concepts

  • Self-Supervised Learning
  • Next-Token Prediction
  • Cross-Entropy Loss
  • Training Data Curation
  • The Training Loop
  • Loss Curves
  • Compute Requirements (FLOPs)
  • Distributed Training