Scaling Laws

Bigger is Better?

Difficulty
Intermediate
Duration
10-12 min
Prerequisites
Pre-training
Step
1/ 7

The Scaling Hypothesis

The scaling hypothesis is one of the most important ideas in modern AI: performance improves predictably as you increase model size, data, and compute.

This wasn't obvious. Before the scaling era, the common belief was that architectural innovations (better attention mechanisms, clever training tricks) were the primary driver of progress. The scaling hypothesis flipped this: given a good-enough architecture (the transformer), simply making it bigger yields consistent, predictable improvements.

The key observations:

  • Loss follows a power law with respect to model size, data size, and compute
  • These power laws hold over many orders of magnitude (10M to 100B+ parameters)
  • The improvements are smooth and predictable — no sudden breakthroughs or plateaus
  • This means you can predict how well a larger model will perform before training it

L(N) = (N_c / N)^α — loss as a function of parameters, where N_c and α are constants

This predictability is remarkable. In most engineering fields, scaling doesn't work so cleanly — you hit diminishing returns, new failure modes, or fundamental bottlenecks. For LLMs, the loss just keeps going down on a smooth curve.

The practical implication: labs can run small-scale experiments, fit the scaling curve, and extrapolate to determine whether a much larger (and much more expensive) training run is worth the investment.

Loss vs Model Size (Log Scale)

Parameters (millions, log scale)Cross-Entropy Loss0.001.322.643.965.28020000400006000080000100000Kaplan et al. (2020) Power Law

Rules of Thumb for Scaling

Scale FactorWhat ChangesObserved Effect
10x parametersModel capacity (width, depth)Loss decreases by ~0.3-0.5
10x training dataInformation available to learnLoss decreases by ~0.2-0.4
10x computeTotal FLOPs (params x data)Loss decreases by ~0.3-0.5
100x computeMajor scale-upQualitative new capabilities may emerge

Scaling Laws — Lesson Content

Understand the predictable relationship between model size, training data, compute, and performance — the foundation of modern LLM development.

Scaling laws are the empirical discovery that LLM performance improves predictably with model size, data, and compute. This predictability lets researchers plan multi-million dollar training runs with confidence. This lesson covers the scaling hypothesis and its power-law relationships, the Chinchilla findings that reshaped how labs allocate compute, the tradeoffs between parameters and data, the fascinating phenomenon of emergent abilities, and practical guidance on when to scale versus when to optimize.

Learning Objectives

  • Describe the power-law relationship between loss and model size
  • Explain the Chinchilla insight about optimal compute allocation
  • Calculate the compute-optimal parameter and data balance for a given budget
  • Define emergent abilities and give examples
  • Evaluate when scaling vs optimization is the better strategy

Step 1: The Scaling Hypothesis

The **scaling hypothesis** is one of the most important ideas in modern AI: performance improves **predictably** as you increase model size, data, and compute. This wasn't obvious. Before the scaling era, the common belief was that architectural innovations (better attention mechanisms, clever training tricks) were the primary driver of progress. The scaling hypothesis flipped this: given a good-enough architecture (the transformer), **simply making it bigger** yields consistent, predictable improvements. **The key observations:** - Loss follows a **power law** with respect to model size, data size, and compute - These power laws hold over many orders of magnitude (10M to 100B+ parameters) - The improvements are **smooth and predictable** — no sudden breakthroughs or plateaus - This means you can **predict** how well a larger model will perform before training it **L(N) = (N_c / N)^α** — loss as a function of parameters, where N_c and α are constants This predictability is remarkable. In most engineering fields, scaling doesn't work so cleanly — you hit diminishing returns, new failure modes, or fundamental bottlenecks. For LLMs, the loss just keeps going down on a smooth curve. The practical implication: labs can run small-scale experiments, fit the scaling curve, and extrapolate to determine whether a much larger (and much more expensive) training run is worth the investment.

Step 2: Chinchilla Scaling Laws

In 2022, DeepMind published the **Chinchilla paper** ("Training Compute-Optimal Language Models"), which revolutionized how labs allocate their compute budgets. **The key finding:** For a fixed compute budget, there's an **optimal ratio** of model parameters to training tokens. Most existing models were severely **over-parameterized** — they had too many parameters trained on too little data. **The Chinchilla rule:** Parameters and training tokens should scale roughly equally. For every doubling of model size, you should also double the training data. **Optimal ratio: ~20 tokens per parameter.** For example, a 70B parameter model should be trained on approximately 70B × 20 = **1.4 trillion tokens**. **Before Chinchilla:** - GPT-3 (175B params) trained on only 300B tokens — severely under-trained - Labs favored bigger models with less data **After Chinchilla:** - Chinchilla (70B params, 1.4T tokens) outperformed the much larger Gopher (280B params, 300B tokens) - Llama 2 (70B params) trained on 2T tokens — following the Chinchilla insight - Smaller, better-trained models became the norm This shifted the industry from "make models as big as possible" to "train models on as much data as possible given your compute budget."

Step 3: Parameters vs Data vs Compute

The three dimensions of scaling — parameters (N), data (D), and compute (C) — are tightly coupled. Understanding their relationship is crucial for planning training runs. **The fundamental constraint:** Compute ≈ 6 * N * D. Given a fixed compute budget, you must choose how to split it between model size and data size. **Parameters (N) — model capacity:** - More parameters = more capacity to memorize patterns and learn complex functions - But: more parameters need more data to train (otherwise they overfit or under-train) - And: more parameters mean slower inference (every token generation involves all parameters) **Data (D) — information supply:** - More data = more patterns to learn from - But: diminishing returns if the model is too small to absorb the information - And: high-quality data is increasingly scarce (the "data wall") **Compute (C) — the budget:** - Compute = 6ND determines the total cost - Given C, there's an optimal (N*, D*) that minimizes loss - Chinchilla: N* ∝ C^0.5 and D* ∝ C^0.5 (both scale equally with compute) **The data wall problem:** We may be approaching the limits of available high-quality text data. The total internet is estimated at ~100 trillion tokens, but after quality filtering, only a fraction is useful. This is driving interest in synthetic data generation, multi-epoch training, and data efficiency techniques.

Step 4: Loss vs Model Size (Log-Log Relationship)

One of the most striking findings in scaling research is that loss vs model size follows a near-perfect **power law** on a log-log plot — it appears as a straight line. **L(N) = (N_c / N)^α_N** Where: - L is the cross-entropy loss - N is the number of parameters - N_c ≈ 8.8 × 10^13 (from Kaplan et al.) - α_N ≈ 0.076 (loss decreases ~7.6% for each 10x increase in params) **What this means practically:** - Going from 1B to 10B parameters reduces loss by a predictable, consistent amount - Going from 10B to 100B reduces loss by the same relative amount - There's no sign of this power law breaking down — no sudden plateau or wall **Similar power laws hold for data and compute:** - L(D) = (D_c / D)^α_D — loss vs dataset size - L(C) = (C_c / C)^α_C — loss vs compute This predictability is what makes scaling laws so powerful for planning. Labs can train small models (cheap), fit the power law, and extrapolate to predict the performance of much larger models (expensive) before committing resources. However, the relationship between loss and downstream task performance is less predictable — a small decrease in loss can sometimes unlock entirely new capabilities.

Step 5: Emergent Abilities

Perhaps the most fascinating aspect of scaling is **emergence** — capabilities that appear suddenly at certain model sizes and are essentially absent below that threshold. **What emergence looks like:** On certain benchmarks, smaller models perform at roughly random chance. Then, at some critical model size, performance jumps sharply. The model goes from "cannot do this at all" to "does this competently" over a relatively narrow range of scale. **Examples of emergent abilities:** - **Few-shot arithmetic:** Models below ~10B can't reliably do 3-digit addition from a few examples. Above ~100B, they can. - **Chain-of-thought reasoning:** Small models don't benefit from "Let's think step by step." Large models (100B+) show dramatic improvements. - **Word unscrambling:** Rearranging jumbled letters appears at ~60B+ parameters. - **Multi-step reasoning:** Combining multiple logical steps emerges at very large scale. **Is emergence real?** Recent research (Schaeffer et al., 2023) argues that emergence may partly be an artifact of how we measure performance. When using metrics like accuracy (binary right/wrong), performance appears to "jump." When using continuous metrics like log-probability, the improvement is more gradual. The debate continues. **Either way, the practical effect is real:** larger models can do things smaller models simply cannot, and it's often hard to predict which capabilities will appear at which scale.

Step 6: When to Scale vs When to Optimize

Scaling isn't always the answer. Smart engineering, better data, and architectural improvements can be more cost-effective than simply making the model bigger. **Scale when:** - You've exhausted easy optimization wins and still need better performance - Your scaling curve predicts the improvement justifies the cost - You need broad capabilities across many tasks (general-purpose models) - You have the budget and infrastructure to support larger models **Optimize instead when:** - You need performance on a **specific** task (fine-tuning a smaller model often wins) - Your deployment budget is limited (smaller models are cheaper to run) - Better data is available (cleaning data can beat scaling 10x) - Inference cost matters more than training cost (smaller models serve faster) **The efficiency frontier is moving:** Techniques like: - **Distillation:** Train a small "student" model to mimic a large "teacher" - **Quantization:** Reduce precision (FP16 → INT4) to shrink models 4x - **LoRA / QLoRA:** Fine-tune only a tiny fraction of parameters - **Mixture of Experts:** Only activate a subset of parameters per token - **Better data curation:** Llama 3's 15T carefully curated tokens > raw 15T tokens The lesson of scaling laws is not "always go bigger" — it's "understand the tradeoffs quantitatively so you make informed decisions."

Step 7: Test Your Understanding

You've learned about scaling laws — the predictable relationship between model size, data, compute, and performance. Let's check your understanding!

Prerequisites

  • Pre-training (recommended)
  • Basic understanding of logarithmic scales

Key Concepts

  • Scaling Hypothesis
  • Power Laws
  • Chinchilla Scaling Laws
  • Compute-Optimal Training
  • Parameters vs Data vs Compute
  • Emergent Abilities
  • Efficiency vs Scale Tradeoffs