Scaling Laws
Bigger is Better?
The Scaling Hypothesis
The scaling hypothesis is one of the most important ideas in modern AI: performance improves predictably as you increase model size, data, and compute.
This wasn't obvious. Before the scaling era, the common belief was that architectural innovations (better attention mechanisms, clever training tricks) were the primary driver of progress. The scaling hypothesis flipped this: given a good-enough architecture (the transformer), simply making it bigger yields consistent, predictable improvements.
The key observations:
- •Loss follows a power law with respect to model size, data size, and compute
- •These power laws hold over many orders of magnitude (10M to 100B+ parameters)
- •The improvements are smooth and predictable — no sudden breakthroughs or plateaus
- •This means you can predict how well a larger model will perform before training it
L(N) = (N_c / N)^α — loss as a function of parameters, where N_c and α are constants
This predictability is remarkable. In most engineering fields, scaling doesn't work so cleanly — you hit diminishing returns, new failure modes, or fundamental bottlenecks. For LLMs, the loss just keeps going down on a smooth curve.
The practical implication: labs can run small-scale experiments, fit the scaling curve, and extrapolate to determine whether a much larger (and much more expensive) training run is worth the investment.
Loss vs Model Size (Log Scale)
Rules of Thumb for Scaling
| Scale Factor | What Changes | Observed Effect |
|---|---|---|
| 10x parameters | Model capacity (width, depth) | Loss decreases by ~0.3-0.5 |
| 10x training data | Information available to learn | Loss decreases by ~0.2-0.4 |
| 10x compute | Total FLOPs (params x data) | Loss decreases by ~0.3-0.5 |
| 100x compute | Major scale-up | Qualitative new capabilities may emerge |