Residual Connections & Layer Norm

Enabling Deep Transformer Stacks

Difficulty
Intermediate
Duration
10-12 min
Prerequisites
Feed-Forward Networks
Step
1/ 7

The Depth Problem: Why Deep Networks Are Hard to Train

Modern Transformers are deep -- BERT has 12 layers, GPT-3 has 96 layers. But stacking many layers creates serious training challenges:

1. Vanishing Gradients: During backpropagation, gradients are multiplied through each layer. With 96 layers, gradients can shrink exponentially, becoming effectively zero. Early layers stop learning.

2. Exploding Gradients: Conversely, gradients can grow exponentially, causing numerical overflow and unstable training.

3. Degradation Problem: Surprisingly, adding more layers can make accuracy worse, even on training data. A 56-layer CNN performs worse than a 20-layer one -- not because of overfitting, but because deeper networks are harder to optimize.

These problems plagued deep learning until 2015, when ResNets introduced residual connections (also called skip connections). The Transformer adopted this idea, combined with layer normalization, to make networks with 96+ layers trainable.

Without these two techniques, the Transformer architecture simply would not work at scale.

Training Challenges in Deep Networks

ProblemWhat HappensConsequenceSolution
Vanishing gradientsGradients shrink through many layersEarly layers stop learningResidual connections create shortcut paths
Exploding gradientsGradients grow through many layersTraining diverges (NaN losses)Layer normalization stabilizes values
DegradationDeeper != better, even on training dataAdding layers hurts performanceResiduals let layers learn "corrections"
Unstable activationsValues drift across layersDifferent layers operate at different scalesLayer norm ensures consistent scale

Depth of Modern Transformers

ModelLayersTotal DepthTrainable Without Residuals?
Our example12 sublayersYes (trivially)
BERT Base1224 sublayersNo -- gradients vanish by layer 1
GPT-21224 sublayersNo -- training would diverge
GPT-396192 sublayersAbsolutely not -- impossible without residuals
GPT-4 (estimated)~120~240 sublayersRequires residuals + normalization + careful init