Modern Transformers are deep -- BERT has 12 layers, GPT-3 has 96 layers. But stacking many layers creates serious training challenges:

1. Vanishing Gradients: During backpropagation, gradients are multiplied through each layer. With 96 layers, gradients can shrink exponentially, becoming effectively zero. Early layers stop learning.

2. Exploding Gradients: Conversely, gradients can grow exponentially, causing numerical overflow and unstable training.

3. Degradation Problem: Surprisingly, adding more layers can make accuracy worse, even on training data. A 56-layer CNN performs worse than a 20-layer one -- not because of overfitting, but because deeper networks are harder to optimize.

These problems plagued deep learning until 2015, when ResNets introduced residual connections (also called skip connections). The Transformer adopted this idea, combined with layer normalization, to make networks with 96+ layers trainable.

Without these two techniques, the Transformer architecture simply would not work at scale.

Problem

What Happens

Consequence

Solution

Vanishing gradients

Gradients shrink through many layers

Early layers stop learning

Residual connections create shortcut paths

Exploding gradients

Gradients grow through many layers

Training diverges (NaN losses)

Layer normalization stabilizes values

Degradation

Deeper != better, even on training data

Adding layers hurts performance

Residuals let layers learn "corrections"

Unstable activations

Values drift across layers

Different layers operate at different scales

Layer norm ensures consistent scale

Model

Layers

Total Depth

Trainable Without Residuals?

Our example

2 sublayers

Yes (trivially)

BERT Base

24 sublayers

No -- gradients vanish by layer 1

GPT-2

24 sublayers

No -- training would diverge

GPT-3

192 sublayers

Absolutely not -- impossible without residuals

GPT-4 (estimated)

~120

~240 sublayers

Requires residuals + normalization + careful init

Residual Connections & Layer Norm

The Depth Problem: Why Deep Networks Are Hard to Train

Training Challenges in Deep Networks

Depth of Modern Transformers