Residual Connections & Layer Norm

Enabling Deep Transformer Stacks

Difficulty
Intermediate
Duration
10-12 min
Prerequisites
Feed-Forward Networks
Step
1/ 7

The Depth Problem: Why Deep Networks Are Hard to Train

Modern Transformers are deep -- BERT has 12 layers, GPT-3 has 96 layers. But stacking many layers creates serious training challenges:

1. Vanishing Gradients: During backpropagation, gradients are multiplied through each layer. With 96 layers, gradients can shrink exponentially, becoming effectively zero. Early layers stop learning.

2. Exploding Gradients: Conversely, gradients can grow exponentially, causing numerical overflow and unstable training.

3. Degradation Problem: Surprisingly, adding more layers can make accuracy worse, even on training data. A 56-layer CNN performs worse than a 20-layer one -- not because of overfitting, but because deeper networks are harder to optimize.

These problems plagued deep learning until 2015, when ResNets introduced residual connections (also called skip connections). The Transformer adopted this idea, combined with layer normalization, to make networks with 96+ layers trainable.

Without these two techniques, the Transformer architecture simply would not work at scale.

Training Challenges in Deep Networks

ProblemWhat HappensConsequenceSolution
Vanishing gradientsGradients shrink through many layersEarly layers stop learningResidual connections create shortcut paths
Exploding gradientsGradients grow through many layersTraining diverges (NaN losses)Layer normalization stabilizes values
DegradationDeeper != better, even on training dataAdding layers hurts performanceResiduals let layers learn "corrections"
Unstable activationsValues drift across layersDifferent layers operate at different scalesLayer norm ensures consistent scale

Depth of Modern Transformers

ModelLayersTotal DepthTrainable Without Residuals?
Our example12 sublayersYes (trivially)
BERT Base1224 sublayersNo -- gradients vanish by layer 1
GPT-21224 sublayersNo -- training would diverge
GPT-396192 sublayersAbsolutely not -- impossible without residuals
GPT-4 (estimated)~120~240 sublayersRequires residuals + normalization + careful init

Residual Connections & Layer Norm — Lesson Content

Learn how residual connections and layer normalization make deep Transformers trainable -- from BERT (12 layers) to GPT-3 (96 layers).

Deep networks are notoriously hard to train: gradients vanish, values explode, and adding layers can actually hurt performance. Two techniques solve these problems in Transformers. Residual connections (output = input + sublayer) create gradient highways that let learning signals reach early layers even in 96-layer models. Layer normalization stabilizes values at every step, preventing drift. Together, they form the backbone of every modern Transformer block. You'll see exactly how they work, why they matter, and how they enable the scaling from BERT's 12 layers to GPT-3's 96.

Learning Objectives

  • Explain the depth problem (vanishing gradients, degradation)
  • Describe how residual connections preserve information and enable gradient flow
  • Apply layer normalization to a token vector
  • Assemble a complete Transformer block with residuals and normalization
  • Understand how block stacking enables progressively deeper representations

Step 1: The Depth Problem: Why Deep Networks Are Hard to Train

Modern Transformers are **deep** -- BERT has 12 layers, GPT-3 has 96 layers. But stacking many layers creates serious training challenges: **1. Vanishing Gradients:** During backpropagation, gradients are multiplied through each layer. With 96 layers, gradients can shrink exponentially, becoming effectively zero. Early layers stop learning. **2. Exploding Gradients:** Conversely, gradients can grow exponentially, causing numerical overflow and unstable training. **3. Degradation Problem:** Surprisingly, adding more layers can make accuracy **worse**, even on training data. A 56-layer CNN performs worse than a 20-layer one -- not because of overfitting, but because deeper networks are harder to optimize. These problems plagued deep learning until 2015, when ResNets introduced **residual connections** (also called skip connections). The Transformer adopted this idea, combined with **layer normalization**, to make networks with 96+ layers trainable. Without these two techniques, the Transformer architecture simply would not work at scale.

Step 2: Residual Connections: output = input + sublayer(input)

A residual connection (skip connection) is remarkably simple: **output = input + sublayer(input)** Instead of replacing the input with the sublayer's output, we **add** them together. The sublayer only needs to learn the **residual** -- the difference between the desired output and the input. This seemingly small change has profound effects: **Identity shortcut:** If the sublayer outputs all zeros, the block simply passes the input through unchanged. This means adding a residual block can never make things worse -- at worst, it does nothing. **Easier optimization:** Learning a small correction (residual) is easier than learning the entire output from scratch. The sublayer can focus on refinements rather than reconstructing the full representation. In a Transformer, residual connections wrap **both** sublayers: 1. **x = x + MultiHeadAttention(x)** 2. **x = x + FeedForward(x)** Each sublayer adds its contribution to the running representation, like annotating a document rather than rewriting it.

Step 3: Why Residuals Help Gradient Flow

The mathematical reason residual connections enable deep networks is elegant. Consider backpropagation: **Without residuals:** output = f(x) - Gradient: d(output)/dx = f'(x) - Through N layers: gradient = f'_1(x) * f'_2(x) * ... * f'_N(x) - If each f'(x) < 1, the product vanishes exponentially **With residuals:** output = x + f(x) - Gradient: d(output)/dx = 1 + f'(x) - Through N layers: gradient includes terms with the **identity** (1) - Even if all f'(x) = 0, the gradient is still 1 through the shortcut! The "+1" from the identity creates a **gradient highway** that bypasses all the sublayers. Gradients can flow directly from the loss back to early layers without being multiplied through dozens of transformations. This is why GPT-3 with 96 layers can still train effectively -- gradients reach the first layer through the chain of residual connections, providing useful learning signals even to the earliest parameters.
Without residuals (N layers):
  dL/dx = dL/df_N * df_N/df_{N-1} * ... * df_1/dx
  = product of N derivatives (can vanish or explode)

With residuals (N layers):
  dL/dx = dL/dy * (1 + df_N/dx) * (1 + df_{N-1}/dx) * ...
  = sum includes identity path (gradient always >= 1 component)

Step 4: Layer Normalization: Normalize Each Vector

Even with residual connections, values can drift as they accumulate through many layers. **Layer normalization** keeps them in a stable range. For each token's vector independently: 1. Compute the **mean** across all dimensions 2. Compute the **variance** across all dimensions 3. **Normalize** to zero mean and unit variance 4. Apply learned **scale** (gamma) and **shift** (beta) parameters **LayerNorm(x) = gamma * (x - mean) / sqrt(variance + epsilon) + beta** Unlike **batch normalization** (which normalizes across the batch dimension), layer norm normalizes across the **feature dimension** of each individual token. This means: - It works with any batch size (even batch size 1 during inference) - Each token is normalized independently - It's well-suited for variable-length sequences The learned gamma and beta parameters allow the model to "undo" the normalization if needed -- they give the model the flexibility to learn the optimal scale and shift for each dimension.
LayerNorm(x) = gamma * (x - mu) / sqrt(sigma^2 + eps) + beta

Where (for token vector x of dimension d):
  mu    = (1/d) * sum(x_i)           (mean across dimensions)
  sigma^2 = (1/d) * sum((x_i - mu)^2) (variance across dimensions)
  eps   = 1e-5                         (numerical stability)
  gamma = learned scale (d params)
  beta  = learned shift (d params)

Step 5: The Complete Transformer Block

Now we can assemble the complete Transformer block. There are two common arrangements: **Post-norm (original Transformer, 2017):** 1. x = x + MultiHeadAttention(x) 2. x = LayerNorm(x) 3. x = x + FeedForward(x) 4. x = LayerNorm(x) **Pre-norm (most modern models):** 1. x = x + MultiHeadAttention(LayerNorm(x)) 2. x = x + FeedForward(LayerNorm(x)) Pre-norm is now preferred because placing normalization **before** each sublayer creates cleaner gradient paths and makes training more stable, especially for very deep models. Each block takes a (seq_len x d_model) matrix and outputs a matrix of the **same shape**. This is what makes Transformer blocks **composable** -- you can stack as many as you want. The input flows through two sublayers, each wrapped with a residual connection, each accompanied by layer normalization. The attention sublayer gathers context; the FFN sublayer transforms each token using that context.
# Pre-norm Transformer block (modern standard)
def transformer_block(x):
    # Sublayer 1: Multi-Head Attention
    normed = layer_norm(x)
    attn_out = multi_head_attention(normed, normed, normed)
    x = x + attn_out  # residual

    # Sublayer 2: Feed-Forward Network
    normed = layer_norm(x)
    ffn_out = feed_forward(normed)
    x = x + ffn_out   # residual

    return x  # same shape as input

Step 6: Stacking Blocks: BERT=12, GPT-3=96

The magic of the Transformer is that these blocks are **stackable**. Each block takes (seq_len x d_model) in and produces (seq_len x d_model) out, so you can chain them: **Block_1 -> Block_2 -> ... -> Block_N** As data flows through more blocks, representations become progressively more refined: - **Early layers** (1-4): Learn basic syntax and word patterns - **Middle layers** (5-8): Learn semantic relationships and phrase structure - **Late layers** (9-12+): Learn task-specific abstractions and complex reasoning Research (known as "probing" studies) has confirmed this layered learning: - Layer 2 of BERT can identify parts of speech - Layer 6 can parse syntactic trees - Layer 10 can resolve coreference ("he" refers to "the cat") Each additional block refines the representations further. Residual connections ensure that information from early layers is never completely lost -- it flows through the skip connections all the way to the output. The choice of depth is a key architectural decision: more layers = more capacity but more expensive to train and run.

Step 7: Test Your Understanding

You've learned how residual connections and layer normalization make deep Transformers trainable. Let's test your understanding of these critical techniques.

Prerequisites

  • Transformer self-attention mechanism
  • Feed-forward networks
  • Backpropagation and gradient flow

Key Concepts

  • Residual Connections (Skip Connections)
  • Layer Normalization
  • Pre-Norm vs Post-Norm
  • Gradient Flow in Deep Networks
  • Transformer Block Assembly
  • Depth Scaling