Imagine you're a teacher grading a group project. The final answer is wrong, and you need to figure out who made the mistake and how much each person contributed to the error.

Backpropagation works exactly the same way. When a neural network makes a wrong prediction, it traces the error backward through the network to figure out which weights are to blame and how to fix them.

Here's the training cycle that repeats over and over:

•Forward Pass — Feed data through the network, get a prediction
•Compute the Error — How wrong was the prediction?
•Backward Pass — Trace the error back to assign blame to each weight
•Update Weights — Adjust weights to reduce the error

This "learn from your mistakes" cycle is how every neural network trains — from simple classifiers to ChatGPT.

Step

Direction

What Happens

1. Forward Pass

→ Left to Right

Data flows through, make a prediction

2. Compute Error

⊗ At Output

Compare prediction vs. target

3. Backward Pass

← Right to Left

Trace blame for the error

4. Update Weights

↻ Everywhere

Adjust weights to reduce error

Backpropagation Step-by-Step — Lesson Content

Trace how a neural network learns from its mistakes by computing gradients and updating weights backward through the network.

Backpropagation is the algorithm that makes neural networks learn. Using a concrete example — predicting exam scores from study and sleep hours — you'll trace the complete training cycle: forward pass, error computation, backward pass, and weight updates. Every number is computed live so you can follow the math step by step. You'll see how the chain rule distributes blame across layers and why gradients shrink as they flow backward.

Learning Objectives

Understand how neural networks learn from errors
Trace the chain rule through a multi-layer network
Compute gradients for each weight by hand
See how gradient descent updates weights
Recognize why gradients vanish in deep sigmoid networks

Step 1: What is Backpropagation?

Imagine you're a teacher grading a group project. The final answer is wrong, and you need to figure out **who made the mistake** and how much each person contributed to the error. **Backpropagation** works exactly the same way. When a neural network makes a wrong prediction, it traces the error **backward** through the network to figure out **which weights are to blame** and how to fix them. Here's the training cycle that repeats over and over: 1. **Forward Pass** — Feed data through the network, get a prediction 2. **Compute the Error** — How wrong was the prediction? 3. **Backward Pass** — Trace the error back to assign blame to each weight 4. **Update Weights** — Adjust weights to reduce the error This "learn from your mistakes" cycle is how **every** neural network trains — from simple classifiers to ChatGPT.

Step 2: Our Example Network

Let's work through a concrete example. We have a tiny network that predicts **exam scores** from two inputs: - **Hours Studied**: 6 hours → normalized to **0.6** - **Hours Slept**: 8 hours → normalized to **0.8** - **Target Output**: 0.9 (the student scored well!) The network has **2 inputs → 2 hidden neurons → 1 output**, and every connection has a **weight** that the network will learn to adjust. We use the **sigmoid** activation function: σ(x) = 1/(1+e⁻ˣ), which squashes any number into the 0–1 range — perfect for predicting a score between 0 and 1. Right now these weights are essentially random guesses. After backpropagation, they'll be a little bit smarter!

Step 3: Forward Pass: Making a Prediction

Data flows **left to right**. Each neuron computes a weighted sum, adds a bias, then applies sigmoid. **Hidden Neuron 1:** z = (0.6 × 0.3000) + (0.8 × 0.5000) + 0.1000 = **0.6800** a = σ(0.6800) = **0.6637** **Hidden Neuron 2:** z = (0.6 × -0.2000) + (0.8 × 0.4000) + -0.1000 = **0.1000** a = σ(0.1000) = **0.5250** **Output:** z = (0.6637 × 0.6000) + (0.5250 × -0.3000) + 0.2000 = **0.4407** prediction = σ(0.4407) = **0.6084** Our network predicts **0.6084**, but the target is **0.9**. Not great! Let's figure out what went wrong.

Step 4: Computing the Error

How wrong is our prediction? We use **Mean Squared Error**: **Loss = ½ × (target − prediction)²** - **Target**: 0.9 (what we wanted) - **Predicted**: 0.6084 (what we got) - **Error**: 0.9 − 0.6084 = **0.2916** - **Loss**: ½ × 0.2916² = **0.0425** We use the ½ to make the derivative cleaner — the 2 from the square cancels out. The network predicted about **61%** when the student actually scored **90%**. Not terrible, but there's room to improve. Now the big question: **which weights should change, and by how much?**

Step 5: The Chain Rule: Tracing the Blame

To fix weight w₁, we need to answer: **"How much did w₁ contribute to the error?"** But w₁ doesn't directly touch the output — it goes through a chain of steps: **w₁ → weighted sum → sigmoid → weighted sum → sigmoid → loss** The **chain rule** lets us multiply the effect at each link: ∂Loss/∂w₁ = (∂Loss/∂output) × (∂output/∂z_out) × (∂z_out/∂a_h₁) × (∂a_h₁/∂z_h₁) × (∂z_h₁/∂w₁) **Think of it like dominoes:** each one knocks over the next. To know how hard the last domino fell because of the first one, multiply all the "push strengths" together. This is the **key insight** of backpropagation: we work **backward** from the output, reusing calculations at each layer. Instead of recomputing each chain from scratch, we share the work — making it efficient even for huge networks.

Step 6: Backward Pass: Output Layer

Starting from the output, we compute the **error signal** (δ): **δ_output** = −(target − output) × σ'(z_out) = -0.2916 × 0.2382 = **-0.0695** The negative value means: "the output needs to go **up** to reach the target." Now we compute gradients for the output weights: - **∂L/∂w₅** = δ_output × a_h₁ = -0.0695 × 0.6637 = **-0.0461** - **∂L/∂w₆** = δ_output × a_h₂ = -0.0695 × 0.5250 = **-0.0365** Both gradients are **negative**, meaning we should **increase** these weights to push the output closer to the target.

Step 7: Backward Pass: Hidden Layer

The error signal flows **backward** through the weights. Each hidden neuron receives blame proportional to its connection to the output: **Hidden 1:** δ_h₁ = (δ_out × w₅) × σ'(z_h₁) = (-0.0695 × 0.6000) × 0.2232 = **-0.0093** **Hidden 2:** δ_h₂ = (δ_out × w₆) × σ'(z_h₂) = (-0.0695 × -0.3000) × 0.2494 = **0.0052** Notice the gradients are **much smaller** here! They shrunk as they passed through each sigmoid layer. This is called the **vanishing gradient** problem — it's why very deep networks with sigmoid can be hard to train, and why modern networks often use ReLU instead. - **∂L/∂w₁** = δ_h₁ × x₁ = **-0.0056** - **∂L/∂w₂** = δ_h₁ × x₂ = **-0.0074** - **∂L/∂w₃** = δ_h₂ × x₁ = **0.0031** - **∂L/∂w₄** = δ_h₂ × x₂ = **0.0042**

Step 8: Weight Updates & The Learning Cycle

Finally! We apply the **gradient descent update rule** with learning rate α = 0.5: **w_new = w_old − α × gradient** The output weights change more than the hidden weights — that's the vanishing gradient in action. Over many training iterations, even these tiny updates add up. After this single step, if we ran the forward pass again with the new weights, our prediction would be **closer to 0.9**. Real training repeats this cycle thousands of times until the network converges. **Key takeaways:** 1. **Forward pass** computes the prediction 2. **Loss function** measures how wrong we are 3. **Chain rule** traces blame backward through each layer 4. **Gradient descent** nudges each weight to reduce the error 5. **Repeat** thousands of times until the network learns! This is the same algorithm powering image recognition, language models, and self-driving cars — just with millions more weights.

Prerequisites

Neural network forward pass
Loss functions (MSE)
Basic understanding of derivatives

Key Concepts

Backpropagation
Chain Rule
Gradients
Weight Updates
Vanishing Gradients

Backpropagation Step-by-Step

What You'll Discover

Chain Rule Intuition

Backward Pass

Weight Updates

The Full Picture

Key Concepts

Chain Rule

Gradient Flow

Error Signals

Weight Updates

Layer-by-Layer

Learning Loop

Continue Learning

What is Backpropagation?

A Simple Neural Network

The Backpropagation Cycle