Backpropagation Step-by-Step

How Neural Networks Learn From Their Mistakes

Difficulty
Intermediate
Duration
18-22 minutes
Prerequisites
Neural networks, Loss functions

What You'll Discover

Learn how neural networks propagate errors backward to improve

Chain Rule Intuition

How gradients compose through layers so the network knows which weights to adjust.

Backward Pass

Watch error signals flow from output back to input, layer by layer.

Weight Updates

See how each weight gets adjusted based on its contribution to the error.

The Full Picture

How forward + backward passes form the complete training loop.

Key Concepts

Chain Rule

Multiply derivatives to trace blame through layers

Gradient Flow

Error signals propagate backward through the network

Error Signals

Each layer receives feedback about its contribution

Weight Updates

Adjust weights proportional to their error contribution

Layer-by-Layer

Process goes backward: output → hidden → input

Learning Loop

Forward pass → loss → backward pass → update → repeat

Step
1/ 8

What is Backpropagation?

Imagine you're a teacher grading a group project. The final answer is wrong, and you need to figure out who made the mistake and how much each person contributed to the error.

Backpropagation works exactly the same way. When a neural network makes a wrong prediction, it traces the error backward through the network to figure out which weights are to blame and how to fix them.

Here's the training cycle that repeats over and over:

  1. Forward Pass — Feed data through the network, get a prediction
  2. Compute the Error — How wrong was the prediction?
  3. Backward Pass — Trace the error back to assign blame to each weight
  4. Update Weights — Adjust weights to reduce the error

This "learn from your mistakes" cycle is how every neural network trains — from simple classifiers to ChatGPT.

A Simple Neural Network

+0.30+0.50-0.20+0.40+0.60-0.30InputHidden 1Output
Positive (Excitatory)
Negative (Inhibitory)
Thickness = Strength

The Backpropagation Cycle

StepDirectionWhat Happens
1. Forward Pass→ Left to RightData flows through, make a prediction
2. Compute Error⊗ At OutputCompare prediction vs. target
3. Backward Pass← Right to LeftTrace blame for the error
4. Update Weights↻ EverywhereAdjust weights to reduce error

Backpropagation Step-by-Step — Lesson Content

Trace how a neural network learns from its mistakes by computing gradients and updating weights backward through the network.

Backpropagation is the algorithm that makes neural networks learn. Using a concrete example — predicting exam scores from study and sleep hours — you'll trace the complete training cycle: forward pass, error computation, backward pass, and weight updates. Every number is computed live so you can follow the math step by step. You'll see how the chain rule distributes blame across layers and why gradients shrink as they flow backward.

Learning Objectives

  • Understand how neural networks learn from errors
  • Trace the chain rule through a multi-layer network
  • Compute gradients for each weight by hand
  • See how gradient descent updates weights
  • Recognize why gradients vanish in deep sigmoid networks

Step 1: What is Backpropagation?

Imagine you're a teacher grading a group project. The final answer is wrong, and you need to figure out **who made the mistake** and how much each person contributed to the error. **Backpropagation** works exactly the same way. When a neural network makes a wrong prediction, it traces the error **backward** through the network to figure out **which weights are to blame** and how to fix them. Here's the training cycle that repeats over and over: 1. **Forward Pass** — Feed data through the network, get a prediction 2. **Compute the Error** — How wrong was the prediction? 3. **Backward Pass** — Trace the error back to assign blame to each weight 4. **Update Weights** — Adjust weights to reduce the error This "learn from your mistakes" cycle is how **every** neural network trains — from simple classifiers to ChatGPT.

Step 2: Our Example Network

Let's work through a concrete example. We have a tiny network that predicts **exam scores** from two inputs: - **Hours Studied**: 6 hours → normalized to **0.6** - **Hours Slept**: 8 hours → normalized to **0.8** - **Target Output**: 0.9 (the student scored well!) The network has **2 inputs → 2 hidden neurons → 1 output**, and every connection has a **weight** that the network will learn to adjust. We use the **sigmoid** activation function: σ(x) = 1/(1+e⁻ˣ), which squashes any number into the 0–1 range — perfect for predicting a score between 0 and 1. Right now these weights are essentially random guesses. After backpropagation, they'll be a little bit smarter!

Step 3: Forward Pass: Making a Prediction

Data flows **left to right**. Each neuron computes a weighted sum, adds a bias, then applies sigmoid. **Hidden Neuron 1:** z = (0.6 × 0.3000) + (0.8 × 0.5000) + 0.1000 = **0.6800** a = σ(0.6800) = **0.6637** **Hidden Neuron 2:** z = (0.6 × -0.2000) + (0.8 × 0.4000) + -0.1000 = **0.1000** a = σ(0.1000) = **0.5250** **Output:** z = (0.6637 × 0.6000) + (0.5250 × -0.3000) + 0.2000 = **0.4407** prediction = σ(0.4407) = **0.6084** Our network predicts **0.6084**, but the target is **0.9**. Not great! Let's figure out what went wrong.

Step 4: Computing the Error

How wrong is our prediction? We use **Mean Squared Error**: **Loss = ½ × (target − prediction)²** - **Target**: 0.9 (what we wanted) - **Predicted**: 0.6084 (what we got) - **Error**: 0.9 − 0.6084 = **0.2916** - **Loss**: ½ × 0.2916² = **0.0425** We use the ½ to make the derivative cleaner — the 2 from the square cancels out. The network predicted about **61%** when the student actually scored **90%**. Not terrible, but there's room to improve. Now the big question: **which weights should change, and by how much?**

Step 5: The Chain Rule: Tracing the Blame

To fix weight w₁, we need to answer: **"How much did w₁ contribute to the error?"** But w₁ doesn't directly touch the output — it goes through a chain of steps: **w₁ → weighted sum → sigmoid → weighted sum → sigmoid → loss** The **chain rule** lets us multiply the effect at each link: ∂Loss/∂w₁ = (∂Loss/∂output) × (∂output/∂z_out) × (∂z_out/∂a_h₁) × (∂a_h₁/∂z_h₁) × (∂z_h₁/∂w₁) **Think of it like dominoes:** each one knocks over the next. To know how hard the last domino fell because of the first one, multiply all the "push strengths" together. This is the **key insight** of backpropagation: we work **backward** from the output, reusing calculations at each layer. Instead of recomputing each chain from scratch, we share the work — making it efficient even for huge networks.

Step 6: Backward Pass: Output Layer

Starting from the output, we compute the **error signal** (δ): **δ_output** = −(target − output) × σ'(z_out) = -0.2916 × 0.2382 = **-0.0695** The negative value means: "the output needs to go **up** to reach the target." Now we compute gradients for the output weights: - **∂L/∂w₅** = δ_output × a_h₁ = -0.0695 × 0.6637 = **-0.0461** - **∂L/∂w₆** = δ_output × a_h₂ = -0.0695 × 0.5250 = **-0.0365** Both gradients are **negative**, meaning we should **increase** these weights to push the output closer to the target.

Step 7: Backward Pass: Hidden Layer

The error signal flows **backward** through the weights. Each hidden neuron receives blame proportional to its connection to the output: **Hidden 1:** δ_h₁ = (δ_out × w₅) × σ'(z_h₁) = (-0.0695 × 0.6000) × 0.2232 = **-0.0093** **Hidden 2:** δ_h₂ = (δ_out × w₆) × σ'(z_h₂) = (-0.0695 × -0.3000) × 0.2494 = **0.0052** Notice the gradients are **much smaller** here! They shrunk as they passed through each sigmoid layer. This is called the **vanishing gradient** problem — it's why very deep networks with sigmoid can be hard to train, and why modern networks often use ReLU instead. - **∂L/∂w₁** = δ_h₁ × x₁ = **-0.0056** - **∂L/∂w₂** = δ_h₁ × x₂ = **-0.0074** - **∂L/∂w₃** = δ_h₂ × x₁ = **0.0031** - **∂L/∂w₄** = δ_h₂ × x₂ = **0.0042**

Step 8: Weight Updates & The Learning Cycle

Finally! We apply the **gradient descent update rule** with learning rate α = 0.5: **w_new = w_old − α × gradient** The output weights change more than the hidden weights — that's the vanishing gradient in action. Over many training iterations, even these tiny updates add up. After this single step, if we ran the forward pass again with the new weights, our prediction would be **closer to 0.9**. Real training repeats this cycle thousands of times until the network converges. **Key takeaways:** 1. **Forward pass** computes the prediction 2. **Loss function** measures how wrong we are 3. **Chain rule** traces blame backward through each layer 4. **Gradient descent** nudges each weight to reduce the error 5. **Repeat** thousands of times until the network learns! This is the same algorithm powering image recognition, language models, and self-driving cars — just with millions more weights.

Prerequisites

  • Neural network forward pass
  • Loss functions (MSE)
  • Basic understanding of derivatives

Key Concepts

  • Backpropagation
  • Chain Rule
  • Gradients
  • Weight Updates
  • Vanishing Gradients