Back to Blog
neural-networksbackpropagationdeep-learningAI

Backpropagation Explained Visually

Backpropagation is how neural networks learn from mistakes. Follow gradients as they flow backward through layers, updating weights to reduce errors step by step.

CS VisualizationsApril 12, 20269 min

Interactive Visualization

Backpropagation Step-by-Step

See this concept in action with our step-by-step interactive visualization.

Try the Visualization

Backpropagation is the algorithm that makes neural networks learn. Without it, we'd have no image recognition, no language models, no modern AI. Yet at its core, it's just the chain rule from calculus applied systematically.

Let's demystify it.

The Big Picture

Training a neural network involves four steps, repeated thousands of times:

  1. Forward pass: Feed input through the network, get a prediction
  2. Compute loss: Measure how wrong the prediction is
  3. Backward pass (backpropagation): Calculate how each weight contributed to the error
  4. Update weights: Adjust weights to reduce the error

Steps 1 and 2 are straightforward. Step 4 is just subtraction. The hard part — the part backpropagation solves — is step 3: figuring out how much each weight is responsible for the error.

Why Is This Hard?

In a deep neural network, the output is the result of hundreds of operations chained together. Changing a weight in the first layer affects the second layer, which affects the third, which eventually affects the output and the loss.

The question is: if the loss is 0.5 and we want to reduce it, how much should we adjust each of the thousands of weights? Which weights caused the error, and by how much?

Backpropagation answers this by computing the gradient — the rate of change of the loss with respect to each weight.

A Simple Example

Let's trace backpropagation through the simplest possible network: one neuron, one input, one weight, one bias.

Input: x = 2
Weight: w = 0.5
Bias: b = 0.1
Target: y = 1

Forward Pass

z = w × x + b = 0.5 × 2 + 0.1 = 1.1
ŷ = sigmoid(z) = sigmoid(1.1) = 0.75
Loss = (y - ŷ)² = (1 - 0.75)² = 0.0625

Our prediction is 0.75, but the target is 1. The loss is 0.0625.

Backward Pass

Now we compute gradients, working backward from the loss:

Step 1: How does loss change with ŷ?

∂Loss/∂ŷ = -2(y - ŷ) = -2(1 - 0.75) = -0.5

Step 2: How does ŷ change with z? (sigmoid derivative)

∂ŷ/∂z = ŷ × (1 - ŷ) = 0.75 × 0.25 = 0.1875

Step 3: How does z change with w?

∂z/∂w = x = 2

Chain rule — how does loss change with w?

∂Loss/∂w = ∂Loss/∂ŷ × ∂ŷ/∂z × ∂z/∂w
         = -0.5 × 0.1875 × 2
         = -0.1875

The gradient is -0.1875. The negative sign means: increasing w will decrease the loss. That's exactly what we want.

Weight Update

With a learning rate of 0.1:

w_new = w - learning_rate × gradient
w_new = 0.5 - 0.1 × (-0.1875)
w_new = 0.51875

The weight moved slightly upward, which will push the prediction closer to 1. Repeat this process thousands of times and the network converges.

wwα×∂L/∂wnewoldlearning rategradient
Weight update: subtract the gradient (scaled by learning rate) from the current weight.

The Chain Rule: The Heart of Backpropagation

The chain rule is what makes backpropagation possible. It says: if y = f(g(x)), then:

dy/dx = dy/dg × dg/dx

In a neural network, the loss is a function of the output, which is a function of the last hidden layer, which is a function of the second-to-last layer, and so on. The chain rule lets us compute the gradient through this entire chain by multiplying local gradients at each step.

This is why it's called backpropagation — we start at the output (the loss) and propagate gradients backward through each layer to reach the input.

Backpropagation in a Multi-Layer Network

InputHidden 1Hidden 2Loss← gradients flow backward ←
Backpropagation: gradients flow backward from the loss through each layer via the chain rule.

For a network with multiple layers, the process works the same way — just with more chain rule applications:

Layer-by-Layer Gradient Flow

Loss → Output Layer → Hidden Layer 2 → Hidden Layer 1 → Input

At each layer, we compute:

  1. The gradient of the loss with respect to this layer's output (received from the layer above)
  2. The gradient of this layer's output with respect to its weights (computed locally)
  3. Multiply them (chain rule) to get the gradient of the loss with respect to this layer's weights
  4. Pass the gradient backward to the previous layer

Every weight in the network gets its own gradient, telling it exactly how to adjust to reduce the loss.

Why Activation Functions Matter for Backpropagation

During backpropagation, the gradient passes through activation functions. The derivative of the activation function determines how well the gradient flows:

Sigmoid: Derivative is at most 0.25 (at z=0). In deep networks, multiplying many small gradients causes them to vanish — weights in early layers barely update. This is the vanishing gradient problem.

ReLU: Derivative is either 0 or 1. Gradients flow unchanged through active neurons, solving the vanishing gradient problem. But "dead" neurons (with negative inputs) have zero gradient permanently.

Tanh: Better than sigmoid (derivative up to 1.0 at z=0) but still prone to vanishing gradients in very deep networks.

This is why ReLU became the default activation function — it lets gradients propagate efficiently through many layers.

Backpropagation in Practice

Mini-Batch Gradient Descent

In practice, we don't compute gradients for one example at a time. We use mini-batches of 32-256 examples:

  1. Forward pass a batch of examples
  2. Compute the average loss
  3. Backpropagate to get averaged gradients
  4. Update weights once

This is faster (GPU parallelism) and more stable (averaged gradients are less noisy) than single-example updates.

Gradient Accumulation

For very large models that don't fit in GPU memory with large batches, gradients can be accumulated across multiple smaller forward/backward passes before updating weights.

Automatic Differentiation

Modern frameworks (PyTorch, TensorFlow) implement backpropagation automatically. You define the forward pass, and the framework constructs a computation graph that tracks every operation. When you call loss.backward(), it traverses this graph in reverse, computing all gradients automatically.

import torch
 
x = torch.tensor(2.0)
w = torch.tensor(0.5, requires_grad=True)
b = torch.tensor(0.1, requires_grad=True)
 
# Forward pass
y_pred = torch.sigmoid(w * x + b)
loss = (1.0 - y_pred) ** 2
 
# Backward pass — computes all gradients automatically!
loss.backward()
 
print(w.grad)  # ∂Loss/∂w
print(b.grad)  # ∂Loss/∂b

You never implement backpropagation by hand in production — but understanding how it works helps you debug training issues, choose architectures, and interpret what's happening when training goes wrong.

Common Problems and Solutions

ProblemSymptomCauseSolution
Vanishing gradientsEarly layers don't learnSigmoid/tanh in deep networksUse ReLU, ResNets
Exploding gradientsLoss becomes NaNLarge weights, bad learning rateGradient clipping, lower LR
Dead neuronsParts of network stop learningReLU neurons with all-negative inputsLeaky ReLU, proper initialization
Slow convergenceLoss decreases very slowlyLearning rate too lowIncrease LR, use Adam optimizer

See It In Action

Backpropagation is much easier to understand when you can watch gradients flow. Our interactive visualization shows:

  • The forward pass computing predictions layer by layer
  • The loss calculation at the output
  • Gradients propagating backward through each layer
  • Weights updating based on their gradients
  • The network's predictions improving over multiple iterations

Step through the process at your own pace and build intuition for how neural networks learn from their mistakes.

Related Articles

Interactive Visualization

Backpropagation Step-by-Step

See this concept in action with our step-by-step interactive visualization.

Try the Visualization