Convolutional Neural Networks (CNNs) — Lesson Content
Learn how CNNs use filters, feature maps, and pooling to extract visual patterns from images — from edges to faces.
Convolutional Neural Networks revolutionized computer vision by exploiting spatial structure in images. Instead of connecting every pixel to every neuron, CNNs slide small learnable filters across the image to detect local patterns.
Using concrete examples with real convolution math, you'll see how filters detect edges, how feature maps highlight patterns, how stride and padding control output dimensions, and how pooling reduces complexity. By the end, you'll understand the complete CNN pipeline from raw pixels to classification.
Learning Objectives
- Understand why CNNs are better than dense networks for images
- Compute a convolution by hand using a filter and input matrix
- Explain how different filters detect different features
- Calculate output dimensions with stride and padding
- Describe the role of pooling in reducing spatial dimensions
- Trace data through a complete CNN pipeline from input to classification
Step 1: What Are Convolutional Neural Networks?
Regular neural networks connect **every** input to **every** neuron. For a tiny 28×28 grayscale image, that's 784 inputs. For a 1080p color photo? That's **6.2 million inputs** — each connected to every neuron in the first layer. The number of parameters explodes, and the network has no understanding of spatial structure.
**Convolutional Neural Networks (CNNs)** solve this by exploiting a key insight: **nearby pixels are related**. Instead of looking at the entire image at once, CNNs scan small regions with learnable **filters** that detect local patterns like edges, corners, and textures.
Think of it like reading a book: you don't look at every letter on the page simultaneously. Your eyes scan across the text, recognizing familiar patterns (letters, words) as they go. CNNs work the same way — sliding small "windows" across the image.
CNNs revolutionized computer vision starting with LeNet (1998) for digit recognition, and AlexNet (2012) which won the ImageNet competition by a massive margin. Today they power everything from face recognition to medical imaging to self-driving cars.
Step 2: The Convolution Operation
The core of a CNN is the **convolution operation**: sliding a small matrix (called a **filter** or **kernel**) across the input image, computing a dot product at each position.
Here's how it works step by step:
1. Place the 3×3 filter on the **top-left corner** of the 5×5 input image (highlighted cells)
2. **Multiply** each filter value by the corresponding pixel value
3. **Sum** all the products to get a single output value
4. **Slide** the filter one position to the right and repeat
5. When you reach the right edge, move down one row and start from the left
For our first position (top-left), the calculation is:
(1×-1) + (2×0) + (0×1) + (0×-2) + (1×0) + (3×2) + (2×-1) + (3×0) + (1×1) = **4**
The filter slides across all valid positions, producing a **3×3 output** called a **feature map**. The output is smaller because the 3×3 filter can't extend past the edges (we'll fix this with padding later).
Output[i,j] = Σₘ Σₙ Input[i+m, j+n] × Filter[m, n]
For a 5×5 input with a 3×3 filter:
Output size = (5 - 3 + 1) × (5 - 3 + 1) = 3 × 3
Step 3: Filters Detect Features
Different filters detect different features. The **key insight** is that the network **learns** which filters to use during training — you don't design them manually.
However, we can understand what kinds of features filters detect by looking at classic hand-crafted examples:
**Vertical edge detector** [-1,0,1; -2,0,2; -1,0,1]: Responds strongly where pixel values change from left to right (light→dark or dark→light). The negative values on the left and positive values on the right create a "difference detector."
**Horizontal edge detector** [-1,-2,-1; 0,0,0; 1,2,1]: Same idea but detects top-to-bottom changes.
**Blur filter** [1/9 repeated]: Averages all neighbors equally, smoothing out noise and details.
**Sharpen filter** [0,-1,0; -1,5,-1; 0,-1,0]: Enhances the center pixel relative to neighbors, making edges crisper.
In a trained CNN, early layers learn simple filters (edges, colors), middle layers combine them into textures and patterns, and deep layers detect complex features (eyes, wheels, text).
Step 4: Feature Maps: The Filter Output
When we slide the vertical edge filter across our 5×5 input, we get a **3×3 feature map** — a new "image" that highlights where vertical edges are.
Each value in the feature map tells us **how strongly** that region of the input matches the filter's pattern. Large positive values mean a strong match, large negative values mean the opposite pattern, and values near zero mean no edge was detected.
Looking at our results:
- The value **4** at position [0,0] means there's a moderate vertical edge in the top-left
- Higher absolute values indicate stronger edges
In a real CNN, each convolutional layer applies **multiple filters** simultaneously. If a layer has 32 filters, it produces 32 feature maps — each detecting a different pattern. These feature maps stack together as **channels**, becoming the input to the next layer.
This is how CNNs build up from simple features to complex ones: layer 1 detects edges → layer 2 combines edges into corners and curves → layer 3 combines those into eyes, wheels, or letters.
Step 5: Stride and Padding
Two important controls affect how convolution works and what size output it produces:
**Stride** controls how far the filter moves at each step:
- **Stride 1** (default): Move the filter one pixel at a time. Maximum detail, largest output.
- **Stride 2**: Skip every other position. Reduces output size by ~half — a built-in downsampling.
**Padding** adds extra pixels around the input border:
- **No padding (valid)**: Filter stays strictly inside. Output shrinks: a 5×5 input with 3×3 filter → 3×3 output.
- **Same padding**: Add zeros around the border so the output is the **same size** as the input. For a 3×3 filter, add 1 pixel of padding on each side.
- **Full padding**: Add enough zeros so every input pixel is visited by every filter position.
The output dimension formula is:
**output = floor((input + 2×padding - kernel) / stride) + 1**
For example: input=32, kernel=3, padding=1, stride=1 → (32 + 2 - 3) / 1 + 1 = **32** (same size!)
Output Size = floor((W + 2P - K) / S) + 1
Where:
W = input width/height
P = padding
K = kernel size
S = stride
Step 6: Pooling: Shrinking Feature Maps
After convolution, **pooling layers** reduce the spatial dimensions of feature maps. This serves three purposes:
1. **Reduces computation** — fewer values means fewer operations in subsequent layers
2. **Provides translation invariance** — small shifts in the input don't change the output much
3. **Controls overfitting** — fewer parameters means less chance of memorizing noise
**Max Pooling** (most common): Slides a window across the feature map and keeps only the **maximum value** in each window.
Looking at our 4×4 example with 2×2 max pooling:
- Top-left 2×2 region [6,8,2,9] → **max = 9**
- Top-right 2×2 region [3,4,1,5] → **max = 5**
- Bottom-left 2×2 region [7,3,4,6] → **max = 7**
- Bottom-right 2×2 region [8,2,5,1] → **max = 8**
The 4×4 feature map is reduced to 2×2 — a **75% reduction** in data, while keeping the strongest activations.
**Average Pooling** is similar but takes the mean instead of the max. It's less common in hidden layers but often used at the very end of a network (global average pooling).
Step 7: Putting It All Together: CNN Architecture
A complete CNN stacks these operations into a pipeline that progressively transforms raw pixels into a classification decision:
**Stage 1 — Feature Extraction** (Convolutional + Pooling layers):
Each "conv block" applies convolution → activation (ReLU) → pooling. Early blocks detect simple features (edges), deeper blocks detect complex features (objects). The spatial dimensions shrink while the number of channels grows.
**Stage 2 — Classification** (Fully Connected layers):
The final feature maps are **flattened** into a 1D vector and fed through regular dense layers, just like the neural networks you've already learned. The output layer uses softmax for multi-class classification.
Here's a concrete example: classifying 32×32 color images (like CIFAR-10) into 10 categories:
The total parameter count is only **~15,000** — compared to a fully connected network on the same input which would need **3,072 × 256 = 786,432** parameters in the first layer alone. This dramatic efficiency is why CNNs dominate vision tasks.
Step 8: Test Your Understanding
You've learned how CNNs transform raw pixels into meaningful features through convolution, filters, and pooling. Let's test your understanding!
Prerequisites
- Neural network basics (forward pass)
- Activation functions (ReLU)
- Backpropagation fundamentals
Key Concepts
- Convolution Operation
- Filters / Kernels
- Feature Maps
- Stride & Padding
- Max Pooling
- CNN Architecture