Hyperparameter Tuning

Finding the Best Settings for Optimal Performance

Difficulty
Intermediate
Duration
15-18 minutes
Prerequisites
Neural network training, Overfitting

What You'll Discover

Learn how to choose the right settings for training neural networks

Learning Rate Impact

See how too-high or too-low learning rates dramatically change training outcomes.

Network Sizing

Understand the trade-off between underfitting with too few neurons and overfitting with too many.

Search Strategies

Compare grid search and random search to find optimal hyperparameter combinations.

Diagnostic Tools

Read learning curves to diagnose problems and guide your tuning decisions.

Key Concepts

Learning Rate

Controls how big of a step the network takes during weight updates

Network Architecture

The number and size of layers determines model capacity

Grid Search

Systematically try every combination from a predefined set

Random Search

Sample random combinations for better coverage with fewer trials

Learning Curves

Plot loss over epochs to diagnose underfitting and overfitting

Early Stopping

Stop training when validation loss stops improving to prevent overfitting

Step
1/ 8

What Are Hyperparameters?

When you train a neural network, it learns parameters like weights and biases automatically. But some settings must be chosen by you before training begins — these are called hyperparameters.

Think of it like baking a cake:

  • Parameters (learned): How much of each ingredient ends up in the batter — the network figures this out through training.
  • Hyperparameters (chosen by you): Oven temperature, baking time, pan size — you decide these before you start baking.

Bad hyperparameters can ruin a perfectly good model, just like the wrong oven temperature can burn a cake. The difference between a 60% accurate model and a 95% accurate model is often not the architecture — it's the hyperparameters.

Here are the most important hyperparameters you'll encounter:

  • Learning Rate — How big of a step to take during gradient descent
  • Hidden Layer Size — How many neurons in each hidden layer
  • Number of Epochs — How many times to iterate over the training data
  • Batch Size — How many samples to process before updating weights

Throughout this lesson, we'll use our familiar example: predicting whether a student passes an exam based on hours studied and hours slept.

Parameters vs Hyperparameters

CategoryExampleWho Decides?When?
ParameterWeight w₁ = 0.35The network (via backprop)During training
ParameterBias b₁ = 0.05The network (via backprop)During training
HyperparameterLearning rate = 0.5You (the engineer)Before training
HyperparameterHidden neurons = 8You (the engineer)Before training
HyperparameterEpochs = 50You (the engineer)Before training
HyperparameterBatch size = 32You (the engineer)Before training

The Goldilocks Problem: Every Hyperparameter Has a Sweet Spot

HyperparameterToo LowToo HighEffect
Learning RateLearns too slowlyDiverges / explodesSpeed vs stability
Hidden NeuronsCan't learn patterns (underfit)Memorizes noise (overfit)Capacity
EpochsNot enough learningOverfitting to training dataTraining duration
Batch SizeNoisy updates, slowLess generalizationUpdate quality

Hyperparameter Tuning — Lesson Content

Learn how to choose the right settings for training neural networks — learning rate, network size, and search strategies that make the difference between failure and success.

Hyperparameters are the knobs you turn before training a neural network: learning rate, layer sizes, epochs, and more. Bad choices lead to models that fail to learn or memorize noise. Using a concrete example — predicting student exam results — you'll see how different hyperparameter choices dramatically change training outcomes. You'll run simulated grid search and random search experiments, learn to read learning curves for diagnosis, and walk away with practical guidelines used by ML engineers every day.

Learning Objectives

  • Distinguish between parameters and hyperparameters
  • Understand the effect of learning rate on training dynamics
  • Recognize underfitting and overfitting from network size choices
  • Compare grid search and random search strategies
  • Read learning curves to diagnose training problems
  • Apply practical guidelines for hyperparameter tuning

Step 1: What Are Hyperparameters?

When you train a neural network, it learns **parameters** like weights and biases automatically. But some settings must be chosen by **you** before training begins — these are called **hyperparameters**. Think of it like baking a cake: - **Parameters** (learned): How much of each ingredient ends up in the batter — the network figures this out through training. - **Hyperparameters** (chosen by you): Oven temperature, baking time, pan size — you decide these before you start baking. Bad hyperparameters can ruin a perfectly good model, just like the wrong oven temperature can burn a cake. The difference between a 60% accurate model and a 95% accurate model is often **not** the architecture — it's the hyperparameters. Here are the most important hyperparameters you'll encounter: - **Learning Rate** — How big of a step to take during gradient descent - **Hidden Layer Size** — How many neurons in each hidden layer - **Number of Epochs** — How many times to iterate over the training data - **Batch Size** — How many samples to process before updating weights Throughout this lesson, we'll use our familiar example: predicting whether a student passes an exam based on **hours studied** and **hours slept**.

Step 2: Learning Rate: Too High vs Too Low vs Just Right

The **learning rate** (α) controls how big of a step the network takes when updating weights during gradient descent. It's arguably the single most important hyperparameter. Let's train the **same network** (8 hidden neurons, 50 epochs) with three different learning rates and watch what happens: **α = 0.001 (Too Low):** The network barely moves. After 50 epochs, the loss is still **0.1330** with only **50%** accuracy. It would eventually converge, but it could take thousands of epochs. This wastes compute time and you might stop training before it's done learning. **α = 5.0 (Too High):** The network overshoots wildly. The loss is **0.0013** — instead of descending into the valley, it bounces over it. The updates are so large that the weights oscillate or diverge completely. This is like trying to park a car at 100 mph. **α = 0.5 (Just Right):** Smooth convergence! Loss drops to **0.1012** with **60%** accuracy. The steps are big enough to make progress, but small enough to not overshoot. The ideal learning rate depends on the problem, the architecture, and even the training data. That's why tuning it is so important.

Step 3: Network Size: How Many Neurons?

The number of neurons in hidden layers determines the network's **capacity** — how complex a pattern it can learn. But more isn't always better! We train three networks (all with α = 0.5, 50 epochs) that differ only in the hidden layer size: **2 Hidden Neurons (Small):** Final loss **0.1167**, accuracy **60%**. The network is too simple to capture the relationship between study hours, sleep, and exam results. This is **underfitting** — the model lacks the capacity to represent the true pattern. **8 Hidden Neurons (Medium):** Final loss **0.1012**, accuracy **60%**. A good balance — enough neurons to learn the pattern without going overboard. **32 Hidden Neurons (Large):** Final loss **0.0399**, accuracy **90%**. With 32 neurons the network has plenty of capacity. On our small 10-sample dataset, this network can essentially **memorize** the training data, which means it may perform poorly on new, unseen data. This is the start of **overfitting**. The key trade-off: **too few neurons → underfitting** (can't learn the pattern), **too many neurons → overfitting** (memorizes noise instead of learning the general pattern).

Step 4: Grid Search: The Systematic Approach

Now that we know both learning rate and network size matter, how do we find the best combination? The simplest approach is **grid search**: try every possible combination from a predefined set of values. We define a grid: - **Learning rates:** [0.01, 0.1, 0.5, 1.0] - **Hidden sizes:** [2, 8, 32] That gives us **4 × 3 = 12 combinations** to try. We train each one for 20 epochs and record the results. The best combination found: **α = 0.01, 32 neurons** → **80%** accuracy (marked with ★). **Pros of grid search:** - Exhaustive — guaranteed to find the best combo in your grid - Easy to implement and parallelize - Reproducible results **Cons of grid search:** - **Exponential cost** — adding a third hyperparameter with 4 values triples the work (4 × 3 × 4 = 48 runs) - Wastes compute on unimportant regions of the space - Assumes evenly spaced values are a good idea (they often aren't)

Step 5: Random Search: Smarter Exploration

In 2012, researchers Bergstra and Bengio showed that **random search** often beats grid search — sometimes dramatically. The reason is surprising. With grid search, if one hyperparameter matters much more than another (which is common!), you waste many runs testing different values of the unimportant one. With random search, every trial explores a **unique value** of each hyperparameter. Instead of a grid, we sample 8 random combinations: - **Learning rate:** sampled from log-uniform [0.001, 2.0] - **Hidden size:** sampled from {2, 4, 8, 16, 32} The best random combination: **α = 0.0342, 16 neurons** → **90%** accuracy. With just **8 random trials** (vs. 12 grid trials), random search explored a wider range of learning rates. Even with fewer experiments, it often finds competitive or better results because it doesn't waste runs on unimportant corners of the search space. **Why random search wins:** - Tests more unique values per hyperparameter - Naturally explores more of the space - Scales better to many hyperparameters - Same or better results with fewer trials

Step 6: Learning Curves: Reading the Signs

A **learning curve** plots loss (or accuracy) over training epochs. It's the most powerful diagnostic tool for understanding what's going wrong — and what to do about it. Here are three common patterns and how to read them: **Scenario A — Underfitting (2 neurons, α=0.01):** Both training and validation loss stay high. The model is too simple to learn the pattern. **Fix:** Increase model capacity (more neurons/layers), increase learning rate, or train longer. **Scenario B — Good Fit (8 neurons, α=0.5):** Both curves decrease together and converge. Training loss: **0.1012**. The small gap between training and validation loss is normal and healthy. This is what you're aiming for. **Scenario C — Overfitting (32 neurons, α=1.0):** Training loss keeps dropping, but validation loss starts **increasing** after a point. The model is memorizing the training data instead of learning generalizable patterns. **Fix:** Reduce model capacity, add regularization (dropout, weight decay), get more training data, or use early stopping. **Pro tip:** Always plot learning curves! They tell you more about your model than any single metric.

Step 7: Practical Guidelines & Best Practices

After understanding individual hyperparameters and search strategies, here are battle-tested guidelines used by practitioners every day: **1. Start with Known Defaults** Don't reinvent the wheel. Use established defaults as your starting point and tune from there. The ML community has converged on good defaults through decades of experimentation. **2. Learning Rate Schedulers** Instead of a fixed learning rate, reduce it during training. Start aggressive (big steps) and get more precise (small steps) as you approach the optimum. Popular schedulers: cosine annealing, step decay, warm-up + decay. **3. Early Stopping** Monitor validation loss during training. If it hasn't improved for N epochs (called "patience"), stop training — you're starting to overfit. This is one of the simplest and most effective regularization techniques. **4. Cross-Validation** Don't evaluate on a single train/val split. Use k-fold cross-validation (typically k=5) to get a more reliable estimate of how well each hyperparameter setting generalizes. **5. Tune in Order of Importance** Learning rate almost always matters most. Tune it first, then architecture (layers/neurons), then regularization, then batch size. Don't waste time tuning batch size before you've found a good learning rate. **6. Use Log Scale for Learning Rate** Learning rates are best explored on a logarithmic scale: [0.0001, 0.001, 0.01, 0.1, 1.0] covers more useful ground than [0.1, 0.2, 0.3, 0.4, 0.5].

Step 8: Test Your Understanding

You've learned about hyperparameters, how they affect training, and strategies for finding the best values. Let's test your understanding!

Prerequisites

  • Neural network basics (forward pass, backpropagation)
  • Gradient descent
  • Loss functions

Key Concepts

  • Hyperparameters vs Parameters
  • Learning Rate
  • Network Capacity
  • Grid Search
  • Random Search
  • Learning Curves
  • Overfitting & Underfitting