Recurrent Neural Networks

Networks That Remember: Processing Sequences Over Time

Difficulty
Intermediate
Duration
18-22 minutes
Prerequisites
Neural networks, Backpropagation

What You'll Discover

Understand how RNNs process sequential data with memory

Hidden State Memory

See how RNNs maintain context across time steps by passing hidden states forward.

Vanishing Gradients

Understand why vanilla RNNs struggle with long sequences and how gradients decay.

LSTM & GRU Gates

Learn how gating mechanisms control what to remember, forget, and output.

Real-World Applications

Match RNN architectures to tasks from sentiment analysis to machine translation.

Key Concepts

Sequential Data

Data where order matters: text, time series, audio, video

Hidden States

The RNN's memory that carries context between time steps

Vanishing Gradients

Gradients shrink exponentially, limiting long-range learning

LSTM

Long Short-Term Memory with forget, input, and output gates

GRU

Simplified gating with update and reset gates

Sequence-to-Sequence

Encoder-decoder architecture for translation and summarization

Step
1/ 8

Why Sequential Data Needs Special Networks

Many real-world data types have a natural order that matters:

  • Text: "Dog bites man" vs. "Man bites dog" — same words, opposite meaning
  • Time series: Stock prices, weather data, sensor readings
  • Audio: Speech is a sequence of sounds over time
  • Video: A sequence of image frames

A regular feedforward network treats all inputs independently — it has no concept of order. If you feed it the words "the cat sat" as three separate inputs, it doesn't know which came first, second, or third.

Recurrent Neural Networks (RNNs) solve this by processing inputs one at a time in order, maintaining a hidden state that carries information from previous steps. Think of it like reading a sentence: your understanding of each word is shaped by the words that came before it.

The key idea: instead of processing the entire sequence at once, process it step by step, building up context as you go.

Sequential Data in the Real World

Data TypeExampleWhy Order MattersTypical Task
Text"I love this movie"Word order determines meaningSentiment analysis
Time SeriesStock: 100, 105, 103, 110Trends depend on orderingPrice prediction
AudioSpeech waveformSounds must be in sequenceSpeech recognition
DNAATCGATCG...Gene sequences encode proteinsProtein structure prediction
MusicNotes over timeMelody = notes in orderMusic generation
VideoFrame 1, Frame 2, ...Actions unfold over timeActivity recognition

Feedforward vs Recurrent Networks

AspectFeedforward NetworkRecurrent Network
Input processingAll at once (parallel)One step at a time (sequential)
MemoryNone — each input independentHidden state carries context forward
Word order"cat sat the" = "the cat sat""the cat sat" ≠ "cat sat the"
Variable lengthFixed input size onlyHandles any sequence length
Best forImages, tabular dataText, time series, audio

Recurrent Neural Networks (RNNs) — Lesson Content

Learn how RNNs process sequential data by maintaining hidden states across time steps — from vanilla RNNs to LSTM and GRU.

Recurrent Neural Networks process data one step at a time while maintaining a hidden state — a form of memory that carries context forward through the sequence. This makes them naturally suited for text, time series, audio, and any data where order matters. Using a concrete example of processing the sentence "the cat sat," you'll watch hidden states evolve at each time step, understand why vanilla RNNs struggle with long sequences (vanishing gradients), and see how LSTM and GRU gates solve this with learnable memory control.

Learning Objectives

  • Explain why feedforward networks fail on sequential data
  • Trace hidden state evolution through an RNN forward pass
  • Understand the vanishing gradient problem and its impact
  • Describe how LSTM gates control information flow
  • Compare GRU and LSTM architectures
  • Match RNN architectures to appropriate tasks

Step 1: Why Sequential Data Needs Special Networks

Many real-world data types have a natural **order** that matters: - **Text:** "Dog bites man" vs. "Man bites dog" — same words, opposite meaning - **Time series:** Stock prices, weather data, sensor readings - **Audio:** Speech is a sequence of sounds over time - **Video:** A sequence of image frames A regular feedforward network treats all inputs independently — it has **no concept of order**. If you feed it the words "the cat sat" as three separate inputs, it doesn't know which came first, second, or third. **Recurrent Neural Networks (RNNs)** solve this by processing inputs **one at a time in order**, maintaining a **hidden state** that carries information from previous steps. Think of it like reading a sentence: your understanding of each word is shaped by the words that came before it. The key idea: instead of processing the entire sequence at once, process it step by step, building up context as you go.

Step 2: The Hidden State: Memory Across Time

The **hidden state** is the RNN's memory. At each time step, the network takes two inputs: 1. The **current input** (e.g., the current word) 2. The **previous hidden state** (summary of everything seen so far) It combines them to produce a **new hidden state**, which is then passed to the next time step. The diagram shows the RNN "unrolled" across 3 time steps for the sequence "the cat sat." Each column is the same network at a different time step. The horizontal arrows between hidden layers represent the **recurrent connection** — the hidden state flowing forward in time. At **t=0** ("the"): The hidden state starts at zeros. The network only knows about "the." At **t=1** ("cat"): The hidden state now encodes "the cat" — it combines the new input with the memory of "the." At **t=2** ("sat"): The hidden state encodes "the cat sat" — the full context accumulated across all time steps. This is why RNNs can handle variable-length sequences: they process one element at a time, building up their understanding incrementally.

Step 3: RNN Computation Step-by-Step

At each time step, the RNN computes the new hidden state using this formula: **h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h)** Let's break it down for our first time step (t=0, input = "the"): 1. **W_xh · x_t**: Multiply the input-to-hidden weight matrix (4×3) by the input vector (3×1). This transforms the input word into "hidden space." 2. **W_hh · h_{t-1}**: Multiply the hidden-to-hidden weight matrix (4×4) by the previous hidden state. At t=0, h_{-1} = [0,0,0,0], so this contributes nothing. 3. **Add bias b_h**: A learnable offset added to each hidden unit. 4. **Apply tanh**: Squashes each value to the range [-1, 1]. This prevents values from growing unboundedly across many time steps. The **same weights** (W_xh, W_hh, b_h) are used at every time step — this is called **weight sharing**. The network learns one set of rules for processing any sequence position, which is both parameter-efficient and allows generalization to sequences of any length.
h_t = tanh(W_xh · x_t + W_hh · h_{t-1} + b_h)

Where:
  x_t     = input at time t (3D one-hot vector)
  h_{t-1} = previous hidden state (4D vector)
  W_xh    = input-to-hidden weights (4×3 matrix)
  W_hh    = hidden-to-hidden weights (4×4 matrix)
  b_h     = hidden bias (4D vector)
  tanh    = activation function (squash to [-1, 1])

Step 4: The Vanishing Gradient Problem

RNNs have a critical flaw: during backpropagation, gradients must flow **backward through time** — from the loss at the end of the sequence, all the way back to the earliest time steps. At each step, the gradient is multiplied by the hidden-to-hidden weight matrix. If these weights are small (< 1), the gradient **shrinks exponentially** — it "vanishes." After 8 time steps with a factor of 0.25, the gradient is only **0.000015** of its original value. The network effectively **cannot learn** from early parts of the sequence. This means a vanilla RNN processing the sentence "The movie that I watched with my friends last weekend at the new theater downtown was **amazing**" cannot connect "amazing" back to "movie" — the gradient vanishes before reaching it. The reverse problem also exists: if weights are > 1, gradients **explode** to infinity. This is usually handled by gradient clipping (capping the gradient magnitude). This fundamental limitation motivated the development of **LSTM** and **GRU** cells, which add gating mechanisms to control gradient flow.

Step 5: LSTM: Learning What to Remember

**Long Short-Term Memory (LSTM)** networks solve the vanishing gradient problem with a clever idea: instead of one hidden state, maintain a **cell state** (long-term memory) controlled by three **gates**. Think of the cell state as a conveyor belt running through time. Gates control what gets added to or removed from the belt: **Forget Gate (f_t):** "What should I forget from the old cell state?" - Outputs values between 0 (completely forget) and 1 (completely keep) - Example: When processing a new sentence, forget the subject of the previous sentence **Input Gate (i_t):** "What new information should I store?" - Decides which parts of the new candidate values to add to the cell state - Example: Store the new subject when encountering it **Output Gate (o_t):** "What part of the cell state should I output as the hidden state?" - Controls what the current hidden state contains - Example: Output verb-related features when predicting the next word after a subject The key insight: the cell state can flow through time **mostly unchanged** (forget gate ≈ 1), allowing gradients to propagate across many time steps without vanishing. Here's an LSTM processing normalized stock prices [0.0, 0.5, 0.3, 1.0]:

Step 6: GRU: A Simpler Alternative

The **Gated Recurrent Unit (GRU)**, introduced by Cho et al. in 2014, simplifies the LSTM by combining the forget and input gates into a single **update gate**, and merging the cell state and hidden state into one. **Update Gate (z_t):** Decides how much of the old hidden state to keep vs. how much to replace with new content. When z ≈ 1, the GRU keeps the old state (like LSTM's forget gate ≈ 1). When z ≈ 0, it fully updates to new content. **Reset Gate (r_t):** Decides how much of the previous hidden state to use when computing the candidate new state. When r ≈ 0, the GRU acts as if it's reading the input for the first time (ignoring history). The GRU achieves **comparable performance** to LSTM on most tasks while being **faster to train** due to fewer parameters. In practice: - Use **LSTM** when you need maximum expressiveness or have plenty of compute - Use **GRU** when you want faster training or have a smaller dataset - Both have been largely superseded by **Transformers** for most NLP tasks, but remain relevant for time series and streaming applications

Step 7: RNN Applications: From Text to Music

RNNs come in several architectures depending on the input/output structure of the task: **Many-to-One** (sequence → single output): The RNN processes the entire sequence and produces one output at the end. Used for sentiment analysis ("This movie was great" → positive), spam detection, and music genre classification. **One-to-Many** (single input → sequence): Given a single input, the RNN generates a sequence. Used for image captioning (image → "A dog playing in the park"), music generation, and text generation. **Many-to-Many** (sequence → sequence): The RNN produces an output at each time step. Two variants: - **Same length**: Part-of-speech tagging (word → tag for each word) - **Different length**: Machine translation via encoder-decoder ("the cat sat" → "le chat assis") Below, the token flow shows how an RNN processes a sentiment analysis input: "The food was absolutely delicious" — the highlighted words carry the strongest sentiment signal, and the hidden state at the end encodes the overall positive sentiment.

Step 8: Test Your Understanding

You've learned how RNNs process sequential data with hidden states, the vanishing gradient problem, and how LSTM and GRU gates solve it. Let's test your understanding!

Prerequisites

  • Neural network basics (forward pass, backpropagation)
  • Activation functions (tanh, sigmoid)
  • Matrix multiplication

Key Concepts

  • Sequential Data
  • Hidden States
  • RNN Computation
  • Vanishing Gradients
  • LSTM Gates
  • GRU Gates