Regular neural networks have no memory. They process each input independently, with no awareness of what came before. That's fine for classifying images, but useless for understanding language, predicting stock prices, or generating music — tasks where context matters.
Recurrent Neural Networks (RNNs) were designed to fix this by maintaining a hidden state — a form of memory. But vanilla RNNs have a fatal flaw: they can't remember things for very long. Enter Long Short-Term Memory networks.
The Problem: Vanishing Gradients
Imagine reading this sentence: "The author, who grew up in France and studied at the Sorbonne before moving to London where she worked for a decade, speaks fluent ___."
You need to remember "France" from the beginning to predict "French" at the end. That's a long-range dependency — the relevant information is far from where it's needed.
Vanilla RNNs struggle with this because during training, gradients (the learning signals) must flow backward through every time step. At each step, the gradient is multiplied by a weight matrix. If those weights are small, the gradient shrinks exponentially:
- After 5 steps: gradient × 0.25⁵ = 0.001
- After 10 steps: gradient × 0.25¹⁰ = 0.000001
- After 20 steps: effectively zero
The network simply cannot learn from information that's more than a few steps back. The gradient vanishes before it arrives.
The LSTM Solution: Gated Memory
LSTM networks, introduced by Hochreiter and Schmidhuber in 1997, solve this with an elegant mechanism: instead of one hidden state, they maintain two:
- Hidden state (h): The short-term working memory — what the network is currently "thinking about"
- Cell state (c): The long-term memory — a highway that information can flow along with minimal interference
The cell state is the key innovation. Information can travel along it for many time steps with the gradient barely diminishing — because the operations on it are carefully controlled by gates.
The Three Gates
An LSTM has three gates, each a neural network layer with sigmoid activation (outputting values between 0 and 1):
1. Forget Gate: "What should I erase?"
The forget gate looks at the current input and previous hidden state, then decides which parts of the cell state to keep and which to erase.
f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f)
- Output of 1.0 = keep this information completely
- Output of 0.0 = erase this information completely
Analogy: You're reading a book and a new chapter starts with a different character. The forget gate says "forget the previous character's details, we're following someone new now."
2. Input Gate: "What new information should I store?"
The input gate has two parts:
- A sigmoid layer decides which values to update
- A tanh layer creates candidate values to add
i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i)
c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)
The new cell state combines forgetting and remembering:
c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t
Analogy: You encounter an important plot point. The input gate says "this is important, store it in long-term memory" and writes it into the cell state.
3. Output Gate: "What should I reveal right now?"
The output gate decides which parts of the cell state to expose as the hidden state (the network's current output).
o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t ⊙ tanh(c_t)
Analogy: You know many facts about the story, but right now someone asks "what just happened?" The output gate selects the relevant recent events to share, not everything you remember.
Why This Solves Vanishing Gradients
The cell state update is:
c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t
When the forget gate is close to 1 and the input gate is close to 0, the cell state passes through almost unchanged: c_t ≈ c_{t-1}. This means the gradient can flow backward through time along the cell state with minimal decay.
Compare this to vanilla RNNs where the gradient passes through a tanh activation at every step (maximum derivative 1.0, typically much less). The LSTM's cell state creates a gradient highway — a path where information and gradients can travel long distances.
LSTM in Action: Predicting Text
Let's trace an LSTM processing "The cat sat on the ___":
| Time Step | Input | Forget Gate | Input Gate | Cell State | Output |
|---|---|---|---|---|---|
| t=1 | "The" | Keep all (new sequence) | Store: article detected | [article context] | Low confidence |
| t=2 | "cat" | Keep article info | Store: subject is "cat" | [article, cat subject] | "cat" features |
| t=3 | "sat" | Keep subject | Store: action is sitting | [cat, sitting] | "sat" features |
| t=4 | "on" | Keep subject+action | Store: preposition follows | [cat, sitting, prep] | Expecting location |
| t=5 | "the" | Keep all context | Store: another article | [cat, sitting, prep, article] | High confidence: noun next |
At t=5, the LSTM "knows" it needs a location noun because it remembers the subject (cat), the action (sat), and the preposition (on). This long-range context is exactly what vanilla RNNs lose.
GRU: The Simpler Alternative
The Gated Recurrent Unit (GRU), introduced in 2014, simplifies the LSTM by combining the forget and input gates into a single update gate, and merging the cell state and hidden state:
| Aspect | LSTM | GRU |
|---|---|---|
| Gates | 3 (forget, input, output) | 2 (update, reset) |
| States | 2 (hidden + cell) | 1 (hidden only) |
| Parameters | More | ~25% fewer |
| Training speed | Slower | Faster |
| Long sequences | Slightly better | Comparable |
GRUs perform comparably to LSTMs on most tasks while being faster to train. Use LSTMs when you need maximum memory capacity; use GRUs when training speed matters.
Real-World Applications
LSTMs power (or powered) many of the AI systems you use daily:
- Machine translation: Google Translate used LSTM-based seq2seq models before switching to Transformers
- Speech recognition: Siri, Alexa, and Google Assistant used LSTM layers for converting speech to text
- Text generation: Early language models were LSTM-based
- Music composition: LSTMs can learn musical patterns and generate new compositions
- Anomaly detection: Monitoring server logs, financial transactions, or sensor data for unusual patterns
- Time series forecasting: Stock prices, weather, energy demand
LSTMs vs Transformers
Since 2017, Transformers have largely replaced LSTMs for natural language processing. But LSTMs still have advantages:
LSTMs win when:
- Processing streaming data in real-time (one element at a time)
- Memory is limited (Transformers need O(n²) memory for attention)
- The sequence is very long and you don't need global attention
- You're working with time series or sensor data
Transformers win when:
- You need to capture relationships between distant tokens
- Parallel training is important (LSTMs are inherently sequential)
- You have enough data and compute
- The task is NLP (Transformers dominate every benchmark)
Related Articles
- What is a Neural Network? — Understand the building blocks that LSTMs extend
- Backpropagation Explained — How gradients flow backward through a network
- Activation Functions Compared — Why sigmoid and tanh are used in LSTM gates
- Word2Vec Explained — How words become vectors that LSTMs can process
See It In Action
Understanding LSTMs requires seeing how gates open and close as data flows through. Our interactive RNN visualization lets you:
- Watch hidden states evolve as the network processes a sequence
- See how the gates control information flow at each time step
- Compare vanilla RNN hidden states with LSTM cell states
- Understand why gradients vanish in regular RNNs but not in LSTMs
Step through the process and build intuition for how gated memory works.