Regular neural networks have no memory. They process each input independently, with no awareness of what came before. That's fine for classifying images, but useless for understanding language, predicting stock prices, or generating music — tasks where context matters.

Recurrent Neural Networks (RNNs) were designed to fix this by maintaining a hidden state — a form of memory. But vanilla RNNs have a fatal flaw: they can't remember things for very long. Enter Long Short-Term Memory networks.

The Problem: Vanishing Gradients

Vanishing gradients: the gradient shrinks exponentially at each time step, making early inputs invisible.

Imagine reading this sentence: "The author, who grew up in France and studied at the Sorbonne before moving to London where she worked for a decade, speaks fluent ___."

You need to remember "France" from the beginning to predict "French" at the end. That's a long-range dependency — the relevant information is far from where it's needed.

Vanilla RNNs struggle with this because during training, gradients (the learning signals) must flow backward through every time step. At each step, the gradient is multiplied by a weight matrix. If those weights are small, the gradient shrinks exponentially:

After 5 steps: gradient × 0.25⁵ = 0.001
After 10 steps: gradient × 0.25¹⁰ = 0.000001
After 20 steps: effectively zero

The network simply cannot learn from information that's more than a few steps back. The gradient vanishes before it arrives.

The LSTM Solution: Gated Memory

LSTM networks, introduced by Hochreiter and Schmidhuber in 1997, solve this with an elegant mechanism: instead of one hidden state, they maintain two:

Hidden state (h): The short-term working memory — what the network is currently "thinking about"
Cell state (c): The long-term memory — a highway that information can flow along with minimal interference

The cell state is the key innovation. Information can travel along it for many time steps with the gradient barely diminishing — because the operations on it are carefully controlled by gates.

The Three Gates

An LSTM cell has three gates controlling information flow: forget, input, and output.

An LSTM has three gates, each a neural network layer with sigmoid activation (outputting values between 0 and 1):

1. Forget Gate: "What should I erase?"

The forget gate looks at the current input and previous hidden state, then decides which parts of the cell state to keep and which to erase.

f_t = sigmoid(W_f · [h_{t-1}, x_t] + b_f)

Output of 1.0 = keep this information completely
Output of 0.0 = erase this information completely

Analogy: You're reading a book and a new chapter starts with a different character. The forget gate says "forget the previous character's details, we're following someone new now."

2. Input Gate: "What new information should I store?"

The input gate has two parts:

A sigmoid layer decides which values to update
A tanh layer creates candidate values to add

i_t = sigmoid(W_i · [h_{t-1}, x_t] + b_i)
c̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)

The new cell state combines forgetting and remembering:

c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t

Analogy: You encounter an important plot point. The input gate says "this is important, store it in long-term memory" and writes it into the cell state.

3. Output Gate: "What should I reveal right now?"

The output gate decides which parts of the cell state to expose as the hidden state (the network's current output).

o_t = sigmoid(W_o · [h_{t-1}, x_t] + b_o)
h_t = o_t ⊙ tanh(c_t)

Analogy: You know many facts about the story, but right now someone asks "what just happened?" The output gate selects the relevant recent events to share, not everything you remember.

Why This Solves Vanishing Gradients

The cell state update is:

c_t = f_t ⊙ c_{t-1} + i_t ⊙ c̃_t

When the forget gate is close to 1 and the input gate is close to 0, the cell state passes through almost unchanged: c_t ≈ c_{t-1}. This means the gradient can flow backward through time along the cell state with minimal decay.

Compare this to vanilla RNNs where the gradient passes through a tanh activation at every step (maximum derivative 1.0, typically much less). The LSTM's cell state creates a gradient highway — a path where information and gradients can travel long distances.

LSTM in Action: Predicting Text

Let's trace an LSTM processing "The cat sat on the ___":

Time Step	Input	Forget Gate	Input Gate	Cell State	Output
t=1	"The"	Keep all (new sequence)	Store: article detected	[article context]	Low confidence
t=2	"cat"	Keep article info	Store: subject is "cat"	[article, cat subject]	"cat" features
t=3	"sat"	Keep subject	Store: action is sitting	[cat, sitting]	"sat" features
t=4	"on"	Keep subject+action	Store: preposition follows	[cat, sitting, prep]	Expecting location
t=5	"the"	Keep all context	Store: another article	[cat, sitting, prep, article]	High confidence: noun next

At t=5, the LSTM "knows" it needs a location noun because it remembers the subject (cat), the action (sat), and the preposition (on). This long-range context is exactly what vanilla RNNs lose.

GRU: The Simpler Alternative

The Gated Recurrent Unit (GRU), introduced in 2014, simplifies the LSTM by combining the forget and input gates into a single update gate, and merging the cell state and hidden state:

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (update, reset)
States	2 (hidden + cell)	1 (hidden only)
Parameters	More	~25% fewer
Training speed	Slower	Faster
Long sequences	Slightly better	Comparable

GRUs perform comparably to LSTMs on most tasks while being faster to train. Use LSTMs when you need maximum memory capacity; use GRUs when training speed matters.

Real-World Applications

LSTMs power (or powered) many of the AI systems you use daily:

Machine translation: Google Translate used LSTM-based seq2seq models before switching to Transformers
Speech recognition: Siri, Alexa, and Google Assistant used LSTM layers for converting speech to text
Text generation: Early language models were LSTM-based
Music composition: LSTMs can learn musical patterns and generate new compositions
Anomaly detection: Monitoring server logs, financial transactions, or sensor data for unusual patterns
Time series forecasting: Stock prices, weather, energy demand

LSTMs vs Transformers

Since 2017, Transformers have largely replaced LSTMs for natural language processing. But LSTMs still have advantages:

LSTMs win when:

Processing streaming data in real-time (one element at a time)
Memory is limited (Transformers need O(n²) memory for attention)
The sequence is very long and you don't need global attention
You're working with time series or sensor data

Transformers win when:

You need to capture relationships between distant tokens
Parallel training is important (LSTMs are inherently sequential)
You have enough data and compute
The task is NLP (Transformers dominate every benchmark)

What is a Neural Network? — Understand the building blocks that LSTMs extend
Backpropagation Explained — How gradients flow backward through a network
Activation Functions Compared — Why sigmoid and tanh are used in LSTM gates
Word2Vec Explained — How words become vectors that LSTMs can process

See It In Action

Understanding LSTMs requires seeing how gates open and close as data flows through. Our interactive RNN visualization lets you:

Watch hidden states evolve as the network processes a sequence
See how the gates control information flow at each time step
Compare vanilla RNN hidden states with LSTM cell states
Understand why gradients vanish in regular RNNs but not in LSTMs

Step through the process and build intuition for how gated memory works.

LSTM Networks Explained: How AI Remembers