RNN, LSTM, GRU — an interactive companion

Part I

Vanilla RNN

§ 01The sequence problem

A feed-forward network sees one input and produces one output — no notion of before or after. But language, audio, time series, and DNA are sequences: order matters, and earlier tokens condition later ones. To handle this, the network needs memory.

The recurrent idea is simple: at each time step t, the network produces a hidden state h_t that depends on the current input x_t and the previous hidden state h_t-1. That single recurrence — feeding the state back into itself — lets a fixed-size network process arbitrarily long sequences.

why this matters Every architecture in this companion — vanilla RNN, LSTM, GRU — is a different design for that recurrence: the same shape of computation, but very different ways of deciding what to remember and what to forget.

§ 02The RNN cell

The vanilla (Elman) RNN cell is one matrix-multiply, one bias, and one nonlinearity:

Hidden state h_t = tanh( W_xh x_t + W_hh h_t-1 + b_h )
Output y_t = W_hy h_t + b_y

Three weight matrices total: W_xh (input → hidden), W_hh (hidden → hidden), W_hy (hidden → output). The hidden-to-hidden matrix is the memory matrix — it's how information from the past survives into the present.

Crucially, the same weights are reused at every time step. That weight sharing is what lets one set of parameters handle a sequence of any length, and it's also the root cause of the vanishing-gradient problem we'll meet in §06.

§ 03Unrolling through time — data flow

The clearest way to see an RNN is unrolled: copies of the same cell drawn side by side, with the hidden state flowing rightward from one to the next.

Demo 01

RNN unrolled — step through a sequence

Watch the hidden state propagate. Each click advances one time step; the cell being computed lights up, and the formula below shows the actual numbers being substituted.

W_xh 0.70

W_hh 0.60

step 0 press Step to begin

Three things to notice in the diagram:

Horizontal arrows carry the hidden state from one time step to the next — this is the memory channel.
Upward arrows from each cell produce an output at that time step (used for sequence tagging, language modeling, etc.).
Downward arrows feed each input x_t into its cell.

§ 04Activations: why tanh inside the cell

The vanilla RNN almost always uses tanh for the hidden state. Three reasons:

tanh's advantages here

Zero-centered output (range −1 to +1). Hidden states can encode "negative evidence" as well as positive. A sigmoid (0 to 1) would force every state component to be non-negative, which biases the dynamics.
Stronger gradients than sigmoid — tanh's derivative peaks at 1.0 (vs sigmoid's 0.25), so signal propagates a bit better through time before vanishing.
Bounded — unlike ReLU, tanh can't let activations explode. Since the same weights are applied at every step, an unbounded activation would compound across time and blow up.

why not ReLU?

ReLU is unbounded above. In a recurrent setting, repeated multiplication by W_hh followed by ReLU lets activations grow without limit — gradients then explode. ReLU-RNNs exist (with careful initialization, e.g. the identity-RNN trick), but they're brittle.
For the gates of LSTM/GRU, ReLU also fails because gates need to live in [0, 1] — they're switches.

The output activation y_t depends on the task: softmax for classification, linear for regression, sigmoid for binary tagging.

§ 05Backprop through time (BPTT)

Training an RNN is just standard backprop applied to the unrolled graph. The loss at time T depends on every earlier hidden state, so the gradient flows backward through every cell:

∂L_T / ∂h_k = (∂L_T / ∂h_T) · ∏_t=k+1..T (∂h_t / ∂h_t-1)

That product of Jacobians is where the trouble starts. Each factor is ∂h_t/∂h_t-1 = diag(tanh') · W_hh. Multiply that matrix by itself T − k times and the magnitude is governed by the largest eigenvalue of W_hh (scaled by tanh' ≤ 1).

If the largest eigenvalue × tanh' < 1: the product decays exponentially → vanishing gradients.
If > 1: it grows exponentially → exploding gradients.

§ 06Vanishing & exploding gradients

This is the central failure mode of vanilla RNNs. With tanh' ≤ 1 everywhere and a typical W_hh spectral radius near 1, gradients shrink by roughly a constant factor each step. After 20 or 30 steps, the gradient that should reach early time steps is effectively zero — the network can't learn long-range dependencies.

Demo 02

Gradient magnitude across time

A toy 1-D model. Each bar shows the magnitude of ∂L/∂h_k as we walk backward from step 30 toward step 0. Adjust the per-step gradient factor to see exponential decay (or explosion).

per-step factor 0.80

After 30 steps, gradient at step 0 ≈ 0.001 (relative to step 30).

classical fixes

Gradient clipping — fixes exploding gradients by capping the norm before each update. Easy and effective.
Careful initialization (orthogonal W_hh) — pushes the spectral radius to 1 so signals don't decay or grow at initialization.
Architectural fixes — LSTM and GRU. These don't just patch the vanilla RNN; they redesign the recurrence so gradients can flow undamped across many steps.

Part II

Long Short-Term Memory

§ 07LSTM — architecture overview

The LSTM (Hochreiter & Schmidhuber, 1997) introduces two innovations:

A separate cell state c_t that runs straight through the network with only minor linear interactions. This is the gradient highway.
Three learned gates — forget, input, output — that decide what to remove from, add to, and read out of the cell state. Each gate is a sigmoid producing values in [0, 1] that act as soft on/off switches.

The LSTM cell maintains two running states between time steps: the cell state c_t and the hidden state h_t. Both have the same dimension.

forget gate f_t = σ( W_f [h_t-1, x_t] + b_f )
input gate i_t = σ( W_i [h_t-1, x_t] + b_i )
candidate c̃_t = tanh( W_c [h_t-1, x_t] + b_c )
cell update c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t
output gate o_t = σ( W_o [h_t-1, x_t] + b_o )
hidden state h_t = o_t ⊙ tanh( c_t )

Notation: [h_t-1, x_t] means concatenation, σ is sigmoid, ⊙ is element-wise product.

§ 08Forget gate σ

The forget gate decides what to throw away from the previous cell state. It takes the previous hidden state and current input, projects them, and squashes through a sigmoid:

f_t = σ( W_f [h_t-1, x_t] + b_f ) ∈ [0, 1]^d

Each component of f_t is multiplied element-wise with the corresponding component of c_t-1:

f_t,k ≈ 1 → keep the k-th memory cell intact.
f_t,k ≈ 0 → erase the k-th memory cell.

initialization tip The bias b_f is usually initialized to a positive value (often 1.0). This pushes the forget gate toward "remember" early in training, so the cell state can establish a useful signal before the network learns to forget.

§ 09Input gate & candidate values σtanh

Adding to memory takes two components. The candidate proposes new content; the input gate decides how much of that content actually gets written.

input gate i_t = σ( W_i [h_t-1, x_t] + b_i ) ∈ [0, 1]^d
candidate c̃_t = tanh( W_c [h_t-1, x_t] + b_c ) ∈ [−1, 1]^d

The candidate c̃_t is "what would be worth remembering if we wrote it down". The input gate i_t is the volume knob: a sigmoid mask that suppresses or amplifies the candidate dimension-by-dimension.

Why the split? It separates magnitude (how strong is this signal, via tanh) from relevance (do we care about it right now, via sigmoid). Putting both in one tanh would conflate them.

§ 10Cell state update — the highway

The forget gate erases, the input gate writes. The cell state update is then an additive combination — and additivity is the whole reason LSTMs solve the vanishing-gradient problem:

c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t

Compare with the vanilla RNN's h_t = tanh(W_hh h_t-1 + …). The vanilla update is multiplicative through a nonlinearity; the LSTM cell update is linear and additive. When the forget gate is open (f_t ≈ 1), c_t ≈ c_t-1 + (something) — the past flows through almost untouched. Gradients can flow back through this path without being multiplied by tanh derivatives.

§ 11Output gate — what to expose σtanh

The cell state is the internal memory. The hidden state is what the rest of the network (and the next time step) sees. The output gate decides which parts of the cell state to expose:

output gate o_t = σ( W_o [h_t-1, x_t] + b_o )
hidden state h_t = o_t ⊙ tanh( c_t )

The tanh(c_t) squashes the cell state into [−1, 1] for stable propagation, and the output gate masks it. This keeps the hidden state bounded even when the cell state — which is allowed to grow as it accumulates — has gotten large.

Demo 03

LSTM cell — interactive gates

Set each gate and the inputs by hand, then watch the cell state and hidden state update. This is one time step; imagine running the same cell at every step of a sequence.

c_{t-1} 0.40

forget f 0.85

input i 0.50

candidate c̃ 0.70

output o 0.80

§ 12Why this particular mixture of activations

Every gate uses sigmoid; the candidate and the output use tanh. This isn't arbitrary — each choice has a specific role.

Component	Activation	Range	Role
forget gate f	σ	[0, 1]	switch — fraction of past to keep
input gate i	σ	[0, 1]	switch — fraction of candidate to write
output gate o	σ	[0, 1]	switch — fraction of cell to expose
candidate c̃	tanh	[−1, 1]	signed content to add to memory
cell readout	tanh(c)	[−1, 1]	squash unbounded cell to stable range

The mnemonic: sigmoid is for how much, tanh is for what. Gates control flow (sigmoid); content carries signed information (tanh).

§ 13Why LSTM solves the vanishing-gradient problem

Look at the cell-state recurrence in isolation: c_t = f_t ⊙ c_t-1 + (writes). The Jacobian of the cell state with respect to its predecessor is just

∂c_t / ∂c_t-1 = diag( f_t )

No tanh derivative, no weight matrix multiplication. If the forget gate is near 1, the gradient passes through unchanged. The product across many steps is ∏ f_t — and the network can learn to keep specific entries of f close to 1 for as long as the information is needed.

This is the entire engineering achievement of the LSTM: turn a multiplicative recurrence (where gradients die) into a near-identity gated recurrence (where gradients survive). Everything else — the input gate, the output gate, the tanh squashes — is in service of making that highway behave well.

§ 14LSTM in code

A minimal NumPy step (single hidden unit, for illustration):

# inputs: x_t (input), h_prev, c_prev
# params: W_f, W_i, W_c, W_o (each takes [h_prev, x_t])

z = np.concatenate([h_prev, x_t])

f = sigmoid(W_f @ z + b_f)        # forget gate
i = sigmoid(W_i @ z + b_i)        # input gate
c_tilde = np.tanh(W_c @ z + b_c)  # candidate
o = sigmoid(W_o @ z + b_o)        # output gate

c = f * c_prev + i * c_tilde      # cell state update
h = o * np.tanh(c)                # hidden state output

In PyTorch you'd write nn.LSTM(input_size, hidden_size, num_layers) and forget the gate equations — but every modern framework's LSTM module is exactly the six lines above, vectorized.

Part III

Gated Recurrent Unit

§ 15GRU — architecture overview

The GRU (Cho et al., 2014) is the LSTM's simplified cousin. It fuses the LSTM's forget and input gates into a single update gate, drops the separate cell state, and gets close to LSTM performance with ~25% fewer parameters.

reset gate r_t = σ( W_r [h_t-1, x_t] + b_r )
update gate z_t = σ( W_z [h_t-1, x_t] + b_z )
candidate h̃_t = tanh( W_h [r_t ⊙ h_t-1, x_t] + b_h )
hidden state h_t = (1 − z_t) ⊙ h_t-1 + z_t ⊙ h̃_t

Compared to LSTM:

Two gates instead of three (reset and update vs forget, input, output).
No separate cell state — the hidden state is the memory. The cell-state highway is replaced by the convex combination in the final line.
No output gate — the hidden state is exposed directly.

§ 16Reset gate σ

The reset gate decides how much of the past hidden state should influence the new candidate. Notice where r_t appears:

h̃_t = tanh( W_h [ r_t ⊙ h_t-1, x_t] + b_h )

When r_t ≈ 0, the past is effectively erased for the purpose of computing the candidate — the new state is computed from x_t alone (plus the bias). When r_t ≈ 1, the past hidden state contributes fully.

Use cases:

Start of a new sentence in a language model: reset → drop context from the previous sentence.
Scene change in a video: reset → forget the previous scene.

§ 17Update gate σ

The update gate is the GRU's most important component. It does the job of both the LSTM's forget gate and input gate, by a clever trick of coupling them:

h_t = (1 − z_t) ⊙ h_t-1 + z_t ⊙ h̃_t

Compare with the LSTM cell update: c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t. The LSTM has two independent gates; the GRU forces i_t = 1 − f_t. That is, "how much you write" is tied to "how much you erase".

z_t ≈ 0 → h_t ≈ h_t-1. The past is kept; nothing new is written. (This is the "remember" mode, equivalent to LSTM's forget=1, input=0.)
z_t ≈ 1 → h_t ≈ h̃_t. The past is overwritten with the candidate.
z_t ≈ 0.5 → blend of the two.

why coupling works In practice, "remember" and "write" are usually opposed — if you decide to overwrite, you'd want to forget too. Tying them costs almost nothing in expressiveness and saves a whole gate's worth of parameters.

§ 18Candidate & final hidden state

Putting it together as one time step:

Compute reset gate r_t and update gate z_t from [h_t-1, x_t].
Use the reset gate to make a conditional view of the past: r_t ⊙ h_t-1.
Compute the candidate h̃_t from that conditional past plus the input.
Blend old and candidate via the update gate.

Demo 04

GRU cell — interactive gates

Set the inputs and gates, watch the candidate and final hidden state. Try the presets to see how the two gates collaborate.

h_{t-1} 0.60

x_t 0.40

reset r 0.70

update z 0.40

§ 19Why GRU can skip the cell state

The LSTM keeps two states (c, h) so that the internal memory is allowed to grow unboundedly while the exposed hidden state stays bounded via o ⊙ tanh(c). The GRU sidesteps this by keeping the candidate h̃_t already in [−1, 1] (it's a tanh) and using a convex combination — so h_t stays bounded automatically.

The downside: GRU's "memory" can't store arbitrarily large values, so it has slightly less capacity for tasks that need fine-grained accumulators (e.g. counting). For most tasks the difference is invisible.

gradient flow in GRU The update gate gives GRU the same identity-shortcut property as LSTM. When z_t ≈ 0, h_t ≈ h_t-1, and the gradient flows back through this near-identity path without being killed by repeated multiplications. Vanishing gradients are mitigated just as effectively as in the LSTM.

§ 20GRU in code

# inputs: x_t, h_prev
# params: W_r, W_z, W_h

z_input = np.concatenate([h_prev, x_t])

r = sigmoid(W_r @ z_input + b_r)         # reset gate
z = sigmoid(W_z @ z_input + b_z)         # update gate

# candidate uses h_prev *gated by r*
h_tilde = np.tanh(W_h @ np.concatenate([r * h_prev, x_t]) + b_h)

h = (1 - z) * h_prev + z * h_tilde       # final hidden state

Four lines instead of six. In PyTorch: nn.GRU(input_size, hidden_size, num_layers).

Part IV

Comparison & context

§ 21LSTM vs GRU — parameter count

For input size d_x and hidden size d_h, each gate/candidate involves a weight matrix of shape d_h × (d_x + d_h) plus a bias of size d_h.

Component	LSTM	GRU
Number of gates	3 (f, i, o)	2 (r, z)
Candidate	c̃ (tanh)	h̃ (tanh)
Recurrent states	2 (c, h)	1 (h)
Weight matrices	4	3
Total parameters	4 · d_h(d_x + d_h + 1)	3 · d_h(d_x + d_h + 1)
For d_x=128, d_h=256	~394k	~296k

GRU has 75% the parameters of LSTM at the same hidden size — meaningfully smaller, faster to train, and a bit faster at inference.

§ 22When to use which

prefer LSTM when…

You have a lot of data and capacity isn't a bottleneck.
The task requires fine-grained, long-term memory (e.g. counting, dependency tracking across many tokens).
You're following established baselines — most pre-transformer literature uses LSTM.

prefer GRU when…

You have limited data — fewer parameters means less overfitting.
You need faster training or inference.
The task involves shorter sequences or simpler temporal patterns.
Hardware is constrained (embedded, on-device).

Empirically, on most NLP and time-series tasks, the two perform within noise of each other. Pick the cheaper one (GRU) unless you have a specific reason to think you need the extra capacity.

the bigger context Both LSTM and GRU have largely been displaced by transformers for most sequence tasks since 2017–2018. The attention mechanism gives O(1) "distance" between any two tokens (no sequential propagation needed), and is more parallelizable. RNNs still shine for streaming inference, very long context with constant memory, and resource-constrained deployment — but for new projects on large datasets, transformers are usually the default.

§ 23Bidirectional & stacked variants

Bidirectional RNN/LSTM/GRU

A unidirectional RNN at step t only sees the past (x₁, …, x_t). For tagging tasks (POS tagging, NER) you want the future too — what follows often disambiguates the current token.

A bidirectional layer runs two independent RNNs: one forward (→) and one backward (←). The final representation at step t is the concatenation [h_t^→, h_t^←]. Doubles the parameters and requires the full sequence in memory (so it's not used for streaming or autoregressive generation).

Stacked (deep) RNNs

Just like deep feed-forward networks, you can stack RNN layers: the hidden states of layer ℓ become the inputs of layer ℓ+1. Each layer can capture features at a different temporal granularity. 2–4 layers is typical; beyond that, returns diminish and training gets fragile (residual connections help).

§ 24Activation function summary

Architecture	Component	Activation	Why
Vanilla RNN	hidden state	tanh	zero-centered, bounded, stronger gradient than sigmoid
Vanilla RNN	output (task-dep.)	softmax / linear / sigmoid	matches loss function
LSTM	forget gate f	σ	fractional switch [0, 1]
	input gate i	σ	fractional switch [0, 1]
	candidate c̃	tanh	signed content carrier
	output gate o	σ	fractional switch [0, 1]
	cell readout	tanh(c)	bound exposed state
GRU	reset gate r	σ	fractional switch [0, 1]
	update gate z	σ	fractional switch [0, 1]
	candidate h̃	tanh	signed content carrier

The whole rule of thumb: σ for gates (how much), tanh for content (what). Every architecture in this companion follows it.

§ 25Reference — data flow at a glance

Cell	Inputs in	States carried	Recurrence shape	Vanishing fix
RNN	x_t, h_t-1	h_t	h_t = tanh(Wx + Uh + b)	none — gradients vanish
LSTM	x_t, h_t-1, c_t-1	h_t, c_t	c_t = f·c_t-1 + i·c̃ h_t = o·tanh(c_t)	additive cell-state highway
GRU	x_t, h_t-1	h_t	h_t = (1−z)·h_t-1 + z·h̃	convex-combination update

Companion 05 of 05 · regression metrics · classification & regression · activations & losses · LLM & RAG evaluation · recurrent networks.