§ 01The sequence problem
A feed-forward network sees one input and produces one output — no notion of before or after. But language, audio, time series, and DNA are sequences: order matters, and earlier tokens condition later ones. To handle this, the network needs memory.
The recurrent idea is simple: at each time step t, the network produces a hidden state ht that depends on the current input xt and the previous hidden state ht-1. That single recurrence — feeding the state back into itself — lets a fixed-size network process arbitrarily long sequences.
§ 02The RNN cell
The vanilla (Elman) RNN cell is one matrix-multiply, one bias, and one nonlinearity:
Output yt = Why ht + by
Three weight matrices total: Wxh (input → hidden), Whh (hidden → hidden), Why (hidden → output). The hidden-to-hidden matrix is the memory matrix — it's how information from the past survives into the present.
Crucially, the same weights are reused at every time step. That weight sharing is what lets one set of parameters handle a sequence of any length, and it's also the root cause of the vanishing-gradient problem we'll meet in §06.
§ 03Unrolling through time — data flow
The clearest way to see an RNN is unrolled: copies of the same cell drawn side by side, with the hidden state flowing rightward from one to the next.
Watch the hidden state propagate. Each click advances one time step; the cell being computed lights up, and the formula below shows the actual numbers being substituted.
Three things to notice in the diagram:
- Horizontal arrows carry the hidden state from one time step to the next — this is the memory channel.
- Upward arrows from each cell produce an output at that time step (used for sequence tagging, language modeling, etc.).
- Downward arrows feed each input xt into its cell.
§ 04Activations: why tanh inside the cell
The vanilla RNN almost always uses tanh for the hidden state. Three reasons:
- Zero-centered output (range −1 to +1). Hidden states can encode "negative evidence" as well as positive. A sigmoid (0 to 1) would force every state component to be non-negative, which biases the dynamics.
- Stronger gradients than sigmoid — tanh's derivative peaks at 1.0 (vs sigmoid's 0.25), so signal propagates a bit better through time before vanishing.
- Bounded — unlike ReLU, tanh can't let activations explode. Since the same weights are applied at every step, an unbounded activation would compound across time and blow up.
- ReLU is unbounded above. In a recurrent setting, repeated multiplication by Whh followed by ReLU lets activations grow without limit — gradients then explode. ReLU-RNNs exist (with careful initialization, e.g. the identity-RNN trick), but they're brittle.
- For the gates of LSTM/GRU, ReLU also fails because gates need to live in [0, 1] — they're switches.
The output activation yt depends on the task: softmax for classification, linear for regression, sigmoid for binary tagging.
§ 05Backprop through time (BPTT)
Training an RNN is just standard backprop applied to the unrolled graph. The loss at time T depends on every earlier hidden state, so the gradient flows backward through every cell:
That product of Jacobians is where the trouble starts. Each factor is ∂ht/∂ht-1 = diag(tanh') · Whh. Multiply that matrix by itself T − k times and the magnitude is governed by the largest eigenvalue of Whh (scaled by tanh' ≤ 1).
- If the largest eigenvalue × tanh' < 1: the product decays exponentially → vanishing gradients.
- If > 1: it grows exponentially → exploding gradients.
§ 06Vanishing & exploding gradients
This is the central failure mode of vanilla RNNs. With tanh' ≤ 1 everywhere and a typical Whh spectral radius near 1, gradients shrink by roughly a constant factor each step. After 20 or 30 steps, the gradient that should reach early time steps is effectively zero — the network can't learn long-range dependencies.
A toy 1-D model. Each bar shows the magnitude of ∂L/∂hk as we walk backward from step 30 toward step 0. Adjust the per-step gradient factor to see exponential decay (or explosion).
After 30 steps, gradient at step 0 ≈ 0.001 (relative to step 30).
- Gradient clipping — fixes exploding gradients by capping the norm before each update. Easy and effective.
- Careful initialization (orthogonal Whh) — pushes the spectral radius to 1 so signals don't decay or grow at initialization.
- Architectural fixes — LSTM and GRU. These don't just patch the vanilla RNN; they redesign the recurrence so gradients can flow undamped across many steps.
§ 07LSTM — architecture overview
The LSTM (Hochreiter & Schmidhuber, 1997) introduces two innovations:
- A separate cell state ct that runs straight through the network with only minor linear interactions. This is the gradient highway.
- Three learned gates — forget, input, output — that decide what to remove from, add to, and read out of the cell state. Each gate is a sigmoid producing values in [0, 1] that act as soft on/off switches.
The LSTM cell maintains two running states between time steps: the cell state ct and the hidden state ht. Both have the same dimension.
input gate it = σ( Wi [ht-1, xt] + bi )
candidate c̃t = tanh( Wc [ht-1, xt] + bc )
cell update ct = ft ⊙ ct-1 + it ⊙ c̃t
output gate ot = σ( Wo [ht-1, xt] + bo )
hidden state ht = ot ⊙ tanh( ct )
Notation: [ht-1, xt] means concatenation, σ is sigmoid, ⊙ is element-wise product.
§ 08Forget gate σ
The forget gate decides what to throw away from the previous cell state. It takes the previous hidden state and current input, projects them, and squashes through a sigmoid:
Each component of ft is multiplied element-wise with the corresponding component of ct-1:
- ft,k ≈ 1 → keep the k-th memory cell intact.
- ft,k ≈ 0 → erase the k-th memory cell.
§ 09Input gate & candidate values σtanh
Adding to memory takes two components. The candidate proposes new content; the input gate decides how much of that content actually gets written.
candidate c̃t = tanh( Wc [ht-1, xt] + bc ) ∈ [−1, 1]d
The candidate c̃t is "what would be worth remembering if we wrote it down". The input gate it is the volume knob: a sigmoid mask that suppresses or amplifies the candidate dimension-by-dimension.
Why the split? It separates magnitude (how strong is this signal, via tanh) from relevance (do we care about it right now, via sigmoid). Putting both in one tanh would conflate them.
§ 10Cell state update — the highway
The forget gate erases, the input gate writes. The cell state update is then an additive combination — and additivity is the whole reason LSTMs solve the vanishing-gradient problem:
Compare with the vanilla RNN's ht = tanh(Whh ht-1 + …). The vanilla update is multiplicative through a nonlinearity; the LSTM cell update is linear and additive. When the forget gate is open (ft ≈ 1), ct ≈ ct-1 + (something) — the past flows through almost untouched. Gradients can flow back through this path without being multiplied by tanh derivatives.
§ 11Output gate — what to expose σtanh
The cell state is the internal memory. The hidden state is what the rest of the network (and the next time step) sees. The output gate decides which parts of the cell state to expose:
hidden state ht = ot ⊙ tanh( ct )
The tanh(ct) squashes the cell state into [−1, 1] for stable propagation, and the output gate masks it. This keeps the hidden state bounded even when the cell state — which is allowed to grow as it accumulates — has gotten large.
Set each gate and the inputs by hand, then watch the cell state and hidden state update. This is one time step; imagine running the same cell at every step of a sequence.
§ 12Why this particular mixture of activations
Every gate uses sigmoid; the candidate and the output use tanh. This isn't arbitrary — each choice has a specific role.
| Component | Activation | Range | Role |
|---|---|---|---|
| forget gate f | σ | [0, 1] | switch — fraction of past to keep |
| input gate i | σ | [0, 1] | switch — fraction of candidate to write |
| output gate o | σ | [0, 1] | switch — fraction of cell to expose |
| candidate c̃ | tanh | [−1, 1] | signed content to add to memory |
| cell readout | tanh(c) | [−1, 1] | squash unbounded cell to stable range |
The mnemonic: sigmoid is for how much, tanh is for what. Gates control flow (sigmoid); content carries signed information (tanh).
§ 13Why LSTM solves the vanishing-gradient problem
Look at the cell-state recurrence in isolation: ct = ft ⊙ ct-1 + (writes). The Jacobian of the cell state with respect to its predecessor is just
No tanh derivative, no weight matrix multiplication. If the forget gate is near 1, the gradient passes through unchanged. The product across many steps is ∏ ft — and the network can learn to keep specific entries of f close to 1 for as long as the information is needed.
This is the entire engineering achievement of the LSTM: turn a multiplicative recurrence (where gradients die) into a near-identity gated recurrence (where gradients survive). Everything else — the input gate, the output gate, the tanh squashes — is in service of making that highway behave well.
§ 14LSTM in code
A minimal NumPy step (single hidden unit, for illustration):
# inputs: x_t (input), h_prev, c_prev # params: W_f, W_i, W_c, W_o (each takes [h_prev, x_t]) z = np.concatenate([h_prev, x_t]) f = sigmoid(W_f @ z + b_f) # forget gate i = sigmoid(W_i @ z + b_i) # input gate c_tilde = np.tanh(W_c @ z + b_c) # candidate o = sigmoid(W_o @ z + b_o) # output gate c = f * c_prev + i * c_tilde # cell state update h = o * np.tanh(c) # hidden state output
In PyTorch you'd write nn.LSTM(input_size, hidden_size, num_layers) and forget the gate equations — but every modern framework's LSTM module is exactly the six lines above, vectorized.
§ 15GRU — architecture overview
The GRU (Cho et al., 2014) is the LSTM's simplified cousin. It fuses the LSTM's forget and input gates into a single update gate, drops the separate cell state, and gets close to LSTM performance with ~25% fewer parameters.
update gate zt = σ( Wz [ht-1, xt] + bz )
candidate h̃t = tanh( Wh [rt ⊙ ht-1, xt] + bh )
hidden state ht = (1 − zt) ⊙ ht-1 + zt ⊙ h̃t
Compared to LSTM:
- Two gates instead of three (reset and update vs forget, input, output).
- No separate cell state — the hidden state is the memory. The cell-state highway is replaced by the convex combination in the final line.
- No output gate — the hidden state is exposed directly.
§ 16Reset gate σ
The reset gate decides how much of the past hidden state should influence the new candidate. Notice where rt appears:
When rt ≈ 0, the past is effectively erased for the purpose of computing the candidate — the new state is computed from xt alone (plus the bias). When rt ≈ 1, the past hidden state contributes fully.
Use cases:
- Start of a new sentence in a language model: reset → drop context from the previous sentence.
- Scene change in a video: reset → forget the previous scene.
§ 17Update gate σ
The update gate is the GRU's most important component. It does the job of both the LSTM's forget gate and input gate, by a clever trick of coupling them:
Compare with the LSTM cell update: ct = ft ⊙ ct-1 + it ⊙ c̃t. The LSTM has two independent gates; the GRU forces it = 1 − ft. That is, "how much you write" is tied to "how much you erase".
- zt ≈ 0 → ht ≈ ht-1. The past is kept; nothing new is written. (This is the "remember" mode, equivalent to LSTM's forget=1, input=0.)
- zt ≈ 1 → ht ≈ h̃t. The past is overwritten with the candidate.
- zt ≈ 0.5 → blend of the two.
§ 18Candidate & final hidden state
Putting it together as one time step:
- Compute reset gate rt and update gate zt from [ht-1, xt].
- Use the reset gate to make a conditional view of the past: rt ⊙ ht-1.
- Compute the candidate h̃t from that conditional past plus the input.
- Blend old and candidate via the update gate.
Set the inputs and gates, watch the candidate and final hidden state. Try the presets to see how the two gates collaborate.
§ 19Why GRU can skip the cell state
The LSTM keeps two states (c, h) so that the internal memory is allowed to grow unboundedly while the exposed hidden state stays bounded via o ⊙ tanh(c). The GRU sidesteps this by keeping the candidate h̃t already in [−1, 1] (it's a tanh) and using a convex combination — so ht stays bounded automatically.
The downside: GRU's "memory" can't store arbitrarily large values, so it has slightly less capacity for tasks that need fine-grained accumulators (e.g. counting). For most tasks the difference is invisible.
§ 20GRU in code
# inputs: x_t, h_prev # params: W_r, W_z, W_h z_input = np.concatenate([h_prev, x_t]) r = sigmoid(W_r @ z_input + b_r) # reset gate z = sigmoid(W_z @ z_input + b_z) # update gate # candidate uses h_prev *gated by r* h_tilde = np.tanh(W_h @ np.concatenate([r * h_prev, x_t]) + b_h) h = (1 - z) * h_prev + z * h_tilde # final hidden state
Four lines instead of six. In PyTorch: nn.GRU(input_size, hidden_size, num_layers).
§ 21LSTM vs GRU — parameter count
For input size dx and hidden size dh, each gate/candidate involves a weight matrix of shape dh × (dx + dh) plus a bias of size dh.
| Component | LSTM | GRU |
|---|---|---|
| Number of gates | 3 (f, i, o) | 2 (r, z) |
| Candidate | c̃ (tanh) | h̃ (tanh) |
| Recurrent states | 2 (c, h) | 1 (h) |
| Weight matrices | 4 | 3 |
| Total parameters | 4 · dh(dx + dh + 1) | 3 · dh(dx + dh + 1) |
| For dx=128, dh=256 | ~394k | ~296k |
GRU has 75% the parameters of LSTM at the same hidden size — meaningfully smaller, faster to train, and a bit faster at inference.
§ 22When to use which
- You have a lot of data and capacity isn't a bottleneck.
- The task requires fine-grained, long-term memory (e.g. counting, dependency tracking across many tokens).
- You're following established baselines — most pre-transformer literature uses LSTM.
- You have limited data — fewer parameters means less overfitting.
- You need faster training or inference.
- The task involves shorter sequences or simpler temporal patterns.
- Hardware is constrained (embedded, on-device).
Empirically, on most NLP and time-series tasks, the two perform within noise of each other. Pick the cheaper one (GRU) unless you have a specific reason to think you need the extra capacity.
§ 23Bidirectional & stacked variants
Bidirectional RNN/LSTM/GRU
A unidirectional RNN at step t only sees the past (x1, …, xt). For tagging tasks (POS tagging, NER) you want the future too — what follows often disambiguates the current token.
A bidirectional layer runs two independent RNNs: one forward (→) and one backward (←). The final representation at step t is the concatenation [ht→, ht←]. Doubles the parameters and requires the full sequence in memory (so it's not used for streaming or autoregressive generation).
Stacked (deep) RNNs
Just like deep feed-forward networks, you can stack RNN layers: the hidden states of layer ℓ become the inputs of layer ℓ+1. Each layer can capture features at a different temporal granularity. 2–4 layers is typical; beyond that, returns diminish and training gets fragile (residual connections help).
§ 24Activation function summary
| Architecture | Component | Activation | Why |
|---|---|---|---|
| Vanilla RNN | hidden state | tanh | zero-centered, bounded, stronger gradient than sigmoid |
| output (task-dep.) | softmax / linear / sigmoid | matches loss function | |
| LSTM | forget gate f | σ | fractional switch [0, 1] |
| input gate i | σ | fractional switch [0, 1] | |
| candidate c̃ | tanh | signed content carrier | |
| output gate o | σ | fractional switch [0, 1] | |
| cell readout | tanh(c) | bound exposed state | |
| GRU | reset gate r | σ | fractional switch [0, 1] |
| update gate z | σ | fractional switch [0, 1] | |
| candidate h̃ | tanh | signed content carrier |
The whole rule of thumb: σ for gates (how much), tanh for content (what). Every architecture in this companion follows it.
§ 25Reference — data flow at a glance
| Cell | Inputs in | States carried | Recurrence shape | Vanishing fix |
|---|---|---|---|---|
| RNN | xt, ht-1 | ht | ht = tanh(Wx + Uh + b) | none — gradients vanish |
| LSTM | xt, ht-1, ct-1 | ht, ct | ct = f·ct-1 + i·c̃ ht = o·tanh(ct) |
additive cell-state highway |
| GRU | xt, ht-1 | ht | ht = (1−z)·ht-1 + z·h̃ | convex-combination update |
Companion 05 of 05 · regression metrics · classification & regression · activations & losses · LLM & RAG evaluation · recurrent networks.