Recurrent Networks

RNN · LSTM · GRU — activation functions and data flow

Companion 05 · cs229 series · interactive diagrams

Part I
Vanilla RNN

§ 01The sequence problem

A feed-forward network sees one input and produces one output — no notion of before or after. But language, audio, time series, and DNA are sequences: order matters, and earlier tokens condition later ones. To handle this, the network needs memory.

The recurrent idea is simple: at each time step t, the network produces a hidden state ht that depends on the current input xt and the previous hidden state ht-1. That single recurrence — feeding the state back into itself — lets a fixed-size network process arbitrarily long sequences.

why this matters Every architecture in this companion — vanilla RNN, LSTM, GRU — is a different design for that recurrence: the same shape of computation, but very different ways of deciding what to remember and what to forget.

§ 02The RNN cell

The vanilla (Elman) RNN cell is one matrix-multiply, one bias, and one nonlinearity:

Hidden state ht = tanh( Wxh xt + Whh ht-1 + bh )
Output yt = Why ht + by

Three weight matrices total: Wxh (input → hidden), Whh (hidden → hidden), Why (hidden → output). The hidden-to-hidden matrix is the memory matrix — it's how information from the past survives into the present.

Crucially, the same weights are reused at every time step. That weight sharing is what lets one set of parameters handle a sequence of any length, and it's also the root cause of the vanishing-gradient problem we'll meet in §06.

§ 03Unrolling through time — data flow

The clearest way to see an RNN is unrolled: copies of the same cell drawn side by side, with the hidden state flowing rightward from one to the next.

Demo 01
RNN unrolled — step through a sequence

Watch the hidden state propagate. Each click advances one time step; the cell being computed lights up, and the formula below shows the actual numbers being substituted.

0.70
0.60
step 0 press Step to begin

Three things to notice in the diagram:

§ 04Activations: why tanh inside the cell

The vanilla RNN almost always uses tanh for the hidden state. Three reasons:

tanh's advantages here
  • Zero-centered output (range −1 to +1). Hidden states can encode "negative evidence" as well as positive. A sigmoid (0 to 1) would force every state component to be non-negative, which biases the dynamics.
  • Stronger gradients than sigmoid — tanh's derivative peaks at 1.0 (vs sigmoid's 0.25), so signal propagates a bit better through time before vanishing.
  • Bounded — unlike ReLU, tanh can't let activations explode. Since the same weights are applied at every step, an unbounded activation would compound across time and blow up.
why not ReLU?
  • ReLU is unbounded above. In a recurrent setting, repeated multiplication by Whh followed by ReLU lets activations grow without limit — gradients then explode. ReLU-RNNs exist (with careful initialization, e.g. the identity-RNN trick), but they're brittle.
  • For the gates of LSTM/GRU, ReLU also fails because gates need to live in [0, 1] — they're switches.

The output activation yt depends on the task: softmax for classification, linear for regression, sigmoid for binary tagging.

§ 05Backprop through time (BPTT)

Training an RNN is just standard backprop applied to the unrolled graph. The loss at time T depends on every earlier hidden state, so the gradient flows backward through every cell:

∂LT / ∂hk = (∂LT / ∂hT) · ∏t=k+1..T (∂ht / ∂ht-1)

That product of Jacobians is where the trouble starts. Each factor is ∂ht/∂ht-1 = diag(tanh') · Whh. Multiply that matrix by itself T − k times and the magnitude is governed by the largest eigenvalue of Whh (scaled by tanh' ≤ 1).

§ 06Vanishing & exploding gradients

This is the central failure mode of vanilla RNNs. With tanh' ≤ 1 everywhere and a typical Whh spectral radius near 1, gradients shrink by roughly a constant factor each step. After 20 or 30 steps, the gradient that should reach early time steps is effectively zero — the network can't learn long-range dependencies.

Demo 02
Gradient magnitude across time

A toy 1-D model. Each bar shows the magnitude of ∂L/∂hk as we walk backward from step 30 toward step 0. Adjust the per-step gradient factor to see exponential decay (or explosion).

0.80

After 30 steps, gradient at step 0 ≈ 0.001 (relative to step 30).

classical fixes
  • Gradient clipping — fixes exploding gradients by capping the norm before each update. Easy and effective.
  • Careful initialization (orthogonal Whh) — pushes the spectral radius to 1 so signals don't decay or grow at initialization.
  • Architectural fixes — LSTM and GRU. These don't just patch the vanilla RNN; they redesign the recurrence so gradients can flow undamped across many steps.
Part II
Long Short-Term Memory

§ 07LSTM — architecture overview

The LSTM (Hochreiter & Schmidhuber, 1997) introduces two innovations:

  1. A separate cell state ct that runs straight through the network with only minor linear interactions. This is the gradient highway.
  2. Three learned gates — forget, input, output — that decide what to remove from, add to, and read out of the cell state. Each gate is a sigmoid producing values in [0, 1] that act as soft on/off switches.

The LSTM cell maintains two running states between time steps: the cell state ct and the hidden state ht. Both have the same dimension.

forget gate ft = σ( Wf [ht-1, xt] + bf )
input gate it = σ( Wi [ht-1, xt] + bi )
candidatet = tanh( Wc [ht-1, xt] + bc )
cell update ct = ft ⊙ ct-1 + it ⊙ c̃t
output gate ot = σ( Wo [ht-1, xt] + bo )
hidden state ht = ot ⊙ tanh( ct )

Notation: [ht-1, xt] means concatenation, σ is sigmoid, is element-wise product.

§ 08Forget gate σ

The forget gate decides what to throw away from the previous cell state. It takes the previous hidden state and current input, projects them, and squashes through a sigmoid:

ft = σ( Wf [ht-1, xt] + bf ) ∈ [0, 1]d

Each component of ft is multiplied element-wise with the corresponding component of ct-1:

initialization tip The bias bf is usually initialized to a positive value (often 1.0). This pushes the forget gate toward "remember" early in training, so the cell state can establish a useful signal before the network learns to forget.

§ 09Input gate & candidate values σtanh

Adding to memory takes two components. The candidate proposes new content; the input gate decides how much of that content actually gets written.

input gate it = σ( Wi [ht-1, xt] + bi ) ∈ [0, 1]d
candidatet = tanh( Wc [ht-1, xt] + bc ) ∈ [−1, 1]d

The candidate t is "what would be worth remembering if we wrote it down". The input gate it is the volume knob: a sigmoid mask that suppresses or amplifies the candidate dimension-by-dimension.

Why the split? It separates magnitude (how strong is this signal, via tanh) from relevance (do we care about it right now, via sigmoid). Putting both in one tanh would conflate them.

§ 10Cell state update — the highway

The forget gate erases, the input gate writes. The cell state update is then an additive combination — and additivity is the whole reason LSTMs solve the vanishing-gradient problem:

ct = ft ⊙ ct-1 + it ⊙ c̃t

Compare with the vanilla RNN's ht = tanh(Whh ht-1 + …). The vanilla update is multiplicative through a nonlinearity; the LSTM cell update is linear and additive. When the forget gate is open (ft ≈ 1), ct ≈ ct-1 + (something) — the past flows through almost untouched. Gradients can flow back through this path without being multiplied by tanh derivatives.

§ 11Output gate — what to expose σtanh

The cell state is the internal memory. The hidden state is what the rest of the network (and the next time step) sees. The output gate decides which parts of the cell state to expose:

output gate ot = σ( Wo [ht-1, xt] + bo )
hidden state ht = ot ⊙ tanh( ct )

The tanh(ct) squashes the cell state into [−1, 1] for stable propagation, and the output gate masks it. This keeps the hidden state bounded even when the cell state — which is allowed to grow as it accumulates — has gotten large.

Demo 03
LSTM cell — interactive gates

Set each gate and the inputs by hand, then watch the cell state and hidden state update. This is one time step; imagine running the same cell at every step of a sequence.

0.40
0.85
0.50
0.70
0.80

§ 12Why this particular mixture of activations

Every gate uses sigmoid; the candidate and the output use tanh. This isn't arbitrary — each choice has a specific role.

Component Activation Range Role
forget gate f σ [0, 1] switch — fraction of past to keep
input gate i σ [0, 1] switch — fraction of candidate to write
output gate o σ [0, 1] switch — fraction of cell to expose
candidate c̃ tanh [−1, 1] signed content to add to memory
cell readout tanh(c) [−1, 1] squash unbounded cell to stable range

The mnemonic: sigmoid is for how much, tanh is for what. Gates control flow (sigmoid); content carries signed information (tanh).

§ 13Why LSTM solves the vanishing-gradient problem

Look at the cell-state recurrence in isolation: ct = ft ⊙ ct-1 + (writes). The Jacobian of the cell state with respect to its predecessor is just

∂ct / ∂ct-1 = diag( ft )

No tanh derivative, no weight matrix multiplication. If the forget gate is near 1, the gradient passes through unchanged. The product across many steps is ∏ ft — and the network can learn to keep specific entries of f close to 1 for as long as the information is needed.

This is the entire engineering achievement of the LSTM: turn a multiplicative recurrence (where gradients die) into a near-identity gated recurrence (where gradients survive). Everything else — the input gate, the output gate, the tanh squashes — is in service of making that highway behave well.

§ 14LSTM in code

A minimal NumPy step (single hidden unit, for illustration):

# inputs: x_t (input), h_prev, c_prev
# params: W_f, W_i, W_c, W_o (each takes [h_prev, x_t])

z = np.concatenate([h_prev, x_t])

f = sigmoid(W_f @ z + b_f)        # forget gate
i = sigmoid(W_i @ z + b_i)        # input gate
c_tilde = np.tanh(W_c @ z + b_c)  # candidate
o = sigmoid(W_o @ z + b_o)        # output gate

c = f * c_prev + i * c_tilde      # cell state update
h = o * np.tanh(c)                # hidden state output

In PyTorch you'd write nn.LSTM(input_size, hidden_size, num_layers) and forget the gate equations — but every modern framework's LSTM module is exactly the six lines above, vectorized.

Part III
Gated Recurrent Unit

§ 15GRU — architecture overview

The GRU (Cho et al., 2014) is the LSTM's simplified cousin. It fuses the LSTM's forget and input gates into a single update gate, drops the separate cell state, and gets close to LSTM performance with ~25% fewer parameters.

reset gate rt = σ( Wr [ht-1, xt] + br )
update gate zt = σ( Wz [ht-1, xt] + bz )
candidatet = tanh( Wh [rt ⊙ ht-1, xt] + bh )
hidden state ht = (1 − zt) ⊙ ht-1 + zt ⊙ h̃t

Compared to LSTM:

§ 16Reset gate σ

The reset gate decides how much of the past hidden state should influence the new candidate. Notice where rt appears:

t = tanh( Wh [ rt ⊙ ht-1, xt] + bh )

When rt ≈ 0, the past is effectively erased for the purpose of computing the candidate — the new state is computed from xt alone (plus the bias). When rt ≈ 1, the past hidden state contributes fully.

Use cases:

§ 17Update gate σ

The update gate is the GRU's most important component. It does the job of both the LSTM's forget gate and input gate, by a clever trick of coupling them:

ht = (1 − zt) ⊙ ht-1 + zt ⊙ h̃t

Compare with the LSTM cell update: ct = ft ⊙ ct-1 + it ⊙ c̃t. The LSTM has two independent gates; the GRU forces it = 1 − ft. That is, "how much you write" is tied to "how much you erase".

why coupling works In practice, "remember" and "write" are usually opposed — if you decide to overwrite, you'd want to forget too. Tying them costs almost nothing in expressiveness and saves a whole gate's worth of parameters.

§ 18Candidate & final hidden state

Putting it together as one time step:

  1. Compute reset gate rt and update gate zt from [ht-1, xt].
  2. Use the reset gate to make a conditional view of the past: rt ⊙ ht-1.
  3. Compute the candidate t from that conditional past plus the input.
  4. Blend old and candidate via the update gate.
Demo 04
GRU cell — interactive gates

Set the inputs and gates, watch the candidate and final hidden state. Try the presets to see how the two gates collaborate.

0.60
0.40
0.70
0.40

§ 19Why GRU can skip the cell state

The LSTM keeps two states (c, h) so that the internal memory is allowed to grow unboundedly while the exposed hidden state stays bounded via o ⊙ tanh(c). The GRU sidesteps this by keeping the candidate t already in [−1, 1] (it's a tanh) and using a convex combination — so ht stays bounded automatically.

The downside: GRU's "memory" can't store arbitrarily large values, so it has slightly less capacity for tasks that need fine-grained accumulators (e.g. counting). For most tasks the difference is invisible.

gradient flow in GRU The update gate gives GRU the same identity-shortcut property as LSTM. When zt ≈ 0, ht ≈ ht-1, and the gradient flows back through this near-identity path without being killed by repeated multiplications. Vanishing gradients are mitigated just as effectively as in the LSTM.

§ 20GRU in code

# inputs: x_t, h_prev
# params: W_r, W_z, W_h

z_input = np.concatenate([h_prev, x_t])

r = sigmoid(W_r @ z_input + b_r)         # reset gate
z = sigmoid(W_z @ z_input + b_z)         # update gate

# candidate uses h_prev *gated by r*
h_tilde = np.tanh(W_h @ np.concatenate([r * h_prev, x_t]) + b_h)

h = (1 - z) * h_prev + z * h_tilde       # final hidden state

Four lines instead of six. In PyTorch: nn.GRU(input_size, hidden_size, num_layers).

Part IV
Comparison & context

§ 21LSTM vs GRU — parameter count

For input size dx and hidden size dh, each gate/candidate involves a weight matrix of shape dh × (dx + dh) plus a bias of size dh.

ComponentLSTMGRU
Number of gates3 (f, i, o)2 (r, z)
Candidatec̃ (tanh)h̃ (tanh)
Recurrent states2 (c, h)1 (h)
Weight matrices43
Total parameters4 · dh(dx + dh + 1)3 · dh(dx + dh + 1)
For dx=128, dh=256~394k~296k

GRU has 75% the parameters of LSTM at the same hidden size — meaningfully smaller, faster to train, and a bit faster at inference.

§ 22When to use which

prefer LSTM when…
  • You have a lot of data and capacity isn't a bottleneck.
  • The task requires fine-grained, long-term memory (e.g. counting, dependency tracking across many tokens).
  • You're following established baselines — most pre-transformer literature uses LSTM.
prefer GRU when…
  • You have limited data — fewer parameters means less overfitting.
  • You need faster training or inference.
  • The task involves shorter sequences or simpler temporal patterns.
  • Hardware is constrained (embedded, on-device).

Empirically, on most NLP and time-series tasks, the two perform within noise of each other. Pick the cheaper one (GRU) unless you have a specific reason to think you need the extra capacity.

the bigger context Both LSTM and GRU have largely been displaced by transformers for most sequence tasks since 2017–2018. The attention mechanism gives O(1) "distance" between any two tokens (no sequential propagation needed), and is more parallelizable. RNNs still shine for streaming inference, very long context with constant memory, and resource-constrained deployment — but for new projects on large datasets, transformers are usually the default.

§ 23Bidirectional & stacked variants

Bidirectional RNN/LSTM/GRU

A unidirectional RNN at step t only sees the past (x1, …, xt). For tagging tasks (POS tagging, NER) you want the future too — what follows often disambiguates the current token.

A bidirectional layer runs two independent RNNs: one forward () and one backward (). The final representation at step t is the concatenation [ht, ht]. Doubles the parameters and requires the full sequence in memory (so it's not used for streaming or autoregressive generation).

Stacked (deep) RNNs

Just like deep feed-forward networks, you can stack RNN layers: the hidden states of layer become the inputs of layer ℓ+1. Each layer can capture features at a different temporal granularity. 2–4 layers is typical; beyond that, returns diminish and training gets fragile (residual connections help).

§ 24Activation function summary

Architecture Component Activation Why
Vanilla RNN hidden state tanh zero-centered, bounded, stronger gradient than sigmoid
output (task-dep.) softmax / linear / sigmoid matches loss function
LSTM forget gate f σ fractional switch [0, 1]
input gate i σ fractional switch [0, 1]
candidate c̃ tanh signed content carrier
output gate o σ fractional switch [0, 1]
cell readout tanh(c) bound exposed state
GRU reset gate r σ fractional switch [0, 1]
update gate z σ fractional switch [0, 1]
candidate h̃ tanh signed content carrier

The whole rule of thumb: σ for gates (how much), tanh for content (what). Every architecture in this companion follows it.

§ 25Reference — data flow at a glance

Cell Inputs in States carried Recurrence shape Vanishing fix
RNN xt, ht-1 ht ht = tanh(Wx + Uh + b) none — gradients vanish
LSTM xt, ht-1, ct-1 ht, ct ct = f·ct-1 + i·c̃
ht = o·tanh(ct)
additive cell-state highway
GRU xt, ht-1 ht ht = (1−z)·ht-1 + z·h̃ convex-combination update

Companion 05 of 05 · regression metrics · classification & regression · activations & losses · LLM & RAG evaluation · recurrent networks.