Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
§ 01 The sequence problem
A feed-forward network sees one input and produces one output — no notion of before or after . But language, audio, time series, and DNA are sequences : order matters, and earlier tokens condition later ones. To handle this, the network needs memory .
02
§ 02 The RNN cell
The vanilla (Elman) RNN cell is one matrix-multiply, one bias, and one nonlinearity:
03
§ 03 Unrolling through time — data flow
The clearest way to see an RNN is unrolled : copies of the same cell drawn side by side, with the hidden state flowing rightward from one to the next.
04
§ 04 Activations: why tanh inside the cell
The vanilla RNN almost always uses tanh for the hidden state. Three reasons:
05
§ 05 Backprop through time (BPTT)
Training an RNN is just standard backprop applied to the unrolled graph. The loss at time T depends on every earlier hidden state, so the gradient flows backward through every cell:
06
§ 06 Vanishing & exploding gradients
This is the central failure mode of vanilla RNNs. With tanh' ≤ 1 everywhere and a typical W hh spectral radius near 1, gradients shrink by roughly a constant factor each step. After 20 or 30 steps, the gradient that should reach early time steps is effectively zero — the network can't learn long-range dependencies.
07
§ 07 LSTM — architecture overview
The LSTM (Hochreiter & Schmidhuber, 1997) introduces two innovations:
08
§ 08 Forget gate σ
The forget gate decides what to throw away from the previous cell state. It takes the previous hidden state and current input, projects them, and squashes through a sigmoid:
09
§ 09 Input gate & candidate values σ tanh
Adding to memory takes two components. The candidate proposes new content; the input gate decides how much of that content actually gets written.
10
§ 10 Cell state update — the highway
The forget gate erases, the input gate writes. The cell state update is then an additive combination — and additivity is the whole reason LSTMs solve the vanishing-gradient problem:
11
§ 11 Output gate — what to expose σ tanh
The cell state is the internal memory. The hidden state is what the rest of the network (and the next time step) sees. The output gate decides which parts of the cell state to expose:
12
§ 12 Why this particular mixture of activations
Every gate uses sigmoid ; the candidate and the output use tanh . This isn't arbitrary — each choice has a specific role.
13
§ 13 Why LSTM solves the vanishing-gradient problem
Look at the cell-state recurrence in isolation: c t = f t ⊙ c t-1 + (writes) . The Jacobian of the cell state with respect to its predecessor is just
14
§ 14 LSTM in code
A minimal NumPy step (single hidden unit, for illustration):
15
§ 15 GRU — architecture overview
The GRU (Cho et al., 2014) is the LSTM's simplified cousin . It fuses the LSTM's forget and input gates into a single update gate , drops the separate cell state, and gets close to LSTM performance with ~25% fewer parameters.
16
§ 16 Reset gate σ
The reset gate decides how much of the past hidden state should influence the new candidate . Notice where r t appears:
17
§ 17 Update gate σ
The update gate is the GRU's most important component. It does the job of both the LSTM's forget gate and input gate, by a clever trick of coupling them :
18
§ 18 Candidate & final hidden state
Set the inputs and gates, watch the candidate and final hidden state. Try the presets to see how the two gates collaborate.
19
§ 19 Why GRU can skip the cell state
The LSTM keeps two states ( c , h ) so that the internal memory is allowed to grow unboundedly while the exposed hidden state stays bounded via o ⊙ tanh(c) . The GRU sidesteps this by keeping the candidate h̃ t already in [−1, 1] (it's a tanh) and using a convex combination — so h t stays bounded automatically.
20
§ 20 GRU in code
# inputs: x_t, h_prev # params: W_r, W_z, W_h z_input = np.concatenate([h_prev, x_t]) r = sigmoid(W_r @ z_input + b_r) # reset gate z = sigmoid(W_z @ z_input + b_z) # update gate # candidate uses h_prev *gated by r* h_tilde = np.tanh(W_h @ np.concatenate([r * h_prev, x_t]) + b_h) h = (1 - z) * h_prev + z * h_tilde # final hidden state Four lines instead of six. In PyTorch: nn.GRU(input_size, hidden_size, num_layers) .
21
§ 21 LSTM vs GRU — parameter count
For input size d x and hidden size d h , each gate/candidate involves a weight matrix of shape d h × (d x + d h ) plus a bias of size d h .
22
§ 22 When to use which
Empirically, on most NLP and time-series tasks, the two perform within noise of each other. Pick the cheaper one (GRU) unless you have a specific reason to think you need the extra capacity.
23
§ 23 Bidirectional & stacked variants
A unidirectional RNN at step t only sees the past ( x 1 , …, x t ). For tagging tasks (POS tagging, NER) you want the future too — what follows often disambiguates the current token.
24
§ 24 Activation function summary
The whole rule of thumb: σ for gates (how much), tanh for content (what) . Every architecture in this companion follows it.
25
§ 25 Reference — data flow at a glance
Companion 05 of 05 · regression metrics · classification & regression · activations & losses · LLM & RAG evaluation · recurrent networks.