Comprehensive Master Notes: RNN

The Architecture, Mathematics, and Flow of Recurrent Neural Networks

Anchor example: "The cat sat" → sentiment class "positive"

Original paper: Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211.

1. Architectural Taxonomy

Recurrent Neural Networks are partitioned into structural archetypes based on how temporal sequences are ingested and how outputs are projected.

Architecture typeInput → Output mappingTemporal dynamicsUse cases
Many-to-OneFull sequence → single vector/labelProcess all \(T\) steps, use final \(h_T\) for classificationSentiment analysis, document classification
One-to-ManySingle input → sequence outputInitial hidden state seeds sequential generationImage captioning, music generation
Many-to-Many (aligned)\(T\) inputs → \(T\) outputs (same length)Output at each timestep, parallel sequencesPOS tagging, NER, frame labeling
Encoder-DecoderSequence → different-length sequenceEncoder compresses; decoder generatesTranslation (pre-attention), summarization

2. Data Representation: From Text to State-Compatible Tensors

Text strings are discrete symbolic tokens. The RNN requires continuous fixed-size vectors at each timestep. The ingestion pipeline converts symbols into a temporal tensor.

Step A: Tokenization & ID Mapping

raw string"The cat sat"
tokenization["The", "cat", "sat"] → token IDs \([42,\,108,\,311]\)
sequence length\(T = 3\)

Step B: Embedding Lookup

operation\(x_t = W_{\text{emb}}[\,\text{id}_t,\,:\,], \qquad W_{\text{emb}} \in \mathbb{R}^{|V| \times d}\)
result\(X = [x_1, x_2, x_3] \in \mathbb{R}^{T \times d} = \mathbb{R}^{3 \times 256}\)
why this operation
Integer token IDs carry no distance metric — token 42 is not "closer" to token 43 than to token 10000. Embedding lookup maps each ID to a learned dense vector in continuous space where semantic similarity translates to geometric proximity. This gives the recurrent cell meaningful numeric input with smooth gradients for end-to-end training.

Step C: Initial Hidden State

initialization\(h_0 = \mathbf{0} \in \mathbb{R}^{d_h}\)  (zeros) or a learned parameter vector
shape\(h_0 \in \mathbb{R}^{d_h}\),  \(d_h\) = hidden dimension (e.g., 512)
why this operation
The recurrence requires a "seed" state before any input arrives. Zero initialization is the default (no prior assumption). A learned \(h_0\) can encode task-specific priors but is rarely critical in practice.

3. Deep Dive: Forward Operations

At every timestep \(t\), the RNN cell performs a single recurrent computation combining new input with accumulated history.

The Core Recurrence Equation

pre-activation\(a_t = W_x\, x_t + W_h\, h_{t-1} + b_h\)
state update\(h_t = \tanh(a_t) \in (-1, +1)^{d_h}\)
output projection\(o_t = W_o\, h_t + b_o\)
probability\(p_t = \operatorname{softmax}(o_t)\)
h₀ tanh x₁ o₁ tanh x₂ o₂ tanh x₃ o₃ h₁ (Wₕ) h₂ (Wₕ) h₃
The same \(W_x,\,W_h,\,W_o\) are reused at every timestep — the cell is one shared function applied recurrently. Blue = inputs, green = state, amber = outputs.

Matrix Dimensions

MatrixShapeRole
\(W_x\)\((d_h \times d_{\text{in}})\)Projects input embedding into hidden space
\(W_h\)\((d_h \times d_h)\)Projects previous hidden state (memory recurrence)
\(b_h\)\((d_h)\)Bias for state transition
\(W_o\)\((|V| \times d_h)\)Projects hidden state to output/vocabulary space

Operation-by-Operation Breakdown

Operation 1: \(W_x\, x_t\) (Input Projection)

math\((d_h \times d_{\text{in}}) \cdot (d_{\text{in}} \times 1) = (d_h \times 1)\)
example\((512 \times 256) \cdot (256 \times 1) = (512 \times 1)\)
why this operation
The input embedding lives in a different vector space (semantic word space) than the hidden state (task-specific recurrent space). This learned linear transformation maps the current word into the correct coordinate system where it can be meaningfully combined with the historical state vector.

Operation 2: \(W_h\, h_{t-1}\) (Recurrent Projection)

math\((d_h \times d_h) \cdot (d_h \times 1) = (d_h \times 1)\)
example\((512 \times 512) \cdot (512 \times 1) = (512 \times 1)\)
why this operation
This is the memory mechanism. The previous hidden state \(h_{t-1}\) compresses the entire history of tokens \([x_1, \ldots, x_{t-1}]\) into a fixed-size vector. Multiplying by \(W_h\) selectively transforms which historical features are relevant for the current timestep. The same \(W_h\) is applied at every timestep — weight sharing across time.

Operation 3: Addition + tanh (Nonlinear Fusion)

addition\(a_t = W_x\, x_t + W_h\, h_{t-1} + b_h\)
squash\(h_t = \tanh(a_t) \in (-1, +1)^{d_h}\)
why this operation
Addition merges present input with past context into one activation vector. The \(\tanh\) squashing is essential: without it, repeated linear maps through time would let activations explode or collapse. \(\tanh\) bounds values to \([-1,+1]\), a soft saturation limiter that keeps signal magnitude stable across arbitrary sequence lengths.

The tanh Derivative (needed for every gradient below)

Because \(\tanh\) is the only nonlinearity in the cell, its derivative appears in every backward pass. From \(\tanh z = \frac{e^z - e^{-z}}{e^z + e^{-z}}\):

scalar\(\dfrac{d}{dz}\tanh z = 1 - \tanh^2 z\)
applied to the cell\(\dfrac{\partial h_t}{\partial a_t} = \operatorname{diag}\!\big(1 - h_t^2\big)\)
why this matters
Since \(h_t \in (-1,1)\), every diagonal entry \(1 - h_{t,i}^2 \in (0, 1]\). It equals \(1\) only when \(h_{t,i}=0\) (the unsaturated centre) and tends to \(0\) as the unit saturates toward \(\pm 1\). This factor \(\le 1\) is exactly what shrinks gradients in §5 — saturation throttles the learning signal.

Key Insight — Weight Sharing Across Time:

critical architectural property
The exact same matrices \(W_x, W_h, W_o\) are used at timestep 1, timestep 50, and timestep 1000. This means: (1) model size is independent of sequence length, (2) the network learns temporal patterns that generalize across positions, and (3) gradients from all timesteps flow into the same parameter matrices during training.

4. Backpropagation Through Time (BPTT)

Training requires \(\partial L/\partial W\) for all shared weight matrices. Because \(W_h\) participates at every timestep, its gradient accumulates contributions from the entire temporal chain.

The Unrolled Gradient Chain

Write \(\dfrac{\partial^{+} h_k}{\partial W_h}\) for the immediate Jacobian — the dependence of \(h_k\) on \(W_h\) holding \(h_{k-1}\) fixed. The total gradient then sums every path by which \(W_h\) reaches the loss:

total loss\(\displaystyle L = \sum_{t=1}^{T} L_t\)  (or just \(L_T\) for many-to-one)
gradient of \(W_h\)\(\displaystyle \frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial h_t} \sum_{k=1}^{t} \frac{\partial h_t}{\partial h_k}\, \frac{\partial^{+} h_k}{\partial W_h}\)
chain through time\(\displaystyle \frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}}\)

The Single-Step Jacobian

Differentiate \(h_i = \tanh(W_x x_i + W_h h_{i-1} + b_h)\) with respect to \(h_{i-1}\). By the chain rule, \(\partial h_i/\partial a_i = \operatorname{diag}(1-h_i^2)\) and \(\partial a_i/\partial h_{i-1} = W_h\), so:

single step\(\displaystyle J_i \equiv \frac{\partial h_i}{\partial h_{i-1}} = \operatorname{diag}\!\big(1 - h_i^2\big)\, W_h\)
multi-step product\(\displaystyle \frac{\partial h_t}{\partial h_k} = \prod_{i=k+1}^{t} \operatorname{diag}\!\big(1 - h_i^2\big)\, W_h\)

This is a product of \((t-k)\) matrices. The behaviour of this product — not any single matrix — determines whether training succeeds or fails.

why this operation matters
BPTT is the chain rule applied to a computation graph "unrolled" across time. Each timestep is a layer in a very deep network, and the gradient traverses backwards through all \(T\) "layers" to update the shared weights. An RNN on a 100-token sequence has the gradient depth of a 100-layer network — but with shared weights, so all those layers push into the same \(W_h\).

5. Gradient Pathology: Why Vanilla RNNs Fail

The fate of training hinges on the norm of the Jacobian product \(\partial h_t/\partial h_k\). Take the spectral norm \(\lVert\cdot\rVert_2\) (largest singular value) and use sub-multiplicativity:

bound\(\displaystyle \Big\lVert \frac{\partial h_t}{\partial h_k} \Big\rVert_2 \le \prod_{i=k+1}^{t} \big\lVert \operatorname{diag}(1-h_i^2) \big\rVert_2 \,\big\lVert W_h \big\rVert_2 \le \big(\gamma\, \sigma_{\max}(W_h)\big)^{\,t-k}\)
tanh factor\(\gamma = \sup_i \big\lVert 1 - h_i^2 \big\rVert_\infty \le 1, \qquad \sigma_{\max}(W_h) = \lVert W_h \rVert_2\)
the correction worth internalizing
It is not "the eigenvalue of \(W_h\)" that governs gradient flow — it is the spectral norm of the per-step Jacobian \(\operatorname{diag}(1-h_i^2)\,W_h\), which folds the \(\tanh\) derivative (\(\le 1\)) into \(W_h\). The base of the exponential is the product \(\gamma\,\sigma_{\max}(W_h)\), not \(\sigma_{\max}(W_h)\) alone.

Vanishing Gradients

condition\(\gamma\, \sigma_{\max}(W_h) < 1\)
effect\(\big\lVert \partial h_t/\partial h_k \big\rVert \lesssim (\gamma\,\sigma_{\max})^{t-k} \to 0\) exponentially in \((t-k)\)
consequencethe gradient from step \(T\) reaches step 1 as essentially zero
practical failure
The network cannot learn long-range dependencies. In "I grew up in France … I speak fluent ___", the signal from the blank decays to near-zero before reaching "France" 20+ tokens earlier. Worse, \(\gamma \le 1\) means even a perfectly tuned \(\sigma_{\max}(W_h)=1\) still vanishes whenever units saturate (\(\gamma<1\)).

Exploding Gradients

condition\(\gamma\, \sigma_{\max}(W_h) > 1\)  (necessary: \(\sigma_{\max}(W_h) > 1\))
effect\(\big\lVert \partial h_t/\partial h_k \big\rVert \to \infty\) exponentially
consequenceweight updates blow up, parameters diverge to NaN
fix: gradient clipping
If \(\lVert g \rVert > \tau\):  \(g \leftarrow g \cdot \tau / \lVert g \rVert\). This bounds the gradient norm but does not fix vanishing. Clipping is a bandaid for explosion; gating (LSTM/GRU) is the architectural cure for vanishing.
‖∂hₜ/∂hₖ‖ temporal distance (t − k) ideal: γ·σₘₐₓ = 1 exploding (γ·σₘₐₓ > 1) vanishing (γ·σₘₐₓ < 1)
Because the Jacobian norm is bounded by \((\gamma\,\sigma_{\max})^{t-k}\), any base \(\neq 1\) is an exponential. Stable long-range gradients require the base to sit on a knife-edge at exactly \(1\) — which vanilla RNNs cannot maintain while still learning.

The Fundamental Tension

why LSTM/GRU were invented
The vanilla RNN faces an impossible optimization: the per-step Jacobian must have norm exactly \(1\) for stable long-range gradients — but that pins down \(W_h\) and forbids learning dynamic, context-dependent transformations. LSTM resolves this by adding a separate additive memory path (the cell state) where the gradient flows through addition rather than repeated matrix multiplication, sidestepping the constraint entirely.

6. End-to-End Operational Lifecycle

[Raw Text: "The cat sat"]
        │
        ▼
[Tokenize → IDs: [42, 108, 311]] ──────────► Shape: (T=3,)
        │
        ▼
[Embedding Lookup: W_emb] ─────────────────► X: (3, 256)
        │
        ▼
┌──────────────────────────────────────────────────────────┐
│ RECURRENT STATE MACHINE (same weights at every step) │
├──────────────────────────────────────────────────────────┤
│ t=1: h_1 = tanh(W_x·x_1 + W_h·h_0 + b) → (512,) │
│ t=2: h_2 = tanh(W_x·x_2 + W_h·h_1 + b) → (512,) │
│ t=3: h_3 = tanh(W_x·x_3 + W_h·h_2 + b) → (512,) │
└──────────────────────────────────────────────────────────┘
        │
        ▼
[Take h_T = h_3 as sequence representation] ► (512,)
        │
        ▼
[Linear: W_cls · h_T + b_cls] ────────────► logits: (num_classes,)
        │
        ▼
[Softmax → Argmax] ───────────────────────► "positive"

TRAINING: Loss → BPTT through all T steps → ∂L/∂W accumulated → Optimizer

7. Interview Depth Q&A

QuestionStrong answer pattern
Why does an RNN need weight sharing?It enables processing arbitrary-length sequences with fixed parameters, and forces the model to learn position-invariant temporal patterns.
What information does \(h_t\) encode?A lossy compressed summary of the entire input prefix \(x_1 \ldots x_t\), biased toward recent tokens due to gradient decay.
Why \(\tanh\) and not ReLU in a vanilla RNN?ReLU is unbounded; repeated application through the recurrence causes activation explosion. \(\tanh\) bounds to \([-1,1]\) for stable recurrent dynamics. (LSTM gets ReLU-like un-saturated memory via the additive cell state instead.)
It's "eigenvalue < 1", right?Sharpen it: the relevant quantity is the spectral norm of the per-step Jacobian \(\operatorname{diag}(1-h^2)\,W_h\). The \(\tanh\) factor \(\le 1\) means saturation alone can vanish gradients even when \(\sigma_{\max}(W_h)=1\).
What is truncated BPTT?Limiting gradient computation to the last \(K\) timesteps instead of all \(T\). Reduces memory/compute from \(O(T)\) to \(O(K)\) but cannot learn dependencies longer than \(K\) steps.
RNN vs CNN for sequences?RNN is inherently sequential (\(O(T)\) serial steps). CNN parallelizes across positions but needs stacking/dilation for long-range context. Transformers supersede both with \(O(1)\) sequential depth via attention.