Comprehensive Master Notes: LSTM
The Architecture, Mathematics, and Flow of Long Short-Term Memory Networks
Anchor example: "I grew up in France ... I speak fluent ___" → "French"
Original paper: Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.
1. Why LSTM Exists: The Problem It Solves
In a vanilla RNN the gradient from timestep \(T\) back to timestep \(1\) passes through a product of \(T-1\) single-step Jacobians \(J_t = \operatorname{diag}(1-h_t^2)\,W_h\). When this product's spectral norm is below \(1\) — which \(\tanh\) saturation actively pushes it toward, since \(\operatorname{diag}(1-h_t^2)\preceq I\) — the product decays exponentially. The network physically cannot learn that "France" (position 3) determines "French" (position 20+).
vanilla RNN gradient\(\displaystyle \frac{\partial h_T}{\partial h_1} = \prod_{t=2}^{T} \operatorname{diag}(1-h_t^2)\,W_h \;\to\; 0\) as \(T\) grows
LSTM cell gradient\(\displaystyle \frac{\partial C_T}{\partial C_1} = \prod_{t=2}^{T} \operatorname{diag}(f_t) \;\approx\; I\) if \(f_t \approx 1\)
the architectural insight
LSTM replaces the multiplicative state transition \(\big(h_t = \tanh(W_h h_{t-1} + \cdots)\big)\) with an additive cell-state update \(\big(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\big)\). The local gradient through addition is the identity (\(\partial(a+b)/\partial a = 1\)), so the cell state acts as a gradient highway — information can persist across hundreds of timesteps without decay.
2. The Dual-State Architecture
LSTM maintains two parallel state vectors at every timestep, each serving a fundamentally different role:
| State | Symbol | Shape | Role | Analogy |
| Cell State | \(C_t\) | \((d_h,)\) | Long-term memory highway. Updated additively. Rarely read directly. | Hard drive / long-term storage |
| Hidden State | \(h_t\) | \((d_h,)\) | Short-term working memory. Filtered version of \(C_t\) exposed to the world. | RAM / active working register |
why two states
Separating storage (\(C_t\)) from interface (\(h_t\)) lets the network accumulate information over long periods (in \(C_t\)) while selectively exposing only task-relevant portions (via \(h_t\)) at each timestep. A single state vector cannot serve both roles without interference.
3. Deep Dive: Gate Operations
Three sigmoid gates control the flow of information. Each produces a vector in \((0,1)^{d_h}\) acting as a soft binary mask — determining how much of each dimension passes through.
notation: the concatenation is a block matrix
Writing \(W \cdot [h_{t-1}, x_t]\) is shorthand for splitting \(W\) into two blocks: \(W\,[h_{t-1},x_t] = W_{h}\,h_{t-1} + W_{x}\,x_t\), where \(W \in \mathbb{R}^{d_h \times (d_h + d_{\text{in}})}\), \(W_h \in \mathbb{R}^{d_h \times d_h}\), \(W_x \in \mathbb{R}^{d_h \times d_{\text{in}}}\). The two forms are identical; the concatenated form is just how it is implemented as a single matmul.
Gate 1: The Forget Gate \((f_t)\)
equation\(f_t = \sigma\big(W_f \, [h_{t-1}, x_t] + b_f\big)\)
input shape\([h_{t-1}, x_t] \in \mathbb{R}^{d_h + d_{\text{in}}}\) (concatenation)
\(W_f\) shape\((d_h \times (d_h + d_{\text{in}}))\)
output shape\(f_t \in (0, 1)^{d_h}\)
why this operation
The forget gate looks at the current input and previous output, then decides which dimensions of the old cell state to retain. A value of 1.0 means "keep this memory entirely"; 0.0 means "erase it." When processing a new subject after "France", the forget gate might erase the "current country" slot to make room for new information.
Gate 2: The Input Gate \((i_t)\)
equation\(i_t = \sigma\big(W_i \, [h_{t-1}, x_t] + b_i\big)\)
output shape\(i_t \in (0, 1)^{d_h}\)
why this operation
The input gate decides which dimensions of the cell state to write new information into — a selective write mask. Not every token is worth storing: "the" likely triggers near-zero input-gate values, while "France" triggers high values in country-related dimensions.
The Candidate Memory \((\tilde C_t)\)
equation\(\tilde C_t = \tanh\big(W_c \, [h_{t-1}, x_t] + b_c\big)\)
output shape\(\tilde C_t \in (-1, +1)^{d_h}\)
why this operation
This is the proposed new content to potentially write into cell state — a full-strength candidate with \(\tanh\) bounds. The input gate \(i_t\) controls how much actually gets written. Separating "what to write" (\(\tilde C_t\)) from "whether to write" (\(i_t\)) gives fine-grained control.
Gate 3: The Output Gate \((o_t)\)
equation\(o_t = \sigma\big(W_o \, [h_{t-1}, x_t] + b_o\big)\)
output shape\(o_t \in (0, 1)^{d_h}\)
why this operation
The output gate decides which parts of the cell state to expose as the current hidden output. The cell might store "country=France, tense=past, subject=I" but only tense matters for the next word. The output gate filters \(C_t\) before it becomes \(h_t\), keeping internal memory private until needed.
4. Cell State Update: The Memory Highway
the core equation\(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\)
operation typeelement-wise multiply (\(\odot\)) + element-wise add
shape\((d_h) \odot (d_h) + (d_h) \odot (d_h) = (d_h)\)
Decomposing the Two Terms
| Term | Computation | What it achieves |
| \(f_t \odot C_{t-1}\) | forget mask × old memory | Selectively retains useful old information. \(f=1\) passes unchanged; \(f=0\) erases. |
| \(i_t \odot \tilde C_t\) | input mask × new candidate | Selectively writes new information. \(i=1\) accepts the full candidate; \(i=0\) blocks the write. |
why addition is the critical design choice
In a vanilla RNN, \(h_t = \tanh(\mathbf{W_h\,h_{t-1}} + \cdots)\) — the old state is multiplied by a weight matrix, the path that causes gradient decay.
In LSTM, \(C_t = f_t \odot C_{t-1} \mathbf{\,+\,} i_t \odot \tilde C_t\) — the old cell state reaches the new one through addition. The direct gradient \(\partial C_t/\partial C_{t-1}\) is just \(\operatorname{diag}(f_t)\) (a mask), not a dense matrix product. As long as \(f_t \approx 1\), gradients flow backward with minimal decay — the "constant error carousel," the core innovation of LSTM (derived in §6).
Numeric Walkthrough \((d_h = 4)\)
\(C_{t-1}\)\([\,0.8,\ -0.3,\ 0.5,\ 0.9\,]\) (old memory)
\(f_t\)\([\,0.9,\ 0.1,\ 1.0,\ 0.7\,]\) (keep dim0, erase dim1, keep dim2…)
\(i_t\)\([\,0.1,\ 0.8,\ 0.0,\ 0.3\,]\) (write mostly to dim1, nothing to dim2…)
\(\tilde C_t\)\([\,0.2,\ 0.6,\ -0.4,\ 0.1\,]\) (new content proposal)
\(f_t \odot C_{t-1}\)\([\,0.72,\ -0.03,\ 0.50,\ 0.63\,]\)
\(i_t \odot \tilde C_t\)\([\,0.02,\ 0.48,\ 0.00,\ 0.03\,]\)
\(C_t\) (sum)\([\,\mathbf{0.74},\ \mathbf{0.45},\ \mathbf{0.50},\ \mathbf{0.66}\,]\)
interpretationdim1 was erased (\(f{=}0.1\)) then rewritten (\(i{=}0.8\)); dim2 was perfectly preserved (\(f{=}1.0,\,i{=}0.0\)) — pure retention.
5. Output Gate & Hidden State Generation
equation\(h_t = o_t \odot \tanh(C_t)\)
operation 1\(\tanh(C_t)\) → squash cell values to \((-1,+1)\) for a stable output range
operation 2\(o_t \odot (\cdot)\) → selectively expose only relevant dimensions
why tanh on \(C_t\) before output
Cell-state values can grow large through repeated additions. \(\tanh\) re-normalizes them to \([-1,1]\) before they are used for predictions or fed onward. Without it, \(h_t\) could take arbitrary magnitude and destabilize downstream computation.
why the output gate filters
Not all stored memory is relevant at every timestep. The cell may hold "country=France" for 20 steps, but it only needs to surface when predicting language-related words. The output gate keeps irrelevant memory hidden from the prediction layer and the next recurrence.
6. Gradient Flow Analysis
The whole point of the architecture lives here. Differentiate the cell-state update \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\) with respect to \(C_{t-1}\), treating the gates as (locally) fixed — this is the direct path that forms the gradient highway:
- \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\) — the additive update.
- \(\dfrac{\partial C_{t,j}}{\partial C_{t-1,j}} = f_{t,j}\) — each output dimension depends on its own input dimension, scaled by the forget gate.
- \(\dfrac{\partial C_t}{\partial C_{t-1}} = \operatorname{diag}(f_t)\) — a diagonal Jacobian (no cross-dimension mixing, no dense \(W\)).
- \(\dfrac{\partial C_T}{\partial C_k} = \displaystyle\prod_{t=k+1}^{T} \operatorname{diag}(f_t) = \operatorname{diag}\!\Big(\textstyle\prod_{t=k+1}^{T} f_t\Big)\).
cell-state gradient\(\dfrac{\partial C_t}{\partial C_{t-1}} = \operatorname{diag}(f_t)\) (just the forget-gate values)
multi-step\(\dfrac{\partial C_T}{\partial C_k} = \operatorname{diag}\!\big(\textstyle\prod_{t=k+1}^{T} f_t\big)\)
condition for flowif \(f_t \approx 1\) on the relevant dimensions, the product \(\approx 1\) across \(T-k\) steps
the honest caveat
The full Jacobian also has indirect terms, because \(f_t, i_t, \tilde C_t\) all depend on \(h_{t-1}\), which depends on \(C_{t-1}\). Those terms exist — but they are added to the clean \(\operatorname{diag}(f_t)\) highway rather than multiplied into it, so a single near-\(1\) forget gate is enough to keep a non-vanishing path open. Contrast the RNN, where the only path is the decaying matrix product.
comparison with vanilla RNN
| Vanilla RNN | LSTM |
| Gradient path | \(\partial h_t/\partial h_{t-1} = \operatorname{diag}(1-h^2)\,W_h\) (dense matrix) | \(\partial C_t/\partial C_{t-1} = \operatorname{diag}(f_t)\) (diagonal) |
| Decay behavior | spectral-norm-dependent exponential decay | controlled by the learned forget gate |
| Long-range signal | vanishes after ~10–20 steps | preserved for 100–1000+ steps if \(f \approx 1\) |
the key insight
LSTM doesn't "solve" vanishing gradients by accident — it is designed so the gradient path through cell state is element-wise (\(\times f_t\)), never a dense matrix multiplication. The network learns when to set \(f_t\) near \(1\) (preserve) versus near \(0\) (forget). Gradient flow becomes a learned, task-adaptive behavior rather than a fixed architectural limitation.
7. End-to-End Operational Lifecycle
[Input: "I grew up in France ... I speak fluent ___"]
│
▼
[Tokenize → Embed] ──────────────────────► X: (T, d_input)
│
▼ Initialize: h_0 = 0, C_0 = 0
┌──────────────────────────────────────────────────────────────────────┐
│ LSTM CELL (same weights at every timestep) │
├──────────────────────────────────────────────────────────────────────┤
│ FOR EACH t = 1 to T: │
│ │
│ concat = [h_{t-1}, x_t] ─────────────────► (d_h + d_input,) │
│ │
│ f_t = σ(W_f · concat + b_f) ─────────────► forget mask (d_h,) │
│ i_t = σ(W_i · concat + b_i) ─────────────► input mask (d_h,) │
│ C̃_t = tanh(W_c · concat + b_c) ──────────► candidate (d_h,) │
│ o_t = σ(W_o · concat + b_o) ─────────────► output mask (d_h,) │
│ │
│ C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t ───────► new cell state (d_h,) │
│ h_t = o_t ⊙ tanh(C_t) ───────────────────► new hidden (d_h,) │
│ │
└──────────────────────────────────────────────────────────────────────┘
│
▼
[Use h_T or all h_t for task head] ────────► logits → "French"
parameters / layer\(4 \times \big[d_h(d_h + d_{\text{in}}) + d_h\big]\) (4 gates: \(f,i,\tilde C,o\))
example\(d_h{=}512,\ d_{\text{in}}{=}256 \;\Rightarrow\; 4 \times (512{\times}768 + 512) = \mathbf{1{,}574{,}912}\)
8. Interview Depth Q&A
| Question | Strong answer pattern |
| Why is LSTM better than vanilla RNN? | The additive cell-state update creates a gradient highway: \(\partial C_t/\partial C_{t-1} = \operatorname{diag}(f_t)\) (diagonal) vs \(\operatorname{diag}(1-h^2)\,W_h\) (dense product that decays). Enables learning 100+ step dependencies. |
| Why both \(C_t\) and \(h_t\)? | \(C_t\) stores long-term memory without interference; \(h_t\) is a filtered, task-relevant view exposed for predictions and recurrence. Separation prevents output noise from corrupting storage. |
| Each gate in one sentence? | Forget: what old memory to erase. Input: what new info to write. Output: what stored info to expose now. |
| Can LSTM still vanish? | Yes — if forget gates stay near 0 on relevant dimensions. But that is now a learned behavior, not an architectural constraint; the network can choose to keep \(f \approx 1\). |
| LSTM vs GRU? | GRU merges \(C\) and \(h\) into one state and uses 2 gates (update + reset) instead of 3. Fewer parameters, often comparable; LSTM gives finer control for very long dependencies. |
| Why \(\sigma\) for gates but \(\tanh\) for the candidate? | \(\sigma \in (0,1)\) for soft binary masking; \(\tanh \in (-1,1)\) for content that can push cell values up or down. Different roles, different ranges. |