Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
1. Why LSTM Exists: The Problem It Solves
In a vanilla RNN the gradient from timestep \(T\) back to timestep \(1\) passes through a product of \(T-1\) single-step Jacobians \(J_t = \operatorname{diag}(1-h_t^2)\,W_h\). When this product's spectral norm is below \(1\) — which \(\tanh\) saturation actively pushes it toward, since \(\operatorname{diag}(1-h_t^2)\preceq I\) — the product decays exponentially. The network physically cannot learn that "France" (position 3) determines "French" (position 20+).
02
2. The Dual-State Architecture
LSTM maintains two parallel state vectors at every timestep, each serving a fundamentally different role:
03
3. Deep Dive: Gate Operations
Three sigmoid gates control the flow of information. Each produces a vector in \((0,1)^{d_h}\) acting as a soft binary mask — determining how much of each dimension passes through.
04
4. Cell State Update: The Memory Highway
In a vanilla RNN, \(h_t = \tanh(\mathbf{W_h\,h_{t-1}} + \cdots)\) — the old state is multiplied by a weight matrix, the path that causes gradient decay.
05
6. Gradient Flow Analysis
The whole point of the architecture lives here. Differentiate the cell-state update \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\) with respect to \(C_{t-1}\), treating the gates as (locally) fixed — this is the direct path that forms the gradient highway: