LSTM — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
1. Why LSTM Exists: The Problem It Solves
In a vanilla RNN the gradient from timestep T back to timestep 1 passes through a product of T-1 single-step Jacobians J_t = diag(1-h_t^2)\,W_h . When this product's spectral norm is below 1 — which tanh saturation actively pushes it toward, since diag(1-h_t^2)preceq I — the product decays exponentially. The network physically cannot learn that "France" (position 3) determines "French" (position 20+).
02
2. The Dual-State Architecture
LSTM maintains two parallel state vectors at every timestep, each serving a fundamentally different role:
03
3. Deep Dive: Gate Operations
Three sigmoid gates control the flow of information. Each produces a vector in (0,1)^d_h acting as a soft binary mask — determining how much of each dimension passes through.
04
4. Cell State Update: The Memory Highway
In a vanilla RNN, h_t = tanh(W_h\,h_t-1 + cdots) — the old state is multiplied by a weight matrix, the path that causes gradient decay.
05
6. Gradient Flow Analysis
The whole point of the architecture lives here. Differentiate the cell-state update C_t = f_t odot C_t-1 + i_t odot tilde C_t with respect to C_t-1 , treating the gates as (locally) fixed — this is the direct path that forms the gradient highway: