LSTM — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

  1. 01

    1. Why LSTM Exists: The Problem It Solves

    In a vanilla RNN the gradient from timestep \(T\) back to timestep \(1\) passes through a product of \(T-1\) single-step Jacobians \(J_t = \operatorname{diag}(1-h_t^2)\,W_h\). When this product's spectral norm is below \(1\) — which \(\tanh\) saturation actively pushes it toward, since \(\operatorname{diag}(1-h_t^2)\preceq I\) — the product decays exponentially. The network physically cannot learn that "France" (position 3) determines "French" (position 20+).

  2. 02

    2. The Dual-State Architecture

    LSTM maintains two parallel state vectors at every timestep, each serving a fundamentally different role:

  3. 03

    3. Deep Dive: Gate Operations

    Three sigmoid gates control the flow of information. Each produces a vector in \((0,1)^{d_h}\) acting as a soft binary mask — determining how much of each dimension passes through.

  4. 04

    4. Cell State Update: The Memory Highway

    In a vanilla RNN, \(h_t = \tanh(\mathbf{W_h\,h_{t-1}} + \cdots)\) — the old state is multiplied by a weight matrix, the path that causes gradient decay.

  5. 05

    6. Gradient Flow Analysis

    The whole point of the architecture lives here. Differentiate the cell-state update \(C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t\) with respect to \(C_{t-1}\), treating the gates as (locally) fixed — this is the direct path that forms the gradient highway: