GRU — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

  1. 01

    1. Design Motivation: Why GRU Was Created

    LSTM solved the vanishing-gradient problem but introduced complexity: 3 gates, a separate cell state, and roughly 4× the parameters of a vanilla RNN. In 2014, Cho et al. asked: can we achieve similar long-range learning with a simpler architecture?

  2. 02

    2. Single-State Architecture

    GRU maintains only one state vector \(h_t\) that serves both as long-term memory and output interface:

  3. 03

    4. State Interpolation: The Core Update

    In LSTM, forget and input are independent: \(f=1, i=1\) keeps old AND writes new, so the cell state can grow. In GRU the coefficients sum to exactly \(1\): whatever proportion you forget, you write that same proportion of new content. This yields a provable boundedness property:

  4. 04

    6. Gradient Flow Analysis

    The previous-state \(h_{t-1}\) reaches \(h_t\) through three routes (it appears directly, and inside both \(z_t\) and \(\tilde h_t\)). Differentiating \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t\) with the product rule gives the full Jacobian: