Notebook excerpts
A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.
01
1. Design Motivation: Why GRU Was Created
LSTM solved the vanishing-gradient problem but introduced complexity: 3 gates, a separate cell state, and roughly 4× the parameters of a vanilla RNN. In 2014, Cho et al. asked: can we achieve similar long-range learning with a simpler architecture?
02
2. Single-State Architecture
GRU maintains only one state vector \(h_t\) that serves both as long-term memory and output interface:
03
4. State Interpolation: The Core Update
In LSTM, forget and input are independent: \(f=1, i=1\) keeps old AND writes new, so the cell state can grow. In GRU the coefficients sum to exactly \(1\): whatever proportion you forget, you write that same proportion of new content. This yields a provable boundedness property:
04
6. Gradient Flow Analysis
The previous-state \(h_{t-1}\) reaches \(h_t\) through three routes (it appears directly, and inside both \(z_t\) and \(\tilde h_t\)). Differentiating \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t\) with the product rule gives the full Jacobian: