GRU — Architecture & Mathematics

Notebook excerpts

A plain-text scan of every section in this note — the interactive, fully-styled version is in the reader above. Use whichever helps.

01
1. Design Motivation: Why GRU Was Created
LSTM solved the vanishing-gradient problem but introduced complexity: 3 gates, a separate cell state, and roughly 4× the parameters of a vanilla RNN. In 2014, Cho et al. asked: can we achieve similar long-range learning with a simpler architecture?
02
2. Single-State Architecture
GRU maintains only one state vector h_t that serves both as long-term memory and output interface:
03
4. State Interpolation: The Core Update
In LSTM, forget and input are independent: f=1, i=1 keeps old AND writes new, so the cell state can grow. In GRU the coefficients sum to exactly 1 : whatever proportion you forget, you write that same proportion of new content. This yields a provable boundedness property:
04
6. Gradient Flow Analysis
The previous-state h_t-1 reaches h_t through three routes (it appears directly, and inside both z_t and tilde h_t ). Differentiating h_t = (1-z_t)odot h_t-1 + z_todot tilde h_t with the product rule gives the full Jacobian: