Comprehensive Master Notes: GRU

The Architecture, Mathematics, and Flow of Gated Recurrent Units

Anchor example: time-series prediction — next value forecast

Original paper: Cho, K. et al. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP 2014 (arXiv:1406.1078).

1. Design Motivation: Why GRU Was Created

LSTM solved the vanishing-gradient problem but introduced complexity: 3 gates, a separate cell state, and roughly 4× the parameters of a vanilla RNN. In 2014, Cho et al. asked: can we achieve similar long-range learning with a simpler architecture?

LSTM gates3 (forget, input, output) + candidate = 4 transformations
GRU gates2 (update, reset) + candidate = 3 transformations
LSTM states2 \((C_t \text{ and } h_t)\)
GRU states1 \((h_t \text{ only})\)

the design insight
GRU merges LSTM's forget and input gates into a single "update gate" \(z_t\), and merges cell state and hidden state into one vector. The coupled update means: whatever you forget, you replace with new content in exactly that proportion. This constraint slightly reduces expressiveness but eliminates an entire state vector and one gate's worth of parameters.

2. Single-State Architecture

GRU maintains only one state vector \(h_t\) that serves both as long-term memory and output interface:

Component	Symbol	Shape	Role
Update Gate	\(z_t\)	\((d_h,)\)	Controls interpolation between old state and new candidate
Reset Gate	\(r_t\)	\((d_h,)\)	Controls how much old state participates in candidate computation
Candidate	\(\tilde h_t\)	\((d_h,)\)	Proposed new state content
Hidden State	\(h_t\)	\((d_h,)\)	The single state — both memory and output

One state \(h\) (green). The update gate \(z_t\) mixes the old state \((\times(1{-}z_t))\) with a candidate \((\times z_t)\); the reset gate \(r_t\) gates the old state into the candidate via \(r_t \odot h_{t-1}\).

3. Deep Dive: Gate Operations

Gate 1: The Update Gate \((z_t)\)

equation\(z_t = \sigma\big(W_z\, x_t + U_z\, h_{t-1} + b_z\big)\)
shapes\(W_z \in \mathbb{R}^{d_h \times d_{\text{in}}}, \quad U_z \in \mathbb{R}^{d_h \times d_h}\)
output\(z_t \in (0, 1)^{d_h}\)

why this operation
The update gate determines how much of the old state to keep vs. replace. \(z_t = 1\): replace old state with the candidate (full write). \(z_t = 0\): keep old state unchanged (full copy-through). It performs the combined role of LSTM's forget AND input gates — write more ⇒ forget more, automatically. This coupling is the key simplification.

Gate 2: The Reset Gate \((r_t)\)

equation\(r_t = \sigma\big(W_r\, x_t + U_r\, h_{t-1} + b_r\big)\)
shapes\(W_r \in \mathbb{R}^{d_h \times d_{\text{in}}}, \quad U_r \in \mathbb{R}^{d_h \times d_h}\)
output\(r_t \in (0, 1)^{d_h}\)

why this operation
The reset gate controls how much the previous state influences the candidate. \(r_t = 0\): the candidate is computed as if there were no history (a pure function of \(x_t\)). \(r_t = 1\): full previous state feeds in. This lets the GRU make sharp "reset" decisions — on a new context it can ignore accumulated history when proposing what to write next.

The Candidate State \((\tilde h_t)\)

equation\(\tilde h_t = \tanh\big(W_h\, x_t + U_h\,(r_t \odot h_{t-1}) + b_h\big)\)
key difference\(h_{t-1}\) is masked by \(r_t\) before the linear transform
output\(\tilde h_t \in (-1, +1)^{d_h}\)

why \(r_t\) multiplies \(h_{t-1}\) inside the candidate
The reset gate is a content selector: it chooses which dimensions of the old state are relevant for computing the new candidate. This differs from the update gate (which decides what to keep) — the reset gate decides what old information is useful for generating new information. A dimension with \(r=0\) means "compute this candidate from scratch using only \(x_t\)."

4. State Interpolation: The Core Update

the core equation\(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t\)
operation typeconvex interpolation (element-wise)
constraint\((1 - z_{t,j}) + z_{t,j} = 1\) for every dimension \(j\)

Decomposing the Interpolation

Term	Computation	When \(z \to 0\)	When \(z \to 1\)
\((1 - z_t) \odot h_{t-1}\)	proportion of old state retained	full retention (copy-through)	complete erasure
\(z_t \odot \tilde h_t\)	proportion of new candidate written	no new content	full replacement

the critical constraint: coupled forget–input

In LSTM, forget and input are independent: \(f=1, i=1\) keeps old AND writes new, so the cell state can grow. In GRU the coefficients sum to exactly \(1\): whatever proportion you forget, you write that same proportion of new content. This yields a provable boundedness property:

\(h_{t,j} = (1-z_{t,j})\,h_{t-1,j} + z_{t,j}\,\tilde h_{t,j}\) is a convex combination (weights \(\in[0,1]\), sum \(=1\)).
So \(\min(h_{t-1,j},\,\tilde h_{t,j}) \le h_{t,j} \le \max(h_{t-1,j},\,\tilde h_{t,j})\).
With \(h_0 = 0\) and \(\tilde h_t \in (-1,1)\), induction gives \(h_{t} \in (-1,1)^{d_h}\) for all \(t\).

The GRU hidden state is therefore bounded in \((-1,1)\) — unlike the LSTM cell state, which is additive and can grow without bound.

Numeric Walkthrough \((d_h = 4)\)

\(h_{t-1}\)\([\,0.8,\ -0.5,\ 0.3,\ 0.9\,]\)
\(z_t\)\([\,0.1,\ 0.9,\ 0.0,\ 0.5\,]\) (mostly keep dim0, mostly replace dim1…)
\(\tilde h_t\)\([\,0.2,\ 0.7,\ -0.4,\ 0.1\,]\) (new candidate)
\((1-z_t) \odot h_{t-1}\)\([\,0.72,\ -0.05,\ 0.30,\ 0.45\,]\)
\(z_t \odot \tilde h_t\)\([\,0.02,\ 0.63,\ 0.00,\ 0.05\,]\)
\(h_t\) (sum)\([\,\mathbf{0.74},\ \mathbf{0.58},\ \mathbf{0.30},\ \mathbf{0.50}\,]\)
interpretationdim0 mostly retained (\(z{=}0.1\)); dim1 almost fully replaced (\(z{=}0.9\)); dim2 perfect copy-through (\(z{=}0.0\)).

5. GRU vs LSTM: Structural Comparison

Property	LSTM	GRU	Implication
State vectors	2 \((C_t, h_t)\)	1 \((h_t)\)	GRU uses ~50% less state memory
Gates	3 (forget, input, output)	2 (update, reset)	GRU has ~25% fewer parameters
Forget–input coupling	independent \((f, i)\)	coupled \((z,\,1{-}z)\)	LSTM can accumulate; GRU interpolates
Output filtering	explicit output gate \(o_t\)	none \((h_t\) exposed directly\()\)	LSTM can hide internal state; GRU cannot
Gradient highway	additive \(C_t\) path, \(\operatorname{diag}(f_t)\)	interpolation, \(\operatorname{diag}(1{-}z_t)\)	both preserve gradients, different mechanisms
Training speed	slower per step	faster (~20–30%)	GRU preferred when compute is limited

Same \(d_h{=}512,\ d_{\text{in}}{=}256\): GRU is ~25% smaller (3 weight transforms vs. LSTM's 4), with one fewer state vector.

when the difference matters
On sequences under ~200 tokens, GRU and LSTM perform near-identically. For very long sequences (1000+) where the network must simultaneously store information AND hide it from outputs, LSTM's separate cell state and output gate help. For speed-critical apps or smaller datasets (overfitting risk), GRU's parameter efficiency wins. Neither consistently dominates.

6. Gradient Flow Analysis

The previous-state \(h_{t-1}\) reaches \(h_t\) through three routes (it appears directly, and inside both \(z_t\) and \(\tilde h_t\)). Differentiating \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t\) with the product rule gives the full Jacobian:

\(\dfrac{\partial h_t}{\partial h_{t-1}} = \underbrace{\operatorname{diag}(1 - z_t)}_{\text{direct copy-through}} \; + \; \underbrace{\operatorname{diag}(z_t)\,\dfrac{\partial \tilde h_t}{\partial h_{t-1}}}_{\text{write path}} \; + \; \underbrace{\operatorname{diag}(\tilde h_t - h_{t-1})\,\dfrac{\partial z_t}{\partial h_{t-1}}}_{\text{gate-modulation path}}\)
\(\dfrac{\partial z_t}{\partial h_{t-1}} = \operatorname{diag}\!\big(z_t \odot (1 - z_t)\big)\,U_z\) (since \(\sigma' = \sigma(1-\sigma)\)).
\(\dfrac{\partial \tilde h_t}{\partial h_{t-1}} = \operatorname{diag}\!\big(1 - \tilde h_t^{2}\big)\,U_h\,\operatorname{diag}(r_t)\) (dominant term; \(\tanh' = 1-\tanh^2\), reset gate \(r_t\) treated as locally constant).

copy-through limitif \(z_t \to 0\): \(\dfrac{\partial h_t}{\partial h_{t-1}} \to \operatorname{diag}(1 - z_t) = I\) (identity — no decay)
multi-step\(\dfrac{\partial h_T}{\partial h_k} \approx \displaystyle\prod_{t=k+1}^{T} \operatorname{diag}(1 - z_t)\) when the modulation terms are small

how GRU preserves gradients
The interpolation guarantees the Jacobian always contains the additive, diagonal term \(\operatorname{diag}(1-z_t)\). When the update gate is near zero on a dimension, that dimension's gradient passes through unattenuated — directly analogous to LSTM's \(\operatorname{diag}(f_t)\) pathway. The mechanism differs (interpolation vs. additive cell state) but the gradient-preserving effect is equivalent: a learned, near-identity highway that the network opens by driving \(z_t \to 0\) (GRU) or \(f_t \to 1\) (LSTM).

7. End-to-End Operational Lifecycle

[Input sequence: x_1, x_2, ..., x_T]
        │
        ▼
[Embedding / Feature extraction] ────────► X: (T, d_input)
        │
        ▼ Initialize: h_0 = zeros(d_h)
┌────────────────────────────────────────────────────────────────┐
│ GRU CELL (same weights at every timestep) │
├────────────────────────────────────────────────────────────────┤
│ FOR EACH t = 1 to T: │
│ │
│ z_t = σ(W_z·x_t + U_z·h_{t-1} + b_z) ────► update (d_h,) │
│ r_t = σ(W_r·x_t + U_r·h_{t-1} + b_r) ────► reset (d_h,) │
│ h̃_t = tanh(W_h·x_t + U_h·(r_t⊙h_{t-1}) + b_h) ► cand. │
│ h_t = (1-z_t)⊙h_{t-1} + z_t⊙h̃_t ────────► new state (d_h,)│
│ │
└────────────────────────────────────────────────────────────────┘
        │
        ▼
[h_T or sequence of h_t] ──────────────► Task head → prediction

parameters / layer\(3 \times \big[d_h\,d_{\text{in}} + d_h^2 + d_h\big]\) (3 transforms: \(z, r, \tilde h\))
example\(d_h{=}512,\ d_{\text{in}}{=}256 \Rightarrow 3 \times (512{\times}768 + 512) = \mathbf{1{,}181{,}184}\)
vs LSTM\(1{,}574{,}912\) — GRU is 25% smaller

8. Interview Depth Q&A

Question	Strong answer pattern
Why can GRU match LSTM with fewer params?	The coupled update gate imposes a conservation constraint (forget \(=1-\)write) that acts as implicit regularization. Fewer degrees of freedom ⇒ less overfitting and easier optimization on moderate data.
What does the reset gate actually do?	It controls how much old state participates in computing the NEW candidate. \(r=0\) ⇒ "ignore history when proposing what to write" — enabling sharp context switches.
When choose GRU over LSTM?	When training speed matters, data is limited (fewer params ⇒ less overfitting), or the task doesn't need fine-grained memory control over 1000+ step dependencies.
Gradient advantage over vanilla RNN?	The \(\operatorname{diag}(1-z_t)\) term is a direct additive path. When \(z\approx 0\), gradients pass through unchanged — equivalent to LSTM's \(f\approx 1\) highway.
Can GRU perfectly copy values like LSTM?	Yes. When \(z_t = 0\): \(h_t = h_{t-1}\) exactly — identical to LSTM's \(f{=}1, i{=}0\) configuration.
Why is the GRU state bounded but the LSTM cell isn't?	GRU updates by convex interpolation (stays in \((-1,1)\)); LSTM updates the cell additively (\(C_t=f\odot C_{t-1}+i\odot\tilde C\)), which can grow unbounded.