LSTM solved the vanishing-gradient problem but introduced complexity: 3 gates, a separate cell state, and roughly 4× the parameters of a vanilla RNN. In 2014, Cho et al. asked: can we achieve similar long-range learning with a simpler architecture?
the design insight GRU merges LSTM's forget and input gates into a single "update gate" \(z_t\), and merges cell state and hidden state into one vector. The coupled update means: whatever you forget, you replace with new content in exactly that proportion. This constraint slightly reduces expressiveness but eliminates an entire state vector and one gate's worth of parameters.
2. Single-State Architecture
GRU maintains only one state vector \(h_t\) that serves both as long-term memory and output interface:
Component
Symbol
Shape
Role
Update Gate
\(z_t\)
\((d_h,)\)
Controls interpolation between old state and new candidate
Reset Gate
\(r_t\)
\((d_h,)\)
Controls how much old state participates in candidate computation
Candidate
\(\tilde h_t\)
\((d_h,)\)
Proposed new state content
Hidden State
\(h_t\)
\((d_h,)\)
The single state — both memory and output
One state \(h\) (green). The update gate \(z_t\) mixes the old state \((\times(1{-}z_t))\) with a candidate \((\times z_t)\); the reset gate \(r_t\) gates the old state into the candidate via \(r_t \odot h_{t-1}\).
why this operation The update gate determines how much of the old state to keep vs. replace. \(z_t = 1\): replace old state with the candidate (full write). \(z_t = 0\): keep old state unchanged (full copy-through). It performs the combined role of LSTM's forget AND input gates — write more ⇒ forget more, automatically. This coupling is the key simplification.
why this operation The reset gate controls how much the previous state influences the candidate. \(r_t = 0\): the candidate is computed as if there were no history (a pure function of \(x_t\)). \(r_t = 1\): full previous state feeds in. This lets the GRU make sharp "reset" decisions — on a new context it can ignore accumulated history when proposing what to write next.
The Candidate State \((\tilde h_t)\)
equation\(\tilde h_t = \tanh\big(W_h\, x_t + U_h\,(r_t \odot h_{t-1}) + b_h\big)\) key difference\(h_{t-1}\) is masked by \(r_t\) before the linear transform output\(\tilde h_t \in (-1, +1)^{d_h}\)
why \(r_t\) multiplies \(h_{t-1}\) inside the candidate The reset gate is a content selector: it chooses which dimensions of the old state are relevant for computing the new candidate. This differs from the update gate (which decides what to keep) — the reset gate decides what old information is useful for generating new information. A dimension with \(r=0\) means "compute this candidate from scratch using only \(x_t\)."
4. State Interpolation: The Core Update
the core equation\(h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t\) operation typeconvex interpolation (element-wise) constraint\((1 - z_{t,j}) + z_{t,j} = 1\) for every dimension \(j\)
Decomposing the Interpolation
Term
Computation
When \(z \to 0\)
When \(z \to 1\)
\((1 - z_t) \odot h_{t-1}\)
proportion of old state retained
full retention (copy-through)
complete erasure
\(z_t \odot \tilde h_t\)
proportion of new candidate written
no new content
full replacement
the critical constraint: coupled forget–input
In LSTM, forget and input are independent: \(f=1, i=1\) keeps old AND writes new, so the cell state can grow. In GRU the coefficients sum to exactly \(1\): whatever proportion you forget, you write that same proportion of new content. This yields a provable boundedness property:
\(h_{t,j} = (1-z_{t,j})\,h_{t-1,j} + z_{t,j}\,\tilde h_{t,j}\) is a convex combination (weights \(\in[0,1]\), sum \(=1\)).
So \(\min(h_{t-1,j},\,\tilde h_{t,j}) \le h_{t,j} \le \max(h_{t-1,j},\,\tilde h_{t,j})\).
With \(h_0 = 0\) and \(\tilde h_t \in (-1,1)\), induction gives \(h_{t} \in (-1,1)^{d_h}\) for all \(t\).
The GRU hidden state is therefore bounded in \((-1,1)\) — unlike the LSTM cell state, which is additive and can grow without bound.
Same \(d_h{=}512,\ d_{\text{in}}{=}256\): GRU is ~25% smaller (3 weight transforms vs. LSTM's 4), with one fewer state vector.
when the difference matters On sequences under ~200 tokens, GRU and LSTM perform near-identically. For very long sequences (1000+) where the network must simultaneously store information AND hide it from outputs, LSTM's separate cell state and output gate help. For speed-critical apps or smaller datasets (overfitting risk), GRU's parameter efficiency wins. Neither consistently dominates.
6. Gradient Flow Analysis
The previous-state \(h_{t-1}\) reaches \(h_t\) through three routes (it appears directly, and inside both \(z_t\) and \(\tilde h_t\)). Differentiating \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot \tilde h_t\) with the product rule gives the full Jacobian:
copy-through limitif \(z_t \to 0\): \(\dfrac{\partial h_t}{\partial h_{t-1}} \to \operatorname{diag}(1 - z_t) = I\) (identity — no decay) multi-step\(\dfrac{\partial h_T}{\partial h_k} \approx \displaystyle\prod_{t=k+1}^{T} \operatorname{diag}(1 - z_t)\) when the modulation terms are small
how GRU preserves gradients The interpolation guarantees the Jacobian always contains the additive, diagonal term \(\operatorname{diag}(1-z_t)\). When the update gate is near zero on a dimension, that dimension's gradient passes through unattenuated — directly analogous to LSTM's \(\operatorname{diag}(f_t)\) pathway. The mechanism differs (interpolation vs. additive cell state) but the gradient-preserving effect is equivalent: a learned, near-identity highway that the network opens by driving \(z_t \to 0\) (GRU) or \(f_t \to 1\) (LSTM).
7. End-to-End Operational Lifecycle
[Input sequence: x_1, x_2, ..., x_T]
│
▼
[Embedding / Feature extraction] ────────► X: (T, d_input)
│
▼ Initialize: h_0 = zeros(d_h)
┌────────────────────────────────────────────────────────────────┐
│ GRU CELL (same weights at every timestep) │
├────────────────────────────────────────────────────────────────┤
│ FOR EACH t = 1 to T: │
│ │
│ z_t = σ(W_z·x_t + U_z·h_{t-1} + b_z) ────► update (d_h,) │
│ r_t = σ(W_r·x_t + U_r·h_{t-1} + b_r) ────► reset (d_h,) │
│ h̃_t = tanh(W_h·x_t + U_h·(r_t⊙h_{t-1}) + b_h) ► cand. │
│ h_t = (1-z_t)⊙h_{t-1} + z_t⊙h̃_t ────────► new state (d_h,)│
│ │
└────────────────────────────────────────────────────────────────┘
│
▼
[h_T or sequence of h_t] ──────────────► Task head → prediction
The coupled update gate imposes a conservation constraint (forget \(=1-\)write) that acts as implicit regularization. Fewer degrees of freedom ⇒ less overfitting and easier optimization on moderate data.
What does the reset gate actually do?
It controls how much old state participates in computing the NEW candidate. \(r=0\) ⇒ "ignore history when proposing what to write" — enabling sharp context switches.
When choose GRU over LSTM?
When training speed matters, data is limited (fewer params ⇒ less overfitting), or the task doesn't need fine-grained memory control over 1000+ step dependencies.
Gradient advantage over vanilla RNN?
The \(\operatorname{diag}(1-z_t)\) term is a direct additive path. When \(z\approx 0\), gradients pass through unchanged — equivalent to LSTM's \(f\approx 1\) highway.
Can GRU perfectly copy values like LSTM?
Yes. When \(z_t = 0\): \(h_t = h_{t-1}\) exactly — identical to LSTM's \(f{=}1, i{=}0\) configuration.
Why is the GRU state bounded but the LSTM cell isn't?
GRU updates by convex interpolation (stays in \((-1,1)\)); LSTM updates the cell additively (\(C_t=f\odot C_{t-1}+i\odot\tilde C\)), which can grow unbounded.