Computational Linguistics
About

Gated Recurrent Unit

The Gated Recurrent Unit is a simplified gating architecture that merges the LSTM's cell state and hidden state into a single vector, achieving comparable performance with fewer parameters and faster training.

hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

The Gated Recurrent Unit (GRU), proposed by Cho et al. (2014), simplifies the LSTM architecture by combining the forget and input gates into a single update gate and merging the cell state and hidden state into one vector. The result is a gated recurrent architecture with fewer parameters than the LSTM that is faster to compute and often achieves comparable performance on language modeling and sequence-to-sequence tasks. The GRU's elegant design has made it a popular alternative to the LSTM, particularly in settings where computational efficiency is important.

GRU Equations

Gated Recurrent Unit Update gate: zₜ = σ(W_z · [hₜ₋₁, xₜ] + b_z)
Reset gate: rₜ = σ(W_r · [hₜ₋₁, xₜ] + b_r)
Candidate: h̃ₜ = tanh(W · [rₜ ⊙ hₜ₋₁, xₜ] + b)
Hidden state: hₜ = (1 − zₜ) ⊙ hₜ₋₁ + zₜ ⊙ h̃ₜ

where σ is the sigmoid function and ⊙ denotes element-wise multiplication

The GRU uses two gates instead of the LSTM's three. The update gate zₜ controls how much of the previous hidden state to retain versus how much to replace with the new candidate. When zₜ is close to 0, the hidden state is largely copied from the previous step (analogous to the LSTM's forget gate being close to 1); when zₜ is close to 1, the hidden state is mostly replaced by the new candidate. The reset gate rₜ controls how much of the previous hidden state is exposed when computing the candidate, allowing the model to effectively forget irrelevant past information.

Comparison with LSTM

The GRU has approximately 25% fewer parameters than a comparably sized LSTM because it eliminates the separate cell state and output gate. Several empirical studies have compared GRUs and LSTMs across tasks including language modeling, machine translation, and speech recognition. Chung et al. (2014) found that GRUs outperformed LSTMs on some tasks and underperformed on others, with no consistent winner. Jozefowicz et al. (2015) conducted a large-scale architecture search and concluded that while the LSTM with forget gate bias initialization is a strong default, some GRU variants can match its performance with lower computational cost.

Minimal Gating Units

The success of the GRU in simplifying the LSTM raised the question of whether further simplification is possible. Zhou et al. (2016) proposed the Minimal Gated Unit (MGU) with just a single gate, achieving reasonable performance on some tasks. Greff et al. (2017) conducted a systematic ablation study of LSTM components and found that the forget gate is the most critical component — removing any other single gate has relatively minor impact. These studies suggest that the essential mechanism is the ability to selectively preserve information over time, which can be achieved with minimal gating architectures.

Applications in Language Modeling

In language modeling specifically, GRUs have been used both as standalone models and as components in larger architectures. The encoder-decoder framework for machine translation originally proposed by Cho et al. (2014) used GRUs for both the encoder and decoder, demonstrating competitive translation quality with efficient training. GRU-based language models have also been successful in dialogue systems, text generation, and as components in attention-based architectures where the recurrent layer provides the sequential backbone over which attention operates.

While transformers have largely supplanted both LSTMs and GRUs for large-scale language modeling, GRUs remain relevant in resource-constrained settings such as on-device language models, real-time applications, and scenarios where training data is limited. The GRU's contribution extends beyond its direct use: it demonstrated that the gating principle, rather than any specific gate configuration, is the key to effective recurrent modeling, influencing the design of subsequent architectures including highway networks and residual connections.

Related Topics

References

  1. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of EMNLP, 1724–1734. doi:10.3115/v1/D14-1179
  2. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
  3. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015). An empirical exploration of recurrent network architectures. Proceedings of ICML, 2342–2350.
  4. Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. doi:10.1109/TNNLS.2016.2582924

External Links