Computational Linguistics
About

LSTM Language Model

Long Short-Term Memory language models address the vanishing gradient problem of simple RNNs through a gating mechanism that learns to selectively remember and forget information, enabling the capture of long-range dependencies in text.

fₜ = σ(W_f·[hₜ₋₁, xₜ] + b_f), cₜ = fₜ⊙cₜ₋₁ + iₜ⊙c̃ₜ

The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber (1997), addresses the fundamental limitation of simple RNNs by introducing a gated memory cell that can maintain information over long time intervals. When applied to language modeling, LSTMs achieve substantially lower perplexity than vanilla RNNs because they can learn to remember relevant contextual information — such as the subject of a sentence — across many intervening words. LSTM language models dominated the field from roughly 2012 to 2018 and remain an important baseline architecture.

LSTM Cell Architecture

LSTM Equations Forget gate: fₜ = σ(W_f · [hₜ₋₁, xₜ] + b_f)
Input gate: iₜ = σ(W_i · [hₜ₋₁, xₜ] + b_i)
Candidate: c̃ₜ = tanh(W_c · [hₜ₋₁, xₜ] + b_c)
Cell state: cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ
Output gate: oₜ = σ(W_o · [hₜ₋₁, xₜ] + b_o)
Hidden state: hₜ = oₜ ⊙ tanh(cₜ)

The LSTM cell maintains two state vectors: the cell state cₜ, which serves as a long-term memory, and the hidden state hₜ, which serves as a short-term working representation. Three gates control information flow. The forget gate fₜ determines which information to discard from the previous cell state. The input gate iₜ determines which new information to store. The output gate oₜ determines which parts of the cell state to expose as the hidden state. All gates use sigmoid activations producing values in [0, 1], allowing smooth interpolation between fully open and fully closed.

Why LSTMs Work for Language Modeling

The key insight of the LSTM is that the cell state update cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ creates an additive path for gradient flow through time. When the forget gate is close to 1 and the input gate is close to 0, the cell state is simply copied forward, and gradients flow backward without attenuation. This constant error carousel solves the vanishing gradient problem for information that the network learns to store. For language modeling, this means an LSTM can learn to track syntactic dependencies (e.g., subject-verb agreement across relative clauses) and semantic context over spans of 50 or more tokens.

AWD-LSTM: The State-of-the-Art RNN Language Model

Merity et al. (2018) introduced the AWD-LSTM (ASGD Weight-Dropped LSTM), which achieved state-of-the-art results among RNN-based language models through a combination of regularization techniques: weight dropout (DropConnect on hidden-to-hidden weights), variational dropout (consistent dropout masks across time steps), embedding dropout, weight tying between input embeddings and output softmax, and averaged SGD optimization. The AWD-LSTM achieved a perplexity of 57.3 on Penn Treebank and 65.8 on WikiText-2, numbers that stood as benchmarks until transformer models surpassed them.

Multi-Layer and Bidirectional LSTMs

Language models typically use stacked (multi-layer) LSTMs, where the hidden state of one LSTM layer serves as the input to the next. Two or three layers are standard, with the lower layers learning more syntactic features and the upper layers capturing more semantic information. Increasing beyond three layers typically yields diminishing returns unless skip connections or highway connections are added. Bidirectional LSTMs, which process the sequence in both directions, are used in models like ELMo but are not suitable for autoregressive language modeling because they violate the left-to-right factorization requirement.

The LSTM language model's legacy extends beyond its direct use. The representational capabilities developed in large-scale LSTM language models informed the design of ELMo (Peters et al., 2018), which demonstrated that contextualized word representations extracted from a pre-trained bidirectional LSTM dramatically improved downstream NLP tasks. This success directly motivated the development of transformer-based pre-trained models like BERT and GPT, which further advanced the paradigm of learning general-purpose representations through language modeling.

Related Topics

References

  1. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. doi:10.1162/neco.1997.9.8.1735
  2. Sundermeyer, M., Schlüter, R., & Ney, H. (2012). LSTM neural networks for language modeling. Proceedings of Interspeech, 194–197.
  3. Merity, S., Keskar, N. S., & Socher, R. (2018). Regularizing and optimizing LSTM language models. Proceedings of ICLR.
  4. Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL-HLT, 2227–2237. doi:10.18653/v1/N18-1202

External Links