Recurrent neural network language models (RNN-LMs), introduced by Mikolov et al. (2010), process word sequences step by step, maintaining a hidden state vector that serves as a compressed summary of the entire preceding context. Unlike feedforward neural language models that condition on a fixed window of n-1 words, an RNN-LM can in principle capture dependencies of arbitrary length, since the hidden state at time t is a function of all previous inputs. This theoretical advantage, combined with practical improvements in perplexity, made RNN-LMs the dominant language modeling architecture from 2010 until the advent of transformers.
Architecture and Computation
Hidden state: hₜ = σ(W_hh · hₜ₋₁ + W_xh · xₜ + b_h)
Output: yₜ = softmax(W_hy · hₜ + b_y)
P(wₜ₊₁ | w₁, ..., wₜ) = yₜ[wₜ₊₁]
Loss: L = -(1/T) Σₜ log P(wₜ | w₁, ..., wₜ₋₁)
At each time step t, the RNN takes the embedding of the current word xₜ and the previous hidden state hₜ₋₁ as inputs, producing a new hidden state hₜ through a nonlinear transformation. The hidden state is then projected through a softmax layer to produce a probability distribution over the vocabulary for the next word. The model is trained using truncated backpropagation through time (BPTT), where gradients are computed over fixed-length segments of the sequence rather than the entire corpus to manage computational cost.
The Vanishing Gradient Problem
Despite its theoretical ability to capture long-range dependencies, the simple (Elman) RNN struggles in practice due to the vanishing gradient problem. When gradients are backpropagated through many time steps, they are repeatedly multiplied by the recurrent weight matrix W_hh. If the spectral radius of this matrix is less than one, gradients shrink exponentially, making it impossible to learn dependencies spanning more than roughly 10-20 time steps. Conversely, if the spectral radius exceeds one, gradients explode, destabilizing training. Gradient clipping mitigates explosion but does not solve vanishing.
Tomas Mikolov's 2010 paper demonstrated that even a simple single-layer RNN with 300-400 hidden units could substantially outperform state-of-the-art smoothed 5-gram models in perplexity and, crucially, in downstream speech recognition word error rate. This result was significant because it showed that the continuous-space generalization of neural models could overcome the sparsity limitations of n-gram models even with a relatively small network, igniting widespread interest in neural language modeling.
Training and Practical Considerations
Training RNN-LMs requires careful attention to several practical details. Learning rate scheduling, typically starting with a rate around 1.0 and halving it when validation perplexity stops improving, is critical for convergence. Dropout regularization, applied to the non-recurrent connections, helps prevent overfitting. The vocabulary is typically limited to 10,000-100,000 words, with rare words mapped to an UNK token. Batch processing requires sequences to be of equal length, typically achieved by splitting the corpus into fixed-length segments.
RNN-LMs were progressively improved through larger hidden states, deeper architectures with multiple stacked layers, and the incorporation of gating mechanisms that led to LSTMs and GRUs. Zaremba et al. (2014) demonstrated that a properly regularized two-layer LSTM could achieve significant perplexity improvements over a simple RNN, establishing regularized LSTMs as the standard RNN-LM architecture. Despite being superseded by transformers for most benchmarks, RNN-LMs remain important as a conceptual bridge between count-based and attention-based approaches to language modeling.