An n-gram language model estimates the probability of each word in a sequence by conditioning on a fixed window of n-1 preceding words, thereby invoking an (n-1)-order Markov assumption. A unigram model (n=1) treats each word as independent; a bigram model (n=2) conditions on the immediately preceding word; a trigram model (n=3) conditions on the two preceding words. Despite their simplicity, n-gram models were the dominant approach to language modeling from the 1980s through the early 2010s, achieving strong practical results in speech recognition, machine translation, and spelling correction.
Estimation and the Markov Assumption
Trigram: P(wᵢ | wᵢ₋₂, wᵢ₋₁) = C(wᵢ₋₂, wᵢ₋₁, wᵢ) / C(wᵢ₋₂, wᵢ₋₁)
General: P(wᵢ | wᵢ₋ₙ₊₁, ..., wᵢ₋₁) = C(wᵢ₋ₙ₊₁, ..., wᵢ) / C(wᵢ₋ₙ₊₁, ..., wᵢ₋₁)
The maximum likelihood estimates of n-gram probabilities are simply the relative frequencies of the n-grams in the training corpus. This simplicity is both the strength and the weakness of the approach. On one hand, estimation requires only counting; on the other, the number of possible n-grams grows as |V|ⁿ, where |V| is the vocabulary size, meaning that most n-grams will never be observed even in very large corpora. This data sparsity problem is the central challenge of n-gram modeling and motivates the development of smoothing techniques.
Sparsity and the Bias-Variance Tradeoff
Higher-order n-gram models can capture longer-range dependencies but require exponentially more data to estimate reliably. A 5-gram model over a vocabulary of 50,000 words has 50,000⁵ = 3.125 × 10²³ possible parameters. Even web-scale corpora containing trillions of tokens cannot adequately cover this space. The bias-variance tradeoff is fundamental: lower-order models (high bias, low variance) miss important contextual patterns, while higher-order models (low bias, high variance) overfit to the training corpus unless appropriately regularized through smoothing or interpolation.
In 2006, Google released the Web 1T 5-gram corpus, containing n-gram counts from approximately one trillion tokens of web text. This resource demonstrated that scale partially compensates for sparsity: larger training sets improve n-gram model performance. Brants et al. (2007) showed that a simple smoothed 5-gram model trained on trillions of words outperformed more sophisticated models trained on smaller data, illustrating the "unreasonable effectiveness of data" in language modeling.
Legacy and Transition to Neural Models
N-gram models dominated language modeling for decades because they are fast to train, easy to interpret, and integrate naturally into systems based on weighted finite-state transducers. Their success in speech recognition at IBM and elsewhere demonstrated that statistical approaches could outperform rule-based systems, catalyzing the empirical revolution in NLP. However, n-gram models have fundamental limitations: they cannot generalize to unseen contexts, they do not exploit similarities between words, and they are ultimately bounded by the Markov assumption.
The transition to neural language models beginning with Bengio et al. (2003) addressed these limitations by representing words as dense vectors and learning continuous functions over these representations. Neural models can generalize across similar contexts and capture long-range dependencies without an explicit Markov assumption. Nevertheless, n-gram models remain relevant as baselines, components in hybrid systems, and as the conceptual framework through which the language modeling task itself is defined.