Computational Linguistics
About

N-Gram Language Models

N-gram language models approximate the probability of a word given its entire history by conditioning only on the previous n-1 words, applying a Markov assumption that makes estimation tractable from finite corpora.

P(wᵢ | w₁,...,wᵢ₋₁) ≈ P(wᵢ | wᵢ₋ₙ₊₁,...,wᵢ₋₁)

An n-gram language model estimates the probability of each word in a sequence by conditioning on a fixed window of n-1 preceding words, thereby invoking an (n-1)-order Markov assumption. A unigram model (n=1) treats each word as independent; a bigram model (n=2) conditions on the immediately preceding word; a trigram model (n=3) conditions on the two preceding words. Despite their simplicity, n-gram models were the dominant approach to language modeling from the 1980s through the early 2010s, achieving strong practical results in speech recognition, machine translation, and spelling correction.

Estimation and the Markov Assumption

N-Gram Probability Estimation Bigram: P(wᵢ | wᵢ₋₁) = C(wᵢ₋₁, wᵢ) / C(wᵢ₋₁)

Trigram: P(wᵢ | wᵢ₋₂, wᵢ₋₁) = C(wᵢ₋₂, wᵢ₋₁, wᵢ) / C(wᵢ₋₂, wᵢ₋₁)

General: P(wᵢ | wᵢ₋ₙ₊₁, ..., wᵢ₋₁) = C(wᵢ₋ₙ₊₁, ..., wᵢ) / C(wᵢ₋ₙ₊₁, ..., wᵢ₋₁)

The maximum likelihood estimates of n-gram probabilities are simply the relative frequencies of the n-grams in the training corpus. This simplicity is both the strength and the weakness of the approach. On one hand, estimation requires only counting; on the other, the number of possible n-grams grows as |V|ⁿ, where |V| is the vocabulary size, meaning that most n-grams will never be observed even in very large corpora. This data sparsity problem is the central challenge of n-gram modeling and motivates the development of smoothing techniques.

Sparsity and the Bias-Variance Tradeoff

Higher-order n-gram models can capture longer-range dependencies but require exponentially more data to estimate reliably. A 5-gram model over a vocabulary of 50,000 words has 50,000⁵ = 3.125 × 10²³ possible parameters. Even web-scale corpora containing trillions of tokens cannot adequately cover this space. The bias-variance tradeoff is fundamental: lower-order models (high bias, low variance) miss important contextual patterns, while higher-order models (low bias, high variance) overfit to the training corpus unless appropriately regularized through smoothing or interpolation.

Google N-Gram Corpus

In 2006, Google released the Web 1T 5-gram corpus, containing n-gram counts from approximately one trillion tokens of web text. This resource demonstrated that scale partially compensates for sparsity: larger training sets improve n-gram model performance. Brants et al. (2007) showed that a simple smoothed 5-gram model trained on trillions of words outperformed more sophisticated models trained on smaller data, illustrating the "unreasonable effectiveness of data" in language modeling.

Legacy and Transition to Neural Models

N-gram models dominated language modeling for decades because they are fast to train, easy to interpret, and integrate naturally into systems based on weighted finite-state transducers. Their success in speech recognition at IBM and elsewhere demonstrated that statistical approaches could outperform rule-based systems, catalyzing the empirical revolution in NLP. However, n-gram models have fundamental limitations: they cannot generalize to unseen contexts, they do not exploit similarities between words, and they are ultimately bounded by the Markov assumption.

The transition to neural language models beginning with Bengio et al. (2003) addressed these limitations by representing words as dense vectors and learning continuous functions over these representations. Neural models can generalize across similar contexts and capture long-range dependencies without an explicit Markov assumption. Nevertheless, n-gram models remain relevant as baselines, components in hybrid systems, and as the conceptual framework through which the language modeling task itself is defined.

Interactive Calculator

Enter training text (multiple sentences, one per line), then a blank line, then a test sentence, then optionally another blank line and the n-gram order (default 2). Computes n-gram probabilities with add-1 (Laplace) smoothing and perplexity of the test sentence.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed.). Prentice Hall.
  2. Brants, T., Popat, A. C., Xu, P., Och, F. J., & Dean, J. (2007). Large language models in machine translation. Proceedings of EMNLP-CoNLL, 858–867.
  3. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
  4. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–394. doi:10.1006/csla.1999.0128

External Links