Computational Linguistics
About

Smoothing Techniques

Smoothing techniques redistribute probability mass from observed n-grams to unseen ones, solving the zero-frequency problem that arises when maximum likelihood estimation assigns zero probability to any word sequence not encountered during training.

P_smooth(wᵢ | wᵢ₋₁) = (C(wᵢ₋₁, wᵢ) + α) / (C(wᵢ₋₁) + α|V|)

In n-gram language modeling, maximum likelihood estimation assigns zero probability to any n-gram not observed in the training corpus. Since natural language is productive and any sufficiently long sentence is likely novel, these zero probabilities are catastrophic: a single unseen bigram drives the probability of an entire sentence to zero. Smoothing techniques address this problem by redistributing some probability mass from observed n-grams to unobserved ones, ensuring that no word sequence receives zero probability. The choice of smoothing method has a substantial impact on language model performance.

Add-k Smoothing and Lidstone Estimation

Laplace (Add-1) and Lidstone (Add-k) Smoothing Laplace: P(wᵢ | wᵢ₋₁) = (C(wᵢ₋₁, wᵢ) + 1) / (C(wᵢ₋₁) + |V|)

Lidstone: P(wᵢ | wᵢ₋₁) = (C(wᵢ₋₁, wᵢ) + α) / (C(wᵢ₋₁) + α|V|)

where α ∈ (0, 1] and |V| is the vocabulary size

The simplest smoothing method, Laplace (add-one) smoothing, adds one to every n-gram count before computing relative frequencies. While conceptually elegant, add-one smoothing is too aggressive for language modeling: it reassigns far too much probability mass from frequent n-grams to unseen ones. Lidstone smoothing generalizes this approach by adding a fractional count α, which can be tuned on held-out data. Even Lidstone smoothing, however, makes the unrealistic assumption that all unseen n-grams are equally likely.

Good-Turing Estimation

Good-Turing estimation, developed by Alan Turing and I. J. Good during World War II for cryptanalysis, uses the frequency of frequencies to estimate the probability of unseen events. The adjusted count for an n-gram with count c is c* = (c + 1) · N_{c+1} / N_c, where N_c is the number of n-grams that appear exactly c times. The total probability mass assigned to unseen n-grams equals N₁ / N, the fraction of singleton observations. Good-Turing estimation is more principled than add-k smoothing but requires additional techniques (such as Simple Good-Turing) to handle the noisy frequency-of-frequency counts at higher values of c.

Chen and Goodman's Comparative Study

The landmark 1999 study by Stanley Chen and Joshua Goodman systematically compared smoothing methods across multiple corpora and n-gram orders. They found that modified Kneser-Ney smoothing consistently outperformed all other methods, including Good-Turing, Witten-Bell, and various interpolation schemes. This comprehensive empirical evaluation established modified Kneser-Ney as the gold standard for n-gram smoothing and remains one of the most cited papers in language modeling.

Discounting Methods

Absolute discounting subtracts a fixed discount d (typically between 0 and 1) from each nonzero n-gram count and distributes the freed probability mass to unseen n-grams: P_abs(wᵢ | wᵢ₋₁) = max(C(wᵢ₋₁, wᵢ) - d, 0) / C(wᵢ₋₁) + λ(wᵢ₋₁) · P_lower(wᵢ). The normalization constant λ ensures the distribution sums to one. Absolute discounting can be seen as a simplification of Good-Turing where the adjustment is independent of the count magnitude, which is a reasonable approximation for counts above one.

All smoothing methods face the same fundamental tension: they must reserve enough probability mass for unseen events to avoid zero probabilities while not degrading the estimates for observed events. The most successful methods, particularly Kneser-Ney smoothing, combine discounting with interpolation and use lower-order distributions that are specifically designed to predict words in novel contexts rather than simply backing off to cruder estimates.

Interactive Calculator

Enter training text (multiple sentences, one per line), then a blank line, then a test sentence, then optionally another blank line and the n-gram order (default 2). Computes n-gram probabilities with add-1 (Laplace) smoothing and perplexity of the test sentence.

Click Calculate to see results, or Animate to watch the statistics update one record at a time.

Related Topics

References

  1. Chen, S. F., & Goodman, J. (1999). An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4), 359–394. doi:10.1006/csla.1999.0128
  2. Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3–4), 237–264. doi:10.1093/biomet/40.3-4.237
  3. Gale, W. A., & Sampson, G. (1995). Good-Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3), 217–237. doi:10.1080/09296179508590051
  4. Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd ed.). Prentice Hall.

External Links