Computational Linguistics
About

Unigram Language Model

The unigram language model tokenization method treats subword segmentation as a probabilistic model selection problem, starting from a large vocabulary and iteratively pruning tokens to maximize the marginal likelihood of the training corpus.

P(x) = ∏ᵢ P(xᵢ) where x = (x₁, ..., xₙ) is a segmentation

The unigram language model approach to subword tokenization, proposed by Kudo (2018), takes a fundamentally different approach from BPE and WordPiece. Rather than building a vocabulary bottom-up through iterative merges, it starts with a large initial vocabulary of candidate subword tokens and iteratively removes tokens whose loss least affects the overall corpus likelihood. The resulting vocabulary maximizes a unigram language model's marginal likelihood over all possible segmentations of the training corpus, producing a principled, probabilistic tokenizer that naturally supports multiple segmentations of the same input.

The Unigram Model

Unigram Subword Model Given vocabulary V = {s₁, ..., sₖ} with probabilities P(sᵢ):

For a segmentation x = (x₁, x₂, ..., xₙ) of sentence S:
P(x) = ∏ᵢ P(xᵢ) (unigram assumption)

Marginal likelihood of S:
P(S) = Σ_{x ∈ S(V)} ∏ᵢ P(xᵢ)

Best segmentation: x* = argmax_x ∏ᵢ P(xᵢ)
Found via Viterbi algorithm in O(|S| × max_token_length)

Under the unigram model, each subword token has an independent probability, and the probability of a segmentation is the product of its token probabilities. The optimal segmentation for a given sentence can be found efficiently using the Viterbi algorithm on a lattice of all possible segmentations. The marginal likelihood — the sum over all possible segmentations — can be computed using the forward algorithm. During training, the EM algorithm alternates between computing the expected frequency of each token across all segmentations (E-step) and updating token probabilities to maximize likelihood (M-step).

Vocabulary Pruning

The training procedure begins with a large seed vocabulary, typically all substrings up to a maximum length that appear at least once in the training corpus. This vocabulary is then iteratively pruned: at each step, a fraction of tokens (typically 10-20%) are removed, selecting those whose removal causes the smallest decrease in corpus log-likelihood. The pruning continues until the desired vocabulary size is reached. This top-down approach contrasts with BPE's bottom-up construction, and it produces vocabularies that are globally optimized for corpus likelihood rather than locally optimized for merge frequency.

Subword Regularization

A unique advantage of the unigram model is that it naturally supports subword regularization: during training, instead of always using the single best segmentation, the model samples from the distribution over all possible segmentations weighted by their probability. For example, "internationalization" might sometimes be segmented as ["international", "ization"] and other times as ["inter", "national", "iza", "tion"]. This stochastic segmentation acts as a data augmentation technique, exposing the neural model to diverse representations of each word and improving robustness. Kudo (2018) showed that subword regularization consistently improves translation quality by 0.5-1.0 BLEU points.

Comparison and Practical Usage

Empirical comparisons between unigram, BPE, and WordPiece tokenization show that the differences are often modest for well-resourced languages, with all three methods producing competitive downstream performance. However, the unigram model has theoretical advantages: it provides a principled probabilistic framework, supports natural segmentation ambiguity, and enables subword regularization. BPE's advantages are simplicity and determinism. In practice, the choice is often made based on framework compatibility — BPE is the default in fairseq and Hugging Face, while the unigram model is available through SentencePiece.

The unigram model's probabilistic nature also makes it better suited for certain analytical tasks. The entropy of the segmentation distribution provides a measure of tokenization ambiguity for each word: morphologically transparent words like "un+break+able" have low segmentation entropy (one segmentation dominates), while ambiguous cases have higher entropy. This information can be used to study the relationship between tokenization and morphological structure, or to identify words that may benefit from explicit morphological preprocessing.

Related Topics

References

  1. Kudo, T. (2018). Subword regularization: Improving neural network translation models with multiple subword candidates. Proceedings of the 56th Annual Meeting of the ACL, 66–75. doi:10.18653/v1/P18-1007
  2. Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proceedings of the 2018 Conference on EMNLP: System Demonstrations, 66–71. doi:10.18653/v1/D18-2012
  3. Bostrom, K., & Durrett, G. (2020). Byte pair encoding is suboptimal for language model pretraining. Findings of the ACL: EMNLP 2020, 4617–4624. doi:10.18653/v1/2020.findings-emnlp.414

External Links