Computational Linguistics
About

Byte Pair Encoding

Byte Pair Encoding is a subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs in a corpus to build a vocabulary of variable-length tokens, providing an effective compromise between character-level and word-level representations.

merge(a, b) → ab if freq(ab) = max

Byte Pair Encoding (BPE), originally a data compression algorithm introduced by Gage (1994), was adapted for NLP by Sennrich, Haddow, and Birch (2016) as a subword tokenization method for neural machine translation. The algorithm starts with a vocabulary of individual characters and iteratively merges the most frequent pair of adjacent tokens in the training corpus, adding the merged token to the vocabulary. After a specified number of merge operations, the resulting vocabulary contains a mix of characters, common subwords, and frequent whole words. BPE has become the standard tokenization method for modern language models, used in GPT, RoBERTa, and many other systems.

The BPE Algorithm

Byte Pair Encoding Algorithm Input: corpus C, desired vocabulary size V
Initialize vocabulary with all characters in C

Repeat until |vocabulary| = V:
  1. Count all adjacent token pairs in C
  2. Find most frequent pair (a, b)
  3. Merge all occurrences of (a, b) → ab
  4. Add ab to vocabulary

Example merge sequence:
l o w → lo w → low
l o w e r → lo w e r → low e r → low er

The key insight of BPE for NLP is that it provides an automatic, data-driven way to build a subword vocabulary that captures the statistical regularities of a language. Common words are represented as single tokens, while rare words are decomposed into frequent subword units. This addresses the open-vocabulary problem — the inability of word-level models to handle words not seen during training — because any word can be represented as a sequence of known subword tokens, down to individual characters in the worst case.

BPE in Neural Machine Translation

Sennrich et al. (2016) demonstrated that BPE-based subword segmentation substantially improved neural machine translation, particularly for morphologically rich languages and rare words. By decomposing rare words into familiar subword units, BPE allows the translation model to generalize: even if "Sonnenblume" (sunflower) is rare, the model may have learned useful representations for "Sonnen" (sun) and "blume" (flower). Joint BPE, where the same merge operations are learned from the concatenation of source and target language corpora, enables cross-lingual subword sharing between related languages.

BPE vs. Morpheme Segmentation

BPE segments correlate with but do not strictly correspond to linguistic morphemes. A BPE tokenizer might split "unhappiness" into "un," "happi," and "ness" — close to the morphological parse — but it might also produce "unh," "app," "iness" depending on corpus statistics. Bostrom and Durrett (2020) systematically compared BPE tokenizations with morphological gold standards and found that BPE recovers morpheme boundaries with moderate accuracy (~60-70%), with performance varying by language. Whether the discrepancy matters for downstream task performance remains debated: BPE's statistical optimization may capture regularities that are relevant for prediction even if they do not align with linguistic morphemes.

Variants and Improvements

Several variants of BPE have been proposed. BPE-dropout (Provilkov et al., 2020) introduces stochasticity during training by randomly skipping some merge operations, producing multiple possible segmentations for each word. This acts as a regularization technique, improving model robustness and translation quality. Byte-level BPE, used in GPT-2 and GPT-3, operates on raw bytes rather than Unicode characters, ensuring that any text can be tokenized without unknown tokens. SentencePiece provides a language-independent implementation of BPE that operates directly on raw text without requiring pre-tokenization into words.

The choice of vocabulary size in BPE involves a tradeoff. Smaller vocabularies produce longer token sequences (increasing computational cost) but better parameter sharing across related words. Larger vocabularies produce shorter sequences but may overfit to the training corpus and handle rare words poorly. Typical vocabulary sizes range from 8,000 to 50,000 tokens depending on the application, with multilingual models generally requiring larger vocabularies to cover multiple languages.

Related Topics

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the ACL, 1715–1725. doi:10.18653/v1/P16-1162
  2. Gage, P. (1994). A new algorithm for data compression. The C Users Journal, 12(2), 23–38.
  3. Provilkov, I., Emelianenko, D., & Voita, E. (2020). BPE-dropout: Simple and effective subword regularization. Proceedings of the 58th Annual Meeting of the ACL, 1882–1892. doi:10.18653/v1/2020.acl-main.170

External Links