Computational Linguistics
About

FastText

FastText extends Word2Vec by representing each word as a bag of character n-grams, enabling the model to generate embeddings for out-of-vocabulary words and capture subword morphological regularities.

s(w, c) = sum_{g in G_w} z_g^T v_c

FastText, developed by Bojanowski, Grave, Joulin, and Mikolov at Facebook AI Research in 2017, extends the Skip-gram model of Word2Vec by incorporating subword information. Instead of learning a single vector for each word, FastText represents each word as the sum of vectors for its constituent character n-grams. This design choice enables the model to compute embeddings for words not seen during training (out-of-vocabulary words) and to share statistical information between morphologically related words, making it particularly effective for morphologically rich languages.

Subword Representation

FastText Scoring Function Each word w is represented as a set of character n-grams G_w

For "where" with n = 3: {<wh, whe, her, ere, re>, <where>}

Scoring function: s(w, c) = Σ_{g ∈ G_w} z_g^T v_c

Word vector: v_w = Σ_{g ∈ G_w} z_g

Trained with Skip-gram negative sampling objective

FastText augments each word with special boundary symbols (< and >) and then extracts all character n-grams of length 3 to 6 (by default), plus the whole word itself as a special token. The word's representation is the sum of the vectors of all its n-grams. This means that morphologically related words like "teach," "teacher," and "teaching" automatically share n-gram vectors (e.g., "teac," "each"), allowing the model to learn morphological regularities without explicit morphological analysis.

Handling Out-of-Vocabulary Words

A critical advantage of FastText over Word2Vec is its ability to produce vectors for previously unseen words. When encountering an unknown word at test time, FastText decomposes it into its character n-grams (which were seen during training) and sums their vectors. This capability is essential for processing morphologically rich languages like Finnish, Turkish, and Arabic, where the number of distinct word forms can be extremely large. It also handles misspellings, neologisms, and domain-specific terminology gracefully.

FastText for Language Identification and Classification

Beyond word embeddings, the FastText library includes a highly efficient text classification module. The classification model uses bag-of-words and bag-of-n-gram features fed through a single hidden layer, trained with a softmax output. Despite its simplicity, FastText classification achieves competitive accuracy with deep learning methods while training orders of magnitude faster. The library provides pre-trained models for 157 languages and has been widely adopted for language identification, sentiment analysis, and intent classification in production systems.

Impact and Comparisons

Empirical comparisons show that FastText consistently outperforms Word2Vec on word similarity tasks for morphologically rich languages (German, Czech, Arabic) and on syntactic analogy tasks that require morphological sensitivity. For English and other analytic languages, the improvement over Word2Vec is smaller but still present for rare words. FastText's subword approach also makes it more robust to noise and typos in the input text, an important practical advantage for processing user-generated content.

FastText's subword approach influenced subsequent models. The Byte Pair Encoding (BPE) tokenization used in modern Transformer models can be seen as a learned variant of character n-gram segmentation. SentencePiece and WordPiece tokenizers similarly decompose words into subword units, and the pre-training objectives of BERT and GPT can be viewed as context-sensitive extensions of the distributional learning that FastText performs at the subword level.

Related Topics

References

  1. Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146. doi:10.1162/tacl_a_00051
  2. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the ACL (EACL) (pp. 427–431). doi:10.18653/v1/E17-2068
  3. Grave, E., Bojanowski, P., Gupta, P., Joulin, A., & Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of LREC (pp. 3483–3487).

External Links