Multilingual NMT

Multilingual NMT

Multilingual neural machine translation trains a single model to translate between multiple language pairs simultaneously, enabling positive transfer across related languages and providing a scalable path to supporting hundreds of translation directions.

P(y|x, l_src, l_tgt; θ) — single θ shared across all language pairs

Multilingual NMT consolidates multiple bilingual translation models into a single model that handles many language pairs. The simplest approach, proposed by Johnson et al. (2017), prepends a target-language token to the source sentence (e.g., "<2en>" for translation into English) and trains on the concatenation of parallel data from all language pairs. Despite its simplicity, this approach achieves competitive or superior performance compared to bilingual baselines, particularly for low-resource language pairs that benefit from transfer learning from related high-resource pairs.

Architecture and Training

Multilingual NMT Input: [<2tgt>] x₁ x₂ ... x_S (source with target language tag)
P(y | x, l_tgt; θ) = ∏_t P(y_t | y_{
Training data: D = ∪_{(l_s, l_t)} D_{l_s→l_t}
Objective: θ* = argmax_θ Σ_{(l_s,l_t)} Σ_{(x,y)∈D} log P(y | x, l_tgt; θ)

Temperature sampling for data balancing:
p(l) ∝ |D_l|^{1/T} (T > 1 upsamples low-resource)

Training a multilingual model requires careful data balancing. If language pairs are sampled proportionally to their data size, high-resource pairs dominate training and low-resource pairs are underfit. Temperature-based sampling (Arivazhagan et al., 2019) raises the sampling probabilities to a power of 1/T, where T > 1 flattens the distribution to give low-resource pairs more training time. This creates a tradeoff: higher temperature improves low-resource translation but may degrade high-resource performance, a phenomenon known as the "curse of multilinguality."

Zero-Shot Translation

A remarkable property of multilingual NMT is zero-shot translation: the ability to translate between language pairs not seen during training. If the model is trained on English-French and English-German parallel data, it can translate directly from French to German without ever having seen French-German parallel data. This capability emerges because the model learns language-agnostic internal representations — a shared semantic space that bridges languages. However, zero-shot translation quality is typically lower than supervised translation and can suffer from off-target translation (generating output in the wrong language).

Massively Multilingual Models

Research at Google and Facebook has scaled multilingual NMT to hundreds of languages. The M2M-100 model (Fan et al., 2021) supports direct translation between 100 languages (9,900 directions) without relying on English as a pivot. The NLLB-200 model (NLLB Team, 2022) extends coverage to 200 languages, including many low-resource and endangered languages. These models demonstrate that multilingual NMT can serve as a practical tool for bridging the digital language divide, though quality varies significantly across language pairs and domains.

Language-Specific and Shared Components

The tension between parameter sharing (for transfer) and language-specific capacity (for avoiding interference) is a central design question. Approaches include language-specific encoder/decoder layers, language-specific attention heads, adapter modules that add small language-specific parameter sets, and mixture-of-experts architectures that route different languages through different expert networks. These methods attempt to capture the benefits of multilinguality — positive transfer, parameter efficiency, zero-shot capability — while mitigating the capacity bottleneck that arises when too many languages compete for limited model parameters.

Multilingual NMT represents a shift in how we think about machine translation: from a collection of independent bilingual systems to a unified model of multilingual communication. This vision connects to broader goals in NLP, including universal language representations, cross-lingual transfer for arbitrary NLP tasks, and the development of AI systems that work across all human languages rather than privileging a small number of well-resourced ones.

Architecture and Training

Zero-Shot Translation

Language-Specific and Shared Components

References

External Links

Architecture and Training

Zero-Shot Translation

Language-Specific and Shared Components

Related Topics

References

External Links