Computational Linguistics
About

Back-Translation

Back-translation is a data augmentation technique that generates synthetic parallel data by translating monolingual target-language text back into the source language, enabling neural MT systems to leverage abundant monolingual corpora to improve translation quality.

D_aug = {(MT_{tgt→src}(y), y) : y ∈ D_mono} ∪ D_parallel

Back-translation (Sennrich et al., 2016) addresses a fundamental asymmetry in machine translation: while parallel corpora are scarce and expensive to produce, monolingual text is abundantly available in most languages. The technique works by training a reverse-direction translation model (target-to-source), using it to translate monolingual target-language data into the source language, and then adding the resulting synthetic parallel data to the training set for the forward model. Despite the noise in the synthetic source sentences, the authentic target sentences provide valuable training signal, particularly for improving fluency and coverage of the target language.

The Back-Translation Process

Back-Translation Pipeline 1. Train reverse model: θ_rev on D_parallel (tgt → src)
2. Generate synthetic sources: x' = MT_rev(y) for y ∈ D_mono
3. Create augmented data: D_aug = D_parallel ∪ {(x', y)}
4. Train forward model: θ_fwd on D_aug (src → tgt)

Optionally iterate: use improved forward model
to generate better reverse training data

The effectiveness of back-translation stems from the observation that NMT systems benefit more from authentic target-side text than from authentic source-side text. The decoder learns to produce fluent, natural target language from the genuine monolingual data, while the encoder needs only to extract sufficient information from the noisy synthetic source to guide generation. This asymmetry explains why back-translation (synthetic source, real target) consistently outperforms forward translation (real source, synthetic target) for data augmentation.

Sampling Strategies

The method used to generate synthetic source sentences significantly impacts back-translation effectiveness. Beam search produces high-quality but low-diversity translations, which may cause the forward model to overfit to specific translation patterns. Sampling from the model distribution produces noisier but more diverse translations, providing a regularization effect. Edunov et al. (2018) showed that adding noise to beam search outputs or sampling with restricted randomness yields the best results, as the diversity prevents overfitting while the quality maintains a useful training signal.

Iterative Back-Translation

Back-translation can be applied iteratively: the improved forward model is used to generate better forward translations of source-language monolingual data, which train a better reverse model, which in turn generates better back-translations. This iterative process (Hoang et al., 2018) converges to progressively better models in both directions, and in the extreme case of no parallel data at all, it becomes the basis for unsupervised machine translation (Lample et al., 2018). The iterative approach effectively bootstraps translation capability from monolingual data alone.

Applications and Extensions

Back-translation has become a standard component of competitive NMT systems. In WMT shared tasks, top-performing systems routinely use back-translated data that exceeds the size of the genuine parallel corpus by a factor of 2–10x. The technique is particularly valuable for low-resource language pairs, where parallel data is scarce but monolingual data may be more readily available. Tagged back-translation, which marks synthetic source sentences with a special token, allows the model to distinguish between genuine and synthetic data, preventing the degradation that can occur when synthetic data overwhelms genuine data.

Extensions of back-translation include self-training (forward translation of source monolingual data), paraphrastic back-translation (using diverse paraphrases as targets), and cross-lingual back-translation for multilingual models. The broader principle — leveraging monolingual data through synthetic data generation — has been adopted across NLP for tasks including summarization, question answering, and dialogue systems, making back-translation one of the most influential data augmentation techniques in modern NLP.

Related Topics

References

  1. Sennrich, R., Haddow, B., & Birch, A. (2016). Improving neural machine translation models with monolingual data. Proceedings of ACL 2016, 86–96. doi:10.18653/v1/P16-1009
  2. Edunov, S., Ott, M., Auli, M., & Grangier, D. (2018). Understanding back-translation at scale. Proceedings of EMNLP 2018, 489–500. doi:10.18653/v1/D18-1045
  3. Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. (2018). Unsupervised machine translation using monolingual corpora only. Proceedings of ICLR 2018. doi:10.48550/arXiv.1711.00043

External Links