Transformer for Translation

The Transformer architecture replaced recurrent networks with self-attention mechanisms for machine translation, enabling massive parallelization during training and achieving state-of-the-art translation quality through multi-head attention and positional encodings.

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

The Transformer, introduced by Vaswani et al. (2017) in the paper "Attention Is All You Need," eliminated the sequential computation of recurrent neural networks by relying entirely on attention mechanisms to model dependencies between positions. This architectural innovation enabled far greater parallelization during training, allowing models to be trained on much larger datasets and at much greater scale. The Transformer rapidly became the dominant architecture for machine translation and subsequently for virtually all of natural language processing.

Scaled Dot-Product and Multi-Head Attention

Transformer Attention Attention(Q, K, V) = softmax(QK^T / √d_k) · V

MultiHead(Q, K, V) = Concat(head₁, ..., head_h) · W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

d_k = dimension of keys
h = number of attention heads
Scaling by √d_k prevents softmax saturation

The Transformer uses scaled dot-product attention, where queries, keys, and values are linearly projected from the input representations. The scaling factor √d_k prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients. Multi-head attention runs h parallel attention functions with different learned projections, allowing the model to jointly attend to information from different representation subspaces. The outputs of all heads are concatenated and projected to produce the final attention output.

Encoder-Decoder Structure

The Transformer maintains the encoder-decoder structure but implements both components using stacked self-attention layers. The encoder consists of N identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer, with residual connections and layer normalization around each. The decoder adds a third sublayer: multi-head cross-attention over the encoder output. The decoder's self-attention is masked to prevent positions from attending to subsequent positions, ensuring the autoregressive property is maintained during generation.

Positional Encoding

Because the Transformer contains no recurrence or convolution, it has no inherent notion of sequence order. Positional information is injected through positional encodings added to the input embeddings. Vaswani et al. (2017) used sinusoidal functions of different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}) and PE(pos, 2i+1) = cos(pos/10000^{2i/d}). This scheme allows the model to learn to attend by relative position and generalizes to sequence lengths longer than those seen during training. Learned positional embeddings are an equally effective alternative used in many subsequent models.

Training and Inference

Transformer translation models are trained with teacher forcing using the Adam optimizer with a warmup learning rate schedule. The warmup phase gradually increases the learning rate to prevent early divergence, followed by a decay phase. Label smoothing, which distributes some probability mass from the correct target token to all other tokens, acts as a regularizer and improves BLEU scores. At inference time, beam search with length normalization is used to find high-probability translations, and ensemble decoding over multiple independently trained models further improves quality.

The Transformer's impact on machine translation has been transformative. It enabled the training of much larger models on much more data than was feasible with RNNs, leading to consistent quality improvements. The architecture's suitability for transfer learning gave rise to pre-trained models like mBART and mT5 that can be fine-tuned for translation in any language pair. The Transformer also enabled multilingual translation models that handle dozens or hundreds of language pairs within a single model, representing a fundamental shift toward universal translation systems.

Transformer for Translation

Scaled Dot-Product and Multi-Head Attention

Encoder-Decoder Structure

Training and Inference

References

External Links

Scaled Dot-Product and Multi-Head Attention

Encoder-Decoder Structure

Training and Inference

Related Topics

References

External Links