The Transformer, introduced by Vaswani et al. (2017) in the paper "Attention Is All You Need," eliminated the sequential computation of recurrent neural networks by relying entirely on attention mechanisms to model dependencies between positions. This architectural innovation enabled far greater parallelization during training, allowing models to be trained on much larger datasets and at much greater scale. The Transformer rapidly became the dominant architecture for machine translation and subsequently for virtually all of natural language processing.
Scaled Dot-Product and Multi-Head Attention
MultiHead(Q, K, V) = Concat(head₁, ..., head_h) · W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
d_k = dimension of keys
h = number of attention heads
Scaling by √d_k prevents softmax saturation
The Transformer uses scaled dot-product attention, where queries, keys, and values are linearly projected from the input representations. The scaling factor √d_k prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients. Multi-head attention runs h parallel attention functions with different learned projections, allowing the model to jointly attend to information from different representation subspaces. The outputs of all heads are concatenated and projected to produce the final attention output.
Encoder-Decoder Structure
The Transformer maintains the encoder-decoder structure but implements both components using stacked self-attention layers. The encoder consists of N identical layers, each containing a multi-head self-attention sublayer and a position-wise feed-forward sublayer, with residual connections and layer normalization around each. The decoder adds a third sublayer: multi-head cross-attention over the encoder output. The decoder's self-attention is masked to prevent positions from attending to subsequent positions, ensuring the autoregressive property is maintained during generation.
Because the Transformer contains no recurrence or convolution, it has no inherent notion of sequence order. Positional information is injected through positional encodings added to the input embeddings. Vaswani et al. (2017) used sinusoidal functions of different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}) and PE(pos, 2i+1) = cos(pos/10000^{2i/d}). This scheme allows the model to learn to attend by relative position and generalizes to sequence lengths longer than those seen during training. Learned positional embeddings are an equally effective alternative used in many subsequent models.
Training and Inference
Transformer translation models are trained with teacher forcing using the Adam optimizer with a warmup learning rate schedule. The warmup phase gradually increases the learning rate to prevent early divergence, followed by a decay phase. Label smoothing, which distributes some probability mass from the correct target token to all other tokens, acts as a regularizer and improves BLEU scores. At inference time, beam search with length normalization is used to find high-probability translations, and ensemble decoding over multiple independently trained models further improves quality.
The Transformer's impact on machine translation has been transformative. It enabled the training of much larger models on much more data than was feasible with RNNs, leading to consistent quality improvements. The architecture's suitability for transfer learning gave rise to pre-trained models like mBART and mT5 that can be fine-tuned for translation in any language pair. The Transformer also enabled multilingual translation models that handle dozens or hundreds of language pairs within a single model, representing a fundamental shift toward universal translation systems.