The attention mechanism, introduced to NMT by Bahdanau et al. (2015), was one of the most consequential innovations in modern deep learning. Rather than compressing the entire source sentence into a single vector, attention allows the decoder to compute a weighted sum of all encoder hidden states at each generation step. The weights, computed dynamically based on the relevance of each source position to the current decoding step, serve as a soft alignment between source and target positions. This innovation immediately improved translation quality, particularly for long sentences, and became a standard component of all subsequent NMT architectures.
Attention Computation
α_{t,i} = exp(score(s_t, h_i)) / Σ_j exp(score(s_t, h_j))
c_t = Σ_i α_{t,i} · h_i
Luong (Multiplicative) Attention score(s_t, h_i) = s_t^T · W · h_i (general)
score(s_t, h_i) = s_t^T · h_i (dot product)
s_t = decoder state, h_i = encoder hidden state
The attention mechanism computes a context vector c_t as a weighted average of encoder hidden states h_i. The attention weights α_{t,i} are obtained by passing a compatibility score through a softmax function, ensuring they form a valid probability distribution over source positions. Bahdanau et al. (2015) used an additive scoring function with a learned weight matrix, while Luong et al. (2015) proposed simpler multiplicative (dot-product) variants that are more computationally efficient. The context vector c_t is concatenated with the decoder state and used to predict the next target token.
Attention as Soft Alignment
The attention weights α_{t,i} can be interpreted as a soft alignment between each target position t and source positions i. When the model generates a target word, it tends to assign high attention weight to the corresponding source words, producing alignment patterns that closely resemble those learned by IBM alignment models. However, attention and alignment are not identical: attention is learned as an intermediate computation optimized for translation quality, not explicitly trained to produce alignments. In practice, attention weights may spread probability mass across multiple source positions or focus on seemingly irrelevant positions for function words.
The attention mechanism introduced for NMT inspired the self-attention mechanism at the heart of the Transformer architecture (Vaswani et al., 2017). While NMT attention computes interactions between decoder states and encoder states (cross-attention), self-attention computes interactions among positions within a single sequence. The multi-head variant allows the model to attend to different types of relationships simultaneously. This generalization of the attention concept has become the foundation of virtually all modern language models and representation learning approaches.
Variants and Refinements
Numerous attention variants have been proposed to address different aspects of the translation problem. Local attention (Luong et al., 2015) restricts the attention window to a subset of source positions, reducing computational cost for long sentences. Coverage mechanisms (Tu et al., 2016) track which source positions have been attended to, addressing the problem of under-translation (omitting source content) and over-translation (repeating content). Multi-head attention (Vaswani et al., 2017) uses multiple parallel attention heads, each capturing different types of source-target relationships.
The impact of attention extends far beyond machine translation. The mechanism has been adopted in virtually every area of NLP — text summarization, question answering, parsing, sentiment analysis — and has spread to computer vision, speech processing, and scientific computing. The Transformer architecture, which uses attention as its sole computational mechanism (dispensing with recurrence entirely), has become the dominant architecture in deep learning, demonstrating the extraordinary generality of the attention concept first developed for NMT.