The attention mechanism, introduced to neural machine translation by Bahdanau et al. (2015), addresses a fundamental limitation of encoder-decoder architectures: the compression of an entire input sequence into a single fixed-size vector. Instead of relying on this bottleneck representation, attention allows the decoder to look back at all encoder hidden states and compute a weighted combination that is most relevant to the current decoding step. This idea proved transformative, not only improving machine translation quality but ultimately leading to the transformer architecture that dominates modern NLP.
Additive and Multiplicative Attention
eᵢⱼ = vᵀ · tanh(W_s · sⱼ₋₁ + W_h · hᵢ)
αᵢⱼ = softmax(eᵢⱼ) = exp(eᵢⱼ) / Σₖ exp(eₖⱼ)
cⱼ = Σᵢ αᵢⱼ · hᵢ
Luong (multiplicative) attention:
eᵢⱼ = sⱼᵀ · W · hᵢ (general)
eᵢⱼ = sⱼᵀ · hᵢ (dot-product)
Bahdanau's additive attention computes alignment scores through a small feedforward network, while Luong et al. (2015) proposed simpler multiplicative alternatives including dot-product and general bilinear attention. Both approaches compute a score for each source position, normalize these scores via softmax to obtain attention weights, and compute a weighted sum of source hidden states as the context vector. The context vector is then concatenated with the decoder state to predict the next output token. Empirically, dot-product attention is faster due to optimized matrix multiplication, while additive attention can be more expressive for small hidden dimensions.
Self-Attention and Scaled Dot-Product
The pivotal extension of attention was self-attention, where a sequence attends to itself rather than to a separate encoder. Vaswani et al. (2017) formalized this as scaled dot-product attention: Attention(Q, K, V) = softmax(QKᵀ / sqrt(d_k)) · V, where Q (queries), K (keys), and V (values) are linear projections of the input sequence. The scaling factor 1/sqrt(d_k) prevents the dot products from growing large in magnitude, which would push the softmax into regions with extremely small gradients. Self-attention computes pairwise interactions between all positions in a sequence in a single operation, enabling O(1) path length for long-range dependencies.
Rather than computing a single attention function, multi-head attention runs h parallel attention operations with different learned projections: MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O, where headᵢ = Attention(QW_Qⁱ, KW_Kⁱ, VW_Vⁱ). Multiple heads allow the model to attend to information from different representational subspaces at different positions simultaneously. For example, one head might capture syntactic dependencies while another captures semantic similarities. The standard transformer uses 8 or 16 heads, each operating on a reduced dimensionality d_k = d_model / h.
Impact on Language Modeling
Attention mechanisms fundamentally changed the landscape of language modeling. In RNN-based models, attention allowed the incorporation of information from distant positions without relying solely on the hidden state's ability to compress it. In the transformer, self-attention became the sole mechanism for modeling inter-token dependencies, entirely replacing recurrence. This shift brought dramatic improvements: attention-based models achieve lower perplexity and are easier to parallelize during training because all positions can be processed simultaneously rather than sequentially.
Beyond language modeling, attention has become a ubiquitous component in deep learning, applied to computer vision (vision transformers), speech processing, protein structure prediction, and many other domains. The interpretability of attention weights, which can be visualized as soft alignments between input and output elements, has also made attention a tool for model analysis, though researchers have debated whether attention weights faithfully reflect the model's reasoning process. The attention mechanism remains one of the most influential ideas in modern machine learning.