Attention Mechanism

Attention Mechanism

The attention mechanism allows neural models to dynamically focus on relevant parts of the input when producing each element of the output, replacing the information bottleneck of fixed-size representations with a flexible, context-dependent weighting scheme.

Attention(Q,K,V) = softmax(QKᵀ / √d_k) V

The attention mechanism, introduced to neural machine translation by Bahdanau et al. (2015), addresses a fundamental limitation of encoder-decoder architectures: the compression of an entire input sequence into a single fixed-size vector. Instead of relying on this bottleneck representation, attention allows the decoder to look back at all encoder hidden states and compute a weighted combination that is most relevant to the current decoding step. This idea proved transformative, not only improving machine translation quality but ultimately leading to the transformer architecture that dominates modern NLP.

Additive and Multiplicative Attention

Attention Mechanisms Bahdanau (additive) attention:
eᵢⱼ = vᵀ · tanh(W_s · sⱼ₋₁ + W_h · hᵢ)
αᵢⱼ = softmax(eᵢⱼ) = exp(eᵢⱼ) / Σₖ exp(eₖⱼ)
cⱼ = Σᵢ αᵢⱼ · hᵢ

Luong (multiplicative) attention:
eᵢⱼ = sⱼᵀ · W · hᵢ (general)
eᵢⱼ = sⱼᵀ · hᵢ (dot-product)

Bahdanau's additive attention computes alignment scores through a small feedforward network, while Luong et al. (2015) proposed simpler multiplicative alternatives including dot-product and general bilinear attention. Both approaches compute a score for each source position, normalize these scores via softmax to obtain attention weights, and compute a weighted sum of source hidden states as the context vector. The context vector is then concatenated with the decoder state to predict the next output token. Empirically, dot-product attention is faster due to optimized matrix multiplication, while additive attention can be more expressive for small hidden dimensions.

Self-Attention and Scaled Dot-Product

The pivotal extension of attention was self-attention, where a sequence attends to itself rather than to a separate encoder. Vaswani et al. (2017) formalized this as scaled dot-product attention: Attention(Q, K, V) = softmax(QKᵀ / sqrt(d_k)) · V, where Q (queries), K (keys), and V (values) are linear projections of the input sequence. The scaling factor 1/sqrt(d_k) prevents the dot products from growing large in magnitude, which would push the softmax into regions with extremely small gradients. Self-attention computes pairwise interactions between all positions in a sequence in a single operation, enabling O(1) path length for long-range dependencies.

Multi-Head Attention

Rather than computing a single attention function, multi-head attention runs h parallel attention operations with different learned projections: MultiHead(Q,K,V) = Concat(head₁, ..., headₕ) · W_O, where headᵢ = Attention(QW_Qⁱ, KW_Kⁱ, VW_Vⁱ). Multiple heads allow the model to attend to information from different representational subspaces at different positions simultaneously. For example, one head might capture syntactic dependencies while another captures semantic similarities. The standard transformer uses 8 or 16 heads, each operating on a reduced dimensionality d_k = d_model / h.

Impact on Language Modeling

Attention mechanisms fundamentally changed the landscape of language modeling. In RNN-based models, attention allowed the incorporation of information from distant positions without relying solely on the hidden state's ability to compress it. In the transformer, self-attention became the sole mechanism for modeling inter-token dependencies, entirely replacing recurrence. This shift brought dramatic improvements: attention-based models achieve lower perplexity and are easier to parallelize during training because all positions can be processed simultaneously rather than sequentially.

Beyond language modeling, attention has become a ubiquitous component in deep learning, applied to computer vision (vision transformers), speech processing, protein structure prediction, and many other domains. The interpretability of attention weights, which can be visualized as soft alignments between input and output elements, has also made attention a tool for model analysis, though researchers have debated whether attention weights faithfully reflect the model's reasoning process. The attention mechanism remains one of the most influential ideas in modern machine learning.

Additive and Multiplicative Attention

Self-Attention and Scaled Dot-Product

Impact on Language Modeling

References

External Links

Additive and Multiplicative Attention

Self-Attention and Scaled Dot-Product

Impact on Language Modeling

Related Topics

References

External Links