The transformer, introduced by Vaswani et al. in the landmark 2017 paper "Attention Is All You Need," is an architecture that dispenses entirely with recurrence and convolution, relying instead on stacked self-attention layers to model all pairwise interactions within a sequence. The transformer's ability to process all positions in parallel during training, combined with its effectiveness at capturing long-range dependencies, has made it the foundation of virtually all modern language models including BERT, GPT, T5, and their successors. It is arguably the most consequential architectural innovation in the history of NLP.
Architecture
Multi-Head: MultiHead(Q,K,V) = Concat(head₁,...,headₕ)W_O
Transformer Layer:
x' = LayerNorm(x + MultiHead(x,x,x))
output = LayerNorm(x' + FFN(x'))
FFN(x) = max(0, xW₁ + b₁)W₂ + b₂
Positional Encoding: PE(pos,2i) = sin(pos/10000^{2i/d})
PE(pos,2i+1) = cos(pos/10000^{2i/d})
The transformer consists of an encoder and a decoder, each composed of stacked identical layers. Each encoder layer contains a multi-head self-attention sublayer followed by a position-wise feedforward network, with residual connections and layer normalization around each sublayer. The decoder adds a third sublayer: cross-attention over the encoder outputs. Crucially, the decoder's self-attention is masked to prevent positions from attending to subsequent positions, ensuring the autoregressive property needed for language generation. Sinusoidal positional encodings inject position information, since the self-attention operation itself is permutation-equivariant.
Key Design Principles
Several design choices contribute to the transformer's effectiveness. Residual connections ensure that gradients flow easily through deep networks, enabling architectures with 12, 24, or more layers. Layer normalization stabilizes training by normalizing activations within each layer. The feedforward network in each layer, typically with a hidden dimension four times the model dimension, provides the model with element-wise nonlinearity that complements the attention mechanism's ability to mix information across positions. The combination of these components creates a powerful function approximator that can model complex linguistic patterns.
Self-attention has O(n²·d) time and memory complexity, where n is the sequence length and d is the model dimension, because it computes pairwise interactions between all positions. For long sequences, this quadratic scaling becomes prohibitive. This has motivated a rich literature on efficient transformers: Linformer reduces attention to O(n) via low-rank projection; Performer uses random features to approximate softmax attention; Longformer and BigBird use sparse attention patterns; and Flash Attention optimizes the memory access patterns of standard attention. These advances have extended transformers to sequences of tens of thousands of tokens.
Encoder-Only, Decoder-Only, and Encoder-Decoder Variants
The original transformer used an encoder-decoder structure for machine translation, but subsequent work explored three architectural variants. Encoder-only transformers (BERT, RoBERTa) process the full input bidirectionally and are used for understanding tasks like classification and token labeling. Decoder-only transformers (GPT, GPT-2, GPT-3) use masked self-attention for autoregressive language modeling and generation. Encoder-decoder transformers (T5, BART) maintain the original structure and are effective for sequence-to-sequence tasks. The decoder-only variant has emerged as the dominant architecture for large language models due to its simplicity and scalability.
The transformer's impact extends far beyond NLP. Vision transformers (ViT) apply the same architecture to image patches, achieving state-of-the-art results in computer vision. AlphaFold 2 uses transformer-like attention for protein structure prediction. Decision Transformer applies the architecture to reinforcement learning. This remarkable generality suggests that the transformer captures something fundamental about information processing: the ability to dynamically route information between elements based on their content, unconstrained by architectural assumptions about locality or sequential order.