GPT

GPT

GPT (Generative Pre-trained Transformer) demonstrated that autoregressive language model pre-training on large text corpora produces powerful representations for downstream NLP tasks, establishing the foundation for the scaling paradigm that led to GPT-2, GPT-3, and beyond.

P(wₜ | w₁,...,wₜ₋₁) = softmax(hₜ W_e^T), L = -Σₜ log P(wₜ | w₁,...,wₜ₋₁)

The Generative Pre-trained Transformer (GPT), introduced by Radford et al. (2018) at OpenAI, was among the first models to demonstrate that unsupervised pre-training of a large transformer on diverse text, followed by supervised fine-tuning, could achieve state-of-the-art results on multiple NLP benchmarks. Unlike BERT's bidirectional encoder, GPT uses a decoder-only transformer trained with a causal (left-to-right) language modeling objective. This design choice enables GPT to generate coherent text, a capability that became the defining feature of the GPT model family as it scaled from 117 million parameters (GPT) to 175 billion (GPT-3).

Architecture and Training

GPT Causal Language Model Pre-training objective:
L₁ = -Σₜ log P(wₜ | wₜ₋ₖ, ..., wₜ₋₁; Θ)

Fine-tuning with task label y:
L₂ = -Σ log P(y | x₁, ..., xₘ)
L = L₂ + λ · L₁

GPT-1: 12 layers, 768 hidden, 12 heads, 117M params
GPT-2: 48 layers, 1600 hidden, 25 heads, 1.5B params
GPT-3: 96 layers, 12288 hidden, 96 heads, 175B params

GPT uses a stack of transformer decoder blocks with masked self-attention that prevents each position from attending to subsequent positions, enforcing the autoregressive property. During pre-training, the model maximizes the log-likelihood of the next token prediction across the entire training corpus. During fine-tuning, a linear output layer is added for the specific task, and the language modeling loss is included as an auxiliary objective to improve generalization and accelerate convergence. The input is formatted with special delimiter tokens to handle different task structures.

Scaling Laws and Emergent Abilities

The GPT family revealed a remarkable empirical finding: language model performance improves predictably as a power law of model size, dataset size, and compute. Kaplan et al. (2020) formalized these scaling laws, showing that test loss follows L(N) approximately proportional to N^{-0.076}, where N is the number of parameters. GPT-3 exploited these scaling laws by training a 175-billion-parameter model on 300 billion tokens, achieving strong few-shot performance on a wide range of tasks through in-context learning — the ability to perform new tasks given only a natural language description and a few examples, without any gradient updates.

In-Context Learning

Perhaps GPT-3's most surprising capability was in-context learning: the ability to perform tasks by conditioning on a few input-output examples provided in the prompt, without any parameter updates. This emergent ability was not explicitly trained but arose from the scale of pre-training. In-context learning challenges the traditional pre-train/fine-tune paradigm by suggesting that sufficiently large language models can adapt to new tasks at inference time. This capability has become the foundation of prompt engineering and the practical deployment of large language models.

Impact and Broader Significance

GPT-2 drew public attention when OpenAI initially withheld the full model, citing concerns about misuse for generating disinformation. This decision, while controversial, initiated an important conversation about the responsible release of powerful AI models. GPT-3 further demonstrated that scaling autoregressive language models produces increasingly capable systems, motivating the development of even larger models (PaLM, Chinchilla, LLaMA) and the commercial deployment of language models as general-purpose AI assistants.

The GPT series established the decoder-only transformer as the dominant architecture for large language models, demonstrating that the simplicity of next-token prediction as a training objective belies the sophistication of the representations it induces. The progression from GPT-1's need for task-specific fine-tuning to GPT-3's few-shot capabilities to instruction-tuned variants' ability to follow open-ended instructions illustrates how the combination of scale and simple objectives can produce increasingly general and capable AI systems.

Architecture and Training

Scaling Laws and Emergent Abilities

Impact and Broader Significance

References

External Links

Architecture and Training

Scaling Laws and Emergent Abilities

Impact and Broader Significance

Related Topics

References

External Links