BERT, introduced by Devlin et al. (2019) at Google, marked a watershed moment in NLP by demonstrating that deep bidirectional pre-training on unlabeled text produces representations that transfer effectively to a wide range of downstream tasks with minimal task-specific architecture. BERT uses a transformer encoder trained with two objectives: masked language modeling (MLM), which randomly masks tokens and trains the model to predict them from bidirectional context, and next sentence prediction (NSP), which trains the model to determine whether two sentences are consecutive. The resulting representations achieved state-of-the-art results on eleven NLP benchmarks simultaneously.
Architecture and Pre-Training
L_MLM = -Σ_{i∈masked} log P(wᵢ | w_{i})
Next Sentence Prediction:
L_NSP = -[y·log P(IsNext) + (1-y)·log P(NotNext)]
Total: L = L_MLM + L_NSP
BERT-Base: 12 layers, 768 hidden, 12 heads, 110M params
BERT-Large: 24 layers, 1024 hidden, 16 heads, 340M params
In masked language modeling, 15% of input tokens are selected for prediction. Of these, 80% are replaced with [MASK], 10% are replaced with a random token, and 10% are left unchanged. This strategy prevents the model from simply learning to detect the [MASK] token. The model must learn rich bidirectional representations to predict the masked tokens from their full context. The input representation sums three embeddings: token embeddings, segment embeddings (indicating sentence A or B), and positional embeddings.
Fine-Tuning Paradigm
BERT's most significant contribution was establishing the pre-train/fine-tune paradigm for NLP. After pre-training on large unlabeled corpora (BooksCorpus and English Wikipedia, totaling 3.3 billion words), BERT is fine-tuned on task-specific labeled data by adding a simple task-specific output layer. For sentence classification, the [CLS] token representation is passed through a linear classifier. For token-level tasks like NER, each token's representation is classified independently. For question answering, start and end token positions are predicted. This approach requires minimal architectural modification across tasks.
BERT's initial release improved the state of the art on the GLUE benchmark by 7.7 absolute points, on SQuAD 1.1 question answering by 1.5 F1 points (surpassing human performance), and on SQuAD 2.0 by 5.1 F1 points. These gains were unprecedented in scale and breadth. The model demonstrated that a single pre-trained architecture could excel across classification, entailment, question answering, and named entity recognition, validating the hypothesis that language understanding requires deep bidirectional context and that this context can be effectively learned from unlabeled text.
Limitations and Legacy
Despite its transformative impact, BERT has notable limitations. The [MASK] token used during pre-training never appears during fine-tuning, creating a pre-train/fine-tune mismatch. The independence assumption in predicting multiple masked tokens ignores correlations between them. The fixed input length of 512 tokens limits processing of longer documents. The next sentence prediction objective was later shown to be of limited value and was dropped by subsequent models like RoBERTa. Additionally, BERT's encoder-only architecture is not naturally suited to generation tasks.
BERT's legacy extends far beyond its direct performance improvements. It established the pre-train/fine-tune paradigm that became the standard approach in NLP, inspired a family of successor models (RoBERTa, ALBERT, ELECTRA, DeBERTa), and catalyzed the development of multilingual models (mBERT, XLM) that brought pre-training benefits to over 100 languages. BERT also democratized access to powerful NLP through the Hugging Face Transformers library, which made it straightforward for practitioners to fine-tune pre-trained models for specific applications.