Masked Language Modeling

Masked language modeling (MLM) is the central pre-training objective of BERT and its descendants. In MLM, a fraction of input tokens (typically 15%) are selected at random, and the model is trained to predict these tokens given the remaining unmasked context. Because the masked positions can attend to both left and right context through the transformer's bidirectional self-attention, MLM enables the learning of deep bidirectional representations — a key advantage over left-to-right autoregressive language models. MLM can be viewed as a form of denoising autoencoding: the input is corrupted by masking, and the model learns to reconstruct the original.

Masking Strategy

BERT Masking Procedure For each selected position i (15% of tokens):
- Replace with [MASK] token: 80% of the time
- Replace with random token: 10% of the time
- Keep original token: 10% of the time

Loss: L_MLM = -(1/|M|) Σ_{i∈M} log P_Θ(xᵢ | x₁,...,x_{i-1},[MASK],x_{i+1},...,xₙ)

Only masked positions contribute to the loss

The 80-10-10 masking strategy addresses the pre-train/fine-tune discrepancy: since the [MASK] token never appears during fine-tuning, always using [MASK] during pre-training would teach the model to rely on a signal that is absent at inference time. By sometimes using random tokens or keeping the original, the model cannot simply learn to detect [MASK] tokens but must build robust representations of all positions. The 15% masking rate balances two considerations: too little masking provides insufficient training signal per sequence, while too much masking removes so much context that prediction becomes unreliable.

Variants and Extensions

Several variants of MLM have been proposed to address its limitations. Whole-word masking masks all subword tokens of a word simultaneously, preventing the model from using subword clues. SpanBERT (Joshi et al., 2020) masks contiguous spans rather than individual tokens, which better captures phrase-level information and improves performance on span-selection tasks like question answering. ERNIE masks named entities and phrases as units, encouraging the model to learn about semantic concepts. T5's span corruption objective extends MLM to predict multi-token spans delimited by sentinel tokens.

Independence Assumption Limitation

A notable limitation of standard MLM is the assumption that masked tokens are conditionally independent given the unmasked context. If multiple tokens are masked in a sentence like "New [MASK] [MASK]," BERT predicts each masked position independently, potentially generating inconsistent outputs like "New York City" for one position and "New Delhi India" for another. XLNet's permutation language modeling addresses this by maintaining autoregressive factorization, and models like XLMR-XL have explored alternative objectives that model dependencies between masked positions.

Theoretical Perspectives

From an information-theoretic perspective, MLM trains the model to minimize the cross-entropy between its predictions and the true conditional distribution of tokens given context. The objective is equivalent to maximizing a lower bound on the mutual information between the masked tokens and the remaining context, encouraging the model to learn representations that capture the maximum amount of information about each token's identity. The connection between MLM and denoising autoencoders provides theoretical grounding: Vincent et al. (2008) showed that denoising autoencoders learn the score function of the data distribution, suggesting that MLM-trained models implicitly learn the structure of the text distribution.

MLM has become the most widely adopted pre-training objective for encoder-based models. Its success across diverse languages, domains, and tasks demonstrates that predicting missing tokens from context is a powerful inductive bias for learning linguistic representations. The objective forces the model to develop rich syntactic and semantic understanding: to predict a masked verb, the model must understand subject-verb agreement; to predict a masked noun, it must grasp selectional restrictions and world knowledge. This linguistically rich training signal, combined with the transformer's capacity, explains why MLM pre-training produces such effective general-purpose representations.

Masking Strategy

Variants and Extensions

Theoretical Perspectives

References

External Links

Masking Strategy

Variants and Extensions

Theoretical Perspectives

Related Topics

References

External Links