The Minimum Description Length (MDL) principle, developed by Jorma Rissanen beginning in 1978, provides an information-theoretic approach to model selection and statistical inference. The core idea is elegantly simple: the best model for a body of data is the one that minimizes the total description length — the sum of the length of the model description and the length of the data encoded using that model. MDL formalizes Occam's razor by quantifying the trade-off between model complexity and data fit. In computational linguistics, MDL has been applied to grammar induction, morphological segmentation, word segmentation, and phonological rule learning.
The Two-Part Code
In its simplest formulation, MDL uses a two-part code. Given data D and a hypothesis (model) H from some class, the total description length is L(H) + L(D|H), where L(H) is the length in bits of encoding the model itself, and L(D|H) is the length of encoding the data using the model. A simple model (short L(H)) may encode the data poorly (long L(D|H)); a complex model may fit the data perfectly but require many bits to describe. MDL selects the model that minimizes the sum, automatically balancing complexity and fit.
L(H) = bits to encode the model/grammar
L(D|H) = bits to encode data given the model
Best model: H* = argmin_H [L(H) + L(D|H)]
Applications in Computational Linguistics
MDL has been particularly influential in unsupervised learning of linguistic structure. In morphological segmentation, the Morfessor system by Creutz and Lagus uses MDL to find the segmentation of words into morphemes that minimizes the combined description of the morpheme inventory and the corpus encoded using that inventory. Short, frequent morphemes lead to a compact lexicon but longer data descriptions; long, infrequent morphemes produce a large lexicon but short data codes. MDL finds the optimal balance.
In grammar induction, MDL evaluates candidate grammars by how well they compress a corpus. A grammar with many specific rules may parse the corpus with few bits per sentence but costs many bits to describe. A grammar with few general rules is cheap to describe but may parse less efficiently. Grünwald (2007) showed that MDL is closely related to Bayesian model selection with universal priors, providing a deep theoretical foundation for this practical approach. Stolcke's 1994 Bayesian grammar induction work and de Marcken's 1996 MDL-based unsupervised parsing are landmark applications in NLP.
Refined MDL and Normalized Maximum Likelihood
Rissanen's refined MDL theory goes beyond the two-part code by using the normalized maximum likelihood (NML) distribution, which achieves minimax optimal coding without requiring a prior distribution. The stochastic complexity of the data relative to a model class is defined as the logarithm of the NML normalizing constant, providing a measure of model class complexity that accounts for the number of distinguishable data patterns. This refined formulation has deeper theoretical properties than the two-part code and avoids the dependence on arbitrary coding choices for the model.
MDL remains an active area of research in computational linguistics. Its applications extend to unsupervised word segmentation for languages without whitespace (such as Chinese and Japanese), phonological rule discovery, semantic category induction, and evaluation of language models. The principle's appeal lies in its generality — it applies to any domain where models must be selected from data — and its principled balance between simplicity and fit, which aligns naturally with the scientific goal of discovering the simplest adequate description of linguistic structure.