Kolmogorov Complexity

Kolmogorov complexity, independently developed by Ray Solomonoff (1964), Andrey Kolmogorov (1965), and Gregory Chaitin (1966), defines the complexity of a string as the length of the shortest program that produces it on a universal Turing machine. Unlike Shannon entropy, which measures the average information in a random source, Kolmogorov complexity measures the information content of an individual object. This makes it the ultimate measure of compressibility: a string is simple if it can be generated by a short program, and random if no program significantly shorter than the string itself can produce it.

Formal Definition

Let U be a fixed universal Turing machine. The Kolmogorov complexity of a string x, denoted K(x), is the length of the shortest binary program p such that U(p) = x. By the invariance theorem, the choice of universal machine affects K(x) by at most an additive constant — so Kolmogorov complexity is essentially machine-independent. The conditional Kolmogorov complexity K(x|y) measures the length of the shortest program that produces x given y as auxiliary input.

Key Properties K(x) ≤ |x| + c (every string has complexity at most its length plus a constant)
K(x) is not computable (no algorithm can compute K(x) for all x)
Most strings of length n have K(x) ≥ n − c (most strings are incompressible)
K(x,y) ≤ K(x) + K(y|x) + O(log n) (chain rule)

Incomputability and Approximation

A fundamental property of Kolmogorov complexity is that it is not computable: there is no algorithm that, given any string x, outputs K(x). This is proved by a diagonal argument related to Berry's paradox. However, K(x) can be approximated from above by actual compression algorithms — the length of the compressed output of gzip, bzip2, or any other lossless compressor provides an upper bound on K(x). This approximation is the practical basis for compression-based approaches to clustering, classification, and similarity measurement in NLP.

Normalized Compression Distance

The Normalized Compression Distance (NCD), defined as NCD(x,y) = (C(xy) − min(C(x), C(y))) / max(C(x), C(y)) where C is a compressor, approximates the normalized information distance based on Kolmogorov complexity. NCD has been applied to language identification, authorship attribution, plagiarism detection, and phylogenetic reconstruction of language families. Despite its simplicity — requiring no feature engineering or domain knowledge — NCD often achieves competitive performance on text classification tasks.

Connections to Linguistics and NLP

Kolmogorov complexity provides the theoretical underpinning for the Minimum Description Length principle: MDL can be viewed as a computable approximation to model selection based on Kolmogorov complexity. In this view, the best grammar for a language is the one that achieves maximum compression of the corpus — that is, the grammar that captures the most regularities. Solomonoff's theory of inductive inference, based on a prior distribution weighted by program length, is the Bayesian counterpart of Kolmogorov complexity and has been described as the theoretically optimal approach to sequence prediction.

In computational linguistics, Kolmogorov complexity also connects to the analysis of language complexity and typological diversity. Languages with more regular morphology are, in a Kolmogorov sense, simpler — their grammars can be described by shorter programs. The concept has been applied to compare the complexity of writing systems, to analyze the information content of linguistic utterances, and to study the trade-off between lexicon size and grammatical complexity across languages.

Formal Definition

Incomputability and Approximation

Connections to Linguistics and NLP

References

External Links

Formal Definition

Incomputability and Approximation

Connections to Linguistics and NLP

Related Topics

References

External Links