Distributional semantics is founded on the distributional hypothesis, articulated by Zellig Harris (1954) and J. R. Firth (1957): "You shall know a word by the company it keeps." This approach represents word meanings as vectors in a high-dimensional space, where each dimension corresponds to a context feature (co-occurring word, document, syntactic relation). Words that appear in similar contexts receive similar vector representations, capturing semantic relatedness without recourse to hand-crafted definitions or logical formalisms.
Vector Space Models
Cosine similarity:
cos(v_i, v_j) = (v_i · v_j) / (||v_i|| · ||v_j||)
PMI weighting:
PMI(w, c) = log₂[P(w, c) / (P(w) · P(c))]
PPMI(w, c) = max(0, PMI(w, c))
The simplest distributional model builds a word-context co-occurrence matrix from a corpus, where rows represent target words and columns represent context words (or documents). Raw co-occurrence counts are typically transformed using pointwise mutual information (PMI) or its positive variant (PPMI) to discount the effect of frequency. Dimensionality reduction techniques such as singular value decomposition (SVD), applied to these matrices, yield dense low-dimensional word vectors that capture latent semantic structure. This approach, known as Latent Semantic Analysis (LSA), was among the first successful distributional models.
From Counting to Prediction
A major shift in distributional semantics came with the introduction of predictive models that learn word vectors by training neural networks to predict context words from target words (Skip-gram) or vice versa (CBOW). Mikolov et al. (2013) showed that these Word2Vec embeddings capture remarkable semantic regularities, including analogical relations like "king - man + woman ≈ queen." Levy and Goldberg (2014) demonstrated that these neural models implicitly factorize a shifted PMI matrix, establishing a deep connection between count-based and prediction-based distributional approaches.
Distributional models are evaluated on word similarity benchmarks (SimLex-999, WordSim-353), analogy tasks (Google analogy dataset), and downstream NLP tasks. Intrinsic evaluations test whether vector similarity correlates with human similarity judgments. Extrinsic evaluations measure improvement on tasks like sentiment analysis, parsing, or machine translation when distributional representations are used as features. The relationship between intrinsic and extrinsic performance is not always straightforward, leading to ongoing debate about the best evaluation methodology.
Contextualized and Compositional Extensions
A limitation of static distributional models is that each word receives a single vector regardless of context, conflating different senses. Contextualized word embeddings from models like ELMo and BERT address this by producing different representations for each token occurrence, conditioned on its sentential context. These models can be seen as the culmination of the distributional hypothesis: meaning is determined not just by typical contexts but by the specific context of use.
Composing distributional word vectors into phrase and sentence meanings remains an active research area. Additive and multiplicative models provide simple baselines, while more sophisticated approaches use syntactic structure to guide composition. The tension between distributional and formal-semantic approaches to compositionality has been productive, spawning hybrid frameworks that attempt to combine the empirical coverage of distributional models with the compositional precision of formal semantics.