Word2Vec, introduced by Tomas Mikolov and colleagues at Google in 2013, revolutionized distributional semantics by demonstrating that simple neural network architectures trained on large corpora produce word embeddings with remarkable algebraic properties. The model comes in two variants: Continuous Bag-of-Words (CBOW), which predicts a target word from its context, and Skip-gram, which predicts context words from a target word. Both produce dense, low-dimensional vectors that capture semantic relationships through vector arithmetic.
Skip-gram and CBOW Architectures
Softmax: P(w_O | w_I) = exp(v'_{w_O}^T v_{w_I}) / Σ_{w=1}^{W} exp(v'_w^T v_{w_I})
Negative sampling approximation:
log σ(v'_{w_O}^T v_{w_I}) + Σ_{i=1}^{k} E_{w_i ~ P_n(w)} [log σ(−v'_{w_i}^T v_{w_I})]
The Skip-gram model uses a shallow neural network with one hidden layer. Given a target word, it maximizes the probability of observing nearby context words within a window of size c. Computing the full softmax over the entire vocabulary is expensive, so two efficient approximations are used: hierarchical softmax, which uses a binary tree over the vocabulary, and negative sampling, which approximates the softmax by contrasting the target context pair against randomly sampled negative pairs. Negative sampling with 5–20 negative samples per positive example typically yields the best results.
Algebraic Properties
The most celebrated property of Word2Vec embeddings is their capacity to capture semantic relationships through vector arithmetic. The relation "king - man + woman ≈ queen" demonstrates that the vector offset between "man" and "woman" encodes a gender relation that transfers across word pairs. Similar regularities hold for syntactic relations (e.g., "walking - walk + swim ≈ swimming") and other semantic relations (country-capital, adjective-comparative). Levy and Goldberg (2014) showed that Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix, connecting the neural approach to traditional count-based distributional semantics.
Word2Vec's performance depends critically on hyperparameters. Larger context windows capture broader topical similarity, while smaller windows emphasize syntactic and functional similarity. Subsampling of frequent words (discarding occurrences of very common words with a probability proportional to their frequency) improves both training speed and the quality of representations for rare words. The dimensionality of the embedding space (typically 100–300) trades off between expressiveness and generalization.
Impact and Legacy
Word2Vec's impact on NLP was transformative. Pre-trained word embeddings became a standard feature in virtually every NLP system, replacing sparse one-hot or bag-of-words representations. The model demonstrated that unsupervised learning on raw text could capture substantial linguistic knowledge, presaging the pre-training revolution that would later produce ELMo, BERT, and GPT. Word2Vec also stimulated research on bias in embeddings, as Bolukbasi et al. (2016) showed that word vectors encode societal stereotypes present in training corpora.
Numerous extensions followed Word2Vec. FastText extended the model to use subword information, GloVe combined global co-occurrence statistics with local context prediction, and various retrofitting methods incorporated knowledge from lexical resources. While contextualized models have since surpassed static embeddings on most benchmarks, Word2Vec remains widely used for its simplicity, efficiency, and interpretability.