WordNet, developed at Princeton University under the direction of George A. Miller beginning in 1985, is a large-scale lexical database that organizes English nouns, verbs, adjectives, and adverbs into synonym sets (synsets), each representing a distinct lexical concept. Synsets are interconnected by semantic relations including hyponymy (IS-A), meronymy (PART-OF), antonymy, and entailment, forming a rich network that encodes much of the taxonomic and relational structure of the English lexicon. WordNet has become the most widely used lexical resource in natural language processing.
Structure and Organization
~207,000 word-sense pairs
Noun hierarchy depth: up to 16 levels
Top-level ontology: 25 unique beginners for nouns
Key relations:
Nouns: hyponymy, meronymy, member-of, substance-of
Verbs: troponymy (manner-of), entailment, cause
Adjectives: antonymy, similar-to, pertainym
The noun taxonomy in WordNet is organized as a hierarchy rooted in general concepts like "entity" and "abstraction," with increasingly specific concepts at lower levels. For example, the path from "dog" to "entity" passes through "canine," "carnivore," "mammal," "animal," "organism," "living thing," and "entity." This hierarchical structure enables computation of semantic similarity: words closer in the taxonomy are more similar. Wu-Palmer similarity, Lin similarity, and path-based measures all exploit this structure.
Semantic Similarity Measures
WordNet supports several well-known measures of semantic similarity. Path length measures the shortest path between two synsets in the taxonomy. Wu-Palmer similarity considers the depth of the least common subsumer (LCS) relative to the depths of the two concepts. Lin similarity combines the taxonomic structure with corpus-based information content: sim(c1, c2) = 2 * IC(LCS) / (IC(c1) + IC(c2)), where IC is the information content derived from corpus frequency. These measures are used in word sense disambiguation, document similarity, and textual entailment.
The Global WordNet Association coordinates the development of wordnets for over 200 languages. EuroWordNet and the Open Multilingual Wordnet link synsets across languages via inter-lingual indices, enabling cross-lingual semantic analysis. The Universal WordNet project automatically extends WordNet coverage using machine translation and distributional methods. These multilingual resources support applications in cross-lingual information retrieval, machine translation evaluation, and multilingual word sense disambiguation.
Impact on NLP
WordNet has been integral to NLP for decades. The SemCor corpus, annotated with WordNet senses, provides the standard training and evaluation data for word sense disambiguation. WordNet synsets define the sense inventory for the Senseval/SemEval WSD shared tasks. In information retrieval, query expansion using WordNet synonyms and hypernyms improves recall. Knowledge graph construction often uses WordNet as a source of taxonomic structure, and lexical entailment detection leverages the hyponymy relation.
Despite its influence, WordNet has known limitations. Its sense inventory is very fine-grained, making disambiguation difficult even for humans. Coverage of domain-specific terminology, new words, and informal language is limited. The taxonomy reflects one particular organization of concepts that may not match cognitive or task-specific needs. Modern approaches increasingly combine WordNet's structured knowledge with distributional representations, using techniques like retrofitting to inject relational knowledge into word vectors, achieving the benefits of both symbolic and distributional approaches.