Computational Linguistics
About

WordNet

WordNet is a large-scale lexical database of English that organizes words into synonym sets (synsets) linked by semantic relations, serving as a foundational resource for word sense disambiguation, information retrieval, and natural language understanding.

synset(w) = {s | w in s, s is a synonym set}

WordNet, developed at Princeton University under the direction of George A. Miller beginning in 1985, is a large-scale lexical database that organizes English nouns, verbs, adjectives, and adverbs into synonym sets (synsets), each representing a distinct lexical concept. Synsets are interconnected by semantic relations including hyponymy (IS-A), meronymy (PART-OF), antonymy, and entailment, forming a rich network that encodes much of the taxonomic and relational structure of the English lexicon. WordNet has become the most widely used lexical resource in natural language processing.

Structure and Organization

WordNet Statistics and Structure ~117,000 synsets, ~155,000 words
~207,000 word-sense pairs

Noun hierarchy depth: up to 16 levels
Top-level ontology: 25 unique beginners for nouns

Key relations:
Nouns: hyponymy, meronymy, member-of, substance-of
Verbs: troponymy (manner-of), entailment, cause
Adjectives: antonymy, similar-to, pertainym

The noun taxonomy in WordNet is organized as a hierarchy rooted in general concepts like "entity" and "abstraction," with increasingly specific concepts at lower levels. For example, the path from "dog" to "entity" passes through "canine," "carnivore," "mammal," "animal," "organism," "living thing," and "entity." This hierarchical structure enables computation of semantic similarity: words closer in the taxonomy are more similar. Wu-Palmer similarity, Lin similarity, and path-based measures all exploit this structure.

Semantic Similarity Measures

WordNet supports several well-known measures of semantic similarity. Path length measures the shortest path between two synsets in the taxonomy. Wu-Palmer similarity considers the depth of the least common subsumer (LCS) relative to the depths of the two concepts. Lin similarity combines the taxonomic structure with corpus-based information content: sim(c1, c2) = 2 * IC(LCS) / (IC(c1) + IC(c2)), where IC is the information content derived from corpus frequency. These measures are used in word sense disambiguation, document similarity, and textual entailment.

WordNet Across Languages

The Global WordNet Association coordinates the development of wordnets for over 200 languages. EuroWordNet and the Open Multilingual Wordnet link synsets across languages via inter-lingual indices, enabling cross-lingual semantic analysis. The Universal WordNet project automatically extends WordNet coverage using machine translation and distributional methods. These multilingual resources support applications in cross-lingual information retrieval, machine translation evaluation, and multilingual word sense disambiguation.

Impact on NLP

WordNet has been integral to NLP for decades. The SemCor corpus, annotated with WordNet senses, provides the standard training and evaluation data for word sense disambiguation. WordNet synsets define the sense inventory for the Senseval/SemEval WSD shared tasks. In information retrieval, query expansion using WordNet synonyms and hypernyms improves recall. Knowledge graph construction often uses WordNet as a source of taxonomic structure, and lexical entailment detection leverages the hyponymy relation.

Despite its influence, WordNet has known limitations. Its sense inventory is very fine-grained, making disambiguation difficult even for humans. Coverage of domain-specific terminology, new words, and informal language is limited. The taxonomy reflects one particular organization of concepts that may not match cognitive or task-specific needs. Modern approaches increasingly combine WordNet's structured knowledge with distributional representations, using techniques like retrofitting to inject relational knowledge into word vectors, achieving the benefits of both symbolic and distributional approaches.

Related Topics

References

  1. Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11), 39–41. doi:10.1145/219717.219748
  2. Fellbaum, C. (Ed.). (1998). WordNet: An Electronic Lexical Database. MIT Press.
  3. Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004). WordNet::Similarity: Measuring the relatedness of concepts. In Proceedings of NAACL-HLT Demonstration Papers (pp. 38–41). doi:10.3115/1614025.1614037
  4. Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., & Smith, N. A. (2015). Retrofitting word vectors to semantic lexicons. In Proceedings of NAACL-HLT (pp. 1606–1615). doi:10.3115/v1/N15-1184

External Links