Computational Linguistics
About

Compounding

Compounding combines two or more free morphemes into a single complex word whose meaning may be compositional or idiomatic, presenting unique challenges for tokenization, semantic interpretation, and multilingual NLP.

compound = stem₁ + (linker) + stem₂

Compounding is the word-formation process that combines two or more independent stems into a single lexical unit. English "blackbird," German "Handschuh" (glove, literally hand-shoe), and Finnish "tietokone" (computer, literally knowledge-machine) are all compounds. Compounding is extraordinarily productive in many languages, with German and Dutch famously allowing essentially unlimited compound length ("Donaudampfschifffahrtsgesellschaftskapitän"). For NLP, compounds create challenges at every level: tokenization must decide whether to split them, parsers must determine their internal structure, and semantic models must compose the meanings of their parts.

Types and Structure of Compounds

Compound Classification Endocentric: blackbird IS-A bird (head = bird)
Exocentric: pickpocket IS-NOT-A pocket (no semantic head)
Copulative: singer-songwriter IS-A singer AND songwriter

Internal structure (right-branching default in English):
[[football] [player]] = player of football
[[high school] [student]] vs. [high [school student]]

Semantic relation: N₁ R N₂ where R is implicit
"coffee cup" → cup FOR coffee; "paper cup" → cup MADE-OF paper

Compounds are typically classified by their headedness. In endocentric compounds, one element (the head) determines the category and basic meaning — a "blackbird" is a type of bird. In exocentric compounds, neither element functions as the semantic head — a "pickpocket" is not a type of pocket. The implicit semantic relation between compound elements is highly variable and context-dependent: a "chocolate cake" is a cake made with chocolate, but a "birthday cake" is a cake for a birthday.

Compound Splitting

For languages with closed compounds (written without spaces), compound splitting is an important preprocessing step. German compound splitting, for instance, requires identifying the constituent words and any linking morphemes. Koehn and Knight (2003) proposed a frequency-based approach that prefers splits into high-frequency parts, while later work used conditional random fields and neural models trained on compound-annotated data. Errors in compound splitting propagate to downstream tasks: incorrectly splitting "Wachstube" as "Wach+Stube" (guard room) vs. "Wachs+Tube" (wax tube) changes the meaning entirely.

Compound Interpretation and Compositionality

The semantic interpretation of novel compounds is a remarkable human ability. We effortlessly understand "airplane food" (food served on airplanes), "snowman hat" (a hat for a snowman), and "chocolate teapot" (a teapot made of chocolate) by inferring the implicit relation between constituents. Computational approaches to compound interpretation include relation classification (selecting from a fixed set of semantic relations), paraphrasing (generating "X that is Y" or "X made of Y"), and distributional composition (combining word vectors). Dima and Hinrichs (2015) showed that neural composition functions can predict compound meanings from their parts with reasonable accuracy.

Compounding Across Languages

Languages vary enormously in how they form and write compounds. English allows both open compounds ("ice cream"), hyphenated compounds ("well-known"), and closed compounds ("football"). German writes all compounds as single orthographic words, often with linking morphemes ("-s-", "-n-", "-er-"). Chinese freely combines characters into compound words with minimal morphological marking. These typological differences mean that compound processing strategies must be language-specific, and multilingual NLP systems must handle compounding variation.

In the context of modern subword tokenization, compounds are often split into subword units that may or may not correspond to meaningful constituents. BPE might segment "Handschuh" into "Hand" and "schuh" (recovering the compound structure) or into "Hands" and "chuh" (destroying it), depending on the training corpus statistics. Whether linguistically motivated compound splitting provides benefits over purely statistical subword tokenization remains an active research question, with evidence suggesting that explicit compound splitting still helps for machine translation of morphologically rich languages.

Related Topics

References

  1. Koehn, P., & Knight, K. (2003). Empirical methods for compound splitting. Proceedings of the 10th Conference of the European Chapter of the ACL, 187–193. doi:10.3115/1067807.1067833
  2. Dima, C., & Hinrichs, E. (2015). Automatic noun compound interpretation using deep neural networks and word embeddings. Proceedings of the 11th International Conference on Computational Semantics, 173–183.
  3. Ziering, P., & van der Plas, L. (2016). Towards unsupervised and language-independent compound splitting using inflectional morphology. Proceedings of the 2016 Conference of the NAACL: HLT, 644–653. doi:10.18653/v1/N16-1075

External Links