Computational Linguistics
About

Treebanks

Treebanks are corpora of sentences annotated with syntactic parse trees, serving as both training data for statistical parsers and as empirical resources for linguistic research.

(S (NP (DT The) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat)))))

A treebank is a text corpus in which each sentence has been annotated with its syntactic structure, typically in the form of a constituency tree or dependency tree (or both). Treebanks serve a dual role: they provide the supervised training data required by statistical and neural parsers, and they constitute empirical databases for testing linguistic hypotheses about syntax. The creation of the Penn Treebank in the early 1990s was a watershed moment that enabled the modern era of data-driven parsing.

Major Treebanks

Penn Treebank Bracketed Format (S
  (NP-SBJ (DT The) (NN company))
  (VP (VBD said)
    (SBAR (IN that)
      (S (NP-SBJ (PRP it))
        (VP (MD would) (VP (VB comply)))))))
  (. .))

The Penn Treebank (PTB), containing approximately 40,000 annotated sentences from the Wall Street Journal, is the most widely used resource for training and evaluating English constituency parsers. Other major constituency treebanks include the German NEGRA/Tiger treebanks, the Chinese Penn Treebank, and the Arabic Penn Treebank. On the dependency side, the Universal Dependencies (UD) project provides consistently annotated treebanks for over 100 languages, making it the largest cross-linguistic syntactic resource.

Annotation Process

Treebank construction involves defining an annotation scheme (tagset and bracketing guidelines), training annotators, and conducting multiple rounds of annotation with adjudication to resolve disagreements. Inter-annotator agreement is typically measured using bracketing F1; the Penn Treebank reports around 93% agreement. Semi-automatic methods, where a parser produces initial annotations that humans correct, have become standard for reducing annotation cost.

Treebank-Induced Grammars
A PCFG can be read directly off a treebank by extracting all production rules and computing their relative frequencies. This treebank grammar provides a strong baseline parser but suffers from the independence assumptions of vanilla PCFGs. State-splitting and lexicalization techniques improve upon treebank grammars substantially.

Impact and Limitations

Treebanks have been transformative for computational linguistics, but they have important limitations. Annotation is expensive, so treebanks are small relative to unannotated corpora. Annotation schemes encode theoretical assumptions that may be controversial. Genre bias (e.g., the PTB's focus on financial news) limits generalization. Despite these limitations, treebanks remain indispensable for training, evaluating, and comparing parsing systems, and the UD project has dramatically expanded the typological coverage of syntactic annotation.

Related Topics

References

  1. Marcus, M. P., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330. https://doi.org/10.5555/972470.972475
  2. Nivre, J., de Marneffe, M.-C., Ginter, F., et al. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. Proceedings of LREC 2020, 4034–4043. https://aclanthology.org/2020.lrec-1.497
  3. Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese Treebank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11(2), 207–238. https://doi.org/10.1017/S135132490400364X
  4. Abeillé, A. (Ed.). (2003). Treebanks: Building and using parsed corpora. Kluwer Academic Publishers. https://doi.org/10.1007/978-94-010-0201-1

External Links