Treebanks

A treebank is a text corpus in which each sentence has been annotated with its syntactic structure, typically in the form of a constituency tree or dependency tree (or both). Treebanks serve a dual role: they provide the supervised training data required by statistical and neural parsers, and they constitute empirical databases for testing linguistic hypotheses about syntax. The creation of the Penn Treebank in the early 1990s was a watershed moment that enabled the modern era of data-driven parsing.

Major Treebanks

Penn Treebank Bracketed Format (S
  (NP-SBJ (DT The) (NN company))
  (VP (VBD said)
    (SBAR (IN that)
      (S (NP-SBJ (PRP it))
        (VP (MD would) (VP (VB comply)))))))
  (. .))

The Penn Treebank (PTB), containing approximately 40,000 annotated sentences from the Wall Street Journal, is the most widely used resource for training and evaluating English constituency parsers. Other major constituency treebanks include the German NEGRA/Tiger treebanks, the Chinese Penn Treebank, and the Arabic Penn Treebank. On the dependency side, the Universal Dependencies (UD) project provides consistently annotated treebanks for over 100 languages, making it the largest cross-linguistic syntactic resource.

Annotation Process

Treebank construction involves defining an annotation scheme (tagset and bracketing guidelines), training annotators, and conducting multiple rounds of annotation with adjudication to resolve disagreements. Inter-annotator agreement is typically measured using bracketing F1; the Penn Treebank reports around 93% agreement. Semi-automatic methods, where a parser produces initial annotations that humans correct, have become standard for reducing annotation cost.

Treebank-Induced Grammars

A PCFG can be read directly off a treebank by extracting all production rules and computing their relative frequencies. This treebank grammar provides a strong baseline parser but suffers from the independence assumptions of vanilla PCFGs. State-splitting and lexicalization techniques improve upon treebank grammars substantially.

Impact and Limitations

Treebanks have been transformative for computational linguistics, but they have important limitations. Annotation is expensive, so treebanks are small relative to unannotated corpora. Annotation schemes encode theoretical assumptions that may be controversial. Genre bias (e.g., the PTB's focus on financial news) limits generalization. Despite these limitations, treebanks remain indispensable for training, evaluating, and comparing parsing systems, and the UD project has dramatically expanded the typological coverage of syntactic annotation.

Major Treebanks

Annotation Process

Impact and Limitations

References

External Links

Major Treebanks

Annotation Process

Impact and Limitations

Related Topics

References

External Links