Computational Linguistics
About

Penn Discourse Treebank

The Penn Discourse Treebank (PDTB) adopts a lexically grounded, connective-based approach to discourse annotation, cataloging explicit and implicit relations between text spans through a hierarchical sense taxonomy.

Rel(Arg1, Conn, Arg2) → Sense ∈ Hierarchy

The Penn Discourse Treebank, developed at the University of Pennsylvania beginning in 2004, represents a fundamentally different approach to discourse annotation from Rhetorical Structure Theory. Rather than building global tree structures over entire documents, PDTB takes a bottom-up, lexically grounded strategy: it annotates individual discourse connectives (words like "because," "however," and "although") along with their two arguments and the semantic sense of the relation they express. This approach also identifies implicit relations between adjacent sentences that lack an explicit connective, making PDTB uniquely valuable for studying how discourse coherence is signaled in natural text.

Annotation Framework

PDTB Relation Structure Explicit: Conn(Arg1, Arg2) → Sense
Implicit: ∅(Arg1, Arg2) → Sense (inferred connective)
AltLex: AltLex(Arg1, Arg2) → Sense (non-connective signal)
EntRel: Entity-based coherence (no semantic relation)

Sense Hierarchy (PDTB 3.0): 4 top-level classes ×
subtypes × further subtypes = ~30 fine-grained senses

PDTB annotates four types of discourse relations. Explicit relations are signaled by overt connectives drawn from a closed class of approximately 100 English expressions. Implicit relations hold between adjacent sentences where no connective is present; annotators insert a connective that best expresses the inferred relation. AltLex relations are signaled by alternative lexicalizations that are not traditional connectives (e.g., "That was the result of..."). EntRel marks cases where adjacent sentences are related only through shared entity reference. PDTB 3.0 organizes senses into a three-level hierarchy with four top-level classes: Temporal, Contingency, Comparison, and Expansion.

Corpus and Versions

PDTB 2.0, released in 2008, annotated the same Wall Street Journal corpus as the Penn Treebank, providing aligned syntactic and discourse annotations for approximately 2,300 articles. It contains over 40,000 annotated relations, roughly evenly split between explicit and implicit. PDTB 3.0 refined the sense hierarchy, improved consistency guidelines, and extended the annotation to additional genres. The PDTB framework has also been applied to other languages, with discourse treebanks following PDTB principles constructed for Chinese, Turkish, Hindi, and several other languages through the PDTB-style cross-lingual annotation initiative.

Implicit Relation Classification

Classifying implicit discourse relations — where no connective is present — is one of the hardest tasks in discourse processing. Without explicit lexical cues, models must infer relations from the semantic content of the arguments alone. State-of-the-art systems achieve roughly 60–65% accuracy on four-way classification in PDTB 2.0, far below the 90%+ accuracy for explicit relations. This gap highlights the fundamental challenge of pragmatic inference: humans effortlessly recover implicit relations that remain difficult for computational models.

Impact on Computational Discourse

PDTB has profoundly influenced computational discourse research. Its connective-based annotation provides cleaner, more reliable annotations than full tree-based approaches, with inter-annotator agreement above 90% for explicit connective identification and approximately 80% for top-level sense classification. The framework's simplicity has made it the preferred resource for training and evaluating shallow discourse parsers, which identify connectives, extract their arguments, and classify relation senses as a pipeline or joint model.

The theoretical implications of PDTB extend beyond annotation convenience. By treating discourse relations as predicate-argument structures anchored to lexical items, PDTB connects discourse analysis to the broader tradition of lexical semantics and argument structure. The distribution of explicit versus implicit relations across genres provides insights into how writers modulate coherence signaling for different audiences. PDTB data has also been used to study the relationship between syntactic structure and discourse organization, revealing systematic patterns in how discourse connectives interact with clause structure and information packaging.

Related Topics

References

  1. Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse TreeBank 2.0. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC).
  2. Webber, B., Prasad, R., Lee, A., & Joshi, A. (2019). The Penn Discourse Treebank 3.0 annotation manual. Philadelphia: University of Pennsylvania.
  3. Xue, N., Ng, H. T., Pradhan, S., Prasad, R., Bryant, C., & Rutherford, A. (2015). The CoNLL-2015 shared task on shallow discourse parsing. Proceedings of the CoNLL-2015 Shared Task, 1–16. doi:10.18653/v1/K15-2001

External Links