Discourse parsing is the task of automatically recovering the discourse structure of a text — identifying elementary discourse units, determining how they relate to one another, and assembling these relations into a coherent representation. Like syntactic parsing at the sentence level, discourse parsing must resolve ambiguities, handle long-range dependencies, and balance local and global structural constraints. However, discourse parsing operates over much larger spans of text and relies on subtler linguistic cues, making it substantially more challenging than its sentential counterpart.
Pipeline Architecture
Stage 2: Tree Building — determine attachment structure
Stage 3: Relation Labeling — classify relation types
Joint objective: T* = argmax_T Π_{(i,j,r)∈T} P(r | uᵢ, uⱼ, context)
Most discourse parsers follow a pipeline architecture. The first stage segments the input text into elementary discourse units, typically clause-level spans. This is treated as a sequence labeling or boundary detection problem and is now largely solved for English, with neural models achieving F1 scores above 95%. The second stage determines the tree structure — which spans should be connected and with what nuclearity. The third stage labels each connection with a rhetorical relation. Pipeline errors propagate, so joint models that simultaneously determine structure and labels have been explored, though they add considerable complexity.
Approaches and Methods
Early discourse parsers relied on hand-crafted rules exploiting cue phrases, syntactic patterns, and positional features. The shift-reduce parser of Marcu (1999) introduced a transition-based approach adapted from syntactic parsing. Later statistical parsers employed CRFs, SVMs, and maximum entropy models with features derived from syntax trees, entity chains, and lexical information. The HILDA system (Hernault et al., 2010) combined SVM-based structure prediction with relation classification, establishing strong baselines on the RST-DT corpus.
Neural approaches have substantially advanced discourse parsing. The model of Ji and Eisenstein (2014) introduced representation learning for discourse, using distributed representations of text spans rather than discrete features. Subsequent work has employed hierarchical attention (Li et al., 2016), pointer networks for top-down tree construction (Lin et al., 2019), and pre-trained language models fine-tuned for discourse relation classification. These approaches have pushed RST parsing F1 from the low 50s to above 60% for fully labeled trees, though a significant gap remains compared to human performance.
Shallow Discourse Parsing
In contrast to full RST parsing, shallow discourse parsing — as formalized in the CoNLL 2015 and 2016 shared tasks — focuses on identifying individual discourse relations without building global tree structures. Following the PDTB framework, shallow parsers identify connectives, extract their Arg1 and Arg2 spans, and classify the relation sense. This task decomposition makes the problem more tractable and has attracted substantial research attention. Argument extraction, particularly for Arg1 which can appear in non-adjacent sentences, remains a key challenge.
Cross-lingual discourse parsing has expanded the field beyond English. Parsers have been developed for Chinese, Spanish, German, Portuguese, and other languages, often leveraging multilingual pre-trained models to transfer discourse knowledge across languages. Cross-framework evaluation, comparing RST-based and PDTB-based parsers on common downstream tasks, has revealed complementary strengths: RST trees capture global document organization while PDTB annotations provide finer-grained local relation classification. Hybrid approaches that combine both perspectives represent a promising direction for comprehensive discourse analysis.