Computational Linguistics
About

Statistical Parsing

Statistical parsing uses probabilistic models trained on treebanks to resolve syntactic ambiguity, with lexicalized and latent-variable models achieving high accuracy on standard benchmarks.

t* = argmax_{t} P(t | w_1...w_n) ∝ argmax_{t} P(t) = argmax_{t} ∏_{r ∈ t} q(r)

Statistical parsing encompasses a broad family of approaches that use probabilistic models, estimated from annotated treebanks, to assign the most likely syntactic structure to a sentence. The fundamental insight is that syntactic ambiguity, which makes parsing computationally intractable under an unweighted grammar, can be resolved by learning which structures are more likely from data. Statistical parsers select the parse tree that maximizes the probability (or score) assigned by a learned model.

Lexicalized Parsing Models

Collins Model 1 (Head-Driven) P(tree) = ∏ P(Hi | Pi, hi, ti)
  × ∏ P(Lj, lj, tlj | Pi, Hi, hi, ti)
  × ∏ P(Rk, rk, trk | Pi, Hi, hi, ti)

Pi = parent nonterminal, Hi = head child
hi = head word, ti = head tag
Lj, Rk = left/right modifiers

The most influential statistical parsing models are the lexicalized models of Collins (1997, 2003) and Charniak (1997, 2000). These extend PCFGs by conditioning rule probabilities on the head word of each constituent, capturing lexical selectional preferences (e.g., that "eat" takes food-related objects). Collins' models generate modifiers conditionally on the head, using smoothed backoff estimation to handle data sparsity. Charniak's model uses a generative process with maximum entropy-inspired features.

Latent-Variable Parsing

An alternative to lexicalization is to automatically refine nonterminal categories using latent variables. The Berkeley Parser (Petrov et al., 2006) starts with a bare treebank grammar and iteratively splits each nonterminal into subcategories using EM, then merges back splits that do not improve likelihood. This split-merge procedure learns fine-grained distinctions (e.g., splitting NP into pronouns, proper nouns, and common noun phrases) without hand-engineering. The resulting latent-variable PCFG achieves over 90% F1, rivaling lexicalized models with a simpler and faster architecture.

Coarse-to-Fine Parsing
Charniak et al. (2006) introduced coarse-to-fine parsing, which first parses with a simple grammar to prune unlikely constituents, then reparses the pruned chart with a richer model. This hierarchical pruning scheme provides a 20x speedup with no loss in accuracy and has become standard practice in statistical parsing.

Evaluation and Impact

Statistical parsers are evaluated on the Penn Treebank WSJ section 23 using labeled precision, recall, and F1 of bracketed constituents. Key milestones include Collins (1997) at 88.1% F1, Charniak (2000) at 89.5%, the Berkeley Parser at 90.1%, and Charniak and Johnson (2005) with discriminative reranking at 91.0%. These models dominated parsing research for over a decade and established the methodology of treebank-trained probabilistic parsing that continues to underpin modern neural approaches.

Related Topics

References

  1. Collins, M. (2003). Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 589–637. https://doi.org/10.1162/089120103322753356
  2. Charniak, E. (2000). A maximum-entropy-inspired parser. Proceedings of NAACL 2000, 132–139. https://aclanthology.org/A00-2018
  3. Petrov, S., Barrett, L., Thibaux, R., & Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. Proceedings of COLING-ACL 2006, 433–440. https://doi.org/10.3115/1220175.1220230
  4. Charniak, E., & Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd Annual Meeting of the ACL, 173–180. https://doi.org/10.3115/1219840.1219862

External Links