Computational Linguistics
About

Unsupervised Parsing

Unsupervised parsing seeks to recover syntactic tree structures from raw text without any annotated training data, relying on distributional regularities and structural priors to discover latent grammatical organization.

z* = argmax_z P(z | x; θ) where z = latent tree, x = observed sentence

Unsupervised parsing is the task of inducing syntactic tree structures (constituency or dependency) from unannotated text. Unlike supervised parsing, which learns from treebanks of annotated sentences, unsupervised parsing must discover syntactic structure from the distributional patterns in raw text alone. This is arguably one of the hardest problems in NLP, as syntactic structure is only indirectly reflected in surface word sequences. Progress has been slow but steady, with recent neural approaches achieving significantly better results than classical methods.

Unsupervised Constituency Parsing

Evaluation Unsupervised parsing is evaluated against gold treebank trees:
Unlabeled bracketing F1 (UF1): fraction of correct span brackets
(ignoring constituent labels, which are not meaningful without supervision)

Right-branching baseline: ~39% UF1 on English PTB
Left-branching baseline: ~9% UF1 on English PTB
CCL (Seginer, 2007): ~71% UF1
Neural PCFG models: ~55–65% UF1
PRPN / Ordered Neurons: ~47–55% UF1

Unsupervised constituency parsing has been approached through both grammar-based and neural methods. Grammar-based approaches include the inside-outside algorithm for PCFGs (Baker, 1979), constituent-context models (Klein and Manning, 2002), and Bayesian models with structural priors. A key finding is that naive EM on PCFGs does not recover meaningful structure; strong inductive biases are needed. The CCM (Constituent-Context Model) of Klein and Manning (2002) was a breakthrough, achieving ~71% F1 by modeling the co-occurrence of yield strings and surrounding contexts.

Unsupervised Dependency Parsing

Unsupervised dependency parsing has seen notable progress through the Dependency Model with Valence (DMV) of Klein and Manning (2004), which models whether a head has already generated dependents in each direction. The DMV, trained with EM, was the first model to outperform a left-chain baseline on English. Subsequent work improved results through better initialization (using POS tag clustering), structural annealing, and Bayesian priors. Neural extensions parameterize the DMV with neural networks for better generalization.

Probing Neural Language Models
An emerging line of research probes pre-trained language models (BERT, GPT) for implicit syntactic knowledge. Structural probes (Hewitt & Manning, 2019) show that dependency trees can be approximately recovered from BERT's hidden representations via a learned linear transformation, suggesting that these models learn substantial syntactic structure without any explicit supervision.

Challenges and Prospects

Unsupervised parsing remains far below supervised methods in accuracy. The fundamental challenge is that there are many possible tree structures consistent with the surface statistics of text, and the linguistically correct one is not always the most statistically salient. Evaluation itself is problematic: the unlabeled bracketing metric may not capture all aspects of the structures these models learn, and there is debate about whether unsupervised models should be evaluated against linguist-designed conventions. Despite these challenges, unsupervised parsing provides insights into the information content of raw text and the inductive biases needed for language acquisition.

Related Topics

References

  1. Klein, D., & Manning, C. D. (2002). A generative constituent-context model for improved grammar induction. Proceedings of ACL 2002, 128–135. https://doi.org/10.3115/1073083.1073106
  2. Klein, D., & Manning, C. D. (2004). Corpus-based induction of syntactic structure: Models of dependency and constituency. Proceedings of ACL 2004, 478–485. https://doi.org/10.3115/1218955.1219016
  3. Hewitt, J., & Manning, C. D. (2019). A structural probe for finding syntax in word representations. Proceedings of NAACL-HLT 2019, 4129–4138. https://doi.org/10.18653/v1/N19-1419
  4. Kim, Y., Dyer, C., & Rush, A. M. (2019). Compound probabilistic context-free grammars for grammar induction. Proceedings of ACL 2019, 2369–2385. https://doi.org/10.18653/v1/P19-1228

External Links