Computational Linguistics
About

Chunking

Chunking (shallow parsing) segments a sentence into non-overlapping, non-recursive phrasal groups such as noun phrases, verb phrases, and prepositional phrases without building a full parse tree.

[NP The big cat] [VP sat] [PP on] [NP the mat] (IOB tagging: B-NP, I-NP, O)

Chunking, also called shallow parsing, identifies the flat, non-recursive phrasal constituents in a sentence without determining their internal hierarchical structure or their attachment to other phrases. The output is a sequence of non-overlapping chunks, each labeled with a phrase type (NP, VP, PP, ADJP, ADVP, etc.). Chunking provides a middle ground between full parsing and simple POS tagging: it captures basic phrasal groupings that are useful for information extraction and other downstream tasks, at much lower computational cost than full parsing.

IOB Tagging Scheme

IOB Encoding B-X: beginning of a chunk of type X
I-X: inside (continuation of) a chunk of type X
O: outside any chunk

Example: [NP The/B-NP big/I-NP cat/I-NP] [VP sat/B-VP] [PP on/B-PP] [NP the/B-NP mat/I-NP]

Variants: IOB1 (B only at chunk boundaries between same-type chunks),
IOB2 (B at every chunk start), IOBES (adds E=end, S=singleton)

Chunking is typically formulated as a sequence labeling problem using the IOB (Inside-Outside-Beginning) tagging scheme. Each word is assigned a tag indicating whether it begins a chunk (B-X), continues a chunk (I-X), or is outside all chunks (O). This reduces chunking to a tagging problem that can be solved with the same models used for POS tagging: HMMs, CRFs, or neural sequence labelers. The IOBES variant adds End and Single tags for improved boundary detection.

Methods and Applications

The CoNLL-2000 shared task established chunking as a standard NLP benchmark. Early systems used rule-based approaches and transformation-based learning. Statistical approaches using SVMs and conditional random fields achieved F1 scores above 94%. Modern neural approaches using BiLSTM-CRF architectures or pre-trained Transformers achieve F1 scores above 96% on the standard benchmark.

NP Chunking
NP chunking (base NP detection) was the first and most studied chunking task. Base NPs are the non-recursive noun phrases that do not contain other NPs. Identifying base NPs is a critical first step for information extraction, since named entities and key concepts typically appear within NPs.

Chunking remains useful as a lightweight alternative to full parsing in applications where speed is critical or where full syntactic structure is not needed. It serves as a preprocessing step for relation extraction, question answering, and text mining. The IOB tagging formulation has also been widely adopted beyond chunking, serving as the standard encoding for named entity recognition, slot filling, and other span-identification tasks.

Related Topics

References

  1. Tjong Kim Sang, E. F., & Buchholz, S. (2000). Introduction to the CoNLL-2000 shared task: Chunking. Proceedings of CoNLL-2000, 127–132. https://doi.org/10.3115/1117601.1117631
  2. Ramshaw, L. A., & Marcus, M. P. (1995). Text chunking using transformation-based learning. Proceedings of the Third ACL Workshop on Very Large Corpora, 82–94. https://arxiv.org/abs/cmp-lg/9505040
  3. Kudo, T., & Matsumoto, Y. (2001). Chunking with support vector machines. Proceedings of NAACL 2001, 1–8. https://doi.org/10.3115/1073336.1073361
  4. Sha, F., & Pereira, F. (2003). Shallow parsing with conditional random fields. Proceedings of HLT-NAACL 2003, 213–220. https://doi.org/10.3115/1073445.1073473

External Links