Chunking, also called shallow parsing, identifies the flat, non-recursive phrasal constituents in a sentence without determining their internal hierarchical structure or their attachment to other phrases. The output is a sequence of non-overlapping chunks, each labeled with a phrase type (NP, VP, PP, ADJP, ADVP, etc.). Chunking provides a middle ground between full parsing and simple POS tagging: it captures basic phrasal groupings that are useful for information extraction and other downstream tasks, at much lower computational cost than full parsing.
IOB Tagging Scheme
I-X: inside (continuation of) a chunk of type X
O: outside any chunk
Example: [NP The/B-NP big/I-NP cat/I-NP] [VP sat/B-VP] [PP on/B-PP] [NP the/B-NP mat/I-NP]
Variants: IOB1 (B only at chunk boundaries between same-type chunks),
IOB2 (B at every chunk start), IOBES (adds E=end, S=singleton)
Chunking is typically formulated as a sequence labeling problem using the IOB (Inside-Outside-Beginning) tagging scheme. Each word is assigned a tag indicating whether it begins a chunk (B-X), continues a chunk (I-X), or is outside all chunks (O). This reduces chunking to a tagging problem that can be solved with the same models used for POS tagging: HMMs, CRFs, or neural sequence labelers. The IOBES variant adds End and Single tags for improved boundary detection.
Methods and Applications
The CoNLL-2000 shared task established chunking as a standard NLP benchmark. Early systems used rule-based approaches and transformation-based learning. Statistical approaches using SVMs and conditional random fields achieved F1 scores above 94%. Modern neural approaches using BiLSTM-CRF architectures or pre-trained Transformers achieve F1 scores above 96% on the standard benchmark.
Chunking remains useful as a lightweight alternative to full parsing in applications where speed is critical or where full syntactic structure is not needed. It serves as a preprocessing step for relation extraction, question answering, and text mining. The IOB tagging formulation has also been widely adopted beyond chunking, serving as the standard encoding for named entity recognition, slot filling, and other span-identification tasks.