Chunking

Chunking, also called shallow parsing, identifies the flat, non-recursive phrasal constituents in a sentence without determining their internal hierarchical structure or their attachment to other phrases. The output is a sequence of non-overlapping chunks, each labeled with a phrase type (NP, VP, PP, ADJP, ADVP, etc.). Chunking provides a middle ground between full parsing and simple POS tagging: it captures basic phrasal groupings that are useful for information extraction and other downstream tasks, at much lower computational cost than full parsing.

IOB Tagging Scheme

IOB Encoding B-X: beginning of a chunk of type X
I-X: inside (continuation of) a chunk of type X
O: outside any chunk

Example: [NP The/B-NP big/I-NP cat/I-NP] [VP sat/B-VP] [PP on/B-PP] [NP the/B-NP mat/I-NP]

Variants: IOB1 (B only at chunk boundaries between same-type chunks),
IOB2 (B at every chunk start), IOBES (adds E=end, S=singleton)

Chunking is typically formulated as a sequence labeling problem using the IOB (Inside-Outside-Beginning) tagging scheme. Each word is assigned a tag indicating whether it begins a chunk (B-X), continues a chunk (I-X), or is outside all chunks (O). This reduces chunking to a tagging problem that can be solved with the same models used for POS tagging: HMMs, CRFs, or neural sequence labelers. The IOBES variant adds End and Single tags for improved boundary detection.

Methods and Applications

The CoNLL-2000 shared task established chunking as a standard NLP benchmark. Early systems used rule-based approaches and transformation-based learning. Statistical approaches using SVMs and conditional random fields achieved F1 scores above 94%. Modern neural approaches using BiLSTM-CRF architectures or pre-trained Transformers achieve F1 scores above 96% on the standard benchmark.

NP Chunking

NP chunking (base NP detection) was the first and most studied chunking task. Base NPs are the non-recursive noun phrases that do not contain other NPs. Identifying base NPs is a critical first step for information extraction, since named entities and key concepts typically appear within NPs.

Chunking remains useful as a lightweight alternative to full parsing in applications where speed is critical or where full syntactic structure is not needed. It serves as a preprocessing step for relation extraction, question answering, and text mining. The IOB tagging formulation has also been widely adopted beyond chunking, serving as the standard encoding for named entity recognition, slot filling, and other span-identification tasks.

IOB Tagging Scheme

Methods and Applications

References

External Links

IOB Tagging Scheme

Methods and Applications

Related Topics

References

External Links