Computational Linguistics
About

Named Entity Recognition

Named entity recognition (NER) identifies and classifies mentions of named entities in text into predefined categories such as person, organization, location, and date.

[PER Barack Obama] was born in [LOC Honolulu] and led [ORG the United States]

Named entity recognition (NER) is the task of locating and classifying named entities in unstructured text into predefined semantic categories. The most common categories are person (PER), organization (ORG), location (LOC), and miscellaneous (MISC), though domain-specific NER systems may recognize entities like gene names, drug names, chemical compounds, or legal citations. NER is a foundational component of information extraction pipelines and serves as input to relation extraction, knowledge base population, and question answering.

Sequence Labeling Formulation

NER as Sequence Labeling IOB2 encoding:
Barack/B-PER Obama/I-PER was/O born/O in/O Honolulu/B-LOC

IOBES encoding:
Barack/B-PER Obama/E-PER was/O born/O in/O Honolulu/S-LOC

Nested NER: [ORG [LOC New York] Times]
Requires span-based or hypergraph models

Like chunking, NER is typically formulated as a sequence labeling task using IOB encoding. Each token is assigned a tag indicating whether it begins (B), continues (I), or is outside (O) an entity of a given type. This formulation works well for flat, non-overlapping entities but cannot handle nested entities (e.g., "New York" as a location inside "New York Times" as an organization). Nested NER requires span-based, hypergraph, or sequence-to-sequence approaches.

Methods

NER systems have evolved through three generations. Rule-based and gazeteer-based systems used handcrafted patterns and dictionaries. Statistical systems, particularly CRFs with hand-crafted features (orthographic patterns, word shape, gazetteers), dominated from the mid-2000s. The current state of the art uses neural architectures: BiLSTM-CRF models (Lample et al., 2016) with character-level embeddings, and more recently, fine-tuned pre-trained language models like BERT that achieve F1 scores above 93% on the CoNLL-2003 English benchmark.

Domain Adaptation
NER models trained on news text often perform poorly on biomedical, legal, or social media text due to domain shift. Domain adaptation techniques include continued pre-training on domain text, few-shot learning, active learning, and data augmentation. Specialized biomedical NER models recognize entities like genes, proteins, diseases, and chemicals.

Evaluation and Challenges

NER evaluation uses entity-level F1: a predicted entity is correct only if both its boundaries and type match the gold standard exactly. The CoNLL-2003 shared task datasets (English and German) remain the most widely used benchmarks. Key challenges include recognizing entities in informal text (social media, conversational language), handling rare and emerging entities not seen in training, multilingual and cross-lingual NER, and resolving entity ambiguity (e.g., "Washington" as person, location, or organization).

Related Topics

References

  1. Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. Proceedings of CoNLL-2003, 142–147. https://doi.org/10.3115/1119176.1119195
  2. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. Proceedings of NAACL-HLT 2016, 260–270. https://doi.org/10.18653/v1/N16-1030
  3. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  4. Li, J., Sun, A., Han, J., & Li, C. (2022). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1), 50–70. https://doi.org/10.1109/TKDE.2020.2981314

External Links