Relation extraction (RE) is the task of identifying semantic relationships between pairs of entities mentioned in text and classifying those relationships into predefined categories. Given a sentence such as "Steve Jobs co-founded Apple in Cupertino," a relation extraction system should identify the triples (Steve Jobs, co-founded, Apple) and (Apple, headquartered_in, Cupertino). Relation extraction is essential for constructing knowledge graphs, populating relational databases, and supporting question answering systems that require structured factual knowledge.
Supervised and Distant Supervision
Distant supervision: align KB triples with text
If KB contains (e₁, r, e₂), label all sentences mentioning both e₁ and e₂ with relation r
Multi-instance learning: P(r | S) = 1 − ∏_{s ∈ S} (1 − P(r | s))
where S = {sentences mentioning e₁ and e₂}
Supervised relation extraction requires labelled training data specifying the relation between entity pairs in context. Feature-based approaches extract lexical, syntactic, and semantic features (e.g., shortest dependency path between entities, entity types, intervening words) and train classifiers such as SVMs or MaxEnt models. Kernel methods define similarity functions over structured representations (parse trees, dependency paths) and have achieved strong results. Neural approaches learn features automatically: CNN-based models (Zeng et al., 2014) extract local patterns, while attention-based models learn to focus on relation-indicative words in the context.
Distant Supervision and Noise Reduction
Labelled relation extraction data is expensive to create, motivating the distant supervision paradigm introduced by Mintz et al. (2009). Distant supervision automatically generates training data by aligning a knowledge base (e.g., Freebase) with a text corpus: if the KB contains the triple (Barack Obama, born_in, Honolulu), then any sentence mentioning both Barack Obama and Honolulu is assumed to express the born_in relation. This assumption is clearly noisy — the sentence might discuss Obama's visit to Honolulu — but the approach generates large training sets at no annotation cost.
Relation extraction is the primary mechanism for automatically constructing and extending knowledge graphs such as Google's Knowledge Graph, Wikidata, and YAGO. These knowledge graphs contain billions of entity-relation triples and power applications including search engines, question answering systems, and recommendation engines. The NELL (Never-Ending Language Learner) project at Carnegie Mellon University demonstrated continuous relation extraction from the web, learning new facts and new extraction patterns in a self-supervised loop that has accumulated millions of beliefs since its deployment in 2010.
Multi-instance learning addresses the noise in distant supervision by modelling relation extraction at the bag level rather than the sentence level. Instead of assuming every sentence containing two entities expresses their KB relation, multi-instance models treat each bag of sentences mentioning an entity pair as a single training instance, learning to select the most informative sentences within each bag. Attention-based aggregation (Lin et al., 2016) learns soft weights for sentences in a bag, and reinforcement learning approaches have been used to select high-quality training instances from noisy distant supervision data.