Open Information Extraction

Open information extraction discovers relational triples from text without restricting relations to a predefined schema, enabling large-scale knowledge acquisition from diverse corpora by expressing relations as natural language phrases rather than fixed ontological categories.

(arg₁, relation_phrase, arg₂) — e.g., (Einstein, was born in, Ulm)

Open information extraction (Open IE) extracts relational triples from text in a schema-free manner, using natural language phrases to express relations rather than mapping them to a predefined set of relation types. Unlike traditional relation extraction, which requires a fixed ontology (e.g., born_in, located_in, CEO_of), Open IE systems can discover arbitrary relations expressed in text, producing triples such as (Einstein, was born in, Ulm) or (aspirin, reduces the risk of, heart attack). This open-ended approach enables extraction at web scale, where the diversity of possible relations far exceeds any predefined schema.

Self-Supervised Extraction

Open IE Triple Extraction Input: "Marie Curie discovered radium in 1898."

Output triples:
(Marie Curie, discovered, radium)
(Marie Curie, discovered radium in, 1898)

Confidence scoring: each triple receives a confidence c ∈ [0, 1]
based on syntactic well-formedness and extraction pattern reliability

The first Open IE system, TextRunner (Banko et al., 2007), used a self-supervised approach: it trained a classifier on a small set of heuristically labelled examples derived from parsed sentences and then applied the classifier to extract triples from a large web corpus. The key insight was that syntactic patterns (e.g., NP-VP-NP constructions) could be used to identify relational triples without any predefined relation schema. Subsequent systems — ReVerb, OLLIE, ClausIE, and OpenIE 5 — improved extraction quality through increasingly sophisticated syntactic analysis, handling of complex sentences, and canonicalisation of extracted relations.

Neural Open IE

Neural Open IE systems reformulate extraction as a sequence labelling or sequence generation task. The Stanford Open IE system uses natural logic and clause splitting to decompose complex sentences into shorter independent clauses before extracting triples. More recent neural approaches use encoder-decoder architectures where the encoder processes the input sentence and the decoder generates triples as structured sequences. These models can handle implicit relations and complex syntactic constructions that pattern-based approaches miss, though they may also hallucinate triples not supported by the input text.

Web-Scale Knowledge Acquisition

Open IE was motivated by the vision of automatically reading the entire web and extracting a comprehensive knowledge base. The TextRunner system extracted over 1 million triples from 9 million web pages; ReVerb extracted 15 million triples from 500 million web pages; and subsequent systems have scaled further. Projects such as NELL and Knowledge Vault combine Open IE with other knowledge acquisition methods to build large-scale knowledge graphs. While the precision of individual extractions remains imperfect, the sheer volume of extractions enables statistical methods to identify reliable facts.

A fundamental challenge in Open IE is the canonicalisation of extracted relations: the phrases "was born in," "hails from," "is a native of," and "comes from" may all express the same underlying relation. Clustering extracted relation phrases into canonical groups enables downstream applications to aggregate information across paraphrases. Another challenge is the extraction of n-ary relations and nested relations, which cannot be naturally expressed as binary triples. Recent work has explored richer extraction formats, including nested triples and quintuples that include temporal and spatial modifiers as additional arguments.

Open Information Extraction

Self-Supervised Extraction

Neural Open IE

References

External Links

Self-Supervised Extraction

Neural Open IE

Related Topics

References

External Links