Open Information Extraction: the Second Generation

Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Open Information Extraction: The Second Generation Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Turing Center Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98195, USA {etzioni,afader,janara,soderlan,mausam}@cs.washington.edu Abstract understanding of text.” [Etzioni et al., 2006]. In response to the challenge of Machine Reading, we have investigated How do we scale information extraction to the mas- the Open Information Extraction (Open IE) paradigm, which sive size and unprecedented heterogeneity of the aims to scale IE methods to the size and diversity of the Web Web corpus? Beginning in 2003, our KnowItAll corpus [Banko et al., 2007]. project has sought to extract high-quality knowledge from the Web. Typically, Information Extraction (IE) systems learn an extractor for each target relation from labeled training examples In 2007, we introduced the Open Information Ex- [Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999]. traction (Open IE) paradigm which eschews hand- This approach to IE does not scale to corpora where the num- labeled training examples, and avoids domain- ber of target relations is very large, or where the target re- specific verbs and nouns, to develop unlexical- lations cannot be specified in advance. Open IE solves this ized, domain-independent extractors that scale to problem by identifying relation phrases—phrases that denote the Web corpus. Open IE systems have extracted relations in English sentences [Banko et al., 2007]. The auto- billions of assertions as the basis for both common- matic identification of relation phrases enables the extraction sense knowledge and novel question-answering of arbitrary relations from sentences, obviating the restriction systems. to a pre-specified vocabulary. This paper describes the second generation of Open Open IE systems avoid specific nouns and verbs at all costs. IE systems, which rely on a novel model of how The extractors are unlexicalized—formulated only in terms of relations and their arguments are expressed in En- syntactic tokens (e.g., part-of-speech tags) and closed-word glish sentences to double precision/recall compared classes (e.g., of, in, such as). Thus, Open IE extractors fo- with previous systems such as TEXTRUNNER and cus on generic ways in which relationships are expressed in WOE. English—naturally generalizing across domains. Open IE systems have achieved a notable measure of suc- 1 Introduction cess on massive, open-domain corpora drawn from the Web, Wikipedia, and elsewhere. [Banko et al., 2007; Wu and Weld, Ever since its invention, text has been the fundamental repos- 2010; Zhu et al., 2009]. The output of Open IE systems has itory of human knowledge and understanding. With the in- been used to support tasks like learning selectional prefer- vention of the printing press, the computer, and the explosive ences [Ritter et al., 2010], acquiring common-sense knowl- growth of the Web, we find that the amount of readily ac- edge [Lin et al., 2010], and recognizing entailment rules cessible text has long surpassed the ability of humans to read [Schoenmackers et al., 2010; Berant et al., 2011]. In addition, it. This challenge has only become worse with the explo- Open IE extractions have been mapped onto existing ontolo- sive popularity of new text production engines such as Twitter gies [Soderland et al., 2010]. where hundreds of millions of short “texts” are created daily [Ritter et al., 2011]. Even finding relevant text has become in- This paper outlines our recent efforts to develop the sec- creasingly challenging. Clearly, automatic text understanding ond generation systems for Open Information Extraction. An has the potential to help, but the relevant technologies have to important aspect of our methodology is a thorough linguistic scale to the Web. analysis of randomly sampled sentences. Our analysis ex- Starting in 2003, the KnowItAll project at the Univer- posed the simple canonical ways in which verbs express rela- sity of Washington has sought to extract high-quality col- tionships in English. This analysis guided the design of these lections of assertions from massive Web corpora. In 2006, Open IE systems, resulting in a substantially higher perfor- we wrote: “The time is ripe for the AI community to set its mance over previous work. sights on Machine Reading—the automatic, unsupervised Specifically, we describe two novel Open IE systems: RE- 3 VERB1 and R2A2 [Fader et al., 2011; Christensen et al., Sentence Incoherent Relation 2011a], which substantially improve both precision and recall The guide contains dead links contains omits and omits sites. when compared to previous extractors such as TEXTRUNNER The Mark 14 was central to the was central torpedo and WOE. In particular, REVERB implements a novel relation torpedo scandal of the fleet. phrase identifier based on generic syntactic and lexical con- They recalled that Nungesser recalled began straints. R2A2 adds an argument identifier, ARGLEARNER, began his career as a precinct to better extract the arguments for these relation phrases. leader. Both systems are based on almost five years of experience Examples of incoherent extractions. Incoherent extrac- with Open IE systems, including TEXTRUNNER, WOE, and a Table 1: tions make up approximately 13% of TEXTRUNNER’s output, 15% careful analysis of their errors. pos parse of WOE ’s output, and 30% of WOE ’s output. The remainder of the paper is organized as follows. We first define the Open IE task and briefly describe previous is is an album by, is the author of, is a city in Open IE systems in Section 2. Section 3 outlines the architec- has has a population of, has a Ph.D. in, has a cameo in made made a deal with, made a promise to ture of REVERB and Section 4 compares this to the existing took took place in, took control over, took advantage of Open IE extractors. We present R2A2’s argument learning gave gave birth to, gave a talk at, gave new meaning to component in Section 5. We compare R2A2 and REVERB got got tickets to see, got a deal on, got funding from in Section 6. Section 7 describes some recent research re- lated to large scale IE. We conclude with directions for future Table 2: Examples of uninformative relations (left) and their com- pletions (right). Uninformative extractions account for approxi- research in Section 8. parse pos mately 4% of WOE ’s output, 6% of WOE ’s output, and 7% of TEXTRUNNER’s output. 2 Open Information Extraction Open IE systems make a single (or constant number of) [Wu and Weld, 2010]. They also show that dependency parse pass(es) over a corpus and extract a large number of relational features result in a dramatic increase in precision and recall tuples (Arg1, Pred, Arg2) without requiring any relation- over shallow linguistic features, but at the cost of extraction specific training data. For instance, given the sentence, “Mc- speed. Cain fought hard against Obama, but finally lost the elec- tion,” an Open IE system should extract two tuples, (Mc- 2.1 Limitations in Previous Open IE Systems Cain, fought against, Obama), and (McCain, lost, the elec- We identify two significant problems in all prior Open IE sys- tion). The strength of Open IE systems is in their efficient tems: incoherent extractions and uninformative extractions. processing as well as ability to extract an unbounded number Incoherent extractions are cases where the extracted relation of relations. phrase has no meaningful interpretation (see Table 1 for ex- Several Open IE systems have been proposed before now, amples). Incoherent extractions arise because the learned ex- including TEXTRUNNER [Banko et al., 2007], WOE [Wu and tractor makes a sequence of decisions about whether to in- Weld, 2010], and StatSnowBall [Zhu et al., 2009]. All these clude each word in the relation phrase, often resulting in in- systems use the following three-step method: comprehensible relation phrases. The second problem, uninformative extractions, occurs 1. Label: Sentences are automatically labeled with extrac- when extractions omit critical information. For example, tions using heuristics or distant supervision. consider the sentence “Hamas claimed responsibility for the 2. Learn: A relation phrase extractor is learned using a Gaza attack”. Previous Open IE systems return the uninfor- sequence-labeling graphical model (e.g., CRF). mative: (Hamas, claimed, responsibility) instead of (Hamas, 3. Extract: the system takes a sentence as input, identi- claimed responsibility for, the Gaza attack). This type of er- fies a candidate pair of NP arguments (Arg1, Arg2) from ror is caused by improper handling of light verb construc- the sentence, and then uses the learned extractor to label tions (LVCs). An LVC is a multi-word predicate composed each word between the two arguments as part of the re- of a verb and a noun, with the noun carrying the seman- [ lation phrase or not. tic content of the predicate Grefenstette and Teufel, 1995; Stevenson et al., 2004; Allerton, 2002]. Table 2 illustrates The extractor is applied to the successive sentences in the cor- the wide range of relations expressed with LVCs, which are pus, and the resulting extractions are collected. not captured by previous open extractors. The first Open IE system was TEXTRUNNER [Banko et al., 2007], which used a Naive Bayes model with unlexi- 3 ReVerb Extractor for Verb-based Relations calized POS and NP-chunk features, trained using examples heuristically generated from the Penn Treebank. Subsequent In response to these limitations, we introduce REVERB, work showed that utilizing a linear-chain CRF [Banko and which implements a general model of verb-based relation Etzioni, 2008] or Markov Logic Network [Zhu et al., 2009] phrases, expressed as two simple constraints. We first de- can lead to improved extractions.

Open Information Extraction: the Second Generation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support