Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

Open Information Extraction: The Second Generation

Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Turing Center Department of and Engineering Box 352350 Seattle, WA 98195, USA {etzioni,afader,janara,soderlan,mausam}@cs.washington.edu

Abstract understanding of text.” [Etzioni et al., 2006]. In response to the challenge of Machine Reading, we have investigated How do we scale information extraction to the mas- the Open Information Extraction (Open IE) paradigm, which sive size and unprecedented heterogeneity of the aims to scale IE methods to the size and diversity of the Web Web corpus? Beginning in 2003, our KnowItAll corpus [Banko et al., 2007]. project has sought to extract high-quality knowl- edge from the Web. Typically, Information Extraction (IE) systems learn an ex- tractor for each target relation from labeled training examples In 2007, we introduced the Open Information Ex- [Kim and Moldovan, 1993; Riloff, 1996; Soderland, 1999]. traction (Open IE) paradigm which eschews hand- This approach to IE does not scale to corpora where the num- labeled training examples, and avoids domain- ber of target relations is very large, or where the target re- specific verbs and nouns, to develop unlexical- lations cannot be specified in advance. Open IE solves this ized, domain-independent extractors that scale to problem by identifying relation phrases—phrases that denote the Web corpus. Open IE systems have extracted relations in English sentences [Banko et al., 2007]. The auto- billions of assertions as the basis for both common- matic identification of relation phrases enables the extraction sense knowledge and novel question-answering of arbitrary relations from sentences, obviating the restriction systems. to a pre-specified vocabulary. This paper describes the second generation of Open Open IE systems avoid specific nouns and verbs at all costs. IE systems, which rely on a novel model of how The extractors are unlexicalized—formulated only in terms of relations and their arguments are expressed in En- syntactic tokens (e.g., part-of-speech tags) and closed-word glish sentences to double precision/recall compared classes (e.g., of, in, such as). Thus, Open IE extractors fo- with previous systems such as TEXTRUNNER and cus on generic ways in which relationships are expressed in WOE. English—naturally generalizing across domains. Open IE systems have achieved a notable measure of suc- 1 Introduction cess on massive, open-domain corpora drawn from the Web, Wikipedia, and elsewhere. [Banko et al., 2007; Wu and Weld, Ever since its invention, text has been the fundamental repos- 2010; Zhu et al., 2009]. The output of Open IE systems has itory of human knowledge and understanding. With the in- been used to support tasks like learning selectional prefer- vention of the printing press, the computer, and the explosive ences [Ritter et al., 2010], acquiring common-sense knowl- growth of the Web, we find that the amount of readily ac- edge [Lin et al., 2010], and recognizing entailment rules cessible text has long surpassed the ability of humans to read [Schoenmackers et al., 2010; Berant et al., 2011]. In addition, it. This challenge has only become worse with the explo- Open IE extractions have been mapped onto existing ontolo- sive popularity of new text production engines such as Twitter gies [Soderland et al., 2010]. where hundreds of millions of short “texts” are created daily [Ritter et al., 2011]. Even finding relevant text has become in- This paper outlines our recent efforts to develop the sec- creasingly challenging. Clearly, automatic text understanding ond generation systems for Open Information Extraction. An has the potential to help, but the relevant technologies have to important aspect of our methodology is a thorough linguistic scale to the Web. analysis of randomly sampled sentences. Our analysis ex- Starting in 2003, the KnowItAll project at the Univer- posed the simple canonical ways in which verbs express rela- sity of Washington has sought to extract high-quality col- tionships in English. This analysis guided the design of these lections of assertions from massive Web corpora. In 2006, Open IE systems, resulting in a substantially higher perfor- we wrote: “The time is ripe for the AI community to set its mance over previous work. sights on Machine Reading—the automatic, unsupervised Specifically, we describe two novel Open IE systems: RE-

3 VERB1 and R2A2 [Fader et al., 2011; Christensen et al., Sentence Incoherent Relation 2011a], which substantially improve both precision and recall The guide contains dead links contains omits and omits sites. when compared to previous extractors such as TEXTRUNNER The Mark 14 was central to the was central torpedo and WOE. In particular, REVERB implements a novel relation torpedo scandal of the fleet. phrase identifier based on generic syntactic and lexical con- They recalled that Nungesser recalled began straints. R2A2 adds an argument identifier, ARGLEARNER, began his career as a precinct to better extract the arguments for these relation phrases. leader. Both systems are based on almost five years of experience Examples of incoherent extractions. Incoherent extrac- with Open IE systems, including TEXTRUNNER, WOE, and a Table 1: tions make up approximately 13% of TEXTRUNNER’s output, 15% careful analysis of their errors. pos parse of WOE ’s output, and 30% of WOE ’s output. The remainder of the paper is organized as follows. We first define the Open IE task and briefly describe previous is is an album by, is the author of, is a city in Open IE systems in Section 2. Section 3 outlines the architec- has has a population of, has a Ph.D. in, has a cameo in made made a deal with, made a promise to ture of REVERB and Section 4 compares this to the existing took took place in, took control over, took advantage of Open IE extractors. We present R2A2’s argument learning gave gave birth to, gave a talk at, gave new meaning to component in Section 5. We compare R2A2 and REVERB got got tickets to see, got a deal on, got funding from in Section 6. Section 7 describes some recent research re- lated to large scale IE. We conclude with directions for future Table 2: Examples of uninformative relations (left) and their com- pletions (right). Uninformative extractions account for approxi- research in Section 8. parse pos mately 4% of WOE ’s output, 6% of WOE ’s output, and 7% of TEXTRUNNER’s output. 2 Open Information Extraction Open IE systems make a single (or constant number of) [Wu and Weld, 2010]. They also show that dependency parse pass(es) over a corpus and extract a large number of relational features result in a dramatic increase in precision and recall tuples (Arg1, Pred, Arg2) without requiring any relation- over shallow linguistic features, but at the cost of extraction specific training data. For instance, given the sentence, “Mc- speed. Cain fought hard against Obama, but finally lost the elec- tion,” an Open IE system should extract two tuples, (Mc- 2.1 Limitations in Previous Open IE Systems Cain, fought against, Obama), and (McCain, lost, the elec- We identify two significant problems in all prior Open IE sys- tion). The strength of Open IE systems is in their efficient tems: incoherent extractions and uninformative extractions. processing as well as ability to extract an unbounded number Incoherent extractions are cases where the extracted relation of relations. phrase has no meaningful interpretation (see Table 1 for ex- Several Open IE systems have been proposed before now, amples). Incoherent extractions arise because the learned ex- including TEXTRUNNER [Banko et al., 2007], WOE [Wu and tractor makes a sequence of decisions about whether to in- Weld, 2010], and StatSnowBall [Zhu et al., 2009]. All these clude each word in the relation phrase, often resulting in in- systems use the following three-step method: comprehensible relation phrases. The second problem, uninformative extractions, occurs 1. Label: Sentences are automatically labeled with extrac- when extractions omit critical information. For example, tions using heuristics or distant supervision. consider the sentence “Hamas claimed responsibility for the 2. Learn: A relation phrase extractor is learned using a Gaza attack”. Previous Open IE systems return the uninfor- sequence-labeling graphical model (e.g., CRF). mative: (Hamas, claimed, responsibility) instead of (Hamas, 3. Extract: the system takes a sentence as input, identi- claimed responsibility for, the Gaza attack). This type of er- fies a candidate pair of NP arguments (Arg1, Arg2) from ror is caused by improper handling of light verb construc- the sentence, and then uses the learned extractor to label tions (LVCs). An LVC is a multi-word predicate composed each word between the two arguments as part of the re- of a verb and a noun, with the noun carrying the seman- [ lation phrase or not. tic content of the predicate Grefenstette and Teufel, 1995; Stevenson et al., 2004; Allerton, 2002]. Table 2 illustrates The extractor is applied to the successive sentences in the cor- the wide range of relations expressed with LVCs, which are pus, and the resulting extractions are collected. not captured by previous open extractors. The first Open IE system was TEXTRUNNER [Banko et al., 2007], which used a Naive Bayes model with unlexi- 3 ReVerb Extractor for Verb-based Relations calized POS and NP-chunk features, trained using examples heuristically generated from the Penn Treebank. Subsequent In response to these limitations, we introduce REVERB, work showed that utilizing a linear-chain CRF [Banko and which implements a general model of verb-based relation Etzioni, 2008] or Markov Logic Network [Zhu et al., 2009] phrases, expressed as two simple constraints. We first de- can lead to improved extractions. The WOE systems made use scribe the constraints and later, the REVERB architecture. of Wikipedia as a source of training data for their extractors, 3.1 Syntactic Constraint which leads to further improvements over TEXTRUNNER The syntactic constraint serves two purposes. First, it elimi- 1Downloadable at http://reverb.cs.washington.edu nates incoherent extractions, and second, it reduces uninfor-

4 V | VP | VW∗P Binary Verbal Relation Phrases V = verb particle? adv? 85% Satisfy Constraints W = (noun | adj | adv | pron | det) P = (prep | particle | inf. marker) 8% Non-Contiguous Phrase Structure Coordination: X is produced and maintained by Y Figure 1: A simple part-of-speech-based regular expression re- Multiple Args: X was founded in 1995 by Y duces the number of incoherent extractions like was central torpedo Phrasal Verbs: X turned Yoff and covers relations expressed via light verb constructions like made 4% Relation Phrase Not Between Arguments a deal with. Intro. Phrases: Discovered by Y,X... mative extractions by capturing relation phrases expressed via Relative Clauses: . . . the Y that X discovered light verb constructions. 3% Do Not Match POS Pattern The syntactic constraint requires relation phrase to match Interrupting Modifiers: X has a lot of faith in Y the POS tag pattern shown in Figure 1. The pattern limits Infinitives: X to attack Y relation phrases to be either a simple verb phrase (e.g., in- vented), a verb phrase followed immediately by a preposition Table 3: Approximately 85% of the binary verbal relation phrases or particle (e.g., located in), or a verb phrase followed by a in a sample of Web sentences satisfy our constraints. simple noun phrase and ending in a preposition or particle (e.g., has atomic weight of). If there are multiple possible phrase in (1) will not be extracted with many argument pairs, matches in a sentence for a single verb, the longest possible so it is unlikely to represent a bona fide relation. We describe match is chosen. the implementation details of the lexical constraint in Section Finally, if the pattern matches multiple adjacent sequences, 3.3. we merge them into a single relation phrase (e.g., wants to extend). This refinement enables the model to readily handle 3.3 The ReVerb Architecture relation phrases containing multiple verbs. A consequence of This section introduces REVERB, a novel open extractor this pattern is that the relation phrase must be a contiguous based on the constraints defined in the previous sections. RE- span of words in the sentence. VERB first identifies relation phrases that satisfy the syntactic While this syntactic pattern identifies relation phrases with and lexical constraints, and then finds a pair of NP arguments high precision, to what extent does it limit recall? To an- for each identified relation phrase. The resulting extractions swer this, we analyzed Wu and Weld’s set of 300 Web sen- are then assigned a confidence score using a logistic regres- tences, manually identifying all verb-based relationships be- sion classifier trained on 1,000 random Web sentences with tween noun phrase pairs. This resulted in a set of 327 relation shallow syntactic features. phrases. This algorithm differs in three important ways from previ- For each relation phrase, we checked whether it satisfies ous methods. First, the relation phrase is identified “holisti- the REVERB syntactic constraint. We found that 85% of the cally” rather than word-by-word. Second, potential phrases relation phrases do satisfy the constraints. Of the remaining are filtered based on statistics over a large corpus (the imple- 15%, we identified some of the common cases where the con- mentation of our lexical constraint). Finally, REVERB is “re- straints were violated, summarized in Table 3. Many of these lation first” rather than “arguments first”, which enables it to cases involve long-range dependencies between words in the avoid a common error made by previous methods—confusing sentence. As we show in Section 4, attempting to cover these a noun in the relation phrase for an argument, e.g. the noun harder cases using a dependency parser can actually reduce responsibility in claimed responsibility for. recall as well as precision. REVERB takes as input a POS-tagged and NP-chunked (x, r, y) 2 3.2 Lexical Constraint sentence and returns a set of extraction triples. Given an input sentence s,REVERB uses the following ex- While the syntactic constraint greatly reduces uninformative traction algorithm: extractions, it can sometimes match relation phrases that are so specific that they have only a few possible instances, even 1. Relation Extraction: For each verb v in s, find the in a Web-scale corpus. Consider the sentence longest sequence of words rv such that (1) rv starts at v r r The Obama administration is offering only modest green- , (2) v satisfies the syntactic constraint, and (3) v sat- house gas reduction targets at the conference. isfies the lexical constraint. If any pair of matches are adjacent or overlap in s, merge them into a single match. The POS pattern will match the phrase: 2. Argument Extraction: For each relation phrase r iden- is offering only modest greenhouse gas reduction targets at (1) tified in Step 1, find the nearest noun phrase x to the left Thus, there are phrases that satisfy the syntactic constraint, of r in s such that x is not a relative pronoun, WH-term, but are not useful relations. or existential “there”. Find the nearest noun phrase y to To overcome this limitation, we introduce a lexical con- the right of r in s. If such an (x, y) pair could be found, straint that is used to separate valid relation phrases from return (x, r, y) as an extraction. over-specified relation phrases, like the example in (1). The constraint is based on the intuition that a valid relation phrase 2REVERB uses OpenNLP for POS tagging and NP chunking: should take many distinct arguments in a large corpus. The http://opennlp.sourceforge.net/

5 We check whether a candidate relation phrase rv satisfies Relations Only the syntactic constraint by matching it against the regular ex- 1.0 pression in Figure 1. 0.8 To determine whether rv satisfies the lexical constraint, we D use a large dictionary of relation phrases that are known 0.6 to take many distinct arguments. In an off-line step, we con- D struct by finding all matches of the POS pattern in a corpus 0.4 of 500 million Web sentences. For each matching relation Precision REVERB phrase, we heuristically identify its arguments (as in Step 2 parse 0.2 WOE pos above). We set D to be the set of all relation phrases that take WOE k TEXTRUNNER at least distinct argument pairs in the set of extractions. In 0.0 order to allow for minor variations in relation phrases, we 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalize each relation phrase by removing inflection, auxil- Recall iary verbs, adjectives, and adverbs. Based on experiments on Figure 2: REVERB identifies correct relation phrases with substan- a held-out set of sentences, we find that a value of k =20 tially higher precision and recall than state-of-the-art open extrac- parse works well for filtering out over-specified relations. This re- tors, including WOE that uses patterns learned over dependency sults in a set of approximately 1.7 million distinct normalized parse paths. relation phrases, which are stored in memory at extraction time. Relation and Arguments 0.5 4 ReVerb Experimental Results 0.4 We compare REVERB to the following systems: 0.3 ¬lex 0.2 • REVERB - The REVERB system described in the previous section, but without the lexical constraint. 0.1 ¬lex REVERB uses the same confidence function as RE- 0.0 Area Under PR Curve REVERB REVERB parseWOE W posOE TEXT- VERB. ¬lex RUNNER • TEXTRUNNER - Banko and Etzioni’s 2008 extractor, Figure 3: REVERB has AUC more than twice that of pos parse which uses a second order linear-chain CRF trained on TEXTRUNNER or WOE , and 30% higher than WOE for ex- extractions heuristically generated from the Penn Tree- tracting both relation and arguments. bank. TEXTRUNNER uses shallow linguistic features in its CRF, which come from the same POS tagger and NP- that are correct. Recall is the fraction of correct extractions chunker that REVERB uses. in the corpus that are returned. We use the total number of pos • WOE - Wu and Weld’s modification to extractions from all systems labeled as correct by the judges TEXTRUNNER, which uses a model of relations as our measure of recall for the corpus. learned from extractions heuristically generated from We begin by evaluating how well REVERB and other Wikipedia. Open IE systems identify correct relation phrases. As Fig- parse • WOE - Wu and Weld’s parser-based extractor, ure 2 shows, REVERB has high precision in finding relation which uses a large dictionary of dependency path pat- phrases, well above precision for the comparison systems. terns learned from extractions heuristically generated The full extraction task, identifying both a relation and from Wikipedia. its arguments, produced relatively similar precision-recall curves for each system, but with a lower precision. Fig- Each system is given a set of sentences as input, and returns ure 3 shows the area under the curve (AUC) for each sys- a set of binary extractions as output. We created a test set tem. REVERB achieves an AUC that is 30% higher than of 500 sentences sampled from the Web, using Yahoo’s ran- parse pos 3 WOE and is more than double the AUC of WOE or dom link service. After running each extractor over the input TEXTRUNNER. The lexical constraint provides a significant sentences, two human judges independently evaluated each boost in performance, with REVERB achieving an AUC 23% extraction as correct or incorrect. The judges reached agree- ¬lex higher than REVERB . ment on 86% of the extractions, with an agreement score of κ =0.68 . We report results on the subset of the data where 4.1 ReVerb Error Analysis the two judges concur. The judges labeled uninformative ex- tractions (where critical information was dropped from the To better understand the limitations of REVERB, we per- extraction) as incorrect. This is a stricter standard than was formed a detailed analysis of its errors in precision (incorrect used in previous Open IE evaluations. extractions returned) and its errors in recall (correct extrac- Each system returns confidence scores for its extractions. tions that it missed). We found that 65% of the incorrect For a given threshold, we can measure the precision and recall extractions returned by REVERB were cases where a rela- of the output. Precision is the fraction of returned extractions tion phrase was correctly identified, but the argument-finding heuristics failed. Of the remaining errors, a common prob- 3http://random.yahoo.com/bin/ryl lem was to mistake an n-ary relation as a binary relation. For

6 example, extracting (I, gave, him) from the sentence “I gave (5%). These three cases account for 99% of the relations in him 15 photographs”. our sample. As with the false positive extractions, the majority of false An example of compound verbs is from the sentence negatives (52%) were due to the argument-finding heuristics “Mozart was born in Salzburg, but moved to Vienna in 1781”, choosing the wrong arguments, or failing to extract all possi- which results in an extraction with a non-adjacent Arg1: ble arguments (in the case of coordinating conjunctions). We (Mozart, moved to, Vienna) now turn to a system that is able to identify arguments with An example with an intervening relative clause is from much higher precision than REVERB’s simple heuristics. the sentence “Starbucks, which was founded in Seattle, has a new logo”. This also results in an extractions with non- 5 Learning Arguments adjacent Arg1: In addition to the relation phrases, the Open IE task also re- (Starbucks, has, a new logo) quires identifying the proper arguments for these relations. Arg2s almost always immediately follow the relation Previous research and REVERB use simple heuristics such as phrase. However, their end delimiters are trickier. There are extracting simple noun phrases or Wikipedia entities as argu- several end delimiters of Arg2 making this a more difficult ments. Unfortunately, these heuristics are unable to capture problem. In 58% of our extractions, Arg2 extends to the end the complexity of language. A large majority of extraction of the sentence. In 17% of the cases, Arg2 is followed by a errors by Open IE systems are from incorrect or improperly- conjunction or function word such as “if”, “while”, or “al- scoped arguments. Recall from previous section that 65% of though” and then followed by an independent clause or VP. REVERB’s errors had a correct relation phrase but incorrect Harder to detect are the 9% where Arg2 is directly followed arguments. by an independent clause or VP. Hardest of all is the 11% For example, from the sentence “The cost of the war where Arg2 is followed by a preposition, since prepositional against Iraq has risen above 500 billion dollars,” REVERB’s phrases could also be part of Arg2. This leads to the well- argument heuristics truncate Arg1: studied but difficult prepositional phrase attachment problem. (Iraq, has risen above, 500 billion dollars) For now, we use limited syntactic evidence (POS-tagging, On the other hand, in the sentence “The plan would reduce NP-chunking) to identify arguments, though more semantic the number of teenagers who begin smoking,” Arg2 gets trun- knowledge to disambiguate prepositional phrases could come cated: in handy for our task. (The plan, would reduce the number of, teenagers) In this section, we describe an argument learning compo- 5.2 Design of ARGLEARNER nent, ARGLEARNER, that reduces such errors. Our analysis of syntactic patterns reveals that the majority of arguments fit into a small number of syntactic categories. 5.1 Linguistic-Statistical Analysis of Extractions Similarly, there are common delimiters that could aid in de- Our goal is to find the largest subset of language from which tecting argument boundaries. This analysis inspires us to de- we can extract reliably and efficiently. To this cause, we first velop ARGLEARNER, a learning-based system that uses these analyze a sample of 250 random Web sentences to understand patterns as features to identify the arguments, given a sen- the frequent argument classes. We hope to answer questions tence and relation phrase pair. such as: What fraction of arguments are simple noun phrases? ARGLEARNER divides this task into two subtasks - find- Are Arg1s structurally different from Arg2s? Is there typical ing Arg1 and Arg2 - and then subdivides each of these sub- context around an argument that can help us detect its bound- tasks again into identifying the left bound and the right bound aries? of each argument. ARGLEARNER employs three classifiers Table 4 reports our observations for frequent argument cat- to this aim (Figure 4). Two classifiers identify the left and egories, both for Arg1 and Arg2. By far the most com- right bounds for Arg1 and the last classifier identifies the right mon patterns for arguments are simple noun phrases such as bound of Arg2. Since Arg2 almost always follows the relation “Obama,” “vegetable seeds,” and “antibiotic use.” This ex- phrase, we do not need a separate Arg2 left bound classifier. plains the success of previous open extractors that use simple We use Weka’s REPTree [Hall et al., 2009] for identify- NPs. However, we found that simple NPs account for only ing the right boundary of Arg1 and sequence labeling CRF 65% of Arg1s and about 60% of Arg2s. This naturally dic- classifier implemented in Mallet [McCallum, 2002] for other tates an upper bound on recall for systems that do not handle classifiers. Our standard set of features include those that de- more complex arguments. Fortunately, there are only a hand- scribe the noun phrase in question, context around it as well as ful of other prominent categories – for Arg1: prepositional the whole sentence, such as sentence length, POS-tags, capi- phrases and lists, and for Arg2: prepositional phrases, lists, talization and punctuation. In addition, for each classifier we Arg2s with independent clauses and relative clauses. These use features suggested by our analysis above. For example, categories cover over 90% of the extractions, suggesting that for right bound of Arg1 we create regular expression indi- handling these well will boost the precision significantly. cators to detect whether the relation phrase is a compound We also explored arguments’ position in the overall sen- verb and whether the noun phrase in question is a subject of tence. We found that 85% of Arg1s are adjacent to the re- the compound verb. For Arg2 we create regular expression lation phrase. Nearly all of the remaining cases are due to indicators to detect patterns such as Arg2 followed by an in- either compound verbs (10%) or intervening relative clauses dependent clause or verb phrase. Note that these indicators

7 Category Patterns Frequency Arg1 Frequency Arg2 Basic NP NN, JJ NN, etc 65% 60% Chicago was founded in 1833. Calcium prevents osteoporosis. Prepositional NP PP+ 19% 18% Attachments The forest in Brazil Lake Michigan is one of the five is threatened by ranching. Great Lakes of North America. List NP (,NP)* ,? 15% 15% and/or NP Google and Apple are A galaxy consists of headquartered in Silicon Valley. stars and stellar remnants. Independent (that|WP|WDT)? 0% 8% Clause NP VP NP Google will acquire YouTube, Scientists estimate that 80% announced . of oil remains a threat. Relative NP (that|WP|WDT) <1% 6% Clause VP NP? Chicago, which is located in Illinois, Most galaxies appear to be has three million residents. dwarf galaxies, which are small.

Table 4: Taxonomy of arguments for binary relationships. In each sentence, the argument is bolded and the relational phrase is italicized. Multiple patterns can appear in a single argument so percentages do not need to add to 100. In the interest of space, we omit argument structures that appear in less than 5% of extractions. Upper case abbreviations represent noun phrase chunk abbreviations and part-of-speech abbreviations. Argument Extractor Training REVERB R2A2 Arg1 Right Web Arg1 0.69 0.81 CoNLL Training Data Bound Classifier Training Data Constructor Arg2 0.53 0.72 Arg1 Left Reranker News Arg1 0.75 0.86 Bound Classifier Relation Arg2 0.58 0.74 Sentences Arg2 Right Extractions Extractor Bound Classifier Table 5: R2A2 has substantially higher F1 score than REVERB for Figure 4: ARGLEARNER’s system architecture. both Argument 1 and Argument 2. will not match all possible sentence structures, but act as use- ful features to help the classifier identify the categories. We 6 R2A2 Experimental Results design several features specific to these different classifiers. We conducted experiments to answer the following questions. The other key challenge for a learning system is training (1) Does R2A2 improve argument detection compared to ar- data. Unfortunately, there is no large training set available guments returned by REVERB’s simple heuristics? and (2) for Open IE. We build a novel training set by adapting data What kind of errors does R2A2 reduce and which errors re- available for semantic role labeling (SRL), which is shown quire more research? to be closely related to Open IE [Christensen et al., 2011b]. We tested the systems on two datasets. The first dataset We found that a set of post-processing heuristics over SRL consists of 200 random sentences from the New York Times. data can easily convert it into a form meaningful for Open IE The second dataset is made up of 200 random Web sentences. training. Three judges with linguistic backgrounds evaluated the out- We used a subset of the training data adapted from the put of the systems, labeling whether the relation phrase was CoNLL 2005 Shared Task [Carreras and Marquez, 2005]. correct, and, if so, whether arguments 1 and 2 were the cor- Our dataset consists of 20,000 sentences and generates about rect arguments for the relation. 29,000 Open IE tuples. The cross-validation accuracies of the We used a stricter criterion for correct arguments than pre- classifiers on the CoNLL data are 96% for Arg1 right bound, vious Open IE evaluations, which counted an extraction as 92% for Arg1 left bound and 73% for Arg2 right bound. The correct if its arguments were reasonable, even if they omitted low accuracy for Arg2 right bound is primarily due to Arg2’s relevant information. The annotators were instructed to mark more complex categories such as relative clauses and inde- an argument correct only if the argument was as informative pendent clauses and the difficulty associated with preposi- as possible, but did not include extraneous information. tional attachment in Arg2. For example, “the President of Brazil” is a correct Arg2 for the relation “spoke to” in the sentence “President Obama Additionally, we train a confidence metric on a hand- spoke to the President of Brazil on Thursday”, but the less labeled development set of random Web sentences. We use informative “the President” is considered incorrect, as is “the Weka’s implementation of logistic regression, and use the President of Brazil on Thursday”. The inter-annotator agree- classifier’s weight to order the extractions. ment for the three judges is 95%. We name our final system that combines REVERB relation In this evaluation, we are primarily concerned with the ef- phrases with ARGLEARNER’s arguments as R2A2. Weeval- fect of ARGLEARNER, hence do not consider possibly cor- uate R2A2 against REVERB next. rect extractions missed by REVERB, since neither system has

8 document and entity clustering, which is too costly for Web- scale IE. Open IE favors speed over deeper processing, which aids in scaling to Web-scale corpora. There is recent work on incorporating distant supervision from manually constructed knowledge bases such as Free- Base or Wikipedia to automatically learn extractors for the

Precision large number of relations in the KB [Mintz et al., 2009; Hoffmann et al., 2010; 2011]. These approaches use heuristic R2A2 methods to generate training data by mapping the entities of ReVerb the KB to sentences mentioning them in text. This reduces the 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 relation extraction problem to a multi-class supervised learn- Recall ing problem. Figure 5: R2A2 has substantially higher recall and precision than The Never Ending Language Learner (NELL) project aims REVERB. to learn a macro-reading agent that gets better and better at reading as it reads the same text multiple times [Carlson et a chance of extracting them. In other words, our recall cal- al., 2010a; 2010b]. Essentially, NELL learns newer extrac- culations use the total number of correct extractions possible tion patterns using previous system extraction instances as using REVERB relation phrases as the denominator. training data. Such pattern learning systems in the past have R2A2 has both precision and recall substantially above been prone to concept drift. NELL largely overcomes concept that of REVERB. Table 5 compares F1 for each argument drift by employing coupled-training, which generates nega- on both data sets at the confidence values that produce the tive training for one concept based on the positive example of highest F1 on a development set. R2A2 increases the F1 by another and known mutual exclusions between types. There 11 - 19 points. Figure 5 shows recall and precision for the en- is also a sophisticated mechanism to advance a hypothesized tire extraction on the combined set of Web and newswire sen- extraction to an accepted extraction. tences. R2A2 has over 0.20 higher precision than REVERB NELL and the distant supervision approaches differ from over nearly all the precision-recall curve as well as higher re- our Open IE paradigm in an important way – they all learn call. extractors for a known set of relations. Distant supervision We also analyzed how R2A2 performs on the argument approaches have scaled to a few thousand relations, whereas types and context patterns identified in Section 5.1. R2A2 NELL knowledge base is much smaller, extracting around a had high F1 (from 0.82 to 0.95) on the major patterns of Arg1: hundred relations. In contrast, our recent-most run of Open simple NP, prepositional attachment, and list. For the exam- IE on a Web-scale corpus returned about 1.5 million distinct ple sentence given at the beginning of Section 5, R2A2 cor- relation phrases. In other words, Open IE can be applied as rectly extracts “The cost of war against Iraq” as Arg1 instead is to any domain and any corpora of English text and it will of “Iraq.” extract meaningful information. The flip side to Open IE is Its performance on adjacent Arg1 is high (0.93), but could its unnormalized output. Open IE has key challenges due to be improved on compound verbs (0.69) and intervening rel- polysemous and synonymous relation phrases. Our follow- ative clauses (0.73). R2A2 also has difficulty recognizing up work attempts to learn synonymous relations [Yates and when the given relation has no valid Arg1, such as in, “If in- Etzioni, 2009] as well as proposes a first solution to normal- terested in this offer, please contact us,” with relation phrase ize open relation phrases to a domain ontology with minimal “contact.” Future systems could make better use of negative supervision [Soderland et al., 2010]. training data to correct this issue. For Arg2, R2A2 performs well on simple noun phrases, 8 Conclusions with an F1 of 0.87. However, F1 is between 0.62 and 0.71 for all other syntactic patterns for Arg2. This is consider- We have described the second generation Open Information ably above REVERB’s F1 of 0.0 to 0.19 on these patterns, but Extraction systems, REVERB, and R2A2. The key differ- still leaves considerable room for improvement. In contrast to entiating characteristic of these systems is a linguistic analy- REVERB, R2A2 also gets the second example sentence from sis that guides the design of the constraints in REVERB and the previous section extracting (The plan, would reduce the features in R2A2. REVERB focuses on identifying a more number of, teenagers who begin smoking). meaningful and informative relation phrase and outperforms the previous Open IE systems by significant margins. R2A2 adds an argument learning component, ARGLEARNER, and 7 Related Work almost doubles the area under precision-recall curve com- Web-scale information extraction has received considerable pared to REVERB. Both these systems are amazingly scal- attention in the last few years. Pre-emptive Information able, since they require only shallow syntactic features. Extraction and Open Information Extraction are the first There are three key directions to pursue this work fur- paradigms that relax the restriction of a given vocabulary of ther. First, while Open IE systems have primarily focused relations and scale to all relation phrases expressed in text on binary extractions, not all relationships are binary. Events [Shinyama and Sekine, 2006; Banko et al., 2007; Banko and have time and locations. Numerous verbs naturally take three Etzioni, 2008; Wu and Weld, 2010]. Preemptive IE relies on arguments (e.g., “Singh gifted the French President a tra-

9 ditional painting.”). We need to extend Open IE to han- [Hall et al., 2009] Mark Hall, Eibe Frank, Geoffrey Holmes, Bern- dle n-ary and even nested extractions. Secondly, not all hard Pfahringer, Peter Reutemann, and Ian H. Witten. The weka important relationships are expressed in verbs. For exam- software: An update. SIGKDD Explorations, 1(1), ple, the sentence “Seattle Mayor Bloomberg said that...” ex- 2009. presses (Bloomberg, is the Mayor of, Seattle). However, such [Hoffmann et al., 2010] Raphael Hoffmann, Congle Zhang, and noun-based extractions are challenging to get at high preci- Daniel S. Weld. Learning 5000 relational extractors. In ACL sion, since exposing the semantics hidden in these compound ’10, 2010. nouns is tricky. For example, “Seattle Symphony Orchestra” [Hoffmann et al., 2011] Raphael Hoffmann, Congle Zhang, Xiao does not imply that (Orchestra, is the Symphony of, Seattle). Ling, Luke Zettlemoyer, and Daniel S. Weld. Distant supervi- Finally, as our current Open IE systems learn general charac- sion for information extraction of overlapping relations. In ACL teristics of the English language, we recognize that the tech- ’11, 2011. niques to handle another language, say Chinese, will likely [Kim and Moldovan, 1993] J. Kim and D. Moldovan. Acquisition be quite different. However, we believe that the general Open of semantic patterns for information extraction from corpora. In IE paradigm will extend to other languages. Procs. of Ninth IEEE Conference on Artificial Intelligence for Applications, pages 171–176, 1993. Acknowledgments: This research was supported in part by [Lin et al., 2010] Thomas Lin, Mausam, and Oren Etzioni. Identi- NSF grant IIS-0803481, ONR grant N00014-08-1-0431, and fying Functional Relations in Web Text. In EMNLP’10, 2010. DARPA contract FA8750-09-C-0179. [McCallum, 2002] Andres McCallum. Mallet: A for language toolkit. http://mallet.cs.umass.edu, 2002. References [Mintz et al., 2009] Mike Mintz, Steven Bills, Rion Snow, and Dan [Allerton, 2002] David J. Allerton. Stretched Verb Constructions in Jurafsky. Distant supervision for relation extraction without la- English. Routledge Studies in Germanic Linguistics. Routledge beled data. In ACL-IJCNLP’09, 2009. (Taylor and Francis), New York, 2002. [Riloff, 1996] E. Riloff. Automatically constructing extraction pat- [Banko and Etzioni, 2008] Michele Banko and Oren Etzioni. The terns from untagged text. In AAAI’96, 1996. tradeoffs between open and traditional relation extraction. In [Ritter et al., 2010] Alan Ritter, Mausam, and Oren Etzioni. A La- ACL’08, 2008. tent Dirichlet Allocation Method for Selectional Preferences. In [Banko et al., 2007] Michele Banko, Michael J. Cafarella, Stephen ACL, 2010. Soderland, Matt Broadhead, and Oren Etzioni. Open information [Ritter et al., 2011] Alan Ritter, Sam Clark, Mausam, and Oren Et- extraction from the web. In IJCAI, 2007. zioni. Named Entity Recognition in Tweets: An Experimental [Berant et al., 2011] Jonathan Berant, Ido Dagan, and Jacob Gold- Study. Submitted, 2011. berger. Global learning of typed entailment rules. In ACL’11, [Schoenmackers et al., 2010] Stefan Schoenmackers, Oren Etzioni, 2011. Daniel S. Weld, and Jesse Davis. Learning first-order horn [Carlson et al., 2010a] Andrew Carlson, Justin Betteridge, Bryan clauses from web text. In EMNLP’10, 2010. Kisiel, Burr Settles, Estevam R. Hruschka Jr., and Tom M. [Shinyama and Sekine, 2006] Yusuke Shinyama and Satoshi Mitchell. Toward an architecture for never-ending language Sekine. Preemptive Information Extraction using Unrestricted learning. In AAAI’10, 2010. Relation Discovery. In NAACL’06, 2006. [Carlson et al., 2010b] Andrew Carlson, Justin Betteridge, [Soderland et al., 2010] Stephen Soderland, Brendan Roof, Bo Qin, Richard C. Wang, Estevam R. Hruschka Jr., and Tom M. Shi Xu, Mausam, and Oren Etzioni. Adapting open information Mitchell. Coupled semi-supervised learning for information extraction to domain-specific relations. AI Magazine, 31(3):93– extraction. In WSDM 2010, 2010. 102, 2010. [Carreras and Marquez, 2005] Xavier Carreras and Lluis Marquez. [Soderland, 1999] S. Soderland. Learning Information Extraction Introduction to the CoNLL-2005 Shared Task: Semantic Role Rules for Semi-Structured and Free Text. Machine Learning, Labeling, 2005. 34(1-3):233–272, 1999. [Christensen et al., 2011a] Janara Christensen, Mausam, Stephen [Stevenson et al., 2004] Suzanne Stevenson, Afsaneh Fazly, and Soderland, and Oren Etzioni. Learning Arguments for Open In- Ryan North. Statistical measures of the semi-productivity of light formation Extraction. Submitted, 2011. verb constructions. In 2nd ACL Workshop on Multiword Expres- [Christensen et al., 2011b] Janara Christensen, Mausam, Stephen sions, pages 1–8, 2004. Soderland, and Oren Etzioni. The tradeoffs between syntactic [Wu and Weld, 2010] Fei Wu and Daniel S. Weld. Open informa- features and semantic roles for open information extraction. In tion extraction using Wikipedia. In Proceedings of the 48th An- Knowledge Capture (KCAP), 2011. nual Meeting of the Association for Computational Linguistics, [Etzioni et al., 2006] Oren Etzioni, Michele Banko, and Michael J. ACL ’10, pages 118–127, Morristown, NJ, USA, 2010. Associa- Cafarella. Machine reading. In Proceedings of the 21st National tion for Computational Linguistics. Conference on Artificial Intelligence, 2006. [Yates and Etzioni, 2009] A. Yates and O. Etzioni. Unsupervised methods for determining object and relation synonyms on the [Fader et al., 2011] Anthony Fader, Stephen Soderland, and Oren web. Journal of Artificial Intelligence Research, 34(1):255–296, Etzioni. Identifying Relations for Open Information Extraction. 2009. Submitted, 2011. [Zhu et al., 2009] Jun Zhu, Zaiqing Nie, Xiaojiang Liu, Bo Zhang, [Grefenstette and Teufel, 1995] Gregory Grefenstette and Simone and Ji-Rong Wen. StatSnowball: a statistical approach to extract- Teufel. Corpus-based method for automatic identification of sup- ing entity relationships. In WWW’09, 2009. port verbs for nominalizations. In EACL’95, 1995.

10