Domain-Targeted, High Precision Knowledge Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Domain-Targeted, High Precision Knowledge Extraction Bhavana Dalvi Mishra Niket Tandon Peter Clark Allen Institute for Artificial Intelligence 2157 N Northlake Way Suite 110, Seattle, WA 98103 bhavanad,nikett,peterc @allenai.org { } Abstract remains elusive. Specifically, our goal is a large, high precision body of (subject,predicate,object) Our goal is to construct a domain-targeted, statements relevant to elementary science, to sup- high precision knowledge base (KB), contain- port a downstream QA application task. Although ing general (subject,predicate,object) state- ments about the world, in support of a down- there are several impressive, existing resources that stream question-answering (QA) application. can contribute to our endeavor, e.g., NELL (Carlson Despite recent advances in information extrac- et al., 2010), ConceptNet (Speer and Havasi, 2013), tion (IE) techniques, no suitable resource for WordNet (Fellbaum, 1998), WebChild (Tandon et our task already exists; existing resources are al., 2014), Yago (Suchanek et al., 2007), FreeBase either too noisy, too named-entity centric, or (Bollacker et al., 2008), and ReVerb-15M (Fader et too incomplete, and typically have not been al., 2011), their applicability is limited by both constructed with a clear scope or purpose. limited coverage of general knowledge (e.g., To address these, we have created a domain- • targeted, high precision knowledge extraction FreeBase and NELL primarily contain knowl- pipeline, leveraging Open IE, crowdsourcing, edge about Named Entities; WordNet uses only and a novel canonical schema learning algo- a few (< 10) semantic relations) rithm (called CASI), that produces high pre- low precision (e.g., many ConceptNet asser- • cision knowledge targeted to a particular do- tions express idiosyncratic rather than general main - in our case, elementary science. To measure the KB’s coverage of the target do- knowledge) main’s knowledge (its “comprehensiveness” Our goal in this work is to create a domain-targeted with respect to science) we measure recall knowledge extraction pipeline that can overcome with respect to an independent corpus of do- these limitations and output a high precision KB main text, and show that our pipeline produces of triples relevant to our end task. Our approach output with over 80% precision and 23% re- leverages existing techniques of open information call with respect to that target, a substantially higher coverage of tuple-expressible science extraction (Open IE) and crowdsourcing, along with knowledge than other comparable resources. a novel schema learning algorithm. We have made the KB publicly available1. There are three main contributions of this work. First, we present a high precision extraction pipeline able to extract (subject,predicate,object) tuples rele- 1 Introduction vant to a domain with precision in excess of 80%. While there have been substantial advances in The input to the pipeline is a corpus, a sense- knowledge extraction techniques, the availability of disambiguated domain vocabulary, and a small set high precision, general knowledge about the world, of entity types. The pipeline uses a combination of 1This KB named as “Aristo Tuple KB” is available for down- text filtering, Open IE, Turker annotation on sam- load at http://data.allenai.org/tuple-kb ples, and precision prediction to generate its output. 233 Transactions of the Association for Computational Linguistics, vol. 5, pp. 233–246, 2017. Action Editor: Patrick Pantel. Submission batch: 11/2016; Revision batch: 2/2017; Published 7/2017. c 2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Second, we present a novel canonical schema in- of tuple-expressible science knowledge than other duction method (called CASI) that identifies clus- comparable resources. We are making the KB pub- ters of similar-meaning predicates, and maps them licly available. to the most appropriate general predicate that cap- Outline tures that canonical meaning. Open IE, used in the early part of our pipeline, generates triples con- We discuss the related work in Section 2. In Sec- taining a large number of predicates (expressed as tion 3, we describe the domain-targeted pipeline, in- verbs or verb phrases), but equivalences and gen- cluding how the domain is characterized to the al- eralizations among them are not captured. Syn- gorithm and the sequence of filters and predictors onym dictionaries, paraphrase databases, and verb used. In Section 4, we describe how the relation- taxonomies can help identify these relationships, ships between predicates in the domain are identi- but only partially so because the meaning of a fied and the more general predicates further pop- verb often shifts as its subject and object vary, ulated. Finally in Section 5, we evaluate our ap- something that these resources do not explicitly proach, including evaluating its comprehensiveness model. To address this challenge, we have devel- (high-precision coverage of science knowledge). oped a corpus-driven method that takes into account 2 Related Work the subject and object of the verb, and thus can learn argument-specific mapping rules, e.g., the rule There has been substantial, recent progress in “(x:Animal,found in,y:Location) (x:Animal,live knowledge bases that (primarily) encode knowledge → in,y:Location)” states that if some animal is found about Named Entities, including Freebase (Bol- in a location then it also means the animal lives in lacker et al., 2008), Knowledge Vault (Dong et al., the location. Note that ‘found in’ can have very dif- 2014), DBPedia (Auer et al., 2007), and others that ferent meaning in the schema “(x:Substance,found hierarchically organize nouns and named entities, in,y:Material). The result is a KB whose general e.g., Yago (Suchanek et al., 2007). While these predicates are more richly populated, still with high KBs are rich in facts about named entities, they are precision. sparse in general knowledge about common nouns Finally, we contribute the science KB itself as a (e.g., that bears have fur). KBs covering general resource publicly available2 to the research commu- knowledge have received less attention, although nity. To measure how “complete” the KB is with re- there are some notable exceptions constructed using spect to the target domain (elementary science), we manual methods, e.g., WordNet (Fellbaum, 1998), use an (independent) corpus of domain text to char- crowdsourcing, e.g., ConceptNet (Speer and Havasi, acterize the target science knowledge, and measure 2013), and, more recently, using automated meth- the KB’s recall at high (>80%) precision over that ods, e.g., WebChild (Tandon et al., 2014). While corpus (its “comprehensiveness” with respect to sci- useful, these resources have been constructed to tar- ence). This measure is similar to recall at the point get only a small set of relations, providing only lim- P=80% on the PR curve, except measured against a ited coverage for a domain of interest. domain-specific sample of data that reflects the dis- To overcome relation sparseness, the paradigm tribution of the target domain knowledge. Compre- of Open IE (Banko et al., 2007; Soderland et al., hensiveness thus gives us an approximate notion of 2013) extracts knowledge from text using an open the completeness of the KB for (tuple-expressible) set of relationships, and has been used to success- facts in our target domain, something that has been fully build large-scale (arg1,relation,arg2) resources lacking in earlier KB construction research. We such as ReVerb-15M (containing 15 million general show that our KB has comprehensiveness (recall triples) (Fader et al., 2011). Although broad cov- of domain facts at >80% precision) of 23% with erage, however, Open IE techniques typically pro- respect to science, a substantially higher coverage duce noisy output. Our extraction pipeline can be viewed as an extension of the Open IE paradigm: 2Aristo Tuple KB is available for download at http:// we start with targeted Open IE output, and then ap- allenai.org/data/aristo-tuple-kb ply a sequence of filters to substantially improve the 234 Figure 1: The extraction pipeline. A vocabulary-guided sequence of open information extraction, crowdsourcing, and learning predicate relationships are used to produce high precision tuples relevant to the domain of interest. output’s precision, and learn and apply relationships ical relations can be achieved given an existing KB between predicates. like Freebase. Although no existing methods can be The task of finding and exploiting relationships directly applied in our problem setting, the AMIE- between different predicates requires identifying based schema clustering method of (Galarraga´ et al., both equivalence between relations (e.g., clustering 2014) can be modified to do this also. We have to find paraphrases), and implication (hierarchical implemented this modification (called AMIE*, de- organization of relations). One class of approach scribed in Section 5.3), and we use it as a baseline is to use existing resources, e.g., verb taxonomies, to compare our schema clustering method (CASI) as a source of verbal relationships, e.g., (Grycner against. and Weikum, 2014), (Grycner et al., 2015). How- Finally, interactive methods have been used to ever, the hierarchical relationship between verbs, out create common sense knowledge bases, for ex- of context, is often unclear, and some verbs, e.g., ample ConceptNet (Speer and Havasi, 2013; Liu “have”, are ambiguous. To address this, we char- and Singh, 2004) includes a substantial amount of acterize semantic relationships not only by a verb knowledge manually contributed by people through but also by the types of its arguments. A sec- a Web-based interface, and used in numerous ap- ond class of approach is to induce semantic equiva- plications (Faaborg and Lieberman, 2006; Dinakar lence from data, e.g., using algorithms such as DIRT et al., 2012). More recently there has been work (Lin and Pantel, 2001), RESOLVER (Yates and Et- on interactive methods (Dalvi et al., 2016; Wolfe zioni, 2009), WiseNet (Moro and Navigli, 2012), et al., 2015; Soderland et al., 2013), which can be and AMIE (Galarraga´ et al., 2013).