Improving Knowledge Base Construction from Robust Infobox Extraction
Total Page:16
File Type:pdf, Size:1020Kb
Improving Knowledge Base Construction from Robust Infobox Extraction Boya Peng∗ Yejin Huh Xiao Ling Michele Banko∗ Apple Inc. 1 Apple Park Way Sentropy Technologies Cupertino, CA, USA {yejin.huh,xiaoling}@apple.com {emma,mbanko}@sentropy.io∗ Abstract knowledge graph created largely by human edi- tors. Only 46% of person entities in Wikidata A capable, automatic Question Answering have birth places available 1. An estimate of 584 (QA) system can provide more complete million facts are maintained in Wikipedia, not in and accurate answers using a comprehen- Wikidata (Hellmann, 2018). A downstream appli- sive knowledge base (KB). One important cation such as Question Answering (QA) will suf- approach to constructing a comprehensive knowledge base is to extract information from fer from this incompleteness, and fail to answer Wikipedia infobox tables to populate an ex- certain questions or even provide an incorrect an- isting KB. Despite previous successes in the swer especially for a question about a list of enti- Infobox Extraction (IBE) problem (e.g., DB- ties due to a closed-world assumption. Previous pedia), three major challenges remain: 1) De- work on enriching and growing existing knowl- terministic extraction patterns used in DBpe- edge bases includes relation extraction on nat- dia are vulnerable to template changes; 2) ural language text (Wu and Weld, 2007; Mintz Over-trusting Wikipedia anchor links can lead to entity disambiguation errors; 3) Heuristic- et al., 2009; Hoffmann et al., 2011; Surdeanu et al., based extraction of unlinkable entities yields 2012; Koch et al., 2014), knowledge base reason- low precision, hurting both accuracy and com- ing from existing facts (Lao et al., 2011; Guu et al., pleteness of the final KB. This paper presents 2015; Das et al., 2017), and many others (Dong a robust approach that tackles all three chal- et al., 2014). lenges. We build probabilistic models to pre- Wikipedia (https://wikipedia.org) dict relations between entity mentions directly from the infobox tables in HTML. The en- has been one of the key resources used for tity mentions are linked to identifiers in an ex- knowledge base construction. In many Wikipedia isting KB if possible. The unlinkable ones pages, a summary table of the subject, called an are also parsed and preserved in the final out- infobox table, is presented in the top right region put. Training data for both the relation extrac- of the page (see the leftmost table in Figure1 for tion and the entity linking models are auto- the infobox table of The_Beatles). Infobox matically generated using distant supervision. tables offer a unique opportunity for extracting We demonstrate the empirical effectiveness of information and populating a knowledge base. the proposed method in both precision and re- call compared to a strong IBE baseline, DBpe- An infobox table is structurally formatted as an dia, with an absolute improvement of 41:3% HTML table and therefore it is often not necessary in average F1. We also show that our ex- to parse the text into a syntactic parse tree as traction makes the final KB significantly more in natural language extraction. Intra-Wikipedia complete, improving the completeness score anchor links are prevalent in infobox tables, of list-value relation types by 61:4%. often providing unambiguous references to en- tities. Most importantly, a significant amount 1 Introduction of information represented in the infobox tables Most existing knowledge bases (KBs) are largely are not otherwise available in a more traditional incomplete. This can be seen in Wikidata (Vran- structured knowledge base, such as Wikidata. deciˇ c´ and Krötzsch, 2014), which is a widely used We are not the first to use infobox tables ∗ Work performed at Apple Inc. 1as of June 2018 138 Proceedings of NAACL-HLT 2019, pages 138–148 Minneapolis, Minnesota, June 2 - June 7, 2019. c 2019 Association for Computational Linguistics for knowledge base completion. The pioneering base, Wikidata, improving an average recall of work of DBpedia (Auer et al., 2007; Lehmann list-value relation types by 61:4%. et al., 2015)2 extracts canonical knowledge triples To summarize, our contributions are three-fold: (subject, relation type, object) from infobox tables with a manually created mapping from Me- • RIBE produces high-quality extractions, diawiki 3 templates to relation types. Despite the achieving higher precision and recall com- success of the DBpedia project, three major chal- pared to DBpedia. lenges remain. First, deterministic mappings are • Our extractions make Wikidata more com- sensitive to template changes. If Wikipedia modi- plete by adding a significant number of triples fies an infobox template (e.g., the attribute “birth- for 51 relation types. Date” is renamed to “dateOfBirth”), the DBpe- • RIBE extracts relations with unlinkable enti- dia mappings need to be manually updated. Sec- ties, which are crucial for the completeness ondly, while Wikipedia anchor links facilitate dis- of list-value relation types and the question ambiguation of string values in the infobox ta- answering capability from a knowledge base. bles, blindly trusting the anchor links can cause errors. For instance, both “Sasha” and “Malia,” 2 Related Work children of Barack Obama, are linked to a sec- tion of the Wikipedia page of Barack_Obama, Auer and Lehmann(2007) proposed to extract in- rather than their own pages. Finally, little atten- formation from infobox tables by pattern match- tion has been paid to the extraction of unlinkable ing against Mediawiki templates and parsing and entities. For example, Larry King has married transforming them into RDF triples. However, re- seven women, only one of which can be linked to lation types of the triples remain lexical and can a Wikipedia page. A knowledge base without the be ambiguous. Lehmann et al.(2015) introduced information of the other six entities will provide an ontology to reduce the ambiguity in relation types. A mapping from infobox attributes to the an incorrect answer to the question “How many 4 women has Larry King married?” ontology is manually created, which is brittle to template changes. In contrast, RIBE is much more In this paper, we present a system, RIBE, to robust. It trains statistical models with distant su- tackle all three challenges: 1) We build proba- pervision from Wikidata to automatically learn the bilistic models to predict relations and object en- mapping from infobox attributes to Wikidata re- tity links. The learned models are more robust lation types. RIBE properly parses infobox val- to changes in the underlying data representation ues into separate object mentions, instead of rely- than manually maintained mappings. 2) We incor- ing on existing Mediawiki boundaries as in (Auer porate the information from HTML anchors and and Lehmann, 2007; Lehmann et al., 2015). The build an entity linking system to link string values RIBE entity linker learns to make entity link pre- to entities rather than fully relying on the anchor dictions rather than a direct translation of anchor links. 3) We produce high-quality extractions even links (Auer and Lehmann, 2007; Lehmann et al., when the objects are unlinkable, which improves 2015) vulnerable to human edit errors. While the completeness of the final knowledge base. there is other relevant work, due to space con- We demonstrate that the proposed method is ef- straints, it is discussed throughout the paper and fective in extracting over 50 relation types. Com- in the appendices as appropriate. pared to a strong IBE baseline, DBpedia, our extractions achieve significant improvements on 3 Problem Definition both precision and recall, with a 41:3% increase in average F1 score. We also show that the extracted We define a relation as a triple (e1; r; e2) or triples add a great value to an existing knowledge r(e1; e2), where r is the relation type and e1, e are the subject and the object entities respec- 2 2 Throughout the paper, we use “DBpedia” to refer to 5 its infobox extraction component rather than the DBpedia tively . We define an entity mention m as the knowledge base unless specified. We use Wikidata as the surface form of an entity, and a relation mention baseline knowledge base for our experiments since it is up- dated more frequently. The last release DBpedia knowledge 4DBpedia Mappings Wiki at http://mappings. base was in 2016. dbpedia.org . 3 5 Mediawiki is a markup language that defines the page e2 can also be a literal value such as a number or a time layout and presentation of Wikipedia pages. expression unless specified. 139 as a pair of entity mentions (m1; m2) for a rela- 4.1.2 Feature Generation tion type r. We denote the set of infobox tables For each relation mention, we generate features by T = ft1; t2; :::; tng where ti appears in the similar to the ones from Mintz et al.(2009), with 6 Wikipedia page of the entity ei . The Infobox some modifications. Since the subject of a rela- Extraction problem studied in this paper aims at tion mention is outside the infobox, most gener- extracting a set of relation triples r(e1; e2) from ated features focus on the object (e.g., the word an input of Wikipedia infobox tables T . shape of the object mention), and its context (e.g., the infobox attribute value). Table6 in Appendix 4 System Overview A lists the complete set of features. We describe the RIBE system in this section. 4.1.3 Distant Supervision Wikidata (Vrandeciˇ c´ and Krötzsch, 2014) is em- ployed as the base external KB. We draw distant Instead of manually collecting training examples, supervision from it by matching candidate men- we use distant supervision (Craven and Kumlien, tions against Wikidata relation triples for training 1999; Bunescu and Mooney, 2007; Mintz et al., statistical models.