Incremental Knowledge Base Construction Using Deepdive

Incremental Knowledge Base Construction Using DeepDive Jaeho Shiny Sen Wuy Feiran Wangy Christopher De Say Ce Zhangyz Christopher Réy yStanford University zUniversity of Wisconsin-Madison {jaeho, senwu, feiran, cdesa, czhang, chrismre}@cs.stanford.edu ABSTRACT complex relationships. Typically, quality is assessed us- Populating a database with unstructured information is a ing two complementary measures: precision (how often a long-standing problem in industry and research that encom- claimed tuple is correct) and recall (of the possible tuples to passes problems of extraction, cleaning, and integration. Re- extract, how many are actually extracted). These systems cent names used for this problem include dealing with dark can ingest massive numbers of documents–far outstripping data and knowledge base construction (KBC). In this work, the document counts of even well-funded human curation ef- we describe DeepDive, a system that combines database and forts. Industrially, KBC systems are constructed by skilled machine learning ideas to help develop KBC systems, and we engineers in a months-long (or longer) process–not a one- present techniques to make the KBC process more efficient. shot algorithmic task. Arguably, the most important ques- We observe that the KBC process is iterative, and we de- tion in such systems is how to best use skilled engineers’ velop techniques to incrementally produce inference results time to rapidly improve data quality. In its full generality, for KBC systems. We propose two methods for incremen- this question spans a number of areas in computer science, tal inference, based respectively on sampling and variational including programming languages, systems, and HCI. We techniques. We also study the tradeoff space of these meth- focus on a narrower question, with the axiom that the more ods and develop a simple rule-based optimizer. DeepDive rapid the programmer moves through the KBC construction includes all of these contributions, and we evaluate Deep- loop, the more quickly she obtains high-quality data. This paper presents DeepDive, our open-source engine for Dive on five KBC systems, showing that it can speed up 1 KBC inference tasks by up to two orders of magnitude with knowledge base construction. DeepDive’s language and ex- negligible impact on quality. ecution model are similar to other KBC systems: DeepDive uses a high-level declarative language [11, 28, 30]. From a database perspective, DeepDive’s language is based on SQL. 1. INTRODUCTION From a machine learning perspective, DeepDive’s language is based on Markov Logic [13, 30]: DeepDive’s language in- The process of populating a structured relational database herits Markov Logic Networks’ (MLN’s) formal semantics.2 from unstructured sources has received renewed interest in Moreover, it uses a standard execution model for such sys- the database community through high-profile start-up com- tems [11, 28, 30] in which programs go through two main panies (e.g., Tamr and Trifacta), established companies like phases: grounding, in which one evaluates a sequence of SQL IBM’s Watson [7, 16], and a variety of research efforts [11, queries to produce a data structure called a factor graph that 25,28,36,40]. At the same time, communities such as natu- describes a set of random variables and how they are cor- ral language processing and machine learning are attacking related. Essentially, every tuple in the database or result similar problems under the name knowledge base construc- of a query is a random variable (node) in this factor graph. tion (KBC) [5, 14, 23]. While different communities place The inference phase takes the factor graph from grounding differing emphasis on the extraction, cleaning, and integra- and performs statistical inference using standard techniques, tion phases, all communities seem to be converging toward e.g., Gibbs sampling [42, 44]. The output of inference is the a common set of techniques that include a mix of data pro- marginal probability of every tuple in the database. As with cessing, machine learning, and engineers-in-the-loop. Google’s Knowledge Vault [14] and others [31], DeepDive The ultimate goal of KBC is to obtain high-quality struc- also produces marginal probabilities that are calibrated: if tured data from unstructured information. These databases one examined all facts with probability 0.9, we would ex- are richly structured with tens of different entity types in pect that approximately 90% of these facts would be correct. To calibrate these probabilities, DeepDive estimates This work is licensed under the Creative Commons Attribution- (i.e., learns) parameters of the statistical model from data. NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this li- Inference is a subroutine of the learning procedure and is cense, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain per- the critical loop. Inference and learning are computation- mission prior to any use beyond those covered by the license. Contact ally intense (hours on 1TB RAM/48-core machines). copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on 1 Very Large Data Bases, August 31st - September 4th 2015, Kohala Coast, http://deepdive.stanford.edu 2 Hawaii. DeepDive has some technical differences from Markov Proceedings of the VLDB Endowment, Vol. 8, No. 11 Logic that we have found useful in building applications. Copyright 2015 VLDB Endowment 2150-8097/15/07. We discuss these differences in Section 2.3. 1310 In our experience with DeepDive, we found that KBC is study of the tradeoff space highlights that neither materi- an iterative process. In the past few years, DeepDive has alization strategy dominates the other. To automatically been used to build dozens of high-quality KBC systems by choose the materialization strategy, we develop a simple a handful of technology companies, a number law enforce- rule-based optimizer. ment agencies via DARPA’s MEMEX, and scientists in fields such as paleobiology, drug repurposing, and genomics. Re- Experimental Evaluation Highlights. We used DeepDive cently, we compared a DeepDive system’s extractions to the programs developed by our group and DeepDive users to un- quality of extractions provided by human volunteers over derstand whether the improvements we describe can speed the last ten years for a paleobiology database, and we found up the iterative development process of DeepDive programs. that the DeepDive system had higher quality (both precision To understand the extent to which DeepDive’s techniques and recall) on many entities and relationships. Moreover, on improve development time, we took a sequence of six snap- all of the extracted entities and relationships, DeepDive had shots of a KBC system and ran them with our incremental no worse quality [32]. Additionally, the winning entry of techniques and completely from scratch. In these snapshots, the 2014 TAC-KBC competition was built on DeepDive [3]. our incremental techniques are 22× faster. The results for In all cases, we have seen the process of developing KBC each snapshot differ at most by 1% for high-quality facts systems is iterative: quality requirements change, new data (90%+ accuracy); fewer than 4% of facts differ by more sources arrive, and new concepts are needed in the applica- than 0.05 in probability between approaches. Thus, essen- tion. This led us to develop techniques to make the entire tially the same facts were given to the developer through- pipeline incremental in the face of changes both to the data out execution using the two techniques, but the incremental and to the DeepDive program. Our primary technical con- techniques delivered them more quickly. tributions are to make the grounding and inference phases 3 more incremental. Outline. The rest of the paper is organized as follows. Sec- tion 2 contains an in-depth analysis of the KBC development Incremental Grounding. Grounding and feature extrac- process, and the presentation of our language for modeling tion are performed by a series of SQL queries. To make KBC systems. We discuss the different techniques for in- this phase incremental, we adapt the algorithm of Gupta, cremental maintenance in Section 3. We also present the Mumick, and Subrahmanian [18]. In particular, DeepDive results of the exploration of the tradeoff space and the de- allows one to specify “delta rules” that describe how the scription of our optimizer. Our experimental evaluation is output will change as a result of changes to the input. Al- presented in Section 4. though straightforward, this optimization has not been ap- plied systematically in such systems and can yield up to Related Work 360× speedup in KBC systems. Knowledge Base Construction (KBC) KBC has been an area of intense study over the last decade, moving from Incremental Inference. Due to our choice of incremental pattern matching [19] and rule-based systems [25] to systems grounding, the input to DeepDive’s inference phase is a fac- that use machine learning for KBC [5, 8, 14, 15, 28]. Many tor graph along with a set of changed data and rules. The groups have studied how to improve the quality of specific goal is to compute the output probabilities computed by the components of KBC systems [27, 43]. We build on this line system. Our approach is to frame the incremental mainte- of work. We formalized the development process and built nance problem as one of approximate inference. Previous DeepDive to ease and accelerate the KBC process, which we work in the database community has looked at how machine hope is of interest to many of these systems as well. Deep- learning data products change in response to both to new la- Dive has many common features to Chen and Wang [11], bels [24] and to new data [9,10]. In KBC, both the program Google’s Knowledge Vault [14], and a forerunner of Deep- and data change on each iteration.

Load more