Deepdive: a Data Management System for Automatic Knowledge Base Construction

DeepDive: A Data Management System for Automatic Knowledge Base Construction by Ce Zhang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer Sciences) at the UNIVERSITY OF WISCONSIN–MADISON 2015 Date of final oral examination: 08/13/2015 The dissertation is approved by the following members of the Final Oral Committee: Jude Shavlik, Professor, Computer Sciences (UW-Madison) Christopher Ré, Assistant Professor, Computer Science (Stanford University) Jeffrey Naughton, Professor, Computer Sciences (UW-Madison) David Page, Professor, Biostatistics and Medical Informatics (UW-Madison) Shanan Peters, Associate Professor, Geosciences (UW-Madison) c Copyright by Ce Zhang 2015 All Rights Reserved i ACKNOWLEDGMENTS I owe Christopher Ré my career as a researcher, the greatest dream of my life. Since the day I first met Chris and told him about my dream, he has done everything he could, as a scientist, an educator, and a friend, to help me. I am forever indebted to him for his completely honest criticisms and feedback, the most valuable gifts an advisor can give. His training equipped me with confidence and pride that I will carry for the rest of my career. He is the role model that I will follow. If my whole future career achieves an approximation of what he has done so far in his, I will be proud and happy. I am also indebted to Jude Shavlik and Miron Livny, who, after Chris left for Stanford, kindly helped me through all the paperwork and payments at Wisconsin. If it were not for their help, I would not have been able to continue my PhD studies. I am also profoundly grateful to Jude for being the chair of my committee. I am also likewise grateful to Jeffrey Naughton, David Page, and Shanan Peters for serving on my committee; and Thomas Reps for his feedback during defense. DeepDive would have not been possible without all its users. Shanan Peters was the first user, working with it before it even got its name. He spent three years going through a painful process with us before we understood the current abstraction of DeepDive. I am grateful to him for sticking with us for this long process. I would also like to thank all the scientists and students who have interacted with me, without whose generous sharing of knowledge we could not have refined DeepDive to its current state. A very incomplete list includes Gabor Angeli, Gill Bejerano, Noel Heim, Arun Kumar, Pradap Konda, Emily Mallory, Christopher Manning, Jonathan Payne, Eldon Ulrich, Robin Valenza, and Yuke Zhu. DeepDive is now a team effort, and I am grateful to the whole DeepDive team, especially Jaeho Shin and Michael Cafarella, who help in managing the DeepDive team. I am also extremely lucky to be surrounded by my friends. Heng Guo, Yinan Li, Linhai Song, Jia Xu, Junming Xu, Wentao Wu, and I have lunch frequently whenever we are in town, and this often amounts to the happiest hour in the entire day. Heng, a theoretician, has also borne the burden of listening to me pitch my system ideas for years, simply because he is, unfortunately for him, my roommate. Feng Niu mentored me through my early years at Wisconsin; even today, whenever I struggle to decide what the right thing to do is, I still imagine what he would have done in a similar situation. Victor Bittorf sparked my interest in modern hardware through his beautiful code and lucid tutorial. Over the years, I learned a lot from Yinan and Wentao about modern database systems, from Linhai about program analysis, and from Heng about counting. Finally, I would also like to thank my parents for all their love over the last twenty-seven years and, I hope, in the future. Anything that I could write about them in any languages, even in my native Chinese, would pale beside everything they have done to support me in my life. My graduate study has been supported by the Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory prime contract No. FA8750- 09-C-0181, the DARPA DEFT Program under No. FA8750-13-2-0039, the National Science Foundation (NSF) EAGER Award under No. EAR-1242902, and the NSF EarthCube Award under No. ACI- 1343760. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, NSF, or the U.S. government. ii The results mentioned in this dissertation come from previously published work [16,84,90,148,157,171, 204–207]. Some descriptions are directly from these papers. These papers are joint efforts with different authors, including: F. Abuzaid, G. Angeli, V. Bittorf, V. Govindaraju, S. Gupta, S. Hadjis, M. Premkumar, A. Kumar, M. Livny, C. Manning, F. Niu, C. Ré, S. Peters, C. Sa, A. Sadeghian, Z. Shan, J. Shavlik, J. Shin, J. Tibshirani, F. Wang, J. Wu, and S. Wu. Although this dissertation bears my name, this collection of research would not have been possible without the contributions of all these collaborators. iii ABSTRACT Many pressing questions in science are macroscopic: they require scientists to consult information expressed in a wide range of resources, many of which are not organized in a structured relational form. Knowledge base construction (KBC) is the process of populating a knowledge base, i.e., a relational database storing factual information, from unstructured inputs. KBC holds the promise of facilitating a range of macroscopic sciences by making information accessible to scientists. One key challenge in building a high-quality KBC system is that developers must often deal with data that are both diverse in type and large in size. Further complicating the scenario is that these data need to be manipulated by both relational operations and state-of-the-art machine-learning techniques. This dissertation focuses on supporting this complex process of building KBC systems. DeepDive is a data management system that we built to study this problem; its ultimate goal is to allow scientists to build a KBC system by declaratively specifying domain knowledge without worrying about any algorithmic, performance, or scalability issues. DeepDive was built by generalizing from our experience in building more than ten high-quality KBC systems, many of which exceed human quality or are top-performing systems in KBC competi- tions, and many of which were built completely by scientists or industry users using DeepDive. From these examples, we designed a declarative language to specify a KBC system and a concrete protocol that iteratively improves the quality of KBC systems. This flexible framework introduces challenges of scalability and performance–Many KBC systems built with DeepDive contain statistical inference and learning tasks over terabytes of data, and the iterative protocol also requires executing similar inference problems multiple times. Motivated by these challenges, we designed techniques that make both the batch execution and incremental execution of a KBC program up to two orders of magnitude more efficient and/or scalable. This dissertation describes the DeepDive framework, its applications, and these techniques, to demonstrate the thesis that it is feasible to build an efficient and scalable data management system for the end-to-end workflow of building KBC systems. iv SOFTWARE, DATA, AND VIDEOS • DeepDive is available at http://deepdive.stanford.edu. It is a team effort to further develop and maintain this system. • The DeepDive program of PaleoDeepDive (Chapter 4) is available at http://deepdive.stanford. edu/doc/paleo.html and can be executed with DeepDive v0.6.0α. • The code for the prototype systems we developed to study the tradeoff of performance and scalability is also available separately. 1. Elementary (Chapter 5.1): http://i.stanford.edu/hazy/hazy/elementary 2. DimmWitted (Chapter 5.2): http://github.com/HazyResearch/dimmwitted. This was developed together with Victor Bittorf. 3. Columbus (Chapter 6.1): http://github.com/HazyResearch/dimmwitted/tree/master/ columbus. This was developed together with Arun Kumar. 4. Incremental Feature Engineering (Chapter 6.2): DeepDive v0.6.0α. This was developed together with the DeepDive team, especially Jaeho Shin, Feiran Wang, and Sen Wu. • Most data that we can legally release are available as DeepDive Open Datasets at deepdive. stanford.edu/doc/opendata. The computational resources to produce these data are made possible by millions of machine hours provided by the Center of High Throughput Computing (CHTC), led by Miron Livny at UW-Madison. • These introductory videos in our YouTube channel are related to this dissertation: 1. PaleoDeepDive: http://www.youtube.com/watch?v=Cj2-dQ2nwoY 2. GeoDeepDive: http://www.youtube.com/watch?v=X8uhs28O3eA 3. Wisci: http://www.youtube.com/watch?v=Q1IpE9_pBu4 4. Columbus: http://www.youtube.com/watch?v=wdTds3yg_G4 5. Elementary: http://www.youtube.com/watch?v=Zf7sgMnR89c and http://www.youtube. com/watch?v=B5LxXGIkYe4 v DEPENDENCIES OF READING 2.1 3 4 1 7 8 2.2 5 6 • If you are only interested in how DeepDive can be used to construct knowledge bases to help your applications and example systems built with DeepDive, you can read following the red solid line. • If you are only interested in our work on efficient and scalable statistical analytics, you can read following the blue dotted line. • Otherwise, you can read in linear order. vi CONTENTS Acknowledgments i Abstract iii Software, Data, and Videos iv Dependencies of Reading v Contents vi List of Figures viii List of Tables x 1 Introduction 1 1.1 Sample Target Workload . 2 1.2 The Goal of DeepDive . 2 1.3 Technical Contributions . 4 2 Preliminaries 6 2.1 Knowledge Base Construction (KBC) . 6 2.2 Factor Graphs .

Deepdive: a Data Management System for Automatic Knowledge Base Construction

Probabilistic Databases

Adaptive Schema Databases ∗

Finding Interesting Itemsets Using a Probabilistic Model for Binary Databases

Open-World Probabilistic Databases

A Modern History of Probability Theory

A Compositional Query Algebra for Second-Order Logic and Uncertain Databases

Answering Queries from Statistics and Probabilistic Views∗

Uva-DARE (Digital Academic Repository)

A Brief Overview of Probability Theory in Data Science by Geert

BAYESSTORE: Managing Large, Uncertain Data Repositories with Probabilistic Graphical Models

Maximum Entropy: the Universal Method for Inference

UC Berkeley UC Berkeley Electronic Theses and Dissertations