Scalable Learning and Inference in Large Knowledge Bases

SCALABLE LEARNING AND INFERENCE IN LARGE KNOWLEDGE BASES By YANG CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 c 2016 Yang Chen To my parents and family ACKNOWLEDGMENTS I owe my sincere gratitude to Dr. Daisy Zhe Wang for her gracious and adept guidance toward my Ph.D. degree. Her broad knowledge, inspiring teaching, and insightful feedback profoundly influence my work. Her meticulous review of my research and pursuit of significance facilitate my publications in SIGMOD and VLDB. Learning to write clearly and precisely from Dr. Wang is an especially invaluable experience I am blessed to have. It is my great honor to work with Dr. Wang to expand the scope of human knowledge. I also received immeasurable help from Dr. Alin Dobra. His passionate lectures and luminous ideas sparked me in many aspects of designing efficient and scalable data mining algorithms. Moreover, I would like to thank Dr. Milenko Petrovic and Dr. Micah H. Clark for their helpful discussions on query rewriting and optimization during my internship at the Florida Institute for Human and Machine Cognition. My research benefits from the machine learning and statistics courses taught by Dr. Anand Rangarajan and Dr. Kshitij Khare. I am thankful to them and Dr. Jih-Kwon Peir for serving my Ph.D. committee and for their suggestions on my work. It is my pleasure to work with many brilliant colleagues: Dr. Christan Grant, Dr. Kun Li, Dr. Clint P. George, Sean Goldberg, Yang Peng, Morteza Shahriari Nia, Miguel E. Rodrguez, Xiaofeng Zhou, and Dihong Gong. Furthermore, I owe special thanks to Soumitra Siddharth Johri for working with me day and night on Grokit. I am also delighted to meet Yu Cheng from University of California, Merced at the SIGMOD'14, SIGMOD'16 conferences and on Google campus to learn their extensions of Datapath and its applications in big data. I am lucky to work with Dr. Xiangyang Lan on Mesa database and Dr. Sergey Melnik on Spanner database during my internships at Google. The experience of working with great people on global-scale projects has broaden my horizons of database technology. It arouses a desire within me to combine science and technology to tackle real-world problems. Finally, I would like to thank my parents for their love and support over the 27 years of my life. They are the endless power that encourages me forward. 4 My research is partially supported by National Science Foundation under IIS Award 1526753, Defense Advanced Research Projects Agency under Grant FA8750-12-2-0348-2 (DEFT/CUBISM), a generous gift from Google, and DSR Lab sponsors: Pivotal, UF Law School, SurveyMonkey, Amazon, Sandia National Laboratories, Harris, Patient-Centered Outcomes Research Institute, and UF Clinical and Translational Science Institute. 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS.................................4 LIST OF TABLES.....................................9 LIST OF FIGURES.................................... 10 LIST OF ALGORITHMS................................. 12 ABSTRACT........................................ 13 CHAPTER 1 INTRODUCTION.................................. 15 1.1 Knowledge Expansion............................. 16 1.2 Ontological Pathfinding............................ 17 1.3 Spreading Activation.............................. 18 1.4 Contributions.................................. 19 2 PRELIMINARIES.................................. 22 2.1 Markov Logic Networks............................. 22 2.1.1 Grounding................................ 22 2.1.2 Inference................................. 24 2.2 First-Order Mining............................... 24 2.2.1 The Scalability Challenge........................ 25 2.2.2 Scoring Metrics............................. 25 2.3 Spark Basics................................... 26 3 KNOWLEDGE EXPANSION OVER PROBABILISTIC KNOWLEDGE BASES 28 3.1 Probabilistic Knowledge Bases......................... 31 3.2 Probabilistic Knowledge Bases: A Relational Perspective.......... 32 3.2.1 First-Order Horn Clauses........................ 33 3.2.2 The Relational Model.......................... 33 3.2.2.1 Classes, relations, and relationships............. 33 3.2.2.2 MLN rules........................... 34 3.2.2.3 Factor graphs......................... 35 3.2.3 Grounding................................ 37 3.2.4 MPP Implementation.......................... 40 3.3 Quality Control................................. 42 3.3.1 Semantic Constraints.......................... 42 3.3.2 Ambiguity Detection.......................... 45 3.3.3 Rule Cleaning.............................. 46 3.3.4 Implementation............................. 46 6 3.4 Experiments................................... 47 3.4.1 Performance............................... 48 3.4.1.1 Case study: the Reverb-Sherlock KB............ 48 3.4.1.2 Effect of batch rule application............... 50 3.4.1.3 Effect of MPP parallelization................ 51 3.4.2 Quality.................................. 52 3.4.2.1 Overall results........................ 53 3.4.2.2 Effect of semantic constraints................ 54 3.4.2.3 Effect of rule cleaning.................... 54 3.5 Summary.................................... 55 4 MINING FIRST-ORDER KNOWLEDGE BY ONTOLOGICAL PATHFINDING 56 4.1 First-Order Mining Problem.......................... 58 4.1.1 The Scalability Challenge........................ 59 4.1.2 Scoring Metrics............................. 59 4.2 Ontological Pathfinding............................ 60 4.2.1 Rule Construction............................ 61 4.2.2 Partitioning............................... 64 4.2.3 Rule Pruning.............................. 71 4.2.4 Parallel Rule Mining.......................... 73 4.2.4.1 General rules......................... 77 4.2.4.2 General confidence scores.................. 78 4.2.5 Analysis................................. 80 4.2.5.1 Parallel mining........................ 80 4.2.5.2 Partitioning.......................... 82 4.3 Experiments................................... 83 4.3.1 Overall Result.............................. 85 4.3.2 Effect of Parallelism........................... 89 4.3.3 Effect of Partitioning.......................... 89 4.3.4 Effect of Rule Pruning......................... 91 4.4 Summary.................................... 93 5 SCALABLE KNOWLEDGE EXPANSION AND INFERENCE......... 94 5.1 Parallel Inference................................ 94 5.2 Quality Analysis................................ 98 5.3 Inference Results................................ 101 5.4 Summary.................................... 107 6 QUERY PROCESSING WITH KNOWLEDGE ACTIVATION.......... 108 6.1 Spreading Activation.............................. 109 6.2 Using SemMemDB............................... 112 6.2.1 Base-Level Activation Calculation................... 114 6.2.2 Spreading Activation Calculation................... 115 6.2.3 Activation Score Calculation...................... 115 7 6.3 Evaluation.................................... 116 6.3.1 Data Set................................. 116 6.3.2 Performance Overview......................... 117 6.3.3 Effect of Semantic Network Sizes.................... 118 6.3.4 Effect of Query Sizes.......................... 119 6.4 Summary.................................... 120 7 RELATED WORK.................................. 122 8 CONCLUSION AND FUTURE WORK...................... 127 8.1 Inductive Reasoning.............................. 129 8.2 Online Inductive Reasoning.......................... 130 8.3 Incremental Deductive Reasoning....................... 131 8.4 Abductive Reasoning.............................. 132 8.5 Knowledge Verification............................. 132 8.6 Summary.................................... 133 REFERENCES....................................... 135 BIOGRAPHICAL SKETCH................................ 145 8 LIST OF TABLES Table page 2-1 Example probabilistic knowledge base constructed from the Reverb-Sherlock datasets......................................... 23 3-1 Example probabilistic knowledge base constructed from the Reverb-Sherlock datasets......................................... 32 3-2 Sherlock-Reverb KB statistics............................. 48 3-3 Tuffy-T and ProbKB systems performance: the first three rows report the running time for the relevant queries in minutes; the last row reports the size of the result table..................................... 49 3-4 Quality control parameters. SC and RC stand for semantic constraints and rule cleaning, respectively.................................. 52 4-1 Example KB schema from YAGO knowledge base.................. 62 4-2 Histogram for \wasBornIn" and \diedIn."..................... 73 4-3 OP experiment setup................................. 84 4-4 Overall mining result................................. 85 4-5 Schema graphs and histograms............................ 88 6-1 DBPedia data set statistics.............................. 117 6-2 Moby Thesaurus II data set statistics........................ 117 6-3 Experiment 1 result.................................. 118 6-4 Experiment 2 semantic network sizes and avg. execution times for single iteration queries of 1000 nodes................................. 118 6-5 Experiment 3 avg. execution times and result sizes for single iteration queries of varying sizes...................................... 119 8-1 PositionsHeld(Barack Obama, *) triples in Wikidata................ 131 8-2 Aggregated knowledge base of beliefs........................

Load more