SCALABLE LEARNING AND IN LARGE KNOWLEDGE BASES

By YANG CHEN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2016 c 2016 Yang Chen To my parents and family ACKNOWLEDGMENTS I owe my sincere gratitude to Dr. Daisy Zhe Wang for her gracious and adept guidance toward my Ph.D. degree. Her broad knowledge, inspiring teaching, and insightful feedback profoundly influence my work. Her meticulous review of my research and pursuit of significance facilitate my publications in SIGMOD and VLDB. Learning to write clearly and precisely from Dr. Wang is an especially invaluable experience I am blessed to have. It is my great honor to work with Dr. Wang to expand the scope of human knowledge. I also received immeasurable help from Dr. Alin Dobra. His passionate lectures and luminous ideas sparked me in many aspects of designing efficient and scalable data mining algorithms. Moreover, I would like to thank Dr. Milenko Petrovic and Dr. Micah H. Clark for their helpful discussions on query rewriting and optimization during my internship at the Florida Institute for Human and Machine Cognition. My research benefits from the machine learning and statistics courses taught by Dr. Anand Rangarajan and Dr. Kshitij Khare. I am thankful to them and Dr. Jih-Kwon Peir for serving my Ph.D. committee and for their suggestions on my work. It is my pleasure to work with many brilliant colleagues: Dr. Christan Grant, Dr. Kun Li, Dr. Clint P. George, Sean Goldberg, Yang Peng, Morteza Shahriari Nia, Miguel E. Rodrguez, Xiaofeng Zhou, and Dihong Gong. Furthermore, I owe special thanks to Soumitra Siddharth Johri for working with me day and night on Grokit. I am also delighted to meet Yu Cheng from University of California, Merced at the SIGMOD’14, SIGMOD’16 conferences and on Google campus to learn their extensions of Datapath and its applications in big data. I am lucky to work with Dr. Xiangyang Lan on Mesa database and Dr. Sergey Melnik on Spanner database during my internships at Google. The experience of working with great people on global-scale projects has broaden my horizons of database technology. It arouses a desire within me to combine science and technology to tackle real-world problems. Finally, I would like to thank my parents for their love and support over the 27 years of my life. They are the endless power that encourages me forward.

4 My research is partially supported by National Science Foundation under IIS Award 1526753, Defense Advanced Research Projects Agency under Grant FA8750-12-2-0348-2 (DEFT/CUBISM), a generous gift from Google, and DSR Lab sponsors: Pivotal, UF Law School, SurveyMonkey, Amazon, Sandia National Laboratories, Harris, Patient-Centered Outcomes Research Institute, and UF Clinical and Translational Science Institute.

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS...... 4 LIST OF TABLES...... 9 LIST OF FIGURES...... 10 LIST OF ALGORITHMS...... 12 ABSTRACT...... 13

CHAPTER 1 INTRODUCTION...... 15 1.1 Knowledge Expansion...... 16 1.2 Ontological Pathfinding...... 17 1.3 Spreading Activation...... 18 1.4 Contributions...... 19 2 PRELIMINARIES...... 22 2.1 Markov Logic Networks...... 22 2.1.1 Grounding...... 22 2.1.2 Inference...... 24 2.2 First-Order Mining...... 24 2.2.1 The Scalability Challenge...... 25 2.2.2 Scoring Metrics...... 25 2.3 Spark Basics...... 26 3 KNOWLEDGE EXPANSION OVER PROBABILISTIC KNOWLEDGE BASES 28 3.1 Probabilistic Knowledge Bases...... 31 3.2 Probabilistic Knowledge Bases: A Relational Perspective...... 32 3.2.1 First-Order Horn Clauses...... 33 3.2.2 The Relational Model...... 33 3.2.2.1 Classes, relations, and relationships...... 33 3.2.2.2 MLN rules...... 34 3.2.2.3 Factor graphs...... 35 3.2.3 Grounding...... 37 3.2.4 MPP Implementation...... 40 3.3 Quality Control...... 42 3.3.1 Semantic Constraints...... 42 3.3.2 Ambiguity Detection...... 45 3.3.3 Rule Cleaning...... 46 3.3.4 Implementation...... 46

6 3.4 Experiments...... 47 3.4.1 Performance...... 48 3.4.1.1 Case study: the Reverb-Sherlock KB...... 48 3.4.1.2 Effect of batch rule application...... 50 3.4.1.3 Effect of MPP parallelization...... 51 3.4.2 Quality...... 52 3.4.2.1 Overall results...... 53 3.4.2.2 Effect of semantic constraints...... 54 3.4.2.3 Effect of rule cleaning...... 54 3.5 Summary...... 55 4 MINING FIRST-ORDER KNOWLEDGE BY ONTOLOGICAL PATHFINDING 56 4.1 First-Order Mining Problem...... 58 4.1.1 The Scalability Challenge...... 59 4.1.2 Scoring Metrics...... 59 4.2 Ontological Pathfinding...... 60 4.2.1 Rule Construction...... 61 4.2.2 Partitioning...... 64 4.2.3 Rule Pruning...... 71 4.2.4 Parallel Rule Mining...... 73 4.2.4.1 General rules...... 77 4.2.4.2 General confidence scores...... 78 4.2.5 Analysis...... 80 4.2.5.1 Parallel mining...... 80 4.2.5.2 Partitioning...... 82 4.3 Experiments...... 83 4.3.1 Overall Result...... 85 4.3.2 Effect of Parallelism...... 89 4.3.3 Effect of Partitioning...... 89 4.3.4 Effect of Rule Pruning...... 91 4.4 Summary...... 93 5 SCALABLE KNOWLEDGE EXPANSION AND INFERENCE...... 94 5.1 Parallel Inference...... 94 5.2 Quality Analysis...... 98 5.3 Inference Results...... 101 5.4 Summary...... 107 6 QUERY PROCESSING WITH KNOWLEDGE ACTIVATION...... 108 6.1 Spreading Activation...... 109 6.2 Using SemMemDB...... 112 6.2.1 Base-Level Activation Calculation...... 114 6.2.2 Spreading Activation Calculation...... 115 6.2.3 Activation Score Calculation...... 115

7 6.3 Evaluation...... 116 6.3.1 Data Set...... 116 6.3.2 Performance Overview...... 117 6.3.3 Effect of Semantic Network Sizes...... 118 6.3.4 Effect of Query Sizes...... 119 6.4 Summary...... 120 7 RELATED WORK...... 122 8 CONCLUSION AND FUTURE WORK...... 127 8.1 Inductive Reasoning...... 129 8.2 Online Inductive Reasoning...... 130 8.3 Incremental Deductive Reasoning...... 131 8.4 Abductive Reasoning...... 132 8.5 Knowledge Verification...... 132 8.6 Summary...... 133 REFERENCES...... 135 BIOGRAPHICAL SKETCH...... 145

8 LIST OF TABLES Table page 2-1 Example probabilistic constructed from the Reverb-Sherlock datasets...... 23 3-1 Example probabilistic knowledge base constructed from the Reverb-Sherlock datasets...... 32 3-2 Sherlock-Reverb KB statistics...... 48

3-3 Tuffy-T and ProbKB systems performance: the first three rows report the running time for the relevant queries in minutes; the last row reports the size of the result table...... 49 3-4 Quality control parameters. SC and RC stand for semantic constraints and rule cleaning, respectively...... 52 4-1 Example KB schema from YAGO knowledge base...... 62 4-2 Histogram for “wasBornIn” and “diedIn.”...... 73 4-3 OP experiment setup...... 84 4-4 Overall mining result...... 85 4-5 Schema graphs and histograms...... 88 6-1 DBPedia data set statistics...... 117 6-2 Moby Thesaurus II data set statistics...... 117 6-3 Experiment 1 result...... 118 6-4 Experiment 2 semantic network sizes and avg. execution times for single iteration queries of 1000 nodes...... 118 6-5 Experiment 3 avg. execution times and result sizes for single iteration queries of varying sizes...... 119 8-1 PositionsHeld(Barack Obama, *) triples in ...... 131 8-2 Aggregated knowledge base of beliefs...... 132

9 LIST OF FIGURES Figure page 2-1 Ground factor graph...... 23

3-1 ProbKB system architecture...... 30 3-2 Knowledge expansion example...... 36 3-3 Query plans generated by Greenplum with (A) and without (B) optimization. The annotations show the durations of each operation in a sample run joining M3 and a synthetic TΠ with 10M records...... 41 3-4 Quality control for knowledge expansion...... 43 3-5 Knowledge expansion performance comparison...... 50 3-6 Overall result of quality control...... 53 4-1 Example schema closure graph. Dashed arrows indicate inherited edges...... 63 4-2 Candidate rules R1-R3 constructed by cycle detection from Example 4.3. The first and last nodes in R1-R3 denote the same start and end node in the cycle.. 64 4-3 Partitioning algorithm: KB partitioned into smaller overlapping parts running independent mining algorithm instances...... 65

4-4 Rule table M, initial partitions M1, M2, and unpartitioned rule r...... 68 4-5 Parallel rule mining: KB divided into groups by join variables, each group running Group-Join to apply inference rules...... 75 4-6 Example Freebase rules...... 85 4-7 OP overall result on YAGO2s and Freebase...... 86 4-8 Sizes and runtime of Freebase partitions...... 90 4-9 Effect of partitioning and pruning...... 92 4-10 Example rules violating functional constraints...... 93 5-1 Knowledge expansion example...... 97 5-2 Cross validation: the knowledge base is partitioned into training and testing sets. The Ontological Pathfinding and parallel inference algorithms run on the training and test sets, respectively, inferred facts to be verified in the input KB. 99 5-3 Inference performance...... 103 5-4 Cross validation result and example inferred facts...... 104

10 6-1 SemMemDB usage with DBpedia knowledge base...... 113 6-2 SemMemDB query plans...... 120

11 LIST OF ALGORITHMS Algorithm page

3-1 Grounding(TΠ, M1,...,Mk)...... 38 4-1 Ontological-Pathfinding(Γ, s, m, t)...... 61 4-2 Closure(G = (V,E), v)...... 63 4-3 Recursive-Partition(Γ, Π, M, s, m)...... 69 4-4 Binary-Partition(Γ,M)...... 70 4-5 Parallel-Rule-Mining(facts, rules)...... 74 4-6 Group-Join(obj, ps = {pred, sub}, rules)...... 76 4-7 Check(rs = {rule.ID})...... 76 4-8 General-Rule-Mining(facts, rules)...... 77

4-9 PCA-Group-Join-Last(ji, fk, zi, rules)...... 79 4-10 PCA-Check(rs = {(y, r˙)})...... 79 5-1 Infer(Γ, M, s, m, N)...... 95 5-2 Parallel-Inference(facts, rules)...... 95

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SCALABLE LEARNING AND INFERENCE IN LARGE KNOWLEDGE BASES By Yang Chen December 2016 Chair: Daisy Zhe Wang Major: Computer Engineering Recent years have seen elevating efforts in the construction of web-scale knowledge bases (e.g., DBPedia, DeepDive, Freebase, Google , NELL, OpenIE, ProBase, YAGO). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to limitations of human knowledge and information extraction algorithms, current knowledge bases are far from complete. To infer the missing knowledge, we propose the knowledge expansion and ontological pathfinding algorithms. The knowledge expansion algorithm applies first-order inference rules to infer facts from an incomplete knowledge base; the ontological pathfinding algorithm mines first-order inference rules from the knowledge bases. The knowledge expansion and ontological pathfinding algorithms form the core components of a probabilistic knowledge base system, ProbKB. The knowledge expansion algorithm efficiently applies first-order inference rules to derive implicit facts from incomplete knowledge bases. The novel contributions to achieve efficiency and quality include: 1) We present a formal definition and a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm that applies inference rules in batches; 2) We implement ProbKB on massive parallel processing databases to achieve further scalability; and 3) We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality.

13 The ontological pathfinding algorithm mines first-order inference rules from these knowledge bases. It scales up via a series of optimization techniques: a new rule mining algorithm to parallelize join queries, a pruning strategy to eliminate unsound and resource-consuming rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing system achieves this scale. We support knowledge queries with spreading activation, a popular way of simulating human memory in semantic networks. We design a relational model for semantic networks and an efficient SQL-based spreading activation algorithm. We leverages the mature query engines and optimizers that generate efficient query plans for memory activation and retrieval. Our system supports human-scale memories with the massive storage capacity provided by modern database systems. We evaluate the spreading activation queries in a comprehensive experimental study using DBPedia, a web-scale ontology constructed from the corpus. The results show that our system runs over 500 times faster than previous works.

Based on these contributions, we propose a probabilistic knowledge base system, ProbKB, that manages web-scale knowledge by scalable learning and inference. We validate the

ProbKB’s effectiveness with web knowledge bases including Freebase and YAGO. For future work, we propose to extend the previous contributions to dynamic knowledge bases and data streams and to support other types of automatic reasoning including abductive and defeasible reasoning.

14 CHAPTER 1 INTRODUCTION Recent development in information extraction and data management systems arouse escalating efforts in constructing large knowledge bases (KBs). These knowledge bases store information in a structured format, allowing for efficient processing and querying. Examples of these knowledge bases include DBPedia [1], DeepDive [2], Freebase [3], [4], Knowledge Vault [5], NELL [6,7], OpenIE [8,9], ProBase [10], ProbKB [11–14], and YAGO [15, 16]. They store structured information about real-world people, places, organizations, etc, paving way for the [17] and semantic search [18] movement that revolutionizes keyword matching for search. Moreover, the knowledge bases have been used for data cleaning [19, 20], [21], data mining [11, 12, 22], and multi-modal search [23]. To support these applications, researchers use various methods to construct the knowledge bases in scale: human crafting (DBpedia, Freebase), information extraction (DeepDive, OpenIE, Probase), reasoning and inference [13, 22, 24], knowledge fusion [5, 25, 26], or a combination of them (NELL). Despite the elevating efforts in automatic knowledge base construction, current knowledge bases are still incomplete or uncertain due to limitations of human knowledge or the probabilistic nature of information extraction algorithms. For example, the Wikipedia pages state that Kale is rich in calcium, and that calcium helps prevent osteoporosis, but we need to infer that Kale helps prevent osteoporosis. In this dissertation, we study the problem of first-order learning and inference in large knowledge bases to derive implicit facts in web-scale knowledge bases, using inference rules. An inference rule is a first-order Horn clause to discover implicit facts. As an example, the following rule expands knowledge of health properties of vegetables:

contains(x, z), preventsDisease(z, y) → preventsDisease(x, y).

We propose the knowledge expansion algorithm to apply batches of inference rules and the ontological pathfinding algorithm to mine inference rules from knowledge bases. The

15 algorithms scale learning and inference to Freebase, the largest public knowledge base. The mining and inference algorithms are core components of an ongoing probabilistic knowledge base system project, Archimedes [27]. 1.1 Knowledge Expansion

To efficiently support knowledge expansion, we design a relational model for probabilistic knowledge bases, allowing an efficient SQL-based inference algorithm that applies inference rules in batches. Our approach is motivated by two observations: 1) Inference rules can be modeled in relational databases as a first class citizen rather than stored in ordinary files; 2) We can use join queries to apply inference rules in batches, rather than one query per rule. Using the relational knowledge base model, the inference algorithm is expressed as SQL queries that operate on the facts and rules tables, applying all rules in one table at a time. For Freebase rules, the number of queries in each iteration is reduced to 6 from 36,625 (depending on rule structures). Our approach improves performance by more than 200 times compared to the state-of-the-art, Tuffy [28], when we have a large number of inference rules. We achieve another speedup of 6.3 by leveraging the shared nothing massive parallel processing (MPP) architecture and general MPP optimizations to maximize data collocation, independence, and parallelism. We support uncertainty using Markov logic networks (MLNs) [29], the standard model to represent uncertain facts and rules. We perform two steps in the MLN inference:

• grounding: constructing a ground factor graph that encodes the probability distribution of all observed and inferred facts; and

• marginal inference: computing the marginal distribution for individual facts. The state-of-the-art MLN inference engine, Tuffy, uses a relational database management system (DBMS) and achieves a significant speed-up for grounding compared to an earlier inference engine, Alchemy [30]. Despite the improvement, Tuffy does not have satisfactory performance for the Reverb and Sherlock knowledge base since they have a large number of rules (30,912). Tuffy uses as many as 30,912 SQL queries to apply them all in each iteration.

16 ProbKB improves performance by modeling the rules in 6 relational tables and using 6 SQL queries to apply the rules in batches. Furthermore, all existing MLN implementations are designed to work with small, clean MLN programs carefully crafted by humans. Thus, they are prone to inaccuracies and errors made by machine constructed MLNs and have no mechanism to detect and recover from errors. To handle these cases, we combine semantic constraints, ambiguity detection, and rule cleaning to prevent errors propagating in the inference chain. As a result, we increase the precision by 0.61. 1.2 Ontological Pathfinding

Mining Horn clauses has been studied extensively in inductive . However, today’s knowledge bases pose several new challenges. First, knowledge bases are often prohibitively large. For example, as of this writing, Freebase has 112 million entities and 388 million facts. None of the existing rule mining algorithms efficiently support knowledge bases of this size. Second, knowledge bases implement the open world assumption, implying that we have only positive examples for rule mining. To address these challenges, a number of new approaches are proposed: Sherlock [24], AMIE+ [22], Markov logic structure learning [31, 32], etc. Still, new techniques need to be invented to scale up state-of-the-art approaches to knowledge bases of billions of facts. In this dissertation, we propose the Ontological Pathfinding algorithm (OP) to tackle the large-scale rule mining problem. We focus on scalability and design a series of parallelization and optimization techniques to achieve web scale. Following the relational knowledge base model [13], we store inference rules in relational tables and use join queries to apply them in batches. The relational approach outperforms state-of-the-art algorithms by orders of magnitude on medium-sized knowledge bases [13]. To scale to larger knowledge bases, we parallelize the mining algorithm by dividing the input knowledge base into smaller groups running parallel in-memory joins. The parallel mining algorithm can be implemented on

17 state-of-the-art cluster computing frameworks to achieve maximum utilization of available computation resource. Furthermore, even if we parallelize the mining algorithm, the parallel tasks are dependent on each other. In particular, the tasks need to shuffle data between stages. As the knowledge bases expand in scale, shuffling becomes the bottleneck of the computation. This shuffling bottleneck motivates us to introduce another layer of partitioning on top of the parallel computation, a partitioning scheme that divides the mining task into smaller independent sub-tasks. Each partition still runs the same parallel mining algorithm as before, but on a smaller input. Since each partition is independent from each other, the results are unioned in the end; no data exchange occurs during computation. Our experiments show that we accomplish the Freebase mining task within 34 hours that does not finish in 5 days without partitioning. One major performance bottleneck is caused by large degrees of join variables in the inference rules. Applying these rules in the mining process generates large intermediate results, enumerating all possible pair-wise relationships of the joined instances. As a result, these rules are often of low quality. Based on this observation, we use non-functionality as an empirical indication of inefficiency and inaccuracy. In our experiments, we determine a reasonable functional constraint and show that 99% of the rules violating this constraint turn out to be false. Removing those rules reduces runtime by more than 5 hours for a single mining task. Combining our approaches, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale. 1.3 Spreading Activation

We design a query execution engine for spreading activation queries. Semantic networks are broadly applicable to associative information retrieval tasks [33], though we are principally motivated by the popularity of semantic networks and spreading activation to simulate aspects of human memory in cognitive architectures, specifically ACT-R [34, 35]. Insofar as

18 cognitive architectures aim toward codification of unified theories of cognition and full-scale simulation of artificial humans, they must ultimately support human-scale memories, which at present they do not. We are also motivated by the desire for a scalable, standalone, cognitive model of human memory free from the architectural and theoretical commitments of a complete cognitive architecture. Our position is that human-scale associative memory can be best achieved by leveraging the extensive investments and continuing advancements in structured databases and big data systems. For example, relational databases already provide effective means to manage and query massive structured data and their commonly supported operations, such as grouping and aggregation, are sufficient and well-suited for efficient implementation of spreading activation. To defend this position, we design a relational data model for semantic networks and an efficient SQL-based, in-database implementation of network activation (i.e., SemMemDB). The main benefits of SemMemDB and our in-database approach are: (1) Exploits query optimizer and execution engines that dynamically generate efficient execution plans for activation and retrieval queries, which is far better than manually implementing a particular fixed algorithm. (2) Uses database technology for both storage and computation, thus avoiding the complexity and communication overhead incurred by employing separate modules for storage versus computation. (3) Implements spreading activation in SQL, a widely-used query language for big data which is supported by various analytics frameworks, including traditional databases (e.g., PostgreSQL), massive parallel processing (MPP) databases (e.g., Greenplum [36]), the MapReduce stack (e.g., Hive), etc. We evaluate SemMemDB using DBPedia [1], a web-scale ontology constructed from the Wikipedia corpus. Our experiment results show several orders of magnitude of improvement in execution time in comparison to results reported in the related work. 1.4 Contributions

In summary, we make the following contributions to tackle the scalable learning and inference problem in web knowledge bases.

19 Knowledge Expansion. We design a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches. We optimize relational knowledge bases on massive parallel processing databases to achieve further scalability. We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. Ontological Pathfinding. We design the ontological pathfinding algorithm that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale. Spreading Activation. We design the SemMemDB system for spreading activation query processing over semantic and knowledge networks. We use the relational model for semantic networks and present an efficient SQL-based spreading activation algorithm. We provide a simple interface for users to invoke retrieval queries. SemMemDB leverages mature query engines and optimizers from databases that generate efficient query plans for memory activation and retrieval. With massive storage capacity supported by modern database systems, SemMemDB supports human-scale memories. We evaluate SemMemDB using DBPedia, a webscale ontology constructed from the Wikipedia corpus. The results show that SemMemDB runs more than 500 times faster than prior works. Dissertation outline. The remainder of this dissertation is organized as follows. Chapter2 describes background in Markov logic, factor graphs, and first-order rule mining. Chapter3

20 describes our relational approach of knowledge expansion. Chapter4 describes the ontological pathfinding algorithm to mine first-order knowledge from web-scale knowledge bases. Chapter5 scales up the knowledge expansion algorithm by parallelization and partitioning. Chapter6 describes SemMemDB for spreading activation query processing. Chapter7 describes related work. Chapter8 concludes the dissertation and discusses future work of various types of knowledge reasoning.

21 CHAPTER 2 PRELIMINARIES 2.1 Markov Logic Networks

Markov logic networks are a mathematical model to represent uncertain facts and rules. We use MLNs to model probabilistic knowledge bases constructed by IE systems. Essentially, an MLN is a set of weighted first-order formulae {(Fi,Wi)}, the weights Wi indicating how likely the formula is true. In Table 2-1, the Π and L columns form an example MLN (These notions will be formally defined in Section 3.1). In this example, the MLN clauses

0.96 born in(Ruth Gruber, New York City) (2–1)

1.40 ∀x ∈ W, ∀y ∈ P : live in(x, y) ← born in(x, y) (2–2)

state a fact that Ruth Gruber is born in New York City and a rule that if a writer x is born in an area y, then x lives in y. However, both statements do not definitely hold. The weights 0.96 and 1.40 specify how strong they are; stronger rules are less likely to be violated. They are both part of an MLN, but with different purposes: (2–1) states a fact; and (2–2) supplies an inference rule. Thus, we treat them separately. It will become clearer when we formally define probabilistic knowledge bases in Section 3.1. MLNs allow hard rules that must never be violated. These rules have weight ∞. For example, the rule from Table 2-1

∞ ∀x ∈ C, ∀y ∈ C, ∀z ∈ W :

(born in(z, x) ∧ born in(z, y) → x = y) (2–3)

says that a writer is not allowed to be born in two different cities. In our work, hard rules are used for quality control: facts violating hard rules are considered as errors and are removed to avoid further propagation. 2.1.1 Grounding

An MLN can be viewed as a template for constructing ground factor graphs. A factor

graph is a set of factors Φ = {φ1, . . . , φN }, where each factor φi is a function φi(Xi) over

a random vector Xi indicating the causal relationships among the random variables in Xi.

22 Table 2-1. Example probabilistic knowledge base constructed from the Reverb-Sherlock datasets.

Entities E Classes C Relations R Ruth Gruber, W (Writer) = {Ruth Gruber}, born in(W , P ), born in(W , C), New York City, C (City) = {New York City}, live in(W , P ), live in(W , C), Brooklyn P (Place) = {Brooklyn} locate in(P , C)

Facts Π 0.96 born in(Ruth Gruber, New York City) 0.93 born in(Ruth Gruber, Brooklyn)

Rules L 1.40 ∀x ∈ W ∀y ∈ P (live in(x, y) ← born in(x, y)) 1.53 ∀x ∈ W ∀y ∈ C (live in(x, y) ← born in(x, y)) 0.32 ∀x ∈ P ∀y ∈ C ∀z ∈ W (locate in(x, y) ← live in(z, x) ∧ live in(z, y)) 0.52 ∀x ∈ P ∀y ∈ C ∀z ∈ W (locate in(x, y) ← born in(z, x) ∧ born in(z, y)) ∞ ∀x ∈ C ∀y ∈ C ∀z ∈ W (born in(z, x) ∧ born in(z, y) → x = y)

1 3 1. born in(Ruth Gruber, New York City) 2. born in(Ruth Gruber, Brooklyn) 5 3. live in(Ruth Gruber, New York City) 4. live in(Ruth Gruber, Brooklyn) 2 4 5. located in(Brooklyn, New York City)

Figure 2-1. Ground factor graph.

These factors together determine a joint probability distribution over the random vector X consisting of all the random variables in the factors. Figure 2-1 shows an example factor

graph, where each factor φi(Xi) is represented by a square with its variables Xi represented

by its neighboring circles. The values of φi are omitted from the figure. Given an MLN and a set of typed entities, the process of constructing a factor graph is called grounding, and we refer to the result factor graph as a ground factor graph. For the entities in Table 2-1 (column E) and their associating types (column C), we create a random vector X containing one binary random variable for each possible grounding of the predicates

23 appearing in Π and L (column R). The random variables created this way are also called ground atoms. Each ground atom has a value of 0 or 1 indicating its truth assignment. In

this example, we have X = {X1,X2,X3,X4,X5} listed in the right of Figure 2-1.

For each possible grounding of formula Fi, we create a ground factor φi(Xi) which has

a value of eWi if the ground formula is true, or 1 otherwise. For instance, from the rule “0.32 located in(Brooklyn, New York City) ← live in(Ruth Gruber, Brooklyn) ∧ live in (Ruth

Gruber, New York City),” we have a factor φ(X3,X4,X5) defined as follows:   1 if (X3,X4,X5) = (1, 1, 0) φ(X3,X4,X5) =  0.32 e otherwise The other factors are defined similarly. According to this definition, the ground factors can be

specified by their variables and weights (Xi, Wi). We will utilize this fact in Section 3.2.2.3. The result factor graph is shown in Figure 2-1. 2.1.2 Inference

In a factor graph Φ = {φ1, . . . , φN }, the factors together determine a joint probability distribution over the random vector X consisting of all the random variables in the factor graph: ! 1 Y 1 X P (X = x) = φ (X ) = exp W n (x) , (2–4) Z i i Z i i i i where ni(x) is the number of true groundings of rule Fi in x, Wi is its weight, and Z is the partition function, i.e., normalization constant. In ProbKB, we are interested in computing P (X = x), the marginal distribution defined by (2–4). This is called marginal inference in probabilistic graphical models literature. The other inference type is maximum a posteriori

(MAP) inference, in which we find the most likely possible world. ProbKB currently uses marginal inference so that we can store all the inferred results in the knowledge base, thereby avoiding query-time computation and improving system responsivity. 2.2 First-Order Mining

We study the problem of mining first-order inference rules from web-scale knowledge bases: Given a knowledge base of (subject, predicate, object) (or (s, p, o)) triples, we mine

24 first-order Horn clauses of form

(w, B → H(x, y)), (2–5) V where the body B = i Bi(·, ·) is a conjunction of predicates, H is the head predicate, and w is a scoring metric reflecting the likelihood of the rule being true. As in AMIE [37] and other ILP systems [38, 39], we use a language bias and assume the Horn clauses to be connected and closed. Two atoms are connected if they share a variable. A rule is connected if every atom is connected transitively to every other atom in the rule. A rule is closed if every variable appears at least twice in different predicates. This assumption ensures that the rules do not contain unrelated atoms or variables. 2.2.1 The Scalability Challenge

Our primary focus of this dissertation is scalable mining. Previous approaches [24, 37] scale to knowledge bases of 1 million facts, but no existing work mines inference rules from the 112 million entities and 388 million facts of Freebase. We investigate a series of parallelization and optimization techniques to achieve this scale, and we study how to use state-of-the-art data processing systems, e.g., Spark, to efficiently implement the parallel algorithms. The first-order mining problem has similarities with association rule mining in transactions databases [37], but they are essentially different. In first-order rules, the atoms are parame- terized predicates. Each parameterized predicate can be grounded to a set of ground atoms. Depending on the size of the knowledge base, each rule can have a large number of possible ground instances. This makes mining first-order knowledge more challenging than mining traditional association rules in transactions databases. 2.2.2 Scoring Metrics

We review the support and confidence metrics for first-order Horn clauses. They have counterparts in association rule mining in transactions databases, but are different from them as discussed above.

25 Support. The support of a rule is defined to be the number of distinct pairs of subject and object in the head of all instantiations that appear in the knowledge base:

supp(B → H(x, y)) := #(x, y): ∃z1, . . . , zm : B ∧ H(x, y). (2–6)

In Equation (4–2), B and H denote the body and head, respectively. z1, . . . , zm are the variables of the rule in addition to x and y. Confidence. The confidence of a rule is defined to be the ratio of its predictions that are in the knowledge base: supp(B → H(x, y)) conf(B → H(x, y)) := . (2–7) #(x, y): ∃z1, . . . , zm : B Our framework supports other scoring functions introduced in [24, 37]. For example, the PCA confidence of a rule is defined to be the fraction of its true predictions over the inferred facts we know to be either true or false, i.e., facts p(x, y) such that ∃y0 : p(x, y0) ∈ Γ. supp(H(x, y) ← B) PCA conf(H(x, y) ← B) := . (2–8) |{H(x, y)|∃y0 : B(x, y) ∧ H(x, y0) ∈ Γ}| For each rule, we compute its support and confidence, and set w = (supp, conf) in (4–1). The support and confidence metrics together indicate the quality of a rule. 2.3 Spark Basics

Spark is a cluster computing framework. Based on its core idea of resilient distributed datasets (RDDs)–read-only collections of objects partitioned across a set of machines–it defines a set of parallel operations. Using these operations, Spark allows users to express a rich set of computation tasks. The operations we use in this dissertation are listed below:

• map/flatMap Transforms an RDD to a new RDD by applying a function to each element in the input RDD.

• groupByKey Transforms an RDD to a new RDD by grouping by a user-specified key. In the result RDD, each key is mapped to a list of values of the key.

• reduceByKey Transforms an RDD to a new RDD by grouping by a user-specified key and performs a reduce function to the values of each key. In the result RDD, each key is mapped to the result value of the reduce function.

26 In our parallel rule mining algorithm, we represent the set of facts {(s, p, o)} and the set of rules {(h, b1, b2)} as two RDDs and express the algorithm using the above parallel operations.

27 CHAPTER 3 KNOWLEDGE EXPANSION OVER PROBABILISTIC KNOWLEDGE BASES With the exponential growth in machine learning, statistical inference, and big-data analytics frameworks, recent years have seen tremendous research interest in information extraction (IE) and knowledge base construction. A knowledge base stores entities and their relationships in a machine-readable format to help computers understand human information and queries. Example knowledge bases include the DBPedia [1], DeepDive [2], Freebase [3], Google Knowledge Graph [4], NELL [7], OpenIE [8,9], ProBase [10], and YAGO [15, 40, 41]. However, these knowledge bases are often incomplete or uncertain due to limitation of human knowledge or the probabilistic nature of extraction algorithms. Thus, it is often desirable to infer missing facts in a scalable, probabilistic manner [42]. For example, the Wikipedia pages state that Kale is rich in calcium, and that calcium helps prevent osteoporosis, then we infer that Kale helps prevent osteoporosis. To facilitate such inference task, Sherlock [24] learns 30,912 uncertain Horn clauses using web extractions. In this chapter, we study the problem of expanding probabilistic knowledge bases using first-order inference. We develop an efficient inference engine by modeling probabilistic knowledge bases as relational tables, and consequently, implement the inference algorithm as a limited number of joins. We use one particular class of inference rules, the semantic constraints, to detect incorrect facts and ambiguous entities. Experiments show promising results in terms of both performance and quality. This chapter addresses the problem of applying inference rules. Chapter4 addresses the related problem of mining first-order inference rules. The standard model for working with uncertain facts and rules is Markov logic networks (MLNs) [29]. To perform the MLN marginal inference task, we take two steps: 1) ground- ing, which constructs a ground factor graph that encodes the probability distribution of all observed and inferred facts; and 2) marginal inference, which computes the marginal distribution for each individual fact. To efficiently support these tasks, the state-of-the-art

28 system Tuffy [28] uses a relational database management system (DBMS) and demonstrates a significant speed-up in the grounding phase compared to an earlier implementation Alchemy [30]. Despite that, Tuffy does not have satisfactory performance for the Reverb and Sherlock datasets since they have a large number of rules (30,912), and Tuffy uses as many as 30,912 SQL queries to apply them all in each iteration. Furthermore, all existing MLN implementations are designed to work with small, clean MLN programs carefully crafted by humans. Thus, they are often prone to inaccuracies and errors made by machine constructed MLNs and have no mechanism to detect and recover from errors. To improve efficiency and accuracy, we present the knowledge expansion algorithm. Our main contribution is a formal definition and relational model for probabilistic knowledge bases, which allows an efficient SQL-based grounding algorithm that applies MLN rules in batches. Our work is motivated by two observations: 1) MLNs can be modeled in DBMSs as a first class citizen rather than stored in ordinary files; 2) We can use join queries to apply the MLN rules in batches, rather than one query per rule. In this way, the grounding algorithm can be expressed as SQL queries that operate on the facts and MLN tables, applying all rules in one MLN table at a time. For the Sherlock rules, the number of queries in each iteration is reduced to 6 from 30,912 (depending on rule structures). Our approach greatly improves performance, especially when the number of rules is large. We achieve further efficiency by using a shared nothing massive parallel processing (MPP) database and general MPP optimizations to maximize data collocation, independence, and parallelism. Another important goal is to maintain a high quality knowledge base. Extracted facts and rules are sometimes inaccurate, but existing MLN systems are designed to work with small, clean MLNs and are prone to noisy data. To handle errors, we combine several strategies, including semantic constraints, ambiguity detection, and rule cleaning. As a result, we increase the precision by 0.61. To summarize, we make the following contributions:

29 Inference Engine (e.g., GraphLab)

RDMBS

Factor Graph

SQL UDF/UDA

Query Optimizer & Execution Engine

MLN Entities Facts

Figure 3-1. ProbKB system architecture.

• We present an efficient knowledge expansion algorithm. We introduce a formal definition and a relational model for probabilistic knowledge bases and design a novel inference algorithm for knowledge expansion that applies inference rules in batches.

• We implement and evaluate ProbKB on an MPP DBMS, Greenplum. We investigate important optimizations to maximize data collocation, independence, and parallelism.

• We combine several methods, including rule cleaning and semantic constraints, to detect erroneous rules, facts, and ambiguous entities, effectively preventing them from propagating in the inference process.

• We conduct a comprehensive experimental evaluation on real and synthetic knowledge bases. We show ProbKB performs orders of magnitude faster than previous works and has much higher quality.

Figure 3-1 shows the ProbKB system architecture. The knowledge base (MLN, entities, facts) is stored in database tables. This relational representation allows an efficient SQL-based grounding algorithm, which is written in user-defined functions/aggregates (UDFs/UDAs) and stored inside the database. During grounding, the database optimizes and executes the stored procedures and generates a factor graph in relational format. Existing inference engines, e.g., Gibbs sampling [43], GraphLab [44], can be used to perform probabilistic inference over the result factor graph.

30 3.1 Probabilistic Knowledge Bases

Based on the syntax and semantics of MLNs and schemas used by state-of-the-art IE and knowledge base systems, we formally define a probabilistic knowledge base as follows:

Definition 3.1. A probabilistic knowledge base (KB) is a 5-tuple Γ=(E,C,R,Π,L), where

•E = {e1, . . . , e|E|} is a set of entities. Each entity e ∈ E refers to a real-world object.

•C = {C1,...,C|C|} is a set of classes (or types). Each class C ∈ C is a subset of E: C ⊆ E.

•R = {R1,...,R|R|} is a set of relations. Each R ∈ R defines a binary relation on Ci,Cj ∈ C: R ⊆ Ci × Cj. We call Ci, Cj the domain and range of R and use R(Ci,Cj) to denote the relation and its domain and range.

• Π = {(r1, w1),..., (r|Π|, w|Π|)} is a set of weighted facts (or relationships). For each (r, w) ∈ Π, r is a tuple (R, x, y), where R(Ci,Cj) ∈ R, x ∈ Ci ∈ C, y ∈ Cj ∈ C, and (x, y) ∈ R; w ∈ R is a weight indicating how likely r is true. We also use R(x, y) to denote the tuple (R, x, y).

•L ={(F1,W1),..., (F|L|,W|L|)} is a set of weighted clauses (or rules). It defines a Markov logic network. For each (F,W ) ∈ L, F is a first-order logic clause, and W ∈ R is a weight indicating how likely formula F holds.

Remark 3.2. The arguments of relations, relationships, and rules are constrained to certain classes, i.e., they are inherently typed. The definition of C implies a class hierarchy: for any

Ci,Cj ∈ C, Ci is a subclass of Cj if and only if Ci ⊆ Cj. Typing provides semantic context for extracted entities and is commonly adopted by recent IE systems, so we make it an integral part of the definition.

Remark 3.3. The weights w’s and W ’s above are allowed to take the values of ±∞, meaning that the corresponding facts or rules are definite (or impossible) to hold. We treat them as semantic constraints and discuss them in detail in Section 3.3.1. When we need to distinguish the sets of deductive inference rules and constraints, we denote them by H and Ω, respectively. We also use L = (H,Ω), or Γ = (E,C,R,Π,H,Ω), to emphasize this distinction.

31 Table 3-1. Example probabilistic knowledge base constructed from the Reverb-Sherlock datasets.

Entities E Classes C Relations R Ruth Gruber, W (Writer) = {Ruth Gruber}, born in(W , P ), born in(W , C), New York City, C (City) = {New York City}, live in(W , P ), live in(W , C), Brooklyn P (Place) = {Brooklyn} locate in(P , C)

Facts Π 0.96 born in(Ruth Gruber, New York City) 0.93 born in(Ruth Gruber, Brooklyn)

Rules L 1.40 ∀x ∈ W ∀y ∈ P (live in(x, y) ← born in(x, y)) 1.53 ∀x ∈ W ∀y ∈ C (live in(x, y) ← born in(x, y)) 0.32 ∀x ∈ P ∀y ∈ C ∀z ∈ W (locate in(x, y) ← live in(z, x) ∧ live in(z, y)) 0.52 ∀x ∈ P ∀y ∈ C ∀z ∈ W (locate in(x, y) ← born in(z, x) ∧ born in(z, y)) ∞ ∀x ∈ C ∀y ∈ C ∀z ∈ W (born in(z, x) ∧ born in(z, y) → x = y)

Example 3.4. Table 2-1 (Reproduced in Table 3-1) shows an example knowledge base constructed using Reverb Wikipedia extractions and Sherlock rules. This is the primary dataset we use to construct and evaluate our knowledge base, and we will refer to this dataset as the Reverb-Sherlock KB hereafter. 

Problem Description. We focus on two challenges in grounding probabilistic KBs:

• Improving grounding efficiency using a relational DBMS; specifically, we seek ways to apply inference rules in batches using SQL queries.

• Identifying and recovering from errors in the grounding process, which prevents them from propagating in the inference chain. 3.2 Probabilistic Knowledge Bases: A Relational Perspective

This central section describes our database approach to achieve efficiency. We first describe the relational model for each of the component E, C, R, Π, H. Ω is related to quality control and is fully discussed in Section 3.3. Then we present the grounding algorithm and

32 explain how it achieves efficiency. Finally, we describe the techniques we use for tuning and optimizing the implementation over Greenplum, an MPP database. 3.2.1 First-Order Horn Clauses

Though Markov logic supports general first-order formulae, we confine H to the set of Horn clauses. A first-order Horn clause is a clause with at most one positive literal [45]:

p, q, . . . , t → u.

This may limit the expressiveness of individual rules, but due to the scope and scale of the Sherlock rule set, we are still able to infer many facts. Horn clauses give us a number of additional benefits:

• Horn clauses have simpler structures, allowing us to effectively model them in relational tables and design SQL-based inference algorithms.

• Learning Horn clauses has been studied extensively in the inductive logic programming literature [46, 47] and is recently adapted to text extractions (Sherlock [24]). There are works on general MLN structure learning [32], but they are not employed in a large scale. 3.2.2 The Relational Model

We first introduce our notations. For each KB element X ∈ {C, R, Π, H}, we denote the corresponding database relation (table) by TX . Thus, TX is by definition a set of tuples. In our implementation, we also use dictionary tables DX , where X ∈ {E, C, R}, to map string representations of KB elements to integer IDs to avoid string comparison during joins and selections. 3.2.2.1 Classes, relations, and relationships

The database definitions for TC, TR, TΠ follow from their mathematical definitions:

Definition 3.5. TC is defined to be the set of tuples {(C, e)} for all pairs of (C, e) ∈ (C × E) such that e ∈ C.

33 Definition 3.6. TR is defined to be the set of tuples{(R,C1,C2)} for all relations R(C1,C2) ∈ R.

Definition 3.7. TΠ is defined to be the set of tuples {(I, R, x, C1, y, C2, w)}, where I ∈ N is an integer identifier, R(C1,C2) ∈ R, x ∈ C1 ∈ C, y ∈ C2 ∈ C, and (R(x, y), w) ∈ Π.

In Definition 3.7, we put all facts in a single table. Comparing with Tuffy, which uses one table for each relation, our approach scales to modern knowledge bases since they often contain thousands of relations (Reverb has 80K). In addition, it allows us to join the MLN tables to apply MLN rules in batches. The C1 and C2 columns are for optimization purposes; they replicate TC and TR to avoid the overhead of joining them when we apply MLN rules.

An example TΠ constructed from Example 3.4 is given in Figure 3-2A. 3.2.2.2 MLN rules

Unlike the other components, L does not map to a relational schema directly since rules have flexible structures. Our approach to this problem is to structurally partition the clauses so that each partition has a well-defined schema.

Definition 3.8. Two first-order clauses are defined to be structurally equivalent if they differ only in the entities, classes, and relations symbols.

Example 3.9. Take the MLN from Example 3.4 as an example. The rules ∀x ∈ W ∀y ∈ P : live in(x, y) ← born in(x, y) and ∀x ∈ W ∀y ∈ C: live in(x, y) ← born in(x, y) are structurally equivalent since they differ only in P and C; the rules ∀x ∈ P ∀y ∈ C ∀z ∈ W : locate in(x, y) ← live in(z, x) ∧ live in(z, y) and ∀x ∈ P ∀y ∈ C ∀z ∈ W : locate in(x, y) ← born in(z, x) ∧ born in(z, y) are structurally equivalent since they differ only in “born in” and “live in.” 

It is straightforward to verify that structurally equivalence defined in Definition 3.8 is indeed an equivalence relation; thus it defines a partition on the space of clauses. According to the definition, each clause can be uniquely identified within a partition by specifying

34 a tuple of entities, classes, and relations, which we refer to as its identifier tuple in that partition. A partition is, therefore, a collection of identifier tuples. We now make it precise:

Definition 3.10. TH is defined to be a set of partitions {M1,...,Mk}, where each partition is a set of identifier tuples comprised of entities, classes, and relations with their weights.

We identify 6 structurally equivalent classes in the Sherlock dataset listed below:

∀x ∈ C1, y ∈ C2 (p(x, y) ← q(x, y)) (3–1)

∀x ∈ C1, y ∈ C2 (p(x, y) ← q(y, x)) (3–2)

∀x ∈ C1, y ∈ C2, z ∈ C3 (p(x, y) ← q(z, x), r(z, y)) (3–3)

∀x ∈ C1, y ∈ C2, z ∈ C3 (p(x, y) ← q(x, z), r(z, y)) (3–4)

∀x ∈ C1, y ∈ C2, z ∈ C3 (p(x, y) ← q(z, x), r(y, z)) (3–5)

∀x ∈ C1, y ∈ C2, z ∈ C3 (p(x, y) ← q(x, z), r(y, z)) (3–6)

We use 6 partitions, M1, ..., M6, to store all the rules.

Example 3.11. M1 and M3 in Figure 3-2 show two example MLN tables. The classes P ,

C, W are defined in Table 2-1. Each tuple in M1 and M3 represents a rule, and each table

represents a particular rule syntax. Tuples (R1,R2,C1,C2) in M1 are chosen to stand for

∀x ∈ C1, ∀y ∈ C2 : R1(x, y) ← R2(x, y). Hence, the first row in M1 means “if a writer is

born in a place, then he probably lives in that place.” Tuples (R1,R2,R3,C1,C2,C3) in M3

stand for ∀x ∈ C1, ∀y ∈ C2, ∀z ∈ C3 : R1(x, y) ← R2(z, x),R3(z, y). Hence, the first row in

M3 means “if a writer is lives in a place and a city, then the place is probably located in

that city.” The other rows are interpreted in a similar manner. 

3.2.2.3 Factor graphs

As we mentioned in Section 2.1.1, a ground factor can be specified by its variables and weight. Definition 3.12 utilizes this fact. For simplicity, we limit our definition to factors of

35 R1 R2 C1 C2 w live in born in WP 1.40 I R x C y C w 1 2 live in born in WC 1.53 1 born in RG W NYC C 0.96 grow up in born in WP 2.68 2 born in RG W Br P 0.93 grow up in born in WC 0.74

0 (A) TΠ (B) M1

R1 R2 R3 C1 C2 C3 w located in live in live in PCW 0.32 located in born in born in PCW 0.52

(C) M3

Query 2 -3 I1 I2 I3 w Classes Entities TΦ SELECT T1.I AS I1, T2.I AS I2, 1 NULL NULL 0.96 W Writer RG Ruth Gruber T3.I AS I3, M3.w AS w 2 NULL NULL 0.93 C City NYC New York City FROM M3 3 1 NULL 1.53 P Place Br Brooklyn JOIN T T1 ON M3.R1 = T1.R AND TΦi TΦ0 4 2 NULL 1.40 M3.C1 = T1.C1 AND M3.C2 = T1.C2 5 1 NULL 0.74 Queries JOIN T T2 ON M3.R2 = T2.R AND 6 2 NULL 2.68 1 groundAtoms queries M3.C3 = T2.C1 AND M3.C1 = T2.C2 7 2 1 0.52 2 groundClauses queries JOIN T T3 ON M3.R3 = T3.R AND i 2 -i 7 4 3 0.32 M3.C3 = T3.C1 AND M3.C2 = T3.C2 n (E) Notations WHERE T1.x = T2.y AND T1.y = T3.y AND TΠ T2.x = T3.x; I R x C1 y C2 w 1 born in RG W NYC C 0.96 n n-1 Mi 2 born in RG W Br P 0.93 Ti TΠ Query 1 -3 3 live in RG W NYC C 4 live in RG W Br P SELECT M3.R1 AS R, till closure 5 grow up in RG W NYC C T2.y AS x, T2.C2 AS C1, 6 grow up in RG W Br P T3.y AS y, T3.C2 AS C2 i 1 -i 7 located in Br P NYC C FROM M3 1 (F) JOIN T T2 ON M3.R2 = T2.R AND TΠ M3.C3 = T2.C1 AND M3.C1 = T2.C2 I R x C1 y C2 w JOIN T T3 ON M3.R3 = T3.R AND Mi 1 0 1 born in RG W NYC C 0.96 M3.C3 = T3.C1 AND M3.C2 = T3.C2 Ti TΠ 2 born in RG W Br P 0.93 WHERE T2.x = T3.x; 3 live in RG W NYC C 4 live in RG W Br P 5 grow up in RG W NYC C Query 1 -1 i 1 -i 6 grow up in RG W Br P SELECT M1.R1 AS R, (G) T.x AS x, T.C1 AS C1, I R x C y C T.y AS y, T.C2 AS C2 1 2 w 0 1 born in RG W NYC C 0.96 FROM M1 Mi TΠ JOIN T ON M1.R2 = T.R AND 2 born in RG W Br P 0.93 M1.C1 = T.C1 AND M1.C2 = T.C2; (D) (H)

Figure 3-2. Knowledge expansion example. (A) Example TΠ table. The abbreviations “P,” “C,” “W,” etc are explained in the “Notations” box. (B)(C) Example MLN j tables. (D) Example query tree for grounding. Ti denotes the intermediate j result for partition i in the jth iteration. TΠ is the merged result at the jth iteration. In TΦ, TΦi is the ground clauses from partition i, and TΦ0 is the singleton factors from TΠ (Algorithm 3-1 Line 10). The double lines indicate that the operand may participate multiple times in the join. (E)-(H) Query results. Shaded rows correspond to shaded tables in (D).

36 sizes up to 3 (large enough for Sherlock rules), but our approach can be easily extended to larger factors.

Definition 3.12. TΦ is defined to be a set of tuples {(I1,I2,I3, w)}, where I1,I2,I3 are foreign keys to TΠ(I) and w ∈ R is the weight. Each tuple (I1,I2,I3, w) represents a weighted ground rule I1 ← I2,I3. I1 is the head; I2,I3 are the body and allowed to be NULL for factors of sizes 1 or 2.

Figure 3-2E shows an example of TΦ. As the final result of grounding, it serves as an intermediate representation that can be input to probabilistic inference engines, e.g., [43, 44]. Moreover, since it records the causal relationships among facts, it contains the entire lineage and can be queried [48]. One application of lineage is to help determine the facts’ credibility, which we use extensively in our experiments. 3.2.3 Grounding

The relational representation of Γ allows an efficient grounding algorithm that applies

MLN rules in batches. The idea is to join the TΠ and Mi tables by equating the corresponding relations and classes. Each of these join queries applies all the rules in partition Mi in batches. The grounding algorithm consists of two steps: we first apply the rules to compute the ground atoms (given and inferred facts) until we have the transitive closure. Then we apply the rules again to construct the ground factors. Algorithm 3-1 summarizes this procedure.

The groundAtoms(TΠ,Mi) and groundFactors(TΠ,Mi) functions in Algorithm 3-1 do the actual joins and are implemented in SQL. The left of Figure 3-2 shows some example queries for partitions 1 and 3. The actual query used is determined by the partition index i (the rule structure). In Lines 6-7, applyConstraints(TΠ) and redistribute(TΠ) are for quality control (Section 3.3) and MPP optimization (Section 3.2.4), respectively. This section describes the SQL queries we use for partitions 1 and 3. The other queries can be derived using the same mechanism.

37 Algorithm 3-1: Grounding(TΠ, M1,...,Mk)

Input : TΠ, M1,...,Mk Output: TΦ 1 TΦ ← ∅; 2 while not convergent do 3 forall partitions Mi do 4 Ti ← groundAtoms(TΠ,Mi); Sk  5 TΠ ← TΠ ∪ j=1 Tj ;

6 applyConstraints(TΠ); 7 redistribute(TΠ);

8 forall partitions Mi do 9 TΦ ← TΦ ∪B groundFactors(TΠ,Mi);

10 TΦ ← TΦ ∪B groundFactors(TΠ); 11 return TΦ;

Queries 1-1 and 1-3 in Figure 3-2 are used to implement groundAtoms(TΠ,Mi) for

partitions 1 and 3. They join the Mi and TΠ (T) tables by relating the relations, entities, and classes in the rule body. Take Query 1-3 as an example: The join conditions M3.R2=T2.R and M3.R3=T3.R select relations M3.R2 and M3.R3, and T2.x=T3.x matches the entities according to Rule (3–3). The remaining conditions check the classes, and finally, the SELECT

clause generates new facts {(I, R, x, Ci, y, Cj, NULL)}. The weights are to be determined in the marginal inference step, so we set them to NULL during grounding. In each iteration,

we apply groundAtoms(TΠ,Mi) to all i and merge the results into TΠ.

In groundFactors(TΠ,Mi), we create a factor for each ground rule by joining the relations from both the head and body of a rule. We illustrate this process in Figure 3-2, Query 2-3. The conditions M3.R1=T1.R, M3.R2=T2.R, M3.R3=T3.R select the relations from the head (R1) and the body (R2, R3). The other conditions match the entities and classes according to Rule (3–3). Then the SELECT clause retrieves the matched IDs and

weight of the rule. In groundFactors(TΠ), we represent the uncertain facts in Π (w 6= NULL) by singleton factors, i.e., factors involving one variable. Lines 9-10 merge the results using bag unions (∪B) due to the following proposition:

38 Proposition 3.13. Query 2-i does not produce duplicate tuples (I1,I2,I3) if Mi does not contain duplicates.

Proof. We prove the length 3 cases. Length 2 cases can be verified similarly. Given any factor

(I1,I2,I3, w) ∈ TΦ, there is one rule in Mi that implies the deduction I1 ← I2,I3, whose columns R1,R2,R3,C1,C2,C3 are determined by the TΠ tuples associated with I1,I2,I3.

Hence, the tuple (I1,I2,I3, w) derives from the rule (R1,R2,R3,C1,C2,C3, w) and the rows

(facts) in TΠ identified by I1,I2, I3. Joining these rows yields at most one tuple (I1,I2,I3, w).

There may be duplicates from different partitions. We treat them as multiple factors among the same variables since they are valid deductions using different rules.

0 Example 3.14. Figure 3-2 illustrates Algorithm 3-1 by applying M1 and M3 to TΠ . In the

first iteration, we run Query 1-1, which applies all four rules in M1. The result is given in

1 0 T1 and merged with TΠ . In the second iteration (n = 2), we run Query 1-3. Again, both

1 rules in M3 are applied, although there is only a single result, which is merged with TΠ .

Note that, according to Algorithm 3-1, all Mi’s should be applied in each iteration, but in this simple example, only M1 and M3 are applicable. Having computed the ground atoms,

Queries 2-1 (omitted from Figure 3-2) and 2-3 generate the ground factors. The final TΦ in

Figure 3-2E includes the singleton factors shown in gray rows. 

Analysis. The correctness of Algorithm 3-1 follows from [28, 49] if groundAtoms(TΠ,Mi) and groundFactors(TΠ,Mi) correctly apply the rules. This can be verified by analyzing the SQL queries in a similar way to what we did for partitions 1 and 3 and is illustrated in Example 5.1. The efficiency follows from the fact that database systems are set oriented and optimized to execute queries over large sets of rows, hence the ability to apply rules in batches significantly improves performance compared to applying one rule using a single SQL query. Though individual queries may become large, our rational is that given 30,912 Sherlock rules,

39 grouping them together and letting the database system manage the execution process will be more efficient than running them independently, which deprives the database of its set oriented optimizations. We experimentally validate this claim in Section 3.4.1. Assuming the number of the outer loop in Line 2 is a constant (which is a reasonable assumption; in our experiments, 15 iterations ground most of the facts, and both Tuffy and

ProbKB systems need to iterate for the same times), Lines 3-7 take O(k) SQL queries, where k is the number of partitions. Lines 8-10 take another O(k) queries. Thus, we execute O(k) SQL queries in total. On the other hand, Tuffy uses O(n) queries, where n is the number of rules. ProbKB thus has better performance than Tuffy when k  n, which is likely to hold for machine constructed MLNs since they often have a predefined schema with limited rule patterns. In the Reverb-Sherlock knowledge base, we have k = 6 and n = 30, 912 and observe the expected benefits offered by the ability to apply the rules in batches. 3.2.4 MPP Implementation

This section describes our techniques to improve performance when migrating ProbKB from PostgreSQL to MPP databases (e.g., Greenplum). The key challenge is to maximize data collocation for join queries to reduce cross-segment data shipping during query execution [50]. Specifically, we use redistributed materialized views [50] to replicate relevant tables using different distribution strategies. The join queries are rewritten to operate on these replicates according to the join attributes to ensure that the joining records are collocated in the same segment. This maximizes the degree of parallelism by avoiding unnecessary cross-segment data shipping. The distribution keys are determined by the attributes used in the join queries. It turns out that due to the similar syntax of Rules (3–1)-(3–6), most of these views are shared among queries; the only replicates of TΠ we need to create is distributed by the following keys: (R, C1, C2), (R, C1, x, C2), (R, C1, C2, y), and (R, C1, x, C2, y). The following example demonstrates how we select distribution keys and rewrite queries.

Example 3.15. Assuming T0 and Tx are materialized views of TΠ distributed by (R, C1,

C2) and (R, C1, x, C2), respectively. Then, in Query 1-3, instead of

40 Redistribute Motion Redistribute Motion 2.21s 3.47s

Hash Join Hash Join 1.34s 2.50s

Seq Scan on Tx Redistribute Motion Seq Scan on T Broadcast Motion 0.26s 0.85s 0.32s 8.06s

Hash Join Hash Join 0.79s 1.02s

Seq Scan on M3 Seq Scan on T0 Broadcast Motion Seq Scan on T 0.3ms 0.25s 0.09s 0.40s

Seq Scan on M3 0.5ms (A) (B) Figure 3-3. Query plans generated by Greenplum with (A) and without (B) optimization. The annotations show the durations of each operation in a sample run joining M3 and a synthetic TΠ with 10M records.

FROM M3 JOIN T T2 ON ... JOIN T T3 ON ...

we use the views:

FROM M3 JOIN T0 T2 ON ... JOIN Tx T3 ON ...

Comparing to Query 1-3, the above query joins M3, T0 and Tx instead of M3 and T. The effect of using redistributed materialized views is illustrated by the query plans in Figure 3-3. When joining records from T and any other table X, Greenplum requires the records reside in the same segment. Otherwise, it redistributes both tables according to the join key or broadcasts one of them, both of which are expensive operations. On the contrary, if T is already distributed by the join keys, Greenplum only needs to redistribute the other table. In the unoptimized plan in Figure 3-3, Greenplum tries to broadcast the intermediate hash join result, which takes 8.06 seconds, whereas the redistribution motion in the optimized

plan takes only 0.85 second. 

41 3.3 Quality Control

MLNs have been applied in a variety of applications. Most of them use clean, hand-crafted MLNs in small domains. However, machine constructed knowledge bases often contain noisy and inaccurate facts and rules. Over such KBs, the errors tend to accumulate and propagate rapidly in the inference chain, as shown in Figure 3-4A. As a result, the inferred knowledge is full of errors after only a few iterations. Hence, it is important to detect errors early to prevent error propagation. Analyzing the inference results, we identify the following error sources: E1) Incorrect facts resulted from the IE systems. E2) Incorrect rules resulted from the rule learning systems. E3) Ambiguous entities referring to multiple entities by a common name, e.g., “Jack” may refer to different people. They generate erroneous results when used as join keys. E4) Propagated errors resulted from the inference procedure. Figure 3-4A illustrates how a single error produces a chain of errors. In this section, we identify potential solutions to each of the problems above. Section 3.3.1 introduces the concept of semantic constraints and functional constraints that can be used to detect erroneous facts (E1 and E4). Section 3.3.2 uses functional constraints further to detect ambiguous entities (E3) to ensure the correctness of join queries. Section 3.3.3 introduces our current approach for rule cleaning (E2). Finally, Section 3.3.4 describes an efficient implementation of the quality control methods on top of our relational model. Combining these techniques, we are able to achieve much higher precision of the inferred facts. 3.3.1 Semantic Constraints

Constraints are an effective tool used in database systems to ensure data validity [51]. This section introduces a similar concept called semantic constraints that we use in knowledge bases to ensure validity of facts. These constraints are derived from the semantics of extracted relations, e.g., a person is born in only one country; a country has only one capital city,

42 born in(Freud, Berlin) born in(Freud, Germany) capital of(Berlin, Germany) hub of(Berlin, Germany)

capital of(Baltimore, Germany) born in(Mandel, Berlin) located in(Baltimore, Berlin) born in(Mandel, Baltimore) born in(Freud, Baltimore) live in(Rothman, Germany) born in(Freud, Berlin) born in(Rothman, Baltimore) live in(Rothman, Baltimore) (A)

Functional Relations Violating Facts Error sources born in(Mandel, Berlin) Leonard Mandel born in born in(Mandel, New York City) Johnny Mandel born in(Mandel, Chicago) Tom Mandel (futurist) grow up in(Miller, Placentia) Dustin Miller grow up in grow up in(Miller, New York City) Alan Gifford Miller grow up in(Miller, New Orleans) Taylor Miller located in(Regional office, Glasgow) McCarthy & Stone regional offices located in located in(Regional office, Panama City) OCHA regional offices located in(Regional office, South Bend) Indiana Landmarks regional offices capital of(Delhi, India) capital of (Incorrect extraction) capital of(Calcutta, India) (B)

Figure 3-4. Quality control for knowledge expansion. (A) Errors (shaded) resulted from ambiguous entities and wrong rules and how they propagate in the inference chain; (B) Sample functional relations and sources of constraint violations. etc. Conceptually, semantic constraints are hard rules that must be satisfied by all possible worlds. Violations, if any, indicate potential errors.

Definition 3.16. A semantic constraint ω is a first-order formula with an infinite weight (F, ∞) ∈ L. The set of semantic constraints is denoted by Ω.

Thus, the MLN L can be written as L = (H, Ω), where H is set of the inference rules and Ω is the set of semantic constraints. We separate them to emphasize their different purposes. By definition, semantic constraints are allowed to be arbitrary first-order formulae, but staying with a particular form would be helpful for obtaining and applying the constraints at scale. One form of constraints we find particularly useful is functional constraints [19, 20, 42]. They help detect errors from propagation and incorrect rules, and can be used to detect

43 ambiguous entities that invalidate equality checks in join queries. Functional constraints have a simple form that reflect a common property of extracted relations, namely, functionality.

Definition 3.17. A relation R(Ci,Cj) ∈ R is functional if for any x ∈ Ci, there is at most one y ∈ Cj such that R(x, y) ∈ Π, or conversely, if for any y ∈ Cj, there is at most one x ∈ Ci such that R(x, y) ∈ Π. We refer to these cases Type-I and Type-II functionality for convenience.

Definition 3.18. For each Type-I functional relation R(Ci,Cj), Ω contains a functional constraint

∞ ∀x ∈ Ci, ∀y, z ∈ Cj : R(x, y) ∧ R(x, z) → y = z. (3–7) A similar rule can be defined for Type-II functional relations.

Example 3.19. Figure 3-4B lists some functional relations we find in Reverb extractions. born in, grow up in, and located in are of Type I, which means a person is born in only one place, etc. capital of is of Type II, which means a country has only one capital city. If an entity participates in a functional relation with different entities from the same class, they violate the functional constraints. The violations are mostly caused by ambiguous

entities and erroneous facts (E1, E3, E4) as illustrated in Figure 3-4B. 

Besides these functional relations, [42] observes a number of extracted relations that are close to but not strictly functional. Consider the relation live in(Person, Country) for instance: it is possible that a person lives in several different countries, but that number should not exceed a certain limit δ. To support these pseudo-functional relations, we allow them to have 1-δ mappings, where δ is called the degree of functionality.

The remaining question is how to obtain these functional constraints. In ProbKB, we use the functional constraints learned by Leibniz [19], an algorithm for automatic functional

44 relations learning. The repository of the functional relations is publicly available1 . We use this repository and a set of manually labeled pseudo-functional relations (Leibniz only contains functional relations) to construct the constraints as defined in this section. 3.3.2 Ambiguity Detection

As shown in Query 1-3, the SQL queries used to apply length 3 rules involve equality checks, e.g., T2.x = T3.x. However, problems arise when two entities are literally equal, but not coreferencing the same object. When this happens, the inference results are likely to be wrong. For instance, in Figure 3-4A, the error “located in(Baltimore, Berlin)” is inferred by

born in(Mandel, Berlin) ∧ born in(Mandel, Baltimore)

→ located in(Baltimore, Berlin)

The ambiguous entity “Mandel” invalidates the equality check used by the join query. Unfortunately, ambiguities are common in Reverb extractions, especially in people’s names, since web pages tend to refer to people by only their first or last names, which often coincide with each other. Ambiguous entities are one of the major sources that cause functional constraint violations.

In a Type I functional relation R(x : Ci, y : Cj), x functionally determines y. An ambiguous x(x0), however, often associates with another y0 such that y 6= y0 since x and x0 refer to different entities. Hence, x violates the functional constraint (3–7). Figure 3-4B lists sample violations caused by ambiguous entities. Thus, one effective way to detect ambiguous entities is to check for constraint violations. However, we also observe other sources that lead to violations, including extraction errors, propagated errors, etc. In this paper, we greedily remove all violating entities to improve

1 http://knowitall.cs.washington.edu/leibniz.

45 precision, but we could do much more if we are able to automatically categorize the errors. For example, violations caused by propagated errors may indicate low credibility of the inference rules, which can be utilized to improve rule learners. 3.3.3 Rule Cleaning

Wrong rules are another significant error source. In Figure 3-4A, for example, one of the incorrect rules we use is

born in(Freud, Baltimore) ∧ born in(Freud, Germany) → capital of(Baltimore, Germany)

Working with clean rules is highly desirable for MLN and other rule-based inference engines since the rules are applied repeatedly to many facts. We clean rules according to their statistical significance, a scoring function used by Sherlock based on conditional probabilities. A more accurate method (used by Nell) involves human supervision, but it does not apply to Sherlock due to its scale. We perform rule cleaning by ranking the rules by their statistical significance and taking the top θ rules (θ ∈ [0, 1]). The parameter θ is obtained by experiments to achieve good precision and recall. 3.3.4 Implementation

The traditional DBMS way to define constraints is by checks, assertions, or triggers: we define one check (or trigger, etc) for each of the functional relations. This requires thousands of constraint definitions and runtime checks, making it impractical for large KBs.

In ProbKB, instead, observing that functional constraints of form (3–7) are all structurally

equivalent, we store them in a single table TΩ:

Definition 3.20. TΩ is defined to be the set of tuples {(R,C1,C2, α, δ)}, where R(C1,C2) ∈ R, α ∈ {1, 2} is the functionality type, and δ is the degree of pseudo-functionality. δ is defined to be 1 for functional relations.

46 It is often the case that the functionality of a relation applies to all its associating

classes. Take located in(C1,C2) for example, the functionality holds for all possible pairs

of (C1,C2) where C1 ∈ {Places, Areas, Towns, Cities, Countries} and C2 ∈ {Places, Areas,

Towns, Cities, Countries, Continents}. For these relations, we omit the C1,C2 components and assume the functionality holds for all possible pairs of classes. As is the case with MLN rules, the strength of this relational model is the ability to join

the TΠ table to apply the constraints in batches. Query3 shows one implementation of the applyConstraints function in Algorithm 3-1 by removing all entities that violate Type I functional constraints:

Listing 3. applyConstraints(TΠ, TΩ)

DELETE FROM T

WHERE T.x, T.C1 IN (

SELECT DISTINCT T.x, T.C1

FROM T JOIN FC ON T.R = FC.R

WHERE FC.arg = 1

GROUPBY T.R, T.x, T.C1, T.C2

HAVING COUNT(*) > MIN(FC.deg) );

Type II functional constraints are applied similarly. 3.4 Experiments

In this section, we validate the ProbKB system in terms of both performance and the quality of inferred facts. We show the efficiency and scalability of ProbKB in Section 3.4.1. In Section 3.4.2, we show the efficacy of our quality control methods. We use the following datasets for the experiments:

47 Table 3-2. Sherlock-Reverb KB statistics.

# relations 82,768 # rules 30,912 # entities 277,216 # facts 407,247

Reverb-Sherlock KB A real knowledge base constructed from Reverb Wikipedia extractions and Sherlock rules. Table 3-2 shows the statistics of this KB. 2 S1 Synthetic KBs with the original 407,247 facts and varying numbers of rules ranging from 10K to 1M. The rules are either taken from Sherlock or randomly generated. We ensure the validity of the rules by substituting random heads for existing rules. S2 Synthetic KBs with the original 30,912 rules and varying numbers of facts ranging from 100K to 10M. The facts are generated by adding random edges to the Reverb KB. 3.4.1 Performance

We use Tuffy as the baseline comparison. In Tuffy, each relation is stored in its own table and each individual rule is implemented by a SQL query. The original Tuffy does not support typing, so we re-implement it and refer to our implementation as Tuffyh. On the contrary,

ProbKB applies inference rules in batches. We implement ProbKB on PostgreSQL 9.2 and on Greenplum 4.2 (ProbKB-p). We run the experiments on a 32-core cluster with 64GB of RAM running Red Hat Linux 4 unless otherwise specified. 3.4.1.1 Case study: the Reverb-Sherlock KB

We run Tuffy-T and Probkb systems to perform the inference task on the Reverb-Sherlock KB. When we evaluate efficiency, we run Query3 once before inference starts and do not

2 However, there is version mismatch between the datasets. Sherlock used TextRunner, an earlier version of Reverb, to learn the inference rules. Thus, there are a number of new entities and relations in Reverb that do not exist in Sherlock. In our experiments, there are initially 13K facts to which inference rules apply. To get a better understanding of ProbKB’s performance, we will always report the sizes of result KBs or the number of inferred facts.

48 perform any further quality control during inference. This results in a KB with 396K facts. We bulkload the dataset and run Query 13 for four iterations, which results in a KB of 1.5M facts. Then we run Query 2 to compute the factor graph. We run Queries 1 and 2 for all MLN partitions. Table 3-3 summarizes the results.

Table 3-3. Tuffy-T and ProbKB systems performance: the first three rows report the running time for the relevant queries in minutes; the last row reports the size of the result table.

Queries Query 1 Load Query 2 Systems Iter 1 Iter 2 Iter 3 Iter 4 ProbKB-p 0.25 0.07 0.07 0.15 0.48 9.75 ProbKB 0.03 0.05 0.12 0.23 1.28 36.28 Tuffy-T 18.22 1.92 9.40 22.40 44.77 84.07 Result size 396K 420K 456K 580K 1.5M 592M

As we see from Table 3-3, Tuffy-T takes over 607 times longer to bulkload than ProbKB since it loads 83K predicate tables, whereas ProbKB and ProbKB-p only need to load one. For Query 1, ProbKB outperforms Tuffy-T by over 100 times in the 2-4th iterations. This performance boost follows from our strategy of using a single query to apply batches of rules altogether, eliminating the need to query the database for 31K times. Instead, only 6 queries are executed in each iteration. On the other hand, we also observe that the KB grows unmanageably large without proper constraints, resulting in 592M factors after the 4th iteration, most of which are incorrect. For this reason, we stop at iteration 4. For

Query 2, ProbKB-p has a speed-up of 8.6. The running time for this query is dominated by writing out the 592M table. Finally, we observe a speed-up of 4 using Greenplum over PostgreSQL.

3 We use Query 1 for Queries 1-i for all partitions i.

49 3.4.1.2 Effect of batch rule application

To get a better insight into the effect of batch rule application, we run the grounding algorithm on a set of synthetic KB’s, S1 and S2, with varying sizes. Since S1 and S2 are synthetic, we only run the first iteration so that Query 2 is not affected by error propagation as in the Reverb-Sherlock case. The running time is recorded in Figures 3-5AB.

(A) Performance Comparison (B) Performace Comparison 1.4 Tuffy-T Tuffy-T 15 12 ProbKB 2.0 ProbKB 1.2 s s

ProbKB-p 6 10 ProbKB-p 6 3 3 0 0

0 0 1.0 1 1 / / 1 # Inferred 1 # Inferred / / 1.5 s s t t e 10 e c 8 c m m a a

i i 0.8 f f t t

d d n n e e o o r 6 r i i 1.0 r r t t 0.6 e e u u f f c c n n e e I I

5 4 x x

# 0.4 # E 0.5 E 2 0.2

0 0.0 0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 # Rules/106 # Facts/106

(C) MPP Optimization 0.30 1.4 ProbKB 0.25 ProbKB-pn 1.2 s

ProbKB-p 6 3 0

0 1.0 1 / 1 # Inferred

/ 0.20 s t e c m a

i 0.8 f t

d

n 0.15 e o r i r

t 0.6 e u f c n e 0.10 I

x

0.4 # E

0.05 0.2

0.0 0 2 4 6 8 10 # Facts/106

Figure 3-5. Knowledge expansion performance comparison. (A) Runtime of KBs with varying number of rules; (B) Runtime of KBs with varying number of facts. (C) Runtime of PostgreSQL and MPP versions of ProbKB. The dashed lines indicate the number of inferred facts.

50 As shown in Figure 3-5A, ProbKB systems have a much better scalability in terms of the MLN size. When there are 106 rules, ProbKB-p, ProbKB, and Tuffy-T take 53, 210, and 16507 seconds, respectively, which is a speed-up of 311 for ProbKB-p compared to Tuffy-T. This is because ProbKB and ProbKB-p use a constant number of queries that apply MLN rules in batches in each iteration, regardless of the number of rules, whereas the number of SQL queries Tuffy-T needs to perform is equal to the number of rules. Figure 3-5B

compares Tuffy-T and ProbKB systems when the size of Π increases. We observe a speed-up of 237 when there are 107 facts.

Another reason for ProbKB’s performance improvement is its output mechanism: ProbKB and ProbKB-p output a single table from each partition, whereas Tuffy-T needs to do 30,912 insertions, one for each rule. 3.4.1.3 Effect of MPP parallelization

Figure 3-5C compares three variants of ProbKB: on PostgreSQL (ProbKB) and on Greenplum–with and without redistributed materialized views (ProbKB-p and ProbKB-pn, respectively). We run them on S2 and record the running times for Queries 1 and 2. The results in Figure 3-5C shows that both Greenplum versions outperform PostgreSQL by at least a factor of 3.1 (ProbKB-pn) when there are 107 facts. Using redistributed materialized views (ProbKB-p), we achieve a maximum speed-up of 6.3. Note that even with our optimization to improve join locality, the speed-up is not perfectly linear with the number of segments used (32). This is because the join data are not completely independent from each other; we need to redistribute the intermediate and the final results so that the next operation has the data in the right segments. These intermediate data shipping operations are shown as redistribution or broadcast motion nodes in Figure 3-3. Data dependencies are unavoidable, but the overall results strongly support the benefits and promise of MPP databases for big data analytics.

51 3.4.2 Quality

To evaluate the effectiveness of different quality control methods, we run two groups of experiments, G1 and G2, with and without semantic constraints; for each group, we perform different levels of rule cleaning. The parameter setup is shown in Table 3-4.

Table 3-4. Quality control parameters. SC and RC stand for semantic constraints and rule cleaning, respectively.

SC RC (θ) G1 no-SC 1 (no-RC) 20% 10% G2 SC 1 (no-RC) 50% 20%

For the first group, we use no semantic constraints. We set the parameter θ (i.e., top θ rules) to be 1 (no rule cleaning), 20%, and 10%, respectively. For the second group, we use semantic constraints and set θ=1, 50%, and 20%. These parameters are obtained by experiments to obtain good precision and recall. For each experiment, we run the inference algorithm until no more correct facts can be inferred in a new iteration. In each iteration, we infer 5000 new facts, the precision of which is estimated by a random sample set of size 25 (Though they may not accurately estimate the precision, they serve our purpose of comparing different methods). Each sample fact is evaluated by two independent human judges. In cases of disagreements, we carry out a detailed discussion before making a final decision. Since all rules and facts are uncertain, we clarify our criteria of assessing these facts. We divide the facts into three levels of credibilities: correct, probable, and incorrect. The “probable” facts are derived from rules that are likely to be correct, but not certain. For example, in Figure 3-4A, we infer that Rothman lives in Baltimore basing on the fact that Rothman is born in Baltimore. This is not certain, but likely to happen, so we accept it. However, there are also a number of facts inferred by rules that are possible but unlikely to hold, like ∀x ∈ City, ∀y ∈ Country (located in(x, y)→capital of(x, y)). We regard such results as incorrect. The precision is estimated as the fraction of correct and probable facts over the sample size.

52 3.4.2.1 Overall results

Using the Reverb-Sherlock KB, we are able to discover over 20,000 new facts that are not explicitly extracted. As noted in the beginning of Section 4.3, the Reverb-Sherlock KB initially has 13,000 facts to which inference rules apply. The precision of the inferred facts are shown in Figure 3-6A.

(A) Precision (B) Error Sources 1.0

Ambiguities (detected) 0.8 34% Ambiguous 0.6 join keys 24% 2%1% Synonyms General types 0.4 No SC RC 6% RC Top 20% Incorrect extractions RC Top 10% 33% Precision of Inferred Facts 0.2 SC Only SC RC Top 50% SC RC Top 20% Incorrect rules 0.0 0 5000 10000 15000 20000 25000 Estimated # of Correct Facts

Figure 3-6. Overall result of quality control. (A) Precision of inferred facts using different quality control methods. (B) Error sources that lead to constraint violations.

As shown in the figure, both semantic constraints and rule cleaning improve precision. The raw Reverb-Sherlock dataset infers 4800 new correct facts at a precision of 0.14. The precision drops quickly when we generate new facts since unsound rules and ambiguous entities result in many erroneous facts. On the contrary, the precision significantly improves with our quality control methods: with top 10% rules we infer 9962 facts at a precision of 0.72; with semantic constraints, we infer 23,164 new facts at precision 0.55. Combining these two methods, we are able to infer 22,654 new facts at precision 0.65 using top 50% rules, and 16,394 new facts at precision 0.75 using top 20% rules. It is noteworthy that using semantic constraints also increases recall (estimated number of correct facts). As shown in Table 3-3, the KB size grows unmanageably large without

53 proper constraints. The propagated errors waste virtually all computation resources, preventing us from finishing grounding and inferring all correct facts. 3.4.2.2 Effect of semantic constraints

As shown in Figure 3-6A, the use of semantic constraints greatly improves precision by removing the ambiguous entities from the extracted facts. These results are expected and validate our claims in Sections 3.3.1 and 3.3.2. Examples of removed ambiguous entities are shown in Figure 3-4B. We identify a total of 1483 entities that violate functional constraints and use 100 random samples to estimate the distribution of the error sources, as shown in Figure 3-6B. Out of these samples, 34% are ambiguous; 63% are due to erroneous facts resulted from incorrect rules (33%), ambiguous join keys (24%), and extraction errors (6%). In addition, we have 3% violations due to general types (e.g., both New York and U.S. are Places) and synonyms (e.g., New York and New York City refer to the same city). As a computational benefit, removing the errors generates a smaller and cleaner KB, with which we finish grounding using 15 iterations in 2 minutes on PostgreSQL. For the raw

Reverb-Sherlock KB, on the contrary, iteration 4 alone takes 10 minutes for ProbKB-p, and we cannot finish the 5th iteration due to its exponentially large size. 3.4.2.3 Effect of rule cleaning

Rule cleaning aims at removing wrong rules to get a cleaner rule set for the inference task. In our experiments, we observe increases of 0.58 and 0.20 in precision for G1 and G2, respectively, as shown in Figure 3-6A. These positive effects are achieved by removing wrong rules, as is expected. One pitfall we find using the score-based rule cleaning is that the learned scores do not always reflect the real quality of the rules. There are correct rules with a low score and incorrect rules with a high score. As a consequence, when we raise the threshold, we discard both incorrect and some correct rules, resulting in a higher precision and lower recall. One

54 insight provided by Section 3.4.2.2 is that incorrect rules lead to constraint violations. Thus, it is possible to use semantic constraints to improve rule learners. 3.5 Summary

This chapter addresses the problem of knowledge expansion in probabilistic knowledge bases. The key challenges are scalability and quality control. We formally define the notion of probabilistic knowledge bases and design a relational model for them, allowing an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches. The experiments show that our methods achieve orders of magnitude better performance than the state-of-the-art, especially when using MPP databases. We use typing, rule cleaning, and semantic constraints for quality control. They are able to identify many errors in the knowledge base, resulted from unsound rules, incorrect facts, and ambiguous entities. As a consequence, the inferred facts have much higher precision. Some of the quality control methods are still in a preliminary stage, but we have already shown very promising results.

55 CHAPTER 4 MINING FIRST-ORDER KNOWLEDGE BY ONTOLOGICAL PATHFINDING Recent years have seen tremendous research and engineering efforts in constructing large knowledge bases (KBs). Examples of these knowledge bases include DBPedia [1], DeepDive [2], Freebase [3], Google Knowledge Graph [4], NELL [7], OpenIE [8,9], ProBase [10], and YAGO [15, 40, 41]. These knowledge bases store structured information about real-world people, places, organizations, etc. They are constructed by human crafting (DBPedia, Freebase), information extraction (DeepDive, OpenIE, ProBase), reasoning and inference [13, 42], knowledge fusion [5, 25], or combinations of them (NELL). The knowledge expansion algorithm in Chapter3 applies a set of inference rules to derive implicit facts from knowledge bases. These rules come from the Ontological Pathfinding (OP) algorithm introduced in this chapter. The OP algorithm builds upon the knowledge base relational model and scales up first-order rule mining by a series of parallelization and optimization techniques, including a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. These techniques allow us to mine 36,625 inference rules from Freebase in 33.22 hours, achieving a new state-of-the-art for first-order rule mining. We focus on the problem of mining first-order inference rules to support knowledge expansion. An inference rule is a first-order Horn clause to discover implicit facts. As an example, the following rule expands knowledge of people’s birth places:

wasBornIn(x, y), isLocatedIn(y, z) → wasBornIn(x, z).

Rules are useful in a wide range of applications: knowledge reasoning and expansion [13, 42], knowledge base construction [7], question answering [21], knowledge cleaning [19, 20], knowledge base maintenance [52], Markov logic learning [32], etc. Mining Horn clauses has been studied extensively in inductive logic programming. However, today’s knowledge bases pose several new challenges. First, knowledge bases are

56 often prohibitively large. For example, as of this writing, Freebase has 112 million entities and 388 million facts. None of the existing rule mining algorithms efficiently support KBs of this size. Second, knowledge bases implement the open world assumption, implying that we have only positive examples for rule mining. To address these challenges, a number of new approaches are proposed: Sherlock [24], AMIE [37], Markov logic structure learning [31, 32], etc. Still, new techniques need to be invented to scale up state-of-the-art approaches to knowledge bases of billions of facts. We propose the Ontological Pathfinding algorithm (OP) to tackle the large-scale rule mining problem. We focus on scalability and design a series of parallelization and optimization techniques to achieve web scale. Following the relational knowledge base model [13], we store inference rules in relational tables and use join queries to apply them in batches. The relational approach outperforms state-of-the-art algorithms by orders of magnitude on medium-sized knowledge bases [13]. To scale to larger knowledge bases, we parallelize the mining algorithm by dividing the input knowledge base into smaller groups running parallel in-memory joins. The parallel mining algorithm can be implemented on state-of-the-art cluster computing frameworks to achieve maximum utilization of available computation resource. Furthermore, even if we parallelize the mining algorithm, the parallel tasks are dependent on each other. In particular, the tasks need to shuffle data between stages. As the knowledge bases expand in scale, shuffling becomes the bottleneck of the computation. This shuffling bottleneck motivates us to introduce another layer of partitioning on top of the parallel computation, a partitioning scheme that divides the mining task into smaller independent sub-tasks. Each partition still runs the same parallel mining algorithm as before, but on a smaller input. Since each partition is independent from each other, the results are unioned in the end; no data exchange occurs during computation. Our experiments show that we accomplish the Freebase mining task within 34 hours that does not finish in 5 days without partitioning.

57 One major performance bottleneck is caused by large degrees of join variables in the inference rules. Applying these rules in the mining process generates large intermediate results, enumerating all possible pair-wise relationships of the joined instances. As a result, these rules are often of low quality. Based on this observation, we use non-functionality as an empirical indication of inefficiency and inaccuracy. In our experiments, we determine a reasonable functional constraint and show that 99% of the rules violating this constraint turn out to be false. Removing those rules reduces runtime by more than 5 hours for a single mining task. Combining our approaches, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale. In summary, we make the following contributions to achieve scalability in mining first-order rules from large knowledge bases:

• We design an efficient rule mining algorithm that evaluates inference rules in batches using join queries. The algorithm runs in parallel by dividing the knowledge base into smaller groups running parallel in-memory joins;

• We design a novel partitioning algorithm to divide the mining task into small independent sub-tasks. Each sub-task is parallelized, running the same mining algorithm, but operates on small inputs. Data shuffling is unnecessary among sub-tasks;

• We define the non-functionality score to prune dubious and inefficient rules. We experimentally determine a reasonable functional constraint and show the constraint improves efficiency and accuracy of the mining algorithm;

• We conduct a comprehensive experiment study to evaluate our approaches over public knowledge bases, including YAGO and Freebase. Our research leads to a first rule set for Freebase, the largest knowledge base with hundreds of millions of facts. We publish the code and data repository online1 . 4.1 First-Order Mining Problem

We study the problem of mining first-order inference rules from web-scale knowledge bases: Given a knowledge base of (subject, predicate, object) (or (s, p, o)) triples, we learn

1 http://dsr.cise.ufl.edu/projects/probkb-web-scale-probabilistic-knowledge-base.

58 first-order Horn clauses of form

(w,H(x, y) ← B), (4–1) V where H(x, y) is the head predicate, the body B = i Bi(zi) is a conjunction of predicates, and w is a scoring metric reflecting the likelihood of the rule being true. As in AMIE+ [22, 37] and other ILP systems [38, 39], we use a language bias and assume the Horn clauses to be connected and closed. Two atoms are connected if they share a variable. A rule is connected if every atom is connected transitively to every other atom in the rule. A rule is closed if every variable appears at least twice in different predicates. This assumption ensures that the rules do not contain unrelated atoms or variables. 4.1.1 The Scalability Challenge

Our primary focus is scalable mining. Previous approaches [22, 24] scale to knowledge bases of 11.02 million facts, but no existing work mines inference rules from the 112 million entities and 388 million facts of Freebase. We investigate a series of parallelization and optimization techniques to achieve this scale, and we study how to use state-of-the-art data processing systems, e.g., Spark, to efficiently implement the parallel algorithms. The first-order mining problem is similar to association rule mining in transactions databases [37], but we note that they are different in nature. In first-order rules, the atoms are parameterized predicates. Each parameterized predicate can be grounded to a set of ground atoms. Depending on the size of the knowledge base, each rule can have a large number of possible ground instances. This makes mining first-order knowledge more challenging than mining traditional association rules in transaction databases. 4.1.2 Scoring Metrics

We review the support and confidence metrics for first-order Horn clauses of form

H(x, y) ← B.

59 Support. The support of a rule is defined to be the number of distinct pairs of subjects and objects in the head of all instantiations that appear in the knowledge base:

supp(H(x, y) ← B) := |{H(x, y)|B(x, y) ∧ H(x, y) ∈ Γ}|. (4–2)

Confidence. The confidence of a rule is defined to be the ratio of its predictions that are in the knowledge base: supp(H(x, y) ← B) conf(H(x, y) ← B) := . (4–3) |{H(x, y)|B(x, y)}| Our framework supports other scoring functions introduced in [24, 37]. For example, the PCA confidence of a rule is defined to be the fraction of its true predictions over the inferred facts we know to be either true or false, i.e., facts p(x, y) such that ∃y0 : p(x, y0) ∈ Γ. supp(H(x, y) ← B) PCA conf(H(x, y) ← B) := . (4–4) |{H(x, y)|∃y0 : B(x, y) ∧ H(x, y0) ∈ Γ}| For each rule, we compute its support and confidence, and set w = (supp, conf) in (4–1). The support and confidence metrics together indicate the quality of a rule. 4.2 Ontological Pathfinding

The Ontological Pathfinding (OP) algorithm aims at scaling up first-order mining algorithms to very large knowledge bases. It performs a series of parallelization and optimization techniques, as described below. 1. Construct. Enumerate syntactically correct rules according to the schema of the knowledge base Γ. We call these rules candidate rules. We store candidate rules in relational tables M according to structural equivalence.

2. Partition. Partition (Γ,M) into smaller inputs {(Γ1,M1),

..., (Γk,Mk)} that satisfy |Γi| ≤ s and |Mi| ≤ m for ∀i. The partitions are independent from each other and more efficient to solve than the input knowledge base. 3. Prune. Eliminate the non-functional rules with more than t joined instances in both predicates of the join. Non-functional rules produce large combinations of the non-join variables, most of them insignificant.

60 4. Mine. For each partition, run the parallel rule mining algorithm to compute the scores of the candidate rules. We note that while the rule mining algorithm runs in parallel in each partition, we run the partitions in serial. These steps are summarized in Algorithm 4-1. In the following sections, we describe each step in detail.

Algorithm 4-1: Ontological-Pathfinding(Γ, s, m, t) 1 M ← construct-rules(Γ.schema); 2 {(Γi,Mi)} ← partition(Γ, ∅, M, s, m); 3 rules ← ∅; 4 forall (Γi,Mi) do 5 Mi ← prune(Γi, Mi, t); 6 rules ← rules ∪ parallel-rule-mining(Γi, Mi);

7 return rules

4.2.1 Rule Construction

As discussed in Section 3.1, all entities and predicates are typed in a knowledge base, i.e., each entity belongs to one or more types, and each predicate has its domain and range. The typing restricts the predicates that can be combined to form possible rules: they must have corresponding types for joining arguments. In this section, we utilize the schema to construct candidate rules.

Definition 4.1. The schema graph of a knowledge base is defined to be a graph GΓ = (V,E), where V = {v1, . . . , v|V |} is the set of nodes representing types and E = {ep(vi, vj)} is the set of labeled directed edges representing typed predicates p with vi as the domain and vj as the range.

In a web-scale knowledge base like YAGO and DBPedia, we often have a type hierarchy to divide types into subtypes for finer classification. For instance, the “Person” type has subtypes including “Actor,” “Professor.” These types are related by the “subClassOf” edges. In Figure 4-1, we have a “subClassOf” edge from “Actor” to “Person.” indicating that “Actor” is a subtype of “Person.” To construct candidate rules, the subclasses need to

61 Table 4-1. Example KB schema from YAGO knowledge base.

Predicate Domain Range livesIn Person City isMarriedTo Person Person worksAt Person Organization directed Person Movie influences Person Actor actedIn Actor Movie subClassOf Actor Person

inherit all the edges from their ancestors. The inherited edges are defined in Definition 4.2 using the following notations: 1. A(v) denotes the ancestor nodes u of v such that there is a path from v to u with all edges labeled as “subClassOf”; and 2. E(u/v) denotes the neighboring non-subClassOf edges of u with at least one u substituted by v.

Definition 4.2. We define the schema closure graph of a schema graph G = (V,E) to be G0 = (V,E ∪ E0) where [ E0 = E(u/v). v∈V,u∈A(v) Example 4.3. In Table 4-1, we provide an example schema from the YAGO knowledge base. Its schema graph is shown in Figure 4-1. In this graph, each type in the domain and range columns is represented as a node, and each row in Table 4-1 is represented as an edge from domain to range. Since Actor is a subclass of Person, we infer additional edges by having Actor inherit Person’s non-subClassOf edges, as shown by dashed arrows in Figure 4-1.



Following Definition 4.2, Algorithm 4-2 shows how to compute the closure of a specific node v in a schema graph G by DFS. When visiting a node v, we recursively visit its ancestors (Line 3) and add their neighboring edges to v (Line 4). Visiting each unvisited node in G yields its schema closure graph.

62 isMarriedTo

worksAt livesIn Organization Person City

influences directed

worksAt * * livesIn subClassOf actedIn Actor Movie directed

isMarriedTo influences * isMarriedTo Figure 4-1. Example schema closure graph. Dashed arrows indicate inherited edges.

Algorithm 4-2: Closure(G = (V,E), v) 1 if !v.visited then 2 forall esubClassOf(v, u) ∈ E do 3 Closure(G, u); 4 E ← E ∪ E(u/v);

5 v.visited ← True;

OP generates candidate rules by detecting closed walks (simple and non-simple cycles) in the schema closure graph. With a length limit l, OP starts from each node in the graph, searching at most l nodes for closed walks ending at the current starting node. When it detects a closed walk, it outputs a syntactically correct rule with the starting edge label as the head predicate. The same closed walk generates multiple rules with different head predicates. For instance, Figure 4-2 (Left) shows three example closed walks from Example 4.3, and Figure 4-2 (Right) shows the corresponding rules constructed from the closed walks. Although we have directed edges in the schema graph, we traverse it in an undirected manner: from any vertex v, we visit its neighbors from both incoming and outgoing edges. In Figure 4-2, R2 is constructed from a closed walk with repeated nodes and

63 isMarriedTo livesIn livesIn R1: Person1 −−−−−−→ Person2 −−−→ City ←−−− Person1 isMarriedTo(P1,P2), livesIn(P2,C) → livesIn(P1,C)

R2: Person −−−−→directed Movie ←−−−−actedIn Actor ←−−−−−influences Person directed(P,M), actedIn(A, M) → influences(P,A)

R3: Person −−−−→worksAt Organization ←−−−−worksAt Actor ←−−−−−influences Person worksAt(P,O), worksAt(A, O) → influences(P,A)

Figure 4-2. Candidate rules R1-R3 constructed by cycle detection from Example 4.3. The first and last nodes in R1-R3 denote the same start and end node in the cycle. edges; R3 shows an example path containing an inherited edge, Actor −−−−→worksAt Organization, inherited from Actor’s ancestor, Person. 4.2.2 Partitioning

The rule construction algorithm stores candidate rules in relational tables. For web knowledge bases with large numbers of facts and predicates, the facts and rules tables tend to be prohibitively large. Before applying the rules by joining the rules table and the facts table, we partition these tables by dividing them into independent but possibly overlapping subsets running smaller instances of the rule mining algorithm, as shown in Figure 4-3. The partitions apply disjoint sets of candidate rules. In the end, the output rules from each partition are combined to construct the final result. The partitioning algorithm improves performance by accepting a size constraint and returns partitions that satisfy the constraint. The smaller partitions are more efficiently processed than the entire KB. We begin by introducing the notions of Γ-partition and Γ-size. They allow us to determine the size of a partition. We utilize this information to assign each rule to an appropriate partition.

64 Independent Overlapping Partitions

Partition 1 Output 1

Output 2 Partition 2 Output

Output 3

Figure 4-3. Partitioning algorithm: KB partitioned into smaller overlapping parts running independent mining algorithm instances.

65 Definition 4.4. Denote the predicates in rule r = p0 ← p1, . . . , pl by Γ(r) = {p0, p1, . . . , pl}.

We define the Γ-partition, with respect to rules Mi = {r1, . . . , rm}, to be the set of predicates m [ Γ(Mi) = Γ(ri). i=1

Given an input knowledge base Γ and partitioned rules M = {M1,...,Mk}. Let Γi =

{p(x, y) ∈ Γ|p ∈ Γ(Mi)} be the partition of facts induced by Mi. Then Γi contains all facts we need to evaluate rules r ∈ Mi. Thus, (Γi,Mi) defines an independent partition of the input KB. Running Algorithm 4-8 on (Γi,Mi) returns the mining results for rules Mi. The size of the partition indicates how long the algorithm runs and can be determined using the predicate histogram H0 from Section 4.2.3.

Definition 4.5. We define the Γ-size of a rule set Mi with respect to KB Γ to be

X 0 σ(Γ,Mi) = |Γi| = H (p), (4–5)

p∈Γ(Mi) where H0 = {(p, |{p(·, ·)}|)} is the predicate histogram.

Example 4.6. Consider the knowledge base Γ and rules table M in Figure 4-4. M is partitioned into two parts, M1 and M2 (initially without the last row, r). The corresponding Γ-partitions, according to Definition 4.4, are

Γ(M1) = {exports, imports, dealsWith, isLocatedIn},

Γ(M2) = {isLocatedIn, hasCapital, wasBornIn, isCitizenOf,

worksAt}.

66 Let H0 denote the histogram of Γ, we have

0 0 0 σ(Γ,M1) = H (exports) + H (imports) + H (dealsWith)

+ H0(isLocatedIn)

= 8,

0 0 σ(Γ,M2) = H (isLocatedIn) + H (hasCapital)

+ H0(wasBornIn) + H0(isCitizenOf)

+ H0(worksAt)

= 8.

Now consider the addition of rule

r = (isLocatedIn(x, y) ← isLocatedIn(x, z), hasCapital(y, z)).

We have:

Γ(M1 ∪ {r}) = Γ(M1) ∪ {hasCapital},

Γ(M2 ∪ {r}) = Γ(M2), where the second equation holds because the predicates in r, {isLocatedin, hasCapital}⊂

Γ(M2). Consequently, σ(Γ,M1 ∪ {r}) = 10, σ(Γ,M2 ∪ {r}) = 8. Adding r to M2 incurs a smaller increase of size than adding it to M1.

Suppose we have added r to M2, i.e., M2 ← M2 ∪ {r}. To evaluate M, instead of performing a big join between Γ and M, we run two small joins between (Γ1,M1) and

(Γ2,M2). In each case, |Γ1| = |Γ2| = 8, 42.86% smaller than Γ. Therefore, each partitioned join is more efficient than the big join. 

67 (A) Γ (B) M p x y H(x,y) b1(x,z) b2(y,z) exports United States Computer dealsWith isLocatedIn isLocatedIn M1 exports Canada Aluminum dealsWith exports imports imports United States Aluminum isCitizenOf wasBornIn hasCapital imports United States Clothing Γ worksAt wasBornIn isLocatedIn M2 1 dealsWith Canada United States isLocatedIn Washington, D.C. United States isLocatedIn hasCapital isLocatedIn r isLocatedIn Ottawa Canada isLocatedIn Stanford University Stanford, California hasCapital Canada Ottawa Γ 2 hasCapital United States Washington, D.C. wasBornIn Donald Knuth Milwaukee, Wisconsin isCitizenOf Donald Knuth United States Partition 1 worksAt Donald Knuth Stanford University hasAcademicAdvisor Donald Knuth Marshall Hall, Jr. Partition 2

Figure 4-4. Rule table M, initial partitions M1, M2, and unpartitioned rule r.

68 Given an upper bound of the Γ-size s and the number of rules m for each partition, our

goal is to find a partition {M1,...,Mk} of M that satisfies the following constraints:

(C1) σ(Γ,Mi) ≤ s, 1 ≤ i ≤ k

(C2) |Mi| ≤ m, 1 ≤ i ≤ k

k [ (C3) Mi = M, i=1

(C4) Mi ∩ Mj = ∅, 1 ≤ i < j ≤ k

We seek to find a partition {M1,...,Mk} with as a small k as possible. Without a priori knowledge of the optimal k, we use a recursive binary partitioning scheme in Algorithm 4-3.

In each recursive step, the input rule set M is partitioned into two smaller parts, M1 and

M2. The algorithm terminates when all partitions satisfy the size constraints (C1) and (C2). The partitions satisfy the completeness and disjointness constraints (C3) and (C4) at each recursive step as M is partitioned into M1 and M2 by Algorithm 4-4 such that M1 ∪M2 = M and M1 ∩ M2 = ∅.

Algorithm 4-3: Recursive-Partition(Γ, Π, M, s, m) 1 if σ(Γ,M) ≤ s and |M| ≤ m then 2 Π ← Π ∪ {M};

3 (M1,M2) ← Binary-Partition(Γ,M); 4 if M1 = ∅ then 5 Π ← Π ∪ {M2};

6 else if M2 = ∅ then 7 Π ← Π ∪ {M1};

8 else 9 Recursive-Partition(Γ, Π,M1, s, m); 10 Recursive-Partition(Γ, Π,M2, s, m);

Specifically, we first determine whether to partition the input M by checking if it already satisfies the size constraints (C1) and (C2) (Line 1). If it does, we add it to the final set of partitions and return (Line 2); otherwise, we use a binary partitioning algorithm to partition the rules (Line 4) and recursively partition the sub-parts (Lines 10-11). Lines

69 Algorithm 4-4: Binary-Partition(Γ,M)

1 M1 ← ∅; 2 M2 ← ∅; 3 forall r ∈ M do 4 ∆1 ← σ(Γ,M1 ∪ {r}) − σ(Γ,M1) + p(Γ,M1); 5 ∆2 ← σ(Γ,M2 ∪ {r}) − σ(Γ,M2) + p(Γ,M2); 6 if ∆1 < ∆2 then 7 M1 ← M1 ∪ {r};

8 else 9 M2 ← M2 ∪ {r};

10 return (M1,M2)

5-8 handle special cases where the size constraint (C1) cannot be satisfied. This happens when s < H0(p) for some predicate p. The binary partitioning algorithm is described in Algorithm 4-4, using a greedy assignment strategy: each rule is assigned to the partition with a smaller increase in size (Lines 4-9). Note that in Lines 4-5, we add a penalty term p(Γ,M) to penalize the larger partition, preventing it from absorbing all subsequent rules, as duplicate predicates do not increase the partition size. In our experiments, we set p(Γ,M) =

1 50 σ(Γ,M). Since Γ-partitions are induced from partitions of inference rules (Definition 4.4) and rules from different partitions may contain duplicate head or body predicates, the Γ-partitions may overlap, as illustrated by the addition of rule r in Example 4.6. To measure this overlap, we introduce the notion of degree of overlap.

Definition 4.7. The degree of overlap (DOV) of a set of rule parts M = {M1,M2,...,Mk} is defined to be P i σ(Γ,Mi) DOV(M) = S . (4–6) σ (Γ, i Γ(Mi)) P In Equation (4–6), the numerator i σ(Γ,Mi) is the total size of the partitions we S make to evaluate the inference rules. The denominator σ (Γ, i Γ(Mi)) is the size of the KB induced by rules M. A DOV greater than 1 indicates overlapping partitions.

70 Example 4.8. Consider the partitioning scheme from Example 4.6. While the initial partitions M1 and M2 are disjoint, adding r to M1 or M2 results in overlapping partitions: 6 + 6 DOV({M ,M }) = = 1; 1 2 12 6 + 7 DOV({M ,M ∪ {r}}) = = 1.0833; 1 2 12 8 + 6 DOV({M ∪ {r},M }) = = 1.1667. 1 2 12 

4.2.3 Rule Pruning

One performance barrier we observe in the candidate rules is that some of them generate prohibitively large intermediate results due to high-degree variables in the joining predicates, as demonstrated by the following example:

hasAcademicAdvisor(x, y) ← diedIn(x, z), wasBornIn(y, z) (4–7)

In the above rule, we have variable z as the join variable. Meanwhile, in places with large populations, e.g., New York City, the California state, etc, there are hundreds of thousands of people who were born or dead. Computing the confidence score of the rule requires applying the rule body and counting the distinct inferred facts. As a result of joining the high-degree variable, the computation is inefficient. Rule (4–7) violates the empirical skewed power-law degree distribution of the natural sparse graphs [53]: most entities have relatively few neighbors while a few have many neighbors. The law implies that the neighbors of a high-degree entity are not likely connected with one another. Rule (4–7), however, predicts a “hasAcademicAdvisor” relationship between every pair of neighbors of the join variable z. This high-degree join problem is common among candidate rules as they are constructed from the KB schema with no validation using the facts. Thus, rules can accidentally have irrelevant predicates that coincide with the join variable with a large degree. To detect these rules, we use histograms to determine the functional property of inference rules: non- functionality of a rule:

71 • Predicate Histogram H0 = {(p, |{p(·, ·)}|)};

• Predicate-Subject Histogram H1 = {(p, x, |{p(x, ·)}|)};

• Predicate-Object Histogram H2 = {(p, y, |{p(·, y)}|)}. In a functional notation, we write H0(p) = |{p(·, ·)}|, H1(p, x) = |{p(x, ·)}|, and H2(p, y) = |{p(·, y)}|. H0 is used to compute the size of a KB partition, as explained in Section 4.2.2. H1 and H2 determine the size of intermediate results. For instance, the size of the join for Rule (4–7) can be computed by X H2(diedIn, z) · H2(wasBornIn, z). z In Definition 4.9 below, we omit the position descriptors H1,H2 and use H(p, z) to denote the histogram entry for predicate p and join variable z in a general rule, the position of z determined by the join under consideration.

Definition 4.9. For a connected, closed rule r:

h ← b1, . . . , bl, we define the non-functionality of r as

NF(r) = max {min(H(bi, z),H(bj, z))} . (4–8) bi,bj connected by z A functional constraint t accepts rules r with NF(r) ≤ t.

According to Definition 4.9, a functional constraint requires each join, represented by a pair of connected atoms of the rule, have a join variable z with no more than t joined instances as determined by min(H(bi, z),H(bj, z)), z ranging over all joined values. We use non-functionality as an empirical indication of rule correctness. Viewing a knowledge graph as a sparse natural graph–e.g., Freebase and YAGO2s has a sparsity of 3.11 × 10−8 and 9.82 × 10−7, respectively–we justify our approach by the empirical power-law degree distribution of natural graphs that only a few entities in a knowledge graph have a large degree [53], implying that neighbors of large degree entities are unlikely inter-connected with one another as suggested by the non-functional joins.

72 In our experiment, we observe that violations of functional constraints are strong indications of incorrectness of the rules, and more than 99% of the non-functional rules are wrong. Removing those erroneous rules improves performance and rule quality. By varying the constraint t, we show that a reasonable choice lies between 50 and 250 and set t = 100 in the default configuration. We experimentally justify our choice in Section 4.3.4.

Example 4.10. As an example, Table 4-2 shows a histogram from the YAGO2 knowledge base containing the number of people born and died in New York City, London, and Montreal.

Table 4-2. Histogram for “wasBornIn” and “diedIn.”

wasBornIn NYC 1287 wasBornIn London 1584 wasBornIn Montreal 618 diedIn NYC 737 diedIn London 951

Using this histogram, we determine the non-functionality of Rule (4–7) to be 951, the joining variable z being NYC and London. The total number of facts inferred by Rule (4–7) is 1287 × 737 + 1584 × 951 = 2454903. The result is 734 times larger than the head predicate

0 “hasAcademicAdvisor” (H (HasAcademicAdvisor) = 3340). 

4.2.4 Parallel Rule Mining

We design a parallel rule mining algorithm to join the facts and rules tables of each partition. The mining algorithm divides the facts table into groups running parallel in-memory group joins, verifies the inferred facts, and collects rule statistics. Each step runs a parallel operation described in Section 2.3. We introduce the mining algorithm using the following equivalent class of rules:

p(x, y) ← q(x, z), r(y, z). (4–9)

In Section 4.2.4.1, we generalize it to other rule classes. We present Algorithm 4-5 using Spark primitives, but we note that it is a general parallel algorithm consisting of basic

73 parallel operations. In Algorithm 4-5, the rules of form Rule (4–9) are represented as an RDD; each predicate variable p, q, r is assigned relations as the mining algorithm applies the rules. Figure 4-5 illustrates how Algorithm 4-5 transforms the datasets using parallel operations.

Algorithm 4-5: Parallel-Rule-Mining(facts, rules) Input : facts = {(pred, sub, obj)}, rules = {(ID, head, body1, body2)} 1 Map each fact (pred, sub, obj) ∈ facts to (obj, (pred, sub)); 2 GroupByKey obj, yielding a list of {(pred, sub)} pairs for each obj; 3 FlatMap the (obj, {(pred, sub)}) pairs to Group-Join(obj, {(pred, sub)}, rules), using Algorithm 4-6, yielding a list of {(pred, sub, obj), rule.ID)} pairs; 4 ReduceByKey (pred, sub, obj), deduplicating the rule.IDs for each (pred, sub, obj) triple; 5 FlatMap the ((pred, sub, obj), {rule.ID}) tuples to Check({rule.ID}), using Algorithm 4-7, yielding a list of {rID, (correct, 1)} pairs; 6 ReduceByKey rule.ID, summing the correct and 1 values; 7 Map each (rule.ID, (sum, count)) to (rule.ID, sum/count) pairs;

In Steps 1 and 2, Algorithm 4-5 groups the input facts by the join variable “obj,” corresponding to the variable z in Rule (4–9). This ensures that the tuples with the same join variable “obj” are in the same group so the rules can be applied to the disjoint groups, as shown by the “Group joins” in Figure 4-5. In addition, we broadcast the rules table to each group to ensure all relevant data are collocated for the joins. Consequently, the groups run disjoint in-memory group-joins, Algorithm 4-6, and are executed in parallel.

The Group-Join algorithm applies the rules in each group. It builds a hash table of the input facts with “pred” as the key (Line 3). For each rule of form Rule (4–9) with

body predicates q and r, Group-Join searches for facts with predicates q and r in a nested loop in Lines 4-7. Since each group has the join variable “obj” matched by the previous “GroupByKey” operation, the matching process applies the body predicates q and r to relevant tuples. For each match, Group-Join generates an inferred fact p(x, y) with p determined by the head of the rule, x by the subject of the first match (sub1), and y by the subject of the second match (sub2), according to Rule (4–9).

74 Group by join Group joins Group by facts Check Count variables

R1 F1 R2 F2, F5 R3 F4

R1 F2, F3 R2 F1 R3 F2

Rules = {R1, R2, R3}

R1 F5 R2 F3, F4 R3 F5

Figure 4-5. Parallel rule mining: KB divided into groups by join variables, each group running Group-Join to apply inference rules.

75 In the Group-Join algorithm, we output both the input facts (Line 2) and the inferred facts (Line 7). Each (pred, sub, obj) triple is output as a key, with the value being the positive ID of the rule if it is inferred by that rule, or 0 if it is from the input knowledge base. These IDs are used to verify the inference results: if a fact is associated with an ID of 0, it exists in the input knowledge base; otherwise, it is inferred by inference rules specified by the ID list.

Algorithm 4-6: Group-Join(obj, ps = {pred, sub}, rules) 1 forall (pred, sub) ∈ ps do 2 emit((pred, sub, obj), 0);

3 preds ← ps.groupBy(pred); 4 forall r ∈ rules do 5 forall sub1 ∈ preds.get(r.body1) do 6 forall sub2 ∈ preds.get(r.body2) do 7 emit((r.head, sub1, sub2), r.ID);

In Step 4, Algorithm 4-5 groups by the output facts, each group aggregating a list of rules inferring the fact, identified by their IDs, as shown by the “Group by facts” transformation in Figure 4-5. The aggregated lists are used by Algorithm 4-7, Check, to determine if each rule is inferring a correct fact by searching for 0 in the list. The Check algorithm outputs a (ID, (c, 1)) tuple for each rule, where c indicates the correctness of the inferred fact. The

Check algorithm transforms the input lists into tuples of rule statistics, as illustrated by “Check” in Figure 4-5. The component-wise sum of the (c, 1) tuples for each rule is the number of correct and total facts, respectively, inferred by the rule. Finally, steps 6 and 7 group by the rules, sum up the correct and total counts, and compute the confidence of each rule.

Algorithm 4-7: Check(rs = {rule.ID}) 1 c ← rs.contains(0); 2 forall rule.ID ∈ rs do 3 emit(rule.ID, (c, 1));

76 The correctness of Algorithm 4-5 follows from the fact that the entire set of rules is broadcast to each group and that the groups are disjoint from each other (recall that they are grouped by key “obj”). Therefore, Step 3 properly applies rules to the facts. In Step 5, Algorithm 4-7 generates individual correct (0 or 1) and total (1) counts for rules inferring each fact. Since each fact contains the “0” flag to determine its correctness, aggregating the results from all facts generates final correct and total counts of each rule. 4.2.4.1 General rules

To generalize Algorithm 4-5 to other rule classes, we recall from Section 4.1 that the Horn clauses are assumed to be connected and closed. Thus, for a general rule with rule ID r˙:

h(x, y) ← b1, b2, . . . , bk,

we can arrange the body atoms so that each bi is connected to bi+1, i = 1, . . . , k − 1, and

bk is connected to h(x, y), by a shared variable zi. The general rule mining algorithm,

Algorithm 4-8, allows zi to be vectors and to contain repeated variables.

Algorithm 4-8: General-Rule-Mining(facts, rules)

Input : facts = {(p, x, y)}, rules = {(r, ˙ h, b1, . . . , bk)} 1 j1 ← facts; 2 forall pairs (bi, bi+1) with shared variable zi do 3 ji ← ji.GroupByKey(zi); 4 fi+1 ← facts.GroupByKey(zi); 5 if i + 1 < k then 6 ji+1 ← {(zi+1, (r, ˙ xi+1))} = Group-Join(ji, fi+1, zi, rules);

7 else 8 jk ← {((h, x, y), r˙)} = Group-Join-Last(ji, fk, zi, rules);

9 Process join result jk, as in Steps 4-7 of Algorithm 4-5;

Algorithm 4-8 joins two body atoms at a time. ji denotes the result of joining b1, . . . , bi,

and is used as the operand for joining the next rule body, fi+1, in each iteration. The Group- Join and Group-Join-Last methods in Lines 6 and 8 implement the rule semantics.

Group-Join performs the join and outputs the tuples with the next shared variable zi+1 as

77 the key, along with the rule ID,r ˙, and any variables referred to by the head or subsequent body atoms. Group-Join-Last completes the join and infers facts from the rules, generating a list of (fact, rule ID) pairs as in Algorithm 4-6 to be further processed to evaluate the confidence of each rule, as in Steps 4-7 of Algorithm 4-5. As a remark, we note that Algorithm 4-8 requires the input rules be of the same form. Thus, if there are N equivalent classes (defined in Section 3.2) of rules, we need N rule tables and N calls of Algorithm 4-8 to complete the mining task. On the other hand, each run of Algorithm 4-8 is highly efficient and optimized, as we apply the rules in batches and in parallel. Thus, our approach trades off between generality and efficiency. 4.2.4.2 General confidence scores

In addition to the standard support and confidence scores, a number of improvements have been proposed, including the PCA confidence [22], head coverage [22], statistical relevance [24], etc. While we provide a general framework using the support and confidence scores, our approach generalizes to the other metrics since all of them involve (1) applying the rules; and (2) counting the results. These operations are defined by the Group-Join, Group-Join-Last, and the Check algorithms. To generalize requires redefining them to implement the semantics of other metrics. In this section, we show how to design mining algorithms for the other scores using the PCA confidence as an example. By definition, the PCA confidence [5, 22] of a rule is defined to be the fraction of its true predictions over the inferred facts we know to be either true or false, i.e., facts p(x, y) such that there exists y0 and p(x, y0) ∈ Γ: supp(H(x, y) ← B) PCA conf(H(x, y) ← B) := . (4–10) |{H(x, y)|∃y0 : B(x, y) ∧ H(x, y0) ∈ Γ}| In Equation (4–10), H(x, y) is the inferred fact; H(x, y0) is a fact with the same head H and subject x. The condition ∃y0 : B(x, y) ∧ H(x, y0) ∈ Γ states the KB knows some value y0 for head-subject pair (H, x) inferred by B(x, y). An inferred fact H(x, y) satisfying the condition has a known truth value: it is true if H(x, y) ∈ Γ and false if H(x, y) 6∈ Γ. Only facts with known truth values contribute to the total count.

78 When checking an inferred fact p(x, y), we collect tuples of form p(x, ·) to determine

if there exists y0 such that p(x, y0) ∈ Γ. We modify the Group-Join-Last algorithm to group facts p(x, y) by the predicate and the subject (p, x). The Check algorithm verifies the groups accordingly. These algorithms are described in Algorithms 4-9 and 4-10.

Algorithm 4-9: PCA-Group-Join-Last(ji, fk, zi, rules)

1 forall p(x, y) ∈ fact(fk, zi) do 2 emit((p, x), (y, 0));

3 g1 ← ji.groupBy(r ˙); 4 g2 ← fk.groupBy(p); 5 forall r = (r, ˙ h, b1, . . . , bk) ∈ rules do 6 forall x ∈ g1.get(r˙) do 7 forall y ∈ g2.get(bk) do 8 emit((h, x), (y, r˙));

The major difference of Algorithms 4-9 and 4-6 is Lines 2 and 8, where instead of emitting ((p, x, y), r˙) pairs for each base and inferred fact, Algorithm 4-9 emits ((p, x), (y, r˙)) and groups by (p, x) in subsequent steps. A special rule ID “0” in the list of rules (Line 1) informs PCA-Check of the existence of the (p, x) pair in the input knowledge base Γ.

Algorithm 4-10: PCA-Check(rs = {(y, r˙)}) 1 S ← {y|(y, 0) ∈ rs}; 2 if S.empty() then 3 return

4 forall (y, r˙) ∈ rs do 5 c ← (y ∈ S); 6 emit(r, ˙ (c, 1));

Algorithm 4-10 implements the PCA semantics. It starts by checking if p(x, y0) ∈ Γ for any y0 by searching for ID “0” in the list of rules in Lines 1-3. If there is, each inferred fact is labeled as correct by setting c = 1 or incorrect by setting c = 0 according to whether they appear in the input knowledge base, as in Lines 4-6. If an inferred fact is correct, then S must be non-empty and contain at least the corresponding y to represent the fact. Line

79 5 must then be executed to evaluate c to 1. Thus, the sum of c of each rule computes its support. The sum of “1” computes the number of facts inferred by the rule known to be either correct or incorrect. Hence, aggregating the counts of each rule as in Steps 6-7 of Algorithm 4-5 evaluates the PCA confidence of the rule. The generalization is based on the observation that scoring functions apply rules and count the results. By overriding the definitions of Group-Join and Check, we generalize Algorithm 4-5 to other metrics. This formation is manifest in other scoring functions. For example, the head coverage [22] applies rules and counts the ratio of correct in each predicate. Our approach provides a general framework for assessing inference rules based on counting. 4.2.5 Analysis

We start by analyzing the joins of facts tables S,T and rules table M with the functional constraint t. We then show how partitioning to small tables under the size constraint s affects the input tables and improves the overall complexity. 4.2.5.1 Parallel mining

To analyze the parallel mining Algorithm 4-5, we study the size of its intermediate result, which is dominated by joining the rule body. Suppose we have facts tables S, T , rules table M, and a functional constraint t. We denote the histograms by HS and HT , the position of the join variable implied by context of the rule. We use M[·], Mˆ [·] to denote the projection and distinct projection of table M to the specified columns, respectively. The

80 join size is then given by: X X HS(p, z) · HT (q, z)

z (p,q)∈M[b1,b2] X X ≤t max{HS(p, z),HT (q, z)}

z (p,q)∈M[b1,b2] X X ≤t (HS(p, z) + HT (q, z))

z (p,q)∈M[b1,b2] X X X X  =t HS(p, z) + HT (q, z)

z (p,q)∈M[b1,b2] z (p,q)∈M[b1,b2]  X X X X  ≤t |M| HS(p, z) + |M| HT (q, z) z z p∈Mˆ [b1] q∈Mˆ [b2]

≤t(|M||S| + |M||T |)

=t|M|(|S| + |T |), (4–11) where the first inequality follows from Definition 4.9. As a result, the time complexity of Algorithm 4-5 is dominated by O(t|M|(|S| + |T |)). In case of a self-join, we have S = T , and the complexity reduces to O(t|M||S|). More generally, we show by induction that the time complexity is O(tl−1|M||S|) for rules with l body predicates:

h(x, y) ← b1, . . . , bl.

Suppose the hypothesis holds for the first 1, 2,..., (l − 2)th joins, i.e., T (t, M, S, l − 2) ≤ ctl−2|M||S| for some constant c ≥ 1. We show that T (t, M, S, l − 1) ≤ ctl−1|M||S| for the (l − 1)th join. In the following analysis, we use HJ (r, z) to denote the number of the intermediate joined tuples derived by rule r with shared variable z for the next join. Since

l−1 HJ (r, z) is the histogram after l−2 joins, we have HJ (r, z) ≤ t by the functional constraint. This histogram is not actually built during rule mining, but is used only for the purpose of

81 complexity analysis. Thus, we have X X HJ (r, z) · HS(p, z) z (r,p)

X X l−1 ≤ max{t HS(p, z), tHJ (r, z)} z (r,p)

l−1 X X X X =t HS(p, z) + t HJ (r, z) (*)

z (r,p)∈M1 z (r,p)∈M2

l−1 X X X X ≤t |M1| HS(p, z) + t HJ (r, z) (**) z z p∈Mˆ1[p] r∈M2

l−1 l−2 ≤t |M1||S| + t · ct |M2||S| (***)

l−1 ≤ct (|M1| + |M2|)|S|

=ctl−1|M||S|. (4–12)

l−1 In Equality (*), M1 contains the rules such that t HS(p, z) > tHJ (r, z). M2 contains the ˆ other rules. So M1 ∪ M2 = M and M1 ∩ M2 = ∅. In Inequality (**), we use M1[p] to denote the distinct predicates in M1 projected to the current body atom p being considered. Hence, X X X X HS(p, z) ≤ |M1| HS(p, z). z z (r,p)∈M1 p∈Mˆ1[p] We have also used the fact that X X X X HJ (r, z) = HJ (r, z),

z (r,p)∈M2 z r∈M2 since the rule r uniquely determines the predicate, p. Inequality (***) applies the hypothesis

X X l−2 HJ (r, z) ≤ ct |M2||S|.

z r∈M2 Therefore, the general time complexity for body length l rules is O(tl−1|M||S|). In practice, as we show in Section 4.3.4, a reasonable t lies between 50-250. Comparing with a direct join of facts and rules, O(|M||S|l), we achieve a notable improvement with pruning. 4.2.5.2 Partitioning

The partitioning Algorithm 4-3 makes a best effort to satisfy the size requirement s. If a predicate p contains a large number of facts (H0(p) > s), Algorithm 4-3 puts the entire predicate in one partition. Assuming Algorithm 4-3 results in N partitions {M1,...,MN },

82 the size of the largest partition being sm, then the overall time complexity to evaluate the partitioned facts and rules tables, based on (4–12), is: N N ! X l−1 l−1 X l−1 O(t sm|Mi|) = O t sm |Mi| = O(t sm|M|), (4–13) i=1 i=1 where t is the functional constraint and l is the length of the rule body as in Section 4.2.3. The time complexity is bounded with respect to the size of the largest partition (sm) instead of the input knowledge base (|Γ|). Thus, partitioning reduces time complexity from O(tl−1|Γ||M|)

l−1 to O(t sm|M|), allowing us to control the complexity by tuning the size constraints for very large knowledge bases. We conclude this section by remarking the difference between partitioning in Algorithm 4-3 and parallelization in Algorithm 4-5. Algorithm 4-3 breaks the input knowledge base into smaller independent partitions so that each partition runs its own instance of Algorithm 4-5. Algorithm 4-5 divides the input knowledge base into correlated groups running sub-procedures of a single mining instance. Algorithm 4-5 is used by Algorithm 4-3 as the parallel mining algorithm in each partition. The two-level partitioning-parallelization scheme is elucidated in Figure 4-3. These techniques combined scale the rule mining algorithm to Freebase. 4.3 Experiments

We validate our approaches by mining inference rules from YAGO and Freebase. Our work contributes the first rule set for Freebase–36,625 first-order inference rules. In this section, we present our results, compare with the state-of-the-art KB rule mining algorithm, AMIE [53], and analyze individual techniques from Sections 4.2 and 4.2.2. We begin by describing the datasets and experiment setup. YAGO YAGO is a knowledge base derived from Wikipedia, WordNet, and GeoNames. Its newest version, YAGO2s, has more than 10M entities and 120M facts, including the schema, taxonomy, core facts, etc. We use the schema for rule construction and the core 4.48M binary facts for rule evaluation. Freebase Freebase is a community-curated knowledge base of well-known people, places, and things, containing 112M entities and 2.68B facts, as of current writing. We preprocess

83 the dataset by removing the multi-language support and use the remaining 388M facts. The datasets statistics are summarized in Table 4-3A.

Table 4-3. OP experiment setup. (A) Datasets statistics. (B) Default parameters.

(A) (B) KB Size Max length 3 # Entities = 834,554 3M (YAGO) YAGO2 Max Γ-size # Facts = 948,047 10M (Freebase) # Entities = 2,137,468 Max # of rules 1000 YAGO2s # Facts = 4,484,907 Functional 100 # Entities = 111,781,246 constraint Freebase # Facts = 388,474,630 Min support 0 Min confidence 0.0

Experiment setup We conduct all experiments on a 64-core machine running on AMD Opteron processors at 1.4GHz, with 512GB RAM and 3.1TB disk space. The OP and AMIE algorithms are implemented in Spark and Java/SQL, respectively, running on Spark 1.3.0, Java 1.8, and PostgreSQL 9.2.3. Default parameters Unless otherwise specified or the current parameter is under evaluation, we use the default parameters in Table 4-3B for the experiments. We determine the parameters by trying multiple parameter combinations and comparing the performance and result rules. For Freebase experiments in Sections 4.3.2 to 4.3.4, we report the result of one class of length 3 rules. Rule set precision We evaluate a rule set by assessing its most confident rules, i.e., with a minimum confidence of 0.6 and supporting at least 2 facts. Under this constraint, we define the precision of a rule set as the percentage of rules satisfying the above threshold that we consider correct. Each rule is rated by two independent human judges. In case of a disagreement, the judges conduct a detailed discussion until a final decision is made. We sample at most 300 rules from each rule set for human inspection.

84 Confidence Rule (1) 0.81 film/film/sequel(x, z), film/film/country(z, y) → film/film/country(x, y) (2) 0.44 film/film/country(x, z), location/country/official language(z, y) → film/film/language(x, y) (3) 1.0 book/book/first edition(x, y) → book/book/editions(x, y) (4) 1.0 book/book/first edition(x, u), book/book edition/book(u, v), book/book/first edition(v, y) → book/book/editions(x, y) (5) 0.41 film/film/sequel(x, u), film/film/country(u, v), location/country/official language(v, y) → film/film/language(x, y) (6) 0.89 music/music video/music video song(x, u), music/composition/recorded as album(u, v), music/album/artist(v, y) → music/music video/artist(x, y)

Figure 4-6. Example Freebase rules.

4.3.1 Overall Result

To evaluate performance of the OP algorithm and compare with the state-of-the-art, we run the OP and AMIE+ algorithms on Freebase and YAGO. As a result, OP mines 36,625 rules in 33.22 hours from Freebase, contributing a largest first-order rules repository created from public knowledge bases. We compare the detail performance metrics, including the number and precision of mined rules, and the runtime for each knowledge base, in Table 4-4.

Table 4-4. Overall mining result.

Dataset Algorithm # Rules Precision Runtime OP 218 0.35 3.59 min YAGO2 AMIE+ 1090 0.46 4.56 min OP 312 0.35 19.40 min YAGO2s AMIE+ 278+ N/A 4.89 h OP 36,625 0.60 33.22 h Freebase AMIE+ 0+ N/A 5+ d

In terms of efficiency and scalability, OP outperforms AMIE+ in all the experiments we run. For Freebase, AMIE+ takes more than 5 days to generate a single rule (AMIE+ outputs an inference rule once it has determined its quality), whereas OP only takes 1.39 days. For

85 (A) YAGO2s Candidate Rules (B) YAGO2s Mining 2.5 14 700 1.0 8 # Candidate rules 12 600 # Rules 7 2.0 Construction time 0.8 Precision 6 10 500 Prune time Runtime 5 1.5 8 400 0.6 4 6 1.0 300 0.4 Precision 3 Runtime/h Runtime/min 4 # Mined Rules 200 2

# Candidate Rules/M 0.5 0.2 2 100 1 0.0 0 0 0.0 0 2 3 4 5 2 3 4 5 Rule Length Rule Length (C) Freebase Candidate Rules (D) Freebase Mining 6 90000 1.0 100 # Candidate Rules 80000 # Rules 5 Construction time 1.5 70000 Precision 0.8 80 4 Prune time 60000 Runtime 0.6 60 1.0 50000 3 40000 0.4 40 Precision Runtime/h

2 Runtime/min 30000 0.5 # Mined Rules

# Candidate Rules/M 20000 20 1 0.2 10000 0 0.0 0 0.0 0 2 3 4 5 2 3 4 Rule Length Rule Length (E) Freebase Length 4 Rules (F) Effect of Parallelism

Composite 5 9.0% Incorrect 6.3% 4 3.4% Correct 3 81.3% Speedup Trivial YAGO2s speedup 2 extensions YAGO2s database runtime Freebase speedup 1 0 10 20 30 40 50 60 # Cores

Figure 4-7. OP overall result on YAGO2s and Freebase. (A)(B) YAGO2s performance. (C)(D) Freebase performance. (E) Quality of Freebase length 4 rules. (F) Effect of parallelism.

86 YAGO2s, OP is more than 15 times faster than AMIE+. For YAGO2, due to its small size, partitioning and parallelization have limited advantage, so OP is only 0.97 minutes faster. The quantity of mined rules is large: 36,625 first-order rules from Freebase, spanning a variety of topics: film, book, music, computer, etc, as shown in Figures 4-6 and 4-10(3). OP mines fewer rules from YAGO and YAGO2s because their sizes are much smaller, and their schemas are incomplete. Possible domain and range values are missing from overloaded predicates, so the rule construction algorithm generates a subset of all possible rules from the available schema. Using a more accurate schema, e.g., the Freebase schema, improves recall. In terms of precision, Freebase rules achieve 0.60, outperforming YAGO rules by more than 0.1. The precision benefits from Freebase’s cleaner data and schema. To illustrate, consider Rule (1) in Figure 4-6. In this rule, we have predicates “film/film/sequel” and “film/film/country.” These predicates impose very specific constraints on the data: “sequel” means sequel of a film, and “country” refers to the producing country of a film. Thus, the Freebase predicates contain fine-grained and precise data instances. On the other hand, because 1) YAGO2s has fewer predicates, and 2) YAGO2s predicates are less well-defined, YAGO2s generates fewer rules with lower quality. Consider the “create” predicate from YAGO2s: the domain is possibly writer, musician, filmmaker, author, etc, and the range can be book, music, film, novel, etc. Thus, “create” can be combined with any matching predicates to form candidate rules, leading to spurious results. The rule “isMarriedTo(x, y) ← created(x, z), created(y, z)” illustrates this situation. Other predicates, like “playsFor,” “owns,” “influences,” are similarly misused. Figures 4-7A-D report OP’s performance for one type of rules of each length from 2 to 5. We mine 1,006 and 83,163 for YAGO2s (lengths 4 and 5) and Freebase (length 4) in 8.62 and 82.77 hours, respectively. More than 97% of the time is spent in the parallel joins. The schema graph and histograms are small, making construction and pruning efficient, taking only 13.29 min for YAGO2s and 2.29 min for Freebase, as reported in Figures 4-7A and C.

87 The construction process for YAGO2s is slower than for Freebase since its predicates are heavily overloaded, as we discuss above, resulting in expensive joins.

Table 4-5. Schema graphs and histograms.

Schema Histogram YAGO2s 284 4187 Freebase 67,415 134,889

To keep the histograms small, we only store entries with more than t counts, where t is the functional constraint. Their sizes shown in Table 4-5. Building the schema and histograms takes 0.58 min for YAGO2s and 7.42 min for Freebase. They are stored in tables shared among subsequent queries. Overall, rule construction and pruning are efficient. Analyzing the rules, we observe that 90.3% of them reduce to lengths 2 and 3 rules, which we classify as trivial extensions and composite rules, as shown in Figure 4-7E. We call a rule trivial extension of another rule if it can be reduced to the other rule by applying and removing any valid rules from its body. In Figure 4-6, Rule (4) is a trivial extension of Rule (3) since the rule “book/book edition/book(u, x) ← book/book/first edition(x, u)” infers that v = x in (4). By replacing v with x and removing the applied rule, it reduces to (3). We call a rule composite if it can be rewritten by chaining shorter rules. For instance, Rule (5) in Figure 4-6 is a composite rule of (1) and (2). Rule (6) gives an example of correct and irreducible length 4 rule. The trivial extensions and composite rules provide little knowledge in addition to lengths 2 and 3 rules. Thus, their distribution in Figure 4-7E implies limited benefits from mining longer rules. We remove those rules and evaluate the remaining rules as we do with length 2 and 3 rules. As a result, the precision is much lower than the lengths 2 and 3 rules: 0.04 for YAGO2s and 0.03 for Freebase, as reported in Figures 4-7A-E. These results suggest the primary and foundational importance of shorter rules in a knowledge base and motivate us to limit the maximum rule length to 3 in the default setting.

88 In summary, the overall results justify the benefits of OP algorithm to mine web-scale knowledge bases. In the remainder of this section, we examine individual techniques of parallelism, partitioning, and rule pruning in greater details and show how they improve the performance and quality of the rule mining task. 4.3.2 Effect of Parallelism

We evaluate the effect of parallelism by comparing the parallel mining algorithm with a SQL implementation on PostgreSQL. We vary the number of cores for parallel mining from 1 to 64 and report in Figure 4-7F the relative speedup compared to running Spark using one core. As a result, the parallel mining algorithm achieves a speedup of 5.70 and 3.34 on YAGO2s and Freebase, respectively. For YAGO2s, Spark with one core is slower than SQL for job setup and initialization, as shown by the circle, but Spark 64-core is 3.14 times faster than the SQL implementation. For Freebase, the SQL queries run more than 5 days on PostgreSQL and on an in-house parallel database system, Datapath [54]. The speedup of the parallel mining algorithm results from 1) the SQL query performs one large join while Algorithm 4-5 runs smaller joins in parallel; 2) the shuffling step in Spark is more efficient than the deduplication operation in PostgresSQL given a large output from previous joins. These results attest to the overall advantage of parallelizing the rule mining algorithm. Nonetheless, we see the parallelization does not make full use of available 64 processors, because the output sizes and performance of Group-Joins vary greatly among groups, depending on the data distribution, and is dominated by the slowest joins among the groups. Moreover, the efficiency of the shuffling step is restricted by data dependency among parallel workers. 4.3.3 Effect of Partitioning

Partitioning is a key step in scaling up the mining algorithm. By setting a maximum Γ-size s and number of rules m, the partitioning algorithm breaks the input knowledge base into parts with no larger than the specified size. Our experiments show the OP algorithm

89 (A) Freebase Partitions (s = 20M, m = 2K) (B) Freebase Partitions (s = 200M, m = 10K) 30 80 700 2000 Partition size rule size Partition size rule size £ 70 £ 1800 25 Runtime 600 Runtime 9 9 1600 0 0

1 60 1 / / 500 e e z z

i 20 i 1400 s s 50 e e l l

u u 400

r r 1200

£ 15 40 £

e e

z z 1000

i i 300 s s

30 Runtime/min Runtime/min n n

o 10 o 800 i i t t i i

t t 200 r 20 r a a 600 P P 5 100 10 400

0 0 0 200 Partitions Partitions

Figure 4-8. Sizes and runtime of Freebase partitions. completes the Freebase mining task in 1.39 days with partitioning, which otherwise takes more than 5 days with no success. The result of Freebase partitioning is illustrated in Figure 4-8: in Figure 4-8A, we set s = 20M and m = 2K; in Figure 4-8B, we set s = 200M and m = 10K. For the former case, we have 65 partitions, all running faster than partitions from the latter case, with the fastest partition finishing in 14.18 seconds and the slowest in 1.17 hours. For the latter case, we have 5 large partitions, the fastest taking 4.58 hours to finish and the slowest taking 1.27 days. The effect of choosing different partition sizes is shown in Figures 4-9A-C for Freebase and Figure 4-9D for YAGO2s. In the Freebase experiments, the effect of partitioning is substantial: as we vary s from 200M to 5M and m from 10K to 1K, the total runtime decreases from 2.55 days to 5.06 hours. The reason for such speedup is because the partitioning algorithm manages to split the input knowledge base into smaller ones that are more efficiently joined, and the overhead of the overlap is less significant as the benefit from joining smaller tables. This benefit can be further verified by the decline of runtime of the largest partitions from 1.27 days to 38.14 minutes as we lower the size constraints, as shown in Figure 4-9B, indicating that the partitions are more efficiently joined because of their smaller sizes. Consequently, the overall runtime significantly drops despite partition overlaps. In Figure 4-9C, we show the DOV increases from 1.14 to 1.35 as we create smaller partitions. The increasing DOV means we spend more time partitioning Freebase: from 2.45

90 minutes to 61.43 minutes. Comparing with Figure 4-9A, we see the reduction from joining smaller partitions has greater impact on the total runtime. On the other hand, if s and m become too small, the overhead of overlapping partitions begins to dominate. The overlapping effect is shown in Figure 4-9D as we partition the 4.48M YAGO2s into smaller parts: while we improve the runtime of the slowest partition from 5.80 minutes to 1.79 minutes, the total runtime rises to 29.60 minutes after hitting the optimum of 19.40 minutes when s = 3M and m = 500. The fall of performance is caused by the growing number of overlapping partitions. The extreme case of applying one rule at a time, taken by state-of-the-art approaches, is equivalent to having one rule in each partition. Given a large search space of candidate rules, it implies a large number of queries, hence too much overlapping overhead for it to be efficient. 4.3.4 Effect of Rule Pruning

The functional constraint t affects the mining algorithm in terms of both performance and quality. To evaluate the accuracy, we define the pruning precision of pruned rules as the percentage of those rules that we consider erroneous. Thus, a high pruning precision indicates that erroneous rules are pruned as desired and justifies the proposed approach. As we vary the functional constraint t and prune violating rules, we report the runtime, number and pruning precision of pruned rules in Figures 4-9EF. We make two observations from Figures 4-9EF: (1) When t ≥ 200, the pruning precision reaches its maximum value of 1.0–all pruned rules are erroneous. However, the runtime grows from 9.55 hours to 14.27 hours, indicating wasted computation in evaluating wrong rules that should otherwise be eliminated by setting a smaller t. (2) On the other hand, decreasing t from 50 to 2 causes the pruning precision to drop sharply from 0.995 to 0.82 while improving runtime only by 1.54 hours from 7.97 to 6.43 hours. From the above observations, we see the rule pruning process improves both performance and quality provided we choose a proper t constraint. Based on Figure 4-9F, a value between 50 and 250 is reasonable. In our default setting, we have t = 100 and detect 101 and

91 (A) Freebase: Runtime vs Partitioning (B) Freebase: Max Runtime vs Partitioning 60 30 m = 10K 50 25 m = 10K 40 20 m = 5K m = 5K 30 15 Runtime/h Runtime/h 10 20 m = 2K m = 2K 5 10 m = 1K m = 1K 0 200 150 100 50 0 200 150 100 50 0 Max Partition size/M Max Partition size/M (C) Freebase: DOV vs Partitioning (D) YAGO2s: Runtime vs Partitioning 1.35 30 m = 1K 1.30 m = 2K 25 m = 1000 1.25 m = 5K 20 1.20 m = 500 m = 10K 15 DOV 1.15 Runtime/min 1.10 10

1.05 5 m = 1000; max runtime m = 500; max runtime 1.00 200 150 100 50 0 4 3 2 1 0 Max Partition size/M Max Partition size/M (E) Runtime vs Functional Constraint (F) Pruned Rules Quality 80 6000 14 1.00 70 12 5000 60 0.95 10 50 YAGO2s pruning precision 4000 YAGO2s # pruned rules 40 8 0.90 Freebase pruning precision 3000 30 6 Freebase # pruned rules

2000 # Pruned Rules 4 Pruning Precision 20 Freebase Runtime/h 0.85 YAGO2s Runtime/min 10 YAGO2s runtime 2 1000 Freebase runtime 0 0 0.80 0 100 200 300 400 500 600 0 100 200 300 400 500 600 Functional Constraint Functional Constraint

Figure 4-9. Effect of partitioning and pruning. (A)-(D) Runtime by varying partition sizes. (E)-(F) Runtime and accuracy by varying functional constraints.

2352 non-functional rules from YAGO2s and Freebase, respectively. For Freebase, more than 99% rules are correctly pruned. Rules (1) and (2) from Freebase in Figure 4-10 illustrate the common reason why functional constraints are violated: in Freebase and other knowledge bases, we have many-one, many-few, and many-many predicates. The “location/location/containedBy” predicate in Rule (1), for example, is a many-few predicate. The rule construction algorithm, based on the KB schema, is unaware of the functionality properties of the predicates. When the one or few variable happens to be the join variable,

92 Rule (1) location/location/containedby(x, z), location/location/contains(z, y) → location/location/contains major portion of(x, y) (2) engineering/engine category/engines(z, x), engineering/engine category/engines(z, y) → engineering/engine/variants(x, y) (3) computer/computer processor/variants(x, z), computer/computer processor/variants(y, z) → computer/computer processor/variants(x, y)

Figure 4-10. Example rules violating functional constraints.

the rule violates the functional constraint. Rule (3) is an incorrectly pruned rule due to a low t = 5 functional constraint. The “computer/computer processor/variants” predicate defines an equivalence relation: variants of a computer processor are variants of each other. According to the sparsity of natural graphs [53], we reduce pruning errors by raising the functional constraint t. 4.4 Summary

In this chapter, we address the scalable first-order rule mining problem. We present the Ontological Pathfinding algorithm to mine first-order inference rules from web-scale knowledge bases. We achieve the Freebase scale via a series of parallelization and optimization techniques: a relational knowledge base model that applies inference rules in batches, a rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm that divides the mining task into smaller independent sub-tasks, and a rule pruning strategy to detect incorrect and resource-consuming rules. Combining these techniques, we mine the first rule set for Freebase, the largest public knowledge base with 388 million facts and 112 million entities, in 34 hours. No existing system achieves this scale. We publish our code and data repositories online.

93 CHAPTER 5 SCALABLE KNOWLEDGE EXPANSION AND INFERENCE Using the rules from the mining algorithm, we design an efficient inference algorithm for knowledge expansion. We infer 0.9 billion new facts from Freebase in 17.19 hours, scaling up the current state-of-the-art [13] inference engine to a 30 times larger knowledge base. Benefiting from cleaner input rules and the parallelization and partitioning techniques, inference over Freebase is 0.48 times faster than mining the rules. We evaluate the inferred facts by cross validation and compare with the evaluation from AMIE+ [22]. We show that we derive 60% new facts with an accuracy approaching 1.0, a much higher precision and recall from AMIE+. Moreover, the cross validation methodology is more feasible and general than the semi-automatic evaluation used by AMIE+. We extend our previous contributions to scale up first-order inference and to propose the cross validation method to evaluate the inference result:

• Based on the optimization techniques for rule mining, we adopt the relational knowledge model from [13] and extend the inference algorithm by parallelization and partitioning. We describe the extended inference algorithm in Section 5.1.

• In our experiments with inference, we derive 927M new facts from Freebase in 17.19 hours. Using cross validation, we show the top 60% facts have an accuracy approaching 1.0. We achieve a better quality than AMIE+ [22] and our previous result [13] over Reverb-Sherlock [8, 24]. We describe the extended experiments in Section 5.3. The inference algorithm is described in Algorithm 5-1. It runs for N rounds. In each round, Algorithm 5-1 partitions the knowledge base and runs the parallel inference algorithm on each partition. The partition algorithms is the same as in Algorithm 4-1. At the end of each inference round, the inferred facts are merged into the knowledge base. We describe the parallel inference algorithm in Section 5.1 and the partitioning algorithm in Section 4.2.2. 5.1 Parallel Inference

Assuming the input knowledge base is represented as a table of {(p, x, y)} tuples, the rules represented as tables of {(r, ˙ h, b)} tuples, we express the inference algorithm as a sequence of parallel operations, as we do in the rule mining algorithm. Similar to

94 Algorithm 5-1: Infer(Γ, M, s, m, N) 1 F ← ∅; 2 for n ← 1 to N do 3 {(Γi,Mi)} ← Partition(Γ ∪ F, ∅, M, s, m); 4 forall (Γi,Mi) do 5 Pi ← Parallel-Inference(Γi, Mi); S 6 F ← F ∪ ( i Pi); 7 return F ;

Section 4.2.4.1, we assume that the rules are connected and closed, so they can be written as

h(x, y) ← b1, b2, . . . , bk, where each bi is connected to bi+1, i = 1, . . . , k − 1, and bk is connected to h(x, y), by a shared variable zi. Under this assumption, Algorithm 5-2 applies the inference rules and derives implicit facts from the knowledge base.

Algorithm 5-2: Parallel-Inference(facts, rules)

Input : facts = {(p, x, y)}, rules = {(r, ˙ h, b1, . . . , bk)} 1 j1 ← facts; 2 forall pairs (bi, bi+1) with shared variable zi do 3 ji ← ji.GroupByKey(zi); 4 fi+1 ← facts.GroupByKey(zi); 5 if i + 1 < k then 6 ji+1 ← {(zi+1, (r, ˙ xi+1))} = Group-Join(ji, fi+1, rules);

7 else 8 jk ← {((h, x, y), r˙)} = Group-Join-Last(ji, fk, rules);

9 return jk.GroupByKey((h, x, y)).filter(0 ∈/ {r˙});

The inputs of Algorithm 5-2 are the facts and rules. The rules are a subset of the candidate rules which the mining algorithm considers correct, e.g., rules with a positive confidence or above a user-specified threshold. The correctness of the rules can be estimated using the cross validation methodology, as we describe in Section 5.3. The group joins (Lines 5 and 6) are similar to the mining Algorithm 4-8, inferring new facts by applying inference rules to each group. After applying the inference rules, Line 9 removes initial facts from

95 the result and returns. Algorithm 5-2 is similar to the rule mining algorithm except that it returns the inferred facts instead of generating counting statistics for the rules and that it does not prune rules since the mining algorithm is assumed to have removed incorrect rules. As with the mining algorithm, Algorithm 5-2 applies one table of rules in batches at a time. Thus, each rule type has its overloaded group join definition according to the rule structure. The correctness of Algorithm 5-2 follows from that of the group joins: if the groups are disjoint from each other and the joins correctly apply the inference rules to each group, the combined result will correctly contain the inference results. Assuming the non-functional rules have been eliminated in the rule mining process, the time complexity of Algorithm 5-2 is O(tl−1|M||S|), where t is the functional constraint and l is the length of the rule body. The time complexity coincides with that of the mining algorithm since the group joins perform the same operations as their mining counterparts.

Example 5.1. Figure 5-1 illustrates an iterative application of Algorithm 5-2 to apply rule

0 tables M1 and M3 to the input KB S , constructed from the Reverb-Sherlock knowledge base. In this example, each entity and variable in the facts and rules tables has an associating class, namely, Writer, Place, City. These rule structures can be accommodated by performing type checks in the group joins.

In the first iteration, we run Algorithm 5-2 to apply all rules in M1 and M3 in batches.

1 0 The result is given in S1 and is merged with S . In this iteration, all four new facts are derived by rules in M1. In the second iteration, we run Algorithm 5-2 to apply rules in M1 and M3 again, and a new fact “located in(Brooklyn, NYC)” is inferred by the rules in M3.

Both rules in M3 are applied in one query, although there is only a single result, which is merged with S1. Note that in each iteration of inference, all rule tables should be applied, but in this illustrative example, only M1 and M3 are applicable. After 2 iterations, we infer

5 new facts, expanding the input knowledge base to a total of 7 facts. 

96 R1 R2 C1 C2 live in born in WP I R x C y C 1 2 live in born in WC 1 born in RG W NYC C grow up in born in WP 2 born in RG W Br P grow up in born in WC

0 (A) S (B) M1

R1 R2 R3 C1 C2 C3 located in live in live in PCW located in born in born in PCW

(C) M3 Sn Classes Entities I R x C1 y C2 1 born in RG W NYC C W Writer RG Ruth Gruber n n-1 C City NYC New York City Si S 2 born in RG W Br P P Place Br Brooklyn 3 live in RG W NYC C 4 live in RG W Br P Notations till closure 5 grow up in RG W NYC C 6 grow up in RG W Br P i 7 located in Br P NYC C S1 (E) I R x C1 y C2 Mi 1 0 1 born in RG W NYC C Si S 2 born in RG W Br P 3 live in RG W NYC C 4 live in RG W Br P 5 grow up in RG W NYC C i 6 grow up in RG W Br P (F) I R x C1 y C2 0 1 born in RG W NYC C Mi S 2 born in RG W Br P (D) (G)

Figure 5-1. Knowledge expansion example. (A) Example facts table S0. The abbreviations “P,” “C,” “W,” etc represent entities and classes and are explained in the “Notations” box. (B)(C) Example rules tables. (D) Example query tree for j inference. Si denotes the intermediate result for type i rules in the jth iteration. Sj denotes the merged result at the jth iteration. (E)-(G) Inference results. Shaded rows correspond to shaded tables in (D), representing the input facts from the previous iteration.

To evaluate the effect of knowledge expansion, we measure the relative size and precision of the inferred facts. Using relative expanded sizes allows us to compare the degree of expansion over knowledge bases of different sizes. Measuring precision is more challenging

97 due to the open world assumption and lack of ground truth. As an estimation, we use cross validation described in Section 5.3.

Definition 5.2. Assume we perform knowledge expansion on S0 and get S. We define the Degree of Expansion as |S \ S0| DOE(S0,S) = . |S0| Thus, the degree of expansion of an inference algorithm is the ratio of the number of inferred facts (beyond the input KB) over the size of the input KB. This relative size allows us to compare expansion over knowledge bases of different sizes. For instance, in Example 5.1, we have a DOE of 5/2 = 2.5. For Freebase, we infer 927M new facts from 388M facts, attaining a DOE of 2.39. 5.2 Quality Analysis

The input rules of Algorithm 5-2 are generated by the OP algorithm. Due to the incompleteness of the input knowledge base and statistical properties of the scoring metrics, the rules are uncertain. Likewise, the facts may come from either human knowledge or information extraction algorithms, depending on how the knowledge base is constructed. Thus, the knowledge bases often contain noisy and inaccurate facts or rules. In such knowledge bases, the errors tend to accumulate and propagate rapidly in the inference chain, as illustrated in Figure 3-4A. As a result, the inferred knowledge is full of errors after only a few iterations. Hence, it is important to detect errors early to prevent error propagation. Analyzing the inference results, we identify the following error sources:

E1) Incorrect facts resulted from the IE systems. E2) Incorrect rules resulted from the rule learning systems. E3) Ambiguous entities referring to multiple entities by a common name, e.g., “Jack” may refer to different people. They generate erroneous results when used as join keys. E4) Propagated errors resulted from the inference procedure. Figure 3-4A illustrates how a single error produces a chain of errors.

98 Knowledge base ➀ Cross partition ➄ Analyze Train Test ➁ Mine Rules ➂ Infer

Facts ➃ Evaluate Figure 5-2. Cross validation: the knowledge base is partitioned into training and testing sets. The Ontological Pathfinding and parallel inference algorithms run on the training and test sets, respectively, inferred facts to be verified in the input KB.

Due to the open world assumption, these errors are hard to identify without ground truths, making it a challenge to analyze the correctness of the inferred facts. In our experiments, we use cross validation to estimate the precision of inferred facts; by splitting Freebase into training sets and testing sets, we estimate a precision of approaching 1.0 for the facts in 0.6 degree of expansion. Most of the errors are caused by erroneous rules as Freebase is itself a high quality knowledge base. On contrary, for machine constructed knowledge bases, ambiguous entities and incorrect extractions are also major sources of inaccurate results [13]. Fact evaluation for knowledge inference has been solved manually or semi-automatically: to determine the correctness of a fact, previous approaches [13, 22, 24] examine inferred facts manually by looking up the web, e.g., Wikipedia. [22] also uses a half automated approach by inferring facts from an older version of a KB (e.g., YAGO2) and verifying the inferred facts in a newer version (e.g., YAGO2s). This approach reduces human efforts, but does not generalize to other KBs. For instance, Freebase does not have sufficiently different versions that can be used for such validation. Our approach is similar to [22] by viewing cross partitioning as simulating an old version of an existing knowledge base. The main benefits of our approach are generalizability to any knowledge bases and fast creation of training and testing sets. We choose K = 5 to simulate the relative sizes of YAGO and YAGO2s (1:4.73).

99 In our approach, cross validation involves 5 steps: cross partitioning, mining, inferring, evaluating, and analyzing, as illustrated in Figure 5-2. Let Γ = {(s, p, o)} be the input

knowledge base, P = {p|(s, p, o) ∈ Γ} be the set of predicates, and Pi = {(s, p, o) ∈ Γ|p =

pi ∈ P} be the set of facts with predicate pi. We describe the detailed procedure below.

1. Cross partition. We randomly partition each set Pi into K nearly equal-sized parts, S {Pi1,...,PiK }, with ||Pin| − |Pim|| ≤ 1 for all 1 ≤ n ≤ m ≤ K. Let Qk = i Pik be

the union of partition k of each Pi. Then, Q = {Q1,...,QK } forms a partition of the knowledge base. We pick

Test = random(Q); [ Train = Q. Q∈(Q\Test)

2. Mine. rules ← OP(Train, s, m, t), where s, m, t are the constraint parameters as defined in Algorithm 4-1. 3. Infer. facts ← Infer(Test, s, m, N). We assign each inferred fact a confidence score, determined by the confidence of the rule inferring this fact. Facts inferred by multiple rules are assigned a score of the maximum confidence of these rules. 4. Evaluate. We sort the facts by confidence scores and evaluate them by checking with the input knowledge base as the ground truth. For a given degree of expansion DOE, we define the corresponding precision as

|top(facts,N) ∩ Γ| precision(DOE) = , N

where N = |Test| × DOE is the number of inferred facts. 5. Analyze. We vary DOE from 0 to 1 and report the precisions.

The cross partitioning step keeps the relative size of each predicate consistent in the training and testing sets. Otherwise, there is a predicate biase, which we define as a predicate having a higher or lower relative frequency in the training set than in the knowledge base

100 and in the testing set, i.e.,

|P ∩ Train| |P | |P ∩ Test| i > i > i , |Train| |Γ| |Test| or vice versa. Presence of predicate biases may lead to deviation of support and confidence scores that reflect the cross partitioning mechanism rather than the rule quality. Step 1 excludes the predicate biases by preserving the relative frequencies of all predicates. In the evaluation step, we use the input knowledge base as the ground truth to verify inferred facts from Test. Due to the open world assumption, it may underestimate the test precisions. Despite that, the reported precisions are a reliable estimate of true test precisions as they guarantee a lower bound for the test set. In the analysis step, we report the precision with DOE instead of with the exact number of inferred facts because the inference step uses only one partition–1/K of the input knowledge base. DOE provides a consistent unit to compare test sets of different sizes. Steps 1-5 form a general framework for evaluating learning and inference over web-scale knowledge bases. We report the experiment results on Freebase and YAGO2s in Section 5.3. 5.3 Inference Results

We evaluate the effect of the inference algorithm in terms of performance and quality. We show that the parallelization and partitioning techniques apply to the inference algorithm to achieve high efficiency and scalability, and we use cross validation to evaluate the correctness of inferred facts. Setup. We apply all the inference rules from the mining algorithm. We use the experiment setup in Table 4-3 with the exception of functional constraints as we assume that non-functional rules are pruned in the mining phase. We estimate the precision of a fact set by assessing facts with a minimum confidence of 0.6 inferred by rules supporting at least 2 facts, as we do in assessing inference rules. Performance. As an overall runtime measurement, we apply the 36,625 inference rules to Freebase and derive 927M facts in 17.19 hours. Thus, the inference algorithm is 0.48 times

101 faster than learning. The speedup benefits from wrong rules with no single support been eliminated in the learning process. To illustrate, while we have 463,631 candidate rules for mining, the mining algorithm only outputs 36,625 rules for inference; the rules without a support are considered wrong and do not participate in the inference algorithm. Furthermore, correct rules tend to have good functionality properties; it is unlikely that two predicates join to produce exponential numbers of intermediate results, as we discuss in Section 4.2.3. Lastly, the inference algorithm does not need to generate counting statistics for each rule. Consequently, it turns out to have a better performance than learning. In our previous work [13], we develop a relational inference engine that models knowledge bases as relational tables and use join-based algorithms to apply the inference rules in batches to achieve efficiency. We show that our approach runs more than 200 times faster than the state-of-the-art, Tuffy [28]. However, the plain join queries do not scale to Freebase due to its size. In particular, we have scaled to a largest KB with 10M facts. Using the parallelization and partitioning techniques we propose in this chapter, we have improved the scalability to 388M facts, achieving a new state-of-the-art of inference in large knowledge bases. In Figure 5-3A, we run the inference algorithm using one type of rules with different max-size partitioning parameters, ranging from 200M to 5M. The result shows that partitioning improves the runtime from 14.67 hours down to 2.23 hours, a near linear speedup of more than 6, consistent with the learning results reported in Figure 4-9. We also observe a similar improvement by varying the number of cores from 1 to 64. These results support the validity of the parallelism and partitioning techniques in inference as well as in learning. On the other hand, pruning does not offer additional help with inference, as non-functional rules are already eliminated in the mining phase. Analyzing the inferred facts, we estimate the overall precision to be 0.96. In particular, the facts in range [0.6, 0.8) have a precision of 0.98 and facts in range [0.8, 1.0] have a precision of 0.74. All the correct facts contribute new knowledge to Freebase. Meanwhile, the result

102 (A) Freebase: Runtime vs Partitioning (B) Freebase Rules Inference Capability 107 14 Total runtime # inferred facts 106 Largest partition runtime 12 105 10 104 8 103

6 # of Facts Runtime/h 102 4 101 2 100 0 0 200 150 100 50 0 1.0 0.9 0.8 0.7 0.6 Max Partition size/M Confidence

Figure 5-3. Inference performance. (A) Effect of partitioning for inference. (B) Inference capability of individual Freebase rules. implies that the confidence score, based on the open world assumption, may underestimate high quality rules. For example, the rule music/release/track(x, y) ← music/release/track list(x, z), music/release track/recording(z, y) infers all correct facts, but has a confidence of 0.66 because “music/release/track” is incomplete. On the other hand, the confidence score may overestimate low quality rules with low supports, but the overall precision is high since most facts are inferred by high support rules, as we explain by the inference capability of the rules. The inference capability of a rule r : H(x, y) ← B is related to its support and confidence scores. According to Equation (4–3), the total number of facts inferred by rule r is

supp(r) |{H(x, y)|B}| = . (5–1) conf(r)

The number of new facts inferred by r is

supp(r) 1 − conf(r) − supp(r) = supp(r) · . (5–2) conf(r) conf(r)

Equation (5–2) identifies rules with high inference capability–the number of new facts inferred by a rule. These rules have high support and low confidence. In practice, as confidence implies rule quality, rules with both high support and reasonably high confidence infer most

103 (A) Cross Validation Result (B) Freebase Inference Categories 2.5 1e7 1.0

2.0 0.8

0.6 1.5

0.4 1.0 # of Inferred Facts

Aggregated Precision 0.2 Freebase precision 0.5 YAGO2s precision 0.0 0.0 0.2 0.4 0.6 0.8 1.0 film cvg Degree of Expansion book base music people award others biology commonlocation freebasebusiness medicineeducation olympics astronomyvisual_art organization government

fictional_universe (C) Freebase Inferred Facts Fact (1) music/album/artist(Live Era ’87-’93, Guns N’ Roses) (2) book/series editor/book edition series edited( Janet Morris, Heroes in Hell by Baen Books) (3) film/film/production companies(Butt Spanking, Bacchus) (4) user/anjackson/default domain/bitstream encoding/format( PDF 1.4, Portable Document Format)

Figure 5-4. Cross validation result and example inferred facts. (A) Precision of inferred facts. (B) Categories of Freebase inferred facts. (C) Examples of inferred facts. correct facts. This explains the higher estimated quality of facts (0.96) than rules (0.60): within the same range of confidence, high support rules infer more facts than low support rules. The inference capability of the 2994 top rules (support ≥ 2, confidence ≥ 0.6) is reported in Figure 5-3B. As we see in the figure, a range of rules infer new facts; thus, the estimated 60% correct rules (Table 4-4) all contribute veritable knowledge. To illustrate, Rule (1) in Figure 4-6 has a support of 5964 and a confidence of 0.81. By Equation (5–2), it infers 1356 new correct facts beyond Freebase. As a special case, rules with confidence 1.00 infer no new facts. These rules, however, are useful when the knowledge base incrementally expands by either human input or information extraction. Cross validation. Manual labeling of inferred facts are laborious and error-prone. Meanwhile, it focuses more on the facts with high confidence scores. Next, we use cross validation to systematically verify each inferred fact. In Figure 5-4A, we show the quality of inferred facts

104 from the YAGO2s and Freebase knowledge bases, ordered by the inferring rules according to their confidence scores. The initial size of the testing set is 825K for YAGO2s and 50.6M for Freebase. Using the rules from the training set, we have inferred 1.67M new facts for YAGO2s and 118M for Freebase, expanding the initial knowledge base by a factor of more than 2. The precision of facts inferred from Freebase is approaching 1.0 for the top 30M facts, proving a high accuracy of the Freebase rules. This benefits from its clean schema and data, as we observe in Section 4.3.1. In addition, by comparing them with the YAGO2s rules, we also see Freebase rules achieve a higher recall, as the completeness and scope of the Freebase schema allows the mining algorithm to generate rules covering wide ranges of topics. The correctly inferred facts range from 71 different categories (i.e., “domain” in the Freebase terminology), e.g., music, book, film, etc, the top 20 displayed in Figure 5-4B. Figure 5-4C shows examples of these facts. The correctness of them can be easily verified on Wikipedia. The statistics show that we have most inferred facts in the music category, 21.46M, while the second largest category, book, has only 1.38M. This is because the original Freebase has most tuples (236M) in the music category, followed by the film category with only 22M tuples. More interestingly, as Freebase is designed as a user-extendible knowledge base, users are allowed to add their own content in private spaces prefixed by “user/[username]/.” As illustrated in Figure 5-4, Fact (4), the inference algorithm helps user “anjackson” discover new knowledge in his own knowledge base on encoding formats. Analyzing the errors, we observe that most of them are caused by erroneous rules. To illustrate, we have 28.22M facts inferred by rules with a confidence score greater than 0.5, 28.18M (99.89%) of which are correct. For the 90.07M facts inferred by rules with lower confidence scores, only 2.6M (2.92%) are correct. Thus, we see that high quality rules generate most correct results; ambiguous entities and incorrect extractions have little impact on the quality of the result in a clean knowledge base. On contrary, for machine

105 constructed knowledge bases, the major error sources also include ambiguous entities and incorrect extractions, as we report in the same study for the Reverb-Sherlock knowledge base [13]. In addition, we see the impact of open world assumption on the confidence scores: the scores are low not because rules are incorrect, but due to the incompleteness of the input knowledge base. Their actual quality may be much higher than reported by the confidence scores. The validity of the cross validation methodology for evaluating inference results can be justified by comparing with AMIE+ [22], where the authors perform a semi-automatic validation using YAGO2 for the inference task and YAGO2s for verification, combined with manual inspection using external sources like Wikipedia. As a result, they observe similar precisions of YAGO2 inferred facts: using YAGO2 with 948K base facts, AMIE+ makes 100K predictions (corresponding to 0.11 degree of expansion) at a precision of 0.7, and 400K precisions (corresponding to 0.42 degree of expansion) at a precision of 0.6. Their precision is higher than evaluated by cross validation reported in Figure 5-4A due to the partial completeness assumption and the external information that fills in the missing knowledge resulted from the open world assumption in modern knowledge bases. Overall, the experiments validate the effectiveness of our approach. We perform first-order mining on Freebase in 34 hours and contribute the first rule set with 36,625 inference rules. In particular, the relational knowledge base model facilitates efficient parallel joins and the partitioning algorithm scales them up by breaking large knowledge bases into smaller independent datasets. Applying the 36,625 inference rules, we derive 927M new facts beyond Freebase in 17 hours. We use cross validation to verify the results, estimating high precisions for the top 60% inferred facts. The inference algorithm contributes large volumes of veritable knowledge to Freebase. Our experiments focus on Freebase to prove scalability, but the approach is applicable to other knowledge bases, e.g., Wikidata [55], to where Freebase is migrating.

106 5.4 Summary

Based on the relational knowledge base model [13], we design an inference algorithm with the optimization of parallelization and partitioning we use in rule mining. We propose a cross validation method to evaluate the inferred facts. Applying the inference rules to Freebase, we derive 927 million facts in 17.19 hours. We estimate the top facts with the degree of expansion of 0.6 with a precision approaching 1.0. Our approaches outperform state-of-the-art rule mining algorithms and inference engines in terms of both performance and quality. All our open-source code repositories and data are published online. Future research includes online learning with dynamic knowledge bases. Real world knowledge bases, e.g., DeepDive, Freebase, NELL, accept new input or user feedback and continuous update their contents. Learning with these knowledge bases requires updating the learning result according to the new contents. Obviously, re-running the learning algorithm for each update would be infeasible. Motivated by our experiments and recent work on incremental MCMC [52], we plan to explore efficient ways to perform online learning with expanding knowledge bases.

107 CHAPTER 6 QUERY PROCESSING WITH KNOWLEDGE ACTIVATION Semantic networks are a popular way of simulating human memory in ACT-R-like cognitive architectures. However, existing implementations fall short in their ability to efficiently work with very large networks required for full-scale simulations of human memories. In this chapter, we present SemMemDB, an in-database realization of semantic networks and spreading activation. We describe a relational representation for semantic networks and an efficient SQL-based spreading activation algorithm. We provide a simple interface for users to invoke retrieval queries. The key benefits of our approach are: (1) Databases have mature query engines and optimizers that generate efficient query plans for memory activation and retrieval; (2) Databases can provide massive storage capacity to potentially support human-scale memories; (3) Spreading activation is implemented in SQL, a widely-used query language for big data analytics. We evaluate SemMemDB in a comprehensive experimental study using DBPedia, a web-scale ontology constructed from the Wikipedia corpus. The results show that our system runs over 500 times faster than previous works. SemMemDB is a module for efficient in-database computation of spreading activation over semantic networks. Semantic networks are broadly applicable to associative information retrieval tasks [33], though we are principally motivated by the popularity of semantic networks and spreading activation to simulate human memory in cognitive architectures, specifically ACT-R [34, 35]. Insofar as cognitive architectures aim toward codification of unified theories of cognition and full-scale simulation of artificial humans, they must ultimately support human-scale memories, which at present they do not. We are also motivated by the desire for a scalable, standalone, cognitive model of human memory free from the architectural and theoretical commitments of a complete cognitive architecture. Our position is that human-scale associative memory is best achieved by leveraging the extensive investments and continuing advancements in structured databases and big data systems. For example, relational databases already provide effective means to manage

108 and query massive structured data and their commonly supported operations, such as grouping and aggregation. They are sufficient and well-suited for efficient implementation of spreading activation. To defend this position, we extend the relational data model for semantic networks and describe an efficient SQL-based, in-database implementation of network activation (i.e., SemMemDB). The main benefits of SemMemDB and our in-database approach are: (1) Exploits query optimizer and execution engines that dynamically generate efficient execution plans for activation and retrieval queries, which is far better than manually implementing a particular fixed algorithm. (2) Uses database technology for both storage and computation, avoiding the complexity and communication overhead incurred by employing separate modules for storage versus computation. (3) Implements spreading activation in SQL, a widely-used query language for big data which is supported by various analytics frameworks, including traditional databases (e.g., PostgreSQL), massive parallel processing (MPP) databases (e.g., Greenplum [36]), the MapReduce stack (e.g., Hive) [56, 57], etc. In summary, we make the following contributions:

• A relational model for semantic networks and an efficient, scalable SQL-based spreading activation algorithm.

• A comprehensive evaluation using DBPedia showing orders of magnitude speed-up over previous works. In this chapter, we provide preliminaries explaining semantic networks and activation, discuss related work regarding semantic networks and ACT-R’s associative memory system, describe the implementation of SemMemDB, and evaluate SemMemDB using DBPedia [1], a web-scale ontology constructed from the Wikipedia corpus. Our experiment results show several orders of magnitude of improvement in execution time in comparison to results reported in the related work. 6.1 Spreading Activation

Semantic memory refers to the subcomponent of human memory that is responsible for the acquisition, representation, and processing of conceptual information [58]. Various

109 representation models for semantic memory have been proposed; in this chapter, we use the semantic network model [59]. A semantic network consists of a set of nodes representing entities and a set of directed edges representing relationships between the entities. Figure 6-1A shows an example semantic network. It is constructed from a small fragment of DBPedia, an ontology extracted from Wikipedia. In this example, we show several scientists and their research topics. The edges in this network indicate how the scientists influence each other and their main interests. Processing in a semantic network takes the form of spreading activation [60]. Given a set of source (query) nodes Q with weights, the spreading activation algorithm retrieves the top-K most relevant nodes to Q. For example, to retrieve the most relevant nodes to “Francis Bacon,” we set Q={(Francis Bacon, 1.0)} as shown in Figure 6-1D. The algorithm returns {Aristotle, Plato, Cicero, John Locke} ranked by their activation scores as shown in Figure 6-1E. These activation scores measure relevance to the query node(s) and are explained shortly. Figure 6-1F shows another example query, a second iteration of the previous query formed by merging the original query with its result. Figure 6-1G shows the result of this second iteration. As shown in the above examples, the spreading activation algorithm assigns an activation score to each result node measuring its relevance to the query nodes in Q. The activation score Ai of a node i is related to the node’s history and its associations with other nodes. It is defined as

Ai = Bi + Si. (6–1)

The Bi and Si terms are base-level activation and spreading activation, respectively. The base-level activation term reflects recency and frequency of use while the spreading activation term reflects relevance to the current context or query. Formally, the base-level activation Bi is defined as

110 n ! X −d Bi = ln tk , (6–2) k=1

where tk is the time since the kth presentation of the node and d is a constant rate of activation decay. (In the ACT-R community, d = 0.5 is typical.) In the case of DBPedia for example, tk values might be derived from the retrieval times of Wikipedia pages. The resultant Bi value predicts the need to retrieve node i based on its presentation history. The

spreading activation Si is defined as

X Si = WjSji. (6–3) j∈Q

Wj is the weight of source (query) node j; if weights are not specified in a query then a

default value of 1/n is used, where n is the total number of source nodes. Sji is the strength

of association from node j to node i. It is set to Sji = S − ln(fanji), where S is a constant parameter and

1 + outedgesj fanji = . edgesji

The values of outedgesj and edgesji are the number of edges from node j and the number of edges from node j to node i, respectively. Equations (6–2) and (6–3) are implemented using the native grouping and aggregation operations supported by relational database systems. Thus, in only a few lines of SQL we are able to implement the relatively complex task of spreading activation. Moreover, database systems are able to generate very efficient query plans based on the table statistics, query optimization, etc. Thus, we believe it is better to rely on the databases to dynamically select the most efficient algorithm rather than to manually develop a fixed one as is done in previous works. ACT-R’s [61] original Lisp-based implementation of spreading activation does not scale to large associative memories. Researchers have thus investigated various ways to augment ACT-R’s memory subsystem to achieve scalability. These investigations include outsourcing

111 the storage of associative memories to database management systems [62] and concurrently computing activations using Erlang [63]. We observe that in [62], databases are used only as a storage medium; activation computations are performed serially outside of the database, which is unnecessarily inefficient and incurs the significant communication overhead of data transfer in and out of the database. In SemMemDB, we try to leverage the full computational power of databases by performing all activation calculations within the database itself, using a SQL-based implementation of the spreading activation algorithm. Semantic network spreading activation has also been explored using Hadoop and the MapReduce paradigm [64]. However, MapReduce-based solutions are batch oriented and not generally appropriate for dealing with ad hoc queries. In terms of simulating an agent’s memory (our principle motivating use-case), queries against the semantic network are ad hoc and real-time, which are the types of queries better managed by relational database systems. 6.2 Using SemMemDB

We refer to Figure 6-1 to illustrate how users interact with the SemMemDB module. Specifically, a user defines a query table Q, a network table N, and a history table H. Having done so, activation and retrieval of the top-K nodes is initiated by a simple SQL query: SELECT * FROM activate() ORDERBY A DESC LIMIT K;

The activate function is a database stored procedure. Its parameters are implicitly

Q, N, H, so the complete signature as activate(Q,N,H). The ORDER BY and LIMIT clauses are optional; they instruct the database to rank nodes by their activation scores and to return only the top-K results. The result is a table of at most K activated nodes with, and ranked by, their activation scores. Tables Q, N, H are defined by the following:

• Table Q contains a tuple (i, w) for each query node i with numeric weight w.

• Table N contains a tuple (i, p, j) for each directed edge from node i to node j with predicate p.

112 Aristotle mainInterest

influence influencedBy influencedBymainInterest

Plato influence influence Philosophy Francis influencedBy influencedBy Bacon influencedBy influence CiceroinfluencedBy coreSubject

influence

John influencedBymainInterest Meta Locke physics (A)

H i t N i p j FB 0 FB IB A A 0 FB IB P P 0 FB IB C C 0 FB I JL Ph 0 A MI Ph M 0 P MI Ph FB 1 C CS Ph FB 3 JL MI M P 3 ··· A 6 (B) (C)

Ans i s Q i w 2 Ans i s 2 1 A 0.5728 FB 1.0 Q i w A 0.4161 P 0.5256 1 A 0.4161 P 0.2941 C 0.3199 FB 1.0 P 0.2941 C -0.0610 Ph -0.0395 C -0.0610 (D) JL -0.0610 JL -0.0848 JL -0.0610 (E) M -0.5065 (F) (G)

Figure 6-1. SemMemDB usage with DBpedia knowledge base. (A) Semantic network fragment showing the relationships between scientists and their interests. Each node represents a DBPedia entity; each directed edge represents a relationship between the entities. (B) Database table N that stores the network depicted in (A). Each row (i, p, j) represents a directed edge between (i, j) with predicate p. We use abbreviations here for illustration (e.g., “FB” for “Francis Bacon”). (C) History table H recording the presentation history for each node. Zero means creation time. (D) and (F) are two example queries; (E) and (G) are the results of (D) and (F), respectively, ranked by activation scores. The precise definitions of N, H, Q tables are given in the Using SemMemDB Section.

113 • Table H contains a tuple (i, t) for every node i presented at numeric time t. A node was ‘presented’ if it was created, queried, or a query result at time t (users may choose other criteria). The specific measure of time (e.g., time-stamp, logical time) is defined by the user. H is typically updated (automatically) when a query is returned by insertion of the query and result nodes with the current time. Tables Q, N, and H are allowed to be views defined as queries over other tables, so long as they conform to the described schema. This allows users to specify more complex data models and queries using the same syntax.

Example 6.1. For the semantic network shown in Figure 6-1A, the corresponding network table N is shown in Figure 6-1B. Figure 6-1C shows one possible history table H; node creation time is assumed to be 0. Figures 6-1D and 6-1F correspond to the queries for {“Francis Bacon”}, and {“Francis Bacon,” “Aristotle,” “Plato,” “Cicero,” “John Locke”}, respectively, with the weights listed in the w columns. Finally, Figures 6-1E and 6-1G show the results of those queries. 

6.2.1 Base-Level Activation Calculation

The base-level activation Bi defined by (6–2) corresponds to a grouping and summation operation. Assuming the current time is T , the following SQL query computes the base-level activations of all nodes: SELECT i, log(SUM(power(T-t,-d))) AS b

FROM H

GROUPBY i;

In Figure 6-1, “Aristotle” was presented most recently at time 6, next “Plato” at time 3,

while “Cicero” and “John Locke” have not presented. In response to Q1, “Aristotle” is judged most relevant to “Francis Bacon” (see Figure 6-1E) despite the fact that all of these nodes have the same number of edges (viz., 1) connecting them to “Francis Bacon.” This is because of the differences between their base-level activations.

114 6.2.2 Spreading Activation Calculation

The spreading activation Si defined by (6–3) is decomposed into two components, Wj, which is query dependent, and Sji, which is network dependent but query independent. Since

Sji is query independent, an effective way to speed up calculation is to precompute Sji values in materialized views. These views store precomputed results in intermediate tables so that they are available during query execution. First, we compute the number of edges from each node i: CREATE MATERIALIZED VIEW OutEdges AS

SELECT i, COUNT(*) AS l FROM N GROUPBY i;

Then, we compute the actual Sij values: CREATE MATERIALIZED VIEW Assoc AS

SELECT i, j, S-ln((1+OutEdges.l)/COUNT(*)) AS l FROM N NATURAL JOIN OutEdges

GROUPBY (i, j, OutEdges.l);

Though a fair amount of computation happens here, we emphasize that it is done once, thereafter the resultant values are used by all queries against the semantic network.

Given the above definition of Assoc and a query Q, we compute the spreading activation

Si as follows: SELECT j AS i, SUM(Q.w*Assoc.l) AS s FROM Q NATURAL JOIN Assoc

GROUPBY j;

6.2.3 Activation Score Calculation

The activation score Ai defined by (6–1) is the sum of base-level activation and spreading activation. The complete SQL procedure for computing activation scores is given in Listing3.

In the activate() procedure, we start by computing the Si terms in the WITH Spreading AS clause. The result of this subquery (Spreading) is often small so that history look-up can be optimized by joining H and Spreading in Line 11. In this way, only the relevant

115 Listing 3. Activation Procedure 1 CREATEOR REPLACE FUNCTION activate() 2 RETURNS TABLE(node INT, s DOUBLE PRECISION) AS $$ 3 BEGIN 4 RETURN QUERY 5 WITH Spreading AS ( 6 SELECT Assoc.j AS i, SUM(Q.w*Assoc.l) AS s 7 FROM Q NATURAL JOIN Assoc 8 GROUPBY Assoc.j 9 ), Base AS ( 10 SELECT H.i AS i, log(SUM(power(T-t,-d))) as b 11 FROM H NATURAL JOIN Spreading 12 GROUPBY H.i 13 ) 14 SELECT Base.i AS i, Base.b+Spreading.s AS A 15 FROM Base NATURAL JOIN Spreading; 16 END; 17 $$ LANGUAGE plpgsql;

portion of history is retrieved. The final activation scores Ai are computed by joining Base and Spreading in Lines 14-15. 6.3 Evaluation

In this section, we evaluate the performance of SemMemDB using PostgreSQL 9.2, an open-source database management system. We run all the experiments on a two-core machine with 8GB RAM running Ubuntu Linux 12.04. 6.3.1 Data Set

We use the English DBPedia1 ontology as the data set for evaluation. The DBPedia ontology contains entities, object properties, and data properties. Entities and object properties correspond to nodes and edges in the semantic network. Data properties only associate with single nodes, so they do not affect spreading activation and hereafter are ignored. We generate a pseudo-history for every node in the semantic network based on the assumption that ‘past retrievals’ follow a Poisson process, where the rate parameter of a

1 http://wiki.dbpedia.org/Downloads

116 node is determined by the number incoming edges.2 The statistics of our DBPedia semantic network data set are listed in Table 6-1.

Table 6-1. DBPedia data set statistics.

# nodes (entities) 3,933,174 # edges (object properties) 13,842,295 # histories (pseudo-data) 7,869,462

For comparison, Table 6-2 lists the statistics of the Moby Thesaurus II data set, which [63] used to evaluate their semantic network implementation.

Table 6-2. Moby Thesaurus II data set statistics.

# nodes (root words) 30,260 # edges (synonyms) 2,520,264

6.3.2 Performance Overview

In the first experiment, we run three queries against the entire DBPedia semantic network data set. The initial queries are listed in Table 6-3A where each contains three nodes. For each, we execute three iterations where the query for each iteration is based on the result of the previous iteration starting with the initial query. We measure the execution time by executing each iteration ten times and taking the average. The execution times and result sizes are listed in Table 6-3B. Note that the history table is not modified during experiments. All the queries complete within tens of milliseconds. We informally compare this result to those in [63]. [63] evaluate their semantic network implementation using the Moby Thesaurus II data set, which is only a tenth of the size of the DBPedia data set (see Table 6-2). Their average execution time is 10.9 seconds. This is more than 500 times

2 We assume that nodes with greater connectivity are retrieved more often, but this choice is arbitrary.

117 Table 6-3. Experiment 1 result. (A) Experiment 1 initial queries. (B) Avg. execution times and result sizes for queries by iteration.

(A) Query Q Q Q Node i 1 2 3 1 Aristotle United States Google 2 Plato Canada Apple 3 John Locke Japan Facebook

(B) Query Iter. 1 Iter. 2 iter. 3 time/ms 5.16 22.02 63.22 Q 1 result size 125 890 3981 time/ms 1.77 5.69 14.60 Q 2 result size 23 121 477 time/ms 2.58 6.15 13.61 Q 3 result size 36 132 381

Table 6-4. Experiment 2 semantic network sizes and avg. execution times for single iteration queries of 1000 nodes.

Proportion 20% 25% 33% 50% 100% # nodes 2,607,952 2,846,850 3,130,706 2,012,183 3,933,174 # edges 2,768,119 3,463,240 4,616,433 6,035,162 13,842,295 # histories 1,988,400 2,345,531 2,903,576 3,906,121 7,869,462 time/ms 35.05 38.47 42.52 45.02 57.63

slower than SemMemDB using for comparison Q1, Iter. 2 at 22.02 ms, which has a larger fan than any query used in [63]. Though informal, this result illustrates the performance benefits offered by the in-database architecture of SemMemDB. 6.3.3 Effect of Semantic Network Sizes

In the second experiment, we evaluate the scalability of SemMemDB by executing queries against semantic networks of increasing size. These semantic networks are produced by using 20%, 25%, 33%, 50%, and 100% of the DBPedia data set. Query size is fixed at

118 Table 6-5. Experiment 3 avg. execution times and result sizes for single iteration queries of varying sizes.

Query Size 1 10 102 103 104 time/ms 1.33 2.45 11.27 57.63 2922.51 result size 4 30 321 2428 19,007

1000 nodes and queries are generated by taking random subsets of DBPedia entities. The execution times and network sizes are listed in Table 6-4. It is perhaps a surprising yet highly desirable result that execution time grows more slowly than network size. This scalability is due to the high selectivity of the join queries. Since the query size is much smaller than the network size, the database is able to efficiently select query-relevant tuples using indexes on the join columns (see Figure 6-2B, left plan). As a result, execution time is not much affected by the total size of the semantic network, only the retrieved sub-network and the index size matter. 6.3.4 Effect of Query Sizes

In the third experiment, we evaluate the scalability of SemMemDB with respect to query size. We execute queries ranging in size from 1 to 104 nodes against the entire DBPedia semantic network data set. The queries are generated by taking random subsets of DBPedia entities. The execution times, query sizes, and result sizes are listed in Table 6-5. The results indicate that execution time scales linearly with query size when query size is small (≤ 103). Under these conditions, join selectivity is high and indexing accelerates query-relevant tuple retrieval. This effect is illustrated in the query plans shown in Figure 6-2B for the sample query shown in Figure 6-2A. When the query size is small, the index scan on Assoc only retrieves a small number of nodes, hence an index nested loop is chosen by the database query planner. When the query size is large (e.g., 104), a large portion of the

Assoc needs to be accessed (the index is not of much help), so the database chooses a hash join for better performance. This dynamic selection of the ‘best’ algorithm exemplifies why

119 SELECT j AS i,

SUM(Q.w*Assoc.l) AS A FROM Q NATURAL JOIN Assoc

GROUP BY j; (A) HashAggregate HashAggregate (287) (130,981)

Index nested loop Hash join Assoc.i = Q.i Assoc.i = Q.i (332) (327,064)

Index scan Seq scan Seq scan Seq scan Assoc Q Assoc Q (12,872,122) (100) (12,872,122) (100,000) (B)

Figure 6-2. SemMemDB query plans. (A) Sample query. (B) Query plans for (A) given that Q is small (left) or large (right). Numbers in parentheses indicate table sizes. Italics indicate differences between the two plans. we feel it is better to rely on the database to efficiently plan and execute activation queries rather than to manually implement a fixed algorithm. 6.4 Summary

In this chapter, we introduced SemMemDB, a module for efficient in-database spreading activation over semantic networks. We presented its relational data model and its scalable spreading activation algorithm. SemMemDB is applicable in many areas including cognitive architectures and information retrieval, however our presentation was tailored to those seeking a scalable, standalone cognitive model of human memory. We evaluated SemMemDB

120 on the English DBPedia data set, a web-scale ontology constructed from the Wikipedia corpus. The experiment results show more than 500 times performance improvement over previous implementations of ad hoc spreading activation retrieval queries over semantic networks. What we have reported here is an early stage development toward a scalable simulation of human memory. Planned future work includes the use of MPP databases and support for more complex queries and models as described in [65].

121 CHAPTER 7 RELATED WORK Recent research in knowledge base construction has resulted in large web knowledge bases. These works have employed a number of techniques to improve the coverage and quality of the knowledge bases: In Freebase [3] and DBpedia [1], the quality is maintained by collaborative human construction. In machine constructed knowledge bases, additional approaches are developed–NELL [6] employs a set of coupling constraints to prevent “semantic drifts” in constructing the knowledge base [66]; OpenIE [67] integrates internal components [20, 68] to mine functional constraints from the extracted facts; ProBase [10] mines a database of instance-class pairs to provide a taxonomy for web entities; YAGO [15, 41] extracts knowledge from high quality sources enhanced with temporal and spatial information; universal schemas [69] handle uncertainty and incompleteness in KB schemas; knowledge graph identification [70] identifies useful knowledge from raw extractions; ProbKB [13] applies functional constraints to detect contradictions and ambiguous entities for knowledge expansion. Markov logic networks. Markov logic networks [29] are the state-of-the-art framework for working with uncertain facts and rules, MLNs have been successfully applied in a variety of applications, including information extraction [71], textual inference [42], entity resolution [72], etc. MLNs can be viewed as a template to generate ground factor graphs (Markov networks) [73–75]. Hence general probabilistic graphical model inference algorithms apply [73, 76–78], as well as specialized MLN inference algorithms [49, 79–81]. There are works on MLN structural learning [82–86], but few of them achieve the web scale. Mining Horn clauses. Mining Horn clauses [45] has first been studied in Inductive Logic

Programming (ILP) literature [46, 47, 87]. Recently, Sherlock [24] and Amie [37] have extended it to mining first-order inference rules from knowledge bases by defining new metrics to address the open world assumption made by the knowledge bases. AMIE achieves the state-of-the-art efficiency using an in-memory database to support the projection queries for counting. Sherlock and AMIE have mined 30,912 and 1090 inference rules from 250K

122 (OpenIE [8,9]) and 948K (YAGO2 [41]) distinct facts, respectively. Still, none of these approaches scales to the Freebase size. To solve this scalability problem, we adopt the mining model of AMIE, described in Section 2.2, and scale it up using a series of parallelization and optimization techniques. Parallel computing. In recent years, various data processing frameworks have been developed to facilitate large-scale data analytics, including in-databases analytics [28, 88–92], MapReduce [56], FlumeJava [93], Spark [94, 95], GraphLab [44, 96], Datapath [54, 97], etc. These systems are effective in a variety of data mining and machine learning problems. Given no previous work that scales up the rule mining algorithms to the web, we are motivated to leverage state-of-the-art parallel computing techniques. The parallel mining algorithm we propose combines the relational model [13] and the MapReduce programming paradigm [56], consisting of a sequence of parallel operations on the KB tables. We implement and evaluate it on Spark in favor of its efficient pipelining of parallel operations using distributed caches. Functional Constraints. Constraints have proven helpful in a wide range of knowledge

base construction and expansion problems. Nell employs a set of coupling constraints to prevent “semantic drifts” in constructing the knowledge base [66]. A particularly useful

class of constraints is functional constraints. The N-FOIL algorithm [98] for Nell uses functional constraints to provide negative examples for ILP learners. [19, 20] mines functional constraints from automatically constructed knowledge bases. [13] applies functional constraints to detect contradictions and ambiguous entities for the knowledge expansion problem. [42] leverages functionality properties of predicates to scale up textual inference to the web. These works of speeding up knowledge and textual inference tasks by leveraging functionality properties of predicates inspire us to apply them to the mining problem by extending the notion of functionality to Horn clauses to prune erroneous rules. Mining association rules. Since the first introduction of association rule mining in [99], researchers have developed a number of improvements [100–103]. In particular, the Direct Hashing and Pruning algorithm [101] partitions itemsets into buckets and prunes buckets

123 that violate the minimum support constraint. The Partition Algorithm [102] partitions the transactions database and processes one partition at a time. These partitioning approaches depend on the assumption that transactions independently contribute to the itemset counts in a transactions database. In a first-order knowledge base, the facts are interconnected by the rules and arguments. This dependency will be lost if we directly partition the knowledge base in the state-of-the-art ways. In our approach, we preserve data dependency by relaxing the non-overlapping requirement and designing a new algorithm that partitions the knowledge base into independent but possibly overlapping partitions. Mining frequent subgraphs. The rule mining problem is closely related to mining frequent subgraphs [104–107], where the knowledge base is represented as a single directed graph. The major difference is the notion of a subgraph: In the first-order rule mining problem, a subgraph is a graph pattern where the edges are labeled but vertices are variables. This leads to an important consequence we address in this dissertation: A more sophisticated counting algorithm is needed to count the frequencies of grounded subgraphs [29], namely, subgraphs with variables substituted by constants. The grounding problem of parameterized subgraphs is not addressed in the frequent subgraph mining literature. Quality Control. Quality control has been a core research challenge in various data management systems. In databases, we use functional dependencies (FDs) and conditional functional dependencies (CFDs) [108, 109] to ensure integrity of the data. These approaches have recently been applied to graphs with graph functional dependencies [110]. In uncertain knowledge bases, an effective methodology for improving quality is to collect redundant data and evidence. For example, multiple information extractors have been used in IE systems for cross verification [5, 111]. Involving human knowledge in KB construction also proves effective. In NELL [6], the system collects human feedback to improve its internal extractors. Similarly, crowdsourcing is applied to provide missing information [112] and verify uncertain data [113]. In addition, there has been increasing research on Question Answering (QA) systems [114] and using QA systems to improve recall of KBs [21]. However, the recent work

124 mainly focuses on improving extraction techniques; the problem of improving the quality of existing datasets is still challenging. We propose to study this problem by extending and combining techniques of using constraints, selective data collection, integrating data streams, knowledge fusion, and query-driven information extraction. Efficient join processing. Multi-way joins are known to have tight bounds on the output size [115]. Recent algorithms [116–118] have constructed multi-way join plans to achieve worst-case optimal runtime with regard to the size bound by joining multiple relations at one time to avoid computing intermediate results as in binary join plans. The bound is further improved by using functional dependencies [119, 120]. In a parallel environment, the multi-way join algorithms are optimized by a data shuffling algorithm, HyperCube [121–124], where a single communication round distributes tuples to every reducer server that needs the tuple for the local multi-way join. An empirical study of these algorithms [121] has proven their effectiveness on large datasets. The first-order mining problem differs from general multi-way join processing in several aspects. First, evaluating inference rules requires processing both the intermediate and final results, whereas general multi-way join performance is bounded only by the final output size. Second, while [116, 118] and our algorithm all use degree information to improve query plans, we leverage the semantics of inference rules to estimate data quality and restrict the size of each intermediate result. [116, 118] use degree information to construct the query plan; the information has no implication of data quality. Third, we use a novel partitioning algorithm to bound the input size of each join. [116] partitions input relations to bucket light and heavy join tuples, but bounding the input sizes of joins by partitioning is not explored before. Knowledge expansion. The problem of knowledge expansion is to discover new knowledge from existing knowledge bases [13]. Typical ways to construct and expand knowledge bases include human collaboration [1,3], information extraction [2,7,8, 10, 125], knowledge

125 fusion and integration [5, 25, 26, 111], inference and reasoning [13, 28, 42, 86, 87, 98], or combinations of multiple approaches [6,7]. Compared with other approaches, Rule-based models are more explainable (people apprehend a rule by inspecting it), persistable (rules can be saved and shared as normal text files), expressive (rules can express inference, constraints, algorithms), and support efficient incremental maintenance of knowledge bases [52]. The state-of-the-art rule-based inference engines [13, 28, 126] attain efficiency by modeling knowledge bases as relational tables and apply the inference rules using database queries. However, due to the size of the input knowledge bases and the rules, the join queries scale to knowledge bases with a maximum of 10M facts. By the parallelization and partitioning techniques, our inference algorithm scales to Freebase with 388M facts and 112M entities. Cross validation. To evaluate the quality of expanded knowledge, a general and effective way is cross validation. In [98, 125], the authors use cross validation to evaluate facts inferred by the Path Ranking Algorithm. [86] uses cross validation to evaluate an online Bayesian logic program [127] learning algorithm for extracted knowledge. AMIE+ [22] learns association rules from an input knowledge base. It uses a similar approach for factual evaluation–learning inference rules from an older and smaller version of a knowledge base (e.g., YAGO2) and validate on a newer version (e.g., YAGO2s), combined with external sources (e.g., Wiki-pedia). However, this approach does not generalize to knowledge bases without significantly different versions. To compare Freebase rules with AMIE+, we apply the more general cross validation methodology and we show that it achieves comparable results as reported by AMIE+.

126 CHAPTER 8 CONCLUSION AND FUTURE WORK In this dissertation, I present the knowledge expansion and ontological pathfinding algorithms to expand web-scale knowledge bases by inferring implicit knowledge using first-order inference. These algorithms form the core components of a probabilistic knowledge base system, ProbKB. We make the following contributions to achieve efficiency, scalablity, and quality: Knowledge Expansion. We design a novel relational model for probabilistic knowledge bases. This model allows an efficient SQL-based inference algorithm for knowledge expansion that applies inference rules in batches. We optimize relational knowledge bases on massive parallel processing databases to achieve further scalability. We combine several quality control methods that identify erroneous rules, facts, and ambiguous entities to improve the precision of inferred facts. Our experiments show that ProbKB system outperforms the state-of-the-art inference engine in terms of both performance and quality. Ontological Pathfinding. We design the ontological pathfinding algorithm that scales to web-scale knowledge bases via a series of parallelization and optimization techniques: a relational knowledge base model to apply inference rules in batches, a new rule mining algorithm that parallelizes the join queries, a novel partitioning algorithm to break the mining tasks into smaller independent sub-tasks, and a pruning strategy to eliminate unsound and resource-consuming rules before applying them. Combining these techniques, we develop the first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 hours; no existing approach achieves this scale. Spreading Activation. We design the SemMemDB system for spreading activation query processing over semantic and knowledge networks. We use the relational model for semantic networks and present an efficient SQL-based spreading activation algorithm. We provide a simple interface for users to invoke retrieval queries. SemMemDB leverages mature query

127 engines and optimizers from databases that generate efficient query plans for memory activation and retrieval. With massive storage capacity supported by modern database systems, SemMemDB supports human-scale memories. We evaluate SemMemDB using DBPedia, a webscale ontology constructed from the Wikipedia corpus. The results show that SemMemDB runs more than 500 times faster than prior works. While our prior works focus on scaling up learning and inference over large static knowledge bases, we propose to design efficient algorithms to perform those tasks on dynamic and streaming knowledge bases according to the intuition to incrementally maintain the learned models. For instance, in DeepDive [2, 52], developers iteratively construct knowledge bases by adding facts and rules to an existing model, and the system performs efficient MLN grounding [28] and MCMC sampling by re-using the past results and appropriately focusing computation on parts of the knowledge base that are more relevant to the updates. The key to efficiency is to avoid re-computation over the entire dataset for each update. In the context of dynamic large-scale knowledge bases, it is essential to continuously maintain the models instead of spending hours to re-train them from scratch for individual additions of facts or rules. Motivated by these results, we propose online deductive and inductive reasoning methods to perform soft inference over uncertain knowledge bases. We further extend the incremental inference techniques to belief revision, retraction, and contraction operations in belief bases [128]. In the progress of learning and inference, errors come from incorrect rules, incorrect facts, ambiguous entities, and propagated errors [13]. We use semantic constraints to detect these errors while new facts are inferred and gathered by IE. For example, a country has one capital city; a person has one full-time job at any given time. Violations of these constraints indicate errors. In addition, we adopt our work [113] that uses mutual information and token entropy to select most uncertain data from a probabilistic database and post them to Amazon Mechanical Turk for verification. In the context of dynamic knowledge bases, we propose the human-guided data cleaning by selecting an optimal set of nodes in the knowledge graph

128 for manual verification: Using the theory of Value of Information in probabilistic graphical models [129], we select those facts giving the largest return with the respect to the remaining state of the knowledge graph. In summary, we propose to develop off-line and on-line learning, incremental inference and optimized selective data collection models and algorithms over dynamically changing and uncertain knowledge graphs. The learning algorithms over knowledge graphs would generate a set of first-order rules (e.g., logical inference) and constraints (e.g., functional dependency). The inference algorithms would reason with uncertain and correlated facts to generate new facts. The value of information algorithms will rank uncertain facts to be validated through additional data collection, such as Amazon Mechanical Turk or further information extraction, based on some optimization function. The research objective of this proposal is to (1) learn the statistical first-order rules over large-scale uncertain knowledge graphs; (2) update and maintain existing models in the context of streaming and dynamic knowledge bases; (3) perform incremental reasoning and inference over dynamic knowledge graphs; and (4) use properties of submodular function to efficiently compute and optimize value of information over dynamic uncertain KB graphs. 8.1 Inductive Reasoning

We propose to extend the Ontological pathfinding algorithm to mine semantic constraints. Constraints are an effective tool in database systems to ensure data validity [51]. In knowledge bases, we use a similar concept called semantic constraints to ensure validity of data. These constraints are derived from the semantics of extracted relations, e.g., a person was born in only one country; a country has only one capital city. Conceptually, semantic constraints are hard rules that must be satisfied by all possible worlds. Violations, if any, indicate potential errors. One form of useful constraints is functional constraints [20, 42, 68]. They help detect errors from propagation and incorrect rules, and can be used to detect ambiguous entities that invalidate equality checks in join queries. In [13], the functional constraints improve the precision of inferred facts by 0.6. The constraints are mined from the TextRunner dataset

129 [20, 68] with 7.5 million extractions from web text corpora. We propose to scale up the mining algorithm with the parallelization and partitioning algorithms that facilitate inference rule mining over Freebase [11, 12]. Another special form of semantic constraints is the conditional constraints, similar to conditional functional dependencies [108] in database systems. For example, while in general a person has any number of email accounts, every UF employee has only one “@ufl.edu” email address. In this case, the functional constraint is conditioned on the value of the email address. As another example, every US citizen has an SSN number. These conditional constraints correspond to the following first-order constraints:

emailAddress(x, z1), emailAddress(x, z2),

LIKE(z1, “%@ufl.edu”), LIKE(z2, “%@ufl.edu”) → z1 = z2,

isCitizenOf(x, “USA”) → hasSSN(x, y).

The major difference between conditional constraints and general constraints is that we consider individual values in the conditions. We propose to extend previous works on discovering conditional functional dependencies [130] to the context of knowledge bases to support conditional constraints. 8.2 Online Inductive Reasoning

We explore efficient methods for online learning over dynamic probabilistic knowledge bases that incrementally expand over time. For example, the NELL system [6] runs continuously to extract information from the web; DeepDive [2] employs an “engineering-in-the-loop” development cycle to iteratively add new rules and data to its contents. New facts and rules are incorporated into the knowledge bases, potentially updating any existing models trained from them. These dynamic knowledge bases pose significant challenge to state-of-the-art learning methods if we re-train the models for each update. Instead, researchers are exploring new ways to incrementally maintain statistical models over changing data [52, 131, 132]. These algorithms improve efficiency by appropriately focusing computation on the updated part of the data instead of re-building the models over the entire dataset for each update.

130 The dynamic streams of data have been used in data mining. [133] incrementally trains an MLN classifier on data streams by discretizing data streams into “data chunks”; for each data chunk, it incrementally trains the MLN for each data chunk by using weights from the previous chunk, and selectively applying the training algorithm based on the data distribution. It uses the open MOA massive online analysis-real time analytics for data streams [134]. Extensions to SPARQL has been proposed to query such datasets: C-SPARQL [135] answers queries by several new techniques, including window definition over data streams, answering continuous queries by registering and storing the queries, and it supports combining multiple streams. The paper uses the dataset from social networks, and oil production. C-SPARQL has been used to perform deductive reasoning and inductive reasoning for semantic social media analytics over streams by useing C-SPARQL queries to express the “stream reasoning” algorithms [136]. 8.3 Incremental Deductive Reasoning

As new information is gathered by the dynamic knowledge base systems, we need to keep track of time-sensitive or even inconsistent information. For example, for the simple query “what positions did Barack Obama hold?” Wikidata provides the information as shown in Table 8-1. As we see in this example, Barack Obama holds each position for only a period of time. Table 8-1. PositionsHeld(Barack Obama, *) triples in Wikidata.

Position Held Start Time End Time President of the USA 20 January 2009 NA United States Senator 3 January 2005 16 November 2008 Illinois State Senator 8 January 1997 4 November 2004

The need to manage dynamic knowledge is also significant in aggregated sentiment and belief knowledge bases [137]. In Table 8-2, for example, each fact is “believed by” an entity with a confidence value estimated from the sources of extraction. As new information shows up, we need to efficiently maintain the facts, the “believed by” relationships, and the corresponding confidence.

131 Table 8-2. Aggregated knowledge base of beliefs.

Entity Is believed to Believed by Confidence Sen. Smith Cut defense spending 30% voters medium Bin Laden Hide in Pakistan CIA high

Motivated by these temporal data, we propose to extend the data model to support attributes including temporal and spatial information. We plan to use these attributes to 1) model temporal and spatial information, and 2) use semantic constraints to ensure the integrity of the knowledge base. The constraints are either specified by the user or mined from the web. 8.4 Abductive Reasoning

Abductive reasoning involves deciding what the most likely inference is that can be made from a set of observations. Beginning with an incomplete set of observations, abductive reasoning decides the likeliest possible explanation for the set. In medical diagnosis, for example, given a set of symptoms, we need to decide what is the diagnosis that would best explain most of them. Abductive is useful in probabilistic databases [138] where we maintain lineage information for the database tuples. Lineage contains possible explanations for each query tuple; selecting the best explanation is a form of abductive reasoning in probabilistic query processing. We use ground factor graphs to store belief lineage that can be used for abductive reasoning. As the final result of grounding, ground factor graphs serve as an intermediate representation that can be input to probabilistic inference engines, e.g., [43, 44]. Moreover, since it records the causal relationships among facts, it contains the entire lineage and can be queried [138]. 8.5 Knowledge Verification

The application of inference rules has the potential to introduce errors into the KB. As stated previously, semantic and conditional constraints can be learned to reduce certain errors

132 and improve accuracy. Other types of errors such as extraction or sparse non-conflicting inference errors an be corrected by judicious application of a small amount of human feedback. In the factor graph representation of a probabilistic knowledge base, individual facts are modeled as nodes in a graph. Given this construction, human-guided data cleaning reduces to the problem of selecting an optimal set of nodes in the graph for manual verification. This human-machine hybridization is designed to raise inference precision and control the propagation of errors in the knowledge graph inference at near-optimal cost. Using the theory of Value of Information in probabilistic graphical models [129], we can select those facts whose knowledge gives the largest return with the respect to the remaining state of the knowledge graph. Seminal work has already been done in deriving optimal or near-optimal algorithms and performance guarantees for both chain and general graphical models. 8.6 Summary

In summary, we propose to build a probabilistic knowledge graph system with a statistical relational learning and inference engine to manage constantly evolving knowledge. Statistical relational learning models combines statistical methods, such as probabilistic graphical models that model the uncertainty and the first-order logic to relational properties of a domain. We propose the following research aims as the building blocks of the system: statistical learning and inference over large-scale knowledge bases, online learning over dynamic knowledge graphs, incremental reasoning and inference over dynamic knowledge graphs, and quality control using semantic constraints, information theory, and human feedback. In our prior work, we develop the first learning and inference system over the largest public knowledge base, Freebase, by a series of parallelization and partitioning techniques, and mine 36,625 inference rules and 927 million new facts. Observing that modern knowledge bases evolve every day–capturing new information or user feedback from the web as exemplified by knowledge bases including NELL and DeepDive–and motivated by existing works on incremental view maintenance (DRed) and incremental MCMC (DeepDive), we propose to extend our system to support dynamic knowledge bases by online learning,

133 incremental reasoning, and optimized data collection to efficiently maintain any previously trained mining models and to avoid computation over the entire knowledge graph for incremental updates to the dynamic knowledge graph.

134 REFERENCES [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives, Dbpedia: A nucleus for a web of open data. Springer, 2007. [2] C. Zhang, “Deepdive: A data management system for automatic knowledge base construction,” Ph.D. dissertation, UW-Madison, 2015. [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor, “Freebase: a collaboratively created graph database for structuring human knowledge,” in Pro- ceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008. [4] G. O. Blog, “Introducing the knowledge graph: thing, not strings,” http://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html. [5] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang, “Knowledge vault: A web-scale approach to probabilistic knowledge fusion,” in Proceedings of the 20th ACM SIGKDD international conference on Knowl- edge discovery and data mining. ACM, 2014. [6] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling, “Never-ending learning,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI- 15), 2015. [7] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka Jr, and T. M. Mitchell, “Toward an architecture for never-ending language learning.” in AAAI, 2010. [8] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, “Open information extraction: The second generation.” in IJCAI, 2011. [9] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni, “Open information extraction for the web,” in IJCAI, 2007. [10] W. Wu, H. Li, H. Wang, and K. Q. Zhu, “Probase: A probabilistic taxonomy for text understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012. [11] Y. Chen, D. Z. Wang, and S. Goldberg, “Scalekb: scalable learning and inference over large knowledge bases,” The VLDB Journal, vol. 25, no. 6, pp. 893–918, 2016. [12] Y. Chen, S. Goldberg, D. Z. Wang, and S. S. Johri, “Ontological pathfinding: Mining first-order knowledge from large knowledge bases,” in Proceedings of the 2016 Inter- national Conference on Management of Data. ACM, 2016, pp. 835–846.

135 [13] Y. Chen and D. Z. Wang, “Knowledge expansion over probabilistic knowledge bases,” in Proceedings of the 2014 ACM SIGMOD international conference on Management of data. ACM, 2014, pp. 649–660. [14] D. Z. Wang, Y. Chen, S. Goldberg, C. Grant, and K. Li, “Automatic knowledge base construction using probabilistic extraction, deductive reasoning, and human feedback,” in Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale . Association for Computational Linguistics, 2012, pp. 106–110. [15] F. Mahdisoltani, J. Biega, and F. Suchanek, “Yago3: A knowledge base from multilingual ,” in 7th Biennial Conference on Innovative Data Systems Re- search. CIDR Conference, 2014. [16] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: a core of semantic knowledge,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007. [17] T. Berners-Lee, J. Hendler, O. Lassila et al., “The semantic web,” Scientific american, 2001. [18] O. Etzioni, “Search needs a shake-up,” Nature, 2011. [19] T. Lin, O. Etzioni et al., “Identifying functional relations in web text,” in Proceed- ings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010. [20] A. Ritter, D. Downey, S. Soderland, and O. Etzioni, “It’s a contradiction—no, it’s not: a case study using functional relations,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008. [21] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin, “Knowledge base completion via search-based question answering,” in Proceedings of the 23rd interna- tional conference on World wide web. ACM, 2014. [22] L. Gal´arraga, C. Teflioudi, K. Hose, and F. M. Suchanek, “Fast rule mining in ontological knowledge bases with amie+,” The VLDB Journal, 2015. [23] Y. Peng, Z. Xiaofeng, D. Z. Wang, and C. V. Fang, “Scalable image retrieval with multimodal fusion,” in FLAIRS Conference, 2015. [24] S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis, “Learning first-order horn clauses from web text,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010. [25] D. Wijaya, P. P. Talukdar, and T. Mitchell, “Pidgin: ontology alignment using web text as interlingua,” in Proceedings of the 22nd ACM international conference on Con- ference on information & knowledge management. ACM, 2013, pp. 589–598.

136 [26] F. M. Suchanek, S. Abiteboul, and P. Senellart, “Paris: Probabilistic alignment of relations, instances, and schema,” Proceedings of the VLDB Endowment, 2011. [27] X. Zhou, Y. Chen, and D. Z. Wang, “Archimedesone: Query processing over probabilistic knowledge bases,” Proceedings of the VLDB Endowment, vol. 9, no. 13, 2016. [28] F. Niu, C. R´e,A. Doan, and J. Shavlik, “Tuffy: Scaling up statistical inference in markov logic networks using an rdbms,” Proceedings of the VLDB Endowment, 2011. [29] M. Richardson and P. Domingos, “Markov logic networks,” Machine learning, 2006. [30] S. Kok, M. Sumner, M. Richardson, P. Singla, H. Poon, and P. Domingos, “The alchemy system for statistical relational ai (technical report). department of computer science and engineering, university of washington, seattle, wa,” 2006. [31] T. N. Huynh, “Discriminative learning with markov logic networks,” DTIC Document, Tech. Rep., 2009. [32] S. Kok, “Structure learning in markov logic networks,” Ph.D. dissertation, University of Washington, 2010. [33] F. Crestani, “Application of spreading activation techniques in information retrieval,” Artificial Intelligence Review, vol. 11, no. 6, pp. 453–482, 1997. [34] J. R. Anderson, D. Bothell, M. D. Byrne, S. Douglass, C. Lebiere, and Y. Qin, “An integrated theory of the mind,” Psychological Review, vol. 111, no. 4, p. 1036, 2004. [35] J. R. Anderson, How can the human mind occur in the physical universe? Oxford University Press, 2007. [36] EMC, “Greenplum database: Critical mass innovation,” EMC, Tech. Rep., 2010. [37] L. A. Gal´arraga,C. Teflioudi, K. Hose, and F. Suchanek, “Amie: association rule mining under incomplete evidence in ontological knowledge bases,” in Proceedings of the 22nd international conference on World Wide Web, 2013. [38] S. Muggleton, “Inductive logic programming: derivations, successes and shortcomings,” ACM SIGART Bulletin, 1994. [39] B. Tausend, “Representing biases for inductive logic programming,” in Machine Learn- ing: ECML-94. Springer, 1994. [40] J. Biega, E. Kuzey, and F. M. Suchanek, “Inside yago2s: A transparent information extraction architecture,” in Proceedings of the 22nd international conference on World Wide Web companion. International World Wide Web Conferences Steering Committee, 2013. [41] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “Yago2: A spatially and temporally enhanced knowledge base from wikipedia,” Artificial Intelligence, 2013.

137 [42] S. Schoenmackers, O. Etzioni, and D. S. Weld, “Scaling textual inference to the web,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008. [43] C. Zhang and C. R´e,“Towards high-throughput gibbs sampling at scale: A study across storage managers,” in Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013. [44] L. Yucheng, G. Joseph, K. Aapo, B. Danny, G. Carlos, and M. Joseph, “Graphlab: A new framework for parallel machine learning,” in Proc. Intl Conf. Uncertainty in Artificial Intelligence (UAI 10), 2010. [45] A. Horn, “On sentences which are true of direct unions of algebras,” The Journal of Symbolic Logic, 1951. [46] J. R. Quinlan, “Learning logical definitions from relations,” Machine learning, 1990. [47] S. Muggleton, “Inverse entailment and progol,” New generation computing, 1995. [48] J. Widom, “Trio: A system for integrated management of data, accuracy, and lineage,” in CIDR, 2005. [49] P. Singla and P. Domingos, “Memory-efficient inference in relational domains,” in AAAI, 2006. [50] S. S. Lightstone, T. J. Teorey, and T. Nadeau, Physical Database Design: the database professional’s guide to exploiting indexes, views, storage, and more. Morgan Kaufmann, 2010. [51] J. D. Ullman, H. Garcia-Molina, and J. Widom, Database systems: the complete book. Prentice Hall Upper Saddle River, 2001. [52] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. R´e,“Incremental knowledge base construction using deepdive,” Proceedings of the VLDB Endowment, 2015. [53] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Powergraph: Distributed graph-parallel computation on natural graphs,” in Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), 2012. [54] S. Arumugam, A. Dobra, C. M. Jermaine, N. Pansare, and L. Perez, “The datapath system: a data-centric analytic processing engine for large data warehouses,” in Pro- ceedings of the 2010 ACM SIGMOD International Conference on Management of data. ACM, 2010. [55] D. Vrandeˇci´cand M. Kr¨otzsch, “Wikidata: a free collaborative knowledgebase,” Com- munications of the ACM, vol. 57, no. 10, pp. 78–85, 2014.

138 [56] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, 2008. [57] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, “Hive: a warehousing solution over a map-reduce framework,” Pro- ceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626–1629, 2009. [58] D. Saumier and H. Chertkow, “Semantic Memory,” Current Neurology and Neuro- science Reports, vol. 2, no. 6, pp. 516–522, 2002. [59] J. F. Sowa, “Semantic Networks,” in Encyclopedia of Cognitive Science. John Wiley & Sons, Ltd, 2006. [60] A. M. Collins and E. F. Loftus, “A spreading-activation theory of semantic processing,” Psychological Review, vol. 82, no. 6, p. 407, 1975. [61] J. R. Anderson, M. Matessa, and C. Lebiere, “ACT-R: A theory of higher level cognition and its relation to visual attention,” Human-Computer Interaction, vol. 12, no. 4, pp. 439–462, 1997. [62] S. Douglass, J. Ball, and S. Rodgers, “Large declarative memories in ACT-R,” in Proceedings of the 9th International Conference of Cognitive Modeling, Manchester, United Kingdom, 2009. [63] S. A. Douglass and C. W. Myers, “Concurrent knowledge activation calculation in large declarative memories,” in Proceedings of the 10th International Conference on Cognitive Modeling, 2010, pp. 55–60. [64] J. G. Lorenzo, J. E. L. Gayo, and J. M. A.´ Rodr´ıguez, “Applying MapReduce to Spreading Activation Algorithm on Large RDF Graphs,” in Information Systems, E- learning, and Knowledge Management Research. Springer, 2013, pp. 601–611. [65] D. Bothell, ACT-R 6.0 Reference Manual, 2007. [66] A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell, “Coupled semi-supervised learning for information extraction,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 101–110. [67] O. Etzioni, M. Banko, S. Soderland, and D. S. Weld, “Open information extraction from the web,” Commun. ACM, vol. 51, pp. 68–74, 2008. [68] T. Lin, O. Etzioni et al., “Identifying functional relations in web text,” in Proceed- ings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2010, pp. 1266–1276. [69] S. Riedel, L. Yao, A. McCallum, and B. M. Marlin, “Relation extraction with matrix factorization and universal schemas,” 2013.

139 [70] J. Pujara, H. Miao, L. Getoor, and W. Cohen, “Knowledge graph identification,” in International Semantic Web Conference. Springer, 2013, pp. 542–557. [71] H. Poon and P. Domingos, “Joint inference in information extraction,” in AAAI, 2007. [72] P. Singla and P. Domingos, “Entity resolution with markov logic,” in Data Mining, 2006. ICDM’06. Sixth International Conference on. IEEE, 2006. [73] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs and the sum-product algorithm,” Information Theory, IEEE Transactions on, 2001. [74] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT press, 2009. [75] D. Y. Chen and D. Z. Wang, “Web-scale knowledge inference using markov logic networks,” in ICML workshop on Structured Learning: Inferring Graphs from Struc- tured and Unstructured Inputs. Association for Computational Linguistics, 2013, pp. 106–110. [76] M. Wick, A. McCallum, and G. Miklau, “Scalable probabilistic databases with factor graphs and mcmc,” Proceedings of the VLDB Endowment, 2010. [77] J. Gonzalez, Y. Low, A. Gretton, and C. Guestrin, “Parallel gibbs sampling: From colored fields to thin junction trees,” in International Conference on Artificial Intelli- gence and Statistics, 2011. [78] M. L. Wick and A. McCallum, “Query-aware mcmc,” in Advances in Neural Informa- tion Processing Systems, 2011. [79] H. Poon and P. Domingos, “Sound and efficient inference with probabilistic and deterministic dependencies,” in AAAI, 2006. [80] P. Singla and P. M. Domingos, “Lifted first-order belief propagation.” in AAAI, 2008. [81] V. Gogate and P. Domingos, “Probabilistic theorem proving,” in UAI. Corvallis, Oregon: AUAI Press, 2011. [82] S. Kok and P. Domingos, “Learning markov logic network structure via hypergraph lifting,” in Proceedings of the 26th annual international conference on machine learn- ing. ACM, 2009. [83] ——, “Learning markov logic networks using structural motifs,” in Proceedings of the 27th international conference on machine learning (ICML-10), 2010. [84] J. Van Haaren and J. Davis, “Markov network structure learning: A randomized feature generation approach,” in Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.

140 [85] T. N. Huynh and R. J. Mooney, “Discriminative structure and parameter learning for markov logic networks,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008. [86] S. Raghavan and R. J. Mooney, “Online inference-rule learning from natural-language extractions.” in AAAI Workshop: Statistical Relational Artificial Intelligence, 2013. [87] B. L. Richards and R. J. Mooney, “Learning relations by pathfinding,” in Proc. of AAAI-92, 1992. [88] J. M. Hellerstein, C. R´e,F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li et al., “The madlib analytics library: or mad skills, the sql,” Proceedings of the VLDB Endowment, 2012. [89] D. Z. Wang, M. J. Franklin, M. Garofalakis, J. M. Hellerstein, and M. L. Wick, “Hybrid in-database inference for declarative information extraction,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011. [90] K. Li, D. Z. Wang, A. Dobra, and C. Dudley, “Uda-gist: an in-database framework to unify data-parallel and state-parallel analytics,” Proceedings of the VLDB Endowment, 2015. [91] D. Z. Wang, Y. Chen, C. E. Grant, and K. Li, “Efficient in-database analytics with graphical models.” IEEE Data Eng. Bull., 2014. [92] Y. Chen, M. Petrovic, and M. Clark, “Semmemdb: In-database knowledge activation,” in FLAIRS Conference, 2014. [93] C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum, “Flumejava: easy, efficient data-parallel pipelines,” in ACM Sigplan Notices. ACM, 2010. [94] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, “Spark: Cluster computing with working sets.” HotCloud, 2010. [95] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012. [96] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein, “Distributed graphlab: a framework for machine learning and data mining in the cloud,” Proceedings of the VLDB Endowment, 2012. [97] Y. Cheng, C. Qin, and F. Rusu, “Glade: big data analytics made easy,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, 2012, pp. 697–700.

141 [98] N. Lao, T. Mitchell, and W. W. Cohen, “Random walk inference and learning in a large scale knowledge base,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011. [99] R. Agrawal, T. Imieli´nski,and A. Swami, “Mining association rules between sets of items in large databases,” ACM SIGMOD Record, 1993. [100] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association rules.” VLDB Endowment, 1994. [101] J. S. Park, M.-S. Chen, and P. S. Yu, “An effective hash-based algorithm for mining association rules,” SIGMOD Record, 1995. [102] A. Savasere, E. Omiecinski, and S. B. Navathe, “An efficient algorithm for mining association rules in large databases.” VLDB Endowment, 1995. [103] J. Han and J. Pei, “Mining frequent patterns by pattern-growth: methodology and implications,” ACM SIGKDD explorations newsletter, 2000. [104] M. Elseidy, E. Abdelhamid, S. Skiadopoulos, and P. Kalnis, “Grami: Frequent subgraph and pattern mining in a single large graph,” Proceedings of the VLDB En- dowment, 2014. [105] L. Zou, L. Chen, and M. T. Ozsu,¨ “Distance-join: Pattern match query in a large graph database,” Proceedings of the VLDB Endowment, 2009. [106] M. Kuramochi and G. Karypis, “Finding frequent patterns in a large sparse graph*,” Data mining and knowledge discovery, 2005. [107] ——, “Frequent subgraph discovery,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001. [108] P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, “Conditional functional dependencies for data cleaning,” in 2007 IEEE 23rd International Con- ference on Data Engineering. IEEE, 2007. [109] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, “Improving data quality: Consistency and accuracy,” in Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007, pp. 315–326. [110] W. Fan, Y. Wu, and J. Xu, “Functional dependencies for graphs,” 2016. [111] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, , and W. Zhang, “From data fusion to knowledge fusion,” Proceedings of the VLDB Endowment, 2014. [112] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “Crowddb: answering queries with crowdsourcing,” in Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. ACM, 2011, pp. 61–72.

142 [113] S. L. Goldberg, D. Z. Wang, and T. Kraska, “Castle: Crowd-assisted system for text labeling and extraction,” in First AAAI Conference on Human Computation and Crowdsourcing, 2013. [114] M. Pa¸sca,“Open-domain question answering from large text collections,” Computa- tional Linguistics, vol. 29, no. 4, pp. 665–667, 2003. [115] A. Atserias, M. Grohe, and D. Marx, “Size bounds and query plans for relational joins,” in Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on. IEEE, 2008, pp. 739–748. [116] M. Joglekar and C. Re, “It’s all a matter of degree: Using degree information to optimize multiway joins,” Proceedings of the International Conference on Database Theory (ICDT), 2016. [117] T. L. Veldhuizen, “Leapfrog triejoin: A simple, worst-case optimal join algorithm,” Proceedings of the International Conference on Database Theory (ICDT), 2014. [118] H. Q. Ngo, E. Porat, C. R´e, and A. Rudra, “Worst-case optimal join algorithms:[extended abstract],” in Proceedings of the 31st symposium on Principles of Database Systems. ACM, 2012. [119] M. A. Khamis, H. Q. Ngo, and D. Suciu, “Computing join queries with functional dependencies,” Proceedings of the 32nd Symposium on Principles of Database Systems, 2016. [120] G. Gottlob, S. T. Lee, G. Valiant, and P. Valiant, “Size and treewidth bounds for conjunctive queries,” Journal of the ACM (JACM), 2012. [121] S. Chu, M. Balazinska, and D. Suciu, “From theory to practice: Efficient join query evaluation in a parallel database system,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015. [122] P. Beame, P. Koutris, and D. Suciu, “Skew in parallel query processing,” in Proceedings of the 33rd Symposium on Principles of Database Systems. ACM, 2014. [123] ——, “Communication steps for parallel query processing,” in Proceedings of the 32nd Symposium on Principles of Database Systems. ACM, 2013. [124] F. N. Afrati and J. D. Ullman, “Optimizing joins in a map-reduce environment,” in Proceedings of the 13th International Conference on Extending Database Technology. ACM, 2010. [125] N. Lao, A. Subramanya, F. Pereira, and W. W. Cohen, “Reading the web with learned syntactic-semantic inference rules,” in Proceedings of the 2012 Joint Conference on Em- pirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.

143 [126] F. Niu, C. Zhang, C. R´e,and J. Shavlik, “Scaling inference for markov logic with a task-decomposition approach,” arXiv preprint arXiv:1108.0294, 2011. [127] K. Kersting and L. De Raedt, “1 bayesian logic programming: Theory and tool,” Statistical Relational Learning, p. 291, 2007. [128] J. Van Benthem, “Dynamic logic for belief revision,” Journal of applied non-classical logics, vol. 17, no. 2, pp. 129–155, 2007. [129] A. Krause and C. Guestrin, “Optimal value of information in graphical models,” Jour- nal of Artificial Intelligence Research, vol. 35, pp. 557–591, 2009. [130] W. Fan, F. Geerts, J. Li, and M. Xiong, “Discovering conditional functional dependencies,” IEEE Transactions on Knowledge and Data Engineering, 2011. [131] M. L. Koc and C. R´e,“Incrementally maintaining classification using an rdbms,” Proceedings of the VLDB Endowment, 2011. [132] A. Nath and P. M. Domingos, “Efficient belief propagation for utility maximization and repeated inference.” in AAAI, 2010. [133] S. Chandra, J. Sahs, L. Khan, B. Thuraisingham, and C. Aggarwal, “Stream mining using statistical relational learning,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 743–748. [134] A. Bifet, G. , R. Kirkby, and B. Pfahringer, “Moa: Massive online analysis,” Journal of Machine Learning Research, vol. 11, no. May, pp. 1601–1604, 2010. [135] D. F. Barbieri, D. Braga, S. Ceri, E. D. VALLE, and M. Grossniklaus, “C-sparql: a continuous query language for rdf data streams,” International Journal of Semantic Computing, vol. 4, no. 01, pp. 3–25, 2010. [136] D. Barbieri, D. Braga, S. Ceri, E. Della Valle, Y. Huang, V. Tresp, A. Rettinger, and H. Wermser, “Deductive and inductive stream reasoning for semantic social media analytics,” IEEE Intelligent Systems, vol. 25, no. 6, pp. 32–41, 2010. [137] Y. Wilks, M. Clark, A. Dalton, I. Perera et al., “Cubism: Belief, anomaly and social constructs,” Interaction Studies, 2014. [138] O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom, “Uldbs: Databases with uncertainty and lineage,” in Proceedings of the 32nd international conference on Very large data bases. VLDB Endowment, 2006, pp. 953–964.

144 BIOGRAPHICAL SKETCH Yang Chen received his Ph.D. degree from the University of Florida in December 2016. During his Ph.D. study, he worked with Dr. Daisy Zhe Wang for the Data Science Research (DSR) Lab, Department of Computer and Information Science and Engineering, the University of Florida. His research focused on knowledge bases, databases, data mining, and scalable algorithms. Yang designed scalable inductive and deductive reasoning algorithms to mine rules and facts from web-scale knowledge bases, e.g., Freebase. His work on knowledge expansion and ontological pathfinding achieved the state-of-the-art of first-order rule mining and formed the key components of a probabilistic knowledge base system,

ProbKB, published in the 2014 and 2016 SIGMOD conferences and the VLDB Journal. Yang served as a program committee member of WWW 2017, reviewer of TKDE 2014, and external reviewer of the VLDB Journal, VLDB, CIDR, and IJCAI. Before coming to the University of Florida, Yang received his Bachelor of Engineering degree in 2011 from the University of Science and Technology of China. In the summer of 2013, Yang worked as a research intern with Dr. Micah H. Clark at Florida Institute for Human and Machine Cognition. In ther summers of 2014-2015, Yang worked as a software engineer intern at Google with Dr. Sergey Melnik in the Spanner team and with Dr. Xiangyang Lan in the Mesa team. Yang will start working at Google in January 2017.

145