Exploiting Prior Knowledge and Latent Variable Representations for the Statistical Modeling and Probabilistic Querying of Large Knowledge Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Exploiting Prior Knowledge and Latent Variable Representations for the Statistical Modeling and Probabilistic Querying of Large Knowledge Graphs Denis Krompaß M¨unchen2015 Exploiting Prior Knowledge and Latent Variable Representations for the Statistical Modeling and Probabilistic Querying of Large Knowledge Graphs Denis Krompaß Dissertation an der Fakult¨atf¨urMathematik, Informatik und Statistik der Ludwig{Maximilians{Universit¨at M¨unchen vorgelegt von Denis Krompaß aus Passau M¨unchen, den 24.09.2015 Erstgutachter: Prof. Dr. Volker Tresp Zweitgutachter: Prof. Dr. Steffen Staab Tag der m¨undlichen Pr¨ufung:20.11.2015 Eidesstattliche Versicherung (Siehe Promotionsordnung vom 12.07.11, x8, Abs. 2 Pkt. .5.) Hiermit erkl¨areich an Eidesstatt, dass die Dissertation von mir selbstst¨andig,ohne unerlaubte Beihilfe angefertigt ist. Denis Krompaß ----------------------------------------------------------------------------- Name, Vorname Ort, Datum Unterschrift Doktorand Formular 3.2 Acknowledgements During the last years, many people contributed to the successful completion of my PhD. First of all, I want to thank my supervisor at Siemens and the Ludwig Maximilian Uni- versity of Munich, Prof. Dr. Volker Tresp. To me, Volker is not just a supervisor but a mentor who positively influenced my research and my personal development with his experience and guidance. There has been no time, where he was not open for discussions on various topics and ideas and his valuable inputs often helped me to drive my research on these topics in the right directions. I further thank Dr. Matthias Schubert with whom I did my first real steps in machine learning during my master thesis. He also recommended Volker as a supervisor for my PhD and introduced me to him. I also want to thank Michal Skubacz, Research Group Head of Knowledge Modeling and Retrieval, who funded my PhD, conference trips, rental of cloud services and gave me the possibility to continue my career in his group at Siemens. I am also very grateful to Prof. Steffen Staab to act as second examiner of my thesis work. Special thanks goes to Maximilian Nickel whose research inspired my own work to a large extent. I also had the pleasure to work with Xueyan Jiang, Crist´obal Esteban, Stephan Baier, Yinchong Yang, Sebnem Rusitschka and Sigurd Spieckermann and thank them for a great working atmosphere and the sometimes long and constructive discussions. In this regard, special thanks to Sigurd Spieckermann for in depth discussions on various topics related to machine learning and for introducing me to Theano. Fortunately, we will have the opportunity to continue working together in the near future as colleagues at Siemens Corporate Technology in the research group Knowledge Modeling and Retrieval. I especially thank my family for supporting me over all these years. To my brother Daniel who introduced me to mathematics at the age of 5. My parents, Bertin and Marietta who have always believed in and supported my career efforts even in very dificult times. To viii Acknowledgements Oma and OnkelP to which I could always turn for any matter or short and very relaxing vacations in the \family headquarter". My deep gratitude goes to Steffi who has always supported me with her outstanding language and cooking skills. I also thank her for introducing me to the many great things I was not aware of and for definitely broadening my horizon in many aspects. Finally, I want to thank \den Jungs" for what I am sure, they are aware of. (If not, then I will be happy to explain myself on the next annual meeting.) Contents 1 Introduction 1 1.1 Learning in the Semantic Web . .4 1.2 Learning in Knowledge Graphs . .4 1.3 Contributions of this Work . .6 2 Knowledge Graphs 9 2.1 Knowledge-Graphs are RDF-Triplestores . 12 2.1.1 RDF-Triple Structure . 13 2.1.2 Schema Concepts . 14 2.1.3 Knowledge Retrieval in Knowledge Graphs . 16 2.2 Knowledge Graph Construction . 17 2.2.1 Curated Approaches . 18 2.2.2 Collaborative Approaches . 18 2.2.3 Automated Approaches on Semi-Structured Textual Data . 18 2.2.4 Automated Approaches on Unstructured Textual Data . 19 2.3 Popular Knowledge Graphs . 21 2.3.1 DBpedia . 21 2.3.2 Freebase . 22 2.3.3 YAGO . 23 2.4 Deficiencies in Today's Knowledge Graph Data . 24 3 Representation Learning in Knowledge Graphs 27 3.1 Representation Learning . 27 3.2 Relational Learning . 30 x CONTENTS 3.3 Statistical Modeling of Knowledge Graphs with Latent Variable Models . 32 3.3.1 Notation . 33 3.3.2 RESCAL . 33 3.3.3 Translational Embeddings . 34 3.3.4 Google Knowledge-Vault Neural-Network . 35 4 Applying Latent Variable Models to Large Knowledge Graphs 37 4.1 Latent Variable Model Complexity in Large Knowledge Graphs . 38 4.2 Simulating Large Scale Conditions . 39 4.2.1 Data Sets . 40 4.2.2 Evaluation Procedure . 42 4.2.3 Implementation and Model Training Details . 43 4.3 Experimental Results . 44 4.3.1 Link-Prediction Quality { TransE has Leading Performance . 44 4.3.2 Optimization Time { RESCAL is Superior to Other Methods . 46 4.4 Related Work . 47 4.5 Conclusion . 48 5 Exploiting Prior Knowledge On Relation-Type Semantics 51 5.1 Type-Constrained Alternating Least-Squares for RESCAL . 52 5.1.1 Additional Notation . 56 5.1.2 Integrating Type-Constraints into RESCAL . 57 5.1.3 Relation to Other Factorizations . 61 5.1.4 Testing the Integration of Type-Constraints in RESCAL . 61 5.1.5 Conclusion . 65 5.2 Type-Constrained Stochastic Gradient Descent . 65 5.2.1 Type-Constrained Triple Corruption in SGD . 66 5.3 A Local Closed-World Assumption for Modeling Knowledge Graphs . 67 5.3.1 Entity Grouping for RESCAL under a Local Closed-World Assumption 69 5.3.2 Link-Prediction in DBpedia with RESCAL . 70 5.3.3 Conclusion . 71 5.4 Experiments { Prior Knowledge on Relation-Types is Important for Latent Variable Models . 72 5.4.1 Type-Constraints are Essential . 74 5.4.2 Local Closed-World Assumption { Simple but Powerful . 76 CONTENTS xi 5.5 Related Work . 77 5.6 Conclusion . 78 6 Ensemble Solutions for Representation Learning in Knowledge Graphs 79 6.1 Studying Complementary Effects between TransE, RESCAL and mwNN . 80 6.2 Experimental Setup . 82 6.3 Experiments { TransE, RESCAL and mwNN Learn Complementary Aspects in Knowledge Graphs . 82 6.3.1 Type-Constrained Ensembles . 83 6.3.2 Ensembles under a Local Closed-World Assumption . 84 6.4 Related Work . 85 6.5 Conclusion . 86 7 Querying Statistically Modeled Knowledge Graphs 87 7.1 Exploiting Uncertainty in Knowledge Graphs . 89 7.1.1 Notation . 91 7.1.2 Probabilistic Databases . 92 7.1.3 Querying in Probabilistic Databases . 95 7.2 Exploiting Latent Variable Models for Querying . 99 7.2.1 Learning Compound Relations with RESCAL . 102 7.2.2 Learning Compound Relations with TransE and mwNN . 103 7.2.3 Numerical Advantage of Learned Compound Relation-Types . 105 7.3 Evaluating the Learned Compound Relations . 106 7.3.1 Experimental Setup . 106 7.3.2 Compound Relations are of Good Quality . 107 7.4 Querying Factorized DBpedia-Music . 109 7.4.1 DBpedia-Music . 109 7.4.2 Experimental Setup . 109 7.4.3 Queries Used for Evaluation . 111 7.4.4 Learned Compound Relations Improve Quality of Answers . 113 7.4.5 Learned Compound Relations Decrease Query Evaluation Time . 119 7.5 Related Work . 120 7.6 Conclusion . 122 xii Contents 8 Conclusion 123 8.1 Summary . 123 8.2 Future Directions and Applications . 126 List of Figures 2.1 Result of Google Query \Angela Kasner" . 10 2.2 Comparison of smart assistants . 11 2.3 Illustration of ontology described in Table 2.3 . 15 2.4 Graph representation of facts from Table 2.4 . 16 2.5 Illustration of automatic knowledge graph completion based on unstructured textual data . 20 2.6 Distribution of facts per entity in DBpedia on a log scale. 24 5.1 Schematic of factorizing knowledge-graph data with RESCAL with and without prior knowledge on type-constraints. 53 5.2 Exploiting type-constraints in RESCAL: Results on the Cora and DBpedia- Music data sets . 63 5.3 Illustration of how the local closed-world assumption . 68 5.4 Exploiting the local closed-world assumption in RESCAL: Results on the DBpedia data sets . 70 7.1 Illustration of a knowledge graph before and after transforming and filtering the extracted triples based on uncertainty . 90 7.2 Evaluation of learned compound relations: AUPRC results on the Nations data set . 108 7.3 Evaluation of learned compound relations: AUPRC results on the UMLS data set . 108 7.4 Query evaluation times in seconds for queries Q1(x), Q2(x) and Q3(x)... 119 xiv List of Figures List of Tables 2.1 Abbreviations for URIs . 12 2.2 Example of defining type-constraints with RDFS . 15 2.3 Sample triples from the DBpedia ontology . 15 2.4 Example triples from DBpedia on Billy Gibbons and Nickelback . 16 4.1 Parameter complexity of latent variable models used in this work . 38 4.2 Details on Freebase-150k, DBpedia-Music and YAGOc-195k data sets . 40 4.3 AUPRC and AUROC results of RESCAL, TransE and mwNN on the Freebase- 150k data set . 44 4.4 AUPRC and AUROC results of RESCAL, TransE and mwNN on the DBpedia- Music data set . 45 4.5 AUPRC and AUROC results of RESCAL, TransE and mwNN on the YAGOc- 195k data set . 46 5.1 Details on Cora and DBpedia-Music data sets used in the experiments. 61 5.2 Comparison of AUPRC and AUROC result for RESCAL with and without exploiting prior knowledge on relations types .