Enriching Knowledge Graphs Using Machine Learning Techniques

ENRICHING KNOWLEDGE GRAPHS USING MACHINE LEARNING TECHNIQUES A Dissertation IN Computer Science and Telecommunications and Computer Networking Presented to the Faculty of the University of Missouri–Kansas City in partial fulfillment of the requirements for the degree DOCTOR OF PHILOSOPHY by MOHAMED GHARIBI M. S., University of Missouri–Kansas City, USA, 2017 Kansas City, Missouri 2020 c 2020 MOHAMED GHARIBI ALL RIGHTS RESERVED ENRICHING KNOWLEDGE GRAPHS USING MACHINE LEARNING TECHNIQUES Mohamed Gharibi, Candidate for the Doctor of Philosophy Degree University of Missouri–Kansas City, 2020 ABSTRACT A knowledge graph represents millions of facts and reliable information about people, places, and things. These knowledge graphs have proven their reliability and their usage for providing better search results; answering ambiguous questions regarding entities; and training semantic parsers to enhance the semantic relationships over the Se- mantic Web. However, while there exist a plethora of datasets on the Internet related to Food, Energy, and Water (FEW), there is a real lack of reliable methods and tools that can consume these resources. This hinders the development of novel decision-making applications utilizing knowledge graphs. In this dissertation, we introduce a novel tool, called FoodKG, that enriches FEW knowledge graphs using advanced machine learning techniques. Our overarching goal is to improve decision-making, knowledge discovery, and provide improved search results for data scientists in the FEW domains. Given an iii input knowledge graph (constructed on raw FEW datasets), FoodKG enriches it with se- mantically related triples, relations, and images based on the original dataset terms and classes. FoodKG employs an existing graph embedding technique trained on a controlled vocabulary called AGROVOC, which is published by the Food and Agriculture Organi- zation of the United Nations. AGROVOC includes terms and classes in the agriculture and food domains. As a result, FoodKG can enhance knowledge graphs with semantic similarity scores and relations between different classes, classify the existing entities, and allow FEW experts and researchers to use scientific terms for describing FEW con- cepts. The resulting model obtained after training on AGROVOC was evaluated against the state-of-the-art word embedding and knowledge graph embedding models that were trained on the same dataset. We observed that this model outperformed its competitors based on the Spearman Correlation Coefficient score. We introduced Federated Learning (FL) techniques to further extend our work and include private datasets by training smaller version of the models at each dataset site without accessing the data and then aggregating all the models at the server-side. We propose an algorithm that we called RefinedFed to further extend the current FL work by filtering the models at each dataset site before the aggregation phase. Our algorithm improves the current FL model accuracy from 84% to 91% on MNIST dateset. iv APPROVAL PAGE The faculty listed below, appointed by the Dean of the Graduate Studies, have exam- ined a dissertation titled “Enriching Knowledge Graphs Using Machine Learning Tech- niques,” presented by Mohamed Gharibi, candidate for the Doctor of Philosophy degree, and hereby certify that in their opinion it is worthy of acceptance. Supervisory Committee Praveen Rao, Ph.D., Committee Chair Department of Computer Science & Electrical Engineering Sejun Song, Ph.D., Co-Discipline Advisor Department of Computer Science & Electrical Engineering Yugyung Lee, Ph.D. Department of Computer Science & Electrical Engineering Ahmed Hassan, Ph.D. Department of Computer Science & Electrical Engineering Zhu Li, Ph.D. Department of Computer Science & Electrical Engineering v CONTENTS ABSTRACT . iii ILLUSTRATIONS . viii TABLES . xi LISTINGS . xiii ALGORITHMS . xiv ACKNOWLEDGEMENTS . xv Chapter 1 INTRODUCTION . 1 1.1 Overview . 1 2 BACKGROUND . 14 2.1 Text Vectorization . 14 2.2 Embedding Models . 18 3 RELATED WORK . 31 3.1 Converting Databases to RDF Model . 31 3.2 Enriching a Dataset with Extra Triples Based on the Existing Ones . 44 3.3 Machine Learning and Embedding Models . 55 3.4 Federated Learning . 55 4 Approach . 57 4.1 Overview . 57 vi 4.2 Converting a Database Table into RDF Triples . 58 4.3 Enriching RDF Data Triples . 59 4.4 Choosing the Target Triples . 60 4.5 Running Entities on ConceptNet . 61 4.6 Levels of Searching Tree . 62 4.7 Adding the Extra Triples . 64 4.8 Architecture . 65 5 IMPLEMENTATION . 67 5.1 Implementation . 67 5.2 FoodKG Implementation . 68 6 EVALUATION . 79 6.1 Work Load . 79 6.2 Results . 79 6.3 FoodKG . 81 6.4 Federated Learning . 96 6.5 Data Availability Statement . 97 7 CONCLUSION AND FUTURE WORK . 99 REFERENCE LIST . 101 VITA . 116 vii ILLUSTRATIONS Figure Page 1 RDF Model . 3 2 RDF model for the second book . 4 3 Federated Learning Architecture . 10 4 Google embedding example . 19 5 Embedding distance example . 21 6 CBOW architecture . 24 7 The generated weights . 24 8 Multiplication to generate the embedding vector . 25 9 Multiplication to generate the embedding vector . 25 10 Multiplication to generate the embedding vector . 26 11 GloVe Co-occurrence matrix for ”the dog ran after the man” . 28 12 FastText architecture for a sentence with N-Gram features . 29 13 Zachary’s Karate club visualized using DeepWalk and GEMSEC embeddings . 30 14 Comparison between SML and R2RML . 35 15 Any23, list of extractors . 36 16 The required template to convert Table 3 to RDF model . 37 17 BioDSL . 38 viii 18 DBpedia semantic query . 45 19 DBpedia results for running ”Lemon” . 46 20 Dandelion semantic results for comparing ”Lemon” and ”Lime” . 47 21 Dandelion results for comparing two phrases . 48 22 Results of ParallelDots for comparing two phrases . 49 23 Results of ParallelDots when comparing two terms . 49 24 WordNet hierarchy for nouns and verbs . 50 25 Results returned from WordNet . 51 26 WordNet results when running a general term . 51 27 WordNet result when running a phrase contains dictionary-based term . 52 28 ConceptNet example for a relationship . 54 29 FEW ontology while using Karma tool . 59 30 First level of the searching tree . 62 31 Third level of the searching tree . 63 32 The searching tree for the term ”Flower” . 63 33 The searching tree for the term ”Fire” . 64 34 FoodKG system architecture. 66 35 RefinedFed Architecture. A local testing dataset will be added to each client to test the model before the collecting phase. Models pass a certin accuracy threshold will be collected by the server. Otherwise, the model will be dropped. 77 36 FoodKG input - Food example . 81 ix 37 FoodKG output - Food example . 82 38 FoodKG input - Water example . 82 39 FoodKG output - Water example . 83 40 FoodKG input - Energy example . 83 41 FoodKG output - Energy example . 84 42 Spearman correlation coefficient ranking scores compared against Con- ceptNet . 89 43 Spearman correlation coefficient scores . 90 44 AGROVEC embeddings visualization using TSNE . 91 45 HolE embeddings visualization using TSNE . 92 46 GloVe embeddings visualization using TSNE . 93 47 Word2vec embeddings visualization using TSNE . 94 48 FastText embeddings visualization using TSNE . 95 49 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat- edTree. 5 Clients . 96 50 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat- edTree. 10 Clients . 97 51 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat- edTree. 20 Clients . 98 52 CIFAR-10 dataset. The accuracy throughout 10 epochs. FL VS. Federat- edTree. 5 Clients . 98 x TABLES Tables Page 1 Example of a book database . 4 2 Table representation form . 15 3 A single row from employees’ database . 32 4 Example of dealer database . 37 5 Example of dealer database . 40 6 Example of dealer database . 40 7 Example of dealer database . 41 8 An example on how each model ranks the objects when the subject is ”wheat” AGROVEC ranks the semantic similarity scores accurately from closest to furthest from the subject . 74.

Load more