ENRICHING KNOWLEDGE GRAPHS USING

TECHNIQUES

A Dissertation IN Computer Science and Telecommunications and Computer Networking

Presented to the Faculty of the University of Missouri–Kansas City in partial fulfillment of the requirements for the degree

DOCTOR OF PHILOSOPHY

by MOHAMED GHARIBI M. S., University of Missouri–Kansas City, USA, 2017

Kansas City, Missouri 2020 c 2020 MOHAMED GHARIBI ALL RIGHTS RESERVED ENRICHING KNOWLEDGE GRAPHS USING MACHINE LEARNING

TECHNIQUES

Mohamed Gharibi, Candidate for the Doctor of Philosophy Degree

University of Missouri–Kansas City, 2020

ABSTRACT

A knowledge graph represents millions of facts and reliable information about people, places, and things. These knowledge graphs have proven their reliability and their usage for providing better search results; answering ambiguous questions regarding entities; and training semantic parsers to enhance the semantic relationships over the Se- mantic Web. However, while there exist a plethora of datasets on the related to

Food, Energy, and Water (FEW), there is a real lack of reliable methods and tools that can consume these resources. This hinders the development of novel decision-making applications utilizing knowledge graphs. In this dissertation, we introduce a novel tool, called FoodKG, that enriches FEW knowledge graphs using advanced machine learning techniques. Our overarching goal is to improve decision-making, knowledge discovery, and provide improved search results for data scientists in the FEW domains. Given an

iii input knowledge graph (constructed on raw FEW datasets), FoodKG enriches it with se- mantically related triples, relations, and images based on the original dataset terms and classes. FoodKG employs an existing graph embedding technique trained on a controlled vocabulary called AGROVOC, which is published by the Food and Agriculture Organi- zation of the United Nations. AGROVOC includes terms and classes in the agriculture and food domains. As a result, FoodKG can enhance knowledge graphs with seman- tic similarity scores and relations between different classes, classify the existing entities, and allow FEW experts and researchers to use scientific terms for describing FEW con- cepts. The resulting model obtained after training on AGROVOC was evaluated against the state-of-the-art and knowledge graph embedding models that were trained on the same dataset. We observed that this model outperformed its competitors based on the Spearman Correlation Coefficient score.

We introduced Federated Learning (FL) techniques to further extend our work and include private datasets by training smaller version of the models at each dataset site without accessing the data and then aggregating all the models at the server-side. We propose an algorithm that we called RefinedFed to further extend the current FL work by filtering the models at each dataset site before the aggregation phase. Our algorithm improves the current FL model accuracy from 84% to 91% on MNIST dateset.

iv APPROVAL PAGE

The faculty listed below, appointed by the Dean of the Graduate Studies, have exam- ined a dissertation titled “Enriching Knowledge Graphs Using Machine Learning Tech- niques,” presented by Mohamed Gharibi, candidate for the Doctor of Philosophy degree, and hereby certify that in their opinion it is worthy of acceptance.

Supervisory Committee

Praveen Rao, Ph.D., Committee Chair Department of Computer Science & Electrical Engineering

Sejun Song, Ph.D., Co-Discipline Advisor Department of Computer Science & Electrical Engineering

Yugyung Lee, Ph.D. Department of Computer Science & Electrical Engineering

Ahmed Hassan, Ph.D. Department of Computer Science & Electrical Engineering

Zhu Li, Ph.D. Department of Computer Science & Electrical Engineering

v CONTENTS

ABSTRACT ...... iii

ILLUSTRATIONS ...... viii

TABLES ...... xi

LISTINGS ...... xiii

ALGORITHMS ...... xiv

ACKNOWLEDGEMENTS ...... xv

Chapter

1 INTRODUCTION ...... 1

1.1 Overview ...... 1

2 BACKGROUND ...... 14

2.1 Text Vectorization ...... 14

2.2 Embedding Models ...... 18

3 RELATED WORK ...... 31

3.1 Converting to RDF Model ...... 31

3.2 Enriching a Dataset with Extra Triples Based on the Existing Ones . . . . 44

3.3 Machine Learning and Embedding Models ...... 55

3.4 Federated Learning ...... 55

4 Approach ...... 57

4.1 Overview ...... 57

vi 4.2 Converting a Table into RDF Triples ...... 58

4.3 Enriching RDF Data Triples ...... 59

4.4 Choosing the Target Triples ...... 60

4.5 Running Entities on ConceptNet ...... 61

4.6 Levels of Searching Tree ...... 62

4.7 Adding the Extra Triples ...... 64

4.8 Architecture ...... 65

5 IMPLEMENTATION ...... 67

5.1 Implementation ...... 67

5.2 FoodKG Implementation ...... 68

6 EVALUATION ...... 79

6.1 Work Load ...... 79

6.2 Results ...... 79

6.3 FoodKG ...... 81

6.4 Federated Learning ...... 96

6.5 Data Availability Statement ...... 97

7 CONCLUSION AND FUTURE WORK ...... 99

REFERENCE LIST ...... 101

VITA ...... 116

vii ILLUSTRATIONS

Figure Page

1 RDF Model ...... 3

2 RDF model for the second book ...... 4

3 Federated Learning Architecture ...... 10

4 Google embedding example ...... 19

5 Embedding distance example ...... 21

6 CBOW architecture ...... 24

7 The generated weights ...... 24

8 Multiplication to generate the embedding vector ...... 25

9 Multiplication to generate the embedding vector ...... 25

10 Multiplication to generate the embedding vector ...... 26

11 GloVe Co-occurrence matrix for ”the dog ran after the man” ...... 28

12 FastText architecture for a sentence with N-Gram features ...... 29

13 Zachary’s Karate club visualized using DeepWalk and GEMSEC embed-

dings ...... 30

14 Comparison between SML and R2RML ...... 35

15 Any23, list of extractors ...... 36

16 The required template to convert Table 3 to RDF model ...... 37

17 BioDSL ...... 38

viii 18 DBpedia semantic query ...... 45

19 DBpedia results for running ”Lemon” ...... 46

20 Dandelion semantic results for comparing ”Lemon” and ”Lime” . . . . . 47

21 Dandelion results for comparing two phrases ...... 48

22 Results of ParallelDots for comparing two phrases ...... 49

23 Results of ParallelDots when comparing two terms ...... 49

24 WordNet hierarchy for nouns and verbs ...... 50

25 Results returned from WordNet ...... 51

26 WordNet results when running a general term ...... 51

27 WordNet result when running a phrase contains dictionary-based term . . 52

28 ConceptNet example for a relationship ...... 54

29 FEW ontology while using Karma tool ...... 59

30 First level of the searching tree ...... 62

31 Third level of the searching tree ...... 63

32 The searching tree for the term ”Flower” ...... 63

33 The searching tree for the term ”Fire” ...... 64

34 FoodKG system architecture...... 66

35 RefinedFed Architecture. A local testing dataset will be added to each

client to test the model before the collecting phase. Models pass a certin

accuracy threshold will be collected by the server. Otherwise, the model

will be dropped...... 77

36 FoodKG input - Food example ...... 81

ix 37 FoodKG output - Food example ...... 82

38 FoodKG input - Water example ...... 82

39 FoodKG output - Water example ...... 83

40 FoodKG input - Energy example ...... 83

41 FoodKG output - Energy example ...... 84

42 Spearman correlation coefficient ranking scores compared against Con-

ceptNet ...... 89

43 Spearman correlation coefficient scores ...... 90

44 AGROVEC embeddings visualization using TSNE ...... 91

45 HolE embeddings visualization using TSNE ...... 92

46 GloVe embeddings visualization using TSNE ...... 93

47 Word2vec embeddings visualization using TSNE ...... 94

48 FastText embeddings visualization using TSNE ...... 95

49 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat-

edTree. 5 Clients ...... 96

50 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat-

edTree. 10 Clients ...... 97

51 MNIST dataset. The accuracy throughout 10 epochs. FL VS. Federat-

edTree. 20 Clients ...... 98

52 CIFAR-10 dataset. The accuracy throughout 10 epochs. FL VS. Federat-

edTree. 5 Clients ...... 98

x TABLES

Tables Page

1 Example of a book database ...... 4

2 Table representation form ...... 15

3 A single row from employees’ database ...... 32

4 Example of dealer database ...... 37

5 Example of dealer database ...... 40

6 Example of dealer database ...... 40

7 Example of dealer database ...... 41

8 An example on how each model ranks the objects when the subject is

”wheat” AGROVEC ranks the semantic similarity scores accurately from

closest to furthest from the subject ...... 74

9 Top 5 related words for the concept ”Foods” ...... 74

10 Top 5 related words for the concept ”Energy” ...... 75

11 Top 5 related words for the concept ”Water” ...... 75

12 Few examples for the most used concepts in FEW domain that do not

appear in global embeddings ...... 76

13 Time needed for different RDF datasets ...... 79

14 FEW systems input experiments ...... 80

15 An input example (Ingredients) ...... 80

xi 16 Accuracy of FL throughout 10 epochs with and without RefinedFed . . . 87

17 Differet graph embedding techniques with their Spearman Correlation score 88

18 The default hyper-parameters for the retrained models ...... 88

xii LISTINGS

1 RDF quad example ...... 3

2 JSON representation form ...... 15

3 Combined JSON representation ...... 15

4 Data after extraction ...... 31

5 R2RML query ...... 32

6 D2RQ-ML syntax [97] ...... 33

7 Data after extraction ...... 59

8 Data after extraction ...... 60

9 data after extraction ...... 64

10 Output triples ...... 64

11 Blank nodes ...... 65

12 Blank nodes ...... 67

13 Blank nodes ...... 68

14 Data after extraction ...... 84

xiii ALGORITHMS

1 Text Classification using AGROVEC and ConceptNet ...... 72

2 F ederatedT ree an extended algorithm for F ederatedAveraging . . . . 78

xiv ACKNOWLEDGEMENTS

This work would have never seen the sunlight without the help and support of many people. Thank you all!

I would like to express my gratitude to my father who supports me all the time and who taught me to be who I am today. I’m also grateful to my mother who motivates always; and to my brother who taught me a lot and was always there for me. I appreciate having you in my life! My sincere thanks go to my grandparents who believe in me!

I would like to express my sincere gratitude to my advisor, Dr. Praveen Rao, for his unlimited guidance, support, and motivation. Thank you for teaching me how to be a better student and a better person. I appreciate all the time and the efforts you invested in me. It is an honor to complete my studies under your supervision.

Prof. Ghulam Chaudhry, thank you for your great support and for all the opportu- nities you offered me.

Dr. Song, Dr. Lee, Dr. Ahmed, and Dr. Zhu, thank you for your continuous support and guidance. I appreciate your time in teaching me.

My manager at IBM, Srini Bhagavan, a great leader and a great supporter who I learned a lot from, Thank you!

My lab friends, thank you for your help and for all the nice times we spent together.

My sincere thanks go to the School of Computing and Engineering faculty, School of

Graduate Studies staff for all the opportunities and the research grants. Thank you all.

We would like to acknowledge the partial support of NSF Grant No. 1747751.

xv CHAPTER 1

INTRODUCTION

In this chapter, we briefly introduce the area of the research, the research problem that we address in the area, the objectives of the work, the tools, the web services that have been used, and my work contribution.

1.1 Overview

Winding back the clock to 20 years ago where anyone could hardly believe they would own a mobile phone much less a laptop, where nowadays most cars have more powerful computing microprocessors than the ones used in the space vehicles that were utilized as transportation to send men to the moon [9]. The huge jump of technology innovates new lifestyles and the way we communicate in many different aspects. It even modifies the priority in our way of thinking, transforming the agricultural revolution to industrial revolution and resulting in a huge information revolution. Nowadays, anything can be accomplished via technology including online meetings, online degrees, online jobs, social communication, and etc. Furthermore, entertainment and communicating with friends and family can be done online through social networking websites. This sig- nificant information revolution generates a huge amount of data every day, called Big Data

(BD) [13]. The Big Data concept refers to a complex and large volume for both struc- tured and unstructured data where the traditional data processing applications software

1 are inadequate to deal with the huge amount of data that is generated every day [57].

Big Data Science (BDS) is the science that studies managing, storing, analyzing, and retrieving huge amount of data. One of the challenges for BDS is that data on the internet does not follow a particular format. Different social media websites use different ways to store and manipulate online data [43]. For instance, YouTube website stated that

400 hours worth of videos are being uploaded per minute and one billion hours is the amount of content being watched on YouTube daily [91]. YouTube stores these hours of videos in a structured format whereas Facebook which has more users than China’s population stores its data in graphs [15]. These different formats create new challenges for users who want to analyze and process such data. The essential part of BDS is en- abling users to analyze and process big data with different formats. Structured data, also known as Relational Databases (RDB), includes tables, spreadsheets, and databases that use Structured Query Language (SQL) for processing. Although SQL is a common and powerful language, there are still many challenges for joining structured and unstructured data such as texts, videos, images, emails, and audio files.

Fortunately, there is a universal data model that is considered a solution for all the aforementioned challenges. Resource Description Framework (RDF) is a World Wide

Web Consortium (W3C) data model. RDF represents data in three parts: subject, predi- cate, and object which are known as RDF triple, ”object”, Figure

1. A new value can be added to describe the context of the triple, which is called the

and that becomes RDF quads instead of triples as in listing 1 [47, 48].

2 Figure 1: RDF Model

RDF triples represent semantic information and facts between entities and con- cepts for both humans and computers [90]. Subjects within RDF data model provided with a Universal Resource Identifier (URI) to present unique information and facts. This allows both humans and computers to trace back the origin of a word, related terms, and in what context it was mentioned [71]. Listing 1 illustrates a quad model after engaging the URIs. "object"

Listing 1: RDF quad example

Furthermore, one of the most important uses of the RDF model is joining and merging data from different formats. Without using RDF model, it can be complex to merge two different databases. The level of complexity increases by increasing the num- ber of databases. When using RDF model, the process begins by converting tables to RDF model, then joining these triples. The advantage of joining RDF data model is usable for different amounts of data with various formats. Converting a database into RDF model is one of the challenges that many users face since there is no specific tool that can be used automatically without a human contribution. Converting a database into RDF model requires a special structure of mapping the data from a database into RDF model. Differ- ent databases require different structures; these structures are called ontologies. For each

3 Figure 2: RDF model for the second book [8] database, a user is required to provide an ontology.

Few ontologies exist on the internet, but they do not cover different users’ pur- poses. Therefore, we developed a new ontology, based on DBpedia ontologies, that can be used to serve users who are working with FEW knowledge base, called FEW ontology.

FEW ontology contains tens of the relationships that can be used to specify the relation- ship between two entities while converting to RDF model. For instance, Table 1 contains books’ titles, authors, publishers, etc.

Table 1: Example of a book database

Isbn Title Author publishedID Pages 0596002637 Partial RDF Shelley Powers 7642 350 0596000480 JavaScript David Flanagan 3556 936

The relationship that links the second book with ”JavaScript” is ”title”. There are few ontologies that define such simple relationships, but for another column the rela- tionship might be ”number of pages”. In this case, a user has to search for an ontology that defines the relationship ”number of pages” or creates his own. After converting the previous table to RDF model, data will be presented as in Figure 2.

4 Another advantage of RDF data model is that a user can simply understand all the information presented using these RDF triples. A user may add extra information such as a link to the author personal website, how many children he has, and what other books he had written.

The second part of our work is enhancing the mapped RDF dataset by adding ex- tra information based on the semantic similarity between the entities in a given dataset.

Our program starts by comparing semantically two entities at a time. Based on the rela- tionships between these entities, extra triples will be added to the dataset containing the compared entities with the semantic similarity score and the relationship between them.

In a dataset, multiple entities may have a relationship other than the existing ones [109].

For instance, a dataset may contain names such as ”David Flanagan” and ”Java in a Nut- shell” which can be confusing to users. In this case, adding extra information based on the semantic similarity between the first and second name such as ”author” or ”owned by” will enrich the dataset and provide valuable information for users to understand the exact relationships between names and entities. Moreover, enriching a dataset with extra infor- mation will minimize the searching time. For example, adding the relationship ”author” between ”David Flanagan” and ”Java in a Nutshell” will save time and effort for users who want to search for the relationship between these names. For this purpose, we uti- lize ConceptNet web service to provide us with all the semantically related concepts for a given word in order to use them to conduct our calculations. Before start explaining about our work, we would like to mention a few of the reasons behind choosing the area of FEW.

5 • Most of the technology nowadays is concerned with computer-related projects such

as the social media, banks, advertisements, education, and etc. Food, Water, and

Energy systems are not having the same technology interests as the other majors.

Hence, our project aims to build a system to improve FEW knowledge graphs to

shed light on these areas in a way that enhances these systems and enables users to

analyze databases and graphs in a better way [83].

• The lack of the existing ontologies while converting databases to RDF model, ob-

ligated us to create a new ontology based on DBpedia ontologies to be used with

FEW systems.

• Analyzing data is not a new concept, but enriching a dataset by adding extra RDF

quads related to the existing ones based on the semantic similarity between these

quads is a real challenge, that will enrich a dataset and provide users with more

helpful information and facts about concepts exist in that particular dataset.

Food, energy, and water are the critical resources for sustaining human life on

Earth. Currently, there are a plethora of datasets on the Internet related to FEW resources.

However, there is still a lack of reliable tools that can consume these resources and provide decision-making capabilities [82]. Moreover, FEW data exists on the Internet in different formats with different file extensions such as CSV, XML, and JSON, and this makes it a challenge for users to join, query, and perform other tasks [51]. Generally, such data types are not consumable in the world of Linked Open Data (LOD), neither ready to be processed by different deep learning networks [64]. Recently, in September 2018, Google

6 announced its ”Google Dataset Search”, which is a search engine that includes graphs and . Google Dataset Search is a giant leap in the Semantic Web domain, but the challenge is the lack of published knowledge graphs, especially in the FEW systems area [35].

Knowledge graphs including [18], DBpedia, [14], and YAGO [98] have been commonly used in Semantic Web technologies, Linked Open Data, and cloud com- puting [29] due to their semantic properties. In recent years, many free and commercial knowledge graphs have been constructed from semi-structured repositories like Wikipedia or harvested from the Web. In both cases, the results are large global knowledge graphs that have a trade-off between completeness and correctness [42]. Recently, different re-

finement methods where proposed to utilize the knowledge in these graphs to make them more useful in domain-specific areas by adding the missing knowledge, identifying error pieces, and extracting useful information for users [74]. Furthermore, knowledge ex- traction methods used in most of the knowledge graphs are based on binary facts [31].

These binary facts represent the relations between two entities, which limit their deep reasoning ability when there are multiple entities, especially in domain-specific areas like

FEW [102].

The lack of reliable knowledge graphs serving FEW resources has much moti- vated us to build our tool, FoodKG, that uses domain-specific graph embeddings to help in decision-making, improving knowledge discovery, simplifying access, and providing better search results [36]. FoodKG enriches FEW datasets by adding additional knowl- edge and images based on the semantic similarities [101] between entities within the same

7 context. To achieve these tasks, FoodKG employs a recent graph embedding technique based on self clustering called GEMSEC [84] which was retrained on the AGROVOC [21] dataset. AGROVOC is a collection of vocabularies that covers all areas of interest to the Food and Agriculture Organization of the United Nations, including food, nutrition, agriculture, fisheries, forestry, and environment. The retrained model, AGROVEC, is a domain-specific graph embedding model that enables FoodKG to enhance knowledge graphs with the semantic similarity scores between different terms and concepts. In ad- dition, FoodKG also allows users to query knowledge graphs using SPARQL through a friendly user interface.

Most of current existing knowledge graphs and data are private. Therefore, we have extended our work by adding Federated Learning (FL) techniques in order to be able to benefit form the private and secured data. FL was introduced by McMahan [61] as a distributed machine learning approach where the goal is to train a centralized/global model using a large number of distributed datasets without accessing the data and keeping the data localized. The idea is to train a smaller version of the model at each dataset site and then aggregate all the models at the server where the goal is to minimize the objective function, as can be seen bellow:

n X minf(w) = piFi(w) = Ei[Fi(w)] (1.1) w i=1 P where n is the number of clients and i pi = 1 s.t. pi ≥ 0. FL allows the training on such data without requiring data transfer outside its holder’s premises. In particular,

FL is one instance of the more general approach ”bringing the code to the data, instead of

8 the data to the code” which in return will train a model using the localized data without having permit access to it. The general description of FL was given by Mcmahan and

Ramage [20] and the theory in Konecn et al (2016a) [52] McMahan et al (2017 [20] 2018

[63]) and [19] to address the fundamental problems of privacy, ownership, and locality of data. FL was initially introduced to target mobiles and edge device applications [62], later on the FL was also used across multiple organization such as hospitals, we will call these two settings ”cross-device” and ”cross-silo” respectively as mentioned in [62].

Federated learning (FL) is one of the widely adopted techniques in the context of privacy-preserving machine in order to train a model on data that is not accessible, such as patient records at hospitals. In particular, instead of uploading the data to a centralized server where a model will be trained, FL techniques rely on sending the model to data holder, which in return will train a model without having to share the data or permit access to it. Furthermore, FL is often deployed to train models from edge and wearable devices that continuously collect data from users, such as phones and medical equipment.

For example, one of the most famous use of FL is in the area of smart phones keyboard.

Google make an extensive use of FL in the Gboard mobile keyboard [40, 46, 80, 106]

(see Figure 3 for the simplified FL architecture), Android Messages [24], where Apple is using cross-device FL in iOS 13 [99]. The model that predicts the next word in smart phones keyboard was trained in FL technique. Instead of uploading all the private text of users to a centralized server and train a model, a simple model will be trained on the user phone to produce a model that does not have a strong accuracy. However, when collecting thousands of user models and averaging their weights on the server, a better and a more

9 Figure 3: The simplified architecture of FL where the server initially sends a global model to the clients. The clients perform local training and share updated weights with the server. The server aggregates the weights and updates the the global model and continues to perform these steps again. generalized model will be produced. The produced model will then be sent to all of the users in the next round. A round in FL starts by a server sends the global model to all of the clients, each client will further train this model on its own private data then send the updates to the server again for the aggregation. This process will continue to repeat and more generalized model will be produced.

Averaging all clients models is the standard approach currently used to generate a global, generalized model with a better accuracy. This technique is similar to random forest where the idea is to average all the over-fitted tree models to produce a better overall

10 model. However, this approach faces a real challenge when participating entities (i.e., data holders that participate in training small models on premises) do not hold ”good” data or their data might include lots of noise. For example, in the case of using FL for improving next-word predictions in smartphones, many use English language to type words in other languages (e.g, I might type ”salam” in English while it’s a greeting in Arabic language).

Not to mention the grammatical mistakes and shortcuts such as typing ”u” instead of

”you”. Different accents and different slangs may also lower the models accuracy such as typing ”goin” instead of ”going” etc. The models that will be collected from such users will be harmful for the general model.

On the other side, we have Computer Vision models that were trained on images.

However, some clients may have a huge number of images with high resolution. Others may have only few number of images, corrupted images, low resolution images, black and white images, and images with a lot of noise that will affect the overall model in a bad way. Moreover, collecting models from a much larger crowd requires more computation powers, bandwidth, and introduces latency. Therefore, our proposed algorithm will run a simple accuracy test for each client model after each round and based on the output the model will be included in further operations on server or not.

In this dissertation, we propose a tool called FoodKG that refines and enriches

FEW resources to utilize the knowledge in FEW graphs in order to make them more useful for researchers, experts, and domain users. The key contributions of our work as follows:

• FoodKG is a novel software tool that aims to enrich and enhance FEW graphs

11 using multiple features. Adding a context to the provided triples is one of the first

features that allows querying the graphs more easily and providing better input for

deep learning models.

• FoodKG provides different Natural Language Processing (NLP) techniques such as

POS tagging, chunking, and Stanford Parser for extracting the meaningful subjects,

unify the repeated concepts, and link related entities together [22, 50, 59].

• FoodKG employs the Specialization Tensor Model (STM) [37] to predict the newly

added relations within the graph.

• We adopt WordNet [67] to return all the offsets for the provided subjects in order

to parse the related images from ImageNet [85]. These images will be added to the

graph in the form of Universal Resource Locator (URL) as related and pure images.

• FoodKG utilizes GEMSEC [84] model that was retrained on AGROVOC with trans-

fer learning and fine-tuning to produce AGROVEC, to provide the semantic similar-

ity scores between the similar and linked concepts. AGROVEC was compared with

word embeddings and knowledge graph embedding models trained on the same

dataset. By virtue of being trained on domain-specific graph data, AGROVEC

achieved superior performance than its competitors in terms of the Spearman Cor-

relation Coefficient score.

• We introduced Federated Learning (FL) techniques to further extend our work and

include private datasets by training smaller version of the models at each dataset site

without accessing the data and then aggregating all the models at the server-side.

12 We propose an algorithm that we called RefinedFed to further extend the current FL

work by filtering the models at each dataset site before the aggregation phase. Our

algorithm improves the current FL model accuracy from 84% to 91% on MNIST

dateset.

Our results show that AGROVEC provides more accurate and reliable results than the other embeddings in different scenarios: category classification, semantic similarity, and scientific concepts.

We aim at making FoodKG one of the best tools for data scientists and researchers in the FEW domains to develop next-generation applications using the concept of knowl- edge graphs and machine learning techniques. The rest of the dissertation is organized as follows: Section 2 discusses recent related work; Section 3 presents the design de- tails of FoodKG; Section 4 discusses the implementation and performance evaluation of

FoodKG; and finally, we conclude in Section 5.

13 CHAPTER 2

BACKGROUND

In the first part of this chapter, we present a brief introduction to several ap- proaches and tools that are used to convert various data formats to RDF data model and the reasons why we chose Karma integration tool.

In the second part of the chapter, we present the most reliable semantic networks and the reasons behind choosing ConceptNet to work with in our project.

In the third part, we present the state of art Embedding models (graphs, knowledge graphs, and word embeddings).

2.1 Text Vectorization

2.1.1 Bag of Words (BOW)

BOW is a technique to parse the features of a document. The meanings of features are the characteristics and properties that you can use to make a decision (to buy a house you look for few features such as how many rooms and its location). The features of the text are how many unique words in the corpus and the occurrence for each word, etc.

BOW is a feature extraction technique in which the output is a vector space that represents each document in the corpus. The length of this vector (dimensions) corresponds to the number of unique words in the corpus (no repetition, each word occurs only once). BOW model has different flavors where each extends or modifies the base BOW. Next will

14 discuss three different vectors: Frequency vectors (count vectors), One Hot Encoding, and Term Frequency/Inverse Document Frequency.

2.1.2 Frequency Vectors

This is the simplest encoding technique, yet it is still effective in some use cases.

Simply we fill the document vector with the count of how many times each word appeared in the document. As an example, let us say our corpus has two documents. While the first one contains ”Alice loves pasta”, the second document contains ”Alice loves fish. Alice and Bob are friends”. To represent the count we can either use a table as in Table 2, or JavaScript Object Notation (JSON) as in listing 2. We can also combine both JSON notations in a single one, listing 3:

Table 2: Table representation form

Alice loves pasta fish and Bob are friends doc1 1 1 1 0 0 0 0 0 doc2 2 1 0 1 1 1 1 1

doc1: {"Alice":1, "loves":1, "pasta":1} doc2: {"Alice":2, "loves":1, "fish":1, "and":1, "Bob":1, " are":1, "friends":1}

Listing 2: JSON representation form

{"Alice":3, "loves":2, "pasta":1, "fish":1, "and":1, "Bob" :1, "are":1, "friends":1}

Listing 3: Combined JSON representation

15 As you can see we have 8 unique words in our corpus. Therefore, our vector will have a size of 8. To represent document 1, we simply take the first row in our table [1, 1, 1,

0, 0, 0, 0, 0]. This vector helps in comparing documents. While this technique is helpful in some use cases it has some limitations such as: does not keep the document structure

(does not keep the order of the words, rather it just counts) and it also has the sparsity problem (most of the values in the vector are zeros, which increase the time complexity and add bias for the model, and the stopping words (such as ’and’, ’or’, ’is’, ’the’, etc.) appear many times more than the other words. Therefore, we use some techniques such as Stemming and Lemmatization. We also remove the stopping words and the rare words that appeared only a few times in the entire corpus.

2.1.3 One Hot Encoding

As discussed in frequency vectors, tokens that appear frequently have more mag- nitude than others that appeared less. Therefore, the OHE vector provides a boolean vector as a solution for this problem where we fill the vector with only 1’s and 0’s. We place 1 if the word appears in the document (1 instead of the count) and 0 otherwise.

Document 2 can be presented as [1, 1, 0, 1, 1, 1, 1, 1].

One Hot Encoding can also be used to represent the words. 1 for the word that we want to represent and 0 for the rest. The word ”Alice” can be represented as [1, 0, 0, 0, 0,

0, 0, 0] or we can add the count as well, so ”Alice” can be represented as [3, 0, 0, 0, 0, 0,

0, 0].

16 2.1.4 Term Frequency/Inverse Document Frequency

So far we have been treating each document as a standalone entity without looking at the context of the corpus. TF/IDF is one of the common techniques to normalize the frequency of tokens in a document with respect to the corpus context. TF/ID represents two things:

1. Term frequency tf(t, d): how frequently a term (t) occurs in a document (d). If

we denote the raw count by f(t, d), then the simplest tf scheme is tf(t, d) = f(t, d)

(Other techniques discussed below) and let us denote the total number of words

appear in document d by len(d). For example, to rank documents that are most

related to the query ”the blue sky”, we count the number of times each word occurs

in each document. However, since each document is different in size, it is not

fair to compare how many times a word occurs in a document with 10 words and

a document with 1M words. Therefore, we scale tf to prevent the bias of long

documents as follows: tf(t, d) = f(t, d) / len(d) Other methods of tf that adjust and

reduce the count of most repeated words in a document:

• Boolean frequency: tf(t, d) = 1 if t occurs in d and 0 otherwise

• Term Frequency adjusted for document length: tf(t, d) = f(t, d)/len(d)

• Logarithmically Scaled Frequencies: tf(t, d) = log( 1 + f(t, d))

• Augmented Frequency: tf(t, d) = 1 * f(t, d) / m, where m is the most occurring

word occur in d

2. Inverse Document Frequency: It measures how important a term is. IDF reduces

17 the value of common words that appear in different documents. Given our pre-

vious example ”the blue sky” the word ”the” is a common word and therefore the

term frequency tends to incorrectly emphasize documents with repeated words with

fewer values such as ”the”. As a solution, we calculate the log() for the total number

of documents (D) divided by n which is the number of documents with t appeared

in: idf(t, D) = log(D / n) and finally, TF/IDF can be calculated as: tf-idf(t, d, D) =

t(t, d) . idf(t, D) Finally, we just add TF-IDF scores in vectors instead of frequency

count or OHE.

2.2 Embedding Models

2.2.1 What are Embeddings?

Embeddings are types of knowledge representation where each textual variable is represented with a vector (think about it as a list of numbers for now). A textual variable could be a word, node in a graph or a relation between two nodes in a knowledge graph. These vectors can be called different names such as space vectors, latent vectors, or embedding vectors. These vectors represent multidimensional feature space on which machine learning methods can be applied. Therefore, we need to make a shift in how we think about language from a sequence of words to points that occupy a high-dimensional semantic space where points in space can be close together or far apart.

18 Figure 4: Image Source: (Embeddings: Translating to a Lower-Dimensional Space) by Google.

2.2.2 Why Do We Need Embeddings?

The purpose of this representation is to get words with similar meanings (seman- tically related) to have a similar representation and be closer to each other after plotting them in a space. Why is that important? Well, for many reasons, mainly:

1. Computers do not understand the text and the relations between words, so you need

a way to represent these words with a number which is what computers understand.

2. Embeddings can be used in many applications such as question answering systems,

recommendations systems, sentiment analysis, text classification and it also makes

it easier for search, return synonyms. Let us take a simple example to understand

how embeddings help with all of that.

19 2.2.3 Simple Embeddings Example

For the sake of simplicity, let us start with this example, consider the words ”king”,

”queen”, ”man”, and ”woman” are represented with the vectors [9, 8, 7], [5, 6, 4], [5, 5,

5], and [1, 3, 2] respectively. Figure 4 depicted these vectors representation. Notice that the word ”king” and the word ”man” are semantically related in a way that both ”man” and ”king” represent a male human. However, the word ”king” has an extra feature which is royalty. Similarly, the word ”queen” is similar to ”woman” but has an extra feature which is royalty as well.

Since the relation between ”king” to ”queen” (male royalty - female royalty) is similar to the relation between ”man” and ”woman” (male human - female human)then subtracting them from each other gives us this famous equation: (king - queen = man

- woman). By the way, when subtracting two words from each other, we subtract their vectors.

2.2.4 The Magic Behind the Embeddings

Suppose we do not know what is the female name for ”king”, so how can we get it? Well, since we know that (king - queen = man - woman), we change the formula to be

(queen = king - man + woman) which makes sense. The formula states if you remove the male gender from ”king” (royalty is the reminder) then add the female gender to royalty to give us what we are looking for which is ”queen”, Figure 5.

Now we know embedding can be helpful in question answering systems. Other examples may be similar (USA - English = France - French), (Germany - Berlin = France

20 Figure 5: Image by (Kawin Ethayarajh), Why does King - Man + Woman = Queen? Understanding Word Analogies.

- Paris). Moreover, embeddings are also helpful in simple recommendation tasks. For example, if someone likes ”orange”, then we look at the most similar vectors to the vector that represents ”orange” and we get the vectors for ”apple”, ”cherry”, and ”banana”. As we can see, the better representation (list of numbers) we get for each word, the better accuracy our recommendation system gets. So the reminding question is how do we come up with this list of numbers for each word? (which is called embedding, latent or space vector).

FoodKG is a unique software in its type and purpose. There are no other sys- tems or tools that have the same features. Our main work falls under graph embedding techniques. Embedded vectors learn the distributional semantics of words and used in dif- ferent applications such as Named Entity Recognition (NER), question answering, doc- ument classification, information retrieval, and other machine learning applications [70].

The embedded vectors mainly rely on calculating the angle between pairs of words to

21 check the semantic similarity and perform other word analogies tasks suck as the com- mon example king - queen = man - woman. The two main methods for learning word vectors are matrix factorization methods such as Latent Semantic Analysis (LSA) [28] and Local Context Window (LCW) methods such as skip-gram (Word2vec) [66]. Matrix factorization is the method that generates the low-dimensional word representation in or- der to capture the statistical information about a corpus by decomposing large matrices after utilizing low-rank approximations. In LSA, each row corresponds to a word or a concept, whereas columns correspond to a different document in the corpus. However, while methods like LSA leverage statistical information, they do relatively poor in the word analogy task indicating a sub-optimal vector space structure. The second method aids in making predictions within a local context window, such as the Continuous Bag- of-Words (CBOW) model [65]. CBOW architecture relies on predicting the focus word from the context words. Skip-gram is the method of predicting all the context words one by one from a single given focus word. Few techniques have been proposed, such as hi- erarchical softmax, to optimize such predictions by building a binary tree of all the words then predict the path to a specific node.

2.2.5 Word2vec

Word2vec is one of the earliest vectors that is mainly to embed words rather than sentences or books. Moreover, the dimension of Word2vec is not related to the number of words in the training data since it uses some algorithms to reduce the dimensions into

(50, 100, 300, etc.). Word2vec falls under prediction based embeddings which tend to

22 predict a word in a given context. Word2vec has two flavors: Continuous Bag Of Words

(CBOW) and Skip-Gram model. CBOW tend to predict the probability of a word given a context, whereas skip-Gram uses opposite CBOW architecture (predict a context given a single word).

2.2.5.1 CBOW Architecture

We start by specifying a context window size which is the beginning and ending for each context. Then we get the One Hot Encoding vectors for each word. Given the corpus ”I like driving fast cars”, the window size is 1 (1 word before and one word after the target word), vector dimension is 3 and we want to predict the middle word ”driving” from the context ”I ... driving”. Notice that we have only 1 hidden layer where its size associated with the required vector dimension, that is the reason for calling this technique learning the representations of vectors because we have only 1 hidden layer. Figure 6 is the architecture, note that the input are the words in the context window size and the output is the learning the representation for the target word. Also note that there are no activation function applied to the hidden layer. However, the output layer utilizes Softmax.

The output of the previous neural network is the weight matrix in Figure 7.

After having the weight matrix we multiply the matrix with the One Hot Encoding vector for the target word to get its representation vector, Figure 8.

Multiplying the weight matrix with a vector filled with zero’s with a single 1 may sound useless at first. Of course, the output is just its position in the matrix, consider the following example in Figure 9.

23 Figure 6: CBOW architecture

Figure 7: The generated weights

24 Figure 8: Multiplication to generate the embedding vector

Figure 9: Multiplication to generate the embedding vector

25 Figure 10: Multiplication to generate the embedding vector

Well, the real purpose of this multiplication is just to lookup the target word vector based on its space in the One Hot Encoding vector.

2.2.5.2 Skip-Gram

Skip-Gram or sometimes called Skip-Ngram model uses a headed flipped archi- tecture of CBOW and the rest are the same. Figure 10 is the architecture for skip-gram where we try to predict all the words within a window size given one context word.

26 2.2.6 GloVe

Recently, Pennington et al. [77] shed light on GloVe, which is an unsupervised learning algorithm for generating embeddings by aggregating global word-word co-occurrence matrix counts where it tabulates the number of times word j appears in the context of the word i.

GloVe is a word embedding model that is trained on the co-occurrence matrix counts. It use the corpus statistics by minimizing least square error in order to obtain the word vector space.

2.2.6.1 Co-occurrence Matrix

Given a corpus having V words, our co-occurrence matrix X will be of size VxV where each word i in X is a unique word in the corpus and each word j denotes to the number of times occurred in the window size of word i. Given this sentence ”the dog ran after the man” and a window size 1, we get the matrix as in Figure 11.

2.2.7 FastText

FastText is another embedding model created by the Facebook AI Research (FAIR) group for efficient learning of word representations and sentence classification [17]. Fast-

Text considers each word is a combination of n-grams of characters where n could range from 1 to the length of the word. Therefore, fastText has some advantages over Word2vec and GloVe, such as finding a vector representation for the rare words that may not appear in Word2vec and GloVe. n-gram embeddings tend to perform better on smaller datasets.

27 Figure 11: GloVe Co-occurrence matrix for ”the dog ran after the man”

FastText supports training using different architectures such as CBOW or Skip-Gram us- ing softmax or hierarchical softmax loss functions or negative sampling. Each word is represented as a bag of character n-grams in addition to the word itself, Figure 12.

According to FastText authors, the neural network consists of a single layer. First, the Bag-of-Word representation is fed to the lookup layer, where the embeddings will be generated for every word. These embeddings are averaged to obtain the averaged embeddings for the whole text.

2.2.8 GEMSEC

A knowledge graph embedding is a type of embeddings in which the input is a knowledge graph that leverages the use of relations between the vertices, triple-based.

We consider Holographic Embeddings of Knowledge Graphs (HolE) to be the state-of-art

28 Figure 12: FastText architecture for a sentence with N-Gram features. The features are embedded and averaged to form the hidden variable [6]. knowledge graph embedding model [72]. However, when the input dataset is a graph instead of a text corpus we apply different embedding algorithms such as LINE [100],

Node2ec [39], M-NMF [105], and DANMF [107]. DeepWalk is one of the common models for graph embedding [78]. DeepWalk leverages modeling and deep learning for learning latent representations of vertices in a graph by analyzing and applying random walks. Random walk in a graph is equivalent to predicting a word in a sentence. Whereas in graphs, the sequence of nodes that frequently appear together are considered to be the sentence within a specific window size. This technique also uses skip-gram to minimize the negative log-likelihood for the observed neighborhood samples. GEMSEC is another graph embedding algorithm that learns nodes clustering while computing the embeddings, whereas the other models do not utilize clustering, Figure 13. It relies on sequence-based embedding with clustering to cluster the embedded nodes simultaneously. The algorithm places the nodes in abstract feature space to minimize the negative log-likelihood of the preserved neighborhood nodes with clustering the nodes into a specific number of clusters.

29 Figure 13: Zachary’s Karate club visualized using DeepWalk and GEMSEC embeddings. White nodes correseponds to the instructor community and blue nodes to the president’s community [84].

Graph embeddings hold the semantics between the concepts in a better way than word embeddings, and that is the reason behind using a graph embedding model to utilize graph semantics in FoodKG.

30 CHAPTER 3

RELATED WORK

In the first part of this chapter, we present a brief introduction to several ap- proaches and tools that are used to convert various data formats to RDF data model and the reasons why we chose Karma integration tool.

In the second part of the chapter, we present the most reliable semantic networks and the reasons behind choosing ConceptNet to work with in our project.

In the third part, we present the state of art Embedding models (graphs, knowledge graphs, and word embeddings).

3.1 Converting Databases to RDF Model

3.1.1 R2RML

R2RML stands for RDB 2 RDF Mapping Language. R2RML is a language that provides Direct Mapping (DR) from relational databases to RDF model in a customized way [41]. Direct mapping enables users to express vocabularies (relationships) and struc- tures based on users’ choice. One of the best features that R2RML provides is to al- low SPARQL Protocol and RDF Query Language (SPARQL) end point queries over the mapped relational data. For instance, given the following data in Table 3 that needs to be converted to RDF model:

Using R2RML, a user may write a query as the following, listing 4: @prefix rr: .

31 Table 3: A single row from employees’ database

EMPNO ENAME JOB DEPTNO Integer Primary Key Characters (100) Characters(20) Integer DEPT. NO. 7369 SMITH CLERK 10

@prefix ex: . <#TriplesMap1> rr:logicalTable [ rr:tableName "EMP" ]. rr:subjectMap [ rr:template "http://data.example.com/employee/{ EMPNO}"; rr:class ex:Employee; ]; rr:predicateObjectMap [ rr:predicate ex:name; rr:objectMap [ rr:column "ENAME" ]; ].

Listing 4: Data after extraction

Using this snippet of code, R2RML will map the given table to RDF model to be in this format after the mapping process, listing 5. rdf:type ex: Employee . ex:name "SMITH" . ex:department .

Listing 5: R2RML query

32 R2RML provides a mapping language where many other languages and platforms rely on it while mapping. Users who want to use R2RML are required an advanced knowledge of R2RML, and extra knowledge about RDF models in order to use R2RML.

Moreover, it does not provide a user interface, which makes it harder to read and under- stand.

3.1.2 D2RQ-ML

D2RQ-ML is a Declarative Mapping Language that maps Relational databases to RDF model. It provides SPARQL access since D2RQ is written in RDF syn- tax. D2RQ provides virtual access to graphs and databases, such as the views in SQL, to process users’ queries [25, 30]. D2RQ requires users to write a template (which can be considered as the mapping structure) in order to use D2RQ. Writing a template for a single database consumes a lot of time and efforts. The complexity level of this procedure in- creases gradually with the number of columns in a database and the number of databases.

Furthermore, users are required to have an advanced knowledge in writing templates and being familiar with D2RQ syntax because it does not provide a Graphical User Interface

(GUI). The following code snippet (listing 6) shows the syntax of D2RQ-ML. map: Database1 a d2rq:Databsae; d2rq:jdbcDSN "jdbc:mysql://localhost/iswc"; d2rq:jdbcDriver "com.mysql.jdbc.Driver"; d2rq:username "user"; d2rq:password "password"; . map: Conference a d2rq:ClassMap;

33 d2rq:dataStorage map:Database1. d2rq: class :Conference; d2rq:uniPattern "http://conferences.org/comp/ confno@@Conferences.ConfID@@"; . map:eventTitle a d2rq:PropertyBridge; d2rq:belongsToClassMap map:Conference; d2rq:property :eventTitle; d2rq:column "Conferences.Name"; d2rq:datatype xsd:string;

Listing 6: D2RQ-ML syntax [97]

3.1.3 SML

Sparqlification Mapping Language (SML) is a mapping language that maps RDB to RDF model. SML offers a better syntax of mapping RDB to RDF than R2RML [4]. The following snippet of code illustrates how SML syntax is easier and more understandable for users.

SML made the syntax easier for users but it still does not provide a GUI, Figure

14. Likewise, a user is still required to have an advanced knowledge in SML syntax before using it which is not an easy task for most of the users who just want to convert their data to RDF data model. In addition, it is time consuming since a user has to map each column manually.

34 Figure 14: Comparison between SML and R2RML [97]

3.1.4 Apache Any23

Apache Any23 is a web service, library, and a command line that extracts and produces RDF triples format from various web documents [75]. The name Any was derived from Anything to triples. To be more specific the current supported formats are:

• RDF/XML, Turtle, Notation 3.

• RDFa with RDFa1.1 prefix mechanism.

• Microsoft formats: Adr, Geo, hCalendar, hCard, hListing, hResume, XFN and

Species.

• HTML5 such as schema.org

• JSON-LD: JSON for linked data.

35 Figure 15: Any23, list of extractors [1]

• CSV: Comma Separated Values with separator auto detection.

Apache Any23 is written in Java and it can be accessed using the command line,

Figure 15. In this case, a user is required to be familiar with all the commands that Any23 permits. It does not provide any GUI as well. Consequently, a user cannot see the changes that have been made on the database during the mapping process. Figure 4 illustrates the types that Any23 able to extract RDFs from.

3.1.5 CSV2RDF

CSV2RDF is a simple but powerful library that allows mapping from Comma separated Values (CSV) to RDF, Figure 16. CSV2RDF is similar to D2RQ-ML, because both of them requires a template written by users to provide the mapping structure. We present a simple table with the required template to be mapped to RDF model [30].

36 Figure 16: The required template to convert Table 3 to RDF model [2]

The required template to convert Table 4 to RDF model in Figure 5 [2].

Table 4: Example of dealer database

Year Make Model Description Price 1997 Ford E350 ac, abs, moon 3000 1997 Chevy Venture ”Extended Edition” - 4900 1999 Chevy Venture ”Extended Edition”... - 5000 1996 Jeep Grand Cherokee MUST SELL!... 4799

A user has to write his own code to parse the data and then use this template within that code. This may take a lot of time especially when a user is dealing with large databases that contain many columns or dealing with many databases.

3.1.6 BioDSL

BioDSL is a new approach in mapping CSV files to RDF using a Domain Spe- cific Language(DSL) called BioDSL, Figure 17. BioDSL allows users to write different programs to map biodiversity data to RDF format then link these RDFs to Linked Data.

37 Figure 17: BioDSL, mapping syntax [97]

BioDSL uses Groovy Programming Language where its syntax is based on the objects and functions [49]. A CSV table represents its entities as columns, BioDSL has an ob- ject called ”csv” to represent each column name, for instance (csv.ename) indicates the csv object that represents the ”ename” columns in a table. BioDSL functions provide a function called ”Map” that maps the repressed column to RDF along with its relationship to other columns in that particular table. Other functions define how the URIs will be generated for each RDF Triple based on the table classes. BioDSL has two main parts, the first part is loading ontologies and the second part is mapping data to RDF model. The below figure illustrates these two parts in a single example.

Similar to the previous approaches, BioDSL does not provide a GUI. Althogh

BioDSL provides better features than the previous approaches, but it still consumes both time and efforts from users.

38 3.1.7 Open Refine

The formal name of Open Refine was Google Refine. Eventually, Google stopped supporting this project since October 2012. Since that time, the name has changed from

Google Refine to Open Refine.

Open Refine is the name of the tool which contains many plugins that perform many different tasks. RDF Refine is the name of the plugin that is used by Open Refine to map data from a CSV database to RDF model. Open Refine is an effective tool for working with messy data. It allows user to clean data, transfer it from one format to many different formats. Furthermore, Open Refine offers a neat GUI for users. Open Refine has many contributions in the world of semantic web [103]. A user may use Open Refine without having an advanced knowledge in writing lines of code nor any experience with writing templates or providing URIs for RDF graphs. On top of that Open Refine provides many other features such as:

• Importing data to Open Refine in various formats.

• Explore the whole dataset within seconds.

• Applying basic and advanced cell transformations.

• Provide features to handle the cell which contains multiple values.

• Create instantaneous links within the dataset.

• Filtering and joining datasets with regular expressions.

• Providing automatic identifying for the named-entity-extraction.

39 • Performing advanced operation on datasets.

• Changing values and links of the cells directly.

• Displaying results to the user after each operation instantaneously.

Additionally, Open Refine delivers powerful features for minimizing and simpli- fying databases. For instance, take a look at Table 5.

Table 5: Example of dealer database

Speed Car PayLoad Image Link - Mercedes - Link1 200 mph Mercedes 2200 pounds Link2 160 mph Toyota 1620 pounds Link1 - Toyota - Link2 - Toyota - Link3

Open Refine starts by exploring the given database then start sorting columns based on the relationships. After sorting the given table, a new table will be generated that look like Table 6.

Table 6: Example of dealer database

Car Speed PayLoad Image Link Mercedes - - Link1 Mercedes 200 mph 2200 pounds Link2 Mercedes - - Link3 Toyota 160 mph 1620 pounds Link1 Toyota - - Link2 Toyota - - Link3

40 The next step is deleting the empty cells and combining the values for each entity.

The generated Table 7 is depicted bellow:

Table 7: Example of dealer database

Car Speed PayLoad Image Link Mercedes 200 mph 2200 pounds Link1, Link2, Link3 Toyota 160 mph 1620 pounds Link1, Link2, Link3

On top of that, it is easy to install on many different operating system. Open

Refine is compatible with Mac, Windows, and Linux.

Open Refine enables users to group together all the identical cells. Using the text facet, a user may change one of these identical cells and the change will be applied to the other cells. It provides a cluster to group the groups based on special characteristics.

Using a cluster allows users to further control the group of groups such as sorting, replac- ing, formatting, etc. At any time, a user may undo any process and the changes will be displayed instantly. Using Open Refine, a user may import any ontology to be used. Open

Refine also allows SPARQL end point queries to be applied on the dataset and to choose what type of reconciliation to use. A user is not required to write any code, in contrast, a user may choose the vocabularies from a list to be applied to a table. Users provided with a graph to illustrate the structure model for a table before start mapping.

In the latest version of Open Refine, while we were mapping a CSV file to RDF data model, it was unable to finish the reconciliation part completely. Without finishing the reconciliation part, no RDF triples will be generated. We tried many different datasets on different machines and we repeatedly received the same outcomes. As a result, we did

41 not use Open Refine in our project.

3.1.8 Karma Integration Tool

Karma is an information integration tool that enables users to integrate data faster and easier from different data sources such as databases, delimiters, spreadsheets, XML,

KML, JSON, and Application Programming Interface (API). Users integrate data based on their choice of ontologies. One of the best features in Karma is learning the mapping process to classes and proposing a model that ties together these classes. Karma generates models based on what it learnt from users previous mapping [79]. A user may simply adjust and normalize the suggested model. Once the model is complete, a user may publish the integrated data as RDF or store it in a dataset. Karma provides many valuable features, we discuss these features one by one:

1. Ease of use: Karma offers a very simple user interface, which allows users to easily

perform all the desired tasks. Karma uses programming-by-example to learn the

mapping models and algorithms optimization to automate the process as much as

possible. This feature is one of the best features for users, since all the tools requires

users to go through the columns one by one to set the relationships between them.

Using Karma, a user has to map the columns for the first time only, then Karma will

propose to users what it aquired. This learning technique by Karma saves a lot of

time, efforts, and prevents error that might occur from a user side.

2. Hierarchal Sources: We have discussed many tools that map databases to RDF.

Karma is the first tool that supports hierarchal data sources such as XML, JSON,

42 and KML. This feature makes Karma a unique tool.

3. Web APIs: Karma supports importing data from both the static sources such as

databases and file and from the APIs that contain thousands of data sources.

4. Semantic models: Karma provides few ontologies to use by default. In addition,

Karma allows a user to upload any ontology to use. Karma recognizes ontologies

in different type and different extensions.

5. Scalable processing: A user may work on a subset of a table to set the desired RDF

model to be used. During this process, Karma will learn the structure of the user’s

model. After importing a larger dataset, Karma will propose a modeling structure

for the user.

6. Data transformation: Karma enables users to transform data expressed in various

formats into a common format. Karma is a capable tool when it comes to mapping

CSV tables into RDF triples.

7. Mapping visualization: This is another unique feature. Karma displays a simple

visualization graph for users which makes it easier to read the relationships between

the columns and their relationships.

8. Editing a dataset: After importing data to Karma, a user may easily delete, add,

swap, move, and change the values of columns and cells.

Karma provides semantic labeling which is the process of mapping columns in a database to the classes in an ontology [81]. Semantic labeling is a very challenging task

43 when it comes to homogeneous data types and variety of data formats. Other techniques use machine learning to extract features that are related to the data from a domain, which means the data has to be re-trained for each new domain. Whereas Karma uses machine learning and it also uses similarity metrics to return correct semantic labels for data in a faster technique. Karma assigns 2 gigabytes of space to users, a user may increase this space up to 16 gigabytes. All these features can be applied using a simple and easy user interface. Therefore, based on all the aforementioned features and powerful services, we decided to use Karma in our project for mapping databases to RDF models.

3.2 Enriching a Dataset with Extra Triples Based on the Existing Ones

This is the heart of our project, considering that there are no tools or web services that add extra triples to a dataset based on the semantic similarity between the exiting triples. For this purpose, we utilize ConceptNet web service that provides us with related terms of a word with the weight for each edge that we can use in our calculations. In this section, we briefly introduce the most common and powerful semantic networks and the reasons why we chose ConceptNet.

3.2.1 DBpedia

DBpedia is a project available on the (WWW) that extracts knowledge and facts from Wikipedia. DBpedia is one of the largest stores for semantic relationships on the web. It allows its users to semantically query the extracted informa- tion from Wikipedia. DBpedia is known as one of the most famous linked data networks,

44 Figure 18: DBpedia semantic query [5] as Tim Berners-Lee described it. Its articles are based on the infobox, which is a struc- tured box of information generated based on Wikipedia [14]. DBpedia contains almost

4.58 million entities. Out of the total entities, 4.22 million entities were classified under consistent ontologies. These entities include persons, places, games, films, albums, orga- nizations, and many other aspects in more than 125 different languages [10]. Moreover,

DBpedia uses SPARQL to query Wikipedia factual information. For instance, let us say a user was interested in the Shojo manga Japanese series Tokyo Mew Mew, and wanted to get some information about it. With a simple query, DBpedia allows a user to combine information about Tokyo Mew Mew from Wikipedia, Figure 18.

This query lists the related genres of Tokyo Mew Mew from Wikipedia. In addi- tion, DBpedia dataset is interlinked with various open source datasets on the internet to enrich Dbpedia knowledge. There are more than 45 interlinks between DBpedia and ex- ternal datasets including OpenCyc, Freebase, CIA Free Fact Book, GeoNames, Bio2RDF, and MusicBrainz. These interlinks with DBpedia provide a substantial amount of infor- mation and facts. This combination of substantial amount of data set make DBpedia a powerful provides a huge amount of semantic relationships between different entities.

45 Figure 19: DBpedia results for running ”Lemon” [5]

Despite all of the powerful features DBpedia provides, most of DBpedia informa- tion is based on Wikipedia. Wikipedia was criticized for presenting truths, half-truths, and falsehoods. On top of that, Wikipedia’s articles can be edited by any user who has an in- ternet connection which make many of Wikipedia topics controversial topics and subjects to spin and manipulation [75]. Therefore, all the facts on DBpedia that were extracted from inaccurate information are controversial facts. Furthermore, the results of DBpedia are messy as it returns all the pages where the word was mentioned in. The returned result could be a synonym, related word, meaning in different languages or just a random article where the concept of the word was mentioned. For instance, we ran the word ”Lemon” on DBpedia and it returned the following results, Figure 19.

These results are considered the most related terms and synonyms to the word

”Lemon” where a user could barely understand in what context these results were men- tioned. Hence, for the mentioned reasons we did not use DBpedia in our project for comparing the semantic similarity between concepts.

46 Figure 20: Dandelion semantic results for comparing ”Lemon” and ”Lime” [3]

3.2.2 Dandelion

Dandelion is a web service that provides text processing to performs different operations on texts. These operations include text extraction, text similarity, text classifi- cation, and sentiment analysis. All these operations on text are helpful for users to analyze texts [3]. In our project, we are interested in text similarity and the related terms. Hence, we ran a few of different test cases to test how Dandelion behaves. The first test case was running two related terms of ”Lemon” and ”lime” (see Figure 20) to check the similarity between these terms. Dandelion returned a result of 0 which means that there is no se- mantic similarity at all between these terms. Similarly, we ran many other similar terms such as ”Flower” and ”Rose,” in all the experiments, Dandelion return 0 as the result.

Dandelion cannot be used to compare single terms, instead it is used to compare phrases. Consequently, we ran two related phrases ”Lemon is healthy” and ”I do not like lime”. The returned result was 0. We ran many other semantically related sentences and we got the same result each time, Figure 21.

As illustrated, Dandelion does not provide accurate results, unless the exact term

47 Figure 21: Dandelion results for comparing two phrases [3] was repeated in both texts. Therefore, we did not use Dandelion services in our project.

3.2.3 ParallelDots

ParallelDots is another web service that uses Artificial Intelligence (AI) to analyze texts and detect the semantic similarity between two given texts. ParallelDots is very simi- lar to Dandelion based on the functionality and the relatedness to our project. ParallelDots analyzes the relatedness between texts by eliminating the redundancy. ParallelDots can be helpful for publisher, bloggers, researchers, and engineers who want to check the re- latedness between two texts. Analyzing text using ParallelDots compares the structure and the meaning between these texts [7]. It also extracts the similar sentences and similar ideas from the corpus. ParallelDots provides a number starts from 0 to 5 where 0 has no similarity at all and 5 is almost the same. ParallelDots is an effictive tool when running sentences and phrases as shown in the Figure 22.

However, ParallelDots does not support comparing two words. Figure 23 illus- trates how it returns 0 as a result when comparing two terms.

48 Figure 22: Results of ParallelDots for comparing two phrases [7]

Figure 23: Results of ParallelDots when comparing two terms [7]

3.2.4 WordNet

WordNet is a lexical database in English language that present short definitions, examples on how to use these terms and the relations among synsets, Figure 24. Synsets is a group of all the synonyms that WordNet describes for each term. WordNet can be considered as a dictionary that provides extra features for users. WordNet includes a lex- ical dictionary that contains nouns, verbs, adverbs, and adjectives but ignores the rest of function words including propositions [12]. The knowledge structure of WordNet orga- nizes both nouns and verbs into hierarchies defines by relationships such as ”Is A”. For instance, the word ”dog” is represented by the following hierarchy. Each level of the hi- erarchy represents a different synsets and each synset has a unique index as shown in the following figure.

49 Figure 24: WordNet hierarchy for nouns and verbs [11]

Each hierarchy contains 25 levels for nouns and 15 levels for verbs. Adjectives and adverbs ordered in the next levels. The main goal of WordNet is to build a lexical database that is consistent with semantic human theories. WordNet is known for its relationships between concepts. Therefore, it is known as an ontology that used as a lexical ontology in

Computer Science field. WordNet as an ontology is helpful for users to provide them with nouns, verbs and the relationship between them. This feature saves a significant amount of time and efforts for users when building new ontologies.

Despite all of the powerful features that WordNet provides, but there are still some limitations. For instance, WordNet does not include etymology within its data and it does not provide much information on usage. Moreover, the goal of WordNet is to include everyday English concepts without including much about domain specific terminology.

WordNet is used to provide lexical comparison of English words limited by dictionary words. Hence, it does not provide brand names, organizations, food, places and other concepts. The purpose of our project is to enhance FEW databases that might contains information on organization names, countries, food ingredients, and etc. In this case,

WordNet will not be efficient in our project to enrich such databases. The test cases

50 Figure 25: Results returned from WordNet [76]

Figure 26: WordNet results when running a general term [76] in Figure 26 show how WordNet is effective when using dictionary words and its poor performance when using other general terms.

Based on Figure 25, we can see that WordNet provides reliable information when running a dictionary term like ”dog”. But when a user runs a general name of a car; for example, ”Ferrari” WordNet does not provide any information, Figure 26.

Even when the user adds another dictionary term, ”car”, to the name ”Ferrari”

WordNet will still does not provide any information, check Figure 27.

Therefore, based on the previous examples, we did not use WordNet to compare the semantic relationships between concepts in our datasets.

51 Figure 27: WordNet result when running a phrase contains dictionary-based term [76]

3.2.5 Document Similarity

Document Similarity (DS) is another web service that provides the semantic sim- ilarity between two concepts or two texts. DS is powerful when running related concepts since its calculations are based on advanced Natural Programming Language (NLP) [26].

NLP is a field of Computer Science (CS) where it provides artificial intelligence and computational linguistics to improve the interaction between humans and computers [89].

NLP provides powerful approach to extract concepts from a text and to compare two dif- ferent texts. One of the best features provided by NLP is natural understanding of texts for computers side, based on semantic analyzing of texts. This feature allows NLP to demonstrate powerful semantic relationships between concepts. On top of that, NLP pro- vides many other features in the world of semantics such as machine translation, Named

Entity Recognition (NER), natural language generation, textual entailment recognition, relationship extraction, and semantic analysis. Based on these features, NLP is known as one of the best resources for semantic operations [16].

DS is one of the services provided by NLP. The results of DS are accurate and reli- able. The only reason why we did not use DS in our project is because it does not provide users with similar concepts. NLP provides this service as a separate feature. Therefore,

52 to be able to check the similarity between two concepts and the related concepts we have to merge these services of NLP. Instead, we chose to utilize ConceptNet that provides all of the mentioned services combined.

3.2.6 ConceptNet

ConceptNet is a semantic network that provides reliable information and facts about entities and concepts for users [44, 56]. ConceptNet uses transitive interference between ideas and concepts to enable even the dissimilar entities to share indirect and similar relationship [79]. For instance, when running an entity like ”Lemon” on Con- ceptNet, it provides the weight of the entity and hundreds of synonyms, related entities, similar words in different languages, types of the entity, prosperities of the entity, and many other information and facts. ConceptNet also displays the origin of the entity and where it was derived from. This feature interlinks ConceptNet with many other semantic networks to enhance and enrich ConceptNet knowledge. The weight being used on Con- ceptNet is based on Markov model. Hence, it is accurate and reliable weight that can be used in our calculations. Another important feature is that ConceptNet provides the con- ditional probabilities for changing an edge type. For instance, the relationship ”is a” is an edge for a concept X and the probability for this edge is 0.13. In contrast, reversing this relationship to ”has a” the probability drops to 0.5 which means reversing relationships affects the weight of a particular entity [94, 96], see Figure 28.

ConceptNet latest version contains data from different resources to improve the state of knowledge and humans’ knowledge. ConceptNet resources:

53 Figure 28: ConceptNet example for a relationship [96]

1. ConceptNet 5

2. Trusted information and facts from DBpedia.

3. Much of ConceptNet knowledge comes from Dictionary that provides synonyms,

antonyms, translations of terms to hundreds of languages, and multiples labeled

words.

4. Open Multilingual WordNet that provides dictionary style knowledge.

5. UMBEL that connects ConceptNet to OpenCyc ontology.

6. knowledge on people’s intuitive word associations.

As we can see, that ConceptNet delivers trusted knowledge based on the most common networks such as Dictionary, DBpedia, and WordNet. Therefore, ConceptNet is the best service to be used in our project due to its trustworthy information. Furthermore,

ConceptNet provides information on dictionary-based terms and other concepts including names, organizations, persons, etc. On top of that, ConceptNet also accepts single terms and phrases to analyze.

54 3.3 Machine Learning and Embedding Models

Computers do not understand text and the relations between words and sentences, so there is a need to represent these words with numbers which is what computers under- stand. Such vectors can be used in many applications such as question answering systems, recommendations systems, sentiment analysis, text classification and it also makes it eas- ier for search, return synonyms, etc.

3.4 Federated Learning

There are two major types of training algorithms while discussing FL infrastruc- ture synchronous and asynchronous. While the asynchronous algorithm was implemented earlier and face much successful work [27], recently the authors of [38, 92] changed the trend towards the synchronous batch training. The authors of [61] proposed the algorithm of FederatedAveraging which shows a huge success in the field, yet it still has some lim- itations such as dropping all the devices that fail to finish a specific number of epochs within a specific amount of time [55]. In general, FedAvg provides a way to filter the nodes (data-holders) from being included in the aggregation. There is a list of require- ments that each client has to fulfill in order to be included in the current round, such as having enough phone charge, having a stable internet connection, the phone is not being heavily used, etc. These requirements make sure that the process will not affect the user usage. Furthermore, it also select phones with good data and good internet connection to avoid phones-dropping issues such as phone cannot be reached.

55 The authors of [73] have proposed a decentralized framework for training mod- els while preserving their privacy. Their proposed protocol, FedCS, helps in solving the clients selection problem by managing clients based on their resources. FedCS is another protocol that allow client selection before the aggregation phase. For example, clients with poor internet connection have to be managed in another way. However, such pro- tocols are not selecting the clients based on the model accuracy that was tested locally, rather they are selecting and grouping the models that should be included in the current round before the training starts.

56 CHAPTER 4

APPROACH

In this chapter, we present the method of enriching an RDF triples dataset with new triples related to the existing ones based on the relationships of the existing triples.

Our project is separated into two parts: converting databases like CSV format to RDF triples. The second part, which is the main part, is enhancing the converted RDF triples by adding extra triples to enrich the original contents in order to provide stronger and more reliable features.

4.1 Overview

We have developed a powerful program and a new approach to enhance the con- tents of a database based on its contents. Our approach can be separated into two parts.

The first part is converting a database table (CSV format) into RDF triples. For this step, we had to use Karma tool since it is a capable tool that delivers helpful services for users to integrate and convert data from various formats such as SML, JSON, and KML to RDF model. One of Karma’s best features is that it allows various operations for users such as converting databases into RDF triples in fast and easy way using a simple interface. We used Karma tool for the converting purpose, but due to the lack of the existing ontologies, we developed a new ontology called FEW ontology. FEW ontology provides the most common terms used in FEW databases based on our observations.

57 The second part of the project is to enrich the outcome RDF dataset with new and related triples. The outcome RDF triples contain the exact same data of the original database but in different format. Entity extraction has to be done first. Then based on the semantic comparison between these entities, new triples will be added containing the most related entities to the existing entities.

4.2 Converting a Database Table into RDF Triples

In order to be able to enrich a database table with new information and facts, the database should be converted to RDF model. Converting a database to RDF model is not a trivial task. After many experiments, we found that Karma integration tool is the most efficient tool to be used in this project because of its reliable and accurate results.

Mapping a database to RDF model requires a structure and an ontology so that the con- tents of the database will be mapped to RDF model based on the given ontology. In this project, we focus on food, water, and energy data. For the mentioned systems, there are no enough ontologies to be used. As a consequence, we developed a new ontology that contains the most common terms and concepts used in FEW data. After that, when mapping FEW databases to RDF model, FEW ontology was used to specify the desirable relationships to be used in the generated RDF triples. The following figure shows few

”Production” relationships were generated based on FEW ontology which have the URI of ”http://umkc.edu/ontology/FEW/”

There was no need to create new classes in our FEW ontology since we can import

58 Figure 29: FEW ontology while using Karma tool other ontologies to use such as DBpedia to reuse the pre-existing classes and other rela- tionships, Figure 29. Using Karma, a user needs to provide the general URI that will be used as the common URI for all the subjects in that particular RDF model. The provided

URI describes the location and the primary key of a table. For instance, based on the fol- lowing URI, we can understand that this URI describes the ”bread” row in the ”example” database that exist in ”http://catalog.data.gov/restaurant”, listing 7. ""

Listing 7: Data after extraction

4.3 Enriching RDF Data Triples

Adding extra triples to a dataset based on the existing triples is a complex task especially when using the sematic similarity between the concepts to validate the related information and facts. This phase of the project contains many levels. In this section, we

59 discuss these levels one by one to make it easier and more understandable.

4.4 Choosing the Target Triples

Adding extra information and triples to a dataset cannot be unorganized. There- fore, in the first step a number of triples should be taken as target triples in order to make a comparison between these target triples. Then, based on the comparison between the tar- get triples, extra information will be added and the target triples with the new triples will be stored in a file. To achieve the most accurate results from comparing triples semanti- cally, two triples will be compared at a time. Particularly, not the whole target triples will be compared to each other, but only the extracted subjects. We have developed a code to extract the concepts from each triple by eliminating the similar part between the rest of the subject. Listing 8 shows how the data will be after extracting the concepts. "20/8/2017" . "100" . "250" . "13/8/2017" . "20/8/2017" . "100" . "302" . "13/8/2017" .

Listing 8: Data after extraction

At this point, the data is ready for the next step of processing.

60 4.5 Running Entities on ConceptNet

In this part of the paper, we discuss our approach and how to calculate an accurate semantic similarity score between any two given concepts. As mentioned before, the target triples represent the first two triples in a dataset and two entities were extracted from these triples. Even though both of the extracted entities are subjects, but we will call the first entity a subject and the second entity an object in order to distinguish between them.

First, our program starts by running the subject (the first entity) on ConceptNet.

Running the subject on ConceptNet will return hundreds of related concepts, synonyms, and their relationships weights. The essential part of our program is that it starts analyzing the returned concepts by searching for the object. Once the object occurs in the returned entities, the program stops searching and starts calculating the semantic similarity be- tween the subject and the object based on the average of the searching tree. The level of searching tree is determined by the user. In case the searching tree reached the max- imum level without finding the object, the searching operation will stop and the average semantic similarity starting from the subject and ending with the object is 0.

The next step in our algorithm is running the object on ConceptNet and searching for the subject within the return entities. If the subject was found before reaching the maximum level of the searching tree, then the program will stop searching and starts cal- culating the semantic similarity score based on the number of levels in the searching tree.

In contrast, if the searching tree reached the maximum number of levels without find- ing the subject, then no relationship was determined between the object and the subject.

61 Figure 30: First level of the searching tree

Hence, there is no relationship at all between the subject and the object. Our algorithm:

4.6 Levels of Searching Tree

Whenever running an entity on ConceptNet, it returns many related concepts.

These related concepts are considered to be in the first level of the searching tree. For instance, running ”Corn” returns many concepts such as (maize, crop, food, etc.) and their weights, Figure 30. If one of these related terms is the object, then the similarity will be taken directly based on the edge weight between these two concepts. The fol- lowing figure shows the first five related terms to ”Corn”. If the object we are looking for is ”maize,” then the searching part stops and returns 4.95 as the semantic relationship between ”corn” and ”maize”.

Otherwise, if the program did not find any similar concept to the object, it starts running each of the related words as a new subject since there might be a relationship but not a direct relationship, Figure 31. For example, if the subject is ”Flower” and the object is ”Green” then we can see that ”Green” was not found in the first level. The program will run each related term as the subject again starting from ”Rose”. This technique builds a huge tree to search all the related concepts that might be related to the object.

62 Figure 31: Third level of the searching tree

Figure 32: The searching tree for the term ”Flower”

In our example, we see that ”Green” was found in the third level of the tree, Figure 32.

Therefore, the semantic similarity is calculated based on adding all the weight and divide the total by the number of levels. The semantic similarity for this example is 5.17.

Lastly, if there is no relation at all between the subject and the object, then the similarity will be 0 and no extra triples will be added, Figure 33. The following example illustrates two entities ”Flower” and ”fire” having no relationship between them no matter how many levels the tree is.

63 Figure 33: The searching tree for the term ”Fire”

4.7 Adding the Extra Triples

In case there is a semantic similarity between these entities after running them on

ConceptNet, we add a new triple after the target triples containing the similarity between the subject and the object with the relationship between them. For example, if we have the following triples as in listing 9: "crop" . "plant" . "park" . "tree" . "park" . "smoke" .

Listing 9: data after extraction

After our program run, the triples output will be as in listing 10: "crop" . "plant" . "maize" 4.95

64 "park" . "tree" . "flower" 5.38 "park" . "smoke" .

Listing 10: Output triples

Two triples were added to this dataset based on the relationship between ”corn” and ”maize,” ”flower” and ”green” with the semantic similarity between these new rela- tionships. Since each quad contains only four parts, we had to add blank nodes to store the new generated triple with adding the URI of ConceptNet. For example, the triple

”maize” will be converted using a hash function into eight-digit number, i.e. 12345678, followed by extra information to form the final quads, listing 11. _:12345678 "4.95" . _:12345678 "http://conceptnet.io/c/en/corn" < Food>.

Listing 11: Blank nodes

4.8 Architecture

In this section, we provide our system architecture, Figure 34, that illustrates the path of the data starting from the raw data on USDA then to Karma that converts these databases into RDF model, ConceptNet that provides related entities, passing through our relationship generator, and finally using the hash function and adding quads for the final output.

65 Figure 34: FoodKG system architecture.

66 CHAPTER 5

IMPLEMENTATION

5.1 Implementation

We implement our program using Python programming language with the help of some libraries such as Requests and hashlib. Requsts library was used to send requests to ConceptNet to run the processing terms. Whereas hashlib package was used to create a hash value from the extra triple since each quad contains only 4 parts including the context. The new generated quad contains subject, predicate, object, as a hash value, then the relationship followed by the score, and the context. Using the hash values, we convert the first three parts to a hash number generate based on a formula in hashlib and we create a blank node. The hash value is now the first part of a triple, a relationship is the second value, the similarity score is third value, and the context is the last value. To add the page link of the node with the highest similarity, we added another quad that contains the same hash value as the first part, the relationship

, then the URL link, and the context. Triples in listing 12 illustrate the hash value implementations. " object1" . " object2" .

67 " example2" < http://umkc.edu/context> . _:12345678 "4.93450204" . _:12345678 "http://conceptnet.io/c/en/a_Term" . " example2"

Listing 12: Blank nodes

:12345678 this is the hash value that contains the triple in listing 13 " example2"

Listing 13: Blank nodes

5.2 FoodKG Implementation

We present a domain-specific tool, FoodKG, that solves the problem of repeated, unused, and missing concepts in knowledge graphs and enriches the existing knowledge by adding semantically related domain-specific entities, relations, images, and semantic similarities values between the entities. We utilize AGROVEC, a graph embedding model to calculate the semantic similarity between two entities, get most similar entities, and classify entities under a set of predefined classes. AGROVEC adds the semantic similarity scores by calculating the cosine similarity for the given vectors. The triple that holds the semantic score will be encoded as a blank node where the subject is the hash of the original triple, the relation will remain the same, and the object is the actual semantic score.

68 FoodKG will parse and process all the subjects and objects within the provided knowledge graph. For each subject, a request will be made to WordNet to fetch its offset number. WordNet is a lexical database for the English language where it groups its words into sets of synonyms called synsets with their corresponding IDs (offsets). FoodKG re- quires these offset numbers to obtain the related images from ImageNet since the existing images on ImageNet are organized and classified based on the WordNet offsets. Ima- geNet is one of the largest image repositories on the Internet, and it contains images for almost all-known classes [23]. These images will be added to the provided graph in the form of triples where the subject is the original word, the predicate will be represented by

”#ImgURLs”, and the object is a Web link URL that contains the images returned from

ImageNet. Figure 1 depicts the FoodKG system architecture.

Before discussing AGROVEC vector and the reasons behind using a graph em- bedding model, we will discuss what are embeddings and how do they work!

5.2.1 AGROVEC

AGROVEC is a domain-specific embedding model that uses GEMSEC, a graph embedding algorithm. It was retrained and fine-tuned on AGROVOC, to produce a domain- specific embedding model. The embedding visualization (using TSNE [58]) for our clus- tered embeddings can be depicted in Figure 2. AGROVEC has the advantage of clustering compared to other models. ARGOVEC was trained with a 300 dimension vector and clus- tering the dataset into 10 clusters. The Gamma value used is 0.01. The number of random walks is 6 with windows size 6. We started with the default initial learning rate of 0.001.

69 AGROVEC was trained using the AGROVOC dataset that contains over 6 Million triples to construct the embedding.

5.2.2 Entity Extraction

FoodKG provides several features; entity extraction is one of the most important features. Users can start by uploading their graphs to FoodKG. Most of the provided graphs contain the same repeated concepts and terms that were named differently (i.g. id, ID, id, id num, etc.) where all of them represent the same entity, other terms use abbreviations, numbers, or short-form (acronyms) [88]. Similar entities with different names create many repetitions and make it a challenge for different graphs to merge, search, and ingest in machine learning or linked data. To overcome this issue, we run

NLP techniques such as POS tagging, chunking, and Stanford Parser over all the pro- vided subjects to extract the meaningful classes and terms that can be used in the next stage. For example, given the following subjects: ”CHEESE,COTTAGE,S-,W/FRU” and

”BUTTER,PDR,1.5OZ,PREP,W/1/1.HYD” , will be represented in FoodKG as ”Cheese,

Cottage” and ”Butter” respectively. Users have the option of whether to provide the con- text of their graphs or leave it empty.

5.2.3 Text Classification

Nowadays, there are many different models to classify a given text to a set of tags or classes such as the Long Short Term Memory (LSTM) network [86]. Nevertheless, text classification is still a challenge when it comes to classifying a single word without a context such as ”apple” since the context is broad, and the word could refer to many

70 different things other than apple the fruit. Therefore, few solutions have been proposed, such as referring to fruits with small letters ”apple” and capital letters for brand names like

”Apple” the corporation. However, such a technique does not seem to be working well on large scale contexts. Besides, this technique does not work in all domains since domain- specific graphs may not include all different contexts for a given word. Therefore, for each domain-specific area, there should be a knowledge graph that researchers and scientists can use in their experiments. At this point, our tool, FoodKG, becomes helpful to build and enrich such knowledge graphs and classify words to a specific class. FoodKG uses

AGROVEC to help in providing the context for such scenarios. We use a simple, yet ef- fective technique with the help of ConceptNet API to accomplish this task [95]. The idea is to start with a set of predefined classes, for example, let us consider only two classes for now ”fruits” and ”animals”. After running these classes on ConceptNet, we store all re- turned top related concepts with the relation ”type of”. Here is an example of the returned words for ”fruit”: pineapple, mango, grapes, plums, berry, etc.. The returned words for

”animals” are lion, fish, dolphin, fox, pet, deer, etc.. We get only the top 10 instances from each category to limit the time complexity of our algorithm. Then, using AGROVEC embeddings, we calculate the semantic similarity score between the given word and all the other words from each class and return the highest average between them (algorithm

1). Based on the highest average score, we choose the category of the given word. This technique proved to be the most reliable technique when it comes to classifying a category using word embeddings. The algorithm time complexity is O(N), where N is the number of classes that we started with. As an example, AGROVEC predicted the class ”Food” for

71 the concept ”brown rice”, ”Energy” class for the concept ”radiation”, and ”Water” for the concept ”hail”.

Algorithm 1 Text Classification using AGROVEC and ConceptNet

Input: T arget, Cat = {A1 = {word1, ..., word10},...,AN = {word1, ..., word10}} Output: The predicted class for the T arget

1: function LOOP(Cat[], T arget) 2: prediction ← nil 3: highestAvg ← 0 4: N ← length(Cat) 5: for i ← 1 to N do 6: total, Avg ← 0

7: K ← length({Ai}) 8: for j ← 1 to K do

9: total ← total + cosineSimilarity(T arget, Ai[j]) 10: end for 11: Avg ← total/K 12: if Avg >highestAvg then 13: highestAvg ← Avg

14: prediction ← Ai 15: end if 16: end for 17: return prediction 18: end function

72 5.2.4 Semantic Similarity

Measurement of the semantic similarity between two terms or concepts has gained much interest due to its importance in different applications such as intelligent graphs, knowledge retrieval systems, and similarity between Web documents [45]. Several se- mantic similarity measures have been developed and used based on purpose [60]. In this paper, we adopt the cosine similarity measurement to measure the similarity between two vectors. FoodKG uses the semantic similarity measure between different subjects in a given graph. The semantic similarity scores will be attached as blank nodes to the original triple where the subject is the hashed blank node ID, the relation is ”#semantic similarity”, and the object the similarity score. These similarity scores can be used in different rec- ommendation systems, questions answering, or in future NLP models. FoodKG relies on the AGROVEC embedding model to generate the similarity scores. Table 8 shows the semantic similarity scores generated by AGROVEC and other models. This example was taken from the AGROVOC dataset to show how the AGROVEC model ranks these pairs in a better way than the other models. Table 9 shows the top five related words for ”Food”, table 10 shows the top 5 related words for ”Energy”, and Table 11 shows the top 5 related words for ”Water”.

5.2.5 Scientific Terms

Researchers and data experts often use domain-specific terms and concepts that may not be commonly used. For instance, these terms Triticum, Malus, and Fragaria are the scientific names for wheat, apples, and strawberries, respectively. However, such

73 Table 8: An example on how each model ranks the objects when the subject is ”wheat” AGROVEC ranks the semantic similarity scores accurately from closest to furthest from the subject

object AGROVEC HolE GloVe Word2vec fastText wheat flour 0.757 -0.199 0.295 0.948 0.992 barley 0.523 0.868 0.421 0.741 0.976 grapes 0.116 0.851 0.802 0.930 0.885 tuna oil 0.046 -0.769 0.376 0.524 0.940 building components 0.016 0.923 0.397 0.466 0.883

Table 9: Top 5 related words for the concept ”Foods”

Model Top 5 related words AGROVEC traditional foods, soups, raw foods, value added product, cooking fats HolE controls, sterilizing, consumer expenditure, Andean Group, structural crops GloVe meat, animal meals, milk, water, seaweeds Word2vec cocoa beans, hides and skins, eggs, oilseed protein, soyfoods fastText pet foods, raw foods, seafoods, soyfoods, skin producing animals

names may not exist in global word or knowledge graph embedding models. As for FEW, these terms can be found in our embedding model since it was trained on AGROVOC terms. This allows data experts to use similar scientific names and other related terms under the food domain while using FoodKG. Table 12 shows an example of top related concepts for FEW that do not exist in global embeddings.

74 Table 10: Top 5 related words for the concept ”Energy”

Model Top 5 related words nuclear energy, energy for agriculture, energy expenditure, AGROVEC animal power, renewable energy HolE stored products pests, plant breeding, age, formulations, sewage GloVe Ericales, carbohydrates, Sphingidae, Orobanchaceae, fungal spores Word2vec stray voltage effects, irrigation canals, libraries, agencies, CMS bioenergy, computer science, wood energy, cytogenetics, fastText Cytogenetics

Table 11: Top 5 related words for the concept ”Water”

Model Top 5 related words AGROVEC hydrosorption, chlorinated water, water statistics, body water, virtual water isObjectOfActivity, dissolved oxygen, economic competition, HolE state, international cooperation GloVe seaweeds, meat, perishable products, phosphorus, drugs Word2vec quarters, meat byproducts, captivity, magnetic water, plant parts fastText heaters, bound water, low water, esters, high water

5.2.6 Relationship Prediction

Word embeddings are well-known in the world of NLP due to their powerful way of capturing the relatedness between different concepts. However, capturing the lexico- semantic relationship between two words (i.e., the predicate of a triple) is a critical chal- lenge for many NLP applications and models. Few techniques have been developed pre- viously that proposed modifying the original word embeddings to include specific rela- tions while training the corpora [32, 68, 104]. These approaches used the post-processing

75 Table 12: Few examples for the most used concepts in FEW domain that do not appear in global embeddings

Food Energy Water cocoa products energy balance water activity brown rice energy generation water extraction gluten free bread energy consumption water availability skim milk energy value water quality emmental cheese energy resources water statistics

trained embeddings to check the concepts that move closer together or further apart to- wards a specific relation. While these algorithms were able to predict specific relations such as synonyms and antonyms, predicting and discriminating between multiple rela- tions is still a challenge. To overcome this challenge, we used transfer learning using the state-of-art STM model, that outperforms previous state-of-art models on CogALex and WordNet datasets, and the AGROVOC dataset to predict domain-specific relations between two concepts. The newly derived model aims particularly at classifying relations between different subjects in the food, agriculture, energy, and water domains.

5.2.7 RefinedFed

RefinedFed is a protocol that has the goal of collecting the federating models that pass a certain test accuracy threshold after each round. In other words, it is a protocol to decide which model should be included in the server model aggregation process, such as averaging the weights while training the model.

This algorithm helps in building and training a more accurate model. Further- more, when the model fails to pass a certain test accuracy, the model will not be collected

76 Figure 35: RefinedFed Architecture. A local testing dataset will be added to each client to test the model before the collecting phase. Models pass a certin accuracy threshold will be collected by the server. Otherwise, the model will be dropped.

77 (algorithm 2) and that will reduce the computation power and the bandwidth on the server which in returns makes the aggregation phase on the server faster, check Figure 35.

Algorithm 2 F ederatedT ree an extended algorithm for F ederatedAveraging 1: Server executes: 2: initialize w0 3: for each round t=1,2,... do 4: Select K eligible clients to compute updates 5: Wait for updates from K clients (indexed 1, . . . , K) 6: Run clients on local testset 7: if client accuracy less than threshold then 8: drop client 9: end if 10: (∆k, nk) = ClientUpdate(w) from client k ∈ [K] P k 11: w¯t = k ∆ // Sum of weighted updates P k 12: n¯t = k n // Sum of weights k 13: ∆t = ∆t / n¯t // Average update 14: wt+1 ← wt + ∆t 15: end for

1: function CLIENTUPDATE(k, w) // run on client k: 2: B ← (split Pk into batches of size 3: for each local epoch i from 1 to E do 4: for batch b ∈ B do 5: w ← w - η∇l(w; b) 6: end for 7: end for 8: return w to server 9: end function

78 CHAPTER 6

EVALUATION

6.1 Work Load

Our work has been tested by running thousands of lines of RDF triples from dif- ferent datasets. We provided it with a strong and reliable approach to be used in real life applications. When running a dataset containing 5 lines of RDF triples or 500 triples, the program performance remains the same. The only obvious difference is the running time.

Table 13 shows the average of required time to run different datasets with different sizes.

Table 13: Time needed for different RDF datasets

Experiment number Number of RDF triples Required time Experiment 1 120 1.51737523 Experiment 2 1200 8.73779988 Experiment 3 3000 17.00693297

As we can see that our program accepts any number of RDF triples despite of how big or small the dataset is. The major difference is that the amount of time required which is increasing gradually with the size of a dataset.

6.2 Results

We have developed a new approach to enrich a database table with extra informa- tion and facts related to the contents after converting it to an RDF model. This approach

79 helps scientists and users interested in the web of semantics to enhance existing data on the internet in way that makes the searching process faster and provides users with more reliable information about entities.

Our project aims to build a robust and reliable knowledge graph that shed light on

FEW systems. We conclude these results with three different examples to illustrate how the actual input and output will be. Given the following datasets in Table 14.

Table 14: FEW systems input experiments

Experiment number Dataset Type Number of triples Required Time Experiment 1 Food 185 1.59 sec Experiment 2 Water 23400 3.23 min Experiment 3 Energy 900 5.13 sec

Table 15: An input example (Ingredients)

Ingredients Quantity Price Starting Date Ending Date Bread 100 loaves $33 13/8/2017 20/8/2017 Roll 30 packs $150 13/8/2017 20/8/2017 Sugar 20 lb $40 13/8/2017 2018 Salt 5 lb $10 13/8/2017 2018 Cake 10 cakes $200 13/8/2017 20/8/2017 cupcake 150 cupcakes $300 13/8/2017 20/8/2017 almond 10 lb $300 13/8/2017 20/8/2017

As we mentioned before, our project contains two main parts. The first part is converting the previous table to RDF model. After converting a database to RDF model, the second part of our project starts by processing these triples to generate new triples that enrich the dataset based on the semantic relationship between the existing triples. After

80 Figure 36: FoodKG input - Food example running the second part, the final output can be seen the Figures(36 - 41).

Our final output meets all the universal requirements for RDF dataset that can be run with SPARQL to semantically query the data. The following quads were generated based on the dataset in Table 15 (food dataset example).

6.3 FoodKG

In this section, we report the evaluation of AGROVEC and compare it with other word and knowledge graph embedding techniques: GloVe, fastText, Word2vec, and HolE.

6.3.1 Evaluation Technique

We employ the Spearman rank correlation coefficients (Spearman’s rho [69]) in order to evaluate the embedding models. Spearman’s rho is a non-parametric measure for assessing the similarity score between two variables. We apply Spearman rho between the predicted cosine similarity using the embeddings and the ground truth, which known as

81 Figure 37: FoodKG output - Food example

Figure 38: FoodKG input - Water example

82 Figure 39: FoodKG output - Water example

Figure 40: FoodKG input - Energy example

83 Figure 41: FoodKG output - Energy example the relatedness task [87]. When the ranks are unique, the Spearman correlation coefficient can be computed using the formula:

n P 2 6 Di R = 1 − i=1 (6.1) s n(n2 − 1) where Di is difference between the two ranks of each observation. the and n is the total number of observations. SELECT ?subject ?object WHERE { ?subject ?object . FILTER (lang(?object) = ’en’) }

Listing 14: Data after extraction

84 6.3.2 Dataset Description

AGROVOC is a collection of vocabularies that covers all areas of interest to the

Food and Agriculture Organization of the United Nations, including food, nutrition, agri- culture, fisheries, forestry, environment. It comprises of 32000 concepts, in over 20 lan- guages, where each concept is represented using a unique id. For instance, the subject

”http://aims.fao.org/aos/agrovoc/c 12332” corresponds to ”maize”. We use the SPARQL query in listing 14 to extract English triples.

6.3.3 Benchmark Description

While there exist well-known word embedding benchmark datasets such as WordSim-

353 [33] for evaluating the semantic similarity measures, these cannot be employed for domain-specific embeddings as many concepts related to FEW are not considered in pub- lic benchmarks. Constructing a domain-specific benchmark is a challenge considering the need for domain experts. Therefore, we leverage ConceptNet to construct a benchmark dataset for evaluating the models. ConceptNet originated from the crowd-sourcing project

Open Mind Common Sense, which was launched in 1999 at the MIT Media Lab. Con- ceptNet used to be a home-grown crowd-sourced project with the purpose of improving the state of computational knowledge and human knowledge. However, currently, the data is generated from many trusted resources such as WordNet, DBPedia, Wiktionary [110],

OpenCyc [93] and others. We split AGROVOC dataset based on its 126 unique relations to depict how each model performs against the different relations and to study the impact

85 of the number of hops between the concepts in the embedding. For each subject and ob- ject, we lookup the weights returned from ConceptNet and consider them to be the ground truth.

6.3.4 RefinedFed Description

We have used MNIST dataset throughout all the experiment with equally dis- tributed number of the training images over all the clients. We have used Pytorch and

Pysyft platforms to simulate different number of clients with the centralized server. We have also added Laplace noise to the training images for few clients to simulate corrupted data, low-accuracy, and noisy models (that will not be collected from the server for the ag- gregation phase, to show the improvement with and without model selection). We chose

Laplace distribution since the peak of Laplace distribution is sharper than Gaussian which means the number of Laplace samples around the zero are more than Gaussain. In prac- tice, both Laplace and Gaussain noise perform well in terms of adding noise to the images and any other type of noise can be used as well. The number of noisy models are the same in both experiments for the FL with and without RefinedFed.

We have conducted many experiments in order to show the impact of RefinedFed over the accuracy of the global model. We have used MNIST Dataset throughout all the experiments with the same neural network architecture. The number of training images were divided equally among all of the clients. In our experiments, we studied the follow- ing cases:

• Federated Learning using MNIST Dataset with Laplace noise added to the images

86 of few clients in order to produce a low-accuracy model to show the impact of a

corrupted model on the overall model. Results can be seen in Table 16.

• Federated Learning using MNIST Dataset with Laplace noise added to the images

of few clients. We have implemented our algorithm, RefinedFed, to filter all the

models with accuracy less than 80%. Results can be seen in Table 16.

During the above experiments, the number of clients, the number of noisy models, and the amount of noise added are the same.

Table 16: Accuracy of FL throughout 10 epochs with and without RefinedFed

Accuracy throughout 10 epochs FL 11% 11% 22% 47% 51% 49% 64% 77% 82% 84% RefinedFed 11% 30% 36% 43% 62% 79% 83% 86% 88% 91%

6.3.5 Results

We evaluated two recent graph embedding models, namely DeepWalk and GEM-

SEC, trained on AGROVOC data, to analyze their performance on the FEW domains.

Table 17 reports the average Spearman correlation coefficient scores for DeepWalk and

GEMSEC. The higher score attained by GEMSEC motivated us to use GEMSEC for constructing AGROVEC.

We evaluated AGROVEC against HolE, GloVe, Word2vec, and fastText, where all of the models were retrained using their default parameters (Table 18) on the AGROVOC dataset except for the number of dimensions.

87 Table 17: Differet graph embedding techniques with their Spearman Correlation score

Model description Spearman Correlation DeepWalk perozzi2014deepwalk 0.068 GEMSEC 2019gemsec 0.101

Table 18: The default hyper-parameters for the retrained models

dimensions learning rate window size epochs lambd AGROVEC 300 0.001 6 50 0.0625 HolE 300 0.5 15 1000 1.0 GloVe 300 0.05 15 15 - Word2vec 300 0.025 5 15 - fastText 300 0.1 25 5 -

The number of dimensions used for all models was 300, with the minimum count set as 1 to include all the concepts and relations. Figure 42 shows the average Spear- man correlation coefficient scores for all the models evaluated on 126 unique relations.

Figure 43 shows the Spearman correlation coefficient scores while limiting the minimum number of word pairs in each relation to 5, 10, and 25 in order to check the model’s per- formance across the different number of word pairs. The results show that AGROVEC, based on GEMSEC trained and fine-tuned on AGROVOC, outperforms all other models by a significant margin when predicting FEW domain similarity scores. Figure 44 shows an example of the AGROVEC embedding using TSNE for the domains Food, Energy, and

Water with the top 20 related terms. This shows how these domains with their top related terms are properly clustered. However, Figures 45, 46, 47, and 48 visualize how Hole,

GloVe, Word2vec, and fastText cluster Food, Energy, and Water domains with their top

88 Figure 42: Spearman correlation coefficient ranking scores compared against Concept- Net. This figure shows how AGROVEC scored highest scores which means its ranking is the closest for ConceptNet ranking (all models trained on AGROVOC dataset and tested against the same benchmark)

20 related terms. Based on Figures (44 - 48) we observe that AGROVEC achieves bet- ter clustering, with the terms of the same domain being placed closer. This was because

AGROVEC uses GEMSEC which uses self clustering.

We also compare the top 5 related terms for food, energy, and water, as detailed in

Tables 14, 15, and 16, respectively. While AGROVEC, which uses GEMSEC trained and

fine-tuned on AGROVOC, was able to fetch appropriate concepts related to the provided terms, the other models struggled despite being trained on the same dataset using their default parameters without fine-tuning.

89 Figure 43: Spearman correlation coefficient scores when evaluated on all triples, relations with minimum number of word pairs in each relation being 5, 10, and 25

90 Figure 44: AGROVEC embeddings visualization using TSNE for the words: Food, En- ergy, and Water with their top 20 nearest neighbors based on AGROVEC model. The figure shows how AGROVEC cluster similar concepts together properly.

91 Figure 45: HolE embeddings visualization using TSNE for the words: Food, Energy, and Water with their top 20 nearest neighbors based on HolE model.

92 Figure 46: GloVe embeddings visualization using TSNE for the words: Food, Energy, and Water with their top 20 nearest neighbors based on GloVe model.

93 Figure 47: Word2vec embeddings visualization using TSNE for the words: Food, Energy, and Water with their top 20 nearest neighbors based on Word2vec model.

94 Figure 48: FastText embeddings visualization using TSNE for the words: Food, Energy, and Water with their top 20 nearest neighbors based on fastText model.

95 6.4 Federated Learning

We have tested our proposed algorithm by training two federated learning models on the MNIST [54] dataset and compared the results with and without our algorithm. The

first time is a FL approach that dose not utilize RefinedFed, whereas the second time the algorithm utilized RefinedFed algorithm as a filtering mechanism for the clients. Figure

49 shows the experiment using 5 clients, Figure 50 shows the same experiement using 10 clients, and in Figure 51, 20 clients were used. Figure 52 shows the same experiements using 5 clients on CIFAR-10 dataset [53]. We would like to mention that our algorithm is an extension for [61] and [19] with some modification to test each model on its local test- ing dataset and select all models that pass a certain threshold before sending the updates for the server. The second function in the algorithm, ClientUpdate, is the same function from the original paper.

Figure 49: MNIST dataset. The accuracy throughout 10 epochs. FL VS. FederatedTree. 5 Clients

96 Figure 50: MNIST dataset. The accuracy throughout 10 epochs. FL VS. FederatedTree. 10 Clients

6.5 Data Availability Statement

The datasets analyzed for this study can be found in the [AIMS (AGROVOC)]

[http://aims.fao.org/vest-registry/vocabularies/agrovoc].

FoodKG code can be found at this Github repository [https://github.com/Gharibim/FoodKG]

97 Figure 51: MNIST dataset. The accuracy throughout 10 epochs. FL VS. FederatedTree. 20 Clients

Figure 52: CIFAR-10 dataset. The accuracy throughout 10 epochs. FL VS. Federat- edTree. 5 Clients

98 CHAPTER 7

CONCLUSION AND FUTURE WORK

In our project, we developed a new approach to enrich a dataset of RDF triples by adding extra triples containing similar information and facts based on the existing knowledge. Our project aims to build a reliable knowledge graph that serves FEW systems by using a raw database from the USDA, then converting it to RDF model, and finally enhancing these triples by adding semantic knowledge. In addition, we developed a FEW ontology that can be used while mapping FEW databases to RDF model. FEW ontology contains many vocabularies that can be used to specify the relationships between columns when converting FEW databases.

In this paper, we presented FoodKG, a novel software tool to enrich knowledge graphs constructed on FEW datasets by adding semantically related knowledge, semantic similarity scores, and images using advanced machine learning techniques. FoodKG re- lies on AGROVEC, which was constructed using GEMSEC but retrained and fine-tuned on the AGROVOC dataset. Since AGROVEC was trained on a controlled vocabulary, it provides more accurate results than global vectors in the food and agriculture domains, for category classification and semantic similarity of scientific concepts. The STM model, retrained on the AGROVOC dataset, is used for the prediction of semantic relations be- tween graph entities and classes. The output produced by FoodKG can be queried using a SPARQL engine through a friendly user interface. We evaluated AGROVEC using the

99 Spearman Correlation Coefficient algorithm, and the results show that our model outper- forms the other models trained on the same graph dataset.

We have also used FL techniques to include the private datasets by training smaller versions of the model on each dataset, then aggregating all these models at the server-side to generate a generalized global model. We further included a local testing dataset at each data site in order to test each local model accuracy before collecting them which make the server collects only high accuracy models.

In the future, we plan to extend our federated work by aggregating the models in parallel way in order to collect the high accuracy models which leads to a higher global model accuracy. Parallel aggregation will also reduce the aggregation time on the server- side.

100 REFERENCE LIST

[1] Apache software foundation. Available from. https://any23.apache.org/index.html;

Retrieved June 2017.

[2] Clark Persia LLC. Available from. http://clarkparsia.github.io/csv2rdf/; Retrieved

May 2017.

[3] Dandelion API. Available from. https://dandelion.eu/; Retrieved August 2017.

[4] DBpedia. Available from. http://wiki.dbpedia.org; Retrieved May 2017.

[5] DBpedia on Wikipedia. Available from. https://en.wikipedia.org/wiki/DBpedia;

Retrieved May 2017.

[6] FastText: stepping through the code. Available from.

https://medium.com/@mariamestre/fasttext-stepping-through-the-code-

259996d6ebc4; Retrieved March 2020.

[7] ParallelDots AI API. Available from. https://www.paralleldots.com/text-analysis-

apis; Retrieved August 2017.

[8] Talis Information Ltd. Available from. http://research.talis.com/2005/rdf-intro/;

Retrieved June 2017.

[9] Tibi Puiu. Available from. https://www.zmescience.com/research/technology/smartphone-

power-compared-to-apollo-432/; Retrieved June 2017.

101 [10] Wikipedia on Wikipedia. Available from. https://en.wikipedia.org/wiki/Wikipedia;

Retrieved May 2017.

[11] WordNet. Available from. https://wordnet.princeton.edu/wordnet/frequently-

asked-questions/for-application-developer/; Retrieved May 2017.

[12] WordNet on Wikipedia. Available from. https://en.wikipedia.org/wiki/WordNet;

Retrieved May 2017.

[13] Abadi, D. J., Marcus, A., Madden, S. R., and Hollenbach, K. SW-Store: a vertically

partitioned DBMS for semantic web data management. The VLDB Journal 18, 2

(2009), 385–406.

[14] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. Dbpedia:

A nucleus for a web of open data. In The Semantic Web. Springer, 2007, pp. 722–

735.

[15] Bernard, D. Facebook as more users than population of China. Avail-

able from. https://learningenglish.voanews.com/a/facebook-has-more-users-than-

china-population/2732122.html; Retrieved June 2017.

[16] Boiy, E., and Moens, M.-F. A machine learning approach to sentiment analysis in

multilingual Web texts. Information Retrieval 12, 5 (2009), 526–558.

[17] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. Enriching word vectors

with subword information. Transactions of the Association for Computational

Linguistics 5 (2017), 135–146.

102 [18] Bollacker, K., Evans, C., Paritosh, P., Sturge, T., and Taylor, J. Freebase: A collab-

oratively created graph database for structuring human knowledge. In Proceed-

ings of the 2008 ACM SIGMOD International Conference on Management of Data

(2008), ACM, pp. 1247–1250.

[19] Bonawitz, K., Eichner, H., Grieskamp, W., Huba, D., Ingerman, A., Ivanov, V.,

Kiddon, C., and et al. Towards federated learning at scale: System design. arXiv

preprint arXiv:1902.01046 (2019).

[20] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, B., Patel, S.,

Ramage, D., Segal, A., and Seth, K. Practical secure aggregation for privacy-

preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Confer-

ence on Computer and Communications Security (2017), 1175–1191.

[21] Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques,

Y., and Keizer, J. The AGROVOC linked dataset. Semantic Web 4, 3 (2013),

341–348.

[22] Chen, D., and Manning, C. A fast and accurate dependency parser using neural

networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural

Language Processing (EMNLP) (2014), pp. 740–750.

[23] Chen, H., Trouve, A., Murakami, K., and Fukuda, A. A concise conversion model

for improving the RDF expression of conceptNet knowledge case. In Artificial

Intelligence and Robotics (2018), Springer Verlag, pp. 213–221.

103 [24] Chen, M., Mathews, R., Ouyang, T., and Beaufays, F. Federated learning of out-

of-vocabulary words. arXiv:1903.10635 (2018).

[25] Chhaya, P., Lee, K.-H., Shin, K.-s., Choi, C.-H., Cho, W.-S., and Lee, Y.-S. Using

D2RQ and Ontop to publish relational database as Linked Data. In 2016 Eighth

International Conference on Ubiquitous and Future Networks (ICUFN) (2016),

IEEE, pp. 694–698.

[26] Chim, H., and Deng, X. Efficient phrase-based document similarity for clustering.

IEEE Transactions on Knowledge and Data Engineering 20, 9 (2008), 1217–1229.

[27] Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., and

et al. Large scale distributed deep networks. In Advances in Neural Information

Processing Systems (2012), 1223–1231.

[28] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman,

R. Indexing by latent semantic analysis. Journal of the American Society for

Information Science 41, 6 (1990), 391–407.

[29] Dubey, M., Banerjee, D., Chaudhuri, D., and Lehmann, J. EARL: Joint entity and

relation linking for question answering over knowledge graphs. In The Semantic

Web - ISWC 2018 (2018), Springer International Publishing, pp. 108–126.

[30] Eisenberg, V., and Kanza, Y. D2RQ/update: updating relational data via virtual

RDF. In Proceedings of the 21st International Conference on World Wide Web

(2012), pp. 497–498.

104 [31] Ernst, P., Siu, A., and Weikum, G. Highlife: Higher-arity fact harvesting. In

Proceedings of the 2018 World Wide Web Conference on World Wide Web (2018),

International World Wide Web Conferences Steering Committee, pp. 1013–1022.

[32] Faruqui, M., Dodge, J., Jauhar, S. K., Dyer, C., Hovy, E., and Smith, N. A.

Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Con-

ference of the North American Chapter of the Association for Computational Lin-

guistics: Human Language Technologies (2015), Association for Computational

Linguistics, pp. 1606–1615.

[33] Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., and

Ruppin, E. Placing search in context: the concept revisited. ACM Transactions on

Information Systems 20, 1 (2002), 116–131.

[34] Gharibi, M., and Rao, P. RefinedFed: A Refining Algorithm for Federated Learn-

ing. In Proceedings of the 49th Annual IEEE Applied Imagery Pattern Recognition

Workshop (AIPR 2020) (2020), pp. 1–5.

[35] Gharibi, M., Rao, P., and Alrasheed, N. RichRDF: A tool for enriching food,

energy, and water datasets with semantically related facts and images. In In-

ternational Semantic Web Conference (P&D/Industry/BlueSky) (2018), Springer

International Publishing.

[36] Gharibi, M., Zachariah, A., and Rao, P. FoodKG: A Tool to Enrich Knowl-

edge Graphs Using Machine Learning Techniques. Front. Big Data 3: 12. doi:

10.3389/fdata, 1–12.

105 [37] Glavas,ˇ G., and Vulicc,´ I. Discriminating between lexico-semantic relations with

the specialization tensor model. In Proceedings of the 2018 Conference of the

North American Chapter of the Association for Computational Linguistics: Human

Language Technologies, Volume 2 (Short Papers) (2018), Association for Compu-

tational Linguistics, pp. 181–187.

[38] Goyal, P., DollA¡r,˜ P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A.,

Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training Imagenet

in 1 hour. arXiv preprint arXiv:1706.02677 (2017).

[39] Grover, A., and Leskovec, J. node2vec: Scalable feature learning for networks. In

Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (2016), ACM, pp. 855–864.

[40] Hard, A., Rao, K., Mathews, R., Ramaswamy, S., Beaufays, F., Augenstein, S.,

Eichner, H., Kiddon, C., , and Ramage, D. Federated learning for mobile keyboard

prediction. arXiv preprint arXiv:1811.03604 (2018).

[41] Hazber, M. A., Li, R., Xu, G., and Alalayah, K. M. An approach for automatically

generating R2RML-based direct mapping from relational databases. In Interna-

tional Conference of Pioneering Computer Scientists, Engineers and Educators

(2016), Springer, pp. 151–169.

[42] Hixon, B., Clark, P., and Hajishirzi, H. Learning knowledge graphs for question

answering through conversational dialog. In Proceedings of the 2015 Conference

of the North American Chapter of the Association for Computational Linguistics:

106 Human Language Technologies (2015), Association for Computational Linguis-

tics, pp. 851–861.

[43] Horng, S.-J. Big data: Challenges and practical application. In 2015 Interna-

tional Conference on Science in Information Technology (ICSITech) (2015), IEEE,

pp. 11–13.

[44] Hsu, M.-H., Tsai, M.-F., and Chen, H.-H. Combining WordNet and ConceptNet

for automatic query expansion: a learning approach. In Asia Information Retrieval

Symposium (2008), Springer, pp. 213–224.

[45] Iosif, E., and Potamianos, A. Unsupervised semantic similarity computation be-

tween terms using web documents. IEEE Transactions on Knowledge and Data

Engineering 22, 11 (2010), 1637–1647.

[46] Kairouz, P., McMahan, B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., and

et al., K. B. Advances and open problems in federated learning. arXiv preprint

arXiv:1912.04977 (2019).

[47] Katib, A., Rao, P., and Slavov, V. A Tool for efficiently processing SPARQL queries

on RDF quads. In International Semantic Web Conference (Posters, Demos &

Industry Tracks) (2017).

[48] Katib, A., Slavov, V., and Rao, P. RIQ: Fast processing of SPARQL queries on

RDF quadruples. Journal of Web Semantics 37 (2016), 90–111.

107 [49] Kleberson, J. d. A., dos Santos, J. L. C., and Moreira, D. A. BioDSL: A Domain-

Specific Language for mapping and dissemination of Biodiversity Data in the LOD.

In Anais do X Brazilian e-Science Workshop (2018), SBC.

[50] Klein, D., and Manning, C. D. Accurate unlexicalized parsing. In Proceedings of

the 41st Annual Meeting on Association for Computational Linguistics-Volume 1

(2003), Association for Computational Linguistics, pp. 423–430.

[51] Knoblock, C. A., and Szekely, P. Exploiting semantics for big data integration. AI

Magazine 36, 1 (2015), 25–39.

[52] Konecny, J., McMahan, B., Ramage, D., and Richtarik, P. Federated optimiza-

tion: distributed machine learning for on-device intelligence. arXiv preprint

arXiv:1610.02527 (2016).

[53] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny

images.

[54] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied

to document recognition. Proceedings of the IEEE 86, 11 (1998), 2278–2324.

[55] Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., and Smith, V. Federated

optimization in heterogeneous networks. Proceedings of Machine Learning and

Systems 2 (2020), 429–450.

[56] Liu, H., and Singh, P. ConceptNetaaˆ practical tool-kit. BT

Technology Journal 22, 4 (2004), 211–226.

108 [57] Lynch, C. How do your data grow? Nature 455, 7209 (2008), 28–29.

[58] Maaten, L. v. d., and Hinton, G. Visualizing data using t-SNE. Journal of Machine

Learning Research 9, Nov (2008), 2579–2605.

[59] Manning, C., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S., and McClosky, D.

The Stanford CoreNLP natural language processing toolkit. In Proceedings of

52nd Annual Meeting of the Association For Computational Linguistics: System

Demonstrations (2014), pp. 55–60.

[60] Martinez-Gil, J. An overview of textual semantic similarity seasures based on web

intelligence. Artificial Intelligence Review 42, 4 (2014), 935–943.

[61] McMahan, B., Moore, E., Ramage, D., Hampson, S., and Aguera, B.

Communication-efficient learning of deep networks from decentralized data. In

Artificial Intelligence and Statistics (2017), 1273–1282.

[62] McMahan, B., and Ramage, D. Federated learning: collaborative machine learning

without centralized training data. Google Research Blog 3 (2017).

[63] McMahan, B., Ramage, D., Talwar, K., , and Zhang, L. Learning differentially

drivate recurrent language models. arXiv preprint arXiv:1710.06963 (2017).

[64] Meester, B. D. High quality schema and data transformations for linked data gener-

ation. In Proceedings of the Doctoral Consortium, part of CAiSEs (2018), pp. 1–9.

[65] Mikolov, T., Chen, K., Corrado, G. S., and Dean, J. Efficient estimation of word

representations in cector space. CoRR abs/1301.3781 (2013).

109 [66] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. Distributed repre-

sentations of words and phrases and their compositionality. In Proceedings of the

26th International Conference on Neural Information Processing Systems - Volume

2 (2013), Curran Associates Inc., pp. 3111–3119.

[67] Miller, G. A. WordNet: A lexical database for English. Communications of the

ACM 38, 11 (1995), 39–41.

[68] Mrksiˇ c,´ N., Vulic,´ I., Seaghdha,´ D. O.,´ Leviant, I., Reichart, R., Gasiˇ c,´ M., Korho-

nen, A., and Young, S. Semantic specialization of distributional word vector spaces

using monolingual and cross-lingual constraints. Transactions of the Association

for Computational Linguistics 5 (2017), 309–324.

[69] Myers, L., and Sirois, M. J. Spearman correlation coefficients, differences between.

Encyclopedia of Statistical Sciences 12 (2004).

[70] Nadeau, D., and Sekine, S. A survey of named entity recognition and classification.

Lingvisticae Investigationes 30, 1 (2007), 3–26.

[71] Neumann, T., and Weikum, G. The RDF-3X engine for scalable management of

RDF data. The VLDB Journal 19, 1 (2010), 91–113.

[72] Nickel, M., Rosasco, L., and Poggio, T. Holographic embeddings of knowledge

graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

(2016), AAAI Press, pp. 1955–1961.

110 [73] Nishio, T., and Yonetani, R. Client selection for federated Learning with heteroge-

neous resources in mobile edge. In ICC 2019-2019 IEEE International Conference

on Communications (ICC) (2019), 1–7.

[74] Paulheim, H. Knowledge graph refinement: A survey of approaches and evaluation

methods. Semantic web 8, 3 (2017), 489–508.

[75] Pauwels, P., Corry, E., and OaDonnell,ˆ J. Making SimModel information available

as RDF graphs. eWork and eBusiness in Architecture, Engineering and Construc-

tion: ECPPM (2014), 439–445.

[76] Pedersen, T., Patwardhan, S., Michelizzi, J., et al. WordNet: Similarity-measuring

the relatedness of concepts. In AAAI (2004), vol. 4, pp. 25–29.

[77] Pennington, J., Socher, R., and Manning, C. D. GloVe: Global vectors for word

representation. In Proceedings of the 2014 Conference on Empirical Methods

in Natural Language Processing (EMNLP) (2014), Association for Computational

Linguistics, pp. 1532–1543.

[78] Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social rep-

resentations. In Proceedings of the 20th ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining (2014), ACM, pp. 701–710.

[79] Pham, M., Alse, S., Knoblock, C. A., and Szekely, P. Semantic labeling: a

domain-independent approach. In International Semantic Web Conference (2016),

Springer, pp. 446–462.

111 [80] Pichai, S. Googleasˆ Sundar Pichai: Privacy should not Be a luxury good. New

York Times, May 7, (2019).

[81] Ramnandan, S. K., Mittal, A., Knoblock, C. A., and Szekely, P. Assigning semantic

labels to data sources. In European Semantic Web Conference (2015), Springer,

pp. 403–417.

[82] Rao, P., Katib, A., and Barron, D. E. L. A knowledge ecosystem for the food,

energy, and water system. CoRR abs/1609.05359 (2016).

[83] Rao, P., Katib, A., and Barron, D. E. L. A knowledge ecosystem for the food,

energy, and water system. arXiv preprint arXiv:1609.05359 (2016).

[84] Rozemberczki, B., Davies, R., Sarkar, R., and Sutton, C. GEMSEC: Graph em-

bedding with self clustering. In Proceedings of the 2019 IEEE/ACM International

Conference on Advances in Social Networks Analysis and Mining 2019 (2019),

ACM, pp. 65–72.

[85] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,

Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. ImageNet

large scale visual recognition challenge. International Journal of Computer Vision

115, 3 (2015), 211–252.

[86] Sachan, D., Zaheer, M., and Salakhutdinov, R. Revisiting LSTM networks for

semi-supervised text classification via mixed objective function. Proceedings of

the AAAI Conference on Artificial Intelligence 33 (2019), 6940–6948.

112 [87] Schnabel, T., Labutov, I., Mimno, D., and Joachims, T. Evaluation methods for un-

supervised word embeddings. In Proceedings of the 2015 Conference on Empirical

Methods in Natural Language Processing (2015), Association for Computational

Linguistics, pp. 298–307.

[88] Shen, W., Wang, J., and Han, J. Entity linking with a knowledge base: issues,

techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering

27, 2 (2015), 443–460.

[89] Shields, C. Text-based document similarity matching using sdtext. In 2016

49th Hawaii International Conference on System Sciences (HICSS) (2016), IEEE,

pp. 5607–5616.

[90] Slavov, V., Katib, A., Rao, P., Paturi, S., and Barenkala, D. Fast processing of

SPARQL queries on RDF quadruples. In Proceedings of the 17th International

Workshop on the Web and Databases (WebDB 2014) (2015), pp. 1–5.

[91] Smith, C. 160 Youtube statistics and facts (2020), by the numbers. Avail-

able from. https://expandedramblings.com/index.php/youtube-statistics; Retrieved

March 2017.

[92] Smith, S., Kindermans, P.-J., Ying, C., and Le, Q. Don’t decay the learning rate,

increase the batch size. arXiv preprint arXiv:1711.00489 (2017).

[93] SmywiAski-Pohl,˚ A. Classifying the Wikipedia Articles into the OpenCyc Taxon-

omy. In WoLE@ ISWC (2012), pp. 5–16.

113 [94] Spagnola, S., and Lagoze, C. Edge dependent pathway scoring for calculating

semantic similarity in ConceptNet. In Proceedings of the Ninth International

Conference on Computational Semantics (2011), Association for Computational

Linguistics, pp. 385–389.

[95] Speer, R., Chin, J., and Havasi, C. ConceptNet 5.5: An open multilingual graph

of general knowledge. In Proceedings of the Thirty-First AAAI Conference on

Artificial Intelligence (2017), AAAI Press, pp. 4444–4451.

[96] Speer, R., and Havasi, C. ConceptNet 5: A large semantic network for relational

knowledge. In The Peopleasˆ Web Meets NLP. Springer, 2013, pp. 161–176.

[97] Stadler, C., Unbehauen, J., Westphal, P., Sherif, M. A., and Lehmann, J. Simplified

RDB2RDF Mapping. In LDOW@ WWW (2015).

[98] Suchanek, F. M., Kasneci, G., and Weikum, G. Yago: A core of semantic knowl-

edge. In Proceedings of the 16th International Conference on World Wide Web

(2007), ACM, pp. 697–706.

[99] Support, G. Your chats stay private while messages improves suggestions.

[100] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. LINE: Large-scale

information network embedding. In Proceedings of the 24th International Con-

ference on World Wide Web (2015), International World Wide Web Conferences

Steering Committee, pp. 1067–1077.

114 [101] Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E. G., and Milios, E. E. Se-

mantic similarity methods in wordNet and their application to information retrieval

on the web. In Proceedings of the 7th annual ACM international workshop on Web

Information and Data Management (2005), ACM, pp. 10–16.

[102] Vashishth, S., Jain, P., and Talukdar, P. CESI: Canonicalizing open knowledge

bases using embeddings and side information. In Proceedings of the 2018 World

Wide Web Conference on World Wide Web (2018), International World Wide Web

Conferences Steering Committee, pp. 1317–1327.

[103] Verborgh, R., and De Wilde, M. Using OpenRefine. Packt Publishing Ltd, 2013.

[104] Vulic,´ I., and Mrksiˇ c,´ N. Specialising word vectors for lexical entailment. In Pro-

ceedings of the 2018 Conference of the North American Chapter of the Association

for Computational Linguistics: Human Language Technologies, Volume 1 (Long

Papers) (2018), Association for Computational Linguistics, pp. 1134–1145.

[105] Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W., and Yang, S. Community preserv-

ing network embedding. In Proceedings of the Thirty-First AAAI Conference on

Artificial Intelligence (2017), AAAI Press, pp. 203–209.

[106] Yang, T., Andrew, G., Eichner, H., Sun, H., Li, W., Kong, N., Ramage, D., and

Beaufays, F. Applied federated learning: Improving Google keyboard query sug-

gestions. arXiv preprint arXiv:1812.02903 (2018).

115 [107] Ye, F., Chen, C., and Zheng, Z. Deep autoencoder-like nonnegative matrix factor-

ization for community detection. In Proceedings of the 27th ACM International

Conference on Information and (2018), ACM, pp. 1393–

1402.

[108] Zachariah, A., Gharibi, M., and Rao, P. A Large-Scale Image Retrieval System for

Everyday Scenes. In Proceedings of the 2020 ACM International Conference on

Multimedia in Asia (MMAsia ’20) (2020), pp. 1–3.

[109] Zachariah, A., Gharibi, M., and Rao, P. QIK: A System for Large-Scale Image

Retrieval on Everyday Scenes With Common Objects. In Proceedings of the

2020 ACM International Conference on Multimedia Retrieval (ICMR ’20) (2020),

pp. 126–135.

[110] Zesch, T., Muller,¨ C., and Gurevych, I. Extracting lexical semantic knowledge from

Wikipedia and Wiktionary. In Proceedings of the Sixth International Conference

on Language Resources and Evaluation (LREC’08) (2008), ”European Language

Resources Association (ELRA)”, pp. 1646–1652. VITA

Mohamed Gharibi joined UMKC in 2015 to pursue a Master and a PhD in Com- puter Science. His research interest includes Data Science, Machine Learning, and Fed- erated Learning. He is also pursuing a degree in higher-education teaching, Preparing

Future Faculty, from the School of Education supported with a Scholar Award from the

School of Graduate Studies at UMKC. Mohamed is working under the supervision of Dr.

Praveen Rao and he is currently a Machine Learning Engineer at IBM.

Mohamed published the following papers:

• Gharibi, M., Rao, P. RefinedFed: A Refining Algorithm for Federated Learning.

Applied Imagery Pattern Recognition (AIPR), the 49th IEEE Annual Applied Im-

agery Pattern Recognition Workshop, 5 pages, 2020 [34]

• Zachariah, A., Gharibi, M., Rao, P., A Large-Scale Image Retrieval System for

Everyday Scenes. ACM Multimedia Asia (ACM MM Asia), 2020 [108]

• Gharibi, M., Zachariah, A., Rao, P., FoodKG: A Tool to Enrich Knowledge Graphs

Using Machine Learning Techniques. Frontiers in Big Data. Doi: 10.3389/f-

data.2020.00012, 2020 [36]

• Zachariah, A., Gharibi, M., Rao, P., QIK: A System for Large-Scale Image Re-

trieval on Everyday Scenes with Common Objects. International Conference on

Multimedia Retrieval (ICMR 2020) [109]

117 • Gharibi, M., Rao, P., & Alrasheed, N. RichRDF: A Tool for Enriching Food, En-

ergy, and Water Datasets with Semantically Related Facts and Images. International

Semantic Web Conference (ISWC 2018) [35]

118