<<

Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Knowledge Graph Embedding for Data 4 5 5 6 Mining vs. Knowledge Graph Embedding for 6 7 7 8 8 9 Link Prediction – 9 10 10 11 Two Sides of the Same Coin? 11 12 12 13 Jan Portisch a,b and Nicolas Heist a and Heiko Paulheim a,* 13 14 a Data and Web Science Group, University of Mannheim, 14 15 E-mails: [email protected], [email protected], 15 16 [email protected] 16 17 b SAP SE, Germany 17 18 E-mail: [email protected] 18 19 19 20 20 21 21 22 22 23 Abstract. Knowledge Graph Embeddings, i.e., projections of entities and relations to lower dimensional spaces, have been 23 24 proposed for two purposes: (1) providing an encoding for data mining tasks, and (2) predicting links in a knowledge graph. Both 24 25 lines of research have been pursued rather in isolation from each other so far, each with their own benchmarks and evaluation 25 methodologies. In this paper, we argue that both tasks are actually related, and we show that the first family of approaches can 26 26 also be used for the second task and vice versa. In two series of experiments, we provide a comparison of both families of 27 27 approaches on both tasks, which, to the best of our knowledge, has not been done so far. Furthermore, we discuss the differences 28 in the similarity functions evoked by the different embedding approaches. 28 29 29 30 Keywords: Knowledge Graph Embedding, Link Prediction, Data Mining, RDF2vec 30 31 31 32 32 33 33 34 1. Introduction in the knowledge graph as good as possible. The evalu- 34 35 ations of this kind of approaches are always conducted 35 36 In the recent past, the topic of knowledge graph em- within the knowledge graph, using the existing knowl- 36 37 bedding – i.e., projecting entities and relations in a edge graph assertions as ground truth. 37 38 knowledge graph into a numerical vector space – has 38 39 gained a lot of traction. An often cited survey from 39 40 2017 [1] lists already 25 approaches, with new mod- 40 41 els being proposed almost every month, as depicted in 41 42 Fig. 1. 42 43 Even more remarkably, two mostly disjoint strands 43 44 of research have emerged in that vivid area. The first 44 45 family of research works focus mostly on link predic- 45 46 tion [2], i.e., the approaches are evaluated in a knowl- 46 47 edge graph refinement setting [3]. The optimization 47 48 goal here is to distinguish correct from incorrect triples 48 49 49 50 *Corresponding author. E-mail: Fig. 1. Publications with Knowledge Graph Embedding in their title 50 51 [email protected]. or abstract, created with dimensions.ai 51

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 A second strand of research focuses on the embed- The second family among the link prediction em- 1 2 ding of entities in the knowledge graph for downstream beddings are semantic matching [1] or matrix factor- 2 3 tasks outside the knowledge graph, which often come ization or tensor decomposition [9] models. Here, a 3 4 from the data mining field – hence, we coin this fam- knowledge graph is represented as a three-dimensional 4 5 ily of approaches embeddings from data mining. Ex- tensor, which is decomposed into smaller matrices or 5 6 amples include: the prediction of external variables for tensors. The reconstruction operation can then be used 6 7 entities in a knowledge graph [4], information retrieval for link prediction. 7 8 backed by a knowledge graph [5], or the usage of a The third and youngest family among the link pre- 8 9 knowledge graph in content-based recommender sys- diction embeddings are based on deep learning and 9 10 tems [6]. In those cases, the optimization goal is to cre- graph neural networks. Here, neural network train- 10 11 ate an embedding space which reflects semantic sim- ing approaches, such as convolutional neural networks, 11 12 ilarity as good as possible (e.g., in a recommender capsule networks, or recurrent neural networks, are 12 13 system, similar items to the ones in the user interest adapted to work with knowledge graphs [9]. 13 14 should be recommended). The evaluations here are al- While most of those approaches only consider 14 15 ways conducted outside the knowledge graph, based graphs with nodes and edges, most knowledge graphs 15 16 on external ground truth. also contain literals, e.g., strings and numeric values. 16 17 In this paper, we want to look at the commonali- [11] shows a survey of approaches which take such lit- 17 18 ties and differences of the two approaches. We look eral information into account. It is also one of the few 18 19 at two of the most basic and well-known approaches review articles which considers embedding methods 19 20 of both strands, i.e., TransE [7] and RDF2vec [4], from both research strands. 20 21 21 and analyze and compare their optimization goals in Link prediction is typically evaluated on a set of 22 22 a simple example. Moreover, we analyze the perfor- standard datasets, and uses a within-KG protocol, 23 23 mance of approaches from both families in the re- where the triples in the knowledge graph are divided 24 24 spective other evaluation setup: we explore the usage in a training, testing, and validation set. Prediction ac- 25 25 of link-prediction based embeddings for other down- curacy is then assessed on the validation set. Datasets 26 26 stream tasks based on similarity, and we propose a link commonly used for the evaluation are FB15k, which 27 27 prediction method based on RDF2vec. From those ex- is a subset of Freebase, and WN18, which is derived 28 28 periments, we derive a set of insights into the differ- from WordNet [7]. Since it has been remarked that 29 29 ences of the two families of methods, and a few rec- 30 those datasets contain too many simple inferences 30 ommendations on which kind of approach should be 31 due to inverse relations, the more challenging variants 31 used in which setting. 32 FB15k-237 [12] and WN18RR [13] have been pro- 32 33 posed. More recently, evaluation sets based on larger 33 34 knowledge graphs, such as YAGO3-10 [13] and DB- 34 35 2. Related Work pedia50k/DBpedia500k [14] have been introduced. 35 36 The second strand of research works, focusing on 36 37 As pointed out above, the number of works on the embedding for downstream tasks (which are often 37 38 knowledge graph embedding is legion, and enumerat- from the domain of data mining), is not as extensively 38 39 ing them all in this section would go beyond the scope reviewed, and the number of works in this area are still 39 40 of this paper. However, there have already been quite a smaller. One of the more comprehensive evaluations 40 41 few survey articles. is shown in [15], which is also one of the rare works 41 42 The first strand of research works – i.e., knowledge which includes approaches from both strands in a com- 42 43 graph embeddings for link prediction – has been cov- mon evaluation. They show that at least the three meth- 43 44 ered in different surveys, such as [1], and, more re- ods for link prediction used – namely TransE, TransR, 44 45 cently, [8], [9], and [10]. The systematic of those re- and TransH – perform inferior on downstream tasks, 45 46 views is similar, as they distinguish different families compared to approaches developed specifically for op- 46 47 of approaches: translational distance models [1] or ge- timizing for entity similarity in the embedding space. 47 48 ometric models [9] focus on link prediction as a geo- For the evaluation of entity embeddings optimized 48 49 metric task, i.e., projecting the graph in a vector space for entity similarity, there are quite a few use cases at 49 50 so that a translation operation defined for relation r on hand. The authors in [16] list a number of tasks, in- 50 51 a head h yields a result close to the tail t. cluding classification and regression of entities based 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 3

1 on external ground truth variables, entity clustering, as 3.1. Data Mining is based on Similarity 1 2 well as identifying semantically related entities. 2 3 Works which explicitly compare approaches from Predictive data mining tasks are predicting classes 3 4 both research strands are still rare. In [17], an in-KG or numerical values for instances. A typical application 4 5 scenario, i.e., the detection and correction of erroneous are recommender systems: given the set of items (e.g., 5 6 links, is considered. The authors compare RDF2vec movies, books) a user liked, predict whether s/he likes 6 7 (with an additional classification layer) to TransE and a particular other item (i.e., make a binary prediction: 7 8 DistMult on the link prediction task. The results are yes/no), or, given the (numerical) ratings a user gave 8 9 mixed: While RDF2vec outperforms TransE and Dist- to items in the past, predict the rating s/he will give to 9 10 other instances. Such ratings can then be used to create 10 Mult in terms of Mean Reciprocal Rank and Preci- 11 a sorted list of recommendations. 11 sion@1, it is inferior in Precision@10. Since the re- 12 Content-based recommender systems are one family 12 sults are only validated on one single dataset, the evi- 13 of recommender systems. They rely on item similarity, 13 dence is rather thin. 14 i.e., if a user liked item A in the past, the system will 14 15 In [18], the authors analyze the vector spaces of dif- recommend items which are similar to A. To that end, 15 16 ferent embedding models with respect to class sepa- each item is represented as a set of features, and items 16 17 ration, i.e., they fit the best linear separation between with similar features are considered similar. 17 18 classes in different embedding spaces. According to RDF2vec has been shown to be usable for recom- 18 19 their findings, RDF2vec achieves a better linear sepa- mender systems, since the underlying method tends to 19 20 ration than the models tailored to link prediction. create similar vectors for similar entities, i.e., position 20 21 Overall, while there are quite a few works using and them closer in vector space [6]. Figure 2 illustrates this 21 22 evaluating approaches from both research strands, di- using a 2D PCA plot of RDF2vec vectors for movies 22 23 rect comparisons are rather rare. This paper aims at in DBpedia. It can be seen that clusters of movies, e.g., 23 24 closing that gap. Disney movies, Star Trek movies, and Marvel related 24 25 movies are formed. 25 26 Similar to recommender systems, many techniques 26 27 for predictive data mining rely on similarity in one 27 28 3. Knowledge Graph Embedding Methods for or the other way. This is more obvious for, e.g., k- 28 29 Data Mining nearest neighbors, where the predicted label for an in- 29 30 stance is the majority or average of labels of its clos- 30 31 Traditionally, most data mining methods are work- est neighbors (i.e., most similar instances), or Naive 31 32 ing on propositional data, i.e., each instance is a row in Bayes, where an instance is predicted to belong to a 32 33 a table, described by a set of (binary, numeric, or cat- class if its feature values are most similar to the typical 33 34 egorical) features. For using knowledge graphs in data distribution of features for this class (i.e., it is similar to 34 35 mining, one needs to either develop methods which an average member of this class). A similar argument 35 36 work on graphs instead of propositional data, or find can be made for neural networks, where one can as- 36 37 ways to represent instances of the knowledge graph as sume a similar output when changing the value of one 37 38 feature vectors [19]. The latter is often referred to as input neuron (i.e., one feature value) by a small delta. 38 39 Other classes of approaches (such as Support Vector 39 propositionalization [20]. 40 Machines) use the concept of class separability, which 40 RDF2vec [4] is a prominent example from the sec- 41 is similar to exploiting similarity: datasets with well 41 ond family. It adapts the word2vec approach [21] for 42 separable classes have similar instances (belonging to 42 deriving word embeddings (i.e., vector representations 43 the same class) close to each other, while dissimilar 43 44 for words) from a corpus of sentences. RDF2vec cre- instances (belonging to different classes) are further 44 45 ates such sentences by performing random walks on away from each other [23]. 45 46 an RDF graph and collecting the sequences of entities 46 47 and relations, then trains a word2vec model on those 3.2. Creating Similar Embeddings for Similar 47 48 sequences. It has been shown that this strategy out- Instances 48 49 performs other strategies of propositionalization. The 49 50 relation between propositionalization and embedding To understand how (and why) RDF2vec creates em- 50 51 methods has also recently been pointed out by [22]. beddings that have the property of placing nearby vec- 51 4 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 In the next step, the walks are used to train a predic- 1 2 tive model. Since RDF2vec uses word2vec, it can be 2 3 trained with the two flavors of word2vec, i.e., CBOW 3 4 (context back of words) and SG (skip gram). The first 4 5 predicts a word, given its surrounding words, the sec- 5 6 ond predicts the surroundings, given a word. For the 6 7 sake of our argument, we will only consider the sec- 7 8 ond variant, depicted in Fig. 6. Simply speaking, given 8 9 training examples where the input is the target word (as 9 10 a one-hot-encoded vector) and the output is the con- 10 11 text words (again, one hot encoded vectors), a neural 11 12 network is trained, where the hidden layer is typically 12 13 of smaller dimensionality than the input. That hidden 13 14 layer is later used to produce the actual embedding 14 15 vectors. 15 16 Fig. 2. RDF2vec embeddings for movies in DBpedia, from [6]. To create the training examples, a window with a 16 17 given size is slid over the input sentences. Here, we 17 18 use a window of size 2, which means that the two 18 19 words preceding and the two words succeeding a con- 19 20 text word are taken into consideration. Fig. 7 shows the 20 21 training examples generated for three instances. 21 22 A model that learns to predict the context given the 22 23 target word would now learn to predict the majority 23 24 of the context words for the target word at hand at 24 25 25 the output layer called output in Fig. 6, as depicted in 26 26 the lower part of Fig. 7. Here, we can see that Paris 27 27 and share two out of four predictions, so do 28 28 Mannheim and Berlin. and Berlin share 29 29 one out of four predictions. 30 30 Considering again Fig. 6, given that the activation 31 31 function which computes the output from the projec- 32 32 tion values is continuous, it implies that similar acti- 33 33 vations on the output layer requires similar values on 34 34 35 the projection layer. Hence, for a well fit model, the 35 distance on the projection layer of Paris, Berlin, and 36 Fig. 3. Example graph used for illustration 36 37 Mannheim should be comparatively lower than the dis- 37 38 tance of the other entities, since they activate similar 38 tors, we use the running example depicted in Fig. 3, 1 39 outputs. 39 40 showing a number of European cities, countries, and Fig. 8 depicts a two-dimensional RDF2vec embed- 40 2 41 heads of those governments. ding learned for the example graph. We can observe 41 42 As discussed above, the first step of RDF2vec is that there are clusters of persons, countries, and cities. 42 43 to create random walks on the graph. To that end, The grouping of similar objects also goes further – we 43 44 RDF2vec starts a fixed number of random walks of a 44 fixed maximum length from each entity. Since the ex- 45 1Note that there are still weights learned for the individual con- 45 46 ample above is very small, we will, for the sake of nections between the projection and the output layer, which empha- 46 47 illustration, enumerate all walks of length 4 that can size some connections more strongly than others. Hence, we cannot 47 48 be created for the graph. Those walks are depicted in simplify our argumentation in a way like “with two common con- 48 text words activated, the entities must be projected twice as close as 49 Fig. 5. It is notable that, since the graph has nodes 49 those with one common context word activated”. 50 without outgoing edges, some of the walks are actually 2Created with PyRDF2vec [24], using two dimensions, a walk 50 51 shorter than 4. length of 8, and standard configuration otherwise 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 5

1 1 Berlin locatedIn Germany . 2 Germany headOfGovernment Angela_Merkel . 2 Mannheim locatedIn Germany . 3 3 Belgium capital Brussels . 4 Germany partOf EU . 4 Belgium partOf EU . 5 5 Belgium headOfGovernment Sophie_Wilmes . 6 EU governmentSeat Brussels . 6 USA capital WashingtonDC . 7 7 WashingtonDC locatedIn USA . 8 capital Paris . 8 France headOfGovernment Emmanuel_Macron . 9 9 Paris locatedIn France . 10 Strasbourg locatedIn France . 10 Germany capital Berlin . 11 11 Brussels locatedIn Belgium . 12 France partOf EU . 12 USA headOfGovernment Donald_Trump . 13 13 EU governmentSeat Strasbourg . 14 14 15 15 16 Fig. 4. Triples of the example Knowledge Graph 16 17 17 18 18 Belgium partOf EU governmentSeat Brussels 19 Belgium capital Brussels locatedIn Belgium 19 20 Belgium partOf EU governmentSeat Strasbourg 20 Belgium headOfGovernment Sophie_Wilmes 21 Berlin locatedIn Germany capital Berlin 21 22 Berlin locatedIn Germany headOfGovernment Angela_Merkel 22 Berlin locatedIn Germany partOf EU 23 Brussels locatedIn Belgium headOfGovernment Sophie_Wilmes 23 24 Brussels locatedIn Belgium partOf EU 24 Brussels locatedIn Belgium capital Brussels 25 EU governmentSeat Strasbourg locatedIn France 25 26 EU governmentSeat Brussels locatedIn Belgium 26 France headOfGovernment Emmanuel_Macron 27 France capital Paris locatedIn France 27 28 France partOf EU governmentSeat Brussels 28 France partOf EU governmentSeat Strasbourg 29 Germany partOf EU governmentSeat Brussels 29 30 Germany partOf EU governmentSeat Strasbourg 30 Germany capital Berlin locatedIn Germany 31 Germany headOfGovernment Angela_Merkel 31 32 Mannheim locatedIn Germany capital Berlin 32 Mannheim locatedIn Germany headOfGovernment Angela_Merkel 33 Mannheim locatedIn Germany partOf EU 33 34 Paris locatedIn France headOfGovernment Emmanuel_Macron 34 Paris locatedIn France partOf EU 35 Paris locatedIn France capital Paris 35 36 Strasbourg locatedIn France capital Paris 36 Strasbourg locatedIn France headOfGovernment Emmanuel_Macron 37 Strasbourg locatedIn France partOf EU 37 38 USA headOfGovernment Donald_Trump 38 USA capital Washington_DC locatedIn USA 39 Washington_DC locatedIn USA capital Washington_DC 39 40 Washington_DC locatedIn USA headOfGovernment Donald_Trump 40 41 41 42 42 Fig. 5. Walks extracted from the example graph 43 43 44 44 can, e.g., observe that European cities in the dataset 3.3. Usage for Link Prediction 45 45 are embedded closer to each other than to Washington 46 46 47 D.C. This is in line with previous observations show- From Fig. 8, we can assume that link prediction 47 48 ing that RDF2vec is particularly well suited in creating should, in principle, be possible. For example, the pre- 48 49 clusters also for finer-grained classes [25]. A predic- dictions for heads of governments all point in a simi- 49 50 tive model could now exploit those similarities, e.g., lar direction. This is in line with what is known about 50 51 for type prediction, as proposed in [26] and [25]. word2vec, which allows for computing analogies, like 51 6 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 for the example at hand are depicted in Fig. 9. We can 1 2 see that in some (not all) cases, the directions of the 2 3 vectors are approximately correct: the partOf vector 3 4 is roughly the difference between EU and Germany, 4 5 France, and Belgium, and the headOfGovernment vec- 5 6 tor is approximately the vector between the countries 6 7 and the politicians cluster. On the other hand, the cap- 7 8 ital vector seems to be less accurate. 8 9 It can also be observed that the vectors for locate- 9 10 dIn and capitalOf point in reverse directions, which 10 11 makes sense because they form connections between 11 12 Fig. 6. The Skip Gram variant of word2vec [4] two clusters (countries and cities) in opposite direc- 12 13 tions. 13 14 the well-known example 14 15 15 16 v(King) − v(Man) + v(Woman) ≈ v(Queen) (1) 4. Knowledge Graph Embedding Methods for 16 17 Link Prediction 17 18 RDF2vec does not learn relation embeddings, only 18 3 19 entity embeddings. Hence, we cannot directly predict A larger body of work has been devoted on knowl- 19 20 relations, but we can use those analogies. If we want edge graph embedding methods for link prediction. 20 21 to make a tail prediction like Here, the goal is to learn a model which embeds enti- 21 22 ties and relations in the same vector space. 22 23 < h, r, ? >, (2) 23 4.1. Link Prediction is based on Vector Operations 24 24 0 0 25 we can identify another pair < h , r, t > and exploit the 25 As the main objective is link prediction, most mod- 26 above analogy, i.e., 26 els, more or less, try to find a vector space embedding 27 27 0 0 of entities and relations so that 28 t − h + h ≈ t (3) 28 29 t ≈ h ⊕ r (7) 29 30 To come to a stable prediction, we would use the aver- 30 age, i.e., 31 holds for as many triples < h, r, t > as possible. ⊕ can 31 32 32 P 0 0 stand for different operations in the vector space; in 0 0 t − h + h 33 basic approaches, simple vector addition (+) is used. 33 t ≈ 0 0 (4) 34 |< h , r, t >| In our considerations below, we will also use vector 34 35 addition. 35 36 With the same idea, we can also average the relation In most approaches, negative examples are created 36 37 vectors r for each relation that holds between all its by corrupting an existing triple, i.e., replace the head 37 38 head and tail pairs, i.e., or tail with another entity from the graph (some ap- 38 39 39 P 0 0 proaches also foresee corrupting the relation). Then, 0 0 t − h 40 a model is learned which tries to tell apart corrupted 40 r ≈ 0 0 , (5) 41 |< h , r, t >| from non-corrupted triples. The formulation in the 41 42 original TransE paper [7] defines the loss function L as 42 43 and thereby reformulate the above equation to follows: 43 44 44 45 t ≈ h + r, (6) 45 X X 0 0 46 L = [γ + d(h + r, t) − d(h + r, t )]+ 46 0 0 0 47 which is what we expect from an embedding model (h,r,t)∈S (h ,r,t )∈S 47 48 for link prediction. Those approximate relation vectors (8) 48 49 49 50 3Technically, we can also make RDF2vec learn embeddings for where γ is some margin, and d is a distance function, 50 51 the relations, but they would not behave the way we need them. i.e., the L1 or L2 norm. S is the set of statements that 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 7

1 Target Word w−2 w−1 w+1 w+2 1 Paris France capital locatedIn France 2 Paris – – locatedIn France 2 3 Paris – – locatedIn France 3 Paris – – locatedIn France 4 Paris France capital – – 4 5 Paris France capital – – 5 Berlin – – locatedIn Germany 6 Berlin Germany capital – – 6 7 Berlin – – locatedIn Germany 7 Berlin – – locatedIn Germany 8 Berlin Germany capital locatedIn Germany 8 9 Berlin Germany capital – – 9 Mannheim – – locatedIn Germany 10 Mannheim – – locatedIn Germany 10 11 Mannheim – – locatedIn Germany 11 Angela Merkel Germany headOfGovernment – – 12 Angela Merkel Germany headOfGovernment – – 12 13 Angela Merkel Germany headOfGovernment – – 13 Donald Trump USA headOfGovernment – – 14 Donald Trump USA headOfGovernment – – 14 15 Belgium – – partOf EU 15 Belgium – – capital Brussels 16 Belgium Brussels locatedIn – – 16 17 Belgium – – partOf EU 17 Belgium – – headOfGovernment Sophie Wilmes 18 Belgium Brussels locatedIn headOfGovernment Sophie Wilmes 18 19 Belgium Brussels locatedIn partOf EU 19 Belgium Brussels locatedIn capital Brussels 20 Belgium Brussels locatedIn – – 20 21 Paris France capital locatedIn France 21 22 Berlin Germany capital locatedIn Germany 22 Mannheim – – locatedIn Germany 23 Angela Merkel Germany headOfGovernment – – 23 24 Donald Trump USA headOfGovernment – – 24 Belgium Brussels locatedIn partOf EU 25 25 26 Fig. 7. Training examples for instances Paris, Berlin, Mannheim, Angela Merkel, Donald Trump, and Belgium (upper part) and majority predic- 26 27 tions (lower part). 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 Fig. 8. The example graph embedded with RDF2vec Fig. 9. Average relation vectors for the example 45 46 46 47 are in the knowledge graph, and S 0 are the corrupted Fig. 10 shows the example graph from above, as em- 47 4 48 statements derived from them. In words, the formula bedded by TransE. Looking at the relation vectors, it 48 49 states for a triple < h, r, t >, h + r should be closer 49 0 50 to t than to t for some corrupted tail, similarly for a 4Created with PyKEEN [27], using 128 epochs, a learning rate 50 51 corrupted head. However, a difference of γ is accepted. of 0.1, the softplus loss function, and default parameters otherwise, 51 8 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 where ηh,r,t can be considered an error term for the 1 2 triple < h, r, t >. Moreover, we define 2 3 3 4 4 ηmax = max ηh,r,t (10) 5 ∈S 5 6 6 Next, we consider two triples < h , r, t > and < h , r, t >, 7 1 2 7 8 which share a relation to an object – e.g., in our exam- 8 9 ple, France and Belgium, which both share the relation 9 6 10 partOf to EU. In that case, 10 11 11

12 t = h1 + r + ηh1,r,t (11) 12 13 13 14 and 14 15 15

16 t = h2 + r + ηh2,r,t (12) 16 17 17 Fig. 10. Example graph embedded by TransE 18 hold. From that, we get7 18 19 19

20 can be observed that they seem approximately accurate h1 − h2 = ηh2,r,t − ηh1,r,t 20 21 in some cases, e.g., 21 ⇒ |h1 − h2| = |ηh2,r,t − ηh1,r,t| 22 22

23 Germany+headO fGovernment ≈ Angela_Merkel, = |ηh2,r,t + (−ηh1,r,t)| 23 24 24 6 |ηh2,r,t| + | − ηh1,r,t| 25 but not everywhere.5 25 = |η | + |η | 26 Like in the RDF2vec example above, we can ob- h2,r,t h1,r,t 26 27 locatedIn capital 27 serve that the two vectors for and 6 2 · ηmax (13) 28 point in opposite directions. Although less obviously 28 29 than in the RDF2vec example, we can also see that en- In other words, ηmax also imposes an upper bound of 29 30 tities in similar classes form (weak) clusters: cities are two entities sharing a relation to an object. As a con- 30 31 mostly in the right half of the space, people in the left, sequence, the lower the error in relation prediction, the 31 32 countries in the upper part. closer are entities which share a common statement. 32 33 This also carries over to entities sharing the same 33 34 34 4.2. Usage for Data Mining two-hop connection. Consider two further triples 35 35 < h1a, ra, h1 > and < h2a, ra, h2 >. In our example, this 36 could be two cities located in the two countries, e.g., 36 As discussed above, positioning similar entities 37 Strasbourg and Brussels. In that case, we would have 37 38 close in a vector space is an essential requirement for 38

39 using entity embeddings in data mining tasks. To un- h1 = h1a + ra + ηh1a,ra,h1 (14) 39 40 derstand why an approach tailored towards link pre- 40 h2 = h2a + ra + ηh2a,ra,h2 (15) 41 diction can also, to a certain extent, cluster similar in- 41 42 stances together, we first rephrase the approximate link Substituting this in (11) and (12) yields 42 43 prediction equation (8) as 43 44 t = h1a + ra + ηh1a,ra,h1 + r + ηh1,r,t (16) 44 45 t = h + r + η , (9) 45 h,r,t t = h2a + ra + ηh2a,ra,h2 + r + ηh2,r,t. (17) 46 46 47 47 6 48 as advised by the authors of PyKEEN: https://github.com/pykeen/ Although, at first glance in Fig. 10, this relation does not 48 pykeen/issues/97 seem to hold for Belgium, we can see that EU is still closer to 49 49 5This does not mean that TransE does not work. The training data Belgium + partO f than most other instances, which is in line with 50 for the very small graph is rather scarce, and two dimensions might TransE’s optimization goal. 50 51 not be sufficient to find a good solution here. 7Using the triangle inequality for the first inequation. 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 9

1 Consequently, using similar transformations as above, e1 ≈ e2 ← ∃ h1, h2, r : r(h1, e1) ∧ r(h2, e2) (22) 1 2 we get 2 3 On a similar argument, RDF2vec also positions en- 3 h − h = η − η + η − η 4 1a 2a h2a,ra,h2 h1a,ra,h1 h2,r,t h1,r,t tities closer which share any relation to another entity. 4 Although this is not visible in the two-dimensional em- 5 ⇒ |h1a − h2a| 6 4 · ηmax (18) 5 6 bedding depicted in Fig. 8, RDF2vec would also cre- 6 Again, η constrains the proximity of the two entities 7 max ate vectors with some similarity for Angela Merkel and 7 h h 8 1a and 2a, but only half as strictly as for the case of Berlin, since they both have a (albeit different) relation 8 h h 9 1 and 2. to Germany (i.e., their prediction for w−2 is similar). 9 10 Hence, the following notions of similarity can also be 10 11 4.3. Comparing the Two Notions of Similarity observed in RDF2vec: 11 12 12 e1 ≈ e2 ← ∃ t, r1, r2 : r1(e1, t) ∧ r2(e2, t) (23) 13 In the examples above, we can see that embeddings 13 for link prediction have a tendency to project similar 14 e1 ≈ e2 ← ∃ h, r1, r2 : r1(h, e1) ∧ r2(h, e2) (24) 14 15 instances close to each other in the vector space. Here, 15 The example with Angela Merkel and Berlin already 16 the notion of similarity is that two entities are simi- 16 hints at a slightly different notion of the interpretation 17 lar if they share a relation to another entity, i.e., e1 17 of proximity in the vector space evoked by RDF2vec: 18 and e2 are considered similar if there exist two state- 18 19 ments < e1, r, t > and < e2, r, t > or < h, r, e1 > and not only similar, but also related entities are positioned 19 8 20 < h, r, e2 >, or, less strongly, if there exists a chain of close in the vector space. This means that to a certain 20 21 such statements. More formally, we can write the no- extent, RDF2vec mixes the concepts of similarity and 21 22 tion of similarity between two entities in link predic- relatedness in its distance function. We will see exam- 22 23 tion approaches as ples of this in later considerations, and discuss how 23 24 they interfere with downstream applications. 24 e1 ≈ e2 ← ∃ t, r : r(e1, t) ∧ r(e2, t) (19) 25 25 26 e1 ≈ e2 ← ∃ h, r : r(h, e1) ∧ r(h, e2) (20) 26 27 In other words: two entities are similar if they share a 5. Experiments 27 28 common connection to a common third entity. 28 29 RDF2vec, on the other hand, covers a wider range To compare the two sets of approaches, we use stan- 29 30 of such similarities. Looking at Fig. 7, we can observe dard setups for evaluating knowledge graph embed- 30 31 that two entities sharing a common relation to two dif- ding methods for data mining as well as for link pre- 31 32 ferent objects are also considered similar (Berlin and diction. 32 33 Mannheim both share the fact that they are located in 33 34 34 Germany, hence, their predictions for w+1 and w+2 are 5.1. Experiments on Data Mining Tasks 35 similar). 35 36 However, there in RDF2vec, similarity can also In our experiments, we follow the setup proposed 36 37 come in other notions. For example, Angela Merkel in [28] and [16]. Those works propose the use of data 37 38 and Donald Trump are also considered similar, be- mining tasks with an external ground truth, e.g., pre- 38 39 cause they both share the relation headOfGovernment, dicting certain indicators or classes for entities. Those 39 40 albeit with different subjects (i.e., their prediction for entities are then linked to a knowledge graph. Different 40 41 41 w−1 is similar). In contrast, such similarities do not feature extraction methods – which includes the gen- 42 lead to close projections for link prediction embed- eration of embedding vectors – can then be compared 42 43 dings. In fact, in Fig. 10, it can be observed that Don- using a fixed set of learning methods. 43 44 ald Trump and Angela Merkel are not very close. In The setup of [16] comprises six tasks using 20 44 45 other words, the following two notions of similarity datasets in total: 45 46 also hold for RDF2vec: 46 47 – Five classification tasks, evaluated by accuracy. 47 48 e1 ≈ e2 ← ∃ t1, t2, r : r(e1, t1) ∧ r(e2, t2) (21) – Five regression tasks, evaluated by root mean 48 49 squared error. 49 50 8The argument in section 4.2 would also work for shared relations – Four clustering tasks (with ground truth clusters), 50 51 to common heads. evaluated by accuracy. 51 10 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 – A document similarity task (where the similarity variable at hand (in other words: they encode some di- 1 2 is assessed by computing the similarity between mension of similarity that can be used to predict the 2 3 entities identified in the documents), evaluated by target variable), classifiers and regressors can pick up 3 4 the harmonic mean of Pearson and Spearman cor- on those dimensions and exploit them in their predic- 4 5 relation coefficients. The dataset is based on the tion model. 5 6 LP50 dataset [29]. What is also remarkable is the performance on the 6 7 – An entity relatedness task (where semantic simi- entity relatedness task. While RDF2vec embeddings 7 8 larity is used as a proxy for semantic relatedness), reflect entity relatedness to a certain extent, this is not 8 9 evaluated by Kendall’s Tau. The dataset is based given for any of the link prediction approaches. Ac- 9 10 on the KORE dataset [30]. cording to the notions of similarity discussed above, 10 11 – Four semantic analogy tasks (e.g., Athens is to this is reflected in the RDF2vec mechanism: RDF2vec 11 12 Greece as Oslo is to X), which are based on the has an incentive to position two entities closer in the 12 13 original datasets on which word2vec was evalu- vector space if they share relations to a common entity, 13 14 ated [21]. as shown in equations 21-24. One example is the re- 14 15 latedness of Apple Inc. and Steve Jobs – here, we can 15 We follow the evaluation protocol suggested in [16]. 16 observe the two statements 16 17 This protocol foresees the usage of different algo- 17 18 rithms on each task for each embedding (e.g., Naive product(AppleInc., IPhone) 18 Bayes, Decision Tree, k-NN, and SVM for classifi- 19 known f or(S teveJobs, IPhone) 19 20 cation), and also performs parameter tuning in some 20 21 cases. In the end, we report the best results per task in DBpedia, among others. Those lead to similar vec- 21 22 and embedding method. Those results are depicted in tors in RDF2vec according to equation 23. 22 23 Table 1. The same property of also assigning closer embed- 23 24 For generating the different embedding vectors, we ding vectors to related entities explains the compara- 24 25 use the DGL-KE framework [31], and we use the tively bad results of RDF2vec on the first two cluster- 25 26 RDF2vec vectors provided by the KGvec2go API [32]. ing tasks. Here the task is to separate cities and coun- 26 27 We compare RDF2vec [4], TransE (with L1 and L2 tries in two clusters, but since a city is also related to 27 28 norm) [7], TransR [33], RotatE [34], DistMult [35], the country it is located in, RDF2vec may position city 28 29 RESCAL [36], and ComplEx [37]. All embeddings are and country rather closely together. Hence, that city 29 9 30 trained on DBpedia 2016-10. To create the embed- has a certain probability of ending up in the same clus- 30 31 ding vectors with DGL-KE we use the parameter con- ter as the country. The latter two clustering tasks are 31 32 figurations recommended by the framework, a dimen- different: the third one contains five clusters (cities, al- 32 33 sion of 200, and a step maximum of 1,000,000. bums, movies, universities, and companies), which are 33 34 From the table, we can observe a few expected and less likely to be strongly related (except universities 34 35 a few unexpected results. First, since RDF2vec is tai- and companies to cities) and therefore are more likely 35 36 lored towards classic data mining tasks like classifi- to be projected in different areas in the vector space. 36 37 cation and regression, it is not much surprising that Here, the difference of RDF2vec to the best perform- 37 38 those tasks are solved better by using RDF2vec vec- ing approaches (i.e., TransE-L1 and TransE-L2) is not 38 39 tors. Still, some of the link prediction methods (in par- that severe. 39 40 ticular TransE and RESCAL) perform reasonably well The problem of relatedness being mixed with simi- 40 41 on those tasks. larity does not occur so strongly for homogeneous sets 41 42 Referring back to the different notions of similar- of entities, as in the classification and regression tasks, 42 43 ity that these families of approaches imply (cf. sec- where all entities are of the same kind (cities, com- 43 44 tion 4.3), this behavior can be explained by the ten- panies, etc.) – here, two companies which are related 44 45 dency of RDF2vec to positioning entities closer in the (e.g., because one is a holding of the other) can also 45 46 vector space which are more similar to each other (e.g., be considered similar to a certain degree (in that case, 46 47 two cities that are similar). Since it is likely that some they are both operating in the same branch). This also 47 48 of those dimensions are also correlated with the target explains why the forth clustering task (where the task 48 49 is to assign sports teams to clusters by the type of 49 50 9The code for the experiments as well as the resulting embeddings sports) works well for RDF2vec – here, the entities are 50 51 can be found at https://github.com/nheist/KBE-for-Data-Mining again homogeneous. 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 11

1 At the same time, the test case of clustering teams Since we can rephrase the first equation as 1 2 can also be used to explain why link prediction ap- 2 3 proaches work well for that kind of tasks: here, it is PoundS terling − UK ≈ 0 (27) 3 4 likely that two teams in the same sports share a relation 4 5 to a common entity, i.e., they fulfill equations 19 and we can conclude that analogy reasoning in RDF2vec 5 6 20. Examples include participation in the same tourna- would yield 6 7 ments or common former players. 7 8 The semantic analogies task also reveals some in- PoundS terling − UK + USA ≈ US Dollar (28) 8 9 teresting findings. First, it should be noted that the 9 10 relations which form the respective analogies (capi- Hence, in RDF2vec, two effects – the preservation of 10 11 tal, state, and currency) is contained in the knowledge relation vectors as well as the proximity of related en- 11 12 graph used for the computation. That being said, we tities – are helpful for analogy reasoning, and the two 12 13 can see that most of the link prediction results (except effects also work for rather noisy cases. However, for 13 14 for RotatE and RESCAL) perform reasonably well cases which are 1:1 relations in the knowledge graph 14 15 here. Particularly, the first cases (capitals and coun- with rather clean training data available, link predic- 15 16 tion approaches are better suited for analogy reason- 16 tries) can be solved particularly well in those cases, as 17 ing. 17 this is a 1:1 relation, which is the case in which link 18 18 prediction is a fairly simple task. 19 5.2. Experiments on Link Prediction Tasks 19 On the other hand, the currency case is solved par- 20 20 ticularly bad by most of the link prediction results. 21 In a second series of experiments, we analyze if we 21 This relation is an n:m relation (there are countries 22 can use embedding methods developed for similarity 22 with more than one official, inofficial, or historic cur- 23 computation, like RDF2vec, also for link prediction. 23 rency, and many currencies, like the Euro, are used 24 We use the two established tasks WN18 and FB15k for 24 25 across many countries. Moreover, looking into DBpe- a comparative study. 25 26 dia, this relation contains a lot of mixed usage and is While link prediction methods are developed for 26 27 not maintained with very high quality. For example, the task at hand, RDF2vec is not. Although RDF2vec 27 28 DBpedia lists 33 entities whose currency is US Dol- computes vectors for relations, they do not necessarily 28 10 29 lars – the list contains historic entities (e.g., West follow the same notion as relation vectors for link pre- 29 30 Berlin), errors (e.g., Netherlands), and entities which diction, as discussed above. Hence, we investigate two 30 31 are not countries (e.g., OPEC), but the United States approaches: 31 32 are not among those. For such kind of relations which 32 1. We average the difference for each pair of a head 33 contain a certain amount of noise and heterogeneous 33 and a tail for each relation r, and use that as aver- 34 information, many link prediction approaches are ob- 34 age as a proxy for a relation vector for prediction, 35 viously not well suited. 35 as shown in equation (4). The predictions are the 36 RDF2vec, in contrast, can deal reasonably well with 36 entities whose embedding vectors are the clos- 37 that case. Here, two effects interplay when solving 37 est to the approximate prediction. This method is 38 such tasks: (i) as shown above, relations are encoded 38 denoted as avg. 39 by the proximity in RDF2vec to a certain extent, i.e., 39 2. For predicting the tail of a relation, we train a 40 the properties in equations (3) and (4) allow to perform 40 neural network to predict an embedding vector 41 41 analogy reasoning in the RDF2vec space in general. of the tail based embedding vectors, as shown in 42 42 Moreover, (ii) we have already seen the tendency of Fig. 11. The predictions for a triple < h, r, ? > are 43 43 RDF2vec to position related entities in relative prox- the entities whose embedding vectors are clos- 44 44 imity. Thus, for RDF2vec, it can be assumed that the est to the predicted vector for h and r. Two sepa- 45 45 following holds: rate networks are trained for predicting heads and 46 46 tails. This method is denoted as ANN. 47 UK ≈ PoundS terling (25) 47 48 We trained the RDF2vec embeddings with 2,000 48 USA ≈ US Dollar (26) 49 walks, a depth of 4, a dimension of 200, a window of 49 50 5, and 25 epochs in SG mode. For the second predic- 50 51 10http://dbpedia.org/page/United_States_dollar tion approach, the two neural networks use two hidden 51 12 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 Relatedness Entity Similarity Document Analogies Semantic AAUP ComplEx RESCAL DistMult RotatE RMSE TransR TransE-L2 Regression TransE-L1 RDF2vec ACC Clustering Dataset AAUP Metric ACC Classification Task 1 2 2 3 3 4 4 5 5 6 6 amncenL5 .4 .4 0.397 0.343 0.348 0.675 KORE 0.709 KendallTau LP50 0.685 HarmonicMean countries and capitals (All) ACC 7 7 8 8 9 9 10 10 11 11 12 12 iisadSae0320350.209 0.335 38.037 0.792 38.343 0.342 39.803 0.840 38.398 14.455 0.648 37.589 Countries) (and Currency 19.694 State 36.459 and Cities 15.601 countries and 0.909 Capitals 0.887 0.691 0.917 0.952 Movies 0.979 Albums Forbes 0.953 Cities 0.696 Teams Forbes AAUP, Movies, Albums, Cities, Countries and Cities (2K) Countries and Cities Movies Albums Forbes Cities 13 13 14 14 15 15 16 16 17 17 18 18

19 tasks mining data different the of Results 19 20 20 21 21 22 22 23 23 68.745 19.648 11.930 0.504 0.339 0.926 0.676 24 0.982 0.758 0.739 0.774 0.610 0.810 24 al 1 Table 25 25 26 26 27 27 3262.3 0692.7 2112.6 22.229 13.009 21.362 13.537 22.161 13.452 23.878 20.699 14.890 20.635 12.789 12.589 23.286 14.128 28 73.665 72.880 78.337 80.177 88.751 69.728 81.503 28 .2 .5 .0 .1 .9 .2 0.602 0.623 0.597 0.617 0.607 0.651 0.628 .0 001019-.3 .4 .8 0.115 0.087 0.000 0.147 0.000 -0.039 0.139 0.001 -0.081 0.000 0.002 0.143 0.285 0.005 0.655 0.693 0.693 0.578 0.678 0.637 0.668 0.563 0.666 0.567 0.740 0.946 0.601 0.715 0.550 0.666 0.728 0.728 0.526 0.561 0.746 0.581 0.603 0.757 0.601 0.637 0.752 0.550 0.676 29 29 30 30 31 31 32 32

33 0.914 0.936 0.878 0.860 0.977 0.908 0.944 0.994 33 34 34 35 35

36 13.558 36 0.434 0.392 0.937 0.938 0.881 0.881 0.883 0.955 0.853 0.991 0.844 0.957 0.510 0.962 37 37 38 38 39 39 6861.8 51719.809 15.137 19.785 26.846

40 0.341 0.344 0.309 0.360 0.864 0.089 0.326 0.814 0.312 0.379 0.211 0.802 0.294 0.782 0.640 0.377 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 35.489

48 0.990 48 49 49 50 50 51 51 37.877 0.591 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 13

1 TransE-L1 outputs mostly leaders from different 1 2 parties and countries, TransR focuses on German 2 3 politicians in various parties and functions, etc. In 3 4 all of those cases, the persons share some prop- 4 5 erty with the query entity Angela Merkel (profes- 5 6 sion, role, nationality, etc.), but similarity is usu- 6 7 ally affected only by one of those properties. In 7 8 other words: one notion of similarity dominates 8 9 the others. 9 10 – In contrast, the persons in the output list of 10 11 RDF2vec are related to the query entity in dif- 11 12 ferent respects. In particular, they played differ- 12 13 ent roles during Angela Merkel’s chancellorship 13 14 Fig. 11. Training a neural network for link prediction with RDF2vec (Gauck was the German president, Lammert was 14 15 the chairman of the parliament, and Voßkuhle 15 16 layers of size 200, and we use 15 epochs, a batch size was the chairman of the federal court). Here, 16 17 of 1,000, and mean squared error as loss. there is no dominant property, instead, similarity 17 18 The results of the link prediction experiments are (or rather: relatedness) is encoded along various 18 19 shown in Table 2.11 We can observe that the RDF2vec properties. 19 20 based approaches perform at the lower end of the spec- 20 With that observation in mind, we can come up with 21 trum. The avg approach outperforms DistMult and 21 an initial set of recommendations for choosing embed- 22 RESCAL on WN18, and both approaches are about en 22 ding approaches: 23 par with RESCAL on FB15k. 23 24 While the results are not overwhelming, they show – RDF2vec works well when dealing with sets of 24 25 that similarity of entities, as RDF2vec models it, is at homogeneous entities. Here, the problem of con- 25 26 least a useful signal for implementing a link prediction fusing related entities (like Merkel and Berlin) 26 27 approach. is negligible, because all entities are of the same 27 28 kind anyways. In those cases, RDF2vec captures 28 29 5.3. Discussion the finer distinctions between the entities better 29 30 than embeddings for link prediction, and it en- 30 31 As already discussed above, the notion of similarity codes a larger variety of semantic relations. 31 32 which is conveyed by RDF2vec mixes similarity and – For problems where heterogeneous sets of enti- 32 33 relatedness. This can be observed, e.g., when querying ties are involved, embeddings for link prediction 33 34 for the 10 closest concepts to Angela Merkel in DBpe- often do a better job in telling different entities 34 35 dia in the different spaces, as shown in Table 3. The apart. 35 36 approach shows a few interesting effects: 36 37 Link prediction is a problem of the latter kind: in em- 37 38 – While most of the approaches (except for RotatE) bedding spaces where different types are properly sep- 38 39 provide a clean list of people, RDF2vec brings up arated, link prediction mistakes are much rarer. Given 39 40 a larger variety of results, containing also Ger- an embedding space where entities of the same type 40 41 many and Berlin (and also a few results which are are always closer than entities of a different type, a link 41 42 not instances, but relations; however, those could prediction approach will always rank all “compatible” 42 43 be filtered out easily in downstream applications entities higher than all incompatible ones. Consider the 43 44 if necessary). This demonstrates the property of following example in FB15k: 44 45 RDF2vec of mixing similarity and relatedness. 45 46 – The approaches at hand have different foci in instrument(GilS cottHeron, ?) 46 47 determining similarity. For example, TransE-L2 47 48 outputs a list of German and Austrian chancellors, Here, music instruments are expected in the object 48 49 position. However, the RDF2vec based approach pre- 49 50 11The code for the experiments can be found at https://github. dicts, among plausible candidates such as electric gui- 50 51 com/janothan/kbc_rdf2vec tar and acoustic guitar, also guitarist and Jimmy Page 51 14 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 Dataset Metric RDF2vec(avg) RDF2vec(ANN) TransE TransR RotatE DistMult RESCAL ComplEx 1 2 2 3 WN18 Mean Rank Raw 147 353 263 232 - - 1180 - 3 4 Mean Rank Filtered 135 342 251 219 309 - 1163 - 4 5 HITS@10 Raw 64.39 49.68 75.4 78.3 - - 37.2 - 5 6 HTIS@10 Filtered 71.33 55.43 89.2 91.7 95.9 57.7 52.8 94.7 6 7 FB15K Mean Rank Raw 399 349 243 226 - - 828 - 7 8 Mean Rank Filtered 347 303 125 78 40 - 683 - 8 9 HITS@10 Raw 35.31 34.25 34.9 43.8 - - 28.4 - 9 10 HTIS@10 Filtered 40.54 41.81 47.1 65.5 88.4 94.2 44.1 84.0 10 11 Table 2 11 12 Results of the link prediction tasks on WN18 and FB15K. Results for TransE and RESCAL from [7], results for RotatE from [34], results for 12 13 DistMult from [35], results for TransR from [33]. 13 14 14 15 15 RDF2vec TransE-L1 TransE-L2 TransR 16 16 17 Joachim Gauck Gerhard Schröder Gerhard Schröder 17 18 Norbert Lammert James Buchanan Frank-Walter Steinmeier 18 19 Stanislaw Tillich Neil Kinnock Philipp Rösler 19 20 Andreas Voßkuhle Nicolas Sarkozy Gerhard Schröder 20 21 Berlin Joachim Gauck Werner Faymann Joachim Gauck 21 22 Jacques Chirac Alfred Gusenbauer Christian Wulff 22 23 Germany Jürgen Trittin Guido Westerwelle 23 24 federalState Sigmar Gabriel Philipp Scheidemann Helmut Kohl 24 25 Social Democratic Party Guido Westerwelle Jürgen Trittin 25 26 deputy Christian Wulff Jens Böhrnsen 26 27 RotatE DistMult RESCAL ComplEx 27 28 Pontine raphe nucleus Gerhard Schröder Gerhard Schröder Gerhard Schröder 28 29 Jonathan W. Bailey Milan Truban Kurt Georg Kiesinger Diána Mészáros 29 30 Zokwang Trading Maud Cuney Hare Helmut Kohl Francis M. Bator 30 31 Steven Hill Tristan Matthiae Annemarie Huber-Hotz William B. Bridges 31 32 Chad Kreuter Gerda Hasselfeldt Wang Zhaoguo Mette Vestergaard 32 33 Fred Hibbard Faustino Sainz Muñoz Franz Vranitzky Ivan Rosenqvist 33 34 Mallory Ervin Joachim Gauck Bogdan Klich Edward Clouston 34 35 Paulinho Kobayashi Carsten Linnemann Irsen˙ Küçük Antonio Capuzzi 35 36 Fullmetal Alchemist and the Broken Angel Norbert Blüm Helmut Schmidt Steven J. McAuliffe 36 37 Archbishop Dorotheus of Athens Neil Hood Mao Zedong Jenkin Coles 37 38 Table 3 38 39 Closest concepts to Angela Merkel in the different embedding approaches used. 39 40 40 41 41 (who is a well-known guitarist). While electric guitar, sification (e.g., distinguishing guitar players from 42 42 guitarist, and Jimmy Page are semantically related, not 43 singers), all entities to be classified are already of the 43 all of them are sensible predictions here, and the fact 44 same coarse class (e.g., musician), and RDF2vec is 44 that RDF2vec reflects that semantic relatedness is a 45 very well suited for capturing the finer differences. 45 drawback in link prediction. 46 However, for coarse classifications, misclassifications 46 The same argument underlies an observation made 47 by mistaking relatedness for similarity become more 47 48 by Zouaq and Martel [18]: the authors found that 48 49 RDF2vec is particularly well suited for distinguish- salient. 49 50 ing fine-grained entity classes (as opposed to coarse- From the observations made in the link prediction 50 51 grained entity classification). For fine-grained clas- task, we can come up with another recommendation: 51 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction 15

1 – For relations which come with rather clean data port competitive link prediction while also represent- 1 2 quality, link prediction approaches work well. ing relatedness. We expect to see some interesting de- 2 3 However, for more noisy data, RDF2vec has a velopments along these lines in the future. 3 4 higher tendency of creating useful embedding 4 5 vectors. 5 6 References 6 For the moment, this is a hypothesis, which should be 7 7 hardened, e.g., by performing controlled experiments 8 [1] Q. Wang, Z. Mao, B. Wang and L. Guo, Knowledge graph 8 on artificially noised link prediction tasks. embedding: A survey of approaches and applications, IEEE 9 9 Transactions on Knowledge and Data Engineering 29(12) 10 (2017), 2724–2743. 10 11 [2] X. Han, S. Cao, X. Lv, Y. Lin, Z. Liu, M. Sun and J. Li, 11 12 6. Conclusion and Outlook Openke: An open toolkit for knowledge embedding, in: Pro- 12 13 ceedings of the 2018 conference on empirical methods in 13 natural language processing: system demonstrations, 2018, 14 In this paper, we have compared two use cases and 14 families of knowledge graph embeddings which have, pp. 139–144. 15 [3] H. Paulheim, Knowledge graph refinement: A survey of ap- 15 16 up to today, not undergone any thorough direct com- proaches and evaluation methods, Semantic web 8(3) (2017), 16 17 parison: approaches developed for data mining, such 489–508. 17 18 as RDF2vec, and approaches developed for link pre- [4] P. Ristoski and H. Paulheim, Rdf2vec: Rdf graph embeddings 18 for data mining, in: International Semantic Web Conference, 19 diction, such as TransE and its descendants. 19 We have argued that the two approaches actually do Springer, 2016, pp. 498–514. 20 [5] B. Steenwinckel, G. Vandewiele, I. Rausch, P. Heyvaert, 20 21 something similar, albeit being designed with different R. Taelman, P. Colpaert, P. Simoens, A. Dimou, F. De Turck 21 22 goals in mind. To support this argument, we have run and F. Ongenae, Facilitating the Analysis of COVID-19 Liter- 22 23 two sets of experiments which examined how well the ature Through a Knowledge Graph, in: International Semantic 23 Web Conference, Springer, 2020, pp. 344–357. 24 different approaches work if applied in the respective 24 other setup. We show that, to a certain extent, embed- [6] P. Ristoski, J. Rosati, T. Di Noia, R. De Leone and H. Paulheim, 25 RDF2Vec: RDF graph embeddings and their applications, Se- 25 26 ding approaches designed for link prediction can be mantic Web 10(4) (2019), 721–752. 26 27 applied in data mining as vice versa, however, there are [7] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston and 27 28 differences in the outcome. O. Yakhnenko, Translating embeddings for modeling multi- 28 relational data, in: Advances in neural information processing 29 From the experiments, we have also seen that 29 systems, 2013, pp. 2787–2795. 30 proximity in the embedding spaces works differently [8] Y. Dai, S. Wang, N.N. Xiong and W. Guo, A Survey on Knowl- 30 31 for the two approaches: in RDF2vec, proximity en- edge Graph Embedding: Approaches, Applications and Bench- 31 32 codes both similarity and relatedness, while TransE marks, Electronics 9(5) (2020), 750. 32 33 and its descendants rather encode similarity alone. [9] A. Rossi, D. Firmani, A. Matinata, P. Merialdo and D. Barbosa, 33 Knowledge Graph Embedding for Link Prediction: A Compar- 34 On the other hand, for entities that are of the same 34 ative Analysis, arXiv preprint arXiv:2002.00819 (2020). 35 type, RDF2vec covers finer-grained similarities bet- [10] S. Ji, S. Pan, E. Cambria, P. Marttinen and P.S. Yu, A survey 35 36 ter. Moreover, RDF2vec seems to work more stably in on knowledge graphs: Representation, acquisition and applica- 36 37 cases where the knowledge graphs are rather noisy and tions, arXiv preprint arXiv:2002.00388 (2020). 37 38 weakly adherent to their schema. [11] G.A. Gesese, R. Biswas, M. Alam and H. Sack, A Survey on 38 Knowledge Graph Embeddings with Literals: Which model 39 These findings give rise both for a recommendation 39 links better Literal-ly?, arXiv preprint arXiv:1910.12507 40 and some future work. First, in use cases where relat- (2019). 40 41 edness plays a role next to similarity, or in use cases [12] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury 41 42 where all entities are of the same type, RDF2vec may and M. Gamon, Representing text for joint embedding of text 42 43 yield better results. On the other hand, for cases with and knowledge bases, in: Proceedings of the 2015 conference 43 on empirical methods in natural language processing, 2015, 44 mixed entity types where it is important to separate the 44 pp. 1499–1509. 45 types, link prediction embeddings might yield better [13] T. Dettmers, P. Minervini, P. Stenetorp and S. Riedel, Con- 45 46 results. volutional 2d knowledge graph embeddings, arXiv preprint 46 47 Moreover, the open question remains whether it is arXiv:1707.01476 (2017). 47 48 possible to develop embedding methods that combine [14] B. Shi and T. Weninger, Open-world knowledge graph com- 48 pletion, arXiv preprint arXiv:1711.03438 (2017). 49 49 the best of both worlds – e.g., that provide both the [15] M. Cochez, P. Ristoski, S.P. Ponzetto and H. Paulheim, Global 50 coarse type separation of TransE and its descendants RDF vector space embeddings, in: International Semantic Web 50 51 and the fine type separation of RDF2vec, or that sup- Conference, Springer, 2017, pp. 190–207. 51 16 J. Portisch et al. / Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction

1 [16] M.A. Pellegrino, A. Altabba, M. Garofalo, P. Ristoski and [28] P. Ristoski, G.K.D. De Vries and H. Paulheim, A collection 1 2 M. Cochez, GEval: A Modular and Extensible Evaluation of benchmark datasets for systematic evaluations of machine 2 3 Framework for Graph Embedding Techniques, in: European learning on the semantic web, in: International Semantic Web 3 Semantic Web Conference, Springer, 2020, pp. 565–582. Conference, Springer, 2016, pp. 186–194. 4 4 [17] J. Chen, X. Chen, I. Horrocks, E. B. Myklebust and E. Jimenez- [29] M.D. Lee, B. Pincombe and M. Welsh, An empirical evaluation 5 5 Ruiz, Correcting Knowledge Base Assertions, in: Proceedings of models of text document similarity, in: Proceedings of the 6 of The Web Conference 2020, 2020, pp. 1537–1547. annual meeting of the cognitive science society, Vol. 27, 2005. 6 7 [18] A. Zouaq and F. Martel, What is the schema of your knowledge [30] J. Hoffart, S. Seufert, D.B. Nguyen, M. Theobald and 7 8 graph? leveraging knowledge graph embeddings and cluster- G. Weikum, KORE: keyphrase overlap relatedness for entity 8 ing for expressive taxonomy learning, in: Proceedings of The 9 disambiguation, in: Proceedings of the 21st ACM international 9 International Workshop on Semantic Big Data, 2020, pp. 1–6. conference on Information and knowledge management, 2012, 10 10 [19] P. Ristoski and H. Paulheim, Semantic Web in data mining and pp. 545–554. 11 11 knowledge discovery: A comprehensive survey, Journal of Web [31] D. Zheng, X. Song, C. Ma, Z. Tan, Z. Ye, J. Dong, H. Xiong, 12 Semantics 36 (2016), 1–22. Z. Zhang and G. Karypis, DGL-KE: Training Knowledge 12 13 [20] P. Ristoski and H. Paulheim, A comparison of propositional- Graph Embeddings at Scale, in: Proceedings of the 43rd In- 13 ization strategies for creating features from linked open data, 14 ternational ACM SIGIR Conference on Research and De- 14 Linked Data for Knowledge Discovery 6 (2014). velopment in Information Retrieval, SIGIR ’20, Associa- 15 [21] T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient esti- 15 tion for Computing Machinery, New York, NY, USA, 2020, 16 mation of word representations in vector space, arXiv preprint 16 pp. 739–748–. arXiv:1301.3781 (2013). 17 [32] J. Portisch, M. Hladik and H. Paulheim, KGvec2go– 17 [22] N. Lavrac,ˇ B. Škrlj and M. Robnik-Šikonja, Propositionaliza- 18 Knowledge Graph Embeddings as a Service, arXiv preprint 18 tion and Embeddings: Two Sides of the Same Coin, arXiv 19 arXiv:2003.05809 (2020). 19 preprint arXiv:2006.04410 (2020). [33] Y. Lin, Z. Liu, M. Sun, Y. Liu and X. Zhu, Learning entity 20 [23] P.-N. Tan, M. Steinbach and V. Kumar, Introduction to data 20 and relation embeddings for knowledge graph completion, in: 21 mining, Pearson Education India, 2016. 21 Proceedings of the AAAI Conference on Artificial Intelligence, 22 [24] G. Vandewiele, B. Steenwinckel, M. Weyns, P. Bonte, 22 Vol. 29, 2015. 23 F. Ongenae and F.D. Turck, pyRDF2Vec: A python li- 23 brary for RDF2Vec, 2020, https://github.com/IBCNServices/ [34] Z. Sun, Z.-H. Deng, J.-Y. Nie and J. Tang, Rotate: Knowl- 24 24 pyRDF2Vec. edge graph embedding by relational rotation in complex space, 25 [25] R. Sofronova, R. Biswas, M. Alam and H. Sack, arXiv preprint arXiv:1902.10197 (2019). 25 26 Entity Typing based on RDF2Vec using Supervised and Unsu- [35] B. Yang, W.-t. Yih, X. He, J. Gao and L. Deng, Embedding 26 27 pervised Methods. entities and relations for learning and inference in knowledge 27 bases, arXiv preprint arXiv:1412.6575 (2014). 28 [26] M. Kejriwal and P. Szekely, Supervised typing of big graphs 28 using semantic embeddings, in: Proceedings of The Interna- [36] M. Nickel, V. Tresp and H.-P. Kriegel, A three-way model for 29 29 tional Workshop on Semantic Big Data, 2017, pp. 1–6. collective learning on multi-relational data, in: Icml, 2011. 30 [27] M. Ali, M. Berrendorf, C.T. Hoyt, L. Vermue, S. Sharifzadeh, [37] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier and G. Bouchard, 30 31 V. Tresp and J. Lehmann, PyKEEN 1.0: A Python Library for Complex embeddings for simple link prediction, in: Inter- 31 32 Training and Evaluating Knowledge Graph Emebddings, arXiv national Conference on Machine Learning, PMLR, 2016, 32 33 preprint arXiv:2007.14175 (2020). pp. 2071–2080. 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51