Extracting Entity-Specific Substructures for RDF Graph
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 0 (0) 1 1 IOS Press Extracting Entity-specific Substructures for RDF Graph Embeddings Muhammad Rizwan Saeed a,*, Charalampos Chelmis b and Viktor K. Prasanna a a Ming Hseih Department of Electrical Engineering, University of Southern California, CA, USA E-mails: [email protected], [email protected] b Department of Computer Science, University at Albany - SUNY, NY, USA E-mail: [email protected] Abstract. Knowledge Graphs (KGs) have become useful sources of structured data for information retrieval and data analytics tasks. Enabling complex analytics, however, requires entities in KGs to be represented in a way that is suitable for Machine Learning tasks. Several approaches have been recently proposed for obtaining vector representations of KGs based on identifying and extracting relevant graph substructures using both uniform and biased random walks. However, such approaches lead to representations comprising mostly popular, instead of relevant, entities in the KG. In KGs, in which different types of entities often exist (such as in Linked Open Data), a given target entity may have its own distinct set of most relevant nodes and edges. We propose specificity as an accurate measure of identifying most relevant, entity-specific, nodes and edges. We develop a scalable method based on bidirectional random walks to compute specificity. Our experimental evaluation results show that specificity- based biased random walks extract more meaningful (in terms of size and relevance) substructures compared to the state-of-the-art and the graph embedding learned from the extracted substructures perform well against existing methods in common data mining tasks. Keywords: Relevance Metrics, Graph Embedding, Linked Open Data, Data Mining, Recommender Systems, RDF, SPARQL, Semantic Web, DBpedia 1. Introduction and GloVe [13] have been proposed for generating graph embedding for entities in a KG. As a first Knowledge Graphs (KGs), i.e., graph-structured step for such approaches, a representative subgraph knowledge bases, store information as entities and for each target entity in the KG must be acquired. the relationships between them, often following Each entity (represented by a node) in a hetero- some schema or ontology [1]. With the emergence geneous KG is surrounded by other entities (or of Linked Open Data [2], DBpedia [3], and Google nodes) connected by directed labeled edges. For a Knowledge Graph1, large-scale KGs have drawn node representing a film, there can be a labeled much attention and have become important data edge director connecting it to a node representing sources for many data mining and other knowledge the director of the film. Another labeled edge re- discovery tasks [4–10] As such algorithms work leaseYear may exist that connects the film to an with the propositional representation of data (i.e., integer literal, also represented by a node in the feature vectors) [11], several adaptations of lan- KG. If we want to extract a subgraph represent- guage modeling approaches such as word2vec [12] ing the film, ideally such relevant information (la- beled edges and nodes) must be a part of the ex- tracted subgraph. On the other hand, relationships *Corresponding author. E-mail: [email protected]. 1https://www.blog.google/products/search/ linked to, say, the Director such as birthYear or introducing-knowledge-graph-things-not/ deathYear, which are at two hops from a film, e.g. 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 M. Saeed et al. / Extracting Entity-specific Substructures for RDF Graph Embeddings Batman_1989 −−−−−→director T im_Burton −−−−−−→birthY ear evant subgraphs for target entities in a KG as 1958, may not be useful or relevant for repre- compared to the state-of-the-art. senting a film. Therefore, in order to extract a To demonstrate the usefulness of our specificity- useful representation (as a subgraph) of a given based approach for real-world applications, we entity, we first need to automatically determine train neural language model (Skip-Gram [19]) for the most relevant edges and nodes in its neigh- borhood. An extracted representation of any tar- generating graph embedding from the extracted get entity2 in a KG can only be considered rep- subgraphs and use the generated embeddings for resentative if it includes only the most relevant the tasks of entity recommendation, regression, nodes and edge w.r.t the target entity. To accom- and classification for select entities in DBpedia. plish this task approaches based on biased ran- This paper considerably extends [20], in which we dom walks [14, 15] have been proposed. Such ap- introduced the metric of specificity for the first proaches use weighting schemes to make a par- time. We provide additional experiments to show ticular set of edges and nodes more likely to be that the vector embeddings can not only be used included in the extracted subgraphs than others. for entity recommendation but also regression and However, weighting schemes based on metrics such classification tasks. We also propose a variation as frequency or PageRank [14, 16] tend to fa- of specificity called SpecificityH which takes into vor popular (or densely connected) nodes in the account the hierarchy of classes in the schema on- representative subgraphs of target entities at the tology associated with a KG, discussed in Section expense of semantically more relevant nodes and 4.2. edges. The rest of this paper is structured as follows. In this paper, we focus on RDF3 KGs which are In Section 2, we provide a brief overview of related encoded using the Resource Description Frame- work. In Section 3, we provide the necessary back- work (RDF) syntax and constitute Linked Open ground. In Section 4 we motivate and introduce Data [17, 18]. We assert that the representative the concept of specificity. In Section 5, we present a subgraphs of different types of entities (e.g., book, scalable method for computing specificity. In Sec- film, drug, athlete) in RDF KGs may comprise dis- tion 6, we present results highlighting beneficial tinct sets of relationships. Our objective is to auto- characteristics of specificity on DBpedia. In Sec- matically identify such relationships and use them tion 7 we conclude with a summary and an outlook to extract entity-specific representations. This is on future work. in contrast to the scenario where extracted repre- sentations are KG-specific because of the inclusion of popular nodes and edges, irrespective of their semantic relevance to the target entities. The main contributions of this paper are as fol- lows: – We propose Specificity as an accurate mea- sure for assigning weights to those semantic relationships which constitute the most intu- itively relevant representations for a given set or type of entities. – We provide a scalable method of computing specificity for semantic relationships of any depth in large-scale KGs. – We show that specificity-based biased random walks enable more compact extraction of rel- 2Since entities are represented as nodes in a KG; we use Fig. 1. Linked Open Data [Source: http://lod-cloud.net/] entity and node interchangeably throughout the text. 3https://www.w3.org/TR/rdf-concepts/ M. Saeed et al. / Extracting Entity-specific Substructures for RDF Graph Embeddings 3 2. Related Work bedding, preserving roles and community mem- berships of nodes. RDF2Vec [11], an extension Linked Open Data (LOD) [21] is a massive KG of DeepWalk and Deep Graph Kernel, uses BFS- shown in Figure 1 where each node itself is a based random walks for extracting subgraphs from Knowledge Graph. Each of these individual KGs RDF graphs, which are converted into feature comes from semantically annotating and integrat- vectors using word2vec [12]. Random walk-based ing unstructured data and publishing as struc- approaches such as RDF2Vec have been shown tured data on the web using the principles of to outperform graph kernel-based approaches in Linked Data or Semantic Web [22–25]. LOD cloud terms of scalability and their suitability for ML currently comprise 1205 KGs4. Few of the KGs tasks for large-scale KGs [11, 28]. The main limi- part of LOD are Wikidata5, Freebase6, and DBpe- tation of approaches using uniform (or unbiased) dia7 [26, 27]. random walks is the lack of control over the ex- Due to its open availability, heterogeneity, and plored neighborhood which can lead to inclusion of cross-domain nature, LOD is increasingly becom- less relevant nodes in identified subgraphs of tar- ing a valuable source of information in many data get entities. To address this challenge biased ran- mining tasks. However, most data mining algo- dom walks based approaches have been recently rithms work with a propositional feature vector proposed [3, 14, 20]. Such approaches use differ- representation of the data [11]. Recently, graph ent weighting schemes for nodes and edges. The embedding techniques have become a popular area weights create the bias by making specific nodes of interest in the research community. An embed- or edges more likely to be visited during random ding maps entire graph or individual nodes into walks. The work closest to ours is biased RDF2Vec low dimensional vector space, preserving as much approach [14] which uses frequency-, degree-, and information as possible related to its neighborhood PageRank-based metrics for weighting schemes. [28–30]. Our proposed approach also uses biased random Numerous techniques have been proposed for walks to extract entity representations. However, generating appropriate representations of KGs for unlike [14], we use our proposed metric of speci- knowledge discovery tasks. Graph kernel-based ap- ficity as an edge- and path-weighting scheme for proaches simultaneously transverse the neighbor- biased random walks for identifying most relevant hoods of a pair of entities in the graph to com- subgraphs for extracting entity representations in pute kernel functions based on metrics such as the the KGs.