Semantic SPARQL Similarity Search Over RDF Knowledge Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Semantic SPARQL Similarity Search Over RDF Knowledge Graphs Weiguo Zheng1, Lei Zou1, Wei Peng1, Xifeng Yan2, Shaoxu Song3, Dongyan Zhao1 1Peking University, Beijing, China, 100080; 2University of California at Santa Barbara, California, USA, 93106; 3Tsinghua University, Beijing, China, 100084. fzhengweiguo,zoulei,pengw,[email protected],[email protected],[email protected] ABSTRACT world fact. Thus, if we want to find more answers to a question, RDF knowledge graphs have attracted increasing attentions these complex SPARQL queries that contain multiple UNION operators years. However, due to the schema-free nature of RDF data, it is are required. Clearly, it is very difficult for users (even the profes- very difficult for users to have full knowledge of the underlying sional users) to conceive the complicated SPARQL queries that not schema. Furthermore, the same kind of information can be repre- only conform to the syntax but also consider the flexible underlying sented in diverse graph fragments. Hence, it is a huge challenge to schemas. We illustrate the challenges by the following motivating formulate complex SPARQL expressions by taking the union of all example. possible structures. 1.1 Motivating Example In this paper, we propose an effective framework to access the RDF repository even if users have no full knowledge of the un- Fig. 1 presents a piece of RDF graph extracted from DBpedia. derlying schema. Specifically, given a SPARQL query, the system Assume that we want to find the cars that are produced in Ger- could return as more answers that match the query based on the many. There are at least three different German car brands, such as semantic similarity as possible. Interestingly, we propose a sys- Porsche Cayenne, Mercedes Benz and BMWX6, which are stored tematic method to mine diverse semantically equivalent structure in three different schemas in Fig. 1. In order to enable the query, patterns. More importantly, incorporating both structural and se- we should issue the following SPARQL query, which is composed mantic similarities we are the first to propose a novel similarity of three subqueries corresponding to the query graphs q1, q2, and measure, semantic graph edit distance. In order to improve the q3 in Fig. 2(a). efficiency performance, we apply the semantic summary graph to SELECT ?x WHERE f summarize the knowledge graph, which supports both high-level f?x <type> Automobile. ?x <p r o d u c t i o n > Germany . g pruning and drill-down pruning. We also devise an effective lower UNION bound based on the TA-style access to each of the candidate sets. f?x <type> Automobile. ?x <assembly> Germany . g Extensive experiments over real datasets confirm the effectiveness UNION f?x <type> Automobile. ?x <manufacturer > ?y . and efficiency of our approach. ?y <l o c a t i o n > Germany .gg Since different structural patterns may express the same seman- 1. INTRODUCTION tic meanings, formulating a SPARQL query that considers all pos- An RDF repository, which consists of a set of triples hsubject, sible structures is not a trivial task. Although we can conceive the predicate, objecti, can be modeled as an RDF graph, where the complete SPARQL query for “the cars produced in Germany”, we vertices represent subjects and objects, and the labeled edges cor- cannot always use the same query pattern to answer other ques- respond to predicates. The rapidly growing RDF knowledge repos- tions. For instance, in order to find all cars produced in Aus- itories, such as DBpedia, Yago and Freebase, increase the demand tralia, we should add another SPARQL subquery “?x < type > for managing graph data effectively and efficiently. AutomobileO f Australia” to capture the complete answers. SPARQL, a structural query language proposed by W3C, is de- signed for querying RDF data. Since SPARQL queries can be rep- Benz manufacturer Mercedes_Benz type ty resented as query graphs [28], a SPARQL query can be answered p on BMW_X6 e by performing the graph pattern matching over RDF graphs [1]. locati Company p rodu Germany Due to the “schema-free” nature of RDF data, different data con- type ction tributors may adopt different schemas to describe the same real- Chery ly emb ype ass t assembly type Porsche_Cayenne type BYD Automobile p Statesman_V6 rodu ction pe China Actor This work is licensed under the Creative Commons Attribution- ty Japan t y on NonCommercial-NoDerivatives 4.0 International License. To view a copy p Toyota ucti e d pro ty e p p bornIn of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For e e y p production t type any use beyond those covered by this license, obtain permission by emailing ty Automobile Andy Lau [email protected]. OfAustralia Proceedings of the VLDB Endowment, Vol. 9, No. 11 Honda AsianCoutry Copyright 2016 VLDB Endowment 2150-8097/16/07. Figure 1: An RDF knowledge graph. SELECT ?x WHERE { ?x ?x { ?x <type> Automobile. type production (q1) SELECT ?x WHERE type production ?x <production> Germany.} { ?x <type> Automobile. UNION Automobile Germany ?x <production> Germany.} Automobile Germany { ?x <type> Automobile. ?x <assembly> Germany. } ?x (q1) (q2) UNION type assembly { ?x <type> Automobile. ?x <manufacturer> ?y. Automobile Germany Query rewriting semantic graph patterns ?y <location> Germany.} } ?x (Exact) SPARQL SPARQL Similarity Search type manufacturer (q3) Query location VM Automobile ?y VM Porsche_Cayenne Germany Porsche_Cayenne Answers: Answers: Benz Benz BMW_X6 BMW_X6 (a) Traditional method (b) Our method Figure 2: Traditional method vs. our method To obtain more correct answers, the traditional method (as shown vertex labels by a similarity function. However, it requires that q in Fig. 2(a)) demands users to have the full knowledge about the and its match must share the same graph structure including the schema of an RDF graph. In other words, it requires that user- edge label constraints. Recently, SLQ [23] presents a query engine s should not only know all the predicates in the knowledge base that integrates a set of transformation functions, such as “synonym” clearly, but also be aware of different structural expressions for i- and “distance”. Although it can plug in the “distance” transfor- dentical semantic facts. It will be more difficult for open-domain mation (i.e., transforming an edge to a shortest path), more com- knowledge graphs, such as DBpedia. Every coin has two sides. plicated structures (e.g., graphs) are hard to deal with. Further- The “schema-free” nature of RDF facilitates the dataset construc- more, structural transformation (corresponding to “distance”) and tion, but it inevitably leads to the inherent difficulty of querying the semantic transformation (corresponding to “ontology”) are taken knowledge base. into consideration separately. The goal of this paper is to provide an effective way to access the 1.3 Challenges and Contributions RDF repository even if one has no full knowledge of the underlying Challenge 1: Mining Diverse Structure Patterns with Equiva- schema. To this end, we provide an effective query model. Given lent Semantic Meanings. It is a common case that many subgraph- an RDF graph G, a user just needs to write a SPARQL query fol- s of a knowledge graph convey the same semantic meaning even if lowing one possible schema to express his/her query intention. The they do not share the identical structure. For example, graphs g , system should return as many answers that semantically match the 1 g2, and g3 in Fig. 3, are three different subgraphs extracted from query as possible. Fig. 2(b) illustrates our framework of SPARQL the knowledge graph G in Fig. 1. Although they are different in similarity search. terms of graph structures, they share the same semantic meaning, For example, given the SPARQL query in Fig. 2(b) (it corre- i.e., the automobiles that are produced in Germany. Note that the sponds to q1), our system can find all cars that are produced in task of mining diverse structural patterns with equivalent seman- Germany. Graphs g1, g2, and g3 in Figs. 3(a), 3(b), and 3(c) are tic meanings is different from schema mapping [13] and ontology three of the matches based on the semantic similarity. alignment [16]. (1) Different inputs: Schema mapping and ontolo- 1.2 Limitations of Existing Approaches gy alignment take two schema/ontologies as inputs; but our input is Although lots of efforts have been devoted to the graph similarity a single knowledge graph. (2) Different outputs: Our task is to find search [25, 8, 2, 7, 22, 23], they suffer from various drawbacks. sets of graph patterns that describe the same semantic meanings. Resorting to Structure Similarity. Several approaches are pro- However, schema mapping and ontology alignment aim to find the posed for the approximate subgraph query, but most of them focus mapping between two elements from two schemas or two concepts on the structure similarity, such as SAPPER [25], kGPM [2] and from two ontologies. Ness [8]. SAPPER [25] investigates the problem of approximate In this paper, we propose an instance-driven approach to mine subgraph search allowing some edges unmatched. It does not sup- these semantically equivalent patterns. According to the mining port the vertex/edge label substitution. kGPM [2] proposes a graph results, we define three representative graph patterns of semantic e- pattern query, which allows a path to match an edge. However, quivalence, i.e., concept generation, edge redirection, and inductive it restricts that the vertex/edge labels specified in the query graph inference, over the RDF knowledge graph. q should be exactly matched. Exploiting the neighborhood-based Challenge 2: Measuring Semantic Similarity in a Uniform Man- similarity measure, Ness [8] and NeMa [7] try to identify the top-k ner.