Answering SPARQL Queries Via Subgraph Matching ∗

gStore: Answering SPARQL Queries via Subgraph Matching ∗ Lei Zou1, Jinghui Mo1, Lei Chen2, M. Tamer Ozsu¨ 3, Dongyan Zhao1,4 1Peking University, China; 2 Hong Kong University of Science and Technology, China; 3 University of Waterloo, Canada; 4 Key Laboratory of Computational Linguistics (PKU), Ministry of Education, China { zoulei,mojinghui,zdy}@icst.pku.edu.cn, [email protected], [email protected] ABSTRACT example is given in Figure 1(a). Note that, an RDF dataset can Due to the increasing use of RDF data, efficient processing of SPA- also be modeled as a graph (called RDF graph), as shown in Figure RQL queries over RDF datasets has become an important issue. 1(b). In order to query RDF repositories, SPARQL query language However, existing solutions suffer from two limitations: 1) they [16] has been proposed by W3C. For example, we can retrieve the cannot answer SPARQL queries with wildcards in a scalable man- names of individuals who were born on February 12, 1809 and died ner; and 2) they cannot handle frequent updates in RDF repositories on April 15, 1865 from the RDF dataset by the following SPARQL efficiently. Thus, most of them have to reprocess the dataset from query: { < > < > scratch. In this paper, we propose a graph-based approach to store Q1: Select ?name Where ?m hasName ?name. ?m BornOn Date < > } and query RDF data. Rather than mapping RDF triples into a re- “1809-02-12”. ?m DiedOnDate “1865-04-15”. lational database as most existing methods do, we store RDF data Although RDF data management has been studied in the past as a large graph. A SPARQL query is then converted into a cor- decade, most existing solutions do not scale to large RDF reposi- responding subgraph matching query. In order to speed up query tories and cannot answer complex queries efficiently. Recent stud- processing, we develop a novel index, together with some effec- ies have focused on scalable techniques for large RDF repositories tive pruning rules and efficient search algorithms. Our method can (e.g. [2, 12, 13, 25, 22]). Although these existing RDF query en- answer exact SPARQL queries and queries with wildcards in a uni- gines, such as RDF-3x [12], Hexastore [22] and SW-store [1], are form manner. We also propose an effective maintenance algorithm designed to address the scalability of SPARQL queries, they have to handle online updates over RDF repositories. Extensive experi- some common limitations: (1) they cannot support SPARQL with ments confirm the efficiency and effectiveness of our solution. wildcards in a scalable manner; and (2) it is very difficult for some existing systems to handle frequent updates in RDF repositories, 1. INTRODUCTION forcing them to reprocess the dataset from scratch when there is an The RDF (Resource Description Framework) data model was update. x-RDF-3x [15], the advanced version of RDF-3x system, proposed for modeling Web objects as part of developing the se- can support updates, but, it still fails to support wildcard queries. mantic web. It has been used in various applications. For example, Yago and DBPedia extract facts from Wikipedia automat- 1.1 SPARQL Queries With Wildcards ically and store them in RDF format to support structural queries In real applications, having full knowledge about a query object over Wikipedia [19, 3]. Biologists also build RDF data collections, may not be practical; thus, it may not be possible to specify exact such as Bio2RDF (bio2rdf.org) and Uniprot RDF (dev.isb-sib.ch/ query criteria. For example, we may know that an important politi- projects/uniprot-rdf), for recording experimental data. cian was born on February 12 and died on April 15, but we have no Generally speaking, RDF data can be represented as a collection idea about his exact birth and death years. In this case, we have to of triples denoted as SPO (sub ject, property, ob ject). A running perform a query with wildcards, as shown below: Q :Select ?name Where { ?m <hasName> ?name. ?m <BornOnDate> ?bd. ∗ 2 Lei Zou, Jinghui Mo and Dongyan Zhao were supported by ?m <DiedOnDate> ?dd. FILTER regex(str(?bd), “02-12”), regex(str(?dd), NSFC under Grant No.61003009 and RFDP under Grant No. “04-15”) } 20100001120029. Lei Chen’s work was supported in part by RGC Although there are techniques for supporting SPARQL queries NSFC JOINT Grant under Project No. N HKUST61 2/09, and with wildcards and for managing large RDF datasets, to the best of ¨ NSFC Grant No. 60736013 and 60803105. M. Tamer Ozsu’s work our knowledge, no technique exists to support both, i.e., the abil- was supported in part by the Natural Sciences and Engineering Re- ity to execute SPARQL queries with wildcards in a scalable man- search Council (NSERC) of Canada. ner. Existing RDF storage systems, such as Jena [23], Yars2 [11] and Sesame 2.0 [5], cannot work well in large RDF datasets (such Permission to make digital or hard copies of all or part of this work for as Yago dataset). SW-store[1], RDF-3x [12], x-RDF-3x [15] and personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies Hexastore [22] are designed to address scalability, however, they bear this notice and the full citation on the first page. To copy otherwise, to can only support exact SPARQL queries, since they replace all lit- republish, to post on servers or to redistribute to lists, requires prior specific erals (in RDF triples) by ids using a mapping dictionary. permission and/or a fee. Articles from this volume were invited to present their results at The 37th International Conference on Very Large Data Bases, 1.2 Frequent Updates Over RDF Repositories August 29th - September 3rd 2011, Seattle, Washington. Proceedings of the VLDB Endowment, Vol. 4, No. 8 In some applications, RDF repositories are not static. For ex- Copyright 2011 VLDB Endowment 2150-8097/11/05... $ 10.00. ample, Yago and DBpedia datasets are continually expanding to 482 Prefix: y= http://en.wikipedia.org/wiki/ 009 Subject Predict Object ĀAbraham Lincolnā hasName 005 y:Abraham_Lincoln hasName ĀAbraham Lincolnā http://en.wikipedia.org/wiki/ 008 BornOnDate y:Abraham_Lincoln BornOnDate Ā1809-02-12ā 010 Abraham_Lincoln http://en.wikipedia.org/wiki/ y:Abraham_Lincoln DiedOnDate 1865-04-15 Country Ā1809-02-12ā DiedIn y:Abraham_Lincoln DiedIn y:Washington_D.C DiedOnDate 004 011 http://en.wikipedia.org/wiki/ y:Washington_D.C hasName ĀWashington D.C.ā Ā1865-04-15ā Washington_D.C. rdf:type y:Washington_D.C FoundYear 1790 rdf:type y:Washington_D.C rdf:type y:city FoundYear hasCapital 006 012 y:United_States hasName ĀUnited Statesā http://en.wikipedia.org/wiki/ Ā1790ā hasName 003 y:United_States hasCapital y:Washington_D.C City 013 http://en.wikipedia.org/wiki/ y:United_States rdf:type Country ĀWashington D.C.ā United_States Ā1976-03-22ā 001 y:Reese_Witherspoon rdf:type y:Actor 015 http://en.wikipedia.org/wiki/ rdf:type BornOnDate y:Reese_Witherspoon BornOnDate Ā1976-03-22ā Reese_Witherspoon locatedIn hasName y:Reese_Witherspoon BornIn y:New_Orleans,_Louisiana BornIn 002 rdf:type y:Reese_Witherspoon hasName ĀReeseWitherspoonā 007 http://en.wikipedia.org/wiki/ hasName y:New_Orleans,_Louisiana FoundYear 1718 http://en.wikipedia.org/wiki/ New_Orleans,_Louisiana Actor y:New_Orleans,_Louisiana rdf:type y:city FoundYear 014 ĀReese Witherspoonā y:New_Orleans,_Louisiana locatedIn y:United_States ĀUnited Statesā 016 Ā1718ā017 (a) RDF Tripes (b) RDF Graph G Figure 1: RDF Graph include the newly extracted knowledge from Wikipedia. The RDF like properties of the same entity, thus, they tend to contain stars data in social networks, such as the FOAF project (foaf-project.org), as subqueries [12]. A star query refers to the query graph that is a are also frequently updated to represent the individuals’ changing star, formed by one central vertex and its neighbors. relationships. In order to support queries over such dynamic RDF Considering the three properties of an RDF graph, we propose a datasets, query engines should be able to handle frequent updates novel indexing schema to speed up query processing. Firstly, we without much maintenance overhead. store an RDF graph as a disk-based adjacency list table T. Then, for each entity or class vertex (Definition 2.1) in the RDF graph, 1.3 Our Approach according to its adjacent edge labels and neighbor vertex labels In this work, we treat RDF datasets from a graph database per- (Definition 2.1), we assign a bitstring as its vertex signature.In spective. A SPARQL query is transformed into a subgraph match- this way, an RDF graph is converted into a data signature graph ∗ ∗ ing query over a large RDF graph. Specifically, we can model an G (Definition 4.3). Then, we propose a novel index (called VS - ∗ RDF dataset (a collection of triples) as a labeled, directed multi- tree) over G . At run time, we also encode all vertices of Q into edge graph (RDF graph), where each vertex corresponds to a sub- vertex signatures, and then convert Q into its corresponding query ∗ ∗ ∗ ject or an object. Each triple represents a directed edge from a signature graph Q . Finding all matches of Q over G will lead to subject to its corresponding object. Given a subject and an ob- all candidate matches (denoted as CL) without any false negative. ject, there may exist more than one property between them, that is, Note that, we propose a novel filtering rule (Theorem 5.1) to reduce multiple-edges may exist between two vertices. Consequently, an the search space in finding CL. Finally, according to CL, we can RDF graph is a multi-edge graph.

Load more