Gstore: a Graph-Based SPARQL Query Engine
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) gStore: A Graph-based SPARQL Query Engine Lei Zou · M. Tamer Ozsu¨ · Lei Chen · Xuchuan Shen · Ruizhe Huang · Dongyan Zhao the date of receipt and acceptance should be inserted later Abstract We address efficient processing of SPARQL 1 Introduction queries over RDF datasets. The proposed techniques, incorporated into the gStore system, handle, in a uni- The RDF (Resource Description Framework) data model form and scalable manner, SPARQL queries with wild- was proposed for modeling Web objects as part of devel- cards and aggregate operators over dynamic RDF datasets. oping the semantic web. Its use in various applications Our approach is graph-based. We store RDF data as a is increasing. large graph, and also represent a SPARQL query as a A RDF data set is a collection of (subject, property, query graph. Thus the query answering problem is con- object) triples denoted as hs; p; oi. A running example verted into a subgraph matching problem. To achieve is given in Figure 1a. In order to query RDF reposito- efficient and scalable query processing, we develop an ries, SPARQL query language [23] has been proposed index, together with effective pruning rules and efficient by W3C. An example query that retrieves the names search algorithms. We propose techniques that use this of individuals who were born on February 12, 1809 and infrastructure to answer aggregation queries. We also who died on April 15, 1865 can be specified by the fol- propose an effective maintenance algorithm to handle lowing SPARQL query (Q1): online updates over RDF repositories. Extensive exper- SELECT ?name WHERE iments confirm the efficiency and effectiveness of our f?m <hasName> ?name . solutions. ?m <bornOnDate> ` `1809 −02 −12 ' '. ?m <diedOnDate> ` `1865 −04 −15 ' '.g Although RDF data management has been studied Extended version of paper \gStore: Answering SPARQL over the past decade, most early solutions do not scale Queries via Subgraph Matching" that was presented at 2011 to large RDF repositories and cannot answer complex VLDB Conference. queries efficiently. For example, early systems such as Lei Zou, Xuchuan Shen, Ruizhe Huang, Dongyan Zhao Jena [31], Yars2 [14] and Sesame 2.0 [6], do not work Institute of Computer Science and Technology, well over large RDF datasets. More recent works (e.g., Peking University, Beijing, China Tel.: +86-10-82529643 [2,20,33]) as well as systems, such as RDF-3x [19], x- E-mail: fzoulei,shenxuchuan,huangruizhe,[email protected] RDF-3x [22], Hexastore [30] and SW-store [1], are de- M. Tamer Ozsu¨ signed to address scalability over large data sets. How- David R. Cheriton School of Computer Science ever, none of these address scalability along with the University of Waterloo, Waterloo, Canada following real requirements of RDF applications: Tel.: +1-519-888-4043 E-mail: [email protected] { SPARQL queries with wildcards. Similar to SQL and Lei Chen XPath counterparts, the wildcard SPARQL queries Department of Computer Science and Engineering, Hong Kong University of Science and Technology, enable users to specify more flexible query criteria in Hong Kong, China real-life applications where users may not have full Tel.: +852-23586980 knowledge about a query object. For example, we E-mail: [email protected] may know that a person was born in 1976 in a city 2 Subject Property Object y:Abraham Lincoln hasName \Abraham Lincoln" y:Abraham Lincoln bornOnDate \1809-02-12" y:Abraham Lincoln diedOnDate \1865-04-15" 030 031 y:Abraham Lincoln bornIn y:Hodgenville KY \1926-07-01" \Female" y:Abraham Lincoln diedIn y:Washington DC bornOnDate gender y:Abraham Lincoln title \President" 009 029 diedOnDate hasName 032 y:Abraham Lincoln gender \Male" \1962-08-05" y:Marilyn Monroe \Marilyn Monroe" y:Washington DC hasName \Washington D.C." y:Washington DC foundingYear \1790" 010 013 014 y:Hodgenville KY hasName \Hodgenville" \Abraham Lincoln" \President" \Male" y:United States hasName \United States" hasName title gender 001 003 y:United States hasCapital y:Washington DC 011 bornOnDate bornIn hasName 017 y:United States foundingYear \1776" \1809-02-12" y:Abraham Lincoln y:Hodgenville KY \Hodgenville" y:Reese Witherspoon bornOnDate \1976-03-22" diedOnDate 025 026 012 diedIn 019 \Franklin D. Roosevelt" \Male" y:Reese Witherspoon bornIn y:New Orleans LA \1865-04-15" 002 \1776" hasName gender y:Reese Witherspoon hasName \Reese Witherspoon" y:Washington D.C. 007 020 y:Franklin Roosevelt y:Reese Witherspoon gender \Female" foundYear foundingYear \1976-03-22" 015 hasName hasCapital y:Reese Witherspoon title \Actress" 016 004 title \1790" 027 bornOnDate 005 \Washington D.C." y:United States y:New Orleans LA foundingYear \1718" \President" y:New Orleans LA locatedIn y:United States y:Reese Witherspoon bornIn y:Franklin Roosevelt hasName \Franklin D. Roo- gender hasName 021 locatedIn locatedIn title hasName bornIn sevelt" \Female" 018 y:Franklin Roosevelt bornIn y:Hyde Park NY 022 023 006 \United States" 008 \Reese Witherspoon" y:Franklin Roosevelt title \President" \Actress" y:New Orleans LA y:Hyde Park NY y:Franklin Roosevelt gender \Male" foundingYear foundingYear y:Hyde Park NY foundingYear \1810" 024 028 y:Hyde Park NY locatedIn y:United States \1718" \1810" y:Marilyn Monroe gender \Female" y:Marilyn Monroe hasName \Marilyn Monroe" y:Marilyn Monroe bornOnDate \1926-07-01" y:Marilyn Monroe diedOnDate \1962-08-05" (a) RDF Triples (Prefix: y=http://en.wikipedia.org/wiki/) (b) RDF Graph G Fig. 1: RDF Graph that was founded in 1718, but we may not know the In this paper we describe gStore, which is a graph- exact birth date. In this case, we have to perform a based triple store that can answer the above discussed query with wildcards, as shown below (Q2): SPARQL queries over dynamic RDF data repositories. SELECT ?name WHERE In this context, answering a query is transformed into f?m <bornIn> ? c i t y . ?m <hasName> ?name . a subgraph matching problem. Specifically, we model ?m <bornOnDate> ?bd . an RDF dataset (a collection of triples) as a labeled, ? c i t y <foundingYear> ` ` 1 7 1 8 ' ' . directed multi-edge graph (RDF graph), where each FILTER(regex(str(?bd), ` `1976 ' ')) g vertex corresponds to a subject or an object. We also { Dynamic RDF repositories. RDF repositories are represent a given SPARQL query by a query graph, Q. not static, and are updated regularly. For example, Subgraph matching of the query graph Q over the RDF Yago and DBpedia datasets are continually expand- graph G provides the answer to the query. ing to include the newly extracted knowledge from For example, Figure 1b shows an RDF graph G cor- Wikipedia. The RDF data in social networks, such responding to RDF triples in Figure 1a. We formally as the FOAF project (foaf-project.org), are also fre- define an RDF graph in Definition 1. Note that, the quently updated to represent the individuals' chang- numbers above the boxes in Figure 1b are not vertex ing relationships. In order to support queries over labels, but vertex IDs that we introduce to simplify the such dynamic RDF datasets, query engines should description. The RDF graph does not have to be con- be able to handle frequent updates without much nected. A SPARQL query can also be represented as a maintenance overhead. directed labeled query graph Q (Definition 2). Figure 2 { Aggregate SPARQL queries. Few existing works and shows the query graph corresponding to the SPARQL SPARQL engines consider aggregate queries despite query Q2. Usually, query graph Q is a connected graph. their real-life importance. A typical aggregate SPARQL Otherwise, we can regard each connected component of query that groups all individuals by their titles, gen- Q as a separate query and perform them one by one. ders, and the founding year of their birth places, and reports the number of individuals in each group is We develop novel indexing and graph matching tech- niques rather than using existing ones. This is because shown below (Q3): the characteristics of an RDF graph are considerably SELECT ? t ?g ?y COUNT(?m) WHERE different from graphs typically considered in much of f?m <bornIn> ? c . ?m <t i t l e > ? t . ?m <gender> ?g . ? c <foundingYear> ?y . g the graph database research. First, the size of an RDF GROUP BY ? t ?g ?y graph (i.e., the number of vertices and edges) is larger 3 regex(str(?bd),\1976") discuss the maintenance of indexes (VS∗-tree and T- index) as RDF data get updated in Section 10. We ?name ?bd \1718" study our methods by experiments in Section 11. Sec- hasName bornOnDate foundingYear tion 12 concludes this paper. Some of the additional bornIn ?m ?city material supporting the main findings reported in the paper are included in an Online Supplement. Fig. 2: Query Graph of Q2 2 Related Work than what is considered in typical graph databases by orders of magnitude. Second, the cardinality of ver- Three approaches have been proposed to store and query tex and edge labels in an RDF graph is much larger RDF data: one giant triples table, clustered property than that in traditional graph databases. For example, tables, and vertically partitioned tables. a typical dataset (i.e., the AIDS dataset) used in the existing graph database work [25,32] has 10,000 data graphs, each with an average number of 20 vertices One giant triples table. The systems in this category and 25 edges. The total number of distinct vertex la- store RDF triples in a single three-column table where bels is 62. The total size of the dataset is about 5M columns correspond to subject, property, and object bytes. However, the Yago RDF graph has about 500M (as in Figure 1a) enabling them to manipulate all RDF vertices and the total size is about 3.1GB.