XML-Based RDF Data Management for Efficient Query Processing
Total Page:16
File Type:pdf, Size:1020Kb
XML-Based RDF Data Management for Efficient Query Processing Mo Zhou Yuqing Wu Indiana University, USA Indiana University, USA [email protected] [email protected] Abstract SPARQL[17] is a W3C recommended RDF query language. A The Semantic Web, which represents a web of knowledge, offers SPARQL query contains a collection of triples with variables called new opportunities to search for knowledge and information. To simple access patterns which form graph patterns for describing harvest such search power requires robust and scalable data repos- query requirements. For SELECT ?t example, the SPARQL itories that can store RDF data and support efficient evaluation of WHERE {?p type Person. ?r ?x ?t. SPARQL queries. Most of the existing RDF storage techniques rely ?p name ?n. ?p write ?r} query on the left re- on relation model and relational database technologies for these trieves all properties of tasks. They either keep the RDF data as triples, or decompose it reviews written by a person whose name is known. into multiple relations. The mis-match between the graph model The needs to develop applications on the Semantic Web and sup- of the RDF data and the rigid 2D tables of relational model jeop- port search in RDF graphs call for RDF repositories to be reliable, ardizes the scalability of such repositories and frequently renders a robust and efficient in answering SPARQL queries. As in the con- repository inefficient for some types of data and queries. We pro- text of RDB and XML, the selection of storage models is critical to pose to decompose RDF graph into a forest of semantically cor- a data repository as it is the dominating factor to determine how to related XML trees, store them in an XML repository and rewrite evaluate queries and how the system behaves when it scales up. SPARQL queries into XPath/XQuery queries to be evaluated in the 1.1 Related Works XML repository. In this paper, we discuss the basic idea of RDF- Most of the existing RDF data repositories [1, 8, 9, 19] rely on to-XML decomposition and the criteria of such decomposition in relational models for data storage and evaluate SPARQL queries term of correctness, redundancy and query efficiency, then propose by rewriting them into SQL queries and then executing them in two RDF-to-XML decomposition algorithms based on these crite- the RDB engine. Among them there are two major directions: ria. Our experimental evaluation results illustrate that our approach (1) keeping the simple triple data model of RDF data, e.g. triple is capable of improving both the storage efficiency and query pro- store [8, 19]; and (2) decomposing RDF triples into relations, either cessing efficiency compared to the existing RDF techniques. based on predicates, e.g. vertical partition or based on semantics, e.g. property table [9]. 1. INTRODUCTION 1HV888 :`7 The triple store does not scale well as the evaluation of a com- V6 J:IV plex SPARQL query invokes many self-joins. Various indexing RDF (Resource Description Frame- ` 1`1 V ] 7]V 7]V techniques [13, 15, 16, 21] were proposed as remedies, at the cost work) [14] is a W3C recommended V01V1 `: V `: V V`QJ 7]V 1`1 V 7]V of huge increase in storage space and decrease in the scalability language for describing linked data ` ] of the Semantic Web in the form of `: V and update efficiency. The vertical partition [1] works well for V6 J:IV triples. RDFS (RDF schema) [7] ex- QQR888 :7 SPARQL queries when all predicates in the WHERE clause are tends RDF vocabularies to define the ^:_R: : known. Otherwise, all tables have to be accessed and results unioned. 1`1 V For example, the RDF data in Fig. 1(a) are stored in five tables. All structure of underlying RDF data, as V01V1 V`QJ well as taxonomic hierarchies of con- V6 `: V J:IV need to be accessed to evaluate the SPARQL query above. The cepts and relations. Both RDF data `1J$ ^G_H.VI: `1J$ property table incurs small number of joins for some queries be- and schema can be represented as Figure 1: Example cause a selection in one property table can match multiple simple graphs, as shown in Fig. 1. The flex- RDF data/schema access patterns. However it suffers storage redundancy and large ibility, simplicity and expressiveness of the Semantic Web evoke overhead in query evaluation [1]. For instance, the RDF data in extensive exploitation of RDF/RDFS in many areas such as bio- Fig. 1 are stored in two tables, Person and Review. Both are ac- informatics and social networks. cessed to evaluate the SPARQL query above because of the predi- cate variable ?x and variable ?r whose class cannot be determined in the time of query-rewrite. An alternative approach [3] preserves the graph nature of RDF Permission to make digital or hard copies of all or part of this work for data by storing RDF graphs in an object-relational database. How- personal or classroom use is granted without fee provided that copies are ever, this separates the RDF schema and RDF primary data, which not made or distributed for profit or commercial advantage and that copies brings difficulties in evaluating queries containing both schema and bear this notice and the full citation on the first page. To copy otherwise, to data instances. republish, to post on servers or to redistribute to lists, requires prior specific The proposal of serializing RDF graph into XML trees to utilize permission and/or a fee. WebDB ’10 Indianapolis, IN USA existing XML technologies [4, 10, 20] focuses on representing all Copyright 2010 ACM 978-1-4503-0186-2/10/06 ...$10.00. RDF features such as blank node in XML, but pays less or no at- tention to the efficiency of RDF data storage and query evaluation. ² We define the RDF-to-XML decomposition and identify the It either leads to XML data [20] with large redundancy or flat XML criteria for measuring the goodness of a decomposition. data [10] that cannot take full advantage of XML query evaluation ² We propose two RDF-to-XML algorithms for decomposing techniques. For example, [10] would serialize the RDF graph in RDF data into XML documents, with the second algorithm Fig. 1 into a 3-level xml tree and the evaluation of our example taking workload into consideration to further improve the query requires eight structural joins and one value join. query performance. 1.2 Our Proposal ² We discusse the query rewrite strategies for rewriting SPARQL We summarize the desired properties of an RDF storage model queries into XPath/XQuery queries. to be as follows: (1) preserving semantics to facilitate efficient ² We report on experiments to better understand the storage evaluation of RDF queries; (2) high performance in evaluating all footprint of our approach, as well as the query performance SPARQL queries rather than only some types of SPARQL queries; for various types of SPARQL queries. (3) high scalability powered by no or small storage redundancy; and (4) small overhead in query rewrite and query optimization. 2. RDF-TO-XML DECOMPOSITION Semi-structured data model organizes data entries in a tree struc- 2.1 RDF data and queries ture and represents the semantic relationships among them via con- tainment relationships. Tree pattern matching is at the core of the We represent RDF data in the form of a node and edge labeled query languages for XML, e.g. XPath[12] and XQuery[11]. We graph: R = (VR; EdR; ¸R) where ¸R is a labeling function that observe the similarity between RDF and XML, in terms of data maps items in VR [ EdR into a finite set of labels and literals. representation (e.g. using links to represent relationships among We further distinguish RDF schema Rs from RDF data Rd, with R = (V ; Ed ; ¸ ) representing the class nodes and proper- data instances) and query (e.g. tree pattern matching in XML and s Rs Rs R ties between classes, and R = (V ; Ed ; ¸ ) representing the graph pattern matching in RDF), and propose to leverage the so- d Rd Rd R phisticated storage management and query evaluation techniques data instances and properties between instances. In particular, in of XML data repositories, such as [2, 6], to store and query RDF Rs, we call labels associated with nodes the class labels such as data. Specifically we propose to de- Person and Review in Fig. 1(b), and labels associated with edges R: :LH.VI: compose RDF data into XML doc- the predicate labels such as Write and Rate in Fig. 1(b). Please :JR %II:`7 1J`Q8 _%V`7 note that V [ V = V , but Ed [ Ed 6= Ed as there uments guided by RDF schema and Rs Rd R Rs Rd R exist edges between class nodes and instance nodes in R indicating :]]1J$ summary information of RDF data ex- H.VI:LR: : %V`7 `V1`1 V VHQI]Q1 1QJ the types of the instances. Q`@CQ:R tracted from the RDF data, store the A SPARQL [17] query is frequently represented as graph pat- RQH : .L%V`7 _%V`7 XML documents in an XML reposi- terns that are formed by connecting a collection of simple access %V`7 !0:C8 tory and rewrite SPARQL queries into `V]Q 1 Q`7 patterns in the query’s WHERE clause. The connection between `V%C XPath/XQuery queries to be evaluated against the XML data using the lat- two simple access patterns with the same variable on their sub- jects is usually referred to the SS join.