J Data Semant (2012) 1:147–185 DOI 10.1007/s13740-012-0008-7

ORIGINAL ARTICLE

Mapping between RDF and XML with XSPARQL

Stefan Bischof · Stefan Decker · Thomas Krennwallner · Nuno Lopes · Axel Polleres

Received: 2 December 2011 / Revised: 4 May 2012 / Accepted: 11 June 2012 / Published online: 4 September 2012 © Springer-Verlag 2012

Abstract One promise of applications is one format into the other. We focus on the semantics of this to seamlessly deal with heterogeneous data. The Extensi- combined language and present an implementation, includ- ble (XML) has become widely adopted ing discussion of query optimisations along with benchmark as an almost ubiquitous interchange format for data, along evaluation. with transformation languages like XSLT and XQuery to translate data from one XML format into another. However, Keywords Query processing · XML · RDF · SPARQL · the more recent Resource Description Framework (RDF) XQuery · XSPARQL has become another popular standard for data representation and exchange, supported by its own query language SPAR- QL, that enables extraction and transformation of RDF data. 1 Introduction Being able to work with XML and RDF using a common framework eliminates several unnecessary steps that are cur- XML [18] has become a well established and widely adopted rently required when handling both formats side by side. interchange format for data on the Web. Accompanying stan- In this paper we present the XSPARQL language that, by dards, such as XSL Transformations (XSLT) by Kay [42] combining XQuery and SPARQL, allows to query XML and and, more recently, XQuery by Chamberlin et al. [22], both RDF data using the same framework and transform data from based on the XML Path Language (XPath) [11], are often used to query XML data and convert between different XML representations. In the effort to convert the Web into a Semantic Web, the S. Bischof · A. Polleres Resource Description Framework (RDF) [46,39] has become Siemens AG Österreich, Siemensstrasse 90, 1210 Vienna, Austria the language of choice for modelling, interlinking, and merg- e-mail: [email protected] ing data. RDF data and applications that consume this data are A. Polleres becoming increasingly present on the Web. Opposed to the e-mail: [email protected] tree structure of XML, RDF structures data in sets of triples, representing edges of a directed, labelled graph. Querying S. Decker · N. Lopes (B) Digital Enterprise Research Institute (DERI), National University RDF graphs and converting between them can be performed of Ireland, Galway, Ireland using SPARQL [54], the W3C recommended query language e-mail: [email protected] for RDF. In many applications combining and converting between S. Decker · N. Lopes IDA Business Park, Lower Dangan, Galway, Ireland XML and RDF data is a useful but often not trivial task. The e-mail: [email protected] importance of this issue is acknowledged within the W3C, for instance in the working groups on Gleaning Resource T. Krennwallner Descriptions from Dialects of Languages (GRDDL) by Con- Institute of Information Systems, Vienna University of Technology, Favoritenstraße 9-11, 1040 Vienna, Austria nolly [24] and Semantic Annotations for WSDL (SAWS- e-mail: [email protected] DL) by Farrell and Lausen [29]. As we will show, common 123 148 S. Bischof et al. approaches for transformations between XML and RDF, it with a related system. Furthermore, we discuss the per- which rely on the standard XML serialisation of RDF by formance impact of the proposed optimisations. Beckett and McBride [9] and on XML technologies, e.g., XSLT, have several disadvantages. While both XQuery and The article is organised as follows: Sect. 2 will illustrate SPARQLlanguages operate on different data models, respec- our main motivation to come up with a new language by tively, the XQuery and XPath Data Model (XDM) [30]for discussing drawbacks of existing technologies for transfor- XML and RDF, we show that the merge of both query lan- mations between RDF and XML. In Sect. 3 we will briefly guages in the novel language XSPARQL has the potential to review the main characteristics of the XQuery and SPARQL finally bring XML and RDF closer together. XSPARQL pro- query languages and, in Sect. 4, present their combination in vides concise and intuitive solutions for mapping between the form of the XSPARQL language by defining the formal XML and RDF in either direction: operations where both semantics and showing semantic properties of the novel lan- XQuery and SPARQL struggle. In fact it is not possible to use guage. Section 5 shows the architecture and query rewriting SPARQL alone for such transformations since the SPARQL techniques for a prototype implementation. Section 6 dis- query language does not provide the possibility of handling cusses query optimisation techniques that speed up the eval- XML data. On the other side, the only way to work with RDF uation of XSPARQL queries. We compare XSPARQL with data within XQuery is by relying on the RDF/XML seriali- another prototype that combines SPARQL and XQuery in sation for RDF graphs. As we show in Sect. 2, this approach Sect. 7 and report on experimental results using the bench- is hard to implement due to the different possible serialisa- mark suite XMarkRDF. We also compare query response tions in RDF/XML for a single RDF graph. An additional use times of the presented optimisations, showing promising for XSPARQL is the conversion between RDF graphs. XSP- results. We conclude this work with a discussion of related ARQL extends SPARQL’s expressiveness for such transfor- works in Sect. 8 and wrap up in Sect. 9. mations, by allowing, for instance, nested XSPARQL queries in the graph construction step. Since its first version by Akhtar et al. [4], XSPARQL has 2 Motivation: Lifting and Lowering gained community interest and practical use cases have been presented in a W3C Member Submission [49].Basedonthese XML can be viewed as a tree-like data representation format, experiences, the present article makes the following main with intermediate nodes of this tree being XML elements or contributions: attribute names, and the leaf nodes being either empty ele- ments or textual attribute values and element content. The order of child nodes is relevant in XML. As opposed to this, – we present syntax and formal semantics of XSPARQL RDF data, i.e., an RDF graph, is an unordered set of subject– based on the XQuery Formal Semantics by Draper et al. predicate–object triples, as follows: [27]. In comparison with our initial publication, we improved the treatment of nested queries over RDF with Definition 1 (RDF Triple, RDF Graph) Given pairwise dis- respect to blank nodes and allow for assignment of RDF joint sets of URI references U, blank nodes B, and literals graphs to variables; L,1 atriple(s, p, o) ∈ UB × U × UBL (often written as a – our implementation of XSPARQL is based on rewriting an “statement” ‘spo.’) is called an RDF triple;setsofRDF XSPARQL query into a semantically equivalent XQue- triples are called RDF graphs. We call elements of UBL RDF ry query; as opposed to the preliminary version of this terms. rewriting by Akhtar et al. [4], in this paper we present a more tightly integrated, new prototype implementing sev- Besides the normative syntax to exchange RDF using eral new features; XML, RDF/XML [9], there are various serialisation formats – we prove various properties of XSPARQL and show for RDF, such as RDFa [2], a format that allows one to embed soundness and completeness of the new tighter query RDF within (X)HTML, or non-XML representations such rewriting; as the more human-readable [7] syntax. Since data – we present a set of optimisations for complex queries (con- in RDF may be considered on a higher level of abstrac- taining nested XSPARQL queries) and show their correct- tion than semi-structured XML data, the translation from ness; XML to RDF is often called lifting, while the opposite direc- – we introduce a novel benchmark suite (XMarkRDF) that tion is called lowering. The importance of converting data extends the XMark XML Benchmark suite by also con- between the XML and RDF formats has been acknowledged sidering RDF as a data format; and – based on the XMarkRDF suite, we present benchmark 1 For brevity we will denote the concatenation of sets by concatenating evaluation of the new XSPARQL prototype and compare their names, e.g., U ∪ B is represented as UB. 123 Mapping between RDF and XML with XSPARQL 149

(a)

Fig. 1 From XML to RDF and back: “lifting” and “lowering” within the W3C in several related standardisation efforts, such as GRDDL and SAWSDL. The GRDDL working group addressed the issue of extracting RDF data out of exist- ing (X)HTML Web pages (lifting). Likewise, in the Seman- (b) tic Web Services community, the SAWSDL working group aimed at defining mechanisms (and link them in Web service descriptions) to generate XML messages sent to Web services from RDF data (lowering) and vice versa extract RDF from service result messages in XML (lifting) [see 29,44]. Both GRDDL and SAWSDL use XSLT (although they acknowl- edge that other mechanisms could be used) in their examples to perform lifting and lowering. In the following, let us illus- trate some drawbacks of this approach. As a running example throughout this paper we use a map- ping between a custom XML format and RDF as shown in Fig. 1 (using Turtle syntax for illustration). The task is, in (c) both directions, to extract for all persons the names of people they know. URIs denoting predicates and terms in a partic- ular domain are typically collected under a common name- space in RDF with a designated prefix, such as RDF core terms in the namespace http://www.w3.org/1999/02/22-rdf- syntax-ns# using prefix rdf: or terms of the FOAF [20] ontology in the namespace http://xmlns.com/foaf/0.1/ using prefix :.2 Blank nodes are represented in Turtle by the prefix ‘_:’ followed by an identifier/label, or by square brackets ‘[]’. Blank nodes play a special role in RDF’s data model: they allow to model unknown nodes or incomplete data, akin to existential variables. Regarding the serialisation in Turtle that (d) means, if we would replace _:b1 in Fig. 1 by _:x, it would Fig. 2 Different representations of the same RDF graph represent an equivalent RDF graph. RDF/XML [9] is the recommended syntax for RDF, using XML as the underlying representation model. This format enables the use of XML tools such as XSLT or XQuery to translate between RDF/XML and other XML formats. How- the same FOAF data. Figure 2a uses Turtle [7], a simple and ever, such a transformation is greatly complicated by the flex- readable textual format for RDF, inaccessible to pure XML ibility the RDF/XML format offers in serialising RDF graphs. processing tools though; the remaining three versions are all Therefore, tools that handle RDF/XML as XML data (and not RDF/XML, ranging from concise (2b) to verbose (2d). These as a sets of triples) need to take different possible representa- three RDF/XML variants represent different XML trees but tions into account. Figure 2 shows four versions of a subset the same RDF graph. Note that blank node identifiers may of the RDF data from our running example that represent disappear or change through XML serialisation. For our running example, let us attempt lifting and lower- 2 In listings and figures we sometimes abbreviate well-known name- ing transformations using XSLT (we will get to XQuery as space URIs with “…”. another alternative in more detail later on). 123 150 S. Bischof et al.

(a) Fig. 4 Lowering using XSLT (lowering.xsl)

Fig. 5 RDF data using the relationship ontology (b)

Fig. 3 Lifting attempt by XSLT its interplay with ontological information, e.g., RDF Schema. RDF Schema [19] (RDFS) allows to express subclass or sub- property hierarchies, which can be exploited by RDF tools Lifting. The XSLT stylesheet in Fig. 3a for instance, could capable of ontological inference. The RDF data from Fig. 1 be used to generate RDF/XML (in the format presented in could—rather than foaf:knows—use predicates from the Fig. 2b) from the relations. file in Fig. 1. However, this relationship ontology,3 which are all stated as subproperties first attempt does not yet accomplish the intended transforma- of foaf:knows, as presented in Fig. 5. Similar consider- tion since unique identifiers are not created for each person. ations would apply if we attempted to perform the lifting This is easy to see in the Turtle version of the result of this and lowering using XQuery: since XML tools do not support transformation (presented in Fig. 3b): while in our example ontological inference, we literally would need to implement names should uniquely identify a person, in this transforma- an RDFS inference engine within XSLT or XQuery, to be tion the same person is potentially given several different able to implement a lowering mechanism that also works for blank nodes. this kind of RDF data. Given the availability of RDF tools Although a proper lifting transformation catering for all and engines that readily offer RDFS support, this seems to possible XML serialisations is doable in XSLT, the corre- be a dispensable exercise. sponding stylesheet would need to be far more involved. Benefits of an integrated language. In recognition of the Lowering. The simple XSLT stylesheet lowering.xsl in above problems, the SAWSDL specification contains a non- Fig. 4b is an attempt to perform the lowering task directly normative example which performs a lowering transforma- from RDF/XML. However, this XSLT will break if the input tion as a sequence of a SPARQL query followed by an XSLT RDF/XML serialisation is in any other variant than the ver- transformation on SPARQL’s query results XML format [23]. sion in Fig. 2b. We could create a specific stylesheet for The advantage of such a two-step approach is first that since each of the presented variants, but creating one that han- SPARQL works on the RDF data model, all the input data dles all the possible RDF/XML forms would be much more from Fig. 2 are considered to be equivalent. Second, if one complicated. deploys a SPARQLengine that supports RDFS inference also Apart from its syntactic ambiguities, processing RDF/ XML via XSLT also loses another feature of RDF, namely 3 http://vocab.org/relationship/. 123 Mapping between RDF and XML with XSPARQL 151 input data that involves ontologically related RDF vocabu- laries could be dealt with. For example, to get all persons whoworkwith(rel:worksWith)orhavemet(rel:has- Met) Charles from the FOAF data described in Fig. 5,asim- ple SPARQL query would be enough: select $person from (a) where { $person foaf:knows [ foaf:name "Charles " ].}

Although the approach proposed by the SAWSDL Work- ing Group provides a good starting point, we argue that it can still be improved on several points: first, the detour through SPARQL’s XML query results format seems to be an unnecessary burden. Second, a more tightly coupled inte- (b) gration of SPARQL and XML query languages can provide a more expressive language, beyond the capabilities of using Fig. 6 An overview of XQuery and SPARQL SPARQL and XSLT or XQuery sequentially, and directly amenable to query optimisations. XSPARQL, the language proposed in the present paper, aims to provide exactly this: use cases that otherwise would require interleaved calls to SPARQL (typically requiring an implementation using an external programming framework) can be solved in XSP- ARQL directly, cf. the lowering example in Fig. 10. More- over, as we will see, the combined language not only allows for concise lifting and lowering, but also may be viewed as an extension of SPARQL for RDF-to-RDF transformations, cf. the example in Fig. 8b below. Before we turn to these examples and XSPARQL in more detail, let us give a short overview of the languages XSPARQL builds on: XQuery and SPARQL.

3 Preliminaries

XQuery allows for a convenient and concise syntax for XML Fig. 7 Lifting using XQuery query processing and XML transformation, while SPARQL is the standard for RDF querying and transformation. One of the major differences between XQuery and SPARQL resides 3.1 XQuery on the ordering of their respective data models: while XML (and hence XQuery) is an intrinsically ordered data model, XQuery consists mainly of so-called FLWOR expressions, RDF is an unordered data model. As such, necessary mecha- denoting the body (FLWO) and the head (R) of a query. The nisms must be in place to ensure XQuery respects the order- ForClauses(F) can be used to declare variables that iterate ing of the input. Queries in each of the two languages can over XML sequences, returned, e.g., by an XPath expres- roughly be divided in two parts: (i) the retrieval part (body) sion, while let assignments (L) allow to bind values, e.g., and (ii) the result construction part (head): this is presented the entire result of an XPath expression, to variables. A filter schematically in Fig. 6. Our goal is to combine these compo- condition on the current variable bindings or processing order nents for both languages in a unified language, XSPARQL, of results within a ForClause can be specified in the where where XQuery’s and SPARQL’s heads and bodies may be part (W) and by the order by clause (O), respectively. In the used interchangeably and even nested. We next outline some head (R) arbitrary well-formed XML, nested XQuery expres- of the main aspects of XQuery and SPARQL relevant to their sions, or previously assigned variables are allowed follow- combination into XSPARQL. For a more detailed overview ing the return keyword. Together with a large catalogue of of XQuery and SPARQL we refer the reader to [22,27] and built-in functions [45], XQuery offers a flexible instrument to [51,54]. for arbitrary XML transformations. 123 152 S. Bischof et al.

The lifting task of Fig. 1 can be solved with XQuery as holds the information available during static analysis, for shown in Fig. 7.4 Please note that, due to the nature of XQue- example the varType component holds variable type infor- ry, in this query we are generating RDF/XML, opposed to the mation, while the dynEnv environment contains information more concise Turtle syntax from Fig. 1. The resulting query available during expression evaluation, like the value for vari- is quite involved, but completely addresses the lifting task, ables that is stored in the varValue component. We refer to including unique blank node generation for each person. We the static environment of C as statEnv(C) and to the dynamic first select, in variable $persons (line 3), all nodes repre- environment as dynEnv(C) and we can access the different senting person names: either person or knows nodes and components by name: statEnv(C) .varType and the specific next, for each different name, we keep a sequence with all value for element var of the a context can be accessed using the distinct person names (stored in variable $positions), statEnv(C) .varType(var). In case the expression context C which we will use as the blank node identifier for the person. is not explicitly presented, statEnv and dynEnv can be used Iterating over these distinct person names, we determine the in place of statEnv(C) and dynEnv(C). person identifier (line 10). The nested for (lines 14–23) again Normalisation rules are represented using mapping rules iterates over persons in order to create nested foaf:knows and, as an example, we present the following rule from elements: for each person name ($n) from the outer for Draper et al. [27] that illustrates the normalisation of con- expression, this nested expression selects the XML nodes secutive ForClauses into XQuery Core: that correspond to persons which $n knows (line 14) and cre-   ates the corresponding foaf:knows elements (lines 18–20). for $VarName1 OptTypeDeclaration1    OptPosVar in Expr  While this is a valid solution for lifting, we still observe the  1 1   , ··· ,  following drawbacks: (1) We still have to build RDF/XML   $VarNamen OptTypeDeclarationn in manually and cannot make use of the more readable and con- OptPosVarn Exprn ReturnClause Expr cise Turtle syntax; and (2) if we had to apply XQuery for the == (N1) lowering task, we still would need to cater for all kinds of for $VarName1 OptTypeDeclaration1  different RDF/XML representations. Thus we still face the OptPosVar1 in Expr1 Expr return ··· same problems as discussed in the XSLT solution in Sect. 2. for $VarNamen OptTypeDeclarationn However, both these drawbacks will be alleviated by adding   OptPosVarn in Exprn Expr ReturnClause Expr SPARQL to XQuery. By combining XQuery and SPARQL, XSPARQL also simplifies the lifting process by allowing to In normalisation rules, fixed-width font (like for) refers to use SPARQL ConstructClauses that generate RDF in Turtle specific keywords, and italic font refers to productions in the format and by performing automatic validation of the gener- XQuery Core grammar [27, Appendix A]. Static type rules, ated RDF graphs. and dynamic evaluation rules are represented using infer- Semantics. Next, let us give a short overview of the XQue- ence rules. For instance, the following static typing rule from ry Formal Semantics [27], on which we will base XSP- Draper et al. [27] ensures that no expression has empty type ARQL’s semantics; it is defined essentially via three types except the empty sequence and functions in the fs namespace of rules: (i) normalisation rules, (ii) static typing rules, that are applied to empty parentheses (): and (iii) dynamic evaluation rules. Normalisation rules are used to rewrite arbitrary XQuery expression to the XQuery statEnv  Expr : Type  <: Core language—a subset of XQuery that, while semanti- statEnv Type empty Expr is the empty parentheses () or fn:data() (S1) cally equivalent, aims to be easier to define, implement, and not or any fs function applied to empty parentheses () optimise [41]. Static typing rules areusedtoassignatype A static type error is raised for expression Expr to each XQuery expression, while the dynamic evaluation rules are responsible for producing the resulting XML from The judgements statEnv  Expr : Type and statEnv  Type <: each expression and guaranteeing that the expression input empty hold when, in the static environment statEnv, both is consistent with the typing information determined during Expr has type Type and Type is a subtype of empty, respec- the static analysis step. Any XQuery expression E is eval- tively. For all details of the XQuery Semantics we refer the uated with regard to an expression context C that holds the reader to [27]. static environment (statEnv) and the dynamic environment Typing Fernández et al. [30] describe the XQuery and (dynEnv) up until the evaluation of E. Environments are XPath Data Model (XDM) that is used to define the input to composed of different components and hold information nec- XQuery and the values of any XPath expression. Draper et al. essary to the evaluation of any XQuery expression: statEnv [27, Section 2.4] describe the formal notation for types that we use throughout this paper. This representation of types is 4 We assume this query is executed with the context item set externally used for specification purposes only and is not exposed to to the document node of relations.xml file presented in Fig. 1. the end user by XQuery. As [27, Section 8.3.1] describe, it 123 Mapping between RDF and XML with XSPARQL 153 is possible to match a Value against a specific Type by using the judgement Value matches Type.

3.2 SPARQL (a) In analogy to FLWOR in XQuery, we will denote its cor- respondent “DWMC expressions” in SPARQL. The body (DWM) offers the following features: a dataset (D), i.e., the set of (named) source RDF graphs, is specified in from (or from named) clauses. The where part (W) allows match- ing parts of the dataset by specifying a graph pattern. Such patterns can be simple triple patterns possibly involving (b) variables, URI references, and literals,5 or unions of graph patterns, optional patterns matching of parts of a graph, or patterns matching of named graphs, etc. Fig. 8 RDF-to-RDF mappings in SPARQL and in XSPARQL

Definition 2 (Graph Patterns, [51]) Let V be an infinite set specified in consecutive from clauses can be merged trans- of variables, graph patterns are inductively defined as fol- parently in SPARQL, however for XML tools this involves lows: renaming of blank nodes at the pure XML level. Solution sequences can be ordered or sliced using solution modifi- – a tuple (s, p, o) ∈ ULV×UV×ULV, called triple pattern, ers (M) order by, limit, and offset. is a graph pattern; In the head, SPARQL’s construct clause (C) offers con- – a set of triple patterns, called Basic Graph Pattern (BGP), venient and XML-independent means to create an output is a graph pattern; RDF graph. A construct template consists of a list of triple   –ifP and P are graph patterns, then (PP), (P optional patterns in Turtle syntax. By instantiating this template with   P ), and (P union P ) are graph patterns; the variable bindings computed in the body, a result graph –ifP is a graph pattern and i ∈ UV, then (graph iP) is a is created, which enables SPARQL to be used as a trans- graph pattern; and formation language between different RDF formats (simi- –ifP is a graph pattern and R is a filter expression, lar to XSLT and XQuery for transforming between XML then (P filter R) is a graph pattern. formats). A simple example for mapping full names from the vCard/RDF [40]formattofoaf:name is given by the SPARQL query in Fig. 8. Blank nodes in construct tem- ( ) For any pattern P, we write vars P for the set of all plates—as used in the query in Fig. 8b—play a special role, in variables occurring in P.Afilter expression R can be com- that they are replaced by a fresh blank node for each solution ULV posed from constants, elements of , comparison opera- sequence in the result graph. = < > ≤ ≥ ¬ ∧ tors (‘ ’,‘ ’, ‘ ’,‘ ’,‘ ’) and logical connectives (‘ ’,‘ ’, Other possible types of SPARQL queries include select, ∨ 6 ‘ ’), and built-in functions. ask, and describe queries: select queries simply return the bindings for variables present in the query (instead The evaluation semantics of SPARQL consists of com- of using these bindings to instantiate the template like a puting a sequence of solution mappings, i.e., sets of bindings construct clause), ask queries return a boolean answer, for the variables in these patterns, matching them against indicating whether the graph pattern produces any results, the graphs in the dataset. Sequences of solution mappings as and describe queries are used to return information about referred to simply as solution sequences. a resource. For the aims of XSPARQL we support the SPARQL is agnostic to the actual XML representation of ConstructClause that allows to produce RDF and the select the underlying source graphs, which alleviates the pain of expression that allows us to input RDF data—although with having to deal with different RDF/XML representations of a different syntax as explained in Sect. 4.1. the graphs in the dataset. Also several RDF source graphs [39] Let us remark that SPARQL does not cater for the creation of new values, which on the contrary is an inherent feature of 5 Note that we do not allow blank nodes in graph patterns, and thus do XQuery. By combining XQuery and SPARQL, we are also not consider them in our definitions. This restriction does not affect the expressivity of SPARQL, implicit in [51], since blank nodes in query enabling SPARQL to use the full range of XPath/XQuery patterns can always be replaced equivalently with variables. See dis- built-in functions [45]. cussion in Sect. 4.2 below. Due to this, the query in Fig. 8 which attempts to merge 6 For a complete list of built-in functions we refer the reader to [54]. family names and given names into a single foaf:name by 123 154 S. Bischof et al. calling the fn:concat function is beyond SPARQL’s capa- ToMultiSet(1) × ToMultiSet(2) ,μ1 and μ2 are com- bilities. As we will see, XSPARQL will not only reuse SPAR- patible}), where by × we denote the Cartesian product of QL for transformations from and to RDF, but also enable such the multisets and ToList() is, as per the SPARQL specifica- advanced RDF-to-RDF transformations. tion [cf. 54, Section 12.4], an operation that turns a multiset Semantics The semantics of SPARQL is defined by means into a sequence with the same elements and arbitrary order- of evaluation rules which are presented by [54, Section 12.5]. ing. Analogously to the ToList() operation, ToMultiSet() con- Here we only give an overview of the notion of Basic Graph verts a sequence into a multiset by preserving duplicates but Pattern (BGP) matching that we will use later to define the disregarding the sequence ordering. semantics of XSPARQL. For further details on the SPARQL query language, we The matching of BGPs is done with regard to a specific refer the reader to the W3C specification [54]. RDF graph—the active graph—which is a graph contained in Next, we define the notion of inclusion of solution the dataset specified to the query. This matching is defined in sequences. terms of replacing variables from the BGP with RDF terms Definition 6 Let 1 and 2 be solution sequences. We present in the active graph, where the function that maps say 1 is included in 2, denoted 1 2, if for all solu- query variables to RDF terms is called a solution mapping. tion mappings μ1 ∈ ToMultiset(1) there exists a solution Definition 3 (Solution Mapping)Asolution mapping [see mapping μ2 ∈ ToMultiset(2) such that μ1 ⊆ μ2. 54, Section 12.1.6] is a partial function mapping SPARQL Please note that this definition extends the notion of sub- variables to RDF terms. The domain of a solution mapping μ, set between multisets by considering also the subset relation denoted dom(μ), is the set of variables for which μ is defined. between their elements, i.e., solution mappings. This def- Furthermore, we denote the value of variable v ∈ V accord- inition will be required for the optimisations presented in ing to solution μ as μ(v). Two solution mappings μ and 1 Sect. 6. Since the presented optimisations are not order pre- μ are compatible if for any v ∈ dom(μ ) ∩ dom(μ ) it 2 1 2 serving we rely only on the notion of inclusion. holds that μ (v) = μ (v).Theunion of two compatible 1 2 In a construct query, the solutions of the pattern in the mappings μ and μ consists of the standard set-theoretical 1 2 where clause of the body are then used to instantiate the con- union μ ∪ μ . 1 2 struct template and the result graph is obtained from the The replacement of variables in a graph pattern according union of all valid RDF triples resulting from such instanti- to a solution mapping is defined next. ation. As mentioned before, for each solution, blank nodes Definition 4 Let P be a graph pattern and μ be a solution occurring in a construct template are replaced by new blank mapping. The variable substitution of P by μ, denoted μ(P), nodes with new identifiers. is the graph pattern P with all variables v ∈ vars(P) ∩ Apart from construct queries, which we mainly focus dom(μ) substituted by μ(v). on here, SPARQL also allows select queries, which return sequences of variable bindings, obtained from projecting Finally, the definition of BGP matching from [54, Sec- only solution mappings for a given list of variables. tion 12.3] specifies the solutions to a query. Definition 5 (Basic Graph Pattern Matching)Wesayμ is a solution for a BGP P with respect to the active graph G, 4 XSPARQL if there exists a solution mapping μ such that μ(P) is a subgraph of G, and μ is the restriction of μ to the variables Conceptually, XSPARQL is a merge of SPARQL construct in vars(P). queries into XQuery. This combination of languages allows us to benefit from the facilities of SPARQL for retrieving The definition of BGP matching is extended to more com- RDF data and to use Turtle-like syntax for constructing RDF plex SPARQL query patterns (including union, optional, graphs, while still having access to all the features from graph filter , , etc.) by the SPARQL algebra [54, Sec- XQuery for XML processing. In XSPARQL we allow any where tion 12.4], such that the clause of every SPARQL native XQuery query and we extend XQuery’s FLWOR query—i.e., any DWM body—returns a list of solutions. expressions to what we call FLWOR’ expressions: We denote this evaluation of a SPARQL Graph Pattern P over a dataset D as eval(D, P). As presented by [51], the (i) In the body we allow SPARQL-style F’DWM blocks evaluation of a SPARQL graph pattern can be specified alternatively to XQuery’s FLWO blocks. The new F’ by mapping the graph pattern to relational algebra oper- clause of the form forvarlist is very similar to XQue- ators. Since Perez et al. deal with set-based semantics of ry’s native ForClause, but instead of allowing a single SPARQL, here we extend their notion of the join opera- variable (which is assigned to the results of an XPath tor to solution sequences. Let 1 and 2 be two solution expression), the new clause supports a white space-sep- sequences; then 1 2 = ToList({μ1 ∪ μ2 | (μ1,μ2) ∈ arated list of variables (varlist). Each variable in varlist 123 Mapping between RDF and XML with XSPARQL 155

Fig. 11 Schematic view of XSPARQL

Fig. 9 LiftinginXSPARQL modified grammar productions with the prime symbol (). We introduce two new productions: SparqlForClause and Con- structClause, corresponding to roughly to SPARQL select queries and construct templates; we present the grammar productions for these in Fig. 12. The full XSPARQL gram- mar can be found in [15]. In these grammar productions, the WhereClause and SolutionModifier correspond, respectively, to rules [13] and [14] from the SPARQL grammar, cf. [54, Appendix A.8]. The newly introduced SparqlForClause(rule [33a]) is sim- ilar to an XQuery for clause that can be used to iterate over SPARQL results.7 This expression stands at the same level as Fig. 10 Lowering using XSPARQL XQuery’s for and let expressions, i.e., such type of clauses are allowed to start new FLWOR’ expressions, or may occur is then assigned the value resulting from evaluating a inside deeply nested XSPARQL queries. SPARQL query of the form: selectvarlist DWM. The ConstructTemplate’ expression is defined in the same (ii) In the head we allow to create RDF graphs directly way as the production ConstructTemplate in SPARQL [54], using construct templates (C) alternatively to XQue- but we additionally allow nested XSPARQL expressions ry’s native return (R). (FLWORExpr’) in subject, predicate, and object positions; (iii) Different forms of nesting are allowed, for example we achieve this by replacing SPARQL syntax rules Verb subqueries that construct RDF graphs may appear in and VarOrTerm for ConstructTemplate with the rules VarOr- let assignments which are later used in SPARQL-style Term’ and Verb’ represented in Fig. 12b.8 from clauses, or can be used for value construction The rules for SourceSelector from the SPARQL syntax within SPARQL-style construct templates. are also extended (as presented in Fig. 12c), i.e., we allow for graphs in a SPARQL dataset to be specified by a variable These modifications allow us to reformulate the lifting which must evaluate to a URI. queryof Fig. 7on page 151 into its slightly more concise XSP- In analogy to SPARQL’s select * shortcut, we allow to ARQL version of Fig. 9. The real power of XSPARQL in our write for * in place of for [list of all unbound variables example becomes apparent on the lowering part, where all of appearing in the WhereClause]forSparqlForClauses; as the other languages observed so far struggled. The lowering syntactic sugar this is also the default value for the F’ clause query for our running example is shown in Fig. 10. whenever a SPARQL-style WhereClause is found and a cor- responding F’ clause is missing. Please note that for Sparql- 4.1 Syntax of XSPARQL ForClauses we do not allow XQuery QNames as variable

In more detail, the XSPARQL syntax is an extension of the 7 This expression does not have the exact semantics of a SPARQL grammar rules in XQuery [22]. Figure 11 shows a schema select clause—returning bindings to variables—but rather adds new of our merge of XQuery and SPARQL. For the definition of variables to the query; hence a syntax inspired by the existing XQuery the XSPARQL syntax, we assume to inherit all the grammar ForClauses was chosen. productions of SPARQL [54] and XQuery [22] and mark any 8 These changes are highlighted in Fig. 12 by using bold face. 123 156 S. Bischof et al.

(a) XSPARQL core syntax elements, extending [22, Appendix A]

(b) Modified ConstructTemplate syntax elements, extending [54, Appendix A]

(c) Modified DatasetClause syntax elements, extending [54, Appendix A]

Fig. 12 Overview of XSPARQL syntax names (further details are available in [27, Section 3.1.1.1]) and assume that only unprefixed variables are shared between the XQuery and SPARQL expressions of XSPARQL. By this treatment, XSPARQL becomes a syntactic superset of native SPARQL construct queries, since we additionally allow the following:

(1) XQuery and SPARQL namespace declarations (P)may be used interchangeably; and (2) SPARQL-style construct result forms (C) may appear before the retrieval part for queries. This feature is mainly added in order to encompass SPARQL style queries, but in principle, we expect the (R/C) parts to appear in the end of a FLWOR’ expression.

Thus,thequeryofFig. 8aonpage153oranyotherSPARQL construct queries remain valid syntax for XSPARQL.

4.2 Semantics of XSPARQL

Next we define the semantics of XSPARQL. After introduc- ing some new types, used in the semantics, and an extension Fig. 13 XSPARQL Type Definitions to the normalisation rules of XQuery ForClauses, we will turn to extending the notion of Basic Graph Pattern matching (Sect. 4.2) to make SPARQL clauses aware of the bindings for (2) the PatternSolution type consists of a set of pairs variables from XQuery. Then, we present the semantics of the (variableName, RDFTerm) representing SPARQL vari- newly introduced expressions: SparqlForClause(Sect. 4.2) able bindings; and ConstructClause (Sect. 4.2), based on XQuery’s formal (3) the RDFGraph is the type of construct expressions; and semantics [27], by defining normalisation, static type and (4) the RDFDataset as the type for DatasetClauses. dynamic evaluation rules for each of the new expressions. XSPARQL Types. We extend the XQuery and XPath Data The formal definition of (1)–(4) is given in Fig. 13.The Model (XDM), described by Fernández et al. [30], with the RDFTerm type is used to represent RDF terms (composed of following new types that accommodate for SPARQL specific URIs, blank nodes or literals). The type of SPARQLvariables parts of XSPARQL: are represented by the Binding type, that consists of the var- iable name and the RDF term that is assigned to it. Finally, (1) the RDFTerm type further consists of the subtypes uri, sequences of SPARQL variable bindings are represented by bnode and literal and is used as the type of SPARQL the type PatternSolution. This representation of SPARQL variables; results is similar to the XML Schema of the SPARQL Query 123 Mapping between RDF and XML with XSPARQL 157

Results XML Format, available at http://www.w3.org/2007/ ARQL ConstructTemplates, we will use position variables9 SPARQL/result.xsd. from XQuery in ForClauses to generate these new blank The RDFGraph type corresponds to a sequence of RDFTri- node identifiers, i.e., we introduce position variables in any ples which are in turn a complex type composed of subject, XQuery for expressions without position variables and also predicate and object.TheRDFDataset type is defined to make sure that XSPARQL SparqlForClauseexpressions as an RDFGraph that is considered the default graph and a have position variables. To handle the XQuery for expres- sequence of RDFNamedGraphs represented by the name of sion, we change the normalisation rule of for expressions to the graph and the corresponding RDFGraph. XQuery Core for expressions (cf. Sect. 3.1): The following definition presents the translation between   for $VarName1 OptTypeDeclaration1 a SPARQL solution sequence and a sequence of Result type    OptPositionalVar in Expr ,  elements that we implement in XSPARQL.  1 1   ··· ,     $VarNamen OptTypeDeclaration   n  Definition 7 (Serialisation of Solution Sequences)Givena OptPositionalVarn in Exprn = (μ ,...,μ ) solution sequence 1 n a serialisation of ReturnClause Expr == into a sequence of PatternSolution is defined as follows: (N5) for $VarName1 OptTypeDeclaration1 – serialise() ⇒ serialise(μ1) ,...,serialise(μn)   OptPositionalVar1 PosVar in Expr1 Expr return – serialise(μ) ⇒ {∀x ∈ dom(μ) , serialise(μ, x)} ··· for $VarNamen OptTypeDeclarationn OptPositionalVar in Expr n PosVar n Expr ReturnClause – serialise(μ, x) ⇒ {term(μ(x))} , Expr · A new normalisation rule PosVar takes care of introduc- where term(μ(x)) is ing new positional variables where necessary. We assume that the introduced position variables are distinct from any – μ(x) if μ(x) ∈ U of the variables in scope, represented by the formal seman- – μ(x) if μ(x) ∈ B tics variable $fs:new [cf. 27, Section 4.12.6]:  == – μ(x) if μ(x) ∈ L PosVar at $fs:new. In case a positional variable is already present  == Following the definition of the serialise function, in evalua- it is reused: at $PosVar PosVar at $PosVar . tion rules, we will refer to sequences of elements of type Pat- We also assume a new static environment component ternSolution as and to elements of type Result as μ. statEnv.posVars which consists of a sequence holding all Query Prolog Normalisation As stated previously, positional variables in the given static environment, that is, XQuery and SPARQL namespace declarations can be used the variables defined in the at clause of enclosing for expres- interchangeably in the query prolog. Hence, we convert any sions. The static type rules for the for expression [cf. 27, SPARQL syntax prefix declaration to XQuery namespace Section 4.8.2] need to be extended accordingly to store these declarations by the following normalisation rules: positional variables, similar to the rules for SparqlForClauses  in Sect. 4.2 below. prefix NCName: Expr == (N2) XSPARQL BGP Matching In this section we extend the  declare namespace NCName = URILiteral ; Expr notion of Basic Graph Pattern (BGP) matching described by [54, Section 12.3], to provide SPARQL with the variable  prefix : Expr bindings from XQuery. For this we rely on the XQuery var- == (N3) Value dynamic environment component that maps variable  declare default namespace = URILiteral ; Expr names to their value and consider this environment com- ponent as defining a set of bindings in the spirit of SPAR-  base Expr QL solution mappings (as presented in Definition 3). Along == (N4) these lines, we will consider the varValue component of the declare base-uri URILiteral ; Expr dynamic environment in which a SPARQL graph pattern P is executed the basis for the XSPARQL instance mapping of XQuery for normalisation In accordance with the P. The transformation from the dynEnv.varValue into the SPARQL semantics, blank nodes in ConstructTemplates XSPARQL instance mapping is defined next: need to be distinctly instantiated for any solution map- ping matching the body, i.e., for every solution for the 9 Position variables are variables that appear in an XQuery ForClauses WhereClause a new blank node identifier needs to be cre- after the optional at keyword—cf. Fig. 6aonpage151—and bind to an ated in the resulting graph. To ensure this behaviour in XSP- integer indicating the current position in the for-expression. 123 158 S. Bischof et al.

Definition 8 (XSPARQL instance mapping)LetC be an G but also respect any bindings for variables in the dynamic expression context, and DC = dynEnv(C) .varValue be environment. We can extend the XSPARQL BGP matching the varValue and TC = statEnv(C) .varType be the var- to generic graph patterns by following the SPARQL eval- Type component of the static environment of C, respectively. uation semantics, as described by [54, Section 12.4]. Con- The XSPARQL instance mapping μC is a solution mapping sidering a graph pattern P and μC the XSPARQL instance v → ∈ ( , ,μ ) where, for each mapping i xi DC , xi is converted into mapping of P, we similarly denote by evalxs D P C the an instance of type RDFTerm or an RDF Collection according evaluation of P over dataset D following XSPARQL BGP to the following conditions: matching. Matching blank nodes in nested queries As for the –ifxi = () and TC .varType(vi ) = RDFTerm then μC(xi ) is undefined; handling of explicit DatasetClauses we briefly review the scoping graph concept from SPARQL’s semantics, presented –ifxi = () and TC .varType(vi ) = RDFTerm then μC(xi ) = () is an empty RDF Collection; in [54, Section 12]. Query solutions are taken from the scop- ing graph, a graph that is equivalent to the active graph –ifxi is a singleton sequence, then μC(xi ) = RDFTerm(xi ); but does not share any blank nodes with it or any graph –ifxi = (e1,...,en), n > 1, is a sequence then μC(xi ) = pattern within the query. Although in XSPARQL we are (RDFTerm(e1) ···RDFTerm(en)) to be read as an RDF Collection [46, Section 4.2] in Turtle notation [see 7, Sec- not considering blank nodes in graph patterns, in the pres- tion 3.5]; ence of nested SparqlForClauses XSPARQL instance map- pings may in fact contain assignments of variables to blank ( ) where RDFTerm xi is nodes, injected from the outer SparqlForClause into the inner SparqlForClause. For example, in Fig. 10 on page – xi if TC .varType(vi ) = RDFTerm, 155, blank nodes bound in the outer SparqlForClause to the – "xi " if TC .varType(vi ) = xsd:string, variable $Person will be injected into the inner SparqlFor- – "xi "ˆˆrdf:XMLLiteral if TC .varType(vi ) = Clauseexpression. In XSPARQL—as opposed to SPARQL element(), patterns—such injected bnodes will be matched like con- – "data(xi )" if TC .varType(vi ) = attribute(), and stants against the blank nodes from the data, to enable core- – "xi "ˆˆTC .varType(vi ), otherwise. ference within nested queries over the same dataset. Toensure this behaviour, we introduce the notion of active dataset; For a graph pattern P, we call the XSPARQL instance map- nested queries over the same active dataset keep the same ping of the expression context in which P is executed the the scoping graphs. Any SparqlForClause with an explicit XSPARQL instance mapping of P. DatasetClause causes the active dataset to change, i.e., new scoping graphs (with fresh blank nodes) for each graph Next we define the notion of XSPARQLBGP matching based within it are created; if no DatasetClause is present in a on the semantics of SPARQL BGP matching presented in nested SparqlForClause (implicit dataset), the active data- Sect. 3.2. set remains unchanged. Definition 9 (Extended solution mapping)LetC be an We introduce another auxiliary function in the XSP- : ( ) expression context. An extended solution mapping of a graph ARQL semantics, fs dataset DatasetClause , which returns pattern P in C is a solution mapping compatible with the an element of type RDFDataset based on the evaluation XSPARQL instance mapping of C. of its argument. This conversion is performed accord- ing to the SPARQL semantics presented in Sect. 3.2 and XSPARQL BGP matching is defined analogously to the detailed in [54]. The static type signature of this function SPARQL BGP matching with the exception that we consider is only extended solution mappings: fs:dataset($datasetClause as xs:string) Definition 10 (XSPARQL BGP matching)LetP be a basic as RDFDataset graph pattern, C be the expression context of P, and G be We allow the SourceSelector of a DatasetClause to an RDF graph. We say that μ is a solution for P with respect be specified by an element of type uri or RDFGraph.Ele- to active graph G, if there exists an extended solution map- ments of the type uri in the position of a graph will ping μ of C such that μ(P) is a subgraph of G and μ is the be mapped to graphs where the uri is used as its name. restriction of μ to the variables in vars(P). XSPARQL—just like the SPARQL specification—leaves This definition quasi injects the variable bindings inherited the exact mapping of URIs to graphs open to particular from XQuery into SPARQL patterns occurring within XSP- implementations, but for the rest of this paper, we assume ARQL; by considering extended solution mappings the bind- obtaining the RDF graph just by dereferencing the URI via ings returned for a BGP P will not only match the input graph HTTP. 123 Mapping between RDF and XML with XSPARQL 159

SparqlForClause and XQuery ForClauses The seman- type checking rules for standard XQuery [27, Section 4.8.2], tics of SparqlForClause (Rule [33a], Fig. 12a) is defined to populate the XSPARQL specific new static environment by the following normalisation rules, static type analysis component statEnv.posVars:10 rules, and dynamic evaluation rules. We will also need statEnv.posVars = (PosVar , ··· , PosVar ) to slightly adapt the static analysis rules for regular For- 1 n statEnv  Expr1 : Type1 Clauses (in order to properly deal with the extra position statEnv  VarName of var expands to Var variables introduced by rule (N5)). We start with normal- statEnv  VarNamepos of var expands to Varpos isation rules for SparqlForClauses with implicit variable statEnv + posVars(PosVar 1, ··· , PosVar n, Varpos) ⇒ selection (by means of “for *”) and with explicitly stated Var prime(Type1); (S3) + varType ⇒ : variables: Var pos xs integer  ExprSingle : Type for at for * OptDatasetClause WhereClause $VarName $VarNamepos  in return : SolutionModifier return ExprSingle statEnv Expr1 ExprSingle Expr Type · quantifier(Type ) == 1   (N6)  Dynamic Evaluation For the dynamic evaluation we have for WhereClause vars  OptDatasetClause WhereClause  to introduce a new dynamic environment component called return SolutionModifier ExprSingle Expr activeDataset that will be used to evaluate WhereClauses.   Initially, this component is empty (or set to a system default) The normalisation rule WhereClause vars determines all statically unbound variables present in the WhereClause, and is changed by a DatasetClause appearing in a SparqlFor- i.e., returns a whitespace separated list of all variables in Clause. We further introduce two auxiliary functions fs:value the WhereClause that are not present in the statEnv.varType and fs:.  environment component. The next normalisation rule intro- fs:value The fs:value $PS, $var function returns the duces a new position variable, analogously to the before- value of the specified SPARQL variable $var in a Pattern · Solution specified by $PS.If$var is not bound in $PS,the mentioned XQuery for normalisation rule, where PosVar is as described above: empty sequence is returned. This function is defined as   fs:value($ps as PatternSolution, for $Var Name 1 ...$Var Name n $variable as xs:string)   OptDatasetClause WhereClause as RDFTerm? SolutionModifier return ExprSingle Expr : : == (N7) fs sparql The fs sparql function corresponds to the ...  adapted version of the eval function, the evalxs function, that for $Var Name 1 $Var Name n PosVar OptDatasetClause WhereClause evaluates graph patterns implementing the extended notion  return ExprSingle Expr of BGP Matching (cf. Definition 10). Static type analysis The following static rule takes care The static type signature of this function is defined as of defining the types of variables present in a for expres- fs:sparql($dataset as RDFDataset, sion as RDFTerm, adds the introduced position variables to $SparqlWhere as xs:string, statEnv.posVars, and determines the static type of the Sparql- $solutionModifiers as xs:string) as PatternSolution* ForClause expression: The parameters of the eval function correspond to the statEnv.posVars = (PosVar1, ··· , PosVarn) xs statEnv  PosVarName of var expands to PosVar dataset $dataset, the SPARQL algebra expression gener- statEnv + posVars(⎛PosVar1, ··· , PosVarn, PosVar⎞) ated from the graph pattern $SparqlWhere and PosVar ⇒ xs:integer; $solutionModifiers, as described by [cf. 54, Section ⎝ ⎠ + varType Var 1 ⇒ RDFTerm; 12.2.3], and the XSPARQL instance mapping μC that is ··· ;Var n ⇒ RDFTerm (S2)  ExprSingle : Type derived from the expression context C over which the fs:sparql function is evaluated. The result of eval con- for $Var 1 ···$Var n at PosVar Name xs DatasetClause WhereClause sists of a solution sequence which, as a result of applying statEnv  SolutionModifier the serialise function (cf. Definition 7), can be translated : ∗ return ExprSingle Type directly into an XQuery sequence of XML elements of type PatternSolution. Please note that, since the variables included in a Sparql- We can now define the dynamic evaluation rules for the ForClause are not allowed to contain a namespace prefix, SparqlForClause expression. Intuitively, these rules state we omitted the rules handling the namespace expansion for the respective variables. The static type rule for a SparqlFor- 10 We show here only the adapted rule for ForClauses with position Clausewithout an explicit DatasetClause is analogous. Like- variables without type declaration, the rule handling both position vari- wise, note that we need to slightly adapt the standard static ables and type declarations is adapted analogously. 123 160 S. Bischof et al. that the return expression ExprSingle will be executed for Analogously to the SparqlForClause with an explicit data- each PatternSolution that is returned from the evalua- set, whenever the fs:sparql function evaluates to an empty tion of the fs:sparql function. The following two dynamic sequence, the result will also be an empty sequence. rules specify the evaluation of the SparqlForClause with an ConstructClause We now define the semantics of the explicit DatasetClause: ConstructClause (Rule [33b], Fig. 12a) by means of nor-  : ( ) ⇒ malisation rules. SPARQL stand-alone construct queries dynEnv fsdataset DatasetClause Dataset Dataset, WhereClause, (as described in Sect. 4.1) are normalised into construct dynEnv  fs:sparql ⇒ μ ,...,μ SolutionModifier 1 m queries with a surrounding ForClause: dynEnv + activeDataset(Dataset)   ⎛ ⎞ construct ConstructTemplate PosVar ⇒ 1;    ⎜ ⇒ : μ , ; ⎟ DatasetClause WhereClause ⎜ Var 1 fs value 1 Var 1 ⎟ +varValue⎝ ... ; ⎠ SolutionModifier Expr  == (N8) Var ⇒ fs:value μ , Var   n 1 n for ∗ DatasetClause  ExprSingle ⇒ Value 1  WhereClause SolutionModifier  . construct  . (D1) ConstructTemplate Expr dynEnv + activeDataset⎛ (Dataset) ⎞ The resulting query will be further rewritten according PosVar ⇒ n;  to aforementioned normalisation rule (N6). As introduced in ⎜ ⇒ : μ , ; ⎟ ⎜ Var 1 fs value m Var 1 ⎟ Sect. 4.1, we allow nested XSPARQL expressions in subject, +varValue⎝ ... ; ⎠  predicate, and object positions of ConstructTemplate’. These Var ⇒ fs:value μ , Var n m n {Expr}  ExprSingle ⇒ Value m nested expressions are identified by the shortcuts , <{Expr}> _:{Expr} for $Var 1 ···$Var n at $PosVar , and , that construct elements of DatasetClause WhereClause type literal, uri, and bnode, respectively. dynEnv  SolutionModifier return Similar to the normalisation rule for stand-alone Return- ⇒ ,..., ExprSingle Value 1 Value m Clauses presented in [27, Section 4.8.1], the following nor- This rule ensures that the activeDataset component of malisation rule transforms construct clauses into XQuery the dynamic environment is updated to reflect the explicit return ExprSingle s. DatasetClause of the SparqlForClause. If the evaluation of construct ConstructTemplate the fs:sparql function does not yield any solutions, i.e., eval- Expr  == (N9) uates to an empty sequence, the overall result will also be the :  return fs evalCT ConstructTemplate normCT empty sequence: . ⇒ In the following we assume that ConstructTemplate’isa dynEnv activeDataset Dataset Dataset, WhereClause, simple "." separated list of Subject, Predicate, and Object. dynEnv  fs:sparql ⇒ () · SolutionModifier (D2) The normCT rule transforms any Turtle shortcut notation for $Var 1 ···$Var n at $PosVar used in ConstructTemplate’ to these simple lists. As an exam- dynEnv  DatasetClause WhereClause ⇒ () ple of this rule, we present the rule for normalising Turtle “;” SolutionModifier return ExprSingle abbreviations [cf. 7, Section 2.3]: The rule that handles the SparqlForClause without an  ; ...; Subject Pred1 Obj1 Predn Objn normCT explicit DatasetClause is presented next: == . ⇒ ... dynEnv activeDataset Dataset Subject Pred1 Obj1 Subject Predn Objn Dataset, WhereClause, dynEnv  fs:sparql ⇒ μ ,...,μ (N10) SolutionModifier 1 m ⎛ ⎞ PosVar ⇒ 1;  ⎜ ⎟ The normalisation rules for the other Turtle shortcuts ⎜ Var 1 ⇒ fs:value μ , Var 1 ; ⎟ dynEnv + varValue ⎝ 1 ⎠ that are allowed in the SPARQL ConstructTemplate’ syn- ...;  ⇒ : μ , tax are similar to this one and are not presented here. Since Var n fs value 1 Var n  ExprSingle ⇒ Value 1 anonymous blank nodes can be written in numerous ways · . (D3) in Turtle, the normCT normalisation rule transforms each ⎛ . ⎞ anonymous blank node into a labelled blank node where the PosVar ⇒ n;  ⎜ ⎟ identifier/label is distinct from any other blank node labels ⎜ Var 1 ⇒ fs:value μ , Var 1 ; ⎟ dynEnv + varValue ⎝ m ⎠ ...;  present in the ConstructTemplate’. Take, as an example, the ⇒ : μ , Var n fs value m Var n ConstructTemplate in Fig. 8b on page 153. It is normalised as  ⇒ ExprSingle Value m { _:b foaf:name { for $Var 1 ···$Var n at $PosVar fn:concat($N, " ", $F) dynEnv  WhereClause SolutionModifier }. return ExprSingle ⇒ Value 1,...,Value m } 123 Mapping between RDF and XML with XSPARQL 161

fs:evalCT The fs:evalCT function is a new built-in func- In case any of the subject, predicate or object do not match tion that ensures the created RDF graph is valid and rewrites an allowed type, the empty sequence is returned. Effectively any blank nodes inside of ConstructTemplates to comply with this suppresses any invalid RDF triples from the output graph. the SPARQL semantics (as described in Sect. 4.2). The aux- iliary fs:validTriple function checks if each triple is valid according to the RDF semantics and is defined by rules (D5) dynEnv  fs:bnode(Subject) ⇒ ValueS  ⇒ and (D6). The static type signatures of these functions are dynEnv Predicate ValueP dynEnv  fs:bnode Object ⇒ ValueO defined as ⎛ ⎞ ValueS matches (uri | bnode) and (D6) dynEnv not⎝ValueP matches uri and ⎠ fs:evalCT($template as RDFTerm*) as RDFGraph  ValueO matches uri | bnode | literal fs:validTriple($subject as RDFTerm,  : ( , , ) ⇒ () $predicate as RDFTerm, dynEnv fs validTriple Subject Predicate Object $object as RDFTerm) as RDFTriple Blank Node Skolemisation In order to comply with the SPARQL construct semantics, all blank nodes inside a The fs:evalCT function, and hence construct expressions, ConstructTemplate’ need to be skolemised, i.e., for each solu- return elements of the previously defined type RDFGraph, tion a new distinct blank node identifier needs to be gener- thus allowing the result of construct expressions to be used ated. Since we normalise every XQuery for expression and in a DatasetClause of a subsequent SparqlForClause.Inmore SparqlForClauses by assigning them position variables (as detail, the fs:evalCT function checks the constructed RDF described in Sect. 4.1), we just need to retrieve the avail- graph for validity according to the conditions described in able position variables from the static environment compo- Definition 1, filtering out any non-valid RDF triples where nent statEnv.posVars, and create the new distinct identifier subjects are literals, predicates are literals or blank nodes, based on the values of these variables. The fs:bnode function etc. This is illustrated by the following dynamic evaluation takes care of skolemising blank nodes. If the argument of rules: this function is of type bnode a new blank node identifier is generated using rule (D7): dynEnv  ValueR matches bnode  : ( , , ) ⇒ . = ( ,..., ) dynEnv fs validTriple Subj1 Pred1 Obj1 Triple1 statEnv posVars PosVar 1 PosVar n . dynEnv.varValue(PosVar 1) = PosValue1 .  .  : , , ⇒ . dynEnv fs validTriple Subjn Predn Objn Triplen ⎛ ⎞ (D4) dynEnv.varValue(PosVar ) = PosValue Subj Pred Obj ⎛ n ⎞ n 1 1 1 , (D7) dynEnv  fs:evalCT⎝ ... ⎠ ValueR ⎜ , ⎟ Subj Pred Obj ⎜ PosValue1 ⎟ n n n dynEnv  fs:skolemConstant⎝ ⎠ ⇒ ValueRS ⇒ < > ... ..., triples Triple1 Triplen triples PosValuen dynEnv  fs:bnode(ValueR ) ⇒ { } The following dynamic evaluation rule for the fs:validTriple element bnode of type xs:string ValueRS function checks, relying on the fs:bnode function defined Otherwise, fs:bnode returns its argument unchanged as below, if a triple is valid according to the RDF semantics: represented by rule (D8):  ( | ) dynEnv Value matches uri literal (D8) dynEnv  fs:bnode(Value ) ⇒ Value dynEnv  fs:bnode(Subject) ⇒ ValS Both rules above use the fs:skolemConstant function for  ( | ) statEnv ValS matches uri bnode the generation of the new identifiers based on the specified dynEnv  Predicate ⇒ ValP  blank node label and on positional variables in the dynamic statEnv ValP matches uri dynEnv  fs:bnode Object ⇒ ValO environment. An example of XSPARQL Semantics Evalua-  ( | | ) tion is included in Appendix A. dynEnv ValO matches⎛ uri ⎞bnode literal Subject, (D5) dynEnv  fs:validTriple⎝ Predicate, ⎠ 4.3 Correspondence Between XSPARQL, XQuery, Object and SPARQL ⇒ element triple of type RDFTriple { element subject of type RDFTerm {ValS } element predicate of type RDFTerm {ValP } Since XSPARQL syntactically extends XQuery, and—by the element object of type RDFTerm {ValO } remarks in the end of Sect. 4.1—also any SPARQL con- } struct query is syntactically valid in XSPARQL, these que- ries are considered semantically equivalent to the semantics 123 162 S. Bischof et al. in their base languages. The next propositions formally estab- lish this intuitive correspondence. The proofs for these propositions and lemmas are included in Appendix C.1–C.3.

Proposition 1 XSPARQL is a conservative extension of XQuery.

A similar correspondence holds for native SPARQL con- struct queries. We show the equivalence between SPAR- QL BGP Matching [54, Section 12.3.1] and XSPARQL BGP Fig. 14 XSPARQL implementation architecture Matching (presented in Sect. 4.2) and prove the equivalence of XSPARQL semantics for native SPARQL construct queries with those of the SPARQL semantics. rewriter, which turns an XSPARQL query into an XQuery; (2) an XQuery engine for evaluating the XQuery; and (3) a Lemma 1 Given a graph pattern P, a dataset D and the SPARQL engine, for querying RDF from within the rewritten XSPARQL instance mapping μC of the expression context XQuery. = ( , ,μ ) C over which P is evaluated, let 1 evalxs D P C Our current prototype is meant first to demonstrate that and 2 = eval(D, P) be solution mappings. If vars(P) ∩ XSPARQL can be implemented directly on top of off-the- dom(μC ) =∅, then 1 = 2 { μC }. shelf components, and provides convenient means to model and execute XML2RDF/RDF2XML transformations. Sec- We define, based on [54, Section 10.2], the semantics of ond, as illustrated in our evaluation section (Sect. 6)weshow the SPARQL construct clause to show their equivalence to that a clever implementation of the XSPARQL language, the XSPARQL construct clause. again integrating an XQuery and a SPARQL engine, but with Definition 11 (SPARQL construct semantics)LetC be a several optimisations in place, can improve efficiency signif- ConstructTemplate and a solution sequence. The SPARQL icantly compared with a naive implementation. construct returns an RDF graph generated by the set-union In general it is possible to use any XQuery and SPARQL of the triples obtained from substituting variables in C with engines to evaluate XSPARQL queries. The current proto- their bindings from and satisfying the following condi- type implements the interface between the XQuery engine, 11 12 tions: Saxon 9.3, and the SPARQL engine, ARQ 2.8.7, using the Saxon Extension API which allows calling Java meth- 1. any invalid RDF triples that may be produced by the ods from within XQuery queries. The main function, called instantiation of the ConstructTemplate are ignored; and xsp:sparqlCall, evaluates a SPARQL query and returns 2. blank node labels within the ConstructTemplate are con- its result using the SPARQL XML results format [8]. By sidered scoped to the template for each solution, i.e., if the using Saxon’s extension mechanism the two query engines same label occurs twice in a template, then there will be are tighter integrated allowing a more efficient communica- one blank node created for each solution in , but there tion than our former prototype [4] which used a SPARQL will be different blank nodes for triples generated by dif- endpoint via HTTP to evaluate SPARQL queries. To imple- ferent query solutions. Blank nodes in the graph template ment the blank node handling as presented in Sect. 4.2 we changed the behaviour of ARQ accordingly using its Java be shared only within the same query solution μi ∈ . API. Instead of implementing all newly introduced types For SPARQL construct queries we can state the following: as given in Sect. 4.2 as custom types in XQuery, we reuse types as given by the XML Schema of the SPARQL Query Proposition 2 XSPARQL is a conservative extension of Results XML Format,13 where the sr:binding type corre- SPARQL construct queries. sponds directly to XSPARQL’s RDFTerm type. An RDFGraph, e.g., the result of a ConstructClause, is serialised using Turtle syntax by building the output as xs:string. The remaining 5 Implementation types RDFDataset and RDFNamedGraph are adapted accord- ingly. In this section we present a prototype implementation of the XSPARQL language. The prototype translates an XSPARQL 11 http://saxon.sourceforge.net/. query into an XQuery query with interleaved calls to a SPAR- 12 http://jena.sourceforge.net/ARQ/. QL engine. The architecture of our implementation is shown 13 See http://www.w3.org/2007/SPARQL/result.xsd, for this paper we in Fig. 14 and consists of three main components:(1) a query assume this schema is associated with the namespace prefix sr. 123 Mapping between RDF and XML with XSPARQL 163

Next we present how SparqlForClauses and Construct- fn:concat function. We parse the query string for vari- Clauses are processed using functions—called rewriting ables and, having access to the list of previously declared functions—that operate on syntactic objects of XSPARQL variables it is possible to determine whether variables should and returning an XQuery expression. In the resulting XQuery be replaced by their previously bound value or kept as a expressions we assume the namespace prefix xsp: associated variable. Within a SparqlForClause, whenever we encounter with http://xsparql.deri.org/demo/xquery/xsparql.xquery. fresh variables, i.e. that have not been declared before, we This prefix is not allowed to be used in any XSPARQL query leave the variable name as a string within the fn:concat and defines XQuery functions (presented below) that are (effectively postponing evaluation of the variable to the available to the rewriting functions and used as the name- SPARQL engine). On the other hand, if a variable has been space for any variables introduced by the rewriting, thus declared before, the XQuery variable name is inserted into avoiding clashes with variables from the XSPARQL query. the fn:concat function, meaning that it is evaluated and SparqlForClause First, our implementation defers SPAR- replaced by its current value when the fn:concat func- QL queries in a SparqlForClause to the external SPARQL tion is evaluated during the execution of the rewritten query. engine and extracts the bindings for the SPARQL variables Furthermore, the xsp:sparqlCall function implements the from the SPARQL XML results document that is returned, matching blank nodes in nested queries feature (as described by the following rewriting function. Let XS and XQ denote in Sect. 4.2): here, we rely on external Java code to call the the set of all XSPARQL core and XQuery core expres- ARQ API in such a way that blank node labels are pre- sions, respectively. Any SparqlForClauseis translated into an served over consecutive SPARQL calls that use the same XQuery query that calls the SPARQL engine with a select dataset; during query rewriting we trigger the correct match- query according to the rewriting function tr : XS → XQ. ing of blank nodes in nested queries whenever we encounter a Given an XSPARQL expression Q of form SparqlForClause without an explicit DatasetClause)asfol- lows: the respective custom Java code is used to maintain for Vars at PosVar $ a stack of the previous datasets. More specifically, we col- DatasetClause lect the blank node identifiers for each dataset created by WhereClause (Q1) an explicit DatasetClause in this stack; when a SparqlFor- SolutionModifier Clause without explicit DatasetClause is encountered, we return ExprSingle take the first element of the stack as the implicit dataset for the SparqlForClause along with its current blank node iden- ( ) then tr Q is defined as the XQuery Core expression tifiers. ConstructClause As for the construction of RDF graphs tr(Q) = (i.e., whenever the ReturnClause is a ConstructClause), our ( ) : 1 let $xsp results⎛ := ⎞ implementation within XQuery simply produces a string in select Vars DatasetClause xsp:sparqlCall⎝ WhereClause ⎠ return Turtle syntax, where we need to ensure that each produced SolutionModifier RDF triple is syntactically valid. This is implemented by (2) for $xsp:result at $PosVar in $xsp:results//sr:result means of a number of additional custom functions. First, return the auxiliary function xsp:rdfTerm $VarName , presented (3) let $v := for each $v ∈ Vars $xsp:result/sr:binding[@name = v]/∗ return in Fig. 19a (Appendix B), returns the correctly formatted (4) ExprSingle RDF term corresponding to the variable’s value in Tur- tle syntax given a variable of a SPARQL result type. This That is, we implement the fs:sparql formal semantics is done by matching the type of the variable and add- function by translating Q to a SPARQL select query, ing the necessary syntactic elements for each type. Next, which is then executed by the custom runtime function the xsp:validTriple presented in Fig. 19b (Appendix B) xsp:sparqlCall that receives a SPARQL select query implements the semantics function fs:validTriple by calling and returns the result in SPARQL’s XML result format. The the xsp:rdfTerm function to correctly format triples to text xsp:sparqlCall function also takes care of XSPARQL’s (using the Turtle syntax); xsp:validTriple further uses BGP matching, as described in Sect. 4.2, by replacing in the the auxiliary functions xsp:validSubject, xsp:valid- SPARQL query any previously bound variables with their Predicate and xsp:validObject that, respectively, deter- current value according to the rules presented in Definition 8, mine, according to the RDF semantics, if their argument is thus mimicking XSPARQL’s BGP matching behaviour while a valid subject, predicate or object. Also available to our relying on an off-the-shelf SPARQL engine. This replace- implementation are the functions xsp:isBlank, xsp:isU- ment of variables is performed by producing XQuery code RI and xsp:isLiteral which determine, respectively, if a that generates the SPARQL select query string that is given term is a blank node, URI or literal. Our implementation as a parameter to xsp:sparqlCall function using XQuery’s of the fs:skolemConstant function consists of appending all 123 164 S. Bischof et al. the position variables from the static context (that are stored Note that the strategies presented here are only applicable in the statEnv.posVars component) to the respective blank for dependent joins satisfying the following restrictions: node identifier using “_” as a separator, represented here by 1. An explicit DatasetClause of the inner query needs to be the rewriting function  statically determined i.e., it cannot be determined based , { , ··· , } = trsk $BNodeName $PosVar1 $PosVarn on variables bound from the outer expression; "_:", $BNodeName, "_", $PosVar , ··· , fn:concat 1 . 2. The return clause of the inner expression can not be a "_", $PosVarn ConstructClause; and Finally, the function xsp:evalCT (without details) imple- 3. The dependent variable in the inner query’s graph pattern ments fs:evalCT by simply concatenating all the triples must be strictly bounded as defined next. generated by the xsp:validTriple function to a string rep- resentation of the RDF graph to be constructed in Turtle syn- tax. Definition 13 (Strict Boundedness)Thesetofstrictly bound ( ) The next lemma states that the results of the evaluation of variables in a graph pattern P, denoted bVars P , is recur- a Basic Graph Pattern P under XSPARQL BGP matching sively defined as follows: if P is semantics can be determined based on the results of eval- uating μ(P) under SPARQL semantics. The proofs for the – a basic graph pattern, then bVars(P) = vars(P); following proposition and lemma are included in Appendi- – (P1 P2), then bVars(P) = bVars(P1) ∪ bVars(P2); ces C.4 and C.5. – (P1 optional P2), then bVars(P) = bVars(P1); – (P union P ), then bVars(P) = bVars(P ) ∩ bVars(P ); μ 1 2 1 2 Lemma 2 Let P be a BGP, D a dataset, and the XSPARQL – (graph iP), then bVars(P) = bVars(P )∪({i}∩V); and  = μ( ) 1 1 instance mapping of P. Considering P P , we have – (P filter R), then bVars(P) = bVars(P ). ( , ,μ) = ,  { μ} 1 1 that evalxs D P eval D P . The following result presents the equivalence of our imple- Informally, the dependent variables must occur (i) in a basic mentation function tr and the XSPARQL semantics.14 graph pattern, (ii) in every alternative of unions pattern, and (iii) also outside of the optional graph pattern in case of op- Proposition 3 Let Q be a SparqlForClauseof form (Q1) and tionals. Strict boundedness essentially ensures that the join dynEnv the dynamic environment of Q, then dynEnv  Q ⇒ variable does not occur only in a filter expression, which Val if and only if dynEnv  tr(Q) ⇒ Val . would lead to problems in case the inner expression is called unconstrained, see below. 6 Towards Optimisations of Nested for Expressions The rewritings introduced for the implementation of dependent joins can be grouped into two categories, depend- In this section we present different rewriting strategies for ing whether the join is performed in XQuery or SPARQL. XSPARQL queries containing nested expressions. We are For performing the join in XQuery, we use already known specifically interested in nested expressions with an inner join algorithms from relational , namely nested- SparqlForClause, as the number of interleaved calls to the loop joins or sort-merge joins. For performing the join in SPARQL engine can be reduced drastically using these SPARQL, if the outer expression is a SparqlForClause we rewritings. As the cost inherent with XQuery for clauses can implement the join by rewriting both the inner and the is much lower, we do not present different rewritings for outer expressions into a single SPARQL call. In case the outer nested expressions where the inner expression is an XQuery query consists of an XQuery ForClause, we can still consider for. These different rewritings proposed in this section con- this approach, but we need to convert the result of the outer stitute the initial step towards defining a set of optimisations XQuery ForClause to an RDF graph, for instance relying for the current implementation of the XSPARQL language. on a SPARQL engine that supports SPARQL Update [32]to We start by presenting the definitions and conditions under add this temporary graph to a triple store. The proofs for the which we can perform these rewritings. propositions presented in this section are included in Appen- dices C.6–C.8. Definition 12 (Dependent Join) We call two nested XSP- ARQL for expressions (ForClause or SparqlForClause), 6.1 Dependent Join implementation in XQuery where the inner expression is a SparqlForClause and at least one variable in the inner expression is bound by the outer The intuitive idea with these rewritings is, instead of using expression, a dependent join. The shared variables between the naïve rewriting that performs one SPARQL query for the for expressions are called dependent variables. each iteration of the outer expression, to execute only one 14 Please note that, for presentation purposes, we are omitting the initial unconstrained SPARQL query, before the outer query. The empty line in case the proof trees require no premises. resulting set of SPARQL solution mappings is then joined in 123 Mapping between RDF and XML with XSPARQL 165

 XQuery with the results of the outer expression, using one join {$Var , ··· , $Var }, $res = ⎛ nl 1  n ⎞ of the following strategies. xsp:isBlank $Var 1 or ⎝ ⎠ The straightforward way to implement the join over fn :empty $res/sr:binding[@name = Var 1]/∗ or / : [ = ]/∗ dependent variables directly in XQuery is by nesting two $Var 1 eq $res sr binding @name Var 1 and ··· XQuery for expressions, much like a regular nested-loop ⎛  ⎞ xsp:isBlank $Var n or join [1] in standard relational databases. In our proposed ⎜ ⎟ ⎜ fn:empty $res/sr:binding[@name = Var n]/∗ ⎟ and ⎝ ⎠ rewriting, the join consists of restricting the values of vari- or ables from the inner expression to the values taken from $Var n eq $res/sr:binding[@name = Var n]/∗ the current iteration of the outer expression, which does not require an incremental solution construction step. The – otherwise: approach for query rewriting applied to XSPARQL is similar to already known optimisations from the relational databases realm and also presented for XQuery queries by [47]. ( ) = optnl Q⎛  ⎞ out out Similarly to Sect. 5, we will describe the implementation for $Var at $PosVar in optnl ExprSingle ⎜ 1 ⎟ of this nested-loop join by means of the rewriting function ⎜ return ⎟ ⎜ in in ⎟  = ( ∪ )\( ∩ ) optnl⎜ for Vars at $PosVar ⎟ optnl.WeuseA B A B A B to denote the sym- ⎝ ⎠ metric difference of two sets A and B. DatasetClause WhereClause SolutionModifier return optnl ExprSingle2 Let Q be an XSPARQL expression of form When Q is an XSPARQL expression of form

(1) for Vars out at $PosVar out DatasetClauseout ( ) out out ( ) out out 1 for $Var at $PosVar in ExprSingle1 return 2 WhereClause SolutionModi f ier in (3) return (2) for Vars in at $PosVar (Q3) (Q2) (4) for Vars in at $PosVar in DatasetClausein DatasetClause WhereClause SolutionModifier (5) WhereClausein SolutionModi f ierin ( ) 3 return ExprSingle2 (6) return ExprSingle the application of the rewriting function opt (Q) can be split nl the application of the rewriting function optnl(Q) can be split into two cases: into two cases: Ð in case ExprSingle does not contain any occurrences sp in –ifExprSingle1 and ExprSingle2 do not contain any occur- of (Q3) then, considering Vars = vars WhereClause sp= ( ) rences of (Q2) then, assuming Vars Vars WhereClause , being the set of variables from the inner WhereClause, we have that we have that

optnl(Q) = optnl(Q) = (1) let $xsp:res_in := (1) let $xsp:results :=⎛ ⎞ ⎛ ⎞ select select {$Var out}∪Vars in ⎜ in ∪ out ∩ sp ⎟ ⎜ DatasetClause ⎟ ⎜ Vars Vars Vars ⎟ xsp:sparqlCall⎜ ⎟ ⎜ in ⎟ ⎝ WhereClause ⎠ xsp:sparqlCall⎜ DatasetClause ⎟ ⎝ in ⎠ SolutionModifier WhereClause in return SolutionModifier (2) for $Var out at $PosVar out in ExprSingle return return 1 ( ) : := (3) for $xsp:result at $PosVar in 2 let $xsp res_out ⎛ ⎞ out in xsp:results//sr:result return select Vars $ ⎜ out ⎟ { out}∩ sp, ⎜ DatasetClause ⎟ ( ) $Var Vars xsp:sparqlCall⎝ out ⎠ 4 if joinnl : then WhereClause $xsp result out (5) let $v := for each $v ∈{Varout}Vars sp SolutionModifier $xsp:result/sr:binding[@name = v]/∗ return return ( ) : out (6) ExprSingle 3 for $xsp rout at $PosVar 2 : // : (7) else () in $xsp res_out sr result return (4) let $v := for each $v ∈ Varsout $xsp:rout/sr:binding[@name = v]/∗ return ( ) : out Note that here we slightly abuse notation, using ‘∪’ 5 for $xsp rin at $PosVar in⎛$xsp:⎛res_in//sr:result⎞⎞ return to denote the concatenation of two lists of variables. Vars out ∩ Vars sp, ⎝ ⎝ ⎠⎠ The auxiliary function joinnl aggregates the actual join- (6) if joinsr $xsp:res_out, then comparison in an XPath expression, where two variables $xsp:res_in out sp are considered compatible if the outer value is a blank (7) let $v := for each $v ∈ Vars Vars xsp:res in/sr:binding[ name = v]/∗ return VarRes $ _ @ node, their values are equal, or the inner value ($ i ) (8) ExprSingle is unbound: (9) else () 123 166 S. Bischof et al.

where the joinsr function is defined as For an XSPARQL query Q of form  (1) for Vars out at $PosVar out DatasetClause joinsr {$Var 1, ··· , $Var n}, $resOut, $resIn =  (2) where GGPout { / : [ = ]/∗}, joinnl $resOut sr binding @name Var 1 $resIn (3) order by OCout ··· and  and (4) return { / : [ = ]/∗}, . (Q4) joinnl $resOut sr binding @name Var n $resIn (5) for Vars in at $PosVar in DatasetClause (6) where GGPin (7) order by OCin ( ) Ð otherwise: 8 return ExprSingle then

opt (Q) = nl ⎛ ⎞ Ð in case ExprSingle does not contain any occurrences for Vars out at $PosVar out DatasetClauseout ⎜ out out ⎟ of (Q4), we have that ⎜ WhereClause SolutionModi f ier ⎟ ⎜ return ⎟ ⎜ ⎟ optsr (Q) = optnl⎜ in in in ⎟ ⎜ for Vars at $PosVar DatasetClause ⎟ ( ) : := ⎝ in in ⎠ 1 let $xsp results ⎛ ⎞ WhereClause SolutionModi f ier select Vars out ∪ Vars in return optnl ExprSingle ⎜ DatasetClause ⎟ xsp:sparqlCall⎜ ⎟ ⎝ where {GGPout . GGPin} ⎠ order by OCout OCin This rewriting to the nested-loop join reduces the number of return needed SPARQL calls from 1 + N (where N is the number (2) for $xsp:result at $PosVar out of iterations of the outer expression) to two SPARQL calls. in $xsp:results//sr:result return ( ) := ∈ Next we show that the opt rewriting function is sound 3 let $v for each $v Vars nl $xsp:result/sr:binding[@name = $v]/∗ return and complete. (4) ExprSingle

Proposition 4 Let Q be a XSPARQL expression of form (Q2) Please note that the group graph patterns GGP1 and or (Q3) and dynEnv the dynamic environment of Q, then GGP2 include the surrounding curly braces: { and }. dynEnv  Q ⇒ Val if and only if dynEnv  optnl(Q) ⇒ Val . Ð otherwise,

( ) = 6.2 Dependent Join Implementation in SPARQL optsr Q⎛ ⎞ for Vars out at $PosVar out DatasetClause ⎜ ⎟ ⎜ where GGPout ⎟ This form of rewriting of nested expressions aims at improv- ⎜ out ⎟ ⎜ order by OC ⎟ ing the runtime of the query by delegating the execution of ⎜ ⎟ ⎜ return ⎟ optsr ⎜ in ⎟ the join to the SPARQL engine, as opposed to performing ⎜ for Vars in at $PosVar DatasetClause ⎟ ⎜ ⎟ the join within XQuery only. ⎜ where GGPin ⎟ ⎝ ⎠ order by OC in SparqlForClause within a SparqlForClause. For nested return optsr ExprSingle expressions where both expressions consist of SparqlFor- Clauses we can implement the join by rewriting the Sparql- Proposition 5 Let Q an XSPARQL expression of form (Q4) ForClauses into a single SPARQL query. The idea here is that and dynEnv the dynamic environment of Q; then dynEnv  a join encoded as nested SparqlForClauses in XSPARQL Q ⇒ Val if and only if dynEnv  optsr(Q) ⇒ Val . can just be implemented by a SPARQL query that merges the where clauses of the outer and inner SparqlForClause. SparqlForClausewithin an XQuery for In case the outer However, there are some restrictions to the applicability of expression is an XQuery for a similar strategy of deferring this rewriting: (i) both queries must be done over the same the join to a single SPARQL query is still possible. Since dataset; (ii) apart from order by, no other solution modifi- the optimisation proposed here does not preserve the order- ers can be used in the queries; and (iii) the original queries ing of results, it can only be applied if the order of the outer must not require any nesting of the XML output or use of XQuery expression is not relevant. Cases where ordering can aggregators. be disregarded in XQuery, as discussed by [35], include not As indicated before, for the next rewriting we are only only the unordered ordering mode in XQuery but also the allowing the order by solution modifier and the concatena- use of aggregate functions and other built-in functions or tion of “order by $o1” and “order by $o2” is “order by the quantifiers some and every. This optimisation relies on $o1 $o2”. For presentation purposes, GGP and OC are, first transforming the outer expressions’ XML results into respectively, a short representation for GroupGraphPattern RDF and then joining this newly created RDF graph with the and OrderCondition. inner SparqlForClause’s where pattern in a single SPARQL 123 Mapping between RDF and XML with XSPARQL 167 query. To implement this, we can, for instance, rely on a tri- Proposition 6 Let Q be an XSPARQL expression of form ple store with support for named graphs to temporarily store (Q5) and dynEnv the dynamic environment of Q, then the RDF data corresponding to the outer XQuery for expres- dynEnv  Q ⇒ Val if and only if dynEnv  optng(Q) ⇒ sion’s bindings for dependent variables. We can then execute Val . a combined query with an adapted graph pattern that joins the pattern in the where clause of the inner SparqlForClause with the bindings stored in the newly created named graph. 7 Experimental Evaluation The optng rewriting function (presented below) starts by cre- ating RDF triples representing the XML input which are then In this section we present an experimental evaluation of our collected into the variable $xsp:ds that corresponds to the prototype presented in Sect. 5 using a novel benchmark suite, RDF graph to be inserted into the triple store. This opera- called XMarkRDF, that is based on the well-known XMark tion is achieved by the XSPARQL functions xsp:createNG benchmark suite for XQuery. We compare our XSPARQL that returns a URI for the newly inserted RDF named graph, prototype with the SPARQL2XQuery engine, an implemen- which is distinct from any other URIs for named graphs used tation of the direct translation of SPARQL to XQuery pre- in the query, while finally the function xsp:deleteNG takes sented by [33] and test—where possible—the effects of the care of deleting the temporary graph. Let Q be an XSPARQL optimisations presented in Sect. 6. expression of form A detailed description of the XMarkRDF benchmark suite is included in Appendix D. We denote the 20 original XMark (1) for $VarName OptTypeDeclaration q q ( ) queries as 1– 20 and the variants of the nested queries to 2 OptPositionalVar in ExprSingle1    ( ) (Q5) which we apply our different rewritings as q8–q11 and q8 – 3 return for Vars DatasetClause WhereClause  ( ) 4 SolutionModifier return ExprSingle2 q11. Further details regarding these queries are included in Appendix D. then, We would like point the reader to available benchmark results for XQuery15 and SPARQL16 which present better ExprSingle ExprSingle – in case 1 and 2 do not contain any results than our XSPARQL implementation benchmarked occurrences of (Q5), we have that in this section. However, the comparison with such native SPARQL and XQuery engines is beyond scope of the paper optng(Q) = (1) let $xsp:ds := ⎛ ⎞ since we specifically address a combined use case, where for $VarName OptTypeDeclaration components from both XQuery and SPARQL are needed. ⎝ ⎠ xsp:createNG OptPositionalVar in ExprSingle1 The benchmark queries presented here cannot be compara- ( ) return xsp:evalCT NGP bly solved by relying on a single SPARQL or XQuery engine. return ( ) : := 2 let $xsp results⎛ ⎞ select Vars ∪ $VarName ⎜ ∪ ⎟ 7.1 Experimental Setup ⎜DatasetClause ⎟ ⎜ : ⎟ ⎜ from named $xsp ds ⎟ ⎜ ∪ ⎟ xsp:sparqlCall⎜WhereClause ⎟ Using the data generators and translators, provided by the ⎜ where graph $xsp:ds ⎟ XMark benchmark and the XSPARQL translation to RDF ⎝ } ⎠ NGP (as presented in Sect. 6), we created datasets with scaling SolutionModifier return factors of 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, and 1.0 and trans- (3) for $xsp:result at $xsp:result_pos lated them into XMarkRDF. An overview of the generated in $xsp:results//sr:result return data is presented in Appendix D, including dataset sizes and, ( ) := ∈ ∪{ } 4 let $v for each $v Vars $VarName for each of the dataset size considered, the number of persons $xsp:result /sr:binding[@name= $v]/∗ ( ) , : and item categories modelled. 5 return ExprSingle2 xsp:deleteNG $xsp ds Furthermore, we converted the XMarkRDF datasets into the RDF/XML format required by the SPARQL2XQuery [] : where NGP is the graph pattern value $VarName . system. The resulting dataset sizes and translation times for – otherwise,

15 Benchmark results for the XMark dataset can be found at http://www. ( ) = optng Q⎛ ⎞ monetdb.org/XQuery/Benchmark/XMark/ and http://www.informatik. for $VarName OptTypeDeclaration uni-freiburg.de/~mschmidt/smp/xmark., retrieved 20-04-2012. ⎜ ⎟ ⎜ OptPositionalVar in optng ExprSingle1 ⎟ 16 optng⎝ ⎠ Benchmark evaluation of RDF stores can be found at http:// return for Vars DatasetClause WhereClause www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/results/ SolutionModifier return optng ExprSingle2 V6/index.html, retrieved 20-04-2012. 123 168 S. Bischof et al.

Table 1 Query response times (in seconds) of the 2MB dataset. Query rewriting error (err)

q1 q2 q3 q4 q5 q6 q7 q8 q9 q10

XS 9.25 10.65 10.43 9.43 10.15 11.38 11.97 358.27 355.71 35.89 S2XQ 2.63 19.47 err 3.71 2.82 2.58 err 3.44 18.91 178.71

q11 q12 q13 q14 q15 q16 q17 q18 q19 q20 XS 371.46 81.96 10.83 10.04 11.61 11.66 10.14 10.93 10.73 19.93 S2XQ 17.99 128.89 3.93 2.72 3.00 3.12 7.58 10.92 3.05 16.64

the different scaling factors of the XMarkRDF dataset are The comparison of the response times of the different also presented in Appendix D. rewriting functions presented in Sect. 6 is shown graphically The benchmark system consists of a dual-core AMD in Figs. 15 and 16. The response times of these queries for Opteron 250 2.4GHz, 4GB memory running a 64-bit instal- the 2MB are presented in Table 2 as a reference, where n/a lation of Ubuntu 10.04.1 LTS. For the XQuery engine, we indicates that the combination of query and optimisation is rely on Saxon version 9.3 Enterprise Edition and Java version not applicable. 1.6.0 64 bit. For evaluating SPARQL queries we used ARQ Wenext present the interpretation of the benchmark results 2.8.7. We ran each query with a timeout of 10 min per query, when comparing to the SPARQL2XQuery system and then with the Java Heap size set to 1GB and the Saxon configura- proceed to then interpretation of the results from the different tion set as schema-unaware. The response time of each query rewriting strategies. was measured using GNU time 1.7 and the process startup Evaluation of XS and S2XQ without optimisation time was deduced to each response time. For the evaluation Table 1 shows that for most of the queries the S2XQ runs we defined the following run configurations: are faster than the interleaved calls to a SPARQL engine in the XS runs. Even considering that the response times do XS using the XSPARQL implementation over the XMark- not include the data translation times, this suggests that an RDF datasets (translated data and queries) without opti- alternative implementation of XSPARQL where the SPAR- misation; QL queries are translated into native XQuery is a viable XSZ using the XSPARQL implementation over the XMark- alternative to interleaving calls to a SPARQL engine. How- RDF datasets (translated data and queries) with nested ever, for such translations to be possible we need access expression optimisation optZ for Z ∈{nl, sr, ng}; to the full RDF dataset to perform the query translation S2XQ using the SPARQL2XQueryimplementation over the which is not possible for example in the case where we translation of the XMarkRDF datasets into the required are querying data behind a SPARQL endpoint. Another XML format (XMarkRDFS2XQ) without optimisation; issue related to the implementation of the SPARQL2XQue- and ry system is that response times deteriorate considerably S2XQZ using the SPARQL2XQuery implementation over for larger datasets. This was observed for all the queries in the translation of the XMarkRDF datasets into the the benchmark and can be seen in the graphs of Figs. 15 required XML format (XMarkRDFS2XQ) with nested and 16. ∈{ , } expression optimisation optZ for Z nl sr . Queries q8–q12 have the highest execution times of all the benchmark queries since they contain nested expressions (as can be seen in q9 presented in Fig. 20). For these nested 7.2 Results and Interpretation queries, our interleaved XSPARQL implementation can only handle small datasets: the 2MB dataset is the largest for which The response times of the XS and S2XQ runs for the bench- all queries finish within the time limit and for the 20MB data- mark queries over the 2MB dataset size are shown in Table 1. set all queries result in a timeout. For these nested queries we We present the 2MB dataset as it is the largest dataset our un- applied the different optimisations described in Sect. 6 and optimised implementation can process within the time limit next we present their benchmark evaluation. of 10 min. Both the data and query translation times are not Evaluation of XS and S2XQ with Nested Expression measured in our benchmarks since this process can be done Optimisation. As we can see from Table 2 and Figs. 15 a priori. The response times for the XMark queries evaluated and 16,theoptnl optimisation provides significant reduc- using the Saxon XQuery engine are not presented in this tion in the query evaluation times. For queries q8, q9, and table since these queries do not cater for our heterogenous q11 the difference in response times is one order of magni- data sources scenario. tude. 123 Mapping between RDF and XML with XSPARQL 169

(a) (b)

(c) (d)

(e) (f)

Fig. 15 Query response times for (variants of) q8 and q9 on all XMarkRDF datasets

The improvement in the execution time for query q10 is for queries q10 and q11 and their variants are not as suitable less drastic. This can be explained by the fact that the outer for optimisation by the XQuery engine when compared with expression of q10 iterates over “categories” which increases queries q8 and q9. For these cases our rewriting strategy is at a much smaller rate than “persons” do in the outer expres- capable of performing the optimisation task for the XQuery sions of queries q8, q9, and q11 (cf. Appendix D). engine. However, for the S2XQ runs this optimisation provides For the XS run, it is possible to see in Fig. 15c, d that optsr virtually no improvement in the query response times for (presented in Sect. 6.2) is generally more efficient in terms queries q8 and q9 and their variants. In queries q10, q11, of response times than the XQuery based. This can be justi-   q10, and q11 we can observe an improvement in response fied by the smaller amount of information that is necessary times. This can be attributed to the fact that the rewriting to transfer from SPARQL to the XQuery engine. This effec-

123 170 S. Bischof et al.

(a) (b)

(c) (d)

(e) (f)

Fig. 16 Query response times for (variants of) q10 and q11 on all XMarkRDF datasets

tively reduces the overhead of using an external SPARQL opposed to the XSsr runs which behaves consistently sim- engine for the evaluation of queries. Considering the S2XQsr ilar to XSnl. This indicates that S2XQ is not as efficient as the run, optsr produces no improvement in the query response ARQ-based native SPARQL engine runs XSsr and XSnl for   times and in some cases (q10 and q11 from Table 2)even larger datasets. deteriorates considerably the response times when compared We can draw similar conclusions for the optng when com- with S2XQ. This further supports our previous claims that paring the query evaluation times of the optsr rewriting. the XQuery engine is not capable of optimising the rewritten However, the response times for this approach are deteri- code from complex SPARQL queries. orated by the overhead of creating, inserting, and deleting  Furthermore, the S2XQsr runs could not evaluate the the RDF Named Graph. This slowdown makes queries q8 ,   higher dataset sizes for query q8, whose response times q10 and q11 of the of the optnl rewriting outperform this deteriorate considerably with the larger dataset sizes—as optimisation.

123 Mapping between RDF and XML with XSPARQL 171

Table 2 Query response times in seconds of different XS S2XQ XSnl S2XQnl XSsr S2XQsr XSng optimisations for the 2MB . . . . XMarkRDF dataset. q8 358 27 3 44 15 66 3 54 n/a n/a n/a Optimisation not applicable q9 355.71 18.91 15.20 19.35 n/a n/a n/a (n/a) q10 35.89 178.71 19.78 156.09 n/a n/a n/a

q11 371.46 17.99 22.67 6.43 n/a n/a n/a  . . . . q8 355 63 3 71 15 48 2 87 10.60 3.35 n/a  . . . . q9 357 13 19 16 15 12 19 10 11.79 15.44 n/a  . . . . q10 36 73 180 32 18 24 154 95 16.37 199.28 n/a  . . . . q11 354 20 18 21 21 00 6 40 18.63 165.99 n/a  . . . . q8 352 10 4 46 13 29 3 61 n/a n/a 13.53  . . . . q9 356 63 18 64 13 23 16 46 n/a n/a 13.17  . . . . q10 37 84 175 70 16 88 175 47 n/a n/a 17.62  . . . . q11 365 24 139 96 21 27 145 05 n/a n/a 24.79

8 Related Work annotating it with necessary information to answer XPath queries. The XPath queries are, in turn, translated into SPAR- With the establishment of XML and RDF, tools and methods QL queries, and the result of the execution of the SPARQL were introduced that rely on existing standards for retrieving query is then translated into a format equivalent to the result and querying both languages. Most of the existing proposals of the XPath query. to merge XML and RDF rely on translating the data from Deursen et al. [26] present an approach for the transforma- different formats and/or translating the queries from differ- tion between XML and RDF in an ontology-dependent man- ent languages. With this in mind, we divided the proposals ner, introducing a language that allows to convert existing into two categories: (1) Translation of data: these tools aim XML Schema documents (and XML documents conforming at integrating the heterogeneous data by translating between to the schemas) by defining mappings relating the schema to different formats, usually relying on user predefined map- specified ontologies. Other approaches [17,56] aim at trans- pings. (2) Integration of query languages: this category lating an XML Schema into an equivalent OWL ontology. of approaches (where XSPARQL is also included) considers However, in our approach, we are focusing on translation the integration and/or expansion of query languages to allow and integration of instance data, rather than aiming at pro- querying different formats. Next we give a short overview of viding a semantic interpretation for XML data. some of the tools and proposals available in each category. The approaches that propose a batch translation of data Data translation The TriX format [21] consists of an pose problems such as the replication of data and the need alternative serialisation for RDF in XML, with the aim of for constant synchronisation between the original data and being compatible with standard XML tools. It uses XSLT the transformed data, for instance in the case of a frequently as an extensibility mechanism, allowing to specify syntactic updated . We argue that this approach is not optimal extensions and defining macros. R3X17 uses an RDF pro- for most enterprise and Web scenarios and dynamic transla- cessor and XSLT to transform RDF data into a predictable tions are the best way to describe and implement such inte- form of RDF/XML also catering for RSS. Similarly, Grit18 gration of data. is designed to be a simplified normalisation for RDF, eas- Last, but not least, as we have also discussed XSPARQL ier to process with XSLT than RDF/XML. Gloze [6]aims as a means to transform between different RDF representa- at directly interpreting an XML document as RDF data by tions beyond the capabilities of SPARQL [53] in this paper, providing transformations between XML and RDF based we should mention the forthcoming SPARQL 1.1 [38] spec- on the XML Schema definition. The transformation tries to ification, that will add features to SPARQL addressing such map each XML element and attribute to an RDF property. use cases (aggregation, value generation, etc.). Whereas no The resulting transformation makes extensive use of RDF detailed studies of SPARQL 1.1’s expressivity exist as of sequences to maintain the ordering from the XML structure. yet, we emphasise that XSPARQL—being a Turing-com- Droop et al. [28] translate the XML document into RDF, plete scripting language for RDF—will be able to encompass all features within SPARQL 1.1 and more. 17 http://wasab.dk/morten/blog/archives/2004/05/30/transforming- Language integration Berrueta et al. [12] present a fra- rdfxml-with-. mework that allows to perform SPARQL queries from XSLT: 18 http://code.google.com/p/oort/wiki/Grit. XSLT+SPARQL. It relies on adding functions to XSLT that 123 172 S. Bischof et al. provide the ability to query SPARQL endpoints and uses framework21 also provides extensions of SPARQL to pro- standard XSLT to process the SPARQL XML results format. cess XPath and XSL transformations in SPARQL queries Similarly to our current implementation, they rely on a clear and defines an extension to the XSLT language to allow to separation between the SPARQL query and XSLT parts of perform SPARQL queries. the query. Other approaches for querying heterogenous data were The following proposals suggest compiling a SPARQL presented by Berger et al. [10]. This query language follows query to XSLT/XQuery: Bikakis et al. [14] translate each a different syntax than the W3C standardised SPARQL and SPARQL query into an XQuery using a previously defined XQuery and allows to write queries in the form of logical mapping from OWL to XML Schema and [33] propose to rules over an abstraction of the XML and RDF data models embed SPARQL into XSLT or XQuery, presenting exten- (represented as a graph). sions to these languages to enable SPARQL querying where The nSPARQL query language [50] proposes to extend each SPARQL query is also translated into an equivalent SPARQL with navigational capabilities using nested regular XQuery. This language is very close to the XSPARQL lan- expressions. With this addition, the language is sufficiently guage, but it does, however, require converting the RDF data expressive to capture the semantics of RDFS. In addition to to XML according to a predefined schema. Assuming the this, it introduces a number of graph navigation operators and queried dataset is available this translation carries overhead adds the ability to selectively traverse the graph. This work is into the query and in case the dataset is not available, for different than our current proposed approach for XSPARQL, example due to being stored behind a SPARQL endpoint, but one of the possibilities for extending XSPARQL is to such translation is not possible. Ding and Buxton presented enable to perform XQuery enriched SPARQL queries. an approach to translate SPARQL into XQuery at the 2011 Related to our nested queries optimisation, initial work Semantic Technology Conference.19 This rewriting generates has been presented by [5] over an extension of SPARQL that XQuery specifically tailored for the Marklogic Server XML caters for nested queries and presented preliminary equiva- database engine. On a similar approach, integrating XPath lences between types of nested queries with the aim of deter- into SPARQL [25], also promises to bridge the gap between mining if query unnesting can be successfully applicable. XML and RDF. This is approach is similar to XSPARQL, although the choice here was to extend the SPARQL query language. While also catering for SQL queries, [31] presents 9 Conclusion and Future Work a translation of SPARQL queries into XQuery and present encouraging benchmark results. Another similar approach is In this paper we presented a novel query language, called presented by [48], where the authors again translate SPARQL XSPARQL, that combines XQuery and SPARQL to provide to XQuery by relying on a normal form of RDF/XML. Also simplified transformations between the XML and RDF data according to our benchmarks, encoding SPARQL in XQue- models. We covered the semantics of XSPARQL, defined as ry seems a viable option—assuming that we have access to an extension of the XQuery semantics, and presented our cur- the RDF dataset beforehand—that would allow to compile rent implementation which consists of rewriting each XSP- XSPARQL to pure XQuery without the use of a separate ARQL query to an XQuery query. The implementation is SPARQL engine. available for download at http://xsparql.deri.org/ where we Zhou and Wu [59] propose another approach to represent also provide an online XSPARQL query evaluator at http:// RDF data as XML trees, based on translating RDFS into an xsparql.deri.org/demo/. XML Schema, and then translating SPARQL queries into We also presented different rewriting strategies for a XPath and XQuery queries. particular category of XSPARQL queries, namely those con- RDF Twig [58] suggests XSLT extension functions that taining nested expressions involving SPARQL queries and provide views on the sub-trees of an RDF graph. The main presented benchmark evaluation of these different rewritings. idea of RDF Twig is that while RDF/XML is hard to navi- For these optimisations we detailed the rewriting functions gate using XPath, a subtree of an RDF graph can be serialised describing their application in our current implementation of intoamoreusefulformofRDF/XML.RDFXSLT20 provides the XSPARQL language. We presented two types of optimi- an XSLT preprocessing stylesheet and a set of helper func- sations for nested expressions: one based on reordering the tions, similar to RDF Twig, yet implemented in pure XSLT expressions in the XQuery rewriting to minimise the number 2.0, readily available for multiple platforms. The CORESE of calls to the SPARQL endpoint and another based on per- forming a more complex SPARQL query that takes care of joining the variables. The benchmarks carried out to deter- 19 http://semtech2011.semanticweb.com/sessionPop.cfm?confid=62& mine the impact of our optimisations have shown encour- proposalid=4015. 20 http://www.wsmo.org/TR/d24/d24.2/v0.1/20070412/rdfxslt.html. 21 http://www-sop.inria.fr/edelweiss/software/corese/. 123 Mapping between RDF and XML with XSPARQL 173 aging results, hinting on a large potential for optimisations in XSPARQL. Among the rewriting strategies presented in this paper and on our test data, pushing joins into a SPARQL engine appeared the most promising strategy. Our benchmark results showed that our optimisations are not only specific (a) to XSPARQL having also improved the response times of the SPARQL2XQuery system to which we compared XSP- ARQL. Future Work In this paper we have shown that nested queries can be efficiently evaluated by applying particular re- writings. Nonetheless all the tested rewriting strategies were (b) created ad-hoc. A declarative algebra model would help to correctly and systematically study further optimisations for Fig. 17 Example input and normalised query of Fig. 8 XSPARQL. As starting points, [34] and [36] have presented translations of XQuery to SQL, whereas in our own earlier Table 3 Result of fs:sparql works we have likewise translated SPARQL essentially to $P $N $F Relational Algebra [52]. These works seem to indicate valid starting points for further research on equivalences and op- _:gen “Charles” “Brown” timisations in our language. Initial steps for defining such a declarative algebra can also be based on subsets of the XQue- ry language, for example XQuery core presented by Koch We assume the input graph vc.rdf as given in Fig. 17a. Let us [43]. A proposal towards the declarative model of XSPARQL go through the three phases of XQuery semantics evaluation, has been done by Bischof [16]. Although the initial set of i.e., the normalisation, static type checking, and dynamic optimisations proposed in this paper shows that our current evaluation steps. implementation of performing interleaved calls to a SPAR- Normalisation In the normalisation step the SPARQL- QL engine can be improved upon, a more tightly integrated style namespace declarations are rewritten to XQuery na- implementation of the XSPARQL language should yield bet- mespace declarations (see Rule (N3)). After that, the whole ter results. Such an integrated implementation is planned for construct query is rewritten to a SparqlForClauseby the near future where we can leverage optimisations pro- Rule (N8), the for * is expanded according to Rule (N6) posed for SPARQL [37] or the proposed implementations of and the resulting SparqlForClauseis then handled by · XQuery over relational databases [34,13]. Finally, the current Rule (N7). Rule PosVar then adds a new positional vari- XSPARQL language specification already allows to query able (e.g., $pos). Finally the construct is normalised by data contained in XML and RDF datastores. However, updat- Rule (N9). The whole normalisation phase results in the ing data of these datastores is still not directly possible. We query given in Fig. 17b. plan to extend the XSPARQL language to a full data manip- Static Type Analysis By Rule (S2) the variables occur- ulation language allowing to update, insert, and delete data ring in the WhereClause, namely $P,$N, and $F, are typed contained in RDF tripestores. Here, similar to our combi- as RDFTerm, and the positional variable, $pos, is typed as nation of query languages, we will aim at combining com- xs:integer. The whole SparqlForClause inherits its type mon data manipulation languages for XML and RDF, such from the contained return ExprSingle , which in turn inher- as SPARQL Update [32] and XQuery Update [55]. its its type from the function fs:evalCT which is RDFGraph. Dynamic Evaluation First the new environment compo- Acknowledgments This work was supported by SFI grant SFI/08/- nent activeDataset is changed from empty to the one given CE/I1380 (Líon-2), by an IRCSET scholarship, and by the Marie Curie in the DatasetClause, i.e., the graph contained in vc.rdf. IRSES Grant 24761 (Net2). We would like to thank Sven Groppe for his help with the SPARQL2XQuery system. According to Rule (D3)theWhereClause is evaluated using the fs:sparql function with the active dataset, as just ini- tialised, the WhereClause as given in the query, and empty SolutionModifiers. The fs:sparql function call results in a A Example of XSPARQL Semantics Evaluation sequence of PatternSolutions (in this case a singleton solution) as given in Table 3. Next, the same rule extracts the As an example we show the application of the XSPARQL variable bindings for all variables, using the fs:value function, evaluation semantics (presented in Sect. 4.2)tothesample and assigns them to the corresponding XQuery variables, query from Fig. 8b. The example query features both, the typed as RDFTerm. After that the return expression Expr- new SparqlForClause as well as the new ConstructClause. Single is evaluated, using the just initialised variables. The 123 174 S. Bischof et al.

Fig. 18 Query results (a) fs:evalCT function calls the fs:validTriple function passing it a blank node generated by the fs:bnode function as sub- ject, “foaf:name” as predicate and the result of fn:concat as object. The fs:bnode function (as given by Rule (D7)) generates a fresh blank node label for each element of the PatternSolution. For this example we assume that the function returns the new blank node label “_:gen_1”. The fs:validTriple function tests these three values for validity. (b) Since the subject is of type bnode, the predicate is a QName Fig. 19 Implementation functions (and therefore considered as being of type uri), and the object is of type literal, namely an xs:string, the func- tion returns them as a valid RDFTriple.Thefs:evalCT func- ARQL semantics we also extend the normalisation rules and tion eventually returns the result of the single fs:validTriple static analysis rules for native XQuery for clauses. More function call and thus the result of the whole query as specifically, rule (N5) extends the XQuery for normalisa- an element of type RDFGraph as shown in Fig. 18. Seria- tion by adding a new variable to each position-variable free lised to Turtle the query result, including QName expansion, for expression (i.e., that does not have an at clause). As is the expression _:gen_1 stated these new position variables are disjoint from the vari- "Charles Brown". ables in scope, and thus this rewriting does not interfere with the semantics of the original query. The only rules which use the newly created position variables are (i) the slightly mod- B Implementation Functions Example ified static type analysis rule (S3) which extends the XQuery for static analysis rule by collecting the position variables in Figure 19 presents some of the XQuery functions defined the static environment component statEnv.posVars, thus also in the XSPARQL language implementation, namely to cor- maintaining the original semantics of the original XQuery rectly format RDF terms (Fig. 19a) and to validate triples for rule and (ii) the dynamic evaluation rule (D7) which resulting from a construct expression (Fig 19b). accesses statEnv.posVars to generate Skolem-identifiers for blank nodes in construct parts. However, rule (D7) only applies to XSPARQL queries which fall outside the native C Proofs XQuery fragment, whereas the semantics of native XQue- ry queries remains untouched and independent of the extra C.1 Proof for Proposition 1 environment components in XSPARQL. 

Proposition 1 XSPARQL is a conservative extension of C.2 Proof for Lemma 1 XQuery. Lemma 1 Given a graph pattern P, a dataset D and the Proof We show that the additional rules introduced in XSPARQL instance mapping μC of the expression context Sect. 4.2 do not modify the semantics of any native XQue- C over which P is evaluated, let 1 = eval (D, P,μC ) ry. The XSPARQL semantics—expressed in terms of nor- xs and 2 = eval(D, P) be solution mappings. If vars(P) ∩ malisation rules, static typing rules, and dynamic evaluation dom(μC ) =∅, then 1 = 2 { μC }. rules—strictly extend the native semantics of XQuery. In the ( , ,μ ) semantics definition we also define new environment com- Proof The XSPARQL BGP matching, evalxs D P C , ponents, namely statEnv.posVars and dynEnv.activeDataset, extends SPARQL’s BGP matching, eval(D, P), by defin- which are not used in the XQuery semantics and thus do ing that the solutions of the BGP are the ones compatible not interfere with query evaluation. However, for the XSP- with the XSPARQL instance mapping μC . Since the evalu- 123 Mapping between RDF and XML with XSPARQL 175 ation of graph patterns (such as union, optional, graph, Proof (⇐) Let us show that if dynEnv  tr(Q) ⇒ Val then and filter) remains unchanged from the SPARQL seman- dynEnv  Q ⇒ Val . The evaluation of Q consists of the tics let us focus on the evaluation of a BGP P. If there are application of Rule (D1)as no shared values between the graph pattern and the XSP- ARQL instance mapping, vars(P) ∩ dom(μC ) =∅, then each solution μ ∈ 2 returned by the SPARQL BGP eval- dynEnv  fs:dataset(DatasetClause) ⇒ DS μ uation semantics is trivially compatible with C and the DS, WhereClause, dynEnv  fs:sparql ⇒ μxs result of the XSPARQL BGP matching is μ ∪ μC . Extend- SolutionModifier i . xs  ⇒ .. ing this result to all solution mappings in 2, we obtain that dynEnv1 ExprSingle Value i = { μ }.  1 2 C for $Var 1 ···$Var n at $PosVar DatasetClause WhereClause dynEnv  SolutionModifier return ExprSingle C.3 Proof for Proposition 2 ⇒ Value 1 ···Value m

Proposition 2 XSPARQL is a conservative extension of μxs SPARQL construct queries. where, for each i ,

dynEnvxs = dynEnv⎛ + activeDataset(DS) ⎞ con- 1 Proof For XSPARQL queries consisting of a SPARQL PosVar ⇒ i;  ⎜ xs ⎟ struct query, there cannot exist any previous bindings for ⎜ Var 1 ⇒ fs:value μ , Var 1 ; ⎟ (T1) +varValue⎝ i ⎠ . variables in XSPARQL and thus the XSPARQL instance ··· ;  xs Var n ⇒ fs:value μ , Var n mapping μC over which the construct query will be exe- i cuted is empty. Let P represent the graph pattern of the con- struct query and D the dataset, trivially there are no shared Let μ be the XSPARQL instance mapping of the expres- variables between μ and P and so, following Lemma 1 the C C sion context that includes dynEnv and tr the pattern bindings 1 for XSPARQLBGP matching are the same bind- solution resulting from evaluating the xsp:sparqlCall ings as SPARQLBGP matching, since = ∪{∅}and 2 1 2 = ( , ) function, i.e., tr eval DatasetClause P , where P is hence 1 = 2. Furthermore, the formal semantics func- the rewriting of WhereClause according to μ . Further- tion fs:evalTemplate returns an RDF graph satisfying all the C more, let μ ∈ be the solution mapping from which conditions of Definition 11: (1) Ignoring invalid RDF tri- i tr Val is generated, i.e., there exists some dynamic environ- ples—Item 1—is guaranteed by Rules D5 and D6; and (2) ment dynEnvtr based on dynEnv and extended with the var- The generation of distinct blank nodes for each solution iable bindings from μ such that dynEnvtr  ExprSingle ⇒ sequence—Item 2—is enforced by the blank node skolemi- i Val . sation rules (Rules (D7) and (D8)).  = ( , ,μ ) Consider xs evalxs DatasetClause WhereClause C as the solution sequence resulting from the evaluation of : = C.4 Proof for Lemma 2 the fs sparql function. As we know from Lemma 2, xs { μ } tr C and thus there must exist a solution map- μ μ ∈ μ = μ μ Lemma 2 Let P be a BGP, D a dataset, and the XSPARQL ping xs xs such that xs i C .From(T1)  xs instance mapping of P. Considering P = μ(P), we have we infer that there exists a dynamic environment dynEnv ( , ,μ) = ,  { μ} that evalxs D P eval D P . that results from extending dynEnv with the variable bind- μ ings from xs and thus this environment will also contain all the variable mappings from dynEnvtr. Since we know Proof Since, according to the variable substitution operation  that dynEnvtr  ExprSingle ⇒ Val ,wealsohavethat we have that vars P = vars(P) \ dom(μ),wealsohave  dynEnvxs  ExprSingle ⇒ Val and thus dynEnv  Q ⇒ that vars P ∩ dom(μ) =∅, and it follows directly from ( , ,μ) = ,  { μ}  Val . Lemma 1 that evalxs D P eval D P . (⇒) Next we will show that if dynEnv  Q ⇒ Val then C.5 Proof for Proposition 3 dynEnv  tr(Q) ⇒ Val . We present the proof tree for each of the XQuery core expressions in the tr(Q) rewriting. The Proposition 3 Let Q be a SparqlForClauseof form (Q1) and proof trees are presented for each line of the tr(Q) rewrit- dynEnv the dynamic environment of Q, then dynEnv  Q ⇒ ing and, in each proof tree, Expr corresponds to the XQuery Val if and only if dynEnv  tr(Q) ⇒ Val . expressions of the following lines:

123 176 S. Bischof et al.

– let expression of line (1): Proof We now present the proof of the optnl rewriting func- ⎛ ⎞ select Vars tion for expressions of the form (Q3). ⎜ DatasetClause ⎟ We start by showing the proof for the base case, where dynEnv  xsp:sparqlCall⎜ ⎟ ⇒ ⎝ WhereClause ⎠ tr ExprSingle of (Q3) does not contain any occurrences of (Q3). SolutionModifier Base Case. (⇒) We start by showing that if dynEnv  Q ⇒ tr  ⇒ dynEnv1 Expr Res  ( ) ⇒ out in Val then dynEnv optnl Q Val . Consider xs and xs let $xsp:results :=⎛ ⎞ the solution sequences returned, respectively, by the evalu- select Vars ation of the outer and inner SparqlForClausesofQ and the ⎜ DatasetClause ⎟  dynEnv  xsp:sparqlCall⎜ ⎟ = out ∩ in ⎝ WhereClause ⎠ set of join variables J Vars vars WhereClause . μout ∈ out μin ∈ in SolutionModifier Furthermore, consider xs xs and xs xs the solu- return Expr ⇒ Res tion mappings that agree on the value of each join vari- able j ∈ J from where Val is generated, i.e., there exists where some dynamic environment dynEnvxs based on dynEnv and  μout μin tr xsp:results ⇒ extended with the variable mappings from xs and xs such dynEnv1 = dynEnv + varValue tr (T2) . that dynEnvxs  ExprSingle ⇒ Val . – for expression of line (2) We show now the proof tree for each of the XQuery core dynEnvtr  $xsp:results//sr:result ⇒ μ expressions in the optnl(Q) rewriting where, in each proof 1 i . tr  ⇒ .. tree, Expr corresponds to the XQuery expressions of the fol- dynEnv2 Expr Resi lowing lines: for $xsp:result at $PosVar tr  : // : dynEnv1 in $xsp results sr result return ⇒ , ··· , = in ∪ where Expr Res1 Resn – let expression of line (1), considering Vars Vars xsp:result ⇒ μ ; out∩ in dynEnvtr = dynEnvtr +varValue i Vars vars WhereClause , we have that 2 1 PosVar ⇒ i ⎛ ⎞ – let expressions of lines (3)–(4) select Vars ⎜ in ⎟ Here we consider all the let expressions represented by  ⎜ DatasetClause ⎟ ⇒ in dynEnv xsp:sparqlCall⎝ in ⎠ line (3), where $v ∈ Vars: WhereClause SolutionModifierin tr  : / : [ = ]/∗⇒ dynEnv2 $xsp result sr binding @name v V nl  ⇒ dynEnv1 Expr Res dynEnvtr  ExprSingle ⇒ Res 3 let $xsp:res_in :=⎛ ⎞ let $v := select Vars dynEnvtr  $xsp:result/sr:binding[@name = v]/∗ ⎜ DatasetClausein ⎟ 2 dynEnv  xsp:sparqlCall⎜ ⎟ return ExprSingle ⇒ Res ⎝ WhereClausein ⎠ SolutionModifierin tr = tr ( ⇒ ) where dynEnv3 dynEnv2 +varValuev V return Expr ⇒ Res xs where Consider the dynamic environment dynEnvi such that xs  ⇒ dynEnvi ExprSingle Val where, as we know  xs dynEnvnl = dynEnv + varValue xsp:res_in ⇒ in . (T3) from (T1), dynEnvi extends dynEnv by changing the ac- 1 tiveDataset and varValue environment components. μ Consider C , xs, and tr as before. From Lemma 2 we The function optnl(Q) translates the SparqlForClause get that xs = tr { μC } and since μC is created based from lines (4)–(6) of Q into the xsp:sparqlCall of on dynEnv.varValue, all the variable bindings from μC are already included in dynEnv. line (1). The inner SparqlForClause of Q is evaluated xs From the proof trees of tr(Q) we can see that the for considering some dynamic environment dynEnvi (and xs expression from line (2) iterates over the all the solution map- its expression context Ci ). Since dynEnvi is an exten- sion of dynEnv we have that dom(μC ) ⊆ dom μC . pings included in tr and the let expressions from lines (3) i tr tr The rewritten xsp:sparqlCall function is evaluated over and (4) ensure there exists a dynEnv2 such that dynEnv2 .var- Value contains all the variable bindings from dynEnvxs. the dynamic environment dynEnv (included in expression i μ varValue, and we have that dynEnvtr  ExprSingle ⇒ Val . context C). Consider C the XSPARQL instance mapping 2 μ  of C and Ci the XSPARQL instance mapping of Ci . Let in = eval DatasetClausein, WhereClausein,μ C.6 Proof for Proposition 4 xs xs Ci be the solution sequence resulting from the evaluation Proposition 4 Let Q be a XSPARQL expression of form (Q2) of the inner SparqlForClause of Q and the solution or (Q3) and dynEnv the dynamic environment of Q, then sequence resulting from evaluating the xsp:sparqlCall dynEnv  Q ⇒ Val if and only if dynEnv  optnl(Q) ⇒ Val . in = in, in in function be nl eval DatasetClause P , where P 123 Mapping between RDF and XML with XSPARQL 177

is the graph pattern obtained from replacing the vari- – let expressions of line (4) in ables in WhereClause according to μC .Asdom(μC ) ⊆ Here we consider all the let expressions represented by μ μ ∈ out dom Ci , i.e., C contains less bindings for variables line (4), where $v Vars : μ than Ci , the rewritten graph pattern Pin contains more nl  : / : [ = ]/∗⇒ dynEnv3 $xsp rout sr binding @name v V unbound variables, and we get that in in. xs nl dynEnvnl  Expr ⇒ Res – let expression of line (2) ⎛ ⎞ 4 select Vars out let $v := nl ⎜ DatasetClauseout ⎟ dynEnv  $xsp:rout/sr:binding[@name = v]/∗ dynEnvnl  xsp:sparqlCall⎜ ⎟ 3 1 ⎝ WhereClauseout ⎠ return Expr ⇒ Res out SolutionModifier where ⇒ out nl  ⇒ nl = nl ( ⇒ ) . dynEnv2 Expr Res dynEnv4 dynEnv3 +varValuev V let $xsp:res_out ⎛:= ⎞ select Vars out ⎜ DatasetClauseout ⎟ dynEnvnl  xsp:sparqlCall⎜ ⎟ – for expression of line (5) 1 ⎝ WhereClauseout ⎠ SolutionModifierout return Expr ⇒ Res where nl  : // : ⇒ dynEnv4 $xsp res_in sr result Si  . nl = nl : ⇒ out . dynEnvnl  Expr ⇒ Res .. dynEnv2 dynEnv1 +varValue xsp res_out 5 i for $xsp:rin at $PosVar out Regarding the SparqlForClauseof lines (1)–(3) of Q (eval- nl  : // : dynEnv4 in $xsp res_in sr result uated considering dynEnv), the optnl(Q) translates it into return Expr ⇒ Res1, ··· , Resn the xsp:sparqlCall from line (2), which is evaluated where over dynEnvnl. 1 : ⇒ ; Consider C the expression context where dynEnvnl is nl = nl xsp rin Si . 1 1 dynEnv5 dynEnv4 +varValue in ⇒ μ PosVar i included, C1 the XSPARQL instance mapping of C1 and Pout the graph pattern obtained from replacing the out μ variables in WhereClause according to C1 .From(T3) – if expression of lines (6)–(9) μ = (μ )∪{ : } we can see that dom C1 dom C $xsp res_in , xsp:res in xsp: but $ _ belongs to the $ reserved namespace so it will not be included in the variables Varsout ∩ vars(WhereClause) , dynEnvnl  join ⇒ true of WhereClause out, and we can observe that we 5 sr $xsp:res_out, $xsp:res_in out nl  ⇒ obtain the same graph pattern P by replacing dynEnv5 ExprSingle Res1 out μ out = WhereClause according to C .Let xs Varsout ∩ vars(WhereClause) , out, out,μ nl  if joinsr : , : evalxs DatasetClause WhereClause C be the dynEnv5 $xsp res_out $xsp res_in () ⇒ solution sequence resulting from evaluating the outer then ExprSingle else Res1 SparqlForClause according to XSPARQL semantics and out = out, out – let expressions of line (7) and (8) nl eval DatasetClause P be the pattern solu- tion resulting from evaluating the rewritten outer Sparql- Here we consider all the let expressions represented by ∈ out in ForClause according to SPARQL semantics. Following line (7), where $v Vars vars WhereClause . out = out { μ } Lemma 2, we have that xs nl C and, as we have seen from the proof of Proposition 3, since μ is C dynEnvnl  $xsp:res_in/sr:binding[@name = v]/∗⇒V already included in dynEnv, we have that out = out. 5 xs nl dynEnvnl  ExprSingle ⇒ Res – for expression of line (3) 6 let $v := nl  $xsp:res_in/sr:binding[@name = v]/∗ nl  : // : ⇒ μ dynEnv5 dynEnv2 $xsp res_out sr result i ⇒ . return ExprSingle Res nl  ⇒ .. dynEnv3 Expr Resi for $xsp:rout at $PosVar out where nl  : // : dynEnv2 in $xsp res_out sr result return ⇒ , ··· , nl = nl ( ⇒ ) . Expr Res1 Resn dynEnv6 dynEnv5 +varValuev V

where out = out in in Since we know that nl xs and xs nl, we obtain μout ∈ out μin ∈ in ( ) : ⇒ μ ; that xs nl and xs nl. Since optnl Q performs a nl nl xsp rout i dynEnv = dynEnv +varValue out . out in 3 2 PosVar ⇒ i nested loop iteration over nl and nl,thejoinsr function 123 178 S. Bischof et al.

μout in = in ... in will join the two solution mappings successfully since xs where, considering Vars $Var 1 $Var n ,wehave μin for each μ and xs share the same values for the join variables, and thus j we have that dynEnv  optnl(Q) ⇒ Val .  xs in ⇐  dynEnv1 +⎛ activeDataset DS ⎞ ( ) We now proceed by showing that if dynEnv in PosVar ⇒ j; opt (Q) ⇒ Val , then dynEnv  Q ⇒ Val .Letout and ⎜ ⎟ nl nl xs = ⎜Varin ⇒ fs:value μ , Varin ;⎟ in dynEnv2 ⎜ 1 j 1 ⎟ . nl be the pattern solutions returned by the outer and inner +varValue⎜··· ; ⎟ SparqlForClauses, respectively, and let μout ∈ out and ⎝ ⎠ nl nl Varin ⇒ fs:value μ , Varin μin ∈ in n j n nl nl be the solution mappings, where Val is deduced μout μin from, i.e., nl and nl agree on their values for the join variables. We also know that there must exist a dynamic As we know from the (⇒) direction of the proof, out = nl nl environment dynEnv , based on dynEnv and extended with out and so we have that μout ∈ out. Regarding the evalua- μout μin nl  xs nl xs the variable mappings nl and nl such that dynEnv tion of the inner SparqlForClausewe have that in in.We ⇒ xs nl ExprSingle Val . consider two cases: (i) μin ∈ in or (ii) μin ∈ in.In(i), we  ⇒ nl xs nl xs Let us turn to the evaluation of dynEnv Q Val . immediately get the desired result that dynEnv  Q ⇒ Val . For (ii), consider μxs the XSPARQL instance of the inner C1 xs – SparqlForClause from lines (1)–(3), where Expr corre- SparqlForClause(created based on dynEnv1 ). As we can see sponds to the SparqlForClause from lines (4)–(6) of from (T4), dynEnvxs (and thus also μxs ) includes the bind- 1 C1 μ ∈ out Q. The evaluation of this SparqlForClause consists of the ings for variables from each solution mapping i xs . application of Rule (D1): Thus, according to the XSPARQL BGP matching (cf. Def- in inition 10), xs will contain all the solution mappings that μ ∈ out are compatible with any solution mapping i xs and μout  specifically those compatible with nl . Since we know that dynEnv  fs:dataset DatasetClauseout ⇒ DSout μin μout μin nl is compatible with nl , we have that nl must belong to DSout, WhereClause, in dynEnv  fs:sparql ⇒ μ ; thus we can deduce that dynEnv  Q ⇒ Val . SolutionModifier i xs . Inductive Step The proof follows from the recursive dynEnvxs  Expr ⇒ Value .. 1 i application of the base case, over a new dynamic envi- for Varsout at $PosVar out ronment determined by the optnl rewriting to dynEnv  DatasetClauseout WhereClauseout i dynEnv  ( ) SolutionModifierout return optnl ExprSingle . Expr ⇒ Value 1 ···Value m The proof for nested queries with an XQuery for outer expression (Q2) is analogous where, in the preceding, the out = out ··· out μ with Vars $Var 1 $Var n , we have for each i evaluation of the SparqlForClause from lines (1)–(3) of (Q3)  is replaced by the evaluation of an XQuery ForClause,aspre- out dynEnv + activeDataset⎛ DS ⎞ sented by Draper et al. [27, Section 4.8.2].  out PosVar ⇒ i;  ⎜ ⎟ dynEnvxs = ⎜Varout ⇒ fs:value μ , Var out ;⎟ . (T4) 1 +varValue⎝ 1 i 1 ⎠ ... ;  out ⇒ : μ , out C.7 Proof for Proposition 5 Varn fs value i Var n Proposition 5 Let Q an XSPARQL expression of form (Q4) – SparqlForClause of lines (4)–(6): and dynEnv the dynamic environment of Q, then dynEnv  xs  ⇒ Q ⇒ Val if and only if dynEnv  optsr(Q) ⇒ Val . The evaluation of dynEnv1 Expr Value i is given by

Proof We start by showing the proof for the base case, where  ExprSingle of (Q4) does not contain any occurrences of (Q4). dynEnvxs  fs:dataset DatasetClausein ⇒ DSin 1 in, in, Base Case. (⇒) We start by showing that if dynEnv  xs  : DS WhereClause ⇒ μ dynEnv fs sparql in out 1 SolutionModi f ier j Q ⇒ Val then dynEnv  optsr(Q) ⇒ Val . Consider . xs xs  ⇒ .. in dynEnv2 ExprSingle Value j and xs the solution sequences returned, respectively, by the for Vars in at $PosVar in evaluation of the outer and inner SparqlForClausesofQ and in in = out ∩ in xs DatasetClause WhereClause J Vars Vars GGP the set of join variables. Further- dynEnv  expr 1 SolutionModi f ierin return more, consider dynEnv the dynamic environment result- ⇒ ··· i ExprSingle Value 1 Value m ing from extending dynEnv with the variable mappings from 123 Mapping between RDF and XML with XSPARQL 179

μout ∈ out μin ∈ in the compatible solution mappings xs xs and xs xs this does not interfere with this proof and solution modifiers expr  ⇒ such that dynEnvi ExprSingle Val . can be safely ignored in what follows. We now show the proof tree for each of the XQuery core Regarding the evaluation of the SparqlForClause from expressions in each line of the optsr rewriting where, for each lines (1)–(4) of Q (evaluated considering dynEnv), the line, Expr represents the expressions of the following lines: optsr(Q) translates it into the xsp:sparqlCall from line (1), which is also evaluated over dynEnv. In this case, accord- out = out – let expression of line (1) ingtoLemma2,wehavethat sr xs and then ⎛ ⎞ μout ∈ out. select out ∪ in xs sr Vars Vars Regarding the evaluation of the SparqlForClause from ⎜ DatasetClause ⎟ xsp:sparqlCall⎜ ⎟ dynEnv  ⎝ where GGPout GGPin ⎠ lines (5)–(8) of Q (evaluated considering some dynamic expr order by OCout OCin environment dynEnv ), the optsr(Q) rewriting incorpo- ⇒ sr rates it into the xsp:sparqlCall from line (1), which is sr  ⇒ dynEnv1 Expr Res also evaluated over dynEnv. Considering that dynEnv is less expr let $xsp:results ⎛:= ⎞ restrictive than dynEnv , i.e., dynEnv contains less bind- select Vars out ∪ Vars in ings for variables than dynEnvexpr , and thus the evaluation ⎜ ⎟  ⎜ DatasetClause ⎟ dynEnv xsp:sparqlCall⎝ out in ⎠ of the inner SparqlForClause over dynEnv will contain all where GGP GGP in μin μout order by OCout OCin the solution mappings from xs and specifically xs.As xs μin  ( ) ⇒ return Expr ⇒ Res and xs are compatible we have that dynEnv optsr Q where Val .  sr = : ⇒ . dynEnv1 dynEnv+ varValue xsp results sr (⇐) Next we show that if dynEnv  optsr(Q) ⇒ Val  ⇒ out in According to the SPARQL semantics, the solution then dynEnv Q Val . Consider sr and sr as per ⇒ sequence that results from evaluating the graph pattern the ( ) direction of the proof and the set of join vari- = out ∩ in GGPout GGPin, = out in consists of all the ables J Vars vars GGP . As we have seen sr sr sr sr μ = μout μin solution mappings μout ∈ out and μin ∈ in such that contains all the solution mappings sr sr sr sr sr sr μout ∈ out μin ∈ in μout μout and μin are compatible. such that sr sr and sr sr and sr , and sr sr μin The following for expression iterates over all these com- sr are compatible. Without loss of generality consider μout μin patible solution mappings: sr and sr the solution mappings where Val is deduced – for expression of line (2) from. Let us turn to the evaluation of dynEnv  Q ⇒ Val . dynEnvsr  $xsp:results//sr:result ⇒ μ 1 i . sr  ⇒ .. dynEnv2 ExprSingle Resi – SparqlForClause from lines (1)–(4), where Expr corre- for $xsp:result at $PosVar out sponds to the SparqlForClause from lines (5)–(8) of Q. sr  : // : dynEnv1 in $xsp results sr result Again, the evaluation of this SparqlForClause consists of return ExprSingle ⇒ Res , ··· , Res 1 n the application of Rule (D1): : ⇒ μ ; sr sr xsp result i where dynEnv = dynEnv +varValue out 2 1 PosVar ⇒ i dynEnv  fs:dataset(DatasetClause) ⇒ DS – let expressions of lines (3)–(4) DS, where GGPout dynEnv  fs:sparql ⇒ μ order by OCout i Here we consider all the let expressions represented by . xs  ⇒ .. line (3), where $v ∈ Vars: dynEnv1 Expr Value i for Varsout at $PosVar out DatasetClause dynEnvsr  $xsp:result/sr:binding[@name = $v]/∗⇒V where GGPout 2 dynEnv  sr  ⇒ order by OCout dynEnv3 ExprSingle Res return Expr ⇒ Value 1 ···Value m let $v := sr  : / : [ = ]/∗ dynEnv2 $xsp result sr binding @name v return ExprSingle ⇒ Res out = out ··· out μ where where Vars $Var 1 $Var n , we have for each i

sr = sr ( ⇒ ) . ( ) dynEnv3 dynEnv2 +varValuev V dynEnv + activeDataset⎛ DS ⎞ PosVarout ⇒ i;  ⎜ ⎟ dynEnvxs = ⎜ Varout ⇒ fs:value μ , Varout ;⎟(T5) 1 +varValue⎝ 1 i 1 ⎠ Note that we are only considering order by solution ··· ;  out ⇒ : μ , out modifiers; thus the number of results of each query is not Varn fs value i Varn changed. At most the ordering of the results is changed but 123 180 S. Bischof et al.

– SparqlForClause of lines (4)–(6): C.8 Proof for Proposition 6 xs  out ⇒ The evaluation of dynEnvi ExprSingle Value i is shown next: Proposition 6 Let Q be an XSPARQLexpression of form (Q5) and dynEnv the dynamic environment of Q, then dynEnv  Q ⇒ Val if and only if dynEnv  optng(Q) ⇒ Val .

dynEnvxs  fs:dataset(DatasetClause) ⇒ DS 1 ⎛ ⎞ Proof We start by showing the proof for the base case, where DS, ExprSingle1 and ExprSingle2 of (Q5) do not contain any dynEnvxs  fs:sparql⎝where GGPin ⎠ ⇒ μ 1 j occurrences of (Q5). order by OCin . xs  ⇒ .. Base Case. (⇒) Let us start by showing that if dynEnv  dynEnv2 ExprSingle Value j Q ⇒ Val then dynEnv  opt (Q) ⇒ Val . Consider in for Vars in at $PosVar in DatasetClause ng xs in the solution sequence returned by the evaluation of the inner xs  where GGP dynEnv1 in expr order by OC SparqlForClausesofQ. Furthermore, consider dynEnvi return ExprSingle ⇒ Value ···Value expr  ⇒ 1 m such that dynEnvi ExprSingle2 Val . The dynamic expr environment dynEnvi results from extending dynEnv with in = in ··· in where Vars $Var 1 $Var n , we have for each bindings for the outer variable $Var Name and with the var- μ μin ∈ in j iable bindings from a solution mapping xs xs where μin( ) = xs VarName $VarName, i.e., the value for the join var- xs ( ) in dynEnv1 +⎛ activeDataset DS ⎞ iable in the solution mapping μ is the same as assigned to in xs PosVar ⇒ j; ⎜ ⎟ $Var Name. xs = ⎜Varin ⇒ fs:value μ , Varin ;⎟ . dynEnv2 ⎜ 1 j 1 ⎟ We now show the proof tree for each of the XQuery core +varValue⎜··· ; ⎟ ⎝ ⎠ expressions in the optng rewriting. in ⇒ : μ , in Varn fs value j Varn – let expression of line (1) As we have seen in the(⇒) direction, we have that out = sr = [] :value out μout ∈ out Considering NGP $VarName ,wehave xs and so we have that sr xs . Consider C the expression context where dynEnv is μ included and C the XSPARQL instance mapping of C. ⎛ ⎞ Further, consider Pin the graph pattern obtained from for $VarName ⎜ ⎟ in μ ⎜ OptTypeDeclaration ⎟ replacing the variables in GGP according to C . Since dynEnv  xsp:createNG⎜ OptPositionalVar ⎟ in ⊆ in ⎝ ⎠ vars GGP vars P all solutions mappings returned in ExprSingle1 return by evaluating GGPin under XSPARQL semantics are xsp:evalTemplate(NGP) included in the solution sequence of evaluating Pin under ⇒ DS in in ng  ⇒ SPARQL semantics, i.e., xs sr. We obtain two cases: dynEnv1 ExprSingle2 Res μin ∈ in μin ∈ in : := (i) sr xs or (ii) sr xs.In(i) we imme- let $xsp ds ⎛ ⎞ diately get that dynEnv  Q ⇒ Val .For(ii), con- for $VarName ⎜ ⎟ sider μxs the XSPARQL instance of the inner SparqlFor- ⎜ OptTypeDeclaration ⎟ C1 dynEnv  xsp:createNG⎜ OptPositionalVar ⎟ Clause(created based on dynEnvxs). As we can see from (T5), ⎝ ⎠ 1 in ExprSingle1 return dynEnvxs (and thus also μxs ) includes the bindings for xsp:evalTemplate(NGP) 1 C1 μ ∈ out return ExprSingle ⇒ Res variables from each solution mapping i xs . Thus, according to the XSPARQL BGP matching (cf. Defini- in tion 10), xs will contain all the solution mappings that where μ ∈ out are compatible with any solution mapping i xs and μout ng specifically those compatible with sr . Since we know dynEnv = dynEnv + varValue(xsp:ds ⇒ DS) . (T6) μin μout μin 1 that sr is compatible with sr , we have that sr must in  ⇒ belong to xs; thus we can deduce that dynEnv Q – let expression of line (2) Val . ng Consider the dataset clause DatasetClause =Dataset- Inductive Step The proof follows from the recursive Clause ∪ from named $xsp:ds and the graph pat- application of the base case, over a new dynamic envi- tern WhereClauseng = WhereClause ∪ where {graph  ronment determined by the optsr rewriting to dynEnvi $xsp:ds [] :value $VarName . optsr(ExprSingle). 

123 Mapping between RDF and XML with XSPARQL 181

⎛ ⎞ ng  : / : [ = ]/∗⇒ select dynEnv3 $xsp result sr binding @name $v V ⎜ Vars ∪ $VarName ⎟ ng ⎜ ⎟ dynEnv  ExprSingle ⇒ Res ng  xsp:sparqlCall⎜ DatasetClauseng ⎟ ⇒ 4 2 dynEnv1 ⎝ ⎠ WhereClauseng let $v := ng  : / : [ = ]/∗ SolutionModifier dynEnv3 $xsp result sr binding @name $v return ExprSingle ⇒ Res ng  ⇒ dynEnv2 ExprSingle2 Res : := let $xsp results ⎛ ⎞ select ⎜ ∪ ⎟ where ⎜ Vars $VarName ⎟ ng  xsp:sparqlCall⎜ DatasetClauseng ⎟ dynEnv1 ⎝ ⎠ ng ng = ng ( ⇒ ) . WhereClause dynEnv4 dynEnv3 +varValuev V SolutionModifier ⇒ return ExprSingle2 Res where Similarly to the proof of Proposition 5, we are only con-  sidering order by solution modifiers; these only change the ng = ng : ⇒ . dynEnv2 dynEnv1 +varValue xsp results order of the solution sequences and thus can be safely ignored for this proof. The new merged dataset, DatasetClauseng, is created Regarding the evaluation of the XQuery for clause from based on DatasetClause and the newly created named lines (1)–(2) of Q (evaluated considering dynEnv), the graph NG. Since the URI that identifies the newly optng(Q) translates it into the xsp:sparqlCall from line (2), ng created named graph NG is distinct from any URI which is evaluated considering dynEnv1 . As we can see ng of named graphs present in DatasetClause, the triples from (T6), dynEnv1 is based on dynEnv by adding the bind- included in NG will never be a solution for Where- ing for the xsp:ds variable. Since this variable belongs to : Clause, and will be matched only by the graph pattern the xsp reserved namespace, it is not allowed in the Whe- where graph $xsp:ds [] :value $VarName . reClause and so we have that the results of evaluating the ng Consider C the expression context where dynEnv is xsp:sparqlCall function over dynEnv or dynEnv1 will included, μC the XSPARQL instance mapping of C be the same. and Pout and Pin the graph patterns obtained from, Regarding the evaluation of the SparqlForClausefrom respectively, replacing the variables in WhereClauseand lines (3)–(4) of Q (evaluated considering some dynamic envi- expr where {graph $xsp:ds [] :value $VarName ronment dynEnv ), the optng(Q) also incorporates it into according to μC .  the xsp:sparqlCall from line (2), which is evaluated over out ng out ng ng Furthermore, let = eval DatasetClause , P and dynEnv1 . Considering that dynEnv1 is less restrictive than  ng ng in ng in expr = eval DatasetClause , P . According to SPAR- dynEnv , i.e. dynEnv1 contains less bindings for vari- ng expr QL semantics, the pattern solution that results from eval- ables than dynEnv , the evaluation of the inner SparqlFor- ng out in uating WhereClause, ng = consists of all Clauseover dynEnv1 will contain all the solution mappings ng ng in μ μ μ μ ∈ out μ ∈ in from xs and specifically in.As out and in are compatible the solution mappings out ng and in ng such μ μ we have that dynEnv  ng(expr) ⇒ Val . that out and in are compatible. – for expression of line (3) (⇐) Next we will show that if dynEnv  optng(Q) ⇒ Val ng  xsp:results//sr:result ⇒ μ dynEnv2 $ i  ⇒ out in . then dynEnv Q Val . Consider ng and ng the solu- ng  ⇒ .. dynEnv3 ExprSingle2 Resi tion sequences returned by, respectively, the evaluation of the : : ng for $xsp result at $xsp result_pos new WhereClause and WhereClause. As we have seen ng dynEnvng  in $xsp:results//sr:result 2 contains all the solution mappings μ = μout μin , where return ExprSingle ⇒ Res , ··· , Res ng ng 1 n μout ∈ out μin ∈ in μout μin ng ng and ng ng, such that ng and ng are com- patible. Again, consider μout and μin the pattern solutions where ng ng where Val is deduced from. ng ng xsp:result ⇒ μ ;  ⇒ dynEnv = dynEnv +varValue i . Let us turn to the evaluation of dynEnv Q Val . 3 2 xsp:result_pos ⇒ i

– XQuery for clause from lines (1)–(2): – let expressions of lines (4)–(5) Expr corresponds to the SparqlForClause from lines (3)– Here we consider all the let expressions represented by line (4), where $v ∈ Vars: (4) of Q.

123 182 S. Bischof et al.

dynEnv  ExprSingle ⇒ V 1 i . Inductive Step. Let us assume that, for some arbitrary xs  ⇒ .. dynEnvi ExprSingle1 Value i  ⇒ dynEnvi , dynEnvi  ExprSingle 1 Val i if and only for $VarName OptTypeDeclaration if dynEnv  opt ExprSingle ⇒ Val . According to  i ng 1 i dynEnv OptPositionalVar in ExprSingle1 opt return ⇒ ... the ng rewriting, there must exist a dynEnv j that is Expr Value i Value n  the extension of dynEnvi with Val i and thus dynEnv j ⇒  ExprSingle2 Val if and only if dynEnv optng ExprSingle2 we have for each Vi: j ⇒ Val . Consequently, we have that dynEnv  Q ⇒ Val if   ( ) ⇒  xs ⇒ and only if dynEnv optng Q Val . dynEnvi = dynEnv + varValue VarName Vi . (T7)

D The XMarkRDF Benchmark – SparqlForClause of lines (2)–(4): For the evaluation of our implementation we created a benchmark suite based on the XMark benchmark suite [57]. dynEnvxs  fs:dataset(DatasetClause) ⇒ DS i According to [3], the XMark suite is the most widely used DS, WhereClause, dynEnvxs  fs:sparql ⇒ μ benchmark suite for XQuery. It provides a data generator that i SolutionModifier j . produces XML data simulating an auction website (includ- dynEnvxs  ExprSingle ⇒ Value .. j 2 j ing information about persons and items they bid for) and for Vars at $PosVar DatasetClause includes 20 XQuery queries, referred to as q to q hence- xs  1 20 dynEnvi WhereClause SolutionModifier ⇒ ··· forth, over this generated data. return ExprSingle2 Value 1 Value m In order to benchmark the XSPARQL language we also require data in RDF format; hence we provide transforma- where, considering Vars = $Var 1 ...$Var n,wehavefor each μ : tions (in fact, using XSPARQL queries) from XML datasets j generated by XMark into RDF triples. In this transformation we replicate all the data in the original XMark datasets as dynEnvxs + activeDataset(DS) i ⎛ ⎞ RDF triples. We start by generating IRIs for each XML ele- PosVar ⇒ j; xs ⎜ ⇒ : μ , ;⎟ ment that represents concepts like “persons,” “items,” “bids,” dynEnv = ⎜Var1 fs value j Var1 ⎟ . j +varValue⎜ ⎟ etc. Inner XML element names are then converted into RDF ⎝··· ; ⎠ ⇒ : μ , predicates and used to link the generated IRIs to the leaf ele- Varn fs value j Varn ment values which are converted into RDF literals. Next, we converted the XMark queries into corresponding XSPARQL As we have seen in the (⇒) direction, we have that out = ng queries using SparqlForClauses to access the RDF data. We out and so we have that μout ∈ out. xs ng xs call this new benchmark suite the XMarkRDF benchmark Consider C the expression context where dynEnv is and is available for download at http://xsparql.deri.org/data/ included and μ the XSPARQL instance mapping of C. C XMarkRDF/. Further consider Pin the graph pattern obtained from replac- From the initial set of 20 queries there are 5 queries (q – ing the variables in WhereClausein according to μ . Since 8   C q ) which contain nested expressions. They are described we know that vars WhereClausein ⊆ vars Pin , all solu- 12 informally in the XMark suite as follows: tions mappings returned by evaluating WhereClausein under XSPARQL semantics are included in the pattern solution of (q8) “List the names of persons and the number of items they in in in evaluating P under SPARQL semantics, i.e., xs ng. bought;” μin ∈ in μin ∈ in (q ) “List the names of persons and the names of the items We obtain two cases: (i) ng xs; or (ii) ng xs.In 9 (i) we immediately get that dynEnv  Q ⇒ Val .For(ii), they bought in Europe;” consider μxs the XSPARQL instance of the inner SparqlFor- (q10) “List all persons according to their interest;” C1 xs (q11) “List the number of items currently on sale whose price Clause(created based on dynEnv1 ). As we can see from (T7), dynEnvxs (and thus also μxs ) includes the bindings for vari- does not exceed 0.02% of the seller’s income;” and 1 C1 μ ∈ out (q12) “For each richer-than-average person, list the number ables from each solution mapping i xs . Thus, accord- in of items currently on sale whose price does not exceed ing to the XSPARQL BGP matching (cf. Definition 10), xs will contain all the solution mappings that are compatible 0.02% of the person’s income.” μ ∈ out with any solution mapping i xs and specifically those μout μin compatible with ng . Since we know that ng is compatible Figure 20a, b presents XMark query q9 and its translated μout μin in with ng , we have that ng must belong to xs; thus we can XSPARQL version in XMarkRDF, respectively. We have deduce that dynEnv  Q ⇒ Val . made two changes to the XMark queries: (1) SPARQL que- 123 Mapping between RDF and XML with XSPARQL 183

Table 4 XMarkRDFS2XQ dataset and translation times Scaling Dataset Translation factor size (MB) times (s)

0.01 3.318.94 0.02 6.418.30 0.05 16.126.08 0.10 32.739.01 0.20 65.362.35 . . (a) 0.50 162 3 143 35 1.00 326.2 329.93

a native XQuery engine. For further comparison between XSPARQL and the SPARQL2XQuery language, and other related works, we refer the reader to Sect. 8. Query q9, as presented in Fig. 20c, is ready to be evaluated by the SPARQL2XQuery system over the XMarkRDFS2XQ dataset. Please note that this query follows the syntax pre- sented in [33], since we only had access to the imple- (b) mentation of the translation from SPARQL to XQuery, while the evaluation was done using the associated XQuery code. We focussed in our experimental evaluation on query response time rather than on data transformation time, and as SPARQL2XQuery requires an additional translation step from RDF to a custom RDF/XML format, we converted the XMarkRDF RDF data into the format required by the SPARQL2XQuery system. An overview of this transla- tion process, including the translation times, is presented in Table 4. We denote these new datasets, containing the RDF/XML format required for the SPARQL2XQuery, by XMarkRDFS2XQ. (c) Optimising XMarkRDF Nested Queries Fig. 20 Variants of benchmark query q9 The different rewritings presented in Sect. 6 can be applied ries do not guarantee any default ordering; hence all original to the four nested queries q8–q11. Query q12 also consists of a XMark queries were declared unordered—as a consequence nested expression; however, the most accurate translation of the XQuery engine is not required to follow document order this query into XSPARQL results in the dependent variable when executing the query; and (2) we added the external not being strictly bound since it occurs only in the filter variables $xml and $rdf in the XQuery and XSPARQL of the inner query. As such, we cannot apply the different query, respectively, as parameters used to specify the URL rewritings to this query. identifying the input benchmark instance. XMarkRDF query q9 is presented in Fig. 20 on the facing We included the SPARQL2XQuery system, which is sim- page. This query is close to queries q8, q10, and q11 and con- ilarinspirittoXSPARQL,by[33] in our system comparison. sists of a nested expression: the inner for expression of the While the language allows to perform similar queries to the query (lines 8–12) is executed once for each person matched XSPARQL language, the implementation follows a differ- by the outer expression (lines 5–6), which means that one ent approach to integrate the XML and RDF data. Rather SPARQL call will be made for each person separately. Thus, than performing interleaved calls to a SPARQL engine, the the number of SPARQL calls performed in the inner expres- SPARQL2XQuery system relies on translating the RDF data sion directly depends on the size of the dataset (cf. Table 5 for into a pre-defined XML format and transforming SPARQL details). Queries q8, q9, and q11 evaluate the inner expression queries into equivalent XQuery over the pre-defined XML for each person, while q10 evaluates the inner expression for format. The translated queries can be directly executed using each category. Each dataset contains usually about 25 times 123 184 S. Bischof et al.

Table 5 Benchmark dataset description References Scaling Persons Categories XMark (MB) XMarkRDF (MB) factor 1. Abiteboul S, Hull R, Vianu V (1995) Foundations of databases. Addison-Wesley, New York 0.01 255 10 1.11.2 2. Adida B, Birbeck M, McCarron S, Pemberton S (2008) RDFa in XHTML: syntax and processing. W3C recommendation, W3C. 0.02 510 20 2.32.3 http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014 0.05 1,275 50 5.85.8 3. Afanasiev L, Marx M (2008) An analysis of XQuery benchmarks. 0.10 2,550 100 11.712.4 Inf Syst 33(2):155–181 4. Akhtar W, Kopecký J, Krennwallner T, Polleres A (2008) XSP- 0.20 5,100 200 23.524.9 ARQL: traveling between the XML and RDF worlds—and avoid- 0.50 12,750 500 58.061.7 ing the XSLT pilgrimage. In: ESWC’08. Springer, Berlin, pp 1.00 25,500 1,000 116.5 124.8 432–447 5. Angles R, Gutierrez C (2010) SQL nested queries in SPARQL. In: AMW’10, CEUR-WS.org, vol 619 6. Battle S (2006) Gloze: XML to RDF and back again. In: Jena user conference’06 7. Beckett D, Berners-Lee T (2008) Turtle—terse RDF triple lan- guage. http://www.w3.org/TeamSubmission/turtle/ 8. Beckett D, Broekstra J (2008) SPARQL query results XML format. (a) W3C recommendation, W3C. http://www.w3.org/TR/2008/REC- rdf-sparql-XMLres-20080115/ 9. Beckett D, McBride B (eds) (2004) RDF/XML syntax specifica- tion (Revised). W3C recommendation, W3C. http://www.w3.org/ TR/2004/REC-rdf-syntax-grammar-20040210/ 10. Berger S, Bry F, Furche T, Linse B, Schroeder A (2006) Beyond XML and RDF: the versatile Web query language Xcerpt. In: Carr (b) L, Roure DD, Iyengar A, Goble CA, Dahlin M (eds) WWW. ACM, New York, pp 1053–1054 Fig. 21 q q Example output excerpts of queries 9 and 9 11. Berglund A, Boag S, Chamberlin D, Fernández MF, Kay M, Ro- bie J, Siméon J (2010) XML path language (XPath) 2.0, 2nd edn. W3C recommendation, consortium. http://www. w3.org/TR/2010/REC-xpath20-20101214/ more persons than categories. The rewriting strategies pre- 12. Berrueta D, Labra JE, Herman I (2008) XSLT+SPARQL: scripting sented in Sect. 6 reduce the number of SPARQL calls to two: the semantic web with SPARQL embedded into XSLT stylesheets. one to get all the people (similar to the direct rewriting ver- In: Workshop on scripting for the semantic web 13. Beyer KS, Cochrane R, Josifovski V, Kleewein J, Lapis G, Loh- sion), and one additional SPARQL call for retrieving all the man GM, Lyle R, Özcan F, Pirahesh H, Seemann N, Truong TC, information about all the auctions in the dataset. Although der Linden BV, Vickery B, Zhang C (2005) System RX: one part the query remains exponential, the practical evaluation will relational, one part XML. In: Özcan F (ed) SIGMOD conference. show that reducing the number of SPARQL calls drastically ACM, New York, pp 347–358 14. Bikakis N, Gioldasis N, Tsinaraki C, Christodoulakis S (2009) improves query execution times (Table 2). Querying XML Data with SPARQL. In: DEXA’09, vol 5690. As mentioned in Sect. 6.2, for the SPARQL-based rewrit- Springer, Berlin, pp 372–381 ings, we want the query output to be computable directly 15. Bischof S (2010) Full XSPARQL grammar. http://xsparql.deri.org/ in SPARQL without any further processing, i.e., we do not doc/grammar.html 16. Bischof S (2012) Optimising XML-RDF data integration. In: Sim- want to use XQuery for further processing of the SPAR- perl E (ed) ESWC 2012. LNCS, vol 7295. Springer, Heidelberg, QL results, and the query should be expressible in SPAR- pp 838–843 QL without features from SPARQL 1.1. Since the original 17. Bohring H, Auer S (2005) Mapping XML to OWL ontologies. In: Leipziger Informatik-Tage, GI, vol 72, pp 147–156 nested queries q8–q11 group the output results (while option- 18. Bray T, Paoli J, Sperberg-Mcqueen CM, Maler E, Yergeau F (2008) ally applying some aggregation function), we need to include Extensible markup language (XML) 1.0, 5th edn. W3C recom- modified versions of these benchmark queries for the eval- mendation, World Wide Web consortium. http://www.w3.org/TR/ uation of the SPARQL based rewritings. In these modified 2008/REC-xml-20081126/ queries, denoted q –q , we changed the return format of the 19. Brickley D, Guha R (2004) RDF vocabulary description language 8 11 1.0: RDF schema. W3C recommendation, W3C. http://www.w3. queries to consist of a flattened representation of the output org/TR/2004/REC-rdf-schema-20040210/ of the original query. An example of the output for queries q9 20. Brickley D, Miller L (2007) FOAF vocabulary specification. http://    and q9 is presented in Fig. 21. All queries qi and qi follow xmlns.com/foaf/spec/ a similar strategy for reformatting the output: the queries 21. Carroll JJ, Stickler P (2004) TriX, RDF triples in XML.   Tech. Rep. HPL-2003-268, HP Labs. http://www.hpl.hp.com/ resulting from applying optsr are named q8–q11, while the techreports/2004/HPL-2004-56.html queries that consist of an outer for expression—to which 22. Chamberlin D, Robie J, Boag S, Fernández MF, Siméon J, Flo-   optng was applied—are q8 –q11. rescu D (2010) XQuery 1.0: an XML query language, 2nd edn.

123 Mapping between RDF and XML with XSPARQL 185

W3C recommendation, W3C. http://www.w3.org/TR/2010/REC- 40. Iannella R (2010) Representing vCard Objects in RDF. http://www. xquery-20101214/ w3.org/Submission/vcard-rdf/. W3C member submission 23. Clark KG, Feigenbaum L, Torres E (2008) SPARQL protocol for 41. Katz H, Chamberlin D, Kay M, Wadler P, Draper D (2003) XQue- RDF. W3C recommendation, W3C. http://www.w3.org/TR/2008/ ry from the experts: a guide to the W3C XML query language. REC-rdf-sparqlprotocol-20080115/ Addison-Wesley 24. Connolly D (2007) Gleaning resource descriptions from dialects 42. Kay M (ed) (2007) XSL transformations (XSLT) version 2.0. W3C of languages (GRDDL). W3C recommendation, W3C. http://www. recommendation, W3C. http://http://www.w3.org/TR/2007/REC- w3.org/TR/2007/REC-grddl-20070911/ xslt20-20070123/ 25. Corby O, Kefi-Khelif L, Cherfi H, Gandon F, Khelif K (2009) Que- 43. Koch C (2006) On the complexity of nonrecursive XQuery and rying the semantic web of data using SPARQL, RDF and XML. functional query languages on complex values. ACM Trans Data- Tech. Rep. 6847, Institut National de Recherche en Informatique base Syst 31(4):1215–1256 et en Automatique 44. Kopecký J, Vitvar T, Bournez C, Farrell J (2007) SAWSDL: 26. Deursen DV, Poppe C, Martens G, Mannens E, Walle RVd (2008) semantic annotations for WSDL and XML schema. IEEE Inter- XML to RDF conversion: a generic approach. In: AXMEDIS’08. net Comput 11(6):60–67 IEEE, pp 138–144 45. Malhotra A, Melton J, Walsh N (eds) (2010) XQuery 1.0 and XPath 27. Draper D, Fankhauser P, Fernández M, Malhotra A, Rose K, Rys 2.0 functions and operators, 2nd edn. W3C recommendation, W3C. M, Siméon J, Wadler P (2010) XQuery 1.0 and XPath 2.0 formal http://www.w3.org/TR/2010/REC-xpath-functions-20101214/ semantics, 2nd edn. W3C recommendation, W3C. http://www.w3. 46. Manola F, Miller E (2004) RDF primer. W3C recommendation, org/TR/2010/REC-xquerysemantics-20101214/ W3C. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/ 28. Droop M, Flarer M, Groppe J, Groppe S, Linnemann V, Pingger- 47. May N, Helmer S, Moerkotte G (2003) Three cases for query dec- a J, Santner F, Schier M, Schopf F, Staffler H, Zugal S (2008) orrelation in XQuery. In: Xsym’03, vol 2824. Springer, Berlin, pp Embedding XPath queries into SPARQL queries. In: ICEIS’08, 70–84 pp 5–14 48. May N, Stuckenschmidt H (2007) Querying embedded RDF with 29. Farrell J, Lausen H (2007) Semantic annotations for WSDL and XML technology: a feasibility study. In: XML Tage 2007. Freie XML schema. W3C recommendation, W3C. http://www.w3.org/ University, Berlin TR/2007/REC-sawsdl-20070828/ 49. Passant A, Kopecký J, Corlosquet S, Berrueta D, Palmisano 30. Fernández MF, Malhotra A, Marsh J, Nagy M, Walsh N (2010) D, Polleres A (2009) XSPARQL: use cases. http://www.w3.org/ XQuery 1.0 and XPath 2.0 data model (XDM), 2nd edn. Submission/xsparql-use-cases/. W3C member submission W3C recommendation, W3C. http://www.w3.org/TR/2010/REC- 50. Pérez J, Arenas M, Gutierrez C (2008) nSPARQL: a navigational xpath-datamodel-20101214/ language for RDF. In: ISWC’08, vol 5318. Springer, Berlin, pp 31. Fischer P, Florescu D, Kaufmann M, Kossmann D (2011) Translat- 66–81 ing SPARQL and SQL to XQuery. In: XMLPrague’11, pp 81–98 51. Pérez J, Arenas M, Gutierrez C (2009) Semantics and complexity 32. Gearon P, Passant A, Polleres A (2011) SPARQL 1.1 update. of SPARQL. ACM Trans Database Syst 34(3):1–45 W3C working draft, W3C. http://www.w3.org/TR/2011/WD- 52. Polleres A (2007) From SPARQLto rules (and back). In: WWW’07 sparql11update-20110512/ 53. Polleres A, Scharffe F, Schindlauer R (2007) SPARQL++ for map- 33. Groppe S, Groppe J, Linnemann V,Kukulenz D, Hoeller N, Reinke ping between RDF vocabularies. In: ODBASE’07. Springer, Berlin C (2008) Embedding SPARQL into XQuery/XSLT. In: SAC’08. 54. Prud’hommeaux E, Seaborne A (eds) (2008) SPARQL query lan- ACM, New York, pp 2271–2278 guage for RDF. W3C recommendation, W3C. http://www.w3.org/ 34. Grust T, Sakr S, Teubner J (2004) XQuery on SQL hosts. In: TR/2008/REC-rdf-sparql-query-20080115/ Nascimento MA, Özsu MT, Kossmann D, Miller RJ, Blakeley JA, 55. Robie J, Chamberlin D, Dyck M, Florescu D, Melton J, Siméon J Schiefer KB (eds) VLDB. Morgan Kaufmann, pp 252–263 (2011) XQuery update facility 1.0. W3C recommendation, W3C. 35. Grust T, Rittinger J, Teubner J (2007) eXrQuy: order indifference http://www.w3.org/TR/xquery-update-10/ in XQuery. In: Chirkova R, Dogac A, Özsu MT, Sellis TK (eds) 56. Rodrigues T, Rosa P, Cardoso J (2008) Moving from syntac- ICDE. IEEE, pp 226–235 tic to semantic organizations using JXML2OWL. Comput Ind 36. Grust T, Mayr M, Rittinger J (2010) Let SQL drive the XQuery 59(8):808–819 workhorse (XQuery join graph isolation). In: EDBT’10, vol 426. 57. Schmidt A, Waas F, Kersten ML, Carey MJ, Manolescu I, Busse ACM, pp 147–158 R (2002) XMark: a benchmark for XML data management. In: 37. Hartig O, Heese R (2007) The SPARQL query graph model for VLDB’02. Morgan Kaufmann, pp 974–985 query optimization. In: Franconi E, Kifer M, May W (eds) ESWC, 58. Walsh N (2003) RDF Twig: accessing RDF graphs in XSLT. In: vol 4519. Springer, Berlin, pp 564–578 Extreme markup languages’03 38. Harris S, Seaborne A (2011) SPARQL 1.1 query language. 59. Zhou M, Wu Y (2010) XML-based RDF data management for effi- W3C working draft, W3C. http://www.w3.org/TR/2011/WD- cient query processing. In: Dong XL, Naumann F (eds) Proceedings sparql11query-20110512/ of the 13th international workshop on the web and databases 2010, 39. Hayes P (2004) RDF semantics. W3C recommendation, W3C. WebDB 2010, Indianapolis, Indiana, USA, June 6, 2010 http://www.w3.org/TR/2004/REC-rdf-mt-20040210/

123