Flexible Query Processing for SPARQL

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Birkbeck Institutional Research Online

Undeﬁned 0 (2015) 1–0 1 IOS Press

Editor(s): Christina Unger, UniversitätBielefeld, Germany; Axel-Cyrille Ngonga Ngomo, UniversitätLeipzig, Germany; Philipp Cimiano, UniversitätBielefeld, Germany; Sören Auer, UniversitätBonn, Germany; George Paliouras, NCSR Demokritos, Greece Solicited review(s): Michael Schmidt, Technische UniversitätMünchen, Germany; Aidan Hogan, Universidad de Chile, Chile; 1 anonymous

Riccardo Frosini a Andrea Cal`ı a,b Alexandra Poulovassilis a Peter T. Wood a a London Knowledge Lab, Birkbeck, University of London, UK Email: {riccardo,andrea,ap,ptw}@dcs.bbk.ac.uk b Oxford-Man Institute of Quantitative Finance, University of Oxford, UK

Abstract. Flexible querying techniques can enhance users’ access to complex, heterogeneous datasets in settings such as Linked Data, where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators. Our ﬂexible query processing approach is based on query rewriting and returns answers incrementally according to their “distance” from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.

Keywords: Semantic Web, SPARQL 1.1, path queries, query approximation, query relaxation

1. Introduction Example 1. Suppose a user wishes to find events that took place in London on 15th September 1940 Flexible querying techniques have the poten- and poses the following query on the YAGO knowl- tial to enhance users’ access to complex, hetero- edge base1, which is derived from multiple sources geneous datasets. In particular, users querying such as Wikipedia, WordNet and GeoNames: Linked Data may lack full knowledge of the structure of the data, its irregularities, and the URIs (x, on, “15/09/1940”) AND (x, in, “London”) used within it. Moreover, the schemas and URIs (The above is not a complete SPARQL query, but used can also evolve over time. This makes it dif- is sufficient to illustrate the problem we address.) ficult for users to formulate queries that precisely This query returns no results because there are no express their information retrieval requirements. property edges named “on” or “in” in YAGO. Hence, providing users with flexible querying ca- Approximating “on” by “happenedOnDate” and pabilities is desirable. “in” by “happenedIn” (which do appear in YAGO) SPARQL is the predominant language for query- gives the following query: ing RDF data and, in the latest extension of SPARQL 1.1, it supports property path queries (x, happenedOnDate, “15/09/1940”) AND (i.e. regular path queries) over the RDF graph. (x, happenedIn, “London”) However, it does not support notions of query approximation and relaxation (apart from the OP- TIONAL operator). 1http://www.mpi-inf.mpg.de/yago-naga/yago/

This still returns no answers, since “happened- This query returns 16 answers that may be rele- In” does not connect event instances directly to vant to the user, since YAGO does store the geo- literals such as “London”. However, relaxing now graphic coordinates of some (unspecified) locations (x, happenedIn, “London”) to (x, type, Event), in Belgium, increasing recall but with possibly low using knowledge encoded in YAGO that the do- precision. main of “happenedIn” is Event, will return all Moreover, YAGO does in fact store directly the events that occurred on 15th September 1940, in- coordinates of the “Battle of Waterloo” event, so cluding those occurring in London. In this partic- if we apply an approximation step that deletes ular instance only one answer is returned which the property “happenedIn”, instead of adding “is- is the event “Battle of Britain”, but other events LocatedIn”, the resulting query could in principle have been returned. So the query (hBattle of W aterlooi, exhibits better recall than the original query, but (hasLongitude|hasLatitude), x) possibly low precision. Alternatively, instead of relaxing the second returns the desired answers, showing both high pre- triple above, another approximation step can be cision and high recall applied to it, inserting the property “label” that In this paper we describe an extension of a frag- connects URIs to their labels and yielding the fol- ment of SPARQL 1.1 with query approximation lowing query: and query relaxation operations that automati- (x, happenedOnDate, “15/09/1940”) AND cally generate rewritten queries such as those il- (x, happenedIn/label, “London”) lustrated in the above examples, calling the extended language SPARQLAR. We first presented This query now returns the only event that oc- SPARQLAR in [5], focussing on its syntax, se- curred on 15th September 1940 in London, that mantics and complexity of query answering. We is “Battle of Britain”. It exhibits both better recall showed that the introduction of the query ap- than the original query and also high precision. proximation and query relaxation operators does Example 2. Suppose the user wishes to find the not increase the theoretical complexity of the lan- geographic coordinates of the “Battle of Waterloo” guage, and we provided complexity bounds for sev- event by posing the query eral language fragments. In this paper, we review and extend these results to a larger SPARQL lan- (hBattle of W aterlooi, guage fragment. We also explore in more detail the happenedIn/(hasLongitude|hasLatitude), x). theoretical and performance aspects of our query AR in which angle brackets delimit a URI. We see processing algorithms for SPARQL , examining that this query uses the property paths exten- their correctness and termination properties, and sion of SPARQL, specifically the concatenation (/) presenting the results of a performance study over and disjunction (|) operators. In the query, the the YAGO dataset. property “happenedIn” is concatenated with either The rest of the paper is structured as fol- “hasLongitude” or “hasLatitude”, thereby finding lows. Section 2 describes related work on flex- a connection between the event and its location (in ible querying for the Semantic Web, and on our case Waterloo), and from the location to both query approximation and relaxation more gener- its coordinates. ally. Section 3 presents the theoretical foundation of our approach, summarising the syntax, se- This query does not return any answers from AR YAGO since YAGO does not store the geographic mantics and complexity of SPARQL . Section 4 presents in detail our query processing approach coordinates of Waterloo. However, by applying an AR approximation step, we can insert “isLocatedIn” for SPARQL , which is based on query rewrit- after “happenedIn” which connects the URI repre- ing. We present our query processing algorithms, and formally show the soundness and complete- senting Waterloo with the URI representing Bel- ness of our query rewriting algorithm, as well as gium. The resulting query is its termination. We include a discussion in Sec- Battle of W aterloo, happenedIn/isLocatedIn/ tion 4.4 on how users may be helped in formulating (hasLongitude|hasLatitude), x. queries and interpreting results in a system which Flexible Query Processing for SPARQL 3 includes query approximation and relaxation. Sec- an input database where every tuple is annotated tion 5 presents and discusses the results of a per- by a distinct variable, each tuple t in the query formance study over the YAGO dataset. Finally, answer is annotated by a formula over the input Section 6 gives our concluding remarks and direc- tuples that contributed to t. tions for further work. Another flexible querying technique for rela- tional databases is described in [4]. The authors present an extension to SQL (Soft-SQL) which 2. Related work permits so-called soft conditions. Such conditions tolerate degrees of under-satisfaction of a query by There have been several previous proposals for exploiting the flexibility offered by fuzzy set the- applying flexible querying to the Semantic Web, ory. mainly employing similarity measures to retrieve In [18] the authors show how a conjunctive reg- additional answers of possible relevance. For ex- ular path query language can be effectively example, in [10] matching functions are used for con- tended with approximation and relaxation tech- stants such as strings and numbers, while in [14] niques, using similar notions of approximation and an extension of SPARQL is developed called iS- relaxation as we use here. Finally, in [23] the au- PARQL which uses three different matching func- thors describe and provide technical details of the tions to compute string similarity. In [7], the struc- implementation of a flexible querying evaluator ture of the RDF data is exploited and a similarity for conjunctive regular path queries, extending the measurement technique is proposed which matches work in [18]. paths in the RDF graph with respect to the query. In contrast to all the above work, our focus Ontology-driven similarity measures are proposed is on the SPARQL 1.1 language. In [5] we ex- in [12,11,20] which use the RDFS ontology to re- tended, for the first time, a fragment of this lan- trieve extra answers and assign a score to them. guage with query approximation and query relax- In [8] methods for relaxing SPARQL-like triple ation operators, terming the extended language pattern queries automatically are presented. Query SPARQLAR. Here, we add the UNION operator relaxations are produced by means of statistical to SPARQLAR and derive additional complexity language models for structured RDF data and results. Moreover, we present in detail our query queries. The query processing algorithms merge processing algorithms for SPARQLAR. Our query the results of different relaxations into a unified processing approach is based on query rewrit- results list. ing, whereby we incrementally generate a set of Recently, a fuzzy approach has been proposed SPARQL 1.1 queries from the original SPARQLAR to extend the XPath query language with the query, evaluate these queries using existing tech- aim of providing mechanisms to assign priorities nologies, and return answers ranked according to to queries and to rank query answers [2]. These their “distance” from the original query. We ex- techniques are based on fuzzy extensions of the amine the correctness and termination properties Boolean operators. of our query rewriting algorithm and we present Flexible querying approaches for SQL have been the results of a performance study on the YAGO discussed in [21] where the authors describe a sys- dataset. tem that enables a user to issue an SQL aggregation query, see results as they are produced, and adjust the processing as the query runs. This ap- 3. Theoretical Foundation proach allows users to write flexible queries containing linguistic terms, observe the progress of In this section we give definitions of the syntax their aggregation queries, and control execution on and semantics of SPARQLAR summarising and ex- the fly. tending the syntax and semantics from [5], and An approximation technique for conjunctive also the complexity results from that paper. We queries on probabilistic databases has been investi- begin with some necessary definitions. gated in [9]. The authors use propositional formulas for approximating the queries. Formulas and Definition 1 (Sets, triples and variables). We as- queries are connected in the following way: given sume pairwise disjoint infinite sets U and L of 4 Flexible Query Processing for SPARQL

URIs and literals, respectively. An RDF triple is erty (a “property node”). The intersection of N a tuple hs, p, oi ∈ U × U × (U ∪ L), where s is and NK is contained in the set of class nodes of the subject, p the predicate and o the object of the NK . D is contained in the set of property nodes triple. We assume also an infinite set V of vari- of NK . ables that is disjoint from U and L. We abbreviate any union of the sets U, L and V by concatenating Definition 4 (Triple pattern). A triple pattern is a their names; for instance, UL = U ∪ L. tuple hx, z, yi ∈ UV × UV × UVL. Given a triple pattern hx, z, yi, var(hx, z, yi) is the set of vari- Note that in the above definition we modify the ables occurring in it. definition of triples from [16] by omitting blank nodes, since their use is discouraged for Linked Note that again we modify the definition from Data because they represent a resource without [16] to exclude blank nodes. specifying its name and are identified by an ID Definition 5 (Mapping). A mapping µ from ULV which may not be unique in the dataset [3]. to UL is a partial function µ : ULV → UL. We Definition 2 (RDF-Graph). An RDF-Graph G is assume that µ(x) = x for all x ∈ UL, i.e. µ maps a directed graph (N,D,E) where: N is a finite set URIs and literals to themselves. The set var(µ) of nodes such that N ⊂ UL; D is a finite set of is the subset of V on which µ is defined. Given a predicates such that D ⊂ U; E is a finite set of la- triple pattern hx, z, yi and a mapping µ such that belled, weighted edges of the form hhs, p, oi, ci such var(hx, z, yi) ⊆ var(µ), µ(hx, z, yi) is the triple that the edge source (subject) s ∈ N ∩ U, the edge obtained by replacing the variables in hx, z, yi by target (object) o ∈ N, the edge label p ∈ D and the their image according to µ. edge weight c is a non-negative number. 3.1. Syntax of SPARQLAR queries Note that, in the above definition, we modify the definition of an RDF-Graph from [16] to add Definition 6 (Regular expression pattern). A reg- weights to the edges, which are needed to for- ular expression pattern P ∈ RegEx(U) is defined malise our flexible querying semantics. Initially, as follows: these weights are all 0. ∗ We next define the ontology of an RDF dataset, P := | | p | (P1|P2) | (P1/P2) | P using a fragment of the RDF-Schema (RDFS) vocabulary. where P1,P2 ∈ RegEx(U) are also regular expression patterns, represents the empty pattern, Definition 3 (Ontology). An ontology K is a di- p ∈ U and is a symbol that denotes the disjunc- rected graph (NK ,EK ) where each node in NK tion of all URIs in U. represents either a class or a property, and each This definition of regular expression patterns is edge in EK is labelled with a symbol from the set {sc, sp, dom, range}. These edge labels encom- the same as that in [6]. Our query pattern syn- pass a fragment of the RDFS vocabulary, namely tax is also based on that of [6], but includes also rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain our query approximation and relaxation operators and rdfs:range, respectively. APPROX and RELAX. AR In an RDF-graph G = (N,D,E), we assume Definition 7 (Query Pattern). A SPARQL that each node in N represents an instance or a query pattern Q is defined as follows: class and each edge in E a property (even though, Q := UV ×UV ×UVL | UV ×RegEx(U)×UVL | more generally, RDF does not distinguish between Q1 AND Q2 | Q1 UNION Q2 | Q FILTER R | instances, classes and properties; in fact, in RDF RELAX(UV × RegEx(U) × UVL) | it is possible to use a property as a node of the APPROX(UV × RegEx(U) × UVL) graph). The predicate type representing the RDF vocabulary rdf:type, can be used in E to connect where R is a SPARQL built-in condition and Q1, an instance of a class to a node representing that Q2 are also query patterns. We denote by var(Q) class. In an ontology K = (NK ,EK ), each node in the set of all variables occurring in a query pattern NK represents a class (a “class node”) or a prop- Q. Flexible Query Processing for SPARQL 5

(In the W3C SPARQL syntax, a dot (.) is used component, with respect to a graph G, denoted for conjunction but, for greater clarity, we use [[t]]G, is deﬁned recursively as follows: AND instead. Note also that and cannot be speciﬁed in property paths in SPARQL 1.1.) [[hx, , yi]]G = {hµ, 0i | var(µ) = var(hx, , yi) AR A SPARQL query has the form SELECT−→w WHERE Q, with −→w ⊆ var(Q). We may omit here ∧ ∃c ∈ N . µ(x) = µ(y) = c} the keyword WHERE for simplicity. Given Q0 = [[hx, z, yi]]G = {hµ, ci | var(µ) = 0 0 −→ SELECT−→w Q, the head of Q , head(Q ), is w if −→w 6= ∅ and var(Q) otherwise. var(hx, z, yi) ∧ hµ(hx, z, yi), ci ∈ E}

[[hx, P1|P2, yi]]G = [[hx, P1, yi]]G ∪ [[hx, P2, yi]]G 3.2. Semantics of SPARQLAR queries [[hx, P1/P2, yi]]G = [[hx, P1, zi]]G on We extend the semantics of SPARQL with reg- [[hz, P2, yi]]G ular expression query patterns given in [6] in or- ∗ der to handle the weight/cost of edges in an RDF- [[hx, P , yi]]G = [[hx, , yi]]G ∪ [[hx, P, yi]]G∪ Graph and the cost of applying the approximation [ and relaxation operators. These costs are used to {hµ, ci | hµ, ci ∈ [[hx, P, z1i]]G rank the answers. In particular, when we introduce n≥1 the APPROX and RELAX operators below these on [[hz1, P, z2i]]G on ··· on [[hzn, P, yi]]G} costs determine the ranking of answers returned to the user, with exact answers (of cost 0) being where P , P1, P2 are regular expression patterns, returned first, followed by answers with increasing x, y, z are in ULV , and z, z1, . . . , zn are fresh vari- costs. ables. We extend the notion of SPARQL query evalua- A mapping satisfies a condition R, denoted µ |= tion from returning a set of mappings to returning R, as follows: a set of pairs of the form hµ, ci, where µ is a mapping and c is a non-negative integer that indicates R is x = a: µ |= R if x ∈ var(µ), a ∈ LU and the cost of the answers arising from this mapping. µ(x) = a; Two mappings µ1 and µ2 are said to be compat- R is x = y: µ |= R if x, y ∈ var(µ) and µ(x) = ible if ∀x ∈ var(µ1) ∩ var(µ2), µ1(x) = µ2(x). The µ(y); union of two mappings µ = µ1 ∪ µ2 can be com- R is isURI(x): µ |= R if x ∈ var(µ) and puted only if µ1 and µ2 are compatible. The result- µ(x) ∈ U; ing µ is a mapping such that var(µ) = var(µ1) ∪ R is isLiteral(x): µ |= R if x ∈ var(µ) and var(µ2) and: for each x in var(µ1) ∩ var(µ2), we µ(x) ∈ L; have µ(x) = µ1(x) = µ2(x); for each x in var(µ1) R is R1 ∧ R2: µ |= R if µ |= R1 and µ |= R2; but not in var(µ ), we have µ(x) = µ (x); and 2 1 R is R1 ∨ R2: µ |= R if µ |= R1 or µ |= R2; for each x in var(µ ) but not in var(µ ), we have 2 1 R is ¬R1: µ |= R if it is not the case that µ(x) = µ (x). 2 µ |= R1; We finally define the union and join of two sets of query evaluation results, M1 and M2: The overall semantics of queries (excluding AP- PROX and RELAX) is as follows, where Q, Q , M ∪ M = {hµ, ci | hµ, c i ∈ M or hµ, c i ∈ 1 1 2 1 1 2 Q are query patterns and the projection opera- M with c = c if c .hµ, c i ∈ M , c = c 2 2 1 @ 2 2 2 2 tor π−→ selects only the subsets of the mappings if c .hµ, c i ∈ M , and c = min(c , c ) w @ 1 1 1 1 2 relating to the variables in −→w : otherwise}. M1 on M2 = {hµ1 ∪ µ2, c1 + c2i | hµ1, c1i ∈ M1 [[Q1 AND Q2]]G = [[Q1]]G [[Q2]]G and hµ2, c2i ∈ M2 with µ1 and µ2 compatible on mappings}. [[Q1 UNION Q2]]G = [[Q1]]G ∪ [[Q2]]G

3.2.1. Exact Semantics [[Q FILTER R]]G = {hµ, ci ∈ [[Q]]G | µ |= R} The semantics of a triple pattern t that may include a regular expression pattern as its second [[SELECT−→w Q]]G = π−→w ([[Q]]G) 6 Flexible Query Processing for SPARQL

We will omit the SELECT keyword from a query swers can then be returned to users incrementally Q if −→w = vars(Q). in order of increasing cost. If we did not use the extended reduction of 3.2.2. Query Relaxation the ontology K, then the relaxation steps ap- Our relaxation operator is based on that in [18] plied would not necessarily be the “smallest”. and relies on a fragment of the RDFS entailment For example, consider the following ontology rules known as ρDF [15]. An RDFS graph K1 en- K = {(b, dom, c), (a, sp, b), (a, dom, c)}, where tails an RDFS graph K2, denoted K1 |=RDF S K2, K 6= extRed(K). If we relax the triple pattern if K2 can be derived by applying the rules in Fig- (x, a, y) with respect to K, then as a first step we ure 1 iteratively to K1. For the fragment of RDFS could apply rule 5 to generate (x, type, c). How- that we consider, K1 |=RDF S K2 if and only if ever, the same triple pattern can be generated with K2 ⊆ cl(K1), with cl(K1) being the closure of the 2 steps of relaxation by applying rule 1 first and RDFS Graph K1 under these rules. Notice that if then rule 5 of Figure 1. K1 is finite then also cl(K1) is finite. As a further condition, we require that the on- Applying a rule means adding a triple that is de- tology K is acyclic in order for relaxed queries to ducible by the rule to G or K. Specifically, if there have unambiguous costs (a detailed analysis can 0 are two triples t, t that match the antecedent of be found in [13]). a rule, then it is possible to insert the triple im- Example 3. Given the following cyclic ontology plied by the consequent of the rule. For example, K = (ha, sp, bi, hb, sp, ai, ha, dom, ci, hb, dom, ci) the triple pattern hx, startsExistingOnDate, yi then cl(K) = K ∪ (ha, sp, ai, hb, sp, bi). By apply- can be deduced from hx, wasBornOnDate, yi and ing steps (ii) and (iii) above we could generate hwasBornOnDate, sp, startsExistingOnDateiby two possible ontologies K0 = (ha, sp, bi, hb, sp, ai, applying rule 2. hb, dom, ci) and K00 = (ha, sp, bi, hb, sp, ai, In order to apply relaxation to queries, the ex- ha, dom, ci) that are extended reductions of K. tended reduction of an ontology K is required [13]. Consider now the query Q = RELAX(x, a, y) Given an ontology K, its extended reduction which can be relaxed to (x, b, y) with K0. Applying extRed(K) is computed as follows: (i) compute a second step of relaxation we obtain (x, type, c). cl(K); (ii) apply the rules of Figure 2 in reverse If instead we used ontology K00, a first step of re- until no more rules can be applied (after applying laxation would immediately generate (x, type, c). this step the ontology generated is unique); (iii) Therefore, having non-acyclic ontologies might apply rules 1 and 3 of Figure 1 in reverse until no generate the same triple pattern at two different 2 more rules can be applied . relaxation distances from the original triple pat- Applying a rule in reverse means removing a tern, depending on the reduced ontology. triple deducible by the rule from G or K. Specif- Following the terminology of [13], a triple pat- ically, if there are two triples t and t0 that match tern hx, p, yi directly relaxes to a triple pattern the antecedent of a rule then it is possible to re- hx0, p0, y0i with respect to an ontology K = move a triple that can be derived from t and t0 by extRed(K), denoted hx, p, yi ≺ hx0, p0, y0i, if that rule. i vars(hx, p, yi) = vars(hx0, p0, y0i) and hx0, p0, y0i Henceforth, we assume that K = extRed(K), is derived from hx, p, yi by applying rule i from which allows direct relaxations to be applied to Figure 1. queries (see below), corresponding to the ‘smallest’ A triple pattern hx, p, yi relaxes to a triple pat- relaxation steps. This is necessary for associating 0 0 0 0 0 0 tern hx , p , y i, denoted hx, p, yi ≤K hx , p , y i, if an unambiguous cost to queries, so that query an- starting from hx, p, yi there is a sequence of direct relaxations that derives hx0, p0, y0i. The relaxation 2In order to generate a unique extended reduction we al- cost of deriving hx, p, yi from hx0, p0, y0i, denoted ter step (iii) of the procedure in [13] as follows: for every rcost(hx, p, yi, hx0, p0, y0i), is the minimum cost of pair of triples (a, sp, b), (b, sp, c) (or (a, sc, b), (b, sc, c) re- applying such a sequence of direct relaxations. spectively) in K, apply rule 1 (rule 3) of Figure 1 in reverse unless there exists a URI d such that (c, sp, d) and (a, sp, d) The semantics of the RELAX operator in AR ((c, sc, d) and (a, sc, d)) are also contained in K. We thank SPARQL are as follows: one of the reviewers for pointing out that, without such an extra condition, the extended reduction may not be unique. [[RELAX(x, p, y)]]G,K = [[hx, p, yi]]G∪ Flexible Query Processing for SPARQL 7

(a, sp, b)(b, sp, c) (a, sp, b)(x, a, y) Subproperty (1) (2) (a, sp, c) (x, b, y) (a, sc, b)(b, sc, c) (a, sc, b)(x, type, a) Subclass (3) (4) (a, sc, c) (x, type, b) (a, dom, c)(x, a, y) (a, range, d)(x, a, y) Typing (5) (6) (x, type, c) (y, type, d)

Fig. 1. RDFS entailment rules

(b, dom, c)(a, sp, b) (b, range, c)(a, sp, b) (e1) (e2) (a, dom, c) (a, range, c) (a, dom, b)(b, sc, c) (a, range, b)(b, sc, c) (e3) (e4) (a, dom, c) (a, range, c)

Fig. 2. Additional rules for extended reduction of an RDFS ontology

{hµ, c + rcost(hx, p, yi, hx0, p0, y0i)i | (actedIn, domain, actor), 0 0 0 (English politicians, sc, politician)} hx, p, yi ≤K hx , p , y i∧ 0 0 0 Suppose the user is looking for the family names hµ, ci ∈ [[hx , p , y i]]G} of all the actors who played in the ﬁlm “Tea with [[RELAX(x, P1|P2, y)]]G,K = Mussolini” and poses this query:

[[RELAX(x, P1, y)]]G,K ∪ SELECT * WHERE {

[[RELAX(x, P2, y)]]G,K ?x actedIn . ?x hasFamilyName ?z } [[RELAX(x, P1/P2, y)]]G,K = The above query returns 4 answers. However, [[RELAX(x, P , z)]] 1 G,K on some actors have only a single name (for exam- [[RELAX(z, P2, y)]]G,K ple Cher), or have their full name recorded using the “label” property directly. By applying re- [[RELAX(x, P ∗, y)]] = [[hx, , yi]] ∪ G,K G laxation to the second triple pattern using rule [ [[RELAX(x, P, y)]]G,K ∪ {hµ, ci | (2), we can replace the predicate hasF amilyName n≥1 by “label”. This causes the relaxed query to return also the given names of actors in that ﬁlm hµ, ci ∈ [[RELAX(x, P, z )]] 1 G,K on recorded through the property “hasGivenName” on [[RELAX(z1, P, z2)]]G,K (hence returning Cher), as well as actors’ full names recorded through the property “label”: a to- ··· [[RELAX(z , P, y)]] } on on n G,K tal of 255 results. As another example, suppose the user poses the where P , P1, P2 are regular expression patterns, 0 0 0 following query: x, x , y, y are in ULV , p, p are in U, and z, z1, ..., zn are fresh variables. SELECT * WHERE { Example 4. Consider the following portion K = ?x type . ?x wasBornIn/isLocatedIn* } (NK ,EK ) of the YAGO ontology, where NK is which returns every English politician born in Eng- {hasF amilyName, hasGivenName, label, actedIn, land. By applying relaxation to the ﬁrst triple pat- Actor, English politicians, politician}, tern using rule (4), it is possible to replace the class English politicians by politicians. This re- and EK is laxed query will return every politician who was {(hasF amilyName, sp, label), born in England, giving possibly additional an- (hasGivenName, sp, label), swers of relevance to the user. 8 Flexible Query Processing for SPARQL

3.2.3. Query Approximation [[APPROX(z1, P, z2)]]G on ··· on For query approximation, we apply edit oper- [[APPROX (z , P, y)]] } ations which transform a regular expression pat- on n G tern P into a new expression pattern P 0. Specif- where P , P , P are regular expression patterns, ically, we apply the edit operations deletion, in- 1 2 x, y are in ULV , p, p0 are in U, and z, z , ..., z sertion and substitution, defined as follows (other 1 n are fresh variables. possible edit operations are transposition and in- version, which we leave as future work): Example 5. Suppose that the user is looking for all discoveries made between 1700 and 1800 AD, and A/p/B (A//B) deletion queries the YAGO dataset as follows: A/p/B (A/ /B) substitution SELECT ?p ?z ?y WHERE{ ?p discovered ?x . ?x discoveredOnDate ?y . A/p/B (A/ /p/B) left insertion ?x label ?z . A/p/B (A/p/ /B) right insertion FILTER(?y >= 1700/1/1 and ?y <= 1800/1/1)} Approximating the third triple pattern, it is pos- Here, A and B denote any regular expression and sible to substitute the predicate “label” by “ ”. the symbol represents every URI from U — so The query will then return more information con- the edit operations allow us to insert any URI and cerning that discovery, such as its preferred name substitute a URI by any other. The application of (hasP referredName) and the Wikipedia abstract an edit operation op has a non-negative cost c op (hasW ikipediaAbstract), improving recall and associated with it. maintaining good precision. These rules can be applied to a URI p in order to As another example, consider the following approximate it to a regular expression P . We write query, which is intended to return every German p ∗ P if a sequence of edit operations can be politician: applied to p to derive P . The edit cost of deriving P from p, denoted ecost(p, P ), is the minimum SELECT * WHERE{ cost of applying such a sequence of edit operations. ?x isPoliticianOf ?y . The semantics of the APPROX operator in ?x wasBornIn/isLocatedIn* } SPARQLAR are as follows: This query returns no answers since the predicate “isP oliticianOf” only connects persons to [[APPROX(x, p, y)]] = [[hx, p, yi]] ∪ states of the United States in YAGO. If the first G G triple pattern is approximated by substituting the [ {hµ, c + ecost(p, P )i | predicate “isP oliticianOf” with “ ”, then the query will return the expected results, matching ∗ p P ∧ hµ, ci ∈ [[hx, P, yi]]G} the correct predicate to retrieve the desired answers, which is “holdsP oliticalP osition”. It will [[APPROX(x, P1|P2, y)]]G = also retrieve all the other persons that are born in [[APPROX(x, P1, y)]]G ∪ Germany (thus showing improved recall, but lower precision). [[APPROX(x, P2, y)]]G Observation 1. By the semantics of RELAX and [[APPROX(x, P1/P2, y)]]G = APPROX, we observe that given a triple pattern [[APPROX(x, P1, z)]]G on hx, P, yi, [[hx, P, yi]]G,K ⊆ [[APPROX(x, P, y)]]G and [[hx, P, yi]]G,K ⊆ [[RELAX(x, P, y)]]G,K for [[APPROX(z, P2, y)]]G every graph G and ontology K. ∗ [[APPROX(x, P , y)]]G = [[hx, , yi]]G∪ [ 3.3. Complexity of query answering [[APPROX(x, P, y)]]G ∪ {hµ, ci | n≥1 We now study the combined, data and query AR hµ, ci ∈ [[APPROX(x, P, z1)]]G on complexity of SPARQL , extending the com- Flexible Query Processing for SPARQL 9 plexity results from [16,17,22] for simple SPARQL the combined, data and query complexity are queries, from [1] for SPARQL with regular expres- shown for specific language fragments and combi- sion patterns to include our new flexible query con- nations of operators. structs, and from [5] to include now UNION in The results for query complexity follow from SPARQLAR. Lemma 1 and Theorems 1, 2 and 3. The complexity of query evaluation is based We next show three new complexity results on the following decision problem, which we de- which extend those of [5], summarised in the last note EVALUATION: Given as input a graph G = two lines of Table 1: (N,D,E), an ontology K, a query Q and a pair Theorem 6. EVALUATION is in NP for queries hµ, costi, is it the case that hµ, costi ∈ [[Q]] ? G,K containing AND, UNION, FILTER, APPROX, Considering data complexity, the decision prob- RELAX and regular expression patterns. lem becomes the following: Given as input a graph G, ontology K and a pair hµ, costi, is it the case Theorem 7. EVALUATION is NP-complete for that hµ, costi ∈ [[Q]]G,K , with Q a fixed query? queries that may contain regular expression pat- Finally, the decision problem for query complexity terns and that are constructed using the operators is the following: Given as input an ontology K, a AND, UNION, FILTER, RELAX, APPROX and query Q and a pair hµ, costi, is it the case that SELECT. hµ, costi ∈ [[Q]]G,K , with G a fixed graph? Theorem 8. EVALUATION is PTIME in data We have the following results, the proofs of complexity for queries that may contain SELECT which are given in the Appendix. and regular expression patterns, and that are Theorem 1. EVALUATION can be solved in time constructed using the operators AND, UNION, O(|E| · |Q|) for queries not containing regular ex- FILTER, RELAX and APPROX. pression patterns, and constructed using only the The results for query complexity follow from AND and FILTER operators. Lemma 1 and Theorems 6 and 7. Theorem 2. EVALUATION can be solved in time We conclude our complexity study confirming AR O(|E| · |Q|2) for queries that may contain regular that adding the UNION operator to SPARQL expression patterns and that are constructed using does not increase the overall complexity. EVAL- only the AND and FILTER operators. UATION is NP-complete for queries that contain regular expression patterns and that are Theorem 3. EVALUATION is NP-complete for constructed using the operators AND, UNION, queries that may contain regular expression pat- FILTER, RELAX, APPROX and SELECT. terns and that are constructed using only the AND and SELECT operators. 3.4. OPTIONAL operator

Lemma 1. EVALUATION of [[APPROX(x, P, y)]]G,K The OPTIONAL operator can be added to a and [[RELAX(x, P, y)]]G,K can be accomplished in polynomial time. SPARQL query in order to retrieve information only when it is available. In other words, it allows Theorem 4. EVALUATION is NP-complete for optional matching of query patterns. If in query Q queries that may contain regular expression pat- the OPTIONAL operator is applied to query patterns and that are constructed using the oper- tern Q00, that is Q = Q0 OPTIONAL {Q00}, then 0 ators AND, FILTER, RELAX, APPROX and [[Q]]G will return all the mappings in [[Q ]]G on SELECT. [[Q00]] plus all the mappings in [[Q0]] that are not compatible with any mappings in [[Q00]]. Theorem 5. EVALUATION is PTIME in data It is possible to add the OPTIONAL opera- complexity for queries that may contain regular ex- tor to SPARQLAR, allowing APPROX and RE- pression patterns and that are constructed using LAX to be applied to triple patterns occurring the operators AND, FILTER, RELAX, APPROX within an OPTIONAL clause, with the same se- and SELECT. mantics as speciﬁed earlier. However, the complex- The complexity study of SPARQLAR in [5] is ity of SPARQL with the OPTIONAL clause is summarised in the ﬁrst six lines of Table 1, where PSPACE-complete [16]. Therefore, by our earlier 10 Flexible Query Processing for SPARQL

Table 1 Complexity of various SPARQLAR fragments. Combined Operators Data Complexity Query Complexity Complexity AND, FILTER O(|E|) O(|Q|) O(|E| · |Q|) AND, FILTER, RegEx O(|E|) O(|Q|2) O(|E| · |Q|2) RELAX, APPROX O(|E|) P-Time P-Time RELAX, APPROX, O(|E|) P-Time P-Time AND, FILTER, RegEx AND, SELECT P-Time NP-Complete NP-Complete RELAX, APPROX, AND, FILTER, P-Time NP-Complete NP-Complete RegEx, SELECT RELAX, APPROX, AND, UNION, O(|E|) NP NP FILTER, RegEx, RELAX, APPROX, AND, UNION, P-Time NP-Complete NP-Complete FILTER, RegEx, SELECT

AR results, the complexity of SPARQL would also no Qi contains the UNION operator. The function increase similarly. toCQS exploits the following equality:

4. Query Processing [[(Q1 UNION Q2) AND Q3]]G =

We evaluate SPARQLAR queries by making use ([[Q1]]G ∪ [[Q2]]G) on [[Q3]]G = of a query rewriting algorithm, following a simi- ([[Q1]]G on [[Q3]]G) ∪ ([[Q2]]G on [[Q3]]G) = lar approach to [11,12,20]. In particular, given a query Q which may contain the APPROX and/or ([[Q1 AND Q3]]G) ∪ ([[Q2 AND Q3]]G) RELAX operators, we incrementally build a set We assign to the variable oldGeneration the set of queries {Q0,Q1,... } that do not contain these S of queries returned by toCQS(Q ). For each query operators such that [[Qi]]G,K = [[Q]]G,K . 0 i 0 We present the algorithm in Section 4.1, proving Q in the set oldGeneration, each triple pattern 0 its correctness in Section 4.2 and termination in hxi,Pi, yii in Q labelled with A (R), and each URI Section 4.3. Some practical issues relating to how p that appears in Pi, we apply one step of approx- users might benefit from a flexible querying system imation (relaxation) to p, and we assign the cost such as this are discussed in Section 4.4. of applying that approximation (relaxation) to the resulting query. The applyApprox and applyRelax 4.1. Query Rewriting functions invoked by Algorithm 2 are shown as Al- gorithms 3 and 5, respectively. From each query Our query rewriting algorithm (Algorithm 2 be- constructed in this way, we next generate a new set low) starts by considering the query Q0 which re- of queries by applying a second step of approxima- turns the exact answers to the query Q, i.e. ignor- tion or relaxation. We continue to generate queries ing the APPROX and RELAX operators. To keep iteratively in this way. The cost of each query gen- track of which triple patterns need to be relaxed or erated is the summed cost of the sequence of ap- approximated, we label such triple patterns with proximations or relaxations that have generated A for approximation and R for relaxation. it. If the same query is generated more than once, The function toCQS (“to conjunctive query only the one with the lowest cost is retained. More- set”) takes as input a query Q, and returns a set over, the set of queries generated is kept sorted by S of pairs hQi, 0i such that i[[Qi]]G = [[Q]]G and increasing cost. For practical reasons, we limit the Flexible Query Processing for SPARQL 11 number of queries generated by bounding the cost Suppose a user wishes to find every event which of queries up to a maximum value c. took place in London on 15th September 1940 and In Algorithm 2, the addT o operator accepts two poses the following query Q: arguments: the first is a collection C of query/cost pairs, while the second is a single query/cost pair APPROX(x, happenedOnDate, “15/09/1940”) hQ, ci. The operator adds hQ, ci to C. If C already AND RELAX(x, happenedIn, “London”). contains a pair hQ, c0i such that c0 ≥ c, then hQ, c0i As pointed out in Example 1, without applying AP- is replaced by hQ, ci in C. PROX or RELAX this query does not return any To compute the query answers (Algorithm 1) we answers when evaluated on the YAGO endpoint apply an evaluation function, eval, to each query (because “happenedIn” connects to URIs represent- generated by the rewriting algorithm (in order of ing places and “London” is a literal, not a URI). increasing cost of the queries) and to each mapping After the first step of approximation and relax- returned by eval we assign the cost of the query. If ation, the following queries are generated: we generate a particular mapping more than once, only the one with the lowest cost is retained. In Q1 = (x, , “15/09/1940”)A AND Algorithm 1, rewrite is the query rewriting algo- (x, happenedIn, “London”)R rithm (Algorithm 2) and the set of mappings M Q2 = is maintained in order of increasing cost. (x, happenedOnDate/ , “15/09/1940”)A AND The applyApprox and applyRelax functions, (x, happenedIn, “London”)R respectively, invoke the functions approxRegex Q3 = and replaceTriplePattern, shown as Algorithms 4 (x, /happenedOnDate, “15/09/1940”)A AND and 6. In Algorithm 6, z, z1 and z2 are fresh (x, happenedIn, “London”) new variables. The relaxTriplePattern function R Q = (x, , “12/12/12”) AND might generate regular expressions containing a 4 A (x, happenedIn, “London”) URI type−, which are matched to edges in E by R Q = reversing the subject and the object and using the 5 (x, happenedOnDate, “15/09/1940”) AND property label type. The predicate type− is gener- A (x, placedIn, “London”) ated when we apply rule 6 of Figure 1 to a triple R Q = pattern. Given a triple pattern hx, a, yi where x 6 (x, happenedOnDate, “15/09/1940”) AND is a constant and is y a variable, and an ontology A (x, type, Event) statement ha, range, di, we can deduce the triple R pattern hy, type, di. If instead the predicate a ap- Each of these also returns empty results, with the pears in a triple pattern containing a regular ex- exception of query Q6 which returns every event pression such as hx, a/b, zi (which is equivalent to occurring on 15/09/1940 (YAGO contains only hx, a, yi AND hy, b, zi), then we cannot simply re- one such event, namely “Battle of Britain”). place it with hy, type, di as the regular expression would be broken apart and two triple patterns 4.2. Correctness of the Rewriting Algorithm would result. By using hd, type−, yi, we correctly construct the triple pattern hd, type−/b, zi. We now discuss the soundness, completeness In the following example, we illustrate how the rewriting algorithm works by showing the queries and termination of the rewriting algorithm. As we it generates, starting from a SPARQLAR query. stated earlier, this takes as input a cost that lim- its the number of queries generated. Therefore the Example 6. Consider the following ontology K classic definitions of soundness and completeness (satisfying K = extRed(K)), which is a fragment need to be modified. To handle this, we use an op- of the YAGO knowledge base: erator CostP roj(M, c) to select mappings with a cost less than or equal to a given value c from a K =({happenedIn, placedIn, Event}, set M of pairs of the form hµ, costi. We denote by {hhappenedIn, sp, placedIni, rew(Q)c the set of queries generated by the rewriting algorithm from an initial query Q which have hhappenedIn, dom, Eventi}) cost less than or equal to c. 12 Flexible Query Processing for SPARQL

Algorithm 1: Flexible Query Evaluation input : Query Q; approx/relax max cost c; Graph G; Ontology K. output: List M of mapping/cost pairs, sorted by cost. M := ∅; foreach hQ0, costi ∈ rewrite(Q,c,K) do foreach hµ, 0i ∈ eval(Q0,G) do M := M ∪ {hµ, costi} return M;

Algorithm 2: Rewriting algorithm

input : Query QAR; approx/relax max cost c; Ontology K. output: List of query/cost pairs, sorted by cost. Q0 := remove APPROX and RELAX operators, and label triple patterns of QAR; queries:=toCQS(Q0); oldGeneration := toCQS(Q0); while oldGeneration 6= ∅ do newGeneration := ∅; foreach hQ, costi ∈ oldGeneration do foreach labelled triple pattern hx, P, yi in Q do rew := ∅; if hx, P, yi is labelled with A then rew := applyApprox(Q,hx, P, yi); else if hx, P, yi is labelled with R then rew := applyRelax(Q,hx, P, yi,K); foreach hQ0, cost0i ∈ rew do if cost + cost0 ≤ c then newGeneration := addTo(newGeneration,hQ0, cost + cost0i); queries:=addTo(queries, hQ0, cost + cost0i ); /* The elements of queries are also kept sorted by increasing cost. */

oldGeneration := newGeneration; return queries;

Algorithm 3: applyApprox

input : Query Q; triple pattern hx, P, yiA. output: Set S of query/cost pairs. S := ∅; foreach hP 0, costi ∈ approxRegex(P ) do 0 0 Q := replace hx, P, yiA by hx, P , yiA in Q; S := S ∪ {hQ0, costi}; return S; Flexible Query Processing for SPARQL 13

Algorithm 4: approxRegex input : Regular Expression P . output: Set T of RegEx/cost pairs. T := ∅; if P = p where p is a URI then T := T ∪ {h, costdi}; T := T ∪ {h , costsi}; T := T ∪ {h /p, costii}; T := T ∪ {hp/ , costii};

else if P = P1/P2 then 0 foreach hP , costi ∈ approxRegex(P1) do 0 T := T ∪ {hP /P2, costi}; 0 foreach hP , costi ∈ approxRegex(P2) do 0 T := T ∪ {hP1/P , costi};

else if P = P1|P2 then 0 foreach hP , costi ∈ approxRegex(P1) do T := T ∪ {hP 0, costi}; 0 foreach hP , costi ∈ approxRegex(P2) do T := T ∪ {hP 0, costi};

∗ else if P = P1 then 0 foreach hP , costi ∈ approxRegex(P1) do ∗ 0 ∗ T := T ∪ {h(P1 )/P /(P1 ), costi}; return T;

Algorithm 5: applyRelax

input : Query Q; triple pattern hx, P, yiR of Q; Ontology K. output: Set S of query/cost pairs. S := ∅; 0 0 0 foreach hhx ,P , y iR, costi ∈ relaxTriplePattern(hx, P, yi,K) do 0 0 0 0 Q := replace hx, P, yiR by hx ,P , y iR in Q; S := S ∪ {hQ0, costi}; return S;

S 0 Definition 8 (Containment). Given a graph G, an CostProj([[Q]] , c) ⊆ 0 [[Q ]] for G,K Q ∈rew(Q)c G,K 0 ontology K, and queries Q and Q , [[Q]]G,K ⊆ every graph G and ontology K. 0 [[Q ]]G,K if for each pair hµ, ci ∈ [[Q]]G,K there 0 To show the soundness and completeness of the exists a pair hµ, ci ∈ [[Q ]]G,K . query rewriting algorithm, we will require the fol- Definition 9 (Soundness). The rewriting of Q, lowing lemmas and corollary. rew(Q)c, is sound if the following holds: S 0 Lemma 2. Given four sets of evaluation results 0 [[Q ]] ⊆ CostProj([[Q]] , c) for Q ∈rew(Q)c G,K G,K 0 0 0 M1, M2, M1 and M2 such that M1 ⊆ M1 and every graph G and ontology K. 0 M2 ⊆ M2, it holds that: Definition 10 (Completeness). The rewriting of 0 0 Q, rew(Q)c, is complete if the following holds: M1 ∪ M2 ⊆ M1 ∪ M2 (1) 14 Flexible Query Processing for SPARQL

Algorithm 6: relaxTriplePattern input : Triple pattern hx, P, yi; Ontology K. output: Set T of triple pattern/cost pairs. T := ∅; if P = p where p is a URI then 0 0 foreach p such that ∃(p, sp, p ) ∈ EK do 0 T := T ∪ {hhx, p , yi, cost2i};

foreach b such that ∃(a, sc, b) ∈ EK and p = type and y = a do T := T ∪ {hhx, type, bi, cost4i}; − foreach b such that ∃(a, sc, b) ∈ EK and p = type and x = a do − T := T ∪ {hhb, type , yi, cost4i};

foreach a such that ∃(p, dom, a) ∈ EK and y is a URI or a Literal do T := T ∪ {hhx, type, ai, cost5i};

foreach a such that ∃(p, range, a) ∈ EK and x is a URI do − T := T ∪ {hha, type , yi, cost6i};

else if P = P1/P2 then 0 0 foreach hhx ,P , zi, costi ∈ relaxTriplePattern(hx, P1, zi) do 0 0 T := T ∪ {hhx ,P /P2, yi, costi}; 0 0 foreach hhz, P , y i, costi ∈ relaxTriplePattern(hz, P2, yi) do 0 0 T := T ∪ {hhx, P1/P , y i, costi};

else if P = P1|P2 then 0 0 0 foreach hhx ,P , y i, costi ∈ relaxTriplePattern(hx, P1, yi) do T := T ∪ {hhx0,P 0, y0i, costi}; 0 0 0 foreach hhx ,P , y i, costi ∈ relaxTriplePattern(hx, P2, yi) do T := T ∪ {hhx0,P 0, y0i, costi};

∗ else if P = P1 then 0 foreach hhz1,P , z2i, costi ∈ relaxTriplePattern((hz1,P1, z2i) do ∗ 0 ∗ T := T ∪ {hhx, P1 /P /P1 , yi, costi}; 0 0 foreach hhx ,P , zi, costi ∈ relaxTriplePattern((hx, P1, zi) do 0 0 ∗ T := T ∪ {hhx ,P /P1 , yi, costi}; 0 0 foreach hhz, P , y i, costi ∈ relaxTriplePattern((hz, P1, yi) do ∗ 0 0 T := T ∪ {hhx, P1 /P , y i, costi}; return T;

0 0 0 0 M1 on M2 ⊆ M1 on M2 (2) M1 on M2 = M1 on M2 (4)

The following result follows from Lemma 2: Lemma 3. Given queries Q1 and Q2, graph G and ontology K the following equations hold: Corollary 1. Given four sets of evaluation results 0 0 0 M1, M2, M1 and M2 such that M1 = M1 and CostP roj([[Q1]]G,K on [[Q2]]G,K , c) = 0 M2 = M2, it holds that: CostP roj(CostP roj([[Q1]]G,K , c) on 0 0 M1 ∪ M2 = M1 ∪ M2 (3) CostP roj([[Q2]]G,K , c), c) Flexible Query Processing for SPARQL 15

CostP roj([[Q1]]G,K ∪ [[Q2]]G,K , c) = is a drawback in terms of the precision of answers retrieved. However, for each predicate in a query, CostP roj([[Q ]] , c)∪ 1 G,K it is possible to specify a set of predicates that CostP roj([[Q2]]G,K , c) are semantically similar to it in order to increase the precision of the retrieved answers. A similar- Theorem 9. The Rewriting Algorithm is sound and ity matching algorithm, based either on syntactic complete. or semantic similarity, could be used to compare the predicates of the RDF dataset. The semantic 4.3. Termination of the Rewriting Algorithm similarity could exploit dictionaries such as Word- Net3. Moreover, we could assign different costs for We are able to show that the rewriting algo- substitution by different (sets of) predicates, de- rithm terminates after a finite number of steps: pending on how similar they are to the original predicate. This would allow for a finer ranking of Theorem 10. Given a query Q, ontology K and the answers. maximum query cost c, the Rewriting Algorithm Similarly, it is possible to add a finer ranking terminates after at most dc/c0e iterations, where c0 of the answers arising from the RELAX operator. is the lowest cost of an edit or relaxation operation, When we relax a triple pattern to derive a direct assuming that c0 > 0. relaxation, we make use of a triple from the ontology K. Instead of assigning a cost to each rule 4.4. Practical Considerations of Figure 1, we could assign a cost to each triple in K, reflecting domain experts’ views of the se- In the previous sections we have concentrated mantic closeness of concepts. Therefore, the direct on theoretical aspects of SPARQLAR. In practice relaxation would have a cost depending on which SPARQLAR would be used as part of a framework triple in K is used. allowing users to search RDF data in a flexible Finally, in order to help users interpret answers way. A front-end could allow users to pose queries to their queries, the system could provide informa- using keywords or natural language. Such user tion about which rewritten query returned which queries could then be translated into SPARQL (cf. answers. Showing only queries to users might not [19]). be particularly helpful, especially if the original Suppose a user poses the following question to query was simply in the form of keywords. Instead, the system: “What event happened in London on showing the sequence of steps by which the origi- 15/09/1940?”. This question could be translated nal terms used by the user were approximated or to the query ((x, happenedOnDate, “15/09/1940”) relaxed could help them decide whether the an- AND (x, happenedIn, “London”)) from Exam- swers returned were meaningful or not. ple 6. The system could automatically apply AP- PROX to the first triple pattern and RELAX to the second triple pattern (as shown in the example) by determining that happenedIn appears 5. Experimental Results in the given ontology, whereas happenedOnDate does not. As we saw earlier, applying approxima- We have implemented the query evaluation al- tion and relaxation to this query produces the an- gorithms described above in Java, using Jena for swer that the user is looking for. SPARQL query evaluation. Figure 3 illustrates the Applying the APPROX operator without an up- system architecture, which consists of three lay- per bound on cost will eventually return every con- ers: the GUI layer, the System layer, and the Data nected node of the RDF graph. This negatively layer. The GUI layer supports user interaction affects the precision of answers but ensures 100% with the system, allowing queries to be submitted, recall. When applying the substitution operation costs of the edit and relaxation operators to be of APPROX to a triple pattern, we do not specify set, data sets and ontologies to be selected, and the predicate that needs to be replaced, but in- query answers to be incrementally displayed to stead insert the wildcard ( ) that allows the insertion/replacement of any predicate. Of course, this 3https://wordnet.princeton.edu/ 16 Flexible Query Processing for SPARQL

the user. The System layer comprises three com- System GUI Utilities ponents: the Utilities, containing classes providing Domain Classes Main Window SPARQLAR SPARQLAR Parser User queries the core logic of the system; the Domain Classes, Query

Relax providing classes relating to the construction of Approx/Relax Cost Setter AR Constructor SPARQL queries; and the Query Evaluator in Approx

Data/Ontology Data/Ontology Data/Ontology which query rewriting, optimisation and evalua- Wrapper Loader Selector tion are undertaken. The Data layer connects the Answers Window Answer Wrapper Answers system to the selected RDF dataset and ontology Query Evaluator

Cache Optimiser using the JENA API; RDF datasets are stored as Data

4 Jena API a TDB database and RDF schemas can be stored Jena Wrapper Rewriting Algorithm Evaluator SPARQL in multiple RDF formats (e.g. Turtle, N-Triple, Queries

RDF/XML). RDF Schema TDB Database User queries are submitted to the GUI, which invokes a method of the SPARQLAR Parser to parse the query string and construct an object of Fig. 3. SPARQLAR system architecture the class SPARQLAR Query. The GUI also invokes the Data/Ontology Loader which creates an object tains over 120 million triples (4.83 GB in Tur- of the class Data/Ontology Wrapper, and the Ap- tle format) which we downloaded and stored in a prox/Relax Constructor which creates objects of TDB database. The size of the TDB database is the classes Approx and Relax. Once these objects 9.70 GB, and the nodes of the YAGO graph are have been initialised, they are passed to the Query stored in a 1.1 GB ﬁle. Evaluator by invoking the Rewriting Algorithm. We ran our experiments on a Windows PC with This generates the set of SPARQL queries to be a 2.4Ghz i5 dual-core processor and 8 GB of RAM. executed over the RDF dataset. The set of queries We executed 10 queries over the database, com- is passed to the Evaluator, which interacts with the prising increasing numbers of triple patterns (1 up Optimiser and the Cache to improve query perfor- to 10), listed below. The aim of this performance mance — we discuss the Optimiser and the Cache study was to further gauge the practical feasibil- in Section 5.2. The Evaluator uses the Jena Wrap- ity of our techniques and to discover major perfor- per to invoke Jena library methods for executing mance bottlenecks requiring further investigation. SPARQL queries over the RDF dataset. The Jena A more comprehensive and detailed performance Wrapper also gathers the query answers and passes study is planned for future work. them to the Answer Wrapper. Finally, the answers are displayed by the Answers Window, in ranked Q1 = SELECT ?a WHERE order. { RELAX(?a rdf:type )} We have conducted empirical trials over the YAGO dataset and the Lehigh University Bench- Q2 = SELECT ?n WHERE mark (LUBM)5. Our empirical results using the { ?a rdfs:label ?n . LUBM are described in [5], where we ran a small RELAX(?a )} set of queries comprising 1 to 4 triple patterns on increasing sizes of datasets, with and with- Q3 = SELECT ?n ?d WHERE out the APPROX/RELAX operators. In all cases, { ?a rdfs:label ?n . the approxed/relaxed versions of the queries re- RELAX(?a ) . turned more answers than the exact query. Re- ?a ?d} sponse times were good for most of the queries. For the rest of this section, we focus on our em- Q4 = SELECT ?n ?m WHERE pirical trials over the YAGO dataset, ﬁrstly with- { ?a rdfs:label ?n . out any optimisations, and then in Section 5.2 ?a ?b . with an optimised query evaluator. YAGO con- ?a ?m . RELAX(?m ?b)} 4https://jena.apache.org/documentation/tdb/. 5http://swat.cse.lehigh.edu/projects/lubm/ Q5 = SELECT ?n1 ?n2 WHERE Flexible Query Processing for SPARQL 17

{ ?a rdfs:label ?n1 . ?a ?m1 . ?b rdfs:label ?n2 . ?m1 . RELAX(?a ?b). ?a ?m2 . APPROX(?a /* ?p). ?m2 . APPROX(?b /* ?p)} APPROX(?city ) . Q6 = SELECT ?n WHERE ?m1 rdfs:label ?n1 . { APPROX(?a / ?m2 rdfs:label ?n2} ) . ?a rdfs:label ?n . The reader will notice that we used the AP- RELAX(?a rdf:type ) . PROX operator only on triple patterns contain- ?city . ing a regular expression in which either the sub- ?a ?city . ject or object is a constant. This is due to the fact APPROX(?a / that if we apply APPROX to simple triple pat- United_States>)} terns of the form (?x, p, ?y), the rewriting algorithm will generate the following two triple pat- Q7 = SELECT ?n1 ?n2 WHERE terns: (?x, , ?y) which returns every triple in the { APPROX(?a rdf:type ) . database, and (?x, , ?y) which returns every node RELAX(?a ?b ). in the database. ?p ?b . For each query Q1 to Q10, we ran both the ex- ?p ?d . act form of the query (without any APPROX or RELAX(?a ?d) . RELAX operators) and the version of the query as ?a rdfs:label ?n1 . specified above. For the latter queries, we set the ?p rdfs:label ?n2} cost of applying each edit operation of approximation and each RDFS entailment rule of Figure 1 to Q8 = SELECT ?c ?n ?p ?l ?d WHERE one, and requested answers of maximum cost two. { ?a ?n . We ran each query 6 times, ignored the first timing ?a rdfs:label ?c . as a Jena cache warm-up, and took the mean of the ?a ?p . other 5 timings. We restart our system each time ?a ?l . we run a query; this avoids the possibility that the RELAX(?a ?d) . warm-up caching of a previous query enhances the APPROX(?a rdf:type ) . execution performance of other queries. ?a ?b1 . The numbers of answers returned by each ?a ?b2} query, for both the exact form and the AP- Filter (?b1!=?b2) PROX/RELAX (A/R) form, are shown in Ta- bles 2 and 3, along with the number of rewrit- Q9 = SELECT ?c ?n ?p ?l ?d WHERE ten queries in each case (# of queries). Tables 4 { ?a ?n . and 5 list the execution times for the exact queries, ?a rdfs:label ?c . the A/R queries and the A/R queries with a sim- ?a ?p . ple caching optimisation implemented (optimised ?a ?l . A/R). We discuss the results without this optimi- ?a ?d . sation in the next subsection, and the results with RELAX(?a rdf:type ) . this optimisation applied in Section 5.2. ?a ?b . 5.1. Initial results ?b ?d . RELAX(?l * )} Query Q1 returns every location stored in YAGO. The rewriting algorithm generates only Q10 = SELECT ?n ?n1 ?n2 WHERE the following additional query { ?a rdfs:label ?n . RELAX(?a rdf:type ). SELECT ?a WHERE APPROX(?a ?city) . {?a rdf:type } 18 Flexible Query Processing for SPARQL which returns only 3 answers. Increasing the max- a more reasonable time. Similarly to query Q5, it imum cost does not result in the rewriting algo- would also be possible to replace the symbol with rithm generating any more queries, and no other a selected disjunction of predicates, making the answers would be returned at higher cost. query more likely to return answers more quickly Query Q2 returns every event that happened in still. Berlin. When the second triple pattern is relaxed Query Q7 returns every event and person such the rewriting algorithm generates a query that re- that the person was born in the same place and on turns every event in YAGO. This explains the long the same day that the event occurred. When the execution time of the relaxed version of Q com- 2 rewriting algorithm is applied to Q , it generates pared to its exact form. 7 many queries that contain no answers or that con- Query Q returns every event that happened in 3 tain answers already computed. The first version Berlin along with its date, while query Q4 returns every actor who acted in movies located in the of caching that we have implemented (described in Section 5.2) is not sophisticated enough to help same city where the actor lived. Queries Q3 and with the A/R version of Q , for the reason ex- Q4 return additional answers in their relaxed form 7 compared to their exact form. Both queries exhibit plained in Section 5.2 and further work is required reasonable performance. here. Query Q5 returns all married couples who live in Query Q8 returns every scientist who has mar- the same city. The long execution time of the exact ried twice and has won a prize. The rewriting al- form of this query is due to the presence of Kleene- gorithm generates 17 queries from Q8. The long closure in two of the query conjuncts. Moreover, execution time of the A/R form of Q8 is due to the rewriting algorithm generates 95 queries which use of the “ ” symbol. The running time of the are time-consuming to evaluate due to the pres- query is improved significantly by the optimisation ence of not only of the Kleene-closure but also the described in Section 5.2. wild-card symbol “ ”. We were not able to com- Query Q9 returns every scientist who was born plete the execution of query Q5 with approxima- in Germany, has won a prize, and was married to tion and relaxation. It might be possible to over- someone with the same date of birth. This query come this problem by replacing “ ” with a selected returns no answers. The execution of the A/R form disjunction of predicates. Such predicates would of the query takes many hours, due to the Kleene- be chosen using knowledge of the graph structure. For example, when we approximate the triple pat- closure. However, the running time of the query is again dramatically improved by the optimisation tern h?a, livesIn/isLocatedIn∗, ?pi in query Q5, we generate h?a, livesIn/ /isLocatedIn∗, ?pi us- described in Section 5.2. ing the insertion edit operator and, during query Finally, query Q10 returns every actor who di- evaluation, the symbol is replaced with a dis- rected and acted in Australian movies and was junction of all the predicates in YAGO. We could born in the United States. The exact form of this instead replace with a disjunction of the pred- query returns no answers. The rewriting algorithm icates that are known to connect livesIn and generates 47 queries and the A/R form of the isLocatedIn, i.e. the predicates p such that there query takes a very long time to evaluate. Once is a path livesIn/p/isLocatedIn in YAGO. This again, the optimisation described in Section 5.2 type of optimisation is currently being investi- gives a significant reduction in the running time. gated. Query Q returns every Chinese actor who 6 5.2. Optimised evaluation played in American films and directed Australian films. The A/R version of the query takes many hours to evaluate (the rewriting algorithm gen- In Tables 4 and 5 we also show the query exe- erates 154 queries) due to the symbol we use cution times for the A/R forms of all the queries for the insertion and substitution approximation using an optimised query evaluator. The optimi- operations. Applying the optimisation technique sation is based on a caching technique, in which described in Section 5.2 below decreases the ex- we pre-compute some of the answers in advance. ecution time dramatically and returns results in The Optimiser module stores the cached answers Flexible Query Processing for SPARQL 19

Table 2 Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R).

Q1 Q2 Q3 Q4 Q5 Exact 6491 116 106 8546 585150 A/R 6494 60614 6867 8586 N/A # of 2 5 5 2 95 queries

Table 3 Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R).

Q6 Q7 Q8 Q9 Q10 Exact 28 5 1540 0 0 A/R 14431 N/A 22540 0 0 # of 154 36 17 29 47 queries

Table 4 Query execution time (in seconds).

Q1 Q2 Q3 Q4 Q5 Exact 0.321 0.008 0.009 1.512 7670 A/R 0.340 66.32 0.81 1.571 N/A optimised 0.440 60.4 2.31 1.01 N/A A/R

Table 5 Query execution time (in seconds).

Q6 Q7 Q8 Q9 Q10 Exact 0.123 5 0.173 1.23 323.100 A/R N/A N/A 272.875 N/A N/A optimised 60.23 N/A 12.475 0.08 100.4 A/R in memory using the Java class HashSet6 which dividually; all possible pairs of triple patterns are enables answers to be retrieved efficiently. Algo- also evaluated. The answers to each are stored in rithm 7 shows the optimised evaluation. We leave cache, a data structure that contains these par- it as future work to investigate other join strate- tial evaluation results. To avoid memory overflow, gies, such as sort-merge join or sideways informa- we place an upper limit on the size of cache. We tion passing. then compute the answers of the A/R part with In Algorithm 7, we start by splitting a query into the newEval function which exploits the answers two parts: the triple patterns which do not have already computed and stored in cache. In other APPROX or RELAX applied to them (which we words, if part of the query has been already com- call the exact part) and those which have (which puted, it retrieves the answers and joins them with we call the A/R part). We first evaluate the exact the part of the query that has not been executed. part of the query and store the results. We then Finally, we join the answers of the exact part of apply the rewriting algorithm to the A/R part. the query with those of the A/R part. Each triple pattern of the latter is evaluated in- For query Q1 the optimised evaluation slightly worsens the computation time. This is due to the 6Java documentation: https://docs.oracle.com/ extra time spent by the evaluation algorithm in javase/6/docs/api/java/util/HashSet.html undertaking the caching. In fact, in general for 20 Flexible Query Processing for SPARQL

Algorithm 7: Flexible Query Evaluation – Optimised input : Query Q; approx/relax max cost c; Graph G; Ontology K. output: List M of mapping/cost pairs, sorted by cost. π−→w := head of Q; QE := sub query of Q with only the triples that are not approxed nor relaxed.; Ms := eval(QE,G); if Ms is empty then return ∅ cache := ∅ ; /* set of pairs of query/evaluation results */ QAR := sub-query of Q comprising only the approxed and relaxed triples; M := ∅; 0 foreach hQ , costi ∈ rewrite(QAR,c,K) do if cache is not full then foreach pair t, t0 ∈ Q0 do cache := cache ∪ ht, eval(t, G)i ; cache := cache ∪ ht AND t0, eval(t AND t0,G)i foreach hµ, 0i ∈ newEval(Q0, preEval, G) do M := M ∪ {hµ, costi};

return π−→w (M on Ms); single-conjunct queries the optimisation does not perform optimally for this particular query, but speed up the computation7. splitting the query into multiple parts and join- For queries Q2 and Q4 the optimised evaluation ing the results separately, as we do in our opti- decreases the execution time somewhat. For query misation technique, improves the evaluation time Q2, since the number of answers is rather large, it considerably. is hard to compute all these answers in a shorter We are still unable to execute the A/R forms of amount of time even with the optimised evalua- queries Q5 and Q7. For query Q5, the long eval- tion. uation time is due to the presence of the Kleene- The optimised evaluation of query Q3 performs closure and the wild-card symbol “ ”. On the other worse than the simple evaluation. The main reason hand query Q7 cannot be computed because of the is that the exact sub-query returns a large num- join structure of the exact part of the query, which ber of answers, namely, every event along with its is the following: label and date. These answers are stored in the cache and then retrieved for the final join with the ?p ?b . relaxed triple pattern. ?p ?d . Queries Q6 and Q8 can now be computed in a ?p rdfs:label ?n2. reasonable amount of time. Q9 can also be com- ?a rdfs:label ?n1 . puted with the optimised algorithm. The time taken is less than 0.1 seconds due to the fact that We can see that the variable ?p appears in the its exact part returns no answers, making the com- first three triple patterns, while the variables of putation of the rest of the query redundant. the last triple pattern do not appear anywhere else Query Q10 can now be computed and, in fact, in the query. Therefore, Jena has to compute a the optimised algorithm managed to run the A/R Cartesian product between the 2954875 answers form of the query faster than its exact form. It is retrieved for the last triple pattern and the 500000 possible that the Jena SPARQL evaluator does not answers retrieved for the first three triple patterns. More sophisticated optimisation techniques need 7In the final version of the system, the optimisation mod- to be investigated to improve the performance of ule would be disabled for single-conjunct queries. these queries. Flexible Query Processing for SPARQL 21

The overall results show that the evaluation of structed from an RDF-dataset G that will have the SPARQLAR queries through a query rewriting ap- following property: if we consider G and S as au- proach is promising. The difference between the tomata, then L(G) ⊆ L(S). Hence, given a query execution time of the exact form and the A/R form Q, if eval(Q, S) = ∅ then eval(Q, G) = ∅. Since of the queries is acceptable for queries with fewer the synopsis S will be considerably smaller than than 5 conjuncts. For most of the other queries, G, we can evaluate Q over S for each query Q gen- the simple optimisation technique described above erated by the rewriting algorithm; if eval(Q, S) re- also brings down the running times of the A/R turns no answer, then we do not need to execute Q forms to more reasonable levels. Clearly, for more over G. Moreover, the synopsis S can be exploited complex queries, more sophisticated optimisation in order to remove the symbol that is generated techniques need to be investigated and developed. when we apply APPROX to a triple pattern of a query. Given a triple pattern hx, P, yi from query Q, we compute A = MP ∩ S, where MP is the au- 6. Conclusions tomaton that recognises L(P ). Subsequently, we replace hx, P, yi with hx, PA, yi in Q, where PA is In this paper we have presented query process- a property path that does not contain the symbol ing algorithms for an extended fragment of the such that L(PA) = L(A). SPARQL 1.1 language, incorporating approxima- Another direction of research is the extension tion and relaxation operators. Our query process- of our approximation and relaxation operators, ing approach is based on query rewriting whereby, query evaluation and query optimisation tech- given a query Q containing the APPROX and/or niques to flexible federated query processing for RELAX operators, we incrementally generate a SPARQL 1.1. Finally, also planned is a detailed set of queries {Q0,Q1,...} that do not contain comparison of the query rewriting approach to S these operators such that i[[Qi]]G,K = [[Q]]G,K , query approximation and relaxation presented and we return results according to their “distance” here with the “native” implementation of similar from the exact form of Q. operators described in [23]. We have formally shown the soundness, com- Acknowledgements. Andrea Cal`ı acknowledges pleteness and termination of our query rewriting partial support by the EPSRC project “Logic- algorithm. Our empirical studies show promising based Integration and Querying of Unindexed query processing performance, but also that fur- Data” (EP/E010865/1). ther optimisations are required. An advantage of adopting a query rewriting approach is that existing techniques for SPARQL References query optimisation and evaluation can be reused to evaluate the queries generated by our rewrit- [1] F. Alkhateeb, J.-F. Baget, and J. Euzenat. Extending ing algorithm. Our ongoing work involves inves- SPARQL with regular expression patterns (for query- tigating optimisations to the rewriting algorithm ing RDF). Web Semant., 7(2):57–73, Apr. 2009. itself, since it can generate a large number of [2] J. M. Almendros-Jiménez, A. Luna, and G. Moreno. Fuzzy XPath queries in XQuery. In R. Meersman, queries. Specifically, we are studying the query H. Panetto, T. S. Dillon, M. Missikoff, L. Liu, O. Pas- AR containment problem for SPARQL and how tor, A. Cuzzocrea, and T. Sellis, editors, On the Move query costs impact on this. Following this inves- to Meaningful Internet Systems: OTM 2014 Confer- tigation, we plan to implement optimisations for ence Proceedings - Confederated International Confer- the rewriting algorithm. For example, for a query ences: CoopIS, and ODBASE 2014, Amantea, Italy, October 27-31, pages 457–472, 2014. Q = Q1 AND Q2 it is possible to decrease the [3] C. Bizer, R. Cyganiak, and T. Heath. How to publish number of queries generated by the rewriting al- Linked Data on the Web. Web page, 2007. Revised gorithm if we know that [[Q1]]G,K ⊆ [[Q2]]G,K , in 2008. Accessed 22/02/2010. which case [[Q]]G,K = [[Q1]]G,K . [4] G. Bordogna and G. Psaila. In J. Galindo, editor, Another area of ongoing work involves the con- Handbook of Research on Fuzzy Information Process- ing in Databases. struction of synopses (or data guides) of RDF- [5] A. Cal`ı,R. Frosini, A. Poulovassilis, and P. T. Wood. datasets in order to speed up query evaluation. Flexible querying for SPARQL. In R. Meersman, In our context, such a synopsis is a graph S con- H. Panetto, T. S. Dillon, M. Missikoff, L. Liu, O. Pas- 22 Flexible Query Processing for SPARQL

tor, A. Cuzzocrea, and T. Sellis, editors, On the Move [15] S. Muñoz,J. Pérez,and C. Gutierrez. Minimal de- to Meaningful Internet Systems: OTM 2014 Confer- ductive systems for RDF. In E. Franconi, M. Kifer, ence Proceedings - Confederated International Confer- and W. May, editors, Proceedings of the 4th European ences: CoopIS, and ODBASE 2014, Amantea, Italy, Conference on The Semantic Web: Research and Ap- October 27-31. plications, ESWC ’07, pages 53–67, Berlin, Heidelberg, [6] M. W. Chekol, J. Euzenat, P. Genevès, and 2007. Springer-Verlag. N. Laya¨ıda.PSPARQL query containment. In N. Fos- [16] J. Pérez, M. Arenas, and C. Gutierrez. Seman- ter and A. Kementsietsidis, editors, Database Pro- tics and complexity of SPARQL. In I. F. Cruz, gramming Languages - DBPL 201, 13th International S. Decker, D. Allemang, C. Preist, D. Schwabe, Symposium, Seattle, Washington, USA, August 29, P. Mika, M. Uschold, and L. Aroyo, editors, Proceed- 2011. Proceedings, 2011. ings of the 5th International Semantic Web Confer- [7] R. De Virgilio, A. Maccioni, and R. Torlone. A sim- ence, ISWC’06, pages 30–43, Berlin, Heidelberg, 2006. ilarity measure for approximate querying over RDF Springer-Verlag. data. In G. Guerrini, editor, Proceedings of the Joint [17] J. Pérez,M. Arenas, and C. Gutierrez. Semantics and EDBT/ICDT 2013 Workshops, EDBT ’13, pages 205– complexity of SPARQL. ACM Trans. Database Syst., 213, New York, NY, USA, 2013. ACM. 34(3):16:1–16:45, Sept. 2009. [8] S. Elbassuoni, M. Ramanath, and G. Weikum. Query [18] A. Poulovassilis and P. T. Wood. Combining approx- relaxation for entity-relationship search. In G. An- imation and relaxation in semantic web path queries. toniou, M. Grobelnik, E. P. B. Simperl, B. Parsia, In P. F. Patel-Schneider, Y. Pan, P. Hitzler, P. Mika, D. Plexousakis, P. D. Leenheer, and J. Z. Pan, edi- L. Zhang, J. Z. Pan, I. Horrocks, and B. Glimm, tors, Proceedings of the 8th Extended Semantic Web editors, Proceedings of the 9th International Seman- Conference on The Semanic Web: Research and Ap- tic Web Conference, ISWC’10, pages 631–646, Berlin, plications - Volume Part II, ESWC’11, pages 62–76, Heidelberg, 2010. Springer-Verlag. Berlin, Heidelberg, 2011. Springer-Verlag. [19] C. Pradel, O. Haemmerlé,and N. Hernandez. Natu- [9] R. Fink and D. Olteanu. On the optimal approxi- ral language query interpretation into SPARQL using mation of queries using tractable propositional lan- patterns. In Fourth International Workshop on Con- guages. In T. Milo, editor, Proceedings of the 14th suming Linked Data - COLD 2013, pages pp. 1–12, International Conference on Database Theory, ICDT Sydney, AU, 2013. Kent State UNiversity. Thanks to ’11, pages 174–185, New York, NY, USA, 2011. ACM. Ceur-ws editor. The definitive version is available at [10] A. Hogan, M. Mellotte, G. Powell, and D. Stampouli. http://ceur-ws.org/Vol-1034/. Towards fuzzy query-relaxation for RDF. In E. Sim- [20] B. R. K. Reddy and P. S. Kumar. Efficient ap- perl, P. Cimiano, A. Polleres, O. Corcho, and V. Pre- proximate SPARQL querying of web of linked data. sutti, editors, The Semantic Web: Research and Ap- In F. Bobillo, R. N. Carvalho, P. C. G. da Costa, plications, volume 7295 of Lecture Notes in Computer C. d’Amato, N. Fanizzi, K. B. Laskey, K. J. Laskey, Science, pages 687–702. Springer Berlin Heidelberg, T. Lukasiewicz, T. Martin, M. Nickles, and M. Pool, 2012. editors, URSW, volume 654 of CEUR Workshop Pro- [11] H. Huang and C. Liu. Query relaxation for star ceedings, pages 37–48. CEUR-WS.org, 2010. queries on RDF. In L. Chen, P. Triantafillou, and [21] M. Sassi, O. Tlili, and H. Ounelli. Approximate query T. Suel, editors, Proceedings of the 11th International processing for database flexible querying with aggre- Conference on Web Information Systems Engineer- gates. In A. Hameurlain, J. Küng, and R. Wag- ing, WISE’10, pages 376–389, Berlin, Heidelberg, 2010. ner, editors, Transactions on Large-Scale Data- and Springer-Verlag. Knowledge-Centered Systems V, pages 1–27. Springer- [12] H. Huang, C. Liu, and X. Zhou. Computing relaxed Verlag, Berlin, Heidelberg, 2012. answers on RDF databases. In J. Bailey, D. Maier, [22] M. Schmidt. Foundations of SPARQL Query Op- K. Schewe, B. Thalheim, and X. S. Wang, editors, Pro- timization. PhD thesis, Albert-Ludwigs-Universitat ceedings of the 9th International Conference on Web Freiburg, 2009. Information Systems Engineering, WISE ’08, pages [23] P. Selmer, A. Poulovassilis, and P. T. Wood. Im- 163–175, Berlin, Heidelberg, 2008. Springer-Verlag. plementing flexible operators for regular path queries. [13] C. A. Hurtado, A. Poulovassilis, and P. T. Wood. In P. M. Fischer, G. Alonso, M. Arenas, and Query relaxation in RDF. J. Data Semantics, 10:31– F. Geerts, editors, Proceedings of the Workshops of the 61, 2008. EDBT/ICDT 2015 Joint Conference (EDBT/ICDT), [14] C. Kiefer, A. Bernstein, and M. Stocker. The fun- Brussels, Belgium, March 27th, 2015., pages 149–156, damentals of iSPARQL: a virtual triple approach for 2015. similarity-based semantic web tasks. In K. Aberer, K. Choi, N. F. Noy, D. Allemang, K. Lee, L. J. B. Nixon, J. Golbeck, P. Mika, D. Maynard, R. Mi- zoguchi, G. Schreiber, and P. Cudré-Mauroux, editors, Proceedings of the 6th International Semantic Appendix Web Conference and 2nd Asian Semantic Web Con- ference, ISWC’07/ASWC’07, pages 295–309, Berlin, Heidelberg, 2007. Springer-Verlag. Proof of Theorem 1. Flexible Query Processing for SPARQL 23

Proof. We give an algorithm for the EVALUA- For NP-hardness we first define the problem of TION problem that runs in polynomial time: graph 3-colourability, which is known to be NP- First, for each i such that the triple pattern complete: given a graph Γ = (NΓ,EΓ) and three hx, z, yii is in Q, we verify that hµ(hx, z, yii), costii colours r, g, b, is it possible to assign a colour ∈ E for some costi. If this is not the case, or if to each node in NΓ such that no pair of nodes P i costi 6= cost we return False. Otherwise we connected by an edge in EΓ are of the same colour? check if µ satisfies the FILTER condition and re- We next define the following RDF graph G = turn True or False accordingly. It is evident that (N,D,E): the algorithm runs in polynomial time since veri- fying that hµ(hx, z, yii), costii ∈ E can be done in N ={r, g, b, a} D = {a, p} time |E|. E ={hhr, p, gi, 0i, Proof of Theorem 2. hhr, p, bi, 0i, hhg, p, bi, 0i, hhg, p, ri, 0i, hhb, p, ri, 0i, hhb, p, gi, 0i, hha, a, ai, 0i} Proof. To show this, we start by building an NFA M = (S, T ) that recognises L(P ), the language P Now we construct the following query Q such that denoted by the regular expression P , where S is there is a variable x corresponding to each node the set of states (including s and s representing i 0 f n of Γ and there is a triple pattern of the form the initial and final states respectively) and T is i hx , p, x i in Q if and only if there is an edge the set of transitions, each of cost 0. We then con- i j (n , n ) in Γ: struct the weighted product automaton, H, of G i j and MP as follows:

– The states of H are the Cartesian product of Q = SELECTx((xi, p, xj) AND ... AND the set of nodes N of G and the set of states 0 0 (xi, p, xj) AND (a, a, x)) S of MP . 0 – For each transition hhs, p, s i, 0i in MP and It is easy to verify that the graph Γ is colourable if each edge hha, p, bi, costi in E, there is a tran- and only if hµ, 0i ∈ [[Q]]G with µ = {x → a}. sition hhs, ai, hs0, bi, costi in H. Proof of Lemma 1 Then we check if there exists a path from hs0, µ(x)i to hsf , µ(y)i in H. In case there is more Premise. Given a pair hµ, costi we have to verify in than one path, we select one with the minimum polynomial time that hµ, costi ∈ [[APPROX(x, P, cost using Dijkstra’s algorithm. Knowing that the y)]]G or hµ, costi ∈ [[RELAX(x, P, y)]]G. We start number of nodes in H is equal to |N| · |S|, the by building an NFA MP = (S, T ) as described number of edges is at most |E| · |T |, and that earlier. |T | ≤ |S|2, the evaluation can be performed in 2 time O(|E| · |S| + |N| · |S| · log(|N| · |S|)). Approximation. An approximate automaton AP = 0 (S, T ) is constructed starting from MP and Proof of Theorem 3 adding the following additional transitions (similarly to the construction in [18]): Proof. We ﬁrst show that the evaluation problem is in NP. Given a pair hµ, costi and a query – For each state s ∈ S there is a transition −→ hhs, , si, αi, where α is the cost of insertion. SELECT w Q where Q does not include FIL- 0 TER, we have to check whether hµ, costi is in – For each transition hhs, p, s i, 0i in MP , where p ∈ D, there is a transition hhs, , s0i, βi, [[SELECT−→Q]] . We can guess a new mapping w G where β is the cost of deletion. µ0 such that π−→(hµ0, costi) = hµ, costi and conse- w – For each transition hhs, p, s0i, 0i in M , where quently check that hµ0, costi ∈ [[Q]] (which can P G p ∈ D, there is a transition hhs, , s0i, γi, where be done in polynomial time as we have seen in γ is the cost of substitution. Theorem 2). The number of guesses is bounded by the number of variables in Q and values from G We then form the weighted product automaton, to which they can be mapped. H, of G and AP as follows: 24 Flexible Query Processing for SPARQL

– The states of H will be the Cartesian product – For each transition hhs, type−, s0i, costi ∈ T 0, 0 of the set of nodes N of G and the set of states s ∈ S0 and hc, sc, c i ∈ K such that s is an- 00 S of AP . notated with c add an incoming clone s of s 0 0 – For each transition hhs, p, s i, cost1i in AP and annotated with c to S0 and add the transi- 00 − 0 0 each edge hha, p, bi, cost2i in E, there is a tran- tion hhs , type , s i, cost + βi to T , where β 0 sition hhs, ai, hs , bi, cost1 + cost2i in H. is the cost of applying rule 4. 0 0 0 0 – For each transition hhs, , s i, costi in AP and – For each hhs, p, s i, costi ∈ T , s ∈ Sf and each node a ∈ N, there is a transition hp, dom, ci such that s0 is annotated with a hhs, ai, hs0, ai, costi in H. constant, add an outgoing clone s00 of s0 an- 0 – For each transition hhs, , s i, cost1i in AP and notated with c to Sf , and add the transition 00 0 each edge hha, p, bi, cost2i in E, there is a tran- hhs, type, s i, cost + γi to T , where γ is the 0 sition hhs, ai, hs , bi, cost1 + cost2i in H. cost of applying rule 5. – For each hhs, p, s0i, costi ∈ T 0, s ∈ S and Finally we check if there exists a path from 0 hp, range, ci such that s is annotated with a hs0, µ(x)i to hsf , µ(y)i in H. Again, if there exists 00 more than one path we select one with minimum constant, add an incoming clone s of s annotated with c to S0, and add the transition cost using Dijkstra’s Algorithm. Knowing that the 00 − 0 0 number of nodes in H is |N|·|S| and that the num- hhs , type , s i, cost + δi to T , where δ is the ber of edges in H is at most (|E| + |N|) · |S|2, the cost of applying rule 6. evaluation can therefore be computed in O((|E| + (We note that because queries and graphs do not |N|) · |S|2 + |N| · |S| · log(|N| · |S|)). contain edges labelled sc or sp, rules 1 and 3 in Figure 1 are inapplicable as far as query relaxation Relaxation. Given an ontology K = extRed(K) 0 0 is concerned.) we build the relaxed automaton RP = (S ,T ,S0, We then form the weighted product automaton, Sf ) starting from MP (similarly to the construc- H, of G and RP as follows: tion in [18]). S0 and Sf represent the sets of initial and final states, and S0 contains every state in S – For each node a ∈ N of G and each state 0 plus the states in S0 and Sf . Initially S0 and Sf s ∈ S of RP , then hs, ai is a state of H if s is contain s0 and sf respectively. Each initial and fi- labelled with either ∗ or a, or is unlabelled. 0 nal state in S0 and Sf is labelled with either a con- – For each transition hhs, p, s i, cost1i in RP and stant or the symbol ∗; in particular, s0 is labelled each edge hha, p, bi, cost2i in E such that hs, ai with x if x is a constant or ∗ if it is a variable and and hs0, bi are states of H, then there is a 0 similarly sf is labelled with y if y is a constant or ∗ transition hhs, ai, hs , bi, cost1 + cost2i in H. − 0 if it is a variable. An incoming (outgoing) clone of – For each transition hhs, type , s i, cost1i in 0 0 a state s is a new state s such that s is an initial RP and each edge hha, type, bi, cost2i in E or final state if s is, s0 has the same set of incoming such that hs, bi and hs0, ai are states of H, 0 (outgoing) transitions as s, and has no outgoing then there is a transition hhs, bi, hs , ai, cost1+ 0 (incoming) transitions. Initially T contains all the cost2i in H. transitions in T . We recursively add states to S0 0 Finally we check if there exists a path from and Sf , and transitions to T as follows until we 0 hs, µ(x)i to hs , µ(y)i in H, where s ∈ S0 and reach a fixpoint: 0 s ∈ Sf . Again, if there exists more than one path – For each transition hhs, p, s0i, costi ∈ T 0 and we select one with minimum cost using Dijkstra’s hp, sp, p0i ∈ K add the transition hhs, p0, s0i, Algorithm. Knowing that the number of nodes in cost+αi to T 0, where α is the cost of applying H is at most |N| · |S0| and the number of edges in rule 2. H is at most |E|·|S0|2, the evaluation can therefore – For each transition hhs, type, s0i, costi ∈ T 0, be computed in O(|E| · |S0|2 + |N| · |S0| · log(|N| · 0 0 0 0 s ∈ Sf and hc, sc, c i ∈ K such that s is |S |)). annotated with c add an outgoing clone s00 0 0 of s annotated with c to Sf and add the Conclusion. We can conclude that both query ap- transition hhs, type, s00i, cost+βi to T 0, where proximation and query relaxation can be evalu- β is the cost of applying rule 4. ated in polynomial time. In particular, the evalu- Flexible Query Processing for SPARQL 25 ation can be done in O(|E|) time with respect to Algorithm 8: EvaluationCost the data and in polynomial time with respect to the query. input : Query Q with no variables, a mapping µ, a graph G Proof of Theorem 4. output: A cost value c, or NULL if Q = t then Proof. Follows straightforwardly from Theorem 3 if there exists a cost such that and Lemma 1. {hµ, costi} ∈ [[t]]G, where t is a simple triple pattern or an APPROX/RELAX Proof of Theorem 5. then return cost; Proof. In order to prove the theorem, we devise an algorithm that runs in polynomial time with else respect to the size of the graph G. We start by return NULL 0 building a new mapping µ such that each variable else if Q = Q1 AND Q2 then 0 x ∈ var(µ ) appears in var(Q) but not in var(µ), Guess a decomposition µ = µ1 ∪ µ2; and to each we assign a different constant from if EvaluationCost(Q1,µ1,G) 6= NULL ND. We then verify in polynomial time that hµ ∪ and EvaluationCost(Q2,µ2,G) 6= NULL 0 µ , costi is in [[Q]]G. The number of mappings we then |var(Q)| can generate is O(|ND| ). Since the query return EvaluationCost(Q1,µ1,G) + is fixed we can therefore say that the evaluation EvaluationCost(Q2,µ2,G); with respect to the data is in polynomial time. else return NULL Proof of Theorem 6 else if Q = Q1 UNION Q2 then Proof. We give an NP algorithm, Evaluation- if EvaluationCost(Q1,µ,G) = NULL then Cost shown as Algorithm 8, for the EVALUA- return EvaluationCost(Q2,µ,G); TION problem for a generic query Q containing else if EvaluationCost(Q2,µ,G) = NULL AND, UNION and regular expression patterns. then The EvaluationCost algorithm takes as input a return EvaluationCost(Q1,µ,G); mapping µ, a graph G and a query Q and returns else if EvaluationCost(Q ,µ,G) ≤ a cost c. Given an evaluation hµ, c0i, and a query 1 EvaluationCost(Q ,µ,G) then Q, then the EvaluationCost algorithm returns c 2 return EvaluationCost(Q1,µ,G); if {hµ, ci} ∈ [[Q]]G and NULL otherwise. Finally, we need to check that c is equal to c0. else It is easy to see in the EvaluationCost algorithm return EvaluationCost(Q2,µ,G); that the non-deterministic step occurs when the condition Q = Q1 AND Q2 is satisfied, in which case we need to guess a decomposition of the map- can be done in NP time as we have seen in Theo- ping µ into µ and µ . The number of guesses 1 2 rem 6). The number of guesses is bounded by the is bounded by the number of possible decomposi- number of variables in Q and values from G to tions of µ (which is finite). which they can be mapped. For NP-hardness, we reduce from the 3-SAT Proof of Theorem 7. problem, which is known to be NP-complete: Given a Boolean formula φ = C1 ∧ · · · ∧ Cn as in- Proof. We first show that the evaluation prob- put, where each clause Ci is a disjunct of exactly lem is in NP. Given a pair hµ, costi and a query three literals, is φ satisfiable? Each literal li1, li2 −→ SELECT w Q, where Q contains AND, UNION and li3 of Ci is either a variable vk or a negated and SELECT, we have to check whether hµ, costi variablev ¯k, with k ∈ {1, . . . , m}. is in [[SELECT−→w Q]]G. We can guess a new map- We start by constructing the following graph: 0 0 ping µ such that π−→w (hµ , costi) = hµ, costi and 0 consequently check that hµ , costi ∈ [[Q]]G (which G = ({a, 0, 1}, {f, t}, {hha, f, 0i, 0i, hha, t, 1i, 0i}) 26 Flexible Query Processing for SPARQL

We then construct the query Q = SELECTz(T1 and a pair hµ2, cost2i ∈ [[Q2]]G,K such that AND ··· Tn AND ha, t, zi), where each symbol hµ1, cost1i on hµ2, cost2i = hµ, costi. Moreover, Ti = {Vi1 UNION Vi2 UNION Vi3} corresponds since cost = cost1 + cost2 we know that cost1 ≤ c to a clause Ci in φ, and each Vij either corresponds and cost2 ≤ c. Therefore, we can conclude that to a triple of the form ha, t, xki if lij is a variable hµ, costi must also be contained in the RHS of the vk, or corresponds to a triple of the form ha, f, xki equation. if lij is a negated variablev ¯k. For the second equation it is easy to verify that It can be veriﬁed that the formula φ is satisﬁable every pair hµ, costi is in CostP roj([[Q1]]G,K ∪ if and only if hµ, 0i ∈ [[Q]]G with µ = {z → 1}. [[Q2]]G,K , c) if and only if it is contained either in CostP roj([[Q1]]G,K , c) or in CostP roj([[Q2]]G,K , Proof of Theorem 8 c), or in both.

Proof. In Theorem 6, the non-deterministic steps Proof of Theorem 9. (i.e. the decomposition of the mapping µ into µ1 and µ2 to verify that hµ, ci ∈ [[Q1 AND Q2]]G), Proof. For ease of reading, in this proof we will depend on the query Q which we assume is fixed. replace the operators APPROX and RELAX with To verify that an evaluation hµ, 0i is in [[t]]G, with A and R respectively and will denote with A/R(t) t a triple pattern of query Q, can be done in |E| that we are applying either APPROX or RELAX steps. Therefore, the evaluation can be computed to a triple pattern t. We divide the proof into three in O(|E| ∗ |µ||Q|) steps. parts: (1) The first part shows that for c ≥ 0 When we include the SELECT operator we and relaxed or approximated triple patterns of need to add a further non-deterministic step, that the form hx, p, yi, the functions approxRegex and is, generating a new mapping µ0 from µ such relaxTriplePattern generate sound and complete 0 triple patterns. (2) The second part of the proof that π−→w (hµ , ci) = hµ, ci. From the proof of The- orem 5 we can see that this can be done in shows that the algorithm is sound and complete for O(|ND||var(Q)|). Since the query is fixed, we con- approximated and relaxed triple patterns contain- clude that the data complexity is polynomial. ing any regular expression. (3) Finally, we show that the algorithm is sound and complete for gen- Proof of Lemma 2 eral queries Q, i.e. we show that the following holds for any query Q, graph G and ontology K: Proof. (1) From the definition of union, it follows S 0 0 0 CostProj([[Q]]G,K , c) ⊆ Q0∈rew(Q) [[Q ]]G,K ⊆ that M1 ∪ M2 contains every mapping from M1 c CostProj([[Q]] , c) and M2, and therefore the statement holds. G,K (2) From the definition of join, M1 on M2 con- (1) In this first part we show that for any triple tains a mapping µ1 ∪ µ2 for every pair of compat- pattern hx, p, yi and cost c ≥ 0 the following holds: ible mappings hµ1, cost1i ∈ M1 and hµ2, cost2i ∈ 0 0 CostProj([[A/R(x, p, y)]] , c) = M2. Since M and M also contain µ1 and µ2, G,K 1 2 S 0 0 0 0 [[t ]]G,K respectively, then M1 on M2 will also contain t ∈rew(A/R(x,p,y))c µ1 ∪ µ2. We show this by induction on the cost c. For the base case of c = 0 we need to show that: Proof of Lemma 3

CostProj([[A/R(x, p, y)]]G,K , 0) = Proof. Considering the right hand side (RHS) [ 0 (5) of the first equation, we know that each pair [[t ]]G,K 0 hµ, costi in the RHS has cost ≤ c and is equal t ∈rew(A/R(x,p,y))0 to hµ1, cost1i on hµ2, cost2i, where cost1 ≤ c, cost2 ≤ c, hµ1, cost1i ∈ [[Q1]]G,K and hµ2, cost2i ∈ On the LHS, since the costs of applying AP- [[Q2]]G,K . Therefore, the pair hµ, costi must also PROX and RELAX have cost greater than zero, be contained in the left hand side (LHS) of the CostProj operator will only return the exact the equation. Conversely, for each pair hµ, costi answers of the query, in other words it will exclude in the LHS, we know that cost ≤ c and that the answers generated by the APPROX and RE- there must exist a pair hµ1, cost1i ∈ [[Q1]]G,K LAX operators. On the RHS, the rewriting algo- Flexible Query Processing for SPARQL 27 rithm will not return queries with associated cost [[hx, p, yi]]G,K ∪ S 0 greater than 0 and therefore will just return the 0 [[t ]] ∪ t ∈rew(A(x, /p,y))c−α G,K S 0 original query unchanged. This, when evaluated, 0 [[t ]] ∪ t ∈rew(A(x,p/ ,y))c−α G,K S 0 will therefore also return the exact answers of the 0 [[t ]] ∪ t ∈rew(A(x,,y))c−β G,K query. So (5) holds. S 0 0 [[t ]] t ∈rew(A(x, ,y))c−γ G,K When c is greater than 0 we consider two dif- = ferent cases, one for APPROX and the other for [[hx, p, yi]]G,K ∪ RELAX: CostProj([[A(x, /p, y)]]G,K , c − α)∪ CostProj([[A(x, p/ , y)]] , c − α)∪ (a) Approximation. For approximation, we show G,K CostProj([[A(x, , y)]] , c − β)∪ the following by induction on the cost c: G,K CostProj([[A(x, , y)]]G,K , c − γ) Given Corollary 1, it is sufficient to show that CostProj([[A(x, p, y)]]G,K , c) = all the following equations hold: [ 0 (6) [[t ]]G,K 0 t ∈rew(A(x,p,y))c [[hx, p, yi]]G,K = [[hx, p, yi]]G,K (7) The induction hypothesis is that (6) holds for c = iα + jβ + kγ for all i, j, k ≥ 0, where α, β, [ 0 γ are the cost of the insertion, deletion and sub- [[t ]]G,K = 0 stitution edit operations, respectively. We have al- t ∈rew(A(x, /p,y))c−α (8) ready shown the base case of i = j = k = 0. We CostProj([[A(x, /p, y)]]G,K , c − α) now show that (6) is true when one of, i, j or k is incremented by 1. Considering the RHS of equation (6), when we [ 0 [[t ]]G,K = apply one step of approximation to a triple pattern 0 t ∈rew(A(x,p/ ,y))c−α (9) the algorithm generates a set of triple patterns. These triple patterns will be recursively rewrit- CostProj([[A(x, p/ , y)]]G,K , c − α) ten by the algorithm. Therefore, by applying every possible edit operation to the original triple [ 0 pattern, we have that: [[t ]]G,K = 0 t ∈rew(A(x,,y))c−β S 0 (10) 0 [[t ]] = t ∈rew(A(x,p,y))c G,K CostProj([[A(x, , y)]]G,K , c − β) [[hx, p, yi]]G,K ∪ S 0 0 [[t ]] ∪ t ∈rew(A(x, /p,y))c−α G,K S 0 0 [[t ]]G,K ∪ t ∈rew(A(x,p/ ,y))c−α [ 0 S 0 [[t ]]G,K = t0∈rew(A(x,,y)) [[t ]]G,K ∪ c−β 0 S 0 t ∈rew(A(x, ,y))c−γ (11) 0 [[t ]] t ∈rew(A(x, ,y))c−γ G,K CostProj([[A(x, , y)]]G,K , c − γ) Considering the LHS of equation (6), again by the semantics of approximation, we have that: Equation (7) is trivially true. Equations (10) and (11) hold since on the LHS, rew(A(x, , y))c CostProj([[A(x, p, y)]]G,K , c) = and rew(A(x, , y))c contain only (x, , y) and [[hx, p, yi]]G,K ∪ (x, , y) respectively, for any c ≥ 0, and on the CostProj([[A(x, /p, y)]]G,K , c − α)∪ RHS, by the semantics of approximation, we CostProj([[A(x, p/ , y)]]G,K , c − α)∪ know that [[A(x, , y)]]G,K = [[x, , y]]G,K and CostProj([[A(x, , y)]]G,K , c − β)∪ [[A(x, , y)]]G,K = [[x, , y]]G,K . CostProj([[A(x, , y)]]G,K , c − γ) For equation (8), considering the semantics of Combining the last two into a single equation, approximation with concatenation of paths, the we therefore need to show that: LHS of the equation can be rewritten in the fol- 28 Flexible Query Processing for SPARQL lowing way since we know that we will not apply γ, δ are the costs of the four relaxation operations any step of approximation to A(x, , z): arising from rules 2, 4, 5 and 6, respectively, of Figure 1. We have already shown the base case S 0 ([[hx, , zi]] ) ( 0 [[t ]] ) G,K on t ∈rew(A(z,p,y))c−α G,K of i = j = k = l = 0. We now show that (14) holds when one of, i, j, k or l is incremented by Applying Lemma 3 we can rewrite the RHS of (8) 1. Similarly to the reasoning for approximation in to: part (a), we need to show that: CostProj(CostProj([[A(x, , z)]]G,K , c − α) on [[hx, p, yi]]G,K ∪ CostProj([[A(z, p, y)]]G,K , c − α), c − α) 0 CostProj([[R(x, p , y)]]G,K , c − α)∪ It is possible to drop the outer CostProj since CostProj([[R(x, type, a)]]G,K , c − β)∪ − the query [[A(x, , z)]]G,K returns only mappings CostProj([[R(a, type , x)]]G,K , c − β)∪ with associated cost 0, obtaining: CostProj([[R(x, type, a)]]G,K , c − γ)∪ − CostProj([[R(a, type , x)]]G,K , c − δ) CostProj([[A(x, , z)]]G,K , c − α) on = CostProj([[A(z, p, y)]]G,K , c − α) [[hx, p, yi]]G,K ∪ S 0 0 0 [[t ]] ∪ t ∈rew(R(x,p ,y))c−α G,K Therefore we need to show that the following S 0 0 [[t ]]G,K ∪ holds: t ∈rew(R(x,type,a))c−β S 0 t0∈rew(R(a,type−,x)) [[t ]]G,K ∪ S 0 c−β ([[hx, , zi]] ) ( 0 [[t ]] ) = S 0 G,K on t ∈rew(A(z,p,y))c−α G,K 0 [[t ]] ∪ t ∈rew(R(x,type,a))c−γ G,K CostProj([[A(x, , z)]] , c − α) S 0 G,K on 0 − [[t ]] t ∈rew(R(a,type ,x))c−δ G,K CostProj([[A(z, p, y)]]G,K , c − α) Given Corollary 1 it is sufficient to show that Given Corollary 1 it is sufficient to show that: the following hold::

[[hx, , zi]]G,K = (12) CostProj([[A(x, , z)]]G,K , c − α) [[hx, p, yi]]G,K = [[hx, p, yi]]G,K (15)

[ 0 0 [[t ]]G,K = CostProj([[R(x, p , y)]]G,K , c − α) = 0 t ∈rew(A(z,p,y))c−α (13) [ 0 (16) [[t ]]G,K 0 0 CostProj([[A(z, p, y)]]G,K , c − α) t ∈rew(R(x,p ,y))c−α

Equation (12) holds by similar reasoning to equation (11). Equation (13) holds by the induction CostProj([[R(x, type, a)]]G,K , c − β) = hypothesis. [ 0 (17) Equation (9) can be shown to hold by similar [[t ]]G,K 0 reasoning to equation (8). We conclude that equa- t ∈rew(R(x,type,a))c−β tion (6) holds for every c ≥ 0. (b) Relaxation. For relaxation, we show the fol- CostProj([[R(a, type−, x)]] , c − β) = lowing by induction on the cost c: G,K [ 0 (18) [[t ]]G,K 0 − t ∈rew(R(a,type ,x))c−β CostProj([[R(x, p, y)]]G,K , c) =

[ 0 (14) [[t ]]G,K 0 t ∈rew(R(x,p,y))c CostProj([[R(x, type, a)]]G,K , c − γ) =

[ 0 (19) The induction hypothesis is that (14) holds for [[t ]]G,K 0 c = iα+jβ+kγ+lδ for all i, j, k, l ≥ 0, where α, β, t ∈rew(R(x,type,a))c−γ Flexible Query Processing for SPARQL 29

− CostProj([[R(a, type , x)]]G,K , c − δ) = to the approxRegex and relaxTriplePattern func-

[ 0 (20) tions which return two sets of triple patterns that [[t ]]G,K will be joined with the AND operator. Therefore 0 − t ∈rew(R(a,type ,x))c−δ the RHS of equation (23) can be written in the following way: Equation (15) is trivially true. Equations (16- S 0 20) can be rewritten as the general case of the in- CostProj( 0 [[t ]] t ∈rew(A/R(x,P1,z))c G,K on duction hypothesis for some c ≥ 0. Therefore equa- S 0 t0∈rew(A/R(z,P ,y)) [[t ]]G,K , c) tions (16-20) hold by the induction hypothesis. We 2 c conclude that equation (13) holds for every c ≥ 0. Given the semantics of approximation and relaxation with concatenation of paths, the LHS of (2) Now we need to show that approxRegex and equation (23) can be written as follows: relaxTriplePattern are sound and complete for triple patterns containing any regular expression. CostProj([[A/R(x, P1, z)]]G,K on In part (1) we have demonstrated soundness and [[A/R(z, P2, y)]]G,K , c) completeness for triple patterns containing a single predicate, p: which by Lemma 3 is equal to: CostProj(CostProj([[A/R(x, P , z)]] , c) CostProj([[A/R(x, p, y)]]G,K , c) = 1 G,K on S 0 CostProj([[A/R(z, P , y)]] , c), c) 0 [[t ]] 2 G,K t ∈rew(A/R(x,p,y))c G,K This is our base case. We now show soundness We therefore need to show that: and completeness by structural induction, consid- CostProj(CostProj([[A/R(x, P , z)]] , c) ering the three diﬀerent operators used to con- 1 G,K on CostProj([[A/R(z, P2, y)]]G,K , c), c) = struct a regular expression: concatenation, dis- S 0 CostProj( 0 [[t ]] t ∈rew(A/R(x,P1,z))c G,K on junction and Kleene-Closure. S 0 0 [[t ]] , c) t ∈rew(A/R(z,P2,y))c G,K (a) Concatenation. The induction hypothesis is that the following equations hold for any regular It is possible to drop the outer CostProj operators on both sides of the above equation. Applying expressions P1 and P2: Corollary 1 it is suﬃcient to show that:

CostProj([[A/R(x, P1, y)]]G,K , c) = CostProj([[A/R(x, P1, z)]]G,K , c) = [ S 0 0 (21) 0 [[t ]]G,K [[t ]]G,K t ∈rew(A/R(x,P1,z))c 0 t ∈rew(A/R(x,P1,y))c CostProj([[A/R(z, P2, y)]]G,K , c) = S 0 0 [[t ]] t ∈rew(A/R(z,P2,y))c G,K These equations hold by the induction hypoth- CostProj([[A/R(x, P , y)]] , c) = 2 G,K esis. Therefore equation (23) holds. [ 0 (22) [[t ]]G,K (b) Disjunction. Similarly to concatenation, our 0 t ∈rew(A/R(x,P2,y))c induction hypothesis is that equations (21) and (22) hold for any regular expressions P1 and P2. We now show that the following holds: We now show that the following equation holds:

CostProj([[A/R(x, P1/P2, y)]]G,K , c) = CostProj([[A/R(x, P |P , y)]] , c) = [ 0 (23) 1 2 G,K [[t ]]G,K 0 [ 0 (24) t ∈rew(A/R(x,P1/P2,y))c [[t ]]G,K 0 t ∈rew(A/R(x,P1|P2,y))c When the approxRegex and relaxTriplePattern functions are passed as input a triple pattern of When the approxRegex and relaxTriplePattern the form A/R(x, P1/P2, y), this is split into two functions are passed as input a triple pattern of triple patterns: A/R(x, P1, z) and A/R(z, P2, y). the form A/R(x, P1|P2, y), this is split into two Both of these triple patterns are passed recursively triple patterns: A/R(x, P1, y) and A/R(x, P2, y). 30 Flexible Query Processing for SPARQL

Both of these triple patterns are passed recursively The approxRegex function rewrites an approxi- to the approxRegex and relaxTriplePattern func- mated triple pattern on the RHS of the equation tions which will return two sets of triple patterns in the following way: A(x, P i/P/P j, y)) for arbi- that will be combined with the UNION operator. trarily chosen i, j satisfying i + j = n. It then Therefore the RHS of equation (24) can be written i splits this into three triple patterns, A(x, P , z1), as follows: j A(z1, P, z2) and A(z2,P , y). Therefore the RHS S 0 of (25) becomes: 0 [[t ]] ∪ t ∈rew(A/R(x,P1,y))c G,K S 0 0 [[t ]] t ∈rew(A/R(x,P2,y))c G,K Given the semantics of approximation and re- [ 0 CostProj( [[t ]]G,K on laxation with disjunction of paths, we can write 0 i t ∈rew(A(x,P ,z1))c the LHS of equation (24) as follows: [ 0 [[t ]]G,K on (26) CostProj([[A/R(x, P1, y)]]G,K ∪ 0 t ∈rew(A(z1,P,z2))c [[A/R(x, P2, y)]]G,K , c) [ 0 [[t ]]G,K , c) which by Lemma 3 is equal to: 0 j t ∈rew(A(z2,P ,y))c

CostProj([[A/R(x, P1, y)]]G,K , c) ∪ CostProj([[A/R(x, P2, y)]]G,K , c) We have added the CostProj operator in order to follow the behaviour of the algorithm that ex- We therefore need to show that: cludes queries with associated cost greater than c. CostProj([[A/R(x, P1, y)]]G,K , c) ∪ Knowing that L(P i/P/P j) = L(P n+1) and by CostProj([[A/R(x, P2, y)]]G,K , c) = S 0 the semantics of approximation with concatena- 0 [[t ]] ∪ t ∈rew(A/R(x,P1,y))c G,K S 0 tion of paths, we can write the LHS as: 0 [[t ]] t ∈rew(A/R(x,P2,y))c G,K

By Corollary 1 it is suﬃcient to show that: i CostProj([[A(x, P , z1)]]G,K on [[A(z , P, z )]] [[A(z ,P j, y)]] , c) CostProj([[A/R(x, P1, y)]]G,K , c) = 1 2 G,K on 2 G,K S 0 0 [[t ]] t ∈rew(A/R(x,P1,y))c G,K CostProj([[A/R(x, P2, y)]]G,K , c) = Applying Lemma 3, this can be rewritten as: S 0 0 [[t ]] t ∈rew(A/R(x,P2,y))c G,K

These equations hold by the induction hypothesis. i CostProj(CostProj([[A(x, P , z1)]]G,K , c) on Therefore equation (24) holds. CostProj([[A(z , P, z )]] , c) (c) Kleene-Closure. Our induction hypothesis in 1 2 G,K on j this case is that CostProj([[A(z2,P , y)]]G,K , c), c) n (27) CostProj([[A/R(x, P , y)]]G,K , c) = S 0 t0∈rew(A/R(x,P n,y)) [[t ]]G,K c Combining (26) and (27) and removing the for any regular expression P and any n ≥ 0, where outer CostProj operator on both hand sides we P n denotes the regular expression P/P/ . . . /P in therefore need to show that: which P appears n times. For the base case of n = 0, where P n = , the equation is trivially i true since rew(A(x, , y))c contains only the query CostProj([[A(x, P , z1)]]G,K , c) on (x, , y). We now show that the following holds: CostProj([[A(z1, P, z2)]]G,K , c) on j n+1 CostProj([[A(z2,P , y)]]G,K , c) = CostProj([[A/R(x, P , y)]]G,K , c) = S 0 0 i [[t ]] t ∈rew(A(x,P ,z1))c G,K on [ S 0 0 (25) 0 [[t ]]G,K [[t ]]G,K t ∈rew(A/R(z1,P,z2))c on S 0 0 n+1 0 j [[t ]] t ∈rew(A/R(x,P ,y))c t ∈rew(A/R(z2,P ,y))c G,K Flexible Query Processing for SPARQL 31

By Corollary 1 it is suﬃcient to show that: We now show that the following holds.

i [ 00 CostProj([[A(x, P , z1)]]G,K , c) = CostProj([[Q]]G,K , c) = [[Q ]]G,K 00 Q ∈rew(Q)c [ 0 (28) [[t ]]G,K (32) 0 i t ∈rew(A(x,P ,z1))c The LHS of equation (32) is equivalent to the following by the semantics of the AND operator: CostProj([[A(z , P, z )]] , c) = 0 1 2 G,K CostProj([[t]]G,K on [[[Q ]]G,K , c) [ 0 (29) [[t ]]G,K Applying Lemma 3 we can rewrite this as fol- 0 t ∈rew(A(z1,P,z2))c lows:

j CostProj(CostProj([[t]]G,K , c) on CostProj([[A(z2,P , y)]]G,K , c) = (33) 0 CostProj([[Q ]]G,K , c), c) [ 0 (30) [[t ]]G,K 0 j t ∈rew(A(z2,P ,y))c For the RHS of equation (32) we have to consider two different cases: either t is a simple triple Equations (28,29,30) hold by the induction hy- pattern or it contains the RELAX or APPROX pothesis since i and j are both less than n; there- operators. If we consider the former case then we fore equation (25) holds. rewrite the RHS of equation (32) to: The same reasoning applies for the relaxTriplePat- [ 00 tern function applied to a relaxed triple pattern [[t]]G,K on [[Q ]]G,K (34) 00 0 on the RHS of (25) with the difference that it Q ∈rew(Q )c rewrites the triple pattern in 3 different ways: R(x, P i/P/P j, y) (for arbitrarily chosen i, j satis- Combining (33) and (34) we need to show that: n n fying i+j = n), R(x, P/P , y) and R(x, P /P, y). CostProj(CostProj([[t]]G,K , c) on 0 It is possible to apply the same steps of the proof CostProj([[Q ]]G,K , c), c) = [[t]]G,K on n S 00 as for approxRegex, noticing that L(P/P ) = 00 0 [[Q ]] Q ∈rew(Q )c G,K L(P n+1) and L(P n/P ) = L(P n+1). We are able to drop the outer CostProj operator (3) General queries. We now show that the al- and the CostProj applied to the triple pattern t gorithm is sound and complete for any query that since [[t]]G,K returns mappings with cost 0. The may contain approximation and relaxation. As the resulting equation is as follows: base case we have the case of a query comprising 0 [[t]]G,K on CostProj([[Q ]]G,K , c) = [[t]]G,K on a single triple pattern, which has been shown in S 00 00 0 [[Q ]]G,K part (2) of the proof: Q ∈rew(Q )c Applying Corollary 1 it is sufficient to show CostProj([[A/R(x, P, y)]]G,K , c) = that: S 0 0 [[t ]] t ∈rew(A/R(x,P,y))c G,K

Consider now a query Q = t AND Q0 with t being [[t]]G,K = [[t]]G,K (35) an arbitrary triple pattern of the query Q. The 0 [ 00 CostProj([[Q ]]G,K , c) = [[Q ]]G,K induction hypothesis is that: 00 0 Q ∈rew(Q )c (36) 0 [ 00 CostProj([[Q ]]G,K , c) = [[Q ]]G,K 00 0 Q ∈rew(Q )c Equation (35) is trivially true and equation (36) (31) holds by the induction hypothesis. Therefore equa- 32 Flexible Query Processing for SPARQL tion (32) holds in the case of t being a simple triple Equation (37) holds since approxRegex and re- pattern. laxTriplePattern are sound and complete as shown If t contains the APPROX or RELAX operators in step (2) of the proof. Equation (38) holds by then the RHS of (32) is: the induction hypothesis. Therefore equation (32) holds in the case of t containing the APPROX and S 0 S 00 0 [[t ]]G,K 00 0 [[Q ]]G,K t ∈rew(t)c on Q ∈rew(Q )c RELAX operators. Therefore, combining the latter with (33), we have: Proof of Theorem 10. CostProj(CostProj([[t]]G,K , c) on 0 CostProj([[Q ]]G,K , c), c) = Proof. The algorithm terminates when the set S 0 CostProj(( 0 [[t ]] ) t ∈rew(t)c G,K on oldGeneration is empty. At the end of each S 00 ( 00 0 [[Q ]] ), c) cycle, oldGeneration is assigned the value of Q ∈rew(Q )c G,K newGeneration. During each cycle, elements are (We have added the CostProj operator on the added to newGeneration only when new queries RHS of the equation in order to follow the be- are generated and have cost less than c or already haviour of the algorithm that excludes queries generated queries are generated again at a lesser with associated cost greater than c). Removing the cost (also less than c). CostProj from both hand sides of the equation and On each cycle of the algorithm, each query gen- applying Corollary 1, it is suﬃcient to show that: erated by applyApprox or applyRelax has cost at least c0 plus the cost of the query from which it is generated. Since we start from query Q which has [ 0 0 CostProj([[t]]G,K , c) = [[t ]]G,K cost 0, every query generated during the nth cycle 0 t ∈rew(t)c will have cost greater than or equal to n·c0. When (37) n · c0 > c the algorithm will not add any queries [ CostProj([[Q0]] , c) = [[Q00]] to newGeneration. Therefore, the algorithm will G,K G,K 0 00 0 stop after at most dc/c e iterations. Q ∈rew(Q )c (38)