Type-Based Detection of XML Query-Update Independence

Type-Based Detection of XML Query-Update Independence Nicole Bidoit-Tollu Dario Colazzo Federico Ulliana Universite Paris Sud & Universite Paris Sud & Universite Paris Sud & INRIA Saclay INRIA Saclay INRIA Saclay [email protected] [email protected] [email protected] ABSTRACT Schema languages, while in other contexts quite precise schemas, This paper presents a novel static analysis technique to detect XML in the form of a DTD, can be automatically inferred, by using accu- query-update independence, in the presence of a schema. Rather rate and efficient existing techniques like the one proposed by Bex than types, our system infers chains of types. Each chain rep- et al. in [8]. resents a path that can be traversed on a valid document during query/update evaluation. The resulting independence analysis is State of the Art precise, although it raises a challenging issue: recursive schemas Schema-based detection of XML query-update independence has may lead to inference of infinitely many chains. been recently investigated. The state of the art technique has been A sound and complete approximation technique ensuring a finite presented by Benedikt and Cheney in [6]. This technique infers analysis in any case is presented, together with an efficient imple- from the schema the set of node types traversed by the query, and mentation performing the chain-based analysis in polynomial space the set of node types impacted by the update. The query and the and time. update are then deemed as independent if the two sets do not overlap. This technique is effective since the static analysis i) is able to manage a wide class of XQuery queries and updates, ii) can be 1. INTRODUCTION performed in a negligible time, and iii) as a consequence, even on A query and an update are independent when the query result is small documents, can avoid expensive query re-computation when not affected by update execution, on any possible input database. independence wrt an update is detected. However, the technique Detecting query-update independence is of crucial importance in has some weaknesses. As illustrated in [6], in some cases, inde- many contexts: i) to minimize view re-materialization; ii) to ensure pendence is not detected, due to some over-approximation made isolation, when queries and updates are executed concurrently; iii) by the type inference rules. as outlined in [6], to enforce access control policies, when the query For example, this technique cannot detect independence between is used to define the part of the database that must not be changed the query q1===a==c and the update u1=delete ==b==c, when by a user update. the schema enforces that c descendants of b nodes are never descen- In all these contexts, benefits are amplified when query-update dants of a nodes. This is because the type inference technique of [6] independence can be checked statically. In order to be useful, ev- infers the type c both for the query path and the update path, with- ery static analysis technique must be sound: if query-update inde- out considering contextual information about the inferred types. pendence is statically detected, then independence does hold. The Since the query and update types overlap, independence is wrongly inverse implication (completeness) cannot be ensured in the general excluded. Indeed, the technique is not precise enough when ances- case, since static independence detection is undecidable (see [6]). tor or descendant axes are used in queries and updates. This means that if a static analyzer is used, for instance, in a view The way XPath axes are typed is not the only source of low pre- maintenance system, sometimes views are re-materialized after up- cision of this technique. Consider documents typed by the well dates even if not needed, because the analysis has not been smart known bibliographic DTD used in [1], the query q2===title and enough to statically detect a view-update independence. Useless the update u2=for x in ==book return insert <author=> view re-materialization frequently occurs if a static analyzer with into x. The technique of [6] infers bib, book and title as types low precision is adopted. This can lead to great waste of time, since traced by q2, and book as type impacted by u2. According to this view materialization cost can be proportional to the database size. technique, the two expressions share the type book, hence indepen- High precision of static independence analysis can be ensured dence is not detected, while it holds. by taking into account schema information. In many contexts, In none of the above examples, independence can be detected schemas are defined by users, mainly by means of the DTD or XML by techniques ignoring schema information like the path-based approach proposed by Ghelli et al.1 [15] and the recent destabilizers- Permission to make digital or hard copies of all or part of this work for based approach proposed by Benedikt and Cheney [5]. Follow- personal or classroom use is granted without fee provided that copies are ing these approaches, for the example q1-u1, the paths ==a==c and not made or distributed for profit or commercial advantage and that copies ==b==c are deemed as overlapping since, for instance, documents bear this notice and the full citation on the first page. To copy otherwise, to matching the path =a=b=c match both paths and similarly, for ex- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present ample q2-u2 and the paths ==title and ==book. their results at The 38th International Conference on Very Large Data Bases, 1 August 27th - 31st 2012, Istanbul, Turkey. This technique deals with update-commutativity detection for a Proceedings of the VLDB Endowment, Vol. 5, No. 9 language with side effects and can be directly extended to query- Copyright 2012 VLDB Endowment 2150-8097/12/05... $ 10.00. update independence detection without side effects. 872 ∗ Contributions sd=doc d(doc)=(a j b) d(a)=c d(b)=c This paper proposes a novel schema-based approach for detecting <doc> XML query-update independence. Differently from [6, 10, 11], <a><c=><=a> a DTD d our system infers sequences of labels (hereafter called chains). In- <a><c=><=a> 8 tuitively, for each node that can be selected by a query/update path <b><c=><=b> > lt doc[l1; l2; l3; l4] <a><c=><=a> <> l a[l0 ] l a[l0 ] in a schema instance, the system infers a chain recording i) all la- σ = 1 1 2 2 <=doc> 0 0 bels that are encountered from the root to the node, ii) in the order > l3 b[l3] l4 a[l4] :> 0 of traversal. This information is at the basis of a precise static inde- li c[]; i = 1::4 a document t valid wrt d pendence analysis. For instance, for q ===a==c and u ===b==c 1 1 the store (σ; l ) over the schema f doc (ajb)∗; a c; b c g, the chains doc:a:c t and doc:b:c are inferred for the query and the update, respectively. Disjointness of these two chains allows us to statically derive the Figure 1: Document and store independence for q1-u1. For the DTD of the XQuery Use Cases [1] before discussed, the chains bib:book:title and bib:book:author, respectively inferred for q2 and u2, diverge after the book sym- 2. PRELIMINARIES bol; this allows us to conclude independence for q2-u2. These two examples highlight that chain inference provides a more precise in- Data model. We represent an instance of the XML data model as dependence analysis than that of [6, 15, 5]. a store σ, which is an environment associating each node location The main contribution of this work is a precise algorithm to de- (or identifier) l with either an element node a[L] or a text node s. tect independence for a query-update pair q-u knowing that docu- In a[L], a is the element tag, while L=(l ; : : : ; l ) is the ordered ments are valid wrt a DTD d. It strongly relies on the following 1 n list of children locations in σ. A tree is a pair t=(σ; l ), where l is developments. t t its root location. dom(σ) denotes the set of locations of σ, while • Chain-based independence for q-u, a static notion, is the foun- σ@l denotes the subtree of σ rooted at l whose domain is limited dation of our algorithm: starting from the set Cd of all possi- to locations connected to l. See Figure 1 for a small document ble chains associated with the DTD d, our inference system together with its store. extracts subsets of chains Cq and Cu which soundly approx- imate the navigation through valid documents made by the DTDs. A DTD is a 3-tuple (Σ; sd; d) where: Σ is a finite alphabet evaluation of the query q and the update u, respectively. Note for element tags, denoted by a; b; c; sd2Σ is the start symbol; d is that our inference system (Section 3) is cautiously specified a function from Σ to regular expressions over Σ[fSg, where S for dealing with all XPath axes. Chain-based independence denotes the string type. For simplicity, next we often use only the is the result of the absence of overlapping pair of chains in Cq d component to specify a DTD. and Cu. Chain-based independence is proved to be sound wrt A tree t=(σ; lt) is valid wrt d, denoted t2d, iff there exists a the semantics notion of query-update independence (Section mapping ν : dom(t) 7! Σ[fSg such that: ν(lt)=sd; ν(l)=S 4).

Load more