Xpath Processing in a Nutshell∗

XPath Processing in a Nutshell∗ Georg Gottlob, Christoph Koch, and Reinhard Pichler Database and Artificial Intelligence Group Technische Universität Wien, A-1040 Vienna, Austria fgottlob, [email protected], [email protected] Abstract size and intricacy of the queries. (Database theoreti- cians refer to this as combined complexity.) We provide a concise yet complete formal We present main-memory XPath processing algo- definition of the semantics of XPath 1 and rithms that run in time guaranteed to be polynomial summarize efficient algorithms for processing in the size of the complete input, i.e. query and data. queries in this language. Our presentation is As experimentally verified in [3], where these and other intended both for the reader who is looking results where first presented, contemporary XPath en- for a short but comprehensive formal account gines do not have this property, and require time ex- of XPath as well as the software developer in ponential in the size of queries at worst. need of material that facilitates the rapid im- We present polynomial-time XPath evaluation algo- plementation of XPath engines. rithms in two flavors, bottom-up and top-down2. Both algorithms have the same worst-case polynomial-time 1 Introduction bounds. The former approach comes with a clear and XPath [10] is a practical language for selecting nodes simple intuition why our algorithms are polynomial, from XML document trees. Its importance stems from while the latter requires to compute considerably fewer its suitability as an XML query language per se and it useless intermediate results and consequently performs being at the core of several other XML-related tech- much better in practice. nologies, such as XSLT, XPointer, and XQuery. The structure of this paper is as follows. Sec- Previous semantics definitions of XPath have either tion 2 introduces XPath axes. Section 3 presents the been overly lengthy [10, 12], or were restricted to frag- data model of XPath and auxiliary functions used ments of the language (e.g. [6, 7]). In this paper, we throughout the paper. Section 4 defines the seman- provide a formal semantics definition of full XPath, in- tics of XPath in a concise way. Section 5 houses the tended for the implementor or the reader who values bottom-up semantics definition and algorithm. Sec- a concise, formal, and functional description. As an tion 6 comes up with the modifications to obtain a extension to [3], we also provide a brief but complete top-down algorithm. We conclude with Section 7. definition of the XPath data model. Given a query language, the prototypical algorith- 2 XPath Axes mic problem is query evaluation, and its hardness needs to be formally studied. The query evaluation In XPath, an XML document is viewed as an unranked problem was not addressed for full XPath in any depth (i.e., nodes may have a variable number of children), until recently [3, 2, 4]1. We summarize the main ideas ordered, and labeled tree. Before we make the data of that work in this paper, with an emphasis on use- model used by XPath precise (which distinguishes be- fulness to a wide audience. tween several types of tree nodes) in Section 3, we Since XPath and related technologies will be tested introduce the main mode of navigation in document in ever-growing deployment scenarios, implementa- trees employed by XPath { axes { in the abstract, ig- tions of XPath engines need to scale well both with noring node types. We will point out how to deal with respect to the size of the XML data and the growing different node types in Section 3. All of the artifacts of this and the next section ∗ This work was supported by the Austrian Science Fund (FWF) under project No. Z29-INF. All methods and algorithms are defined in the context of a given XML document. presented in this paper are covered by a pending patent. Fur- Given a document tree, let dom be the set of its nodes, ther resources, updates, and possible corrections are available and let us use the two functions at http://www.xmltaskforce.com. 1The novelty of our results is a bit surprising, as considerable effort has been made in the past to understand the XPath query firstchild, nextsibling : dom ! dom; containment problem (e.g. [1, 5, 8]), which { with its relevance to query optimization { is only a means to the end of efficient 2By this we refer to the mode of traversal of the expression query evaluation. tree of the query while computing the query result. ∗ child := firstchild:nextsibling function −1 ∗ −1 evalχ1[χ2 (S) := evalχ1 (S) [ evalχ2 (S). parent := (nextsibling ) :firstchild ∗ ∗ function eval(R1[:::[Rn) (S) begin descendant := firstchild:(firstchild [ nextsibling) 0 0 −1 −1 ∗ −1 S := S; /* S is represented as a list */ ancestor := (firstchild [ nextsibling ) :firstchild while there is a next element x in S0 do descendant-or-self := descendant [ self append fRi(x) j 1 ≤ i ≤ n; Ri(x) 6= null; ancestor-or-self := ancestor [ self 0 0 Ri(x) 62 S g to S ; following := ancestor-or-self:nextsibling: 0 nextsibling∗:descendant-or-self return S ; preceding := ancestor-or-self:nextsibling−1: end; −1 ∗ (nextsibling ) :descendant-or-self where S ⊆ dom is a set of nodes of an XML document, following-sibling := nextsibling:nextsibling∗ 1 1 e and e are regular expressions, R, R , : : :, Rn are preceding-sibling := (nextsibling− )∗:nextsibling− 1 2 1 primitive relations, χ1 and χ2 are axes, and χ is an axis other than \self". Table 1: Axis definitions in terms of \primitive" tree relations “firstchild", \nextsibling", and their inverses. Clearly, some axes could have been defined in a simpler way in Table 1 (e.g., ancestor equals ∗ 3 parent.parent ). However, the definitions, which use to represent its structure . “firstchild" returns the a limited form of regular expressions only, allow to first child of a node (if there are any children, i.e., compute χ(S) in a very simple way, as evidenced by the node is not a leaf), and otherwise \null". Let Algorithm 2.2. n ; : : : ; nk be the children of some node in document 1 The function eval ∗ essentially computes order. Then, nextsibling(n ) = n , i.e., \nextsib- (R1[:::[Rn) i i+1 graph reachability (not transitive closure). It can be ling" returns the neighboring node to the right, if it implemented to run in linear time in terms of the data exists, and \null" otherwise (if i = k). We define the in a straightforward manner; (non)membership in S0 is functions firstchild−1 and nextsibling−1 as the inverses checked in constant time using a direct-access version of the former two functions, where \null" is returned if of S0 maintained in parallel to its list representation no inverse exists for a given node. Where appropriate, (naively, this could be an array of bits, one for each we will use binary relations of the same name instead member of dom, telling which nodes are in S0). of the functions. (fhx; f(x)i j x 2 dom; f(x) 6= nullg is the binary relation for function f.) Lemma 2.3 Let S ⊆ dom be a set of nodes of an The axes self , child, parent, descendant, ancestor, XML document and χ be an axis. Then, (1) χ(S) = descendant-or-self , ancestor-or-self , following, preced- evalχ(S) and (2) Algorithm 2.2 runs in time O(jdomj). ing, following-sibling, and preceding-sibling are binary relations χ ⊆ dom × dom. Let self := fhx; xi j x 2 3 Data Model domg. The other axes are defined in terms of our \primitive" relations “firstchild" and \nextsibling" as Let dom be the set of nodes in the document tree as ∗ shown in Table 1 (cf. [10]). R1:R2, R1 [R2, and R1 de- introduced in the previous section. Each node is of one note the concatenation, union, and reflexive and tran- of seven types, namely root, element, text, comment, sitive closure, respectively, of binary relations R1 and attribute, namespace, and processing instruction. As R2. Let E(χ) denote the regular expression defining in DOM [9], the root node of the document is the only χ in Table 1. It is important to observe that some one of type \root", and is the parent of the document axes are defined in terms of other axes, but that these element node of the XML document. The main type of definitions are acyclic. non-terminal node is \element", the other node types are self-explaining (cf. [10]). Nodes of all types besides Definition 2.1 (Axis Function) Let χ denote an \text" and \comment" have a name associated with it. XPath axis relation. We define the function χ : A node test is an expression of the form τ() (where 2dom ! 2dom as χ(X ) = fx j 9x 2 X : x χxg τ is a node type or the wildcard \node", matching any 0 0 0 0 type) or τ(n) (where n is a node name and τ is a type (and thus overload the relation name χ), where X0 ⊆ dom is a set of nodes. whose nodes have a name). τ(∗) is equivalent to τ(). We define a function T which maps each node test Algorithm 2.2 (Axis Evaluation) to the subset of dom that satisfies it. For instance, Input: A set of nodes S and an axis χ T (node()) = dom and T (attribute(href)) returns all Output: χ(S) attribute nodes labeled \href". Method: evalχ(S) Example 3.1 Consider a document consisting of six function eval (S) := S. nodes { the root node r, which is the parent of self the document element node a (labeled \a"), and function evalχ(S) := evalE(χ)(S).

Load more