An Efficient Xpath Query Processor for XML Streams
Total Page:16
File Type:pdf, Size:1020Kb
An Efficient XPath Query Processor for XML Streams Yi Chen Susan B. Davidson Yifeng Zheng Arizona State University University of Pennsylvania University of Pennsylvania [email protected] [email protected] [email protected] Abstract observed in [25]. Since predicates are common in XPath queries, we must be able to handle not only wildcards, child Streaming XPath evaluation algorithms must record a and descendant axes, but also predicates. potentially exponential number of pattern matches when When evaluating predicates on XML streams, we may both predicates and descendant axes are present in queries, encounter data that potentially can be a query solution – a and the XML data is recursive. In this paper, we use a com- candidate node – before we encounter the data required to pact data structure to encode these pattern matches rather evaluate the predicates to decide its membership; therefore, than storing them explicitly. We then propose a polyno- we must remember candidates as well as their query pattern mial time streaming algorithm to evaluate XPath queries matches until the relevant data is encountered. For exam- by probing the data structure in a lazy fashion. Extensive ple, consider the XPath query //a[d]/b[e]//c and the sam- experiments show that our approach not only has a good ple XML documentshown in figure1(a)1. Whenwe process theoretical complexity bound but is also efficient in prac- the XML element c1 in the document order (or equivalently, tice. a pre-ordertraversal of the XML tree), we cannot determine whether or not it is in the query result at the point that it is encountered. We therefore need to record information about 1 Introduction the pattern match to subquery //a/b//c: (an,b1,c1) until we can determine the predicate satisfaction of an and b1, thus deciding whether or not c is a solution. XML has become the de facto standard for data ex- 1 Based on this intuition, several algorithms [23, 25, 26, change. The problem of efficiently evaluating XML 21, 20] have been proposed to process XML queries con- queries, e.g. XPath, in both main memory and streaming taining predicates. These algorithms are efficient and scale environments has therefore attracted a lot of attention from well for nonrecursive XML streams, i.e. data in which tags the research community [7, 16, 28]. do not repeat along a root-to-leaf path. However, when In this paper, we focus on a streaming environment, as predicates are combined with descendant axis traversal and found with stock market data, network monitoring or sensor the XML data is recursive, evaluating XPath queries in a networks. In such an environment, data streams, which can streaming fashion raises new challenges: be potentially infinite, arrive continuously and must be pro- • Due to the combination of descendant axis traversal in cessed using a single sequential scan because of the limited a query and the recursive structure of XML data, the num- storage space available. Query results should be distributed ber of pattern matches of a single XML node to a subquery incrementally and as soon as they are found, potentially be- can be potentially exponential to the query size. Consider fore we read all the data. Furthermore, the query processing the query Q : //a[d]//b[e]//c and the XML data in fig- algorithm should scale well in time and space. An algo- 1 ure 1(a). Note that Q is different from the earlier query rithm that meets these requirements for XPath processing 1 due to the descendant (rather than child) axis traversal be- over XML data is called a streaming XPath evaluation al- tween tags a and b. For the XML node c there are n2 ways gorithm. 1 for c to match subquery //a//b//c: (a ,b ,c ), where Several streaming XPath evaluation algorithms based on 1 i j 1 1 ≤ i ≤ n, 1 ≤ j ≤ n. Representing n2 in terms of the size finite state automata (FSA) have been proposed to process of the data and query, this becomes O((|D|/|Q|)|Q|), where XPath queries containing the child axis (‘/’), descendant |D| is the XML data size, |Q| is the XPath query size. axis (‘//’) and wildcard (‘*’) [3, 19]. Automaton-based • At least one pattern match must satisfy the query predi- methods are attractive due to their efficiency and clean de- sign. However, they cannot evaluate XPath queries which 1Throughout the paper, we use subscripts to distinguish nodes with the contain predicates (‘[...]’) since an FSA is memory-less, as same tag. 1 a1 search space without enumerating all the pattern matches. … … // In this way, TwigM achieves a complexity which is polyno- d1 dn an a mial in the size of the data and query. As tested in extensive b1 // experiments, TwigM is a practically efficient and worst case b … e1 polynomial algorithm. d // The TwigM algorithm was implemented in an XPath bn query processor for XML streams and demonstrated in [11]. c1 e c The contributions of this paper are: (a) Data D (b) Query Q 1 1 1. We design a data structure to encode the matches in a compact form. For example, to process Q1 on the Figure 1. Sample XML Data and an XPath sample data in figure 1(a), TwigM stores 2n nodes to Query encode n2 pattern matches. 2. We propose a streaming XPath evaluation algorithm called TwigM. Recall that to determine whether a can- cates to make a candidate become part of the result. How- didate is indeed a solution entails finding at least one ever, until a pattern match which satisfies the query predi- pattern match that satisfies all the query predicates. cates is found, we must record all the pattern matches for a Rather than computing all the pattern matches in the candidate and test their predicate satisfaction. In the worst search space explicitly from the compact data struc- case, we will not know whether or not the candidate is in- ture, and then testing predicate satisfaction, TwigM deed a solution until all pattern matches are considered. prunes the search space as we process the XML stream For example, when node c1 is met, since we are not able by checking predicate satisfaction on a small number to determine the predicate satisfaction of its n2 subquery of elements in the data structure. For example, we only pattern matches we must record all of them. Processing need to check predicate satisfaction on 2n elements in- 2 the data in document order, we then verify that the matches stead of checking n pattern matches to evaluate Q1. (a ,b ,c ), 2 ≤ i ≤ n, 2 ≤ j ≤ n fail the predicates in Q . i j 1 1 3. We analyze the time complexity of TwigM, which is At the very end, we find that the match (a ,b ,c ) satisfies 1 1 1 polynomial in terms of the size of the data and query. the predicates and therefore c1 is a query solution. These challenges do not exist in a non-streaming envi- 4. We present a detailed performanceevaluationof an im- ronment since XML nodes can be randomly accessed dur- plementation of TwigM compared with several other ing query processing. For example, in the best known poly- related systems. The results show that our approach nomial time main memory algorithms for evaluating XPath not only has a good theoretical complexity but a good queries [16], the whole document is loaded into main mem- performance on various practical queries and data sets. ory before query processing. Since XML nodes can be ran- The remainder of the paper is organized as follows. Sec- domly accessed, predicates can be checked first so that we tion 2 presents the data model and query language. Sec- do not need to remember the pattern matches. These tech- tion 3 gives an overview our XPath streaming evaluation niques are not suitable for processing XML streams, where strategy. TwigM is introduced and analyzed in section 4. only a single sequential scan is allowed. Furthermore, as Section 5 presents performance results. Section 6 discusses will be shown in section 5, the algorithms have trouble pro- related work, and section 7 concludes the paper. cessing large XML files. Previous XPath streaming algorithms that handle pred- 2 Data Model and Query Language icates either do not support descendant axis traversal [23, 21], or explicitly store all pattern matches [25, 26]. As ana- In this paper, XML data is modeled as a stream of lyzed in [26], the worst case complexity of the algorithms modified SAX events: startElement(tag, level, id) and in [25, 26] is O(|D|× 2|Q| × k), where k is the number endElement( , ), where is the tag of the node of different query pattern matches in which an XML node tag level tag being processed, is level of the node in the corre- participates. level spondingXML tree, and id is uniqueidentifier of the node. 2 As discussed, recording pattern matches by enumerat- These events are the input to our algorithms. ing and storing them explicitly can be expensive. Motivated by [7], we therefore design a stack-based data structure to Definition 2.1: The current node is the XML node whose concisely encode pattern matches. We then propose a novel tag is currently being parsed by the SAX parser. An active XPath streaming algorithm, TwigM, which searches for sat- 2We omit the discussion of attributes for now, however our implemen- isfying matches in the compactdata structure by pruning the tation supports attributes as well as elements. 2 node is an XML node whose start tag has been processed a1 but end tag has not yet been processed by the SAX parser.