Path Queries on Compressed XML∗
Total Page:16
File Type:pdf, Size:1020Kb
Path Queries on Compressed XML∗ Peter Buneman Martin Grohe Christoph Koch [email protected] [email protected] [email protected] Laboratory for Foundations of Computer Science University of Edinburgh, Edinburgh EH9 3JZ, UK Abstract for bringing subtrees of the document tree into main memory on demand. Another approach [8, 11, 19] is Central to any XML query language is a path to store information about each node in the document language such as XPath which operates on tree in one or more tuples in a relational database. In the tree structure of the XML document. We both cases the structure is fragmented and substantial demonstrate in this paper that the tree struc- I/O is required to evaluate a complex path expression; ture can be effectively compressed and ma- and this increases with the size of the source document. nipulated using techniques derived from sym- An alternative approach is to extract the text (the bolic model checking. Specifically, we show character string data) from the document and to store first that succinct representations of document it in separate containers, leaving the bare structure, tree structures based on sharing subtrees are a tree whose nodes are labeled with element and at- highly effective. Second, we show that com- tribute names. We shall call this structure the skele- pressed structures can be queried directly and ton of the document. This separation of the skele- efficiently through a process of manipulating ton from string data is used in the XMILL compres- selections of nodes and partial decompression. sor [15], which, as internal model of data representa- We study both the theoretical and experimen- tion in query engines, is reminiscent of earlier vertical tal properties of this technique and provide partitioning techniques for relational data [3] which algorithms for querying our compressed in- have recently been resurrected [2] for query optimiza- stances using node-selecting path query lan- tion. Even though the decomposition in XMILL is guages such as XPath. used for compression only, it is natural to ask whether We believe the ability to store and manipulate it could also be used for enabling efficient querying. large portions of the structure of very large The promise of such an approach is clear: We only XML documents in main memory is crucial need access to the skeleton to handle the navigational to the development of efficient, scalable native aspect of (path) query evaluation, which usually takes XML databases and query engines. a considerable share of the query processing time; the remaining features of such queries require only mostly 1 Introduction localized access to the data. The skeleton of an XML document is usually relatively small, and the efficiency That XML will serve as a universal medium for data of query evaluation will depend on how much of the exchange is not in doubt. Whether we shall store large skeleton one can fit into main memory. However for XML documents as databases (as opposed to using large XML documents with millions of nodes, the tree conventional databases and employing XML just for skeletons are still large. Thus compressing the skele- data exchange) depends on our ability to find specific ton, provided we can query it directly, is an effective XML storage models that support efficient querying optimization technique. of XML. The main issue here is how we represent the In this paper, we develop a compression technique document in secondary storage. One approach [9] is to for skeletons, based on sharing of common subtrees, index the document and to implement a cache policy which allows us to represent the skeletons of large ∗ This work was supported in part by a Royal Society Wolf- XML documents in main memory. Even though this son Merit Award. The third author was sponsored by Erwin method may not be as effective in the reduction of Schr¨odinger grant J2169 of the Austrian Research Fund (FWF). space as Lempel-Ziv compression [22] is on string data, Permission to copy without fee all or part of this material is it has the great advantage that queries may be effi- granted provided that the copies are not made or distributed for ciently evaluated directly on the compressed skeleton direct commercial advantage, the VLDB copyright notice and and that the result of a query is again a compressed the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base skeleton. Moreover, the skeleton we use is a more gen- Endowment. To copy otherwise, or to republish, requires a fee eral structure than that used in XMILL, where it is and/or special permission from the Endowment. a tree containing just tag and attribute information Proceedings of the 29th VLDB Conference, of nodes. Our skeletons may be used to express other Berlin, Germany, 2003 properties of nodes in the document tree, e.g. that they contain a given string, or that they are the answer to bib a (sub)query. Our work borrows from symbolic model checking, book paper paper an extremely successful technique that has lead for- mal hardware verification to industrial strength [16, 6]. Model checking is an approach to the verification of fi- title author title author nite state systems which is based on checking whether author author title author the state space of the system to be verified satisfies, or is a model of, certain properties specified in some (a) temporal logic. It has already been observed that checking whether the state space of a system satis- bib bib fies a property specified in some logic and evaluating (2) a query against a database are very similar algorith- mic problems, which opens possibilities for transferring book paper book (3) paper techniques between the two areas. The main practical problem that researchers in verification are facing is the \state explosion" problem { state spaces get too title author title author large to be dealt with efficiently. The most success- (b) (c) ful way of handling this problem is symbolic model checking. Instead of representing the state space ex- Figure 1: A tree skeleton (a) and its compressed ver- plicitly, it is compressed to an ordered binary decision sions (b) and (c). diagram (OBDD) [4], and model-checking is done di- rectly on the OBDD rather than the original uncom- Thus the original skeleton (a) can be recovered by pressed state space. The success of this method is due the appropriate depth-first traversal of the compressed both to the compression rates achieved and to the fact skeleton (b). Further compression can be achieved by that OBDDs have nice algorithmic properties which replacing any consecutive sequence of out-edges to the admit model-checking. same vertex by a single edge marked with the appropri- Our compression of XML skeletons by subtree shar- ate cardinality. Such a graph is shown in Figure 1 (c), ing can be seen as a direct generalization of the com- where the unmarked edges have cardinality 1. 2 pression of Boolean functions into OBDDs. This en- ables us to transfer the efficient algorithms for OBDDs This compressed representation is both natural and to our setting and thus provides the basis for new al- exhibits an interesting property of most practical XML gorithms for evaluating path queries directly on com- documents: they possess a regular structure in that pressed skeletons. subtree structures in a large document are likely to be repeated many times. If one considers an example of Example 1.1 Consider the following XML document extreme regularity, that of an XML-encoded relational of a simplified bibliographic database. table with R rows and C columns, the skeleton has size O(C ∗ R), and the compressed skeleton as in Figure 1 <bib> (b) has size O(C + R). This is further reduced by <book> merging multiple edges as in Figure 1 (c) to O(C + <title>Foundations of Databases</title> log R). Our experiments show that skeletons of less <author>Abiteboul</author> regular XML files, which contain several millions of <author>Hull</author> tree nodes, also compress to fit comfortably into main <author>Vianu</author> memory. In several cases, the size of the compressed </book> skeleton is less than 10% of the uncompressed tree. <paper> We should also remark that an interesting property of <title>A Relational Model for this technique is that it does not rely on a DTD or any Large Shared Data Banks</title> other form of schema or static type system. Indeed we <author>Codd</author> have evidence that common subtree compression can </paper> exploit regularity that is not present in DTDs. <paper> Compressed skeletons are easy to compute and al- <title>The Complexity of low us to query the data in a natural way. Compressed Relational Query Languages</title> and uncompressed instances are naturally related via <author>Vardi</author> bisimulation. Each node in the compressed skeleton </paper> represents a set of nodes in the uncompressed tree. </bib> The purpose of a path query is to select a set of nodes in the uncompressed tree. However this set can be The skeleton is shown in Figure 1 (a). Its com- represented by a subset of the nodes in a partially de- pressed version, which is obtained by sharing common compressed skeleton, and { as our experiments show { subtrees, is shown in Figure 1 (b). It is important to the amount of decompression is quite small in practice. note that the order of the out-edges is significant.1 Moreover there are efficient algorithms for producing 1Throughout this paper, we visualize order by making sure stances as depicted in our figures corresponds to the intended that a depth-first left-to-right pre-order traversal of the in- order of the nodes.