TIMBER: A Native XML Database
¡
H. V. Jagadish , Shurug Al-Khalifa , Adriane Chapman , Laks V. S. Lakshmanan , Andrew Nierman ,
¢
Stelios Paparizos , Jignesh M. Patel , Divesh Srivastava , Nuwee Wiwatwattana , Yuqing Wu , Cong Yu £
University of Michigan, Ann Arbor, MI, USA ¤ ¥
e-mail: jag, shurug, apchapma, andrewdn, spapariz, jignesh, nuwee, yuwu, congy ¦ @umich.edu § University of British Columbia, Vancouver, BC, Canada ¤¨¤
e-mail: [email protected] © AT&T Labs Research, Florham Park, NJ, USA e-mail: [email protected]
The date of receipt and acceptance will be inserted by the editor
Abstract This paper describes the overall design and ar- ever, such a mapping often results in either an unnormal- chitecture of the Timber XML database system currently ized relational representation or in a very large number of being implemented at the University of Michigan. The sys- tables, due to the flexible nature of XML, with attributes tem is based upon a bulk algebra for manipulating trees, and sub-elements frequently missing, and repetition of sub- and natively stores XML. New access methods have been elements being allowed. developed to evaluate queries in the XML context, and new Our approach in Timber is to start from scratch and de- cost estimation and query optimization techniques have al- velop an XML data management system from the ground so been developed. We present performance numbers to up. Many components of a standard database system can support some of our design decisions. be reused with no change. For instance there is no need to We believe that the key intellectual contribution of this modify transaction management facilities. However, other system is a comprehensive set-at-a-time query processing components must be modified to accommodate the new ability in a native XML store, with all the standard com- data model and query language. The overall architecture ponents of relational query processing, including algebraic of the Timber system is presented in Sec. 3. rewriting and a cost-based optimizer. Our challenge is to develop a native XML database, in which XML data is stored directly, retaining its natural tree structure. At the same time, we would like to obtain all the benefits of relational database management, such as declar- 1 Introduction ative querying and set-at-a-time processing. To be able to obtain efficient processing on large data- With the growing popularity of XML, it is clear that large bases, we require set-at-a-time processing of data. In other repositories of XML data will emerge. In this paper, we words, we require a bulk algebra that can manipulate sets describe the architecture of Timber, a native XML data of trees: each operator on this algebra would take one or management system being developed at the University of more sets of trees as input and produce a set of trees as out- Michigan [67]. put. We have devised such an algebra, called TAX, and we One popular technique for managing XML data is to present this in Section 4. The biggest challenge in devising map the data to existing (relational) database systems. How- this algebra is the heterogeneity allowed by XML, and in
¤ XQuery [10], the W3C recommended declarative language Supported in part by the United States National Science for querying XML databases today. Foundation (NSF), under grants IIS-9986030, DMI-0075447, and IIS-0208852, and by an equipment grant from IBM. Given an algebra, we need an efficient query evaluation ¤ ¤ Supported in part by the Natural Sciences and Engineering mechanism. This is the subject of Section 5. After describ- Research Council of Canada (NSERC) and a research fellowship ing the overall structure of the query pipeline, we delve into from the British Columbia Advanced Systems Institute (BCASI). a couple of access methods of significance. 2 TIMBER: A Native XML Database
department
faculty staff faculty lecturer
name RA name name secretary RA RA RA name TA TA TA
J. Smith Tom T.Brown faculty research scientist H.Grey F.Lee Jerry Tony Rich R.King Mark Andy Chris
name secretary TA RA RA TA name secretary RA RA RA RA
K.Blue M.Black Peter Pam DJ Bob L.Young P.White Tod Max Ann Lisa
Fig. 1 Tree Representation of an Example XML document,
A query optimizer is able to take a declarative query sub-elements, XML documents can have a very rich struc- specification, and choose a suitable evaluation plan using ture, as we just saw. It is hard to capture this structure the available access methods, making use of cost estimates in a rigid relational table without dividing the document for various operations and algebraic identities. We present into very small standard “units” that can be represented as the architecture of our optimizer in Section 6. We also present tuples in a table. Therefore, a simple XML schema often novel techniques for obtaining size (and cost) estimates. produces a relational schema with many tables. Structural After a brief discussion of issues regarding updates in information in the tree-based schema is modeled by joins Section 7, we finally wrap up with a discussion of the cur- between tables in the relational schema. XML queries are rent status, and some indications of performance, in Sec- converted into SQL queries over the relational tables, and tion 8. We begin by setting the context for our work in even simple XML queries often get translated into expen- Sec. 2 sive sequences of joins in the underlying relational data- base.
2 Motivation and Related Work Example 2 A typical translation [48] of the schema of fig- ure 1 would map the lecturer elements to a table, and store Example 1 Figure 1 shows a very simple XML document. TA elements as tuples in another table. To find the TAs as- The personnel of a department can be faculty, staff, lec- sisting a specified lecturer will then require a join between turer or research scientist. Each of them has a name the two tables. More complex queries will require multiple as identification. They may or may not have a secretary. joins. Each faculty may have both TAs and RAs. A lecturer can have one or more TAs, but no RA. A research scientist Driven by the arguments above, one is persuaded to can have any number of RAs, but no TA. seek a direct implementation of XML data management, Some characteristics of XML data are obvious even where XML data is not translated into rigid relations. There from this simple example. XML has a tree structure: ele- are several implementations of XML storage that are in- ments in the document can be structurally related and these dependent of relational databases [38, 45, 60, 62, 63]. Sev- structural relationships are meaningful. XML also has flex- eral of these are driven by the document (or programming ibility – the number of RAs and TAs associated with person- language) community, rather than the database community. nel is allowed to vary. While there are constraints on what The implementations are procedural, directly evaluating que- is allowed, it is still quite possible for certain classes of ries as a series of nested FOR loops. They are also tuple-at- sub-elements to be missing altogether. For instance, there a-time, whereas it has been well established through the may be a lecturer who has no teaching assistants at all. experience of the database community that set-at-a-time access is essential for good performance. As such, these Several mapping techniques have been proposed [21, implementations do very well for small data sets, but do 30, 48, 49] to express tree-based XML data to flat tables not scale very well to large data sets. For instance, Xindice in a relational schema. Due to the possible absence of at- (see dbXML) recommends [4] that its system not be used tributes and sub-elements, and the possible repetition of for documents larger than 5MB! TIMBER: A Native XML Database 3
Other solutions have also been proposed. For instance, Loading Data Flow XML Query XML databases have been implemented on top of an object- Query Result Retrieval Data Flow oriented database [17, 32, 44, 61] and a semi-structured da- tabase [34,35,42]. Such implementations suffer from a com- Query Parser Query Output API bination of the drawbacks listed above for the two extreme scenarios. Tamino is a leading commercial “native” XML XML Data Query database, yet descriptions of its architecture [46, 47] are Optimizer Query fairly sketchy. Tamino uses an evolution of the ADABAS Evaluator Data Parser nested relational engine as its data store, with the bulk of the innovation in the product coming from new index struc- tures, support for handling XML schematic information, Metadata Index and the web interface layer. Manager Manager Data Recently, Natix [26, 27] has been developed as a stor- Manager age manager suitable for XML data. The focus is on ef- ficient management of tree-structured data at the level of page allocation and physical placement. Whereas our cur- Data Storage Manager rent development is on top of the more “standard” Shore storage manager, we intend to consider switching to Natix as the latter matures. Our project is aimed centrally at building an efficient Data XML database engine. As such it differs from related ef- forts at data integration [6,52] and querying XML over the Fig. 2 TIMBER Architecture Overview web [40]. However, each of these important research ef- forts requires at least some management and querying of XML data as part of their research effort. As such, each is 3.1 Data Storage exploring issues that closely relate to ours in many cases. For instance, we will mention techniques used in the Nia- The Data Parser takes an XML document as input, and pro- gara [40] system at several places below. duces a parse tree as output. The Data Manager takes each Finally, we mention the Toronto XML project [5], aimed node of this parse tree as it is produced, transforms it incre- at managing XML data using an approach complementary mentally into an internal representation and stores it into to ours. Whereas we are developing new techniques for Shore as an atomic unit of storage.1 A set of navigation managing and querying tree-structured XML data, the To- interface and scan interface is provided for the Query Eval- ronto project maps XML into flat files, RDBMS or OODBMS, uator to retrieve data, one node at a time. These interfaces whichever is most appropriate for a given class. The core can also be used by Index Manager and Metadata Manager, of their effort is in managing the metadata for this map- to generate the data they need. ping and in developing clever new index structures for this For storage efficiency reasons, a node in the Timber heterogeneous representation. Data Manager is not exactly the same as a DOM [57] node. There is a node corresponding to each element, with child nodes for sub-elements. However, all attributes of an ele- ment node are clubbed together into a single node, which 3 System Architecture is then stored as a child node of that element node. Also, the content of an element node, if any, is pulled out into
1 We found that Shore had considerable overheads in dealing with small objects. We are engineering our system to package our The overall architecture of Timber is shown in figure 2. data in page-size containers, and handing Shore an entire con- We build our system on top of Shore [9], a popular back- tainer as an object. At present, this engineering optimization has end store that is responsible for disk memory management, been implemented in our Query Evaluator for intermediate results buffering and concurrency control. XML data, index and that may have to be read and written multiple times in quick suc- metadata are also stored in Shore through Data Manager, cession. This optimization is less critical for the actual data itself, Index Manager and Metadata Manager, respectively. and has not yet been implemented in the Data Manager. 4 TIMBER: A Native XML Database a separate child node. If the node is of mixed type, with within the same small label value range. If updates are well multiple content parts interspersed with sub-elements, each distributed, no relabeling may be required for a long time. content part is pulled out into a separate child node. Finally, See [13]. We use double values for these labels in the cur- due to our focus on data management issues, all processing rent version of Timber, as an “automatic” means of leav- instructions, comments, and such are simply ignored. In a ing gaps, at least to within machine precision. Note that as future version of our system, we could create yet another new data is appended (as opposed to being inserted in the child node of the element node with all such data. middle), new larger label values can simply be manufac- An inclusion relationship between an element and its tured for the appended nodes with no effect on the existing sub-elements is the tightest possible bond between two en- nodes. tities in a database. Entire sub-trees are frequently reque- In relational databases, a record identifier (typically cal- sted. In fact, in a document representation of the database, a led an “rid” or a “tid”) is used to identify each record. This sub-tree corresponds to a contiguous fragment of the docu- is not quite an identifier in the sense of an object-oriented ment. As such, the determination of parent–child and ancestor– database – there is no concept of object identity. It fre- descendent containment relationships is a very frequent op- quently is a function of physical placement of the record eration in XML query processing. It has been observed [2, (like a physical pointer), but it does not have to be: it is 13, 39] that it is possible to associate a numeric start and truly a logical identifier. It is also not visible to the user at end label with each data node in the database, defining a the query level. Nonetheless, it plays a central role in re- corresponding interval between these labels such that every lational query processing. For an XML database, we seek descendant node has an interval that is strictly included in a corresponding node identifier. XML permits an optional its ancestors’ interval. If each node is also labeled with its ID attribute, but this is not quite it, since this is user-visible, Level, or nested depth of the node in the document, then and is optional, and further is not even applicable for nodes parent–child relationships can also be found. The relevant that do not correspond to XML elements (such as attributes formulae are: and comments). The normal solution would be to invent
such an identifier for our system. However, we find that the
– Ancestor–descendant relationship: a node
tuple of start, end, and level labels serves this purpose ad-
¡¡¡ ¡ ! "
is the ancestor of node iff mirably. As such, we shall use this triple of labels as node ¡ .
identifier. Note that while start alone suffices to serve as a ¨ ¨# $
– Parent–child relationship: a node is the
node identifier, using the triple as a node identifier enables
¡%¨&¡¨'¡ ( ¡' )* +",&¡'
parent of node iff efficient index-based query processing, as we’ll see later. #-,¡#.0/ .
The physical storage order of XML elements can sig-
¡ ¡
( and are start labels, and are end labels, and nificantly impact the cost of data access. Since we expect
¡ and are level labels in these formulae.) sub-elements to be requested frequently with an element, We will discuss, in Sec. 5, how we use these formulae ideally we would like to cluster these together. It is gener- in Timber. For the present, we focus on how these start, ally believed that storing XML data in document order (or end and level labels are managed. Conceptually, these la- pre-order tree traversal order) is the most desirable. This bels are additional attributes created automatically by the is what we do. An equivalent way of expressing this is that system and associated with each node. Where document we would like to store our nodes sorted by the value of their boundaries are important, one could create separate labels start labels. Again, updates are an issue. See Sec. 7. for each document, so that an additional doc label would be required to match in addition to the interval subsump- tion described above. It is easy to map between such a 3.2 Index Storage multi-document model, and a model in which the ranges of label values for each document are assigned to be non- There is a rich history of work on index structures suited to overlapping, doing away with the need for a separate doc specific purposes. In particular, we draw inspiration from attribute. the work done in the context of object-oriented systems, Updates are an issue in any such labeling scheme, see [14]. such as [7,29]. More recently, novel path indices have been It is conceivable that a complete re-labeling could be re- proposed for XML and semi-structured data [15, 28, 37]. quired for each update, leading to very poor update per- Schema summarization structures have also been proposed formance. We address this issue by leaving gaps between [23, 24]. We are intensively studying this problem, but at successive label values. With this mechanism, relabeling the current time have only single-node indices implemented is required only if a large number of insertions take place in Timber. TIMBER: A Native XML Database 5
We construct value indices on attribute values, whether metadata information, and performs the required mapping these are numeric or character string. We also construct in- from logical to physical operators. The resulting query plan dices on element content, when this content is recognized tree is evaluated by the Query Evaluator, pipelined one op- as a number. We also construct term-based inverted indices erator at a time, by means of a set of calls to the Data Man- on element content when this is a large piece of text. In ager and Index Manager, which in turn call Shore storage. addition, we construct an index on tag name: that is, given a tag name, we can return all the elements with the speci- fied tag. All our indices are stored using the B-Tree index 4 Tree Algebra facility provided by Shore. Index structures typically return a list of Rids in rela- An XML document is a tree, with each edge in the tree rep- tional systems. Correspondingly, they return lists of start, resenting element nesting (or containment). See figure 1, end and level labels in an XML database. for example. Structural relationships in this tree are central to most XML querying. As such, an appropriate algebra for XML should manipulate sets of trees. That is, each opera- 3.3 Metadata Storage tor in the algebra should take as input one or more sets of trees and produce as output a set of trees. Timber has a metadata store that is, for the most part, not Order is important to XML documents. As such, the remarkable. There is the usual information regarding at- trees manipulated by the algebra should be ordered. (This tribute types, data set sizes and indices constructed. The is true, even if queries frequently do not care about the or- histograms maintained for cost estimation purposes are no- der. See labeled paragraph on ordering later in this section.) vel, and are described in Sec. 6. Moreover, each node in a tree represents an XML element, Schema plays a crucial role in traditional databases, and is thus labeled with the element tag and any attributes and table structure is a crucial part of the metadata main- of the element. In short, we require an algebra to manipu- tained. However, in the design of XML, much care has late sets of ordered labeled trees. been taken to make sure that a great deal can be accom- XML also permits references, which are represented as 2 plished even in the absence of schema (or DTD). In the non-tree edges, and may be used in some queries. These are same spirit, we have designed the core of Timber not to important to handle, and our algebra is be able to express have any dependence on schema whatsoever. The bulk of these. However, there is a qualitative difference between the description in this paper is with regard to the Timber these reference edges, which are handled as “joins”, and core, and hence has little mention of schema. containment edges, which are handled as part of a “selec- Knowledge of schema can play an important role in tion”. data layout, in choice of index structures, and in query op- To be able to obtain efficient processing on large data- timization. Our goal is to use this information, when avail- bases, we require set-at-a-time processing of data. In other able, to advantage; while continuing to retain reasonable words, we require a bulk algebra that can manipulate sets performance even when schema information is not avail- of trees: each operator on this algebra would take one or able. For instance, even data statistics are collected in our more sets of trees as input and produce a set of trees as position histograms (described in Sec 6 below), without output. Using relational algebra as a guide, we can attempt specific reference to the schema. to develop a suite of operators suited to manipulating trees instead of tuples. 3.4 Query Processing Heterogeneity: Each tuple in a relation has identical struc- XML queries in XQuery [10]3 are parsed into an algebraic ture – given a set of tuples from some relation in rela- operator tree by the Query Parser. (The tree algebra used tional algebra, we can reference components of each tuple for this purpose is described in Sec. 4). The Query Op- unambiguously by attribute name or position. Trees have timizer reorganizes this tree, based on a set of rules and a more complex structure than tuples. More importantly, sub-elements can often be missing or repeated in XML. As 2 In fact, there is not yet complete agreement with regard to the such, it is not possible to reference components of a tree best means of expressing XML schema information [64, 65]. 3 We have designed Timber to be as language independent as by position or even name. For example, in a bibliographic possible. We have written parsers for other languages, including XML tree, consider a particular book sub-tree, with nested Quilt [11], XML-QL [16], and XQL [43], but no longer maintain (multiple) author sub-elements. We should be able to im- these. pose a predicate of our choice on the first author, on every 6 TIMBER: A Native XML Database author, on some (at least one) author, and so on. Each of interest. One witness tree is produced for each combination these possibilities could be required in some application, of node bindings that matches the pattern. The set of wit- and these choices are not equivalent. ness trees produced through the matching of a pattern tree are all homogeneous: we can name nodes in the pattern trees, and use these names to refer to the bound nodes in $1 $1.tag = faculty &