Query Optimization for XML Jason Mchugh, Jennifer Widom

Query Optimization for XML Jason McHugh, Jennifer Widom

Stanford University g fmchughj,widom @db.stanford.edu, http://www-db.stanford.edu

Semistructured data has been de®ned as data that may be Abstract irregular or incomplete, and whose structure may change XML is an emerging standard for data representa- rapidly or unpredictably. Although data encoded in XML tion and exchange on the World-Wide Web. Due to may conform to a Document Type De®nition,orDTD the nature of information on the Web and the inher- [BPSM98], DTD's are not required by XML. Due to the ent ¯exibility of XML, we expect that much of the nature of information on the Web and the inherent ¯exi- data encoded in XML will be semistructured:the bility of XMLÐwith or without DTD'sÐwe expect that data may be irregular or incomplete, and its struc- much of the data encoded in XML will exhibit the classic ture may change rapidly or unpredictably. This pa- characteristics of semistructured data as outlined above. per describes the query processor of Lore,aDBMS The Lore system at Stanford is a complete DBMS de-

for XML-based data supporting an expressive query signed speci®cally for semistructured data [MAG+ 97]. language. We focus primarily on Lore's cost-based Lore's original data model, the Object Exchange Model query optimizer. While all of the usual problems (OEM), is a graph-based data model with a close corre- associated with cost-based query optimization apply spondence to XML. The query language of Lore, called to XML-based query languages, a number of addi- Lorel (for Lore Language), is an expressive OQL-based tional problems arise, such as new kinds of indexing, language for declarative navigation and updates of semi- more complicated notions of database statistics, and structured databases. Recently we migrated Lore to a fully vastly different query execution strategies for dif- XML-based data model, and extended the Lorel query lan- ferent databases. We de®ne appropriate logical and guage accordingly. For details see [GMW99]. The results physical query plans, database statistics, and a cost presented in this paper apply directly to the new XML ver- model, and we describe plan enumeration including sion of Lore. heuristics for reducing the large search space. Our This paper describes Lore's query processor, with a fo- optimizer is fully implemented in Lore and prelimi- cus on its cost-based query optimizer. While our general nary performance results are reported. approach to query optimization is typicalÐwe transform a query into a logical query plan, then explore the (expo- 1 Introduction nential) space of possible physical plans looking for the The World-Wide Web community is rapidly embracing one with least estimated costÐa number of factors associ- XML as a new standard for data representation and exchange ated with XML data complicate the problem. Path traver- on the Web [BPSM98]. At its most basic level, XML is sals (i.e., navigating subelement and reference links) play a document markup language permitting tagged text (ele- a central role in query processing and we have introduced ments), element nesting, and element references. However, several new types of indexes for ef®cient traversals through XML also can be viewed as a data modeling language, and a data graphs. The variety of indexes and traversal tech- signi®cant potential user populationviews XML in this way niques increases our search space beyond that of a conven- [Mar98]. Fortuitously, work from the database community tional optimizer, requiring us to develop aggressive pruning in the area of semistructured data [Abi97, Bun97]Ðwork heuristics appropriate to our query plan enumeration strat- that signi®cantly predates XMLÐuses graph-based data egy. Other challenges have been to de®ne an appropriate models that correspond closely to XML. Thus, research re- set of statistics for graph-based data and to devise methods sults in the area of semistructured data are now broadly for computing and storing statistics without the bene®t of applicable to XML. a ®xed schema. Statistics describing the ªshapeº of a data graph are crucial for determining which methods of graph This work was supported by Rome Laboratories under Air Force traversal are optimal for a given query and database. Contract F30602-96-1-0312 and by the National Science Foundation under grant IIS-9811947. Once we have added appropriate indexes and statistics Permission to copy without fee all or part of this material is granted to our graph-based data model, optimizing the navigational provided that the copies are not made or distributed for direct commercial path expressions that form the basis of our query language advantage, the VLDB copyright notice and the title of the publication and does resemble the optimization problem for path expres- its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, sions in object-oriented database systems, and even to some requires a fee and/or special permission from the Endowment. extent the join optimization problem in relational systems. Proceedings of the 25th VLDB Conference, As will be seen, many of our basic techniques are adapted Edinburgh, Scotland, 1999. from prior work in those areas. However, we decided to

315

build a new overall optimization framework for a number other things, sets and pointers to objects of any type. Ba- of reasons: sic knowledge of the schema was crucial, however, and

Previous work has considered the optimization of sin- queries were placed into categories with a ®xed set of ex- gle path expressions (e.g., [GGT96, SMY90]). Our ecution strategies for each category. Lore follows a more query language permits several, possibly interrelated, traditional and ¯exible model of query processing. Model path expressions in a single query, along with other 204 was based on self-describing record structures some- query constructs. Our optimizer generates plans for what resembling OEM, but the work concentrated primarily complete queries. on clever bit-mapped indexing structures to achieve high

The statistics maintained by relational DBMS's (for performance for its relatively simple queries. joins) and object-oriented DBMS's (for path expres- Relational databases. As mentioned earlier, at a coarse sion evaluation) are generally based on single joining level the problem of optimizing a Lorel path expression is pairs or object references, while for accuracy in our similar to the join ordering problem in relational databases. environment it is essential to maintain more detailed However, join ordering algorithms usually rely on statis- statistics about complete paths. tics about each joining pair, while for typical queries in our

The capabilities of deployed OODBMS optimizers are environment it is crucial to maintain more comprehensive fairly limited, and we know of no available prototype statistics about entire paths. The computation and stor- optimizer ¯exible enough to meet our needs. Build- age of our statistics is further complicated by the lack of ing our own framework has allowed us to experiment a schema. In addition, when quanti®cation is present in with and identify good search strategies and pruning our queries, the SQL translation results in complex sub- heuristics for our large plan space. It also has allowed queries that many relational optimization frameworks are us to integrate the optimizer easily and completely into ill-equipped to handle. the existing Lore system. Object-oriented databases. Many of the points dis- 2 Related Work cussed in the previous paragraph apply to object-oriented Lore. Details of the syntax and semantics of Lorel can be databases as well. There has been some work on optimizing found in [AQM+ 97]. The overall architecture of the Lore path expressions in an OODBMS context [GGT96]. They system, including the simple query processing strategy we propose a set of algorithms to search for objects satisfying used prior to developing our cost-based query optimizer, path expressions containing predicates, and analyze their can be found in [MAG+ 97]. relative performance. Our work differs in that we consider Other semistructured databases.TheUnQL query many interrelated path expressions within the context of an language [BDHS96, FS98] is based on a graph-structured arbitrary query with other language constructs. We also data model similar to OEM. For query optimization, a trans- provide additional access methods for path expressions, lation from UnQL to UnCAL is de®ned [BDHS96], which and our optimization techniques are implemented within provides a formal basis for deriving optimization rewrite a complete DBMS. Similar comparisons can be drawn be- rules such as pushing selections down. However, UnQL tween our work and other recent OODB optimization work, does not have a cost-based optimizer as far as we know. e.g., [GGMR97, KMP93, OMS95, SO95, MSOP86, LV91, The Strudel Web-site management system is based on semi- YM97]. Some recent papers have speci®ed cost models structured data [FFLS97, FFK+ 99], and query optimization for object-oriented DBMSs, e.g., [BF97, GGT95]. Object- is considered in [FLS98]. In Strudel, semistructured data oriented databases typically support object clustering and graphs are introduced for modeling and querying, while the physical extents, rendering many of these formulas inap- data itself may reside elsewhere in arbitrary format. A key plicable in our setting. Work on indexing in OODBs is feature of Strudel's approach to query optimization is the surveyed in [YM97]; for a discussion of indexing in Lore, use of declarative storage descriptors, which describe the please see [MWA + 98]. underlying data stores. The optimizer enumerates query Generalized path expressions. Other recent work, in- execution plans, with a cost model that derives the costs of cluding [FLS98] discussed above, has considered the prob- operators from their descriptors. In contrast, Lore data is lem of optimizing the evaluation of generalized path ex- stored under our control, and the user may dynamically cre- pressions, which describe traversals through data and may ate indexes to provide ef®cient access methods depending contain regular expression operators. In [CCM96] an al- upon the expected queries. Finally, [FLS98] includes de- gebraic framework for the optimization of generalized path tailed experimental results of how large their search space expressions in an OODBMS is proposed, including an ap- is, but no other performance data is given. In contrast, our proach that avoids exponential blow-up in the query op- experiments focus on the performance of the query plan timizer while still offering ¯exibility in the ordering of selected by the optimizer versus other possible query plans. operations. In [FS98] two optimization techniques for Some much earlier systems, such as Multos [BRG88] generalized path expressions are presented, query pruning and Model 204 [O'N87], considered problems associated and query rewriting using state extents. Lore's techniques with semistructured data but in very different settings. Mul- for handling generalized path expressions are described in tosoperated on complex data objects which allowed, among [MW99a], but the work of [FLS98, CCM96, FS98] could

316

DBGroup

Member &1 Project Project Member Member Member

&2 &3 &4 Project &5 &6 Name Name Office Office Project Name Age Title Title Age Office

&7 &8&9 &10 &11 &12 &13&14 &15 &16 "Clark" "Smith" 28 "Gates 252" "Jones" 46 "Lore" "Tsimmis"

Building Room Building Room

&17 &18 &19 &20 "CIS" "411" "Gates" 252 Figure 1: A tiny OEM database be applicable within our framework. incoming edges) must be speci®ed in XML with explicit XML query languages.TheXML-QL data model and references, i.e., via ID and IDREF(S) attributes [BPSM98]. query language [DFF+ 98] is similar in expressibility to The following XML fragment corresponds to the rightmost ours, with some extensions speci®c to the current speci®- Member in Figure 1, where Project is an attribute of cation of XML. XQL [RLS98] is a simpler query language type IDREFS. based on single path expressions and is strictly less pow- erful than XML-QL, Lorel [AQM+ 97], StruQL [FFLS97], Jones or UnQL [BDHS96]. To the best of our knowledge no full 46 cost-based query optimizer has been developed for XML- Gates QL or XQL, and the optimization principles presented in 252 this paper should be directly applicable when that task is tackled. 3 Preliminaries As mentioned earlier, we have migrated Lore to a fully XML-based data model, and extended the Lorel query 3.1 Data Model language accordingly [GMW99]. The primary changes Lore's original data model, OEM (for Object Exchange to the model were the introduction of ordered subobjects, Model) [PGMW95], was designed for semistructured data. attribute-value lists, and reference edges in addition to nor- An example OEM database containing (®ctitious) informa- mal subobject edges. Corresponding changes were made to tion about the Stanford Database Group appears in Figure the Lorel query language, althoughin most cases the queries 1. Data in OEM is schema-less and self-describing, and can one uses over an OEM database are identical to those used be thought of as a labeled directed graph. The vertices in over a corresponding XML database. the graph are objects; each object has a unique object iden- We now introduce two de®nitions that are useful in the ti®er (oid), such as &5. Atomic objects have no outgoing remainder of the paper. edges and contain a value from one of Lore's basic atomic types such as integer, real, string, gif, java, De®nition 3.1 (Simple Path Expression)Asimple path audio, etc. All other objects may have outgoing edges expression speci®es a single-step navigation in the database.

and are called complex objects. Object &3 is complex and A simple path expression for a variable or name x and label

x:l y y

its subobjects are &8, &9, &10, and &11. Object &7 is l has the form , and denotes that variable ranges over

x x

atomic and has value ªClarkº. Names are special labels all l -labeled subobjects of the object assigned to .If is x that serve as aliases for single objects and as entry points an atomic object, or if l is not an outgoing label from , into the database. In Figure 1, DBGroup is a name that then y ranges over the empty set. denotes object &1. De®nition 3.2 (Path Expression)Apath expression is an The correspondence between OEM and XML is evi- ordered list of simple path expressions. dent: OEM's objects correspond to elements in XML, and OEM's subobject relationship mirrors element nesting in Path expressions are the basic building blocks in the Lorel XML. The fundamental differences are that subelements in language and describe traversals through the data in a XML are inherently ordered since they are speci®ed tex- declarative fashion. For example, ªDBGroup.Member

tually, and XML elements may optionally include a list of x, x.Age yº says that variable y ranges over all ob- attribute-value pairs. Note that graph structure (multiple jects that can be reached by starting with the DBGroup

317

object, following an edge labeled Member, then follow- plan that lies within our search space. The physical query ing an edge labeled Age. Lorel supports a shorthand to plan is a tree composed of physical operators that are im- write this path expression as ªDBGroup.Member.Age plemented by the query execution engine and perform the

yº, and further shorthands to eliminate variables such as low-level steps required to execute the query and construct + y [AQM 97], however for clarity we avoid shorthands in the result. We use a recursive iterator approach in query the examples in this paper. Also, a simple path expres- processing, as described in, e.g., [Gra93], and we assume

sion may contain a regular expression or ªwildcardsº as the reader is familiar with the basic concepts associated + described in [AQM 97]. In general, l in De®nition 3.1 with iterators. could be a component of a generalized path expression, but 3.4 Lore Indexes we have simpli®ed the de®nition for presentation purposes As in a conventional DBMS, indexes in Lore enable fast in this paper. and ef®cient access to the data. In a relational DBMS, an 3.2 Query Language index is created on an attribute in order to locate tuples Lorel is an extension of OQL [Cat94] supporting declarative with particular attribute values quickly. In Lore, such a path expressions for traversing graph data and extensive au- value index alone is not suf®cient, since often the path to tomatic coercion for handling heterogeneous and typeless a node is as important as the node's value. Lore contains data without generating errors [AQM+ 97]. Although Lorel several indexing structures that are useful for ®nding rel- offers much syntactic sugar over OQL that is convenient evant atomic values, parents of objects, and speci®c paths in practice (including the shorthands mentioned above), and edges within the database. The value index,orVindex, in this paper we write our queries without these syntactic supports ®nding all atomic objects with a given incoming conveniences in order to be very explicit and enable under- edge label and satisfying a given predicate. The label index, standing for those familiar with OQL but unfamiliar with or Lindex, supports ®nding all parents of a given object via Lorel. As a simple example, consider the following query, an edge with a given label. The edge index,whichwe which asks for all of the young members of the Database term the Bindex, supports ®nding all parent-child object Group.1 The result of the query over the database of Figure pairs connected via a speci®ed label. In addition to these 1 is shown. indexes, Lore's DataGuide [GW97] provides the function- ality of a path index, or Pindex. Details on Lore indexes, QUERY 1: Select x From DBGroup.Member x including coercion issues addressed by the Vindex, can be Where exists y in x.Age: y<30 found in [MWA+ 98]. RESULT: 4 Motivation Smith 28 As in any declarative query language, there are many ways Gates 252 to execute a single Lorel query. Let us consider Query 1 introduced in Section 3.2 and roughly sketch several types CIS of query plans. As we will illustrate, the optimal query 411 plan depends not only on the values in the database but also on the shape of the graph containing the data. It is this additional factor that makes optimization of queries over 3.3 Lore Query Processing XML data both important and dif®cult. The general architecture of the Lore system is very much The most straightforward approach to executing Query like a traditional DBMS [MAG+ 97]. After a query is 1 is to fully explore all Member subobjects of DBGroup parsed,itispreprocessed to factor out common subexpres- and for each one look for the existence of an Age subobject sions and convert Lorel shorthands into a more traditional of the Member object whose value is less than 30. We OQL form. The logical query plan generator then creates a call this a top-down execution strategy since we begin at single logical query plan describing a very high-level execu- the named object DBGroup (the top), then process each tion strategy for the query. As we will show in Section 5.1, simple path expression in a forward manner. This approach generating logical query plans is fairly straightforward, but is similar to pointer-chasing in object-oriented systems, special care was taken to ensure that the logical query plans and to nested-loop index joins in relational systems. This are ¯exible enough to be transformed easily into vastly query execution strategy results in a depth-®rst traversal of different physical query plans. The ªmeatº of the query op- the graph following edges that appear in the query's path timizer occurs in the physical query plan enumerator.This expressions. component uses statistics and a cost model in order to trans- Another way to execute Query 1 is to ®rst identify all form the logical query plan into the estimated best physical objects that satisfy the ªy<30º predicate by using an appropriate Vindex if it exists (recall Section 3.4). Once we 1 The existential quanti®cation in the Where clause is necessary since have an object satisfying the predicate, we traverse back- a Member object could conceivably have many Age subobjects. A shorthand in Lorel allows simply ªwhere x.Age < 30º, which is prepro- wards through the data, going from child to parent, match- cessed automatically into the query as shown here [AQM + 97]. ing path expressions in reverse using the Lindex. We call

318

Query: Select x From A.B x where exists y in x.C: y = 5

A A A D ...

D D B B B B B B D B B ... B ......

C C C C C C C C C ......

5 5 5 4 4 5 4 4 5 Top-down Preferred Bottom-up Preferred Hybrid Preferred Figure 2: Different databases and good query execution strategies this query execution strategy bottom-up since we ®rst iden- has many A.B.C paths, but only a single leaf satisfying the tify atomic objects and then attempt to work back up to a predicate, so bottom-up is a good candidate. Finally, in the named object. This approach is similar to reverse pointer- third database top-down execution would visit all the leaf chasing in object-oriented systems. The advantage of this nodes, but only a single one satis®es the predicate. Bottom- approach is that we start with objects guaranteed to satisfy up would identify the single object satisfying the predicate, the Where predicate, and do not needlessly explore paths but would visit all of the nodes in the upper-right portion of through the data only to ®nd that the ®nal value does not the database. For this database, a hybrid plan where we use satisfy the predicate. Bottom-up is not always better than top-down execution to ®nd all A.B objects, then bottom-up top-down, however, since there could be very few paths execution for one level, then ®nally join the two results, satisfying the path expression but many objects satisfying would be a good strategy. the predicate. Each of these three example plans has a substantiallydif- A third strategy is to evaluate some, but not all, of a ferent shape from the others, and each is the optimal plan path expression top-down and create a temporary result of for a particular database. Our primary goal when design- satisfying objects. Then use the Vindex as described earlier ing our logical query plans was to create a structure that and traverse up, via the Lindex, to the same point as the represents, at a high level, the sequence of steps necessary top-down exploration. A join between the two temporary to execute a query, while at the same time permits simple results yields complete satisfying paths. (In fact certain join rules to transform the logical query plan into a wide variety types do not require temporary results at all.) We call this of different physical query plans. approach a hybrid plan, since it operates both top-down and bottom-up, meeting in the middle of a path expression. This 5 Query Execution Engine type of plan can be optimal when the fan-in degree of the 5.1 Logical Query Plans reverse evaluation of a path expression becomes very large at about the same time that the fan-out degree in the forward Recall that a single logical query plan is created after the evaluation of the path expression becomes very large. query is preprocessed into a canonical form. Before ex- These three approaches give a ¯avor of the very different plaining the logical query plan operators and structure of types of plans that could be used to evaluate a simple query, the plans, we introduce two additional de®nitions. one that effectively consists of a single path expression. The De®nition 5.1 (Variable Binding) During query process-

actual search space of plans for this simple query is much ing, a variable x in the query is said to be bound if an object

x o x larger, as we will illustrate in Section 5.4, and more com- o has been assigned to . We also say that is bound to . plicated queries with multiple interrelated path expressions naturally have an even larger variety of candidate plans. De®nition 5.2 (Evaluation) During query processing, an To make things even more concrete, suppose we are pro- evaluation of a query plan (or subplan) is a list of all vari- cessing the query ªSelect x From A.B x Where ables appearing in the plan along with the object (if any) exists y in x.C: y = 5º, which is isomorphic to bound to each variable. Query 1. In Figure 2 we present the general shape and The goal of query execution is to iteratively generate com- a few atomic values for three databases, illustrating cases plete evaluations for all variables in the query, producing when each type of query plan described above would be a the set of query results based on these evaluations. good strategy. The database on the left has only one A.B.C One major difference between the top-downand bottom- path and top-down execution would explore only this path. up query execution strategies introduced in Section 4 is the Bottom-up execution, however, would visit all the leaf ob- order in which the query is processed. In the top-down jects with value 5, and their parents. The second database approach we handle the From clause before the Where;

319

ChainChain Chain Discover(z,"D",v) Discover(x,"B",y) Discover(y,"C",z) Figure 3: Representation of a path expression in the logical query plan

Project(t2)

CreateTemp(x,t2)

Glue Chain Glue

Name("DBGroup",t1) Discover(t1,"Member",x) Exists(y) Select(y,<,30)

Discover(x,"Age",y) Figure 4: A complete logical query plan the order is reversed for the bottom-up strategy. Also con- ample, consider the path expression ªx.B y, y.C z,

sider the Where clause of Query 1: ªWhere exists y z.D vº(where x is de®ned elsewhere in the query) which in x.Age:y < 30º. We can break this clause into two has the logical query subplan shown in Figure 3. The left- distinct pieces: (a) ®nd all Age subobjects of x,and(b) most Discover node is responsible for choosing the best test their values. In the bottom-up plan we ®rst use the way to provide bindings for variables x and y.TheChain Vindex to satisfy (b) and then we use the Lindex for (a). In node directly above it is responsible for evaluating the path the top-down strategy ®rst we satisfy (a) by ®nding an Age expression ªx.B y, y.C zº ef®ciently. This could be

subobject of x, then we test the condition to ful®ll (b). done by using the children's most ef®cient ways of execut- In fact, all queries can be broken into independent com- ing their subplans and joining them together, or possibly by ponents where the execution order of the components is not using a path index for the entire path expression. The ®nal ®xed in advance. We term the places where independent Chain and Discover nodes are similar. components meet rotation points, since during the creation Figure4 shows thecompletelogical query plan for Query of the physical query plan we can rotate the order between 1. Each rotationpoint is represented by a Glue node that has two independent components to get vastly different plans. as its children the two independent subplans. The topmost In our approach, each logical operator can construct its Glue node connects the subplans for the From and Where optimal physical (sub)plan with respect to a set of variables clauses. The Chain node connects the two components of that are already bound elsewhere in the plan. The mecha- the path expression appearing in the From.TheExists node nism by which we store the binding information must also quanti®es y .AGlue node separates the existential in the store information about how a variable was bound in order Where from the actual predicate test, allowing either oper- for the statistics to accurately estimate the cost and number ation to occur ®rst in the physical query plan. Because the of results. For example, given the path expression ªx.B semantics of Lorel requires a set of objects to be returned,

y, y.C zº and assuming that we are following subob- the CreateTemp and Project nodes at the top of the plan z

jects from variable y to variable , then the statistics for are responsible for gathering the satisfying evaluations and

y y z will depend on how was bound. If was bound via returning the appropriate objects to the user. a Bindex operator then the number of object bindings for Space limitations preclude a full description of logical

y (number of C edges) might be quite different from the query plans or examples of more complex queries, but the x case where y was bound by following subobjects from . general ¯avor and ¯exibility of our approach should be Statistics are discussed further in Section 5.3. evident. For details please see [MW99b]. A list and description of most of our logical operators is given in [MW99b]. Here we will focus on the Discover 5.2 Physical Query Plans and Chainlogical operators used for path expressions. Each The number of physical query plan operators is large; a simple path expression in the query is represented as a Dis- list and description of the more common operators appears cover node, which indicates that in some fashion informa- in [MW99b]. Here we focus on some of the more interesting tion is discovered from the database. When multiplesimple operators including those used to traverse paths through the path expressions are grouped together into a path expres- data. Recall that our physical query plan operators are sion, we represent the group as a left-deep tree of Discover iterators: each node in the plan requests a ªtupleº at a time nodes connected via Chain nodes. It is the responsibility from its children, performs some operation on the tuple(s), of the Chain operator to optimize the entire path expres- and passes result tuples to its parent. The ªtuplesº that our sion represented in its left and right subplans. As an ex- query plans operate over are evaluations (De®nition 5.2):

320

Subplans for: A.B x, x.C y

NLJ NLJ

Chain NLJ

Lindex(x,"C",y) Name(t,A) Name(t,A) Scan(x,"C",y) Discover Discover (A,"B",x) (x,"C",y) Scan(A,"B",x) Scan(x,"C",y) Lindex(t,"B",x) Bindex(t,"B",x) Logical Query Scan Plan Plan Lindex Plan Bindex Plan

Pindex("A.B x, x.C y", y) Pindex Plan Figure 5: Different physical query plans

vectors of bindings for variables in the query. Section 4. If we already have a binding for y then we can In a physical query plan, there are six operators that use the second plan, the ªLindex Planº. In this plan we use

identify information stored in the database: two Lindex operations starting from the bound variable y ,

l y 1. Scan(x, , ): The Scan operator performs pointer- and then con®rm that we have reached the named object

chasing: it places into y all objects that are subobjects A. This corresponds to the bottom-up execution strategy l of the complex object x via an edge labeled . of Section 4. In the ªBindex Planº, we directly locate all

l y 2. Lindex(x, , ): In the reverse of the Scan operator, parent-child pairs connected via a edge using the Bindex

the Lindex operator places into x all objects that are operator. We then con®rm that the parent object is the

A C l parents of y via an edge labeled . This reverse pointer- named object ,andScan for all of the subobjects of the chasing operator is implemented in Lore by the link child object. In the ªPindex Planº, we use the Pindex op- index (Section 3.4). erator, which allows us to directly obtain the set of objects reached via the given path expression. Note that several of 3. Pindex(PathExpression, x): Lore maintains a dynamic ªstructural summaryº of the current database called a the plans use a nested-loop join (NLJ) operator without a DataGuide [GW97]. The DataGuide also can be used predicate. These are dependent joins where the left subplan as a path index, enabling quick retrieval of oid's for passes bound variables to the right subplan. all objects reachable via a given path expression. The Recall the hybrid query execution strategy introduced in

Pindex operator places into x all objects reachable via Section 4. One subplan evaluates a portion of the query and

the PathExpression. obtains bindings for a set of variables, say V , and another

x y

4. Bindex(l , , ): The Bindex operator ®nds all parent- subplan obtains bindings for another set of variables, say

W V \ W child pairs connected via an edge labeled l . This oper- . Suppose contains one variable, but the plans ator allows us to ef®ciently locate edges whose label are otherwise independent, meaning one does not provide a

appears infrequently in the database. binding that the other uses (as in the hybrid plan). Then by n

5. Name(x, ): The Name operator simply veri®es that creating evaluations for both subplans and joining the re- n the object in variable x is the named object . (Recall sults on the shared variable, we ef®ciently obtain complete from Section 3.1 that named objects are entry points evaluations. As in relational systems, deciding which join to a Lore database.) operator to use is an important decision made by the opti-

mizer. (Currently we support nested-loop and hash joins.) x

6. Vindex(Op, Va l u e, l , ): The Vindex operator accepts a x label l , an operator Op,andaVa l u e, and places into Again space limitations preclude a full description of all atomic objects that satisfy the ªOp Valueº condition physical query plans or examples of even remotely compli-

and have an incoming edge labeled l . cated queries, although wewill visit physical query plans for As an example that uses some of these operators, con- Query 1 later in Section 5.4. In addition to the basic traver- sider the path expression ªA.B x, x.C yº(whereA is a sal operators discussed above, we have operators to perform name) and four possible plans as shown in Figure 5. (The projection and selection, manage temporary results, per- optimizer can generate up to eleven different physical plans form an aggregation operation over a subplan, ensure the for this single path expression.) The logical query plan existential and universal quanti®cation of a variable, per- is shown in the top left panel. In the ®rst physical plan, form set and arithmetic operations between two subplans, the ªScan Planº, we use a sequence of Scan operators to and others. For details please see [MW99b]. The physical discover bindings for each of the variables, which corre- operators can be combined in numerous ways, producing a sponds to the top-down execution strategy introduced in vast search space for even relatively simple queries. Our

321

plan enumeration and pruning heuristics will be discussed must estimate how many objects will bindto the next simple in Section 5.4. path expression to be evaluated. Consider evaluating the path expression ªA.B x, x.C yº top-down. If we have 5.3 Statistics and Cost Model

a binding for x, then the optimizer needs to estimate the As with any cost-based query optimizer, we need to estab- number of C subobjects, on average, that objects reached lish a metric by which we estimate the execution cost for a by the path ªA.B xº have. Alternatively, if we proceed given physical query plan or subplan. Lore currently does bottom-up with a binding for x, then the optimizer must not enforce any object clustering, so we are limited to us- estimate the average number of parents via a B edge for ing the predicted number of object fetches as our measure all the C's. We call these two estimates fan-out and fan-in of I/O cost, since we cannot accurately determine whether respectively. The fan-out for a given path expression p and

two objects will be on the same page. Despite this rough

l jpj jp j=jpj d label is computed from the statistics by l .

approximation, experiments presented in Section 6 validate l

jpj jp j=jpj

Likewise, fan-in is d . that our cost model is reasonably accurate. Nevertheless, Our statistics are most accurate for path expressions of

re®ning and expanding the cost model, especially by in-

k + k length 1, since for a given we store statistics about creasing our knowledge of the locality of objects on pages paths of length up to k , and these statistics include infor- (through statistics-gathering and/or actual clustering), is an mation about incoming and outgoing edges to the pathsÐ

area where we intend to invest future effort. effectively giving us information about all paths of length

+ k + 5.3.1 Statistics k 1. Given a path expression of length 2, for maximum accuracy we combine the statistics for two overlap-

Our query optimizer must consult statistical information

p k + ping paths p1 and 2 each of length 1. about the size, shape, and range of values within a data- When estimating the number of atomic values that will base in order to estimate the cost of a physical query plan. satisfy a given predicate, standard formulas such as those

Initially we stored statistics in the DataGuide, but quickly given in [SAC+ 79, PSC84] are insuf®cient in our semi- were limited by the fact that we could only store statistics structured environment due to the extensive type coercion about paths beginning from a named object [GW97]. The that Lore performs [AQM+ 97]. Our formulas take coer- optimizer may choose to begin evaluating a path expres- cion into account by combining value distributions for all sion anywhere within the path (via the Bindex or Vindex atomic types that can be coerced into a type comparable to operator), so we needed more ¯exible statistics. Our new the value in the predicate. approach is to store statistics about all possible subpaths

(label sequences) in the database up to a length k ,where 5.3.2 Cost Model k is a tuning parameter. (Typical object-oriented and re- Each physical query plan is assigned a cost based upon the

lational database systems compute and store statistics for estimated I/O and CPU time required to execute the plan. = k 1.) We have explored several algorithms for ef®ciently The costing procedure is recursive: the cost assigned to a computing and storing such statistics, but a presentation of node in the query plan depends on the costs assigned to these algorithms is outside of the scope of this paper. Re- its subplans, along with the cost for executing the node gardless of algorithm used, a clear trade-off exists between itself. In order to compute estimated cost recursively, at the cost in time and space for a larger k and the accuracy of each node we must also estimate the number of evaluations the statistics. expected for that subplan. To decide if one plan is cheaper

The statistics we maintain for every label subpath p of than another, we ®rst check the estimated I/O cost. Only k length include: when the I/O costs are identical do we take estimated CPU cost into account. Again, our cost metric is admittedly For each atomic type, the total number of atomic ob- simplistic, but it does appear acceptable for the ®rst version jects of that type reachable via p. of our cost-based optimizer as shown by the performance

For each atomic type, the minimum and maximum results in Section 6.

values of all atomic objects of that type reachable via p.

Due to space constraints, our formulas for estimated I/O

p jpj

The total number of instances of path , denoted . and CPU cost and number of evaluationsare omitted but ap- p

The total number of distinct objects reachable via , pear in [MW99b]. As an example, consider the I/O formula

l x +

jpj

d for the Vindex(Op, Va l u e, , ) operator: BLevel

denoted . l ;type1

+ + l

Selectivity (l , Op, Va l u e) BLevel Selectivity ( ,

1 ; 2 l The total number of -labeled subobjects of all objects l type2

Op, Va l u e). Here BLevel gives the height of the relevant

p jp j

reachable via , denoted l .

B+-tree index, and the Selectivity functions are the formu- l

The total number of incoming -labeled edges to any las to estimate the number of satisfying results given Lore's

jp j instance of p, denoted . coercion system. (Because of type coercion, multiple B+-

As mentioned earlier, our I/O cost metric is based on the es- trees need to be accessed during a Vindex operation.) As a

l y timated number of objects fetched during evaluation of the second example, the I/O cost for the Lindex(x, , ) operator

query. Thus, for example, given an evaluation that corre- is 2 + Fin ,whereFin is the fan-in statistic as de- PathOf(y);l sponds to a traversal to some point in the data, the optimizer ®ned earlier. The Lindex is implemented using extendible

322

n: number of simple path expressions n=3 n=5 n=7 All possible plans/Lore's search space 1458 / 48 2,361,960 / 228 8,035,387,920 / 948 Table 1: Analysis of Search Space Size hashing, and our cost estimate assumes no over¯ow buck- how dramatically our heuristics reduce the search space. ets. Thus, it requires two page fetches (one for the directory However, even with our aggressive pruning, Lore still and one for the hash bucket) and one additional page fetch chooses very good plans as we demonstrate in Section 6. for every possible parent. For further re®nement, we intend to experiment with a ®nal optimization phase in which we can apply transformations 5.4 Plan Enumeration directly over the generated physical query plan, such as The search space of physical query plans for a single Lorel moving subplans to different locations in the overall plan.

query is very large. For example, a single path expression We now discuss how physical plans are produced. As n of length n can be viewed as an -way join, where as ªjoin mentioned earlier, each logical plan node creates an optimal methodsº Lore considers pointer-chasing, reverse pointer- physical plan given a set of bound variables. During plan chasing, and two different standard relational joins. Fur- enumeration we track for every variable in the query: (1) thermore, there may be many interrelated path expressions whether the variable is bound or not;(2) which plan operator in a single query, along with other constructs such as set op- has bound the variable; (3) all other plan operators that use erations, subqueries, aggregation, etc. [AQM+ 97]. In order the variable; (4) whether the variable is stored within a to reduce the search space as well as the complexity of our temporary result. For instance, the logical query plan node plan enumerator, we use a greedy approach to generating Discover for the simple path expression x.Age y may be physical query plans. Each logical query plan node makes asked to create its optimal plan given that x has already a locally optimal decision, creating the best physical sub- been bound by some other physical operator, in which case plan for the logical plan rooted at that node. The decision it may decide that Scan is optimal. However, if y was is based on a given set of bound variables passed from the bound instead then it may decide that Lindex is optimal. node's parent. The key to considering a variety of different After a node creates its optimal subplan, the new state of physical plans is that a node may ask its child(ren) for the the variables and the optimal subplan are passed to the optimal physical query subplan many times, using different parent. Note that a logical node may be unable to create sets of bound variables each time. While this greedy ap- any physical plan for a given state of the variables if it proach greatly reduces the search space, it still explores an always requires some variables to be bound. In this case, exponentially large number of physical query plans. Thus, ªno planº is returned and a different choice must be made at our plan enumerator currently uses the following additional a higher level in the plan. In [MW99b] we detail how each heuristics to further prune the search space. logical plan node generates its optimal physical subplan. To illustrate the transformation from a logical plan to a The optimizer does not consider joining two simple path expressions together unless they share a com- physical plan, let us consider part of the search space ex- mon variable. This restriction substantially reduces plored during the creation of the physical query plan for the number of ways to order the evaluation of sim- Query 1, whose logical query plan was given in Figure 4. ple path expressions. (See [MW99b] for a detailed The topmost Glue node (indicating a rotation point) in Fig- discussion.) ure 4 is responsible for deciding the execution order of its children: either left-then-rightor right-then-left. It requests

The Pindex operator is considered only when a path expression begins with a name, and no variable except the best physical query plan from the left child and then, us- the last is used elsewhere within the query. The latter ing the returned bindings, requests the best physical query plan from the right child. One possible outcome is the requirement is based on the fact that Pindex only binds the last variable in its path expression, so other needed physical query plan fragment shown in Figure 6(a). After variables in the path would have to be discovered by exploring left-then-right execution order, the topmost Glue node considers the right-then-left order. The right child some additional method. is another Glue node which recursively follows the same

The Select clause always executes last, since in procedure. Suppose that for this nested Glue node, the left- nearly all cases it depends on one or more variables then-right execution order results in the physical subplan bound in the From clause. Also, the physical query shown in Figure 6(b), while the right-then-left execution plan will always execute either the complete From or order results in Figure 6(c). (For details of the CreateTemp Where complete clause before moving on to the other and Once operators please see [MW99b].) Suppose plan one. (c) is chosen based on a lower estimated cost. The bind-

The optimizer does not attempt to reorder multiple ings provided by this subplan are then supplied to the left independent path expressions. child of the topmost Glue node to create the optimal query A detailed analysis of our physical query plan search space plan for the left child, which could result in the ®nal subplan is provided in [MW99b]. Table 1 gives some examples of shown in Figure6(d). Notice that in theright subplan for the

323

NLJ HashJoin (t1,t2,t3) Select PindexJoin (t1,=,"true") ("DBGroup.Member x", x) CreateTemp CreateTemp (y,{x},t1) (y,{},t2) Aggr (y,exists,t1) Once(x) Vindex Select(y,<,30) ("Age",<,30,y) Bindex Scan(x,"Age",y) (a) ("Age",x, y) (b) HashJoin (t0,t1,t2) NLJ CreateTemp CreateTemp (x,{},t0) (x,{},t1)

Pindex Vindex Once(x) NLJ ("Age",<, 30, y) ("DBGroup.Member("A.B x", x) x", x) Vindex ("Age",<, 30, y) Once(x) Lindex(x,"Age",y)

Lindex(x,"Age",y) (c) (d) Figure 6: Possible transformations for Query 1 into a physical query plan topmost Glue node, the Chain node decided that the Pindex rowed from HTML. Rather than use these small datasets operator is the best way to get all ªDBGroup.Member xº for our experiments we built our own XML database about objects within the database, despite the fact that we already movies made in 1997, combining information from many

have a binding for x. This choice makes sense when the sources including the Internet Movie Database (located at

estimated fan-in for x with label DBGroup is very high. http://www.imdb.com). The database contains facts As a ®nal step the topmost Glue node decides which query about 1,970 movies, 10,260 actors and actresses, plot sum- plan is cheaper, either (a) or (d), and passes that plan to its maries, directors, editors, writers, etc., as well as multi- parent. media data such as stills and audio clips. The database loaded into Lore is about 5 megabytes. Value, Link, and 6 Performance Results Edge indexes (recall Section 3.4) account for an additional The query optimization techniques described in this pa- 8.1 megabytes. The database is semistructured and very per are fully implemented and integrated into Lore, includ- cyclic; for example, actors have edges to each movie they ing the physical operators, statistics, cost formulas, logical appeared in (along with their role in the movie), and movies query plan generation, and physical query plan enumeration have edges to all of the actors in the movie. The database and selection. The implementation for these components graph contains 62,256 nodes and 130,402 edges. consists of approximately 31,000 lines of C++ code. We Lore allows us to turn off all pruning heuristics temporar- also have implemented mechanisms for instructing the op- ily, in order to create and execute all possible query execu- timizer to favor certain types of plans (in order to debug tion plans within our search space for a single query. Thus, and conduct experiments), and we have built a very useful we can evaluate how the chosen plan performs against other query plan visualization tool. We now present some pre- possible plans. However, it is infeasible to perform this ex- liminary performance results showing that our cost model is tensive experiment for large queries, since the number of reasonably accurate and that the optimizer is choosing good plans in the search space is very large, and some plans are plans. Extensive performance evaluations over a large suite extremely slow to execute (even if the chosen plan is very of queries and databases is beyond the scope of this paper. ef®cient). We report on a sample of experiments, again All of our tests were run on a Sun Ultra 2 with 256 emphasizing that exhaustive performance evaluations are megabytes of RAM. However, Lore was con®gured to have beyond the scope of this paper. a small buffer size of approximately 150K bytes, in order to match the relatively small databases used by our initial per- Experiment 1. We begin with an extremely simple query: formance experiments. Each query was run with an initially Select DB.Movie.Title. Using exhaustive search empty buffer. Over all of the queries in our experiments the Lore produces eleven different query plans, with estimated average optimization time was approximately 1/2 second. I/O costs and actual execution times (in seconds) as show At the time of this writing we have not located any sig- in Table 2. In this and subsequent tables the plan chosen ni®cant amounts of readily available XML data. What is by the optimizer when the pruning heuristics are used is available consists mostly of small, tree-structured docu- marked with a star (*). The ®rst and best plan simply uses ments usually with cryptic tags or presentational tags bor- Lore's path index to quickly locate all the movie titles. The

324

Plan # 1* 2 3 4 5 6 7 8 9 10 11 Exec. Time (sec.) 0.36 1.78 2.02 14.44 61.82 67.24 74.09 94.15 250.61 397.18 423.34 Estimated I/O 1975 3944 3944 9853 31918 31918 11823 37827 17742 17733 23855 Table 2: Query execution times second plan, which is only slightly slower, uses top-down mistakeisduelargely to simplisticestimatesof atomicvalue pointer-chasing. The worst plan uses Bindex operators and distributions(we have not yet implemented histograms) and hash joins. of set operation costs. Devoting some effort to improving To evaluate the relative accuracy of our cost model, con- the optimizer in these areas should lead it to select the

sider the estimated I/O cost against the actual execution optimal plan. 2 time. With some exceptions, the estimated cost accurately Experiment 4. Our fourth query selects movies with a cer- re¯ects the relative execution time for each plan. Since our tain quality rating. Here too we considered only a sampling cost model is still quite simplistic, we are very encouraged of all possible plans.

by these results. 2 Rank Time Description Experiment 2. Our second query asks for all movies 1* 0.61 Bindex for rating, then Lindex up Genre with a subobject having value ªComedyº. This 2 0.89 Bottom-up turned out to be a ªpointº query, since many movies don't 3 4.04 Top-down have a Genre subobject and most aren't comedies. Es- timated I/O costs again re¯ected relative execution times Since it turns out that quality ratings are fairly uncommon fairly accurately, so hereafter we focus only on execution in the database, the optimizer (correctly) chooses to ®nd all times. Twenty-four plans were considered using exhaustive ratings via the Bindex, then to work bottom-up. 2 search. The following table describes some of them, where We have performed many experiments in addition to ªTimeº is the execution time and ªRankº indicates the plan those reported here, although the ones described are a rep- rank by execution time among all plans considered. resentative sample. In general, our experiments so far allow us to conclude: (i) our cost estimates are accurate enough Rank Time Description to select the best plan in most cases, although some re®ne- 1* 0.3307 Bottom-up ments are needed; (ii) execution times of the best and worst 2 0.3768 Bindex, Select then Lindex plans for a given query and database can differ by many 7 3.3384 Top-down orders of magnitude; and (iii) the best execution strategy 24 458.58 Full Bindex is highly dependent on the query and database, indicating Since the Where clause is very selective, the optimal plan that a cost-based query optimizer for XML data is crucial uses a bottom-up strategy: a Vindex operator locates all to achieving good performance. objects having the value ªComedyº and an incoming edge 7 Future Work labeled Genre.TheLindex operator matches the remainder of the path expression in reverse. The second-best plan Extensions to the work presented here are underway, in- is only slightly slower. It uses the Bindex to locate all cluding speci®c optimizationtechniques for branching path Genre edges, ®lters using the ªComedyº predicate, then expressions, a query rewrite that moves Where clause pred- proceeds bottom-up. The slowest plan uses a poor combi- icates into the From clause, and a transformation that intro- nation of Bindex operators and joins. Top-down evaluation duces a Group-By clause when a large number of paths

results in the seventh-best plan. 2 pass through a small number of objects. We are also con- Experiment 3. In the remaining two experiments we did sidering partially correlated subplans, which are similar to not execute all possible plans since the queries and space of correlated subqueries but rely on the bindings passed be- plans are much larger. Instead, we generated and executed tween portions of the physical query plan rather than on samplying of plans from within our search space. Again, the query itself. In the area of statistics we are considering the plan chosen by the optimizer is marked with a star(*). even more ef®cient statistics-gathering algorithms, perhaps The following results are for a query with two existentially incorporating some graph sampling. We also plan to gather quanti®ed variables in the Where clause. statistics about the location of objects on disk, with corresponding modi®cations to the cost formulas to generate Rank Time Description more accurate cost estimates. 1 0.33 Bottom-up 2* 3.68 Top-down References 3 6.95 Hybrid with Pindex [Abi97] S. Abiteboul. Querying semistructured data. In Proc. of the 4 7.01 Hybrid with pointer-chasing International Conference on Database Theory, pages 1±18, 5 23.13 Bindex and Vindex then Lindex Delphi, Greece, January 1997.

[AQM + 97] S. Abiteboul, D. Quass, J. McHugh, J. Widom, andJ. Wiener. Notice that the optimizer chose plan 2, the top-down or The Lorel query language for semistructured data. Journal of pointer-chasing execution strategy, as the best plan. The Digital Libraries, 1(1):68±88, April 1997.

325

[BDHS96] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu. A [KMP93] A. Kemper, G. Moerkotte, and K. Peithner. A blackboard query language and optimization techniques for unstructured architecture for query optimization in object bases. In Proc. of data. In Proc. of the ACM SIGMOD International Conference the Nineteenth International Conference on Very Large Data on Management of Data, pages 505±516, Montreal, Canada, Bases, pages 543±554, Dublin, Ireland, August 1993. June 1996. [LV91] R. Lanzelotte and P. Valduriez. Optimization of nonrecursive [BF97] E. Bertino and P. Foscoli. On modeling cost functions for queries in OODBs. In Proc. of the Second International Con- object-oriented databases. IEEE Transactions on Knowledge ference on Deductive and Object-Oriented Databases, pages and Data Engineering, 9(3):500±508, May 1997. 1±21, Munich, Germany, December 1991.

[BPSM98] T. Bray, J. Paoli, and C. Sperberg-McQueen. Extensible [MAG+ 97] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and markup language (XML) 1.0, February 1998. W3C Rec- J. Widom. Lore: A database management system for semi- ommendation available at http://www.w3.org/TR/1998/REC- structured data. SIGMOD Record, 26(3):54±66, September xml-19980210. 1997. [BRG88] E. Bertino, F. Rabitti, and S. Gibbs. Query processing in a [Mar98] M. Marchiori. Proc. of QL'98 - The Query Languages Work- multimedia document system. ACM Transactions on Of®ce shop. Boston, MA, December 1998. Papers available online Information Systems, 6(1):1±41, January 1988. at http://www.w3.org/TandS/QL/QL98/. [Bun97] P. Buneman. Semistructured data. In Proc. of the Sixth [MSOP86] D. Maier, J. Stein, A. Otis, and A. Purdy. Development ACM SIGACT-SIGMOD-SIGART Symposium on Principles of an object-oriented DBMS. In Proc. of the Conference of Database Systems, Tucson, Arizona, May 1997. Tutorial. on Object-Oriented Programming Systems, Languages, and Applications, pages 472±481, Portland, Oregon, November [Cat94] R.G.G. Cattell. The Object Database Standard: ODMG-93. 1986. Morgan Kaufmann, San Francisco, California, 1994. [MW99a] J. McHugh and J. Widom. Compile-time path expansion in [CCM96] V. Christophides, S. Cluet, and G. Moerkotte. Evaluating Lore. In Workshop on Query Processing for Semistructured queries with generalized path expressions. In Proc. of the Data and Non-StandardData Formats, Jerusalem, Isreal, Jan- ACM SIGMOD International Conference on Management of uary 1999. Data, pages 413±422, Montreal, Canada, June 1996. [MW99b] J. McHugh and J. Widom. Query optimization for semi-

[DFF+ 98] A. Deutsch, M. Fernandez, D. Florescu, A. Levy, and D. Su- structured data. Technical report, Stanford University Data- ciu. XML-QL: A query language for XML, August 1998. base Group, February 1999. Document is available at

Available at http://www.w3.org/TR/NOTE-xml-ql/. ftp://db.stanford.edu/pub/papers/qo.ps. + [FFK + 99] M. Fernandez, D. Florescu, J. Kang, A. Levy, and D. Suciu. [MWA 98] J. McHugh, J. Widom, S. Abiteboul, Q. Luo, and A. Rajara- Catching the boat with Strudel: Experiences with a web- man. Indexingsemistructured data. Technical report, Stanford site management system. In Proc. of the ACM SIGMOD University Database Group, 1998. Document is available at International Conference on Management of Data,pages 414± ftp://db.stanford.edu/pub/papers/semiindexing98.ps 425, Seattle, Washington, June 1999. [OMS95] M. T. Ozsu, A. Munoz, and D. Szafron. An extensible query [FFLS97] M. Fernandez, D. Florescu, A. Levy, and D. Suciu. A optimizer for an objectbase management system. In Proc. query language for a web-site management system. SIGMOD of the Fourth International Conference on Information and Record, 26(3):4±11, September 1997. Knowledge Management, pages 188±196, Baltimore, Mary- land, November 1995. [FLS98] D. Florescu, A. Levy, and D. Suciu. Query optimization algorithm for semistructured data. Technical report, AT&T [O'N87] Patrick O'Neil. Model 204 architecture and performance. In Laboratories, June 1998. Proc. of the Second International Workshop on High Perfor- mance Transaction Systems (HPTS), pages 40±59, Asilomar, [FS98] M. Fernandez and D. Suciu. Optimizing regular path ex- CA, 1987. pressions using graph schemas. In Proc. of the Fourteenth [PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Ob- International Conference on Data Engineering, pages 14±23, Orlando, Florida, February 1998. ject exchange across heterogeneous information sources. In Proc. of the Eleventh International Conference on Data Engi- [GGMR97] J. Grant, J. Gryz, J. Minker, and L. Raschid. Semantic query neering, pages 251±260, Taipei, Taiwan, March 1995. optimization for object databases. In Proc. of the Thirteenth [PSC84] G. Piatetsky-Shapiro and C. Connell. Accurate estimation of International Conference on Data Engineering, pages 444± the number of tuples satisfying a condition. In Proc. of the 454, Birmingham, UK, April 1997. ACM SIGMOD International Conference on Management of [GGT95] G. Gardarin, J. Gruser, and Z. Tang. A cost model for clus- Data, pages 256±276, Boston, MA, June 1984. tered object-oriented databases. In Proc. of the Twenty-First [RLS98] J. Robie, J. Lapp, and D. Schach. XML query language (XQL). International Conference on Very Large Data Bases, pages In [Mar98], 1998. 323±334, Zurich, Switzerland, September 1995. [SAC+ 79] P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie, and [GGT96] G. Gardarin, J. Gruser, and Z. Tang. Cost-based selection T. Price. Access path selection in a relational database man- of path expression processing algorithms in object-oriented agement system. In Proc. of the ACM SIGMOD International databases. In Proc. of the Twenty-Second International Con- Conference on Management of Data, pages 23±34, Boston, ference on Very Large Data Bases, pages 390±401, Bombay, MA, June 1979. India, 1996. [SMY90] W. Sun, W. Meng, and C. T. Yu. Query optimization in object- [GMW99] R. Goldman, J. McHugh, and J. Widom. From semistructured oriented database systems. In Database and Expert Systems data to XML: Migrating the Lore data model and query lan- Applications, pages 215±222, Vienna, Austria, August 1990. guage. In Proc. of the 2nd International Workshopon the Web and Databases, Philadelphia, PA., June 1999. [SO95] D. D. Straube and M. T. Ozsu. Query optimization and execution plan generation in object-oriented database systems.IEEE [Gra93] G. Graefe. Query evaluation techniques for large databases. Transactions on Knowledge and Data Engineering, 7(2):210± ACM Computing Surveys, 25(2):73±170, 1993. 227, April 1995. [GW97] R. Goldman and J. Widom. DataGuides: Enabling query [YM97] C. Yu and W. Meng. Principles of Database Query Pro- formulation and optimization in semistructured databases. In cessing for Advanced Applications. Morgan Kaufmann, San Proc. of the Twenty-Third International Conference on Very Francisco, California, 1997. Large Data Bases, pages 436±445, Athens, Greece, August 1997.

326