Effective Schema-Based XML Query Optimization Techniques

Effective Schema-Based XML Query Optimization Techniques Guoren Wang and Mengchi Liu Jeffrey Xu Yu School of Computer Science Department of SEEM Carleton University, Canada CUHK, Hong Kong, China fwanggr, [email protected] [email protected] Bing Sun, Ge Yu, and Jianhua Lv Hongjun Lu Department of Computer Science Department of Computer Science Northeastern University, China HKUST, Hong Kong, China fsunbing,yuge,[email protected] [email protected] Abstract One of the ways to evaluate RPE in relational XML man- agement systems is rewriting an RPE to some SPEs (Simple Use of path expressions is a common feature in most Path Expression) according to the schema information and XML query languages, and many evaluation methods for statistic information on XML documents, and then com- path expression queries have been proposed recently. How- puting the SPEs respectively by translating the SPEs into ever, there are few researches on the issue of optimizing SQL statements and executing them. Take VXMLR[12] for regular path expression queries. In this paper, two kinds example, RPE queries containing ’//’ and ’*’ are rewritten of path expression optimization principles are proposed, to SPEs, which will be translated into SQL statements for named path shortening and path complementing, respec- a later processing in accordance with relational database tively. The path shortening principle reduces the querying schema. In the Lore system[1], there are three strategies cost by shortening the path expressions with the knowledge to compute SPE queries: top-down, bottom-up and hybrid. of XML schema. While the path complementing principle In the top-down way, an XML document tree (DOM tree) substitutes the user queries with the equivalent lower-cost is navigated from its root to get proper results; and in the path expressions. The experimental results show that these bottom-up way, an XML document tree is traversed from two techniques can largely improve the performance of path its leaves to the root with the help of value indices; finally, expression query processing. in the hybrid way, a long path is divided into several short paths, each of which can be evaluated in either the top-down way or the bottom-up way, and then the results of these short 1 Introduction paths are joined together to get proper results. As an improvement of the hybrid way, the extent join Although XML is usually used as an information ex- algorithm [9] has been paid a lot of attention recently. change standard, storing, indexing and querying XML data This paper introduces two optimization strategies based on are still important issues and recently have become research the extent join algorithm: path shortening Strategy and hotspots in the database community. To retrieve XML data path complementing Strategy. These strategies can also be from databases, many query languages have been proposed, used for other path evaluation algorithms besides the ex- such as XQuery [4], XPath [2], XML-QL [3], XML-RL [7], tent join algorithm. The path shortening Strategy shortens and Lorel [1]. Because the use of regular path expres- the path to reduce the number of join operations so that the sions (RPE) is a common feature of these languages, query query performance can be improved. The path complement- rewriting and optimization for RPE is becoming a research ing Strategy computes the original path expression query hotspot and a few research results have been published re- through some other low-cost path expressions. cently [9, 12]. Most implemented XML systems are based on relational 2 Background DBs. In the relational way [5, 12], XML data is mapped into tables of a relational/object-relational database system and An XML document can be represented as a rooted tree, queries on XML data are translated into SQL statements. Td = (Vd;Ed;Ld), called an XML data tree, where Vd is a set of nodes, each of which represents either an ele- bel structure of an XML data tree shall be a subgraph of ment or an attribute; Ld is a set of node labels representing the corresponding XML schema graph. In other words, the tags of the nodes, each of which is associated with a la- label paths imposed by the XML schema graph put restric- bel; Ed is a set of edges, each of which represents either a tions on the possible label paths in the corresponding XML parent-child relationship between two element nodes or an data trees. Figure 2 shows the XML schema graph for the element-attribute relationship between an element and an XML data tree in Figure 1. In Figure 2, the root node is the attribute. In the following, we use rd to represent the root node with a label site. The node labels with underline, of an XML data tree. such as income are for attributes. Figure 1 shows an XML data tree for an XML document Given an XML schema graph and an XML database in the XML benchmark project, XMark. Here, an ellipse consisting of XML data trees that conform to the XML represents an element node, whereas a triangle represents schema graph, let p be a label path of an XML schema an attribute node. The numbers with prefix ”&” marked on graph p = =l1=l2= ¢ ¢ ¢ =lk, where l1 is the label of the root the nodes are the node identifiers. Note, &1 is the root node node of the XML schema graph, and lk the ending label of the XML data tree whose label is site. of the label path. An extent of the label path p, denoted ext(p), is a set of nodes (or node identifiers) of any data path =&n1=&n2= ¢ ¢ ¢ =&nk in an XML data tree such that their corresponding label path matches p. In addition, an extent of a label, lj, denoted ext(lj), is a set of nodes (or node identifiers) that have label l. Consider a label path of the XML schema graph in Figure 2 p = /site/regions/namerica/item/ name. The extent of p, ext(p), includes &6 in the XML data tree in Figure 1, because there is a data path &1:&2:&3:&4:&6 such that the corresponding label path of Figure 1. An XML Data Tree the data path matches p. On the other hand, ext(name) = f&6; &9; &13g. An XML data tree is a set of paths. Given a node, a data path is a sequence of node identifiers from the root P athExpression ::= CONNECT OR P athSteps to the node, and a label path is a sequence of labels from j P athSteps CONNECT OR P athSteps the root to the node. For example, &1:&2:&3:&4:&6 P athSteps ::= Label and &1:&21:&22:&24:&25 are two data paths from j Label 0j0 P athSteps the root to the nodes &6 and &25. The corre- j (P athSteps) sponding label paths are /site/regions/namerica/ j 0¤0 item/name and /site/close_auctions/close_ CONNECT OR ::= 0=0 j 0==0 auction/annotation/description, respectively. Figure 3. BNF Definition of Path Expression site Figure 3 shows the BNF definition of path expressions, regio ns people clo sed_auctio ns open_auctio ns as a simplified version of XPath. In Figure 3, Label rep- nameric a {asia ,...} resents node labels. A symbol ’*’ is introduced as a wild- person clo sed_auctio n open_auctio n card for an arbitrary node label in the XML schema graph. item id The CONNECTOR represents the connector of paths. We pric e name buyer person mainly discuss two widely used connectors: ’/’ and ’//’, bid der where the former represents a parent-child relationship, and annotatio n profile in come the latter represents an ancestor-descendant. Given an XML descrip tio n schema graph, Gt, and an XML database consisting of XML data trees that conform to the XML schema. A path Figure 2. An XML Schema Graph expression query, Q, is a path expression (Figure 3). A path expression query, Q, is valid if it at least match a la- An XML schema graph is a rooted directed graph, Gt = bel path, p = l1=l2= ¢ ¢ ¢ =lk, in Gt. A directed query graph (Vt;Et), where Vt is a set of labelled nodes and Et is a set Gq = (Vq;Eq) is defined as a rooted subgraph of Gt such of edges. We use rt to denote the root of the XML schema that Gq only includes the label paths of Gt that the query Q graph. It is important to note that an XML schema graph matches. The root of Gq is the root of Gt and the no-child specifies the possible structures of XML data trees. The la- nodes are the nodes with label lk. The result of the query, 2 Q, denoted ext(Q), is [i ext(pi) where pi is a label path Technique 2. (Effective Path Shortening) Let Gt be an from the root to the no-child node with label lk in Gq. XML schema graph and Q = ¢ ¢ ¢ =lk be an arbitrary valid path expression query. A corresponding query graph Gq 3 Simple but Effective Optimization Tech- can be constructed (Gq ⊆ Gt) as follows. Let Qu be a niques: Path Shortening and Path Comple- set of labels, lu, that such that, (i) lu only appears in Gq and does not appear in Gt ¡ Gq, (ii) all data paths, that menting satsify the query Q, must traverse at least one node with label lu, and (iii) the label path l1= ¢ ¢ ¢ =lu is unique. We call The path shortening technique is based on the following those labels unique labels. The effective path shortening observation.

Load more