Paper 928 Complex Group-By Queries for XML
Total Page:16
File Type:pdf, Size:1020Kb
Paper 928 Complex Group-By Queries for XML C. Gokhale+, N. Gupta+, P. Kumar+, L.V.S. Lakshmanan∗, R. Ng∗, and B.A. Prakash+ + Indian Institute of Technology, Bombay ∗ University of British Columbia, Canada 1 Introduction also omit the selection part of the query, and just focus on the aggregation part. The popularity of XML as a data exchange standard has group //Book led to the emergenceof powerfulXML query languages like by //Name return ( //Name, avg(/Price), count(*) XQuery [21] and studies on XML query optimization. Of then by /Year return ( late, there is considerable interest in analytical processing /Year, median(/#Sold) ) ) of XML data (e.g.,[2, 3]). As pointed out by Borkar and The intent of the query is to group Book nodes first Carey in [3], even for data integration, there is a compelling by (publisher) Name. For each group, we get the average need for performing various group-by style aggregate oper- Price, andthe numberof Book nodesin the group. More- ations. A core operator needed for analytics is the group- over, for each publisher group, the book nodes are further by operator, which is widely used in relational as well as sub-grouped by Year. For each of these nested groups, the OLAP database applications. XQuery requires group-by median #Sold per year is returned. operations to be simulated using nesting [2]. Studies addressing the need for XML grouping fall into Q1−Answer two broad categories: (1) Provide support for grouping at Name−group Name−group the logical or physical level [6] and recognize grouping op- erations from nested queries and rewrite them with group- Name avg count() Year−group Year−group Kaufman (Price) 258 Name avg count() Year−group ing operations [4, 5, 9, 12]. (2) Extend XQuery FLWOR $70 Wesley (Price) 175 Year median Year median $120 expressions with explicit constructs similar to the group-by, 1999 (#Sold) 2000 (#Sold) Year median 5600 26000 1999 (#Sold) order-by and having clauses in SQL [3, 2]. However, direct 2300 algorithmic support for a group-by operator is not explored. In this paper, we focuson efficient processing of a group- Figure 2. Partial Result of Q1 by operator for XML – with the additional goal of sup- porting a full spectrum of aggregation operations, including Figure 2 shows the answer to query Q1 for the (par- holistic ones such as median() [8] and complex nested tial) data in Figure 1. The first group shown, for instance, Name = Kaufman aggregations, together with having clause, as well as mov- is for . Among the 258 books in this ing window aggregation. group, the average price is $70. These books are further sub-grouped by Year. For each year that appears in the in- Consider the simple catalogue example in Figure 1. This put data, the median number of copies sold is also returned can be part of an input XML database, or intermediate re- (e.g., 5600 for 1999). We can enhance nested group-by sult of a query. The catalogue is heterogeneous: it contains query Q with two features, as illustrated by query Q : information about books, music CDs, etc. Books are orga- 1 2 group //Book nized by Subject, e.g., physics, chemistry. For by //Name each book, there is information on its Title, Author, having count(*) ≥ 100 return ( //Name, avg(/Price), count(*) Year, #Sold, Price, (publisher) Name, etc. Books may then by /Year return ( have multiple authors. The data value at a leaf node is /Year(10,5), median(/#Sold) ) ) shown in italics. The node id of a node is also shown for In Q2, the having-clause for the outer block removes future discussion. publishers with the total number of book nodes less than Consider the following nested group-by query Q1. 100. Besides, we form moving windows over years – with While we could follow the syntax proposed by [2], syntax each window having a width of 10 years and a step size not being our main focus, we use a more concise form. We of 5 years (e.g., [1990,2000], [1995,2005], etc.). While in 1 Catalogue Subject Subject (22) Music CD (1) (23) (24) (33) (13) Name Name (2) Book (3) Book Book Book Physics Chemistry (5) (6) (7) (8) (9) (10) (14) (15) (16) (17) (18) (19) (25) (26) (27) (28) (29) (30) (4)PubInfo TitleAuthor Author Year #Sold Price PubInfo Title Author Year #Sold Price PubInfo Title Author Year #Sold Price Sound Mcmanus 2000 800 $96 Carbon Johnson 1999 8000 $25 MechnicsNewton Smith 1999 5600 $60 Compounds (11) Waves (20) (31) (12) (21) Name City Name City Name City (32) Kaufman NY Wesley LA Kaufman NY Figure 1. The Catalogue Example the next section we will present a more comprehensive set performance than simulating grouping via nesting. None of moving window options, it should be easy to appreci- of these papers discuss algorithms for directly computing ate the value of supporting nested group-bys with having group-bys (with possible nesting, having, and moving win- clauses and moving windows for XML querying. In princi- dows). ple, all value aggregationsrequired of XML can be obtained A second line of studies investigates how to support by shredding it to relations and using SQL (the “SQL ap- group-by at a logical or physical level [6], and detect proach”). We examine this issue empirically in Section 7 group-bys from nested queries and rewrite them with ex- with an emphasis on queries involving grouping together plicit grouping operations [4, 5, 9, 12]. However, detect- with nesting. Indeed, owing to XML’s inherent hierarchical ing grouping inherent in nested queries is challenging and nature, nested group-by (e.g., query Q1) is a fundamental such queries are hard to express and understand. In partic- type of group-by that merits study. In our experiments, we ular, the focus of [12] is on structural aggregation by node observed an order of magnitude difference between the per- types as opposed to value aggregation. Studies by Fiebig formance of the SQL approach (using Oracle) and ours. We and Moerkotte [6], Pedersen et al. [13], and Deutsch et make the following contributions. al. [4] all consider using query optimization-style rewrite • We propose a framework for expressing complex ag- rules for various kinds of grouping. The transformed query gregationqueries on XML data featuring nested group- plan would be based on nested loops. by, having cluase, and moving windows (Section 3). There is an extensive body of work on efficient com- • We develop a disk-based algorithm for efficient eval- putation of group-by and cube queries for relational data uation of queries involving any subset of the above (e.g., [8, 10]). These algorithms are not directly applica- fearures (Section 5). ble to hierarchical data especially when group-by elements • We discuss the results of a comprehensiveset of exper- (β’s) may involve combination of forward and backward iments comparing our approach with that of shredding axes and aggregations on values may be nested and may XML into relations and using SQL, and with those of occur at multiple levels (e.g., Q2). Of course, by shred- Galax [7] and Qizx [17], validating the efficiency of ding XML to relations, all such queries can be expressed in our algorithm and the effectiveness of the optimiza- SQL. The performance impact of this approach compared tions developed (Section 7). with our direct approach is discussed in Section 7. Related work appears in the next section. Section 8 sum- Finally, [15, 19] study XPath selectivity estimation to marizes the paper and discusses future work. obtain statistical summaries and approximate answers for XPath expressions. They do not directly support exact com- 2 Related Work putation of group-bys. While for relational data, SQL provides explicit support 3 Class of Nested Group-bys for group-by, XQuery requires us to simulate it using nest- ing. It has been noted that this leads to expressions that are 3.1 GeneralFormof1-levelNestingandExamples hard to read, write, and process efficiently [2, 3]. Beyer et The examples discussed so far are instances of the gen- al. [2] and Borkar and Carey [3] propose syntactic exten- sions to XQuery FLOWR expressions to provide explicit eral form of a one-level nested group-by query below. group α support for group-by. They also demonstrate how related where Cons out out out out analytics such as moving window aggregations and cube by β1 (mw1 ) ...βk (mwk ) having AggConsout return ( can also be expressed in the extended syntax. Beyer et out out out out out out β1 ,...,βk , agg1 (γ1 ),...,aggm (γm ) in in in in al. report preliminary experimental results indicating better then by β1 (mw1 ),...,βp (mwp ) 2 Catalogue any book received in a group. As a last example of nested Publisher Publisher aggregation, spread(Rating)=def max(Rating) − Name Location Location Penguin min(Rating) combines min() and max(). out Name Year Year Aggregation Conditions: Cons, AggCons and New York in Value Book Book AggCons are sets of conditions. Cons in the 1999 where clause are the usual node-level selection conditions. Subject Title #Sold Review Review Price Science Fiction 200,000 $15 out in Foundation AggCons and AggCons are sets of aggregation Reviewer Rating John Doe 9 conditions of the form aggi(γi) θi ci, where θi ∈ {=, 6= ,>,<, ≥, leq}, and ci is a constant. With the use of the Figure 3. An Example Illustrating Node Type Inversion having clause, iceberg queries can be easily expressed in the proposed framework. Moving windows: Moving window queries have been having AggConsin return ( in in in in in in studied extensively for relational databases [18].