Data Cube: a Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Total Page:16
File Type:pdf, Size:1020Kb
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Surajit Chaudhuri Adam Bosworth Andrew Layman Don Reichart Murali Venkatrao Frank Pellow Hamid Pirahesh 1 May 1997 Technical Report MSR-TR-97-32 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 This paper appeared in Data Mining and Knowledge Discovery 1 (1): 29-53 (1997) 1 IBM Research, 500 Harry Road, San Jose, CA. 95120 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Surajit Chaudhuri Adam Bosworth Andrew Layman Don Reichart Murali Venkatrao Frank Pellow 1 Hamid Pirahesh 2 5 February 1995, Revised 15 November 1995, Expanded June 1996 Technical Report MSR-TR-96-xx Microsoft Research Advanced Technology Division Microsoft Corporation One Microsoft Way Redmond, WA 98052 2 IBM Research, 500 Harry Road, San Jose, CA. 95120 Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals 3 Jim Gray Microsoft [email protected] Surajit Chaudhuri Microsoft [email protected] Adam Bosworth Microsoft [email protected] Andrew Layman Microsoft [email protected] Don Reichart Microsoft [email protected] Murali Venkatrao Microsoft [email protected] Hamid Pirahesh IBM [email protected] Frank Pellow IBM [email protected] Microsoft Technical report MSR-TR-95-22 5 February 1995, Revised 18 November 1995, Expanded June 1996 Abstract : Data analysis applications typically aggregate these visualization and data analysis tools represent the data across many dimensions looking for anomalies or dataset as an N-dimensional space. Visualization tools unusual patterns. The SQL aggregate functions and the render two and three-dimensional sub-slabs of this space as GROUP BY operator produce zero-dimensional or one- 2D or 3D objects. dimensional aggregates. Applications need the N- Color and time (motion) add two more dimensions to the dimensional generalization of these operators. This pa- display giving the potential for a 5D display. A Spread- per defines that operator, called the data cube or simply sheet application such as Excel is an example of a data cube . The cube operator generalizes the histogram, visualization/analysis tool that is used widely. Data analy- cross-tabulation, roll-up, drill-down, and sub-total con- sis tools often try to identify a subspace of the N- structs found in most report writers. The novelty is that dimensional space which is “interesting” (e.g., discriminat- cubes are relations. Consequently, the cube operator can ing attributes of the data set). be imbedded in more complex non-procedural data analysis programs. The cube operator treats each of the N aggregation attributes as a dimension of N-space. The aggregate of a particular set of attribute values is a point Spread Sheet Analyze & in this space. The set of points forms an N-dimensional Extract cube. Super-aggregates are computed by aggregating the Formulate N-cube to lower dimensional spaces. This paper (1) ex- Table Size vs Speed Price vs Speed plains the cube and roll-up operators, (2) shows how they 10 15 10 4 Nearline Cache 1 Tape Offline fit in SQL, (3) explains how users can define new aggre- 10 12 Tape Main 10 2 Disc Secondary $/MB gate functions for cubes, and (4) discusses efficient tech- Size(B) Online Online 10 9 Secondary Tape Disc Tape 10 0 niques to compute the cube. Many of these features are Main Nearline Offline Visualize 6 Tape Tape-2 being added to the SQL Standard. 10 10 Cache 10 3 10 -4 10 -9 10 -6 10 -3 100 10 3 10 -9 10 -6 10 -3 100 10 3 Access Time (seconds) Access Time (seconds) 1. Introduction Figure 1: Data analysis tools facilitate the Extract- Data analysis applications look for unusual patterns in Visualize-Analyze loop. The cube and roll-up operators data. They categorize data values and trends, extract sta- along with system and user-defined aggregates are part of tistical information, and then contrast one category with the extraction process. another. There are four steps to such data analysis: Thus, visualization as well as data analysis tools do “di- formulating a query that extracts relevant data from a mensionality reduction” , often by summarizing data along large database, the dimensions that are left out. For example, in trying to extracting the aggregated data from the database into a analyze car sales, we might focus on the role of model, file or table, year and color of the cars in sale. Thus, we ignore the dif- visualizing the results in a graphical way, and ferences between two sales along the dimensions of date of analyzing the results and formulating a new query. sale or dealership but analyze the totals sale for cars by Visualization tools display data trends, clusters, and dif- model, by year and by color only. Along with summariza- ferences. Some of the most exciting work in visualization tion and dimensionality reduction, data analysis applica- focuses on presenting new graphical metaphors that allow people to discover data trends and anomalies. Many of 3 An extended abstract of this paper appeared in [Gray et.al.] Data Cube 1 tions use constructs such as histogram, cross-tabulation, As mentioned in the introduction, visualization and data subtotals, roll-up and drill-down extensively. analysis tools extensively use dimensionality reduction (aggregation) for better comprehensibility. Often data This paper examines how a relational engine can support along the other dimensions that are not included in a “2-D” efficient extraction of information from a SQL database representation are summarized via aggregation in the form that matches the above requirements of the visualization of histogram, cross-tabulation, subtotals etc. In SQL and data analysis. We begin by discussing the relevant Standard, we depend on aggregate functions and the Group features in Standard SQL and some of the vendor-specific By operator to support aggregation. SQL extensions. Section 2 discusses why GROUP BY fails to adequately address the requirements. The Cube The SQL standard [SQL], [Melton, Simon] provides five and the ROLLUP operators are introduced in Section 3 functions to aggregate the values in a table: COUNT(), and we also discuss how these operators overcome some SUM(), MIN(), MAX(), and AVG() . For example, the of the shortcomings of GROUP BY. Sections 4 and 5 dis- average of all measured temperatures is expressed as: cuss how we can address and compute the Cube. SELECT AVG(Temp) FROM Weather; 1.1. Relational and SQL Data Extraction In addition, SQL allows aggregation over distinct values. The following query counts the distinct number of report- How do traditional relational databases fit into this multi- ing times in the Weather table: dimensional data analysis picture? How can 2D flat files SELECT COUNT(DISTINCT Time) (SQL tables) model an N-dimensional problem? Further- FROM Weather; more, how do the relational systems support the ability to Aggregate functions return a single value. Using the support operations over N-dimensional representation that GROUP BY construct, SQL can also create a table of many are central to visualization and data analysis programs? aggregate values indexed by a set of attributes. For exam- We address each of these two issues in this section. The ple, The following query reports the average temperature answer to the first question is that relational systems for each reporting time and altitude: model N-dimensional data as a relation with N-attribute SELECT Time, Altitude, AVG(Temp) FROM Weather domains. For example, 4-dimensional (4D) earth tem- GROUP BY Time, Altitude; perature data is typically represented by a Weather table Grouping Values Aggregate Values (Table 1). The first four columns represent the four di- mensions: latitude, longitude, altitude, and time. Addi- Partitioned Table tional columns represent measurements at the 4D points such as temperature, pressure, humidity, and wind veloc- ity. Each individual weather measurement is recorded as a new row of this table. Often these measured values are Sum() aggregates over time (the hour) or space (a measurement area centered on the point). Figure 2 : The GROUP BY relational operator partitions a Table 1: Weather table into groups. Each group is then aggregated by a Time (UCT) Latitude Longitude Altitude Temp Pres function. The aggregation function summarizes some col- (m) (c) (mb) 96/6/1:1500 37:58:33N 122:45:28W 102 21 1009 umn of groups returning a value for each group. many more rows like the ones above and below GROUP BY is an unusual relational operator: It partitions 96/6/7:1500 34:16:18N 27:05:55W 10 23 1024 the relation into disjoint tuple sets and then aggregates over each set as illustrated in Figure 2. SQL's aggregation functions are widely used in database applications.Thispopularity is reflected in the presence of a large number of queries in the decision-support benchmark TPC-D [TPC]. The TPC-D query set has one 6D GROUP BY and three 3D GROUP BY s. One and two dimensional GROUP BY s are the most common. Surprisingly, aggregates appear in the TPC online-transaction processing bench- marks as well (TPC-A, B and C) . Table 2 shows how fre- quently the database and transaction processing bench- marks use aggregation and GROUP BY . A detailed descrip- tion of these benchmarks is beyond the scope of the paper (See [Gray] and [TPC]). Data Cube 2 Ratio_To_Total (expression) : Sums all the expres- Table 2: SQL Aggregates in Standard Benchmarks. sions. Then for each instance, divides the expression Benchmark Queries Aggregates GROUP BYs instance by the total sum. TPC-A, B 1 0 0 TPC-C 18 4 0 To give an example, the following SQL statement TPC-D 16 27 15 SELECT Percentile, MIN(Temp), MAX(Temp) FROM Weather Wisconsin 18 3 2 GROUP BY N_tile(Temp,10) as Percentile AS 3AP 23 20 2 HAVING Percentile = 5; SetQuery 7 5 1 returns one row giving the minimum and maximum tem- 1.2.