Data Cube: a Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals
Total Page:16
File Type:pdf, Size:1020Kb
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Microsoft [email protected] Adam Bosworth Microsoft [email protected] Andrew Layman Microsoft AndrewL@MicrosotLcom Hamid Pirahesh IBM [email protected] Abstract: Data analysis applications typ&ally aggregate points such as temperature, pressure, humidity, and wind data across many dimensions looking for unusual patterns. velocity. Often these measured values are aggregates over The SQL aggregate functions and the GROUP BY operator time (the hour) or space (a measurement area). produce zero-dimensional or one-dimensional answers. Applications need the N-dimensional generalization of Table l:Weather these operators. This paper defines that operator, called Time (UCT) Latitude Longitude %ititude Tem; Pres the data cube or simply cube. The cube operator general- (m) (c) (mb) 27/ii/94:150( 37:58:33N 122:45:28W 102 21 1009 izes the histogram, cross-tabulation, roll-up, drill-down, 27/ii/94:150( 34:16:18N 27:05:55W" I0 23 11024 and sub-total constructs found in most report writers. The cube treats each of the N aggregation attributes as a di- The SQL standard provides five functions to aggregate the mension of N-space. The aggregate of a particular set of values in a table: COUNT ( ), SUM ( ), MIN ( ), ~ ( ), attribute values is a point in this space. The set of points and AVG(). For example, the average of all measured forms an N-dimensional cube. Super-aggregates are com- temperatures is expressed as: puted by aggregating the N-cube to lower dimensional SELECT AVG (Temp) spaces. Aggregation points are represented by an "infinite FROM Weather; value", ALL, so the point (ALL,ALL,...,ALL, sum(*)) repre- sents the global sum of all items. Each ALL value actually In addition, SQL allows aggregation over distinct values. represents the set of values contributing to that aggrega- The following query counts the distinct number of report- tion. ing times in the Weather table: SELECT COUNT (DISTINCT Time) I. Introduction FROM Weather; Many SQL systems add statistical functions (median, stan- Data analysis applications look for unusual patterns in data. dard deviation, variance, etc.), physical functions (center They summarize data values, extract statistical information, of mass, angular momentum, etc.), financial analysis and then contrast one category with another. There are two (volatility, Alpha, Beta, etc.), and other domain-specific steps to such data analysis: functions, extracting the aggregated data from the database into a file or table, and Some systems allow users to add new aggregation func- visualizing the results in a graphical way. tions. The Illustra system, for example, allows users to Visualization tools display trends, clusters, and differences. add aggregate functions by adding a program with the The most exciting work in data analysis focuses on pre- following three callbacks to the database system [lilustra]: senting new graphical metaphors that allow people to dis- Init (&handle) : Allocates the handle and initializes cover data trends and anomalies. Many tools represent the the aggregate computation. dataset as an N-dimensional space. Two and three- Iter (&handle, value) : Aggregates the next value dimensional sub-slabs of this space are rendered as 2D or into the current aggregate. 3D objects. Color and time (motion) add two more dimen- value, = Final (&handle) : Computes and returns the sions to the display giving the potential for a 5D display. resulting aggregate by using data saved in the handle. This invocation deallocates the handle. How do traditional relational databases fit into this picture? How can flat files (SQL tables) possibly model an N- Consider implementing the Average() function. The dimensional problem? Relational systems model N- handle stores the count and the sum initialized to zero. dimensional data as a relation with N-attribute domains. When passed a new non-null value, Iter ( ) increments the For example, 4-dimensional earth-temperature data is typi- cally represented by a Weather table shown below. The count and adds the sum to the value. The Final () call first four columns represent the four dimensions: x, y, z, t. dealiocates the handle and returns sum divided by count. Additional columns represent measurements at the 4D 152 1063-6382/96 $5.00 © 1996 IEEE Aggregate functions return a single value. Using the Red Brick also offers three cumulative aggregates that GROUP lay construct, SQL can also create a table of many operate on ordered tables. aggregate values indexed by a set of attributes. For exam- Oamulative (expression) : Sums all values so far in ple, The following query reports the average temperature an ordered list. for each reporting time and altitude: Running_Sum (expression, n) : Sums the most recent n SELECT Time, Altitude, gVG(Temp) values in an ordered list. The initial n-1 values are FROM Weather NULL. GROUP BY Time, Altitude; Runnlng_Average (expression, n) : Averages the GROUP BY is an unusual relational operator: It partitions most recent n values in an ordered list. The initial n-1 the relation into disjoint tuple sets and then aggregates over values are mv.~L. each set as illustrated in Figure 1. These aggregate functions are optionally reset each time a grouping value changes in an ordered selection. Aggregate Values I--] 2. Problems With GROUP BY" I--1 SQL's aggregation functions are widely used. In the spirit of aggregating data, Table 2 shows how frequently the database and transaction processing benchmarks use ag- gregation and GROUP Be. Surprisingly, aggregates also appear in the online-transaction processing TPC-C query Figure !: The GROUP BY relational operator partitions a set. Paradoxically, the TPC-A and TPC-B benchmark table into groups. Each group is then aggregated by a transactions spend most of their energies maintaining ag- function. The aggregation function summarizes some col- gregates dynamically: they maintain the summary bank umn of groups returning a value for each group. account balance, teller cash-drawer balance, and branch balance. All these can be computed as aggregates from the history table [TPC]. Red Brick systems added some interesting aggregate func- tions that enhance the GROUP BY mechanism [Red Brick]: Table 2: SQL Aggregates in Standard Benchmarks Rank (expression): returns the expression's rank in the Benchmark Queries Aggregates GROUPBYs set of all values of this domain of the table. If there are TPC-A, B 1 0 0 TPC-C 18 N values in the column, and this is the highest value, 4 0 TPC-D 16 27 15 the rank is N, if it is the lowest value the rank is 1. Wisconsin ! 8 3 2 N_tile(expression, n): The range of the expression AS3Ap 23 20 2 (over all the input values of the table) is computed and divided into n value ranges of approximately equal SetQuery 7 5 I population. The function returns the number of the range holding the value of the expression. If your The TPC-D query set has one 6D GROUP BY and three 3D bank account was among the largest 10% then your GROUP BYS. ID and 2D GROUP BYS are most common. rank (account.balance, 10) would return 10. Red Brick providesjust N_tile (expression, 3). Certain forms of data analysis are difficult if not impossi- Ratio To Total (expression): Sums all the expres- ble with the SQL constructs. As explained next, three sions and then divides the expression by the total sum. common problems are: (1) Histograms, (2) Roll-up Totals and Sub-Totals for drill-downs, (3) Cross Tabulations. To give an example: SELECT Percentile, MIN (Temp) ,MAX (Temp) The standard SQL GROUP BY operator does not allow a FROM Weather direct construction of histograms (aggregation over com- GROUP BY N_tile (Temp, 10) as Percentile puted categories.) For example, for queries based on the HAVING Percentile = 5; Weather table, it would be nice to be able to group times returns one row giving the minimum and maximum tem- into days, weeks, or months, and to group locations into peratures of the middle 10% of all temperatures. As men- areas (e.g., US, Canada, Europe,...). This would be easy if tioned later, allowing function values in the GROUP BY is function values were allowed in the GROUP BY list. If that not yet allowed by the SQL standard. 153 we~ allowed, the following query would give the daily Table 4: Sales Summary maximum reportedtemperamre. Model Year Color Units SELECT day, nation, MAX(Temp) Chevy 1994 black 50 FROM Weather GROUP BY Day(Time) AS day, Chevy 1994 white 40 Country(Latitude,Longitude) Chevy 1994 ALL 90 AS nation; Chevy 1995 black 85 Chevy 1995 white I 15 Some SQL systems support histograms but the standard Chevy 1995 ALL 200 does not. Rather, one must construct a table-valued expres- Chevy ALL ALL 290 sion and then aggregate over the resulting table. The fol- lowing statement demonstrates this SQL92 construct. The SQL stmement to build ~is SalesSummary table SELECT day, nation, MAX(Temp) from the raw Sales data is: FROM ( SELECT Day(Time) AS day, Country (Latitude, Longitude) SELECT Model, ALL, ALL, SUM(Sales) AS nation, FROM Sales Temp WHERE Model = 'Chevy' FROM Weather ) AS foo GROUP BY Model GROUP BY day, nation; UNION SELECT Model, Year, ALL, SUM(Sales) A second problem relates to roll-ups using totals and sub- FROM Sales totals for drill-down reports. Reports commonly aggregate WHERE Model = 'Chevy' GROUP BY Model, Year data at a coarse level, and then at successively finer levels. UNION The car sales report in Table 3 shows the idea. Data is ag- SELECT Model, Year, Color, SUM(Sales) gregated by Model, then by Year, then by Color. The re- FROM Sales port shows data aggregated at three levels. Going up the WHERE Model = 'Chevy' levels is called rolling-up the data.