Materialized View Selection in a Multidimensional Database
Total Page:16
File Type:pdf, Size:1020Kb
Materialized View Selection in a Multidimensional Database Elena Baralis Stefano Paraboschi Ernest Teniente Politecnico di Torino Politecnico di Milan0 Universitat Politecnica de Catalunya baralis(Dpolito.it paraboscQelet.polimi.it tenienteQlsi.upc.es Its basic structure may be represented with the sim- ple entity-relationship diagram depicted in Figure 1, Abstract in which all the Di entities represent the dimensions of the MDDB, while the connecting relationship F is A multidimensional database is a data repos- the fact table. itory that supports the efficient execution of Each dimension table Di contains all the informa- complex business decision queries. Query re- tion that is specific only to the dimension itself, while sponse can be significantly improved by stor- the fact table F correlates all dimensions and contains ing an appropriate set of materialized views. information on the attributes of interest for the inter- These views are selected from the multidimen- section of all the dimensions. A new operator, the sional lattice whose elements represent the so- data-cube operator [GBLP96], has been proposed to lution space of the problem. perform the computation, on a single relation (the fact Several techniques have been proposed in the table), of one or more aggregate functions for all pos- past to perform the selection of materialized sible combinations of grouping attributes (which are views for databases with a reduced number the elements of the data-cube). of dimensions. When the number and com- Since the computation of any of the elements of plexity of dimensions increase, the proposed the cube is rather time-consuming, it may be pre- techniques do not scale well. computed to guarantee a satisfactory query response The technique we are proposing reduces the time to the user. On the other side, the material- soluticn space by considering only the relevant ization of the complete cube may be unfeasible, both elements of the multidimensional lattice. An because of its size and of the time required to update additional statistical analysis allows a further it when the fact table is updated. Hence, several tech- reduction of the solution space. niques [Gup97, GHRU97, HRU96] have been proposed to select an appropriate subset of elements (which are 1 Introduction indeed views on the fact table) to materialize. The proposed algorithms work very well for medium A multidimensional database (MDDB) is a data repos- size databases, but do not seem to scale well for the in- itory that provides an integrated environment for deci- creased complexity of actual operational MDDB’s. In- sion support queries that require complex aggregations deed, as shown in the practical example of Section 1.1, on huge amounts of historical data. An MDDB is a in addition to the fact table, operational MDDB’s may relational data warehouse, in which the information is have several dimensions, each of which is character- organized following the so-called star-model [Kim96]. ized bv a considerable number of attributes. most of which may be relevant for grouping computation as Permission to copy without fee all or part of this material is granted provided that the copies are not made or- distributed for well. Thus, the presence of dimensions exponentially direct commercial advantage, the VLDB copyright notice and increases the number of elements in the cube. the title of the publication and its date appear, and notice is If a set of user-specified relevant queries is available, given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee exploiting this information may yield a significant re- and/or special permission from the Endowment. duction of the solution space. We observe that the Proceedings of the 23rd VLDB Conference number of representative queries is extremely small Athens, Greece, 1997 with respect to the total number of elements of the 156 a Promotion, which describes the characteristics of product promotions. Overall the promotion di- mension is characterized by at least 10 attributes. The fact table provides the sales information on which the actual financial analysis is performed. It includes the identifiers of all the dimensions and sev- eral attributes describing sales revenues ( e.g., in terms of number of units sold). In this paper we consider a subset of the attributes of each dimension as rele- Figure 1: Entity-Relationship representation of an vant attributes for grouping computations: we assume MDDB 15 attributes for dimensions Product and Store, 9 at- tributes for Time, and 11 attributes for Promotion. complete data-cube. Then, the indication of the rele- vant queries is exploited to drive the selection of the 1.2 Related Work candidate views, i.e., the views that, if materialized, may yield a reduction of the total cost. The number Multidimensional data processing for relational data of candidate views may be further reduced by means warehouses has raised considerable interest both in of a heuristic based on the estimation of the size of the the scientific community [GBLP96, Gup97, GHRU97, candidate views. Candidate views are discarded when HRU96, Wid95, ZGHW95] and in the industrial com- their aggregation granularity is too big, because their munity, where several products have appeared. materialization would not yield a substantial improve- The algorithms presented in this paper are closest to ment in query response time with respect to using a the work in [Gup97, GHRU97, HRU96]. In particular, higher level view. [HRU96] considers an MDDB including only the fact The following section presents a practical exam- table and proposes a greedy algorithm for the selection ple of an MDDB, while Section 1.2 discusses related of an appropriate subset of the views of the complete work. In Section 2 a formal model of a multidimen- data-cube to materialize. Work in [GHRU97] extends sional database is given, and its relation with the data the previous results to the selection of both material- cube model is discussed. Section 3 formally introduces ized views and indexes. Both works do not consider the problem. Section 4 describes the technique to se- the cost of maintaining the materialized views in the lect views and an algorithm to perform the selection. model. A more general query and update model is Furthermore, the statistical technique to improve the proposed in [Gup97], where a theoretical framework selection’s efficiency is presented, together with exper- for the view-selection problem is presented. In this imental results. Section 5 draws conclusions. context, a general algorithm and several heuristics are proposed. A detailed comparison of our work with 1.1 A Practical Example [Gup97, GHRU97, HRU96] is performed in the rele- Consider as a practical example, taken from [Kim96], vant sections of the paper. the MDDB for a large grocery store chain, character- [RSS96] first gave a formal description of the multi- ized by a large number of stores, each of which is a ple view maintenance problem. They present a frame- supermarket selling a wide variety of different prod- work for improving query performances by storing an ucts (e.g., grocery, frozen foods, bakery, etc.). The additional set of materialized views and consider sev- MDDB stores information on each sale in each store eral heuristics for optimization. The cost model they by day, also considering the promotions under which propose, which includes both query and maintenance each product is sold. We can identify the following costs, uses an estimate of the number of disk accesses, dimensions: by making hypotheses on the physical design of the database, while we selected a more abstract metric. l Product, which can be characterized by more than 50 different attributes. 2 Multidimensional Database Model l Store, which characterizes each point of sale. The store dimension may contain more than 20 at- Definition 2.1 A Multidimensional Database is tributes. a collection of relations DI, . , D,, F, where l Time, which provides the appropriate detail to l Each Di is a dimension table, i.e., a rela- allow accurate analysis of the MDDB data. The tion characterized by an identifier di that uniquely time dimension may have more than 15 attributes. identifies each tuple (di is the primary key of Di). 157 l F is a fact table, i.e., a relation connecting all Example 2.6 Consider again the multidimensional tables D1, . , D,; the identifier of F is given by database of our example, MDDB = {Product, Store, the foreign keys dl, . , d, of all the dimension Time, Promotion, F), where each dimension table has tables it connects; the schema of F contains a set its corresponding attributes and with fact table F = of additional attributes V (representing the values {p, s, d,r, f}, having p, s, d (time dimension is rep- on which the aggregate functions are applied). resented with the granularity of day) and r as for- eign keys of the dimension tables and f representing The dimension tables may contain hierarchies. the amount of sales. The following queries can be re- quested on the MDDB: Definition 2.2 Let D be a dimension table with iden- l q1 = total sales per product tifier d. An attribute hierarchy on D is a set of functional dependencies FDD = {f do, f dl, . , f dn}, l q2 = total sales per product and store where each f di is characterized by two sets of attributes l q3 = total sales per product and day Ai c Attr(D) and AL E Attr(D) (respectively called left side and right side of the dependency); the depen- l q4 = total sales per product, store and day dency is represented as f di : Ai + AL. The queries we consider are select-join-groupby Each functional dependency fdi is a constraint on queries, with some restrictions on the allowed selec- the content of the dimension table D: for each tuple tion and join predicates.