Materialized Views and Data Warehouses

Materialized Views and Data Warehouses Nick Roussop oulos Department of Computer Science and Institute of Advanced Computer Studies University of Maryland [email protected] 2 The Multifaceted Form of Views Relational views have several forms: Abstract A data warehouse is a redundant collection pureprogram: an unmaterialized view is a program sp eci cation, \the intension", that gener- of data replicated from several p ossibly dis- ates data. Query mo di cation [Sto75] and com- tributed and lo osely coupled source databases, + organized to answer OLAP queries. Rela- piled queries [ABC 76]were the rst techniques tional views are used b oth as a sp eci cation exploiting views{ their basic di erence is that the technique and as an execution plan for the rst is used as a macro that do es not optimize until run-time while the second stores optimized derivation of the warehouse data. In this p o- execution plans. Such a view form is a pure pro- sition pap er, we summarize the versatilityof gram with no extensional attachments. Each time relational views and their p otential. the view program is invoked, it generates materi- alizes the data at a cost that is roughly the same 1 Views for eachinvo cation. The imp ortance of the \algebraic closedness" of the relational mo del has not b een recognized enough in derived data: a materialized view is \the exten- its 27 years of existence. Although a lot of energy sion" of the pure program form and has the char- has b een consumed on dogmatizing on the \relational acteristics of data likeany other relational data. purity", on its interface simplicity, on its mathemati- Thus, it can b e further queried to build views- cal foundation, etc., there has not b een a single pap er on-views or collectively group ed [Pap94] to build with a central fo cus on the imp ortance of relational super-views. The derivation op erations are at- views, their versatility, and their yet-to-b e exploited tached to materialized views. These pro cedural p otential. attachments along with some \delta" relational What is a relational view? Is it a program? Is it algebra are used to p erform incremental up dates data? Is it an index? Is it an OLAP aggregate? It is on the extension. all these. And a lot more. Below I summarize the most imp ortant uses, techniques, and b ene ts p ertaining to pure data: when materialized views are converted views. Note that the cited work here is not meantto to snapshots, the derivation pro cedure is detached b e exhaustive but representative and easily accessible and the views b ecome pure data that is not main- from my short-term memory. tainable pure data is at the opp osite end of the sp ectrum from pure program. The copyright of this paper belongs to the paper's authors. Per- mission to copy without fee al l or part of this material is granted pure index: view indexes [Rou82b] and View- provided that the copies are not made or distributed for direct commercial advantage. Caches [Rou91] illustrate this avor of views. Pro ceedings of the 4th KRDB Workshop Their extension has only p ointers to the underly- Athens, Greece, 30-August-1997 ing data which are dereferenced when the values F. Baader, M.A. Jeusfeld, W. Nutt, eds. are needed. Like all indexing schemes, the imp ortance of indexes lies in their organization, which http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-8/ Page 1 facilitates easy manipulation of p ointers and ef- JMRS93]. The goals of these studies range from im- cient single-pass dereferencing, and thus avoids proving query optimization and pro cessing to supp ort- thrashing. ing rules in active databases, to query pro cessing in client-server and distributed/replicated database ar- hybrid data & index: a partially materialized view chitectures, to handling time queries, to obtaining ef- [BR96] stores some attributes as data while the cient up date dissemination, to avoiding exp ensive rest are referenced through p ointers. This form computations of external predicates, etc. All these combines data and indexes. B-trees, Join indexes techniques have one common underlying theme: the [Val87], star-indexes [Sys96] and most of the other re-use of views to save cost. indexing schemes b elong to this category, with ap- Amortization and re-use of views can only b e p os- 2 propriate schema mapping for translating p oint- sible if they can b e discovered by the query optimizer ers to record eld values. Note that in this form, which decides to plug-in those views which reduce the the data values are drawn directly from the un- cost of the query. The b ene ts are multiplied in a derlying relations and no transformation to these multi-user environment with a lot of shared access 1 values is required . to views. Despite this, only the ADMS prototyp e has extended the query optimizer and its cost mo del OLAP aggregate/indexing: a data cub e [GBLP96] [CR94b] to include in its plan selection materialized is a set of materialized or indexed views [GHRU96, views, ViewCaches, and incremental access metho ds RKR97]. They corresp ond to pro jections of the and a tailored bu er manager , as well as a tailored multi-dimensional space data to lesser dimension- bu er manager designed to supp ort these access meth- ality subspaces and store aggregate values in it. o ds [CR93]. However, b oth IBM and Microsoft plan to In this form, the data values are aggregated from incorp orate similar constructs in their DB2 and Sequel a collection of underlying relation values. Sum- Server future releases. mary tables and Star Schemas [Sys96] b elong in The most common technique for discovering views this form the latter b elongs here as muchasin in any of its forms is subsumption [Rou82a, Fin82, the previous category. LY85, Rou91, BJNS94 ]. Subsumption in its most gen- eral form is an undecidable problem, but for the most Each of these forms is used by some comp onentof common queries can b e reduced to an NP-complete a relational system. Having a uni ed view of all forms problem. For simple conjunctive query views, it fur- of relational views is imp ortant in recognizing com- ther reduces to p olynomial-time and very ecient al- monalities, re-using implementation techniques, and gorithms [Rou82a, CR94b]. discovering p otential uses not yet exploited. In a data warehouse where query execution and I/O are magni ed, the mandate for re-use cannot b e ig- 3 Discovery and Re-use of Views nored. Furthermore, in an OLAP environment, unlike OLTP, up dates come in bulk rather than a few-at- RDBMSs do nothing else but generate or access mate- a-time, making incremental up date techniques more rialized views 24 hours a day whether these are prede- e ectively amortized [RKR97]. Therefore, query op- ned views, results of compiled queries, ad ho c queries, + timizers based on materialized view fragments are a or even materialized view fragments [RCK 95], i.e., necessity.At this p oint, data warehouses rely solely temp orary results generated during the execution of on users' memory for re-using precomputed summary a larger query. Unfortunately, commercial RDBMSs tables. This severely limits their p erformance p oten- discard these views immediately after they are deliv- tial. ered to the user or to a subsequent execution phase. The cost for generating the views is for one-time-use 4 Pro cessing of Views only instead of b eing amortized over multiple and/or shared accesses [Rou91]. Now let's examine view pro cessing for all the view Caching query intermediate results for sp eeding forms except for the pure data snapshots which are up intra- and inter-query pro cessing has b een stud- not maintainable. View pro cessing involves view scan- ied widely [Fin82,LY85, Rou91, Sel87, Jhi88, DR92, ning, incremental up date, or b oth applied simultane- AL80, Rou82b, Rou91, Sel88, Jhi88, RK86b, BALT86, ously. Scanning and incremental up date of views im- Fin82,LY85, DR92, HS93, RK86b, Han87a, Han87b, ply sp ecial lo cks, lo cking proto cols [RES93], autho- 1 rization [RB85], and consistency proto cols for asyn- This is how the indexed form is almost exclusively used chronous up dates from multiple sources [ZGMHW95]. although there is no intrinsic reason for not applying a transformation function, other than the identity one, to the underlying 2 values b efore indexing them- e.g., calibrate the values b efore the user cannot b e aware of views generated by the system entered in a B-tree. and other users. Page 2 I will concentrate here on p erformance issues. dating is not a factor any more. Therefore, minimal dereferencing is a go o d target optimization. Partially View scanning in the pure program view form is typ- materialized views [BR96] which materialize only the ically the same as re-execution of the query that cre- subset of the attributes useful for the incremental up- ated the view. There is no p erformance b ene t for un- date, outer-joins instead of joins, or other appropriate materialized views other than predicting re-execution attribute caching techniques [Sta89] are b est suited. cost more accurately after the rst time. The p erfor- On the other hand, fully materialized views are cum- mance is bad but predictable. Scanning a materialized b ersome and generate a lot of unnecessary I/O and view has a cost that dep ends on the ratio of the useful data movement for just up dating views that are to b e tuples in it to answer a given query, called density of used in the future.

Load more