Real Life Data Mart - Models Comparison

Mario Milicevic, Vedran Batos Polytechnic of Dubrovnik Cira Carica 4, 20000 Dubrovnik CROATIA

Vedran Mornar Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, 10000 Zagreb CROATIA

Abstract: During the implementation of Data Warehouse data is derived from OLTP normalized database and then translated into denormalized dimensional model. Basic parameters for evaluation of this process are ETL process speed, disk space consumption, query performances and user friendliness. Any of mentioned parameters could be favored with appropriate model.

Keywords: Data warehouse, Data mart, Dimensional modeling, Star schema

1 Introduction More then 100 relations (database tables) are used for storing data for preparation, booking and sale of Analyzed data is based on three years of data from the excursions. OLTP database owned by tourist agency - more precisely from the part of database dedicated to excursions preparation and manipulation. For the 3 Dimensional model purpose of better differentiating the models and better prediction of future behaviour, the quantity of Analyzed DM is built around the most interesting data is enlarged for significant factor. The original event in the system - sale transaction. However, and characteristic distribution of data is preserved after taking into consideration real-life aspect of this during this process. case, it is clear that data gathered during the sale The total quantity of data is representing more then transaction are not always complete. 7 000 000 transactions, i.e. more then 36 000 000 On the basis of users' requirements transaction is passengers - during the three years interval. determined with ticket. Specific detail is the fact On the basis of transactional data Data Mart (DM) that ticket may be issued for more passengers. The was built, representing the prototype of the future consequence is faster sale process - what is very data warehousing (DWH) solution. important because tickets are often being sold just before the excursion - even in the moment of embarkation into the vehicle. 2 Initial transactional database All passengers an same ticket share the common attributes (attributes A): In analyzed case OLTP database is the only one source of date for DM - what became quit usual o ticket S/N situation during the last few years. Only for minimal o excursion (DIM) amount of input data, textual files or spreadsheets o excursion date (DIM) are taken into consideration. o exc.time (embarkation) (DIM) The consequence of quoted is that the ETL o embarkation location (DIM) procedures are being simplified, but still we must o language (guide) (DIM) pay attention on their efficiency. o reseller (DIM) Transactional database is following relational model o booking date (DIM) consistently – all relations are normalized (3NF), o mode of payment (DIM) referential and domain integrities are accomplished o number of adults where it is necessary, etc. o number of childs (3-12) o number of infants (0-2) This model can be found in the literature for a o amount (EUR) while, but it is used relatively rarely. Possible reason is in the fact that level of normalization is somewhat Foreign keys pointing to the corresponding primary higher (two related fact tables with different grain), key of the dimension table are marked (DIM). so some queries must use additional join operation. As one can notice attributes like name, gender, age In this example two fact tables share same trans_ID etc are not registered. It is acceptable because - surrogate key that is representing the unique sale excursions are usually lasting within one day and transaction number (ticket). corresponding money amounts are low. With one transaction (ticket) it is possible to register EXCURSION_DIM more passengers (even to 100) - so this fact table PK exc_dep_key isn't built on the at the lowest level of granularity TIME_DIM possible. .... PK time_key However, there is one important exception: according to regulations set by custom authorities - ... for excursions abroad - additional data must be registered: LOCATION_DIM PK location_key  first name  family name ...  date of birth TICKET_FACT RESELLER_DIM  gender PK trans_ID  passport number PK reseller_key exc_dep_key  country of citizenship exc_date_key ... time_key ... so additional attributes can be created (attributes location_key PAYMENT_DIM B): book_date_key paym_key PK paym_key reseller_key o age exc_lang_key ... o age group (ADL/CHD/INF) .... attributes A DATE_DIM o gender ... o country of citizenship (DIM) PK date_key o native language (DIM) ..... o excursion price

Although only for 10% of total number of LANGUAGE_DIM passengers additional attributes exist, this data is PK lang_key invaluable for different demographic analyses. PAX_FACT ... PK trans_ID 3.1 DM models PK pax_num COUNTRY_DIM Three different models are analysed: nat_lang_key country_key PK country_key A) extended fact table ... attributes B ... B) fact table with incomplete data ... C) two independent fact tables Fig. 1. Data Mart – Model A All models include same data - but same output reports will be accomplished with different SQL Extended fact table (PAX = passenger) contains queries. Also, significant differences can be additional data only for transactions where that data expected in space consumptions and queries is available duration. A) extended fact table B) fact table with incomplete data EXCURSION_DIM EXCURSION_DIM PK exc_dep_key PK exc_dep_key TIME_DIM ....

.... PK time_key TIME_DIM PK time_key ......

LOCATION_DIM LOCATION_DIM TICKET_FACT PAX_FACT PK location_key PK location_key PK trans_ID PK trans_ID PAX_FACT ... PK pax_num exc_dep_key ... exc_date_key RESELLER_DIM exc_dep_key PK trans_ID time_key exc_date_key location_key PK reseller_key time_key RESELLER_DIM exc_dep_key book_date_key location_key paym_key ... book_date_key exc_date_key PK reseller_key reseller_key paym_key time_key exc_lang_key PAYMENT_DIM reseller_key .... exc_lang_key location_key ... attributes A PK paym_key nat_lang_key book_date_key ... country_key ... paym_key ... PAYMENT_DIM attributes A reseller_key DATE_DIM ... attributes B exc_lang_key PK paym_key PK date_key ... nat_lang_key ..... country_key ...... LANGUAGE_DIM attributes A DATE_DIM PK lang_key ... attributes B PK date_key ...... COUNTRY_DIM ..... PK country_key

LANGUAGE_DIM ... PK lang_key Fig. 3. Data Mart – Model C

... It is necessary to stress that chosen model will be completed with a few aggregation tables. That tables COUNTRY_DIM are omitted in this work because their influence on PK country_key measured parameters is not essential. Of course, assumption is that analyzed queries after ... all won't use that aggregation tables - because grain Fig. 2. Data Mart – Model B or method of aggregation is not adequate in observed case. Only one fact table is used - at the lowest level of granularity possible (data for single passenger). 3.2 Space consumption analysis However, as noted before, only for 10% of passengers additional data is available - so in All implementations of DWH technologies are remaining 90% rows of table attributes like age or linked with hard space consumption considerations. gender has value 'UNKNOWN' (NULL values are Often this is the crucial factor for selecting optimal avoided). design.

C) two independent fact tables In this case study DM size is determined mostly (99%) by size of fact table(s) and accompanied Data is located in two independent fact tables - with indexes (the biggest dimension table has only 9000 different grain and number of rows. However, rows). during the initial load and forthcoming updates same surrogate key trans_ID is maintained in both tables - but just for necessities of this analysis and for possibility of comparison with other models. Model Object Rows Size (106) (MB) A TICKET fact table 7,2 664 A PAX fact table 3,2 112 A INDEXES 60 A Total: 836 B PAX fact table 36,5 3226 B INDEXES 147 B Total: 3373 C TICKET fact table 7,2 664 C PAX fact table 3,2 320 C INDEXES 80 C Total: 1064

Table 1. Data Mart Size

4000 Chart 2. Index Size Comparison

3500 It is obvious that bitmap indexes occupy less DM space - i.e. for model B 147MB (bitmap) vs. 3000 4560MB (B-tree).

) 2500 However, additional fact must be mentioned in B context of bitmap indexes: M (

2000 e z i as expected - space needed for bitmap index is S 1500 o in direct correlation with indexed attribute 1000 cardinality. Even attribute with 9000 different values could be good index candidate - when 500 corresponding table has few millions of rows; 0 o less expected (but quite logical) – amount of A B C space used for bitmap index is depending significantly on distribution of indexed attribute Model - partially presorted data can reduce needed space significantly; o less expected (but quite logical) - bit map Chart 1. Data Mart Size indexes respond inadequately on insert or update of corresponding table - needed space is It is evident that model B has the least efficient use not increased proportionally - for example: of space - as the consequence of forcing the lowest level of granularity - even when all attributes are not (a) PAX fact table (model B) has 36 533 622 rows known. (3 years data). Bitmap index on column For DM size estimation bitmap indexes on FK in time_key occupy 14MB of space. Complex fact tables are taken into consideration - for example query using 4 bitmap indexes lasts 5s; 8 indexes on TICKET fact table from model A. (b) during the daily update of DM 130657 rows are Besides other advantages over the B-tree indexes being inserted into fact table (increase of (more about that later in paper) bitmap indexes need 0,36%). After the insert bitmap index on column significantly less space: time_key occupy 40MB of space (increase of 5000 185%). Query execution time is increased from 4500 5s to 10s (100%). 4000 Quoted facts are the reason for recreation of bitmap 3500 indexes after the massive inserts or updates. )

B 3000

M Bit Map (

2500 e

z B-Tree i 2000 S 1500 3.3 Queries performances 1000 500 0 A B C Model reseller ‘DWH tours’, language ‘English’, departure Significant number of queries has been analyzed - time ‘<8’, max.capacity ‘<50’, mode of payment with multiple joins of fact and dimension tables. ‘Cash’ (this example is for Model A). This is join Special attention is dedicated to comparison of the between two fact tables and six dimension tables. characteristics of bitmap and b-tree indexes. Both CBO makes a decision that only four bitmap space consumption and query performances are indexes should be used. Remaining two indexes analyzed before b-tree indexes are dropped. won’t produce additional contribution to execution Each query is executed several times in different time: environment to ensure that effect of cache memory (database buffers) is controlled. Presented values are SELECT dd.date_dmy ExcDate, the average value - because it is realistic pf.gender Gender, COUNT(pf.gender) PaxCount assumption that each query will be executed more FROM ticket_fact tf, pax_fact pf, than once (maybe with minor changes in WHERE excursion_dim ed, date_dim dd, clause). time_dim td, reseller_dim rd, language_dim ld, payment_dim pd WHERE tf.exc_trans_key=pf.exc_trans_key Execution Q1 Q2 Q3 Q4 Q5 Q6 AND tf.exc_dep_key=ed.exc_dep_key Time (s) (s) (s) (s) (s) (s) AND tf.exc_date_key=dd.date_key Model A 0,5 10,0 1,0 50,0 1,0 25,0 AND tf.exc_time_key=td.time_key Model B 1,5 20,0 1,0 80,0 50,0 5,0 AND tf.reseller_key=rd.reseller_key AND tf.exc_lang_key=ld.lang_key Model C 0,5 10,0 0,1 50,0 0,1 2,5 AND tf.paym_key=pd.paym_key AND dd.year = 2002 Table 2. Query Execution Time AND dd.month_of_year=8 AND rd.reseller_desc='DWH TOURS' AND ld.lang_name='English' AND ed.max_capacity>=50 90 AND td.hour<8 80 AND pd.paym_desc='Cash' )

s GROUP BY ROLLUP (dd.date_dmy, pf.gender) ( 70 ORDER BY 1,2; e

m 60 i T

50 Execution plan: n o

i 40 t SELECT STATEMENT | u

c 30 SORT ORDER BY | e SORT GROUP BY ROLLUP | x 20 TABLE ACCESS BY INDEX ROWID | PAX_FACT E NESTED LOOPS | 10 HASH JOIN | 0 HASH JOIN | HASH JOIN | Q1 Q2 Q3 Q4 Q5 Q6 TABLE ACCESS BY INDEX ROWID | TICKET_FACT BITMAP CONVERSION TO ROWIDS| Query BITMAP AND | BITMAP MERGE | BITMAP KEY ITERATION | Model A Model B Model C TABLE ACCESS FULL | RESELLER_DIM BITMAP INDEX RANGE SCAN| I_TICKET_RES BITMAP MERGE | Chart 3. Query Execution Time BITMAP KEY ITERATION | TABLE ACCESS FULL | LANGUAGE_DIM BITMAP INDEX RANGE SCAN| I_TICKET_LANG Query Q1 and Q2 are using data from TICKET fact BITMAP MERGE | BITMAP KEY ITERATION | table (sale transaction level) and associated TABLE ACCESS FULL | PAYMENT_DIM BITMAP INDEX RANGE SCAN| I_TICKET_PAYM dimension tables. Queries Q3 to Q6 are using data BITMAP MERGE | also from PAX fact table (passengers level). BITMAP KEY ITERATION | TABLE ACCESS FULL | DATE_DIM Queries have different selectivity - for example Q4 BITMAP INDEX RANGE SCAN| I_TICKET_DATE retrieves number of passengers in year 2002. - TABLE ACCESS FULL | DATE_DIM TABLE ACCESS FULL | TIME_DIM grouped by months, language and gender. In this TABLE ACCESS FULL | EXCURSION_DIM case Cost Based Optimizer (CBO) estimates that INDEX RANGE SCAN | PK_PAX_FACT there is not need to use available indexes. For distinction of Q4, query Q5 retrieve number of Special note must be taken into consideration with passengers grouped by dates and gender, but filtered model B: some attributes have value only in 10% of with the precise conditions: year ‘2002’, month ‘8’, rows - so additional condition must be added: ... and pf.gender is not null ...... (for example) - to ensure that results will be the same as in models A and C. 4 Conclusion

3.4 DM update analysis Although debating about dimensional models reaches levels of dogmatic adherence to one of the DM update, or ETL procedure - is considerably methodologies, and uncompromising recognition of simplified thanks to the fact that dominant source of DWH authorities, it is fact that the latest versions of data is transactional database. Furthermore, the DBMS ensure very effective construction and usage of mechanisms like Change Data Capture deployment of DM (i.e. DWH) - even in the cases (CDC) ensures that only changed data from OLTP when ideal model cannot be used. database is transferred into DM. Practical example shows effects of different All models are using same dimensional tables - so approaches to the construction of DM. Model B update of dimensional tables lasts same for all (fact table with incomplete data) gives on an models. average the worst results. However remaining two Practically 99% of time required for daily update of models should be taken into consideration equally. DM will be dedicated to fact table(s). As already The crucial influence on the final decision will have mentioned, it is advisable to rebuild bitmap indexes ranking of parameters - is it more important (in after inserts and updates. particular case) to have more compact (smaller database) or faster daily updates or better Execution Mod. A Mod. B Mod. C performances of queries. Time (s) (s) (s) insert into 5,0 450,0 5,0 TICKET_FACT 5 References recreate bit map 170,0 690,0 170,0 indexes on TF [1] Kimball R, Ross M. The Data Warehouse insert into 0,5 - 1,0 Toolkit. New York: John Wiley & Sons; 2002. PAX_FACT [2] Kimball R. et al. The Data Warehouse Lifecycle recreate bit map 16,0 - 65,0 Toolkit. New York: John Wiley & Sons; 1998. indexes on PF [3] Inmon B. Creating The Data Warehouse Data TOTAL 191,5 1140,0 241,0 Model From The Corporate Data Model. Inmon Associates, Inc.; 2000. http://www.billinmon.com/library/whiteprs/earlywp/ Table 3. Data Mart Daily Update ttdwdmod.pdf [4] Kimball R. A Dimensional Modeling Manifesto. Analyzed study case of DM daily update is DBMS online Aug.’97;1997. assuming that 27759 are inserted into TICKET fact http://www.dbmsmag.com/9708d15.html table (Models A and C); i.e. 130657 for Model B. [5] Boehnlein M, Ulbrich-vom Ende A. Deriving Into PAX fact table (Models A and C) 11400 rows Initial Data Warehouse Structures from the are inserted. Conceptual Data Models of the Underlying Execution time is worst for Model B - and there are Operational Information Systems. Proceedings of few reasons: TICKET fact table is largest (what has DOLAP99; Kansas City; 1999. negative influence on index rebuild time); also attributes with different grain must be combined into the same table.