Automated Presentation of Slowly Changing Dimensions
Total Page:16
File Type:pdf, Size:1020Kb
IT 09 025 Examensarbete 30 hp Juni 2009 Automated presentation of slowly changing dimensions Christer Boedeker Institutionen för informationsteknologi Department of Information Technology Abstract Automated presentation of slowly changing dimensions Christer Boedeker Teknisk- naturvetenskaplig fakultet UTH-enheten On the subject of data warehousing, a lot of material is available on what needs to be done to maintain a presentation area, but very little on how to do it. This paper Besöksadress: presents a structure and process for automatically maintaining and updating a Ångströmlaboratoriet Lägerhyddsvägen 1 multidimensional presentation area from a fully normalized data staging area. The Hus 4, Plan 0 paper also discusses a theoretical algorithm for finding the best division of slowly changing dimension attributes into several groups based on how they change Postadress: together. Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student Handledare: Andreas Perjus Ämnesgranskare: Tore Risch Examinator: Anders Jansson IT 09 025 Tryckt av: Reprocentralen ITC Contents 1. Abstract ........................................................................................................................................ 3 2. Introduction .................................................................................................................................. 7 3. Background ................................................................................................................................... 9 3.1. Terms ............................................................................................................................................ 9 3.2. Relational vs. Multidimensional design. ....................................................................................... 9 3.3. Slowly changing dimensions ....................................................................................................... 10 3.4. Dimension subdivision ................................................................................................................. 11 3.5. Dimension presentation automation .......................................................................................... 14 4. Dimension subdivision ................................................................................................................ 15 4.1. The number of rows when combining three or more attributes ................................................. 17 4.2. Exhaustive search ........................................................................................................................ 18 5. Dimension presentation automation ........................................................................................... 21 5.1. Metadata repository for automated presentation dimension table creation ............................. 21 5.2. Extending the metadata repository with the dimension partitioning ......................................... 22 5.3. Automated sub dimension presentation table creation ............................................................. 23 5.4. Structure of the staging area ...................................................................................................... 24 5.5. Natural key to surrogate key mapping ....................................................................................... 28 5.6. Incremental update of presentation area. .................................................................................. 30 5.7. Corrected, old dimension data. ................................................................................................... 33 5.8. Extending the dimension presentation tables ............................................................................. 36 5.9. Altering the sub dimension partitions ......................................................................................... 38 5 5.10. Surrogate key translation table .................................................................................................. 40 6. Insights and conclusions .............................................................................................................. 47 7. Bibliography ................................................................................................................................ 49 8. APPENDIX ................................................................................................................................... 50 A. Surrogate key join vs. natural key join + date range limit. .............................................................. 50 B. Surrogate key translation table test trace files ............................................................................... 64 6 2. Introduction This paper is a report on a thesis project in information technology at Uppsala University. The subject area is data warehousing which is a topic of computing science. The paper describes the structures and procedures for staging type 2 slowly changing dimensions and populating them in a presentation layer. A data warehouse is a large collection of data from a business or comparable operation. Unlike an online transaction processing database, where focus is on single record read and writes, the data in a data warehouse is modeled to support retrieval of large portions of the database with good performance. In a multidimensional data warehouse we speak about dimensions and facts. Dimensions are the categories of data we store in the warehouse, such as customer information and product information. Facts are the measurements of the operation such as the sale of a product to a specific customer at a specific time. Dimensions are naturally more static than facts since it is more common to sell a product than to add a new product to the catalog. What is even less common is the change of an attribute in an already existing product, for example a change of product group or package size. Still this happens sometimes and then we talk about slowly changing dimensions. The literature on data warehousing is rich with information about how slowly changing dimensions are constructed in the presentation layer of a multidimensional data warehouse. Detailed technical discussions about how the data structures for storing the data and the procedures for moving the data to the presentation layer are much harder to find. The aim of this project was to find and document working procedures for transforming raw attribute data into a dimensionally modeled data warehouse presentation layer. The paper begins which a short introduction on the nature of a multidimensional data model as opposed to a normalized relational model and then specifically the techniques for presenting slowly 7 changing dimension attributes. The paper then contains a search for a method to find the best way to distribute attributes between sub dimensions and details data structures and procedures for storing the attributes and automatically copying them to the multidimensional presentation layer. The final outcome are the algorithms needed for planning the presentation of the the slowly changing dimensions and automating the maintenance and population of this presentation from a previously populated staging area. The project was carried out in an oracle 10g rdbms environment where dummy dimension attribute data was generated. I chose to work in a lab environment with large amounts of dummy data since it is important to verify that the algorithms are practically useable and well performing. As part of the work I studied the available literature on slowly changing dimensions and dimensional modeling, although the algorithms had to be developed from scratch since the available literature doesn’t go into any details regarding implementation of procedures and staging structures. The goal was to find the procedures necessary to transform the raw attribute data into presentation dimensions supporting flexible and highly performing queries along with the fact data. To achieve this, the necessary steps for the complete transformation were identified and for each step sql code was developed which was then run against the dummy attribute data. The result is a collection of sql statements which can be used to compile a complete package for scd maintenance. The project is focused on the Oracle rdbms, so much of the code is Oracle dependent and some of the reflections are based on the assumption that the underlying database engine is of the classical row based type, wide dimension tables taking a long time to read for instance. The newer more data warehouse centered column based architectures don’t have those read issues, but still most of the observations are true for that type of database engine as well. 8 3. Background 3.1. Terms Dimension – A category of information such as person, product, time etc. A collection of dimension entities of the same type. Dimension entity – the entity described in a dimension and identified by the natural key. May be a person, a product, a date etc. Dimension attribute – An attribute of the dimension entity, such as phone number or product package size. Natural key – a unique identifier of a dimension entity. Natural implies that the key originates from outside the database, such as a personal number, but in the context of this paper it may also be generated in the database. Surrogate key – A generated key identifying a row in a slowly changing dimension table. SCD, slowly changing dimension – A dimension