Beyond the Data Model: Designing the Data Warehouse
Total Page:16
File Type:pdf, Size:1020Kb
Beyond the Data Model: of a Designing the three-part series Data Warehouse By Josh Jones and Eric Johnson CA ERwin TABLE OF CONTENTS INTRODUCTION . 3 DATA WAREHOUSE DESIGN . 3 MODELING A DATA WAREHOUSE . 3 Data Warehouse Elements . 4 Star Schema . 4 Snowflake Schema . 4 Building the Model . 4 EXTRACT, TRANSFORM, AND LOAD . 7 Extract . 7 Transform . 7 Load . 7 Metadata . 8 SUMMARY . 8 2 ithout a doubt one of the most important because you can add new topics without affecting the exist- aspects data storage and manipulation ing data. However, this method can be cumbersome for non- is the use of data for critical decision technical users to perform ad-hoc queries against, as they making. While companies have been must have an understanding of how the data is related. searching their stored data for decades, it’s only really in the Additionally, reporting style queries may not perform well last few years that advanced data mining and data ware- because of the number of tables involved in each query. housing techniques have become a focus for large business- In a nutshell, the dimensional model describes a data es. Data warehousing is particularly valuable for large enter- warehouse that has been built from the bottom up, gather- prises that have amassed a significant amount of historical ing transactional data into collections of “facts” and “dimen- data such as sales figures, orders, production output, etc. sions”. The facts are generally, the numeric data (think dol- Now more than ever, it is critical to be able to build scalable, lars, inventory counts, etc.), and the dimensions are the bits accurate data warehouse solutions that can help a business of information that put the numbers, or facts, into context move forward successfully. (think sales person names, regions, etc.). This method allows Database modeling goes beyond on-line transactional pro- for very efficient queries from the standpoint of non-techni- cessing (OLTP) models for traditional relational databases cal users, because the data is grouped in a logical way, and it and extends in the world of data warehousing. In addition to performs quite well since the similar types of data are stored requiring that a schema be designed, data warehouses have together. However, maintaining a dimensional data ware- several other components that are equally important. When house requires some effort, because adding new types of working with data warehouses you also have to take into data may require that the entire warehouse be updated, or account metadata, and how the data is to be placed into the reprocessed, in order to incorporate the new types of data. warehouse from one or more sources. This whitepaper will Finally, it is important to understand where the data for a cover data warehouse design, working with metadata, build- data warehouse comes from. Obviously, a large portion of it ing ETL processes, and offer some guideline for the overall (in some cases, all of it) will come from the operational design of your data warehouse solution. databases being used by the enterprise. Just remember, in large enterprises, this may mean data could be gathered DATA WAREHOUSE DESIGN from databases on mainframes as well as distributed sys- Put simply, a data warehouse is a repository of data. tems. However, there may be data stored in non-relational Typically, this is historical data about the business that has sources, such as file shares and even email servers. Because been conducted by a company. This may sound like a data- of this, the process of designing a data warehouse also base, and at the most basic definition, it is. However, the includes the task of designing the processes that bring data generally accepted difference is that the term database actu- into the data warehouse. The process of getting that data is ally refers to a relational database, and often more specifi- often referred to as Extract, Transform, and Load (ETL) cally an online transactional processing (OLTP) relational reflecting the three major operations that process raw data database. OLTP databases are designed to perform well into the data warehouse. under transactional loads, often supporting applications that are highly focused in nature, such as purchasing interfaces. MODELING A DATA WAREHOUSE Conversely, a data warehouse is for online analysis process- Just as in relational data modeling, building a data model for ing (OLAP), and built to meet the needs of users who are a data warehouse requires a significant amount of work be responsible for analyzing the data and making decisions done prior to the start of the project. This is because a data based on it. warehouse is a very structured, specific deliverable. You must It is worth noting that there is not a strictly defined, uni- account for what the initial data will be, and what data will fied standard to building data warehouses. There are at least be incorporated in the future. It is necessary to approach the two very well known methodologies behind designing a data design of a data warehouse much like an application design warehouse; the “dimensional” model and the “normalized” project; in fact, often times a data warehouse will be coupled model. Each of these methods works and each has its pros with a front end reporting system, whether it is a web based and cons. While we will be using the dimensional approach, application or a desktop client. As such, you’ll follow stan- it is worth noting the basics of each. dard design practices, such as gathering requirements, build- The normalized model is based on the normalization mod- ing documentation and diagrams, and generating a signifi- els defined by E.F. Codd. Specifically, the idea is to store the cant amount of data about the data, or metadata, that will data in third normal form (3NF), grouping the data by topic, help define the structure and scope of the data warehouse. i.e. customers, orders, products. This allows the data ware- These activities are outside the scope of this paper, because house to be updated with new types of data very easily, preparing for a data warehouse requires these tasks be com- 3 pleted prior to the beginning of the technical process necessary to break the dimensions down into subdimension described here. tables. The end result is a schema the visually looks like a snowflake, because of the layers of dimension tables. Figure DATA WAREHOUSE ELEMENTS 2 shows a slightly expanded view of the sales data in a data The basic pieces of a dimensional data warehouse model warehouse, employing a snowflake schema. are tables and relationships, laid out in a schema. Specifically, we’ll have fact tables and dimensional tables, and they will be related to one another in a manner that facilitates understanding of the relationships between those tables. This layout is called a schema, and there are two pri- mary schema types used in a dimensional data warehouse: snowflake and star. STAR SCHEMA One of simplest schemas, star schemas are named such because they usually involve a single fact table, related to a single layer of dimension tables. Each dimension table has a simple primary key, as it is responsible for a single type of data. The fact table has an aggregate primary key, as it is the compounded collection of the primary keys from the dimen- sion tables. Figure 1 shows a very simple star schema outlin- ing sales data. Figure 2. Snowflake schema. While this is not an exhaustive look at a data model, it shows the basics of a snowflake schema. There is a central fact tables, Sales, and two layers of dimension tables, effectively normalizing the data. The advantage here is that logically, we can see how the data is related very easily. We can drill down from the sales, to the region, and lookup the country code. In a large system, a snowflake can also facilitate data mining queries that require complex comparisons across different dimensions. For example, if we need to see all sales from the 3rd quarter of 2007 for a specific country, filtered by product Figure 1. Simple star schema. manufacturer, we’re literally joining together seven tables. If this data had been stored in a star, we’d be joining four tables. The simplicity of a star schema is that for each fact table, However, from a performance standpoint, this schema may you will only ever have to join one level of dimension tables slow down larger queries that seek to simply retrieve data to get at the data. Unfortunately, the drawback of a star from a few dimensions. Just remember, because of the differ- schema is that for intricate systems, it can become quite ences in performance and usability, it is vital to structure your complex to read and query when it is necessary to retrieve a schema according to your specific needs. large amount of fact data. BUILDING THE MODEL SNOWFLAKE SCHEMA Let’s now assume that we have a specific project, and have The snowflake schema is very similar to the star schema, gathered the necessary requirements. The project will be with the exception of the relationships between dimensions fairly simple; we’re building a data warehouse for a DVD and facts. A snowflake schema still has fact tables at its cen- retailer that wants to be able to figure out what products ter, with relationships to dimension tables. However, if there are being ordered and what their customer demographics are dimensions with large numbers of attributes, it might be are. This is a readily recognizable need; almost every business 4 is either selling a product or a service, and while the details vary, the You can see that it is a standard, normalized basics stay the same.