Managing Current and Historical view of Information in a Environment

By

Nagarajan Subramanian e-mail:[email protected]

TABLE OF CONTENTS

Managing Current and Historical view of Information in a Data warehouse environment...... 3 Introduction...... 3 Current and Historical view ...... 3 Traditional DW approach...... 4 ƒ Type 2 (Slowly Changing Dimension – SCD)...... 4 ƒ Type 3 (Slowly Changing Dimension – SCD)...... 4 ƒ Type 2 + (SCD 2+) ...... 5 ƒ Other DW solutions ...... 6 A relook at Surrogate key – Modified Surrogate key approach ...... 6 Summary ...... 7

Managing Current and Historical view of Information in a Data Warehouse environment

Introduction

In today’s challenging business environment, the core strength of data warehousing is to be utilized to its maximum to achieve long-term customer satisfaction. Access to faster and accurate information resulting in smarter decision is the key to success. Business executives today understand that access to such information provides competitive advantage.

Data warehouse is one single place to do multi-point analysis and to derive a completely different perspective of business to enhance the customer satisfaction program. Looking into the Information from both current view and historical view of business is a key factor in understanding the trends of customer business relationship. In this article, I have tried to present the various data warehouse design aspects to consider for providing both the current and historical view of data.

Current and Historical view

Consider a business scenario where Dilbert International acquired Alice Corporation in 2007 to become Albert International. In 2008, Albert International goes public to become Albert Corporation. If after the successful transformations, a three year profitability historical view report of Albert Corporation would require facts to be broken down to show Dilbert International’s, Albert International’s and Albert Corporation’s profitability. But a current perspective of the profitability report would require ignoring the acquisition and reporting it as if the Albert Corporation was in business prior 2008.

Below is a sample display of how the profitability report would look like in historical and current view.

Historical view

Fiscal Year Company Profit after tax (billions) 2006 Dilbert International US $ 10 2007 Albert International US $ 15 2008 Albert Corporation US $ 20

Current view

Fiscal Year Company Profit after tax (billions) 2006 Albert Corporation US $ 10 2007 Albert Corporation US $ 15 2008 Albert Corporation US $ 20

Traditional DW approach

Traditional DW methods suggest lot of techniques to handle this particular scenario. Some of them are listed below.

Type 2 (Slowly Changing Dimension – SCD)

This method uses a surrogate key that is generated in the DW to keep track of changes to the dimension. By this method, every time a change is encountered in the source system dimension, a new record is inserted into the dimension table. Fact table starts using the new surrogate key to identify the correct dimension record there by making historical reporting very simple. For e.g., in our business scenario, the dimension table and fact table records would look similar to the example given below.

Dimension table

Surrogate Key Period Begin Period End Company Name date date 1 2006-01-01 2006-12-31 Dilbert International 2 2007-01-01 2007-12-31 Albert International 3 2008-01-01 - Albert Corporation

Fact table

Surrogate Key Period Ending Profit (in Billions) 1 2006-12-31 US $ 10 2 2007-12-31 US $ 15 3 2008-12-31 US $ 20

This approach simplifies the historical reporting, but makes the current view of reporting complex. The current view of the data can be achieved through complex SQL constructs or multi-join. But when the scenario includes reporting for multiple corporations and its acquisitions, the data handling becomes very complex to implement. This is because there is no relation between the companies maintained in the dimension table.

Type 3 (Slowly Changing Dimension – SCD)

Type 3 approach works similar to type 2 but differs in the basic handling of dimension data. Type 3 suggests keeping multiple columns in the dimension table to handle changes to the dimension table. That is, it suggests keeping multiple columns for dimension attributes. When one of the attribute value changes, the old value is moved to a separate column and the current column is updated with the new value.

For e.g,

Dimension table

Surrogate Key Company Name Company Name Company Name Period 1 Period 2 Period 3 1 Dilbert Albert International Albert Corporation International

Fact table

Surrogate Key Period Profit (in Billions) 1 2006 US $ 10 1 2007 US $ 15 1 2008 US $ 20

In the above example, the historical changes for the company name are stored in multiple columns with each column representing the name of the company at different periods. So, the reporting system needs to take care of identifying the correct company name based on the periods. This is a pretty simple approach but becomes cumbersome when there are multiple such changes and when you don’t have a definite number of changes to handle. Another version of type 3 suggests keeping the number of columns constant based on the requirement to see number of versions and recycle the column data whenever the number of changes crosses the number of columns. This not only results in loss of history but defeats the basic purpose of historical reporting.

Type 2 + (SCD 2+)

A slight variation of SCD 2 suggests maintaining two separate tables of dimension. One table would hold the historical changes and the other will maintain the current dimension data as illustrated below

Historical Dimension table

Surrogate Key Period Begin Period End Company Name date date 1 2006-01-01 2006-12-31 Dilbert International 2 2007-01-01 2007-12-31 Albert International 3 2008-01-01 - Albert Corporation

Current Dimension table

Surrogate Key Period Begin Period End Company Name date date 1 2006-01-01 2006-12-31 Albert Corporation 2 2007-01-01 2007-12-31 Albert Corporation 3 2008-01-01 - Albert Corporation

Fact table

Surrogate Key Period Ending Profit (in Billions) 1 2006-12-31 US $ 10 2 2007-12-31 US $ 15 3 2008-12-31 US $ 20

In the above example, when the fact table is joined to “historical dimension table”, we get the history view of the data and when it is joined to “current dimension table”, we get the current view of the data. This is again a very simple approach where the dimension table is duplicated and based on the requirement to see history vs. current data, the joins are performed. The main disadvantage of this approach is the duplication of dimension table where the current table actually stores data that is not true. The other disadvantage of this approach is that these two tables must be kept in sync by the ETL processes which makes it complicated.

Other DW solutions

There are other fact based approaches for this problem. Maintaining both current and historical surrogate keys in the fact table is one such approach. This is a complicated ETL approach where both the keys are maintained in the fact table as two separate columns. Based on the requirements, the correct column is joined to the dimension table to get both historical and current reporting.

A relook at Surrogate key – Modified Surrogate key approach

While there are still debates over the requirement of a surrogate key in data warehouses, the following solution can be implemented by relooking the way surrogate key is handled in a dimension table to handle the above discussed business scenario. The basic characteristic of a surrogate key is that “it is a unique key to identify a unique row in a dimension table”. The solution described below changes this characteristic and starts reusing surrogate keys for the related dimension values along with the traditional SCD 2 approach to handle the above discussed business scenario. In addition to this, we shall also add an indicator column to our dimension table to identify the current record.

For e.g,

Dimension table

Surrogate Period Period End Company Name Current Key Begin date date record Indicator 1 2006-01-01 2006-12-31 Dilbert International N 1 2007-01-01 2007-12-31 Albert International N 1 2008-01-01 - Albert Corporation Y

Fact table

Surrogate Key Period Ending Profit (in Billions) 1 2006-12-31 US $ 10 1 2007-12-31 US $ 15 1 2008-12-31 US $ 20

The fact table is joined to the dimension table in two different ways based on the reporting requirements.

For historical reporting, the “period ending” of the fact table is joined to the dimension table’s “Period begin date” and “Period end date” using a between clause in addition to the surrogate key join.

For current reporting, the condition “current record indicator”=“Y” is added to the condition and the fact table is joined to the dimension table based on the surrogate key.

There are advantages and disadvantages in this approach just like any other approach discussed above. The following are the advantages that drove us towards this approach: a. Single dimension table to handle both current and historical reporting b. Join conditions are simpler – No complex or multi-join is required to get current/historical reporting. c. The relation between the dimension records are maintained using the surrogate key itself. Thus dimension reporting becomes easier.

The primary disadvantage of this approach is that the joins are always based on multiple conditions, for historical reporting the surrogate key as well as the date conditions are used. For current reporting, the surrogate key as well as the current record indicator is used. Though this may affect performance in some scenarios, it handles both the current and history reporting requirements in an easy and efficient manner.

Summary

Just like the debate of the correct data warehouse approach between ’s Vs , data modeling solutions too has multiple viewpoints. All the approaches described above may suit well depending on the requirements and hence it is very important for a data modeler/data architect to think in different angles before settling down with a solution. This article is intended to illustrate one another solution for the historical and current reporting requirement of a data warehouse.