Modernizing the Data Warehouse
Total Page:16
File Type:pdf, Size:1020Kb
Modernizing the Data Warehouse A Technical Whitepaper Norman F. Walters Freedom Consulting Group, Inc. August 2018 Copyright 2018 Freedom Consulting Group, Inc. All rights reserved. All other company and product names referenced in this document may be trademarks or registered trademarks of their respective owners. Copyright 2018 Freedom Consulting Group, Inc., all rights reserved i Table of Contents Table of Contents ...................................................................................................................... i Introduction .............................................................................................................................. 1 The EDW’s Role in Data Integration ...................................................................................... 1 Query Performance and Data Security in the EDW............................................................... 2 BI Challenges ........................................................................................................................... 4 Modernizing the EDW ............................................................................................................. 4 Modern Tools for a Modern Data Warehouse ......................................................................... 7 Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 1 Introduction With the advent of Big Data, cloud architectures, micro services and No SQL databases many organizations are asking themselves if they still need a Data Warehouse. This paper explores this question in detail, but the short answer is, “Yes!” However, if your Enterprise Data Warehouse still looks the same as it did 10 years ago, it’s probably time to modernize. The fundamental reasons for creating an Enterprise or Corporate Data Warehouse and the associated principles have not been done away with by newer technologies. That doesn’t mean, however, that some of these newer technologies can’t be leveraged as part of a data warehouse strategy. In fact, they should be used as part of the overall data warehouse strategy and architecture. After decades of use, there is still a great deal of misunderstanding about the nature and purpose of an Enterprise Data Warehouse (EDW). An EDW is much more than a large database for storing and querying lots of data from different source systems. In many ways, the term “data warehouse” is a misnomer. In a real warehouse trucks bring in goods, forklifts take those goods and store them somewhere, and there they remain until someone comes to remove them from the warehouse for use. A corporate data warehouse, however, is much more than a place to store a bunch of data until someone needs it. A true corporate data warehouse allows data to be cleansed, transformed into a more easily used and understandable form as it comes in, and enriched, which, in turn, enables effective integration of data from across multiple source systems to produce information. Unlike goods trucked to a warehouse, what comes out of a data warehouse is a fundamentally different product than the data that first entered. The question then is not if there is still a need for a data warehouse, but, rather, what form the EDW should take in light of tremendous advances in technology and the explosion of data in many forms. To ensure it truly delivers value, the EDW needs to undergo a transformation. The EDW’s Role in Data Integration With the proliferation of web services and the development of Business Intelligence applications that promote data federation/virtualization, why do we need an EDW for data integration? How do we leverage our existing data warehouse to get the most out of it, using these new technologies? First, let’s look at the reasons data warehouses were created in the first place. Data integration is one of the key foundational reasons for the creation of a data warehouse. It sounds sporty to say we’re going to use web services, and the BI sales teams are great at making it sound like their applications can work miracles. The truth, though, is that enabling effective data integration requires more sophistication Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 2 than web services or an Enterprise Service Bus can provide, and a lot more work than the BI vendors would have you believe. One reason for leveraging a data warehouse for data integration is data quality. As author and social marketing strategist, Paul Gillin, says, “Data quality is corporate America’s dirty little secret.” It’s common for many source systems to have dirty or non-standard data that makes data integration a real challenge. One of the benefits of ingesting this data into the EDW is that it can be cleaned and transformed as it’s loaded, improving data integration and the quality of the data, which in turn significantly improves the quality of information provided from the warehouse. Trying to use web services or data federation in these cases can be problematic at best, particularly if data sets are expansive or multiple source systems are involved. Another reason for integrating data within an EDW is to present one version of the truth. EDW developers can work with each source system and the appropriate end users to interpret the data and determine how best to cleanse, transform, integrate and present it. Once this work is done, each EDW user is guaranteed to receive the same answer to the same question every time. Without the EDW, chaos reigns and the truth is elusive, because each web service developer or end user must decide how to handle each of these things, leading to a great deal of inconsistency. This leads to executives and managers making decisions based on faulty information, and creates an environment in which users no longer trust the information provided to them. And then there are the many issues surrounding historical data that each source system owner must contend with. Often, growing historical data negatively impacts the performance of a system, even though it may rarely be viewed. However, data owners are typically reluctant to archive the data, because they may still need to access it in the future. Historical data is also a problem when new systems are implemented and there is a desire to retain old data that will not be imported into the new system. This is especially prevalent when implementing new ERP systems, for example. The EDW can provide a solution to each of these problems. Historical data can be archived off to the EDW, eliminating the performance problems for the source system, while still ensuring the data is accessible if needed. Because the EDW uses special data warehousing constructs to store data and leverages techniques like data partitioning, it is able to store the data with much less of an impact to query performance. It also can facilitate the deletion of archived historical data beyond defined retention periods. One of the problems with historical data when implementing new systems is figuring out how to make the old data fit the new architecture. By archiving this data in to the EDW this problem can be substantially avoided. However, it still presents an opportunity to do some transformation of the data, without worrying about constraints built into the new application, so that historical data can be merged effectively with current data in queries from the EDW. Query Performance and Data Security in the EDW Two of the key drivers behind the need for an EDW have been the need for fast query Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 3 performance across integrated data sets, and the need for data security and privacy. These drivers have not gone away, but the desire for a more agile approach that also can accommodate newer technologies, such as Big Data and cloud architectures, has become more urgent. Problems with query performance/response time can occur in any system, and the EDW is no exception. There are many potential causes for these kinds of problems within the EDW and many potential solutions. Each situation must be looked at with a critical eye and investigated with a holistic approach in order to derive the correct solution. However, these issues are outside the intended scope of this article. Rather, the intent here is to address the shortcomings with the classic EDW architecture while ensuring we don’t reintroduce the problems the EDW was created to resolve. The use of specialized data warehousing techniques and database constructs provides for enhanced performance, particularly for integrated data from multiple source systems. If designed properly, an EDW can provide much better integration and query performance across data from multiple sources, and with much bigger data sets, than can be done if querying directly from the source systems. As the volume of data grows performance can still be an issue in any data warehouse, and must be constantly looked at as an area for improvement. As mentioned above, another key driver behind using an EDW is data security. Clearly, in the world we live in today, system and data security are preeminent concerns across all IT applications and systems. An advantage of a data warehouse is that it requires only one comprehensive security solution that can be easily controlled, monitored and maintained, as opposed to trying to apply effective security controls to multiple source systems. Over the years, alternatives to the traditional EDW have been