Modernizing the

A Technical Whitepaper

Norman F. Walters Freedom Consulting Group, Inc.

August 2018

Copyright 2018  Freedom Consulting Group, Inc. All rights reserved. All other company and product names referenced in this document may be trademarks or registered trademarks of their respective owners.

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved i

Table of Contents

Table of Contents ...... i Introduction ...... 1 The EDW’s Role in ...... 1 Query Performance and Data Security in the EDW...... 2 BI Challenges ...... 4 Modernizing the EDW ...... 4 Modern Tools for a Modern Data Warehouse ...... 7

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 1

Introduction

With the advent of , cloud architectures, micro services and No SQL many organizations are asking themselves if they still need a Data Warehouse.

This paper explores this question in detail, but the short answer is, “Yes!” However, if your Enterprise Data Warehouse still looks the same as it did 10 years ago, it’s probably time to modernize. The fundamental reasons for creating an Enterprise or Corporate Data Warehouse and the associated principles have not been done away with by newer technologies. That doesn’t mean, however, that some of these newer technologies can’t be leveraged as part of a data warehouse strategy. In fact, they should be used as part of the overall data warehouse strategy and architecture.

After decades of use, there is still a great deal of misunderstanding about the nature and purpose of an Enterprise Data Warehouse (EDW). An EDW is much more than a large for storing and querying lots of data from different source systems. In many ways, the term “data warehouse” is a misnomer. In a real warehouse trucks bring in goods, forklifts take those goods and store them somewhere, and there they remain until someone comes to remove them from the warehouse for use. A corporate data warehouse, however, is much more than a place to store a bunch of data until someone needs it. A true corporate data warehouse allows data to be cleansed, transformed into a more easily used and understandable form as it comes in, and enriched, which, in turn, enables effective integration of data from across multiple source systems to produce information. Unlike goods trucked to a warehouse, what comes out of a data warehouse is a fundamentally different product than the data that first entered.

The question then is not if there is still a need for a data warehouse, but, rather, what form the EDW should take in light of tremendous advances in technology and the explosion of data in many forms. To ensure it truly delivers value, the EDW needs to undergo a transformation.

The EDW’s Role in Data Integration

With the proliferation of web services and the development of Business Intelligence applications that promote data federation/virtualization, why do we need an EDW for data integration? How do we leverage our existing data warehouse to get the most out of it, using these new technologies? First, let’s look at the reasons data warehouses were created in the first place.

Data integration is one of the key foundational reasons for the creation of a data warehouse. It sounds sporty to say we’re going to use web services, and the BI sales teams are great at making it sound like their applications can work miracles. The truth, though, is that enabling effective data integration requires more sophistication

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 2

than web services or an Enterprise Service Bus can provide, and a lot more work than the BI vendors would have you believe.

One reason for leveraging a data warehouse for data integration is data quality. As author and social marketing strategist, Paul Gillin, says, “Data quality is corporate America’s dirty little secret.” It’s common for many source systems to have dirty or non-standard data that makes data integration a real challenge. One of the benefits of ingesting this data into the EDW is that it can be cleaned and transformed as it’s loaded, improving data integration and the quality of the data, which in turn significantly improves the quality of information provided from the warehouse. Trying to use web services or data federation in these cases can be problematic at best, particularly if data sets are expansive or multiple source systems are involved.

Another reason for integrating data within an EDW is to present one version of the truth. EDW developers can work with each source system and the appropriate end users to interpret the data and determine how best to cleanse, transform, integrate and present it. Once this work is done, each EDW user is guaranteed to receive the same answer to the same question every time. Without the EDW, chaos reigns and the truth is elusive, because each web service developer or end user must decide how to handle each of these things, leading to a great deal of inconsistency. This leads to executives and managers making decisions based on faulty information, and creates an environment in which users no longer trust the information provided to them.

And then there are the many issues surrounding historical data that each source system owner must contend with. Often, growing historical data negatively impacts the performance of a system, even though it may rarely be viewed. However, data owners are typically reluctant to archive the data, because they may still need to access it in the future.

Historical data is also a problem when new systems are implemented and there is a desire to retain old data that will not be imported into the new system. This is especially prevalent when implementing new ERP systems, for example.

The EDW can provide a solution to each of these problems. Historical data can be archived off to the EDW, eliminating the performance problems for the source system, while still ensuring the data is accessible if needed. Because the EDW uses special data warehousing constructs to store data and leverages techniques like data partitioning, it is able to store the data with much less of an impact to query performance. It also can facilitate the deletion of archived historical data beyond defined retention periods.

One of the problems with historical data when implementing new systems is figuring out how to make the old data fit the new architecture. By archiving this data in to the EDW this problem can be substantially avoided. However, it still presents an opportunity to do some transformation of the data, without worrying about constraints built into the new application, so that historical data can be merged effectively with current data in queries from the EDW.

Query Performance and Data Security in the EDW

Two of the key drivers behind the need for an EDW have been the need for fast query

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 3

performance across integrated data sets, and the need for data security and privacy. These drivers have not gone away, but the desire for a more agile approach that also can accommodate newer technologies, such as Big Data and cloud architectures, has become more urgent.

Problems with query performance/response time can occur in any system, and the EDW is no exception. There are many potential causes for these kinds of problems within the EDW and many potential solutions. Each situation must be looked at with a critical eye and investigated with a holistic approach in order to derive the correct solution. However, these issues are outside the intended scope of this article. Rather, the intent here is to address the shortcomings with the classic EDW architecture while ensuring we don’t reintroduce the problems the EDW was created to resolve.

The use of specialized data warehousing techniques and database constructs provides for enhanced performance, particularly for integrated data from multiple source systems. If designed properly, an EDW can provide much better integration and query performance across data from multiple sources, and with much bigger data sets, than can be done if querying directly from the source systems. As the volume of data grows performance can still be an issue in any data warehouse, and must be constantly looked at as an area for improvement.

As mentioned above, another key driver behind using an EDW is data security. Clearly, in the world we live in today, system and data security are preeminent concerns across all IT applications and systems. An advantage of a data warehouse is that it requires only one comprehensive security solution that can be easily controlled, monitored and maintained, as opposed to trying to apply effective security controls to multiple source systems.

Over the years, alternatives to the traditional EDW have been proposed, however, some of the alternatives frequently touted can significantly impair performance rather than help it, and also create problems with data security and data privacy.

One example of such an approach is data federation. A federated database system is defined by Wikipedia as “a type of meta-database management system, which transparently maps multiple autonomous database systems into a single federated database.” While such an approach eliminates the need for ingesting and storing source system data into another repository, it still requires a great deal of work to architect and maintain. It also doesn’t really handle data integration. This introduces the risk described earlier with creating multiple versions of the truth.

This approach also introduces performance problems as application providers attempt to integrate data from various source systems for presentation to end users. The is primarily due to network latency and the need to perform multiple queries on multiple systems. For larger volumes of data, this also obviously creates more network traffic, which can negatively impact more than just the applications accessing the data.

Using a federated approach also creates risks for data security and privacy. Using an EDW not only allows for a comprehensive security structure, that ensures users are only seeing the data they should, it also allows for more effective data obfuscation and data encryption. Without the EDW, every application must consider how to implement effective security on the data they provide, which leads to data leakage and causes stakeholders to “lose control” over their sensitive data.

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 4

BI Challenges

One of the challenges facing the data warehouse and BI today is the often rigid and time- consuming development process in place to support traditional BI reporting needs. This can be frustrating to report consumers, particularly in light of the new styles of reporting that are available. This rigidity inhibits the ability to provide real self-service capabilities, as well as to introduce new BI tools that can be leveraged for , investigative analysis, predictive analysis and data science.

There is still a need for BI reporting that follows a more rigid approach to ensure data quality, by formally testing reports and ensuring auditability and reproducible results. This is particularly important for financial reports or reports to outside agencies. However, this rigid approach gets in the way when developing to support newer forms of reporting and analytics. Clearly, what is needed to solve these BI challenges is a more agile process for developing and delivering BI capabilities.

In 2014, Gartner introduced the “Bi-modal” concept1. This concept refers to two different modes of IT development. Mode 1 is the more traditional style of development, in which every system must be reliable, predictable and correct. Such systems have formal requirements and designs, are formally tested, governed and managed, and often must be auditable. Mode 2 refers to more agile development methods that focus on speed and agility. These methods allow for experimentation and flexibility, and, in the BI realm, support self-service analytics2. Users and stakeholders desperately want more agile reporting and self-service analytics, but this cannot be accomplished by sticking with the classic data warehouse architecture and the use of monolithic BI applications. To achieve these objectives, we must modernize the EDW.

Modernizing the EDW

Hopefully, it’s clear that the need for an EDW still exists. However, it should be equally clear that the classic data warehouse architecture is no longer sufficient to meet the needs of users who desire self-service reporting/analytics and need more agility and faster results than in the past. The question, then, is how do we leverage new technologies to increase the effectiveness, usability and agility of the data warehouse?

1 M. Mesaglio and S. Mingay, Bimodal IT: How to be Digitally Agile Without Making a Mess, July 2014, see https://www.gartner.com/doc/2798217/bimodal-it-digitally-agile-making 2 Rick F. van der Lans, Developing a Bi-Modal Logical Data Warehouse Architecture Using Data Virtualization, September 2016

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 5

The classic EDW was conceived to solve numerous problems, and it solved them pretty well. Among the many benefits provided are:

• High data quality

• Consistent, reproducible reports

• Integrated data across various data sources

• High performance compared to querying from source systems

• One comprehensive security solution

• Improved data governance

Unfortunately, all of these benefits came with a price. In order to provide the benefits listed, the EDW required fairly rigid processes. This approach still works for reporting on well-structured data that requires a high standard of data quality, such as financial data, but the lack of agility limits the introduction of new techniques and new technology, such as the use of data lakes and self-service reporting or analytics. The nature of the EDW makes it fairly inflexible and leads to lengthy time to market for new capabilities. Support for big data can be complex or non-existent, the ability to provide self-service capabilities is limited and supporting streaming data is difficult. With the adoption of agile methodologies, the proliferation of new technologies and the explosion of data, it can be argued that the classic EDW has reached the end of its shelf life.

One solution to these problems is to develop a Logical Data Warehouse. For years the terms Federated Data Warehouse or Virtual Data Warehouse have been thrown around, but, as those solutions are defined, they actually reintroduce many of the problems that a traditional data warehouse was created to solve. In recent years, however, technology has advanced to a point where it is feasible to achieve much of what these solutions had hoped to achieve without the inherent problems posed by either of these solutions by themselves. This can be done through Data Virtualization.

Data Virtualization is not the same as a Virtual Data Warehouse, primarily because it recognizes the need for a physical data warehouse in most instances, while still enabling many of the benefits hoped for, but not realized, with Federated and Virtual Data Warehouses. Data Virtualization technology has come a long way in recent years and must be seriously considered as a way to overcome some of the shortcomings inherent in the traditional data warehouse architecture discussed throughout this whitepaper (see Figure 1).

Data Virtualization supports the creation of a logical data warehouse architecture (Figure 2). It supports the bi- modal concept mentioned earlier through an agile architecture that removes the end user from the details of how and where data is stored. BI Analyst and Consultant, Rick F. van der Lans, provides the following definition for a logical data warehouse architecture:

“The logical data warehouse architecture delivers data and to data consumers in support of decision- making, reporting, and data retrieval; whereby data and metadata stores are decoupled from data consumers through a metadata-driven layer to increase flexibility, and whereby data and metadata are presented in a

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 6

subject-oriented, integrated, time-variant, and reproducible style.” – Rick F. van der Lans3

Figure 1 - Traditional Data Warehouse There is no need for data consumers to be aware of where and how the data they are accessing is stored. They should not care or need to know if the data they are using comes from a , a Hadoop database, a production (source) system or an Excel . Likewise, things like knowledge of SQL and the details of table joins should not be necessary for consumers of data3. As far back as 1970, E. F. Codd, developer of the relational model for database management, recognized this. Codd said, “Future users of large data banks must be protected from having to know how the data is organized (…) application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” He referred to this as the data independence objective.

The use of Data Virtualization can help us go a long way toward achieving that objective. Comprehensive BI applications do provide for the development of a semantic layer to remove data consumers from these sorts of details, and to provide a more user-friendly mechanism for reporting and analysis. However, each BI application requires its own semantic layer to be developed and maintained. Through the use of data virtualization, it now becomes possible for IT to develop one, universal semantic layer that can be leveraged by all BI applications, as well as other external data consumers and applications. This opens up a world of possibilities for BI applications and other methods of data consumption without all of the work required to develop and maintain multiple semantic layers.

3 Rick F. van der Lans, Developing a Bi-Modal Logical Data Warehouse Architecture Using Data Virtualization, September 2016

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 7

Data virtualization also solves the problem of how to ensure data security and privacy, while opening up data access to more data consumers and applications. As with the issue of the semantic layer, data virtualization enables the development and maintenance of one security solution, rather than multiple solutions across multiple applications and data stores.

Figure 2 - Logical Data Warehouse With the introduction of data virtualization, the EDW can be transformed into a logical data warehouse that brings agility to data warehousing, while significantly increasing options for BI reporting, data analytics and other data services. It facilitates this without negatively impacting the original benefits of an EDW, such as data quality, data consistency, query performance and data security.

Modern Tools for a Modern Data Warehouse

Technology has advanced significantly since the concept of an EDW was originally developed. Over the years, the EDW has adopted and integrated that new technology; everything from sophisticated extract, transform and load tools, to comprehensive BI applications and the use

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 8

of data warehouse appliances. However, the concept of a logical data warehouse has the potential to transform data warehousing more than any of these technologies.

As mentioned above, the key to developing a modern, logical data warehouse is data virtualization (DV). DV tools provide the unified virtual layer needed to abstract source complexity and present data as though it’s from a single source. Use of DV is a faster, easier and more agile way to approach data integration, BI reporting and data analytics. While there are many DV applications available, only a few really stand out. One of those is the Denodo Platform (www.denodo.com). Denodo arguably provides one of the most sophisticated and easy to use DV applications on the market. Denodo provides the following features:

• A robust push-down query optimizer, that allows efficient query optimization across multiple sources

• Advanced caching capabilities

• Comprehensive data security, including data masking, encryption and custom authentication provider integration

• A means to view data lineage and perform impact analysis of changes

• Version control of views that make up the semantic layer

• Google-like search capability

• A wide variety of connectors, including bigdata, such as Hadoop, and NoSQL databases, as well as RESTful APIs. All available connectors are included in the price of Denodo.

• Data profiling

• Provides platforms for AWS and Azure

Other noteworthy examples of DV tools are TIBCO Data Virtualization (formerly CISCO) and Red Hat JBoss Data Virtualization. These products and others offer many of the features also provided by Denodo.

Besides the use of data virtualization, there are other newer technologies to consider as part of a logical data warehouse solution. The significant advances in cloud technology have inevitably led to questions about leveraging the cloud for data warehousing solutions. For a traditional data warehouse that ingests and integrates highly structured data, and leverages structures such as star schemas and data marts, it is still difficult to beat the performance of a data warehouse appliance such as , Oracle’s Exadata and IBM’s Netezza platforms. Still, leveraging the cloud for data warehousing is attractive to many organizations that want to take advantage of the benefits of the cloud, and avoid having to deal with maintaining physical servers and with issues like power, space and cooling.

The concept of the logical data warehouse means organizations can continue to use their existing data warehouse technology and environment, if desired, but also expand the warehouse to include sources that reside in the cloud. Alternatively, for organizations intent on migrating to the cloud, options exist that allow the physical data warehouse to be migrated to the cloud with less risk of significant impacts to performance than in the past.

One of these options is Amazon Redshift (www.amazon.com/redshift). Redshift is a cloud data warehouse platform running on AWS. The columnar design of Redshift, similar to technologies such as Sybase IQ, provides

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 9

significant improvements in performance over a traditional relational database management system. It also brings with it all of the inherent benefits of running on the cloud, such as reduced cost, simplicity and rapid scalability.

Another promising option is a relatively new product called Snowflake (www.snowflake.net). Snowflake is a data warehouse built to run on the cloud. Whereas Amazon Redshift is built on the PostgreSQL platform, Snowflake was built from the ground up to run on the cloud. Snowflake was developed to support both structured and semi-structured data, and leverages the capabilities of AWS to scale up and down as needed on demand. Using a tool called Snowpipe, Snowflake leverages AWS S3 event notifications to also provide the ability to handle the continuous loading of streaming data. Some benchmarks have shown Snowflake to significantly outperform Redshift and Google BigQuery on larger or more complex datasets. The efficiency of Snowflake also means it tends to be more cost effective than these other options.

For processing large amounts of data extremely quickly some organizations are turning to GPU databases. Rather than using traditional CPUs for processing, GPU databases use a set of Graphical Processing Units (GPU). Whereas high end CPUs typically have anywhere from 16 to 32 cores per processor, high end GPUs can have over 4,000 cores. GPU databases are becoming more and more popular for sophisticated analytics, such as BI, artificial intelligence and Geospatial workloads, and are used extensively in the world of blockchaining. One popular GPU database product is Kinetica. Kinetica provides ODBC and other connectors, as well as a REST API. It connects to popular BI tools, such as Tableau, and open source visualization tools, such as Kibana. It also comes with its own BI framework, called Reveal. Kinetica has native support for geospatial processing, and includes the OpenGL rendering pipeline to handle extremely complex visualizations for WML layers over maps. Kinetica has also partnered with Amazon, making it’s GPU database technology available for lease through AWS.

Conclusion

The use case for a physical data warehouse is still alive and well. ERP applications and other similar applications remain perfect candidates for the transformation and integration that a data warehouse makes possible. At the same time, tremendous advances in technology as well as huge increases in the size and type of data to be processed go well beyond the original intent of a data warehouse, and cannot be dealt with in the same way.

ERP applications and other more traditional “business” applications typically store and process highly structured data, and data that begs to be integrated with other data sources to facilitate business decision making. These datasets are perfect for ingestion, transformation and integration into a physical data warehouse. These datasets lend themselves to the development of star schemas, data marts, cubes and other data warehouse structures that facilitate integrated, time variant reporting. Typically, these datasets remain a manageable size, and can be leveraged with traditional BI tools. If well designed and maintained, the data structures containing this data will also perform well as the size of the dataset grows.

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 10

As we’ve discussed, however, technology and data have evolved significantly since the advent of the data warehouse. Today we’re faced with huge amounts of data that can be structured, semi-structured or unstructured. Handling BigData or unstructured data in a traditional EDW environment simply won’t work. The volume of data will be too much for the system to handle, and unstructured data can be very difficult, if not impossible, to model in a relational schema.

There is where the logical data warehouse comes in. As this paper shows, by leveraging data virtualization, a logical data warehouse can be created that includes the EDW, but also allows for BigData, unstructured data and many other sources and applications to be included in the data warehouse. The promise of data virtualization is a logical data warehouse environment that provides tremendous new capabilities, while still not giving up the advantages that the classic data warehouse design brings with it. Done properly, data virtualization will provide the framework for expanding greatly on the capabilities a warehouse can provide, but in a way that is more agile, maintainable and cost-effective.

Copyright  2018 Freedom Consulting Group, Inc., all rights reserved