Modernizing the Data Warehouse

Total Page:16

File Type:pdf, Size:1020Kb

Modernizing the Data Warehouse Modernizing the Data Warehouse A Technical Whitepaper Norman F. Walters Freedom Consulting Group, Inc. August 2018 Copyright 2018 Freedom Consulting Group, Inc. All rights reserved. All other company and product names referenced in this document may be trademarks or registered trademarks of their respective owners. Copyright 2018 Freedom Consulting Group, Inc., all rights reserved i Table of Contents Table of Contents ...................................................................................................................... i Introduction .............................................................................................................................. 1 The EDW’s Role in Data Integration ...................................................................................... 1 Query Performance and Data Security in the EDW............................................................... 2 BI Challenges ........................................................................................................................... 4 Modernizing the EDW ............................................................................................................. 4 Modern Tools for a Modern Data Warehouse ......................................................................... 7 Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 1 Introduction With the advent of Big Data, cloud architectures, micro services and No SQL databases many organizations are asking themselves if they still need a Data Warehouse. This paper explores this question in detail, but the short answer is, “Yes!” However, if your Enterprise Data Warehouse still looks the same as it did 10 years ago, it’s probably time to modernize. The fundamental reasons for creating an Enterprise or Corporate Data Warehouse and the associated principles have not been done away with by newer technologies. That doesn’t mean, however, that some of these newer technologies can’t be leveraged as part of a data warehouse strategy. In fact, they should be used as part of the overall data warehouse strategy and architecture. After decades of use, there is still a great deal of misunderstanding about the nature and purpose of an Enterprise Data Warehouse (EDW). An EDW is much more than a large database for storing and querying lots of data from different source systems. In many ways, the term “data warehouse” is a misnomer. In a real warehouse trucks bring in goods, forklifts take those goods and store them somewhere, and there they remain until someone comes to remove them from the warehouse for use. A corporate data warehouse, however, is much more than a place to store a bunch of data until someone needs it. A true corporate data warehouse allows data to be cleansed, transformed into a more easily used and understandable form as it comes in, and enriched, which, in turn, enables effective integration of data from across multiple source systems to produce information. Unlike goods trucked to a warehouse, what comes out of a data warehouse is a fundamentally different product than the data that first entered. The question then is not if there is still a need for a data warehouse, but, rather, what form the EDW should take in light of tremendous advances in technology and the explosion of data in many forms. To ensure it truly delivers value, the EDW needs to undergo a transformation. The EDW’s Role in Data Integration With the proliferation of web services and the development of Business Intelligence applications that promote data federation/virtualization, why do we need an EDW for data integration? How do we leverage our existing data warehouse to get the most out of it, using these new technologies? First, let’s look at the reasons data warehouses were created in the first place. Data integration is one of the key foundational reasons for the creation of a data warehouse. It sounds sporty to say we’re going to use web services, and the BI sales teams are great at making it sound like their applications can work miracles. The truth, though, is that enabling effective data integration requires more sophistication Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 2 than web services or an Enterprise Service Bus can provide, and a lot more work than the BI vendors would have you believe. One reason for leveraging a data warehouse for data integration is data quality. As author and social marketing strategist, Paul Gillin, says, “Data quality is corporate America’s dirty little secret.” It’s common for many source systems to have dirty or non-standard data that makes data integration a real challenge. One of the benefits of ingesting this data into the EDW is that it can be cleaned and transformed as it’s loaded, improving data integration and the quality of the data, which in turn significantly improves the quality of information provided from the warehouse. Trying to use web services or data federation in these cases can be problematic at best, particularly if data sets are expansive or multiple source systems are involved. Another reason for integrating data within an EDW is to present one version of the truth. EDW developers can work with each source system and the appropriate end users to interpret the data and determine how best to cleanse, transform, integrate and present it. Once this work is done, each EDW user is guaranteed to receive the same answer to the same question every time. Without the EDW, chaos reigns and the truth is elusive, because each web service developer or end user must decide how to handle each of these things, leading to a great deal of inconsistency. This leads to executives and managers making decisions based on faulty information, and creates an environment in which users no longer trust the information provided to them. And then there are the many issues surrounding historical data that each source system owner must contend with. Often, growing historical data negatively impacts the performance of a system, even though it may rarely be viewed. However, data owners are typically reluctant to archive the data, because they may still need to access it in the future. Historical data is also a problem when new systems are implemented and there is a desire to retain old data that will not be imported into the new system. This is especially prevalent when implementing new ERP systems, for example. The EDW can provide a solution to each of these problems. Historical data can be archived off to the EDW, eliminating the performance problems for the source system, while still ensuring the data is accessible if needed. Because the EDW uses special data warehousing constructs to store data and leverages techniques like data partitioning, it is able to store the data with much less of an impact to query performance. It also can facilitate the deletion of archived historical data beyond defined retention periods. One of the problems with historical data when implementing new systems is figuring out how to make the old data fit the new architecture. By archiving this data in to the EDW this problem can be substantially avoided. However, it still presents an opportunity to do some transformation of the data, without worrying about constraints built into the new application, so that historical data can be merged effectively with current data in queries from the EDW. Query Performance and Data Security in the EDW Two of the key drivers behind the need for an EDW have been the need for fast query Copyright 2018 Freedom Consulting Group, Inc., all rights reserved Modernizing the Data Warehouse 3 performance across integrated data sets, and the need for data security and privacy. These drivers have not gone away, but the desire for a more agile approach that also can accommodate newer technologies, such as Big Data and cloud architectures, has become more urgent. Problems with query performance/response time can occur in any system, and the EDW is no exception. There are many potential causes for these kinds of problems within the EDW and many potential solutions. Each situation must be looked at with a critical eye and investigated with a holistic approach in order to derive the correct solution. However, these issues are outside the intended scope of this article. Rather, the intent here is to address the shortcomings with the classic EDW architecture while ensuring we don’t reintroduce the problems the EDW was created to resolve. The use of specialized data warehousing techniques and database constructs provides for enhanced performance, particularly for integrated data from multiple source systems. If designed properly, an EDW can provide much better integration and query performance across data from multiple sources, and with much bigger data sets, than can be done if querying directly from the source systems. As the volume of data grows performance can still be an issue in any data warehouse, and must be constantly looked at as an area for improvement. As mentioned above, another key driver behind using an EDW is data security. Clearly, in the world we live in today, system and data security are preeminent concerns across all IT applications and systems. An advantage of a data warehouse is that it requires only one comprehensive security solution that can be easily controlled, monitored and maintained, as opposed to trying to apply effective security controls to multiple source systems. Over the years, alternatives to the traditional EDW have been
Recommended publications
  • Netezza EOL: IBM IAS Is a Flawed Upgrade Path Netezza EOL: IBM IAS Is a Flawed Upgrade Path
    Author: Marc Staimer WHITE PAPER Netezza EOL: What You’re Not Being Told About IBM IAS Is a Flawed Upgrade Path Database as a Service (DBaaS) WHITE PAPER • Netezza EOL: IBM IAS Is a Flawed Upgrade Path Netezza EOL: IBM IAS Is a Flawed Upgrade Path Executive Summary It is a common technology vendor fallacy to compare their systems with their competitors by focusing on the features, functions, and specifications they have, but the other guy doesn’t. Vendors ignore the opposite while touting hardware specs that mean little. It doesn’t matter if those features, functions, and specifications provide anything meaningfully empirical to the business applications that rely on the system. It only matters if it demonstrates an advantage on paper. This is called specsmanship. It’s similar to starting an argument with proof points. The specs, features, and functions are proof points that the system can solve specific operational problems. They are the “what” that solves the problem or problems. They mean nothing by themselves. Specsmanship is defined by Wikipedia as: “inappropriate use of specifications or measurement results to establish supposed superiority of one entity over another, generally when no such superiority exists.” It’s the frequent ineffective sales process utilized by IBM. A textbook example of this is IBM’s attempt to move their Netezza users to the IBM Integrated Analytics System (IIAS). IBM is compelling their users to move away from Netezza. In the fall of 2017, IBM announced to the Enzee community (Netezza users) that they can no longer purchase or upgrade PureData System for Analytics (the most recent IBM name for its Netezza appliances), and it will end-of-life all support next year.
    [Show full text]
  • EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence an Architectural Overview
    EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence An Architectural Overview EMC Information Infrastructure Solutions Abstract This white paper provides readers with an overall understanding of the EMC® Greenplum® Data Computing Appliance (DCA) architecture and its performance levels. It also describes how the DCA provides a scalable end-to- end data warehouse solution packaged within a manageable, self-contained data warehouse appliance that can be easily integrated into an existing data center. October 2010 Copyright © 2010 EMC Corporation. All rights reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com All other trademarks used herein are the property of their respective owners. Part number: H8039 EMC Greenplum Data Computing Appliance: High Performance for Data Warehousing and Business Intelligence—An Architectural Overview 2 Table of Contents Executive summary .................................................................................................
    [Show full text]
  • Microsoft's SQL Server Parallel Data Warehouse Provides
    Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value Published by: Value Prism Consulting Sponsored by: Microsoft Corporation Publish date: March 2013 Abstract: Data Warehouse appliances may be difficult to compare and contrast, given that the different preconfigured solutions come with a variety of available storage and other specifications. Value Prism Consulting, a management consulting firm, was engaged by Microsoft® Corporation to review several data warehouse solutions, to compare and contrast several offerings from a variety of vendors. The firm compared each appliance based on publicly-available price and specification data, and on a price-to-performance scale, Microsoft’s SQL Server 2012 Parallel Data Warehouse is the most cost-effective appliance. Disclaimer Every organization has unique considerations for economic analysis, and significant business investments should undergo a rigorous economic justification to comprehensively identify the full business impact of those investments. This analysis report is for informational purposes only. VALUE PRISM CONSULTING MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. ©2013 Value Prism Consulting, LLC. All rights reserved. Product names, logos, brands, and other trademarks featured or referred to within this report are the property of their respective trademark holders in the United States and/or other countries. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this report may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
    [Show full text]
  • Dell Quickstart Data Warehouse Appliance 2000
    Dell Quickstart Data Warehouse Appliance 2000 Reference Architecture Dell Quickstart Data Warehouse Appliance 2000 THIS DESIGN GUIDE IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF ANY KIND. © 2012 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell. Dell , the DELL logo, the DELL badge, and PowerEdge are trademarks of Dell Inc . Microsoft , Windows Server , and SQL Server are either trademarks or registered trademarks of Microsoft Corporation in the United States and/or other countries. Intel and Xeon are either trademarks or registered trademarks of Intel Corporation in the United States and/or other countries. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in trademarks and trade names other than its own. December 2012 Page ii Dell Quickstart Data Warehouse Appliance 2000 Contents Introduction ............................................................................................................ 2 Audience ............................................................................................................. 2 Reference Architecture ..............................................................................................
    [Show full text]
  • The Teradata Data Warehouse Appliance Technical Note on Teradata Data Warehouse Appliance Vs
    The Teradata Data Warehouse Appliance Technical Note on Teradata Data Warehouse Appliance vs. Oracle Exadata By Krish Krishnan Sixth Sense Advisors Inc. November 2013 Sponsored by: Table of Contents Background ..................................................................3 Loading Data ............................................................3 Data Availability .........................................................3 Data Volumes ............................................................3 Storage Performance .....................................................4 Operational Costs ........................................................4 The Data Warehouse Appliance Selection .....................................5 Under the Hood ..........................................................5 Oracle Exadata. .5 Oracle Exadata Intelligent Storage ........................................6 Hybrid Columnar Compression ...........................................7 Mixed Workload Capability ...............................................8 Management ............................................................9 Scalability ................................................................9 Consolidation ............................................................9 Supportability ...........................................................9 Teradata Data Warehouse Appliance .........................................10 Hardware Improvements ................................................10 Teradata’s Intelligent Memory (TIM) vs. Exdata Flach Cache.
    [Show full text]
  • The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape Chapter 1 Chapter 5 Data Integration Data Integration Strategies
    The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape Chapter 1 Chapter 5 Data Integration Data Integration Strategies 3-4 Compared 11 Chapter 2 Chapter 6 Extract, Transform, Load - Integrating Data Integration ETL Strategies 5 12 Chapter 3 Case Studies Enterprise Service Bus - ESB 7-8 13-15 Chapter 4 Data Virtualization - DV 9-10 The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape 2 Chapter 1 Data Integration Data Silo Syndrome The problem of data silos, which are data sources that are unable to easily share data from one to the other, has plagued the IT landscape for many years, and continues to do so today, despite the advents of broadband Internet, gigabit networking, and cloud-based storage. Data silos exists for a variety of reasons: • Old systems have trouble talking with modern systems. • On-premises systems have difficulty talking with cloud-based systems. • Some systems only work with specific applications. • Some systems are configured to be accessed by specific individuals or groups. • Companies acquire other companies, taking on differently-configured systems. The Roles of ETL, ESB, and Data Virtualization Technologies in Integration Landscape 3 Chapter 1 Bringing the Data Together The problem with data silos is that no one can run a query across them; they must be queried separately, and the separate results need to be added together manually, which is costly, time-consuming, and inefficient. To bring the data together, companies use one or more of the following data integration strategies: 1. Extract, Transform, and Load (ETL) Processes, which copy the data from the silos and move it to a central location, usually a data warehouse 2.
    [Show full text]
  • State of the Data Warehouse Appliance Krish Krishnan
    Thema Datawarehouse Appliances The future looks bright considering all the innovation State of the Data Warehouse Appliance Krish Krishnan Krish Krishnan is a recognized expert worldwide in the strategy, architec- ture and implementation of high performance data warehousing solutions. He is a visionary data warehouse thought leader and an independent analyst, writing and speaking at industry leading conferences, user groups and trade publications. His conclusion: The future looks bright for the data warehouse appliance. We have been seeing in the past 18 months a steady influx of runs for 10 hours in a 1,000,000 records database”; data warehouse appliances, and the offerings include the esta- - “My users keep reporting that the ETL processes failed due to blished players like Oracle and Teradata. There are a few facts large data volumes in the DW”; that are clear from this rush of options and offerings: - “My CFO wants to know why we need another $500,000 for - The data warehouse appliance is now an accepted solution and infrastructure when we invested a similar amount just six is being sponsored by the CxO within organizations; months ago”; - The need to be nimble and competitive has driven the quick - “My DBA team is exhausted trying to keep the tuning process maturation of data warehouse appliances, and the market is online for the databases to achieve speed.” embracing the offerings where applicable; - Some of the world’s largest data warehouses have augmented The consistent theme throughout the marketplace focuses on the the data warehouse appliance, including: New York Stock need to keep performance and response time up without incur- Exchange has multiple enterprise data warehouses (EDWs) on ring extra spending due to higher costs.
    [Show full text]
  • Gartner Magic Quadrant for Data Management Solutions for Analytics
    16/09/2019 Gartner Reprint Licensed for Distribution Magic Quadrant for Data Management Solutions for Analytics Published 21 January 2019 - ID G00353775 - 74 min read By Analysts Adam Ronthal, Roxane Edjlali, Rick Greenwald Disruption slows as cloud and nonrelational technology take their place beside traditional approaches, the leaders extend their lead, and distributed data approaches solidify their place as a best practice for DMSA. We help data and analytics leaders evaluate DMSAs in an increasingly split market. Market Definition/Description Gartner defines a data management solution for analytics (DMSA) as a complete software system that supports and manages data in one or many file management systems, most commonly a database or multiple databases. These management systems include specific optimization strategies designed for supporting analytical processing — including, but not limited to, relational processing, nonrelational processing (such as graph processing), and machine learning or programming languages such as Python or R. Data is not necessarily stored in a relational structure, and can use multiple data models — relational, XML, JavaScript Object Notation (JSON), key-value, graph, geospatial and others. Our definition also states that: ■ A DMSA is a system for storing, accessing, processing and delivering data intended for one or more of the four primary use cases Gartner identifies that support analytics (see Note 1). ■ A DMSA is not a specific class or type of technology; it is a use case. ■ A DMSA may consist of many different technologies in combination. However, any offering or combination of offerings must, at its core, exhibit the capability of providing access to the data under management by open-access tools.
    [Show full text]
  • DATA VIRTUALIZATION: the Modern Data Integration Solution
    Whitepaper DATA VIRTUALIZATION: The Modern Data Integration Solution DATA VIRTUALIZATION LAYER Contents Abstract 3 Introduction 4 Traditional Integration Techniques 5 Data Virtualization 6 The Five Tiers of Data Virtualization Products 8 Summary: 10 Things to Know about Data Virtualization 9 Denodo Platform: The Modern Data Virtualization Platform 10 Abstract Organizations continue to struggle with integrating data quickly enough to support the needs of business stakeholders, who need integrated data faster and faster with each passing day. Traditional data integration technologies have not been able to solve the fundamental problem, as they deliver data in scheduled batches, and cannot support many of today’s rich and complex data types. Data virtualization is a modern data integration approach that is already meeting today’s data integration challenges, providing the foundation for data integration in the future. This paper covers the fundamental challenge, explains why traditional solutions fall short, and introduces data virtualization as the core solution. Introduction We have been living through an unprecedented explosion in the volume, variety, and velocity of incoming data. This is not new, but emerging technologies, such as the cloud and big data systems, which have brought large volumes of disparate data, have only compounded the problem. Different sources of data are still stored in functional silos, separate from other sources of data. Today, even data lakes contain multiple data silos. Business stakeholders need immediate information for real-time decisions, but this is challenging when the information they need is scattered across multiple sources. Similarly, initiatives like cloud first, app modernization, and big data analytics cannot move forward until data from key sources is brought together as a unified source.
    [Show full text]
  • Migrating to Virtual Data Marts Using Data Virtualization Simplifying Business Intelligence Systems
    Migrating to Virtual Data Marts using Data Virtualization Simplifying Business Intelligence Systems A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy January 2015 Sponsored by Copyright © 2015 R20/Consultancy. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. or there countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Trademarks of companies referenced in this document are the sole property of their respective owners. Table of Contents 1 Introduction 1 2 Data Marts 1 3 The Costs of Data Marts 3 4 The Alternative: Virtual Data Marts using Data Virtualization 5 5 Developing Virtual Data Marts 6 Step 1: Recreating Physical Data Marts as Virtual Data Marts 6 Step 2: Improving Query Performance on Virtual Data Marts 12 Step 3: Identifying Common Specifications Among Virtual Data Marts 14 Step 4: Redirecting Reports to Access Virtual Data Marts 16 Step 5: Extracting Definitions from the Reporting Tools 16 Step 6: Defining Security Rules 16 Step 7: Adding External Data to Virtual Data Marts 17 6 Getting Started 17 About the Author Rick F. van der Lans 19 About Cisco Systems, Inc. 19 Copyright © 2015 R20/Consultancy, all rights reserved. Migrating to Virtual Data Marts using Data Virtualization 1 1 Introduction Almost every BI system is made up of many data marts. These data marts are commonly developed to improve query performance, to deliver to users the data with the right data structure and the right aggregation level, to minimize network delay for geographically dispersed users, to allow the use of specific database technologies, and to give the users more control over their data.
    [Show full text]
  • Next-Generation Data Virtualization Fast and Direct Data Access, More Reuse, and Better Agility and Data Governance for BI, MDM, and SOA
    WHITE PAPER Next-Generation Data Virtualization Fast and Direct Data Access, More Reuse, and Better Agility and Data Governance for BI, MDM, and SOA Executive Summary It’s 9:00 a.m. and the CEO of a leading consumer electronics manufacturing company is meeting with the leadership team to discuss disturbing news. Significant errors in judgment resulted in the loss of millions of dollars in the last quarter. Overproduction substantially increased the costs of inventory. The sales team is not selling where market demands exist. Customer shipping errors are on the rise. With competitive threats at an all-time high, these are indeed disturbing developments for the company. All of these problems can be traced to one thing: the lack of a common view of all enterprise information assets across the company, on demand, when key decision makers need it. Some lines of business have stale or outdated data. Many define common entities, such as customer, product, sales, and inventory, differently. Others simply have incomplete and inaccurate data, or they can’t view the data in the proper context. A few years ago, a couple of business lines embarked on business intelligence (BI) projects to identify better cross-sell and up-sell opportunities. Their IT teams found that creating a common view of customer, product, sales, and inventory was no easy task. Data wasn’t just in the data warehouse—it was everywhere, and it was growing with each passing day. Various tools and integration approaches—such as hand coding, ETL, EAI, ESB, EII—were applied to access, integrate, and deliver all this data to different consuming applications.
    [Show full text]
  • Teradata Data Warehouse Appliance 2800
    Teradata Data Warehouse Appliance 2850 07.16 EB9388 DATA WAREHOUSING The Best Performing Appliance Effortless Scalability Unmatched in its scalability, the Data Warehouse Appliance in the Marketplace accommodates future business growth by expanding incrementally from two to 2,048 nodes. It also accommodates Teradata Corporation makes taking your first step toward an user data space from 16TB to 34PB of uncompressed integrated data warehouse an easy one with the Teradata® user data. Featuring a massively parallel processing (MPP) Data Warehouse Appliance 2850. With this enterprise-ready architecture, the platform supports scalable growth, not only platform from Teradata, you can start building your inte- in data capacity, but in all dimensions including performance, grated data warehouse, and grow it as your needs expand. users, and applications. Ready to Run The Teradata BYNET® system interconnect for high-speed, fault tolerant messaging between nodes is a key ingredient Delivered ready to run, the Teradata Data Warehouse to achieving scalability. The BYNET is based on a robust, Appliance is a fully-integrated system purpose built for powerful protocol with innovative database messaging data warehousing. The appliance features the industry- features that optimize the use of a dual Mellanox® leading Teradata Database, a robust Teradata hardware InfiniBand™ fabric for the Teradata MPP architecture. platform with dual Intel® Xeon® 18-core processors, up to 12TB of memory in a single cabinet, SUSE® Linux Mission-Critical Availability operating system, and enterprise-class storage—all The Teradata Data Warehouse Appliance achieves preinstalled into a power-efficient unit. That means the availability and performance continuity through Teradata’s system is up and running live in just a few hours for rapid unique clique architecture where nodes are connected to time to value.
    [Show full text]