Migrating to Virtual Data Marts Using Data Virtualization Simplifying Business Intelligence Systems
Total Page:16
File Type:pdf, Size:1020Kb
Migrating to Virtual Data Marts using Data Virtualization Simplifying Business Intelligence Systems A Technical Whitepaper Rick F. van der Lans Independent Business Intelligence Analyst R20/Consultancy January 2015 Sponsored by Copyright © 2015 R20/Consultancy. All rights reserved. Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. or there countries. To view a list of Cisco trademarks, go to this URL: www.cisco.com/go/trademarks. Trademarks of companies referenced in this document are the sole property of their respective owners. Table of Contents 1 Introduction 1 2 Data Marts 1 3 The Costs of Data Marts 3 4 The Alternative: Virtual Data Marts using Data Virtualization 5 5 Developing Virtual Data Marts 6 Step 1: Recreating Physical Data Marts as Virtual Data Marts 6 Step 2: Improving Query Performance on Virtual Data Marts 12 Step 3: Identifying Common Specifications Among Virtual Data Marts 14 Step 4: Redirecting Reports to Access Virtual Data Marts 16 Step 5: Extracting Definitions from the Reporting Tools 16 Step 6: Defining Security Rules 16 Step 7: Adding External Data to Virtual Data Marts 17 6 Getting Started 17 About the Author Rick F. van der Lans 19 About Cisco Systems, Inc. 19 Copyright © 2015 R20/Consultancy, all rights reserved. Migrating to Virtual Data Marts using Data Virtualization 1 1 Introduction Almost every BI system is made up of many data marts. These data marts are commonly developed to improve query performance, to deliver to users the data with the right data structure and the right aggregation level, to minimize network delay for geographically dispersed users, to allow the use of specific database technologies, and to give the users more control over their data. Unfortunately, data marts are expensive because they require a lot of work to develop, operate, and maintain. They also complicate the architectures of BI systems. For example, changes made to the data warehouse can lead to a multitude of changes throughout all the data marts, and changes made to a report can involve complex modifications to the data marts and the corresponding ETL programs. In addition, they can degrade data quality and complicate data governance. But most importantly, data marts reduce the flexibility of a BI system. The main reason is that they are physical data marts, so they occupy disk storage, they must be loaded with new data periodically, they have to be managed, tuned, and optimized, and so on. Nowadays, organizations demand flexible BI systems. This does not mean that all the data marts should be dropped, because then all their benefits would disappear as well. A better solution is to replace physical data marts by virtual data marts. With virtual data marts data can still be delivered with the right structure, in the right form, and at the right data aggregation level. But they are not physical, so they do not occupy disk storage, they do not have to be loaded with new data periodically, they do not have to be managed, tuned, and optimized, and so on. Virtual data marts are much more flexible than physical data marts. Changing them is predominantly a change of specifications. Virtual data marts can be developed in various ways. This whitepaper describes in detail a step‐by‐step approach for migrating physical data marts to virtual data marts using Cisco Information Server (CIS). The approach is based on an evolutionary migration that does not impact the existing reporting workload. 2 Data Marts Data Marts – Most BI systems make use of data marts; see Figure 1. Some even comprises countless data marts. The data they contain is derived from the data warehouse. ETL programs are used to copy data from the data warehouse to the data marts periodically. Each data mart is developed for a specific group of users, all with comparable information and reporting needs. A data mart contains a subset of all the data from a data warehouse. Also, whereas a data warehouse contains the lowest level of data, a data mart usually contains a slightly aggregated version of all that data. Most reports run on one of those data marts instead of on the data warehouse. Copyright © 2015 R20/Consultancy, all rights reserved. Migrating to Virtual Data Marts using Data Virtualization 2 Figure 1 Traditionally, a BI system consists of many data marts. Why Are Data Marts Developed? – Usually, in a BI system the majority of all the reporting and analytics is processed on data marts. Common reasons for developing a data mart are: Query performance: The most dominant reason for organizations to develop data marts is query performance. Without data marts all the queries must be executed on the data warehouse. This query workload might be too intense for the database server managing the data warehouse, leading to long waiting times for the users. By implementing data marts, most of the queries are offloaded from the data warehouse to these data marts, and by distributing the query workload over multiple data marts, the performance of queries improves. Data structure: Particular reporting or analytical tools require data to have a certain structure. For example, a tool might demand that the tables are organized as a star schema. If the tables in the data warehouse are normalized, the tool won’t be able to access the data. In this case, a data mart is used to present the same data in the right structure. In such a system, ETL scripts are used to transform the normalized data to a star schema arrangement. Geographically dispersed users: If users are geographically dispersed, querying a central data store might lead to considerable network delay, which slows down the reports. In this situation, it’s better to move data physically closer to where the users are located, taking network traffic out of the equation. Storage technology: Data marts allow the use of different storage technologies for different reports and reporting tools. For example, the performance can be improved for certain reports by storing the data in a multi‐dimensional database server in which data is stored in a cube‐like form instead of in a SQL database server. Local control: By having a real copy of their data in a dedicated database, users have more control over that data. For example, it allows them to add private data or data from external sources to enrich their reporting capabilities. Copyright © 2015 R20/Consultancy, all rights reserved. Migrating to Virtual Data Marts using Data Virtualization 3 Technologies Used – As indicated, data marts can be developed with different data storage technologies. The most popular form is a SQL database. In this case, a separate set of tables is developed in a SQL database. Technically, this set of tables may be defined inside the database that contains the data warehouse, but quite often a separate database is developed. A second popular form is analytical cubes. Products such as Microsoft Analysis Service and KylinOLAP don’t store the data in tables, but in cubes in which the data is organized in dimensions and hierarchies. Data stored in cubes cannot be accessed using SQL, but through a dedicated language, such as MDX. The third option for storing the data of a data mart is by storing it in files, such as Microsoft Excel files, Microsoft Access files, or simple comma‐delimited files. Star Schema – The tables in a data mart developed with SQL technology are usually organized as a star schema1. In a star schema the tables are classified as dimension tables (or dimensional tables) and fact tables. Fact tables are the central tables in a star schema. A row in a fact table usually represents a business event. Examples of potential fact tables are money withdrawals from a bank account, bookings for a flight, and payments at a counter. Each row in a dimension table represents some business object, such as a customer, a product, or a department. Dimension tables don’t have relationships with each other, but only relationships with fact tables. Star schema owes its name to its graphical representation where the fact table forms the center and the dimension tables are drawn as rays originating from that center, together forming a star. The primary goal of arranging tables as a star schema is to limit the number of tables that have to be accessed and joined when a query is processed. The often cited advantage of avoiding table joins is improved query performance. Another advantage is that it becomes much easier to write queries and to present the end user with a set of options from which a tool can generate a query. The fact that duplicate data increases the amount of required storage is seen as less important. Also, the fact that duplicate data can lead to inconsistent data is not considered a major disadvantage as well, which makes sense in a data warehouse environment where all the inserts and updates are executed in a very controlled fashion. A data store can contain many fact tables and thus many star schemas. If fact tables share the same dimension tables, these dimension tables are called conformed dimension tables. This is only possible if those fact tables have been designed in such a way that they can use the same dimension tables. 3 The Costs of Data Marts The Price Tag of Data Marts – Quite often the decision to develop data marts is made lightheartedly. Query performance is commonly used as the all‐overriding argument. Unfortunately, data marts are more expensive than most organizations think. Already in 2008, Gartner2 indicated that the costs of derived data stores, such as data marts, are expensive components of an entire data warehouse architecture.