5/2/2011

GETTING REAL ABOUT DATA VIRTUALIZATION

RALPH KIMBALL INFORMATICA MAY 2011

Ralph Kimball Informatica May 2011

Webinar Roadmap

 Attractions of virtualization

 Range of virtualization types

 Unique characteristics of data virtualization

 Data virtualization augments the

 Best use cases for data virtualization

 Transitioning from data virtualization to physical ETL

 Data virtualization architecture

 Organization and infrastructure changes brought by data virtualization

 Next steps

1 5/2/2011

Attractions of Virtualization

 Virtualization creates a virtual, rather than real, version of something.

 Resources can be provisioned, expanded, and moved independently from specific  Physical location, hardware, operating systems, software release, data structure

 In 2011, businesses need to adapt quickly  Explosive increases in demand  Increasing integration requirements

A Range of Virtualization Types, Often Leveraged by the Cloud

 Storage virtualization  Separates logical storage from physical storage

 Desktop virtualization  Makes management of desktops easier

 PC operating system virtualization  Emulates an alternate operating system on one machine

 Application virtualization  Encapsulates application separate from operating system

 Server virtualization  Hypervisor hosts multiple operating systems

 But there is something missing from this list…

2 5/2/2011

Unique Characteristics of Data Virtualization

 A different kind of virtualization: making data available with desired target format and content without regard to the actual physical storage or heterogeneous structure.

 More content aware than other virtualizations

 Target format and structures computed at run time.

 Transfers processing for data access and integration closer to the end user

 Enhances the data warehouse when used appropriately

 Rounds out the enterprise virtualization strategy

Data Virtualization Augments, not Replaces, the Data Warehouse

 Traditional data warehouse responsibilities are performed before, or as a side effect, of data virtualization, including  Slowly changing dimensions (history tracking “SCDs”)  Physical versions of tables for performance  Data staging and archiving  Compliance

 Data virtualization frequently a prototype for physical ETL when scaled up or taken to production

3 5/2/2011

Data Virtualization Architecture

 Data virtualization engine is middleware: a switchboard between multiple diverse data sources and multiple consuming clients

Business process data sources, OLTP and DWs

Complex non-tabular numeric and text sources

Data feeds, telemetry data, analog data

Best Use Cases for Data Virtualization

 Rapid prototyping, end user trial integration

 “Downstream”

 Publishing conformed dimensions

 Deploying dimensional models

 Transforming data structures at run time

 Embedded data profiling

4 5/2/2011

Rapid Prototyping and End User Trial Integration

 New data sources can be configured in ad hoc ways  Temporary filters  Initial data transformations  New data columns  Views pre-joining other data

 Prototype experiments conducted by skilled end user analyst or BI application developers

 Provide usability immediately and until scaling or production drives physical ETL (more on this later).

“Downstream” MDM

 The “real” MDM is the primary centralized sole source for creation and publishing of master entities such as customer, product, location, calendar and others.

 Downstream MDM is the pragmatic recognition that the EDW is often the primary substitute for creating conformed versions of these master entities prior to the implementation of the “real” MDM.  Used for EDW/BI applications  Not typically the main driver of OLTP applications  A good use case for data virtualization

5 5/2/2011

Conformed Customer Dimensions at Different Grains

Common (conformed) fields provide opportunities for drill-across

Publishing Conformed Dimensions and Conformed Facts

 Publishing conformed dimensions and facts is more than data cleaning as per the previous slide:  The assignment of consistent low cardinality labels such as categories to all the members of a dimension requires complex mapping, e.g., demographic cluster identifiers assigned to each customer.  The assignment of pro-rating factors to convert between sales districts and zip codes also requires complex mapping

 Need to consider processing overhead and architectural decisions to use DV in the most complex cases.

6 5/2/2011

Deploying Dimensional Models Using the Kimball Architecture

 Many schemas, especially normalized, can be exposed as dimensional models through data virtualization

Transforming Data Structures at Run Time

 Complex numeric payloads can be exposed as simple scalar results usable by SQL  Wafer fabrication measurements (matrix to scalar)  Waveform sampling, e.g., medical lab results

 “Data bags” of disorderly name-value pairs can be conformed and presented as positional designs

7 5/2/2011

Embedded Data Profiling

 Analysts working close to the business can evaluate trustworthiness of a data source during prototyping  Data profiling alerts analyst to data problems  Checklists generated of data improvement tasks  Send demands to source for better data  Transform poor data if business rules reasonable  Tag bad data that can’t be fixed

 If data source qualifies, then DV can continue to be used, or if scaling issues overwhelm the application, the source can undergo conventional ETL

Data Mining

 Many data mining algorithms need data to be cleaned or transformed  Neural networks require all input data to be mapped into 0-to-1 ranges  Much cleaner results obtained when labels and descriptors are expanded and made consistent

8 5/2/2011

Transforming from Data Virtualization to Physical ETL

 Data virtualization trades off physical table preparation for query-time computation

 Threshold may be reached where virtualization app should be replaced by equivalent physical table preparation  Growing complexity of virtualization logic  Load on data source  Run time

 DV to physical table transformation is like materialization  Transfer DV logic into physical ETL logic automatically

Organization/Infrastructure Changes Brought by Data Virtualization

 Rapid, cost effective deployment for right-sized integration use cases

 Integration and data profiling pushed closer to the BI team and individual analysts

 Watch that IT is not left out of the loop

 One-time ETL creation of physical tables replaced by query-time computation of virtual tables

 Very large or frequent or complex virtualizations may not be feasible

 Need to elevate data virtualization planning to be part of enterprise virtualization initiative

9 5/2/2011

Next Steps

 Examine backlog of integration requests from BI team, end users, and analysts to see which would be good candidates for data virtualization

 Evaluate solution alternatives as part of the enterprise virtualization initiative  Socialize this concept with non-data IT executives who are concerned with other virtualization projects

 Implement trial virtualization project  development time, complexity, skills needed  Measure run-time performance, infrastructure loads

The Kimball Group Resource

 www.kimballgroup.com

 Best selling data warehouse books NEW BOOK! The Kimball Group Reader 

 In depth data warehouse classes taught by primary authors  (Ralph/Margy)  Data warehouse lifecycle (Margy/Warren)  ETL architecture (Ralph/Bob)

 Dimensional design reviews and consulting by Kimball Group principals

 Informatica White Papers on Integration, Data Quality, and Big Data Analytics

10