5/2/2011
GETTING REAL ABOUT DATA VIRTUALIZATION
RALPH KIMBALL INFORMATICA MAY 2011
Ralph Kimball Informatica May 2011
Webinar Roadmap
Attractions of virtualization
Range of virtualization types
Unique characteristics of data virtualization
Data virtualization augments the data warehouse
Best use cases for data virtualization
Transitioning from data virtualization to physical ETL
Data virtualization architecture
Organization and infrastructure changes brought by data virtualization
Next steps
1 5/2/2011
Attractions of Virtualization
Virtualization creates a virtual, rather than real, version of something.
Resources can be provisioned, expanded, and moved independently from specific Physical location, hardware, operating systems, software release, data structure
In 2011, businesses need to adapt quickly Explosive increases in demand Increasing integration requirements
A Range of Virtualization Types, Often Leveraged by the Cloud
Storage virtualization Separates logical storage from physical storage
Desktop virtualization Makes management of desktops easier
PC operating system virtualization Emulates an alternate operating system on one machine
Application virtualization Encapsulates application separate from operating system
Server virtualization Hypervisor hosts multiple operating systems
But there is something missing from this list…
2 5/2/2011
Unique Characteristics of Data Virtualization
A different kind of virtualization: making data available with desired target format and content without regard to the actual physical storage or heterogeneous structure.
More content aware than other virtualizations
Target format and structures computed at run time.
Transfers processing for data access and integration closer to the end user
Enhances the data warehouse when used appropriately
Rounds out the enterprise virtualization strategy
Data Virtualization Augments, not Replaces, the Data Warehouse
Traditional data warehouse responsibilities are performed before, or as a side effect, of data virtualization, including Slowly changing dimensions (history tracking “SCDs”) Physical versions of tables for performance Data staging and archiving Compliance
Data virtualization frequently a prototype for physical ETL when scaled up or taken to production
3 5/2/2011
Data Virtualization Architecture
Data virtualization engine is middleware: a switchboard between multiple diverse data sources and multiple consuming clients
Business process data sources, OLTP and DWs
Complex non-tabular numeric and text sources
Data feeds, telemetry data, analog data
Best Use Cases for Data Virtualization
Rapid prototyping, end user trial integration
“Downstream” master data management
Publishing conformed dimensions
Deploying dimensional models
Transforming data structures at run time
Embedded data profiling
4 5/2/2011
Rapid Prototyping and End User Trial Integration
New data sources can be configured in ad hoc ways Temporary filters Initial data transformations New data columns Views pre-joining other data
Prototype experiments conducted by skilled end user analyst or BI application developers
Provide usability immediately and until scaling or production drives physical ETL (more on this later).
“Downstream” MDM
The “real” MDM is the primary centralized sole source for creation and publishing of master entities such as customer, product, location, calendar and others.
Downstream MDM is the pragmatic recognition that the EDW is often the primary substitute for creating conformed versions of these master entities prior to the implementation of the “real” MDM. Used for EDW/BI applications Not typically the main driver of OLTP applications A good use case for data virtualization
5 5/2/2011
Conformed Customer Dimensions at Different Grains
Common (conformed) fields provide opportunities for drill-across
Publishing Conformed Dimensions and Conformed Facts
Publishing conformed dimensions and facts is more than data cleaning as per the previous slide: The assignment of consistent low cardinality labels such as categories to all the members of a dimension requires complex mapping, e.g., demographic cluster identifiers assigned to each customer. The assignment of pro-rating factors to convert between sales districts and zip codes also requires complex mapping
Need to consider processing overhead and architectural decisions to use DV in the most complex cases.
6 5/2/2011
Deploying Dimensional Models Using the Kimball Architecture
Many schemas, especially normalized, can be exposed as dimensional models through data virtualization
Transforming Data Structures at Run Time
Complex numeric payloads can be exposed as simple scalar results usable by SQL Wafer fabrication measurements (matrix to scalar) Waveform sampling, e.g., medical lab results
“Data bags” of disorderly name-value pairs can be conformed and presented as positional designs
7 5/2/2011
Embedded Data Profiling
Analysts working close to the business can evaluate trustworthiness of a data source during prototyping Data profiling alerts analyst to data problems Checklists generated of data improvement tasks Send demands to source for better data Transform poor data if business rules reasonable Tag bad data that can’t be fixed
If data source qualifies, then DV can continue to be used, or if scaling issues overwhelm the application, the source can undergo conventional ETL
Data Mining
Many data mining algorithms need data to be cleaned or transformed Neural networks require all input data to be mapped into 0-to-1 ranges Much cleaner results obtained when labels and descriptors are expanded and made consistent
8 5/2/2011
Transforming from Data Virtualization to Physical ETL
Data virtualization trades off physical table preparation for query-time computation
Threshold may be reached where virtualization app should be replaced by equivalent physical table preparation Growing complexity of virtualization logic Load on data source Run time
DV to physical table transformation is like materialization Transfer DV logic into physical ETL logic automatically
Organization/Infrastructure Changes Brought by Data Virtualization
Rapid, cost effective deployment for right-sized integration use cases
Integration and data profiling pushed closer to the BI team and individual analysts
Watch that IT is not left out of the loop
One-time ETL creation of physical tables replaced by query-time computation of virtual tables
Very large or frequent or complex virtualizations may not be feasible
Need to elevate data virtualization planning to be part of enterprise virtualization initiative
9 5/2/2011
Next Steps
Examine backlog of integration requests from BI team, end users, and analysts to see which would be good candidates for data virtualization
Evaluate solution alternatives as part of the enterprise virtualization initiative Socialize this concept with non-data IT executives who are concerned with other virtualization projects
Implement trial virtualization project Measure development time, complexity, skills needed Measure run-time performance, infrastructure loads
The Kimball Group Resource
www.kimballgroup.com
Best selling data warehouse books NEW BOOK! The Kimball Group Reader
In depth data warehouse classes taught by primary authors Dimensional modeling (Ralph/Margy) Data warehouse lifecycle (Margy/Warren) ETL architecture (Ralph/Bob)
Dimensional design reviews and consulting by Kimball Group principals
Informatica White Papers on Integration, Data Quality, and Big Data Analytics
10