Getting Real About Data Virtualization

5/2/2011 GETTING REAL ABOUT DATA VIRTUALIZATION RALPH KIMBALL INFORMATICA MAY 2011 Ralph Kimball Informatica May 2011 Webinar Roadmap Attractions of virtualization Range of virtualization types Unique characteristics of data virtualization Data virtualization augments the data warehouse Best use cases for data virtualization Transitioning from data virtualization to physical ETL Data virtualization architecture Organization and infrastructure changes brought by data virtualization Next steps 1 5/2/2011 Attractions of Virtualization Virtualization creates a virtual, rather than real, version of something. Resources can be provisioned, expanded, and moved independently from specific Physical location, hardware, operating systems, software release, data structure In 2011, businesses need to adapt quickly Explosive increases in demand Increasing integration requirements A Range of Virtualization Types, Often Leveraged by the Cloud Storage virtualization Separates logical storage from physical storage Desktop virtualization Makes management of desktops easier PC operating system virtualization Emulates an alternate operating system on one machine Application virtualization Encapsulates application separate from operating system Server virtualization Hypervisor hosts multiple operating systems But there is something missing from this list… 2 5/2/2011 Unique Characteristics of Data Virtualization A different kind of virtualization: making data available with desired target format and content without regard to the actual physical storage or heterogeneous structure. More content aware than other virtualizations Target format and structures computed at run time. Transfers processing for data access and integration closer to the end user Enhances the data warehouse when used appropriately Rounds out the enterprise virtualization strategy Data Virtualization Augments, not Replaces, the Data Warehouse Traditional data warehouse responsibilities are performed before, or as a side effect, of data virtualization, including Slowly changing dimensions (history tracking “SCDs”) Physical versions of tables for performance Data staging and archiving Compliance Data virtualization frequently a prototype for physical ETL when scaled up or taken to production 3 5/2/2011 Data Virtualization Architecture Data virtualization engine is middleware: a switchboard between multiple diverse data sources and multiple consuming clients Business process data sources, OLTP and DWs Complex non-tabular numeric and text sources Data feeds, telemetry data, analog data Best Use Cases for Data Virtualization Rapid prototyping, end user trial integration “Downstream” master data management Publishing conformed dimensions Deploying dimensional models Transforming data structures at run time Embedded data profiling Data mining 4 5/2/2011 Rapid Prototyping and End User Trial Integration New data sources can be configured in ad hoc ways Temporary filters Initial data transformations New data columns Views pre-joining other data Prototype experiments conducted by skilled end user analyst or BI application developers Provide usability immediately and until scaling or production drives physical ETL (more on this later). “Downstream” MDM The “real” MDM is the primary centralized sole source for creation and publishing of master entities such as customer, product, location, calendar and others. Downstream MDM is the pragmatic recognition that the EDW is often the primary substitute for creating conformed versions of these master entities prior to the implementation of the “real” MDM. Used for EDW/BI applications Not typically the main driver of OLTP applications A good use case for data virtualization 5 5/2/2011 Conformed Customer Dimensions at Different Grains Common (conformed) fields provide opportunities for drill-across Publishing Conformed Dimensions and Conformed Facts Publishing conformed dimensions and facts is more than data cleaning as per the previous slide: The assignment of consistent low cardinality labels such as categories to all the members of a dimension requires complex mapping, e.g., demographic cluster identifiers assigned to each customer. The assignment of pro-rating factors to convert between sales districts and zip codes also requires complex mapping Need to consider processing overhead and architectural decisions to use DV in the most complex cases. 6 5/2/2011 Deploying Dimensional Models Using the Kimball Architecture Many schemas, especially normalized, can be exposed as dimensional models through data virtualization Transforming Data Structures at Run Time Complex numeric payloads can be exposed as simple scalar results usable by SQL Wafer fabrication measurements (matrix to scalar) Waveform sampling, e.g., medical lab results “Data bags” of disorderly name-value pairs can be conformed and presented as positional designs 7 5/2/2011 Embedded Data Profiling Analysts working close to the business can evaluate trustworthiness of a data source during prototyping Data profiling alerts analyst to data problems Checklists generated of data improvement tasks Send demands to source for better data Transform poor data if business rules reasonable Tag bad data that can’t be fixed If data source qualifies, then DV can continue to be used, or if scaling issues overwhelm the application, the source can undergo conventional ETL Data Mining Many data mining algorithms need data to be cleaned or transformed Neural networks require all input data to be mapped into 0-to-1 ranges Much cleaner results obtained when labels and descriptors are expanded and made consistent 8 5/2/2011 Transforming from Data Virtualization to Physical ETL Data virtualization trades off physical table preparation for query-time computation Threshold may be reached where virtualization app should be replaced by equivalent physical table preparation Growing complexity of virtualization logic Load on data source Run time DV to physical table transformation is like materialization Transfer DV logic into physical ETL logic automatically Organization/Infrastructure Changes Brought by Data Virtualization Rapid, cost effective deployment for right-sized integration use cases Integration and data profiling pushed closer to the BI team and individual analysts Watch that IT is not left out of the loop One-time ETL creation of physical tables replaced by query-time computation of virtual tables Very large or frequent or complex virtualizations may not be feasible Need to elevate data virtualization planning to be part of enterprise virtualization initiative 9 5/2/2011 Next Steps Examine backlog of integration requests from BI team, end users, and analysts to see which would be good candidates for data virtualization Evaluate solution alternatives as part of the enterprise virtualization initiative Socialize this concept with non-data IT executives who are concerned with other virtualization projects Implement trial virtualization project Measure development time, complexity, skills needed Measure run-time performance, infrastructure loads The Kimball Group Resource www.kimballgroup.com Best selling data warehouse books NEW BOOK! The Kimball Group Reader In depth data warehouse classes taught by primary authors Dimensional modeling (Ralph/Margy) Data warehouse lifecycle (Margy/Warren) ETL architecture (Ralph/Bob) Dimensional design reviews and consulting by Kimball Group principals Informatica White Papers on Integration, Data Quality, and Big Data Analytics 10 .

Load more