10/4/2011

MAKE BIG DATA AGILE: IMPROVE RELIABILITY AND TRUSTWORTHINESS OF BIG DATA BY ALIGNING PROCESS AND DATA

Ralph Kimball Informatica October 2011

The is More Urgently Needed Than Ever

 Operational data increasingly dominant

 Low latency data trending toward zero delay

 Customer behavior insights being monetized

 “Big data” introduces many new challenges  Huge volumes, many types of sources  Structured  semi-structured  unstructured  Non-relational processing  Raw data, often dirty, often sparse

1 10/4/2011

Threats to the Data Warehouse Mission

 Waterfall approach  Compartmentalized responsibilities  Formal specifications, handoffs, change bureaucracy  Excessive documentation  Resistance to course changes  Infrequent deliverables

 Integration overwhelm

 Untrustworthy, unreliable data

 “Starting Over” because of Big Data ??

Big Data Extends the Data Warehouse

 Use cases that drive the bottom line:  Targeted ads in real time based on web site activity and/or geo-tracking  “Satisfaction crash” intervention based on voice-to-text analysis of support calls  Customer sentiment change response based on social media  Emergency response planning based on satellite image comparison  Fraud detection based on cross-market brokerage transaction comparisons

2 10/4/2011

Adopt the Best Aspects of Agile Development

 Business driven  Sophisticated business sponsor  All IT members involved in KPI identification

 Small teams with end to end responsibility  Eliminate non value-added tasks

 Decompose the project

 Frequent deliverables

 Reasonable documentation

 Expect and encourage mid-course corrections

Use the Value Stream Map to Identify Best Agile Opportunities

DOIT Corporation: Value Stream Map (AS-IS) for Change Request Process Monday, December 20, 2010

GMNA Applications

Data Warehouse Change Request Team Team (Irina) Confirmation Request

Telephone Tag

Automated Status Request Workflow/Tracking (Cust satisfaction) Status Update Status Request

CR Review Committee Status Update to clarify rqmnts Notify Customer Semi-Weekly Review (13 days) Status Request (Cust Satisfaction) (5 days) Clarify Requirements Status Update

Requirements Clarification Bypass Council Approved Changes for simple CR’s Production CR Add CR (26 days) Submission To List Assign Resource Design Approval Daily ETL Batch Run Data Warehouse Team Integration Team Manager Architecture Review Council Test Scheduling

Bypass CR Approval Committee for Test Team Manager simple changes Charge Test Results Design Approved Request Change Management Board (8 days) Distribution Document Designs Forward CR Request Automated Regression To Developer Approved CR Design Docs & Design Docs & Schedule Testing (21 days - requires CR Approval & Schedule investment)

Requirements Design & Testing Handoff Test Case Test Execution Production Production Review Development Development Deployment Execution Development Team Test Team Automatic Daily ETL Batch Development Team Development Team Test Team Infrastructure Team Run CR’s 1 P1x12 P2x35 P3x124

8.8 Days 13.3 Days 26 Days 1 Day 12.8 Days 8.5 Days 0.3 Days Lead Time = 75.6 Days

30 Minutes 180 Minutes 15 Minutes 180 Minutes 90 Minutes 15 Minutes Work Time = 510 Minutes (8.5 hrs or 0.35 Days)

Value Ratio: Work Time / Lead Time = 0.5% Notes: (1) Lead Time includes 5 delay in customer notification (2) Lead Time could be reduced to 24 days with just process changes and using existing tools (3) Lead Time could be reduced to 3 days with a capital investment for automated testing

3 10/4/2011

Kimball Lifecycle Approach is Inherently Agile

Business Driven!

Technical Product Architecture Selection & Growth Design Installation

Program/ ETL Business Dimensional Physical Project Design & Deployment Requirements Modeling Design Planning Definition Development

BI BI Application Application Maintenance Design Development

Program/Project Management

Prepare for an Agile EDW with Data Profiling

 Data profiling is a necessary foundation of the EDW

 Qualify or disqualify new data sources at the earliest opportunity

 Implement data quality triage:  Demand source system fix, or  Correct data in DW ETL, or  Tag data and pass through

 Embed data profiling throughout pipeline: Source  ETL back room  BI front room

 Integrate data profiling tightly with data flow development: discovery spawns in line module

4 10/4/2011

Agile Data Governance Fits Rhythm of EDW Development

 Recognize, protect data as a corporate asset

 Continuously manage data quality at all levels  Operational sources  Data integration pipelines  BI delivery

 Continuously extend EDW integration with conformed dimensions driven by enterprise MDM

 Continuously add new sources, facts, dimensions, and dimension attributes  Dimensional approach minimizes rework, impact

Agile Approach: Four Active Steps

 Manage/plan with the data warehouse bus matrix

 Develop with enterprise MDM

 Integrate incrementally

 Manage data quality incrementally

5 10/4/2011

Agile Approach: Use the 1. Data Warehouse Bus Matrix

 Decompose and manage data warehouse implementation projects

 Implement rows of matrix one at a time

Business Processes Dimensions

Agile Approach: Leverage 2. Development with Enterprise MDM

 Move entity creation and admin to operations  Move away from downstream MDM  Reduce repeated ETL demands for same entities

 Establish single source for entity content  Operational systems, social media tracking, and EDW subscribe to MDM  Roll out MDM incrementally

6 10/4/2011

Agile Approach: 3. Integrate Incrementally

 Data warehouse integration is drilling across:

 Agile advantages of conformed dimensions  Install conformed attributes one source at a time  Add new conformed attributes one at a time  Deliver value with each iteration

Agile Approach: 4. Manage Data Quality Incrementally

 Use data profiler to build data quality screens  Column screens  Structure screens  Business rule screens

 Manage data quality through event

 Agile advantages of this DQ architecture  Build screens one at a time  Add or subtract screens

7 10/4/2011

Agile Approach: Special Considerations for Big Data

 Big data fits comfortably with agile approach  Agile approach tolerates uncertainty, back tracking, and midcourse corrections  Big data often must be analyzed before structure is understood

 Extend familiar RDBMS experience to non-relational data with Hadoop resources  MapReduce, Hbase, Cassandra  Data profiling extended to non-relational data  Hadoop as a preprocessor for creating structured data

Next Steps

 Integration is your fate!  Establish continuous agile process for identifying, vetting, and implementing new data sources

 Use data virtualization for rapid prototyping and proofs of concept

 Choose a Big Data source and a Hadoop configuration as a pilot project  Build IT skills  Introduce new KPIs to business community

 Use Kimball/Informatica Big Data white paper as guide to growing EDW to support Big Data

8 10/4/2011

The Kimball Group Resource

 www.kimballgroup.com

 Best selling data warehouse books NEW BOOK! The Kimball Group Reader 

 In depth data warehouse classes taught by primary authors  (Ralph/Margy)  Data warehouse lifecycle (Margy/Warren)  ETL architecture (Ralph/Bob)

 Dimensional design reviews and consulting by Kimball Group principals

 Informatica White Paper on Big Data Impact on EDW: http://vip.informatica.com/?elqPURLPage=8808

9