10/4/2011
MAKE BIG DATA AGILE: IMPROVE RELIABILITY AND TRUSTWORTHINESS OF BIG DATA BY ALIGNING PROCESS AND DATA
Ralph Kimball Informatica October 2011
The Data Warehouse is More Urgently Needed Than Ever
Operational data increasingly dominant
Low latency data trending toward zero delay
Customer behavior insights being monetized
“Big data” introduces many new challenges Huge volumes, many types of sources Structured semi-structured unstructured Non-relational processing Raw data, often dirty, often sparse
1 10/4/2011
Threats to the Data Warehouse Mission
Waterfall approach Compartmentalized responsibilities Formal specifications, handoffs, change bureaucracy Excessive documentation Resistance to course changes Infrequent deliverables
Integration overwhelm
Untrustworthy, unreliable data
“Starting Over” because of Big Data ??
Big Data Extends the Data Warehouse
Use cases that drive the bottom line: Targeted ads in real time based on web site activity and/or geo-tracking “Satisfaction crash” intervention based on voice-to-text analysis of support calls Customer sentiment change response based on social media Emergency response planning based on satellite image comparison Fraud detection based on cross-market brokerage transaction comparisons
2 10/4/2011
Adopt the Best Aspects of Agile Development
Business driven Sophisticated business sponsor All IT members involved in KPI identification
Small teams with end to end responsibility Eliminate non value-added tasks
Decompose the project
Frequent deliverables
Reasonable documentation
Expect and encourage mid-course corrections
Use the Value Stream Map to Identify Best Agile Opportunities
DOIT Corporation: Value Stream Map (AS-IS) for Change Request Process Monday, December 20, 2010
GMNA Applications
Data Warehouse Change Request Team Team (Irina) Confirmation Request
Telephone Tag
Automated Status Request Workflow/Tracking (Cust satisfaction) Status Update Status Request
CR Review Committee Status Update Data Dictionary to clarify rqmnts Notify Customer Semi-Weekly Review (13 days) Status Request (Cust Satisfaction) (5 days) Clarify Requirements Status Update
Requirements Clarification Bypass Council Approved Changes for simple CR’s Production CR Add CR (26 days) Submission To List Assign Resource Design Approval Daily ETL Batch Run Data Warehouse Team Integration Team Manager Architecture Review Council Test Scheduling
Bypass CR Approval Committee for Test Team Manager simple changes Charge Test Results Design Approved Request Change Management Board (8 days) Distribution Document Designs Forward CR Request Automated Regression To Developer Approved CR Design Docs & Design Docs & Schedule Testing (21 days - requires CR Approval & Schedule investment)
Requirements Design & Testing Handoff Test Case Test Execution Production Production Review Development Development Deployment Execution Development Team Test Team Automatic Daily ETL Batch Development Team Development Team Test Team Infrastructure Team Run CR’s 1 P1x12 P2x35 P3x124
8.8 Days 13.3 Days 26 Days 1 Day 12.8 Days 8.5 Days 0.3 Days Lead Time = 75.6 Days
30 Minutes 180 Minutes 15 Minutes 180 Minutes 90 Minutes 15 Minutes Work Time = 510 Minutes (8.5 hrs or 0.35 Days)
Value Ratio: Work Time / Lead Time = 0.5% Notes: (1) Lead Time includes 5 delay in customer notification (2) Lead Time could be reduced to 24 days with just process changes and using existing tools (3) Lead Time could be reduced to 3 days with a capital investment for automated testing
3 10/4/2011
Kimball Lifecycle Approach is Inherently Agile
Business Driven!
Technical Product Architecture Selection & Growth Design Installation
Program/ ETL Business Dimensional Physical Project Design & Deployment Requirements Modeling Design Planning Definition Development
BI BI Application Application Maintenance Design Development
Program/Project Management
Prepare for an Agile EDW with Data Profiling
Data profiling is a necessary foundation of the EDW
Qualify or disqualify new data sources at the earliest opportunity
Implement data quality triage: Demand source system fix, or Correct data in DW ETL, or Tag data and pass through
Embed data profiling throughout pipeline: Source ETL back room BI front room
Integrate data profiling tightly with data flow development: discovery spawns in line module
4 10/4/2011
Agile Data Governance Fits Rhythm of EDW Development
Recognize, protect data as a corporate asset
Continuously manage data quality at all levels Operational sources Data integration pipelines BI delivery
Continuously extend EDW integration with conformed dimensions driven by enterprise MDM
Continuously add new sources, facts, dimensions, and dimension attributes Dimensional approach minimizes rework, impact
Agile Approach: Four Active Steps
Manage/plan with the data warehouse bus matrix
Develop with enterprise MDM
Integrate incrementally
Manage data quality incrementally
5 10/4/2011
Agile Approach: Use the 1. Data Warehouse Bus Matrix
Decompose and manage data warehouse implementation projects
Implement rows of matrix one at a time
Business Processes Dimensions
Agile Approach: Leverage 2. Development with Enterprise MDM
Move entity creation and admin to operations Move away from downstream MDM Reduce repeated ETL demands for same entities
Establish single source for entity content Operational systems, social media tracking, and EDW subscribe to MDM Roll out MDM incrementally
6 10/4/2011
Agile Approach: 3. Integrate Incrementally
Data warehouse integration is drilling across:
Agile advantages of conformed dimensions Install conformed attributes one source at a time Add new conformed attributes one at a time Deliver value with each iteration
Agile Approach: 4. Manage Data Quality Incrementally
Use data profiler to build data quality screens Column screens Structure screens Business rule screens
Manage data quality through event database
Agile advantages of this DQ architecture Build screens one at a time Add or subtract screens
7 10/4/2011
Agile Approach: Special Considerations for Big Data
Big data fits comfortably with agile approach Agile approach tolerates uncertainty, back tracking, and midcourse corrections Big data often must be analyzed before structure is understood
Extend familiar RDBMS experience to non-relational data with Hadoop resources MapReduce, Hbase, Cassandra Data profiling extended to non-relational data Hadoop as a preprocessor for creating structured data
Next Steps
Integration is your fate! Establish continuous agile process for identifying, vetting, and implementing new data sources
Use data virtualization for rapid prototyping and proofs of concept
Choose a Big Data source and a Hadoop configuration as a pilot project Build IT skills Introduce new KPIs to business community
Use Kimball/Informatica Big Data white paper as guide to growing EDW to support Big Data
8 10/4/2011
The Kimball Group Resource
www.kimballgroup.com
Best selling data warehouse books NEW BOOK! The Kimball Group Reader
In depth data warehouse classes taught by primary authors Dimensional modeling (Ralph/Margy) Data warehouse lifecycle (Margy/Warren) ETL architecture (Ralph/Bob)
Dimensional design reviews and consulting by Kimball Group principals
Informatica White Paper on Big Data Impact on EDW: http://vip.informatica.com/?elqPURLPage=8808
9