MDM for the Modern Data Architecture September 2014 Purpose of MDM
Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth.
2 RedPoint Global Inc. 2014 Confidential Why it matters
“ Without data you’re just another person with an opinion.”
W. Edwards Deming
3 RedPoint Global Inc. 2014 Confidential Vicious Cycle of Unmanaged Data 1 Master Data Issues remain unaddressed or unresolved
Data conflicts Unmanaged Data Garbage 4 reinforce siloed 2 in/garbage out operations creates process confusion
Lack of process trust slows 3 business momentum
4 RedPoint Global Inc. 2014 Confidential A Data Architecture Under Pressure
Unstructured documents, emails
Server logs Business Custom Packaged Analytics Applications Applications Applications Sentiment, web data
Hierarchical data
2.8 ZB in 2013 OLTP, ERP, CRM RDBMS EDW MPP 85% from new data types Repositories 15x Machine Data by 2020 Data System Transactional data 40 ZB by 2020
Source: IDC Master data Existing Sources (CRM, ERP, Clickstream, Sensor, machine data Sources Logs)
© Hortonworks Inc. 2014 Geolocation
Clickstream
5 RedPoint Global Inc. 2014 Confidential Broad Spectrum of Benefits Across Industries
Financial Retail Telecom Manufacturing Services • New account risk • 360° view of the • Call detail records • Supplier consolidation screens customer (CDRs) • Supply chain and • Fraud prevention • Analyze brand • Infrastructure logistics • Trading risk sentiment investment • Assembly line quality • Maximize deposit • Localized, personalized • Next product to buy assurance spread promotions (NPTB) • Proactive maintenance • Insurance underwriting • Website optimization • Real-time bandwidth • Crowdsourced quality • Accelerate loan • Optimal store layout allocation assurance processing • New product development Utilities, Oil Healthcare Public Sector & Gas • Genomic data for • Smart meter stream • Analyze public medical trials analysis sentiment • Monitor patient vitals • Slow oil well decline • Protect critical networks • Reduce re-admittance curves • Prevent fraud and rates • Optimize lease bidding waste • Store medical research • Compliance reporting • Crowdsource reporting data • Proactive equipment for repairs to • Recruit cohorts for repair infrastructure pharmaceutical trials • Seismic image • Fulfill open records processing requests
6 RedPoint Global Inc. 2014 Confidential Gartner’s Nexus of Forces Making Things Worse
7 RedPoint Global Inc. 2014 Confidential Business Benefits of MDM
Today IT data mgmt. pros focus on: Business leaders really care about: Eliminating duplicate/orphaned data Increasing revenue Standardizing and centralizing data/metadata Decreasing costs Meeting operational SLAs Increasing operational efficiencies Data enrichment Reducing risk Data integration and synchronization Improving customer experiences
Use business-value driven KPIs to evangelize MDM benefits
Reduction in direct marketing Reduction in average handle time postage costs in call center Increase in customer self-service for order management, technical support Increase in campaign response rates and customer service Reduction in customer privacy Delivering a consistent cross- compliance risk exposure channel customer experience
8 RedPoint Global Inc. 2014 Confidential How About MDM on a Data Lake?
Benefits of a Hadoop Data Lake Challenges to Data Lake Approach
• Data is ingested in its raw state regardless of • Severe shortage of Map Reduce skilled format, structure or lack of structure resources • Raw data can be used and reused for differing purposes across the enterprise • Inconsistent skills lead to inconsistent • Beyond inexpensive storage, Hadoop is an results of code based solutions extremely power and scalable and • Nascent technologies require multiple segmentable computational platform point solutions • Master Data can be fed across the enterprise • Technologies are not enterprise grade and deep analytics on clean data is immediately enabled • Some functionality may not be possible within these frameworks
9 RedPoint Global Inc. 2014 Confidential Key Functions for Master Data Management
ETL & ELT Data Quality Integration & Matching • Profiling, reads/writes, • Cleanse data • Grouping transformations • Parsing, correction • Fuzzy match • Single project for all jobs • Geo-spatial analysis
Master Key Web Services Process Automation Management Integration & Operations • Create keys • Consume and publish • Job scheduling, monitoring, • Track changes • HTTP/HTTPS protocols notifications • Maintain matches • XML/JSON/SOAP • Central point of control over time formats • Meta Data Management
10 RedPoint Global Inc. 2014 Confidential Data Lake is the Center of Your MDM Strategy
Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion, validation and standardization Geospatial processing and geocoding Data profiling, lineage and metadata management Identity resolution and persistent keying and entity profile management
11 RedPoint Global Inc. 2014 Confidential Attribute source and consumer mapping Data Lake Architecture for MDM
Data Sources
Clickstream CRM
Online Chat ERP
Sensor Billing Data
Social Subscrib Media er Call Detail Product + Records
Fabrication Network Logs
Sales Weather Feedback
Field Compete Feedback
Field Manuf. Feedback
12 RedPoint Global Inc. 2014 Confidential How Can That Possibly Work?
More Map Reduce! YARN!
13 RedPoint Global Inc. 2014 Confidential Overview What is Hadoop/Hadoop 2.0 Hadoop 1.0 Hadoop 2.0
• All operations based on Map Reduce • Introduction of the YARN: “a general-purpose, distributed, application • Intrinsic inconsistency of code based management framework that supersedes the solutions classic Apache Hadoop MapReduce framework for processing data in Hadoop clusters.” • Highly skilled and expensive resources needed • Mature applications can now operate directly on Hadoop • 3rd party applications constrained by the need to generate code • Reduce skill requirements and increased consistency
14 RedPoint Global Inc. 2014 Confidential RedPoint Data Management on Hadoop
Parallel Section
Execution Key /
Partitioning Data
a p R e d u c e AM / Split M
AM / Tasks I/O
Tasks Analysis
A R N
Partition Y Data server
15 RedPoint Global Inc. 2014 Confidential Reference Hadoop Architecture
Monitoring and Management Tools
SOURCE DATA Query/Visualization/ Reporting/Analytical Tools and Apps DBs AMBARI
DATA REFINEMENT INTERACTIVE FilFil PIG HIVE HIVE Server2 esesFiles MAPREDUCE STRUCTURE JMS HCATALOG Queue’s YARN Data Sources LOAD (metadata services) REST SQOOP 1 RDBMS - Sensor Logs WebHDFS LOAD - Clickstream HTTP - Flat Files HDFS SQOOP/Hive NFS - Unstructured - Sentiment STREAM n EDW - Customer Flume Web HDFS - Inventory RedPoint Functional Footprint
16 RedPoint Global Inc. 2014 Confidential Benchmarks – Project Gutenberg
Sample MapReduce (small subset of the entireSample code which Pig totals script nearly without 150 lines): the UDF: public static class MapClass extends Mapper
>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code
6 hours of development 3 hours of development 15 min. of development
6 minutes runtime 15 minutes runtime 3 minutes runtime
User Defined Functions Extensive optimization No tuning or optimization required prior to running needed required script
17 RedPoint Global Inc. 2014 Confidential Data Lake Architecture for MDM
Data Sources
CRM Clickstream
ERP Online Chat
Sensor Billing Data
Subscrib Social er Media
Call Detail Product + Records
Fabrication Network Logs
Weather Sales Feedback
Compete Field Feedback
Manuf. Field Feedback
18 RedPoint Global Inc. 2014 Confidential