MDM for the Modern Data Architecture September 2014 Purpose of MDM

Create correct and consistent data across the enterprise that fosters trust in information and acceleration of growth.

2  RedPoint Global Inc. 2014 Confidential Why it matters

“ Without data you’re just another person with an opinion.”

W. Edwards Deming

3  RedPoint Global Inc. 2014 Confidential Vicious Cycle of Unmanaged Data 1 Master Data Issues remain unaddressed or unresolved

Data conflicts Unmanaged Data Garbage 4 reinforce siloed 2 in/garbage out operations creates process confusion

Lack of process trust slows 3 business momentum

4  RedPoint Global Inc. 2014 Confidential A Data Architecture Under Pressure

Unstructured documents, emails

Server logs Business Custom Packaged Applications Applications Applications Sentiment, web data

Hierarchical data

2.8 ZB in 2013 OLTP, ERP, CRM RDBMS EDW MPP 85% from new data types Repositories 15x Machine Data by 2020 Data System Transactional data 40 ZB by 2020

Source: IDC Master data Existing Sources (CRM, ERP, Clickstream, Sensor, machine data Sources Logs)

© Inc. 2014 Geolocation

Clickstream

5  RedPoint Global Inc. 2014 Confidential Broad Spectrum of Benefits Across Industries

Financial Retail Telecom Manufacturing Services • New account risk • 360° view of the • Call detail records • Supplier consolidation screens customer (CDRs) • Supply chain and • Fraud prevention • Analyze brand • Infrastructure logistics • Trading risk sentiment investment • Assembly line quality • Maximize deposit • Localized, personalized • Next product to buy assurance spread promotions (NPTB) • Proactive maintenance • Insurance underwriting • Website optimization • Real-time bandwidth • Crowdsourced quality • Accelerate loan • Optimal store layout allocation assurance processing • New product development Utilities, Oil Healthcare Public Sector & Gas • Genomic data for • Smart meter stream • Analyze public medical trials analysis sentiment • Monitor patient vitals • Slow oil well decline • Protect critical networks • Reduce re-admittance curves • Prevent fraud and rates • Optimize lease bidding waste • Store medical research • Compliance reporting • Crowdsource reporting data • Proactive equipment for repairs to • Recruit cohorts for repair infrastructure pharmaceutical trials • Seismic image • Fulfill open records processing requests

6  RedPoint Global Inc. 2014 Confidential Gartner’s Nexus of Forces Making Things Worse

7  RedPoint Global Inc. 2014 Confidential Business Benefits of MDM

Today IT data mgmt. pros focus on: Business leaders really care about: Eliminating duplicate/orphaned data Increasing revenue Standardizing and centralizing data/ Decreasing costs Meeting operational SLAs Increasing operational efficiencies Data enrichment Reducing risk and synchronization Improving customer experiences

Use business-value driven KPIs to evangelize MDM benefits

Reduction in direct marketing Reduction in average handle time postage costs in call center Increase in customer self-service for order management, technical support Increase in campaign response rates and customer service Reduction in customer privacy Delivering a consistent cross- compliance risk exposure channel customer experience

8  RedPoint Global Inc. 2014 Confidential How About MDM on a Data Lake?

Benefits of a Hadoop Data Lake Challenges to Data Lake Approach

• Data is ingested in its raw state regardless of • Severe shortage of Map Reduce skilled format, structure or lack of structure resources • Raw data can be used and reused for differing purposes across the enterprise • Inconsistent skills lead to inconsistent • Beyond inexpensive storage, Hadoop is an results of code based solutions extremely power and scalable and • Nascent technologies require multiple segmentable computational platform point solutions • Master Data can be fed across the enterprise • Technologies are not enterprise grade and deep analytics on clean data is immediately enabled • Some functionality may not be possible within these frameworks

9  RedPoint Global Inc. 2014 Confidential Key Functions for Master Data Management

ETL & ELT Data Quality Integration & Matching • Profiling, reads/writes, • Cleanse data • Grouping transformations • Parsing, correction • Fuzzy match • Single project for all jobs • Geo-spatial analysis

Master Key Web Services Process Automation Management Integration & Operations • Create keys • Consume and publish • Job scheduling, monitoring, • Track changes • HTTP/HTTPS protocols notifications • Maintain matches • XML/JSON/SOAP • Central point of control over time formats • Meta Data Management

10  RedPoint Global Inc. 2014 Confidential Data Lake is the Center of Your MDM Strategy

Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion, validation and standardization Geospatial processing and geocoding Data profiling, lineage and metadata management Identity resolution and persistent keying and entity profile management

11  RedPoint Global Inc. 2014 Confidential Attribute source and consumer mapping Data Lake Architecture for MDM

Data Sources

Clickstream CRM

Online Chat ERP

Sensor Billing Data

Social Subscrib Media er Call Detail Product + Records

Fabrication Network Logs

Sales Weather Feedback

Field Compete Feedback

Field Manuf. Feedback

12  RedPoint Global Inc. 2014 Confidential How Can That Possibly Work?

More Map Reduce! YARN!

13  RedPoint Global Inc. 2014 Confidential Overview ­ What is Hadoop/Hadoop 2.0 Hadoop 1.0 Hadoop 2.0

• All operations based on Map Reduce • Introduction of the YARN: “a general-purpose, distributed, application • Intrinsic inconsistency of code based management framework that supersedes the solutions classic MapReduce framework for processing data in Hadoop clusters.” • Highly skilled and expensive resources needed • Mature applications can now operate directly on Hadoop • 3rd party applications constrained by the need to generate code • Reduce skill requirements and increased consistency

14  RedPoint Global Inc. 2014 Confidential RedPoint Data Management on Hadoop

Parallel Section

Execution Key /

Partitioning Data

a p R e d u c e AM / Split M

AM / Tasks I/O

Tasks Analysis

A R N

Partition Y Data server

15  RedPoint Global Inc. 2014 Confidential Reference Hadoop Architecture

Monitoring and Management Tools

SOURCE DATA Query/Visualization/ Reporting/Analytical Tools and Apps DBs AMBARI

DATA REFINEMENT INTERACTIVE FilFil PIG HIVE HIVE Server2 esesFiles MAPREDUCE STRUCTURE JMS HCATALOG Queue’s YARN Data Sources LOAD (metadata services) REST SQOOP 1                          RDBMS - Sensor Logs WebHDFS              LOAD - Clickstream HTTP              - Flat Files HDFS SQOOP/Hive NFS              - Unstructured              - Sentiment STREAM             n EDW - Customer Flume Web HDFS - Inventory RedPoint Functional Footprint

16  RedPoint Global Inc. 2014 Confidential Benchmarks – Project Gutenberg

Sample MapReduce (small subset of the entireSample code which Pig totals script nearly without 150 lines): the UDF: public static class MapClass extends Mapper pig.maxCombinedSplitSize { 67108864 private final static String delimitersSET pig.splitCombination= true "',./<>?;:\"[]{}-=_+()&*%^#$!@`~ \\|«»¡¢£¤¥¦©¬®¯±¶·¿"; private final static IntWritable oneA == LOADnew IntWritable(1); '/testdata/ pg/*/*/*'; private Text word = new Text(); B = FOREACH A GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word; public void map(WordOffset key, TextC =value, FOREACH Context B context)GENERATE UPPER(word) AS word; throws IOException, InterruptedException { String line = value.toString(); D = GROUP C BY word; StringTokenizer itr = new StringTokenizer(liE = FOREACHne, delimiters);D GENERATE COUNT(C) AS occurrences, group; while (itr.hasMoreTokens()) { word.set(itr.nextToken()); F = ORDER E BY occurrences DESC; context.write(word, one); STORE F INTO '/user/cleonardi/pg/pig-count'; } } } Map Reduce Pig RedPoint

>150 Lines of MR Code ~50 Lines of Script Code 0 Lines of Code

6 hours of development 3 hours of development 15 min. of development

6 minutes runtime 15 minutes runtime 3 minutes runtime

User Defined Functions Extensive optimization No tuning or optimization required prior to running needed required script

17  RedPoint Global Inc. 2014 Confidential Data Lake Architecture for MDM

Data Sources

CRM Clickstream

ERP Online Chat

Sensor Billing Data

Subscrib Social er Media

Call Detail Product + Records

Fabrication Network Logs

Weather Sales Feedback

Compete Field Feedback

Manuf. Field Feedback

18  RedPoint Global Inc. 2014 Confidential