Hadoop2, Spark Big Data, real time, machine learning & use cases
Cédric Carbone Twitter : @carbone Agenda • Map Reduce • Hadoop v1 limits • Hadoop v2 and YARN • Apache Spark • Streaming : Spark vs Storm • Machine Learning : Recommender System • Use Case : Next Product To Buy • Q&A What’s hadoop • The Apache™ Hadoop® project develops open- source software for reliable, scalable, distributed computing.
• Java framework for storage and running data transformation on large cluster of commodity hardware
• Licensed under the Apache v2 license
• Created from Google's MapReduce, BigTable and Google File System (GFS) papers
HDFS : Distributed Storage • Distributed, • Scalable, • Portable, • Reliable file system for the Hadoop framework.
Metadata / data separation: • Name Nodes • Data Nodes
Map Reduce • Map() : parse inputs and generate 0 to n
WordCount Example • Each map take a line as an input and break into words – It emits a key/value pair of the word and 1 • Each Reducer sums the counts for each word – It emits a key/value pair of the word and sum Map Reduce
Data Node 1
Data Node 2 Map Reduce Map Reduce Map Reduce Map Reduce Hadoop MapReduce v1
Hadoop MapReduce v1 Hadoop MapReduce v1 Hadoop MapReduce v1
Not good for low-latency jobs on smallest dataset Hadoop MapReduce v1
Good for off-line batch jobs on massive data
Hadoop 1 • Batch ONLY – High latency jobs
HIVE Pig Cascading Query Scripting Accelerate Dev.
MapReduce1 Cluster Resource Management + Data Processing BATCH
HDFS (Redundant, Reliable Storage) Hadoop2 : Big Data Operating System • Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS – Simultaneously & with predictable levels of service – Data analysts and real-time applications
MapReduce1 Other Data Processing Data Processing BATCH …
YARN (Cluster Resource Management)
HDFS (Redundant, Reliable Storage) Hadoop2 : Big Data Operating System • Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS – Simultaneously & with predictable levels of service – Data analysts and real-time applications
STREAMING GRAPH Machine BATCH INTERACTIVE ONLINE In-Memory OTHER (Storm, Samza (Giraph, Learning (MapReduce) (Tez) (Hbase HOYA) (Spark) (ElasticSearch) Spark Streaming) GraphX) (Spark MLLIb)
YARN (Cluster Resource Management)
HDFS (Redundant, Reliable Storage) Stinger.next Stinger.next https://spark.apache.org
Apache Spark™ is a fast and general engine for large-scale data processing. The most active project
250 45000
40000
200 35000
30000 150 25000
20000 100 15000
10000 50 5000
0 0 Patches Lines Added MapReduce Storm MapReduce Storm Yarn Spark Yarn Spark Spark won the Daytona GraySort contest! Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. RDD & Operation
Resilient Distributed Datasets (RDDs)
Operations ➜ Transformations (e.g. map, filter, groupBy) ➜ Actions (e.g. count, collect, save) Spark scala> val textFile = sc.textFile("README.md") ➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() ➜ res0: Long = 126 scala> textFile.first() ➜ res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) ➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4 scala> textFile.filter(line=>line.contains("Spark")).count() ➜ res3: Long = 15 Streaming
Streaming Storm Storm Storm vs Spark
Spark Storm
Scope Batch, Streaming, Graph, ML, SQL Streaming only
Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++ Machine Learning Library (Mllib) Collaborative Filtering Collaborative Filtering (learning) Collaborative Filtering (learning) Collaborative Filtering (learning) Collaborative Filtering : Let’s use the model Collaborative Filtering : similar behaviors Collaborative Filtering Prediction Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media Input Data
UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc…
2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620 Matric Factorization The result
1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19 Real Time Big Data Use Case Next Gen Data Marketing Platform
Next Product To Buy Ready for Omni-channel? Traditional marketing
Current approach cannot keep up…
200m 44% Buyers complete of direct people on marketing is Do Not Call list never opened. 60%
99.9% 86% of their research
of online of TV viewers before reaching out banners are skip to vendors. never clicked. commercials
“2013 Definitive Guide to Social Marketing” - Marketo. Statement
2000 2010 2013 2015
Multi Cross Omni Channel Channel Channel Consumer Graph
Next Product to Buy in Action 1
Open data
Premium data Next Product to Buy in Action 1
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 2
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 3
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 4
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 4
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 4
Brand ERP data
CRM Loyalty
Open data
Premium data Next Product to Buy in Action 5
Brand ERP data
CRM Loyalty
Open data
Premium data Brand Premium Open Social Influans
OnBoard
Graph
Suggest
+
Fine Tune
+ Social Interactions
Engage
Sales Real Time Big Data Use Case Next Gen Data Marketing Platform
Next Product To Buy
➜ Right Person ➜ Right Product ➜ Right Price ➜ Right Time ➜ Right Channel Questions?
Cédric Carbone [email protected] We g r a p h c o n s u m e r s @carbone
[email protected] @hugfrance www.hugfrance.fr