Apache Spark

Hadoop2, Spark Big Data, real time, machine learning & use cases

Cédric Carbone Twitter : @carbone Agenda • Map Reduce • Hadoop v1 limits • Hadoop v2 and YARN • Apache Spark • Streaming : Spark vs Storm • Machine Learning : Recommender System • Use Case : Next Product To Buy • Q&A What’s hadoop • The Apache™ Hadoop® project develops open- source software for reliable, scalable, distributed computing.

• Java framework for storage and running data transformation on large cluster of commodity hardware

• Licensed under the Apache v2 license

• Created from Google's MapReduce, BigTable and Google File System (GFS) papers

HDFS : Distributed Storage • Distributed, • Scalable, • Portable, • Reliable file system for the Hadoop framework.

Metadata / data separation: • Name Nodes • Data Nodes

Map Reduce • Map() : parse inputs and generate 0 to n • Reduce() : sums all values of the same key and generate a

WordCount Example • Each map take a line as an input and break into words – It emits a key/value pair of the word and 1 • Each Reducer sums the counts for each word – It emits a key/value pair of the word and sum Map Reduce

Data Node 1

Data Node 2 Map Reduce Map Reduce Map Reduce Map Reduce Hadoop MapReduce v1

Hadoop MapReduce v1 Hadoop MapReduce v1 Hadoop MapReduce v1

Not good for low-latency jobs on smallest dataset Hadoop MapReduce v1

Good for off-line batch jobs on massive data

Hadoop 1 • Batch ONLY – High latency jobs

HIVE Pig Cascading Query Scripting Accelerate Dev.

MapReduce1 Cluster Resource Management + Data Processing BATCH

HDFS (Redundant, Reliable Storage) Hadoop2 : Big Data Operating System • Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS – Simultaneously & with predictable levels of service – Data analysts and real-time applications

MapReduce1 Other Data Processing Data Processing BATCH …

YARN (Cluster Resource Management)

STREAMING GRAPH Machine BATCH INTERACTIVE ONLINE In-Memory OTHER (Storm, Samza (Giraph, Learning (MapReduce) (Tez) (Hbase HOYA) (Spark) (ElasticSearch) Spark Streaming) GraphX) (Spark MLLIb)

YARN (Cluster Resource Management)

HDFS (Redundant, Reliable Storage) Stinger.next Stinger.next https://spark.apache.org

Apache Spark™ is a fast and general engine for large-scale data processing. The most active project

250 45000

40000

200 35000

30000 150 25000

20000 100 15000

10000 50 5000

0 0 Patches Lines Added MapReduce Storm MapReduce Storm Yarn Spark Yarn Spark Spark won the Daytona GraySort contest! Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. RDD & Operation

Resilient Distributed Datasets (RDDs)

Operations ➜ Transformations (e.g. map, filter, groupBy) ➜ Actions (e.g. count, collect, save) Spark scala> val textFile = sc.textFile("README.md") ➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count() ➜ res0: Long = 126 scala> textFile.first() ➜ res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) ➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4 scala> textFile.filter(line=>line.contains("Spark")).count() ➜ res3: Long = 15 Streaming

Streaming Storm Storm Storm vs Spark

Spark Storm

Scope Batch, Streaming, Graph, ML, SQL Streaming only

Spark Streaming Storm Storm Trident Processing model Micro batches Record-at-a-time Micro batches Thoughput ++++ ++ ++++ Latency Second Sub-second Second Reliability Models Exactly once At least once Exactly once Embedded Hadoop Distro HDP, CDH, MapR HDP HDP Support Databricks N/A N/A Community ++++ ++ ++ Machine Learning Library (Mllib) Collaborative Filtering Collaborative Filtering (learning) Collaborative Filtering (learning) Collaborative Filtering (learning) Collaborative Filtering : Let’s use the model Collaborative Filtering : similar behaviors Collaborative Filtering Prediction Netflix Prize (2009) Netflix is a provider of on-demand Internet streaming media Input Data

UserID::MovieID::Rating::Timestamp 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 Etc…

2::1357::5::978298709 2::3068::4::978299000 2::1537::4::978299620 Matric Factorization The result

1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972) 1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994) 1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993) 1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991) 1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939) 2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994) 2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994) 2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993) 2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982) 2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995) 3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994) 3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19 Real Time Big Data Use Case Next Gen Data Marketing Platform

Next Product To Buy Ready for Omni-channel? Traditional marketing

Current approach cannot keep up…

200m 44% Buyers complete of direct people on marketing is Do Not Call list never opened. 60%

99.9% 86% of their research

of online of TV viewers before reaching out banners are skip to vendors. never clicked. commercials

“2013 Definitive Guide to Social Marketing” - Marketo. Statement

2000 2010 2013 2015

Multi Cross Omni Channel Channel Channel Consumer Graph

Next Product to Buy in Action 1

Open data

Premium data Next Product to Buy in Action 1

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 2

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 3

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 4

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 4

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 4

Brand ERP data

CRM Loyalty

Open data

Premium data Next Product to Buy in Action 5

Brand ERP data

CRM Loyalty

Open data

Premium data Brand Premium Open Social Influans

OnBoard

Graph

Suggest

Fine Tune

+ Social Interactions

Engage

Sales Real Time Big Data Use Case Next Gen Data Marketing Platform

Next Product To Buy

➜ Right Person ➜ Right Product ➜ Right Price ➜ Right Time ➜ Right Channel Questions?

Cédric Carbone [email protected] We g r a p h c o n s u m e r s @carbone

[email protected] @hugfrance www.hugfrance.fr