Hadoop – Spark Overview

J. Allemandou

Generalities Hadoop Spark Apache Hadoop & Spark – What is it ? Demo

Joseph Allemandou

JoalTech

17 / 05 / 2018

1 / 23 Hello!

Hadoop – Spark Overview Joseph Allemandou J. Allemandou

Generalities

Hadoop Spark

Demo

2 / 23 Plan

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark 1 Generalities on High Performance Computing (HPC)

Demo

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

3 / 23 More computation power: Scale up vs. scale out

Hadoop – Spark Overview J. Allemandou Scale Up Scale out

Generalities

Hadoop Spark

Demo

4 / 23 More computation power: Scale up vs. scale out

Hadoop – Spark Overview Scale out J. Allemandou Scale Up

Generalities

Hadoop Spark

Demo

4 / 23 Parallel computing

Hadoop – Spark Overview Things to consider when doing parallel computing: J. Allemandou Partitioning (tasks, data) Generalities

Hadoop Spark Communications Demo Synchronization Data dependencies Load balancing Granularity IO

Livermore Computing Center - Tutorial

5 / 23 Looking back - Since 1950

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Figure: Ngram Viewer

6 / 23 Looking back - Since 1950

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Figure:

6 / 23 Looking back - Since 1950

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Figure: Google Ngram Viewer

6 / 23 Looking back - Recent times

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Figure:

7 / 23 Same problem, different tools

Hadoop – Spark Overview

J. Allemandou

Generalities Supercomputer Hadoop Spark

Demo Dedicated hardware Commodity hardware

Message Passing Interface

8 / 23 MPI

Hadoop – Spark Overview

J. Allemandou

Generalities C / C++ / Fortran / Python Hadoop Spark

Demo Low-level API - Send / receive a lot to do manually split the data assign tasks to workers handle synchronisation handle errors

9 / 23 Hadoop + Spark

Hadoop – Spark Overview

J. Allemandou

Generalities Java / Scala / Python / R Hadoop Spark Demo High-level API - dataflows

Less performant than MPI (see this scientific paper )

Easier to code than MPI (a lot!)

High-availability oriented

10 / 23 Plan

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark 1 Generalities on High Performance Computing (HPC)

Demo

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

11 / 23 Distributed storage and computation platform (Java, open source) HDFS – Hadoop Distributed File System YARN – Yet Another Resource Negotiator

Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)

Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)

Apache Hadoop (v2) in (almost) 3 points

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

12 / 23 Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)

Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)

Apache Hadoop (v2) in (almost) 3 points

Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo

12 / 23 Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)

Apache Hadoop (v2) in (almost) 3 points

Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)

12 / 23 Apache Hadoop (v2) in (almost) 3 points

Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)

Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)

12 / 23 Apache Hadoop - A big ecosysem

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

13 / 23 HDFS are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Master - Slave architecture (NameNode & DataNodes)

YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)

Global Execution engines exploit data locality

Things to know about Apache Hadoop

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

14 / 23 YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)

Global Execution engines exploit data locality

Things to know about Apache Hadoop

Hadoop – Spark Overview J. Allemandou HDFS

Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo

14 / 23 Global Execution engines exploit data locality

Things to know about Apache Hadoop

Hadoop – Spark Overview J. Allemandou HDFS

Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)

14 / 23 Things to know about Apache Hadoop

Hadoop – Spark Overview J. Allemandou HDFS

Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)

Global Execution engines exploit data locality

14 / 23 Apache Hadoop - The Frame (HDFS + YARN)

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

15 / 23 Apache Hadoop - The Frame (HDFS + YARN)

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

15 / 23 Apache Hadoop - The Frame (HDFS + YARN)

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

15 / 23 Apache Hadoop - MapReduce

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

16 / 23 Apache Hadoop - MapReduce

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

16 / 23 Apache Spark

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

17 / 23 Apache Spark

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

17 / 23 Programming data-flows instead of single map-reduce steps A lot less code for complex flows Data lineage allows for error recovery

Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!

Why is Spark preferred to MapReduce?

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

18 / 23 Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!

Why is Spark preferred to MapReduce?

Hadoop – Spark Overview

J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery

18 / 23 Why is Spark preferred to MapReduce?

Hadoop – Spark Overview

J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery

Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!

18 / 23 Spark - A big ecosysem as well

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

19 / 23 Spark - An example of (very) complex flow

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo Mediawiki History Reconstruction Job

20 / 23 Plan

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark 1 Generalities on High Performance Computing (HPC)

Demo

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

21 / 23 Scala Spark / Toree / Jupyter notebook

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo Notebook with Spark on Docker

22 / 23 Thank you!

Hadoop – Spark Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

23 / 23