Hadoop – Spark Overview
J. Allemandou
Generalities Hadoop Spark Apache Hadoop & Spark – What is it ? Demo
Joseph Allemandou
JoalTech
17 / 05 / 2018
1 / 23 Hello!
Hadoop – Spark Overview Joseph Allemandou J. Allemandou
Generalities
Hadoop Spark
Demo
2 / 23 Plan
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark 1 Generalities on High Performance Computing (HPC)
Demo
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
3 / 23 More computation power: Scale up vs. scale out
Hadoop – Spark Overview J. Allemandou Scale Up Scale out
Generalities
Hadoop Spark
Demo
4 / 23 More computation power: Scale up vs. scale out
Hadoop – Spark Overview Scale out J. Allemandou Scale Up
Generalities
Hadoop Spark
Demo
4 / 23 Parallel computing
Hadoop – Spark Overview Things to consider when doing parallel computing: J. Allemandou Partitioning (tasks, data) Generalities
Hadoop Spark Communications Demo Synchronization Data dependencies Load balancing Granularity IO
Livermore Computing Center - Tutorial
5 / 23 Looking back - Since 1950
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Figure: Google Ngram Viewer
6 / 23 Looking back - Since 1950
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Figure: Google Ngram Viewer
6 / 23 Looking back - Since 1950
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Figure: Google Ngram Viewer
6 / 23 Looking back - Recent times
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
Figure: Google Trends
7 / 23 Same problem, different tools
Hadoop – Spark Overview
J. Allemandou
Generalities Supercomputer Big Data Hadoop Spark
Demo Dedicated hardware Commodity hardware
Message Passing Interface
8 / 23 MPI
Hadoop – Spark Overview
J. Allemandou
Generalities C / C++ / Fortran / Python Hadoop Spark
Demo Low-level API - Send / receive messages a lot to do manually split the data assign tasks to workers handle synchronisation handle errors
9 / 23 Hadoop + Spark
Hadoop – Spark Overview
J. Allemandou
Generalities Java / Scala / Python / R Hadoop Spark Demo High-level API - dataflows
Less performant than MPI (see this scientific paper )
Easier to code than MPI (a lot!)
High-availability oriented
10 / 23 Plan
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark 1 Generalities on High Performance Computing (HPC)
Demo
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
11 / 23 Distributed storage and computation platform (Java, open source) HDFS – Hadoop Distributed File System YARN – Yet Another Resource Negotiator
Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)
Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)
Apache Hadoop (v2) in (almost) 3 points
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
12 / 23 Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)
Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)
Apache Hadoop (v2) in (almost) 3 points
Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo
12 / 23 Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)
Apache Hadoop (v2) in (almost) 3 points
Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)
12 / 23 Apache Hadoop (v2) in (almost) 3 points
Hadoop – Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS – Hadoop Distributed File System Hadoop Spark YARN – Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent)
Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block)
12 / 23 Apache Hadoop - A big ecosysem
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
13 / 23 HDFS Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Master - Slave architecture (NameNode & DataNodes)
YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)
Global Execution engines exploit data locality
Things to know about Apache Hadoop
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
14 / 23 YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)
Global Execution engines exploit data locality
Things to know about Apache Hadoop
Hadoop – Spark Overview J. Allemandou HDFS
Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo
14 / 23 Global Execution engines exploit data locality
Things to know about Apache Hadoop
Hadoop – Spark Overview J. Allemandou HDFS
Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)
14 / 23 Things to know about Apache Hadoop
Hadoop – Spark Overview J. Allemandou HDFS
Generalities Files are split in blocks – Partitioning ! Blocks are replicated – Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers)
Global Execution engines exploit data locality
14 / 23 Apache Hadoop - The Frame (HDFS + YARN)
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
15 / 23 Apache Hadoop - The Frame (HDFS + YARN)
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
15 / 23 Apache Hadoop - The Frame (HDFS + YARN)
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
15 / 23 Apache Hadoop - MapReduce
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
16 / 23 Apache Hadoop - MapReduce
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
16 / 23 Apache Spark
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
17 / 23 Apache Spark
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
17 / 23 Programming data-flows instead of single map-reduce steps A lot less code for complex flows Data lineage allows for error recovery
Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!
Why is Spark preferred to MapReduce?
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
18 / 23 Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!
Why is Spark preferred to MapReduce?
Hadoop – Spark Overview
J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery
18 / 23 Why is Spark preferred to MapReduce?
Hadoop – Spark Overview
J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery
Decoupling of tasks and containers – Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching!
18 / 23 Spark - A big ecosysem as well
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
19 / 23 Spark - An example of (very) complex flow
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo Mediawiki History Reconstruction Job
20 / 23 Plan
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark 1 Generalities on High Performance Computing (HPC)
Demo
2 Apache Hadoop and Spark – A Glimpse
3 Demonstration
21 / 23 Scala Spark / Toree / Jupyter notebook
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo Notebook with Spark on Docker
22 / 23 Thank you!
Hadoop – Spark Overview
J. Allemandou
Generalities
Hadoop Spark
Demo
23 / 23