Apache Hadoop & Spark – What Is It ?
Total Page:16
File Type:pdf, Size:1020Kb
Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Apache Hadoop & Spark { What is it ? Demo Joseph Allemandou JoalTech 17 / 05 / 2018 1 / 23 Hello! Hadoop { Spark Overview Joseph Allemandou J. Allemandou Generalities Hadoop Spark Demo 2 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 3 / 23 More computation power: Scale up vs. scale out Hadoop { Spark Overview J. Allemandou Scale Up Scale out Generalities Hadoop Spark Demo 4 / 23 More computation power: Scale up vs. scale out Hadoop { Spark Overview Scale out J. Allemandou Scale Up Generalities Hadoop Spark Demo 4 / 23 Parallel computing Hadoop { Spark Overview Things to consider when doing parallel computing: J. Allemandou Partitioning (tasks, data) Generalities Hadoop Spark Communications Demo Synchronization Data dependencies Load balancing Granularity IO Livermore Computing Center - Tutorial 5 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Since 1950 Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Ngram Viewer 6 / 23 Looking back - Recent times Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Figure: Google Trends 7 / 23 Same problem, different tools Hadoop { Spark Overview J. Allemandou Generalities Supercomputer Big Data Hadoop Spark Demo Dedicated hardware Commodity hardware Message Passing Interface 8 / 23 MPI Hadoop { Spark Overview J. Allemandou Generalities C / C++ / Fortran / Python Hadoop Spark Demo Low-level API - Send / receive messages a lot to do manually split the data assign tasks to workers handle synchronisation handle errors 9 / 23 Hadoop + Spark Hadoop { Spark Overview J. Allemandou Generalities Java / Scala / Python / R Hadoop Spark Demo High-level API - dataflows Less performant than MPI (see this scientific paper ) Easier to code than MPI (a lot!) High-availability oriented 10 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 11 / 23 Distributed storage and computation platform (Java, open source) HDFS { Hadoop Distributed File System YARN { Yet Another Resource Negotiator Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 12 / 23 Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo 12 / 23 Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) 12 / 23 Apache Hadoop (v2) in (almost) 3 points Hadoop { Spark Overview J. Allemandou Distributed storage and computation platform (Java, open source) Generalities HDFS { Hadoop Distributed File System Hadoop Spark YARN { Yet Another Resource Negotiator Demo Built for scalability Up to thousands of nodes (see this tweet ) Can be easily extended (heterogeneous hardware) Is resilient to errors (to some extent) Working with multiple computation engines Hadoop MapReduce (the good old one) Apache Spark (the new kid on the block) 12 / 23 Apache Hadoop - A big ecosysem Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 13 / 23 HDFS Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Master - Slave architecture (NameNode & DataNodes) YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 14 / 23 YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo 14 / 23 Global Execution engines exploit data locality Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) 14 / 23 Things to know about Apache Hadoop Hadoop { Spark Overview J. Allemandou HDFS Generalities Files are split in blocks { Partitioning ! Blocks are replicated { Failure tolerance ! Hadoop Spark Master - Slave architecture (NameNode & DataNodes) Demo YARN Execution containers with defined resources Scheduler manages resource sharing Master - Slave architecture (ResourceManager & NodeManagers) Global Execution engines exploit data locality 14 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - The Frame (HDFS + YARN) Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 15 / 23 Apache Hadoop - MapReduce Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 16 / 23 Apache Hadoop - MapReduce Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 16 / 23 Apache Spark Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 17 / 23 Apache Spark Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 17 / 23 Programming data-flows instead of single map-reduce steps A lot less code for complex flows Data lineage allows for error recovery Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 18 / 23 Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery 18 / 23 Why is Spark preferred to MapReduce? Hadoop { Spark Overview J. Allemandou Programming data-flows instead of single map-reduce Generalities steps Hadoop Spark A lot less code for complex flows Demo Data lineage allows for error recovery Decoupling of tasks and containers { Spark executors run multiple tasks Less executor management overhead Executors can reuse RAM - Caching! 18 / 23 Spark - A big ecosysem as well Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 19 / 23 Spark - An example of (very) complex flow Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Mediawiki History Reconstruction Job 20 / 23 Plan Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark 1 Generalities on High Performance Computing (HPC) Demo 2 Apache Hadoop and Spark { A Glimpse 3 Demonstration 21 / 23 Scala Spark / Toree / Jupyter notebook Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo Notebook with Spark on Docker 22 / 23 Thank you! Hadoop { Spark Overview J. Allemandou Generalities Hadoop Spark Demo 23 / 23.