Spark for HPC: a Comparison with MPI on Compute-Intensive Applications Using Monte Carlo Method

IT 18 048 Examensarbete 30 hp November 2018 Spark for HPC: a comparison with MPI on compute-intensive applications using Monte Carlo method Huijie Shen Tingwei Huang Institutionen för informationsteknologi Department of Information Technology Abstract Spark for HPC: a comparison with MPI on compute-intensive applications using Monte Carlo method Huijie Shen and Tingwei Huang Teknisk- naturvetenskaplig fakultet UTH-enheten With the emergence of various big data platforms in recent years, Apache Spark - a distributed large-scale computing platform, is Besöksadress: perceived as a potential substitute for Message Passing Interface Ångströmlaboratoriet Lägerhyddsvägen 1 (MPI) in High Performance Computing (HPC). Due to the limitations in Hus 4, Plan 0 fault-tolerance, dynamic resource handling and ease of use, MPI, as a dominant method to achieve parallel computing in HPC, is often Postadress: associated with higher development time and costs in enterprises such Box 536 751 21 Uppsala as Scania IT. This thesis project aims to examine Apache Spark as an alternative to MPI on HPC clusters and compare their performance in Telefon: various aspects. The test results are obtained by running a compute- 018 – 471 30 03 intensive application on both platforms to solve a Bayesian inference Telefax: problem of a extended Lotka-Volterra model using particle Markov chain 018 – 471 30 00 Monte Carlo methods. As is confirmed by the tests, Spark is demonstrated to be superior in fault tolerance, dynamic resource Hemsida: handling and ease of use, whilst having its shortcomings in http://www.teknat.uu.se/student performance and resource consumption compared with MPI. Overall, Spark proves to be a promising alternative of MPI on HPC clusters. As a result, Scania IT continues to explore Spark on HPC clusters for use in different departments. Handledare: Erik Lönroth Ämnesgranskare: Sverker Holmgren Examinator: Mats Daniels IT 18 048 Tryckt av: Reprocentralen ITC Acknowledgement We would like to express gratitude to the following people for their involvement and contributions in this thesis project: Erik Lönroth and Jimmy Hedman at Scania IT and Mohammadtaghi Rezaei at KTH, for their continuous guidance as supervisors of the project Prof. Sverker Holmgren at Uppsala University for his valuable feedback on the report Prof. Darren Wilkinson at Newcastle University, for providing use case on our application Fredrik Hurtig, Lars Kihlmark, Daniel Sigwid, Mats Johansson at Scania IT, for their help in solving technical and process issues throughout the project Justin Pearson and Edith Ngai at Uppsala University, for their help on practical issues in the thesis process Acronym HPC High Performance Computing LV Lotka-Volterra MCMC Markov chain Monte Carlo method MH Metropolis-Hastings algorithm MPI Message Passing Interface ODE ordinary differential equation PMMH particle marginal Metropolis-Hastings algorithm PDF probability density function RDD Resilient Distributed Dataset Contents Acknowledgement .............................................................................................................................. v Acronym ............................................................................................................................................ vi 1 Introduction ................................................................................................................................ 11 1.1 Project Motivation .......................................................................................................... 11 1.2 Scope ............................................................................................................................... 11 2 Background ................................................................................................................................ 13 2.1 High Performance Computing ....................................................................................... 13 2.1.1 History of Supercomputer ............................................................................... 13 2.1.2 Supercomputer architecture ............................................................................ 13 2.1.3 Software and system management .................................................................. 14 2.2 Message Passing Interface ............................................................................................. 15 2.2.1 Parallel Computing Model .............................................................................. 15 2.2.2 Evolution of MPI ............................................................................................. 16 2.2.3 Implementation of MPI ................................................................................... 16 2.3 Apache Spark .................................................................................................................. 16 2.3.1 Spark Core ........................................................................................................ 17 2.3.2 Spark Module ................................................................................................... 17 2.3.3 Spark Scheduler ............................................................................................... 18 2.3.4 How Spark works on a cluster ........................................................................ 18 3 The application ........................................................................................................................... 20 3.1 Lotka-Volterra model ..................................................................................................... 20 3.1.1 Deterministic Lotka-Volterra model ............................................................... 20 3.1.2 Stochastic Lotka-Volterra model ..................................................................... 22 3.1.3 An extended Lotka-Volterra model ................................................................. 24 3.2 Monte Carlo methods and Bayesian inference ............................................................. 26 3.2.1 Metropolis-Hastings algorithm ....................................................................... 26 3.2.2 Sequential Monte Carlo method ..................................................................... 26 3.2.3 Bayesian inference ........................................................................................... 27 3.3 Solving the inference problem of LV-12 model using PMMH algorithm .................. 28 3.4 Input and Outcome ......................................................................................................... 31 4 Implementation .......................................................................................................................... 34 4.1 System environment ....................................................................................................... 34 4.2 Set up Spark on HPC cluster ......................................................................................... 34 4.3 Set up MPI on HPC cluster ............................................................................................ 35 4.4 Parallelise the implementation ....................................................................................... 36 4.4.1 Parallelise the implementation in Spark ......................................................... 37 4.4.2 Parallelise the implementation in MPI ........................................................... 39 5 Results and discussion ............................................................................................................... 41 5.1 Performance and scalability ........................................................................................... 41 5.2 Fault tolerance ................................................................................................................ 42 5.3 Dynamic resource handling ........................................................................................... 44 5.4 Resource consumption ................................................................................................... 48 5.4.1 CPU consumption ............................................................................................ 48 5.4.2 Memory consumption ...................................................................................... 50 5.4.3 Network usage .................................................................................................. 52 5.5 Ease of use ...................................................................................................................... 56 5.6 Tuning the Spark application ......................................................................................... 57 5.6.1 Deploy mode .................................................................................................... 57 5.6.2 Number of partitions ........................................................................................ 58 5.6.3 Memory tuning ................................................................................................. 58 6 Related work .............................................................................................................................. 61 7 Conclusions ................................................................................................................................ 62 7.1 Distribution of work ......................................................................................................

Load more