Hadoop, Spark and MPI
Total Page:16
File Type:pdf, Size:1020Kb
Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI Carolina Salto Universidad Nacional de La Pampa Gabriela Minetti ( [email protected] ) Universidad Nacional de La Pampa https://orcid.org/0000-0003-1076-6766 Enrique Alba Universidad de Málaga Gabriel Luque Universidad de Málaga: Universidad de Malaga Research Article Keywords: Big optimization, Genetic Algorithms, MapReduce, Hadoop, Spark, MPI Posted Date: July 31st, 2021 DOI: https://doi.org/10.21203/rs.3.rs-725766/v1 License: This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Noname manuscript No. (will be inserted by the editor) 1 2 3 4 5 6 Big Optimization with Genetic Algorithms: Hadoop, Spark 7 8 and MPI 9 10 Carolina Salto · Gabriela Minetti · Enrique Alba · Gabriel Luque 11 12 13 14 15 16 17 18 the date of receipt and acceptance should be inserted later 19 20 21 Abstract Solving problems of high dimensionality (and benefits of these systems, namely file, memory and com- 22 complexity) usually needs the intense use of technolo- munication management, over the resulting algorithms. 23 gies, like parallelism, advanced computers and new types We analyze our MRGA solvers from relevant points of 24 of algorithms. MapReduce (MR) is a computing paradigm view like scalability, speedup, and communication vs. 25 long time existing in computer science that has been computation time in big optimization. The results for 26 proposed in the last years for dealing with big data ap- high dimensional datasets show that the MRGA over 27 plications, though it could also be used for many other Hadoop outperforms the implementations in Spark and 28 tasks. In this article we address big optimization: the MPI frameworks. For the smallest datasets, the execu- 29 solution to large instances of combinatorial optimiza- tion of MRGA on MPI is always faster than the exe- 30 tion problems by using MR as the paradigm to design cutions of the remaining MRGAs. Finally, the MRGA 31 32 solvers that allow transparent runs on a varied num- over Spark presents the lowest communication times. 33 ber of computers that collaborate to find the problem Numerical and time insights are given in our work, so 34 solution. We first investigate the influence of the used as to ease future comparisons of new algorithms over 35 MR technology, then including Hadoop, Spark and MPI these three popular technologies. 36 as the middleware plataforms to express genetic algo- Keywords Big optimization · Genetic Algorithms · 37 rithms (GAs), arising the MRGA solvers, in a style dif- MapReduce · Hadoop · Spark · MPI 38 ferent from the usual imperative transformational pro- 39 gramming. Our objective is to confirm the expected 40 41 1 Introduction 42 Carolina Salto 43 Facultad de Ingenier´ıa, Universidad Nacional de La Pampa, The challenges that have arisen with the beginning of 44 Argentina the era of the Big Data have been largely identified 45 CONICET, Argentina ORCID iD 0000-0002-3417-8603 and recognized by the scientific community. These chal- 46 E-mail: [email protected] lenges include dealing with very large data sets, since 47 they may well limit the applicability of most of the 48 Gabriela Minetti Facultad de Ingenier´ıa, Universidad Nacional de La Pampa, usual techniques. For instance, evolutionary algorithms, 49 Argentina as combinatorial optimization problem solvers, do not 50 ORCID iD 0000-0003-1076-6766 scale well to high dimensional instances [20]. To over- 51 E-mail: [email protected] 52 come these limitations, evolutionary developers can em- Enrique Alba ploy Big Data processing frameworks (like Apache Ha- 53 ITIS Software, Universidad de M´alaga, Spain 54 ORCID iD 0000-0002-5520-8875 doop, Apache Spark, among others) to process and gen- 55 E-mail: [email protected] erate Big Data sets with a parallel and distributed algo- 56 Gabriel Luque rithm on clusters and clouds [5,8,22,26,33]. In this way, 57 ITIS Software, Universidad de M´alaga, Spain the programmer may abstract from the issues of dis- 58 ORCID iD 0000-0001-7909-1416 tributed and parallel programming, because the major- 59 E-mail: [email protected] ity of the frameworks manages the load balancing, the 60 61 62 63 64 65 2 Carolina Salto et al. 1 2 network performance, and the fault tolerance. These lems whose complexity is associated with handling Big 3 features made them popular, creating a new branch of Data. All this implies a significant lack of information 4 parallel studies where the focus is on the application on the advantages and limitations of each framework to 5 and not on exploiting the underlying hardware. implement MRGA solvers for big optimization. In this 6 A well-known computing paradigm that is used to sense, the selection of the most appropriate one to im- 7 process Big Data is MapReduce (MR). It splits the plement this kind of algorithm results in a very complex 8 large data set into smaller chunks in which the map task. In order to mitigate the lack of information about 9 the MRGA scalability on the three most known MR 10 function processes in parallel and produces key/value frameworks (MR-MPI, Hadoop, and Spark), we define 11 pairs as output. The output of map tasks is the input the following research questions: 12 for reduce functions in such a way that all key/value pairs with the same key go to the same reduce task [5]. 13 – RQ1: Can we efficiently design big optimization MRGA 14 Hadoop is a very popular framework, relying in the solvers using these frameworks? 15 MR paradigm [1,34], both in industry and academia. – RQ2: Which of the frameworks allows the MRGA 16 This framework provides a ready-to-use distributed in- solver to reach its best time performance by scaling 17 frastructure, which is easy to program, scalable, reli- to high dimensional instances? 18 able, and fault-tolerant [14]. Since Hadoop allows par- 19 – RQ3: Are MRGAs scalable when considering an in- allelism of data and control, we research for other soft- 20 creased number of the map tasks? ware tools doing similar jobs. The MapReduce-MPI 21 – RQ4: Is the time spent in communication a factor (MR-MPI) [24] is a library built on top of MPI, which 22 to consider when choosing a solver? 23 conforms another framework with a somewhat similar 24 goal. Here you can have more control of the platform, With the first research question, we analyze the 25 allowing to improve the bandwidth performance and usability of these frameworks to design MRGAs that 26 reduce the latency costs. Another popular Big Data solve big optimization problems. The RQ2 deepens this 27 framework is Apache Spark [13] that is different from analysis, hopefully offering interesting information on 28 Hadoop and MR-MPI, since the computational model the MRGA performance when the instance dimension 29 of Spark is based on memory. The core concept in Spark 30 scales. Furthermore, the scalability of all the studied is Resilient Distributed Dataset (RDD) [14], which pro- 31 approaches is also analyzed considering the number of 32 vides a general purpose efficient abstraction for dis- parallel process (map tasks), as RQ3 suggests. Finally, 33 tributed shared memory. Spark allows developing multi- the last research question allows us to examine which 34 step data pipelines using a directed acyclic graph. MRGA solver spends more time in communication than 35 Although the three mentioned technologies allow in computation. 36 implementations following the MR paradigm, they have To address these RQs, we analyze how a Simple 37 significant differences. Consequently, they encourage us Genetic Algorithm (SGA) [12] can take advantage of 38 to carry out a performance analysis targeted to dis- these Big Data processing frameworks in the optimiza- 39 cover how big optimization can be best implemented tion of large instances of a problem. We here decide to 40 41 onto the MR model that later is run by any of these use this SGA because because it is a canonical tech- 42 three platforms. This comparative analysis arouses in- nique in the core of the Evolutionary Algorihtm (EAs) 43 terest for any curious scientist, in order to offer evi- family, and most things done on it can be reproduced 44 dence about their relative performance (advantages and in other EAs and population-based metaheuristics. For 45 disadvantages). Moreover, the MR paradigm can con- the purposes of this analysis, in this research a SGA 46 tribute to build new optimization and machine learning design is tailored for the MR paradigm, procuring the 47 models, in particular scalable genetic algorithms (MR- so called MRGA [29], coming out from a parallelization 48 GAs), as combinatorial optimization problem solvers, of each iteration of a SGA under the Iteration-Level 49 which are widely used in the scientific and industrial Parallel Model [31]. The contributions of this work are 50 community. In the literature, many researchers have manyfold. We develop the same optimizer (MRGA) us- 51 52 reported on GAs programmed on Hadoop [5,8,11,32, ing three open-source MR frameworks. We consider the 53 33] and Spark [15,22,26], and a few ones under MR- implementations made in our previous research [29], 54 MPI [29], according to the authors knowledge. More- MRGA-H for Hadoop and (MRGA-M) for MR-MPI. 55 over, these proposals present different GA parallel mod- Morever, in this work, the MRGA design is implemented 56 els for big optimization, but they are specific for a into the Spark framework, arising the MRGA-S algo- 57 particular MR framework. Furthermore, these research rithm.