Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI

Carolina Salto Universidad Nacional de La Pampa Gabriela Minetti (  [email protected] ) Universidad Nacional de La Pampa https://orcid.org/0000-0003-1076-6766 Enrique Alba Universidad de Málaga Gabriel Luque Universidad de Málaga: Universidad de Malaga

Research Article

Keywords: Big optimization, Genetic Algorithms, MapReduce, Hadoop, Spark, MPI

Posted Date: July 31st, 2021

DOI: https://doi.org/10.21203/rs.3.rs-725766/v1

License:   This work is licensed under a Creative Commons Attribution 4.0 International License. Read Full License Noname manuscript No. (will be inserted by the editor)

1 2 3 4 5 6 Big Optimization with Genetic Algorithms: Hadoop, Spark 7 8 and MPI 9 10 Carolina Salto · Gabriela Minetti · Enrique Alba · Gabriel Luque 11 12 13 14 15 16 17 18 the date of receipt and acceptance should be inserted later 19 20 21 Abstract Solving problems of high dimensionality (and benefits of these systems, namely file, memory and com- 22 complexity) usually needs the intense use of technolo- munication management, over the resulting algorithms. 23 gies, like parallelism, advanced computers and new types We analyze our MRGA solvers from relevant points of 24 of algorithms. MapReduce (MR) is a computing paradigm view like scalability, speedup, and communication vs. 25 long time existing in that has been computation time in big optimization. The results for 26 proposed in the last years for dealing with big data ap- high dimensional datasets show that the MRGA over 27 plications, though it could also be used for many other Hadoop outperforms the implementations in Spark and 28 tasks. In this article we address big optimization: the MPI frameworks. For the smallest datasets, the execu- 29 solution to large instances of combinatorial optimiza- tion of MRGA on MPI is always faster than the exe- 30 tion problems by using MR as the paradigm to design cutions of the remaining MRGAs. Finally, the MRGA 31 32 solvers that allow transparent runs on a varied num- over Spark presents the lowest communication times. 33 ber of computers that collaborate to find the problem Numerical and time insights are given in our work, so 34 solution. We first investigate the influence of the used as to ease future comparisons of new algorithms over 35 MR technology, then including Hadoop, Spark and MPI these three popular technologies. 36 as the middleware plataforms to express genetic algo- Keywords Big optimization · Genetic Algorithms · 37 rithms (GAs), arising the MRGA solvers, in a style dif- MapReduce · Hadoop · Spark · MPI 38 ferent from the usual imperative transformational pro- 39 gramming. Our objective is to confirm the expected 40 41 1 Introduction 42 Carolina Salto 43 Facultad de Ingenier´ıa, Universidad Nacional de La Pampa, The challenges that have arisen with the beginning of 44 Argentina the era of the Big Data have been largely identified 45 CONICET, Argentina ORCID iD 0000-0002-3417-8603 and recognized by the scientific community. These chal- 46 E-mail: [email protected] lenges include dealing with very large data sets, since 47 they may well limit the applicability of most of the 48 Gabriela Minetti Facultad de Ingenier´ıa, Universidad Nacional de La Pampa, usual techniques. For instance, evolutionary algorithms, 49 Argentina as combinatorial optimization problem solvers, do not 50 ORCID iD 0000-0003-1076-6766 scale well to high dimensional instances [20]. To over- 51 E-mail: [email protected] 52 come these limitations, evolutionary developers can em- Enrique Alba ploy Big Data processing frameworks (like Apache Ha- 53 ITIS Software, Universidad de M´alaga, Spain 54 ORCID iD 0000-0002-5520-8875 doop, Apache Spark, among others) to process and gen- 55 E-mail: [email protected] erate Big Data sets with a parallel and distributed algo- 56 Gabriel Luque rithm on clusters and clouds [5,8,22,26,33]. In this way, 57 ITIS Software, Universidad de M´alaga, Spain the programmer may abstract from the issues of dis- 58 ORCID iD 0000-0001-7909-1416 tributed and parallel programming, because the major- 59 E-mail: [email protected] ity of the frameworks manages the load balancing, the 60 61 62 63 64 65 2 Carolina Salto et al. 1 2 network performance, and the fault tolerance. These lems whose complexity is associated with handling Big 3 features made them popular, creating a new branch of Data. All this implies a significant lack of information 4 parallel studies where the focus is on the application on the advantages and limitations of each framework to 5 and not on exploiting the underlying hardware. implement MRGA solvers for big optimization. In this 6 A well-known computing paradigm that is used to sense, the selection of the most appropriate one to im- 7 process Big Data is MapReduce (MR). It splits the plement this kind of algorithm results in a very complex 8 large data set into smaller chunks in which the map task. In order to mitigate the lack of information about 9 the MRGA scalability on the three most known MR 10 function processes in parallel and produces key/value frameworks (MR-MPI, Hadoop, and Spark), we define 11 pairs as output. The output of map tasks is the input the following research questions: 12 for reduce functions in such a way that all key/value pairs with the same key go to the same reduce task [5]. 13 – RQ1: Can we efficiently design big optimization MRGA 14 Hadoop is a very popular framework, relying in the solvers using these frameworks? 15 MR paradigm [1,34], both in industry and academia. – RQ2: Which of the frameworks allows the MRGA 16 This framework provides a ready-to-use distributed in- solver to reach its best time performance by scaling 17 frastructure, which is easy to program, scalable, reli- to high dimensional instances? 18 able, and fault-tolerant [14]. Since Hadoop allows par- 19 – RQ3: Are MRGAs scalable when considering an in- allelism of data and control, we research for other soft- 20 creased number of the map tasks? ware tools doing similar jobs. The MapReduce-MPI 21 – RQ4: Is the time spent in communication a factor (MR-MPI) [24] is a library built on top of MPI, which 22 to consider when choosing a solver? 23 conforms another framework with a somewhat similar 24 goal. Here you can have more control of the platform, With the first research question, we analyze the 25 allowing to improve the bandwidth performance and usability of these frameworks to design MRGAs that 26 reduce the latency costs. Another popular Big Data solve big optimization problems. The RQ2 deepens this 27 framework is Apache Spark [13] that is different from analysis, hopefully offering interesting information on 28 Hadoop and MR-MPI, since the computational model the MRGA performance when the instance dimension 29 of Spark is based on memory. The core concept in Spark 30 scales. Furthermore, the scalability of all the studied is Resilient Distributed Dataset (RDD) [14], which pro- 31 approaches is also analyzed considering the number of 32 vides a general purpose efficient abstraction for dis- parallel process (map tasks), as RQ3 suggests. Finally, 33 tributed shared memory. Spark allows developing multi- the last research question allows us to examine which 34 step data pipelines using a directed acyclic graph. MRGA solver spends more time in communication than 35 Although the three mentioned technologies allow in computation. 36 implementations following the MR paradigm, they have To address these RQs, we analyze how a Simple 37 significant differences. Consequently, they encourage us (SGA) [12] can take advantage of 38 to carry out a performance analysis targeted to dis- these Big Data processing frameworks in the optimiza- 39 cover how big optimization can be best implemented tion of large instances of a problem. We here decide to 40 41 onto the MR model that later is run by any of these use this SGA because because it is a canonical tech- 42 three platforms. This comparative analysis arouses in- nique in the core of the Evolutionary Algorihtm (EAs) 43 terest for any curious scientist, in order to offer evi- family, and most things done on it can be reproduced 44 dence about their relative performance (advantages and in other EAs and population-based . For 45 disadvantages). Moreover, the MR paradigm can con- the purposes of this analysis, in this research a SGA 46 tribute to build new optimization and machine learning design is tailored for the MR paradigm, procuring the 47 models, in particular scalable genetic algorithms (MR- so called MRGA [29], coming out from a parallelization 48 GAs), as combinatorial optimization problem solvers, of each iteration of a SGA under the Iteration-Level 49 which are widely used in the scientific and industrial Parallel Model [31]. The contributions of this work are 50 community. In the literature, many researchers have manyfold. We develop the same optimizer (MRGA) us- 51 52 reported on GAs programmed on Hadoop [5,8,11,32, ing three open-source MR frameworks. We consider the 53 33] and Spark [15,22,26], and a few ones under MR- implementations made in our previous research [29], 54 MPI [29], according to the authors knowledge. More- MRGA-H for Hadoop and (MRGA-M) for MR-MPI. 55 over, these proposals present different GA parallel mod- Morever, in this work, the MRGA design is implemented 56 els for big optimization, but they are specific for a into the Spark framework, arising the MRGA-S algo- 57 particular MR framework. Furthermore, these research rithm. Later on, we analyze and compare these three 58 works mainly focus on the parallelization of highly time- implementations considering relevant aspects such as 59 consuming fitness computation, but not on solving prob- execution time, scalability, and speedup to solve a large 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 3 1 2 problem size of industrial interest as the knapsack prob- and a MRAppMaster per application, which is imple- 3 lem [10]. As to our knowledge, this is the first work mented using the Hadoop YARN framework [1]. In or- 4 considering the same MRGA solver implemented in the der to meet those goals, the central Scheduler (in the 5 three widely known platforms and pointing out their ResourceManager) responds to a resource request by 6 different features for the benefit of future researches. granting a container. Essentially, the container is the re- 7 This article is organized as follows: next section dis- source allocation, which allows to an application the use 8 cusses the MR paradigm and Big Data frameworks, of a specific amount of resources (memory, CPU, etc.) 9 showing their similarities and differences. Section 3 pre- on a specific host. In this context, the Hadoop client 10 sents a brief state of the art in implementing GAs with submits the job/configuration to the ResourceManager 11 that distributes the software/configuration to the slaves, 12 the MR paradigm, and contains our proposal. Sections 13 4 and 5 define meaningful experiments to reveal infor- schedules, and monitors the tasks, providing status and 14 mation on the three systems, perform them, and give diagnostic information to the client including fault tol- 15 some findings. Finally, Section 6 summarizes our con- erance management. In this sense, it is noticeable that 16 clusions and expected future work. the installation and the configuration of the Hadoop 17 framework requires a very specific and long sequence of 18 steps, becoming difficult to adapt it to a particular clus- 19 2 MR Paradigm and Frameworks ter of machines. Moreover, at least one node (master) 20 is dedicated to the system management. 21 To deal with parallel processing applications on large 22 An application in the MR paradigm is arranged as a data sets, Hadoop incorporates the Hadoop Distributed 23 pair (or a sequence of pairs) of map and reduce func- 24 tions [7]. Each map function takes as input a set of File System (HDFS) and Hadoop YARN. The first one 25 key/value pairs (records) from data files and generates handles scalability and redundancy of data across nodes. 26 a set of intermediate key/value pairs. Then, MR groups The second one is a framework for job scheduling that 27 together all these intermediate values associated with a executes data processing tasks on all nodes. 28 same intermediate key. A value group and its associated 29 key is the input to the reduce function, which combines 30 2.2 MR-MPI these values in order to produce a new and possibly 31 smaller set of key/value pairs that are saved in data 32 MR-MPI is a small and portable C++ library that only files. Furthermore, this function receives the intermedi- 33 uses MPI for inter-processor communication, thus the ate values via an iterator, allowing the model to handle 34 user writes a main program that runs on each proces- lists of values that are too large to fit into main memory. 35 sor of a cluster, making map and reduce calls to the 36 The input data is automatically partitioned into a set MR-MPI library. As a consequence, a new framework 37 of M splits when the map invocations are distributed arises with no extra installation and configuration tasks 38 across multiple machines, where each input split is pro- (light management and easy to program with it), but 39 cessed by a map invocation. The intermediate key space not fault tolerance. The use of the MR library within 40 is divided into R pieces, which are distributed into R 41 MPI follows the traditional mode to call the MPI Send reduce invocations. The number of partitions (R) and 42 and MPI Recv primitives between pairs of processors, the partitioning function are user defined. 43 using large aggregated messages to improve the band- As previously mentioned, our aim in this work is to 44 width performance and reduce the latency costs. 45 perform a comparison of the different Big Data frame- 46 works to develop big optimization MRGA solvers. For 47 that purpose, this section presents three Big Data frame- 2.3 Spark 48 works, as the Hadoop [1], the MR-MPI [24], and Spark 49 [13], with the goal of identifying the advantages and Apache Spark has a very powerful and high-level API, 50 limitations of each one. In this process, the focus is on which is built upon the basic abstraction concept of 51 the installation, use, and productivity characteristics of 52 the Resilient Distributed Dataset [35]. A RDD is an each framework. 53 immutable and a fault-tolerant collection of elements 54 in shared memory that can be operated on parallel. 55 This kind of datasets is divided into logical partitions, 56 2.1 Hadoop each one is computed on different nodes of the cluster 57 through operations that transform a RDD (creating a 58 The Hadoop framework consists of a single master Re- new one) or perform computations on the RDD (re- 59 sourceManager, one slave NodeManager per cluster-node, turning a value). 60 61 62 63 64 65 4 Carolina Salto et al. 1 2 Spark applications are composed of a single driver data, being the MPI implementation responsible for de- 3 program and multiple workers or executors. The client tecting and handling network faults. 4 process starts the driver program, which orchestrates Spark uses lazy evaluation to reduce the number of 5 and monitors execution of a Spark application and calls passes it has to take over our data by grouping opera- 6 to actions. With each action, the Spark scheduler builds tions together. In platforms like MR-MPI and Hadoop, 7 an execution graph and launches a Spark job. Each developers often have to spend a lot of time consider- 8 job consists of stages, which are a collection of tasks ing how to group together operations to minimize the 9 that represent each parallel computation and are per- number of MR passes. In Spark, there is no substan- 10 formed on the executors (Java Virtual Machine, JVM, tial benefit to writing a single complex map instead of 11 processes). Each executor has several task slots for chaining together many simple operations. Thus, users 12 13 running tasks in parallel. The physical placement of are free to organize their program into smaller, more 14 executor and driver processes depends on the clus- manageable operations. 15 ter type and its configuration. Table 1 shows a comparison between the considered 16 In order to run on a cluster, the SparkContext can frameworks taking into account various aspects: such 17 connect to several types of cluster managers (either as language supported, volume of data sets, processing 18 Sparks own standalone cluster manager, Mesos or YARN), type, easy configuration, among others. 19 which allocate resources across applications. In this work, 20 the Hadoop YARN is adopted as cluster manager, due 21 to their previous use with Hadoop. The installation and 3 Big Optimization with Genetic Algorithms 22 the configuration of the Spark framework is not direct, 23 In this section, we start with a review of the litera- 24 requiring the configuration of many properties that con- ture about how GAs were translated into different Big 25 trol internal settings. Most of then have reasonable de- Data frameworks. After that, we describe the simple 26 fault values, but others require to be adjusted to an 27 appropriate values and generally are particular to the model of SGA used in this work and how it is adapted 28 cluster features. These parameters vary from Spark’s to be implemented by following MapReduce (MRGA). 29 properties to size’s settings of the JVM. The idea is to implement the same MRGA using the 30 Hadoop (MRGA-H), MR-MPI (MRGA-M) and Spark 31 (MRGA-S) frameworks to solve big optimization prob- 32 2.4 Final Discussion on the Platforms lems. This will allow us to compare the results and to 33 find out their strong and weak features. The implemen- 34 Spark allows a more flexible organization of the pro- tations of MRGA-H and MRGA-M algorithms are ob- 35 cesses, appropriate for iterative algorithms, and eases tained from [32] and own previous work [29], respec- 36 efficiency due to the use of in memory data structures tively. In the case of MRGA-S, its implementation was 37 (RDDs). But Spark requires a quite large amount of developed from scratch. 38 RAM memory and it is quite greedy in the utilization 39 of the cluster resources, while Hadoop and MR-MPI 40 41 can be used in low-resource and non-dedicated plat- 3.1 Literature Review 42 forms. Hadoop is designed mainly for batch processing, 43 while with enough RAM, Spark may be used for near Some of the most representative works that model GAs 44 real-time processing. Also, many problems in industrial using Big Data frameworks are discribed in this section. 45 domains are implemented in C/C++, which are only Verma et. al [32,33] proposed a SGA and a CGA based 46 natively supported by Hadoop and MR-MPI implemen- on the selecto-recombinative GA, proposed by [12], which 47 tations (and not in Spark). Therefore, the research on only use two genetic operators: selection and recombi- 48 efficient uses of these first frameworks (as done in this nation. SGA was developed using the Hadoop frame- 49 article) is today an important domain [5,8]. work. The authors match the map function with the 50 Hadoop framework stores both the input and the evaluation of the population fitness, whereas the re- 51 52 output of the job in the HDFS, whereas MR-MPI al- duce function performs the selection and recombina- 53 locates pages of memory. Spark can also use HDFS to tion operations. They proposed the use of a custom- 54 store the data, providing fault tolerance by the task made Partitioner function, which splits the interme- 55 duplication. In this way, Hadoop and Spark also sup- diate key/value pairs among the reducers by using a 56 ply data redundancy. But the HDFS creation and its random shuffle. In [11], the authors proposed a simi- 57 configuration require a careful setting of properties, re- lar model of GA than Verma et. al [32] for software 58 sulting in a time-consuming process. Instead, MR-MPI testing. The main difference relays in the use of only 59 is not able to detect a dead processor and retrieve the one reducer that receives the entire population. Thus, 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 5 1 2 Table 1 Comparison between Big Data frameworks. 3 4 5 6 Features open-source yes yes yes 7 popular for big-data yes no yes 8 language supported Java, C++, Ruby, Python C, C++, Python scala, Java, Python, R 9 fault-tolerance yes no yes processing approach read and write to disk read and write to disk in-memory 10 volume of data sets huge large (RAM + disk) quite large (memory sizes) 11 processing type batch batch near real-time iterative processing no yes yes 12 load balancing automatic manual automatic 13 installation easy easy easy 14 configuration relatively difficult easy relatively difficult 15 16 the reducer can perform the selection and apply the tion. The proposed models are evaluated in relation to 17 18 crossover and mutation operators to produce a new off- performance and accuracy over multiple cluster sizes. 19 spring to be evaluated in the next MR job. However, Many of the reviewed works deploy computing-intensive 20 different parallel GA’s models were proposed by Fer- runs of EAs on the Big Data infrastructures [8,6,16]. 21 rucci et al. [8] using Hadoop as distributed infrastruc- In particular, they implement parallel versions of EAs 22 ture. The authors propose a global model, a grid model, to optimize the running time for the algorithmic exper- 23 and finally an island model. They analyze the proposals iments, because the optimization problems have com- 24 in terms of execution time and speedup, as well as the putationally costly fitness evaluation functions. As to 25 behavior of the three parallel models in relation to the papers dealing with large data volume, we can men- 26 overhead produced using Hadoop. Chavez et al. [6] in- tion the work of Verma et al. [32,33], which involves 27 troduce changes in ECJ [30] to follow the MP paradigm 10n(n = 4) variables and a population of n × logn size. 28 29 in order to launch any EA problem on a big data in- Finally, Chavez et al. [6] and Alterkawi et al. [4] address 30 frastructure using Hadoop similarly as when a single large and complex data classification tasks, but the au- 31 computer is used to run the algorithm. Jatoth et al. [16] thors do not indicate the amount of memory usage. 32 solved the problem of QoS-aware big service composi- The aforementioned proposals present different GA 33 tion by implementing a MapReduce based evolutionary parallel models for big optimization, but they are spe- 34 algorithm with guided mutation on a Hadoop cluster, cific for a single MR framework. This implies a signif- 35 which use a global model in the MapReduce phase. icant lack of information on the advantages and lim- 36 itations of each framework to implement GAs for big 37 optimization. In this sense, the selection of the most 38 An implementation of GAs using Spark can be found appropriate one to implement this kind of algorithm 39 in [22]. Their proposal consists in the partition of the 40 results in a very complex task. In order to mitigate this 41 population in many worker processes which applies the lack of information, the main objective of our research 42 genetic operations and evaluation by the use of map is to design and implement a scalable GA on the three 43 function, when the stop criterion is met, a reduce func- most known MR frameworks: MR-MPI, Hadoop, and 44 tion aggregates the subpopulations to find the most Spark. 45 promising individuals. In the same line of using Spark 46 as parallel platform, Hu et al. [15] use a SGA as opti- 47 mizer tool, where the population is divided in chunks 3.2 Big Optimization MRGA Solver 48 for evaluation purposes by using a map function. After 49 that, a collect function is used to gather all the individ- For our study, we will use a simple genetic algorithm, 50 uals of the population together to apply genetic opera- SGA, whose operations can be found in a great number 51 52 tions. More recently, two versions of parallel GAs were of Evolutionary Algotithms in the literature. Our aim 53 proposed in [4] using Spark framework. The proposals is then to guide future research that is linked to these 54 are based on the traditional master slave model and the search operations when designing other algorithms to 55 island model to solve large dimensional classifier prob- MR implementations. The pseudocode of SGA is pre- 56 lems. The first model handles the evolutionary process sented in the Algorithm 1, which starts by generating 57 by the Spark driver, which sends the individuals across an initial population. During the evolutionary cycle, the 58 the executors to compute the fitness. In the second one, population is evaluated and then a set of parents is se- 59 each island is an executor and evolves a subpopula- lected by tournament selection [21]. After that, the uni- 60 61 62 63 64 65 6 Carolina Salto et al. 1 2 Algorithm 1 Sequential GA is implemented to build the new population for the next 3 1: t = 0; {current generation} MR task (a new iteration). 2: initialize(P op(t)); 4 3: evaluate(P op(t)); Regardless of the framework used, the key/value 4: while (non stop criterion is met) do 5 ′ pairs, generated at the end of the map phase, are shuf- 5: P op (t) = select(P op(t)); {k-wise tournament selection with- 6 out replacement} fled and split to the reducers and converted in an in- ′ 7 6: offspring = recombine(P op (t), pc); {uniform crossover} termediate key/multivalues space. The shuffle of indi- 7: P op(t + 1) = replace(P op(t),offspring); 8 8: evaluate(P op(t + 1)); viduals consists in a random assignment of individuals 9 9: t = t + 1 to reducers instead of using a traditional hash function 10 10: end while 11: return (best individual); over the key. This modification, as suggested in [32], re- 11 sponds to avoid that all values corresponding to a same 12 13 key (identical individuals) will be send to the same re- 14 form recombination operator is applied to them. The duce function, generating a biased partition and fix as- 15 recently created offspring conform the new population signed of individuals to the same partition through evo- 16 for the next generation (using the generational replace- lution and an unbalance load of reduce functions at the 17 ment). The evolutionary process ends when either the end of evolution. Therefore, the intermediate key/value 18 optimum solution to the problem at hand is found or pairs are distributed into R partitions using an uniform 19 the maximum number of iterations is reached. distribution. 20 21 Our proposed MRGA algorithm preserves the SGA behavior, but it resorts to parallelization for some parts: 22 3.3 MRGA-H Algorithm 23 the evaluation and the application of genetic operators. 24 Although, our technique performs several operations in Some modifications were introduced in the code devel- 25 parallel, its behavior is equal to the sequential GA. oped in [32]. The most important ones are related to the 26 A key/value pair has been used to represent indi- 27 changes imposed by passing from the old MRV1 to the viduals in the population as a sequence of bits. To dis- new Java MapReduce API MRV2 [1], because they are 28 tinguish identical individuals (with the same genetic 29 not compatible with each other. These important dif- configuration), a random identifier (ID) is assigned in 30 ferences involve the new package name, the context ob- map 31 the function to each one. The ID prevents that jects that allow the user code to communicate with the reduce 32 identical individuals were assigned to the same MR system, the Job control that is performed through 33 function, in the phase of shuffling when the intermedi- the Job class in the new API (instead of the old one 34 ate key/multivalues space are generated. The sequence JobClient), and the reduce() method that now passes 35 of bits together with the ID corresponds to the key in values as an Iterable Object. Also, some modifica- 36 the key/value pair. The value part is the individual fit- tion were required during the generation of the random 37 map ness, which is computed by the function. individual ID. Finally, the individual evaluation was in- 38 For large problem sizes, the population initializa- 39 cluded in the method fitness of the GAMapper Class. tion could be a consuming time process. The situation 40 The rest of the code with the functionality of the GA 41 can get worse with large individual sizes, as the case remains without important modifications. The scheme 42 in this work. According to this situation, this initializa- of MRGA-H is plotted in Figure 1. Chunks of data read 43 tion is parallelized in a separate MR phase. The map from HDFS and processed by each map are represented 44 functions are only used to generate random individu- by shaded rectangles. 45 als. After that, the iterative evolutionary process be- 46 gins, where each iteration consists of a map and reduce 47 functions. The map functions compute the fitness of 3.4 MRGA-M Algorithm 48 individuals. As each map has assigned different chunk 49 of data, they evaluate a set of different individuals in MRGA-M creates the MPI environment for the parallel 50 parallel. This fitness is added as value in the key/value execution. Then, the sequence begins with the instan- 51 52 pair. Each map finds their best individual that is used tiation of an MR object and the setting of their pa- 53 in the main process to determine if the stop criterion rameters. The MRGA-M follows the scheme shown in 54 is met. The reduce functions carry out the genetic op- Figure 2 where boxes with solid outlines are files and 55 erations. The binary tournament selection is performed the chunks of data processed by each map are repre- 56 locally with the intermediate key space, which is dis- sented by shaded rectangles in a hard disk. 57 tributed in the partitioning stage after map operation. The first MR phase consists of only one map func- 58 The uniform crossover (UX) operator is applied over tion (calling to a serial Initialize()) to create the ini- 59 the selected individuals. The generational replacement tial population. In our implementation the main process 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 7 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Figure 1 MRGA-H scheme. 24 25 26 (process with MPI id equal to zero) generates a list of accelerate the processing instead of using files to store 27 filenames. Our Initialize() function processes each file the population, as in MRGA-H and MRGA-M. How- 28 to build the initial population. ever, the MRGA-S also exploits the parallelism in the 29 The second and following MR phases have a se- evaluation and in the application of genetic operators. 30 quence of map and reduce functions. These map func- Consequently, MRGA-S follows the same logical func- 31 tions receive a chunk of the large file passed back to our tionalities than both MRGA-H and MRGA-M, with re- 32 fitness M fitness spect to the SGA behavior. 33 () and then split it in chunks. The () 34 function processes each key (an individual) received, In this MRGA implementation, a different concep- 35 evaluates it obtaining the fitness value and emits a tion of key/value pairs to represent individuals is used. 36 key/value pairs. After that, the MR-MPI aggregate() The key represents the partition were an individual has 37 function shuffles the key/value pairs across processors to be assigned to, whereas the value correspond to the 38 by distributing the pairs randomly. Then, the MR-MPI individual itself. 39 convert() function transforms a key/value pairs into Figure 3 presents a scheme of the MRGA-S. The 40 a key/multi-value pairs. Finally, the Evol() function sequence begins with the creation of a RDD (Stage 1), 41 (from the reduce method) will be called once for each which is parallelized in the main program. The elements 42 key/multi-value pair assigned to a processor. This func- 43 of the RDD are copied to form a distributed dataset 44 tion selects a pair of individuals by tournament selec- that can be operated in parallel in each worker, which 45 tion and performs the recombination. The new individ- transforms them and returns the results to the main 46 uals generated are written into permanent storage to be program. After that, an iterative process begins con- 47 read by the map methods in the following MR phase. sisting of two Spark stages that are repeated until the 48 stop criterion is met. 49 The first step in the main loop (Stage k) assigns a 50 3.5 MRGA-S Algorithm random value to each individual in the range [1, .., R], 51 by using a special version of the map operation (map- 52 53 As we have before explained, Spark extends and gen- ToPair() operation). This conversion prepares a RDD 54 eralizes the MR idea with a different implementation. for the next operation, which consists in grouping the 55 In consequence, we need to introduce changes to the individuals with the same key. Note that this step (Stage 56 MRGA design to obtain the MRGA-S, which are de- k) is equivalent to the Partitioner in the MRGA-H or 57 tailed in the following. to the agregate() function in MRGA-M. 58 The proposed MRGA is based on Spark RDDs to The next Stage k+1 begins with the redistribution 59 store the population. This RDD is cached in memory to of the individuals across the partitions, by using the 60 61 62 63 64 65 8 Carolina Salto et al. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Figure 2 MRGA-M scheme. 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 Figure 3 MRGA-S scheme. 47 48 groupByKey() function. After that, a mapToPair() op- 4 Experimental Setting 49 eration is invoked with the Evolution() function as its 50 To address the research questions about the efficiency 51 parameter, in order to evaluate the individuals and ap- 52 ply genetic operators into a partition, generating the and scalability presented in Section 1, we consider as 53 new individuals for the next generation. Although, this a benchmark the knapsack problem [18,25,28,36] to 54 is a new difference with MRGA-H and MRGA-M, MRGA- evaluate the proposed algorithms. The choice of this 55 S maintains the SGA’s underlying idea. Finally, a new problem was motivated by the fact that it allows us to 56 RDD containing the individuals from all the partitions assess the MRGA scalability on different big instance 57 (the whole population) is obtained to continue with the sizes. To carry out the evaluation of the analysis of our 58 first step of Stage k+1 in this iterative process. proposals, we use metrics such as execution time, scal- 59 ability, speedup, and communication vs. computation. 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 9 1 2 The problem, the experimentation methodology, and 3 the evaluation metrics are explained in the following 4 subsections. 5 6 7 8 4.1 Knapsack Problem 9 10 The knapsack problem (KP) is a classic NP-complete 11 problem [18], which is defined by the task of taking a 12 set of items, each with a weight and a profit, filling the Figure 4 Solution representation of the knapsack problem in MRGA. 13 knapsack so that the total profit is maximized, but not 14 exceeding the maximum weight the knapsack can hold. 15 The KP formulation is shown in Equation 1. 4.2 Experimentation Methodology 16 17 Each MRGA’s approach evolves 50000 randomly ini- 18 Ns tialized individuals. This population size was chosen in 19 max X xipi (1) order to increase the memory load that our big opti- 20 i=1 21 mization solver has to manipulate. For each generation, these algorithms use the binary tournament selection to 22 subject to 23 select parents, a probability of 100% to recombine the

24 Ns parents using the UX operator, and the generational 25 xiwi ≤ K replacement to obtain the next population. Let us re- 26 X i=1 call that for each considered KP instance and number of 27 map tasks (#map), we execute 30 times the MRGA-M, 28 where K is the maximum weight the knapsack can hold, MRGA-H, and MRGA-S. The computational environ- 29 N S ment used in this work to carry out the experimentation 30 and s is the number of items in the set, . Each 31 item has a weight wi and a profit pi. Here xi indicates is a cluster of five nodes with 8 GB RAM. 32 whether an item i is present or not in the knapsack. The sequence of bits of an individual is grouped by 33 Therefore, a KP solution is represented by a bit string arrays of long ints (64 bits) and their lengths depends 34 in MRGA, as is shown in Figure 4 and its implementa- on the instance dimension. For example, the individual 35 tion is described in Section 3.2. length for the 25K instance is 392 long ints (25,000/64) 36 KP is a very well-known problem in computer sci- requiring 3.1 KB of memory, and therefore the popu- 37 ence. It occurs in many situations be they in indus- lation demands 156.25 MB. The decision of using long 38 try, communication, finance, applied sciences or in real ints was to optimize the bit operations required by the 39 life [9,17,19,27], being itself a very interesting combi- evolutionary operators. In this way, the total RAM re- 40 quirements varies from 150 MB to 1.8 GB, justifying 41 natorial problem to be dealt using the big optimiza- 42 tion solver presented in this work. In general, the KP the use of Big Data frameworks. 43 literature solves problem instances that vary between 44 100 and 10,000 items in the knapsack. In this article, 45 we propose to optimize six different high dimensional 4.3 Evaluation Metrics 46 instances with a very large number of items: 25,000, 47 50,000, 75,000, 100,000, 125,000, 150,000, 200,000 and In the previous section we explained the methodoly for 48 300,000 items. They are named as 25K, 50K, 75K, 100K, experimentation, now we will develop on the strategy to 49 125K, 150K, 200K and 300K, respectively. These big carry out a fair comparison between the MRGA solvers. 50 KP instances were obtained with the generator described In this way, distinct metrics are considered to evaluate 51 52 in [23] and can be found in the repository https:// them, such as execution time, scalability, and speedup. 53 github.com/GabJL/LargeKPInstances, choosing the un- This becomes as a good practice to report results in the 54 correlated data instances type (no correlation between field [3]. We also evaluate the behavior 55 the weight and the profit of an item). We selected these of our proposals considering different number of maps 56 large datasets because they represent different degrees and reducers in the case of MRGA-H and MRGA-M 57 of computational and memory load, being also an im- and workers for MRGA-S, in order to assess an analysis 58 portant contribution to the state-of-the-art of the prob- of the implications of the amount of parallelism in the 59 lem of the knapsack. performance of MRGA approaches. 60 61 62 63 64 65 10 Carolina Salto et al. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 5 The method to compute times for MRGA-M and MRGA-H. 16 17 18 Execution Time. The execution time (or runtime) rithm allows us to understand the reasons that causes 19 achieved by each MRGA approach is measured in mil- our proposals have a slightly improved speedup. We 20 liseconds using the system clock. This includes the time adopt the method used in [8], which is proposed for 21 between starting and finishing the whole algorithm. Con- the Hadoop framework and we have extended for the 22 sequently, we include all the communication involved other two ones. Figure 5 illustrates how the communi- 23 in the execution. As a way to analyze the implications cation and computational times are calculated per gen- 24 25 of the amount of parallelism in the execution of each eration. This method allows to isolate the GA execution 26 approach, a comparison among the different MRGA time (computational time) from the time spent by each 27 solvers is carried out. This is the base metric to measure framework to put on-line and run each algorithm (com- 28 the scalability and speedup that are explained below. munication time). 29 Scalability. We analyze the scalability from two 30 different dimensions. The first one refers to the algo- 31 rithm capacity to solve increasing sizes of the problem. 5 Result Analysis 32 For that reason we include six different instances in the 33 In this section, we present the results that allows as to study. In this case, we maintain the number of map 34 answer the different RQs formulated in Section 1. The tasks constant. The second one addresses to increase 35 comparison between the MRGA solvers with respect 36 the number of map tasks whereas we keep the prob- to the scalability is in Section 5.1. The analysis of the 37 lem size fixed, considering 4, 8, and 12 map tasks. This speedup is in Subsection 5.2. Finally, in Subsection 5.3, 38 study allows us to determine how the execution times we contrast the communication and computation times 39 are modified when more resources to solve the same consumed by each MRGA. 40 problem are available. Consequently, we have scalabil- 41 ity with an increasing problem size and scalability with 42 a constant overall load. 43 5.1 Scalability 44 Speedup. The speedup (sm) is the ratio between 45 the mean execution time on one processor and the mean Table 2 and Figure 6 show the execution times achieved 46 execution time on m processors. We use the definition of for each algorithm by increasing the problem dimen- 47 weak speedup given in [2] that compares the execution sion. In Table 2, the MRGA-S execution time for the 48 time of the parallel algorithm on one processor against 300K instance is not available (N/A) because this solver 49 the execution time of the same algorithm on m proces- was not able of running such a large instance in our sys- 50 sors. For this particular study, the solution quality is tems. For every instance, each bar of Figure 6 represents 51 taken as the stopping criterion. The evaluated MRGA 52 a different number of map tasks. As expected, the exe- solvers should compute solutions having a similar ac- 53 cution times increase as the dimensionality of the prob- 54 curacy. Thus, a relaxation of the optimal fitness value lem grows. This situation is very clear in the case of 55 for each KP instance (e.g., 90%) is considered, but in MRGA-M and MRGA-S. However, MRGA-H presents 56 any case the same value. All these define an orthodox a behavior with no direct dependence of the problem 57 speedup measure in the Alba’s taxonomy [2]. size. The previous results give us support to answer the 58 Communication vs. Computation. A study about RQ2 since these algorithms can solve efficiently incre- 59 the communication and computation times of each algo- mental high dimensional instances, becoming scalable 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 11 1 2 Table 2 Execution times achieved for each MRGA solver. MRGA-M MRGA-H MRGA-S 3 Inst #map=4#map=8#map=12 #map=4#map=8#map=12 #map=4#map=8#map=12 4 25K 15420 14025 14130 62862 56797 55978 22951 19391 23349 5 50K 39039 22424 22710 42607 35255 33931 38107 30232 28032 75K 66585 36600 36570 58439 48221 47597 51448 40955 37335 6 100K 78894 74724 40150 52234 38541 38479 65415 50923 45181 7 125K 108645 99390 51810 118170 94092 89665 79798 61931 55903 150K 136815 118110 113798 75931 54267 48066 94052 73351 66071 8 200K 157475 160672 148570 79271 50402 47387 123655 71088 83443 9 300K 251028 260218 248944 101026 63380 57806 N/A 172853 208043 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Figure 6 Mean execution time of MRGA algorithms to solve 46 the KP instances. 47 Figure 7 Mean execution time of MRGA algorithms for each 48 number of map tasks to solve the KP instances. 49 big optimization solvers. In particular, the MRGA-H 50 presents the best performance. 51 52 Now, if we analyzed what happens when a same KP 53 instance is solved by some MRGA and the number of 54 map tasks is increased, as shown in Figure 6, we can 55 observe a decrease in the execution time. This study obtain a gain in the time. Being, 12 a good number of 56 allows to infer how is affected the MRGA execution map tasks for MRGA-M, while 8 map tasks is enough 57 times when the same load is maintained but the amount for obtaining accurate results in both MRGA-H and 58 of map tasks is augmented. This suggests that when MRGA-S and the improvement achieved adding more 59 more resources are used to solve the same problem, we maps is meaningless. 60 61 62 63 64 65 12 Carolina Salto et al. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Figure 8 Speedup trend per instance. 26 27 28 hardware resource limitations to support big instances 29 in a few number of map tasks. Moreover, MRGA-H is 30 the algorithm with less time variations than the other 31 32 two MRGAs for a given map task number. In the case 33 of MRGA-S, we observe a slight increasing in the run- 34 times when instances with more number of items are 35 solved. Consequently, we can infer that the three MR- 36 GAs present an efficient performance and are scalable, 37 being MRGA-H the most efficient and scalable solver 38 for big optimization. This conclusion can also be de- 39 duced from Figure 6, although it cannot be seen with 40 Figure 9 Mean speedup per MRGA solver. the naked eye. 41 42 43 44 5.2 Speedup Now, we focus on the speedup values to reinforce 45 the justification of the previous answer to RQ1 from a 46 We analyze the results shown in Figure 7 by compar- different point of view. Figure 8 graphically shows the 47 ing the execution times of each MRGA. In this way, we speedup values for each KP instance. The MRGA-H 48 find responses to the research questions RQ1 and RQ3 speedup is the best of the three algorithms for the ma- 49 about the MRGA efficiency and scalability when more jority of the problem instances. Although the speedup 50 computational resources to solve the same problem are is sub-linear (s < m), the MRGA-H results are quite 51 m 52 available. For the smallest datasets, i.e. instances with good because they are approximately at 0.65 from the 53 less than 100,000 items, we observe that the execution ideal speedup value for 4 and 8 map tasks. Both MRGA- 54 of MRGA-M is always faster than the executions of the M and MRGA-S present a poor relation respect to the 55 remaining MRGAs, regardless of the number of map ideal value, and this situation becomes worse when larger 56 tasks used. However, for the largest data sets, MRGA- number of map tasks is considered (the values are very 57 M is the slowest MRGA solver, mainly when 4 and 8 small, less than 0.3). These observations are corrob- 58 map tasks are employed, while MRGA-H is the fastest orated with the average speedup values per MRGA 59 one. The MRGA-M weak behavior is caused by the solvers that is shown in Figure 9. 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Figure 10 The average communication vs. computational times spent by each algorithm per generation. 27 28 5.3 Communication vs. Computation Hadoop are better suited to low cost commercial off- 29 the-shelf computers. In the view of these results, we 30 RQ4 31 Figure 10 shows the communication and computation cannot answer clearly. Although the communica- 32 times for each MRGA solver taken as reference the in- tion time is an important factor to take into account 33 stance with a dimension of 100,000 items. Similar obser- to chose a MRGA solver, this kind of time is strongly 34 vations can be done for the other instances. This kind of related with the number of maps, dataset size, and the 35 analysis allows us to give more details about the execu- hardware infrastructure. Therefore, the combination of 36 tion time achieved by each approach. The stacked bars these last factors could lead or not to a reduction in the 37 represent communication and computational times for communication time. 38 a generation. 39 On the one hand, it is worth noting that MRGA- 40 S presents the lowest communication and it goes de- 41 42 creasing as the number of maps is increased, while the 6 Conclusions 43 computational time stays similar. What explains this 44 situation is that the total number of executors is fixed In this article, we have proposed big optimization solvers 45 regardless of the number of map tasks, consequently based on MR implementations of a Simple Genetic Al- 46 the tasks assigned to a executor have to be executed gorithm in different Big Data processing frameworks. 47 in a serial way. Another reason is that the available These allowed us to solve large instances of combina- 48 hardware infrastructure is below the Spark hardware torial optimization problems, in particular, the knap- 49 requirements. sack problem that is important in the industry and 50 On the other hand, for MRGA-M and MRGA-H the academia. In this sense, we used three open-source frame- 51 52 communication time surpasses the computation time, works as Hadoop, MR-MPI, and Spark in order to gen- 53 but it remains stable for every number of map tasks. erate MRGA-H, MRGA-M, and MRGA-S solvers, re- 54 The reasons of this behavior are the same large dataset spectively. We empirically assessed the effectiveness of 55 size is considered and always the number of maps is the tree MRGA algorithms in terms of execution time, 56 equal to the assigned number of cores. However, the scalability, speedup, and communication vs. computa- 57 computation time becomes smaller when the number tion to answer the research questions formulated at the 58 of tasks increase because each map task is assigned to beginning of this work. This assessment was carried out 59 a different core of the cluster. Moreover, MR-MPI and by using six big instances with sizes varying from 25,000 60 61 62 63 64 65 14 Carolina Salto et al. 1 2 to 300,000 items, which were chosen to represent differ- will be considered on these three frameworks in order 3 ent degrees of computational and memory load. to improve the speedup of the big optimizer and take 4 Results show that, from a computational point of advantage of the distributed nature of the new propos- 5 view, the execution times of the MRGA solvers increased als. Also, the sensitivity of the parameter settings on 6 as the dimensionality of the problem grew, as it was the proposed algorithm will be addressed. Other appro- 7 expected. This behavior is exhibited by MRGA-M and priate big optimization problems to analyze the perfor- 8 MRGA-S, but not by MRGA-H, that is not affected for mance of these big optimization solvers will be used to 9 the instance dimensionality and shows the best speedup give more insights of their behavior. 10 values. The MRGA-H then outperforms the other two 11 12 in terms of execution time when the problem size scales 13 to high dimensional instances. In this way, RQ1 and Acknowledgments 14 RQ2 are satisfactorily answered. 15 Furthermore, the answer to the question about the This research received financial support from the Uni- 16 scalability of MRGA solvers to an increased number of versidad Nacional de La Pampa and the Incentive Pro- 17 map task (RQ3) was that, in fact, a gain in time is gram from MINCyT (Argentina). Moreover, this re- 18 observed if more map tasks are used to solve the same search has been partially funded by the Spanish MINECO 19 problem instance. It is more noticeable for MRGR-M and FEDER project TIN2017-88213-R (6city), and by 20 than for the other two MRGA solvers, due to the grow- PRECOG (UMA18-FEDERJA-003). 21 22 ing memory requirements as consequence of the increase 23 in the instance sizes. MRGA-S presents the lowest com- 24 munication times but it can not exploit the advantage Conflict of interest 25 to use the in-memory persistence. 26 However, no conclusive evidence was found with re- The authors declare that they have no conflict of inter- 27 gards to the time spent in communication as a factor est. 28 to choose a particular MRGA solver, the last research 29 question formulated (RQ4) in the present study. The 30 communication time seems to be too much related in Human and animal rights 31 map 32 an unknown way with the number of tasks, dataset size, and the hardware infrastructure. This article does not contain any studies with animals 33 performed by any of the authors. 34 The differences observed in the behavior of our MRGA 35 solvers are to some extent explained by the facts that 36 both, MRGA-M and MRGA-S, keep and manage the Informed consent 37 population from memory, while MRGA-H uses HDFS 38 to manage it. Furthermore, the MRGA-S deserves spe- Informed consent was obtained from all individual par- 39 cial attention due to the low performance in its be- 40 ticipants included in the study. havior. Given that the population is updated in each 41 42 iteration, the contents of its RDD persists in memory only one iteration. As a consequence, the Sparks per- 43 References 44 formance advantage with respect to the use in-memory 45 persistence can not be exploited. 1. Welcome to Apache Hadoop! Technical re- 46 Summarizing the above observations, MRGA-H pre- port, The Apache Software Foundation, 47 sents a better performance and scalability than MRGA- http://hadoop.apache.org/, 2014. 48 M and MRGA-S when high dimensional optimization 2. E. Alba. Parallel evolutionary algorithms can achieve super-linear performance. Information Processing Let- 49 problems are solved. Therefore, the MRGA-H solver 50 ters, 82(1):7 – 13, 2002. Evolutionary Computation. continues to perform adequately as its workload grows 3. E. Alba. Parallel Metaheuristics: A New Class of Algo- 51 as much as the the capacity of the containers allows rithms. Wiley-Interscience, New York, NY, USA, 2005. 52 it. Nevertheless, the MRGA-H and MRGA-S should be 4. L. Alterkawi and M. Migliavacca. Parallelism and parti- 53 tioning in large-scale GAs using spark. In Proceedings of 54 positively considered since they are using frameworks the Genetic and Evolutionary Computation Conference, 55 which allow easier programmability. They also present GECCO 19, page 736744, New York, NY, USA, 2019. 56 further advantages, such as inherent support to node Association for Computing Machinery. 57 failures and data replication. 5. A. Cano, C. Garc´ıa-Mart´ınez, and S. Ventura. Extremely 58 high-dimensional optimization with MapReduce: Scal- In a future work, other models to parallelize the ing functions and algorithm. Information Sciences, 415- 59 SGA, such as the island model, using MR paradigm 416(Supplement C):110 – 127, 2017. 60 61 62 63 64 65 Big Optimization with Genetic Algorithms: Hadoop, Spark and MPI 15 1 2 6. F. Ch´avez, F. Fern´andez, C. Benavides, D. Lanza, of the Genetic and Evolutionary Computation Confer- J. Villegas, L. Trujillo, G. Olague, and G. Rom´an. ence Companion, GECCO ’17, pages 1857–1863. ACM, 3 ECJ+Hadoop: An easy way to deploy massive runs of 2017. 4 evolutionary algorithms. In G. Squillero and P. Bu- 23. D. Pisinger. Core problems in knapsack algorithms. Op- 5 relli, editors, Applications of Evolutionary Computation, erations Research, 47:570–575, 1999. 6 pages 91–106, Cham, 2016. Springer International Pub- 24. S. Plimpton and K. Devine. Mapreduce in MPI for large- 7 lishing. scale graph algorithms. Parallel Computing, 37(9):610– 8 7. J. Dean and S. Ghemawat. MapReduce: simplified data 632, 2011. 9 processing on large clusters. In OSDI04: Proceedings of 25. T. Pradhan, A. Israni, and M. Sharma. Solving the 01 the 6TH Conference on Symposium on Operating Sys- 10 knapsack problem using genetic algorithm and rough set tems Desing and Implementation. USENIX Association, theory. In 2014 IEEE International Conference on Ad- 11 2004. vanced Communications, Control and Computing Tech- 12 8. F. Ferrucci, P. Salza, and F. Sarro. Using Hadoop MR nologies, pages 1120–1125, 2014. 13 for parallel GAs: A comparison of the global, grid and 26. R. Qi, Z. Wang, and S. Li. A parallel genetic algorithm 14 island models. Evol. Computation, 0(0):1–33, 2017. based on spark for pairwise test suite generation. Journal 15 9. J. Rui Figueira, G. Tavares, and M. Wiecek. Labeling of Computer Science and Technology, 31:417–427, 2016. 16 algorithms for multiple objective integer knapsack prob- 27. V. Quintuna Rodriguez and Ma. Laye. Modeling and Computers & Operations Research 17 lems. , 37(4):700 – optimization of content delivery networks with heuris- 711, 2010. tics solutions for the multidimensional knapsack problem. 18 10. M.R. Garey and D.S. Johnson. Computers and In- pages 13–18, 2016. 19 tractability: a Guide to the Theory of NP-Completeness. 28. A. Salama, M. Wahed, and E. Yousif. Big data flow ad- 20 Freeman, 1979. justment using knapsack problem. Journal of Computer 21 11. L. Di Geronimo, F. Ferrucci, A. Murolo, and F. Sarro. A and Communications, 6:30–39, 2018. 22 parallel genetic algorithm based on Hadoop MapReduce 29. C. Salto, G. Minetti, E. Alba, and G. Luque. Develop- 23 for the automatic generation of JUnit test suites. In 2012 ing genetic algorithms using different mapreduce frame- IEEE Fifth International Conference on Software Test- 24 works: MPI vs. Hadoop. In F. Herrera, S. Damas, ing, Verification and Validation, pages 785–793, April R. Montes, S. Alonso, O.´ Cord´on, A. Gonz´alez, and 25 2012. A. Troncoso, editors, Advances in Artificial Intelligence, 26 12. D.E. Goldberg. The Design of Innovation: Lessons from pages 262–272, Cham, 2018. Springer International Pub- 27 and for Competent Genetic Algorithms. Kluwer Aca- lishing. 28 demic Publishers, 2002. 30. E. Scott and S. Luke. ECJ at 20: Toward a general meta- 29 13. M. Hamstra, H. Karau, M. Zaharia, A. Konwinski, and heuristics toolkit. In Proceedings of the Genetic and Evo- 30 P.Wendell. Learning Spark: Lightning-Fast Big Data An- lutionary Computation Conference Companion, GECCO alytics. OReilly Media, 2015. 19, pages 1391–1398, New York, NY, USA, 2019. Associ- 31 14. I. Hashem, N. Anuar, A. Gani, I. Yaqoob, F. Xia, and 32 ation for Computing Machinery. S. Khan. Mapreduce: Review and open challenges. Sci- 31. E. Talbi. Metaheuristics: From Design to Implementa- 33 entometrics, 109(1):389–422, Oct 2016. tion. Wiley Publishing, 2009. 34 15. C. Hu, G. Ren, C. Liu, M. Li, and W. Jie. A spark- 32. A. Verma, X. Llor`a, D.E. Goldberg, and R. Camp- 35 based genetic algorithm for sensor placement in large bell. Scaling genetic algorithms using MapReduce. In 36 scale drinking water distribution systems. Cluster Com- ISDA’09, pages 13–18, 2009. 37 puting, 20(2):1089–1099, 2017. 33. A. Verma, X. Llor`a, S. Venkataraman, D.E. Goldberg, 16. C. Jatoth, G.R. Gangadharan, U. Fiore, and R. Buyya. and R. Campbell. Scaling eCGA model building via data- 38 Qos-aware big service composition using mapreduce 39 intensive computing. In IEEE Congress on Evolutionary based evolutionary algorithm with guided mutation. Fu- Computation, pages 1–8, 2010. 40 ture Generation Computer Systems, 86:1008 – 1018, 34. T. White. Hadoop, The Definitive Guide. OReilly Media, 41 2018. 2012. 42 17. L. Jenkins. A bicriteria knapsack program for planning 35. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, 43 remediation of contaminated lightstation sites. European M. McCauley, M. Franklin, S. Shenker, and I. Stoica. 44 Journal of Operational Research, 140(2):427–433, 2002. Resilient distributed datasets: A fault-tolerant abstrac- 18. H. Kellerer, U. Pferschy, and D. Pisinger. Introduction to tion for in-memory cluster computing. In Proceedings of 45 NP-Completeness of Knapsack Problems, pages 483–493. 46 the 9th USENIX Conference on Networked Systems De- Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. sign and Implementation, NSDI’12, pages 2–2. USENIX 47 19. K. Klamroth and M. M. Wiecek. Time-dependent capital Association, 2012. 48 budgeting with multiple criteria. In Yacov Y. Haimes and 36. Guo Zhou, Ruixin Zhao, and Yongquan Zhou. Solving 49 Ralph E. Steuer, editors, Research and Practice in Mul- large-scale 0-1 knapsack problem by the social-spider op- 50 tiple Criteria Decision Making, pages 421–432, Berlin, timisation algorithm. IJCSM, 9(5):433–441, 2018. 51 Heidelberg, 2000. Springer Berlin Heidelberg. 20. M. Lozano, D. Molina, and F. Herrera. Editorial scalabil- 52 ity of evolutionary algorithms and other metaheuristics 53 for large-scale continuous optimization problems. Soft 54 Computing, 15(11):2085–2087, 2011. 55 21. B. Miller and D. Goldberg. Genetic algorithms, tourna- 56 ment selection, and the effects of noise. Complex Systems, 57 9:193–212, 1995. 58 22. C. Paduraru, M. Melemciuc, and A. Stefanescu. A dis- tributed implementation using apache spark of a genetic 59 algorithm applied to test data generation. In Proceedings 60 61 62 63 64 65