Map Concurrency Characterization for Mapreduce on the Cloud

2 MC : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd F. Sakr Carnegie Mellon University in Qatar Education City, Doha, State of Qatar Emails: fmhhammou, [email protected] Abstract—MapReduce is now a pervasive analytics engine !"#$%&'J1BD3FG"KD<55E"!$#((&' !)#%$&' !)#"")&'7361<"KD<55E" on the cloud. Hadoop is an open source implementation of +!!" ,"-./+ !"#$%"&'()%*+ 0)1.2'3+ ,"#(+ !"#$%"&'()%4+ MapReduce and is currently enjoying wide popularity. Hadoop *!!" offers a high-dimensional space of configuration parameters, )!!" which makes it difficult for practitioners to set for efficient (!!" and cost-effective execution. In this work we observe that '!!" MapReduce application performance is highly influenced by &!!" %!!" map concurrency. Map concurrency is defined in terms of two $!!" configurable parameters, the number of available map slots and #!!" /0123456"7891":;1256<=>" the number of map tasks running over the slots. We show that !" some inherent MapReduce characteristics enable well-informed ,$-('." ,(-%%." ,$-&&*.",&-'(." ,$-*&." ,&-&$." ,$-&&*.",&-$$&." ,$-$$&.",&-$$&." prediction of map concurrency. We propose Map Concurrency ,?39@1A"5B"CDE";F5G="H1A"?5<1-"75GDF"?39@1A"5B"CDE"7D=I=." Characterization (MC2), a standalone utility program that can predict the best map concurrency for any given MapReduce Figure 1. The execution times of various benchmarks under default Hadoop application. By leveraging the generated predicted information, and Hadoop with two tuned parameters, # of map tasks and # of map slots MC2 can judiciously guide Map phase configuration and, (the two parameters that define map concurrency). consequently, improve Hadoop performance. Unlike many of relevant schemes, MC2 does not employ simulation, dynamic MapReduce applications in the most economical way, while instrumentation, and/or static analysis of unmodified job code still achieving good performance. The approach of renting to predict map concurrency. In contrast, MC2 utilizes a simple, yet effective mathematical model, which exploits the more nodes so as to enhance performance is not cost-effective MapReduce characteristics that impact map concurrency. We in the cloud’s pay-as-you-go environment [26]. As such, in implemented MC2 and conducted comprehensive experiments addition to elasticity, scalability and fault-tolerance, an ideal on a private cloud and on Amazon EC2 using Hadoop 0.20.2. analytics engine should provide high-performing and cost- Our results show that MC2 can correctly predict the best map effective execution framework for big data applications on concurrencies for the tested benchmarks and provide up to 2.2X the cloud. speedup in runtime. Hadoop has more than 190 configuration parameters out of which 10-20 parameters can have significant impact on job I. INTRODUCTION MapReduce [8] is now a popular choice for big data pro- performance [17]. Today, the burden falls on Hadoop users to cessing and is highly recognized for its elasticity, scalability specify effective settings for all these parameters. Hadoop’s and fault-tolerance. For instance, Google utilizes MapReduce default configuration settings do not necessarily provide the to process 20PB of data per day [8]. Amazon added a new best performance. Thus, they might lead to some inefficiency service, called Amazon Elastic MapReduce to enable busi- when Hadoop is deployed on the cloud. Fig. 1 depicts the execution times of various MapReduce applications run on a nesses, researchers, data analysts, and developers to easily 1 process vast amounts of data [2]. In essence, Amazon Elastic private cloud under two Hadoop configurations, the default MapReduce has created a market for pay-as-you-go analytics one and a one with two tuned parameters, the number of on the cloud [19]. map tasks and the number of map slots. As shown, the MapReduce provides minimal abstractions, hides architec- tuned configuration provides Hadoop with speedups of 2.3X, tural details, and automatically parallelizes computation by 2.1X, 1.3X, 1.1X and 2X for Sobel, WordCount-CE, K- running multiple map and/or reduce tasks over distributed Means, Sort, and WordCount-CD, respectively. Clearly, this data across multiple machines. MapReduce incorporates two demonstrates that: (1) Hadoop’s default configuration is not phases, Map and Reduce phases, and allows programmers optimal, (2) the numbers of map tasks and slots (or what to write sequential map and reduce functions that are trans- we refer to as map concurrency) have a strong impact on formed by the framework into concurrent map and reduce Hadoop performance, and (3) for effective execution, Hadoop tasks. Hadoop [13] is an open source implementation of might require different configuration settings for different MapReduce. Hadoop’s adoption by academic, governmental, applications. and industrial organizations is growing at a fast pace [19]. For Selecting an appropriate Hadoop configuration is not a example, industry’s premier web vendors such as Facebook, trivial task. Exhausting all possible options for a single Yahoo! and Microsoft have already advocated Hadoop [22]. parameter, let alone all parameters, is a complex, time- Academia is currently using Hadoop for seismic simulation, consuming, and quite expensive process. Furthermore, even if natural language processing, and web data mining, among an optimal configuration is located for a specific application, others [15], [38]. Nonetheless, Hadoop users are faced with a main challenge 1The experimentation environment and all our benchmarks are described on the cloud. In particular, they lack the ability to run in Section VI. it might not be applicable to other applications (see Fig. 1). 0G&H7=$%F*I$J'*2G7F$!"#$B"5/$KL7F$2G7$M72N*F/$ Therefore, pursuing a brute-force scan over every parameter !"#$B"5/$CD7(&)*'$ O7=&(7$B"5/$CD7(&)*'$ per every application is clearly an inefficient approach. In- !"#$0#.12$ deed, setting Hadoop parameters for efficient execution is a @A%0$ $ -.*(/$ $$ $ form of art, which typically requires extensive knowledge ! ! !"#$ O7=&(7$ ! K&2$ of Hadoop internals [24]. Most practitioners of big data %&'()*'$ %&'()*'$ $ ! ! ! ! @A%0$! analytics (e.g., computation scientists, systems researchers, ! and business analysts) lack the expertise to tune Hadoop and +,+$-.*(/,0#.12$!"##1'3$$ ! 4-.*(/$5167$15$89!-:$;<$=7>"&.2?$ E"F))*'5$! improve performance [19]. Consequently, they tend to either 342"5$67'8$! run Hadoop using the default configuration, thus potentially )&*+($,$!(-.($)/".($ 0(1*2($)/".($ missing a promising optimization opportunity on the cloud, !"#$%&"'($ 0(1*2($%&"'($ or learn the internal intricacies of MapReduce to select Figure 2. Map task and reduce task executions and Map and Reduce satisfactory Hadoop configuration settings, or hire expertise phases. Reduce phase includes two stages, Shuffle and Merge stage and to accomplish the mission. We argue that practitioners need Reduce stage. not do all that. Specifically, we suggest that a simple, accurate and fast scheme can be devised to effectively guide Hadoop II. HADOOP OVERVIEW configuration on the cloud. A. Hadoop Architecture and MapReduce Phases As map concurrency greatly influences MapReduce per- Hadoop is an open source implementation of MapReduce. formance, in this work we focus on optimizing the Map Hadoop presents MapReduce as an analytics engine and phase in MapReduce. Optimizing the Reduce phase is also under the hood uses a distributed storage layer referred to crucial and has been left for future exploration. We propose as Hadoop Distributed File System (HDFS). HDFS mimics Map Concurrency Characterization (MC2), a highly accurate Google File System (GFS) [21]. MapReduce adopts a tree- predictor that predicts the best map concurrencies for MapRe- style, master-salve architecture. The master is denoted as duce applications. MC2 is based on a simple mathematical JobTracker and each slave node is called a TaskTracker. The model that leverages two main MapReduce characteristics: JobTracker is responsible for scheduling map and reduce (1) data shuffling (i.e., moving Map phase output to Reduce tasks at specific TaskTrackers in a Hadoop cluster, monitoring phase) and (2) total overhead for setting up all map tasks them and re-executing failed ones. in a job. This is contrary to many current related schemes A MapReduce job typically includes two phases, a Map that incorporate simulation, dynamic instrumentation and/or phase and a Reduce phase. Nonetheless, a job can still have static analysis of unmodified MapReduce application code only a Map phase and will, consequently, be referred to as 2 to accomplish a similar objective. MC is a standalone a Reduce-Less job [7]. In the presence of a Reduce phase, utility program that only requires some information about the map tasks in the Map phase produce and store intermediate given application and speedily enough (in microseconds) can outputs on local disks (means not on HDFS) and partition predict the best map concurrency for the application without them to designated reduce tasks. Each reduce task pulls its involving Hadoop. corresponding partitions in a process known as shuffling, In this paper we make the following contributions: merges them, applies on the merged outcome a user-defined reduce function, and stores final results in HDFS (see Fig. 2). • We show a strong dependency between the execution Thus, the Reduce phase is usually broken up into a Shuffle times of MapReduce applications and map concurrency. and Merge stage and a Reduce stage as shown in Fig.

Map Concurrency Characterization for Mapreduce on the Cloud

Mapreduce and Beyond

Integrating Crowdsourcing with Mapreduce for AI-Hard Problems ∗

Mapreduce: Simplified Data Processing On

Overview of Mapreduce and Spark

Finding Connected Components in Huge Graphs with Mapreduce

Large-Scale Graph Mining @ Google NY

Mapreduce Indexing Strategies: Studying Scalability and Efﬁciency ⇑ Richard Mccreadie , Craig Macdonald, Iadh Ounis

Mapreduce Basics

Mapreduce: a Major Step Backwards - the Database Column

Matlab-To-Map Reduce Automation

Mapreduce: a Flexible Data Processing Tool

Scaling to Build the Consolidated Audit Trail