<<

Relaxed Synchronization and Eager in MapReduce

Karthik Kambatla, Naresh Rapolu, Suresh Jagannathan, Ananth Grama Department of Computer Science, Purdue University {kkambatl, nrapolu, suresh, ayg}@cs.purdue.edu

Abstract With a view to supporting large-scale distributed applications in unreliable wide-area environments, MapReduce has emerged as a commonly-used programming Dean and Ghemawat proposed a novel program- model for large-scale distributed environments. While the underlying programming model based on maps and reductions has been shown ming model based on map and reduce operations, to be effective in specific domains, significant questions relating to called MapReduce [5]. A map,inaMapReduce performance and application scope remain unresolved. This paper program takes as input a list of key-value pairs targets key questions of performance through relaxed semantics of underlying map and reduce constructs in iterative MapReduce and applies a programmer-specified function, in- applications. Specifically, it investigates the notion of partial syn- dependently, on each pair in the list. A reduce chronizations combined with eager scheduling to overcome global operation takes as input a list indexed by a key, synchronization overheads associated with reduce operations. In addition to presenting semantics for these constructs, the paper of all corresponding values, and applies a reduction describes their application on two illustrative problems — eigen- function on the values. It outputs a list of key- spectra/pagerank computation on web graphs and clustering high- value pairs, where the keys are unique and values dimensional datasets. The proposed constructs up to five-fold performance improvements on real graphs and datasets, whileadding are determined by the function applied on all of only minimal program complexity. the values associated with the key. The inherent The following specific contributions are reported in the paper: simplicity of this programming model, combined (i) partial synchronizations combined with eager scheduling are with underlying system support for scheduling, capable of significant performance improvements, (ii) diverse classes of iteratively-structured applications can be efficiently executed in fault tolerance, and application development, make MapReduce on hundreds of distributed hosts, and (iii) a general MapReduce an attractive platform for diverse data- API for partial synchronizations and eager map scheduling holds intensive applications. Indeed, MapReduce has been significant performance potential for other applications with irregular data-access and computation profiles. used effectively in a wide variety of data processing applications. The large volumes of data processed at Google, Yahoo, and Amazon, stand testimony to 1. Introduction the effectiveness and scalability of the MapReduce paradigm. Hadoop1,anopensourceimplementation Motivated by large amounts of data generated of MapReduce, is in wide deployment and use. by web-based applications, scientific experiments, Amajorityoftheapplicationscurrentlyexecuting business transactions, etc., and the need to analyze in the MapReduce framework have a data-parallel, this data in effective, efficient, and scalable ways, uniform access profile, which makes them ideally there has been significant recent activity in develop- suited to map and reduce abstractions. Recent ing suitable programming models, runtime systems, research interest, however, has focused on more un- and development tools. The distributed nature of structured applications that do not lend themselves data sources, coupled with rapid advances in net- naturally to data-parallel formulations. Common ex- working and storage technologies naturally motivate amples of these include sparse unstructured graph abstractions for supporting large-scale distributed operations (as encountered in diverse domains in- applications. The inherent heterogeneity, latencies, cluding social networks, financial transactions, and and unreliability, of underlying platforms makes the development of such systems challenging. 1. Hadoop, http://hadoop.apache.org scientific datasets), discrete optimization and state- ion. This has the important effect of smoothing space search techniques (in business opti- load imbalances associated with typical applica- mization, planning), and discrete event modeling. tions. This paper combines partial synchronizations, For these applications, there are two major unre- locality enhancement, and eager scheduling, along solved questions: (i) can the existing MapReduce with algorithmic asynchrony to deliver distributed framework effectively support such applications in a performance improvements of 100 to 500% (and be- scalable manner? and (ii) what enhancements to the yond in several cases). The enhanced programming MapReduce framework would significantly enhance semantics resulting in the significant performance its performance and scalability without compromis- improvement do not impact programmability/ pro- ing desirable attributes of programmability and fault gram complexity. tolerance? We demonstrate all of our results on the wide- This paper primarily focuses on the second ques- area Google cluster in the of Pagerank tion – namely, it seeks to extend MapReduce se- and clustering (K-Means) implementations. The mantics to support specific classes of unstructured applications are representative of a broad set of applications on large-scale distributed environments. application classes. They are selected because of Recognizing that one of the key bottlenecks in their relative simplicity as well as their ubiquitous supporting such applications is the global synchro- interaction pattern. Our extended semantics have a nization associated with the reduce operation, it much broader application scope, including common introduces notions of partial synchronization and data analyses kernels (PCA, dimensionality reduc- eager scheduling. The underlying insight is that tion, PDDP), graph applications (matching through for an important class of applications, algorithms isorank, random walks), and more traditional HPC exist that do not need global synchronization for workloads (iterative linear/non-linear solvers, parti- overall-correctness. (Please note that we use the cle methods). term overall-correctness as being distinct from se- The rest of the paper is organized as follows: in quential semantics. Overall correctness refers to an Section 2, we provide a brief overview of MapRe- accurate solution to a given problem.) For such ap- duce, Hadoop, and motivate the problem;in Section plications, while global synchronizations optimize 3, we outline our primary contributions and their serial operation counts, violating these synchro- significance; in Section 4, we present extended nizations merely increases operation counts without semantics and the corresponding API; in Section 5, impacting correctness of the algorithm. Common we discuss our implementations of Pagerank and K- examples of such algorithms include, computation Means clustering and analyze performance gains of of eigenvectors (pageranks) through (asynchronous) our approach. We outline avenues for ongoing work power methods, branch-and-bound based discrete and conclusions in Sections 8 and 9. optimization with lazy bound updates, and other heuristic state-space search algorithms. For such 2. Background and Motivation algorithms, a global synchronization can be replaced by a partial synchronization. However, this partial The primary design motivation for the functional synchronization must be combined with suitable MapReduce abstractions is to allow programmers to locality enhancing techniques so as to minimize express simple concurrent computations, while hid- the adverse effect on operation counts. These lo- ing low-level details of scheduling, fault-tolerance, cality enhancing techniques take the form of min- and data distribution in a single library [5]. The cut graph partitioning and aggregation into maps simplicity of the API makes programming rela- in graph analyses, periodic quality equalization in tively easy. Programs are expressed as sequences branch-and-bound, and other such operations that (iterations) of map and reduce operations with are well known in parallel and distributed process- intervening synchronizations. The synchronizations ing communities. Replacing global synchroniza- are implicit in the reduce operation, since a reduce tions with partial synchronizations also allows us must wait for all map outputs. Once a reduce to subsequent maps in an eager fash- operation has terminated, the next set of maps can be scheduled. Fault tolerance is achieved by This leads to correspondingly more computation for rescheduling maps that time out. A reduce op- the hubs as compared to the spokes. eration cannot be initiated until all maps have exe- We seek to answer the following key questions cuted. Maps generally correspond to fine-grained relating to the application scope and performance tasks. This allows the scheduler to balance load. of MapReduce (or the general paradigm of maps There are no explicit primitives for localizing data and reduces)inthecontextofapplicationsthat access. Such localization must be achieved by ag- tolerate algorithmic asynchrony: (i) what are suit- gregating fine-grained maps into coarser maps, able abstractions (MapReduce extensions) for dis- using traditional parallel processing techniques. It tributed asynchronous algorithms? (ii) for an appli- is important to remember, however, that increasing cation class of interest, can the performance benefits the granularity carries with it the potential downside of localization, partial synchronization, and eager of increased makespan resulting from failures. scheduling of maps overcome the sub-optimality As may be expected, for many applications, the in terms of serial operation counts, and (iii) can dominant overhead in the program is associated this framework be used to deliver scalable and high with the global reduction operations (and the in- performance over wide-area distributed systems? herent synchronizations). When executed in wide- In answering these questions, we also investigate area distributed environments, these synchroniza- whether the use of proposed MapReduce extensions tions often incur substantial latencies associated renders programs more complex, thereby increasing with underlying network and storage infrastructure. development effort. In contrast, partial synchronizations take over an order of magnitude less time on conventional wide- 3. Technical Contributions area testbeds. For this reason, fine-grained maps, which are natural for programming data parallel To alleviate the cost of global synchronization MapReduce applications do not always yield ex- and granularity, we propose and validate the follow- pected performance improvements and scalability. ing solutions within the MapReduce framework: The obvious question that follows from these • We provide support for global and partial observations is whether we can decrease the number synchronizations. Partial synchronizations are of global synchronizations, perhaps at the expense enforced only on a subset of maps.Global of increased number of partial synchronizations synchronizations are identical to reduce opera- (we define a partial synchronization to imply a tions, enforced on all the maps. synchronization/reduction only across a subset of • Following partial synchronizations, subsequent the maps). The resulting algorithm(s) may be sub- maps are scheduled in an eager fashion, i.e., as optimal in terms of serial operation counts, but soon as the partial synchronization operation may be (significantly) more efficient and scalable completes. in a MapReduce framework. A particularly relevant • We use partial synchronizations and eager class of algorithms where such tradeoffs are possi- scheduling in combination with a coarser ble are iterative techniques applied to unstructured grained, locality enhancing allocation of tasks problems (where the underlying data access patterns to maps. are unstructured). This broad class of algorithms • We validate the aforementioned techniques underlies applications ranging from pagerank com- on two representative problems – computing putations on the web to sparse solvers in scientific pagerank on sparse unstructured (power-law computing applications and data analyses operations type) real web graphs and clustering high- such as clustering, dimensionality reduction, and dimensional data using unsupervised clustering graph matching. In many of these applications, algorithm (K-Means). These applications are added complexity arises from the inherent load im- illustrative of a broader class of asynchrony- balance associated with the maps.Forexample,in tolerant algorithms. We show that while the the computation of pagerank, a hub node may have a number of serial operations (and indeed the significantly higher degree (in- or out-) than spokes. sum of partial and global reductions) is much higher, the reduction in number of global re- Indeed optimizations such as these have been ex- ductions yields performance improvements of plored in the context of traditional HPC platforms as up to 500% compared to traditional MapRe- well with some success. However, the difference in duce implementation on a 460-node cluster overhead between a partial and global synchroniza- provided by IBM-Google consortium as part tion in relation to the intervening useful computation of the CluE NSF program. is not as large for HPC platforms. Consequently, the performance improvement is significantly amplified To alleviate the overhead of global synchroniza- on distributed platforms. It also follows thereby tion, we propose a two level scheme — a local that performance improvements from MapReduce reduce operation applies the specified reduction deployments on wide-area platforms, as compared function to only those key-value pairs emanating to single processor executions are not expected to be from the preceding map(s) at the same host. A significant unless the problem is scaled significantly global reduce operation, on the other hand, incurs to amortize overheads. However, MapReduce for- significant network and file system overheads. In our mulations are motivated primarily by the distributed Pagerank example, the rank of a node is determined nature of underlying data and sources, as opposed by the rank of its neighbors. In the traditional to the need for parallel speedup. For this reason, MapReduce formulation, in every iteration, map performance comparisons must be with respect to involves each node pushing its Pagerank to all its traditional MapReduce formulations, as opposed to outlinks and reduce accumulates all neighbors’ speedup and efficiency measures more often used in contributions to compute Pagerank for the corre- the parallel programming community. This is further sponding node. These iterations continue until the reinforced by the fact that in Pageranks converge. environments, it is difficult to even estimate the Consider an alternate formulation in which the resources available to a program. graph has been partitioned (typically through a While most of our development efforts and all of crawler – different hosts crawl different parts of our validation results are in the context of pagerank the web), and a map now corresponds to the and k-means, concepts of partial reductions com- local-pagerank computation of all nodes within the bined with locality enhancing techniques and eager partition. For each of the internal nodes (nodes map scheduling have wider applicability. In general, that have no edges leaving the partition), a partial our semantic extensions apply to this wider class of reduction accurately computes the rank (assuming applications that admit asynchronous algorithms – the neighbors’ ranks were accurate to begin with). algorithms for which relaxed synchronization im- On the other hand, boundary nodes (nodes that pacts only performance, and not correctness. have edges leading to other partitions) must have aglobalreductiontoaccountforremoteneighbors. 4. Proposed Semantic Extensions It follows therefore that if the ranks of the bound- ary nodes were accurate, ranks of internal nodes In this section, we present our proposed semantics can be computed simply through local iterations. and API for iterative MapReduce to alleviate syn- Thus follows a two-level scheme wherein partitions chronization overheads, while preserving desirable (maps)iterateonlocaldatatoconvergenceandthen attributes of programmability. perform a global reduction. It is easy to see that this two-level scheme 4.1. Semantics for Iterative MapReduce increases the serial operation count. Furthermore, it increases the total number of synchronizations Figure 1 describes the formal syntax of our pro- (partial + global) compared to the traditional for- posed language. For our purposes, lists form an in- mulation. However, and perhaps most importantly, it teresting value in the language. MapReduce operates reduces the number of global reductions. Since this on lists and not on its elements. Any operation on is the major overhead, the program has significantly an element of the list must be defined in terms of better performance and scalability. an abstraction, so that our MapReduce constructs v!Value (1)

p!Processors = {P1,...,Pm} (2)

Σ !LocalStore = {Σ1,...,Σm} (3) Λ !GlobalStore = {Λ} (4) σ!Σ=L → Z (5) λ!Λ=L → Z (6) f!Fn :: = λx.e (7)

l!List :: = [v1,...,vn] (8) e!P :: = f (9)

| Apply( I ,) (10) Fig. 1. Iterative Relaxed MapReduce: Syntax

l, σ =⇒ σ(l) (LOCAL-LOOKUP) l, λ =⇒ λ(l) (GLOBAL-LOOKUP)

Apply( I ,)=⇒g I efm fr l (APPLY-ITER)

! ! while (condg) G condl fm fr lg,λ =⇒g lg ,λ ! ! (ITER-MAPRED) I condg fm fr lg,λ =⇒g lg ,λ

! ch lg,λ,σ ≡ l¯l,λ,σ ! ! ! ! ! ! !! ! while cond map ( L fm fr) l¯l,σ =⇒l l¯l ,σ agg l¯l ,σ,λ≡ lg ,σ,λ fold fr lg ,λ =⇒g lg ,λ !! ! ! G cond fm fr lg,λ,σ =⇒g lg ,λ,σ (MAPRED-GLOBAL)

!! !! !! ! ! map fm ll,σ =⇒l ll ,σ fold fr ll ,σ =⇒l ll ,σ ! ! (MAPRED-LOCAL) L fm fr ll,σ =⇒l ll ,σ

¯ ch lg ≡ ll ≡{ll | ll ⊂ lg & ∩ll,ll!l¯l ll = φ}; ∀li,σi[li '→ λ(l)] (CHUNKIFY)

¯ agg ll ≡ lg ≡∪ll!l¯l ll; ∀li,λ[l '→ li] (AGGREGATE) Fig. 2. Iterative Relaxed MapReduce: Semantics can evaluate them on the elements to compute new sume a set of hosts and a set of associated stores. We lists. We define L, G,andI as the local, global, define look up functions, σ (local) and λ (global). and iterative versions of MapReduce; one invoked We also define two different evaluation rules for from another. These operators evaluate on a tuple MapReduce — =⇒ g for the evaluation rules . that change the global state, and =⇒l to describe Programs in the language are defined as applications the evaluations that change the local state. Figure 2 of our iterative MapReduce applied to functions and describes these semantics. A typical program in the an input list. language is an application of IterMapReduce(I)to atupleconsistingofaterminationcondition,map To capture salient behavior of the system, we as- and reduce functions and the list of data. The programming complexity. computation rules[Apply-Iter] evaluate to running global IterMapReduce. 4.2. API I,theiterativeMapReducetakescond, map, re- duce and a list to operate on. The cond function The rigorous semantics described above ad- operates on the global heap until the termination vocate a source-to-source translator for iterative condition is met. Iterative MapReduce calls global MapReduce with relaxed synchronization and eager MapReduce iteratively until this condition is met. scheduling. We formulated a simple-to-implement [Mapred-Global] describes the behavior of global API to validate our claims, as discussed below — MapReduce. This operation uses another condition In the traditional MapReduce API, the user pro- to determine the termination of the local MapRe- vides map and reduce functions along with the duce and operates on data local to the MapReduce functions to split and format the input data. In addi- it gets associated with. The main feature is to have tion to these, for iterative MapReduce applications, amechanismforsynchronizationacrossthelocal the user must provide functions for termination of and global stores. This is achieved by our Chunkify global and local MapReduce iterations, and func- and Aggregate functions. Chunkify(ch) partitions the tions to convert data into the formats required by the global list into sequence of multiple lists each made local map, reduce functions. The Chunkify and available to the local MapReduce and copies these Aggregate functions discussed in the semantics chunks to the respective local stores. Aggregate are built using these conversion functions. takes care of aggregating the data from the mul- To provide more flexibility to the programmer tiple local MapReduce computations and copies the and to allow better exploitation of application- changes to global store. During the evaluation of specific optimizations, we propose four new func- global MapReduce, we first Chunkify the data. We tions — gmap, greduce, lmap and lreduce (two then have the required values in the local store. We different sets of map and reduce for the local subsequently apply the standard map,toacurried and global versions of the operations). Global map version of local MapReduce — (L fm fr)—and uses pool(s) to schedule lmap and lreduce to the local list, iteratively until it terminates. Only the exploit available parallelism. Functions Emit() and local store is modified during this stage. At the end EmitIntermediate() support data-flow in traditional of these iterations, we need to synchronize globally, MapReduce. We introduce their local equivalents requiring us to copy data back to the global store, — EmitLocal() and EmitLocalIntermediate(). Ag- this performed by the Aggregate function. gregate iterates over the data emitted by EmitLocal() [Mapred-Local] controls the behavior of local and sends it to the global reduce through the Emit- MapReduce. This is very close in structure to the Intermediate() function. This API forms the basis original MapReduce model, with the main differ- for our pagerank and K-Means implementations. ence being that it operates locally, with no global store changes, as explained in the semantics. 5. Evaluation The semantics use while, map,andfold con- structs (bold faced in the semantics), which carry We validate the performance benefits of our pro- their usual functional definitions. From a systems posed semantics on two commonly used applica- perspective, map/fold functions process the data tions — pagerank (more generally, computing the given to them in parallel, and collect the output. eigenspectra of a matrix) and k-means clustering. In addition to the inputs to the regular MapRe- We compare native MapReduce implementations of duce, our iterative version requires a termination these applications with their modified implementa- condition. Such a termination condition has to be tions that exploit relaxed synchronization and eager thought of even for iterating over traditional MapRe- scheduling. We perform our experiments on a live duce. Also, our iterative MapReduce reduces to testbed – the 460 node cluster provided by IBM- traditional MapReduce if the local condition is set Google consortium as part of the CluE (Cluster to one iteration. Our semantics do not increase Exploratory) NSF program. Table 1 describes the TABLE 1. Measurement testbed, Software Mathematically, the pagerank of a node is given by Clue Cluster Intel(R) Xeon(TM) CPU 2.80GHz the following expression: Machines - 460 4GBRAM,2x400GBharddrives VM 1VMperhost Software Hadoop 0.17.3, Java 1.6 PRd =(1− χ)+χ ∗ ! s.pagerank/s.outlinks, Heap space 1GBperslave (s,d)!E (11) where χ is the damping factor, s.pagerank and physical resources, software, and restrictions on the s.outlinks correspond to the pagerank and the out- cluster. Since this is a shared cluster, there is the degree of the source node, respectively. potential that several jobs are concurrently executing For both our implementations, the input is a graph during our experiments. Nonetheless, our results represented as an adjacency list. We start with a strongly validate our insights. pagerank of 1 for all nodes. The pageranks converge to their correct values after a few iterations of 5.1. Implementation applying the above expression for each node. We define convergence by constraining the change in individual pagerank values (10−5 in our case) across For both applications, we implement base ver- iterations. sions that conform to the conventional MapReduce formulations. We also implement modified versions; 5.2.1. General Pagerank. The general MapReduce relaxed synchronization and eager scheduling real- implementation of pagerank iterates over a map ized by rewriting the global map function (gmap) task that emits the pageranks of all source nodes to have a local MapReduce (lmap and lreduce)as to the corresponding destinations in the graph, described in the following pseudo-code: and a reduce task that accumulates the pager- gmap(xs : X list) { ank contributions from various sources to a single while(no- local-convergence-intimated) { destination. In the actual implementation, the map lmap(x); // emits (lkey, lval) function emits tuples of the type .Thereduce task operates on adestinationnode,whichgathersthepageranks for each value in lreduce-output{ from its incoming source nodes and computes a EmitIntermediate(key, value); } new pagerank for itself. Thus, after every iteration, } the nodes have renewed pageranks that propagate through the graph in subsequent iterations until The argument to global map is a they converge. One can observe that even a small list(xs). We run a local MapReduce with each ele- change in the pagerank of one node is broadcast to ment of xs as input to lmap.Weuseahashtable all the nodes in the graph in successive iterations to store the intermediate and final results of the of MapReduce, incurring a potentially significant local MapReduce. Applying this method to an ar- global synchronization cost. bitrary application requires knowledge of the local Our baseline for performance comparison is a termination condition. The local termination can be MapReduce implementation for which maps cor- convergence (as in Pagerank and K-Means ) or a respond to complete partitions, as opposed to single constraint on the number of iterations. node updates. We use this as a baseline because the performance of this formulation was noted to be 5.2. Pagerank on par or better than the adjacency-list formulation where the update of a single node is associated with The pagerank of a node in a given graph is the a map.Forthisreason,ourbaselineprovidesamore scaled sum of the pageranks of all of its neighbors. competitive implementation. 5.2.2. Eager Pagerank. We begin our description TABLE 2. Input graph properties of Eager Pagerank with an intuitive description of Input graphs Stanford webgraph Power-law how the underlying algorithm accommodates asyn- Nodes 280,000 100,000 chrony. In a graph with a power-law type distribu- Edges 3million 3million tion, one may assume that each hub is surrounded by Damping factor 0.75 0.85 alargenumberofspokes,andthatinter-hubedges are comparatively infrequent. This allows us to relax strict synchronization on inter-hub edges until the of global pageranks. However, the pagerank of subgraph in the proximity of a hub has relatively any node propagated during the global reduce is self-consistent pageranks. Disregarding the inter- representative, in a way, of the sub-graph it belongs hub edges does not lead to algorithmic inconsistency to. Thus, one may observe that the local and global since, after every few local iterations of MapReduce reduce functions are functionally identical. calculating the pageranks in the subgraph, there Note that in Eager Pagerank, the local reduce is a global synchronization (following a global waits on a local synchronization barrier, while the map)leadingtoadisseminationofthepageranks local maps can be implemented on a in this subgraph to other subgraphs via inter-hub within a single host in a cluster. The local syn- edges. This propagation reimposes consistency on chronization does not incur any inter-host commu- the global state. Consequently, we update only the nication delays. This makes the local overheads (neighboring) nodes in the smaller subgraph. We considerably lower than the global overheads. achieve this by a set of iterations of local MapRe- duce, as described in the semantics. Evidently, this 5.2.3. Input data. Table 2 describes the two graphs method leads to improved efficiency if each map that are considered for our experiments on Pagerank. operates on a hub or a group of topologically Our primary validation dataset is the Stanford web localized nodes. Such topology is inherent in the graph from the Stanford Webbase project3.Itisa way we collect data, as it is crawler-induced. One medium-sized crawl conducted in 2002 of 280K can also use one-time graph partitioning using tools nodes and about 3 million edges. To study the 2 like Metis ,asinourcase.Theoverallperformance impact of problem size and related parameters, of MapReduce implementation is largely invariant we also rely on synthetic graphs with power-law on the choice of partitioner for wide-area ( distributions, generated through preferential attach- dominated) platforms. In the Eager Pagerank im- ment [4] in igraph4.Thealgorithmusedtocreate plementation, the map task operates on a sub-graph. the synthetic graph is described below, along with Local MapReduce, within the global map,computes its justification. the pagerank of the constituent nodes in the sub- Preferential attachment based graph generation. graph. Consequently, we run the local MapReduce Graph datasets are generated by adding vertices to convergence. one at a time – connecting new vertices to numConn Instead of waiting for all the other global map vertices already in the network, chosen uniformly at tasks operating on different subgraphs, we eagerly random. For each of these numConn vertices, numIn schedule the next local map and local reduce it- and numOut of its inlinks and outlinks are chosen erations on the individual subgraph inside a sin- uniformly at random and connected to the joining gle global map task. Upon local convergence of vertex. This is done for all of the newly connected the subgraphs, we synchronize globally, so that nodes to the incoming vertex. This method of creat- all nodes can propagate their computed pageranks ing a graph is closely related to the evolution of the to other sub-graphs. These iterations over global web. This procedure increases the probability of a MapReduce run to convergence. Such an Eager highly reputed site getting linked to new sites, since Pagerank incurs more computational cost, since local reductions may proceed with imprecise values 3. Stanford Webbase project. http://diglib.stanford.edu:8091/ testbed/- doc2/WebBase/ 2. Metis. http://glaros.dtc.umn.edu/gkhome/views/metis 4. The igraph library, http://igraph.sourceforge.net/ Inlink distribution of Stanford webgraph 100000 Inlinks

10000

1000

100 Number of links

10

1 1 10 100 1000 10000 100000 Number of pages

Fig. 5. Number of Iterations to converge(on y-axis) for Fig. 3. Inlink distribution of the Stanford webgraph different number of Partitions(on x-axis) for the Stanford webgraph for a damping factor of 0.75 Inlink distribution of synthetic power-law graph 10000 Inlinks entire graph is given to one global map and its local 1000 MapReduce would compute the final pageranks of all the nodes. If the partition size is one, each 100 partition gets a single adjacency list, Eager Pagerank Number of links becomes regular Pagerank, because each map task 10 operates on a single node.

1 Figures 5 and 6 show the number of iterations 1 10 100 1000 Number of pages taken by the relaxed and general implementations of Pagerank, on the Stanford and synthetic webgraphs Fig. 4. Inlink distribution of the synthetic power-law that we use for input, as we vary the number of graph partitions. The number of iterations does not change in the general case as each iteration performs the same work irrespective of the number of partitions it has a greater probability of being in an inlink from and partition sizes. Also, in Eager Pagerank, we other randomly chosen sites. Figures 3 and 4 show found that the average number of local MapReduce the distribution of inlinks for the two input graphs. iterations per one global map task is 40,˜ which corresponds to the increased computation. Inspite of 5.2.4. Results. To show the dependence of per- such increased computation, we observe significant formance on global synchronizations, we vary the performance benefits. number of iterations of the algorithm by altering The results for Eager Pagerank are consistent with the number of partitions the graph is split into. our predictions. The number of global iterations Fewer partitions result in a smaller number of larger decrease with the number of partitions. For both subgraphs. Each map task does more work and graphs, the number of iterations to converge in- would normally result in fewer global iterations creases monotonically with the number of partitions. in the relaxed case. The fundamental observation The time to solution depends strongly on the here is that it takes fewer iterations to converge number of iterations but is not completely deter- for a graph having already converged subgraphs. mined by it. It is true that the global synchroniza- The trends are more pronounced when the graph tion costs would decrease, but when we reduce follows a power-law distribution more closely. In the number of partitions significantly, the work either case, the total number of iterations are fewer done by each map task increases. This increase than in the general case. For Eager Pagerank, if potentially results in increased cost of computation; the number of partitions is decreased to one, the more than the benefit from reduced communication/ Fig. 6. Number of Iterations to converge(on y-axis) for Time to converge(on y-axis) for various number different number of Partitions(on x-axis) for the synthetic Fig. 8. of Partitions(on x-axis) for the synthetic webgraph for a webgraph for a damping factor of 0.85 damping factor of 0.85

points of optimality, 400 partitions for the Stanford webgraph and 200 for the synthetic webgraph. We also observe that the smaller the graph, the lower this optimal number of partitions.

5.3. K-Means

K-Means is a commonly-used technique in un- supervised clustering. Implementation of the algo- Fig. 7. Time to converge(on y-axis) for various number rithm in the MapReduce framework is straightfor- of Partitions(on x-axis) for the Stanford webgraph for a ward, as demonstrated in [13, 3]. Briefly, in the damping factor of 0.75 map phase, every point chooses its closest cluster centroid, and in the reduce phase, every centroid is updated to be the mean of all the points that synchronizations. Consequently, there is an optimal chose the particular centroid. The iterations of map number of partitions for every application on a given and reduce phases continue until the centroid platform. movement is below a given threshold. Euclidean dis- Figures 7 and 8 show the runtimes for the tance metric is generally used to determine centroid relaxed and general implementations of Pagerank on movement. the stanford and synthetic webgraphs with varying In the proposed relaxed MapReduce framework, number of partitions. One may notice the significant each global map handles a unique subset of the performance gains the relaxed case achieves over input points. The local map and reduce iterations the general case for both graphs. On average, we see inside the global map,clusterthegivensubsetof 2x to 3x improvement in runtimes. The improve- the points using the common input-cluster-centroids. ment in the case of a synthetic graph is expected Once convergence is achieved in the local iterations, to be more uniform and consistent, which can be the global map emits the input-cluster-centroids attributed to its close conformance to a power-law and their associated updated-centroids. The global distribution. This in turn leads to faster convergence reduce calculates the final-centroids, which is the inside the subgraph, which is observed consistently mean of all updated-centroids corresponding to in our experiments. asingleinput-cluster-centroid.Thefinal-centroids Finally, given that both graphs exhibit power- form the input-cluster-centroids for the next iter- law-type characteristics, we notice their respective ation. These iterations continue until the input- Fig. 9. K-Means iterations for different thresholds Fig. 10. K-Means time taken for different thresholds cluster-centroids converge. The algorithm used in the relaxed approach to K-Means is similar to the one recently shown by Tom-Yov and Slonim [17] for pairwise clustering. An important observation from their results is that the input to the global map should not be the same subset of the input points in every iteration. For every few iterations the input points need to be partitioned differently across global maps so as to avoid getting stuck in a local minima. Also, the convergence condition must include detection of Fig. 11. Iterations and Time-to-Converge for different oscillations along with the Euclidean metric. Partitions We use the K-Means implementation in the nor- mal MapReduce framework from the Apache Ma- hout project5.SampledUSCensusdataof1990 periment was run 10 times and the average number from the UCI Machine Learning repository6 was of iterations and time taken were considered. We used as the dataset for comparison between the observe a consistent two-fold speedup with respect normal and relaxed approaches. The sample size is to the parallel K-Means algorithm implemented in around 200K points, each with 68 dimensions. For traditional MapReduce. both relaxed and normal K-Means , initial centroids The number of local MapReduce iterations inside were chosen randomly for the sake of generality. each global map increases as the threshold for Algorithms such as canopy clustering can be used convergence decreases. It is important to observe to identify initial centroids for faster execution and from the figure that increase in computation does better quality of final clusters. not translate to an increase in the overall time. Total Figures 9 and 10 respectively show the number time-to-converge depends primarily on the number of iterations and the time taken to converge for of global synchronizations (iterations). Furthermore, aparticulardatasetof200Kpointswithvarying global synchronization costs increase with the clus- thresholds. All the experiments were conducted on ter size, implying scalable performance gains. the 460 node cluster described above. To eliminate Figure 11 shows the number of iterations and the impact of random partitioning performed after time-to-converge with varying number of partitions every 5 iterations (chosen heuristically), each ex- (disjoint subsets of points). A global map applies lo- cal MapReduce clustering on its input partition. An 5. Apache Mahout. http://lucene.apache.org/mahout. 6. US Census Data, 1990. UCI Machine Learning Repository. increase in the number of partitions decreases the http://kdd.ics.uci.edu/databases/census1990/USCensus1990.html size of subset of points. The normal K-Means clus- tering takes 18 iterations for this dataset to converge than traditional MapReduce, we argue that the pro- for a threshold of 0.01. From the graph, one can gramming complexity is not substantial. This is also observe that (i) relaxed K-Means is significantly evidenced by the simplicity of the semantics used faster for the partitions considered, and (ii) as in in the paper to describe it. In our implementations Pagerank, there exists an optimal number of parti- of pagerank and k-means, this represented tens of tions, namely at 109 partitions. Note that the precise lines of code. cluster centroids are not identical across the general Fault-tolerance. While our approach relies on ex- and relaxed MapReduce implementations. However, isting MapReduce mechanisms for fault-tolerance, we verify the clustering quality (through inter and in the event of failure(s), our recovery times may intra-cluster distances) to be statistically identical be slightly longer as each map task is coarser and across these implementations. re-execution would take longer. However, all of our experimental results are reported on a produc- 6. Discussion tion wide-area cluster of 460 nodes, with real-life transient failures. This leads us to believe that the While we demonstrate quantitative performance overhead is not significant. improvements for selected algorithms, we must ad- Scalability. In general, it is difficult to estimate dress other key questions, namely: how general is the resources available to, and used by a program our proposed approach (are there large application execution in a Cloud computing environment. We classes than can benefit from this)? how do our draw our assertions on scalability of our formulation proposed extensions impact program complexity? by inferring resource availability from granularity of what impact do they have on underlying fault- task partitions. Please note that these inferences are tolerance mechanisms and overall scalability? We meant to be qualitative in nature, and not based on provide qualitative arguments for each of these raw data (since the data is not made available by questions: design). Typically, clusters run of the order of 10 map Generality. Our relaxed synchronization mecha- tasks per node. Since each map handles one com- nisms can be generalized to broad classes of ap- plete partition of the graph, for very large num- plications. The pagerank application, which relies bers of partitions(e.g., 6400), we potentially use on an asynchronous mat-vec, is representative of the entire 460 nodes in the Google cluster for the eigenvalue/ linear system solvers (computing eigen- map phase. Such high node utilization incurs heavy vectors using the power method of repeated mul- network delays during copying and merging before tiplications by a unitary matrix). A wide range of the reduce phase, leading to increased synchro- similar applications requiring the spectra of graphs, nization overheads. Significant speedups obtained global alignments, random walks, can be computed using our semantics for iterative MapReduce in such using this algorithmic template. For all of these alargedatacenterenvironments,demonstratethat applications, our methods and results are directly our solution is indeed scalable. applicable. Algorithms such as shortest paths over sparse graphs and graph alignment can be directly 7. Related work cast into our framework. Beyond graphs, an asyn- chronous mat-vec forms the core of iterative linear Over the past few years, the MapReduce pro- system solvers. Our asynchronous k-means imple- gramming model has gained attention primarily mentation extends this to problems in data min- because of its simple programming model and the ing and analysis. Other problems in data analysis, wide range of underlying hardware environments. such as PCA/SVD/dimensionality reduction directly There have been efforts exploring both the systems follow from our eigenvalue solution. Clearly, this aspects as well as the application base for MapRe- represents a very large application class. duce. Programming Complexity. While relaxed synchro- Chen et al [2] describe their experiences with nization requires slightly more programming effort large data-intensive tasks — analyzing the NetFlix user data and scene matching in GPS-tagged im- irrespective of the underlying hardware resources. ages along with compute-intensive tasks such as There is considerable work on distributed earthquake simulations. They observe that small data mining techniques. In particular, works by application-specific optimizations greatly improve Agarawal [9, 7] on middleware architechtures for the performance and runtime. For data-intensive parallel datamining and Parthasarthy [1, 15] on tasks, they show that intelligent partitioning of input parallel-clustering and mining have shown excellent data to maps heavily reduces network communica- performance. The context of our proposed work, tion during global synchronization. They also note namely MapReduce, presents a distinct set of over- that several applications are entirely data parallel, heads, requires semantic extension evaluation that obviating the need for a reduce operation. They can leverage algorithmic asynchrony to alleviate argue that the MapReduce runtime (e.g., Hadoop) these overheads. should be optimized for such reduction-free com- There have been recent efforts aimed at bridging putations. They also propose indexing of inputs and the gap between relational databases and MapRe- intermediate data to enable faster access, prove- duce [16]. These efforts propose a merge phase after nance tracking for linkage and dependence, and the reduce phase to compute joins on outputs sampling of input data to learn characteristics such of various reductions. Olston et al [11] propose a as skews in the map and reduce computations. SQL-style declarative language, Pig Latin, on top HBase7 and HyperTable8 are ongoing projects using of the MapReduce model. The underlying compiler the primitives of HDFS (Hadoop Distributed File deals with the automatic creation of the procedural System) providing a distributed, column-oriented map and reduce functions for data-parallel opera- store for efficient indexing of data being used in tions. Such languages further enhance programma- the MapReduce system. bility and allow the system to optimize execution. Anumberofefforts[10,14,19]targetoptimiza- Dryad [8], DryadLINQ [18], and Sawzall [12] aim tions to the MapReduce runtime and scheduling sys- to make MapReduce an implicit low-level program- tems. Proposals include dynamic resource allocation ming primitive. A program written in LINQ can to fit job requirements and system capabilities to be parsed to form a DAG, which is then used by detect and eliminate bottlenecks within a job. Such the DRYAD system to schedule the data parallel improvements combined with our efficient appli- portions on a distributed cluster. Most of these cation semantics, would significantly increase the efforts are aimed at bringing MapReduce program- scope and scalability of MapReduce applications. ming primitives into high-level languages and also The simplicity of MapReduce programming model to overcome the rigidity of MapReduce semantics. has also motivated its use in traditional shared They are also helpful in logical separation of data- memory systems [13]. parallel and task-parallel regions in a general appli- AdominantcomponentofatypicalHadoop cation. execution corresponds to the underlying commu- All of the aforementioned efforts try to improve nication and I/O. This happens even though the efficiency of MapReduce by adding software layers MapReduce runtime attempts to reduce communi- on top of existing MapReduce semantics or by im- cation by trying to instantiate a task at the node proving the underlying runtime. In contrast, we aim 9 or the rack where the data is present. Afrati et al. to extend MapReduce semantics to leverage algo- study this important problem and proposed alternate rithmic asynchrony in important application classes. computational models for sorting applications to reduce communication between hosts in different racks. Our extended semantics deal with the same 8. Future Work problem but, from the application‘s perspective, The myriad trade-offs associated with wide range 7. HBase. http://hadoop.apache.org/hbase of overheads on different platforms pose intriguing 8. Hypertable. http://www.hypertable.org 9. A New Computation Model for Rack-based Computing. challenges. We identify some of these challenges as http://infolab.stanford.edu/ ullman/pub/mapred.pdf they relate to our proposed solutions: Generality of semantic extensions. We have without burdening the programmer with locality demonstrated the use of partial synchronization and enhancing aggregations. eager scheduling in the context of specific applica- System-level enhancements. Often times, when tion classes. While we have argued in favor of their executing iterative MapReduce programs, the output broader applicability, these claims must be quan- of one iteration is needed in the next iteration. Cur- titatively established. Several task-parallel applica- rently, the output from a reduction is written to the tions with complex interactions are not naturally distributed file system (DFS) and must be accessed suited to traditional MapReduce formulations. Are from the DFS by the next set of maps,which the proposed set of semantic extensions adequate involves significant overhead. Using online data for such applications? MapReduce tries to strike a structures (for example, Bigtable) provides credible balance between programmability and performance. alternatives, however, issues of fault tolerance must It attempts to divest the programmer of burdens of be resolved. optimizing locality, scheduling, and communication. In doing so, it often compromises performance. These trade-offs must be viewed in the context of 9. Conclusion the underlying platforms and applications. In this paper, we propose extended semantics for Implications for tightly coupled systems. MapRe- MapReduce based on partial synchronizations and duce has been shown to be an effective alternative eager map scheduling. We demonstrate that when to Pthreads even on the systems combined with locality enhancing techniques and al- [13] for specific applications. It is important to note gorithmic asynchrony, these extensions are capable that shared memory MapReduce has a much larger of yielding significant performance improvements. design space because of greater control over the We demonstrate our results in the context two spawned map and reduce tasks (thread pools). selected problems — pagerank and K-Means clus- Anumberofperformanceoptimizationsarepos- tering. Our results strongly motivate the use of par- sible here – ranging from pipelining iterates of tial synchronizations for broad application classes. map and reduce operations, to reordering map Finally, these enhancements in performance do not and reduce operations for enhancing computation- adversely impact the programmability and fault- communication characteristics. Other optimizations tolerance features of the underlying MapReduce involve speculative execution of maps,relyingon framework. promises as outputs of maps as opposed to the values themselves. This rich space of optimizations bears significant research investigation. References Integration of proposed semantics with high-level [1] S. Asur and S. Parthasarathy. A viewpoint- programming languages. Our proposed semantics based approach for interaction graph analysis. and its variants can be integrated with SQL-type ACM SIGKDD,2009. declarative languages, namely Pig-Latin [11] and [2] Shimin Chen and Steven W. Schlosser. Map- DryadLINQ [18]. The resulting system would po- reduce meets wider varieties of applications. tentially improve the efficiency of MapReduce sig- Technical Report, IRP-TR-08-05, Intel Re- nificantly, while decreasing the load on application search, Pittsburg,2008. programmers to make decisions over the distribution [3] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, of data. YuanYuan Yu, Gary Bradski, Andrew Y. Ng, Optimal granularity for maps. As shown in our and Kunle Olukotun. Map-reduce for machine work, as well as the results of others, the perfor- learning on multicore. Advances in Neural In- mance of a MapReduce program is a sensitive func- formation Processing Systems 19,pages281– tion of map granularity. An automated technique, 288. based on execution traces and sampling [6] can [4] Price D. J. de S. A general theory of bibliomet- potentially deliver these performance increments ric and other cumulative advantage processes. Journal of the American Society for Informa- duce optimization using dynamic regulated tion Science, Vol 27 , 292-306,1976. prioritization. SIGMETRICS/Performance ’09: [5] Jeffrey Dean and Sanjay Ghemawat. Mapre- Proceedings of the 2009 Joint International duce: Simplified data processing on large clus- Conference on Measurement & Modeling of ters. Symposium on Design Computer Systems,2009. and Implementation (OSDI),2004. [15] V. Satuluri and S. Parthasarathy. Applications [6] Richard O. Duda, Peter E. Hart, and David G. to community discovery. ACM SIGKDD,2009. Stork. Chapter 8 pattern classification 2nd edi- [16] H.-C. Yang, A. Dasdan, R.-L. Hsiao, and tion. AWiley-IntersciencePublication,2001. D. S. P. Jr. Map-reduce-merge: simplified re- [7] L. Glimcher, X Zhang, and G. Agarwal. Scal- lational data processing on large clusters. SIG- ing and parallelizing a scientific feature min- MOD ’07: Proceedings of the ACM SIGMOD ing application using a cluster middleware. international conference on Management of IPDPS’04 IEEE International Parallel and data,2007. Distributed Processing Symposium,2004. [17] Elad Yom-tov and Noam Slonim. Parallel [8] Michael Isard, Mihai Budiu, Yuan Yu, An- pairwise clustering. SDM’09, Proceedings of drew Birrell, and Dennis Fetterly. Dryad: dis- SIAM Data Mining conference,2009. tributed data-parallel programs from sequen- [18] Yuan Yu, Michael Isard, Dennis Fetterly, Mi- tial building blocks. Proceedings of the 2nd hai Budiu, far Erlingsson, Pradeep Kumar ACM SIGOPS/EuroSys European Conference Gunda, and Jon Currey. Dryadlinq: A system on Computer Systems,2007. for general-purpose distributed data-parallel [9] R. Jin and G. Agarwal. A middleware for computing using a high-level language. Sym- developing parallel datamining applications. posium on Operating System Design and Im- SIAM Conference on Data Mining,2001. plementation (OSDI), San Diego, CA,2008. [10] Karthik Kambatla, Abhinav Pathak, and [19] Matei Zaharia, Andrew Konwinski, Anthony Himabindu Pucha. Towards optimizing hadoop Joseph, Randy Katz, and Ion Stoica. provisioning for the cloud. 1st Workshop on Hot Topics in Cloud Computing, HotCloud, 2009. [11] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: A not-so-foreign language for data processing. SIGMOD ’08: Proceedings of the ACM SIGMOD international conference on Management of data,2008. [12] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal Special Issue on Grids and Worldwide Computing Programming Models and Infras- tructure, 13(4):227-298,2003. [13] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating for multi- core and multiprocessor system. Proceedings of the 13th Intl. Symposium on High- Performance (HPCA), Phoenix , AZ,2007. [14] Thomas Sandholm and Kevin Lai. Mapre-