<<

Relaxed Synchronization and Eager in MapReduce

Karthik Kambatla, Naresh Rapolu, Suresh Jagannathan, Ananth Grama Purdue University

Abstract MapReduce [8]. Programs in MapReduce are ex- pressed as map and reduce operations. A map MapReduce has rapidly emerged as a programming paradigm takes in a list of key-value pairs and applies for large-scale distributed environments. While the underlying pro- gramming model based on maps and reduces has been shown to be a programmer-specified function independently on effective in specific domains, significant questions relating to perfor- each pair in the list. A reduce operation takes as mance and application scope remain unresolved. This paper targets input a list indexed by a key, of all corresponding questions of performance through relaxed semantics of underlying map and reduce constructs in iterative MapReduce applications like values, and applies a reduction function on the eigenspectra/ pagerank and clustering (k-means). Specifically, it in- values. It outputs a list of key-value pairs, where vestigates the notion of partial synchronizations combined with eager the keys are unique and values are determined by scheduling to overcome global synchronization overheads associated with reduce operations. It demonstrates the use of these partial the function applied on all of the values associated synchronizations on two complex problems, illustrative of much larger with the key. The inherent simplicity of this pro- classes of problems. gramming model, combined with underlying system The following specific contributions are reported in the paper: (i) partial synchronizations combined with eager scheduling is capable support for scheduling, fault tolerance, and applica- of yielding significant performance improvements, (ii) diverse classes tion development, made MapReduce an attractive of applications (on sparse graphs), and data analysis (clustering) can platform for diverse data-intensive applications. In- be efficiently computed in MapReduce on hundreds of distributed hosts, and (iii) a general API for partial synchronizations and eager deed, MapReduce has been used effectively in a map scheduling holds significant performance potential for other wide variety of data processing applications. The applications with highly irregular data and computation profiles. large volumes of data processed at Google, Ya- hoo, and Amazon, stand testimony to the effective- 1. Introduction ness and scalability of the MapReduce paradigm. Hadoop [10], the open source implementation of Motivated by the large amounts of data generated MapReduce, is in wide deployment and use. by web-based applications, scientific experiments, A majority of the applications currently executing business transactions, etc., and the need to analyze in the MapReduce framework have a data-parallel, this data in effective, efficient, and scalable ways, uniform access profile, which makes them ideally there has been significant recent activity in devel- suited to map and reduce abstractions. Recent oping suitable programming models, runtime sys- research interest, however, has focused on more un- tems, and development tools. The distributed nature structured applications that do not lend themselves of data sources, coupled with rapid advances in naturally to data-parallel formulations. Common ex- networking and storage technologies naturally lead amples of these include sparse unstructured graph to abstractions for supporting large-scale distributed operations (as encountered in diverse domains in- applications. The inherent heterogeneity, latencies, cluding social networks, financial transactions, and and unreliability, of underlying platforms makes the scientific datasets), discrete optimization and state- development of such systems challenging. space search techniques (in business opti- With a view to supporting large-scale distributed mization, planning), and discrete event modeling. applications in unreliable wide-area environments, For these applications, there are two major unre- Dean and Ghemawat proposed a novel program- solved questions: (i) can the existing MapReduce ming model based on maps and reduces, called framework effectively support such applications in a scalable manner? and (ii) what enhancements to the area Google cluster in the of Pagerank MapReduce framework would significantly enhance and clustering (K-Means ) implementations. The its performance and scalability without compromis- applications are representative of a broad set of ing desirable attributes of programmability and fault application classes. They are selected because of tolerance? their relative simplicity as well as their ubiquitous This paper primarily focuses on the second ques- interaction pattern. Our extended semantics have a tion – namely, it seeks to extend the MapReduce much broader application scope. semantics to support specific classes of unstructured The rest of the paper is organized as follows: applications on large-scale distributed environments. in Section 2, we provide a more comprehensive Recognizing that one of the key bottlenecks in background on MapReduce, Hadoop, and motivate supporting such applications is the global synchro- the problem;in Section 3, we outline our primary nization associated with the reduce operation, it contributions and their significance; in Section 4, introduces notions of partial synchronization and ea- we present our new semantics and the corresponding ger scheduling. The underlying insight is that for an API; in Section 5, we discuss our implementations important class of applications, algorithms exist that of Pagerank and K-Means clustering and analyze do not need global synchronization for correctness. the performance gains of our approach. We outline Specifically, while global synchronizations optimize avenues for ongoing work and conclusions in Sec- serial operation counts, violating these synchro- tions 8 and 9. nizations merely increases operation counts without impacting correctness of the algorithm. Common 2. Background and Motivation examples of such algorithms include, computation of eigenvectors (pageranks) through (asynchronous) The primary design motivation for the functional power methods, branch-and-bound based discrete MapReduce abstractions is to allow programmers optimization with lazy bound updates, and other to express simple concurrent computations, while heuristic state-space search algorithms. For such hiding the messy details of parallelization, fault- algorithms, a global synchronization can be replaced tolerance, data distribution and load balancing in by a partial synchronization. However, this partial a single library [8]. The simplicity of the API synchronization must be combined with suitable makes programming extremely easy. Programs are locality enhancing techniques so as to minimize expressed as sequences (iterations) of map and the adverse effect on operation counts. These lo- reduce operations with intervening synchroniza- cality enhancing techniques take the form of min- tions. The synchronizations are implicit in the cut graph partitioning and aggregation into maps reduce operation, since it must wait for all map in graph analyses, periodic quality equalization in outputs. Once a reduce operation has terminated, branch-and-bound, and other such operations that the next set of maps can be scheduled. Fault are well known in the parallel processing commu- tolerance is achieved by rescheduling maps that nity. Replacing global synchronizations with partial time out. A reduce operation cannot be initiated synchronizations also allows us to subse- until all maps have completed. Maps generally quent maps in an eager fashion. This has the impor- correspond to fine-grained tasks, which allows the tant effect of smoothing load imbalances associated scheduler to balance load. There are no explicit with typical applications. This paper combines par- primitives for localizing data access. Such localiza- tial synchronizations, locality enhancement, and ea- tion must be achieved by aggregating fine-grained ger scheduling, along with algorithmic asynchrony maps into coarser maps, using traditional parallel to deliver distributed performance improvements of processing techniques. It is important to remember, 100 to 200% (and beyond in several cases). The however, that increasing the granularity carries with enhanced programming semantics resulting in the it the potential downside of increased makespan significant performance improvement do not impact resulting from failures. application programmability. As may be expected, for many applications, the We demonstrate all of our results on the wide- dominant overhead in the program is associated with the global reduction operations (and the in- 3. Technical Contributions herent synchronizations). When executed in wide- area distributed environments, these synchroniza- To address the cost of global synchronization and tions often incur substantial latencies associated granularity, we propose and validate the following with underlying network and storage infrastructure. solutions within the MapReduce framework: In contrast, partial synchronizations take over an • We provide support for global and partial order of magnitude less time. For this reason, fine- synchronizations. Partial synchronizations are grained maps, which are natural for programming enforced only on a subset of maps. Global data parallel MapReduce applications do not always synchronizations are identical to reduce opera- expected performance improvements and scal- tions, enforced on all the maps. ability. • Following partial synchronizations, the subse- The obvious question that follows from these quent maps are scheduled in an eager fashion, observations is whether we can decrease the number i.e., as soon as the partial synchronization op- of global synchronizations, perhaps at the expense eration completes. of increased number of partial synchronizations (we • We use partial synchronizations and eager define a partial synchronization to imply a syn- scheduling in combination with a coarser chronization/reduction only across a subset of the grained, locality enhancing allocation of tasks maps). The resulting algorithm(s) may be subop- to maps. timal in serial operation counts, but may be more • We validate the aforementioned techniques efficient and scalable in a MapReduce framework. to compute pagerank on sparse unstructured A particularly relevant class of algorithms where (power-law type) real web graphs and to clus- such tradeoffs are possible are iterative techniques ter high-dimensional data using unsupervised applied to unstructured problems (where the under- clustering algorithms, namely K-Means . These lying data access patterns are unstructured). This applications are illustrative of a broader class broad class of algorithms underlies applications of asynchrony-tolerant algorithms. We show ranging from pagerank computations on the web to that while the number of serial operations (and sparse solvers in scientific computing applications indeed the sum of partial and global reductions) and clustering algorithms. In many of these appli- is much higher, the reduction in number of cations, an additional complexity arises from the global reductions yields performance improve- inherent load imbalance associated with the maps. ments of over 200% compared to traditional For example, in the computation of pagerank, a hub MapReduce implementation on a 460-node node may have a significantly higher degree (in- cluster provided by IBM-Google consortium as or out-) than spokes. This leads to correspondingly part of the CluE NSF program. more computation for the hubs as compared to the To alleviate the overhead of global synchroniza- spokes. tion, we propose a two level scheme — a local We seek to answer the following key questions reduce operation applies the specified reduction relating to the application scope and performance function to only those key-value pairs emanating of MapReduce (or the general paradigm of maps from the preceding map(s) at the same host. A and reduces) in the context of applications that global reduce operation, on the other hand, incurs tolerate algorithmic asynchrony: (i) what are suit- significant network and file system overheads. We able abstractions (MapReduce extensions) for dis- illustrate the use of these primitives in the context tributed asynchronous algorithms? (ii) for an appli- of Pagerank and K-Means clustering algorithms. In cation class of interest, can the performance benefits Pagerank, the rank of a node is determined by the of localization, partial synchronization, and eager rank of its neighbors. In the traditional MapReduce scheduling of maps overcome the sub-optimality formulation, in every iteration, map involves each in terms of serial operation counts, and (iii) can node pushing its Pagerank to all its outlinks and this framework be used to deliver scalable and high reduce accumulates all neighbors’ contributions performance over wide-area distributed systems? to compute Pagerank for the corresponding node. These iterations continue until the Pageranks con- resources available to a program. verge. While most of our development efforts and all of Consider an alternate formulation in which the our validation results are in the context of pagerank graph has been partitioned, and a map now cor- and kmeans, concepts of partial reductions com- responds to the local-pagerank computation of all bined with locality enhancing techniques and eager nodes within the partition. For each of the internal map scheduling have wider applicability. In general, nodes (nodes that have no edges leaving the par- our semantic extensions apply to this wider class of tition), a partial reduction accurately computes the applications that admit asynchronous algorithms – rank (assuming the neighbors’ ranks were accurate algorithms for which relaxed synchronization im- to begin with). On the other hand, boundary nodes pacts only performance, and not correctness. (nodes that have edges leading to other partitions) must have a global reduction to account for remote 4. Proposed Semantic Extensions neighbors. It follows therefore that if the ranks of the boundary nodes were accurate, ranks of As mentioned earlier, numerous applications use internal nodes can be computed simply through MapReduce iteratively. Hence, improving the per- local iterations. Thus follows a two-level scheme formance of iterative MapReduce is very important, wherein partitions (maps) iterate on local data to specifically because of the high overhead due to syn- convergence and then perform a global reduction. chronization and starting/stopping of MapReduce It is easy to see that this two-level scheme in- jobs (input and output data handling). creases the serial operation count. Moreover, it in- In this section, we present the semantics and creases the total number of synchronizations (partial API that we propose for iterative MapReduce to + global) compared to the traditional formulation. alleviate synchronization overheads, preserving the However, and perhaps most importantly, it reduces programming simplicity. the number of global reductions. Since this is the major overhead, the program has significantly better performance and scalability. 4.1. Semantics for Iterative MapReduce Indeed optimizations such as these have been ex- plored in the context of traditional HPC platforms as Figure 1 describes the formal syntax of our pro- well with some success. However, the difference in posed language. For our purposes, lists form an in- overhead between a partial and global synchroniza- teresting value in the language. MapReduce operates tion in relation to the intervening useful computation on lists and not on its elements. Any operation on is not as large for HPC platforms. Consequently, the an element of the list has to be defined in terms of performance improvement is significantly amplified an abstraction, so that our MapReduce constructs on distributed platforms. It also follows thereby can evaluate them on the elements to compute new that performance improvements from MapReduce lists. We define L, G, and I as the local, global, deployments on wide-area platforms, as compared and iterative versions of MapReduce; one invoked to single processor executions are not expected to be from another. These operators evaluate on a tuple significant unless the problem is scaled significantly . to amortize overheads. However, MapReduce for- Programs in the language are defined as applications mulations are motivated primarily by the distributed of our iterative MapReduce applied to functions and nature of underlying data and sources, as opposed an input list. to the need for parallel speedup. For this reason, To capture salient behavior of the system, we performance comparisons must be with respect to assume a set of processors and a set of associated traditional MapReduce formulations, as opposed to stores. We define look up functions, σ (local) and speedup and efficiency measures more often used in λ (global). We also define two different evaluation the parallel programming community. This is further rules for MapReduce — =⇒g for the evaluation reinforced by the fact that in rules that change the global state, and =⇒ l to environments, it is difficult to even estimate the describe the evaluations that change the local state. λ ǫ Λ = L → Z (6) v ǫ Value (1) f ǫFn :: = λx.e (7)

p ǫ Processors = {P1, ..., Pm} (2) l ǫ List ::= [v1, ..., vn] (8)

Σ ǫ LocalStore = {Σ1, ..., Σm} (3) eǫP :: = f (9)

Λ ǫ GlobalStore = {Λ} (4) | Apply( I , < e, fm, fr,l>)(10) σ ǫ Σ = L → Z (5) Fig. 1. Iterative Relaxed MapReduce: Syntax

l, σ =⇒ σ(l) (LOCAL-LOOKUP) l,λ =⇒ λ(l) (GLOBAL-LOOKUP)

Apply( I , < e, fm, fr,l>) =⇒g I e fm fr l (APPLY-ITER)

′ ′ while (condg) G condl fm fr lg,λ =⇒g lg ,λ ′ ′ (ITER-MAPRED) I condg fm fr lg,λ =⇒g lg ,λ

′ ch lg,λ,σ ≡ l¯l,λ,σ ′ ′ ′ ′ ′ ′ ′′ ′ while cond map ( L fm fr) l¯l, σ =⇒l l¯l , σ agg l¯l ,σ,λ ≡ lg ,σ,λ fold fr lg ,λ =⇒g lg ,λ ′′ ′ ′ G cond fm fr lg,λ,σ =⇒g lg ,λ , σ (MAPRED-GLOBAL)

′′ ′′ ′′ ′ ′ map fm ll, σ =⇒l ll , σ fold fr ll , σ =⇒l ll , σ ′ ′ (MAPRED-LOCAL) L fm fr ll, σ =⇒l ll , σ

¯ ch lg ≡ ll ≡{ll | ll ⊂ lg & ∩ll, llǫl¯l ll = φ}; ∀li, σi[li 7→ λ(l)] (CHUNKIFY)

¯ agg ll ≡ lg ≡ ∪llǫl¯l ll; ∀li,λ[l 7→ li] (AGGREGATE) Fig. 2. Iterative Relaxed MapReduce: Semantics

Figure 2 describes the semantics. A typical it gets associated with. The main feature is to have program in the language is an application of a mechanism of synchronization across the local IterMapReduce(I) to a tuple consisting of a termina- and global stores. This is achieved by our Chunkify tion condition, map and reduce functions and the list and Aggregate functions. Chunkify(ch) partitions the of data. The computation rules[Apply-Iter] evaluate global list into sequence of multiple lists each made to running global IterMapReduce. available to the local MapReduce and copies these I, the iterative MapReduce takes cond, map, re- chunks to the respective local stores. Aggregate duce and a list to operate on. The cond function takes care of aggregating the data from the mul- operates on the global heap until the termination tiple local MapReduce computations and copies the condition is met. Iterative MapReduce calls global changes to global store. During the evaluation of MapReduce iteratively till this condition is met. global MapReduce, we first Chunkify the data. We [Mapred-Global] describes the behavior of global then have the required values in the local store. We MapReduce. This operation uses another condition subsequently apply the standard map, to a curried to determine the termination of the local MapRe- version of local MapReduce — (L fm fr) — and duce and operates on data local to the MapReduce the local list, iteratively until it terminates. Only the local store is modified during this stage. At the end TABLE 1. Measurement testbed, Software of these iterations, we need to synchronize globally, Clue Cluster Intel(R) Xeon(TM) CPU 2.80GHz requiring us to copy data back to the global store, Machines - 460 4 GB RAM, 2x 400 GB hard drives this performed by the Aggregate function. VM 1 VM per host [Mapred-Local] controls the behavior of local Software Hadoop 0.17.3, Java 1.6 MapReduce. This is very close in structure to the Heap space 1 GB per slave original MapReduce model, with the main differ- ence being that it operates locally, with no global store changes, as explained in the semantics. support data-flow in traditional MapReduce. We The semantics use while, map, and fold con- introduce their local equivalents — EmitLocal() and structs (bold faced in the semantics) which carry EmitLocalIntermediate(). Aggregate iterates over their usual functional definitions. From a systems the data emitted by EmitLocal() and sends it to perspective, map/fold functions process the data the global reduce through the EmitIntermediate() given to them in parallel, and collect the output. function. In addition to the inputs to the regular MapRe- Our API has been validated in our implementa- duce, our iterative version requires a termination tion of pagerank and kmeans algorithms used in the condition. Such a termination condition has to be evaluation of these relaxed semantics. thought of even for iterating over traditional MapRe- duce. Also, our iterative MapReduce reduces to 5. Evaluation traditional MapReduce if the local condition is set to one iteration. Our semantics do not increase To evaluate the performance benefits from our programming complexity. proposed semantics, we consider two applications — pagerank and kmeans. We compare general 4.2. API MapReduce implementations of these applications to their modified implementations that exploit re- The rigorous semantics advocate a source-to- laxed synchronization and eager scheduling. Pager- source translator for iterative MapReduce with re- ank (more generally, computing the eigenspectra laxed synchronization and eager scheduling. We of a matrix) and kmeans (representing clustering formulated a simple-to-implement API to validate algorithms). our claims, as discussed below — In our attempt to model our experiments in a real- In the traditional MapReduce API, the user pro- life setting, we use the 460 node cluster provided by vides map and reduce functions along with the func- IBM-Google consortium as part of the CluE (Cluster tions to split and format the input data. In addition to Exploratory) NSF program. these, for iterative MapReduce applications, the user Table 1 lists the physical resources, software must provide functions for termination of global and restrictions on the cluster. Since this is a and local MapReduce iterations, and functions to shared cluster, there is the potential that several jobs convert data into the formats required by the lo- are concurrently executing during our experiments. cal map, reduce functions. The Chunkify and Nonetheless, our results strongly validate our in- Aggregate functions discussed in the semantics sights. are built using these conversion functions. To provide more flexibility to the programmer and 5.1. Implementation to allow better exploitation of application-specific optimizations, we propose four new functions — For both applications, we implemented base ver- gmap, greduce, lmap and lreduce (two different sets sions which conform to the popular MapReduce of map and reduce for the local and global versions implementations known for these applications. We of the operations). Global map uses pool(s) to also implement modified versions; relaxed synchro- schedule lmap and lreduce to exploit available par- nization and eager scheduling realized by rewriting allelism. Functions Emit() and EmitIntermediate() the global map function (gmap) to have a local MapReduce (lmap and lreduce) as described by the 5.2.1. General Pagerank. The general MapReduce following pseudo-code - implementation of pagerank iterates over a map task gmap (xs : X list) { 1that emits the pageranks of all the source nodes to while (no−local −convergence−intimated ) { 2the corresponding destinations in the graph, and a 3reduce task that accumulates the pagerank contribu- lmap (x); // emits (lkey, lval) 4tions from various sources to a single destination. 5 In the actual implementation, the map function lreduce ( ) ; 6 emits tuples of the type destination-node, pager- } 7 < 8ank contributed to this destination node by the for each value in lreduce−output { 9source>. The reduce task operates on a destination EmitIntermediate (key, value); 10node, which gathers the pageranks from its incom- } 11ing source nodes and computes a new pagerank for } 12itself. Thus, after every iteration, the nodes have When a regular map phase is given an input of renewed pageranks which propagate through the list, we give each map task a smaller graph in subsequent iterations till they converge. list (instead of an element of the list). One can observe that even a small change in the The argument to gmap is a list(xs). pagerank of one node is broadcast to all the nodes We run a local MapReduce with each element in the graph in successive iterations of MapReduce, of xs as input to lmap. We use a hashtable to incurring a potentially significant global synchro- store the intermediate and final results of the local nization cost. MapReduce. Our baseline for performance comparison is a Applying this method to any application trans- MapReduce implementation for which maps corre- lates to knowing the local termination condition. spond to complete partitions, as opposed to single The local termination can be convergence (as in node updates. We use this as a baseline because the Pagerank and K-Means ) or a particular number of performance of this formulation was noted to be iterations. on par or better than the adjacency-list formulation where the update of a single node is associated with 5.2. Pagerank a map. For this reason, our baseline provides a more competitive implementation. The pagerank of a node is the scaled sum of the pageranks of all of its neighbors. Mathematically, 5.2.2. Eager Pagerank. We begin our description the pagerank of a node is given by the following of Eager Pagerank with an intuitive description of expression: how the underlying algorithm accommodates asyn- chrony. In a graph with a power-law type distribu- tion, one may assume that each hub is surrounded by P Rd = (1 − χ)+ χ ∗ X s.pagerank/s.outlinks (s,d)ǫE a large number of spokes, and that inter-hub edges (11) are comparatively infrequent. This allows us to relax where χ is the damping factor, s.pagerank and strict synchronization on inter-hub edges until the s.outlinks correspond to the pagerank and the out- subgraph in the proximity of a hub has relatively degree of the source node, respectively. self-consistent pageranks. Disregarding the inter- For both the implementations, the input is a hub edges does not lead to algorithmic inconsistency graph represented as adjaceny lists. We start with a since, after every few local iterations of MapReduce pagerank of 1 for all nodes. The pageranks converge calculating the pageranks in the subgraph, there to their correct values after a few iterations of is a global synchronization (following a global applying the above expression for each node. We map) leading to a dissemination of the pageranks define convergence as the pageranks of all nodes in this subgraph to other subgraphs via inter-hub not changing by more than a particular threshold edges. This propagation reimposes consistency on (10−5 in our case) in subsequent iterations. the global state. TABLE 2. Input graph properties Inlink distribution of Stanford webgraph 100000 Inlinks Input graphs Stanford webgraph Power-law Nodes 280,000 100,000 10000 Edges 3 million 3 million 1000 Damping factor 0.75 0.85

100 Number of links

10 Consequently, we update only the (neighboring) 1 nodes in the smaller subgraph. We achieve this by 1 10 100 1000 10000 100000 a set of iterations of local MapReduce as described Number of pages in the semantics. Evidently, this method leads to improved efficiency if each map operates on a hub Fig. 3. Inlink distribution of the Stanford webgraph or a group of topologically localized nodes. Such topology is inherent in the way we collect data as it is crawler-induced. One can also use one-time 5.2.3. Input data. Table 2 describes the two graphs graph partitioning using tools like Metis [17]. We that are considered for our experiments on Pagerank, use Metis as our synthetic data set is not inherently both conforming to power law distributions. Our partitioned. primary validation dataset is the Stanford web graph from the Stanford WebBase project [20]. It is a In the Eager Pagerank implementation, the map medium-sized crawl conducted in 2002 of 280K task operates on a sub-graph. Local MapReduce, nodes and about 3 million edges. To study the within the global map, computes the pagerank of impact of problem size and related parameters, the constituent nodes in the sub-graph. Hence, we we also rely on synthetic graphs with power-law run the local MapReduce to convergence. distributions, generated through preferential attach- Instead of waiting for all the other global map ment [7] in igraph [13]. The algorithm used to tasks operating on different subgraphs, we eagerly create the synthetic graph is described below, along schedule the next local map and local reduce it- with its justification. erations on the individual subgraph inside a single Preferential attachment based graph generation. global map task. Upon local convergence of the sub- The graph is generated by adding vertices one at a graphs, we synchronize globally, so that all nodes time — connecting it to numConn vertices already can propagate their computed pageranks to other in the network, chosen uniformly at random. For sub-graphs. This iteration over global MapReduce each of these numConn vertices, numIn and numOut runs to convergence. Such an Eager Pagerank incurs of its inlinks and outlinks are chosen uniformly at more computational cost, since local reductions may random and connected to the joining vertex. This proceed with imprecise values of global pageranks. is done for all of the newly connected nodes to However, the pagerank of any node propagated the incoming vertex. This method of creating a during the global reduce is representative, in a graph is closely related to the evolution of the web. way, of the sub-graph it belongs to. Thus, one may This procedure increases the probability of a highly observe that the local and global reduce functions reputed site getting linked to new sites, since it has are functionally identical. a greater probability of being in an inlink from other Note that in Eager Pagerank, the local reduce randomly chosen sites. waits on a local synchronization barrier, while the Figures 3 and 4 show the distribution of inlinks local maps can be implemented on a on for the two input graphs. The best line fit gives a single host in a cluster. The local synchronization the power-law exponent for the two graphs showing does not incur any inter-host communication delays. their conformity with a hubs-and-spokes model. This makes the local overheads considerably lower Very few nodes have a very high inlink value, than the global overheads. emphasizing our point that very few nodes require Inlink distribution of synthetic power-law graph Iterations to converge 10000 35 Inlinks

30 1000 25

100 20 # Iterations

Number of links 15 10 10 EagerPageRank GenPageRank 1 5 1 10 100 1000 100 1000 10000 Number of pages # Partitions

Fig. 4. Inlink distribution of the synthetic power-law Fig. 5. Number of Iterations to converge(on y-axis) for graph different number of Partitions(on x-axis) for the Stanford webgraph for a damping factor of 0.75

frequent global synchronization. More often than Iterations to converge not, even these nodes (hubs) mostly have spokes 35 as their neighbours. 30 Crawlers inherently induce locality in the graphs as they crawl neighborhoods before crawling remote 25 sites. As we also use synthetic data in addition 20 # Iterations to the Stanford Web Graph, we partition it using 15 Metis. A good partitioning algorithm that minimizes 10 edge-cuts has the desired effect of reducing global EagerPageRank GenPageRank 5 synchronizations as well. Metis, though not tailored 100 1000 10000 for power-law graphs, does a decent partitioning of # Partitions the graph. This partitioning is performed off-line (only once) and takes about GET THIS NUMBER Fig. 6. Number of Iterations to converge(on y-axis) for seconds which is negligible compared to the runtime different number of Partitions(on x-axis) for the synthetic of pagerank, and hence is not included in the webgraph for a damping factor of 0.85 reported numbers. There also exist other ways of generating the inputs to the map tasks [5]. graph is given to one global map and its local 5.2.4. Results. To show the dependence of per- MapReduce would compute the final pageranks of formance on global synchronizations, we vary the all the nodes. If the partition size is one, each number of iterations of the algorithm by altering partition gets a single adjacency list, Eager Pagerank the number of partitions the graph is split into. becomes regular Pagerank, because each map task Fewer partitions result in a smaller number of larger operates on a single node. subgraphs. Each map task does more work and Figures 5 6 show the number of iterations taken would normally result in fewer global iterations by the relaxed and general implementations of in the relaxed case. The fundamental observation Pagerank, on the Stanford and synthetic webgraphs here is that it takes fewer iterations to converge that we use for input, as we vary the number of for a graph having already converged subgraphs. partitions. The number of iterations does not change The trends are more pronounced when the graph in the general case as each iteration performs the follows the power-law distribution more closely. In same work irrespective of the number of partitions either case, the total number of iterations are lesser and partition sizes. than the general case. For Eager Pagerank, if the The results for Eager Pagerank are consistent with number of partitions is decreased to one, the whole our predictions. The number of global iterations Execution time relaxed and general implementations of Pagerank on 3500 the stanford and synthetic webgraphs with varying 3000 number of partitions. One may notice the significant

2500 performance gains the relaxed case achieves over the general case for both graphs. On an average, 2000 we see 2x to 3x improvement in runtimes. The im- Time (seconds) 1500 provement in the case of synthetic graph is expected

1000 to be more uniform and consistent, which can be EagerPageRank GenPageRank attributed to its close conformance to a power-law 500 100 1000 10000 distribution, leading to more accurate partitioning. # Partitions This in turn leads to a better load balancing on the map tasks along with faster convergence inside Fig. 7. Time to converge(on y-axis) for various number of Partitions(on x-axis) for the Stanford webgraph for a the subgraph. This is observed consistently in our damping factor of 0.75 experiments. Finally, given that both graphs exhibit power-law characteristics, we notice their respective points of Execution time 3500 optimality, 400 partitions for the Stanford webgraph

3000 and 200 for the synthetic webgraph. We also observe that the smaller the graph, the lower this optimal 2500 number.

2000

Time (seconds) 1500 5.3. K-Means

1000 EagerPageRank K-Means is a popular technique used for unsuper- GenPageRank 500 vised clustering. Implementation of the algorithm 100 1000 10000 # Partitions in the MapReduce framework is straightforward as shown in [21, 6]. Briefly, in the map phase, every Fig. 8. Time to converge(on y-axis) for various number point chooses its closest cluster centroid and in the of Partitions(on x-axis) for the synthetic webgraph for a reduce phase, every centroid is updated to be the damping factor of 0.85 mean of all the points which chose the particular centroid. The iterations of map and reduce phases continue until the centroid movement is below a decrease with the number of partitions. For both given threshold. Euclidean distance metric is usually graphs, the number of iterations to converge in- used to calculate the centroid movement. creases monotonically with the number of partitions. In the proposed Relaxed MapReduce framework, The time to solution depends strongly on the each global map handles a unique subset of the number of iterations but is not completely deter- input points. The local map and reduce iterations mined by it. It is true that the global synchronization inside the global map, cluster the given subset of costs would decrease, but when we reduce the the points using the common input-cluster-centroids. number of partitions significantly, the work to be Once convergence is achieved in the local iterations, done by each map task increases. This increase the global map emits the input-cluster-centroids and potentially results in increased cost of computation, their associated updated-centroids. The global re- even more than the benefit of decrease in commu- duce calculates the final-centroids which is the mean nication. Hence, there would be an optimal number of all updated-centroids corresponding to a single of partitions for every application on a given cluster. input-cluster-centroid. The final-centroids form the For a well-partitioned graph we expect an inverted input-cluster-centroids for the next iteration. These bell-curve. iterations continue until the input-cluster-centroids Figures 7 and 8 show the runtimes for the converge. The algorithm used in the relaxed approach to K-Means is similar to the one recently shown by Tom-Yov and Slonim [25] for pairwise clustering. An important observation from their results is that the input to the global map should not be the same subset of the input points in every iteration. For every few iterations the input points need to be partitioned differently across global maps so as to avoid the algorithm move towards local optima. Also, the convergence condition includes detection of oscillations along with the Eucledian metric. We use the K-Means implementation in the nor- mal MapReduce framework from the Apache Ma- hout project [16]. Sampled US Census data of 1990 from the UCI Machine Learning repository [23] was used as the clustering data for the comparison between the normal and relaxed approaches. The Fig. 9. Iterations and Time-to-Converge for different sample size is around 200K points each with 68 Partitions dimensions. For both relaxed and normal K-Means , initial centroids were chosen randomly for the sake of generality. Algorithms such as canopy clustering An increase in the number of partitions decreases can be used to identify initial centroids for faster the size of points subset. The normal K-Means clus- execution and better quality of final clusters. tering takes 18 iterations for this dataset to converge Figure ?? shows the number of iterations and for a thresold of 0.01. the time taken to converge for a particular dataset From the graph, one can observe that (1) relaxed of 200K points with varying thresholds. All the K-Means is signficantly faster for the partitions experiments were conducted on the 460 node cluster considered, and (2) as in Pagerank, there exists described above. To eliminate the impact of ran- an optimal number of partitions, namely at 109 dom partitioning performed after every 5 iterations partitions. (chosen heuristically), each experiment was run 10 times and the average number of iterations and 6. Discussion time taken were considered. We can observe a consistent speedup of about 2 with respect to the Generality. Our relaxed synchronization mecha- parallel K-Means algorithm implemented in tradi- nisms can be generalized to broad classes of ap- tional MapReduce. plications. The pagerank application, which relies The number of local MapReduce iterations inside on an asynchronous mat-vec, is representative of each global map increases as the threshold for eigenvalue/linear system solvers (computing eigen- convergence decreases. It is important to observe vectors using the power method of repeated mul- from the figure that increase in computation does tiplications by a unitary matrix). A wide range of not traslate to an increase in the overall time. similar applications requiring the spectra of graphs, Total time-to-converge depends on the number of global alignments, random walks, can be computed global synchronizations (iterations). Also, global using this algorithmic template. For all of these synchronization costs increase with the cluster size, applications, our methods and results are directly implying scalable performance gains. applicable. Algorithms such as shortest paths over Figure 9 shows the number of iterations and sparse graphs and graph alignment can be directly time-to-converge with varying number of partitions cast into our framework. Beyond graphs, the asyn- (disjoint subsets of points). A global map applies chronous matvec forms the core of iterative linear local MapReduce clustering on its input partition. system solvers. Asynchronous kmeans clustering immediately because of its simple programming model and the validates the usefulness of our approach in various wide range of underlying hardware environments. clustering and data-mining applications. There have been efforts exploring both the systems Finally, the goal of this paper is to examine aspects as well as the application base for MapRe- tradeoffs of serial operation count and distributed duce. performance. These tradeoffs manifest themselves Chen et al [4] describe their experiences with in a much wider application class. large data-intensive tasks – analyzing the NetFlix Programming Complexity. While relaxed synchro- user data and scene matching in GPS-tagged im- nization requires slightly more programming effort ages along with compute-intensive tasks such as than traditional MapReduce, we argue that the pro- earthquake simulations. They observe that small gramming complexity is not substantial. This is also application-specific optimizations greatly improve evidenced by the simplicity of the semantics used the performance and runtime. For data-intensive in the paper to describe it. tasks, they show that intelligent partitioning of input Fault-tolerance. While our approach relies on ex- data to maps heavily reduces network communica- isting MapReduce mechanisms for fault-tolerance, tion during global synchronization. They also note in the event of failure(s), our recovery times may that several applications are entirely data parallel, be slightly longer as each map task is longer and obviating the need for a reduce operation. They re-execution would take longer. However, all of argue that the MapReduce runtime (e.g., . Hadoop) our results are reported on a production wide-area should be optimized for such reduction-free compu- cluster of 460 nodes, with real-life transient failures. tations. They also propose indexing of inputs and in- This leads us to believe that the overhead is not termediate data to enable faster access, provenance significant. tracking for linkage and dependence, sampling of Scalability. In general, it is difficult to estimate input data to learn characteristics such as skews the resources available to, and used by a program in the map and reduce computations. HBase [11] execution in a Cloud computing environment. We and HyperTable [12] are ongoing projects using draw our assertions on scalability of our formulation the primitives of HDFS (Hadoop Distributed File by infering resource availability from the number of System) providing a distributed, column-oriented graph partitions. Please note that these inferences store for efficient indexing of data being used in are meant to be qualitative in nature, and not based the MapReduce system. on raw data (since the data is not made available by A number of efforts [15, 22, 1] target optimiza- design). tions to the MapReduce runtime and scheduling sys- Typically, clusters run of the order of 10 map tems. Proposals include dynamic resource allocation tasks per node. Since each map handles one com- to fit job requirements and system capabilities to plete partition of the graph, for very large numbers detect and eliminate bottlenecks within a job. Such of partitions(e.g., 6400), we potentially use the improvements combined with our efficient appli- entire 460 nodes in the Google cluster for the map cation semantics, would significantly increase the phase. Such high node utilization incurs heavy net- scope and scalability of MapReduce applications. work delays during copying and merging before the The simplicity of MapReduce programming model reduce phase, leading to increased synchronization has also motivated its use in traditional shared overheads. Several orders of speedup obtained using memory systems [21]. our semantics for iterative MapReduce in such a A dominant component of a typical Hadoop exe- large data center environments, demonstrate that our cution correponds to the underlying communication solution is indeed scalable. and I/O. This happens even though the MapReduce runtime attempts to reduce communication by trying 7. Related work to instantiate a task at the node or the rack where the data is present. Afrati et al. [2] study this Over the past few years, the MapReduce pro- important problem and proposed alternate compu- gramming model has gained attention primarily tational models for sorting applications to reduce communication between hosts in different racks. the proposed set of semantic extensions adequate Our extended semantics deal with the same problem for such applications? MapReduce tries to strike a but, from the application‘s perspective, irrespective balance between programmability and performance. of the underlying hardware resources. It attempts to divest the programmer of burdens of There have been recent efforts aimed at bridging optimizing locality, scheduling, and communication. the gap between relational databases and MapRe- In doing so, it often compromises performance. duce [24]. These efforts propose a merge phase These trade-offs must be viewed in the context of after the reduce phase to compute joins on outputs the underlying platforms and applications. of various reductions. Olston et al [18] propose a Implications for tightly coupled systems. MapRe- SQL-style declarative language, Pig Latin, on top duce has been shown to be an effective alternative to of the MapReduce model. The underlying compiler Pthreads even in the systems [21] deals with the automatic creation of the procedural for specific applications. It is important to note that map and reduce functions for data-parallel opera- the shared memory MapReduce has a much larger tions. Such languages further enhance programma- design space because of higher control over the bility and allow the system to optimize execution. spawned map and reduce tasks (thread pools). A Dryad [14], DryadLinq [26], and Sawzall [19] aim number of performance optimizations are possible to make MapReduce an implicit low-level program- here – ranging from pipelining iterates of map and ming primitive. A program written in LINQ can reduce operations to reordering map and reduce be parsed to form a DAG, which is then used by operations in order to aggregate maps and en- the DRYAD system to schedule the data parallel hance computation-communication characteristics. portions on a distributed cluster. Most of these Other optimizations involve speculative execution efforts are aimed at bringing MapReduce program- of maps, relying on promises as outputs of maps ming primitives into high-level languages and also as opposed to the values themselves. This rich to overcome the rigidity of MapReduce semantics. space of optimizations bears significant research They are also helpful in logical separation of data- investigation. parallel and task-parallel regions in a general appli- Integration of proposed semantics with high-level cation. programming languages. Our proposed semantics All of the aforementioned efforts try to improve and its variants can be integrated with SQL-type efficiency of MapReduce by adding software layers declarative languages, namely Pig-Latin [18] and on top of existing MapReduce semantics or by im- DryadLinq [26]. The resulting system would po- proving the underlying runtime. In contrast, we aim tentially improve the efficiency of MapReduce sig- to extend MapReduce semantics to leverage algo- nificantly, while decreasing the load on application rithmic asynchrony in important application classes. programmers to make decisions over the distribution 8. Future Work of data. Optimal granularity for maps. As shown in our The myriad trade-offs associated with wide range work, as well as the results of others, the perfor- of overheads on different platforms pose intriguing mance of a MapReduce program is a sensitive func- challenges. We identify some of these challenges as tion of map granularity. An automated technique, they relate to our proposed solutions: based on execution traces and sampling [9] can Generality of semantic extensions. We have potentially deliver these performance increments demonstrated the use of partial synchronization and without burdening the programmer with locality eager scheduling in the context of a specific appli- enhancing aggregations. cation class. While we have argued in favor of their System-level enhancements. Often times, when broader applicability, these claims must be quan- executing iterative MapReduce programs, the output titatively established. Several task-parallel applica- of one iteration is needed in the next iteration. tions with complex interactions are not naturally Currently, the output from a reduction is written to suited to traditional MapReduce formulations. Are the distributed file system (DFS) and must be ac- cessed from the DFS by the next set of maps, which learning on multicore. Advances in Neural In- involves significant overhead. Using online data formation Processing Systems 19, pages 281– structures (for example, Bigtable) provides credible 288. alternatives, however, issues of fault tolerance must [7] Price D. J. de S. A general theory of bibliomet- be resolved. ric and other cumulative advantage processes. . Journal of the American Society for Informa- tion Science, Vol 27 , 292-306, 1976. 9. Conclusion [8] Jeffrey Dean and Sanjay Ghemawat. Mapre- duce: Simplified data processing on large clus- ters. Proc. of OSDI, 2004. In this paper, we propose extended semantics for [9] Richard O. Duda, Peter E. Hart, and David G. MapReduce based on partial synchronizations and Stork. Chapter 8. pattern classication. 2nd edi- eager map scheduling. We demonstrate that when tion. A Wiley-Interscience Publication, 2001. combined with locality enhancing techniques and al- [10] Hadoop. http://hadoop.apache.org. gorithmic asynchrony, these extensions are capable [11] HBase. http://hadoop.apache.org/hbase. of yielding significant performance improvements. [12] Hypertable. http://www.hypertable.org. We demonstrate our results in the context of the [13] The igraph library. problem of computing page-ranks on a web graph. http://igraph.sourceforge.net/. Our results strongly motivate the use of partial [14] Michael Isard, Mihai Budiu, Yuan Yu, An- synchronizations for broad application classes. Fi- drew Birrell, and Dennis Fetterly. Dryad: dis- nally, these enhancements in performance do not tributed data-parallel programs from sequen- adversely impact the programmability and fault- tial building blocks. Proceedings of the 2nd tolerance features of the underlying MapReduce ACM SIGOPS/EuroSys European Conference framework. on Computer Systems, 2007. [15] Karthik Kambatla, Abhinav Pathak, and References Himabindu Pucha. Towards optimizing hadoop provisioning for the cloud. 1st Workshop on [1] Improving MapReduce Performance in Het- Hot Topics in Cloud Computing, HotCloud, erogeneous Environments, San Diego, CA, 2009. 12/2008 2008. USENIX Association. [16]Apache Mahout. [2] Foto N. Afrati and Jeffrey D. Ullman. A New http://lucene.apache.org/mahout. Computation Model for Rack-based Comput- [17] METIS. http://glaros.dtc.umn.edu/gkhome/views/metis. ing. http://infolab.stanford.edu/ ullman/pub- [18] Christopher Olston, Benjamin Reed, Utkarsh /mapred.pdf. Srivastava, Ravi Kumar, and Andrew Tomkins. [3] R.E. Bryant. Data-intensive supercomputing: Pig latin: A not-so-foreign language for data The case for disc. technical report cmu-cs-07- processing. SIGMOD ’08: Proceedings of the 128, carnegie mellon university. 2007. ACM SIGMOD international conference on [4] Shimin Chen and Steven W. Schlosser. Map- Management of data, 2008. reduce meets wider varieties of applications. [19] R. Pike, S. Dorward, R. Griesemer, and technical report,irp-tr-08-05, intel research, S. Quinlan. Interpreting the data: Parallel pittsburg. 2008. analysis with sawzall. Scientific Programming [5] Vijayanand Chokkapu and Asif ul Haque. Journal Special Issue on Grids and Worldwide Pagerank calculation using map reduce. Computing Programming Models and Infras- technical reports, cornell web lab, 2008. tructure, 13(4):227-298, 2003. http://www.infosci.cornell.edu/weblab/papers/Chokka[20]pu2008.pdf. Stanford WebBase Project. [6] Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, http://diglib.stanford.edu:8091/ testbed/- YuanYuan Yu, Gary Bradski, Andrew Y. Ng, doc2/WebBase/. and Kunle Olukotun. Map-reduce for machine [21] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. Evaluating for multi- core and multiprocessor system. Proceedings of the 13th Intl. Symposium on High- Performance (HPCA), Phoenix , AZ, 2007. [22] Thomas Sandholm and Kevin Lai. Mapre- duce optimization using dynamic regulated prioritization. SIGMETRICS/Performance ’09: Proceedings of the 2009 Joint International Conference on Measurement & Modeling of Computer Systems, 2009. [23]1990. UCI Machine Learn- ing Repository US Census Data. http://kdd.ics.uci.edu/databases/census1990/USCensus1990.html. [24] H.-C. Yang, A. Dasdan, R.-L. Hsiao, and D. S. P. Jr. Map-reduce-merge: simplified re- lational data processing on large clusters. SIG- MOD ’07: Proceedings of the ACM SIGMOD international conference on Management of data, 2007. [25] Elad Yom-tov and Noam Slonim. Parallel pairwise clustering. SDM’09, Proceedings of SIAM Data Mining conference, 2009. [26] Yuan Yu, Michael Isard, Dennis Fetterly, Mi- hai Budiu, far Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. Sym- posium on Design and Im- plementation (OSDI), San Diego, CA, 2008.