Scalable Big Graph Processing in Mapreduce

Scalable Big Graph Processing in MapReduce ♮ Lu Qin†,JeffreyXuYu‡,LijunChang§,HongCheng‡,ChengqiZhang†, Xuemin Lin§ †Centre for Quantum Computation and Intelligent Systems, University of Technology, Sydney, Australia ‡ The Chinese University of Hong Kong, China ♮ § The University of New South Wales, Australia East China Normal University, China †{lu.qin,chengqi.zhang}@uts.edu.au ‡{yu,hcheng}@se.cuhk.edu.hk §{ljchang,lxue}@cse.unsw.edu.au ABSTRACT 1. INTRODUCTION MapReduce has become one of the most popular parallel com- As one of the most popular parallel computing paradigms for puting paradigms in cloud, due to its high scalability, reliability, big data, MapReduce [10] has been widely used in a lot of com- and fault-tolerance achieved for a large variety of applications in panies such as Google, Facebook, Yahoo, and Amazon to process big data processing. In the literature, there are MapReduce Class a large amount of data in the order of tera-bytes everyday. The and Minimal MapReduce Class to define the mem- success of MapReduce is due to its high scalability, reliability, and oryMRC consumption, communication cost,MMC CPU cost, and number of fault-tolerance achieved for a large variety of applications and its MapReduce rounds for an algorithm to execute in MapReduce. easy-to-use programming model that allows developers to develop However, neither of them is designed for big graph processing in parallel data-driven algorithms in a distributed shared nothing envi- MapReduce, since the constraints in can be hardly achieved ronment. A MapReduce algorithm executes in rounds. Each round simultaneously on graphs and the conditionsMMC in may induce has three phases: map, shuffle,andreduce.Themap phase gener- scalability problems when processing big graphMRC data. In this pa- ates a set of key-value pairs using a map function, the shuffle phase per, we study scalable big graph processing in MapReduce. We in- transfers the key-value pairs into different machines and ensures troduce a Scalable Graph processing Class by relaxing some that key-value pairs with the same key arrive at the same machine, constraints in to make it suitable for scalableSGC graph process- and the reduce phase processes all key-value pairs with the same ing. We defineMMC two graph join operators in ,namely, join key using a reduce function. and join, using which a wide range of graphSGC algorithmsEN can be NE Motivation: In the literature, there are researches that define al- designed, including PageRank, breadth first search, graph keyword gorithm classes in MapReduce in terms of memory consumption, search, Connected Component (CC) computation, and Minimum communication cost, CPU cost, and the number of rounds. Karloff Spanning Forest (MSF) computation. Remarkably, to the best of et al. [23] give the first attempt in which the MapReduce Class our knowledge, for the two fundamental graph problems CC and ( ) is proposed. defines the maximal requirements for MSF computation, this is the first work that can achieve O(log(n)) anMRC algorithm to executeMRC in MapReduce, in the sense that if any con- MapReduce rounds with O(n + m) total communication cost in dition in is violated, running the algorithm in MapReduce each round and constant memory consumption on each machine, is meaningless.MRC Nevertheless, a better class is highly demanded where n and m are the number of nodes and edges in the graph re- to guide the development of more stable and scalable MapReduce spectively. We conducted extensive performance studies using two algorithms. Thus, Tao et al. [41] introduce the Minimal MapRe- web-scale graphs Twitter-2010 and Friendster with different graph duce Class ( ) in which several aspects can achieve optimal- characteristics. The experimental results demonstrate that our al- ity simultaneously.MMC A lot of important database problems including gorithms can achieve high scalability in big graph processing. sorting and sliding aggregation can be solved in .However, is still incapable of solving a large rangeMMC of problems espe- Categories and Subject Descriptors ciallyMMC for those involved in graph processing, which is an important branch of big data processing. The reasons are twofold. First, a H.2.4 [Information Systems]: Database Management—Systems graph usually has some inherent characteristics which make it hard to achieve high parallelism. For example, a graph is usually un- Keywords structured and highly irregular, making the locality of the graph Graph; MapReduce; Cloud Computing; Big Data very poor [30]. Second, the loosely synchronised shared nothing computing structure in MapReduce makes it difficult to achieve high workload balancing and low communication cost simultaneously as defined in when processing graphs (see Section 3 MMC Permission to make digital or hard copies of all or part of this work for personal or for more details). Motivated by this, in this paper, we relax some conditions in and define a new class of MapReduce algo- classroom use is granted without fee provided that copies are not made or distributed MMC for profit or commercial advantage and that copies bear this notice and the full cita- rithms that is more suitable for scalable big graph processing. tion on the first page. Copyrights for components of this work owned by others than We make the following contributions in this paper. ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- Contributions: publish, to post on servers or to redistribute to lists, requires prior specific permission (1) New class defined for scalable graph processing: We define a and/or a fee. Request permissions from [email protected]. new class for scalable graph processing in MapReduce. We SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. aim at achievingSGC three goals: scalability, stability,androbustness. Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00. http://dx.doi.org/10.1145/2588555.2593661. Scalability requires an algorithm to achieve good speed-up w.r.t the 827 number of machines used. Stability requires an algorithm to ter- 2. PRELIMINARY minate in bounded number of rounds. Robustness requires that an In this section, we introduce the MapReduce framework and re- algorithm never fails regardless of how much memory each ma- view the two algorithm classes in MapReduce in the literature. chine has. relaxes two constraints defined in ,namely, the communicationSGC cost on each machine, and theMMC total number of 2.1 The MapReduce Framework rounds. For the former, we define a new cost, that balances the MapReduce, introduced by Google [10], is a programming model communication in a random manner where the randomness is re- that allows developers to develop highly scalable and fault-tolerant lated to the degree distribution of the graph. For the latter, we relax parallel applications to process big data in a distributed shared noth- the O(1) rounds defined in to O(log(n)) where n is the ing environment. A MapReduce algorithm executes in rounds. Each graph node number. In addition,MMC we require the memory used in round involves three phases: map, shuffle,andreduce.Assuming each machine to be loosely related to the size of the input data, in that the input data is stored in a distributed file system as a set of order to achieve high robustness. Such a condition is even stronger key-value pairs, the three phases work as follows. than that defined in .Therobustness requirement is highly Map: In this phase, each machine reads a part of the key-value MMC • m m demanded by a commercial database system, with which a database pairs (ki ,vj ) from the distributed file system and generates { } s s administrator does not need to worry that the data grows too large a new set of key-value pairs (ki ,vj ) to be transferred to other to reside entirely in the total memory of the machines. machines in the shuffle phase.{ } Shuffle: The key-value pairs (ks,vs) generated in the map (2) Two elegant graph operators defined to solve a large range of • { i j } graph problems: We define two graph join operators, namely, phase are shuffled across all machines. At the end of the shuffle NE s s s s join and join. join propagates information from nodes phase, all the key-value pairs (ki ,v1), (ki ,v2), with the EN NE s { ···} to their adjacent edges and join aggregates information from same key ki are guaranteed to arrive at the same machine. adjacent edges to nodes. BothEN join and join can be imple- Reduce: Each machine groups the key-value pairs with the same NE EN • s s s s mented in . Using the two graph join operators, a large range key ki together as (ki , v1,v2, ), from which a new set of SGC key-value pairs (kr,v{r) is generated···} and stored in the dis- of graph algorithms can be designed in , including PageRank, { i j } breadth first search, graph keyword search,SGC Connected Component tributed file system to be processed in the next round. (CC) computation, and Minimum Spanning Forest (MSF) compu- Two functions need to be implemented in each round: a map tation. Especially, for CC and MSF computation, it is non-trivial to function and a reduce function. A map function determines how to s s m m solve them using graph operators in . To the best of our knowl- generate (ki ,vj ) from (ki ,vj ) whereas a reduce function SGC { } { r r } s s s edge, for the two fundamental graph problems, this is the first work determines how to generate (ki ,vj ) from (ki , v1,v2, ). that can achieve O(log(n)) MapReduce rounds with O(n + m) { } { ···} total communication cost in each round and constant memory con- 2.2 Algorithm Classes in MapReduce sumption on each machine, where n and m are the number of nodes In the literature, two algorithm classes have been introduced in and edges in the graph respectively.

Load more