Fast Parallel Conversion of to Adjacency List for Large-Scale Graphs

Shaikh Arifuzzaman†‡ and Maleq Khan‡ †Department of Computer Science ‡Network Dynamics & Simulation Science Lab, Virginia Bioinformatics Institute Virginia Tech, Blacksburg, VA 24061, USA sm10, maleq @vbi.vt.edu { }

ABSTRACT D.1.3 Programming Techniques: Concurrent Programming— In the era of Bigdata, we are deluged with massive graph data Parallel Programming; G.2.2 Discrete Mathematics: Graph emerged from numerous social and scientific applications. In Theory—Graph Algorithms most cases, graph data are generated as lists of edges (edge list), where an edge denotes a link between a pair of entities. INTRODUCTION However, most of the graph algorithms work efficiently when Graph (network) is a powerful abstraction for representing information of the adjacent nodes (adjacency list) for each underlying relations in large unstructured datasets. Exam- node is readily available. Although the conversion from edge ples include the web graph [11], various social networks, e.g., list to adjacency list can be trivially done on the fly for small Facebook, Twitter [17], collaboration networks [19], infras- graphs, such conversion becomes challenging for the emerg- tructure networks (e.g., transportation networks, telephone ing large-scale graphs consisting billions of nodes and edges. networks) and biological networks [16]. These graphs do not fit into the main memory of a single com- puting machine and thus require distributed-memory parallel We denote a graph by G(V,E), where V and E are the set or external-memory algorithms. of vertices (nodes) and edges, respectively, with m = E edges and n = V vertices. In many cases, a graph is speci-| | In this paper, we present efficient MPI-based distributed fied by simply listing| | the edges (u, v), (v, w), E, in an memory parallel algorithms for converting edge lists to adja- arbitrary order, which is called edge list. A graph· · · ∈ can also cency lists. To the best of our knowledge, this is the first work be specified by a collection of adjacency lists of the nodes, on this problem. To address the critical load balancing issue, where the adjacency list of a node v is the list of nodes that we present a parallel load balancing scheme which improves are adjacent to v. Many important graph algorithms, such both time and space efficiency significantly. Our fast paral- as computing shortest path, breadth-first search, and depth- lel algorithm works on massive graphs, achieves very good first search are executed by exploring the neighbors (adjacent speedups, and scales to large number of processors. The al- nodes) of the nodes in the graph. As a result, these algorithms gorithm can convert an edge list of a graph with 20 billion work efficiently when the input graph is given as adjacency edges to the adjacency list in less than 2 minutes using 1024 lists. Although both edge list and adjacency list have a space processors. Denoting the number of nodes, edges and pro- requirement of O(m), scanning all neighbors of node v in an cessors by n, m, and P , respectively, the time complexity of m edge list can take as much as O(m) time compared to O(dv) our algorithm is O( P + n + P ) which provides a speedup time in adjacency list, where dv is the degree of node v. factor of at least Ω(min P, davg ), where davg is the average degree of the nodes. The{ algorithm} has a space complexity of is another data structure used for graphs. m O( P ), which is optimal. Much of the earlier work [3, 15] use adjacency matrix A[., .] of order n n for a graph with n nodes. Element A[i, j] de- Author Keywords notes whether× node j is adjacent to node i. All adjacent nodes Parallel Algorithms; Massive Graphs; Bigdata; Edge List; of i can be determined by scanning the i-th row, which takes Adjacency List; Load Balancing. O(n) time compared to O(di) time for adjacency list. Fur- ther, adjacency matrix has a prohibitive space requirement of 2 ACM Classification Keywords O(n ) compared to O(m) of adjacency list. In a real-world network, m can be much smaller than n2 as the average de- gree of a node can be significantly smaller than n. Thus ad- jacency matrix is not suitable for the analysis of emerging Permission to make digital or hard copies of all or part of this work for personal or large-scale networks in the age of bigdata. classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- In most cases, graphs are generated as list of edges since it tion on the first page. Copyrights for components of this work owned by others than SCS must be honored. Abstracting with credit is permitted. To copy otherwise, or re- is easier to capture pairwise interactions among entities in a publish, to post on servers or to redistribute to lists, requires prior specific permission system in arbitrary order than to capture all interactions of and/or a fee. HPC’15, April 12-15, 2015, Alexandria, VA. a single entity at the same time. Examples include captur- Copyright c 2015 Society for Modeling & Simulation International (SCS). ing person-person connection in social networks and protein- respectively. The speedup factor is at least Ω(min P, davg ), { } protein links in protein interaction networks. This is true even where davg is the average degree of the nodes. for generating massive random graphs [2, 14] which is useful Organization. for modeling very large system. As discussed by Leskovec et. The rest of the paper is organized as follows. al [18], some patterns only exist in massive datasets and they The preliminary concepts and background of the work are preliminaries and background are fundamentally different from those in smaller datasets. briefly described in section. While generating such massive random graphs, algorithms Our parallel algorithm along with the load balancing scheme parallel algorithm dataset usually output edges one by one. Edges incident on a node is presented in section. In section and experimental setup v are not necessarily generated consecutively. Thus a conver- , we present our dataset and specifica- sion of edge list to adjacency list is necessary for analyzing tion of computing resources used for experiments. We discuss our experimental results in section performance analysis, and these graphs efficiently. we conclude thereafter. Why do we need parallel algorithms? With unprecedented advancement of computing and data technology, we are del- uged with massive data from diverse areas such as business PRELIMINARIES AND BACKGROUND and finance [5], computational biology [12] and social sci- In this section, we describe the basic definitions used in this ence [7]. The web has over 1 trillion webpages. Most of the paper and then present a sequential algorithm for converting social networks, such as, Twitter, Facebook, and MSN, have edge list to adjacency list. millions to billions of users [13]. These networks hardly fit in the memory of a single machine and thus require exter- Basic definitions nal memory or distributed memory parallel algorithms. Now We assume n vertices of the graph G(V,E) are labeled as external memory algorithms can be very I/O intensive lead- 0, 1, 2, . . . , n 1. We use the words node and vertex inter- ing to a large runtime. Efficient distributed memory parallel changeably. If− (u, v) E, we say u and v are neighbors algorithms can solve both problems (runtime and space) by of each other. The set∈ of all adjacent nodes (neighbors) of distributing computing tasks and data to multiple processors. v V is denoted by Nv, i.e., Nv = u V (u, v) E . ∈ { ∈ | ∈ } The degree of v is dv = Nv . In a sequential settings with the graphs being small enough | | to be stored in the main memory, the problem of converting In edge-list representation, edges (u, v) E are listed one an edge list to adjacency list is trivial as described in the next after another without any particular order.∈ Edges incident to section. However, the problem in a distributed-memory set- a particular node v are not necessarily listed together. On the ting with massive graphs poses many non-trivial challenges. other hand, in adjacency-list representation, for all v, adja- The neighbors of a particular node v might reside in multiple cent nodes of v, Nv, are listed together. An example of these processors which need to be combined efficiently. Further, representations is shown in Figure1. computation loads must be well-balanced among the proces- sors to achieve a good performance of the parallel algorithm. N = 1 (0, 1) 0 { } Like many others, this problem demonstrates how a simple 1 2 (1, 2) N = 0, 2, 3, 4 1 { } trivial problem can turn into a challenging problem when we (1, 3) N = 1, 3 (1, 4) 2 { } are dealing with bigdata. N = 1, 2, 4 0 4 3 (2, 3) 3 { } (3, 4) N = 1, 3 Contributions. In this paper, we study the problem of con- 4 { } verting edge list to adjacency list for large-scale graphs. We a) Example Graph b) Edege List c) Adjacency List present MPI-based distributed-memory parallel algorithms which work for both directed and undirected graphs. We Figure 1: The edge list and adjacency list representations of devise a parallel load balancing scheme which balances the an example graph with 5 nodes and 6 edges. computation load very well and improves the efficiency of the algorithms significantly, both in terms of runtime and space requirement. Furthermore, we present two efficient merging A Sequential Algorithm schemes for combining neighbors of a node from different The sequential algorithm for converting edge list to adjacency processors– message-based and external-memory merging– list works as follows. Create an empty list N for each node which offer a convenient trade-off between space and run- v v, and then, for each edge (u, v) E, include u in Nv and v time. Our algorithms work on massive graphs, demonstrate ∈ in Nu. The pseudocode of the sequential algorithm is given very good speedups on both real and artificial graphs, and in Figure2. For a directed graph, line 5 of the algorithm scale to a large number of processors. The edge list of a should be omitted since a directed edge (u, v) doesn’t imply graph with 20B edges can be converted to adjacency list in that there is also an edge (v, u). In our subsequent discussion, two minutes using 1024 processors. We also provide rigor- we assume that the graph is undirected. However, the algo- ous theoretical analysis of the time and space complexity of rithm also works for the directed graph with the mentioned our algorithms. The time and space complexity of our algo- m m modification. rithms are O( P + n + P ) and O( P ), respectively, where n, m, and P are the number of the nodes, edges, and processors, This sequential algorithm is optimal since it takes O(m) time to process O(m) edges and thus can not be further improved. The algorithm has a space complexity of O(m). m 1: for each v V do (p 1) ). Processor i is assigned partition Ei. Processor ∈ − d P e 2: Nv i then constructs adjacency lists N i for all nodes v such that ← ∅ v 3: for each (u, v) E do (., v) E or (v, .) E . Note that adjacency list N i is ∈ i i v 4: Nu Nu v only a∈ partial adjacency∈ list since other partitions E might ← ∪ { } j 5: Nv Nv u i ← ∪ { } have edges incident on v. We call Nv local adjacency list of v in partition i. The pseudocode for Phase 1 computation is Figure 2: Sequential algorithm for converting edge list to ad- presented in Figure3. jacency list. 1: Each processor i, in parallel, executes the follow- ing. For small graphs that can be stored wholly in the main mem- 2: for (u, v) Ei do ory, the conversion in a sequential setting is trivial. How- i ∈i 3: Nv Nv u ever, emerging massive graphs pose many non-trivial chal- 4: N i ← N i ∪ {v} lenges in terms of memory and execution efficiency. Such u ← u ∪ { } graphs might not fit in the local memory of a single com- Figure 3: Algorithm for performing Phase 1 computation. puting node. Even if some of them fit in the main memory, the runtime might be prohibitively large. Efficient parallel al- This phase of computation has both the runtime and space gorithms can solve this problem by distributing computation complexity of O( m ) as shown in Lemma1. and data among computing nodes. We present our parallel P algorithm in the next section. LEMMA 1. Phase 1 of our parallel algorithm has both the m runtime and space complexity of O( P ). THE PARALLEL ALGORITHM m Proof: Each initial partition i has Ei = O( P ) edges. Ex- First we present the computational model and an overview of |m | m ecuting Line 3-4 in Figure3 for O( P ) edges requires O( P ) our parallel algorithm. A detailed description follows there- time. Now the total space required for storing local adjacency after. i m lists N in partition i is 2 Ei = O( ). v | | P  Computation Model Thus the computing loads and space requirements in Phase 1 We develop parallel algorithms for message passing inter- are well-balanced. The second phase of our algorithm con- face (MPI) based distributed-memory parallel systems, where structs the final adjacency list Nv from local adjacency lists i each processor has its own local memory. The processors do Nv from all processors i. Note that balancing load for Phase not have any shared memory, one processor cannot directly 1 doesn’t make load well balanced for Phase 2 which requires access the local memory of another processor, and the proces- a more involved load balancing scheme as described later in sors communicate via exchanging messages using MPI con- the following sections. structs. (Phase 2) Merging Local Adjacency Lists Overview of the Algorithm Once all processors complete constructing local adjacency N i N N i Let P be the number of processor used in the computation lists v, final adjacency lists v are created by merging v i and E be the list of edges given as input. Our algorithm has from all processors as follows. two phases of computation. In Phase 1, the edges E are par- P −1 [ i titioned into P initial partitions Ei, and each processor is Nv = Nv (1) assigned one such partition. Each processor then constructs i=0 neighbor lists from the edges of its own partition. However, edges incident to a particular node might reside in multiple The scheme used for merging local adjacency lists has signif- processors, which creates multiple partial adjacency lists for icant impact on the performance of the algorithm. One might the same node. In Phase 2 of our algorithms, such adjacency think of using a dedicated merger processor. For each node i lists are merged together. Now, performing Phase 2 of the al- v Vi, the merger collects N from all other processors and ∈ v gorithm in a cost-effective way is very challenging. Further, merges them into Nv. This requires O(dv) time for node v. computing loads among processors in both phases need to be Thus the runtime complexity for merging adjacency lists of P balanced to achieve a significant runtime efficiency. The load all v V is O( v∈V dv) = O(m) , which is at most as balancing scheme should also make sure that space require- good as∈ the sequential algorithm. ment among processors are also balanced so that large graphs Next we present our efficient parallel merging scheme which can be processed. We describe the phases of our parallel al- employs P parallel mergers. gorithm in detail as follows. An Efficient Parallel Merging Scheme (Phase 1) Local Processing To parallelize Phase 2 efficiently, our algorithm distributes The algorithm partitions the set of edges E into P partitions the corresponding computation disjointly among processors. S Ei such that Ei E, k Ek = E for 0 k P 1. Each Each processor i is responsible for merging adjacency lists ⊆ m ≤ ≤ m− partition Ei has almost edges– to be exact, edges, Nv for nodes v in Vi V such that for any i and j, P d P e S ⊂ except for the last partition which has slightly fewer (m Vi Vj = and Vi = V . Note that this partitioning of − ∩ ∅ i nodes is different from the initial partitioning of edges. How read in the whole file into its main memory. It only stores j the nodes in V is distributed among processors crucially af- local adjacency lists Nv of a node v at a time, merges it to fects the load balancing and performance of the algorithm. Nv, releases memory and then proceeds to merge the next Further, this partitioning and load balancing scheme should j node v + 1. This works correctly since while writing Si be parallel to ensure the efficiency of the algorithm. Later in j i in Fi , local adjacency lists Nv are listed in the sorted or- this section, we discuss a parallel algorithm to partition set der of v. External-memory merging thus has a space require- of nodes V which makes both space requirement and runtime ment of O(maxv dv). However, the I/O operation leads to a well-balanced. Once the partitions Vi are given, the scheme higher runtime with this method than message-based merg- for parallel merging works as follows. ing, although the asymptotic runtime complexity remains the Step 1: Let Si be the set of all local adjacency lists in par- same. We demonstrate this space-runtime tradeoff between • j these two methods in our performance analysis section. tition i. Processor i divides Si into P disjoint subsets Si , 0 j P 1, as defined below. The runtime and space complexity of parallel merging de- ≤ ≤ − j i pends on the partitioning of V . Next, we discuss the parti- S = N : v Vj . (2) i { v ∈ } tioning and load balancing scheme followed by the complex- j ity analyses. Step 2: Processor i sends S to all other processors j. This • i step introduces non-trivial efficiency issues which we shall Partitioning and Load Balancing discuss shortly. The performance of the algorithm depends on how loads are Step 3: Once processor i gets Si from all processors j, it distributed. In Phase 1, distributing the edges of the input • j graph evenly among processors provides an even load bal- constructs Nv for all v Vi by the following equation. ∈ ancing both in terms of runtime and space leading to both [ k m Nv = Nv (3) space and runtime complexity of O( P ). However, Phase 2 is k i computationally different than Phase 1 and requires different k:Nv ∈S k partitioning and load balancing scheme. We present two methods for performing Step 2 of the above In Phase 2 of our algorithm, set of nodes V is divided into P scheme. The first method explicitly exchanges messages subsets Vi where processor i merges adjacency lists Nv for all among processors to send and receive Sj by using mes- i v Vi. The time for merging Nv of a node v (referred to as sage buffers (main memory). The other method uses disk merging∈ cost d = j henceforth) is proportional to the degree v space (external memory) to exchange Si . We call the first Nv of node v. Total cost for merging incurred on a processor method message-based merging and the second external- i| is Θ(| P d ). Distributing equal number of nodes among v∈Vi v memory merging. processors may not make the computing load well-balanced j in many cases. Some nodes may have large degrees and some (1) Message-based Merging: Each processor i sends S di- i very small. As shown in Figure5, distribution of merging cost rectly to processor j via messages. Specifically, processor i P i i ( dv) across processors is very uneven with an equal sends N (with a message < v, N >) to processor j where v∈Vi v v number of nodes assigned to each processor. Thus the set V v V| .| A processor might send multiple lists to another j should be partitioned in such a way that the cost of merging processor.∈ In such cases, messages to a particular proces- is almost equal in all processors. sor are bundled together to reduce communication overhead. j Once a processor i receives messages < v, Nv > from other P −1 j 1e+09 processors, for v Vi, it computes Nv = j=0 Nv . The Miami ∈ ∪ LiveJournal pseudocode of this algorithm is given in Figure4. Twitter 1e+08 1: for each v s.t. (., v) Ei (v, .) Ei do i ∈ ∨ ∈ 2: Send < v, Nv > to proc. j where v Vj ∈ 1e+07 3: for each v Vi do ∈ Load 4: Nv ← ∅ j 5: for each < v, Nv > received from any proc. j do j 1e+06 6: Nv Nv N ← ∪ v Figure 4: Parallel algorithm for merging local adjacency lists 100000 0 5 10 15 20 25 30 35 40 45 50 to construct final adjacency lists Nv. A message, denoted by i Rank of Processors < v, Nv >, refers to local adjacency lists of v in processor i.

j Figure 5: Load distribution among processors for LiveJour- (2) External-memory Merging: Each processor i writes Si nal, Miami and Twitter before applying the load balancing j in intermediate disk files Fi , one for each processor j. Pro- scheme. Summary of our dataset is provided in Table2. i j cessor i reads all files Fj for partial adjacency lists Nv for each v Vi and merges them to final adjacency lists using Let, f(v) be the cost associated with constructing Nv by re- step 3 of∈ the above scheme. However, processor i doesn’t ceiving and merging local adjacency lists for a node v V . ∈ We need to compute P disjoint partitions of node set V such to compute Vi using cost function f(v) = dv. In summary, that for each partition Vi, computing load balancing for Phase 2 has the following main steps. X 1 X f(v) f(v). ≈ P Step 1: Compute cost f(v) = dv for all v in parallel by the v∈Vi v∈V • algorithm shown in Figure6. Now, note that, for each node v, total size of the local adja- cency lists received (in Line 5 in Figure4) equals the number Step 2: Compute cumulative sum F (v) by a parallel prefix • sum algorithm [4]. of adjacent nodes of v, i.e., Nv = dv. Merging local adja- j | | cency lists Nv via set union operation (Line 6) also requires Step 3: Compute boundary nodes ni for every subset Vi = dv time. Thus, f(v) = dv. Now, since the adjacent nodes • ni, . . . , n 1 using the algorithms [1,6]. of a node v can reside in multiple processors, computing { (i+1) − } f(v) = Nv = dv requires communication among multiple LEMMA 2. The algorithm for balancing loads for Phase processors.| | For all v, computing f(v) sequentially requires n+m 2 has a runtime complexity of O( P + P + maxi Mi) and O(m + n) time which diminishes the advantages gained by n a space requirement of O( P ), where Mi is the number of the parallel algorithm. Thus, we compute f(v) = dv for all messages received by processor i in Step 1. v in parallel, in O( n+m + c) time, where c is the commu- P Proof: For Step 1 of the above load balancing scheme, exe- nication cost. We will discuss the complexity shortly. This m cuting Line 1-4 (Figure6) requires O( Ei )=O( P ) time. The algorithm works as follows: for determining dv for v V in n | | n parallel, each processor i computes d for n nodes v,∈ where cost for executing Line 5-6 is O( P ) since there are P nodes v P j k in (i+1)n v v starts from to 1. Such nodes v satisfy the v such that n/P = i. Each processor i sends a total of P P − j v k m m equation, = i. Now, for each local adjacency list O( P ) messages since Ei = P . If the number of messages n/P received by processor i|is M| , then Line 7-8 of the algorithm i i i i Nv constructed in Phase 1, processor i sends dv = Nv to has a complexity of O(M ) (we compute bounds for M in v i | | i i processor j = n/P with a message < v, dv >. Once proces- Lemma3). Computing Step 2 has a computational cost of j n sor i receives messages < v, dv > from other processors, it O( P + P ) [4]. Step 3 of the load balancing scheme requires PP −1 j O( n + P ) time [1,6]. Thus the runtime complexity of the computes f(v) = dv = d for all nodes v such that P j=0 v n+m j v k load balancing scheme is O( P + P + maxi Mi). Storing n/P = i. The pseudocode of the parallel algorithm for n n f(v) for P nodes has a space requirement of O( P ).  computing f(v) = dv is given in Figure6. LEMMA 3. Number of messages Mi received by proces- sor i in Step 1 of load balancing scheme is bounded by 1: for each v s.t. (., v) Ei (v, .) Ei do P(i+1)n/P −1 i i ∈ ∨ ∈ O(min n, in/P dv ). 2: dv Nv { } ← |v | 3: j n/P Proof: Referring to Figure6, each processor i computes dv ← i n in (i+1)n 4: Send < v, dv > to processor j for nodes v, where v starts from to 1. For each j k P P P − 5: for each v s.t. v = i do v, processor i may receive messages from at most (P 1) n/P other processors. Thus, the number of received messages− is 6: dv 0 n ← j at most P (p 1) = O(n). Now, notice that, when all neigh- 7: for each < v, dv > received from any proc. j do × − j bors u Nv of v reside in different partitions Ej, processor 8: dv dv + dv ∈ i might receive as much as Nv = dv messages for node v. ← | | This gives another upper bound, M = O(P(i+1)n/P −1 d ). Figure 6: Parallel algorithm executed by each processor i for i in/P v P(i+1)n/P −1 computing f(v) = dv. Thus we have Mi = O(min n, dv ). { in/P }  Once f(v) is computed for all v V , we compute cumulative In most of the practical cases, each processor receives much t ∈ smaller number of messages than that specified by the theo- sum F (t) = P f(v) in parallel by using a parallel pre- v=0 retical upper bound. Now, for each node v, processor i re- fix sum algorithm [4]. Each processor i computes and stores ceives messages actually from fewer than P 1 processors. F (t) t t in (i+1)n 1 − for nodes , where starts from P to P . This Let, for node v, processor i receives messages from O(P.lv) O( n +P ) − computation takes P time. Then, we need to compute processors, where lv is a real number (0 lv 1). Thus total Vi such that computation loads are well-balanced among pro- ≤P(i+1)≤ n/P −1 number of message received, Mi = O( in/P P.lv). cessors. Partitions Vi are disjoint subset of consecutive nodes, To get a crude estimate of Mi, let lv = l for all v. The i.e., Vi = ni, ni + 1 . . . , n(i+1) 1 for some node ni. We { − } term l can be thought of as the average over all lv. Then call ni start node or boundary node of partition i. Now, Vi n is computed in such a way that the sum P f(v) becomes Mi = O( P .P.l) = O(n.l). As shown in Table1, the actual v∈Vi number of messages received M is up to 7 smaller than the almost equal ( 1 P f(v)) for all partitions i. At the end of i P v∈V theoretical bound. × this execution, each processor i knows ni and n(i+1). Algo- rithm presented in [6] compute Vi for the problem of triangle LEMMA 4. Using the load balancing scheme discussed in counting. The algorithm can also be applied for our problem this section, Phase 2 of our parallel algorithm has a runtime (i+1)n −1 P P Network Nodes Edges Source Network n in dv Mi l(avg.) P Email-Enron 37K 0.36M SNAP [20] Miami 2.1M 2.17M 600K 0.27 web-BerkStan 0.69M 13M SNAP [20] LiveJournal 4.8M 2.4M 560K 0.14 Miami 2.1M 100M [9] PA(5M, 20) 5M 2.48M 1.4M 0.28 LiveJournal 4.8M 86M SNAP [20] Twitter 42M 2.4B [17] Table 1: Number of messages received in practice compared Gnp(n, d) n 1 nd Erdos-R˝ eyni´ to the theoretical bounds. This results report max M with 2 i i PA(n, d) n 1 nd Pref. Attachment P = 50. Summary of our dataset is provided in Table2. 2 Table 2: Dataset used in our experiments. Notations K, M m and B denote thousands, millions and billions, respectively. complexity of O( P ). Further, the space required to construct m all final adjacency lists Nv in a partition is O( P ). Proof: Line 1-2 in the algorithm shown in Figure4 requires world networks. Miami is a synthetic, but realistic, social m contact network [6,9] for Miami city. Networks Gnp( n, d) O( Ei )=O( P ) time for sending at most Ei edges to other processors.| | Now, with load balancing,| each| processor re- and PA(n, d) are random networks. Gnp(n, d) is generated us- P m ing the Erdos-R˝ eyni´ model [10] with n nodes ceives and merges at most O( dv/P ) = O( ) edges v∈V P and d average degree. Network PA(n, d) is generated using (Line 5-6). Thus the cost for merging local lists N j into final v (PA) model [8] with n nodes and av- list N has a runtime of O( m ). Since the total size of the v P erage degree d. Both the real-world and PA(n, d) networks local and final adjacent lists in a partition is O( m ), the space P have skewed degree distributions which make load balancing requirement is O( m ). P  a challenging task and give us a chance to measure the perfor- The runtime and space complexity of our complete parallel mance of our algorithms in some of the worst case scenarios. algorithm are formally presented in Theorem1. Experimental Setup. We perform our experiments using THEOREM 1. The runtime and space complexity of our a high performance computing cluster with 64 computing m m parallel algorithm is O( P + P + n) and O( P ), respectively. nodes, 16 processors (Sandy Bridge E5-2670, 2.6GHz) per node, memory 4GB/processor, and operating system SLES Proof: The proof follows directly from Lemma 1, 2, 3, and 11.1. 4. The total space required by all processors to process m edges PERFORMANCE ANALYSIS m In this section, we present the experimental results evaluating is O(m). Thus the space complexity O( P ) of our parallel algorithm is optimal. the performance of our algorithm. Performance gain with load balancing: Cost for merg- P Load distribution ing incurred on each processor i is Θ( dv) (Figure4). v∈Vi Load distribution among processors can be very uneven with- Without load balancing, this cost Θ(P d ) can be as v∈Vi v out applying our load balancing scheme, as discussed in par- much as Θ(m) (it is easy to construct such skewed graphs) titioning and load balancing section. We show a compari- leading the runtime complexity of the algorithm Θ(m). With son of load distribution on various networks with and with- load balancing scheme our algorithm achieves a runtime of out load balancing scheme in Figure7. Our scheme provides O( m + P + n) = O( m + m ), for usual case n > P P davg an almost equal load among the processors, even for graphs P . Thus, by simple algebraic manipulation, it is easy to with very skewed such as LiveJournal and see, the algorithm with load balancing scheme achieves a Twitter. Loads are significantly uneven for such skewed net- Ω(min P, davg )-factor gain in runtime efficiency over the works without load balancing scheme. algorithm{ without} load balancing scheme. In other words, the algorithm gains a Ω(P )-fold improvement in speedup when Strong Scaling davg P and Ω(davg)-fold otherwise. We demonstrate this Figure8 shows strong scaling (speedup) of our algorithm on gain in≥ speedup with experimental results in our performance LiveJournal, Miami and Twitter networks with and without analysis section. load balancing scheme. Our algorithm demonstrates very good speedups, e.g., it achieves a speedup factor of 300 1024 ≈ DATA AND EXPERIMENTAL SETUP with processors for Twitter network. Speedup factors increase almost linearly for all networks, and the algorithm We present the datasets and the specification of computing scales to a large number of processors. Figure8 also shows resources used in our experiments below. the speedup factors the algorithm achieves without load bal- Datasets. We used both real-world and artificially generated ancing scheme. Speedup factors with load balancing scheme networks for evaluating the performance of our algorithm. A are significantly higher than those without load balancing summary of all the networks is provided in Table2. The scheme. For Miami network, the differences in speedup fac- number of nodes in our networks ranges from 37K to 1B tors are not very large since Miami has a relatively even de- and the number of edges from 0.36M to 20B. Twitter [17], gree distribution and loads are already fairly balanced with- web-BerkStan, Email-Enron and LiveJournal [20] are real- out load balancing scheme. However, for real-world skewed 3e+06 9e+06 2.5e+08 With Load Balancing With Load Balancing With Load Balancing Without Load Balancing 8e+06 Without Load Balancing Without Load Balancing 2.5e+06 7e+06 2e+08 2e+06 6e+06 1.5e+08 5e+06 1.5e+06

Loads Loads 4e+06 Loads 1e+08 1e+06 3e+06

2e+06 5e+07 500000 1e+06 0 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

Rank of Processors Rank of Processors Rank of Processors (a) Miami network (b) LiveJournal network (c) Twitter network

Figure 7: Load distribution among processors for LiveJournal, Miami and Twitter networks by different schemes.

350 300 350 Without Load Balancing Without Load Balancing Without Load Balancing With Load Balancing With Load Balancing With Load Balancing 300 250 300 250 250 200 200 200 150 150 150

Speedup Factor Speedup Factor 100 Speedup Factor 100 100

50 50 50

0 0 0 0 100 200 300 400 500 600 700 800 900 1000 1100 0 100 200 300 400 500 600 700 800 900 1000 1100 0 100 200 300 400 500 600 700 800 900 1000 1100 Number of Processors Number of Processors Number of Processors (a) Miami network (b) LiveJournal network (c) Twitter network

Figure 8: Strong scaling of our algorithm on LiveJournal, Miami and Twitter networks with and without load balancing scheme. Computation of speedup factors includes the cost for load balancing.

Memory (MB) Runtime (s) networks, our load balancing scheme always improves the Network speedup quite significantly– for example, with 1024 proces- EXT MSG EXT MSG sors, the algorithm achieves a speedup factor of 297 with Email-Enron 1.8 2.4 3.371 0.078 load balancing scheme compared to 60 without load balanc- web-BerkStan 7.6 10.3 10.893 1.578 ing scheme for LiveJournal network. Miami 26.5 43.34 33.678 6.015 LiveJournal 28.7 42.4 31.075 5.112 This experiment also demonstrates that our algorithm scales Twitter 685.93 1062.7 1800.984 90.894 to a large number of processors. The speedup factors continue Gnp(500K, 20) 6.1 9.8 6.946 1.001 to grow almost linearly up to 1024 processors. PA(5M, 20) 68.2 100.1 35.837 7.132 PA(1B, 20) 9830.5 12896.6 14401.5 1198.30 Comparison between Message-based and External- Table 3: Comparison of external-memory (EXT) and memory Merging message-based (MSG) merging (using 50 processors). We compare the runtime and memory usage of our algo- rithm with both message-based and external-memory merg- ing. Message-based merging is very fast and uses message buffers in main memory for communication. On the other hand, external-memory merging saves main memory by using Weak Scaling disk space even though it requires large runtime for I/O oper- Weak scaling of a parallel algorithm shows the ability of the ations. Thus these two methods provide desirable alternatives algorithm to maintain constant computation time when the to trade-off between space and runtime. However, as shown problem size grows proportionally with the increasing num- in Table3, message-based merging is significantly faster (up ber of processors. As shown in Figure9, total computation to 20 ) than external-memory merging albeit taking a little time of our algorithm (including load balancing time) grows larger× space. Thus message-based merging is the preferable slowly with the addition of processors. This is expected since method in our fast parallel algorithm. the communication overhead increases with additional pro- cessors. However, the growth of runtime of our algorithm is rather slow and remains almost constant, the weak scaling of 4. Aluru, S. Teaching parallel computing through parallel the algorithm is very good. prefix. In the Intl. Conf. on High Performance Computing, Networking, Storage and Analysis (2012). 40 Weak Scaling 5. Apte, C., et al. Business applications of data mining. 35 Communications of the ACM 45, 8 (Aug. 2002), 49–53. 30 6. Arifuzzaman, S., Khan, M., and Marathe, M. Patric: A 25 parallel algorithm for counting triangles in massive 20 networks. In Proc. of the 22nd ACM Intl. Conf. on 15 Information and Knowledge Management (2013). Time Required (s) 10 7. Attewell, P., and Monaghan, D. Data Mining for the 5 Social Sciences: An Introduction. University of 0 0 100 200 300 400 500 600 700 800 900 1000 1100 California Press, 2015. Number of Processors 8. Barabasi, A., et al. Emergence of scaling in random Figure 9: Weak scaling of our parallel algorithm. For this networks. Science 286 (1999), 509–512. experiment we use networks PA(x/10 1M, 20) for x pro- 9. Barrett, C., et al. Generation and analysis of large × cessors. synthetic social contact networks. In Proc. of the 2009 Winter Simulation Conference (2009), 103–114. CONCLUSION 10. Bollobas, B. Random Graphs. Cambridge Univ. Press, We present a parallel algorithm for converting edge-list of a 2001. graph to adjacency-list. The algorithm scales well to a large number of processors and works on massive graphs. We de- 11. Broder, A., et al. Graph structure in the web. Computer vise a load balancing scheme that improves both space effi- Networks (2000), 309–320. ciency and runtime of the algorithm, even for networks with 12. Chen, J., and Lonardi, S. Biological Data Mining, very skewed degree distributions. To the best of our knowl- 1st ed. Chapman & Hall/CRC, 2009. edge, it is the first parallel algorithm to convert edge list to ad- jacency list for large-scale graphs. It also allows other graph 13. Chu, S., and Cheng, J. Triangle listing in massive algorithms to work on massive graphs which emerge natu- networks and its applications. In Proc. of the 17th ACM rally as edge lists. Furthermore, this work demonstrates how SIGKDD Conf. on Knowledge Discovery and Data a seemingly trivial problem becomes challenging when we Mining (2011), 672–680. are dealing with Bigdata. 14. Chung, F., et al. Complex Graphs and Networks. American Mathematical Society, Aug. 2006. ACKNOWLEDGMENT We thank our external collaborators, members of Network 15. Coppersmith, D., and Winograd, S. Matrix Dynamics & Simulation Science Laboratory, and anonymous multiplication via arithmetic progressions. In Proc. of reviewers for their suggestions and comments. This work the 19th Annual ACM Symposium on Theory of has been partially supported by DTRA Grant HDTRA1-11-1- Computing (1987), 1–6. 0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0001, 16. Girvan, M., and Newman, M. in NSF NetSE Grant CNS-1011769, and NSF SDCI Grant OCI- social and biological networks. Proc. Natl. Acad. of Sci. 1032677. USA 99, 12 (June 2002), 7821–7826. REFERENCES 17. Kwak, H., et al. What is twitter, a or a 1. Alam, M., and Khan, M. Efficient algorithms for news media? In Proc. of the 19th Intl. Conf. on World generating massive random networks. Technical Report Wide web (2010), 591–600. 13-064, NDSSL at Virginia Tech, May 2013. 18. Leskovec, J. Dynamics of large networks. In Ph.D. 2. Alam, M., Khan, M., and Marathe, M. Distributed- Thesis, Pittsburgh, PA, USA. (2008). memory parallel algorithms for generating massive scale-free networks using preferential attachment model. 19. Newman, M. Coauthorship networks and patterns of In Proc. of the Intl. Conf. on High Performance scientific collaboration. Proc. Natl. Acad. of Sci. USA Computing, Networking, Storage and Analysis (2013). 101, 1 (April 2004), 5200–5205. 3. Alon, N., Yuster, R., and Zwick, U. Finding and 20. SNAP. Stanford network analysis project. counting given length cycles. Algorithmica 17 (1997), http://snap.stanford.edu/, 2012. 209–223.