Faster Parallel Graph Connectivity Algorithms for GPU

Faster Parallel Graph Connectivity Algorithms for GPU Thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering by Research by Mihir Wadwekar 201202026 [email protected] International Institute of Information Technology Hyderabad - 500 032, INDIA November 2019 Copyright c Mihir Wadwekar, 2019 All Rights Reserved International Institute of Information Technology Hyderabad, India CERTIFICATE It is certified that the work contained in this thesis, titled “Faster Parallel Graph Connectivity Algo- rithms for GPU” by Mihir Wadwekar, has been carried out under my supervision and is not submitted elsewhere for a degree. Date Adviser: Dr. Kishore Kothapalli To My Family Acknowledgments I would like to take this opportunity to thank my advisor, Dr Kishore Kothapalli. He has guided me at every stage of my work and was always helpful and understanding. His insights and experience made the entire journey from trying out ideas to developing and publishing them a lot smoother and easier. Working with him was an enriching and an enjoyable experience for me. I would also like to thank my family who always understood and supported me. Their undying faith and support gave me the confidence and freedom to navigate the unknown waters of research. Lastly, I would like to thank all my friends. Research is not a straightforward process and can often be frustrating and challenging but being with you guys made the journey a bit more easier and a lot more enjoyable. v Abstract Finding whether a graph is k-connected, and identifying its k-connected components is a fundamen- tal problem in graph theory. For this reason, there have been several algorithm for this problem in both sequential and parallel settings. Several recent sequential and parallel algorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. It can be also noticed that the time spent by the algorithms on BFS operations is usually a significant portion of the overall runtime of the algorithm. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared a queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. In this thesis, we present the first GPU algorithms and implementations for 2- and 3-connectivity. We improve upon them through the use of certificates to reduce the size of the input graph and provide the fastest implementations yet. We also study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. We apply our technique to our GPU implementations for 2- and 3- connectivity and improve upon them further by a factor of 2.2x and 2.1x respectively. vi Contents Chapter Page Abstract :::::::::::::::::::::::::::::::::::::::::::::: vi 1 Introduction :::::::::::::::::::::::::::::::::::::::::: 1 1.1 Parallel Algorithms in Graph Theory . 1 1.1.1 Challenges . 2 1.2 GPUs as a Parallel Computation Platform . 3 1.2.1 Brief History . 3 1.2.2 GPU Architecture . 4 1.2.3 Software Frameworks . 5 1.2.3.1 OpenCL . 5 1.2.3.2 CUDA . 5 1.3 k-connectivity . 6 1.3.1 Previous Parallel Approaches . 6 1.3.1.1 1-connectivity . 7 1.3.1.2 2-connectivity . 7 1.3.1.3 3-connectivity . 8 1.3.1.4 BFS . 9 1.3.2 Motivation for our Approach . 10 1.4 Our Contributions . 10 1.4.1 GPU Algorithm for 2-connectivity . 11 1.4.2 GPU Algorithm for 3-connectivity . 11 1.4.3 Expediting Parallel k-connectivity Algorithms . 11 2 GPU-BiCC: GPU Algorithm for 2-connectivity :::::::::::::::::::::::: 12 2.1 Overview . 12 2.2 Motivation . 13 2.3 Algorithm GPU-BiCC . 14 2.3.1 Algorithm . 17 2.3.2 Complexity Analysis . 19 2.4 Implementation . 20 2.5 Experiments And Analysis . 21 2.5.1 Setup . 21 2.5.2 Results . 22 2.6 Extension to Dense Graphs . 24 vii viii CONTENTS 3 GPU-TriCC: GPU Algorithm for 3-connectivity :::::::::::::::::::::::: 26 3.1 Overview . 26 3.2 The Algorithm of Miller and Ramachandran for Graph Triconnectivity . 26 3.3 Triconnectivity on GPU . 28 3.3.1 Dataset . 29 3.3.2 Results . 29 4 Expediting Parallel Graph Connectivity Algorithms :::::::::::::::::::::: 32 4.1 Motivation . 32 4.2 An Overview of our Approach . 34 4.3 Application to 2-Connectivity . 35 4.3.1 Our Approach . 36 4.3.1.1 Algorithm Sample-GPU-BiCC . 36 4.3.2 Implementation Details . 37 4.3.3 Experimental Results, Analysis, and Discussion . 38 4.3.3.1 Experimental Platform and Dataset . 38 4.3.3.2 Results . 38 4.3.3.3 Discussion . 41 4.4 Application to 3-connectivity . 41 4.4.1 Our Approach . 41 4.4.1.1 Algorithm Sample-GPU-TriCC . 42 4.4.2 Implementation Details . 43 4.4.3 Experimental Results, Analysis, and Discussion . 43 4.4.3.1 Dataset . 43 4.4.3.2 Results . 43 4.4.3.3 Discussion . 45 5 Conclusions and Future Work ::::::::::::::::::::::::::::::::: 46 Related Publications :::::::::::::::::::::::::::::::::::::::: 47 Bibliography :::::::::::::::::::::::::::::::::::::::::::: 48 List of Figures Figure Page 1.1 A sample social network diagram displaying friendship ties among a set of Facebook users [54]. 1 1.2 Side by Side comparison of the same video game character rendered in 1996 and 2017 4 1.3 Block diagram of a GPU (G80/GT200) card [11]. 5 1.4 A graph with connectivity 4 [53]. One can remove any 3 vertices and the graph would still remain connected. 6 2.1 H’s vertex set is base vertices of G with edges induced by non-tree edges of G. H0 is generated after applying connected components algorithm to H and contracting the trees. The unique ID for every connected component in H0 serves as the ID for the alias vertex in G0......................................... 15 2.2 Figure shows the cycle created by paths Pxy, Pyuj , Puj ui , and Puix. For ease of expo- sition, the auxiliary graph shown contains only the changes made with respect to u and not the changes induced with respect to other vertices. 16 2.3 Here a is an articulation vertex because aa00 is a bridge. Vertex u had an upward traversal in Step 2 and hence aa0 cannot be a bridge. 19 2.4 Thread t1 marks all edges it traversed with its ID. Then in Array2, it stores the LCA vertex found. Thus every edge of the graph knows its LCA vertex, by first looking up the thread which discovered it and then the corresponding value stored at that thread ID. 21 2.5 The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of GPU-BiCC over the next fastest one. 23 2.6 Primary Y-axis represents the timings in milliseconds. On Secondary Y-axis, Speedup- 1 represents speedup of Cert-GPU-BiCC over BFS-BiCC while Speedup-2 shows speedup of GPU-BiCC over BFS-BiCC. 25 3.1 Figure showing the stages in the algorithm of Miller and Ramachandran [29]. 27 3.2 Figure showing the time taken by Algorithms GPU-TriCC and Cert-GPU-TriCC on the graphs listed in Table 3.1. The primary Y-axis represents time in milliseconds. The number on each bar indicates the percentage of time spent by Algorithm Cert-GPU- TriCC in BFS operations. 31 4.1 Figure shows the percentage time spent by Algorithm Cert-GPU-BiCC (cf. Section 2.6 on BFS operations. 33 ix x LIST OF FIGURES 4.2 Figure illustrating our technique in comparison to other approaches towards practical parallel graph algorithms. The top arrow represents direct computation that is usually expensive. The middle arrow indicates preprocessing via strict structural subgraphs or constructs that are sometimes expensive to create. The bottom path corresponds to the approach proposed in this paper. In the figure, red/solid arrows indicate expensive operations while green/hollow arrows indicate operations that are easy to perform in general. 35 4.3 An example run of Algorithm Sample-GPU-BiCC on the graph in part (a) of the figure. 38 4.4 Figure showing the time taken by Algorithms Cert-GPU-BiCC and Sample-GPU-BiCC on the graphs listed in Table 4.1. The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of Algorithm Sample-GPU-BiCC over Algo- rithm Cert-GPU-BiCC. 39 4.5 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph kron as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1; 4; 10; and 14. ................................... 40 4.6 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph coPa- perDBLP as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1; 4; 10; and 14.

Faster Parallel Graph Connectivity Algorithms for GPU

1 Vertex Connectivity 2 Edge Connectivity 3 Biconnectivity

Simpler Sequential and Parallel Biconnectivity Augmentation

IAU Catalyst, June 2021

Specialising Dynamic Techniques for Implementing the Ruby Programming Language

Schematic Representation of Large Biconnected Graphs?

Balanced Vertex-Orderings of Graphs

Stefano Meschiari Education W

Lista Ofrecida Por Mashe De Forobeta. Visita Mi Blog Como Agradecimiento :P Y Pon E Me Gusta En Forobeta!

Pure Graph Algorithms

A Generic Framework for the Topology-Shape-Metrics Based Layout

Download This Issue (Pdf)

Developing Interactive Web Applications For