Faster Parallel Graph Connectivity Algorithms for GPU

Thesis submitted in partial fulfillment of the requirements for the degree of

Master of Science in Computer Science and Engineering by Research

by

Mihir Wadwekar 201202026 [email protected]

International Institute of Information Technology Hyderabad - 500 032, INDIA November 2019 Copyright c Mihir Wadwekar, 2019 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Faster Parallel Graph Connectivity Algo- rithms for GPU” by Mihir Wadwekar, has been carried out under my supervision and is not submitted elsewhere for a degree.

Date Adviser: Dr. Kishore Kothapalli To My Family Acknowledgments

I would like to take this opportunity to thank my advisor, Dr Kishore Kothapalli. He has guided me at every stage of my work and was always helpful and understanding. His insights and experience made the entire journey from trying out ideas to developing and publishing them a lot smoother and easier. Working with him was an enriching and an enjoyable experience for me. I would also like to thank my family who always understood and supported me. Their undying faith and support gave me the confidence and freedom to navigate the unknown waters of research. Lastly, I would like to thank all my friends. Research is not a straightforward process and can often be frustrating and challenging but being with you guys made the journey a bit more easier and a lot more enjoyable.

v Abstract

Finding whether a graph is k-connected, and identifying its k-connected components is a fundamen- tal problem in . For this reason, there have been several algorithm for this problem in both sequential and parallel settings. Several recent sequential and parallel algorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. It can be also noticed that the time spent by the algorithms on BFS operations is usually a significant portion of the overall runtime of the algorithm. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared a queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. In this thesis, we present the first GPU algorithms and implementations for 2- and 3-connectivity. We improve upon them through the use of certificates to reduce the size of the input graph and provide the fastest implementations yet. We also study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. We apply our technique to our GPU implementations for 2- and 3- connectivity and improve upon them further by a factor of 2.2x and 2.1x respectively.

vi Contents

Chapter Page

Abstract ...... vi

1 Introduction ...... 1 1.1 Parallel Algorithms in Graph Theory ...... 1 1.1.1 Challenges ...... 2 1.2 GPUs as a Parallel Computation Platform ...... 3 1.2.1 Brief History ...... 3 1.2.2 GPU Architecture ...... 4 1.2.3 Software Frameworks ...... 5 1.2.3.1 OpenCL ...... 5 1.2.3.2 CUDA ...... 5 1.3 k-connectivity ...... 6 1.3.1 Previous Parallel Approaches ...... 6 1.3.1.1 1-connectivity ...... 7 1.3.1.2 2-connectivity ...... 7 1.3.1.3 3-connectivity ...... 8 1.3.1.4 BFS ...... 9 1.3.2 Motivation for our Approach ...... 10 1.4 Our Contributions ...... 10 1.4.1 GPU Algorithm for 2-connectivity ...... 11 1.4.2 GPU Algorithm for 3-connectivity ...... 11 1.4.3 Expediting Parallel k-connectivity Algorithms ...... 11

2 GPU-BiCC: GPU Algorithm for 2-connectivity ...... 12 2.1 Overview ...... 12 2.2 Motivation ...... 13 2.3 Algorithm GPU-BiCC ...... 14 2.3.1 Algorithm ...... 17 2.3.2 Complexity Analysis ...... 19 2.4 Implementation ...... 20 2.5 Experiments And Analysis ...... 21 2.5.1 Setup ...... 21 2.5.2 Results ...... 22 2.6 Extension to Dense Graphs ...... 24

vii viii CONTENTS

3 GPU-TriCC: GPU Algorithm for 3-connectivity ...... 26 3.1 Overview ...... 26 3.2 The Algorithm of Miller and Ramachandran for Graph Triconnectivity ...... 26 3.3 Triconnectivity on GPU ...... 28 3.3.1 Dataset ...... 29 3.3.2 Results ...... 29

4 Expediting Parallel Graph Connectivity Algorithms ...... 32 4.1 Motivation ...... 32 4.2 An Overview of our Approach ...... 34 4.3 Application to 2-Connectivity ...... 35 4.3.1 Our Approach ...... 36 4.3.1.1 Algorithm Sample-GPU-BiCC ...... 36 4.3.2 Implementation Details ...... 37 4.3.3 Experimental Results, Analysis, and Discussion ...... 38 4.3.3.1 Experimental Platform and Dataset ...... 38 4.3.3.2 Results ...... 38 4.3.3.3 Discussion ...... 41 4.4 Application to 3-connectivity ...... 41 4.4.1 Our Approach ...... 41 4.4.1.1 Algorithm Sample-GPU-TriCC ...... 42 4.4.2 Implementation Details ...... 43 4.4.3 Experimental Results, Analysis, and Discussion ...... 43 4.4.3.1 Dataset ...... 43 4.4.3.2 Results ...... 43 4.4.3.3 Discussion ...... 45

5 Conclusions and Future Work ...... 46

Related Publications ...... 47

Bibliography ...... 48 List of Figures

Figure Page

1.1 A sample social network diagram displaying friendship ties among a set of Facebook users [54]...... 1 1.2 Side by Side comparison of the same video game character rendered in 1996 and 2017 4 1.3 Block diagram of a GPU (G80/GT200) card [11]...... 5 1.4 A graph with connectivity 4 [53]. One can remove any 3 vertices and the graph would still remain connected...... 6

2.1 H’s set is base vertices of G with edges induced by non-tree edges of G. H0 is generated after applying connected components algorithm to H and contracting the trees. The unique ID for every connected component in H0 serves as the ID for the alias vertex in G0...... 15

2.2 Figure shows the cycle created by paths Pxy, Pyuj , Puj ui , and Puix. For ease of expo- sition, the auxiliary graph shown contains only the changes made with respect to u and not the changes induced with respect to other vertices...... 16 2.3 Here a is an articulation vertex because aa00 is a bridge. Vertex u had an upward traversal in Step 2 and hence aa0 cannot be a bridge...... 19 2.4 Thread t1 marks all edges it traversed with its ID. Then in Array2, it stores the LCA vertex found. Thus every edge of the graph knows its LCA vertex, by first looking up the thread which discovered it and then the corresponding value stored at that thread ID. 21 2.5 The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of GPU-BiCC over the next fastest one...... 23 2.6 Primary Y-axis represents the timings in milliseconds. On Secondary Y-axis, Speedup- 1 represents speedup of Cert-GPU-BiCC over BFS-BiCC while Speedup-2 shows speedup of GPU-BiCC over BFS-BiCC...... 25

3.1 Figure showing the stages in the algorithm of Miller and Ramachandran [29]...... 27 3.2 Figure showing the time taken by Algorithms GPU-TriCC and Cert-GPU-TriCC on the graphs listed in Table 3.1. The primary Y-axis represents time in milliseconds. The number on each bar indicates the percentage of time spent by Algorithm Cert-GPU- TriCC in BFS operations...... 31

4.1 Figure shows the percentage time spent by Algorithm Cert-GPU-BiCC (cf. Section 2.6 on BFS operations...... 33

ix x LIST OF FIGURES

4.2 Figure illustrating our technique in comparison to other approaches towards practical parallel graph algorithms. The top arrow represents direct computation that is usually expensive. The middle arrow indicates preprocessing via strict structural subgraphs or constructs that are sometimes expensive to create. The bottom path corresponds to the approach proposed in this paper. In the figure, red/solid arrows indicate expensive operations while green/hollow arrows indicate operations that are easy to perform in general...... 35 4.3 An example run of Algorithm Sample-GPU-BiCC on the graph in part (a) of the figure. 38 4.4 Figure showing the time taken by Algorithms Cert-GPU-BiCC and Sample-GPU-BiCC on the graphs listed in Table 4.1. The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of Algorithm Sample-GPU-BiCC over Algo- rithm Cert-GPU-BiCC...... 39 4.5 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph kron as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1, 4, 10, and 14...... 40 4.6 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph coPa- perDBLP as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1, 4, 10, and 14...... 40 4.7 Figure showing the time taken by Algorithms Cert-GPU-TriCC and Sample-GPU- TriCC on the graphs listed in Table 3.1. The primary Y-axis represents time in milliseconds. The secondary Y-axis gives the speedup of Algorithm Sample-GPU- TriCC over Algorithm Cert-GPU-TriCC...... 44 4.8 Figure represents the time taken by Algorithm Sample-GPU-TriCC on the graph rm07r as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in millions at various values of k...... 44 4.9 Figure represents the time taken by Algorithm Sample-GPU-TriCC on the graph rand- TriCC1 as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F at various values of k...... 45 List of Tables

Table Page

2.1 Sparse Graphs for GPU-BiCC ...... 22 2.2 Dense Graphs for GPU-BiCC ...... 25

3.1 Graphs used in our experiments for triconnectivity. In the table, the letter K stands for a thousand and the letter M stands for a million...... 30

4.1 Graphs used in our experiments for biconnectivity. In the table, the letter K stands for a thousand and the letter M stands for a million...... 39

xi Chapter 1

Introduction

1.1 Parallel Algorithms in Graph Theory

A graph is a collection of objects/nodes where each pair of nodes share a link based on some defined property/relation. Graph theory is the domain based on the study of graphs. Graphs are a useful tool in studying the process flows and dependencies involved in real-world systems. In computer science, graphs have traditionally been employed as an aid in modelling, understanding and visualizing under- lying data. For example, social networks often model their users as a graph where two users can be connected if they are ”friends” or share the same interests. This structure helps in understand- ing their own user-base and how they can expand it or bring like-minded people together.

Figure 1.1 A sample social network diagram displaying friendship ties among a set of Facebook users [54].

Graph theory today finds its application in domains of networks, machine learning, linguistics, flow computation etc to name a few. Several real-world problems can be broken down into simpler, funda-

1 mental graph problems. Over the years, several efficient graph algorithms have been proposed for the standard graph problems. Discussing all the advances in graph algorithms would be beyond the scope of this thesis. However, a majority of these algorithms were serial in nature. With the rapid improvement in single-core processor from the 1980s till early 2000s, it made sense to develop serial algorithms. Every new generation of processors increased clock speeds and transistor counts leading to better performance.

With time, physical limitations were reached in single-core processors as further increase in fre- quency lead to unmanageable heating problems. Transistor scaling also became more difficult. Eventu- ally, it became cheaper and faster to have 2 slower cores in parallel than trying to design increasingly- faster single-core processors. As the focus shifted to parallel architectures, so did the algorithms. Par- allel algorithms became more widespread with rise of multi-core architecture where multiple cores are embedded in a single computer chip. Multi-core architectures were soon followed by development of many-core architectures which uses a larger number of simpler, independent processors. The surge in parallel architecture led to an increased emphasis on parallel algorithms as now they could be potentially deployed across thousands of cores to deliver results which would have been impossible to achieve in a single-core setting. Graph theory also attracted a fair amount of research in devising parallel algorithms for the traditional problems.

In this thesis, we haven taken up the problem of k-connectivity in graphs, a traditional graph problem and presented parallel algorithms which leverage the latest advances in computational power to deliver really fast results.

1.1.1 Challenges

On surface it looks easy to model a graph in a parallel computation model where each node can be assigned to an individual processor. However efficient communication and load balancing are hard things to achieve in a graphical setting. Problems in graphs are often not uniformly spread as in graphics where each pixel can be computed independently. Load balancing and synchronization take up a lot of resources because if the work is not distributed correctly, some nodes can effectively become the bottleneck. Graph traversals, finding spanning trees are examples where the work is not uniform. Each node can have different number of neighbors and thus requires varying amounts of work. Even if nodes are aggregated in clusters and then assigned to processors, the problem still remains as each cluster may then have different workloads.

Efficient work distribution, synchronization and communication between nodes is tough to achieve for graphs in a parallel setting. Despite the inherent difficulty in load-balancing, several parallel graph algorithms have been presented over the years which have been significantly faster than the serial im- plementations. We discuss some of these approaches in the later chapters.

2 1.2 GPUs as a Parallel Computation Platform

A parallel algorithm can be implemented on several underlying parallel architectures such as a multi- core architecture, many-core architecture, or even on a distributed network. Although the theoretical complexity remains the same, the underlying architecture can affect the run-time of a parallel algorithm by several magnitudes. Each architecture has its own specific strengths and weaknesses in handling parallelism. In this thesis, we have looked at parallel implementations on multi-core and many-core systems. We believe through results, that for our problem of k-connectivity in graphs, a many-core system such as a GPU is the best approach.

1.2.1 Brief History

Graphics Processing Units or GPUs, as the name suggests, were primarily developed as specialized graphics chips for video games around 1970s. They were initially used as a helper to CPU which could perform numerous mundane yet resource-intensive tasks while the CPU processed the game’s instructions and player inputs. Game developers had realized the value of specialized graphics chip as simple having more RAM was too expensive. Since the 1970s, GPUs have been primarily used as frame buffers for video games/applications. Even today, all the video games are rendered on to the screen by the GPU. CPU provides the instructions, moves the data to GPU and then GPU takes over the graphical calculations. Over the decades, GPUs started providing even more specialized hardware for rendering game graph- ics. As the GPU hardware advanced, so did its integration with the software. It became more easier for developers to use the GPU through several additional APIs. GPUs eventually became increasingly pow- erful, versatile and accessible to the common public. Figure 1.2 shows the growth in GPU capabilities in the last 20 years. A modern consumer-end GPU such as the Nvidia RTX 2080 Ti boosts 4532 cores and almost 18.6 billion transistors. It is capable of providing a performance of 13.4 teraflops and supports 14 GB of DDR6 memory. GPUs can today be programmed in a variety of languages through a large number of well- documented APIs. Thus, despite being primarily used for rendering graphics, a GPU can still be programmed and used by anyone for any-purpose. A GPU readily provides a huge number of weak but cheap cores for any programmable computation. As a result, GPUs have been picked up by researchers as an alternate parallel computation platform. GPUs can offer significant performance boost over a parallel cluster for a particular set of tasks. GPUs are widely being used in domains of image processing, machine learning, scientific computing, ray tracing, fluid dynamics etc. Although GPUs are not widely used in graph theory, we show that it is possible to achieve considerably faster results on a GPU through clever implementation of well-designed parallel graph algorithms.

3 (a) Tomb Raider 1996 (b) Tomb Raider 2017

Figure 1.2 Side by Side comparison of the same video game character rendered in 1996 and 2017

1.2.2 GPU Architecture

GPUs have a drastically different architecture from CPUs. GPUs are designed with a SIMD(Single Instruction Multiple Data) philosophy to exploit data-level parallelism. A SIMD machine is a machine which performs numerous parallel computations through one process instruction at a time. A GPU has an array of SIMD processors called SMs or Streaming Multiprocessors. Each SM is comprised of numerous cores known as SPs or Stream Processors. Memory which can be global, constant or shared is available for each SM and across GPU. Figure 1.3 (cf. [11]) demonstrates the parallel architecture of a typical GPU.

As can be seen from Figure 1.3, a GPU has hundreds of smaller, efficient cores. Subsets of these cores are often specialized to perform a single/similar type of work such as ray tracing, floating point calculations or shader arithmetic. For example, the GeForce 20 series of GPUs developed by Nvidia accelerates real-time ray tracing through the use of new specialized RT cores, which are designed to process all Bounding Volume Hierarchy traversal and ray/triangle intersection testing [32]. Although high in number, GPU cores are not strong as CPU cores. They operate at a lower frequency and have very limited memory. In addition, shifting data to global memory of GPU from the main memory is a very costly step. Thus one needs to consider memory and synchronization issues carefully while designing algorithms for GPU. The algorithm must be able to scale its tasks to thousands of cores but at the same time ensure that the tasks involve simple, uniform work. However in later chapters, we show that pursuing this is worth the effort as our GPU implementations outperform the parallel-CPU implementations by a huge margin.

4 Figure 1.3 Block diagram of a GPU (G80/GT200) card [11].

1.2.3 Software Frameworks

GPUs today can be programmed through a variety of languages such as C, C++, python etc. Most of these languages leverage one of the following two software frameworks.

1.2.3.1 OpenCL

OpenCL or Open Computing Language is a open, free standard for programming parallel proces- sors. It provides a standard task and data-based parallelism through exposing APIs in C99 and C++. OpenCL operates upon an abstraction called compute units, which can underlying be represented by GPUs, many-core CPUs, FPGAs etc. Thus it can support developing applications which can perform on heterogeneous parallel platforms. Although written in C and C++, OpenCL is available on Python, , and .NET through third-party platforms. Further information on OpenCL can be found here [33].

1.2.3.2 CUDA

CUDA is a low-level parallel programming framework developed by Nvidia for CUDA-enabled GPUs (currently only Nvidia GPUs are CUDA-enabled). It is designed for C, C++ and Fortran with third-party wrappers available for a variety of languages such as Python, Matlab, R, Perl and Ruby. Through CUDA, one can program an application which leverage multiple CUDA-enabled GPUs in

5 sync with parallel CPUs. Since it is a low-level framework, it has compatibility issues and several high- level programming constructs are currently missing. However the low-level of the framework allows a more fine-grained control on the underlying resources leading through more specific libraries. We have programmed our parallel GPU algorithms through CUDA framework. For more details, we refer the reader to [13].

1.3 k-connectivity

Finding whether a graph is k-connected and identifying its k-connected components is a fundamental problem in graph theory. The problem finds its application in planarity testing [20], isomorphism in planar graphs [22], network analytics [5] [51] [18], clustering [4], data visualization [39] to name a few. A graph G is called connected if there exists a path between every two pair of vertices. A graph G is called k-connected if it has more than k vertices and remains connected when fewer than k vertices are removed. A connected graph can be also called as a 1-connected graph. Similarly, a 2-connected graph is called biconnected and a 3-connected graph is called triconnected.

Figure 1.4 A graph with connectivity 4 [53]. One can remove any 3 vertices and the graph would still remain connected.

Figure 1.4 shows an example 4-connected graph.

1.3.1 Previous Parallel Approaches

Due to its varied applications, testing graph k-connectivity has been a problem of immense research interest. In the following sections, some of the previous parallel approaches at 1-, 2-, and 3-connectivity are listed. Some parallel implementations of BFS are also mentioned as almost all of the parallel ap- proaches to k-connectivity start with a BFS spanning tree.

6 1.3.1.1 1-connectivity

Early PRAM algorithms for testing the connectivity of a graph were proposed by Hirschberg et al. [21]. Hirsrchberg et al. [21] start with the adjacency matrix of the graph and then contract the graph edges into super-vertices in parallel iterations. Their algorithms uses n2 processors to find the connected components of an undirected graph in O(log2n) time. Shiloach and Vishkin [40] propose an PRAM algorithm which runs in O(log(n)) time using O(n + m) work. The Shiloach-Vishkin algorithm operates on the same principle of forming partitions and contracting them as done by Hirsrchberg et al. [21]. Here in each iteration, the algorithm merges two trees based on certain condition. Initially each edge of the graph is a tree. This forest of trees are merged in parallel iterations till all connected components belong to distinct stars with each node assigned to one star. Shiloach and Vishkin show that this algorithm runs in O(log(n)) time using O(n + m) processors. Several experimental studies on finding the connected components of a graph are based on this algorithm [43, 19, 41]. In particular, Soman et al. [43] adapt this algorithm on a GPU showing a speedup of 9 to 12 times over the best sequential CPU implementation. We later borrow Soman et al. [43]’s algorithm for finding connected components of a graph on a GPU. In a recent work, Sutton et al. [44] argue that the Shiloach and Vishkin algorithm [40] can be applied on an O(n) edge spanning subgraph of the input graph. The connected components of the subgraph can be used to find the connected components of the original graph by using the algorithm of Shiloach and Vishkin [40]. Our work in this thesis seems to provides a good reason for the speedup achieved by Sutton et al [44] and extends their work for 2- and 3-connectivity.

1.3.1.2 2-connectivity

The classical serial solution to 2-connectivity is proposed by Tarjan [45]. Tarjan [45] maintains for each vertex v (i) the depth for each vertex in the DFS (Depth First Search) tree and (ii) the lowest depth of neighbors of all descendants of v. This information is collected and maintained while performing a DFS. Based on this information, the algorithm identifies the vertices which can disconnect the graph into biconnected components. Since it relies on a DFS, the algorithm is linear in nature. The first PRAM algorithm for finding the 2-connected components of a graph in parallel is given by Tarjan and Vishkin [46]. It is based on the same principles of Tarjan [45]’s serial implementation but it relies on a spanning tree rather than a DFS. The algorithm begins by generating a spanning tree by applying Shiloach and Vishkin [40]’s parallel connected components algorithm. It then records similar information on vertices as Tarjan [45]’s serial algorithm by doing an Euler tour of the generated spanning tree. This is followed by a construction of auxiliary graph G0 where two edges are connected iff they are part of the same in G. Thus, the algorithm reduces the problem of finding the 2-connected components of a given graph to finding the connected components of an auxiliary graph. It runs in O(log(m)) with O(n+m) processors. In [16], Edwards and Vishkin adapt this PRAM algorithm to XMT architecture showing 9x-33x speedup over serial implementations.

7 Bader and Cong [10] later identified that the process of constructing the auxiliary graph is however quite slow in practice. Bader and Cong [10] proceed to use a formulation akin to that of Cheriyan and Thurimella [8]. They prove that edges in G which are not in a BFS tree T and are also not in a spanning forest F of G\T are nonessential i.e. do not affect the biconnectivity of G. Removal of such edges leads to a smaller O(n)-sized certificate which can be used to determine and find biconnected components in G. Their certificate-based approach runs in O( + log(n)) time where d is the diameter of the graph. Overall, they show a speed-up of upto 2x on a variety of graphs on multi-core CPUs. More recently, Slota and Madduri [42] proposed that one can test the biconnectivity of a graph by performing multiple BFS traversals on multi-core CPUs. They presented a simple yet effective parallel algorithm which does not rely on the construction of an auxiliary graph/certificate. The algorithm is based on the theorem that a non-root vertex v in the BFS tree is an articulation vertex iff it has at least one child w that cannot reach any vertex of depth at least equal to depth of v when v is removed from G. Slota and Madduri run truncated BFSs from every vertex in G. The truncated BFSs terminate when the above theorem fails for any vertex. In the worst-case, the algorithm runs on O(nm) time but through good implementation and optimizations, it is on average 4.2x faster than the Bader and Cong [10] approach. In 2015, Chaitanya and Kothapalli [6] improve upon Slota and Madduri [42]’s result by 2.45x through an auxiliary graph based approach. The algorithm relies on the observation that it is easier to find the bridges of a graph G rather than its articulation points. The auxiliary graph G0 is constructed from a BFS tree T of G. An equivalence relation is proved between G0 and G stating that the bridges in G0 correspond to the articulation points in G. The algorithm in worst-case runs in O(md) time where d is the diameter of the graph. We then further improve Chaitanya and Kothapalli [6]’s approach and adapt it to GPUs. We improve the auxiliary graph construction of Chaitanya and Kothapalli by using connected components algorithm on a subset of BFS tree T . This helps us in optimizing the number of additional vertices required for auxiliary graph construction. With an improved auxiliary graph and a clever implementation on a GPU, we are able to achieve over 4x speedup. We also apply Bader and Cong [10]’s edge-pruning technique for a further 2x speedup on dense graphs. We expand more on our algorithm along with the ones proposed by Bader and Cong [10], Slota and Madduri [42], and Chaitanya and Kothapalli [6] in Chapter 2.

1.3.1.3 3-connectivity

For finding the 3-connected components of a graph G, Ramachandran and Vishkin [38], and Miller and Ramachandran [29] present PRAM algorithms. Their algorithms makes use of the ear decomposi- tion of a graph and define an auxiliary graph for every ear of G. These auxiliary graphs are then used to check the 3-connectivity of G followed by finding the 3-connected components of G. These algorithms [46, 29] can be recast to use the result of Cheriyan and Thurimella [8]. Vishkin and Edwards [16, 17]

8 study parallel implementations of 2- and 3-connectivity algorithms on the XMT architecture [49] and compare how these XMT implementations scale with increasing number of cores. Other than the above attempts, we could not find other significant implementations of PRAM algo- rithms for 3-connectivity. We later adapt Miller and Ramachandran [29]’s algorithm to GPU in Chapter 3 and then improve upon it in Chapter 4.

1.3.1.4 BFS

In all of the approaches to k-connectivity, one needs to start with a spanning tree of some kind. Most of the serial approaches use DFS for generating the spanning tree and any other additional data they need. However a DFS is not easily parallelizable due to the serial nature of tasks involved. All descendants of a child node must be explored first before any other child nodes can be visited. Due to the inherent serial nature, several of the PRAM approaches start with a parallel BFS. A BFS is also hard to parallelize due to problems in work-balancing and synchronization. However, it is still possible to achieve fast parallel BFS implementations through cleverly-optimized implementations. The difficulty of efficiently performing a BFS traversal of a graph has led to several researchers identifying numerous algorithmic and data structure optimizations on GPUs as well as multi-core archi- tectures. Beamer et al. [2] present a hybrid BFS algorithm which combines the conventional top-down approach with a new bottom-up approach. In the top-down approach, nodes search for unexplored children while in the bottom-up approach the unexplored children nodes search for their parents. The direction-optimizing approach show on an average 2x speedup over other state-of-the-art multi-core implementations. In [9], Chhugani et al. use cache- and data structure optimizations for an efficient and fast BFS on a multi-core system. Merril at al. [28] present a BFS based on a fine-grained task management based approach on GPUs. The algorithm assigns GPU resources based on the number of neighbors of a vertex. With a O(n+m) complexity, the algorithm implementation shows a performance which is several times faster than the state-of-the-art CPU and GPU implementations. In our k-connectivity algorithms, we use this algorithm implementation for BFS purposes. Despite these advances, we notice in our study that BFS traversals still consume a significant por- tion of the run time of parallel graph connectivity algorithms. Irrespective of the implementation, we observer that the O(n + m) nature of work involved in a BFS is almost always the bottleneck step in a k-connectivity algorithm implementation. We attempt to rectify the bottleneck by performing BFS only on a smaller subsection of the graph in Chapter 4. To summarize, one can find several attempts for k-connectivity in various settings such as sequential algorithms [12, 8, 45, 23, 25], parallel algorithms [40, 21, 46, 44, 10, 6, 42, 50, 26, 16, 17, 29, 25], and also distributed algorithms [36]. Almost all of the algorithms use graph traversal techniques to create one or more spanning tree(s) and use the properties of spanning trees to test the k-connectivity of the graph and obtain its k-connected components. Expanding on all of these works in depth would be beyond the

9 scope of this thesis. We have added brief description on most of them in the above paragraphs and we further discuss on some of these algorithms in Chapters 2, 3, and 4.

1.3.2 Motivation for our Approach

In particular, in the parallel setting, PRAM algorithms that require poly-logarithmic time and work in O(m + n) are known for k = 1, 2, 3, and [40, 21, 46, 29, 25]. However, in practice, the constants hidden in the big-O notation are significantly high for k ≥ 2. These constants mainly arise due to the number of graph traversals such as BFSs. These traversals are always O(m) which quickly becomes a problem in scaling the algorithms for non-sparse graphs. In light of this, some techniques for reducing the graph size have been proposed. In one such result, Cheriyan and Thurimella [8] present a method to obtain a O(n) sized-certificate from the original graph. They prove that the O(n)-sized certificate graph has the same connectivity properties as the original graph. Cheriyan and Thurimella’s [8] result has been successfully used to deliver faster connectivity algorithms by Bader and Cong [10], Chaitanya and Kothapalli [6] and later by us in Chapter 2. The O(n) certificate approach does make connectivity algorithms faster but obtaining the certificate requires performing multiple BFSs. Each BFS being O(n + m), obtaining a certificate is a fairly expensive operation. One needs to perform k BFSs for obtaining a O(n) certificate for k-connectivity. Thus although this approach is faster, it is still burdened by expensive BFSs. Here we see an opportunity to develop faster algorithms if we can obtain certificates by circumventing the expensive BFSs. In addition, as mentioned in Section 1.2.2, GPUs differ vastly in architecture from CPUs. The high number of dedicated cores provide an opportunity to deliver much faster results. However, since the architectures are radically different, PRAM algorithms implemented on many-core CPUs cannot be directly applied to GPUs. Today, GPUs have been successfully used in several areas of research to deliver significantly faster implementations. However to the best of the knowledge, no GPU implemen- tations/algorithms are available for k-connectivity for k ≥ 4. Thus an idea to obtain cheaper certificates by efficiently utilizing a GPU served as the basis of our approach in developing fast parallel algorithms for k-connectivity.

1.4 Our Contributions

In this thesis, we propose faster GPU-based algorithms for 2-connectivity and 3-connectivity. We implement our algorithms and demonstrate that they are the fastest solutions yet. To the best of our knowledge, no other GPU implementation for 2- or 3-connectivity exist. Later, we generalize our ap- proach and theorize a solution which removes the dependence on BFS which we show to be the - neck operation. The time taken by performing the BFS on the entire graph is reduced be performing the BFS only on a smaller sub-graph and reapplying the partial results.

10 1.4.1 GPU Algorithm for 2-connectivity

For 2-connectivity, we present and implement a fast parallel GPU algorithm based on the works of [6]. To the best of our knowledge, this is the first such attempt and also the fastest implementation across architectures. The implementation is on an average 4x faster than the next best implementation. The implementation works best for sparse graphs and achieves upto 70x improvement over all other implementations. Later, we also apply an edge-pruning technique which results in a further 2x speedup for dense graphs.

1.4.2 GPU Algorithm for 3-connectivity

We adapt Miller and Ramachandran’s [29] parallel algorithm and provide the first GPU implementa- tion for 3-connectivity. Since no other significant parallel implementations existed of 3-connectivity at the time of publication, we take our implementations as baseline and improve upon it by almost 5 times through employing a certificate-based approach as described by Cheriyan and Thurimella [8].

1.4.3 Expediting Parallel k-connectivity Algorithms

In our algorithms for 2-connectivity and 3-connectivity we observe empirically that a parallel BFS is the bottleneck by a huge margin. We propose a novel solution to avoid performing the BFS on the entire graph and rather working with BFS applied on a smaller sub-graph. We design certificates which are significantly cheaper to compute but are inaccurate. We then address the inaccuracies in the certificate produced rather than wasting time in generating the perfect certificate. We test our approach for 2- and 3-connectivity and provide implementations which further improve our fastest implementations by 2.2x and 2.1x respectively.

11 Chapter 2

GPU-BiCC: GPU Algorithm for 2-connectivity

2.1 Overview

A graph G is called biconnected if there exists at least two vertex disjoint paths between any two vertices of the graph. The maximal biconnected subgraphs of G are often called as the biconnected components of G. The problem of finding the biconnected components of a graph has long been a part of the algorithmic graph theory with applications to social networks [5], clustering [4], data visualization [39] and many other areas. An articulation point, which is a vertex whose removal disconnects the graph, represents a critical point in a network whose failure can disrupt the flow of messages across the network. In social networks, articulation points can indicate people who connect people of different interests. Bi-connected graphs are fault-tolerant in the sense that the failure of a single node does not disable the network. The classical solution to this problem is due to Tarjan which uses a depth first traversal of the un- derlying graph [12]. Given that performing a depth first traversal of a graph is a P-complete problem [24], the algorithm of Tarjan does not lend itself to efficient parallelization. The first parallel algorithm for this problem is designed by Tarjan and Vishkin [46] which converts the problem of finding the bi- connected components of a graph to finding the connected components of an auxiliary graph. However, the approach of Tarjan and Vishkin suffers in practice due to the huge size of the auxiliary graph and also the operations required to construct the auxiliary graph. Cong and Bader [10] uses the connection between spanning trees and k-connectivity of a graph [8] to improve the algorithm of Tarjan and Vishkin [46]. Given the overwhelming number of applications and significance of the problem, there is a renewed interest in this problem in the parallel setting. Algorithms for this problem have been studied on mod- ern architectures such as multi-core CPUs and the XMT [42, 6, 7, 16]. Recently, Slota and Madduri [42] proposed two parallel algorithms for finding the biconnected components of a graph. These algo- rithms are best suited for multi-core architectures and are since improved by the work of Chaitanya and Kothapalli [6]. The algorithm of [6] is particularly suited for sparse graphs. As discussed in Section 1.2, GPUs have a massively parallel architecture consisting of thousands of smaller, more efficient cores which can provide a significantly higher speedup. Due to the difference in

12 architecture between a CPU and a GPU, algorithms designed for multi-cores CPUs are not always well- suited for GPUs. Such a situation has meant that CPU algorithms have to be sometimes reinterpreted to arrive at efficient algorithms in practice. While there are several recent works that use the biconnected components of a graph to improve on the performance of graph algorithms on the GPU [35, 30], to the best of our knowledge there is no GPU based solution for finding the biconnected components of a graph. Here we present and implement the first GPU algorithm for biconnected components. Through several optimizations and a clever implementation on a K40C GPU, we are able to achieve a speedup of up to 70x over the approach of Slota and Madduri [42] and up to 9x over the approach of Chaitanya and Kothapalli [6, 7]. Our approach also particularly suits sparse graphs as is the case with other existing current algorithms [42, 6, 7]. To extend our approach to dense graphs, we borrow an edge pruning technique mentioned by Cong and Bader [10]. This further speeds up the algorithm for dense graphs achieving an implementation which works for all kinds of graphs.

2.2 Motivation

The approach of Slota and Madduri [42] performs multiple breadth-first searches (BFSs). After the first BFS generates the BFS tree, each thread picks a vertex and checks via another BFS whether a child node can reach a node above the current vertex. These BFSs terminate when a vertex at a higher level is found. Such an approach which requires multiple BFSs is not well-suited for a GPU. As discussed in Section 1.2, GPUs normally have a high number of weak cores. In order to effectively utilize the numerous cores, one needs to schedule a large number of small yet uniform tasks. This is tough to do in a BFS where each vertex can have different number of neighbors leading to irregularity in work assigned. It is not feasible to schedule multiple BFSs on either a thread or a warp level. A single warp can be assigned a node with hundreds of neighbors to expand while some other warp may be assigned a vertex of low degree. This would again lead to work imbalance where some threads are doing extra work while others remain idle. Similar problems are encountered even if try to schedule BFS per SM (Streaming Multiprocessor) of a GPU. In [28], Merrill et al. argue that the most efficient way for exploring vertices is when fine-grained scan-based exploration is supplemented with coarser cooperative thread array-based and warp-based exploration. An entire thread block is used for exploring a single large-degree vertex. Scheduling a BFS per SM although feasible would still not be efficient. As a result, due to the irregularity of the work involved, BFSs are often the bottleneck in GPU graph algorithms. In our approach, we found that a single BFS would consume 20% to 50% of total time taken to determine biconnectivity. Thus it is lot easier to work with simple tree traversals in parallel than doing multiple parallel BFSs. Hence, rather than depending on a multiple BFSs, we generate the BFS tree once and do simple graph traversals in parallel on the BFS tree.

13 2.3 Algorithm GPU-BiCC

We build our algorithm upon the following three lemmas used by Chaitanya and Kothapalli [6, 7] for their CPU parallel algorithm called LCA-BiCC.

Let, Vlca(G) be the set of the lowest common ancestors (LCAs) of each non-tree edge of G with respect to some BFS tree T . The proofs for the three lemmas have been taken verbatim from [7].

Lemma 1: Let G be a 2-edge-connected graph and let T be a BFS tree of G. If v is not in Vlca(G), then v cannot be an articulation point of G. Proof: On the contrary, assume that a vertex v is an articulation point and is not the LCA of any non-tree edge of G. If v is on only one cycle in G, then v cannot be an articulation point. So, we assume in the

rest of the proof that v is on at least two cycles in G. Let C1,C2, ··· ,Ck be the fundamental cycles

induced respectively by non-tree edges e1, e2, ··· , ek ∈ G \ T and pass through vertex v. Let Ci and

Cj be any two cycles from the set {C1,C2, ...., Ck} induced by non-tree edges ei and ej respectively.

Let vertices x, y be LCA of the endpoints of ei and ej respectively. It is evident that x and y should be the ancestors of v as v lies on both the cycles and v∈ / {x, y}. The relation between x and y can be categorized as follows.

• x = y: In this case the two cycles Ci and Cj share the same LCA say x and also the vertex v. This

implies that Ci and Cj share at least an edge (as there are at least two vertices, x and v, common

to both Ci and Cj). So, even after the removal of v, all edges belonging to Ci and Cj remain in a single biconnected component. Hence, v is not an articulation point.

• x 6= y, z = LCA(x, y), and z∈ / {x, y} : As x and y are ancestors of v there is a path x to v and v to y in T . As z is the ancestor of x and y there is a path z to x and y to z in T . This concludes that there is a path from z ! x ! v ! y ! z which leads to a cycle in T . However, T is a BFS tree and cannot have cycles. Therefore, our assumption that v is an articulation point is not valid.

• x 6= y and LCA(x, y) ∈ {x, y}: Without loss of generality, we will assume that y = LCA(x, y).

Let Ci and Cj be any pair of cycles induced by non-tree edges ei and ej and pass through v with LCA(ei) = x and LCA(ej) = y. Since y is a proper ancestor of x, there is a path from x ! v (in T and also in G) that is common to Ci and Cj. This ensures that there is at least an edge

common between the cycles Ci and Cj. Similar to the case where x = y, this allows us to argue

that even after the removal of v, all edges of Ci and Cj remain in a single connected component. Since the above holds for any pair of cycles passing through v, v is not an articulation point.

A bridge is an edge whose removal disconnects the graph. Naturally, end-points of bridges are also articulation points. However, not all articulation points are end-points of bridges. Chaitanya and Kothapalli [6, 7] construct an auxiliary graph so that all the articulation points in the original graph

14 Figure 2.1 H’s vertex set is base vertices of G with edges induced by non-tree edges of G. H0 is generated after applying connected components algorithm to H and contracting the trees. The unique ID for every connected component in H0 serves as the ID for the alias vertex in G0. become the end-points of bridges in the auxiliary graph. Based on the same principle and lemmas, we also obtain an auxiliary graph, albeit in a more efficient and simpler way. We split G into its 2-edge connected components. (A 2-edge connected component of a graph G is a maximal subset of edges such that every pair of vertices in the component have at least two edge disjoint paths between them.) Let u be the LCA of a non-tree edge pq. Let x and y be the base vertices in the cycle induced by pq. The base vertices for a cycle C induced by a nontree edge e, with LCA u, are the neighbors of u in the cycle C. We then introduce an alias vertex u0, add the edge uu0 and replace the edges ux and uy with u0x and u0y respectively. Only a single alias vertex is introduced for cycles sharing a common base vertex. Finding the cycles that share a common base vertex is a non-trivial problem. From an algorithmic view point, it is essential that an efficient technique be designed for this purpose. We map this problem to the problem of finding the connected components of a graph. We construct a new graph H as follows. Every vertex that is a base vertex in G is a vertex in H. Two vertices in H share an edge if these two vertices are the base vertices for some cycle in G. Figure 2.1 demonstrates an example for the construction of the auxiliary graph. This leads to the second and third lemmas regarding the newly introduced alias vertices.

Lemma 2: Let G be a 2-edge-connected graph, T a rooted BFS tree of G with root as r, and G0 be the auxiliary graph of G constructed. For vertices u in G0 with u 6= r, u is an articulation point of G iff u is an end point of some bridge uv in G0 with u ∈ G and v 6∈ G. Proof: (only − if) ⇐: Consider a vertex u which is not an articulation point in graph G with u 6= r. We will show that any edge of type uu0, where u0 is the alias of u, cannot be a bridge in auxiliary graph G0.

Let C := {C1,C2, ··· ,Ck} be the cycles that pass through vertex u in G. The relation between vertex u and the such cycles can be categorized as follows.

15 • u is not the LCA of any of the end points of non-tree edges that induces cycles in C: In this case, no alias vertices are introduced in G0 due to u. Therefore, bridges with u as one end points does not exist in G0. (Note that G is already 2-edge-connected and has no bridges).

• u is the LCA any two non-tree edges that induces cycles Ci and Cj in C: According to the 0 0 construction of G , two alias vertices ui and uj are introduced in the auxiliary graph G . Further, 0 two edges uui and uuj are also added to G . An example is illustrated in Figure 2.2.

denotes a BFS spanning tree path denotes a path in the graph

ei andej denote non−tree edges u G: u G : uiu j

Pux Pyu PPu ix yu j CiC j x y C C x i j y a ei ej d a ei ej d b c b c Pxy Pxy

Figure 2.2 Figure shows the cycle created by paths Pxy, Pyuj , Puj ui , and Puix. For ease of exposition, the auxiliary graph shown contains only the changes made with respect to u and not the changes induced with respect to other vertices.

Let x and y be any two distinct vertices on the cycles Ci and Cj respectively. Since u is not an 0 articulation point in G, there must be some path Pxy in G between x and y that does not pass

through u as shown in Figure 2.2. The path Pxy along with paths Pyuj , Puj ui , and Puix forms a 0 0 simple cycle in G . This indicates that edges uui and uuj on this cycle cannot be bridges in G . 0 So there is no bridge in G with one of the endpoint as u pertaining to cycles Ci and Cj. The

above property holds for any two cycles Ci and Cj.

• u is the LCA of some non-tree edge that induces cycle Ci in C: Consider the case where the

number of cycles through u is at least 2. By our assumption, u is not an articulation point. Let u1

be the alias vertex of u. Hence, for some vertex x in Ci that is not equal to u, and another vertex,

say the parent of u, there is a path that does not go through u. This path along with edges uu1 0 and uP (u) mean that the edge uu1 is part of a cycle. Therefore, in G , the edge uu1 will not be a bridge.

16 (if) ⇒: Let u be an articulation point of a 2-edge-connected graph G and let G0 be the corresponding auxiliary graph. It holds that u has at least two cycles passing through it and not sharing any of the base vertices. Let bi and bj be one of the base vertices on two fundamental cycles. Let x and y be two vertices, both distinct from u, on two such cycles Ci and Cj that get disconnected by the removal of u. Since u is an articulation point in G, there exists only one path between x and y that passes through u, say, x − bi, bi − u, u − bj and bj − y. Let u1 and u2 be any two alias points created for Ci and Cj. The 0 corresponding path Pxy(G ) has the form x − b1, b1 − u1, u1 − u, u − u2, u2 − b2, b2 − y. Since u is an

articulation point, there cannot be a path Pxy between x and y in G (and in its corresponding auxiliary 0 graph G ) that does not pass through u and its alias vertices. As a result uu1 remains uncovered in any 0 0 cycle of G . Hence, uu1 will be a bridge in G .

Lemma 3: Let G be a 2-edge-connected graph, T a rooted BFS tree of G with root as r, and G0 be the auxiliary graph of G constructed. Vertex r is an articulation point in G iff r is the LCA of more than one non-tree edge of G0 according to a BFS in G0 from r, and r is also an end point of some bridge in G0.

Proof: We use Puv(G) to denote a path between vertices u and v in the graph G. (only − if) ⇐: Notice that since G is 2-edge-connected, vertex r is on at least one cycle. Further, since r is the root of the BFS tree of G, for every fundamental cycle that contains r, vertex r is the LCA of the non-tree edge that induces the cycle. We now make a case distinction as follows. If r has exactly one cycle that passes through it, then r is not an articulation point of G. Now consider the case that more than one cycle passes through r. Let Ci and Cj be any two cycles through r induced by non-tree 0 edges ei and ej. In G , we now introduce two alias vertices ri and rj and also the edges rri and rrj,

along with edges between ri and rj to the base vertices of Ci and Cj. If r is not an articulation point,

then we notice that there are any two distinct vertices x and y in Ci and Cj respectively such that there

is a path between x and y that does not go through r. This path between x and y, Pxy, along with paths

Pxri , the edges rir and rrj, and path Prj y creates a cycle that contains the edges rri and rrj. Therefore, 0 the edges rri and rrj cannot be bridges in G . (if) ⇒: The same argument from proof of (if) part of Lemma 2 holds true when u is an articulation point and is the root of the spanning tree.

Our GPU algorithm for biconnected components is built upon the above 3 lemmas. We differ from Chaitanya and Kothapalli [6, 7] in our construction of auxiliary graph. We show in the next section that our auxiliary graph is more efficient as it saves us from performing additional graph traversals later.

2.3.1 Algorithm

Our GPU algorithm for biconnected components (BiCC) can be stated in four steps as mentioned in Algorithm 1. Step 1: BFS

17 Algorithm 1: GPU Algorithm for BiCC Input: Graph G Output: BCC ID for each edge 1 Generate BFS tree T from G 2 Obtain Vlca(G) set and bridges from T 3 Reconstruct G into G0 around Vlca(G) 4 Mark Articulation Points and BCCs in G0

In this step, we obtain a BFS tree T of the input graph G using Merril et al [28]’s GPU algorithm for BFS.

Step 2: LCA and Bridges In this step, we traverse up from the end-points of every non-tree edge in parallel till the lowest common ancestor is found. This forms the Vlca(G) set. Mark each edge encountered while traversing. Since we have traversed through every cycle of G, the unmarked edges are the bridges of G.

Step 3: Constructing the Auxiliary Graph The auxiliary graph is constructed as described in the previous section. We first identify all base vertices for every pair of non-tree edges. All the identified base vertices form a new sub-graph H. We then find the connected components of H and contract the trees to generate H0. Each root node of H0 is used as an alias vertex in constructing the auxiliary graph G0. Figure 2.3 shows the construction of the auxiliary graph G0 in a step-by-step manner. Our construction of auxiliary graph is different from Chaitanya and Kothapalli [6, 7]. Chaitanya and Kothapalli [6, 7] do not use connected components to simplify the construction of alias vertices. As a result, it is possible that new cycles are formed in their construction of auxiliary graph. If new cycles are being formed, then the Vlca set also has to be recalculated. In our construction, since we apply connected components algorithm on the base vertices, we can guarantee that our added edges will never form a cycle. As a result, our auxiliary graphs helps us in saving a LCA traversal of the graph. In addition, Chaitanya and Kothapalli [6, 7] use another BFS to identify the BCCs once the articula- tion points are identified. However, we show in the next steps that by storing additional information in Step 2 and Step 3, this BFS is not necessary. Thus our of construction of auxiliary graph improves upon Chaitanya and Kothapalli [6, 7]’s ap- proach by saving on a BFS and LCA traversal of the edges.

Step 4: Marking Articulation Points and BCCs Most of the work for this step can be done simultaneously in Steps 2 and 3. While traversing in Step 2, we record whether the traversal is finishing at that vertex or going further. If a traversal is finishing at a vertex, then it would finish even in the auxiliary graph. The alias vertices in our construction do not change the order or the number of vertices visited in traversal.

18 Now as Lemma 2 states, an LCA vertex is an articulation point if any of its incident edges with its alias vertex is a bridge. Let au be an edge between an LCA vertex a and its base vertex u. Let a0 be the alias vertex introduced in the auxiliary graph. If au was a part of an unfinished tree traversal, then ua0a would also be a part of that traversal. aa0 cannot be a bridge. Thus all added edges can be checked for bridges by comparing with the information stored in Step 2. Figure 2.3 below illustrates the above example.

Figure 2.3 Here a is an articulation vertex because aa00 is a bridge. Vertex u had an upward traversal in Step 2 and hence aa0 cannot be a bridge.

Furthermore in Step 2, each thread marks LCA vertex found for every non-tree edge. This value is accessible to every edge in the path to the LCA. To generate the unique BCC ID for each edge, each edge checks its LCA vertex. If the LCA vertex is an articulation point, then we have the ID, else we check for the LCA of the edge(LCA vertex, parent(LCA vertex)) and so on. Since we are traversing LCA of LCAs, the traversal is not long. The articulation vertex ID serves as the ID for the biconnected components. Chaitanya and Kothapalli [6, 7] do another LCA traversal in Step 4 for getting the articulation points. As shown above, this traversal is unnecessary if some additional information is stored in Step 2. They also need to perform another BFS for getting the biconnected components. Through simple manipula- tion and better techniques used in the construction of the auxiliary graph, we are avoiding one BFS and one LCA traversal compared to the approach of [6, 7]. The implementation details are discussed in the next section.

2.3.2 Complexity Analysis

BFS or Step 1 takes O(m + n) time sequentially. Step 2 involves tree traversals for each non-tree edge. Each tree traversal cannot exceed the diameter d of the graph as we are using a BFS tree. So for m − n non-tree edges, Step 2 takes O(d · (m − n)) time. For Step 3, we run connected components algorithm on the graph defined out of base vertices in G. In the worst case, the time taken by Step 3 is in O(n + m). Step 4 involves each thread checking each alias vertex for bridges which would be O(n). Thus, sequentially our algorithm takes O(d · (m − n)) time. However, few real-world graphs have large

19 diameter and if they do, very few LCA traversals consume O(d) time. This is observed and mentioned by Chaitanya and Kothapalli [6, 7].

2.4 Implementation

m−n On GPU, we schedule 1024 threads per block. The number of blocks then becomes 1024 . This configuration was found to be best over several trials. The rest of this section discusses specific GPU implementation details for each step of our algorithm.

Step 1: BFS For the sake of modularity and ease, we adapt the BFS program from Merrill et al. [28] for our work. Merrill et al. [28] BFS works efficiently by employing block-based and warp-based exploration along with a fine-grained exploration. SMX-wide gathering is used for adjacency lists larger than warp width. Scan-based gathering collects the loose ends. They implement out-of-core vertex and edge frontiers, use local prefix sums instead of local atomic operations and use a best effort bit-mask for filtering. As shown in their paper, their implementation achieves one of the fastest general implementation of BFS. Some other BFSs are suited for some particular instances, but we found Merrill et al. [28] work best for our general purposes. Our implementation however can benefit from any improvements in GPU based BFS in future

Step 2: LCA and Bridges. Finding LCAs of non-tree edges are independent tasks and are hence easy to parallelize. Each thread picks a non-tree edge and traverses upwards till LCA is found. Although the memory accesses are not coalesced, each thread in its own is doing trivial work. Each thread maintains three values while traversing. First, each thread marks every encountered edge. Thus, later bridges can be identified by gathering unmarked edges. Second, each thread marks whether it is ending at a vertex or going beyond it. This differentiation helps in identifying bridges in the auxiliary graph as explained in Step 4 of previous section. These two values can be marked and updated in the same array. Third, each thread also stores its own ID in every edge it has discovered. In an separate array, every thread stores the LCA vertex it found at its ID location. Thus every edge can lookup its LCA vertex in a two-step read. Since multiple threads may traverse a tree edge, overwrites may occur. As long as the tree edges are pointing to some LCA, it does not matter. This LCA vertex later helps in marking BCCs. Figure 2.4 illustrates the two-step read for an example edge.

Step 3: Auxiliary Graph Construction As mentioned earlier, a connected components algorithm is required for identifying shared base vertices. We adapted Soman et al. [43] connected components work for our need. We found that their generalized GPU implementation had better timings than any other implementation. Although the code needed to be partially rewritten, the algorithm behind it remained the same as explained in their paper.

20 Figure 2.4 Thread t1 marks all edges it traversed with its ID. Then in Array2, it stores the LCA ver- tex found. Thus every edge of the graph knows its LCA vertex, by first looking up the thread which discovered it and then the corresponding value stored at that thread ID.

Once we know how the alias vertices are to be constructed, the actual construction is as follows. Each thread picks a non-tree edge and adds the corresponding edges for its alias vertex. Here also the work done by each thread in itself is simple.

Step 4: Articulation Points and BCCs This step gets implemented mostly in Step 2, as threads are keeping a record of whether a traversal is stopping at an vertex or not. As mentioned in Step 4 of previous section, this information is enough to find the new bridges in the auxiliary graph. A simple kernel call with a lookup to stored global memory suffices. The articulation points get marked. As for the BCCs, each edge has already got its own unique articulation point through Step 2. Any query regarding BCCs can be done in nearly linear time.

2.5 Experiments And Analysis

2.5.1 Setup

We run our implementation on Nvidia Tesla K40C GPU. The K40c provides 12 GB of GDDR5 ECC RAM with a maximum memory bandwidth of 288 GB/sec. Each core runs at a clock speed of 745 MHz. The K40c GPU supports a peak double precision floating point performance of 1.43 TFlops and a single precision floating point performance of 4.29 TFlops. Each SMX has a 64KB cache that is shared by the 192 cores of that SMX. A L2 cache of 1.5 MB is available across the SMXs. More details regarding

21 Tesla K40C GPU can be found in [31]. This GPU is attached to an Intel i7-4790K CPU with 32GB RAM. We use CUDA 7.5 [13] for our implementation. We run CPU-based algorithms on an Intel Xeon E5-2650 CPU. This CPU is equipped with 128 GB RAM and a maximum memory bandwidth of 68 GB/s. The E5-2650 CPU features dual processors where each processor has 10 cores and each core can process two threads using hyper threading. Each core operates at 2.34 GHz which can be boosted to 3 GHz. The CPU offers 64 KB L1 cache, 256 KB L2 cache and a shared 25 MB L3 cache. These implementations were programmed using OpenMP [34]. We have compared our approach with Slota and Madduri [42], named BFS-BiCC in the plots, and also that of Chaitanya and Kothapalli [6, 7], named LCA-BiCC in the plots. These implementations were each run on 40 CPU threads. The graphs used in our experiment were taken from the Stanford Large Network Dataset Collec- tion() [27] and the University of Florida Sparse Matrix Collection [14]. The graphs were mostly sparse in nature. Graphs were assumed to be undirected and have a single connected component. If needed, edges were added explicitly to ensure connectivity in a preprocessing step. All experiments were repeated several times and the average of the observations was used in plotting. Table 2.1 lists all the considered graphs.

Table 2.1 Sparse Graphs for GPU-BiCC Graph Nodes Edges Diameter webGoogle 875,713 5,105,039 21 webbase 1,000,005 3,105,536 29 amazon 262,111 1,234,877 32 webStandford 281,903 2,312,497 674 webBerkStan 685,230 7,600,595 514 roadNet-pa 1,088,092 1,541,898 786 roadNet-ca 1,965,206 2,766,607 849 netherlands-osm 2,216,688 4,882,476 2554 greatBritain 7,733,822 16,313,034 9340 asia-osm 11,950,757 25,423,206 48126

2.5.2 Results

Algorithm BFS-BiCC [42] performs BFSs from every point in a BFS tree and checks whether a child node is accessible to a node higher than the parent if the parent node is removed. If it is accessible, then the child node cannot be an articulation point. However, Chaitanya and Kothapalli [6, 7] prove that all articulation points must belong to the set of Vlca(G). Hence Algorithm BFS-BiCC can be modified to test for only the LCA points. We implement this approach as Algorithm LCA-BFS-BiCC and test it against our implementation. The below figure shows the runtime of the three algorithms LCA-BiCC, BFS-BiCC, LCA-BFS-BiCC against our GPU algorithm named GPU-BiCC in the plots.

22 Figure 2.5 The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of GPU-BiCC over the next fastest one.

Algorithm GPU-BiCC performs on an average 4.03x faster than the next fastest implementation. In certain cases, it outperforms Algorithm BFS-BiCC by nearly 700x and Algorithm LCA-BiCC by 9x. Algorithm LCA-BFS-BiCC appears to be faster than Algorithm BFS-BiCC only in certain cases. In most of the other cases, Algorithm LCA-BFS-BiCC performs similar if not worse than Algorithm BFS-BiCC. This speedup in certain cases can be attributed to two factors.

First, in large sparse graphs, the number of non-tree edges are less and hence finding LCAs takes less time. However as graphs become dense, the LCA step becomes time-intensive. The advantage of working on a smaller subset of vertices is offset by actually finding the smaller subset of LCAs.

Second, it can be observed that even in cases of large sparse graphs, LCA-BFS-BiCC although better, still gives higher speedups in only certain cases. This speedup is influenced by the diameter of the graph.

Algorithm BFS-BiCC slows quite considerably as the diameter increases. This can be verified by observing the diameter of a graph from Table 1 and its respective runtime from the above figure. As mentioned, Algorithm BFS-BiCC does BFSs from each point to check whether a higher up point is reachable. In large diameter graphs, it results in a single thread traversing long chain of single-linked nodes causing work imbalance.

Algorithm LCA-BFS-BiCC eliminates most of these vertices since only a small subset of nodes is considered. Experimentally we found that LCA subset of vertices is approximately 10% of the nodes. Algorithm GPU-BiCC and LCA-BiCC have no such computation and are unaffected by the diameter of the graph. Thus it can be seen that despite the modifications to BFS-BiCC, GPU-BiCC performs better.

23 Algorithm GPU-BiCC was also tested from various starting points. The BFS was run from maximum degree vertex among other random vertices. However no substantial change in timings was observed. GPU-BiCC performs consistently irrespective of the starting point.

2.6 Extension to Dense Graphs

Algorithm GPU-BiCC performs one tree traversal from each non-tree edge of a BFS tree. In real- world graphs, which are usually sparse, the bottleneck step is the BFS. The number of non-tree edges are usually small and is in O(n). However as the graph gets denser, the number of non-tree edges can get large. Since O(m) GPU threads are launched for performing the tree traversals, this step slows down and becomes the most time consuming step. Since Algorithm BFS-BiCC performs BFSs from every vertex, as long as the average degree is less, Algorithm BFS-BiCC remains unaffected by dense graphs. Dense graphs also generally have a low diameter and thus Algorithm BFS-BiCC does not suffer from the large diameter drawback mentioned in above section. As a result the speedup observed by Algorithm GPU-LCA-BiCC over Algorithm BFS-BiCC is relatively less for dense graphs. Cong and Bader [10] mention an edge-pruning technique while presenting their algorithm for finding the biconnected components of a graph. The key idea of the technique from [10] shows that most of the edges are non-essential for finding BCCs. Their pruning technique is as follows. Consider a graph G and its BFS tree T . Let G1 be G \ T and F be the spanning forest of G1. Cong and Bader [10] then prove that the non-tree edges in G which are not in F are non-essential for biconnectivity. Notice that a spanning forest of n nodes would have at most n−1 edges. Thus if the above technique is applied, then even for dense graphs, the number of LCA traversals to be performed drops down from O(m) to O(n). We apply this pruning technique again using modified Soman et al. [43] connected components approach to generate a spanning forest and then tested it on dense graphs. We used random graph generators for generating dense graphs. GTgraph suite [1] provides three random graph generators based on Erdos-R¨ enyi´ [3] and the RMAT model. The graphs used are listed in Table 2.2. The results of the pruning technique can be seen in Figure 2.6. The new algorithm is named as Cert-GPU-BiCC. On an average, Cert-GPU-BiCC achieves a 2x speedup over GPU-BiCC for dense graphs. Cert- GPU-BiCC prunes m-n edges. For sparse graphs, m-n is not a big number and hence a significant speedup is not observed. The cost of pruning the edges slows down Cert-GPU-BiCC for sparse graphs. Experimentally, we found that the overhead for pruning the edges is useful when m roughly crosses 10 times n.

24 Table 2.2 Dense Graphs for GPU-BiCC Graph Nodes Edges Diameter liveJournal 4,847,571 68,993,773 9 com-Orkut 3,072,441 117,185,083 16 RMAT1M 50M 1,000,000 50,000,000 5 RMAT1M 70M 1,000,000 70,000,000 8 R500K 50M 500,000 50,000,000 5 R1M 25M 1,000,000 25,000,000 13 D1M 50M 1,000,000 50,000,000 5008 D2M 75M 2,000,000 75,000,000 11053

Figure 2.6 Primary Y-axis represents the timings in milliseconds. On Secondary Y-axis, Speedup-1 represents speedup of Cert-GPU-BiCC over BFS-BiCC while Speedup-2 shows speedup of GPU-BiCC over BFS-BiCC.

25 Chapter 3

GPU-TriCC: GPU Algorithm for 3-connectivity

3.1 Overview

Recall that a graph is said to be triconnected if every pair of nodes v, w ∈ V (G) have at least three vertex disjoint paths between them. The maximal 3-connected subgraphs of G are called as the triconnected components of G.A separating pair in a graph G is a pair of nodes v, w such that removing v and w from G disconnects G. Triconnectivity has applications in networks [18] and in determining isomorphism in planar graphs [22]. Hopcraft and Tarjan [23] published the first sequential algorithm for finding the triconnected com- ponents of a graph. The algorithm from [23] is based on the depth-first traversal (DFS) of a graph. Given that DFS is a P-complete problem [24], this approach would not be parallelizable in the PRAM sense. Over the years, few PRAM style parallel algorithms have been presented for finding the tricon- nected components of a graph. Ramachandran and Vishkin [38] present a PRAM algorithm for testing triconnectivity that runs in O(log n) time using O(n + m) work. Miller and Ramachandran [29] ex- tend the work from [38] to also obtain the triconnected components of a graph using O(log2 n) time and O(n + m) work on a CRCW PRAM. To date, the algorithm of Miller and Ramachandran is the fastest PRAM algorithm for identifying the triconnected components of a graph in parallel. Here, we implement the algorithm of Miller and Ramachandran [29] on a GPU and extend it by applying the certificate-based approach given by Cheriyan and Thurimella [8]. In the next chapter, we use our GPU implementation as the baseline against better algorithms. We start by briefly describing the algorithm of Miller and Ramachandran [29]. This is followed by our improvements and implementation on GPU.

3.2 The Algorithm of Miller and Ramachandran for Graph Triconnec- tivity

The algorithm of Miller and Ramachandran [29] is based on an open ear decomposition of a graph. An open ear decomposition of a graph G(V,E) is a partition of E into ordered edge-disjoint paths

P0,P1,P2, ... such that P0 is a simple cycle and every other path Pi, i ≥ 1, has its endpoints on previous

paths (ears) and no internal vertices of Pi lie on Pj, j < i. Since a vertex cannot be internal to two ears,

26 an open ear decomposition provides scope for traversing the graph in parallel. In addition, Miller and Ramachandran [29] prove that the two vertices from any separating pair in a biconnected graph are non-adjacent vertices of some ear Pi. The algorithm of Miller and Ramachandran [29] is based on this property. As a result, Miller and Ramachandran start with an open ear decomposition of a biconnected graph. The algorithm then generates the bridges for every non-trivial ear in parallel. (An ear is non-trivial if it has at least three vertices.) For a given subgraph S, the bridges with respect to S is a partition V (G \ S) such that two vertices are in the same class if and only if there is path connecting them without using any vertex of S. Each such bridge is then compressed into a single vertex. This single vertex is connected to the original ear through the same attachments as the corresponding bridge. This is done for every bridge for every non-trivial ear in parallel. The bridge graph is simplified into an ear graph by merging bridges which share the same attachments on a ear. Figure 3.1(b)–(d) show the above steps on the graph from Figure 3.1(a).

Figure 3.1 Figure showing the stages in the algorithm of Miller and Ramachandran [29].

The ear graph for every ear is further simplified by merging interlacing bridges into an non- overlapping graph, called a star graph. All separating pairs can be easily discovered through a star graph. Figure 3.1(e) shows the formation of a star graph from the corresponding ear graph and the subsequent separating pairs with respect to a single ear. This process is done across the graph for all ears in parallel. Once the separating pairs are identified, the triconnected components are generated by splitting the graph into Tutte splits [48] for every separating pair. The entire algorithm is shown in run

27 in O log2 n) time using O(n + m) work in the CRCW PRAM model. We refer the reader to [29] for further details.

3.3 Triconnectivity on GPU

To the best of our knowledge, there is no known GPU based algorithm for the graph triconnectivity problem. In this section, we provide a GPU based implementation for the algorithm of Miller and Ramachandran [29]. A brief summary of our implementation is given below. Henceforth, we refer to our GPU implementation for triconnectivity as Algorithm GPU-TriCC listed as Algorithm 2.

Algorithm 2: Algorithm for GPU-TriCC Input: Biconnected graph G Output: TriCC(G) 1 Find an open ear decomposition of G 2 for every nontrivial ear Pi do 3 Construct bridge graph from bridges 4 Obtain ear graph Gi(Pi) from bridge graph ∗ 5 Coalesce interlaced ear graph into star graph Gi (Pi) ∗ 6 Identify the separating pairs from Gi (Pi) 7 Use Tutte splits to obtain triconnected components

We employ Ramachandran’s [37] popular ear decomposition algorithm for generating the open ear decomposition. Our ear decomposition requires two graph traversals and a sorting of edge list. Ob- taining bridges and the subsequent bridge graph requires a connected components algorithm. We use Soman et al. [43] GPU implementation for the same. The ear graphs are generated through a divide and conquer approach as mentioned in [29]. Assuming r ears, the first step in the divide and conquer approach generates the ear graph for the first and the last r/2 ears. Connected components of the ith stage are utilized at the (i + 1)th stage as we narrow down to generating the ear graph for every indi- vidual ear. Every ear graph is then coalesced in parallel to generate the star graph. Coalescing involves resolving all overlapping attachments in the ear graph. Separating pairs can be easily identified once the star graph is generated as shown in Figure 3.1(e). The graph is then split into upper split and lower split

graphs for every separating pair (a, b) on the ear Pi. The upper split and lower split graphs are basically

division of vertices belonging to ears Pj, j < i and ears Pk, k > i. Miller and Ramachandran [29] prove that each of the split is biconnected and every other separating pair lies either in the upper split graph or in lower split graph but not in both. Hence this procedure is applied recursively till no separating pair is present in either of the split graphs generated. Thus the triconnected components of G are identified. Notice from the algorithm of [29] that the bulk of the work done can be associated with each ear subsequent to obtaining an open ear decomposition. As every graph G has m − n + 1 ears, the number of edges in G heavily impacts the performance. A method to filter the edges beforehand can give an increase in performance.

28 In a remarkable result, Cheriyan and Thurimella [8] showed that the k-connectivity of an undirected graph can be tested by using a kn sized subgraph of the graph instead of using the entire graph. Formally,  i−1  let Ti for i ≥ 1 is the BFS spanning forest of G \ ∪j=1Tj . Cheriyan and Thurimella show that the k k graph ∪i=1Ti is k-connected if and only if G is k-connected. One often says that the graph H := ∪i=1Ti is a certificate for the k-connectivity of G. Similar results are also shown by Khuller and Scheiber [26]. The technique of Cheriyan and Thurimella [8] does improve the practical performance of parallel algorithms for testing the k-connectivity of a given undirected graph. Evidence for this can be seen from the work of Bader and Cong [10], Chaitanya and Kothapalli [6] and in our algorithm GPU-BiCC(Section 2.3.1) for finding the biconnected components of a graph on symmetric multiprocessors, multi-core CPUs, and GPUs respectively. Much of this improvement can be attributed to the smaller size of the certificate in terms of the number of edges in the input graph. Since the performance of GPU-TriCC heavily depends on the number of edges, a reduction in the size of the graph through use of certificates provides scope for a better performance. To this end, we make use of the idea of Cheriyan and Thurimella [8]. Accordingly, a certificate for triconnectivity of G is obtained as the union of T , F1 = BF SSpanningF orest(G\T ) and F2 = BF SSpanningF orest(G\

(T ∪ F1)), where T is the BFS tree of G. The graph H := T ∪ F1 ∪ F2 is then provided as the input to Algorithm GPU-TriCC. This modification is named as Algorithm Cert-GPU-TriCC and is listed as Algorithm 3.

Algorithm 3: Algorithm for Cert-GPU-TriCC Input: Biconnected graph G Output: T riCC(G) 1 T := BFS(G) 2 F1 := BFSSpanningForest(G/T ) 3 F2 := BFSSpanningForest(G/(T ∪ F1)) 4 H := T ∪ F1 ∪ F2 5 Run GPU-TriCC on H

3.3.1 Dataset

The graphs we use are taken from real-world datasets [14], random graphs following the Erdos-R˝ enyi´ model [3] generated using the GTGraph generator [1]. All the graphs we consider are undirected and unweighted. Directed graphs are made undirected by removing the direction on the edge. Graphs that are not connected are augmented with additional edges to make them connected. Key properties of the graphs are shown in Table 3.1.

3.3.2 Results

We study the performance of Algorithms GPU-TriCC and Cert-GPU-TriCC on the graphs listed in 3.1. The random graphs are generated to have a particular number of TCCs as shown in Table 3.1. The

29 Graph Description Nodes Edges Real-World Graphs nd24k 3D Mesh, ND set. 72K 14.3M kron18 kronecker, DIMACS10 262K 10.5M rm07r 3D viscous case 381K 37.4M coPaperDBLP coauthor citation network 540K 15.2M bone010 3D trabecular bone 986K 36.3M High-order vector finite element method in dielFilterV3 1.1M 45.2M EM Random Graphs rand-Tricc1 1 TCC 500K 30M rand-Tricc2 10000 TCCs 500K 30M

Table 3.1 Graphs used in our experiments for triconnectivity. In the table, the letter K stands for a thousand and the letter M stands for a million.

GPU used for these experiments is an Nvidia K40c GPU (cf. Section 2.5.1). From Figure 3.2, we can observe that Algorithm Cert-GPU-TriCC is on an average 5x faster compared to Algorithm GPU-TriCC. In Figure 3.2 we show percentage of the time spent by Algorithm Cert-GPU-TriCC in obtaining the certificate H using three BFS traversals of G. As can be observed from Figure 3.2, Algorithm Cert- GPU-TriCC despite being 5x faster than Algorithm GPU-TriCC, spends nearly 63% of total time in obtaining H. The high cost of procuring the certificate serves as the motivation to look for methods which can mitigate this cost.

30 Figure 3.2 Figure showing the time taken by Algorithms GPU-TriCC and Cert-GPU-TriCC on the graphs listed in Table 3.1. The primary Y-axis represents time in milliseconds. The number on each bar indicates the percentage of time spent by Algorithm Cert-GPU-TriCC in BFS operations.

31 Chapter 4

Expediting Parallel Graph Connectivity Algorithms

4.1 Motivation

In Section 2.3.1, we presented an improved version of Chaitanya and Kothapalli [6, 7] algorithm for finding biconnected components and showed that our GPU implementation was the fastest yet. For triconnectivity, we presented the first GPU implementation of the problem based on Miller and Ra- machandran’s [29] algorithm for triconnectivity. In both cases, we presented a further improvement through the use of certificates (Section 2.6 and Section 3.3 respectively). Much of this improvement can be attributed to the smaller size of the certificate in terms of the number of edges in the input graph. However, the time taken to obtain the certificate on large input graphs via parallel BFS operations is a significant portion of the total run time. Consider our algorithm Cert-GPU-BiCC from Section 2.6, which is so far the fastest known implementation for finding the biconnected components of a graph in parallel. Algorithm Cert-GPU-BICC performs two BFS traversals on the graph G to obtain a certificate of size at most 2n − 2 edges for testing the biconnectivity of G. Figure 4.1 shows the time spent by Algorithm Cert-GPU-BiCC on BFS operations on a set of eight graphs. As shown in Figure 4.1, these two BFS operations consume on average 66% of the time spent by Algorithm Cert-GPU-BiCC. It indicates that to design faster parallel algorithms for graph k-connectivity, one must relook at the expensive BFS operations. The large time spent by BFS operations can be attributed to the fact that a BFS traversal requires assigning nodes to levels such that for i ≥ 0, the shortest hop distance from the source of the BFS to any node in level i is i. Arriving at such an assignment in parallel requires expensive algorith- mic/programming constructs such as synchronization, concurrent data structures, and work balancing among threads. Our contention is here that using a BFS is expensive due to the O(n + m) nature of it. The hidden constants behind synchronization, filtering, load balancing makes any parallel BFS slow in practice as compared to the further phases of our algorithms for in 2- and 3-connectivity. Figure 4.1 and Figure 3.2 in Section 3.3.2 serve as an example, where a BFS on the entire graph is expensive than all the other steps combined. For all our experiments so far, we have used the parallel GPU BFS implementation of Merrill et al [28]. There are other parallel implementations available such as the Gunrock framework [52]

32 Figure 4.1 Figure shows the percentage time spent by Algorithm Cert-GPU-BiCC (cf. Section 2.6 on BFS operations. which has two implementations: one largely based on that of Merril et al [28] and one based on Beamer et al [2]. Studies on the Gunrock implementation [52] of Beamers BFS [2], (cf. https: //gunrock.github.io/gunrock/doc/latest/md_stats_do_ab_random.html) indicate that the performance tuning of Beamer et al. [2] approach is quite involved and depends significantly on the input graph. Employing a better BFS would definitely reduce the total time but an approach which reduces the BFS burden itself would be more desirable. This is what we explore in this chapter by performing BFS only on a small sampled graph and correcting the output in later phases of our algorithm. Rather than focusing on optimizing implementation(s) of parallel BFS algorithms, we focus on using BFS on smaller graphs instead. The certificate based algorithms for graph k-connectivity which we use in 2- and 3-connectivity are efficient only after obtaining the necessary certificate using k BFS traversals. Therefore, we suggest designing parallel algorithms that do not perform BFS operations on the input graph. One way of achieving this goal is to trade-off the cost of obtaining the certificate to its accuracy. In this chapter, we show that by using novel strategies we can avoid performing BFS on the input graph G. Instead, we use a sparse, spanning, and connected subgraph H0 of the input graph. The subgraph H0 thus identified is used in testing the k-connectivity of G. It must be noted that H0 may not be an accurate certificate for the k-connectivity of G. Therefore, the k-connectivity of H0 may not provide an answer the k-connectivity of G immediately. To make up for this inaccuracy, we include additional steps on an auxiliary graph F created out of G and the k-connectivity information obtained from H0. The auxiliary graph F is constructed such that G is k-connected if and only if F is k- connected. The sizes of H0 and F are usually smaller compared to that of G resulting in a low overall run time.

33 We implement our approach for testing the 2, and 3-connectivity and obtaining the 2, and 3- connected components of a graph on Nvidia GPUs. Our results on a variety of graphs indicate that our approach outperforms the corresponding best known implementations by a factor of 2.2x, and 2.1x respectively. We believe that our technique has applications to other graph problems where one can algorithmically replace structures that are expensive to compute with simple to obtain and possibly inaccurate structures followed by a post-processing step. Our work therefore opens the possibility of reinterpreting important steps in parallel graph algorithms so as to make them more efficient in practice.

4.2 An Overview of our Approach

Several recent studies on parallel graph algorithms have explored varied techniques to improve their practical efficiency on multi-core and accelerator based architectures. Many such studies use well- known graph computations such as traversals, spanning trees, edge/vertex decompositions as a subrou- tine. These algorithms can be summarized as follows. From the input graph G, one obtains a structural subgraph H such that computation on G can be translated or reduced to computations on H followed by additional post-processing steps as required. An example of this can be seen in the work of [35, 15] where a reduced graph that shrinks all nodes of degree two is used as the graph H. The above mentioned approach of computing on a subgraph H often helps if H is of a smaller size than G. The benefits of the above stated approach however will be limited if identifying H is expensive possibly due to the strict structural guarantees required on H. As can be noticed from Figure 4.1, a large portion of time is spent in obtaining H indicates scope for revisiting the approach. In this direction, we propose to reconsider the algorithmic implication of replacing H with a suitable, easy to create structure H0 such that the computation can be done on H0 instead of H. In case the result of the computation on H0 does not provide a correct result for the required computation on G, additional steps may be required depending on the nature of the problem. However, in these additional post-processing steps, the size of the problem is often much smaller than the size of the original graph. Therefore, it is expected that the cost of post-processing is small. The approach can be seen to have three stages. In Stage I, we obtain a subgraph H0 of G. Stage II performs computation on H0. An optional Stage III introduces a post-processing step, if required. The technique as presented above allows for multiple possibilities at all stages. In Stage I, H0 can be obtained by (i) uniformly sampling the input graph G, (ii) by relaxing the structural properties required of H, (iii) using importance sampling, and the like. In Stage II, the computation on H0 is chosen based on the input problem. Depending on the choices exercised in Stage I and Stage II, we consider the question of whether the output of Stage II can lead to the required output on the original graph. If the output of Stage II is insufficient to arrive at the final answer, we consider Stage III as the post-processing stage. In Stage III also, the computation required depends on the nature of the problem and the utility

34 of H0. Stage III, depending on the problem can use possibilities such as iterating, and augmenting the result, and constructing an auxiliary graph for suitable computation.

Figure 4.2 Figure illustrating our technique in comparison to other approaches towards practical parallel graph algorithms. The top arrow represents direct computation that is usually expensive. The middle arrow indicates preprocessing via strict structural subgraphs or constructs that are sometimes expensive to create. The bottom path corresponds to the approach proposed in this paper. In the figure, red/solid arrows indicate expensive operations while green/hollow arrows indicate operations that are easy to perform in general.

We note that as H0 and H are expected to be of of similar size, the time taken for computing on H and H0 will not differ significantly. Hence, for our technique to be useful in practice, the the cost of of Stages I and III should be lesser than the cost of obtaining H from G. Figure 4.2 illustrates the idea of our approach. In Figure 4.2, we also list some of the possibilities at each stage of the approach and also the particular choice that this paper uses in these stages. In the following sections, we apply our approach to test the k-connectivity and find the k-connected components of an undirected graph G for k = 2 and k = 3. Cheriyan and Thurimella [8] show that a subgraph H can be constructed as a certificate for G via k BFS traversals. (The k-connectivity of H offers a quick way to test the k-connectivity of G.) As obtaining H via multiple BFS traversals of the input graph can consume a significant portion of the overall runtime (cf. Figure 4.1), we show that our approach can be helpful in arriving at faster parallel algorithms for graph k-connectivity.

4.3 Application to 2-Connectivity

In chapter 2, we presented the only known GPU algorithm for this problem. Algorithm GPU-BiCC from Section 2.3.1 argues that in a parallel setting, finding the bridges of a graph G is much easier

35 compared to finding the articulation points. Based on this observation, the algorithm first identifies the bridges of G and separates G into its 2-edge-connected components. To identify the articulation points in each 2-edge-connected component Gi of G, Algorithm GPU-BiCC builds an auxiliary graph 0 0 Gi such that bridges of Gi can be used to locate the articulation points of Gi, and hence those of G. This information is then used to subsequently identify the biconnected components of G. As shown in 2.5.2, Algorithm GPU-BiCC is 4x faster compared to other parallel approaches [42, 6]. On dense graphs, Algorithm Cert-GPU-BiCC from Section 2.6 uses the certificate as defined by Cheriyan and Thurimella [8] to obtain a further 2x speedup on Algorithm GPU-BiCC.

4.3.1 Our Approach

As mentioned in the previous section, one can take H as the subgraph formed by taking the union of a BFS tree T of G and the BFS spanning tree of G \ T . This certificate H will have n nodes and at most 2(n − 1) edges. However, as Figure 4.1 shows, obtaining H is an expensive step, taking an average of 66% of the total time of Algorithm Cert-GPU-BiCC. We therefore use our approach as outlined in Section 4.2 by replacing H with a suitable H0. To this end, we start with H0 as a kn sized spanning subgraph of G for an appropriately chosen constant k and proceed to find the biconnected components of H0. As H0 may miss including certain edges critical to answer the biconnectivity of G, H0 is not a certificate for biconnectivity of G. Nev- ertheless, the biconnected components of H0 can be used to create an auxiliary graph F . Each node in F roughly corresponds to a biconnected component of H0 and edges of F represent edges between these components. The edges of G \ H0 are used to add additional edges to F so that F acts as a valid certificate for the biconnectivity of G. As H0 is of comparable size to H and the size of F is expected to be small, our approach can help reduce the time spent in BFS operations. More formal details of our approach are presented in the following section.

4.3.1.1 Algorithm Sample-GPU-BiCC

Algorithm Sample-GPU-BiCC for finding the biconnected components (BCCs) of a connected graph G is listed as Algorithm 4. Each of the steps are explained below.

Algorithm 4: Algorithm Sample-GPU-BiCC Input: A connected graph G Output: The Biconnected Components (BCCs) of G. 0 1 Obtain subgraph H from G 0 2 Find the BCCs of H using Algorithm GPU-BiCC. 0 3 Extract F using the BCCs of H and edges of G 4 Find the BCCs of F using Algorithm GPU-BiCC.

36 • Step 1 – Obtain subgraph H0 from G: Recall that H0 is a kn sized subgraph of G. We identify H0 by viewing the edges of G as an edge list and including every m/knth edge for a total of kn edges. Note that no randomness is used as any kn edges suffice for our purpose.

• Step 2 – Find BCCs of H0: Once H0 is obtained, we find the BCCs of H0 using Algorithm GPU-BiCC from Section 2.3.1. These BCCs are used to define the vertices of F .

• Step 3 – Extract F using BCCs of H0 and the edges of G: In this step, we create an auxiliary graph F . The BCCs of H are treated as super-vertices that correspond to the vertices of F . Recall that a node can be part of several BCCs. In particular, articulation points belong to multiple BCCs. Hence we keep such nodes as a separate vertex in F . Two nodes in F are joined by an edge if there exists an edge vw ∈ E(G) such that v and w are in different super-vertices of F . This results in F being a multi-graph. In such cases, since we need to know if there are at least two edges between nodes in F , we need to add at most two edges between any two nodes of F . For every pair of nodes v, w in F with two edges between them, we keep only one such edge between v and w, and add an auxiliary vertex v0 and edges vv0, v0w to F . By doing so, F is now a simple graph. We note that as the edges of G are all used to define the edges of F , F acts as a certificate for the 2-connectivity of G. In other words, if vertices x and y have more than one vertex-disjoint path between them in G, then either x and y belong to the same super-vertex of F or the super-vertices of F that contain x and y have more than one vertex-disjoint path between them. The former happens when all the edges on at least one cycle containing x and y is in H0. The latter happens when no cycle containing x and y is in H0 in which case, the edges of the cycle(s) that are not in H0 induce edges in F that ensure that the super-vertices of F containing x and y have two vertex-disjoint paths.

• Step 4 – Find BCCs of F : In this step, we find the BCCs of F using Algorithm GPU-BiCC. The biconnected components identified in this step now map directly to the biconnected components of G.

Figure 4.3 demonstrates an example run of Algorithm Sample-GPU-BiCC on the graph in part(a) of the figure.

4.3.2 Implementation Details

We implement Algorithm Sample-GPU-BiCC on a GPU. For BFS on a GPU, we use the implemen- tation from [28] that uses a fine-grained task management strategy. According to our approach, these BFS operations are done on subgraphs H0 and F thereby requiring lesser time. This is followed by iden- tifying the Least Common Ancestor (LCA) of the end points of every non-tree edge. Here, we launch one thread for every non-tree edge. Generating F requires a lookup of the edge list of G along with the

37 Figure 4.3 An example run of Algorithm Sample-GPU-BiCC on the graph in part (a) of the figure. information of BCCs of H0. This is easily implemented on a GPU by assigning a thread to every edge of G. We therefore note that Algorithm Sample-GPU-BiCC is amenable to a GPU-based execution where a massive thread pool is supported. We arrange the threads into blocks with 1024 threads per block.

4.3.3 Experimental Results, Analysis, and Discussion

4.3.3.1 Experimental Platform and Dataset

All our experiments are performed using an Nvidia Tesla K40c GPU as described in detail in Section 2.5.1. The dataset also mostly has the same graphs as in Table 3.1. Table 4.1 has the same real-world graphs as Table 3.1. Apart from the real-world graphs, we use random graphs based on the Erdos-R˝ enyi´ model [24] generated using the GTGraph generator [1]. All the graphs we consider are undirected and unweighted. Directed graphs are made undirected by removing the direction on the edge. Graphs that are not connected are augmented with additional edges to make them connected. Key properties of the graphs are shown in Table 4.1.

4.3.3.2 Results

In this section we compare our implementation of Algorithm Sample-GPU-BiCC to that of Algo- rithm Cert-GPU-BiCC (Section 2.3.1. The overall improvement in performance on the graphs listed in Table 4.1 is shown in Figure 4.4. Algorithm Sample-GPU-BiCC achieves a speed-up ranging from 1.47x to 3.35x compared to Algorithm Cert-GPU-BICC. The average speedup as shown in Figure 4.4 is 2.2x. All the above experiments were run with k = 4. The time spent by our approach on BFS

38 Table 4.1 Graphs used in our experiments for biconnectivity. In the table, the letter K stands for a thousand and the letter M stands for a million. Graph Description Nodes Edges Real-World Graphs Same as in Table 3.1 Random Graphs rand-Bicc1 1 BCC 1M 75M rand-Bicc2 10000 BCCs 1M 75M operations is listed on the top of each bar. As can be noted, this time is on the average only 15% of the total time indicating that our approach is successful in mitigating the practical inefficiency of BFS oper- ations in the context of parallel graph biconnectivity algorithms. The graph nd24K has a high speedup of 3.35x as it is dense and is biconnected. Even a small sample of edges keeps almost all the nodes in a single BCC. For the graph coPaperDBLP, the lower than average speed-up can be attributed to its graph structure and the sampling strategy used. In this case, H0 has numerous long chains of nodes of 2-degrees. This increases the BFS time and the subsequent time for finding the BCCs of H0.

Figure 4.4 Figure showing the time taken by Algorithms Cert-GPU-BiCC and Sample-GPU-BiCC on the graphs listed in Table 4.1. The primary Y-axis represents time in milliseconds. The Secondary Y-axis gives the speedup of Algorithm Sample-GPU-BiCC over Algorithm Cert-GPU-BiCC.

To study the impact of the choice of k on the performance of our algorithm on k, we plot time taken by our algorithm on two graphs from Table 4.1 as we vary k. The results of this experiment are shown in Figure 4.5 and 4.6 for graphs kron18 and coPaperDBLP respectively. When k is small, the size of H0 is small. As a result, the time taken in Step 2 is small. However, if only few edges of G are included in

39 H0, the number of biconnected components found in Step 2 tends to be high. Therefore, the size of F grows, resulting in Step 4 consuming more time. On the contrary if the value of k is high, the size of H0 is high. As a result, the time taken in Step 2 is high. But, since more edges have been included in H0, the size of F decreases thereby making Step 4 relatively faster. Steps 1 and 3 are not significantly impacted by the choice of k and hence omitted from Figures 4.5 and 4.6.

Figure 4.5 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph kron as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1, 4, 10, and 14.

Figure 4.6 Figure represents the time taken by Algorithm Sample-GPU-BiCC on the graph coPaperD- BLP as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in thousands at k = 1, 4, 10, and 14.

40 Another factor to be noted is that the size of F , shown in Figures 4.5 and 4.6, indeed decreases as we increase k. This is in tune with our original motivation that doing a BFS on such a small graph will be significantly better than doing a BFS on G. The BFSs on smaller-sized H0 and F combined are cheaper than a BFS on G. (Obtaining a certificate from G requires 2 BFSs on G, not one). The above discussion suggests that k should be chosen appropriately. We see from Figures 4.5 and 4.6 that a good value of k is around 4 while values of k between 3 to 5 offer a similar result in general.

4.3.3.3 Discussion

In this section, we summarize a few important points concerning our approach.

• Obtaining H0: We observe that H0 can be generated in several other ways such as a uniformly-at- random process on the edges, a selection based on degree of the vertices, and other such strategies. We found that using randomness to obtain H0 is not necessary to make our approach work. With a deterministic post-processing phase, we believe that one should focus more on trying to reduce the overall run time instead of getting a “good” H0. In all the approaches, the impact on the performance was almost similar. Hence we use deterministic sampling.

• Certificate based approaches: From the work of Bader and Cong [10] and also that of Cheriyan and Thurimella [8], it is apparent that using a certificate for testing the biconnectivity of a graph is practically efficient. In our approach, as the graphs H0 and F are very sparse, such a certificate is not required and Algorithm GPU-BiCC is enough.

4.4 Application to 3-connectivity

In Section 3.3, we adapt Miller and Ramachandran’s [29] algorithm for triconnectivity and provide the first GPU implementation for the same. We apply Cheriyan and Thurimella’s [8] certificate reduction to present Algorithm Cert-GPU-TriCC which is shown in Figure 3.2 to be almost 5x faster than our GPU implementation: GPU-TriCC. Figure 3.2 also shows that the time taken in BFSs is almost 63% of the total time.

4.4.1 Our Approach

As discussed in the previous section, the three BFS traversals required to obtain a certificate H for testing the triconnectivity of a graph take up almost 63% of the total time. A suitable H0 can reduce the initial cost of three BFSs. We begin with a kn sized random subgraph H0. However, as is the case in biconnectivity, H0 can miss out out on some critical edges required for triconnectivity. It is not a valid certificate yet. We then find the TCCs of H0. The TCCs are treated as super-vertices. These super- vertices form the vertex set of an auxiliary graph F . The edges of the rest of G are then used to define the edges of F . This auxiliary graph is refined to ensure that at most three edges are present between

41 any two super-vertices. This is done to keep the size of F as small as possible. Finally, the TCCs of F are identified which correspond to the TCCs of G. The algorithm is explained in-depth in the following sub-section.

4.4.1.1 Algorithm Sample-GPU-TriCC

Algorithm Sample-GPU-TriCC for finding the triconnected components (TCCs) of a connected graph G is listed as Algorithm 5. Each of the steps are explained below.

Algorithm 5: Algorithm Sample-GPU-TriCC Input: Biconnected Graph G Output: TCCs of G 0 1 Obtain a spanning subgraph H from G 0 2 Find the TCCs of H 0 3 Extract F using the TCCs of H and edges of G 4 Find the TCCs of F

• Step 1 – Obtain a spanning subgraph H0 from G: As in the case of Algorithm Sample-GPU-BiCC, we identify H0 by viewing the edges of G as an edge list and including every m/knth edge for a total of kn edges. Note that no randomness is used as any kn edges suffice for our purpose.

• Step 2 – Find the TCCs of H0: We find the TCCs of H0 using Algorithm GPU-TriCC. Since H0 is a sampled subgraph, it may not be biconnected. We modify the ear decomposition algorithm of Ramachandran [37] to find the open ear decomposition within individual biconnected compo- nents. Due to this modification, the ears although identified correctly, are not correctly numbered. However, Algorithm GPU-TriCC only required the ears and not their numbering.

• Step 3 – Constructing F using the TCCs of H0 and edges of G: The TCCs of H0 are com- pressed into super-vertices. Since a vertex in a separating pair can belong to multiple triconnected components, vertices in each separating pairs are treated as independent super-vertices. These super-vertices form the vertices of F . The edges of F are identified in three steps. First, an edge is added between two nodes of F if there exists an edge vw ∈ E(G) such that v and w are part of different TCCs of H0. In second step, F is filtered to ensure that no more than three edges are present between any two vertices of F . This is done to keep the size of F as small as possible. In the third step, we convert F to be a simple graph. To this end, for every two nodes in F with more than one edge between them, we split each such edge by introducing an auxiliary vertex. Similar to the arguments provided in Section 4.3.1.1, we note that the graph F has the property that if vertices a, b, c of G have at least three vertex-disjoint paths between them in G, then either they belong to the same super-vertex of F , or the super-vertices of G containing these vertices have at least three vertex-disjoint paths between them in F . Therefore, F can be used to identify the triconnectivity and the triconnected components of G.

42 • Step 4 – Find the TCCs of F : We run Algorithm GPU-TriCC on F to generate the TCCs of F . These components map directly to the triconnected components of G.

4.4.2 Implementation Details

As can be noticed from [29], for the graph triconnectivity problem, some computations such as BFS and LCA traversals are common to the biconnectivity problem. In this case too, on the GPU, we therefore use the BFS implementation from [28]. Open ear decomposition is implemented through sorting and LCA traversals. Sorting is performed using Thrust library [47]. LCA traversals are done by assigning a thread to every non-tree edge. Generating the bridge graph for every ear involves finding the connected components of various appropriately defined subgraphs. For this purpose, we use the GPU based algorithm from Soman et al. [43]. Generating the star graph with respect to every ear and the subsequent identification of triconnected components can also be done on a GPU by expressing the computation as a sequence of multiple kernels.

4.4.3 Experimental Results, Analysis, and Discussion

The experimental platform we use for our experiments is described in Section 2.5.1. We scheduled the above algorithm on 1024 threads per blocks.

4.4.3.1 Dataset

We use the same dataset as the one we used for comparing GPU-TriCC and Cert-GPU-TriCC. Table 3.1 lists all the graphs used for comparing Algorithm Sample-GPU-TriCC.

4.4.3.2 Results

We compare the performance of Algorithm Sample-GPU-TriCC to that of Algorithm Cert-GPU- TriCC. As noted earlier, Algorithm Cert-GPU-TriCC is to the best of our knowledge, the fastest algo- rithm on GPUs for finding the triconnected components of a graph. Figure 4.7 shows the time taken by Algorithm Sample-GPU-TriCC on the graphs listed in Table 3.1. As can be observed, Algorithm Sample-GPU-TriCC achieves a speedup of 2.1x on average over Algorithm Cert-GPU-TriCC. The value of k is set at 4 in this experiment. We now proceed to study the performance of Algorithm Sample-GPU-TriCC as k is varied. On graphs rm07r and rand-Tricc1, Figures 4.8 and 4.9 respectively show the results of this study. As k increases, the size of H0 increases resulting in increase in the time taken by Step 2. On the other hand, the decrease in the size of F with increasing k reduces the time taken in Step 4. The choice of k is to be made considering this trade-off. From our experiments, we note that k = 4 is a good choice in the case of triconnectivity. However, values of k between 4 and 6 offer a similar performance.

43 Figure 4.7 Figure showing the time taken by Algorithms Cert-GPU-TriCC and Sample-GPU-TriCC on the graphs listed in Table 3.1. The primary Y-axis represents time in milliseconds. The secondary Y-axis gives the speedup of Algorithm Sample-GPU-TriCC over Algorithm Cert-GPU-TriCC.

Figure 4.8 Figure represents the time taken by Algorithm Sample-GPU-TriCC on the graph rm07r as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F in millions at various values of k.

44 Figure 4.9 Figure represents the time taken by Algorithm Sample-GPU-TriCC on the graph rand- TriCC1 as k is varied. Y-axis represents time in milliseconds with varying k on x-axis. Tuples on the line labeled ”Total Time” show the number of nodes and edges of F at various values of k.

4.4.3.3 Discussion

One can observe in Figure 4.8 and Figure 4.9 or even in Figure 4.6 and Figure 4.5 in 2-Connectivity section, that Step 4 does not decrease significantly with increasing k. This is due to the fact that after some k, most of the biconnected/triconnected components of G are identified via H0 alone. In general, Algorithm Cert-GPU-TriCC involves more BFS operations than Algorithm Cert-GPU- BiCC. Thus, it seems that Algorithm Sample-GPU-TriCC should benefit more from our technique than Algorithm Sample-GPU-BiCC. However, as shown in Figure 4.4 and Figure 4.7, our technique results in a near similar speedup in both cases. This is due to the reason that for the graphs we considered in our dataset, and in general, we expect more triconnected components than biconnected components. So, the size of the auxiliary graph F generated using our technique is larger in the case of triconnectivity as compared to biconnectivity.

45 Chapter 5

Conclusions and Future Work

In this thesis, we have presented and implemented the first GPU algorithms for 2- and 3-connectivity. We have improved upon them both through employing Cheriyan and Thurimella’s [8] certificate ap- proach. Our implementations are to the best of the knowledge the fastest implementations yet. Later we have studied the impact of BFS on k-connectivity and have come up with an approach to mitigate the cost of performing BFSs. Our results indicate that a significant gain in performance can be obtained by reinterpreting algorithms to perform BFS on graphs that are much smaller in size compared to the input graph. We believe that our approach can be useful in other settings too. As our results show promise, in future a theoretical analysis can be done to explain the speedup produced. The approach to avoid BFS and working with a cheaper inaccurate certificate can also be explored in a serial setting.

46 Related Publications

• Mihir Wadwekar and Kishore Kothapalli. Expediting Parallel Graph Connectivity Algorithms. 26th IEEE International Conference on High Performance Computing, Data and Analytics (HiPC). 2018.

• Mihir Wadwekar and Kishore Kothapalli. A Fast GPU Algorithm for Biconnected Components. Tenth International Conference on Contemporary Computing (IC3). 2017.

47 Bibliography

[1] D. Bader and K. Madduri. Gtgraph: A suite of synthetic graph generators. URL http://www. cse.psu.edu/˜kxm85/software/GTgraph/.

[2] S. Beamer, K. Asanovic,´ and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3-4):137–148, 2013.

[3] B. Bollobas.´ Random graphs. In Modern graph theory, pages 215–252. Springer, 1998.

[4] R. A. Botafogo and B. Shneiderman. Identifying aggregates in hypertext structures. In Proceedings of the Third Annual ACM Conference on Hypertext, HYPERTEXT ’91, pages 63–74, New York, NY, USA, 1991. ACM. ISBN 0-89791-461-9. doi: 10.1145/122974.122981. URL http:// doi.acm.org/10.1145/122974.122981.

[5] U. Brandes and D. Wagner. Analysis and Visualization of Social Networks, pages 321–340. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004. ISBN 978-3-642-18638-7. doi: 10.1007/ 978-3-642-18638-7 15. URL https://doi.org/10.1007/978-3-642-18638-7_15.

[6] M. Chaitanya and K. Kothapalli. A simple parallel algorithm for biconnected components in sparse graphs. 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, pages 395–404, 2015.

[7] M. Chaitanya and K. Kothapalli. Efficient multicore algorithms for identifying biconnected com- ponents. International Journal of Networking and Computing, 6(1):87–106, 2016.

[8] J. Cheriyan and R. Thurimella. Algorithms for parallel k-vertex connectivity and sparse certifi- cates. In Proceedings of the twenty-third annual ACM symposium on Theory of computing, pages 391–401. ACM, 1991.

[9] J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey. Fast and efficient graph traversal algorithm for cpus: Maximizing single-node efficiency. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 378–389. IEEE, 2012.

[10] G. Cong and D. A. Bader. An experimental study of parallel biconnected components algorithms on symmetric multiprocessors (smps). In Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, pages 9–pp. IEEE, 2005.

48 [11] S. Cook. CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes, 2012.

[12] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms. MIT press, 2009.

[13] CUDA 7.5, 2016. URL http://developer.download.nvidia.com/compute/ cuda/7.5/Prod/docs/sidebar/CUDA_Toolkit_Release_Notes.pdf.

[14] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38:1:1–1:25, Dec. 2011. ISSN 0098-3500. doi: 10.1145/2049662.2049663. URL http: //doi.acm.org/10.1145/2049662.2049663.

[15] D. Dutta, M. Chaitanya, K. Kothapalli, and D. Bera. Applications of ear decomposition to efficient heterogeneous algorithms for shortest path/cycle problems. International Journal of Networking and Computing, 8(1):73–92, 2018.

[16] J. A. Edwards and U. Vishkin. Better speedups using simpler parallel programming for graph con- nectivity and biconnectivity. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores, pages 103–114. ACM, 2012.

[17] J. A. Edwards and U. Vishkin. Brief announcement: speedups for parallel graph triconnectivity. In Proceedings of the twenty-fourth annual ACM symposium on Parallelism in algorithms and architectures, pages 190–192. ACM, 2012.

[18] D. K. Goldenberg, P. Bihler, M. Cao, J. Fang, B. D. O. Anderson, A. S. Morse, and Y. R. Yang. Localization in sparse networks using sweeps. In Proceedings of the 12th Annual International Conference on Mobile Computing and Networking, MobiCom ’06, pages 110–121, New York, NY, USA, 2006. ACM. ISBN 1-59593-286-0. doi: 10.1145/1161089.1161103. URL http: //doi.acm.org/10.1145/1161089.1161103.

[19] J. Greiner. A comparison of parallel algorithms for connected components. In Proceedings of the sixth annual ACM symposium on Parallel algorithms and architectures, pages 16–25. ACM, 1994.

[20] B. Haeupler, K. R. Jampani, and A. Lubiw. Testing simultaneous planarity when the common graph is 2-connected. In Algorithms and Computation, pages 410–421, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.

[21] D. S. Hirschberg, A. K. Chandra, and D. V. Sarwate. Computing connected components on parallel computers. Communications of the ACM, 22(8):461–464, 1979.

[22] J. E. Hopcroft and R. E. Tarjan. Isomorphism of Planar Graphs, pages 131–152. Springer US, Boston, MA, 1972. ISBN 978-1-4684-2001-2. doi: 10.1007/978-1-4684-2001-2 13. URL https://doi.org/10.1007/978-1-4684-2001-2_13.

49 [23] J. E. Hopcroft and R. E. Tarjan. Dividing a graph into triconnected components. SIAM Journal on Computing, 2(3):135–158, 1973.

[24] J. JaJ´ a.´ An introduction to parallel algorithms, volume 17. Addison-Wesley Reading, 1992.

[25] A. Kanevsky and V. Ramachandran. Improved algorithms for graph four-connectivity. In Founda- tions of Computer Science, 1987., 28th Annual Symposium on, pages 252–259. IEEE, 1987.

[26] S. Khuller and B. Schieber. Efficient parallel algorithms for testing k and finding disjoint s-t paths in graphs. SIAM Journal on Computing, 20(2):352–375, 1991.

[27] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http: //snap.stanford.edu/data, June 2014.

[28] D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In ACM SIGPLAN Notices, volume 47, pages 117–128. ACM, 2012.

[29] G. L. Miller and V. Ramachandran. A new graph triconnectivity algorithm and its parallelization. Combinatorica, 12(1):53–76, Mar 1992. ISSN 1439-6912. doi: 10.1007/BF01191205. URL https://doi.org/10.1007/BF01191205.

[30] S. Nayyaroddeen, M. Gambhir, and K. Kothapalli. A study of graph decomposition algorithms for parallel symmetry breaking. In Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International, pages 598–607. IEEE, 2017.

[31] Nvidia Tesla K40C. URL https://www.nvidia.in/content/PDF/kepler/ Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf.

[32] NVIDIA Turing Architecture, 2018. URL https://www.nvidia.com/ content/dam/en-zz/Solutions/design-visualization/technologies/ turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf.

[33] OpenCL 2.2, 2018. URL https://www.khronos.org/registry/OpenCL/specs/2. 2/pdf/OpenCL_API.pdf.

[34] OpenMP 4.0, 2013. URL https://www.openmp.org/wp-content/uploads/ OpenMP4.0.0.pdf.

[35] C. Pachorkar, M. Chaitanya, K. Kothapalli, and D. Bera. Efficient parallel ear decomposition of graphs with application to betweenness-centrality. In High Performance Computing (HiPC), 2016 IEEE 23rd International Conference on, pages 301–310. IEEE, 2016.

[36] G. Pandurangan, P. Robinson, and M. Scquizzato. Fast distributed algorithms for connectivity and mst in large graphs. ACM Transactions on Parallel Computing (TOPC), 5(1):4, 2018.

50 [37] V. Ramachandran. Parallel open ear decomposition with applications to graph biconnectivity and triconnectivity. Citeseer, 1992.

[38] V. Ramachandran and U. Vishkin. Efficient parallel triconnectivity in logarithmic time. In Aegean Workshop on Computing, pages 33–42. Springer, 1988.

[39] SAS-OPTGRAPH. http://support.sas.com/documentation/solutions/ optgraph/index.html.

[40] Y. Shiloach and U. Vishkin. An o (log n) parallel connectivity algorithm. Technical report, Com- puter Science Department, Technion, 1980.

[41] J. Shun, L. Dhulipala, and G. Blelloch. A simple and practical linear-work parallel algorithm for connectivity. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, pages 143–153. ACM, 2014.

[42] G. M. Slota and K. Madduri. Simple parallel biconnectivity algorithms for multicore platforms. In High Performance Computing (HiPC), 2014 21st International Conference on, pages 1–10. IEEE, 2014.

[43] J. Soman, K. Kishore, and P. Narayanan. A fast gpu algorithm for graph connectivity. In Par- allel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, pages 1–8. IEEE, 2010.

[44] M. Sutton, T. Ben-Nun, and A. Barak. Optimizing parallel graph connectivity computation via subgraph sampling. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 12–21. IEEE, 2018.

[45] R. Tarjan. Depth-first search and linear graph algorithms. SIAM journal on computing, 1(2): 146–160, 1972.

[46] R. E. Tarjan and U. Vishkin. An efficient parallel biconnectivity algorithm. SIAM Journal on Computing, 14(4):862–874, 1985.

[47] Thrust C++ library. URL https://developer.nvidia.com/thrust.

[48] W. T. Tutte. Connectivity in graphs, volume 15. University of Toronto Press, 1966.

[49] U. Vishkin, S. Dascal, E. Berkovich, and J. Nuzman. Explicit multi-threading (xmt) bridging models for instruction parallelism. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, pages 140–151. ACM, 1998.

[50] M. Wadwekar and K. Kothapalli. A fast gpu algorithm for biconnected components. In Contem- porary Computing (IC3), 2017 Tenth International Conference on, pages 1–6. IEEE, 2017.

51 [51] F. Wang, M. Thai, and D. Du. On the construction of 2-connected virtual backbone in wireless networks. IEEE Transactions on Wireless Communications, 8(3):1230–1237, 3 2009. ISSN 1536- 1276. doi: 10.1109/TWC.2009.051053.

[52] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens. Gunrock: A high-performance graph processing library on the gpu. In ACM SIGPLAN Notices, volume 51, page 11. ACM, 2016.

[53] Wikipedia contributors. K-vertex-connected graph — Wikipedia, the free encyclo- pedia, 2018. URL https://en.wikipedia.org/w/index.php?title= K-vertex-connected_graph&oldid=820650910. [Online; accessed 3-January- 2019].

[54] Wikipedia contributors. Social network analysis — Wikipedia, the free encyclope- dia, 2019. URL https://en.wikipedia.org/w/index.php?title=Social_ network_analysis&oldid=876315796. [Online; accessed 2-January-2019].

52