SUPPORTING EFFICIENT GRAPH ANALYTICS AND SCIENTIFIC COMPUTATION USING ASYNCHRONOUS DISTRIBUTED-MEMORY PROGRAMMING MODELS

By

SAYAN GHOSH

A dissertation submitted in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

WASHINGTON STATE UNIVERSITY School of Electrical Engineering and

MAY 2019

Copyright by SAYAN GHOSH, 2019 All Rights Reserved c Copyright by SAYAN GHOSH, 2019 All Rights Reserved To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of SAYAN GHOSH find it satisfactory and recommend that it be accepted.

Assefaw H. Gebremedhin, Ph.D., Chair

Carl Hauser, Ph.D.

Ananth Kalyanaraman, Ph.D.

Pavan Balaji, Ph.D.

Mahantesh Halappanavar, Ph.D.

ii ACKNOWLEDGEMENT

I thank my adviser, Dr. Assefaw Gebremedhin for his generous guidance, unflagging support, and considerable enthusiasm toward my research. I greatly appreciate his attempts in always push- ing me to refine my writing and narration skills, which has helped me to become a better researcher and communicator. I would like to thank Dr. Jeff Hammond for introducing me to one-sided communication models, which play an important role in my thesis. I would also like to thank Dr. Barbara Chapman and Dr. Sunita Chandrasekaran, for their unwavering support during my Masters studies at the University of Houston. I am immensely fortunate to have had the opportunity to work with all of my thesis committee members. As a Teaching Assistant to Dr. Carl Hauser for the Computer Networks course, I appreciate that he encouraged me to solve the problem sets on my own, so that I could assist the students effectively. Through the discussions with Dr. Pavan Balaji, I have learned the importance of low-level performance analysis for comprehensive evaluation of an application. I am grateful to Drs. Mahantesh Halappanavar and Ananth Kalyanaraman for introducing me to graph community detection related research. I sincerely believe that criticism outperforms praise. I have been lucky to have mentors who never settled for less and always pushed me to explore a bit more. I appreciate the supervision of Drs. Jeff Hammond, Pavan Balaji, Antonio Pena˜ and Yanfei Guo during my internships at Argonne National Laboratory. I spent over a year as an intern and an Alternate Sponsored Fellow at Pacific Northwest National Laboratory, and I would like to thank Drs. Mahantesh Halappanavar and Antonino Tumeo for engaging me with research on graph community detection. I admire every one of them for their guidance and the efforts in enhancing my knowledge. Special thanks to the administrative staff of the Electrical Engineering and Computer Science department, Graduate School, and the Office of International Programs at Washington State Uni- versity for their commitment toward helping students.

iii I would like to thank my parents/in-laws for their unswerving support and deep empathy, de- spite the vast distance between us. Finally, I would like to recognize my wife Priyanka for her constructive criticisms, logical disagreements, unconditional love, and sharing all the hardships of student life with magnificent flair — ”Strangers on this road we are on; We are not two, we are one”.

iv SUPPORTING EFFICIENT GRAPH ANALYTICS AND SCIENTIFIC COMPUTATION USING ASYNCHRONOUS DISTRIBUTED-MEMORY PROGRAMMING MODELS

Abstract

by Sayan Ghosh, Ph.D. Washington State University May 2019

Chair: Assefaw H. Gebremedhin

Future High Performance Computing (HPC) nodes will have many more processors than the contemporary architectures. In such a system with massive parallelism it will be necessary to use all the available cores to drive the network performance. Hence, there is a need to explore one-sided models which decouple communication from synchronization. Apart from focusing on optimizing communication, it is also desirable to improve the productivity of existing one-sided models by designing convenient abstractions that can alleviate the complexities of parallel appli- cation development. Classically, a majority of applications running on HPC systems have been arithmetic intensive. However, data-driven applications are becoming more prominent, employ- ing algorithms from areas such as graph theory, machine learning, and data mining. Most graph applications have minimal arithmetic requirements, and exhibit irregular communication patterns. Therefore, it is useful to identify approximate methods that can enable communication-avoiding optimizations for graph applications, by potentially sacrificing some quality. The first part of this dissertation addresses the need to reduce synchronization by exploring one-sided communication models and designing convenient abstractions that serve the need of distributed-memory scientific applications. The second part of the dissertation is about evaluating

v the impact of approximate methods and communication models on parallel graph applications. We begin with the design and development of an asynchronous matrix communication inter- face that can be leveraged in parallel numerical linear algebra applications. Next, we discuss the design of a compact set of C++ abstractions over a one-sided communication model, which improves developer productivity significantly. Then, we study the challenges associated with par- allelizing community detection in graphs, and develop a distributed-memory implementation that incorporates a number of approximate methods to optimize performance. Finally, we consider a half-approximation algorithm for graph matching, and evaluate the implications of different com- munication models in its distributed-memory implementation. We also examine the effect of data reordering on performance. In summary, this dissertation provides concrete insights into designing low-overhead high-level interfaces over asynchronous distributed-memory models for building parallel scientific applica- tions, and presents empirical analysis on the effect of approximate methods and communication models in deriving efficiency for irregular scientific applications using distributed-memory graph applications as a use-case.

vi TABLE OF CONTENTS

Page

ACKNOWLEDGEMENT ...... ii

ABSTRACT ...... v

LIST OF TABLES ...... xiii

LIST OF FIGURES ...... xvi

CHAPTER 1: INTRODUCTION ...... 1

1.1 HARDWARE TRENDS ...... 1

1.2 POWER CONSUMPTION GOVERN FUTURE SYSTEM DESIGN ...... 1

1.3 IRREGULAR APPLICATION CHALLENGES ...... 3

1.4 USING SPARSE LINEAR ALGEBRA FOR GRAPH APPLICATIONS ...... 4

1.5 MOTIVATION ...... 5

1.5.1 Distributed-memory applications and Message Passing Interface ...... 6

1.5.2 One-sided communication model ...... 7

1.5.3 Approximate computing techniques ...... 8

1.5.4 Summary ...... 9

1.6 CONTRIBUTIONS ...... 10

1.7 PUBLICATIONS ...... 11

vii 1.8 DISSERTATION ORGANIZATION ...... 12

CHAPTER 2: BACKGROUND ON MPI ONE-SIDED COMMUNICATION . . . . . 14

2.1 INTRODUCTION ...... 14

2.2 REMOTE DIRECT MEMORY ACCESS ...... 16

2.3 MEMORY MODEL ...... 17

2.3.1 Memory consistency ...... 17

2.3.2 MPI RMA memory model ...... 18

2.4 MPI-2 TO MPI-3 RMA ...... 19

2.5 CHAPTER SUMMARY ...... 20

CHAPTER 3: ONE-SIDED INTERFACE FOR MATRIX OPERATIONS USING MPI: A CASE STUDY WITH ELEMENTAL ...... 21

3.1 INTRODUCTION ...... 21

3.2 ABOUT ELEMENTAL ...... 23

3.2.1 Data Distribution ...... 24

3.2.2 Elemental AXPY Interface ...... 26

3.3 BEYOND THE ELEMENTAL AXPY INTERFACE ...... 27

3.3.1 Enhancing the Performance of the Existing AXPY Interface ...... 28

3.3.2 From the AXPY Interface to the RMA Interface ...... 29

3.4 PROPOSED ONE-SIDED APIS ...... 30

3.4.1 RMAInterface ...... 30

3.4.2 Distributed Arrays Interface (EL::DA) ...... 35

3.5 EXPERIMENTAL EVALUATION ...... 36

viii 3.5.1 Microbenchmark Evaluation ...... 38

3.5.2 Application Evaluation – GTFock ...... 42

3.6 CHAPTER SUMMARY ...... 43

CHAPTER 4: RMACXX: AN EFFICIENT HIGH-LEVEL C++ INTERFACE OVER MPI-3 RMA ...... 45

4.1 INTRODUCTION ...... 45

4.2 RELATED WORK ...... 49

4.3 DESIGN PRINCIPLES OF RMACXX ...... 51

4.3.1 Window class ...... 52

4.3.2 Standard interface ...... 56

4.3.3 Expression interface ...... 60

4.4 EXPERIMENTAL EVALUATION ...... 66

4.4.1 Instruction count and latency analysis ...... 67

4.4.2 Message rate and remote atomics ...... 72

4.4.3 Application evaluations ...... 75

4.5 CHAPTER SUMMARY ...... 79

CHAPTER 5: DISTRIBUTED-MEMORY PARALLEL LOUVAIN METHOD FOR GRAPH COMMUNITY DETECTION ...... 80

5.1 INTRODUCTION ...... 80

5.2 RELATED WORK ...... 82

5.3 PRELIMINARIES ...... 83

5.3.1 Modularity ...... 83

5.3.2 Serial Louvain algorithm ...... 85

ix 5.3.3 Challenges in distributed-memory parallelization ...... 85

5.4 THE PARALLEL ALGORITHM ...... 86

5.4.1 Input distribution ...... 87

5.4.2 Overview of the parallel algorithm ...... 87

5.5 APPROXIMATE METHODS FOR PERFORMANCE OPTIMIZATION ...... 91

5.5.1 Threshold Cycling ...... 93

5.5.2 Early Termination ...... 93

5.5.3 Incomplete Coloring ...... 95

5.6 EXPERIMENTAL EVALUATION ...... 96

5.6.1 Algorithms compared ...... 97

5.6.2 Experimental platforms ...... 97

5.6.3 Test graphs ...... 98

5.6.4 Comparison on a single node ...... 100

5.6.5 Strong scaling ...... 101

5.6.6 Weak scaling ...... 102

5.6.7 Analysis of performance of the approximate computing methods/heuristics 104

5.6.8 Combining approximate methods/heuristics delivers better performance . . 107

5.6.9 Solution quality assessment ...... 109

5.7 APPLICABILITY OF THE LOUVAINMETHOD AS A BENCHMARKING TOOL FOR GRAPH ANALYTICS ...... 110

5.7.1 Characteristics of distributed-memory Louvain method ...... 111

5.7.2 Synthetic Data Generation ...... 113

5.8 ANALYSIS OF MEMORY AFFINITY, POWER CONSUMPTION, AND COM- MUNICATION PRIMITIVES ...... 115

x 5.8.1 Evaluation on Intel Knights Landing R architecture ...... 116

5.8.2 Power, energy and memory usage ...... 118

5.8.3 Impact of MPI communication method ...... 119

5.9 ADDRESSING THE RESOLUTION LIMIT PROBLEM ...... 123

5.10 CHAPTER SUMMARY ...... 126

CHAPTER 6: EXPLORING MPI COMMUNICATION MODELS FOR GRAPH AP- PLICATIONS USING GRAPH MATCHING AS A CASE STUDY . . 127

6.1 INTRODUCTION ...... 127

6.2 IMPLEMENTING DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS USING MPI ...... 129

6.3 HALF-APPROXIMATE MATCHING ...... 132

6.3.1 Matching preliminaries ...... 132

6.3.2 Serial algorithm for half-approximate matching ...... 134

6.4 PARALLEL HALF-APPROXIMATE MATCHING ...... 134

6.4.1 Graph distribution ...... 135

6.4.2 Communication contexts ...... 135

6.4.3 Distributed-memory algorithm ...... 136

6.4.4 Implementation of the distributed-memory algorithms ...... 139

6.5 EXPERIMENTAL EVALUATION ...... 142

6.5.1 Notations and experimental setup ...... 142

6.5.2 Scaling analysis and comparison with MatchBox-P ...... 143

6.5.3 Impact of graph reordering ...... 146

6.5.4 Performance summary ...... 149

xi 6.6 RELATED WORK ...... 153

6.7 CHAPTER SUMMARY ...... 154

CHAPTER 7: CONCLUSION AND FUTURE WORK ...... 156

7.1 SUMMARY OF FINDINGS ...... 156

7.2 FUTURE WORK ...... 157

BIBLIOGRAPHY ...... 159

xii LIST OF TABLES

1.1 2017 average residential electricity usage across United States, compared with power consumption of a regional supercomputer...... 3

1.2 Energy, Power and Memory usage of five approximate computing variants of the distributed-memory implementation of Louvain method for graph community de- tection (Chapter 5)...... 9

3.1 Test Molecules used for GTFock evaluation...... 43

4.1 RMACXX semantics compared to Global Arrays, Fortran 2008 Coarrays and UPC++. 49

4.2 Window class template parameter list. Default values are in bold...... 55

4.3 Communication scenarios for origin/target derived type creation...... 58

4.4 Expression completion characteristics...... 65

4.5 Experimental platforms...... 67

4.6 Atomic memory operations...... 69

4.7 Expression interface instructions and latencies...... 70

4.8 Global indexing put instructions and latencies...... 71

4.9 Bulk put with noncontiguous local buffer...... 72

4.10 Communication models and transport layers...... 72

4.11 RMACXX usage in applications ...... 78

5.1 Experimental platforms...... 98

xiii 5.2 Test graphs, listed in ascending order of edges...... 98

5.3 I/O performance (in secs.) for three real-world input graphs on NERSC Cori using Lustre file striping and burst buffers...... 99

5.4 Distributed memory vs shared memory (Grappolo) performance (runtime) of Lou- vain algorithm on a single NERSC Cori node using 4-64 threads. The input graph is soc-friendster (1.8B edges)...... 101

5.5 Versions yielding the best performance over the baseline version (run on 16-128 processes) for input graphs (listed in ascending order of edges)...... 103

5.6 GTgraph SSCA#2 generated graph dimensions and associated information. . . . . 103

5.7 Stochastic block partition dataset characteristics used for coloring analysis...... 107

5.8 Performance of ET(0.25) combined with Threshold Cycling for soc-friendster (1.8B edges). Relative percentage gains in performance are in braces...... 108

5.9 Quality comparisons of our distributed Louvain implementation and Grappolo with LFR ground truth community information...... 110

5.10 First phase of Louvain method versus the last phase for real-world inputs on 1K processes of NERSC Cori...... 112

5.11 Power/Energy and Memory consumption of distributed Louvain implementation using four real-world graphs exhibiting diverse characteristics on 1K processes (64 nodes) of NERSC Cori ...... 120

5.12 Execution time (in secs.) and Modularity (Q) on 1-4K processes for RGG datasets with unit edge weights ...... 121

5.13 Execution time (in secs.) and Modularity (Q) on 1-4K processes for RGG datasets with Euclidean distance weights ...... 122

5.14 Number of iterations, execution time (in secs.) and Modularity of Friendster (65.6M vertices, 1.8B edges) for various MPI communication models on 1024/2048 pro- cesses using 2 OpenMP threads/process ...... 122

5.15 Number of iterations, execution time (in secs.) and Modularity of Friendster for various MPI communication models on 1024/2048 processes using 4 OpenMP threads/process ...... 123

5.16 Quality comparison between Louvain and Fast-tracking resistance method using ball bearing graphs...... 124

xiv 5.17 Quality of Louvain compared to ground truth data obtained from Fast-tracking resistance method for small/moderate sized real-world graphs...... 126

6.1 Description of keywords used in algorithms ...... 137

6.2 Synthetic and real-world graphs used for evaluation ...... 144

6.3 Graph topology statistics for stochastic block partitioned graph on 512-2K processes145

6.4 Neighborhood graph topology statistics for Friendster and Orkut ...... 146

6.5 Impact of reordering depicted through the number of edges augmented with the number of ghost vertices for different partitions ...... 147

6.6 Neighborhood topology of original vs RCM reordered graphs ...... 148

6.7 Versions yielding the best performance over the Send-Recv baseline version (run on 512-16K processes) for various input graphs...... 151

6.8 Power/energy and memory usage on 1K processes ...... 151

xv LIST OF FIGURES

1.1 Processor trends for the past five decades. Figure courtesy: Rupp et al. [166]. . . . 2

1.2 Left: Evaluation of first 50 supercomputers in Top 500 list using HPL [136]. Perfor- mance evaluated by Floating-Point Operations Per Second (FLOPS). Right: Eval- uation of first 50 supercomputers in Graph 500 list using Breadth First Search (BFS). Performance evaluated by Traversed Edges Per Second (TEPS) [140]. Plot data collected from the respective lists of Top500/Graph500 for November 2017. . 6

1.3 Generic layout of scientific applications with communication layers...... 7

2.1 MPI two-sided versus one-sided communication (RMA) operations ...... 15

2.2 MPI Remote Memory Access (RMA) versus MPI two-sided Send/Recv. Each pro- cess exposes some memory (window) before issuing one-sided put/get communi- cation calls...... 15

2.3 Unified (left) and Separate (right) memory models of MPI RMA. Figure courtesy: Hoefler et al. [100]...... 19

3.1 Overall structure of the Elemental library...... 24

3.2 Elemental element-wise cyclic distribution (MC×MR) of an 8 × 8 matrix on a 2 × 2 process grid (4 processes). Dark borders indicate local/physical chunks corre- sponding to a global chunk...... 25

3.3 AXPY interface components: A (synonymous to Y in “AXPY”) is a DistMatrix and X is a locally owned matrix...... 27

3.4 Elemental RMAInterface API...... 31

3.5 Logical diagram of a 2D blocked distributed matrix multiplication using RMAIn- terface. Each block of DistMatrix A, B and C contains non- contiguous elements. a, b and c are local matrices (i.e, Matrix)...... 32

xvi 3.6 Steps involved in a put/accumulate operation of RMAInterface. An 8×8 distributed matrix (as shown in Figure 3.2) is updated starting at position (3, 3) by a local 5×5 matrix M. A get operation would show the arrows in the opposite direction. Step 3 (MPI DDT creation) is optional...... 33

3.7 Bandwidth of put/get/accumulate operations with/without MPI DDT on 16 pro- cesses (higher is better). The X axis shows the size of the data transferred...... 34

3.8 Hartree-Fock proxy microbenchmark comparing the Elemental AXPY interface versions on 256 processes of Blues. The number inside braces denotes the number of tasks...... 39

3.9 Performance comparison of EL::DA and GA for the distributed matrix-matrix mul- tiplication microbenchmark on Cori. Four processes per node are used by Casper for asynchronous progress...... 41

3.10 GTFock execution on Cori. Two processes per node are used by Casper for asyn- chronous progress...... 41

4.1 PGAS models that support both local and global view, showing flexibility...... 48

4.2 High-level organization of RMACXX...... 52

4.3 RMACXX window creation using the local indexing constructor. A window cre- ated by a process using the local indexing constructor is not associated with the windows created by other processes...... 53

4.4 RMACXX window creation with global indexing capabilities. A window of global dimensions 6 × 8 integers is created collectively by 4 processes laid out in a logical 2 × 2 grid...... 54

4.5 Top: Elementwise access translation for local indexing. Bottom: Bulk access trans- lation for local indexing. operator() returns a reference to the current Window instance, which is used to call the subsequent operator>>/operator<< to initiate communication...... 57

4.6 Intermediate steps of a bulk communication operation (i.e., put) using global in- dices (for e.g., win({1,2},{5,4}) << buf) ...... 58

4.7 An example of the RMACXX subarray constructor...... 59

xvii 4.8 Top: Translation of an elementwise expression (w(1, {0}) + 2 ∗ w(2, {1})). Bottom: Translation of a bulk expression (2 + w(2, {1, 1}, {4, 4}) ∗ w(0, {2, 2}), {5, 5}). Items in the boxes connected with solid lines are expression terms including operators. The arrows indicate underlying class instantiations. The nonmember operators (* and +) are special functions that accept instances of class EExpr/BExpr and return another instance of EExpr/BExpr after performing the requisite operation...... 62

4.9 Stages in expression processing...... 63

4.10 Instruction counts (top) and latencies (bottom) of MPI and RMACXX (local in- dexing) operations on ANL Blues...... 68

4.11 Local indexing concurrent versions: Instructions (left) and Latencies in seconds (right)...... 69

4.12 Blues: Intranode put (left) and get (right) rates...... 73

4.13 Edison: Intranode put (left) and get (right) rates...... 73

4.14 Blues: Internode put (left) and get (right) rates...... 73

4.15 Edison: Internode put (left) and get (right) rates...... 74

4.16 Blues: Intranode (left) and internode (right) fetch-add rates...... 74

4.17 Edison: Intranode (left) and internode (right) fetch-add rates...... 75

4.18 Application evaluations using RMACXX on NERSC Edison...... 76

5.1 Vertex-based graph distribution between two processes for an undirected graph with 4 vertices and 8 edges. Ghost vertices are retained by a process: for process #0, the “ghost” vertices are 2 and 3, whereas for process #1, the ghosts are 0 and 1. 87

5.2 Graph reconstruction. In the example, we suppose that the modularity optimization has assigned vertices {0, 1, 3} to community 0, vertex 2 to community 2 and vertex 4 to community 4 (i.e., vertices 2 and 4 are each one in their own community). Because community IDs originate from vertex IDs, we consider the community IDs from 0 to 2 owned (local) to process #0, and community IDs 3 and 4 local to process #1...... 92

5.3 Communication volume, in terms of mean send/recv message sizes (bytes) ex- changed between pairs of processes, for two real-world inputs on 1024 processes. The vertical axis represents the sender process ids and the horizontal axis repre- sents the receiver process ids...... 99

xviii 5.4 Execution times of our distributed Louvain implementation for graphs listed in Table 5.2. X-axis: Number of processes (and nodes), Y-axis: Execution time (in secs.)...... 102

5.5 Weak scaling of baseline distributed Louvain implementation on GTgraph gener- ated SSCA#2 graphs. X-axis: Input graphs listed in Table 5.6, Y-axis: Execution time (in secs.)...... 104

5.6 Convergence characteristics of nlpkkt240 (401.2M edges) on 64 processes...... 105

5.7 Convergence characteristics of web-cc12-PayLevelDomain (1.2B edges) on 64 pro- cesses...... 106

5.8 Modularities for stochastic block partition graphs of various sizes and sampling techniques on 16 nodes (192 processes) of NERSC Edison...... 107

5.9 Performance of com-orkut (117.1M edges) when coloring is combined with ET on NERSC Edison...... 108

5.10 Approximate computing techniques have little effect on RMAT generated Graph500 graphs...... 112

2 1 5.11 Distribution based on [0, 1] on p = 4 processes and for N = 12. p > d mandates that vertices in a process can only have edges with vertices owned by its up or down neighbor. The blocks between the parallel lines indicate vertices owned by a process...... 114

5.12 Communication volume, in terms of minimum send/recv message sizes (in bytes) exchanged between pairs of processes, of the single-phase Louvain imple- mentation with basic RGG input vs RGG with random edges using 1024 processes. Adding extra edges increase overall communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids; the top-left corner represents id zero for both sender and receiver. Byte sizes vary from 8 (blue) to 32 (red) for the figure on left, and from 8 (blue) to 3000 (red) for the figure on right...... 115

5.13 Communication volumes (in terms of send/recv invocations, and mean send/ recv message sizes exchanged between processes) of single-phase Louvain method and Graph500 BFS for 134M vertices on 1024 processes. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids; the top-left corner represents id zero for both sender and receiver. Blue represents the minimum and Red repre- sents maximum volume for each of the figures at different minimum and maximum values (communication patterns are important)...... 116

xix 5.14 Performance of four real-world graphs using different memory modes on KNL nodes of ALCF Theta (for the default quadrant clustering mode). X-axis: Number of processes; Y-axis: Execution time (secs.) in log-scale...... 117

5.15 The relative performance profiles for cache, equal, flat and split memory modes on Theta KNL nodes using a subset of inputs. The X-axis represents the factor by which a given scheme fares relative to the best performing scheme for that particular input. The Y-axis represents the fraction of problems. The closer a curve is aligned to the Y-axis the superior is its performance relative to the other schemes over a range of 40 inputs...... 118

5.16 A sample ball bearing graph consisting of a large component (referred to as ball) with 128 vertices and two small components (referred to as bearings) each with 9 vertices. Modularity based methods including Louvain fail to designate the small components into individual communities, and treat them as a single community. . . 123

6.1 Subset of a process neighborhood and MPI-3 RMA remote displacement computa- tion. Number of ghost vertices shared between processes are placed next to edges. Each process maintains two O(neighbor) sized buffers (only shown for P7): one for storing prefix sum on the number of ghosts for maintaining outgoing commu- nication counts, and the other for storing remote displacement start offsets used in MPI RMA calls. The second buffer is obtained from alltoall exchanges (depicted by arrows for P7) of the prefix sum buffer among the neighbors...... 131

6.2 Example of maximum weight matching, edges {0,2} and {1,4} are in the matching set ...... 132

6.3 Communication volumes (in terms of Send-Recv invocations) of MPI Send-Recv baseline implementation of half-approx matching using Friendster (1.8B edges) and Graph500 BFS using R-MAT graph of 2.14B edges on 1024 processes. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids...... 133

6.4 Communication contexts depicting different scenarios in distributed-memory half- approx matching. If y is a vertex, then y0 is its “ghost” vertex...... 136

6.5 Weak scaling of NSR, RMA, and NCL on synthetic graphs ...... 145

6.6 Strong scaling results on 1K-4K processes for different instances of Protein K-mer graphs ...... 145

6.7 Performance of RMA and NCL on social network graphs ...... 146

xx 6.8 Rendering of the original graph and RCM reordered graph expressed through the adjacency matrix of the respective graphs (Cage15 and HV15R). Each non-zero entry in the matrix represents an edge between the corresponding row and column (vertices)...... 147

6.9 Comparison of original and RCM reordering on 1K/2K processes ...... 148

6.10 Communication volumes (in bytes) of original HV15R and RCM reordered HV15R. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids...... 149

6.11 Performance profiles for RMA, NCL and NSR using a subset of inputs used in the experiments. The X-axis shows the factor by which a given scheme fares relative to the best performing scheme. The Y-axis shows the fraction of problems for which this happened. The closer a curve is aligned to the Y-axis the superior its performance is...... 150

6.12 Communication volumes (in terms of bytes exchanged) of baseline implementation of half-approximate matching and Graph500 BFS, using R-MAT graph of 134.2M edges on 1024 processes...... 154

7.1 Original (left) vs Balanced (right) graph edge distribution of soc-friendster for graph clustering (running the first phase only) across 1K processes of NERSC Cori. 158

xxi CHAPTER 1 INTRODUCTION

In this chapter we provide the motivation and rationale for this dissertation. We discuss the state of the current High Performance Computing (HPC) systems, and the challenges of irregular mem- ory access intensive applications, that are representative of many graph analytic workloads. This dissertation explores high-level abstractions over asynchronous distributed-memory programming models for easing the development of science applications in general, and improving the efficiency of graph analytics by focusing on the communication model and approximate computation tech- niques.

1.1 Hardware trends

Throughout the last decade, there have been approximately 10-50× increase in core count of pro- cessors in HPC systems. Arithmetic operations are getting cheaper due to the many processor cores, but on-node and off-node (over network) memory accesses are still lagging behind by more than a 100× as compared to cost of an arithmetic operation. At the same time, single-thread performance has improved by 5-8× due to optimizations in the compiler toolchain. Compiler im- provements assisting single-thread/serial performance is only momentary, serial performance has already started to stagnate and is expected to remain so in the near future. Figure 1.1 shows these trends.

1.2 Power consumption govern future system design

It is speculated that if future HPC system design follows the present rate of progress, then an ex- ascale computer (capable of 1018 operations per second) would consume about 70 megawatt of power (about 3-7× more than current supercomputers) [168], making it untenable to support the

1 42 Years of Microprocessor Trend Data

7 10 Transistors (thousands) 106

5 Single-Thread 10 Performance (SpecINT x 103) 104 Frequency (MHz) 103 Typical Power 102 (Watts)

1 Number of 10 Logical Cores

100

1970 1980 1990 2000 2010 2020 Year Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

Figure 1.1: Processor trends for the past five decades. Figure courtesy: Rupp et al. [166]. exorbitant power bill. Improving power efficiency of current and next generation of supercomput- ers is considered as one of the grand challenges that would have enormous impact on computation and humanity [17]. Currently, each of the top 20 supercomputers in the world consumes more than a megawatt of power (in contrast, the first general purpose computer circa 1946, ENIAC [79] con- sumed 150 kW), costing more than a million dollars for electricity bill per machine. Table 1.1 lists the average 2017 power bill for U.S census regions (data from U.S Energy Information Adminis- tration [2]), compared to power consumption of a supercomputer in the same region. It is evident that turning on the supercomputer for an hour uses about 2-10× more power than the average monthly household power consumption1. Keeping power/energy efficiency in mind, the next generation of HPC nodes will comprise of many low frequency processor cores, dramatically increasing the amount of available parallelism. This means that the network infrastructure will be shared by many more cores than the present ar- chitectures, and it will be necessary to use all the available cores to drive the network performance. Therefore, communication methods with implicit synchronization2 will be expensive. Hence, there is a need to explore asynchronous distributed-memory programming models.

1https://www.eia.gov/electricity/sales_revenue_price/pdf/table5_a.pdf 2Defines mechanism which indicate when a communication operation starts and when it ends.

2 Table 1.1: 2017 average residential electricity usage across United States, compared with power consump- tion of a regional supercomputer.

Average price of Power consumption Average monthly Census power in 2017 (in kW) of the fastest residential power region (cents per kWh) regional supercomputer consumption (kWh) New England 16.52 1000 (TX-Green, MIT) 603 Middle Atlantic 12.63 3575.63 (IBM) 668 East North Central 10.12 3945 (Mira, ANL) 745 West North Central 9.77 - 889 South Atlantic 9.96 1465.8 (Excalibur, DOD) 1055 East South Central 9.29 9783 (Summit, ORNL) 1120 West South Central 8.26 7200 (Stampede2, TACC) 1107 Mountain 9.51 1727 (Cheyenne, NCAR) 841 Pacific Contiguous 13.28 7438.3 (Sierra, LLNL) 671 Pacific Noncontiguous 23.28 - 544

1.3 Irregular application challenges

Compute intensive applications (such as dense matrix operations) have the ability to overlap com- putation and asynchronous communication, among other compiler-based enhancements such as vectorization and loop optimizations, to derive a higher efficiency on modern HPC systems. In contrast, applications that are prone to irregular memory accesses are comparatively difficult to optimize. Among irregular applications with little computation needs, graph applications are particularly challenging to optimize on distributed-memory systems. Availability of large scale datasets [164, 122, 21] has led to the emergence of graph analytics as an important activity on modern computer systems. However, due to the irregular nature of memory accesses, high ratios of communication to computation and inherently bulk-synchronous pattern of a number of graph algorithms (owing to an implicit synchronization point at the end of an iteration), graph algorithms pose significant challenges for efficient implementation on parallel systems [129]. Many com- plex real-world problems can be represented and studied using graphs – social networks, bacterial growth, modeling human brain connectome, image recognition, designing smart electricity grid, and many more. A number of graph and combinatorial algorithms exhibit irregular memory ac- cesses. Moreover, most graph problems can be classified as NP-hard in computational complexity

3 studies; in other words, there are no polynomial time solutions. Even when polynomial time solu- tions exist, they are impractical as the problem sizes increase by orders of magnitude. Therefore, in practice, a number of such problems are implemented using approximation algorithms [105]. In the context of implementing efficient graph applications, approximate computing techniques and heuristics that trades off output quality with performance, are frequently applicable. Apart from the analytical capabilities, graph algorithms are also used in intermediate compu- tation stages of a number of science applications [147, 23, 22, 73].

1.4 Using sparse linear algebra for graph applications

A popular mantra influencing performance optimization for HPC systems is: “computation is rel- atively cheap and memory accesses are the bottleneck”. A workaround to this limitation that is gaining traction in the graph analytics community involves transforming the original problem from graph vertex/node centric to linear algebra space [9, 113, 112]. The rationale for doing so is that by converting the problem from memory access domain to an arithmetic operations domain, it is possi- ble to utilize the processor cores uniformly, making memory accesses more regular and predictable. However, sparse linear algebra is an active research area, and efficient distributed-memory sparse linear algebra implementations are rare. In practice, depending on a sparse linear algebra library would require either conforming to its sparse matrix distribution or data re-distributions, which can be constraining for an application. Moreover, explicit linear algebra formulations of some common graph methods are excessive as it increases instruction and memory usage (which may increase the power consumption). For instance, extracting a subgraph from a larger graph using linear algebra entail sparse matrix multiplications, which can be instead implemented using simpler schemes. Thus, in order to attain sustainable performance on next generation HPC systems, this dis- sertation focuses on both — communication model to reduce the overhead of data transfers, and algorithmic adjustments to optimize the underlying resource utilization.

4 1.5 Motivation

A na¨ıve assumption that could be made from Figure 1.1 is that the performance of applications is bound to increase with the number of processor cores (increasing the overall memory throughput or bandwidth), owing to the availability of more work units to share the computational workload. This is unfortunately not the case, as a number of scientific applications are dominated by irregular memory accesses, and little computation, making it difficult to efficiently utilize current HPC systems. The key difference between the two classes of parallel applications: compute intensive and irregular memory access prone, is evident upon exploring present supercomputers through the yardstick of standard benchmarks. Figure 1.2 lists the top 50 supercomputers based on their relative performances using two prominent benchmarks, depicting the aforementioned classes of parallel applications. In general, while Figure 1.1 shows the hardware trends for the last five decades, Figure 1.2 compares the performance of two representative scientific application classes on current supercomputers. The first benchmark is High Performance LINPACK (HPL) [59], which uses level-3 BLAS3 operations for performing Gaussian elimination with partial pivoting [57] on large matrices, typi- fying a compute intensive workload. Supercomputers are ranked based on their HPL performances in terms of floating-point operations per second in the Top 500 list [136]. The second benchmark is Breadth First Search (BFS), a fundamental graph algorithm for traversing vertices on a graph. In BFS, every vertex of a graph is visited starting from an arbitrary “root” vertex, exploring all neigh- boring vertices in the current level (i.e., edges incident on a vertex) before moving on to vertices in the next level. BFS can be characterized primarily by random accesses to memory locations, and does not involve any arithmetic operations. Supercomputers are ranked based on their BFS performances in terms of traversed edges per second in the Graph 500 list [140]. HPL resembles dense regular workloads whereas BFS deals with sparse irregular workloads. Each of these benchmarks emphasize on either computation or memory accesses. From Figure 1.2, the thousand-order of performance rift between regular compute intensive and irregular data-driven

3Basic Linear Algebra Subprogram (BLAS) provides standard building blocks for matrix operations [58].

5 High Performance LINPACK Breadth First Search 1x109 1x108 Performance (GFlop/s) 7 Performance (GTEP/s) 1x108 Total number of cores 1x10 Total number of cores Execution time (in mins.) 6 Execution time (in mins.) 1x107 1x10 100000 1x106 10000 100000 1000 10000 100 1000 10 100 1 10 0.1 0 10 20 30 40 50 0 10 20 30 40 50 Supercomputers from Top500 list Supercomputers from Graph500 list

Figure 1.2: Left: Evaluation of first 50 supercomputers in Top 500 list using HPL [136]. Performance evaluated by Floating-Point Operations Per Second (FLOPS). Right: Evaluation of first 50 supercomputers in Graph 500 list using Breadth First Search (BFS). Performance evaluated by Traversed Edges Per Second (TEPS) [140]. Plot data collected from the respective lists of Top500/Graph500 for November 2017. approaches are evident. Ideally, in Figure 1.2 the performance line should be well above the number of cores. For graph workloads, even the simplest benchmark (i.e., BFS) is unable to attain acceptable performance and requisite efficiency. In fact, following the trajectory processor trends in Figure 1.1, Figure 1.2 conveys that machine improvements does not always lead to sustainable application performance.

1.5.1 Distributed-memory applications and Message Passing Interface

The increasing volumes of data pose an unprecedented need for larger computational resources and efficient parallel algorithms to solve problems at extreme scale. As a consequence of this, data is usually spread across the compute nodes of a cluster, requiring intermediate communication steps between the nodes during application runs. Although modern network interconnects have microsecond latency for transferring a byte between any two nodes, communication is usually the most expensive part for a number of applications running on supercomputers. Message Passing Interface (MPI) [189] is the de facto distributed-memory programming model used by scientific applications for data communication on the widest variety of HPC platforms. MPI is a standard specification, and it supports various data communication patterns encountered by applications, such as two-sided message passing, collective and one-sided operations. One of

6 the most popular and versatile communication model in MPI is two-sided message passing a.k.a Send/Recv. In the MPI two-sided interface, a Send operation initiated by a particular process will be matched by a Recv on another process. In contrast, MPI one-sided interface extends the communication mechanisms of MPI by allowing one process to specify all communication parameters, both for the sending side and for the receiving side. This flexibility allows for more asynchrony in applications, and in certain cases improve the overall performance significantly. Figure 1.3 shows a generic layout of a scientific application on a distributed-memory system. The communication subsystem usually comprises of multiple layers of software that adds to the data transfer overhead. Software complexity increases when low-level interfaces are used, whereas productivity increases when applications make use of higher level abstractions, at the expense of a tighter software dependency.

Scienti c applications

Parallel Scienti c Toolkits Linear Solvers Algebra ...

Communication middleware Productivity

... Complexity

MPI

Network

Figure 1.3: Generic layout of scientific applications with communication layers.

1.5.2 One-sided communication model

A number of applications can benefit from detaching communication and synchronization (a.k.a the one-sided model); however, due to the relative recency and restrictive preliminary design of the MPI one-sided interface (discussed in Chapter 2), the adoption has been slow. Despite the

7 issues raised by users over the years about the complexity of MPI, it remains one of the most suc- cessful parallel programming models at present, due to its singular commitment toward portability and performance. Other communication runtimes with complementary functionalities to MPI have been available for as long as its existence; however, there are many challenges to leverage them in practice. Due to the intrinsic complexity of the low-level networking interfaces, it requires significantly more effort to directly use them for communication from an application. Another op- tion is to use Partitioned Global Address Space (PGAS) communication middlewares. For certain specific cases, these middlewares may provide slightly better performance than an MPI implemen- tation. However, switching to an external PGAS library is often impractical for scientific softwares that use MPI, since MPI already provides the necessary one-sided communication facilities that applications need for optimizing their communication pipeline. Moreover, using an external com- munication library in conjunction with MPI on an application may also increase its resource usage, due to the existence of separate runtimes, and cause silent interoperability issues.

1.5.3 Approximate computing techniques

Apart from improving the communication pipeline, it is equally important to analyze approximate computing strategies that trade off quality with performance [139]. For certain scientific appli- cations, invoking such techniques can be as straightforward as choosing reduced or mixed preci- sion arithmetic over a full precision (affecting the number of bits used to store the exponent and mantissa of a floating-point number) during certain parts of the computation. For instance, quar- ter/half precision floating-point arithmetic require only 8/16-bits of storage, compared to regular floating-point arithmetic requiring 32/64-bits. However, for graph applications that perform little arithmetic operations, such general measures may not yield any notable benefit. Redesigning the algorithm might be the only option to avoid communication for applications that are characterized by irregular memory accesses. A desirable secondary effect of algorithm redesign for communi- cation avoidance can be a reduced memory footprint and power consumption. A number of graph algorithms are implemented using heuristics and approximate computing techniques. Heuristics

8 provide certain flexibilities in invoking communication avoiding optimizations that may also sig- nificantly reduce resource utilization of a system. Table 1.2 demonstrates the energy, power and memory consumption of five parallel approximate computing variants of the Louvain method for graph clustering/community detection (discussed in Chapter 5), and compares them with the par- allel baseline version. It can be observed that lower memory traffic corresponds to lower power/ energy consumption.

Table 1.2: Energy, Power and Memory usage of five approximate computing variants of the distributed- memory implementation of Louvain method for graph community detection (Chapter 5).

Memory Memory Energy Power Versions Traffic (MB) (kJ) (kW) (GB) Baseline 867.6 9354.25 15.82 1633 Approximate #1 867.3 4625.91 14.65 755 Approximate #2 875.6 5740.31 15.01 829 Approximate #3 893.5 10924.59 15.60 1581 Approximate #4 1026.4 3149.93 14.02 522 Approximate #5 1025.6 2850.92 14.67 520

1.5.4 Summary

In summary, motivation of this dissertation comes from:

• Communication models: The need for exploring models that separate communication from synchronization, and identifying other communication strategies that can be improve the communication pipeline of applications. It is also effective to identify common asynchronous communication patterns and develop high performance abstractions to enhance the produc- tivity of communication models that are commonly used in applications.

• Graph analytics: Dominant applications running on modern HPC systems are linear algebra based, that favors large number of arithmetic operations to keep processor cores busy and drive memory bandwidth. However, a new class of applications are gaining prominence that are recognized by irregular memory accesses (instead of floating-point operations), namely

9 graph analytics. For such applications, it is crucial to examine strategies that can improve the efficiency at extreme scales.

1.6 Contributions

The contributions of this dissertation are briefly summarized below:

• In Chapter 3, we discuss the design and development of an asynchronous communication interface for matrix transfers in a state-of-the-art distributed-memory linear algebra library, called Elemental. In order to facilitate unbiased performance evaluation, we first improve the synchronization scheme of the existing two-sided matrix transfer interface in Elemental, by adopting a consensus mechanism using a nonblocking barrier. Then, we design a one-sided interface for matrix transfers, that uses derived types internally to aggregate data. Overall, the new one-sided interface enhanced the performance of the existing matrix transfer interface by 5-40×.

• MPI RMA is a low-level API, and it takes major effort to familiarize oneself with its nuances. Chapter 4 introduces RMACXX, a high-level object-oriented modern C++ abstraction over MPI RMA. RMACXX is designed for users who are not well-versed with MPI, and makes developing applications using a one-sided interface more productive than existing low-level interfaces such as MPI or SHMEM. The added convenience of RMACXX does not come with performance penalties, it only adds about 20 instructions to the critical path for stan- dard usage. Also, RMACXX allows combining one-sided communication with entrywise arithmetic operations in the form of RMACXX expressions.

• Chapter 5 explores the design and implementation of graph clustering or community detec- tion method. The goal of community detection is to partition a network into “communities” such that each community consists of a tightly-knit group of nodes with relatively sparser connections to the rest of the nodes in the network. To compute clustering on large-scale networks, efficient parallel algorithms capable of fully exploiting features of modern archi-

10 tectures are needed. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. We present a distributed-memory parallel implementation of the Louvain method, a widely used serial method for community detection. In addition to a baseline parallel implementation of the Louvain method, we discuss a number of approximate methods that significantly improve performance while preserving solution quality. We also introduce a proxy application that mimics our distributed Louvain implementation, and serves as a sandbox for designing dis- tributed Louvain method with diverse communication models and measuring the impact of approximate methods on different input graphs. The proxy application is now part of the Exascale Proxy Applications suite, contributing to the co-design efforts of the US DOE Ex- ascale Computing Project.

• Chapter 6 discusses distributed-memory implementation of half-approximate algorithm for the graph matching problem. A matching in a graph is a subset of edges such that no two matched edges are incident on the same vertex. A maximum weight matching is a match- ing of maximum weight computed as the sum of the weights of matched edges. Execu- tion of graph matching is dominated by high volume of irregular memory accesses, mak- ing it an ideal candidate for studying the effects of various MPI communication models on graph applications at scale. We investigated the performance implications of design- ing half-approximate graph matching MPI-3 RMA and neighborhood collective models and compared them with a baseline Send-Recv implementation, using a variety of synthetic and real-world graphs. We also explored the impact of graph reordering on communication pat- terns by reducing sparse matrix bandwidth using the Reverse Cuthill-McKee algorithm.

1.7 Publications

Part of this dissertation is based on the following peer-reviewed publications:

1. Sayan Ghosh, Mahantesh Halappanavar, Ananth Kalyanaraman, Arif Khan, Assefaw Ge-

11 bremedhin. Exploring MPI Communication Models for Graph Applications Using Graph Matching as a Case Study. 33rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2019).

2. Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Asse- faw Gebremedhin. miniVite: A Graph Analytics Benchmarking Tool for Massively Parallel Systems. Performance Modeling, Benchmarking and Simulation of High Performance Com- puter Systems (PMBS 2019).

3. Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Assefaw Gebremedhin. Scalable Distributed Memory Community Detection Using Vite. 22nd IEEE High Performance Extreme Computing Conference (HPEC 2018).

4. Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Hao Lu, Daniel Chavarria-Miranda,` Arif Khan, Assefaw Gebremedhin. Distributed Louvain Algo- rithm for Graph Community Detection. 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2018).

5. Sayan Ghosh, Jeff Hammond, Antonio J. Pena,˜ Pavan Balaji, Assefaw Gebremedhin, Bar- bara Chapman. One-Sided Interface for Matrix Operations using MPI-3 RMA: A Case Study with Elemental. 45th International Conference on Parallel Processing (ICPP 2016).

1.8 Dissertation organization

The rest of the dissertation is organized as follows: Chapter 2 reviews MPI one-sided program- ming model. Chapter 3 discusses performance and programmability implications of designing a general one-sided interface for remote matrix operations in an existing linear algebra library. Chapter 4 presents a compact set of C++ abstractions over MPI one-sided programming model, to enhance the productivity of developing parallel applications with MPI without sacrificing perfor- mance. Chapter 5 presents the design and analysis of the distributed-memory implementation of

12 the Louvain method for graph community detection. Chapter 6 studies the relatively underutilized communication models of MPI for graph applications, using distributed-memory implementations of half-approximate graph matching algorithm as a case study. Finally, Chapter 7 concludes this dissertation.

13 CHAPTER 2 BACKGROUND ON MPI ONE-SIDED COMMUNICATION

2.1 Introduction

In the MPI two-sided interface, a Send initiated by a particular process will be matched by a Recv on another process. In contrast, MPI one-sided programming model or Remote Memory Access (RMA) detaches communication and synchronization. By eliminating the message match- ing restriction, RMA offers the possibility of a lower latency for small data transfers as compared to the classic MPI two-sided model. Figure 2.1 shows a cartoon comparing the classic blocking two-sided model and the nonblocking one-sided communication model of MPI. In MPI RMA, one process to specify all the communication parameters, both for the sending side and for the receiving side. To achieve this, every process involved in an RMA operation needs to expose part of its memory such that other processes can initiate one-sided data transfers targeting the memory of any particular process. The exposed memory is referred to as an MPI window. After creation of an MPI window, processes can initiate nonblocking put /get one-sided operations. For synchronization, users need to invoke a flush operation, which completes all outstanding RMA operations. Figure 2.2 demonstrates MPI RMA as compared to MPI two-sided model. Two categories of MPI RMA process synchronization exist: active and passive. For an active target communication, both origin and target processes are involved in the communication, but for passive target communication, only the origin process is involved in data transfer. In this dissertation, we use passive target synchronization, since it is better suited for arbitrary accesses to different target processes [100]. Interested readers are recommended to review materials that provide a comprehensive description of the data and synchronization models of MPI RMA [86, 67, 84, 100]. RMA is not a general term for one-sided programming/communication, rather it is used in the

14 Two sided network Process #A Process #B

Memory ...work... SEND ...work...... work...... work...

RECV Memory

One sided network Process #A Process #B

Memory ...work...

PUT ...work...... work...... work...

...work... Memory ...work... SYNC

...work... GET

...work... Memory

...work... Memory SYNC

Figure 2.1: MPI two-sided versus one-sided communication (RMA) operations

Figure 2.2: MPI Remote Memory Access (RMA) versus MPI two-sided Send/Recv. Each process exposes some memory (window) before issuing one-sided put/get communication calls.

15 context of MPI exclusively. MPI RMA provides a library interface over low-level networking APIs for moving data from local to remote memory. In the forthcoming sections, we briefly discuss about some of the crucial aspects of MPI one-sided model, such as the networking technology enabling one-sided communications in general (Section 2.2), the RMA memory model (Section 2.3), and the evolution of MPI RMA over the years (Section 2.4).

2.2 Remote Direct Memory Access

Interconnects are the central nervous system of a supercomputer. Broadly, interconnects are com- prised of four parts: links (wires carrying the data), switches/routers (connecting parts of the inter- connection network), channels (logical connection between the switches) and network interfaces. Network interfaces or adapters connect processor memory to the network, enabling decoupling of computation and communication. Remote Direct Memory Access (RDMA) is a transport proto- col [159] that allows direct memory access from the memory of a host (compute/storage node) to another without involving the network software stack, and the CPU (a.k.a zero-copy). Moreover, remote caches are unaffected as accessed memory contents of the remote CPU are not loaded into the cache, making it available for applications. Thus, RDMA frees up resources, such as proces- sors to pursue computational tasks without any interruptions to handle the data movement. RDMA technologies such as Infiniband [5], iWARP (internet Wide Area RDMA Protocol) [41] and RoCE (RDMA over Converged Ethernet) [40] provides the necessary high throughput and low-latency data transfers required by modern data center traffic. The ubiquity of RDMA capable network interfaces in HPC systems is another reason to con- sider one-sided programming models. Low-level operations in the RDMA model such as read/ write and atomics (fetch-and-op and compare-and-swap) map directly to high-level one-sided com- munication interfaces. RDMA also offers standard two-sided send/receive operations. Before ini- tiating transfers between the memories of discrete compute nodes, the memory containing user buffers need to be “registered” by the network driver software. Memory registration is a nontrivial

16 process that includes pinning the physical memory pages1 and accessing the on-chip memory of the interconnect to store the pinned memory region. If the overhead associated with registration/ de-registration exceeds the overhead associated with a user space buffer copy, then RDMA send/ receive is usually used internally, otherwise, RDMA writes/reads can be employed for larger mes- sage transfers. The designated limit differentiating small and large messages decide the underlying RDMA operations used in data transfers, and is normally platform dependent. Liu et al. [125] and Sur et al. [175] discuss the design and implementation of MPI over RDMA in details.

2.3 Memory model

Before discussing the MPI RMA memory model, we briefly explore the state of current memory systems in computer architectures.

2.3.1 Memory consistency

Standard von Neumann based computer architectures make use of a memory hierarchy, usually consisting of the shared main memory, scratchpad memory (private and/or shared), and private caches. The memory hierarchy is instrumental in improving the latency of frequently used data accesses, and optimize the overall memory access bandwidth of a compute node. The multipro- cessor architecture and software framework works in tandem to ensure that the issued loads/stores follow an expected order resulting in a correct program, respecting the underlying memory consis- tency model, with or without explicit user actions, as mandated by the platform. A cache coherent architecture enforces certain conditions on loads/stores to maintain uniformity of the shared data resident on private caches throughout the computation. A software/runtime may have to perform some extra work to ensure memory consistency, which can actually achieve a higher performance than hardware based solutions to enforce a stronger consistency. For example, an x86 mfence instruction serializes all the load/stores to

1Pinning allows locking pages into physical memory, in a way that prevents page migration such that the memory will be present at a fixed physical location.

17 memory prior to its invocation. It can be used by a low-level runtime such as MPI for data syn- chronization. In certain cases, due to existence of separate memories or address spaces, maintaining con- sistency is harder. For instance, accelerators such as General-purpose Graphics Processor Units (GPGPU) have a separate memory subsystem from the CPU. Until recently [167], moving data across GPUs over a network required intermediate data copies involving CPU, GPU and the net- work interface. Some architectures does not support cache coherence in hardware (such as NEC SX machines [89]). Hence, it is the responsibility of the runtime to evoke instructions for flushing or invalidating the cache, such that the updated data is reflected in memory. Communication runtimes such as MPI that aim to be portable across the widest of the platforms must consider all of these cases during the interface standardization efforts.

2.3.2 MPI RMA memory model

We discuss the interactions of MPI RMA interface with the memory regions for general platforms. The memory subsystem of a machine can be broadly classified as local to a process and addressable by all the processes. Usually, machines have a memory shared by the processor cores (i.e., main memory on multiprocessors) and processor-local private memories (i.e., caches and/or scratchpad memory). Cache coherent systems have hardware support to ensure consistency of data between main memory and the private process-local copies of data, which is necessary for maintaining data integrity. However, systems that does not support hardware cache coherence require explicit data synchronization statements to ensure uniformity of the data in main memory with the process-local copies of data. As such, MPI RMA propose two memory models — unified, in which public/private copies of the MPI window are alike, and separate, otherwise. Process-local load/stores access the private copy of the window, whereas the put/get one-sided operations are applied to the public window. Figure 2.3 shows the distinction between public and private windows.

18 Figure 2.3: Unified (left) and Separate (right) memory models of MPI RMA. Figure courtesy: Hoefler et al. [100].

Systems supporting hardware cache coherence can make use of the unified model. For the sep- arate memory model, synchronization between private and public copies of the window is required to have a consistent view of data. The separate model is intended for systems that does not support coherence in hardware, and coherency is maintained by the user through explicit synchronization statements. The RMA unified memory model is used for all the cases in this dissertation.

2.4 MPI-2 to MPI-3 RMA

MPI RMA was first introduced in 1997, as part of the MPI-2 standard specification. Prior to the relatively recent release of MPI-3 (circa 2012), MPI RMA lacked important atomic operations (e.g., fetch-and-add and compare-and-swap), had an inconvenient synchronization model (includ- ing lack of separation of local and remote completion of a one-sided communication operation), and only supported the “separate” memory model (detached public and private windows) that made MPI one-sided communication routines undesirable for one-sided operations in applications. How- ever, despite the inconvenience, such a model was indeed required for developing applications on non-cache coherent systems [194]. Bonachea and Duell [24] discuss the limitations of MPI-2 RMA as compared with contem- porary PGAS models. The decision that led to the design of the MPI RMA memory model was largely a consequence of the HPC ecosystem of the time. The rationale influencing the initial de-

19 sign of MPI RMA was to support prevalent architectures that required some actions from the users to ensure memory consistency. Section 2.3.1 review some of the available software/hardware op- tions, that needs to be considered for designing a generic one-sided interface. Tipparaju et al. [179] and Hoefler et al. [100] discuss the design constraints of MPI RMA w.r.t memory consistency. It is difficult for the MPI forum to accurately anticipate user preferences in designing a portable interface that could be universally adopted for building parallel scientific applications. MPI-2 RMA was not popular, as most users expected a PGAS-like interface. Moreover, explicit synchro- nizations in MPI-2 RMA (owing to the separate RMA model which was the only choice) added unnecessary overheads to the data transfers. Despite the improvements brought forth by MPI-3 RMA, users have contrasting opinions on memory consistency, making it challenging to come to a consensus on the quality of the design. While some view memory consistency as an absolute ne- cessity that should be enforced by the hardware, others are reluctant to accept the synchronization overheads, and yet prefer a weaker consistency model.

2.5 Chapter summary

This chapter presented a brief background on MPI one-sided programming model or Remote Mem- ory Access (RMA). We covered the low-latency networking technology, namely RDMA, that pro- vides the transport layer acceleration for enhancing the performance of MPI RMA communication operations. We also introduced the memory model of MPI RMA, in the context of consistency requirements of general computer architectures. MPI-2 RMA was unpopular due to its restric- tive interface, while MPI-3 RMA strengthens the interface by providing convenient options to the users, making RMA similar to the existing PGAS models.

20 CHAPTER 3 ONE-SIDED INTERFACE FOR MATRIX OPERATIONS USING MPI: A CASE STUDY WITH ELEMENTAL

3.1 Introduction

Many scientific applications are a mixture of regular and irregular computations. For example, in quantum chemistry, methods such as density-function theory (DFT) are composed of a highly ir- regular step of forming the Fock matrix, which requires dynamic load balancing and unstructured communication in order to utilize all the processing elements, followed by parallel dense linear algebra to diagonalize this matrix. Other application domains have similar patterns of combining domain-specific matrix-formatting steps with standard linear algebra procedures. It is critical to allow application developers to combine domain-specific code with the best available dense linear algebra libraries without compromising performance by restricting the data layouts or communi- cation patterns they can use in the domain-specific parts of their code. Historically, the Global Arrays library (GA) [144] has met this need in quantum chemistry ap- plications by providing a library that supports dense array data structures, a set of one-sided com- munication primitives that support arbitrary subarray access patterns, the necessary features for dynamic load-balancing, and an interface to parallel dense linear algebra capability from ScaLA- PACK [39] (and, more recently, ELPA [133]). Thus, the domain scientist is able to write an effi- cient Fock matrix formation code by reading and updating distributed arrays, then calling the dense eigensolver, Cholesky, or other procedures from ScaLAPACK, without having to know anything about the ScaLAPACK interface. An alternative approach to using GA, which is implemented in linear algebra libraries such as PETSc [14], PLAPACK [185], and Elemental [153], involves queuing up updates to a distributed array locally, then completing them with a collective operation.1 This has the desirable property

1See http://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Mat/

21 of being highly portable, since it can be implemented by using two-sided messaging, such as MPI Send-Recv. However, the distributed data structure cannot be touched until the collective step happens, and there is no opportunity for overlapping communication and computation, because unlike one-sided operations in GA, unmatched messages cannot complete asynchronously.2 With the relatively recent release of the MPI-3 standard [135], the communication primitives re- quired to implement GA are widely available in both open-source and proprietary implementations of MPI. Furthermore, new distributed dense linear libraries such as Elemental and DPLASMA [26] have emerged as modern alternatives to ScaLAPACK. These changes motivate a fresh investigation of the interplay between irregular computations and dense linear algebra computations. We seek an answer to the following question:

Can we achieve similar performance as Global Arrays and ScaLAPACK while pro- viding more flexible interfaces, support for different matrix distributions, and greater portability through MPI-3?

GA provides a convenient data access interface that uses high-level array indices; it also pro- vides a rich lower-level interface for managing data distribution, exploiting locality, and managing communication. GA is built on top of Aggregate Memory Copy Interface (ARMCI), which is a low-level one-sided communication runtime system. ARMCI has implementations for several vendor-specific network conduits such as uGNI, LAPI, et al. ARMCI is compared with MPI by Dinan et al. [54]. Elemental has a similar interface for accessing portions of matrices distributed in memory, called the AXPY interface (named after the BLAS axpy routine, which performs vector- vector multiplication and addition, i.e., a × x + y, where a, x, and y are vectors). This interface provides a mechanism so that individual processes can independently submit local submatrices that will be automatically redistributed and added to the global distributed matrix. The interface also allows for the reverse: each process may asynchronously request an arbitrary subset of the global

MatSetValues.html and http://libelemental.org/documentation/0.81/core/axpy_ interface.html for details on these interfaces. 2The MPI progress rules for Send apply only when a matching Recv has been posted. In the case here, matching Recv can be posted only after information about the local queue has been communicated with the destination processes.

22 matrix. Both of these functionalities are effected by a single AXPY routine. Unlike GA, however, one needs collective synchronization before the global matrix or the local matrix—that is, submitted to the global matrix, or where a subset of the global matrix would eventually reside—can be reused or accessed.3 This is strictly an artifact of the underlying implementation of the API using MPI point-to- point routines and in certain applications may require essentially replicating the matrix on every process. We note that the AXPY interface was developed at a time when the MPI RMA model was restrictive in its offerings. Therefore, we set out to redesign the Elemental AXPY interface in a better way that would reduce the amount of collective synchronization and bring the interface semantically closer to GA [77]. The overall purpose of this chapter is to demonstrate that MPI- 3 RMA naturally fits into any matrix computational model where multiple asynchronous remote operations are needed. The remainder of the chapter is organized as follows: In Section 3.2, we provide some back- ground information and motivation for our work on Elemental. In Section 3.3, we identify some limitations of the existing asynchronous matrix update API of Elemental, and, propose a new set of API which aims to improve the performance of the existing approach significantly. In Section 3.4, we discuss about implementing the new API using MPI-3 RMA, and, establish the need for a distributed arrays interface to increase productivity. In Section 3.5, we present performance eval- uations using microbenchmarks and a quantum chemistry application. In Section 3.6, we draw conclusions and briefly discuss our future research plan.

3.2 About Elemental

Elemental is a C++ library for distributed-memory algorithms for dense/sparse linear algebra and interior-point methods for convex optimization. Similar to PLAPACK [185], Elemental was de- signed around the idea of building different matrix distributions and providing a simple API for moving a matrix from one such distribution to another throughout a computation. Elemental has a

3A way to circumvent this for the local matrix is to allocate a new matrix for every AXPY call.

23 Figure 3.1: Overall structure of the Elemental library. thin abstraction layer on top of the necessary routines from BLAS, LAPACK, and MPI. Figure 3.1 shows the high-level organization of Elemental.

3.2.1 Data Distribution

One of the strong requirements for good performance of dense matrix computations is scalabil- ity. The way in which the data is distributed (or decomposed) over the memory hierarchy is of fundamental importance to scalability. Data distribution impacts the granularity of the compu- tation, which impacts the scalability and load balance. Most distributed-memory dense linear algebra packages differ in the way data distribution is performed. Unlike a few other linear al- gebra libraries that distribute contiguous blocks of data to processes (e.g., PLAPACK[185] and ScaLAPACK [39]), in Elemental the matrix distributions are designed to spread the matrix in an element-wise fashion. As depicted by Figure 3.2, in Elemental, individual elements of a matrix are distributed cyclically in column-major ordering following the 2D process-grid layout. Therefore, if there are 4 processes in total, then it is a 2×2 grid, and, elements are distributed in a round-robin fashion within each column/row of the 2D process grid. Elemental offers a number of element-wise distributions over the process grid and provides a convenient mechanism for performing basic matrix manipulations. The Matrix class builds

24 Figure 3.2: Elemental element-wise cyclic distribution (MC×MR) of an 8 × 8 matrix on a 2 × 2 process grid (4 processes). Dark borders indicate local/physical chunks corresponding to a global chunk. a 2D matrix owned by only the process calling it, and DistMatrix class4 is the distributed-memory variant of the former (U and V signifies the distribution pattern on each di- mension). The default matrix distribution is known as MC×MR (matrix column by matrix row), which distributes the elements in the first dimension in a round-robin fashion over each column of the process grid, and in the second dimension in a similar way over each row of the 2D process grid. The logical 2D process grid is actually composed of different MPI communicators (for in- stance, there are communicators for processes forming the row of the process grid and that of the column), because of the need for MPI collective communication to distribute elements according to the specified distribution. Figure 3.2 shows the MC×MR logical and physical distribution of a matrix. Elemental has around 10 such distributions (distributions can also be paired). Throughout our work, we have used only the MC×MR distribution, because it leads to the best scalability for a variety of cases.

4T stands for template substitution for datatypes, including complex types.

25 3.2.2 Elemental AXPY Interface

We started the work by investigating the performance and suitability of the Elemental AXPY in- terface, which offers functionalities to add and subtract (or perform an operation), to and from globally distributed matrices. The AXPY interface is implemented by using MPI point-to-point communication routines. This interface has the look and feel (in terms of the API) of the Global Arrays toolkit [144], a well-known one-sided library extensively used in the NWChem computa-

tion chemistry package [184]. The AXPY interface has only three routines: ATTACH,AXPY, and

DETACH; the individual functionalities are explained below.

1.A TTACH (collective) – Performs vector resizing, buffer allocation.

2.A XPY (point-to-point) – Sends or receives data to/from the globally distributed matrix based on the direction parameter specified by the user (LOCAL_TO_GLOBAL or GLOBAL_TO_ LOCAL). This operation is analogous to the ARMCI/Global Arrays scaled accumulate – dst += scale * src (where dst and src are the origin buffer or the target buffer, depending on the direction).

3.D ETACH (collective) – Finishes all outstanding communication and tracks progress repeat- edly.

The components of the Elemental AXPY interface is shown in Figure 3.3. We identified several issues with the existing AXPY interface of Elemental that limit its capa- bilities considerably. The issues are discussed in the next section in detail. Hence, we redesigned the existing AXPY interface to an entirely new interface, which we call the RMAInterface, with the aim of improving the asynchrony of remote operations and thereby enhancing the performance significantly.

26 Figure 3.3: AXPY interface components: A (synonymous to Y in “AXPY”) is a DistMatrix and X is a locally owned matrix.

3.3 Beyond the Elemental AXPY Interface

This section offers general guidelines on designing an interface for asynchronous matrix opera- tions. We study some of the notable issues with the existing Elemental AXPY interface.

• Asynchrony: The existing interface does not allow overlapping of operations. When DE-

TACH returns, all communication to/from the distributed matrix completes. This enforces a bulk synchronous model.

• Overallocation: Although one could issue multiple AXPY calls within an ATTACH-DETACH epoch (note that these are point-to-point nonblocking operations), different local matrices would have to be allocated as well, because local/remote completion of operations is not

guaranteed until DETACH is called.

• Restrictive synchronization: In DETACH, every process must exchange “end-of-message”

(EOM) messages to mark communication end; also, the ATTACH-DETACH pair of calls marks access epochs.

• Expression: The put/get/accumulate operations are all expressed through a single function

27 –AXPY, where one must specify alpha (scale factor) and direction (local to global or vice versa) to select the particular operation.

3.3.1 Enhancing the Performance of the Existing AXPY Interface

Since AXPY operations are point-to-point and nonblocking, the communication may not even be- gin until DETACH is called. Apart from ensuring MPI progress, DETACH also needs a mechanism to mark the end of current data communication. This is because there is no way of assessing in advance the number of messages to be received. Hence, each process sends an EOM message to

every other process to complete the ATTACH-DETACH epoch. This mechanism poses a nonnegligible overhead. For instance, in a simple test program, in which each MPI process updates different locations of the distributed matrix, we found that around

80% of the total time was spent on DETACH.

We can improve the performance of DETACH, however, if there is a way to improve the synchronization logic. Fortunately, MPI-3 introduces nonblocking barriers (MPI Ibarrier5), which can be used to implement a synchronization scheme for cases when the number of messages to receive is not known in advance. This is facilitated by alternately checking inside a loop for any incoming message (via MPI Iprobe) and testing whether the synchronous sends have completed (via MPI Testall). In order to improve the performance of the end of data communication, we introduced a con- sensus mechanism using a nonblocking barrier (MPI Ibarrier) instead of explicitly sending

messages to mark the end of communication (during DETACH). This scheme is referred to as non- blocking consensus and is inspired by prior research on data exchange protocols[97, Algorithm 2]. By leveraging this protocol we were able to significantly improve performance, but also to save memory by not allocating buffers associated with EOM synchronization. The pseudocode of the

enhanced DETACH is listed in Algorithm 1. The HANDLEDATA function handles the data posted

5A non-blocking barrier (MPI Ibarrier) functions in the following way: A request object associated with a barrier evaluates to true upon calling MPI Test only when all the processes in the communicator have started the nonblocking barrier.

28 during the AXPY call.

Algorithm 1: DETACH: Nonblocking barrier for determining end of communication 1: done ← false 2: while not done do 3: HANDLEDATA() 4: if nonblocking barrier is active then 5: done = test barrier for completion 6: else 7: if all sends are finished then 8: activate nonblocking barrier

We note that only MPI synchronous sends will work in this particular case, because testing on an MPI request handle associated with a nonblocking synchronous send returns only when a corresponding MPI receive is posted. By completely bypassing the requirement to send EOM packets, we were able to save approximately 3×p buffer allocations (where p is the total number of processes) and obtain a performance improvement of up to 14x (refer to Figure 3.8 in Section3.5).

3.3.2 From the AXPY Interface to the RMA Interface

The design of the existing AXPY interface offers little possibility of overlapping computation and communication within the application using it. We therefore designed a new interface, RMAInter- face, offering one-sided semantics that overcomes the strict synchronization requirements in the existing interface while expressing the required functionality in a more natural manner. Following are the design highlights of this new interface.

• ATTACH-DETACH should be required to be called only once per distributed matrix, instead

of every time we need to (re)use the buffers in use by a prior AXPY call.

• We expose remote operations in the API, such as PUT/GET/ACCUMULATE.

• Instead of bulk synchronization facilitated through DETACH, we introduce operation-wise (noncollective) synchronization functions. We add a number of synchronization API routines (referred to as Flush operations), for enforcing local/remote completion.

29 We will henceforth refer to the Elemental AXPY interface as AxpyInterface and the new RMA interface using MPI-3 RMA as RMAInterface.

3.4 Proposed One-Sided APIs

In this section we present the two APIs we propose. We first discuss designing RMAInterface using MPI-3 RMA. Next, we introduce the Distributed Arrays interface that we designed on top of RMAInterface. Distributed Arrays expose an interface similar to Global Arrays, but with the ability to extend the functionality of GA significantly because of being tightly integrated to Elemental.

3.4.1 RMAInterface

Design Overview

The basic idea behind RMAInterface is to expose a set of one-sided communication and synchro- nization functions to enable fetching and updating portions of a distributed matrix. A distributed matrix is semantically similar to a global array, but is laid out across processes according to Ele- mental’s MC×MR distribution. Low-level details such as processes involved in communication and MPI-related types are abstracted away from the API. RMAInterface users will only need to specify the local and distributed matrix handles and the (2D) coordinate axes of the distributed matrix, in order to fetch or update a part of the globally distributed matrix. To familiarize the readers with the RMA interface functionality, we list the definitions of some of its basic functions in Figure 3.4. To demonstrate the efficacy of RMAInterface, we explore a common parallel programming motif – distributed blocked matrix multiplication. We will show that such a programming task could be easily developed using RMAInterface, whereas the existing AXPY interface or any bulk- synchronous mechanism will not work in this case. Figure 3.5 includes the corresponding pseu- docode and a schematic diagram. As shown in the figure, in carrying out a matrix-matrix multipli- cation, an MPI process acquires a 2D tile of the distributed matrices, performs a local GEMM on the tile it owns, and asynchronously modifies the distributed matrix with the locally updated val- ues. A distributed counter serves as a load balancer by mapping tasks to processes, and ensuring

30 /* Management */ void Attach( DistMatrix& Y ); void Detach();

/* Remote Transfer */

/* locally blocking */ void Put( Matrix& Z, Int i, Int j ); void Acc( Matrix& Z, Int i, Int j ); void Get( Matrix& Z, Int i, Int j );

/* locally non-blocking */ void IPut( Matrix& Z, Int i, Int j ); void IAcc( Matrix& Z, Int i, Int j ); void IGet( Matrix& Z, Int i, Int j );

/* Synchronization */ void Flush( Matrix& Z ); void LocalFlush( Matrix& Z );

Figure 3.4: Elemental RMAInterface API. that no two processes access the same tile. Load imbalance occurs if there are more processes than tiles—some of the processes need to wait on a barrier for others to finish. This type of communi- cation pattern, involving asynchronous updates to different portions of a distributed matrix, is not possible in the existing AXPY interface. This is because, AXPY (which simulates put/get/accu- mulate operation) is not locally complete unless DETACH is called. Since DETACH is a collective operation, it cannot be invoked (inside the innermost loop in Figure 3.5) if there is no guarantee that every process will obtain a task. Even if this were true, every AXPY call would need to be placed in the middle of ATTACH-DETACH functions in order to ensure local completion as the local matrix c is reused.

Implementation Details

RMAInterface is implemented on top of the MPI-3 RMA API. By MPI-3 RMA/one-sided, we always mean passive target communication in which only the origin process—that is, the process initiating RMA calls—is actively involved in data transfers at the user level. This is in contrast with active target communication in MPI RMA, wherein both processes, origin and target, are

31 DistMatrix A DistMatrix B DistMatrix C (= A x B)

Get A * =

for i in I blocks: * = for k in K blocks: if (load_balancer == me): Get (a, i, k) Reuse * = for j in J blocks: blocks of A Get (b, k, j) * = // local gemm c(i,j) = a(i,k) * b(k,j) Get blocks Acc local Acc (a, i, j) of B blocks to C Figure 3.5: Logical diagram of a 2D blocked distributed matrix multiplication using RMAInterface. Each block of DistMatrix A, B and C contains non-contiguous elements. a, b and c are local matrices (i.e, Matrix). explicitly involved in the RMA communication at the user level. Implementing RMAInterface on top of MPI one-sided functionality is natural because of the one-sided nature of both APIs. The Matrix class in Elemental features methods for handling fundamental matrix oper- ations. The underlying data storage consists of a contiguous buffer of elements (of data type T, which could be complex) in column-major fashion.

ATTACH initializes the RMAInterface environment for a particular distributed matrix, starts an MPI RMA epoch (by invoking MPI Win lock all), and creates the necessary buffers. We support either MPI Win allocate or MPI Win create for creating MPI windows for RMA

communication in ATTACH, as per user configuration. While supporting MPI Win create just requires associating a pointer to a data buffer with the RMA window, most of the core Elemental interfaces had to be modified to support MPI_Win_ allocate. Like existing linear algebra packages, core data structures in Elemental are resized dynamically as needed. This situation had to be prevented when MPI Win allocate was used, because a predefined amount of memory needs to be available for RMA operations. However, one can dynamically attach memory to an MPI window object via MPI Win create dynamic.

32 Although MPI Win create dynamic allows exposing memory without needing an explicit re- mote synchronization, this usually results in low performance. The use of C++ makes the design

more expressive, since ATTACH and DETACH can be realized as constructors and destructors of an RMAInterface object, respectively.

Figure 3.6: Steps involved in a put/accumulate operation of RMAInterface. An 8 × 8 distributed matrix (as shown in Figure 3.2) is updated starting at position (3, 3) by a local 5 × 5 matrix M. A get operation would show the arrows in the opposite direction. Step 3 (MPI DDT creation) is optional.

We denote Put as locally blocking (when the function returns, the input matrix may be reused), whereas IPut is locally nonblocking (the input matrix cannot be reused upon return until a syn- chronization call is properly performed). The underlying interface uses the input coordinates axes (passed to the put/get/accumulate functions) to determine the target process for the individual ele- ments in the input Matrix object. The coordinate axes are also used to calculate the displace- ment (from the window base) in the MPI window of the determined target process (steps 1 and 2 in Figure 3.6 depict these operations). Therefore, a single put/get/accumulate call usually results in a number of MPI Get/Put/Accumulate calls to fetch/update data from/to the memory of several remote processes.

33 Get Put Accumulate 600 600 600 w DDT 500 w/o DDT 500 500

400 400 400

300 300 300

200 200 200

Bandwidth (MB/s) 100 100 100

0 0 0 500.0 kB 1.0 MB 1.5 MB 2.0 MB 500.0 kB 1.0 MB 1.5 MB 2.0 MB 500.0 kB 1.0 MB 1.5 MB 2.0 MB

Figure 3.7: Bandwidth of put/get/accumulate operations with/without MPI DDT on 16 processes (higher is better). The X axis shows the size of the data transferred.

Figure 3.6 shows the various steps involved in a remote operation until completion (effected by calling the appropriate synchronization functions). Since the target data layout (chunks of distributed matrix in individual processes) consists of contiguous blocks separated by fixed strides, we found the usage of MPI derived data types (DDT) to be appropriate in this case. We use MPI Type vector to designate the target data layout, which helps limit the number of RMA operations. MPI implementations have been known to suffer from performance penalties when working with DDT, even though a substantial body of research has focused on improving them [182, 161, 30, 85]. Therefore, we made it user configurable for the code path to use MPI DDT in RMA operations, which otherwise would fall back to the version that uses MPI standard data types (i.e, MPI INT, MPI DOUBLE). To assess the benefit of using MPI DDT, we use a simple test case that performs a number of one-sided put/get/accumulate operations on varied data sizes, from 8 B to 2 MB, with increments of 128 B on 16 processes. Figure 3.7 shows the comparative bandwidth (in MB/s) of RMAInterface put/get/accumulate functions with and without MPI DDT. We observe an improvement of up to 20% in bandwidth on average for the NERSC Cori platform (see Section 3.5 for platform details) as a result of using MPI DDT. This improvement is attributed to minimizing the number of overall RMA operations. In terms of synchronization, a flush operation ensures remote/local completion of all out- standing operations initiated from (or targeted toward) the current process. Flush translates to

34 MPI Win flush all, while LocalFlush enforces the local completion of all operations (i.e., LocalFlush translates to MPI Win flush local all).

A relevant distinction between RMAInterface and the original AXPYInterface is that DETACH is no longer responsible for collective synchronization, as a flush will complete outstanding opera-

tions in an asynchronous fashion. In RMAInterface,DETACH marks the end of an RMA epoch (by issuing MPI Win unlock all) and clears the associated buffers.

3.4.2 Distributed Arrays Interface (EL::DA)

We create the Distributed Arrays interface (EL::DA) on top of the Elemental DistMatrix and RMAInterface interfaces. Both regular and irregular (GA) distributions internally map to Elemen- tal’s cell-cyclic data layout. EL::DA supports the most fundamental GA operations, and hence most applications written in GA may be easily ported to EL::DA. We also created a C interface to EL::DA (although Elemental is written in C++11, it offers C interfaces for almost all core mod- ules), which allows it to be used with applications written in C as well. In EL::DA, one-sided functions (such as NGA Put or NGA Acc) are implemented by using RMAInterface, whereas for any other collective operation (such as GA Add or GA Symmetrize), the core Elemental API is used. However, supporting GA local access functions (e.g., NGA Access) is not possible in an efficient manner because of the underlying element-cyclic data distribution. For example, consider the case of NGA Access, which returns a pointer to the local portion of the global array. In EL::DA, NGA Access is no longer a local operation; it requires a remote get to pull the relevant portions of data from the global array to a local buffer. The reason is that the local portion of an Elemental DistMatrix contains elements that are globally noncontiguous (see Figure 3.2), and hence it cannot be used for performing computations that expect elements to be in their logical order. NGA Release update is another case when EL::DA is nonconformant to the original GA. NGA Release update releases access to a local copy (owned by a particular process) of a GA,

35 in case the local copy was accessed for writing. For cache-coherent machines, NGA_Release_ update is essentially a no-op. Nevertheless, since the local buffer (exposed by NGA Access) was updated, it needs to be sent back to the global array distributed in Elemental’s cell-cyclic fashion. Therefore, for EL::DA, NGA Release update is a remote put operation to update the global matrix with local data. Since Elemental offers an unparalleled range of functionality pertaining to linear algebra and optimization, a rich set of the Elemental framework is exposed to the Distributed Arrays interface. In contrast, this is not possible in the existing GA, which offers limited linear algebra functionality.

3.5 Experimental Evaluation

Our experiments are motivated by operations frequently arising in quantum chemistry applications. Quantum chemistry simulation of small or large molecular systems requires substantial computa- tion resources and suitable parallel programming models for scalability. In practice, quantum chemistry codes are not perfectly scalable, however, because of the significant volume of commu- nication that dominates the total time. We used two platforms for our experiments.

1. Argonne’s Blues – 310-node cluster with dual-socket Intel R Xeon R E5-2670 processors per node and a QLogic InfiniBand QDR interconnect.

2. NERSC Cori – 2,388-node Cray XC40 machine with dual-socket Intel R Xeon R E5-2698v3 CPUs per node and the Cray XC series interconnect (Aries).

We used MVAPICH2 (version 2.2.1) on Blues and Cray MPI (version 7.2.5) on Cori (both of them are MPICH derivatives). MVAPICH2 uses hardware for contiguous put/get operations and implements accumulate and strided one-sided operations using software. Following the vendor rec- ommendations, we use the regular mode of Cray MPI, where RMA operations are implemented in software (as opposed to the DMAPP mode[176], which implements one-sided contiguous Put/Get

36 operations in hardware). We found no noticeable difference in the performance of Cray MPI runs with or without DMAPP for our test cases (all of which use accumulate operations). We compare EL::DA with Global Arrays (version 5.4). The native communication conduit of GA is ARMCI (Aggregate Remote Memory Communication Interface), which uses low-level net- work APIs for point-to-point communication and MPI for collective operations. ARMCI-MPI [54] is a completely rewritten implementation of the original ARMCI using MPI RMA (specifically, it supports both MPI-2 RMA and MPI-3 RMA) for one-sided communication. We used ARMCI- MPI with MPI-3 RMA (referred as GA in the plots) for our evaluations. Also, Cray Aries (Cori’s interconnect) cannot perform accumulate or atomic update operations in hardware and must use software. The MPI standard does not mandate RMA to be asynchronous: although there is no need for remote processes to be involved in passive RMA communication, in practice the remote/target process may issue calls to the MPI runtime system to ensure progress in communication. Asyn- chronous progress in MPI implementations is typically enabled by using a communication helper thread per process to handle messages from other processes or by utilizing hardware interrupts. Both MVAPICH2 and Cray MPI (in regular mode) implement asynchronous progress using a background thread per process as an optional feature. Because of requiring deployment of as many helper threads as MPI processes, this scheme leads to either processing core oversubscription or devoting half the cores to ensuring MPI progress. Both approaches result in losing a considerable amount of compute power in polling for incoming messages. To maximize the benefits of the performance potential of leveraging MPI RMA communication instead of its two-sided counterpart, we use Casper [171], a new process-based progress engine for MPI one-sided operations, which aims to alleviate most of the crucial drawbacks of thread-based or hardware interrupt-based communication progress and to favor scalability[170]. Casper decouples the number of helpers devoted to progressing MPI RMA communications from the number of MPI processes—the optimal number of helper (ghost) processes is application dependent and currently is specified by the user at launch time. We leverage Casper in our evaluations involving MPI-

37 3 RMA (both Elemental and GA/ARMCI-MPI) and observe significantly improved performance with respect to the cases based on the original thread-based asynchronous progress models.

3.5.1 Microbenchmark Evaluation

We use two microbenchmarks to evaluate the performance of RMAInterface. The first microbench- mark loosely simulates a Fock matrix construction used commonly for electronic structure calcula- tion. This microbenchmark is designed to compare RMAInterface with the original and improved AxpyInterface (denoted by ORIG and ORIG-NBX). The second microbenchmark is a distributed matrix-matrix multiplication, to compare the performance of EL::DA with that of GA.

Hartree-Fock Proxy

This microbenchmark features two phases. In the first phase, each process requests a task, and upon receiving a task it issues a remote accumulate operation to different tiles of a 2D matrix distributed in Elemental’s elementwise cyclic fashion. In the second phase, each process requests a task again, and upon receiving a task it issues a remote get from a different tile of the global matrix to a local matrix. Processes that do not receive a task (because of insufficient tasks) just wait on a barrier. Accesses to different blocks are made possible via a distributed global counter, which ensures that at a time only a single process is accessing a tile of the global array. We compare the performance of six AxpyInterface versions in Figure 3.8. In this microbench- mark, we fix the number of processes to 256 and vary the workload (expressed as number of tasks, each task involving exactly one accumulate or a get operation). The various versions in Figure 3.8 are defined as follows.

1. ORIG – Original Elemental AXPY interface.

2. ORIG-NBX – Original Elemental AXPY interface with a nonblocking consensus mechanism

to test communication completion in DETACH.

3. RMA – Locally nonblocking API of RMAInterface (e.g., IPut/IAcc, see Figure 3.4).

38 100000 ORIG RMA RMA-CSP ORIG-NBX RMAB RMAB-CSP 10000

1000

100 Execution Time (s)

10

1 2k-8k(16)1k-8k(64)512-8k(256)512-16k(1024)256-8k(1024)128-8k(4096)64-8k(16384)32-8k(65536)

Matrix-DistMatrix Dimensions

Figure 3.8: Hartree-Fock proxy microbenchmark comparing the Elemental AXPY interface versions on 256 processes of Blues. The number inside braces denotes the number of tasks.

4. RMAB – Locally blocking API of RMAInterface (e.g., Put/Acc, see Figure 3.4).

5. RMA-CSP – RMA with Casper.

6. RMAB-CSP – RMAB with Casper.

With fewer than 256 tasks (the number of processes), all versions suffer from load imbalance, since some of the processes do useful work while the others wait on a barrier. With a larger number of tasks, the performance of the original AxpyInterface (ORIG) suffers because of the large number of messages to mark the end of communication. We improve the performance of these

situations by modifying the DETACH to use a nonblocking consensus mechanism (ORIG-NBX) using MPI Ibarrier (see Section 3.3.1), which has improved the performance of the original AxpyInterface significantly—up to 40x for a large number of tasks and on average by 5x.

The AXPY function of the AxpyInterface (which is analogous in terms of functionality to IPut/IAcc in the RMAInterface) issues synchronous nonblocking sends to the target process that owns a patch of noncontiguous elements. The target process handles the sends by posting matching receives (when DETACH is called) and places each element at its correct position with respect to the Elemental cyclic distribution. Because of the active participation of the target pro-

39 cess, AXPY is able to pack all the (noncontiguous) elements in a single MPI Issend. In the case of RMAInterface, since we use MPI-3 passive RMA, we cannot involve the remote process in the communication. Hence, the origin process has to calculate the displacement in the remote pro- cess MPI window and issue multiple RMA operations. Therefore, the RMA versions issue many one-sided accumulates (with comparatively small data size as compared with ORIG/ORIG-NBX) over the network, when ORIG/ORIG-NBX could essentially pack elements to limit the number of messages. With an increasing number of tasks, the number of one-sided operations also increases, and RMA/RMAB suffer. In particular, the performance of the RMA versions starts degrading significantly because of the lack of progress of one-sided operations, especially for greater than 4K tasks; and in the worst case it is 5x slower than ORIG-NBX. In contrast, version RMAB is around 10x better on average (for tasks greater than 4K), because it enforces local completion (via MPI Win flush local). Hence, we assert the importance of asynchronous progress in RMA communication. The performance of RMAB increases significantly when a number of “ghost” processes are used for asynchronous progress (for Figure 3.8 we use 4 ghost processes per node). A performance improvement of up to 4x is observed with the use of the RMAInterface locally blocking API in combination with the Casper progress engine (referred as RMAB-CSP in Figure 3.8) as compared with the optimized AxpyInterface (i.e., ORIG-NBX).

Distributed Matrix-Matrix Multiplication

The matrix-matrix multiplication operation, that is, C = A × B, entails multiplying matrices (A and B) and storing the result on a third matrix (C). Since AxpyInterface cannot be used to simulate truly one-sided operations because of its bulk-synchronous nature (see Section 3.4.1), we compare only RMA approaches for this microbenchmark. Specifically, we compare the relative performance (demonstrated by an average of a million floating point operations per second, or MFLOPS) of a distributed matrix multiplication microbenchmark written in GA and EL::DA in Figure 3.9. We emphasize that the data distribution of Elemental is element-wise cyclic, whereas GA in this case

40 128 processes (nodes=16:ppn=12) 256 processes (nodes=16:ppn=20) 512 processes (nodes=32:ppn=20) 30000 20000 11000 28000 GA 18000 10000 EL::DA 26000 9000 24000 16000 22000 14000 8000 20000 7000 12000 18000 6000 16000 10000 5000 14000 8000 4000

MFLOPS/Process 12000 10000 6000 3000 8000 4000 2000 8192-NN8192-TN8192-NT8192-TT16384-NN16384-TN16384-NT16384-TT 8192-NN8192-TN8192-NT8192-TT16384-NN16384-TN16384-NT16384-TT 8192-NN8192-TN8192-NT8192-TT16384-NN16384-TN16384-NT16384-TT

Figure 3.9: Performance comparison of EL::DA and GA for the distributed matrix-matrix multiplication microbenchmark on Cori. Four processes per node are used by Casper for asynchronous progress.

10000 1000 10000 GA GA GA EL::DA EL::DA EL::DA

1000

100 1000

100 Execution Time (s) Execution Time (s) Execution Time (s)

10 10 100 16 32 64 128 256 512 1024 32 64 128 256 512 32 64 128 256 512 Number of Processes Number of Processes Number of Processes (a) 10 iterations of the 1hsg 28 (b) 1 iteration of the 1hsg 38 (c) 1 iteration of the 1hsg 45 molecule molecule molecule

Figure 3.10: GTFock execution on Cori. Two processes per node are used by Casper for asynchronous progress. uses a regular distribution (each PE or process receives contiguous chunks of the global array). We use square matrices as input, with all (four) combinations of transpose operations for A and B matrices. For instance, in Figure 3.9, “8192-NT” corresponds to a matrix multiplication version where the input matrix dimensions are 8192 × 8192 and “NT” indicates A was not transposed (N), whereas B was transposed (T). An important factor for scalability in dense-matrix linear algebra computation for distributed- memory architectures is the matrix distribution over processes. Predicting the best distribution is challenging because on the one hand we want the tile size per process to be sufficiently large for BLAS (Basic Linear Algebra Subprograms) efficiency and, on the other hand, to be sufficiently small to avoid load imbalance due to communication. Elemental tries to attain a good compromise by removing the constraint on deciding the block sizes per process by making it 1, which means essentially that each process gets a noncontiguous element of the input matrix. Element-wise cyclic data distribution has been proven to be more scalable than previous approaches that partition the

41 matrices into contiguous blocks and distribute the block to the processes [153]. As shown in Figure 3.9, the performance of EL::DA suffers when the communication volume supersedes the number of arithmetic operations (for instance, when 512 processes are used for multiplying 8K matrices, particularly with A/B transposed). Because of the data distribution, however, EL::DA shows scalability for large (16K) matrices and is 6%–40% better than GA in terms of performance.

3.5.2 Application Evaluation – GTFock

Hartree-Fock calculations play a crucial role in quantum chemistry and are a useful prototype for parallel scalability studies. In computational chemistry, the Hartree-Fock (HF) or SCF (Self-Consistent Field) method is used in approximating the energy of a quantum many-body systems in a stationary state. This is an iterative method; in each SCF iteration, the most computationally intensive part is the calculation of the Fock matrix. GTFock [127] is a new parallel algorithm for Hartree-Fock calculations that uses fine-grained tasks to balance the computation. It also enables dynamically assigning tasks to processes to reduce communication. GTFock is an excellent application candidate because it exploits domain (MPI and GA), thread (OpenMP), and data (vector loops) parallelism, which are necessary to achieve effective parallel efficiency on the next generation of supercomputers. Because of the similarity of GA and EL::DA APIs, the EL::DA port of GTFock was straight- forward, and essentially a drop-in replacement for GA functions. Table 3.1 lists the molecules that we used in our experiments, and some of their properties that have a direct correlation with the input data size. All the molecules that we used as input are provided with the GTFock package as part of the cc-pVDZ basis set. Figure 3.10 shows the total execution time of only a single iteration for the 1hsg 38 and 1hsg 45 molecules, whereas the execution time of 10 iterations is reported for 1hsg 28. In Figure 3.10, we observe that for all inputs, EL::DA shows consistently around 20% improve-

42 Table 3.1: Test Molecules used for GTFock evaluation.

Molecule No. of Atoms No. of Shells No. of Basis Functions 1hsg 28 122 549 1159 1hsg 38 387 1701 3555 1hsg 45 554 2427 5065 ment over GA up to 128 processes. However, with increasing number of processes, the volume of remote communication increases significantly in the Fock matrix building stage, which negatively affects the overall scalability of EL::DA. This is because in EL::DA, local accesses to a global ar- ray (via NGA Access/NGA Release) entail extra MPI Get calls to bring elements distributed across the processes (in Elemental cell-cyclic fashion) into a local buffer of the current process. Therefore, NGA Access is not a local operation for EL::DA, which in case of GA is a simple pointer assignment to a contiguous buffer (local portion of a global array), due to its data layout. Unfortunately, the design of GTFock relies on frequent local accesses to global arrays, which affects the performance of EL::DA over 256 processes. To maintain the integrity of the solu- tion, EL::DA has to issue extra remote operations. Upon profiling GTFock with EL::DA on 512 processes, we found that more than 40% of the total execution time is spent on two functions that initialize and update the local portion of the global arrays (through NGA Access-NGA Release) before and after the Fock matrix computation, in every SCF iteration. On the other hand, in case of GTFock with GA, those functions merely contribute with around 1% to the entire execution time. If remote accesses to distributed arrays dominate an application (i.e., more NGA Acc/Put/ Get and less NGA Access), then EL::DA will be more scalable and efficient than GA, as we demonstrated in the matrix multiplication microbenchmark.

3.6 Chapter summary

We presented a case study in designing a one-sided interface within a high-performance linear algebra and optimization framework. Our work started with improving the existing interface for updating distributed matrices in Elemental (AxpyInterface), and we justified the need for a new API (RMAInterface) for applications that require asynchronous one-sided operations on distributed

43 arrays. We built a Distributed Arrays interface (EL::DA) using the RMAInterface and Elemental Dist- Matrix to enhance the productivity of developers requiring optimized one-sided operations and a high-performance linear algebra framework. Integrating such an interface into Elemental opens up interesting possibilities in directly accessing a rich set of scientific algorithms, which is otherwise not possible from a standalone API such as GA. Overall, we demonstrate that our proposed RMAInterface is an effective programming model for asynchronous distributed matrix update delivering competitive performance results compared with those of existing MPI point-to-point APIs and GA.

44 CHAPTER 4 RMACXX: AN EFFICIENT HIGH-LEVEL C++ INTERFACE OVER MPI-3 RMA

4.1 Introduction

A significant number of scientific applications use MPI either directly or indirectly via third-party scientific computation libraries. Such applications often need features to support communication scenarios (such as asynchronous update of distributed matrices) that would benefit from a one- sided communication model. Instead of switching to another Partitioned Global Address Space (PGAS) model to cater to the one-sided communication needs of an MPI-based application, an elegant and productive solution is to exploit the one-sided functionality already offered by MPI-3 RMA. Although MPI-3 RMA provides a variety of functions to complement use cases commonly arising in scientific applications, writing robust application codes still is difficult without detailed understanding of the features of MPI-3 RMA. Many scientists prefer an approach that will let them develop parallel applications intuitively without requiring detailed knowledge of the middleware. Language-based PGAS models such as Chapel [35], Coarray Fortran [145], Unified Parallel C (UPC) [32], X10 [38], Titanium [95] and XcalableMP (XMP) [121] address this need by offer- ing a rich set of functionality embedded within a language or through directive-based language extensions. However, PGAS models whose runtimes are not MPI-based are susceptible to silent interoperability issues. An example where a PGAS model may cause a deadlock as a result of relying on a runtime separate from MPI is depicted in the code snippet below. if (rank == 0) { PGAS_put(...); PGAS_sync(...); } MPI_Barrier(MPI_COMM_WORLD);

45 The PGAS put function is a representative asynchronous one-sided put operation belonging to a PGAS model, and the PGAS sync function is for synchronization. If PGAS_sync requires in- volvement of the target process for communication progress and completion, but the target process blocks in the MPI Barrier function, then a deadlock will occur. Therefore, to avoid such is- sues, a PGAS model may require invocation of some form of coarse-grained serialization in order to work with MPI, affecting the overall performance of an application. Usage of multiple run- times also significantly increases resource utilization, limiting available computing resources for applications. Yang et al. [193] and Dinan et al. [54] discuss the importance of interoperability of PGAS models with MPI and issues arising from multiple runtimes. Applications may also suffer from the disparity of asynchronous progress models (which is usually thread, process or interrupt based) between MPI and PGAS models. Since PGAS models are not designed keeping aggregate opera- tions in mind, support for optimized collective operations for PGAS models is an active research area. On the other hand, a vast majority of legacy applications rely on MPI collectives to achieve peak performance. This is made possible thanks to decades of past research effort in optimizing algorithms specifically for collective operations [138, 37, 157, 177, 36, 99]. The ability to quickly prototype parallel scientific codes is a desirable feature that a low-level interface such as MPI RMA lacks. In this work we strive to address this issue. Also, a large number of parallel scientific codes already use MPI and need features similar to those available in other PGAS models without incurring interoperability overheads. We seek a solution to the following question:

Can we develop an intuitive and composable interface over MPI-3 RMA that can pro- vide features needed by parallel scientific applications?

As an answer to this question, we present RMACXX (pronounced as “r max”), which is a com- pact set of C++ bindings to most common features in the MPI-3 RMA API. RMACXX enables users to build a variety of scientific codes without needing to consult the MPI specification. Any standard C++14 compiler will be able to compile and optimize the code written using RMACXX.

46 Although RMACXX uses some features of C++ that are not available on standards prior to C++14, commonly used C++ compilers at present are already C++17 compliant [44]. The choice of C++ language as the building block of RMACXX is deliberate, since it offers the necessary features to build an intuitive high-level interface. RMACXX employs advanced object-oriented concepts that are unavailable in languages commonly used for numerical computation, such as Fortran or Python. However, it is conceptually possible to partially expose a similar high-level interface using Fortran 2018 [160] and Python [186], with some restrictions. The C++ template functionality can be approximated in Fortran using parameterized user-defined types, and generic types in Python. There is implicit support for nonblocking communication (for e.g., asynchronous attribute) and operator overloading in Fortran 2018, both are key aspects of RMACXX design. But, is reserved in Fortran, and cannot be overloaded, whereas () and [] are associated with Fortran arrays and coarrays, and cannot be overloaded. The only option is to create new operators, which can make the interface verbose. In Python, it is possible to overload an index operator (i.e., []), which can be used to pass communication parameters. Python also supports parallel programming through the multiprocessing [69] and threading [70] modules. Unlike OpenMP [47], OpenCL [173], XcalableMP (XMP) [121], or CUDA [146], RMACXX is not a language extension; and because it does not require a separate runtime, it is compatible with any MPI implementation compliant with the MPI-3 standard. Moreover, since RMACXX uses MPI directly, it can automatically benefit from recent low-level improvements in MPI, such as in network address management [88], lock scalability for better multithreading support [4, 49], asynchronous progress [171], and overall runtime [158]. From the user perspective, PGAS programming models can be categorized based on how shared data is accessed during communication. Some models, such as SHMEM, Fortran 2008 Coarrays, and MPI RMA, require an explicit processing element (PE) or process to access an ar- bitrary remote memory location. Other models, such as Global Arrays, do not require a user to specify the target process to shared data and instead use an index-based interface that encapsulate

47 the low-level details of the actual location(s) of the requested data. RMACXX augments these ap- proaches by providing an optimized local indexing interface to access arbitrary remote data ranges on a particular target process and a convenient global indexing interface that uses mathematical coordinates to access data spread across multiple processes. Figure 4.1 places RMACXX along with other contemporary models that support a local and global view of data.

Global view: Intuitive, GA, UPC, easy to use Chapel XcalableMP, UPC++, RMACXX Low level, relatively Local view: di cult to ARMCI, CAF, program SHMEM

Native MPI Figure 4.1: PGAS models that support both local and global view, showing flexibility.

Users of MPI RMA must provide the memory offset (also referred to as target displacement) in the target process’s memory in order to initiate a one-sided data transfer. As an example, we present below a code snippet of a halo exchange for a simple 2D 5-point stencil (domain size is 2 × (bx + 2) × (by + 2) including the “ghost” cells), using a 1D memory layout and derived data types. Since this is an iterative algorithm, the domain size is doubled to hold the updated data from the previous iteration on the second half. Pointers are swapped at the end of each iteration. For simplicity, the example excludes other mandatory parts.

MPI_Put(&arr[bx+3], 1, north_south_type, north, ((by+1)*(bx+2))+1, 1, north_south_type, win);

MPI_Put(&arr[by*(bx+2)+1], 1, north_south_type, south, 1, 1, north_south_type, win);

MPI_Put(&arr[2*bx+2],1, east_west_type, east, (bx+2), 1, east_west_type, win); MPI_Put(&arr[bx+3],1, east_west_type, west, (2*bx+3), 1, east_west_type, win);

48 Table 4.1: RMACXX semantics compared to Global Arrays, Fortran 2008 Coarrays and UPC++.

RMACXX Global Arrays Fortran 2008 Coarrays UPC++ Global indexing Local indexing // each rank allocates // distribute 10 ints among 2 processes dist objectd(0); // process 0 // 10 ints // each rank allocates array of int rga = NGA Create(C INT, 1, // set local value Window Window // 10 ints {10}, ”GA”, NULL); *d = val; win({0},{4}); win({10}); integer(kind=4) :: arr(10)[*] // irregular distribution API // each rank allocates 10 ints // process 1 // each rank allocates // coarray variable int iga = NGA Create irreg(C INT, 2, global ptrgptr = Window // 10x10 ints integer(kind=4) :: val[*]

Data structures {1,10}, {1,2}/*grid*/,{0,0,5} /*map*/); new array(10); win({5},{9}); Window win({10,10}); // fetch value of dist object // put/get //put/get d.fetch(target); NGA Put/Get(rga,{3},{7}, ptr,{1}); val1 = arr(5)[2] // get // put/get // put/get // accumulate (axpy) arr(5)[2] = val2 // put // put/get std::future... = rput/rget(T val, win(1, {2}) <<3; double a = 0.1; // atomics (using event) win({2}) <<3; global ptrdst, Completion cx); win(0, {0},{4}) >>ptr; NGA Acc(iga,{0,1},{0,3}, ptr, &a); type(event type) :: counter[*] win({3},{7}) >>ptr; // accumulate // accumulate // nonblocking operation integer :: ctr, prev // accumulate rpc(int recipient, Func &&func, win(1,{2},SUM) <<3; ga nbhdl t hdl; event post(counter[2]) win({2},SUM) <<3; Args&&...args); // atomics NGA NbPut(rga, {3},{7}, ptr, ... // atomics // atomics // fetch-and-add {1}, &hdl); event wait(counter) // fetch-and-add RMA operations atomic domainad({...}); win(1,{2},SUM,1) // atomics (fetch-and-add) event query(counter[2], ctr) win({2},SUM, 1) >>val; global ptrg = ...; >>val; long prev = NGA Read inc(rga, // atomics (using intrinsics) std::futuref = ad.fetch add(x, {4}, 1); atomic add(arr(5)[2], 1, prev) val, ...); NGA NbWait(&hdl); // local completion sync images(...) barrier(); flush local(...); GA Sync(); sync all wait(); flush(...);

Synch. NGA Barrier(); sync memory

Calculation of the target displacement (the fifth parameter in MPI Put) is nontrivial and error prone. In RMACXX, n-dimensional (n-D) coordinates are used to access a chunk of remote memory, which is an intuitive way to access distributed arrays that are used in many scientific applications. The remainder of this chapter is organized as follows. We discuss related work in Section 4.2. Section 4.3 presents the overall design of RMACXX covering the standard and expression inter- faces. In Section 4.4 we present performance evaluations of RMACXX compared to MPI RMA using microbenchmarks and applications. We draw conclusions in Section 4.5.

4.2 Related work

The Global Arrays (GA) toolkit [144] allows creation of distributed arrays that can be accessed using n-D coordinates, similar to RMACXX. GA has no dedicated local indexing interface and assumes global indices, adding extra overhead for cases that can be better represented by a local indexing model. Unlike RMACXX, RMA operations in GA are not implicitly nonblocking and require an explicit handle per operation for local completion. The Fortran 2008 standard introduced Coarrays [145], which are local-view objects addressable from other processes. Several shortcomings of Fortran 2008 Coarrays have been identified in

49 [134]. Coarrays enforce local completion semantics and mandate ordering of operations when the same location or process is accessed. In contrast, RMACXX is implicitly nonblocking, and put/get operations impose no ordering restrictions. XcalableMP [121] is a directive-based C/Fortran language extension that supports local/global indexing. It is developed within the Omni compiler toolchain [163], which uses MPI as its com- munication runtime. XMP performs data distribution and assigns work to a set of nodes through parallelization directives. It supports multidimensional arrays and variation of block/cyclic data distributions for global view programming. For local view, XMP uses Coarray features from the Fortran 2008 standard. XMP has no specific support for remote atomic operations. atomic oper- ation or active messages. However, it provides limited support for mixing XMP directives with OpenMP and MPI. UPC++ [7, 8] is a PGAS library that offers a set of C++ extensions over the GASNet-EX communication runtime [25, 92]. Unlike RMACXX and XMP, UPC++ does not provide a specific mechanism to distribute data, and it requires users to supply target rank in asynchronous communi- cation operations involving distributed objects (as in local view model). UPC++ uses global pointer and distributed object constructs to reference remote memory segments. Prior to communication, a global pointer needs to reference a distributed object to point to the shared-memory segment of a process. RMACXX uses overloaded operators for performing RMA operations, whereas UPC++ uses function call semantics for the same purpose. Nonblocking operations in UPC++ have associ- ated C++ future and promise objects (these are standard C++ concurrency mechanisms) that bind the status of the underlying operation with any result values. Apart from waiting for completion of an operation, UPC++ also provides several additional mechanisms for determining completion. In contrast, MPI RMA uses bulk synchronization functions to complete previously issued operations. Creating a future in one scope and evoking completion on another scope is challenging and re- quires ownership transfers that may incur significant overheads in copying intermediate objects. In RMACXX, the status of an operation is tied to a Window instance, which can be easily referenced across scopes without involving intermediate object copies. UPC++ also provides remote proce-

50 dure calls (RPC), which allows a function to be invoked on a target process. MPI RMA provides an accumulate operation that is similar to the concept of active message [62]/RPCs, except for predefined operations. The RMACXX interface for accumulate is very similar to put, with an extra parameter to specify the built-in operation. In Table 4.1, we compare RMACXX with Global Arrays, Fortran 2008 Coarrays and UPC++, and list a subset of functionalities to demonstrate the relative semantic differences between the models.

4.3 Design principles of RMACXX

RMACXX supports diverse application use cases by allowing them to maintain and update MPI windows in an intuitive fashion by utilizing customized C++ operators. We can perform the same halo exchange operation mentioned in Section 4.1 using RMACXX, by employing 2D coordinates to access the MPI window of a target process (instead of remote memory offsets), as shown in the code snippet below.

/* Every process creates2D window of size(by+2)X(bx+2) */ Window win({by+2,bx+2}); ...

/* Exchange data between neighbors*/ win(me,{1,1},{1,bx}) >> win(north,{by+1,1},{by+1,bx}); win(me,{by,1},{by,bx}) >> win(south,{0,1}, {0,bx}) ; win(me,{1,bx},{by,bx}) >> win(east,{1,0},{by,0}); win(me,{1,1},{by,1}) >> win(west,{1,bx+1},{by,bx+1}); ...

/* Complete operations */ win.flush();

/* Free resources*/ win.wfree();

51 The primary data structure of RMACXX is the composable Window class, which encapsu- lates an MPI RMA window and includes operators to access the distributed data using local or global indexing schemes. Figure 4.2 highlights the key interfaces of RMACXX. We distinguish between two communication mechanisms in RMACXX: elementwise access (i.e., communication of a single element) and bulk access (i.e., communication involving multiple elements).

C++ compiler

RMACXX

Non-member Noncontiguous local operators for buffer communication expression template

Standard interface Expression interface

Figure 4.2: High-level organization of RMACXX.

4.3.1 Window class

In Bjarne Stroustrup’s words, “When designing a class, it is often useful to consider what’s true for every object of the class and at all times. Such a property is called an invariant” [174]. The invariant of the Window class is that an instance of it contain the operators to access portions of the distributed data from different processes by using n-D coordinates.

52 Window creation

In MPI RMA, a window corresponds to memory exposed by a process before performing one-sided operations. In contrast, RMACXX introduces a notion of type and dimensionality into window creation. Users need to pass either the window dimensions or n-D data ranges owned by a process as an input parameter. This helps RMACXX determine the overall data size per process, and, for global indexing cases, construct an n-D process grid according to which data is globally distributed across processes. Creating a window for local indexing is simple. Invoking the following statement creates a 6(height) × 2(width) (2D) window of double data type on every process in the default commu- nicator: Window win({6,2}). A concrete example of the local indexing interface is shown in Figure 4.3, to demonstrate that a process is unaware of another process’s distribution.

// process 0 Window win({2,3});

// process 1 Window win({2,5});

// process 2 Window win({4,2});

Local window dimensions // process 3 (a process is unaware of the Window win({3,4}); distribution of another process)

Figure 4.3: RMACXX window creation using the local indexing constructor. A window created by a process using the local indexing constructor is not associated with the windows created by other processes.

Figure 4.4 demonstrates window creation for a global indexing case. Unlike the local indexing case, where there is no notion of a process grid, the global indexing windows are supposed to follow the layout of the underlying process grid. Users can also create a custom process grid through MPI cartesian topology routines [183] and pass it while instantiating an RMACXX window with global indexing. RMACXX does not limit the number of dimensions; the data within the braces {} is evaluated at compile time, and the empirical limit is compiler dependent.

53 0 1 0 1 // process 0 Window win({0,0},{1,2}); 2x2 process grid // process 1 0 Window win({0,3},{2,7}); 1 2 // process 2 Window win({2,0},{5,2}); 3 4 // process 3 5 Window win({3,3},{5,7}); 0 1 2 3 4 5 6 7 Global window dimensions Figure 4.4: RMACXX window creation with global indexing capabilities. A window of global dimensions 6 × 8 integers is created collectively by 4 processes laid out in a logical 2 × 2 grid.

Window template parameters

Apart from parameterizing the Window class with type T (where T is a plain data type such as int, double, etc.), we have 5 parameters to set various properties of a Window instance (discussed in Table 4.2).

Window destruction

Invoking the wfree() function on a Window instance ensures resource cleanup. Although the RMACXX destructor can handle cleanup of resources, we cannot rely on it to automatically release resources for every situation. The reason is that a destructor may be called just before the program main function exits and after MPI Finalize is invoked. This situation will cause an ill-formed MPI program that will crash, since no MPI function (wfree invokes the function for MPI window destruction) can be called after MPI Finalize is executed. Unfortunately, there is no portable way to let MPI Finalize handle cleanup of RMACXX objects. Therefore, we rely on the user to conditionally call wfree() before the Window object goes out of scope, in order to avoid memory leaks or exceptions. RMACXX does not provide a mechanism to copy Window objects, and we disallow the compiler from creating automatic copy constructors. Since a copy operation on a Window class would be expensive, it is prohibited.

54 Table 4.2: Window class template parameter list. Default values are in bold.

Template Parameter Value Properties Parameter LOCAL Users would need to pass tar- WinType get process along with indices in communication operations. GLOBAL RMACXX calculates the target process from the passed global indices. NO EXPR Standard RMA operations such WinUsage as put, get, accumulate, and atomics are supported. EXPR A Window object can partici- pate in expressions that combine RMA and arithmetic operations. NO FLUSH Users are responsible for com- WinCompletion pleting outstanding RMA oper- ations. LOCAL FLUSH This ensures that all outstand- ing RMA operations complete at the origin process (i.e., buffer in origin process could be reused) when an RMA operation returns. REMOTE FLUSH All outstanding RMA operations are guaranteed to complete at the origin and target process when an RMA operation returns. ATOMIC NONE RMA operations are not atomic. WinAtomicity ATOMIC PUT GET This makes put/get RMA operations atomic, by re- placing underlying MPI Put with MPI Accumulate and MPI Get with MPI Get accumulate functions. NOT CONCURRENT RMACXX operations are not WinThreadSafety thread-safe. CONCURRENT RMACXX operations can be safely issued from multiple threads; locks are used internally to protect the data structures from concurrent accesses.

55 4.3.2 Standard interface

In this section, we focus on the standard interface of RMACXX, which essentially provides a convenient layer above the MPI-3 RMA interface.

Overloaded operators

The direction of communication (put or get) in RMACXX is represented via the << or >> sym- bols and is made possible by overloading the corresponding C++ operators. Three operators have been customized for one-sided operations in RMACXX: operator(), operator<<, and operator>>. A pair of operators is invoked internally for every one-sided operation in RMACXX. Communication parameters such as the target process or coordinates are passed via operator(); hence it is always the first operator to be called. For elementwise access only one set of coordinates is passed to operator(), whereas for bulk cases two sets of coordinates are required that specify a range of elements. The passed coordinates are parameterized as a reference to an aggregate initializer [43], thereby avoiding any extra copy. For atomic operations such as accumulate, get-accumulate, fetch-and-op, and compare-and- swap, there is an extra argument to operator(), which is the requisite operation for combining the input data to the data on the target window (e.g., sum, product, bitwise or). Additionally, fetch- and-op/compare-and-swap operations must supply the input value (and value to be compared, for compare-and-swap) to be operated on the target window, in addition to the specified operation. Like their MPI counterparts, fetch-and-op and compare-and-swap affect only a single memory lo- cation at a time; hence bulk access scenarios do not apply to them. The right-hand side (RHS) of >> or << is either a scalar of type T (for elementwise cases) or a pointer of type T ∗ (for bulk cases), which is of the same type as the arguments accepted by operator>> or operator<<. Figure 4.5 shows the internal translation of a representative elementwise and bulk access put oper- ation for the local indexing case. In the case of global indexing, a communication operation may require multiple underlying calls to MPI RMA routines (as can be seen from from Figure 4.4) to access specific portions of

56 win(0,{1}) << 2;

C++ compiler Compute disp. and Window const& win store target rank (0) = this->operator()(0,{1}); MPI_Put with input win.operator<<(2)}; value (2), use disp. and target from the previous step

win(1,{1,2},{5,5}) << ptr; Compute disp., store target rank (1), and C++ compiler calculate input count

Window const& win Create MPI subarray = this->operator() derived data type to (1,{1,2},{5,5}); represent remote n-D data

win.operator<<(ptr)}; MPI_Put with input bu er (ptr) and use subarray type for target window

Figure 4.5: Top: Elementwise access translation for local indexing. Bottom: Bulk access translation for local indexing. operator() returns a reference to the current Window instance, which is used to call the subsequent operator>>/operator<< to initiate communication. local buffer/remote window. For elementwise operations using global indexing, RMACXX has to calculate the target process and the remote displacement corresponding to input global coordinates (which involves O(p) overhead, where p is the total number of processes). For bulk cases, since noncontiguous portions of the input buffer may be accessed, RMACXX creates the MPI subarray type (i.e., using MPI Type create subarray function) to designate local data blocks (in addition to creating a subarray type for target window) prior to communica- tion. Therefore, the relative overhead of the global indexing interface is higher than that of the local indexing, but the benefit is enhanced convenience.

57 Synchronization

Like MPI RMA, RMACXX Window access operators are nonblocking asynchronous (unless WinCompletion is parameterized to LOCAL_FLUSH or REMOTE FLUSH). Otherwise, users have to invoke local flush/flush functions for completing outstanding operations.

Derived datatypes for bulk communication

To optimize bulk transfers, RMACXX creates MPI subarray type on the fly during communication operations involving multidimensional windows. For the global indexing cases, slices of origin data buffer may correspond to multiple target windows, requiring RMACXX to incur additional overheads of creating subarray derived types for every patch of remote data. Figure 4.6 demon- strates an RMACXX put operation using global indices in which a local buffer is communicated to multiple irregular-sized window segments spread across processes in a 2 × 3 grid. Derived

Determine target 0 window of the 1 rst chunk 2 3 Create origin/ remote subarray 4 types prior to 5 X Local bu er communication 0 1 2 3 4 5 6 7 8 9 Global indexing window Communication spread over 2x3 PE grid

Figure 4.6: Intermediate steps of a bulk communication operation (i.e., put) using global indices (for e.g., win({1,2},{5,4}) << buf)

Table 4.3: Communication scenarios for origin/target derived type creation.

Subarray Created at Communication Scenario Origin Target No No Involves 1D window, local/global indexing. Involves >=2D window, local indexing, No Yes contiguous local buffer. Yes No - Involves >=2D window, local/global indexing, Yes Yes contiguous/noncontiguous local buffer.

58 type creation overhead is an unavoidable trade-off to minimize the overall communication. For 1D local/global indexing operations, there is no need to create a derived type during communication because the data resides in contiguous blocks. Table 4.3 lists the scenarios in which subarray types are created by RMACXX as origin (or local)/target datatype during communication.

Noncontiguous data layout representation for local buffers

Users can specify multidimensional array regions (i.e., a subarray) in a local communication buffer by using RMACXX Subarray t objects for local/global indexing cases. A RMACXX Subarray t object is instantiated with the starting location of the subarray in the local buffer, subarray size, and size information. This is represented in Figure 4.7. For multidimensional local indexing

RMACXX_Subarray_t type({1,0},{5,3},{6,5});

Figure 4.7: An example of the RMACXX subarray constructor. cases, RMACXX internally manages an MPI subarray derived type corresponding to a particular RMACXX_Subarray_t instance that is used during communication. For global indexing, even when a RMACXX_Subarray_t object is created in advance, there are distinct data blocks with different subarray dimensions (for both origin buffer and target win- dow) that cannot be determined before communication is initiated (see Figure 4.6). The following code snippet (similar to its MPI counterpart in Section 4.1) depicts the halo exchange between east/west neighbors of a 2D grid of size (by +2)×(bx+2) using the RMACXX_ Subarray_t class to specify noncontiguous portions of a local buffer used in communication.

59 Window win({by+2,bx+2});

/* starts, subsizes and sizes */ RMACXX_Subarray_t west_t({1,1},{by,1},{by+2,bx+2}), east_t({1,bx},{by,1},{by+2,bx+2}); ...

/* north/south uses contiguous data */ win(north,{by+1,1},{by+1,bx}) << &arr[bx+1]; win(south,{0,1},{0,bx}) << &arr[by*(bx+1)]; /* east/west uses RMACXX_Subarray_t */ win(east,{1,0},{by,0}) << east_t(arr); win(west,{1,bx+1},{by,bx+1}) << west_t(arr);

Another flexible way to communicate halo regions is shown at the beginning of this section, involving RMACXX expressions. The RMACXX expression interface is discussed in the next section.

4.3.3 Expression interface

Some application use cases may allow combining a number of RMA operations using arithmetic operators to build an expression. Such an action is especially useful when intermediate results of an ongoing computation are not important and only the final output is sought. Following is a minimal code snippet illustrating an expression that uses the elementwise local indexing interface of RMACXX.

Window win({6,6});

2 + win(2,{1,1})*win(0,{2,2}) >> num; ... win.local_flush();

A Window object must initialize its WinUsage parameter with EXPR, in order to participate in expressions (see Section 4.3.1). Both local and global indexing are supported in the RMACXX

60 expression interface. The expression in the preceding example internally performs two get opera- tions (for win(2,{1,1}) and win(0,{2,2}), which are referred to as window access terms) and the requisite local computation before pushing the final result into the variable num. Since the default value of WinCompletion for win is NO FLUSH, the user has to invoke a flush operation (local flush()) to guarantee expression completion, that is, to ensure that the result is available in the variable num. The result of an expression is moved to an object on the RHS of >>, which is referred to as the expression destination. Apart from accepting T and T ∗ (for elementwise/ bulk cases) as valid arguments to operator>>, Window& is also allowed, as illustrated in the example below.

Window win({10,10});

2 + win(3,{1,1},{3,3})*win(0,{4,3},{6,5}) >> win(1,{2,3},{4,5}); ... win.flush();

Internally, RMACXX issues the corresponding gets, evaluates the expression, and invokes a put to the window corresponding to the window access term in the RHS of >>. The Window template parameters, discussed in Section 4.3.1, are valid for expressions as well and play an important role in deferring evaluation of expressions, setting the desired atomicity, and providing the requisite multithreading support. We employed a C++ template metaprogramming technique, called expression templates, in building the RMACXX expression interface. Expression templates [187, 46] enable building ex- pressions by using overloaded arithmetic operators. The operators do not actually perform the com- putation but instead produce intermediate objects that are able to evaluate the desired expressions. The exact definition of the intermediate objects are known at compile time and can be expanded inline. Without expression templates, we cannot guarantee correct evaluation of all combinations of expressions by just customizing operators, since there are far too many. Also, a purely polymor- phic approach would require creating extra data structures to hold intermediate results. Expression templates have been used in scientific computation libraries such as Blitz++ [188], uBLAS [190],

61 Llano [115], and POOMA [162]. To the best of our knowledge, they are being used for one-sided communication purposes for the first time in this work.

Hidden expression classes

template EExpr,EExpr,Add>> + operator+(EExpr a, EExpr b); template w(1,{0}) * EExpr>,EExpr ,Mul> operator*(T c, EExpr b); EExpr ex1(w(1,{0})) 2 w(2,{1})

Scalarc(2) EExpr ex2(w(2,{1}))

EExpr ex3(c)

template BExpr,Add>> operator+(T a, BExpr b); + template BExpr,BExpr,Mul>> 2 * operator*(BExpr a, BExpr b);

w(2,{1,1},{4,4}) w(0,{2,2},{5,5})

BExpr ex1(w(2,{1,1},{4,4}) BExpr ex2(w(0,{2,2},{5,5}))

Figure 4.8: Top: Translation of an elementwise expression (w(1, {0})+2∗w(2, {1})). Bottom: Translation of a bulk expression (2+w(2, {1, 1}, {4, 4})∗w(0, {2, 2}), {5, 5}). Items in the boxes connected with solid lines are expression terms including operators. The arrows indicate underlying class instantiations. The nonmember operators (* and +) are special functions that accept instances of class EExpr/BExpr and return another instance of EExpr/BExpr after performing the requisite operation.

RMACXX hides from the user the templated expression classes: template EExpr for elementwise expressions, and template BExpr for bulk expressions. These are responsible for creating intermediate class types that can be combined with any number of arbitrary operators using C++ templates. When operator() is invoked for a Window, instead of returning a reference to the current Window instance (as in the RMACXX standard interface, see Section 4.3.2), it wraps the ref- erence into an EExpr or an BExpr class instance (capturing the communication parameters, i.e.,

62 coordinates and/or target) before returning. Figure 4.8 demonstrates an expression in the form of an parse tree, where the top-level nodes are the nonmember operators (+, −, ∗,/) and leaf nodes are the expression instances. These nonmember operators return partial trees, which are encoded as a specific class type using C++ templates. When the parse tree is assigned to an object, then the entire expression can be evaluated in one pass. Figure 4.9 shows the steps performed during evaluation of an expression. Overall, RMACXX attempts to issue communication immediately if space is available on a predefined static buffer. The expression object also needs to be stored before deferring evaluation of expressions. In the remainder of this section, we discuss the RMACXX expression pipeline in detail.

Figure 4.9: Stages in expression processing.

Need for temporary storage

During expression processing, temporary storage is required primarily for supporting two cases: to hold data associated with window access terms and to store intermediate EExpr/BExpr instances that are created through combination of multiple expression class instances via arithmetic operators (see Figure 4.8). The question of storing data and intermediate objects arises because, based on the WinCompletion parameter, an expression may not be evaluated (but is deferred) until the user invokes a flush operation. Therefore, when Window is parameterized with EXPR, RMACXX maintains a static buffer for holding intermediate results of the get operations. For storing the expression object, a separate chunk of memory is preallocated, which is used exclusively to construct expression instances by using the C++ placement new operator that

63 allows expression object construction on preallocated memory. When the static buffer is full, communication is deferred until the expression can be evaluated. In contrast, expression object construction cannot be deferred, and hence it is allocated on free store (using operator new) if the preallocated buffer depletes. The storage requirement further increases for bulk global indexing cases, since RMACXX may need to store multiple communication parameters (in addition to local buffer offsets) pertaining to distinct communication operations.

Issuing communication operations

An expression is evaluated from left to right by the compiler, and the expression destination at the RHS of >> is not encountered by the compiler until operator>> is invoked. Even with a preallocated buffer, a problem with issuing the gets immediately (as soon as window access terms are encountered) is that the type of expression destination is not known in advance. In particular, when it is a window access term in the RHS to >>, issuing a get operation will be erroneous since a put operation needs to be initiated instead. There is no way to roll back the get operation that has already been issued. To prevent such a scenario, we issue the gets pertaining to window access terms when operator>> is invoked (i.e., the earliest an expression can start evaluating).

Deferring expression evaluation

Since one cannot detach communication with associated computation in an expression, just stor- ing the communication parameters for later expression evaluation is not sufficient because relative ordering matters for arithmetic operations. Figure 4.8 shows that the bulk/elementwise class tem- plates are created at compile time based on the expression at hand, thus making it difficult to declare a container with an “expression” class type. We bypass this issue by creating nontemplated abstract base classes [68] that each of the expression classes inherits, and we declare the relevant function members as virtual. The abstract base class cannot be instantiated (since there are no constructors); hence we store each pointer to a derived type object as a pointer to the base class instance into a container that is accessible by all expression objects. When the user invokes a flush

64 operation, expression objects are popped from this container and are evaluated one by one.

Expression completion

For the standard interface, completion refers only to completion of outstanding RMA operations. For expressions, however, an additional task of computing the entrywise arithmetic operations is included in the completion semantics. Table 4.4 shows expression completion requirements based on the WinCompletion attribute set during expression window creation time.

Table 4.4: Expression completion characteristics.

WinCompletion Expression NO FLUSH LOCAL FLUSH REMOTE FLUSH RHS Type Evaluated Evaluated T* Completes immediately, immediately, (bulk) on (local )flush no flush required no flush required Evaluated Evaluated T& Completes immediately, immediately, (elementwise) on (local )flush no flush required no flush required Evaluated Window< T, . . . , immediately, Evaluated Completes EXP R >& but requires flush to immediately, on flush (window) ensure put no flush required completion

Concurrency

Like the standard interface, threads can issue expressions consisting of multiple window access terms. Intermediate objects that are created during expression evaluation are private to each thread, and access to shared data structures associated with a Window object is protected by locks, similar to the RMACXX standard interface.

Enhancing programmability

Using RMACXX expressions may lead to significant reduction in lines of code when entrywise operations are involved. Entrywise operations such as Hadamard (or Schur) and Kronecker prod- ucts are extensively used in linear algebra and statistics. MPI RMA does not support transfering

65 data between windows, so in such a use case multiple MPI calls with intermediate buffers need to be managed explicitly. In contrast, RMACXX expression interface handles this case elegantly by hiding the implementation details from the user and only requiring the indices to distributed windows. This situation has significant implications for scientific computations such as ab initio quantum many-body methods. For instance, the tensor contraction expression for a single term in the coupled cluster model of electronic structure calculation may span hundreds of terms, requiring hundreds of thousands of lines of code to be written in MPI [16]. We provide a concrete example to highlight the benefits of RMACXX. The following code snippet performs a 5-point stencil update using RMACXX global indexing interface over a 2D process grid: for(int x = 1; x < height; x++) for(int y = 1; y < width; y++) (win({x,y}) + win({x-1,y}) + win({x+1,y}) + win({x,y-1}) + win({x,y+1})/5 >> buf[x][y];

This is very similar to a serial code. If this were to be implemented using MPI RMA, it would require users to calculate each remote offset, invoke 5 MPI get operations, issue synchronization statements, and perform the requisite arithmetic operations. Although, the above example uses element-wise transfers, even for the bulk transfer cases the RMACXX code remains fairly similar (bulk transfers expect both lo and hi indices). If such a scenario with bulk communication and arithmetic operations were implemented using MPI RMA, users have an additional responsibility to manage the intermediate buffers. In contrast, RMACXX provides convenience to the user by hiding the details of the intermediate buffer management, allowing customizations to the various aspects of communication.

4.4 Experimental evaluation

We use two platforms for our experiments: Argonne Blues and NERSC Edison. Details about the platforms are listed in Table 5.1. In this section, we compare the performance of RMACXX with

66 Table 4.5: Experimental platforms.

Cluster Compilers and MPI Usage Processor architecture Network Interconnect GNU 4.9.3, Intel 17.0.4, Dual-socket 8-core Intel R “Sandy Bridge” Xeon R E5-2670 Intel TrueScale QDR InfiniBand ANL Blues MVAPICH2 2.2, GNU 6.1 at 2.6 GHz (16 cores/node), 64 GB memory/node, 20 MB L3 cache. with PSM interface. GNU 4.9.3, Intel 17.0.4, Dual-socket 12-core Intel R “Ivy Bridge” Xeon R E5-2695 v2 NERSC Edison (Cray XC 30) Cray Fortran (CCE 8.6.2), at 2.4 GHz (24 cores/node), 64 GB memory/per node, Cray Aries with Dragonfly topology. Cray MPICH 7.6.2 30 MB L3 cache.

that of other models, by using both microbenchmarks and applications. The microbenchmarks measure the performance of a single RMA operation, such as put, get, accumulate, etc. Addition- ally, we perform an instruction count analysis to quantify the cost of various operations supported by RMACXX over MPI, which otherwise would be difficult to demonstrate through execution time analysis only. The intent of RMACXX is to match the performance of MPI, since it is limited by MPI’s performance. RMACXX is a header-only library, which enables compilers to perform op- timizations such as inlining (since identical definition of function is guaranteed) without explicit user input.

4.4.1 Instruction count and latency analysis

We perform an instruction count and latency analysis on Blues to determine the relative costs of using RMACXX over MPI. In particular, in this section we emphasize the local indexing interface, in order to quantify the minimum overhead of RMACXX. Absolute overhead of RMACXX is measured by turning off communication (using MPI PROC NULL instead of a valid target rank) for local indexing cases. In such cases, the overhead is about 10–15%, since RMACXX still has to perform some calculations, unlike MPI. These microbenchmarks are executed for 10,000 iterations on two processes of a single node, and the 95% Confidence Interval (CI) was within 5% of the mean. We report the mean time for an iteration in seconds. RMACXX Window is parameterized

with NO FLUSH for every case. For measuring instruction counts, we used the Intel R Software Development Emulator (SDE) tool [104] v 7.58. The synchronization time/instructions are not measured, because synchronization or flush operations are not deterministic and require polling to ensure completion of outstanding RMA operations. Following are the descriptors for the different variants of code corresponding to a particular

67 test case used for the instruction count/latency analysis.

• mpi-pnull – MPI version of a microbenchmark; performs nil communication

• rmacxx-pnull – RMACXX version of a microbenchmark; performs nil communication, uses elementwise (E) or bulk (B) access interface

• mpi – MPI version of a microbenchmark

• rmacxx – RMACXX version of a microbenchmark; uses elementwise (E) or bulk (B) access interface

Put Get Accumulate Get-accumulate 1000 1000 1000 1000 465 345 364 373 423 448 261 269 261 266 241 241 190 189 214 231 141 161 169 141 161 166 162 181 100 100 100 100

0.1 0.1 1 1

0.2262 0.2287 0.2340 0.2600 0.2640 0.2670 0.0300 0.0307 0.0312 0.0308 0.0310 0.0311 0.1 0.1 0.0162 0.0183 0.0184 0.0164 0.0177 0.0180 0.0186 0.0195 0.0195 0.0221 0.0242 0.0268

0.01 mpi-pnullrmacxx-pnull(E)rmacxx-pnull(B)mpi rmacxx(E)rmacxx(B) 0.01 mpi-pnullrmacxx-pnull(E)rmacxx-pnull(B)mpi rmacxx(E)rmacxx(B) 0.01 mpi-pnullrmacxx-pnull(E)rmacxx-pnull(B)mpi rmacxx(E)rmacxx(B) 0.01 mpi-pnullrmacxx-pnull(E)rmacxx-pnull(B)mpi rmacxx(E)rmacxx(B)

Figure 4.10: Instruction counts (top) and latencies (bottom) of MPI and RMACXX (local indexing) opera- tions on ANL Blues.

RMACXX local indexing interface

Standard RMA operations Figure 4.10 shows that RMACXX adds 20–28 extra instructions and incurs an overhead of 1–4% compared with MPI for standard RMA operations. The instruction counts of MPI accumulate are relatively high (compared with put/get counts) because of the presence of atomic instructions, due to mutexes. The fetch-and-op and compare-and-swap atomic operations can access only a single memory address; hence bulk tests do not apply to them. As shown in Table 4.6, for fetch-and-op and compare-and-swap, RMACXX uses 18 and 19 extra instructions, respectively, and has an overhead of 1–1.5% as compared with MPI.

68 Table 4.6: Atomic memory operations.

fetch-and-op compare-and-swap Version Instr. Lat. (s) Instr. Lat. (s) mpi-pnull 137 0.0148 130 0.0146 rmacxx-pnull 155 0.0163 149 0.0174 mpi 448 0.2927 310 0.2393 rmacxx 466 0.2964 329 0.2395

Concurrent execution In this microbenchmark, two threads concurrently issue a put operation, essentially doubling the work compared to the put case discussed previously. Window is param- eterized with CONCURRENT. RMACXX provides flexibility in choosing either a spin lock-based mutex or the default blocking mutex (which falls under the category of semaphore) provided by C++ (i.e., the lock/unlock functions in std::mutex). A spin lock is implemented by us-

1000 0.1 std::mutex spinlock 0.0538 0.0582 0.0593 0.0591 0.0484 0.0494 mpi 0.0475 0.0384 0.0396 308 312 mpi-pnull 269 273 0.0337 mpi 208 212 241 169 173 mpi-pnull 141 100 0.01 rmacxx-pnull(E)rmacxx-pnull(B)rmacxx(E)rmacxx(B) rmacxx-pnull(E)rmacxx-pnull(B)rmacxx(E)rmacxx(B)

Figure 4.11: Local indexing concurrent versions: Instructions (left) and Latencies in seconds (right).

ing atomic operations, whereas a semaphore uses a futex (fast userspace mutex) kernel system call [71] on GNU/Linux platforms. Figure 4.11 shows that when std::mutex lock is chosen, 47 extra instructions are used compared with standard RMACXX put (see Figure 4.10). In con- trast, when a spin lock is chosen, only 8 extra instructions are used. In terms of latency, compared to MPI, RMACXX incurs an overhead of 40% with the std::mutex lock. When a spin lock is used, RMACXX is only 2–4% slower than MPI. However, unlike std::mutex, which can suspend waiting threads, a spin lock will make a waiting thread busy-wait, which can negatively impact performance when a lock needs to be held for a long time.

69 Expressions Similar to the put/get microbenchmarks discussed above, we evaluate the perfor- mance of a trivial case of expression (i.e., win(1,{0}) >> val;), which is synonymous to a get operation (on an integer type). Table 4.7 shows the overhead of a trivial RMACXX expres- sion compared with an MPI Get operation.

Table 4.7: Expression interface instructions and latencies.

Instructions Latencies (secs.) Version Elementwise Bulk Elementwise Bulk mpi-pnull 142 142 0.0165 0.0165 rmacxx-pnull 204 219 0.0237 0.0248 mpi 247 247 0.0308 0.0308 rmacxx 309 328 0.0333 0.0337

We observe that the rmacxx-pnull version has 40–50% overhead (uses 62/81 extra instruc- tions) compared with the MPI get operation. This results is expected because RMACXX has to calculate the displacement and related offset into the static buffer (26 instructions), post the out- standing gets (153 instructions), and construct an intermediate object and store it in the preallocated buffer (21 instructions). Adding another 4 instructions pertaining to counters and creating expres- sion instance from window reference, we get a total of 204, which matches the overall instructions for the rmacxx-pnull case (see Table 4.7). The rmacxx version has 8–10% overhead com- pared with MPI. Overall, the elementwise expression interface adds 41/46 instructions compared with the RMACXX standard interface. Similarly, the bulk expression interface uses 52/61 extra instructions compared to the standard interface.

RMACXX global indexing interface

Put microbenchmark We demonstrate performance of a put operation using the RMACXX global indexing data model. Since the target process is calculated internally by RMACXX in this case, there is no straightforward way to set it to MPI PROC NULL (i.e., rmacxx/mpi-pnull cases). Hence we discuss only the mpi and rmacxx cases in Table 4.8. As shown in the table, for elementwise cases RMACXX has an overhead of about 7% com- pared with MPI, whereas for bulk cases the overhead is 8/25% (for 1D/2D). Unlike the local

70 Table 4.8: Global indexing put instructions and latencies.

Instructions Latencies (secs.) Version Element Bulk Bulk Element Bulk Bulk wise (1D) (2D) wise (1D) (2D) mpi 241 241 20632 0.0300 0.0300 3.220 rmacxx 284 307 21337 0.0320 0.0390 3.485

indexing interface, RMACXX has to retain the input coordinates, requiring copies of input coor- dinates and hence adding to this overhead. For bulk cases, RMACXX also manages derived types (Section 4.3.2). To assess the relative costs of datatype creation, we include the instructions for subarray type creation at origin and target for MPI, when we compare MPI with RMACXX for the bulk (2D) cases. Instructions for the base MPI version are significantly higher in this case because of the cost of datatype creation and internal copies inside the MPI software stack to store type metadata. Overall, the RMACXX versions add 43 instructions for elementwise cases and at least 66 instructions for bulk cases (705 extra instructions for a bulk 2D case) as compared with MPI.

Overheads related to communication of noncontiguous local buffer RMACXX uses subarray types for bulk transfers involving multidimensional windows. Communication of noncontiguous local chunks require encapsulating subarray information of local buffer in a RMACXX Subarray t instance (refer to Table 4.3 in Section 4.3.2). This microbenchmark measures the overhead of com- munication with and without the intermediate subarray type creation for origin/target, for a trivial case of communicating only one byte. Table 4.9 shows that for 1D local/global indexing interface, RMACXX adds 43–67 extra in- structions and has 2–4% overhead compared with MPI. In contrast, for 2D cases, RMACXX uses 120–717 extra instructions. Table 4.9 also lists local indexing cases for reference. For local index- ing 2D transfers with noncontiguous local buffers, a subarray type for the origin can be created in advance when RMACXX Subarray t is invoked, whereas a target-side subarray is constructed during communication. This is the reason that its number of instructions is approximately half that of the global indexing counterpart.

71 Table 4.9: Bulk put with noncontiguous local buffer.

Local Index Window Global Index Window Version 1D 2D 1D 2D mpi-pnull 141 10184 - - rmacxx-pnull 184 10236 - - mpi 241 10530 241 20632 rmacxx 284 10651 308 21349

4.4.2 Message rate and remote atomics

We compare message rate and remote atomics (fetch-and-add) performance of RMACXX and MPI with other PGAS models on shared/distributed memory of the ANL Blues and NERSC Edison platforms. These tests are performed on two processes.

Table 4.10: Communication models and transport layers.

Cluster Communication Models Transport Interfaces MVAPICH2 2.2 (includes MPICH PSM 3.1.4) ANL Blues OpenCoarrays 2.1.0 [63] (uses MVAPICH2 2.2 MPI-3 RMA, GNU 6.1) Global Arrays 5.4 Comex (uses MPI-1 two-sided with progress ranks), uses MVAPICH2 2.2 UPC++ (GNU 6.1) GASNet-EX (MPI conduit, MVA- PICH2 2.2) Cray MPICH 7.6.2 uGNI/XPMEM Cray Fortran 8.6.2 DMAPP/XPMEM NERSC Edison Global Arrays 5.4 Comex, uses Cray MPICH 7.6.2 UPC++ GASNet-EX (uGNI/XPMEM)

Usually, synchronization time is not included in message rate tests; but Fortran 2008 Coarrays (CAF) have local completion semantics, and Global Arrays (GA) does not support implicit non- blocking RMA operations. Therefore, to maintain comparability, we include synchronization time in our evaluation. In Table 4.10, we list the communication models and the low-level interfaces of Blues and Edison that we used in our evaluation.

Message rate The message rate microbenchmark measures time taken to launch a number of put/get operations.

72 108 108 MPI MPI RMACXX RMACXX 107 CAF 107 CAF GA GA UPC++ UPC++ 106 106

105 105

4 4

Messages / s 10 Messages / s 10

103 103

2 2 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M

Bytes transferred Bytes transferred

Figure 4.12: Blues: Intranode put (left) and get (right) rates.

109 109 MPI MPI 8 RMACXX 8 RMACXX 10 CAF 10 CAF 7 GA 7 GA 10 UPC++ 10 UPC++

106 106

105 105

Messages / s 104 Messages / s 104

103 103 2 2 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M

Bytes transferred Bytes transferred

Figure 4.13: Edison: Intranode put (left) and get (right) rates.

107 107 MPI MPI RMACXX RMACXX 6 CAF 6 CAF 10 GA 10 GA UPC++ UPC++ 105 105

104 104 Messages / s Messages / s 103 103

2 2 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M 10 1 4 16 64 256 1M 4K 16K 64K 262K 1M 4M

Bytes transferred Bytes transferred

Figure 4.14: Blues: Internode put (left) and get (right) rates.

Figures 4.12, 4.13, 4.14, and 4.15 show the message rates of MPI, RMACXX, CAF, GA, and UPC++ between two processes within a node (intranode) and on two different nodes (internode)

73 108 108 MPI MPI RMACXX RMACXX 7 CAF 7 CAF 10 GA 10 GA UPC++ UPC++ 106 106

105 105 Messages / s Messages / s 104 104

3 3 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M 10 1 4 16 64 256 1K 4K 16K 64K 262K 1M 4M

Bytes transferred Bytes transferred

Figure 4.15: Edison: Internode put (left) and get (right) rates. of Blues and Edison, respectively. Overall, on Edison we observe that CAF performs 1.2–5× and 6–10× better than the rest for small to medium-size messages on shared and distributed memory, respectively. The distributed-memory performance of GA on Blues was found to be competi- tive with that of MPI RMA/RMACXX for large data sizes owing to a different protocol (MPI-1 two-sided, with implicit point-to-point synchronization) and an extra process per node for com- munication progress. UPC++ demonstrates performance comparable to that of MPI/RMACXX on Blues since GASNet-EX uses an MPI conduit.

108 106 MPI MPI 7 RMACXX RMACXX 10 CAF 5 CAF GA 10 GA 106 UPC++ UPC++ 4 105 10 4 10 3

FADD / s FADD / s 10 103 102 102 1 1 10 1 4 16 64 256 1K 4K 16K 10 1 4 16 64 256 1K 4K 16K

# Elements (ints) # Elements (ints)

Figure 4.16: Blues: Intranode (left) and internode (right) fetch-add rates.

Remote atomics The remote atomics microbenchmark measures the time taken to issue a num- ber of 8-byte fetch-and-add operations, similar to the message rate microbenchmark. Figures 4.16

74 108 106 MPI MPI 7 RMACXX RMACXX 10 CAF 5 CAF GA 10 GA 106 UPC++ UPC++ 4 105 10 4 10 3

FADD / s FADD / s 10 103 102 102 1 1 10 1 4 16 64 256 1K 4K 16K 10 1 4 16 64 256 1K 4K 16K

# Elements (ints) # Elements (ints)

Figure 4.17: Edison: Intranode (left) and internode (right) fetch-add rates.

and 4.17 show the performance of fetch-and-add on shared/distributed memory of Blues and Edi- son, respectively. On Edison, intranode performance of UPC++ is significantly better than MPI (Figure 4.17), which indicates scope for improvement in remote atomics implementation of Cray MPICH 7.6.2.

4.4.3 Application evaluations

We use eight applications from different scientific domains to evaluate different capabilities of RMACXX and compare the results with MPI/GA on Edison, as shown in Figure 4.18. We compare the performance of 1, 000 iterations of halo exchange for a 24K ×24K distributed array using RMACXX expressions (code snippet shown at the beginning of Section 4.3) with an MPI implementation using derived types (code snippet shown in Section 4.1). Despite the overhead of the expression interface, the performance of RMACXX is close to that of MPI across 64 to 1K proceeses. The stencil benchmark performs nearest-neighbor computation over a regular 2D grid for a 5-point stencil for heat diffusion equations. The entire data grid is distributed in a 2D grid of processes, such that each process gets a portion of the grid cells. The RMACXX version in this case uses the RMACXX Subarray t constructor to access noncontiguous locations of the origin buffer. Both the MPI and RMACXX versions work with a user-managed derived type (the MPI

75 Halo exchange Stencil 1 10 RMACXX (u expression) RMACXX (u RMACXX_Subarray_t) MPI RMA MPI RMA (u vector_type) 1

0.1 0.1

Execution time (in s) 0.01 16(2K) 32(4K) 64(8K) 128(16K)256(32K)512(64K)1K(128K)2K(256K) Execution time (in s) 0.01 64 128 256 512 1024 Processes Processes(Matrix dimension)

LULESH MILC 10 10 RMACXX RMACXX MPI RMA MPI RMA MPI Send/Recv

1

Execution time (in s) 1 8 (0.2M) 64 (1.8M) 216 (5.9M) 512 (13.9M) 4096 (110.6M) Execution time (in s) 0.1 16 64 256 1K 4K Processes (Elements) Processes

EBMS NWChem TCE proxy 1000 1000 RMACXX RMACXX MPI RMA GA 100

10

1 Execution time (in s)

Average tracking time (in s) 100 0.1 256 512 1024 2048 16(4K) 64(8K) 256(16K) 1K(32K) 4K(64K) Processes Processes(Matrix dimension)

Giga random updates (GUPS) Lennard Jones MD 0.1 1000 RMACXX RMACXX MPI RMA GA 0.01 100 0.001

10 0.0001 16(1M)32(2M)64(4M)128(8M)256(16M)512(33M)1024(67M)2048(134M)4096(268M) Giga-updates per second Execution time (in s) 1 16(1K) 64(2K) 256(4K) 1K(8K) 4K(16K) Processes (Number of updates) Processes(#Atoms)

Figure 4.18: Application evaluations using RMACXX on NERSC Edison.

76 version uses a vector type, whereas RMACXX creates a subarray type) and have similar overheads; hence the performance difference is negligible. The Livermore Unstructured Lagrange Explicit Shock Hydrodynamics (LULESH) [109] mini- application is from the area of shock hydrodynamics. The native version of LULESH uses the MPI two-sided nonblocking API. In each iteration of LULESH, a process needs to communicate with 26 of its neighbors in a 3D domain. We show weak-scaling results for 100 iterations on 8–4K processes. At 4K processes, the MPI two-sided nonblocking version was found to be about 4% faster than the RMA versions, while we observe no difference between MPI and RMACXX. The MIMD Lattice Computation (MILC) Collaboration studies quantum chromodynamics, the theory of strong interaction between subatomic particles [18]. The code invokes a conjugate gra- dient solver on a four-dimensional rectangular grid. Our implementation is based on the foMPI implementation of MILC from [74]. We maintain a local lattice size of 44 for our parallel evalua- tions. Overall, we observe a performance variation of 1–3% between MPI RMA and RMACXX. Energy Banding Monte Carlo Simulation (EBMS) [64] is a mini-application that simulates neutron transport. The computations for 50M particles are divided into memory nodes (we use a memory group of 16) and tracking nodes. Between compute nodes, MPI RMA operations are invoked to get band data from remote nodes. EBMS achieves overlap in computation and commu- nication by invoking a get operation for fetching the (k + 1)th band data, when particles in band k are tracked. RMACXX essentially replaces the only MPI Get operation in this application, adding an overhead of about 30 extra instructions, which is too little to affect the overall execution time. The NWChem TCE proxy application simulates the distributed Fock matrix building computa- tion in electronic structure calculations and employs a similar communication pattern exploited in the NWChem quantum chemistry suite’s TCE CCSD(T) module [184]. Each process requests a task, and upon receiving a task it issues get operations to fetch different blocks from distributed matrices. Then, it performs local computation on the fetched blocks and issues floating-point accu- mulate operations to push the updated blocks to a distributed matrix. Accesses to different blocks are made possible via a distributed global counter, which ensures that only a single process is ac-

77 cessing a particular block of the global array at a time. In this case, the performance of RMACXX is about 5× better than that of the GA version across the 16 to 4K processes considered. This is due to the subpar performance of GA accumulate operations on Edison, as compared with the MPI RMA accumulate operation, which was observed to be about 2–4× faster in comparison. The Random Access benchmark [130] is used to determine worst-case memory throughput, by allowing updates in random positions of a large distributed array. After determining a random target process, a fetch-and-add operation is used to calculate the remote offset where an update will be made, which is followed by the one-sided operation that writes to the position. The Giga- Random-Updates per second (GUPS) metric is used to demonstrate memory throughput, which is calculated as the total number of remote updates divided by 109. The performances of MPI and RMACXX were found to be similar; however, a variance of 4–10% was observed for this benchmark owing to random access patterns. The force between two atoms or particles can be approximated by the Lennard-Jones potential energy function [151]. The force matrix is the square of the number of atoms, and it is divided into multiple blocks for dynamic load balancing. Using Newton’s laws of equation and velocity Verlet algorithm, the velocities and coordinates of the atoms are updated for the subsequent time steps. The original code of this application uses GA, and we replaced one of the most heavily used one-sided put operation with the RMACXX global indexing interface. We observe about 5–10% improvement in performance. Table 4.11 lists the RMACXX interface usage in the applications discussed in this section.

Table 4.11: RMACXX usage in applications

Applications RMACXX API usage Halo exchange Bulk expressions EBMS Bulk get Hartree-Fock proxy Bulk put, get, accumulate Stencil Bulk put with noncontiguous transfer LULESH Bulk put GUPS Elementwise put, fetch-and-add MILC Elementwise fetch-and-add, Bulk get Lennard Jones MD Bulk put

78 4.5 Chapter summary

We presented RMACXX, a compact set of C++ bindings to MPI-3 RMA, with which users with limited knowledge of MPI can quickly prototype a variety of parallel application codes. The primary purpose of RMACXX is to enhance the programmability of MPI-3 RMA through an intuitive interface, while at the same time providing options for advanced usage, keeping overheads at a minimum. To the best of our knowledge, this work is the first to establish the efficacy of using modern C++ over MPI RMA. RMACXX adds only about 20 instructions to the critical communication path for the standard interface compared with hand-written MPI RMA codes. In addition to standard RMA operations, RMACXX supports arbitrary entrywise arithmetic operations on local objects, efficiently manag- ing memory of intermediate objects. RMACXX exhibited competitive performance with MPI in a wide variety of application case studies. In summary, RMACXX offers PGAS-like primitives without resorting to multiple runtimes or needing special compilers, and it enjoys near-MPI performance.

79 CHAPTER 5 DISTRIBUTED-MEMORY PARALLEL LOUVAIN METHOD FOR GRAPH COMMUNITY DETECTION

5.1 Introduction

Community detection is a widely used operation in graph analytics. Given a graph G = (V,E), the goal of the community detection problem is to identify a partitioning of vertices into “commu- nities” (or “clusters”) such that related vertices are assigned to the same community and disparate/ unrelated vertices are assigned to different communities. The community detection problem is dif- ferent from the classical problem of graph partitioning in that neither the number of communities nor their size distribution is known a priori. Because of its ability to uncover structurally coher- ent modules of vertices, community detection has become a structure discovery tool in a number of scientific and industrial applications, including biological sciences, social networks, retail and financial networks, and literature mining. Comprehensive reviews on the various formulations, methods, and applications of community detection can be found in [42, 65, 141, 152]. Various measures have been proposed to evaluate the goodness of partitioning produced by a community detection method [110, 117, 123]. Of these measures, modularity is one that is widely used. Proposed by Newman [142], the measure provides a statistical way to quantify the goodness of a given community-wise partitioning on the basis of the fraction of edges that lie within communities. Modularity has its limitations, however. As a metric, it suffers from what is known as a res- olution limit [66]. Computationally, modularity optimization is an NP-complete problem [27]. Despite these limitations, the measure continues to be widely used in practice [65, 80]. Resolution- limit-free versions of modularity have been proposed [181], and numerous efficient community de- tection heuristics (based on maximizing modularity) have been developed over the years, making

80 the analysis of large-scale networks feasible in practice. One such efficient heuristic is the Louvain method proposed by Blondel et al. [20]. The method is a multi-phase, multi-iteration heuristic that starts from an initial state of |V | communities (with one vertex per community) and iteratively improves the quality of community assignment until the gain in quality (i.e., modularity gain) becomes negligible. From a computation standpoint, this translates into performing multiple sweeps of the graph (one per iteration) and graph coarsenings (between successive phases). Because of its speed and relatively high quality of output in practice [103], the Louvain method has been widely adopted by practitioners. Since its introduction to the field, there have been multiple attempts at parallelizing the Louvain heuristic (see Section 5.2). To the best of our knowledge, the fastest shared memory multithreaded implementation of Louvain is the Grappolo software package [128]. The implementation was able to process a large real-world network (soc-friendster; 1.8B edges) in 812 seconds on a 20 core, 768

GB DDR3 memory Intel R XeonTMshared memory machine [91]. We present a scalable distributed-memory implementation of the Louvain method for parallel community detection [76, 75]. One of the major challenges in the design of an efficient distributed- memory Louvain implementation is to enable efficient vertex neighborhood scans (for changes in neighboring community states), since with a distributed representation of the graph, communica- tion cost can become significant. Another major challenge is the frequency with which community states are accessed for queries and updates; the serial algorithm has the benefit of progressing from one iteration to the next in a synchronized manner (benefiting always from the latest of state in- formation), while the cost of maintaining and propagating such latest information could become prohibitive in a distributed setting. The variable rates at which vertices are processed across the processor space presents another layer of challenge in the distributed setting. The approach pro- posed in this chapter overcomes the above challenges using a combination of various approximate computing techniques and heuristics. As mentioned earlier, community detection algorithms based on the strategy of maximizing

81 modularity are susceptible to the resolution limit problem, a situation where the algorithms fail to distinguish between two clearly defined clusters (modules) that are smaller than a certain size with respect to the total size of the input and the interconnectedness of the clusters themselves. Consequently, for the Louvain method resolution of modules smaller than the square root of the total number of edges is not guaranteed. Since scalability is a primary goal of our work, addressing the resolution limit problem is important. We therefore implemented the fast-tracking resistance method of Granell et al. [82] and present the empirical results we obtained. The remainder of the chapter is organized as follows. After a brief review of related work in Section 5.2, we provide in Section 5.3 essential preliminaries on the community detection prob- lem along with a description of the serial Louvain algorithm and associated design challenges in its parallelization. We present our distributed-memory parallel algorithm in its basic version in Section 5.4. We introduce approximate computing techniques and heuristics to improve the per- formance of our parallel baseline implementation, which is discussed in Section 5.5. We provide an extensive performance evaluation of our algorithm and its associated heuristics on real-world networks in Section 5.6. In Section 5.8, we present further analysis on a manycore platform (Intel R KNL), discuss resource usage pattern (for e.g, power, energy, memory, etc.) of the distributed im- plementations, and study the impact of MPI communication methods. In Section 5.9, we present an analysis of the resolution limit problem using the fast-tracking resistance method. Section 5.10 concludes the chapter.

5.2 Related work

There have been a number of prior research efforts on distributed parallel community detec- tion [12, 29, 148, 154, 156, 191]. Among these, an MPI-based distributed-memory Louvain im- plementation is reported in [156], where, similar to our distribution strategy, the vertices and their edge lists are split among the processes using a 1D decomposition. Although our distribution strategies are similar, the overall methods are very different. Firstly, we use various approximate computing techniques and heuristics to optimize performance. Moreover, we use large real-world

82 datasets in our experimental evaluations, and compare the performance of our MPI+OpenMP Lou- vain algorithm with that of a pure OpenMP implementation. The authors of [156] report the ex- ection time for their algorithm run on the uk-2007 real-world network (3.3B edges) to be about

45 seconds on 128 IBM R Power7TMnodes. In comparison, we report an all-inclusive execution

time of about 47 seconds for the same uk-2007 graph on 128 processes using 8 Intel R Haswell nodes of NERSC Cori and 4 OpenMP threads per process. Another MPI implementation is dis- cussed in [191], where ParMETIS [111] is used to partition the graph among processes before the distributed-memory community detection algorithm starts. Since graph partitioning is an NP-hard problem, we decided in our approach not to add extra computational time finding a near-optimal graph partition, and instead work with a simpler distribution. Zeng et al. [195] discuss their distributed-memory (MPI-based) Louvain implementation that replicates high degree vertices among processes and redistributes edges to ensure equivalent dis- tribution of edges. The authors report that the execution time of the first two Louvain phases on the uk-2007 graph is over 100 seconds on 1024 processes of the ORNL Titan supercomputer. In contrast, the execution time of the baseline version of our distributed Louvain implementation in- cluding all the Louvain phases for the uk-2007 graph is about 38 seconds on 1024 processes of NERSC Cori.

5.3 Preliminaries

We review in this section preliminaries around the community detection problem and the Louvain method for solving it.

5.3.1 Modularity

A graph is represented by G = (V,E), where V is the set of vertices and E is the set of edges. An edge between vertex i and j may have an associated edge weight wi,j. The community detection problem is one of identifying a set of communities in the input graph G, where the communities represent a partitioning of V . The goodness of clustering achieved by community detection can

83 be measured by a global metric such as modularity [142]. More specifically, given a community- wise partitioning of an input graph, modularity measures the difference between the fraction of edges within communities compared to the expected fraction that would exist on a random graph with identical vertex and degree distribution characteristics. Given G and its adjacency matrix representation A (where the matrix entries correspond to edge weights), the modularity of G, denoted by Q, is given by:

1 X ki ∗ kj Q = (A − )δ(c , c ) 2m ij 2m i j i,j

where:

m = sum of all the edge-weights (5.1)

ki = sum of weights of edges incident on vertex i

ci = community that contains vertex i

δ(ci, cj) = 1 if ci = cj, 0 otherwise.

In practical terms, modularity depends on the sum of all edge weights between vertices within a particular community (denoted by eij), and sum of weights of all edges incident upon each community c (denoted by ac). Viewed that way, Equation 5.1 can be written as Equation 5.2, where C denotes the set of communities.

 2 X eij  ac  Q = − 2m 2m c∈C where: (5.2) X eij = wij : ∀i, j ∈ c, and {i, j} ∈ E X ac = ki i∈c We use the formulation in Equation 5.2 in our implementations.

84 Algorithm 2: Serial Louvain algorithm. Input: Graph G = (V,E), threshold τ, initial community assignment Cinit Output: Community assignment Ccurr 1: Qprev ← −∞ 2: Ccurr ← Cinit 3: while true do 4: for all v ∈ V do 5: N(v) ← neighboring communities of v

6: targetComm ← arg maxt∈Nv ∆Q(v moving to t) 7: if ∆Q > 0 then 8: Move v to targetComm and update Ccurr 9: Qcurr ← ComputeModularity(G, Ccurr ) 10: if Qcurr − Qprev ≤ τ then 11: break 12: else 13: Qprev ← Qcurr

5.3.2 Serial Louvain algorithm

The Louvain method consists of multiple phases, each with multiple iterations. In particular, a phase runs for a number of iterations until convergence. Initially, each vertex is assigned to a sep- arate community. Within each iteration, vertices are processed as follows: For a given vertex v, the gain in modularity (∆Q) that would result in moving v to each of its neighboring communities is calculated; if the maximum of such gain is positive, then v is moved to that community from its current community. The phase is continued until the gain in modularity between any two suc- cessive iterations falls below a user-specified threshold (τ). When a phase ends, the graph for the next phase is rebuilt, by collapsing all vertices within a community into a single meta-vertex, and the process is continued until no appreciable gain in modularity is achieved between consecutive phases. A pseudocode summarizing the procedure just described is shown in Algorithm 2.

5.3.3 Challenges in distributed-memory parallelization

The primary issue affecting the global modularity in distributed-memory parallelization of the Louvain algorithm stems from concurrent community updates. A particular process only has the updated vertex-community association information from its last synchronization point. Between

85 the last synchronization point and by the time the current process accesses a community, it is pos- sible that a remote process has marked some updates for the community. However, these changes will be applied at the next synchronization point. Due to this lag of community update, the global modularity score (and overall convergence) of a distributed-memory parallel implementation of Louvain algorithm could be different from a similar serial or shared memory implementation. Lu et al. [128] discuss some challenges in parallelization such as negative gain and local maxima scenarios which are relevant for distributed-memory cases as well. There is significant communication overhead at every iteration of every phase, owing to ex- change of community updates (vertices entering and leaving communities). Updated community information is required for calculating the cumulative edge weights within a community, and inci- dent on a community, which are part of the modularity calculation (see Equation 5.2). Therefore, at every iteration, we need updated community information of tail/ghost vertices (a vertex owned by another process, but is stored as part of the edge list in the current process). Also, if a locally owned vertex moves to another community that is owned by a remote process, then the degree and edge weights pertaining to that vertex need to be communicated to the target community owner as well. Modularity calculation also requires global accumulation of the weights, requiring collec- tive communication operations. Finally, at the end of a phase, the graph is rebuilt, which entails communicating new vertex-community mappings to the respective owners of vertices.

5.4 The Parallel Algorithm

In this section we describe our parallel Louvain implementation in its basic form (we will discuss the various approximate computing techniques and heuristics it additionally employs in the next section). We begin with a brief note on how we distribute the input graph across processes. We use p to denote the number of processes, and rank i to denote an arbitrary rank in the interval [0, p − 1].

86 5.4.1 Input distribution

Our parallel Louvain implementation does not employ sophisticated graph partitioning. Instead we distribute the input vertices and their edge lists evenly across available processes such that each process receives roughly the same number of edges. Each process stores the subset of vertices that it owns. Each process also keeps track of a “ghost” copy for any vertex that has an edge to any of its local vertices but is owned by a different (remote) process. Henceforth, we refer to the latter set of vertices as “ghost” vertices. We use the compressed sparse row (CSR) format to store the vertex and edge lists [55]. Similarly, each process owns a subset of communities (set initially to equal number of communities per process), and also keeps track of a set of “ghost” communities to which the process’s local communities have incident (inter-community) edges. Given the static nature of input loading, each process knows the vertex and community intervals owned by every other process as well. However, the information pertaining to those vertices and communities could change dynamically and therefore need to be communicated. Figure 5.1 demonstrates the vertex-based graph distribution among 2 processes, each process maintaining its subgraph in CSR format. Graph Adjacency matrix Per-process CSR 0 1 2 3 Process #0 0 0 1 1 0 rowptr: 0 2 4 0 1 1 1 0 0 1 colidx: 1 2 0 3 2 1 0 0 1 Process #1 2 3 3 0 1 1 0 rowptr: 0 2 4 colidx: 0 3 1 2 Figure 5.1: Vertex-based graph distribution between two processes for an undirected graph with 4 vertices and 8 edges. Ghost vertices are retained by a process: for process #0, the “ghost” vertices are 2 and 3, whereas for process #1, the ghosts are 0 and 1.

5.4.2 Overview of the parallel algorithm

As mentioned earlier, the Louvain algorithm comprises multiple phases, and each phase is run for a number of iterations. Initially, each vertex is in its own community, and as the community

87 Algorithm 3: Parallel Louvain Algorithm (at rank i). Input: Local portion Gi(Vi,Ei) (in CSR format), Input: Threshold, τ (default: 10−6) Output: Community assignment Ccurr 1: Ccurr ← {{u}|∀u ∈ V }{Initial community assignment} 2: {currMod, prevMod} ← 0 3: while true do 4: currMod ← LouvainIteration(Gi,Ccurr ) 5: if currMod − prevMod ≤ τ then 6: break and output the final set of communities

7: BuildNextPhaseGraph(Gi,Ccurr ) 8: prevMod ← currMod

detection progresses, vertices migrate by entering and leaving communities. Each vertex resides in one community at the start of an iteration, and decides on which of its neighboring communities to move to by the end of an iteration. Algorithm 3 shows a high-level description of the parallel Louvain algorithm executed on a process. In this pseudocode, each iteration of the while loop corresponds to a Louvain “phase”. Algorithm 3 shows the two major steps of the parallel Louvain algorithm. The first step involves invoking the Louvain iteration, which runs the Louvain heuristic for modularity maximization. The second step is graph reconstruction, where vertices in each cluster are collapsed into a single meta- vertex, compacting the graph. In what follows, we describe these two steps more in detail.

Louvain iteration

Algorithm 4 lists the steps for performing a sequence of Louvain iterations within a phase. Since each process owns a subset of vertices and a subset of communities, communication typically involves information on vertices and/or communities. For each vertex owned locally, a community

ID is stored; and for each community owned locally, its incident degree (ac) is stored locally (as

part of the vector Ccurr in Algorithm 4). In addition, each process stores the list of its ghost vertices and their corresponding remote owner processes. Since this vertex mapping to the process space changes with every phase (due to graph compaction), we perform a single (one-time per phase) send-receive communication step

88 Algorithm 4: Algorithm for the Louvain iterations of a phase (at rank i). Input: Local portion Gi(Vi,Ei) (in CSR format), InOut: Community assignment Ccurr , Output: Modularity at the end of the phase

1: function LouvainIteration(Gi,Ccurr ) 2: Vg ← ExchangeGhostVertices(Gi) 3: while true do 4: send/receive latest information on all ghost vertices 5: for v ∈ Vi do {Local computation} 6: Compute ∆Q by moving v to each of its neighboring communities 7: Determine target community for v based on the migration that maximizes ∆Q 8: Mark both the current and target communities of v for an update 9: send updated information on ghost communities to owner processes 10: Ccurr ← receive and update information on local communities 11: currMod i ← Locally compute modularity based on Gi and Ccurr P 12: currMod ← all-reduce: ∀i currMod i 13: if currMod − prevMod ≤ τ then 14: break 15: prevMod ← currMod 16: return prevMod to exchange these ghost coordinate information (shown in line 2 of Algorithm 4, further explained in Algorithm 5). Note that the initial ghost community information can be derived from the ghost vertex information, as at the start of every phase, each vertex resides in its own community. How- ever, after every iteration (within a phase), changes to the community membership information need to be relayed from the corresponding owner processes to all those processes that keep a ghost copy of those communities. The main body of each Louvain iteration consists of the following major steps (see Algo- rithm 4): i) At the beginning of each iteration, obtain information about ghost vertices (i.e., their latest community assignments) at each process (line 4); ii) Using the latest vertex information, compute the new community assignments for all local vertices (lines 5-8). This is a local computation step, i.e., there is no communication; iii) Send all updated information for ghost communities to their owner processes, and receive and update information on any local communities that were updated remotely (lines 9-10);

89 Algorithm 5: Algorithm to receive information about ghost vertices from remote (owner) processes. Input: Local portion Gi(Vi,Ei) (in CSR format) Output: List Vg of ghost vertices

1: function ExchangeGhostVertices(Gi) 2: for v ∈ Vi do 3: [e0, e1] ← getEdgeRangeForVertex(v) 4: for u ∈ [e0, e1] do 5: owner ← Gi.getOwner(u) 6: if owner 6= me then 7: vmap[owner] ← vmap[owner] ∪ {u} 8: for j ∈ [0, p − 1] do 9: if j 6= me then 10: send vmap[j ] to rank j 11: receive data in Vg[j] list

12: return Vg

iv) Compute the global modularity based on the new community state (lines 11-12); and v) If the net modularity gain (∆Q) achieved relative to the previous iteration is below the desired threshold τ, then terminate the phase, and continue otherwise (lines 13-15).

Graph reconstruction

The communities found at the end of the current phase are considered as new vertices for the compressed graph for the next phase. Edges within a community form a self loop around the community, whereas the weights of edges between communities are added and a single edge is placed between them with the cumulative weight. The graph reconstruction phase is shown via an example in Figure 5.2. Process #0 owns vertices {0, 1, 2}, while process #1 owns vertices {3, 4}. The figure shows the partitioning of the CSR representation. The index array employs local indexes, whereas the edges array has global vertex IDs. Each process has an array identifying community IDs for local vertices, and a hash map that associates remote neighbor vertices with their respective community ID. The distributed graph reconstruction process proceeds according to the following steps, as illustrated in Figure 5.2.

90 1. Each process counts its unique local clusters, which are renumbered starting from 0. Renum- bering is performed with a map that associates the old community ID with the new ID.

2. Each process checks for local community IDs that, during the Louvain iterations, may have been assigned to remote vertices but are no longer associated with any of the vertices in the local partition.

3. Local unique clusters are renumbered globally: this is achieved using a parallel prefix sum computation on the number of unique clusters.

4. Processes are involved in communicating the new global community IDs for the local par- tition. Only the new community IDs that replaces the old community IDs used in other processes need to be communicated.

5. Every process examines each of the vertices in its partition and starts creating partial new edge lists. For each vertex in the partition, a process checks its neighbor list. Neighbors associated with the same new community ID contribute to a “self loop” edge.

6. Once these new partial edge lists have been created, they are redistributed across processes. New partitions are generated so that every process owns an equal number of vertices (as much as possible).

7. New arrays for indices and vertices of the coarsened graph can thus finally be rebuilt from the edge lists.

5.5 Approximate methods for performance optimization

In this section we discuss two approximate computing techniques and a heuristic which further improve the overall execution times or quality of the distributed Louvain algorithm overviewed in the previous section.

91 Figure 5.2: Graph reconstruction. In the example, we suppose that the modularity optimization has assigned vertices {0, 1, 3} to community 0, vertex 2 to community 2 and vertex 4 to community 4 (i.e., vertices 2 and 4 are each one in their own community). Because community IDs originate from vertex IDs, we consider the community IDs from 0 to 2 owned (local) to process #0, and community IDs 3 and 4 local to process #1.

92 5.5.1 Threshold Cycling

The Louvain algorithm uses a threshold τ to decide termination—more specifically, if the net modularity gain achieved between two successive phases (Algorithm 3) falls below τ then the algorithm is terminated (achieves convergence). (Note that the same threshold is also used between consecutive iterations of a phase in Algorithm 4 to terminate a phase.) Typically, this τ parameter is kept fixed throughout the execution. We extend the concept presented by Lu et al. [128] for a multithreaded Louvain algorithm implementation, in tuning the threshold across phases. The main idea is that during the initial phases, when the graph is relatively large, the threshold is kept large, and is then reduced incrementally for later phases. The point here is that if the threshold is small, then the Louvain algorithm per phase will typically undergo more iterations before it can exit; however, a higher threshold could translate to lesser number of iterations to convergence. Such savings in the number of iterations are likely to result in larger performance savings in the earlier phases when the graph is still large. We implemented a scheme in which the threshold is modulated in a cyclical fashion across phases. A range of threshold values are invoked in successive phases after every K phases, where K is prespecified.

5.5.2 Early Termination

In our parallel Louvain algorithm, one of the major contributors to communication cost is ex- change of ghost vertex information across processes. This cost can become a bottleneck if the original partitioning of the input graph has a large fraction of edges between vertices that reside in different processes. However, after experimenting with our multithreaded Louvain algorithm [128] with numerous inputs, we made a critical observation: the rate at which the overall modularity in- creases significantly slows down as the iterations within a phase progress. This diminishing returns property in quality is reflected in almost all of the modularity evolution charts presented in [128]. The diminishing return happens because within a phase, the rate at which vertices change their community affiliation tends to drastically reduce as the iterations progress within a phase. In other words, vertices tend to hop around initially but soon collocate with their community partners,

93 thereby becoming less likely to move in the later stages. We present an approximate computing method to take advantage of the above observation. We devise a probabilistic scheme by which a vertex decides to stay “active” or become “inactive”, at any given iteration. Being “active” implies that the vertex will participate in the computation within the main body of the Louvain iteration (Algorithm 4; lines 5-8), and will recompute its current community affiliation. Alternatively, if the vertex is “inactive”, it will be dropped from the processing queue during that iteration. Note that by making a vertex inactive during an iteration, we can save on all the potential computation and communication that it generates. The savings can be particularly significant for vertices with large degrees. To identify which vertices to make inactive, we take advantage of the above observation by looking at the most recent activity of that vertex; intuitively, if the vertex has not moved lately then we reduce the probability that it will stay active.

More specifically, consider a vertex v. Let Cv,j denote the community containing v at the end of any given iteration j. Let the probability that v is active during iteration k be denoted by Pv,k.

We define Pv,k as follows:

  Pv,k−1 ∗ (1 − α), if Cv,k−1 = Cv,k−2 Pv,k = (5.3)  1, otherwise where α is a real number between 0 and 1. The idea is to rapidly decay the probability as a vertex continues to stay in its current community. As α approaches zero, it becomes similar to the baseline scheme; and as it approaches one, it becomes highly aggressive in terminating vertices early on during execution, with the potential risk of compromising on quality. Consequently, we call this probabilistic method as “early termination” (ET). We changed the main computation for loop of Algorithm 4 (line 5) so that each vertex marks itself first as active or inactive based on the probabilistic scheme, and subsequently includes itself for further processing or not. We developed two minor variants to the above ET idea. In the first variant (called simply ET), when the probability for a given vertex becomes less

94 than 2%, we label it inactive. In the second variant (called ETC, where the C stands for communication), the early termi- nation scheme is combined with another option to further improve performance. In particular, we provide an option to calculate the percentage of global inactive vertices, and if 90% of the vertices are inactive in a particular phase, then the program exits. This option requires an extra remote communication, involving global summation of inactive vertices. In some cases, we ob- served early termination with remote communication to be around 1.2 to 2.3x better than using early termination alone. Further sophistication in the implementation is possible. Note that if α is 1, then once a vertex marks itself as inactive, it will stay inactive for ever. In fact, we should be able to safely argue that this property will hold (with high probability) for large α values as it nears 1. Secondly, any communication that relates to inactive vertices can be prevented/preempted by communicating the ghost vertex IDs that have become inactive to other processes that still think they need them.

5.5.3 Incomplete Coloring

A distance-1 coloring of a graph is an assignment of colors (unique labels) to vertices such that no two neighboring vertices are assigned the same color. Therefore, vertices of the same color (a color class) form an independent set, where no two vertices are neighbors of each other. The objective of the coloring problem is to find a coloring where the number of colors used is the fewest possible. The problem is known to be NP-hard [34]. We call a distance-1 coloring where only a subset of vertices are assigned colors such that no two neighbors that are colored are colored with the same color an incomplete coloring. Let us now consider coloring in the context of community detection. In the serial algorithm, the order in which vertices are processed within each iteration can impact performance as well as final modularity of the solution. In parallel, the order can play a prominent role in performance since concurrent processing of two vertices that are connected by an edge (mutual dependency) could delay conver- gence. In our multithreaded implementation [128], we used distance-1 coloring to overcome this

95 challenge. We enabled concurrent processing of each color class at a time, since this guarantees that no two neighboring vertices are processed concurrently. In our distributed-memory implementation, we perform an incomplete coloring to reduce the overhead in switching between the colors. The basic idea is to color only a fraction of the vertices with a preselected number of colors using the Jones-Plassmann algorithm [106]. The algorithm proceeds by assigning a unique random number to each vertex. At a given iteration of the al- gorithm, if the random number of a vertex is the maximum among its neighbors, then the vertex colors itself at this step with a predetermined color for that iteration and removes itself from further consideration. Otherwise, it competes in subsequent steps until it gets a color or the given number of colors are exhausted. Another variant of this approach is to keep coloring until a certain minimum fraction of the vertices are colored, after which the remaining vertices are bundled into one color class. The vertices that are bundled into the final color class may have conflicts (neighbors of each other). The performance/quality implications of incomplete coloring is discussed in Section 5.6.7.

5.6 Experimental evaluation

Using both real-world and synthetic graphs, we extensively evaluate our parallel algorithm and its variants on a wide selection of computing platforms. We generally conduct two kinds of evalua- tions: performance evaluation (runtime and scalability) and quality of solution evaluation (modu- larity). Our primary performance evaluation results are presented in Sections 5.6.4 to 5.6.8, and the quality of solution assessment results are presented in Section 5.6.9. Additionally, we present further performance analysis results on memory usage, power consumption and MPI communica- tion methods in Section 5.8. We begin by first spelling out our experimental setup in Sections 5.6.1 to 5.6.3.

96 5.6.1 Algorithms compared

As far as we know, currently no MPI-based distributed-memory Louvain algorithm implementa- tion is publicly available. Therefore, to compare our results, we use the multithreaded implemen- tation of the Louvain algorithm available in Grappolo [128]. Below we summarize the descriptors/ legends used in the figures/tables in this section to refer to the different variants of our parallel algorithm used in (discussed in Section 5.5).

• Baseline: the main parallel version (Algorithm 3) without the approximate computing methods or heuristics discussed in Section 5.5.

• Threshold Cycling: version with threshold cycling enabled.

• ET: version with adaptive early termination, which requires an input parameter (α). We report ET performance with α = 0.25 and α = 0.75.

• ETC: variant of ET with an extra communication step to gather inactive vertex count. We report ETC performance with α = 0.25 and α = 0.75.

• Color: version with incomplete coloring. We use from 32 to 40 colors in our experiments.

5.6.2 Experimental platforms

Our primary testbed for performing distributed/shared memory evaluations is the NERSC Cori supercomputer; however, we have also used NERSC Edison and ALCF Theta supercomputers for some supplementary analysis. The details of the platforms are summarized in Table 5.1. We build the codes with -O3 -xHost compilation options (in ALCF Theta, we replaced -xHost with -xmic-avx512), and, use 2/4 OpenMP threads per process, using all the cores available on a node.

97 Table 5.1: Experimental platforms.

Clusters Compilers and MPI versions Processor architecture Network interconnect Intel 17.0.2, Dual-socket 16-core Intel R “Haswell” Xeon R E5-2698 v3 Cray Aries R with NERSC Cori (Cray XC 40) Cray MPICH 7.6.0 at 2.3 GHz (32 cores/node), 128 GB memory/node, 40 MB L3 cache. Dragonfly topology. Intel 17.0.4, Dual-socket 12-core Intel R “Ivy Bridge” Xeon R E5-2695 v2 Cray Aries R with NERSC Edison (Cray XC 30) Cray MPICH 7.6.2 at 2.4 GHz (24 cores/node), 64 GB memory/per node, 30 MB L3 cache. Dragonfly topology. Intel 17.0.4, Intel R “Knights Landing” Xeon Phi R 7230 at 1.3 GHz (64 cores/node), Cray Aries R with ALCF Theta (Cray XC 40) Cray MPICH 7.7.0 16 GB on-package memory, 192 GB memory/per node. Dragonfly topology.

Table 5.2: Test graphs, listed in ascending order of edges.

Graphs #Vertices #Edges Modularity channel 4.8M 42.7M 0.943 com-orkut 3M 117.1M 0.661 soc-sinaweibo 58.6M 261.3M 0.482 twitter-2010 21.2M 265M 0.478 nlpkkt240 27.9M 401.2M 0.939 web-wiki-en-2013 27.1M 601M 0.671 arabic-2005 22.7M 640M 0.989 webbase-2001 118M 1B 0.983 web-cc12-PayLevelDomain 42.8M 1.2B 0.687 soc-friendster 65.6M 1.8B 0.624 sk-2005 50.6M 1.9B 0.971 uk-2007 105.8M 3.3B 0.972

5.6.3 Test graphs

Our primary testset for strong scaling analysis consists of 11 graphs (with number of edges greater than 100M) collected in their native formats from four sources: UFL sparse matrix collection [51], Network repository [165], SNAP [122] and LAW [21]. The graphs are listed in Table 5.2, along with their respective output modularity as reported by Grappolo (using 1 thread). Overall performance of our distributed implementation is sensitive to the input graph (espe- cially since our simple graph partitioning makes no assumption about the underlying graph struc- ture). Figure 5.3 shows the inter-process communication volume of distributed Louvain method for four inputs on 1024 processes respectively, and they exhibit significantly different communication patterns. We converted the test graphs from their various native formats to an edge list based binary format, and used the binary file as an input in our implementation. We make use of MPI I/O for

98 reading the input file in parallel (and follow best practices), and our overall I/O time is about 1-2% of the overall execution time. The I/O time could be further optimized by file striping, by allowing a file to be split across multiple disks (also called Object Storage Targets (OSTs)) respectively. Once a file is striped, the read/write operations could access multiple OSTs concurrently, improving the overall I/O bandwidth. The burst buffer nodes of NERSC Cori uses vendor specific flash storage (or Solid State Drives) middleware that can accelerate I/O performance. Table 5.3 shows 2.7-22× speedup in I/O times between default scheme (a single storage target) and burst buffers. Striping can also provide a similar performance benefit. However, in our experiments we did not explore advanced striping options, and observe burst buffer outperforming striping for large files with the default options.

Table 5.3: I/O performance (in secs.) for three real-world input graphs on NERSC Cori using Lustre file striping and burst buffers.

Graphs (size in GB) Default (1 OST) Striped (x OST) Burst Buffer orkut (3.6 GB) 1.45 0.59 (10) 0.54 soc-friendster (55 GB) 26.47 3.25 (25) 1.66 uk2007 (100 GB) 35.54 11.50 (50) 1.63

(a) uk-2007 (b) soc-friendster (c) nlpkkt240 (d) com-orkut

Figure 5.3: Communication volume, in terms of mean send/recv message sizes (bytes) exchanged be- tween pairs of processes, for two real-world inputs on 1024 processes. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids.

A number of graphs used for the current evaluation have close to billions of edges. Therefore, it is still feasible from the execution time standpoint to just use the parallel baseline version, de- spite the improvements gained from the approximate methods. However, larger real-world graphs with high level of overlap in community structures are harder cases for community detection in

99 general. In such cases, the Louvain method may run for a significantly large number of iterations over multiple phases, consuming vast amount of resources. Also, real-world graphs obtained from comparative genomics [180] studies can be particularly large, as it may be a result of trillions of pairwise interactions between sequences across thousands of genomes. At such massive scales, the only optimization that can make an impact is computation/communication avoidance. There- fore, approximate computing methods play a pivotal role at extreme scales. In fact, approximate computing methods provide users with the necessary tools to select the desired balance between performance with quality.

5.6.4 Comparison on a single node

To assess the overhead of our MPI+OpenMP distributed-memory Louvain implementation, we compared it with the multithreaded implementation from the Grappolo software package, on a sin- gle Cori node, using a single process and multiple threads. Table 5.4 shows the runtimes in seconds of our distributed-memory implementation and the shared memory (Grappolo) implementation for the input graph soc-friendster (1.8B edges). In our experiments with soc-friendster input, it ran for over 400 iterations until convergence, with each phase taking more than a 100 iterations (exhibiting relatively slow modularity growth), making it an ideal real-world dataset for clustering analysis. The table shows that performance of the pure OpenMP version is about 2.3x better than our dis- tributed version on all 32 cores of the node. On the other hand, the distributed version shows better scaling with the number of threads (about 4x speedup on 64 threads relative to 4 threads, whereas the shared-memory version scales to about 2x). In all these runs, the modularity difference was found to be under 1%. Furthermore, the distributed version obtains a speedup of up to 7x compared to the optimized shared-memory version on 64 threads, when we scale out to 4K processes on 256 nodes (see Figure 5.4).

100 Table 5.4: Distributed memory vs shared memory (Grappolo) performance (runtime) of Louvain algorithm on a single NERSC Cori node using 4-64 threads. The input graph is soc-friendster (1.8B edges).

#Threads Distributed memory (secs.) Shared memory (secs.) 4 6,082.25 1,216.54 8 3,615.52 843.37 16 2,252.09 725.26 32 1,515.24 689.38 64 1,303.98 554.52

5.6.5 Strong scaling

We report the total runtime (inclusive of the time to read the input graph, perform modularity maximization and graph reconstruction) for our test graphs in Figure 5.4. We observe that the process end points of best speedup vary by the input, with moderate/large inputs showing reason- able scalability up to 1K-2K processes. However, some graphs such as sk-2005, have relatively low number of iterations per phase, which indicates that there is not enough work to utilize the increased parallelism beyond a certain point. These end points in scaling are to do with the balance between computation and communica- tion times. For instance, we used HPCToolkit [1] to profile the baseline version on soc-friendster on 256 processes (32 nodes). Analysis shows that 98% of the entire execution time is spent in the main body of the Louvain iterations (with 1% in graph rebuilding and another 1% for reading the input graph using MPI I/O routines). Of the 98%, roughly 34% is used in communicating com- munity related information, and 40% is spent in the reduction operation (line 12 of Algorithm 4); whereas 22% of the time is used in computation (lines 5-8 of Algorithm 4). To compare relative performances, we calculated speedup as a ratio between execution times of the baseline parallel version and the fastest running version on 16-128 processes for a particular input. Table 6.7 shows these results. It can be seen that early termination versions (ET or ETC) deliver the best perfor- mance in most cases. With increasing number of processes, load imbalance also rises, affecting the overall scalability. Even in the worst-case, we have observed about 1.12 − 3.78× speedup of the approximate computation methods relative to the parallel baseline for the same set of inputs on

101 Baseline ET(0.75) Threshold Cycling ETC(0.25) arabic-2005 com-orkut ET(0.25) ETC(0.75) 55 180 channel 50 160 45 140 60 40 120 50 35 100 40 30 80 25 30 60 20 40 20 15 20 10 10 0 5 32(N=2) 64(N=4) 128(N=8) 256(N=16) 512(N=32) 1024(N=64) 2048(N=128) 0 16(N=1) 32(N=2) 64(N=4) 128(N=8) 256(N=16) 512(N=32) 1024(N=64) 16(N=1) 32(N=2) 64(N=4) 128(N=8) 256(N=16) 512(N=32)

nlpkkt240 web-cc12-PayLevelDomain web-wiki-en-2013 110 650 140 600 100 120 90 550 80 500 100 70 450 80 60 400 60 50 350 40 300 40 250 30 20 200 20 150 0 10 128(8) 256(16) 512(32) 1024(64) 2048(128) 64(4) 128(8) 256(16) 512(32) 1024(64) 2048(128) 64(4) 128(8) 256(16) 512(32) 1024(64)

webbase-2001 sk-2005 uk-2007 100 110 85 90 105 80 100 75 80 95 70 70 90 65 60 85 60 50 80 55 40 75 50 70 45 30 65 40 20 60 35 10 55 30 64(4) 128(8) 256(16) 512(32) 1024(64) 64(4) 128(8) 256(16) 512(32) 1024(64) 64(4) 128(8) 256(16) 512(32) 1024(64)

soc-friendster twitter-2010 soc-sinaweibo 2200 650 650 2000 600 600 1800 550 550 1600 500 500 1400 450 450 1200 400 400 1000 800 350 350 600 300 300 400 250 250 200 200 200 0 150 150 128(8) 256(16) 512(32) 1024(64) 2048(128) 4096(256) 128(8) 256(16) 512(32) 1024(64) 2048(128) 4096(256) 128(8) 256(16) 512(32) 1024(64) 2048(128)

Figure 5.4: Execution times of our distributed Louvain implementation for graphs listed in Table 5.2. X-axis: Number of processes (and nodes), Y-axis: Execution time (in secs.).

Table 6.7.

5.6.6 Weak scaling

For weak scaling analysis, we use GTgraph synthetic graph generator suite [11] to generate graphs according to DARPA HPCS SSCA#2 benchmark [10]. Graphs following SSCA#2 benchmark are comprised of random-sized cliques, with various parameters to control the amount of vertex

102 Table 5.5: Versions yielding the best performance over the baseline version (run on 16-128 processes) for input graphs (listed in ascending order of edges).

Graphs Best speedup Version channel 46.18x ETC (0.25) com-orkut 14.6x ETC (0.75) Threshold soc-sinaweibo 3.4x Cycling twitter-2010 3.3x ETC (0.25) Threshold nlpkkt240 8.68x Cycling web-wiki-en-2013 7.92x ET (0.75) arabic-2005 5.8x ETC (0.25) webbase-2001 7x ETC (0.25) web-cc12-PayLevelDomain 3.75x ETC (0.25) soc-friendster 23x ETC (0.25) sk-2005 1.8x ETC (0.75) uk-2007 2.4x ETC (0.75)

connections and inter-clique edges, along with other options to set the maximum sizes of clusters and cliques.

Table 5.6: GTgraph SSCA#2 generated graph dimensions and associated information.

Name #Vertices #Edges Modularity #Processes (Nodes) Graph#1 5M 333.7M 0.999981 1 (1) Graph#2 10M 660.7M 0.999990 32 (2) Graph#3 50M 3.3B 0.999998 208 (13) Graph#4 100M 6.6B 0.999999 448 (28) Graph#5 150M 6.9B 0.999999 512 (32)

We fix the maximum clique size of the generated graphs (of various dimensions) to 100 and deliberately keep inter-clique edge probability low to enforce good community structure. Each graph is executed on a different combination of processes (and nodes), such that overall work-per- process is fixed. A list of the generated graphs along with the process-node configuration on which they are run is provided in Table 5.6. Our distributed implementation reported exact same conver- gence criteria for each graph listed in Table 5.6, since the underlying structures of the graphs are similar, despite the difference in sizes. The weak scaling results we obtained are summarized in Figure 5.5. The figure shows nearly constant execution time for the Baseline versions of our dis-

103 tributed Louvain implementation using input SSCA#2 graphs of varying sizes and varying process counts (1-512).

120

115

110

105

100

95

Execution time (in s) 90

85

80 Graph#1 Graph#2 Graph#3 Graph#4 Graph#5

Figure 5.5: Weak scaling of baseline distributed Louvain implementation on GTgraph generated SSCA#2 graphs. X-axis: Input graphs listed in Table 5.6, Y-axis: Execution time (in secs.).

5.6.7 Analysis of performance of the approximate computing methods/heuristics

Threshold cycling

The Threshold Cycling scheme provided significant performance (runtime) benefit (compared to Baseline) with less than 3% decrease in modularity for over 90% of the test graphs. Meanwhile, threshold cycling performed only marginally better for the soc-sinaweibo and web-wiki-en-2013 graphs. These graphs ran for only 3 or 4 phases, and the Louvain algorithm converged before ending a cycle of threshold modifications. In such cases, our distributed implementation always forces Louvain iteration to run once more with the lowest threshold (default τ = 10−6), to ensure acceptable modularity. Hence, in such cases, threshold cycling yields only nominal benefit.

Early termination

The runtime charts in Figure 5.4 show that the early termination versions (ET or ETC) provide the best performance for most input graphs. We discuss the modularity growth and iterations

104 per phase characteristics on 64 processes for two of the test graphs—nlpkkt240 and web-cc12- PayLevelDomain. These results are shown in Figs. 5.6a and 5.6b (for nlpkkt240) and Figs. 5.7a and 5.7b (for web-cc12-PayLevelDomain). Generally, we observed one of two trends among all the test graphs: one in which ET with α = 0.25 performs better than ET with α = 0.75 (Figs. 5.6a and 5.6b), and another in which the converse happens—(Figs. 5.7a and 5.7b). In Figure 5.6a we observe slow growth in modularity of ET(0.75) across many more phases (which increases the overall execution time) compared to ET(0.25). Also, we observe significantly more iterations per phase for ET(0.75) compared to ET(0.25) in Figure 5.6b. It is to be noted that although the total number of phases of ET(0.75) is 2.6x, and total iterations is 1.3x to that of Baseline, the overall execution time of ET(0.75) is still about 1.47x better than Baseline. This is due to the prevalence of a number of inactive vertices per phase, the time spent per phase is significantly less than the Baseline version. With α close to 1, the scheme of labeling vertices as inactive becomes more aggressive. This hurts the convergence characteristics as evidenced by a higher number of phases for ET(0.75), as there are fewer (active) vertices at every phase which can move to other communities, maximizing ∆Q.

Baseline ET(0.75) Baseline ET(0.75) Threshold Cycling ETC(0.25) Threshold Cycling ETC(0.25) ET(0.25) ETC(0.75) ET(0.25) ETC(0.75) 1 350 0.9 300 0.8 250 0.7 0.6 200 0.5 150 0.4 100

0.3 Iterations/Phase Modularity/Phase 0.2 50 0.1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Phases Phases (a) Modularity growth (b) Number of iterations

Figure 5.6: Convergence characteristics of nlpkkt240 (401.2M edges) on 64 processes.

However, we observe an interesting phenomenon with ET, when the optional communication step is performed (i.e., version ETC). In Figure 5.6b, we see that ETC(0.25) and ETC(0.75) dis-

105 Baseline ET(0.75) Baseline ET(0.75) Threshold Cycling ETC(0.25) Threshold Cycling ETC(0.25) ET(0.25) ETC(0.75) ET(0.25) ETC(0.75) 1 55 50 0.8 45 40 0.6 35 30 0.4 25 20 Iterations/Phase

Modularity/Phase 0.2 15 10 0 5 0 1 2 3 4 5 0 1 2 3 4 5 Phases Phases (a) Modularity growth (b) Number of iterations

Figure 5.7: Convergence characteristics of web-cc12-PayLevelDomain (1.2B edges) on 64 processes. play very similar performances, whereas ET(0.25) and ET(0.75) were quite different. The remote communication step computes the global number of vertices that are inactive, and if it is more than 90%, then Louvain iteration exits. This is quite different from the ET counterparts, as ET still relies comparing consecutive modularities to a fixed τ per iteration in a phase. Figure 5.6b shows a linear increase in modularity/iterations, due to this exit condition, and yields about 20-30% performance benefit as compared to using ET alone. For web-cc12-PayLevelDomain, we observe from Figs. 5.7a and 5.7b that the aggressive ver- sion of ET with α = 0.75 (denoted as ET(0.75)) performs better than ET(0.25), at the expense of 4% decrease in modularity. The performance of ET(0.75) is 16% better than ET(0.25) owing to lesser number of iterations per phase.

Incomplete coloring

We observed that incomplete coloring did not benefit most of the real world graphs that we used in our evaluation. We therefore employed synthetic graphs to investigate the impact of coloring in improving the quality of community detection. We used in particular synthetic graphs from the MIT Graph Challenge streaming partition datasets, which are based on stochastic blockmodels and uses different sampling strategies [108]. We specifically selected two sets of graphs with increased level of overlap between the blocks and low/high level of size variation between the

106 blocks (denoted as HILO and HIHI) because they are in general hard use cases for modularity based community detection. For instance, HILO denotes “high level of overlap but low level of size variation between blocks”, indicating stronger interactions (i.e., a large number of edges or connections) between the individual blocks or clusters, making the task of community detection harder. The number of vertices and edges for the HILO and HIHI datasets are listed in Table 6.5c.

Table 5.7: Stochastic block partition dataset characteristics used for coloring analysis.

Input label #Nodes #Edges 200K 200,000 4,750,333 1M 1,000,000 23,716,108 5M 5,000,000 118,738,395 20M 20,000,000 475,167,612

Figure 5.8 shows the modularities reported by different approximate computing techniques for our distributed Louvain implementation, and we observe incomplete coloring heuristic is 8-70% better than the rest for most of the cases.

static streamingEdge HILO 1M HILO 5M HILO 20M streamingSnowball 0.5 0.5 0.5 HILO 200K 0.25 0.25 1 0.25 0.125 0.125 0.5 0.25 0.0625 0.0625 0.125 0.125 0.03125 0.03125 BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color

HIHI 200K HIHI 1M HIHI 5M HIHI 20M 1 1 0.5 0.5

0.5 0.25 0.25 0.5 0.25 0.125 0.125 0.125 0.0625 0.25 0.0625 0.03125 0.0625 0.125 0.03125 0.015625 0.03125 BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color BaselineThresh.ET(0.25)ET(0.75)ETC(0.25)ETC(0.75)Color

Figure 5.8: Modularities for stochastic block partition graphs of various sizes and sampling techniques on 16 nodes (192 processes) of NERSC Edison.

107 5.6.8 Combining approximate methods/heuristics delivers better performance

Some of the heuristics/approximate computing methods can be combined with others to generate a complementary positive effect (on performance and/or quality). We discuss few possibilities by combining early termination ET with threshold cycling and coloring. Combining ET with Threshold Cycling yields better performance in some cases due to reduction in the number of iterations per phase, allowing a phase to end relatively faster. As shown in Table 5.8, we observe a consistent 10% improvement in performance for soc-friendster input. Since the exit criteria of

Table 5.8: Performance of ET(0.25) combined with Threshold Cycling for soc-friendster (1.8B edges). Relative percentage gains in performance are in braces.

Execution time Execution time Processes (Nodes) ET(0.25) ET(0.25) + Threshold Cyc. (secs.) (secs.) 256 (16) 683.99 614.78 (10%) 512 (32) 448.05 398.77 (11%) 1024 (64) 299.74 264.64 (12%) 2048 (128) 216.60 195.12 (10%) 4096 (256) 186.87 167.85 (10%)

Color Color+ETC(0.25) Color+ET(0.25) Color+ETC(0.75) Modularity Color+ET(0.75) 1 Execution time (in secs) 4096 2048 0.5 1024 512 256 128 0.25 96 192 384 768 1536 96 192 384 768 1536 Processes Processes

Figure 5.9: Performance of com-orkut (117.1M edges) when coloring is combined with ET on NERSC Edison.

ETC is not τ-based, we did not see any benefit of adding Threshold Cycling. Coloring helps in generating an informed partial ordering (i.e., processing of vertices), allowing vertices to settle in their final community states more quickly. Early termination exploits this behavior and quickens convergence (by reducing the number of iterations). In Figure 5.9, we demonstrate 4-7x speedup when coloring is combined with ET, without affecting modularity significantly.

108 5.6.9 Solution quality assessment

In order to assess the quality of our distributed Louvain implementation, we compare our results against known ground truth communities for a variety of networks generated by the LFR bench- mark [119]. When the quality assessment feature is turned on, our implementation performs extra collective operations per Louvain method phase to gather the vertex-community associations of the current graph into the root process. We list all possible pairs of vertices (a vertex pair {x, y} signifies an edge) in community

assignments obtained from our Louvain implementation (denoted as CL) and, the ground truth

x information (denoted as CG). Community assignment of a vertex x in CL is denoted by CL,

x whereas in CG it is represented by CG. We categorize every pair of vertex {x, y} in the respective

community assigments of CL and CG into one of the three bins:

x y • True Positive (TP): if {x, y} belong to the same community in CL and CG, i.e., CL = CL

x y and CG = CG;

x y • False Negative (FN): if {x, y} belong to the same community only in CG, i.e., CL 6= CL and

x y CG = CG;

x y • False Positive (FP): if {x, y} belong to the same community only in CL, i.e., CL = CL and

x y CG 6= CG;

Based on the above categorization, we calculate the metrics of Precision, Recall and F-score:

TP • Precision, P = TP +FP ;

TP • Recall, R = TP +FN ;

2∗P ∗R • F-score, F = P +R ;

For the current set of LFR benchmark networks, we ran our distributed implementation on 2 nodes with 16 processes per node (with 2 OpenMP threads per process). As shown in Table 5.9,

109 Table 5.9: Quality comparisons of our distributed Louvain implementation and Grappolo with LFR ground truth community information.

#Vert. #Edges Precision Recall F-score(Grappolo) F-score 350K 34.73M 0.980 1 0.990 0.990 600K 58.91M 0.981 1 0.990 0.990 1M 98.12M 0.962 1 0.981 0.981 1.5M 147.13M 0.937 1 0.967 0.967 2M 196.45M 0.896 1 0.951 0.945 high F-score and precision/recall corresponds to high quality solution as compared to ground truth community assignments. We also observed nearly identical F-score results reported by Grappolo (i.e., shared-memory Louvain implementation) for the same LFR benchmark networks.

5.7 Applicability of the Louvain method as a benchmarking tool for graph analytics

Benchmarking of high performance computing systems can help provide critical insights for effi- cient design of computing systems and software applications. Although a large number of tools for benchmarking exist, there is a lack of representative benchmarks for the class of irregular com- putations as exemplified by graph analytics. Unlike other graph-based methods such as breadth- first search and betweenness centrality, distributed-memory graph community detection represents highly complex computational patterns stressing a variety of system features, which can provide crucial insight for co-design of future computing systems. The importance of mini-application driven co-design of architectures and algorithms has been established as a holistic approach in assessing key performance issues in large scientific applica- tions [56, 15, 94, 60]. However, a significant number of mini-applications used in HPC co-design are characterized by regular updates to dense data structures such as meshes and matrices. Hence, there is an urgent need to explore mini/proxy applications characterized by irregular memory ac- cesses, which is the mainstay of a large number of graph applications. For our benchmark, we implement the very first phase of the Louvain method, without rebuild- ing the graph. This allows us to accurately assess the overhead of community detection separately from the graph rebuilding process. The benchmark can also generate random geometric graphs

110 (RGG) in parallel (discussed in the forthcoming Section 5.7.2), thereby making it convenient for users to parameterize synthetic graphs (with different communication characteristics) to run the Louvain algorithm.

5.7.1 Characteristics of distributed-memory Louvain method

We argue that community detection is a better tool for benchmarking irregular applications, be- cause it exhibits different characteristics in comparison to other graph traversal based workloads. For example, the Louvain method involves floating-point arithmetic operations for computing modularity, whereas other graph algorithms such as breadth-first search and graph coloring do not have any floating-point arithmetic. Community detection is also communication-intensive, as within every Louvain iteration, information of ghost communities (such as current size and degree of communities) needs to be updated for computing global modularity. Two conflicting goals – simplicity of the benchmark, and true representation of real-world applications – drive the choice of a good benchmarking tool. The following two observations from the performance analysis of our distributed-memory Louvain method led us to the design of a potential benchmarking proxy application for the Exascale Computing Project1.

Louvain phase analysis

Although the Louvain method is executed for multiple phases until convergence, for a variety of real world inputs, we observed the first phase to be the most expensive in terms of overall execution time. Table 5.10 demonstrates that most of the input graphs exhibit a cumulative difference of about 1 − 5% between the execution times of the first and the final phase. Therefore, analyzing just the first phase provides sufficient information about the overall performance and community structure in most cases. Furthermore, graph rebuilding complicates the implementation and can distort benchmarking results when the graph sizes are small and utilize a small portion of the total participating processors on a system.

1ECP Proxy Applications: https://proxyapps.exascaleproject.org

111 Table 5.10: First phase of Louvain method versus the last phase for real-world inputs on 1K processes of NERSC Cori.

First phase Complete execution Graphs #Vertices #Edges Iterations Modularity Time Phases Iterations Modularity Time friendster 65.6M 1.8B 143 0.619 565.201 3 440 0.624 567.173 it-2004 41.3M 1.15B 14 0.394 45.064 4 91 0.973 45.849 nlpkkt240 27.9M 401.2M 3 0.143 3.57 5 832 0.939 21.084 sk-2005 50.6M 1.9B 11 0.314 71.562 4 83 0.971 72.94 orkut 3M 117.1M 89 0.643 59.5 3 281 0.658 59.64 sinaweibo 58.6M 261.3M 3 0.198 270.254 4 108 0.482 281.216 twitter-2010 21.2M 265M 3 0.028 209.385 4 184 0.478 386.483 uk2007 105.8M 3.3B 9 0.431 35.174 6 139 0.972 37.988 web-cc12-paylvladmin 42.8M 1.2B 31 0.541 140.493 4 159 0.687 146.92 webbase-2001 118M 1B 14 0.458 14.702 7 239 0.983 24.455

Scale 21 Scale 22 Scale 23 Scale 24 32 64 128 256 baseline tscale et1 et2 etc1 64 128 etc2 16 32

32 64 Execution time (in secs) Execution time (in secs) Execution time (in secs) Execution time (in secs) 8 16 16 32 96 192 384 768 1536 96 192 384 768 1536 96 192 384 768 1536 96 192 384 768 1536 Processes Processes Processes Processes

Figure 5.10: Approximate computing techniques have little effect on RMAT generated Graph500 graphs.

Ineffective approximate computing techniques

Although we demonstrated significant performance improvements with the application of the par- allel approximation techniques (refer to Section 5.5 and Section 5.6.7), the efficacy of these meth- ods depends on the connectivity structure of the input graph. As an illustration, we show the scaling of our distributed-memory Louvain implementation on four Graph500 [140] Kronecker graphs that have a poor community structure (modularity≈ 0.0107 − 0.0199) in Figure 5.10. While the fig- ure shows about 2.3-3.5x speedup for strong scaling on up to 1536 processes, it also shows the negligible difference that approximate methods make with respect to the baseline performance.

Performance profiling

We profiled our distributed-memory Louvain implementation extensively using HPCToolkit [1] on a billion-edge graph, and observed that about 60% of the time was spent in managing and commu- nicating vertex-community information, and, about 40% was spent on computation/communication (i.e., MPI Allreduce) of global modularity. Profiling helped us in identifying communication

112 intensive sites in the application, where we can apply alternate communication options such as MPI collectives or RMA and measure their impact.

5.7.2 Synthetic Data Generation

A benchmark should have the capability to generate test data for evaluation, such that it is conve- nient for a user to execute the program by just specifying some parameters. In this case, it is also important to generate a test graph with some community structure, such that the Louvain iteration (Algorithm 4) runs for a number of iterations before converging. We have developed a distributed- memory parallel random geometric graph (RGG) generator for this purpose. The generator allow users to bypass the file I/O for reading an input graph and create a synthetic graph in memory, which can be further parameterized to affect the overall communication intensity. We specifically chose RGGs because they are known to naturally exhibit consistent community structure with high modularity [50], as opposed to scale-free graphs. An n-D random geometric graph (RGG), represented as G(n, d), is a graph generated by ran- domly placing N vertices in an n-D space and connecting pairs of vertices whose Euclidean dis- tance is less than or equal to d. In our experiments we only consider 2D RGGs contained within a unit square, [0, 1]2, and the Euclidean distance between two vertices is used as the weight of the edge connecting them. We calculate d from two quantities, as explained next. Connectivity is a monotonic property

q ln N of RGG, in 2D unit-square RGGs have a sharp threshold at dc = πN [53]. The connectivity threshold is also the longest edge length of the minimum spanning tree in G [150]. The thermo- q λc dynamic limit when a giant component appears with high probability is given by dt = πN [53],

and the value of λc is given by 2.0736 for 2D unit-square RGGs. The particular value of d that we

have used is dct = (dc + dt)/2. We distribute the domain such that each process receives N/p vertices (where p is the total

1 number of processes). Each process owns 1 × p of the unit square, and generates that many random numbers, between specific ranges, as shown in Figure 5.11.

113 {0,0}

{1/p,1} {0,1/p}

{2/p,1} {0,2/p}

{3/p,1} {0,3/p}

{4/p,1}

2 1 Figure 5.11: Distribution based on [0, 1] on p = 4 processes and for N = 12. p > d mandates that vertices in a process can only have edges with vertices owned by its up or down neighbor. The blocks between the parallel lines indicate vertices owned by a process.

The generated random numbers are exchanged between neighboring processes, in order to compute Euclidean distance between neighboring vertices. The ghost vertices are then exchanged between neighbors. Since RGG relies on random numbers, it is important that the sequence of numbers be chosen from the same distribution across processes. We implement the linear con- gruential generator (LCG) algorithm using MPI. LCG is defined by a linear recurrence relation to deterministically generate a sequence of random numbers. To increase communication pressure, we also provide an option for introducing some noise into a RGG, by adding a percentage of total edges randomly (following a uniform distribution) between vertices. Adding random edges increases the likelihood of a process communicating with other non-neighboring processes, increasing the overall network congestion, and thereby creating ideal scenarios for measuring the impact of different communication options. Figure 5.12 shows inter-process communication (as reported by TAU profiler [169]) for the single-phase Louvain implementation between 1024 processes of NERSC Cori, using basic RGG as compared to RGG with 20% random edges for a graph of 134M vertices and 1.6B edges.

114 (a) Basic RGG input, black spot (b) RGG input with 20% extra edges means zero exchange to increase overall communication

Figure 5.12: Communication volume, in terms of minimum send/recv message sizes (in bytes) exchanged between pairs of processes, of the single-phase Louvain implementation with basic RGG input vs RGG with random edges using 1024 processes. Adding extra edges increase overall communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids; the top-left corner represents id zero for both sender and receiver. Byte sizes vary from 8 (blue) to 32 (red) for the figure on left, and from 8 (blue) to 3000 (red) for the figure on right.

Figure 5.13 shows inter-process communication patterns between single-phase Louvain method with RGG input of 134M vertices (20% of the overall number of edges added randomly) compared to Graph500 BFS [140] (with SCALE equal to 27). It is evident from the figure that only a subset of processes participate in communication for Graph500 BFS, whereas all of the processes contribute to the overall communication in the single-phase Louvain implemention.

5.8 Analysis of Memory Affinity, Power Consumption, and Communication Primitives

In this section, we present further performance results on the Intel Knights Landing manycore processor (Section 5.8.1), analyze power, energy and memory usage (Section 5.8.2), and investi- gate the impact of various MPI communication primitives in the context of our algorithm (Sec- tion 5.8.3).

115 (a) MPI calls: Lou- (b) MPI calls: (c) Mean message (d) Mean message vain Graph500 BFS sizes: Louvain sizes: Graph500 BFS

Figure 5.13: Communication volumes (in terms of send/recv invocations, and mean send/recv mes- sage sizes exchanged between processes) of single-phase Louvain method and Graph500 BFS for 134M vertices on 1024 processes. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids; the top-left corner represents id zero for both sender and receiver. Blue represents the minimum and Red represents maximum volume for each of the figures at different minimum and maximum values (communication patterns are important).

5.8.1 Evaluation on Intel Knights Landing R architecture

Intel Xeon Phi R Knights Landing R (KNL) is a manycore processor [172], that is the primary component of the latest Cray XC40 supercomputer in ALCF, named Theta. A KNL node in Theta consists of 64 cores, organized into 32 tiles (2 cores/tile, sharing an L2 cache of 1 MB) in a 2-D layout, a high bandwidth in-package multi-channel DRAM memory of size 16 GB (MCDRAM), and 192 GB of DDR4 main memory. The tiles are connected by a mesh interconnect, and the mesh support different levels of memory address affinities, known as clustering modes. It is also possible to configure the available memory into one of the three modes—(i) cache: MCDRAM is a cache for main memory; (ii) flat: MCDRAM is treated as an addressable memory (like main memory); and, (iii) hybrid: a portion of MCDRAM is treated as addressable memory, and the rest is a cache for main memory. We further classify hybrid mode into equal and split. In equal memory mode, 50% of MC- DRAM is addressable memory, and the other 50% is a cache. Whereas, in split mode, 75% of MCDRAM is addressable memory, and the remaining 25% is cache. We use a custom allocator (i.e., hbw::allocator) from the memkind library [31] to allocate some C++ data structures on the KNL MCDRAM. The performance differences between different clustering modes were not evident, therefore we selected the default quadrant for our distributed Louvain implementation. In

116 quadrant clustering mode, the tiles are divided into four parts (quadrants), which are spatially lo- cated near four groups of memory controllers. Keeping the clustering mode constant, we vary the memory modes and demonstrate performance using four real-world datasets in Figure 5.14. Despite the inherent simplicity of the cache mode (no application code modification), the ac- cess latency of MCDRAM is higher than standard caches, and the overall memory bandwidth is impacted by main memory accesses (for the portion of data not resident on the MCDRAM). Due to the irregular nature of memory accesses in our distributed Louvain implementation, cache misses are pervasive. In cache mode, the MCDRAM in KNL is treated as a direct mapped cache (with 64 byte cache line), in which an address in the main memory is mapped to only one location in

the cache. Whereas, L3 caches in conventional CPU architectures such as Intel R Haswell R are multi-way set associative, in which an address in main memory can be mapped to any of the mul- tiple cache addresses, significantly reducing conflict misses. An MCDRAM cache miss is more expensive than reading from main memory, because memory requests cannot travel from processor L2 cache to main memory directly, and has to involve MCDRAM in between.

cache equal flat split soc-friendster (1.8B edges) nlpkkt240 (401.2M edges) com-orkut (117.1M edges) uk-2007 (3.3B edges) 8192 1024 512 256 4096 512 256 2048 128 256 128 1024 Execution time (in s) Execution time (in s) Execution time (in s)

Execution time (in s) 64 512 128 64 256 512 1024 2048 256 512 1024 2048 4096 256 512 1024 2048 256 512 1024 2048

Figure 5.14: Performance of four real-world graphs using different memory modes on KNL nodes of ALCF Theta (for the default quadrant clustering mode). X-axis: Number of processes; Y-axis: Execution time (secs.) in log-scale.

We notice that the hybrid split mode is more scalable than the other modes, and the flat mode (opposite of the cache mode) yields the best performance in most of the cases. In flat mode, we explicitly allocated some data structures on the MCDRAM, and observe 30% better performance as compared to the cache mode. We capture the relative performances between the KNL memory modes for our distributed Louvain implementation in Figure 5.15.

117 Figure 5.15: The relative performance profiles for cache, equal, flat and split memory modes on Theta KNL nodes using a subset of inputs. The X-axis represents the factor by which a given scheme fares relative to the best performing scheme for that particular input. The Y-axis represents the fraction of problems. The closer a curve is aligned to the Y-axis the superior is its performance relative to the other schemes over a range of 40 inputs.

5.8.2 Power, energy and memory usage

Performance cannot be analyzed using metrics based on execution times alone, it is important to have a holistic approach to understand the impact of approximation techniques on the underlying system, in terms of power/energy of the compute node, memory consumption and the energy-delay product (EDP) [120]. Table 5.11 shows power/energy and memory consumption of the distributed Louvain versions using the CrayPat tool [52] on NERSC Cori. We observe that apart from pro- viding the best performance for soc-friendster and com-orkut, ETC also reduces the power/energy consumption and memory traffic (L3 cache misses) by about 2−3× relative to the parallel baseline version. Similarly, for uk-2007 (which runs for approximately half the number of iterations than soc-friendster), ETC demonstrates about 40% improvement in energy consumption as compared to the baseline. For the nlpkkt240 input, the speedup of the Threshold Cycling version is about 9×, relative to baseline (refer to Table 6.7). Accordingly, the Threshold Cycling variant yields about

118 2× improvement in energy usage and memory traffic for nlpkkt240, whereas, the ETC variants in this case report about 30% more memory traffic as compared to baseline. In this case, ETC in- creases the number of phases/iterations to convergence, raising the overall memory traffic (impact of ETC on convergence characteristics of nlpkkt240 is shown in Figs. 5.6a and 5.6b).

5.8.3 Impact of MPI communication method

We use MPI nonblocking Send/Recv and collectives to perform communication in our distributed Louvain implementation. Exchanging vertex-community association among processes take place in every iteration of a phase, and is the most expensive communication operation. Therefore, we implemented few communication intensive sites using MPI nonblocking/blocking Send/Recv, collectives and MPI-3 RMA to study the effect of different communication models. We use the following notations for different variants of communication:

• NBSR: Uses MPI nonblocking point-to-point communication routines, i.e., MPI Isend/ MPI Irecv.

• SR: Uses blocking MPI send and receive, i.e., MPI Sendrecv.

• COLL: Uses MPI blocking collective operation, uses MPI Alltoallv.

• RMA: Uses MPI-3 RMA for one-sided communication with passive target synchronization; uses MPI Put (default) or MPI Accumulate, denoted as RMA(Put) and RMA(Acc) re- spectively.

Due to the distribution of a random geometric graph (RGG) (refer to Section 5.7.2), if random edges are not added, then a process communicates with at most two neighboring processes (see Figure 5.12). Therefore, we also discuss performance of the RGG datasets with additional random edges (20% of the total number of edges, about 0.4 − 1.3B). Table 5.12 shows the performance of the single-phase Louvain implementation on the RGG datasets of over a billion edges with unit edge weights on 1K-4K processes. When extra edges

119 Table 5.11: Power/Energy and Memory consumption of distributed Louvain implementation using four real-world graphs exhibiting diverse characteristics on 1K processes (64 nodes) of NERSC Cori

Mem. Mem. Node Node Ver. Traffic/ EDP (MB/p) eng.(kJ) pwr.(kW) node (GB) uk-2007 (3.3B edges) Baseline 95.9 1465.43 13.02 14.40 1.65E+08 Threshold 95.8 1419.73 13.07 14.35 1.54E+08 Cycling ET(0.25) 71.5 1173.48 12.82 15.10 1.07E+08 ET(0.75) 71.6 1090.97 12.72 13.12 9.36E+07 ETC(0.25) 71.7 912.58 12.03 9.88 6.92E+07 ETC(0.75) 71.8 893.93 12.18 9.88 6.56E+07 soc-friendster (1.8B edges) Baseline 867.6 9354.25 15.82 1633 5.53E+09 Threshold 867.3 4625.91 14.64 755.09 1.46E+09 Cycling ET(0.25) 875.6 5740.31 15 829.87 2.20E+09 ET(0.75) 893.5 10924.59 15.59 1581 7.65E+09 ETC(0.25) 1026.4 3149.93 14.02 522.2 7.08E+08 ETC(0.75) 1025.6 2850.92 14.67 520.75 5.54E+08 nlpkkt240 (401.2M edges) Baseline 76.5 1343.61 15.21 85.23 1.19E+08 Threshold 65.4 436.16 9.74 4.51 1.95E+07 Cycling ET(0.25) 76.1 984.82 13.49 43.15 7.19E+07 ET(0.75) 75.9 912.49 13.52 42.82 6.16E+07 ETC(0.25) 110.2 1258.78 15.01 127.25 1.06E+08 ETC(0.75) 109.9 1207.97 14.84 121.51 9.83E+07 com-orkut (117.1M edges) Baseline 115.4 1271.46 14.07 41.12 1.15E+08 Threshold 115.5 953.45 12.33 23.41 7.37E+07 Cycling ET(0.25) 115.6 961.39 11.88 19.66 7.78E+07 ET(0.75) 115.6 812.7 11.46 15.40 5.76E+07 ETC(0.25) 115.6 706.27 10.92 11.50 4.57E+07 ETC(0.75) 115.6 676.67 11.08 11.52 4.13E+07 are added, execution times increase by up to 3×, owing to an increase in overall communication volume. However, the change in modularity is more gradual across multi-process runs, and it declines by about 17% when extra edges were added.

120 We also analyzed the impact of using real edge weights (the Euclidean distance) between ver- tices, as shown in Table 5.13. If the Euclidean distance between a randomly selected vertex pair is unavailable (when the respective vertices are owned by non-neighboring processes), then we pick an edge weight uniformly between (0, 1). We ensure the edge weight is consistent across MPI communication models by providing a seed to the random number generator, equivalent to a unique hash of the vertex pair. Unlike the extra edge cases in Table 5.12, in Table 5.13 we observe about 3 − 9% variability in modularity across MPI communication models. In general, for the RGG graphs, comparing different MPI communication models, performance of SR is at least 1.5 − 2× worse than others for every case, potentially due to internal message ordering overheads. Performance of COLL is consistently superior, and for the largest RGG of 536.8M vertices, COLL is about 1.2 − 5× faster than the rest. Performance of basic RGG datasets (with no extra edges) with unit and real edge weights are comparable, overall modularity difference being about 3 − 6%. However, we observe a significant difference between the two approaches for the largest case (536.8M vertices) with the extra edges (a reduction of about 1.2 − 3× in execution time and up to 22% in modularity). Due to a significant reduction in number of iterations to convergence, the execution time is lesser than the basic case.

Table 5.12: Execution time (in secs.) and Modularity (Q) on 1-4K processes for RGG datasets with unit edge weights

1024 processes (|V | = 134.2M) 2048 processes (|V | = 268.4M) 4096 processes (|V | = 536.8M) Versions |E| = 1.59B |E| = 1.9B |E| = 3.24B |E| = 3.89B |E| = 6.64B |E| = 7.97B Time Q Time Q Time Q Time Q Time Q Time Q NBSR 6.53 0.750 18.28 0.626 9.57 0.749 21.68 0.626 49.53 0.748 57.06 0.625 COLL 5.56 0.750 18.32 0.626 7.28 0.749 21.65 0.626 18.10 0.748 47.85 0.625 SR 13.30 0.750 28.32 0.626 31.50 0.749 49.27 0.626 94.41 0.748 115.87 0.625 RMA 5.76 0.751 19.05 0.626 8.82 0.753 23.23 0.626 47.18 0.750 60.95 0.626

Table 5.14 shows the performance of single-phase Louvain implementation on Friendster over 1K/2K processes using different MPI communication models, for 2 OpenMP threads per process. From Table 5.14, we observe that the performance of nonblocking Send/Recv and collectives is competitive to RMA(Put), despite differences in iterations to convergence. Due to the atomicity requirements of the MPI accumulate operation, the performance of RMA(Acc) suffers at scale, as

121 Table 5.13: Execution time (in secs.) and Modularity (Q) on 1-4K processes for RGG datasets with Eu- clidean distance weights

1024 processes (|V | = 134.2M) 2048 processes (|V | = 268.4M) 4096 processes (|V | = 536.8M) Versions |E| = 1.59B |E| = 1.9B |E| = 3.24B |E| = 3.9B |E| = 6.64B |E| = 7.97B Time Q Time Q Time Q Time Q Time Q Time Q NBSR 5.99 0.776 13.08 0.648 9.54 0.776 17.50 0.629 30.46 0.776 20.15 0.599 COLL 5.46 0.776 12.87 0.653 7.46 0.776 14.17 0.628 15.34 0.776 21.83 0.598 SR 13.17 0.776 19.6 0.649 32.05 0.776 31.97 0.624 88.81 0.776 65.17 0.598 RMA 5.80 0.777 13.3 0.628 9.18 0.776 15.43 0.624 21.96 0.776 22.41 0.544

Table 5.14: Number of iterations, execution time (in secs.) and Modularity of Friendster (65.6M vertices, 1.8B edges) for various MPI communication models on 1024/2048 processes using 2 OpenMP threads/ process

1024 processes 2048 processes Versions Iterations Time Modularity Iterations Time Modularity NBSR 119 689.53 0.6146 111 399.63 0.6146 COLL 123 694.49 0.6147 115 403.75 0.6146 SR 119 703.09 0.6146 119 445.81 0.6147 RMA(Put) 117 678.11 0.6146 115 414.74 0.6146 RMA(Acc) 115 731.62 0.6146 111 677.66 0.6146

compared to the rest. The primary reason behind these fluctuations in the runtime performances can be attributed to variable number of iterations to convergence (111−123), indicating contrasting communication volumes between these versions. Since the Louvain method is inherently sequential, the order of community updates from pro- cesses impacts the overall number of iterations to convergence, making it nondeterministic across program runs. The lack of implicit message ordering in the MPI communication methods also contribute to this disparity. This behavior makes it difficult to accurately measure the effect of the Louvain method by using different communication models on real-world graphs. We use threads to parallelize the local computation of modularity (as shown in Line 11 of Al- gorithm 4). Small differences in the double precision modularity quantity across runs caused by varying the number of threads can impact the iterations to convergence, and the overall execution times. For instance, when the number of threads per process was increased to 4 (from 2 in Ta- ble 5.14), the variability in iterations among different versions increased as well (101 − 171), as demonstrated in Table 5.15. Additionally, we observe a modularity variation of about 2% across

122 different versions in Table 5.15.

Table 5.15: Number of iterations, execution time (in secs.) and Modularity of Friendster for various MPI communication models on 1024/2048 processes using 4 OpenMP threads/process

1024 processes 2048 processes Versions Iterations Time Modularity Iterations Time Modularity NBSR 101 437.08 0.6160 125 328.01 0.6231 COLL 101 432.98 0.6184 149 434.06 0.6260 SR 101 448.96 0.6167 167 444.68 0.6278 RMA(Put) 119 495.18 0.6182 171 440.28 0.6264 RMA(Acc) 145 607.51 0.6193 113 591.10 0.6222

5.9 Addressing the resolution limit problem

Figure 5.16: A sample ball bearing graph consisting of a large component (referred to as ball) with 128 vertices and two small components (referred to as bearings) each with 9 vertices. Modularity based methods including Louvain fail to designate the small components into individual communities, and treat them as a single community.

Community detection algorithms based on maximization of modularity are susceptible to the resolution limit problem, where they cannot distinguish between two clearly defined clusters (mod- ules) that are smaller than a certain size with respect to the total size of the input (defined to be the square root of the total number of edges in the network) and on the interconnectedness of the clusters themselves. See the work of Fortunato and Barthelemy´ for details [66].

123 In order to address the resolution limit problem, we implemented the fast-tracking resistance method discussed in [82]. Our preliminary analysis using pathological inputs suggest that this method addresses the resolution limit problem in an effective manner. Intuitively, the key idea of this method is to formulate an additional global metric called resistance (denoted by r) that changes its value when the subgraph is split, and to compute a special global modularity (denoted

as QAF G) value that includes r in its calculation. The minimum value of r (rmin) is reached when the subgraph cannot be split anymore, and QAF G(rmin) = 0 when the minimum value of r is reached. Therefore, using the fast-tracking resistance algorithm for community detection, our exit criteria changes from comparing the respective modularity differences between successive phases, to determining whether QAF G = 0 for a particular value of r. For empirical analysis, we generate graphs that consist of a single large module (maximal clique) and multiple small connected subgraphs with different granularities to observe the effect of resolution limit problem. We refer to these graphs as ball bearing graph, associating the large module with a ball, and the small modules as bearings, surrounding the ball. Figure 5.16 shows an example of a ball bearing graph with 128 vertices in the large module and two smaller modules of 9 vertices each. We summarize the results from comparing the quality of the Louvain method with fast-tracking resistance method in Table 5.16. We use different ball bearing graphs (differing only by the number of bearings), with a large component consisting of 512 vertices.

Table 5.16: Quality comparison between Louvain and Fast-tracking resistance method using ball bearing graphs.

Input graph Louvain method Fast-tracking resistance method # bearings #Comm. Precision Recall F-score Gini #Comm. Precision Recall F-score Gini 2 2 0.998 0.992 0.995 0.447 3 0.999 0.992 0.995 0.614 3 2 0.996 1 0.998 0.429 4 1 1 1 0.679 4 2 0.992 1 0.996 0.407 5 1 1 1 0.707 5 2 0.987 1 0.993 0.387 6 1 1 1 0.720 6 2 0.981 1 0.990 0.367 7 1 1 1 0.724 7 3 0.988 1 0.994 0.522 8 1 1 1 0.724

We observe that the Louvain method fails to identify the small components, whereas the fast- tracking resistance method is able to recognize them as independent communities (modules). In

124 order to express the relative community sizes, we compute Gini coefficient, which measures the variability or relative inequality in a distribution of frequencies [192]. A Gini coefficient of zero would indicate a perfect distribution where all the communities will be of the same size. For our context, a low Gini coefficient value (closer to 0) would suggest that vertices are clustered into a small number of communities of similar sizes. In contrast, a Gini coefficient of one would in- dicate a perfect imbalance with a single large community. Thus, for a ball-bearing graph, larger values of Gini coefficient (closer to 1) would indicate that the bearings have been correctly identi- fied as independent communities. In other words, the resolution limit problem has been correctly addressed. This observation corroborates the number of communities reported by fast-tracking resistance method (the last column in Table 5.16). F-scores reported by Louvain is still high, because the large component containing 95% of the total vertices in the network is accurately clus- tered. However, the F-scores reported by fast-tracking resistance method is perfect for most of the cases. Thus, based on this empirical evidence, we propose to use the fast-tracking method as a better approach to generate ground truth information for real-world inputs without known ground truth information. Towards a better ground-truth: Obtaining ground truth information for real-world graphs is challenging due to several reasons. Current schemes often use imprecise methods that make quality comparisons difficult. Given a situation when ground truth information is not available, we there- fore propose to use the final set of communities obtained from the fast-tracking resistance method to establish ground truth data for a variety of real-world graphs. The results are summarized in Table 5.17. Since modularity-based methods suffer from resolution limit problem, the F-score of graphs obtained from the Louvain algorithm are often low, which is as expected. Further, we note that since fast-tracking resistance method splits clusters (modules), the total count for false positive (FP) will increase for the Louvain method with respect to the fast-tracking resistance method. Consequently, scores for precision are lower. Our current implementation of the fast-tracking resistance method is multithreaded.

125 Table 5.17: Quality of Louvain compared to ground truth data obtained from Fast-tracking resistance method for small/moderate sized real-world graphs.

Real-world graphs Fast-tracking resistance Louvain Precision Recall F-score Name #Vertices #Edges #Comm. Gini Mod. #Comm. Gini Mod. p2p-Gnutella04 10.87K 39.99K 3,483 0.642 0.009 232 0.929 0.033 0.426 1 0.597 skirt 12.59K 196.52K 492 0.661 0.789 31 0.298 0.908 0.256 1 0.408 email-Enron 36.69K 367.66K 5,348 0.693 0.069 1,210 0.863 0.079 0.295 0.685 0.413 ca-AstroPh 18.77K 396.16K 2,304 0.658 0.111 332 0.919 0.122 0.731 1 0.844 loc-Brightkite 58.22K 428.15K 22,230 0.563 0.026 5,118 0.799 0.044 0.332 1 0.498 soc-Epinions1 75.88K 508.83K 27,280 0.593 0.014 5,389 0.841 0.021 0.248 1 0.397 soc-Slashdot0811 77.36K 905.46K 26,003 0.626 0.006 4,290 0.878 0.013 0.280 1 0.438 msc10848 10.84K 1.22M 53 0.495 0.786 12 0.389 0.856 0.295 1 0.456 loc-Gowalla 196.59K 1.9M 40,310 0.566 0.097 4,382 0.942 0.122 0.412 1 0.583 roadNet-CA 1.97M 5.53M 607,763 0.465 0.219 139,446 0.727 0.242 0.605 1 0.754

5.10 Chapter summary

We presented a distributed-memory implementation for community detection in graphs using the Louvain method based on modularity optimization. We introduced several approximate computing techniques/heuristics for improving performance and scalability, and demonstrated their efficacy using a large set of inputs from diverse real-world applications as well as synthetically gener- ated with ground truth information. We demonstrated speedups of 1.8x to 46x (using up to 4K processes) relative to the baseline version, for a wide variety of real-world networks. Modularities obtained by the different versions of our parallel algorithm are in most cases comparable to the best modularities obtained by a state-of-the-art multithreaded Louvain implementation. We presented a proxy application that implements a single phase of the Louvain algorithm for parallel graph com- munity detection. Constraining the Louvain method to run only for a single phase allows for quick analysis (especially since the method is a multi-phase heuristic). We also provided a discussion on addressing the resolution limit problem for the Louvain method and provided comparison with the fast-tracking resistance method. We believe the detailed discussion of the parallel implementation, the approximation methods introduced, and the experimental analysis provided in this chapter will benefit a wider range of graph algorithms which also have a greedy iterative structure with vertex-centric computations. What sets us apart from similar studies is our emphasis in exploring both qualitative and quantita- tive aspects of the Louvain method using a variety of networks.

126 CHAPTER 6 EXPLORING MPI COMMUNICATION MODELS FOR GRAPH APPLICATIONS USING GRAPH MATCHING AS A CASE STUDY

6.1 Introduction

Single program multiple data (SPMD) using message passing is a popular programming model for numerous scientific computing applications running on distributed-memory parallel systems. Mes- sage Passing Interface (MPI) is a standardized interface for supporting message passing through several efforts from vendors and research groups. Among the communication methods in MPI, point-to-point Send-Recv and collectives have been at the forefront in terms of their usage and wide applicability. Especially, for distributed-memory graph analytics, Send-Recv remains a pop- ular model, due to its wide portability across MPI implementations and support for communication patterns (asynchronous point-to-point updates) inherent in many graph workloads. In addition to the classical Send-Recv model, Remote Memory Access (RMA), or one-sided communication model, and neighborhood collective operations are recent advanced features of MPI that are relevant to applications with irregular communication patterns. A one-sided com- munication model separates communication from synchronization, allowing a process to perform asynchronous nonblocking updates to remote memory. Relevant background on MPI RMA is in Chapter 2. Data locality and task placement are crucial design considerations to derive sustainable perfor- mance from the next generation of supercomputers. Presently, MPI offers enhanced support for virtual topologies to exploit nearest neighborhood communication [99]. To derive the maximum benefit from the topology, there are special communication operations in MPI, namely neighbor- hood collectives (abbreviated as NCL), to collectively involve only a subset of processes in com- munication. MPI neighborhood collectives advance the concepts of standard collective operations,

127 and build up on decades of past research effort in optimized algorithms for collective operations. We provide an overview of these models in Section 6.2. From an application perspective, graph algorithms have recently emerged as an important class of applications on parallel systems, driven by the availability of large-scale data and novel al- gorithms of higher complexity. However graph algorithms are challenging to implement [129]. Communication patterns and volume are dependent on the underlying graph structure. Therefore, a critical aspect of optimization in such applications is in designing the granularity of communica- tion operations, and the choice of communication primitives. In this chapter, we use graph matching, a prototypical graph problem, as a case study to evalu- ate the efficacy of different communication models. Given a graph G = (V, E, ω), a matching M is a subset of edges such that no two edges in M are incident on the same vertex. The weight of a matching is the sum of the weight of the matched edges, and the objective of the maximum weight matching problem is to find a matching of maximum weight. Algorithms to compute optimal so- lutions are inherently serial and are impractical for large-scale problems, albeit having polynomial time complexity [72, 118]. However, efficient approximation algorithms with expected linear time complexity can be used to compute high quality solutions [61, 155]. Further, several of the half- approximation algorithms (algorithms that guarantee at least half of the optimal solution in terms of the weight of the matching) can be parallelized efficiently [33, 132]. We use the parallel vari- ant of the locally-dominant algorithm in this work to explore the efficacy of three communication models (detailed in Section 6.4). In particular, we implemented the locally-dominant algorithm using Send-Recv, RMA and neighborhood collectives. To the best of our knowledge, it is the first time that these communication schemes have been used for half-approximate matching. Secondly, we devised a detailed case study for understanding and characterizing the perfor- mance efficacy of different communication models for graph matching under different measures. Our working hypothesis is that the newer communication models of RMA and NCL are likely to outperform the classical Send-Recv model for graph matching. The goal of the case study was to test this hypothesis using the three implementations of half-approximate matching.

128 While our study demonstrates the general validity of our working hypothesis (at a high level), it also paints a more nuanced picture about the way these individual models differ with input dis- tributions, ordering schemes, and system sizes. We demonstrate that with our RMA and NCL implementations, one can achieve up to 4.5× speedup for a billion-edge real-world graph relative to Send-Recv. We show that while RMA is more consistent at delivering high performance, NCL is more sensitive to the type of input graph and to input vertex ordering. Besides runtime per- formance, we also evaluate the energy and memory consumption costs of these different models for graph matching, and show that NCL and RMA significantly reduce these costs as well. Our experimental assessment makes a case for not viewing any one of these metrics in isolation but to identify models that are likely to achieve best tradeoffs under different configurations. The rest of the chapter is organized as follows. Section 6.2 contains preliminary discussion on distributed-memory graph applications using classic MPI Send-Recv communication model, and provides background information on the neighborhood collective operations of MPI. We introduce the half-approximate graph matching in Section 6.3, and discuss the serial algorithm. We discuss components of the parallel algorithm in Section 6.4. In Section 6.5 we analyze the relative per- formances of RMA, NCL and Send-Recv versions for real-world/synthetic graphs. We study the related work in Section 6.6. Finally, we draw conclusions in Section 6.7.

6.2 Implementing distributed-memory parallel graph algorithms using MPI

A number of graph-based algorithms are iterative in nature, requiring updates to different subset of vertices in each iteration, with some local computation performed on the vertices by the current process. This is referred to as the owner-computes model. Implementing such owner-computes graph algorithms on distributed-memory using MPI Send-Recv typically requires asynchronous message exchanges to update different portions of the graph simultaneously, giving rise to irregular communication patterns. Algorithm 6 illustrates a generic iterative graph algorithm, representing the owner-computes model. Within the realm of the owner-computes model, it is possible to replace Send-Recv with other

129 Algorithm 6: Prototypical distributed-memory owner-computes graph algorithm using non- blocking Send-Recv for communication. Compute function represents some local compu- tation by the process “owning” a vertex. Input: Gi = (Vi,Ei) portion of the graph G in rank i. 1: while true do 2: Xg ← Recv messages 3: for {x, y} ∈ Xg do 4: Compute(x, y) {local computation}

5: for v ∈ Vi do 6: for u ∈ Neighbor(v) do 7: Compute(u, v) {local computation} 8: if owner(u) 6= i then 9: Nonblocking Send(u, v) to owner(u) 10: if processed all neighbors then 11: break and output data viable communication models. Our distributed-memory approximate matching implementations follow the owner-computes model and serve as an ideal use-case to study the impact of different communication models, while being representative of a wide variety of distributed-memory graph algorithms. We briefly discuss the MPI neighborhood collective operations in this section, since we have already covered MPI RMA in Chapter 2. Neighborhood collective operations were introduced in MPI-3 standard (circa 2012), along with significant extensions to MPI RMA. Distributed-memory graph algorithms exhibit sparse communication patterns, and are therefore ideal candidates for exploring the relatively recent enhancements to MPI process topology inter- face [99]. A virtual topology is an extra (optional) attribute associated with a communicator, that captures the communication pattern between processes. At present, MPI supports three types of virtual topologies: cartesian, graph and distributed graph. Cartesian topology is used for utilizing communication patterns on regular grids, on stencil-type computations. Graph topology requires each process to create a global communication graph, whereas, Distributed graph topology allows each process to only define a subset of processes (as edges) that it communicates with. The choice of a particular topology depends on the shape of underlying data, and since graph-based workloads have sparse data, we create a distributed graph topology for our communication needs.

130 Neighborhood collective operations make use of this graph topology, in optimizing the com- munication among neighboring processes [98, 96, 137]. Figure 6.1 represents the process graph topology, mimicking the underlying data distribution (we do not assume fixed neighborhoods). A graph topology can also be augmented with edge weights representing communication volumes between nodes, at the expense of extra memory. However, we only use unweighted process graphs in this chapter. Process topology routines in MPI has an option to reorder/remap ranks. However, we do not enable it, since most of the current MPI implementations treat the option as a no-op (at least for cartesian topologies) [87]. In the context of the owner-computes model, the primary impact of using a distributed graph topology is reduction in memory consumption, which in this case is expected to be proportional to the subset of processes as opposed to all the processes in the communicating group. We provide supporting information for this reduction in Table 6.8.

P2

P2

P7

2 P0 P5 P6 P7 3 P3 1 P0 P1 P2 P4

4 P0 P1 P2 P3 P4 3

P7 P2 P7

P7 Figure 6.1: Subset of a process neighborhood and MPI-3 RMA remote displacement computation. Num- ber of ghost vertices shared between processes are placed next to edges. Each process maintains two O(neighbor) sized buffers (only shown for P7): one for storing prefix sum on the number of ghosts for maintaining outgoing communication counts, and the other for storing remote displacement start offsets used in MPI RMA calls. The second buffer is obtained from alltoall exchanges (depicted by arrows for P7) of the prefix sum buffer among the neighbors.

131 6.3 Half-Approximate Matching

We discuss in this section the half-approximate matching algorithm in serial, and in Section 6.4 we discuss its parallelization in distributed-memory.

6.3.1 Matching preliminaries

A matching M in a graph is a subset of edges such that no two edges in M are incident on the same vertex. The objective of a matching problem can be to maximize the number of edges in a matching, known as the maximum matching, or to maximize the sum of the weights of the matched edges, known as the maximum weight matching (when weights are associated with the edges). Figure 6.2 demonstrates maximum weight matching.

10 10 0 1 0 1 45 15 20 35 45 15 20 35 4 4 4 4 2 3 4 2 3 4

15 21 15 21 5 5

Figure 6.2: Example of maximum weight matching, edges {0,2} and {1,4} are in the matching set

A further distinction can be on the optimality of the solutions computed – optimal or approx- imate. We limit our scope to half-approx weighted matching algorithms in this chapter. We refer the reader to several classical articles on matching for further details [118, 102, 78]. A simple approach to compute a half-approx matching is to consider edges in a non-increasing order of weights and add them to the matching if possible. This algorithm was proposed by Avis and is guaranteed to produce matchings that are half-approx to optimal matching [6]. However, ordering of edges serializes execution. Pries proposed a new algorithm by identifying locally dom- inant edges – edges that are heavier than all their adjacent edges – without a need to sort the edges [155]. The locally dominant algorithm was adapted to a distributed algorithm by Hoep- man [101], and into a practical parallel algorithm by Manne and Bisseling [132] and Halappanavar et al. [90]. We build on the work on MatchboxP [33] and implement novel communication

132 (a) MPI calls, graph matching (b) MPI calls, Graph500 BFS

Figure 6.3: Communication volumes (in terms of Send-Recv invocations) of MPI Send-Recv baseline im- plementation of half-approx matching using Friendster (1.8B edges) and Graph500 BFS using R-MAT graph of 2.14B edges on 1024 processes. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids. schemes. We note that each communication model is a nontrivial implementation and requires significant modifications to the Send-Recv based algorithm. Further, communication patterns gen- erated by matching are distinctly different from available benchmarks such as Graph500, as shown in Figure 6.3, which makes it a better candidate to explore novel communication models in MPI-3.

Intuitively, the locally-dominant algorithm works by identifying dominant edges in parallel, adding them to the matching, pruning their neighbors and iterating until no more edges can be added. A simple approach to identify locally dominant edges is to set a pointer from each vertex to its current heaviest neighbor. If two vertices point at each other, then the edge between these vertices is locally dominant. When we add this edge to the matching, only those vertices pointing to the endpoints of the matched edge need to be processed to find alternative matches. Thus, the algorithm iterates through edges until no new edges are matched. This algorithm has been proved to compute half-approx matchings [132]. The algorithm has expected linear time complexity but suffers from a weakness when there are no edge weights (or weights are all equal) and ties need to be broken using vertex ids [132]. However, a simple fix for this issue is to use a hash function on vertex ids to prevent linear dependences in pathological instances such as paths and grids with

133 ordered numbering of vertices.

6.3.2 Serial algorithm for half-approximate matching

Algorithm 7 demonstrates the serial half-approximate matching algorithm, based on [132]. There are two phases in the algorithm: in the first phase, the initial set of locally dominant edges in G = (V, E, ω) are identified and added to matching set M; the next phase is iterative—for each vertex

0 in M, its unmatched neighboring vertices are matched. For a particular vertex v, Nv represents unmatched vertices in v’s neighborhood. The vertex with the heaviest unmatched edge incident on v is referred as v’s mate, and this information is stored in a data structure called mate. Throughout the computation, mate of a vertex can change as it may try to match with multiple vertices in its neighborhood. We use similar notations for the parallel algorithms as well.

Algorithm 7: Serial matching algorithm. Input: Graph G = (V, E, ω). Output: M set of matched vertices.

1: matev ← ∅ ∀v ∈ V , M ← ∅ 2: for v ∈ V do

0 3: u ← matev ← arg maxu∈Nv ωu,v 4: if mateu = v then 5: M ← M ∪ {u, v} 6: while true do 7: v ← some vertex from M 0 8: for x ∈ Nv where matex = v and x * M do 0 9: y ← matex ← arg maxy∈Nx ωx,y 10: if matey = x then 11: M ← M ∪ {x, y} 12: if processed all vertices in M then 13: break

6.4 Parallel Half-Approximate Matching

In distributed-memory, we need to communicate information on candidate mates, and a way to express the context of the data that is being communicated. We discuss these communication contexts in detail in the next subsection and subsequently present our distributed-memory imple-

134 mentation of the matching algorithm. Our graph distribution is vertex-based, and is discussed in Section 5.4.1 of Chapter 5. We store locally owned vertices and edges using Compressed Sparse Row (CSR) format [55]. The owner function takes a vertex as an input parameter and returns its owning process. This simple distribution may cause load imbalance for graphs with irregular node degree distributions; in Section 6.5.3 we investigate the impact of graph reordering techniques and load imbalances.

6.4.1 Graph distribution

Our graph distribution is 1D vertex-based, which means each process owns some vertex, and all of its edges. The input graph is stored as a directed graph, so each process stores some vertices that are owned by a different process in order to maintain edge information. For example, if vertex u is owned by process #0 and vertex v is owned by process #1, and there is an edge between u and v, then process #0 stores u − v0, and process #1 stores v − u0 (where u0/v0 are “ghost” vertices, and there is an undirected edge between process #0 and #1 in the distributed graph process topology). We store locally owned vertices and edges using Compressed Sparse Row (CSR) format [55]. The owner function takes a vertex as an input parameter and returns its owning process. Please refer to the discussion in Section 5.4.1 of Chapter 5. This simple distribution may cause load imbalance for graphs with irregular node degree dis- tributions; in Section 6.5.3 we investigate the impact of graph reordering techniques and load imbalances.

6.4.2 Communication contexts

The message contexts are used to delineate actions taken by a vertex to inform its status to its neighbors and to avoid conflicts. For Send-Recv implementation, these contexts are encoded with message tags, and for RMA and neighborhood collective implementations, they are part of the communication data. A vertex is done communicating with its ghost vertices when all its cross edges have been deactivated, and is no longer a part of a candidate set. Figure 6.4 demonstrates

135 communication contexts arising from different scenarios as the distributed-memory algorithm pro- gresses.

vertex cross edge between vertices on di erent processes matching between (ghost) neighbors message direction

REQUEST REQUEST REQUEST REJECT INVALID REQUEST

REQUEST REJECT INVALID REJECT

Case 1: u and v are Case 2: u sends a Case 3: After Case 4: u is already matched Case 5: u has no neighbors Case 6: An extension to Case 5, mates, both sends a match request to v, recalculating u's to v, but its neighbor x has u to initiate a matching request, in which x sends a REQUEST to u match request, which but v's mate is not u, mate, u sends a as its mate. u sends REJECT so u sends INVALID before u could send an INVALID. yields to a match. so it rejects u's request. match request to v, messages to deactivate all messages to its neighbors so Hence, u sends a REJECT in This results in u which yields to a edges with its remaining that they can deactivate any response to REQUEST, and also recalculating its mate. match. neighbors, such that they can edges to u. sends an INVALID to x. Messages recalculate their mate. can arrive in any order. Figure 6.4: Communication contexts depicting different scenarios in distributed-memory half-approx match- ing. If y is a vertex, then y0 is its “ghost” vertex.

From these scenarios, we conclude that a vertex may send at most 2 messages to a “ghost” ver- tex (i.e., an incident vertex owned by another process), so the overall number of outgoing/incoming messages is bounded by twice the number of ghost vertices. This allows us to precompute com- munication buffer sizes, making it convenient for memory allocation. Based on the ghost counts, a process can stage incoming or outgoing data associated with a particular neighbor process on discrete locations in its local buffers.

6.4.3 Distributed-memory algorithm

Algorithm 8 demonstrates the top-level implementation of our distributed-memory half-approx matching. Similar to the serial algorithm 7, the parallel algorithm also consists of two phases: in the first phase, locally owned vertices attempt to match with the unmatched vertices on its neigh- borhood (FINDMATE, Algorithm 9), and in the next phase, neighbors of matched vertices in Mi are processed one by one (PROCESSNEIGHBORS, Algorithm 10). We maintain a per process counter array called nghosts that tracks the number of “active” ghost vertices in its neighborhood (that are still unmatched and available). As the ghost vertices are processed, the count is decremented. When the sum of the content of nghosts array returns 0, it means that a process does not have an active ghost vertex, and is free to exit if it has no pending communication. An MPI Allreduce operation to aggregate ghost counts may also be required for RMA and NCL to exit the iteration.

136 Algorithm 8: Top-level distributed-memory algorithm. Input: Local portion of graph Gi = (Vi,Ei) in rank i. 1: matev ← ∅ ∀v ∈ Vi, nghostsi ← ∅, Mi ← ∅ 2: for v ∈ Vi do 3: FINDMATE(v) 4: while true do 5: PROCESSINCOMINGDATA() 6: v ← some vertex from Mi 7: if owner(v) = i then 8: PROCESSNEIGHBORS(v)

9: if SUM(nghostsi) = 0 then 10: break

We use generic keywords in the algorithms (such as Push, Evoke and Process) to represent communication/synchronization/buffering methods used across multiple MPI versions. Table 6.1 provides a mapping of those keywords to the actual MPI functions used by a specific implementa- tion.

Table 6.1: Description of keywords used in algorithms

Neighborhood Keyword/Action Send-Recv RMA Collectives Push (mark data for MPI Isend MPI Put Insert data into send imminent communica- buffer. tion) Evoke (evoke out- MPI Iprobe MPI Win flush all MPI Neighbor alltoall standing communica- MPI Neighbor alltoall MPI Neighbor alltoallv tion) Process (handle in- MPI Recv Check data in local MPI Check data in receive coming data) window. buffer.

The FINDMATE function, depicted in Algorithm 9, is used whenever a locally owned vertex has to choose an unmatched vertex with maximum weight (i.e., mate) from its neighborhood. Apart from initiating matching requests (REQUEST communication context), if there are no available vertices in the neighborhood for matching, then it can also eagerly send an INVALID message to all its neighbors, such that they can deactivate the edge (Case #5 from Figure 6.4). After a vertex receives an INVALID message from its neighboring vertex, the neighbor can no longer be considered as a potential candidate. Therefore, INVALID messages are broadcast by vertices that

137 cannot be matched with the goal of minimizing futile matching requests. Edge deactivation involves evicting an unavailable vertex from the neighborhood candidate set

0 of potential mates (for e.g., Nx \ y represents evicting vertex y from the candidate set of vertex x). When the unavailable vertex is a ghost, then the nghosts counter needs to be decremented as well.

Algorithm 9: FINDMATE: Find candidate mate of a vertex in rank i. Input: Locally owned vertex x.

0 1: y ← matex ← arg maxy∈Nx ωx,y 2: if y 6= ∅ then {Initiate matching request} 3: if owner(y) = i then 4: if matey = x then 0 5: Nx \ y 0 6: Ny \ x 7: Mi ← Mi ∪ {x, y} 8: else{y is a ghost vertex} 0 9: Nx \ y 10: nghostsy ← nghostsy − 1 11: Push(REQUEST , owner(y), {y, x}) 0 12: else{Invalidate Nx} 0 13: for z ∈ Nx do 14: if owner(z) = i then 0 15: Nx \ z 0 16: Nz \ x 17: else 0 18: Nx \ z 19: nghostsz ← nghostsz − 1 20: Push(INVALID, owner(z), {z, x})

The PROCESSNEIGHBORS function helps to mitigate potential conflicts in the neighborhood of a matched vertex. After the first phase, multiple vertices (denoted by set X) in the neighborhood of a matched vertex v may list v as a mate. However, since v is already matched and unavailable from the candidate sets, ∀x ∈ X, x∈ / M. In that case, PROCESSNEIGHBORS evokes mate recalculation for x if it is locally owned, or sends a REJECT message to the owner of x (Case# 4 of Figure 6.4).

PROCESSINCOMINGDATA ensures proper handling of incoming data from another process. Based on the received communication context, relevant action is taken; that involves edge deacti- vation, along with mate recalculation or successful matching or rejection of a matching request.

138 Algorithm 10: PROCESSNEIGHBORS: Process active neighbors of matched vertex in rank i. Input: Locally owned matched vertex v. 0 1: for x ∈ Nv do 2: if matev 6= x then 3: if owner(x) = i then 0 4: Nv \ x 0 5: Nx \ v 6: if matex = v then{Recalculate matex} 7: FINDMATE(x) 8: else 0 9: Nv \ x 10: nghostsx ← nghostsx − 1 11: Push(REJECT , owner(x), {x, v})

Algorithm 11: PROCESSINCOMINGDATA: Process incoming data in rank i. 1: flag ←Evoke() 2: if flag = true then {received data} 3: {x, y, ctx} ← Process incoming data 4: if ctx.id = REQUEST then 5: matched ← false 6: if x∈ / Mi then 0 7: Nx \ y 8: if matex = y then 9: Mi ← Mi ∪ {x, y} 10: matched ← true 11: if !matched then {push REJECT if match not possible} 0 12: Nx \ y 13: nghostsy ← nghostsy − 1 14: Push(REJECT , ctx.source, {y, x}) 15: else if ctx.id = REJECT then 0 16: Nx \ y 17: nghostsy ← nghostsy − 1 18: if matex = y then 19: FINDMATE(x) 20: else{received INVALID} 0 21: Nx \ y 22: nghostsy ← nghostsy − 1

6.4.4 Implementation of the distributed-memory algorithms

In this section, we discuss implementations of the distributed-memory algorithms using MPI Send- Recv, RMA and neighborhood collectives.

139 MPI Send-Recv implementation

The baseline Send-Recv implementation uses MPI Isend to initiate a nonblocking send operation for communicating with a neighbor process. It uses a nonblocking probe call (i.e., MPI Iprobe) in every iteration, before receiving the message (using MPI Recv), as there is no prior information on incoming message due to irregular communication. The communication context is encoded in

the message tags. At present, we do not aggregate outgoing messages, therefore, PROCESSIN-

COMINGDATA checks for incoming messages and receives them one at a time.

MPI-3 RMA implementation

In MPI RMA, a process needs to calculate the memory offset (also referred to as target displace- ment) in the target process’s memory in order to initiate a one-sided data transfer. Calculation of the target displacement is nontrivial and error prone. A straightforward way to calculate re- mote displacement is via a counter held by every process in the communicating group. However, maintaining a distributed counter requires extra communication, and relatively expensive atomic operations. In Figure 6.1, we show an alternate way to precompute remote data ranges for a process graph neighborhood. The amount of data exchanged between two nodes of a process graph is propor- tional to the number of ghost vertices. A prefix sum on the number of ghosts a process is sharing with each of its neighbor (process) allows a process to logically partition its local outgoing buffer among target processes, to avoid overlaps or conflicts. The outgoing buffer maintains the “in- flight” data, since we use nonblocking MPI Put for communication, we need to ensure that input buffer is not reused until the put has completed, at least locally. After performing the prefix sum over the number of ghosts, an MPI Neighbor alltoall within a process neighborhood in- forms a process of a unique offset that it can use in RMA calls targeting a particular neighbor. In addition to this scheme, each process would just need to maintain a local counter per neighboring process. There is also no way to determine incoming data size in MPI RMA without an extra communication. Hence, we issue an MPI Neighbor alltoall on a subset of processes, to

140 exchange outgoing data counts among the process neighborhood. This may lead to load imbalance when the process neighborhood sizes are disproportionate. For the RMA version, before the MPI window can be accessed, we have to invoke a flush synchronization call, to ensure completion of current outstanding one-sided data transfers at the origin and remote side. This is more efficient than the Send-Recv implementation, which has to probe for a message before posting the corresponding Recv for handling incoming data. The difference in synchronization also affects the exit criteria from the second iterative phase of approximate matching algorithm (refer to Algorithm 7). For the Send-Recv version, a local summation on the ghosts array is enough to determine completion of outstanding Send operations. However, in RMA since processes do not coordinate with each other, it may lead to a situation in which a process exits the iteration and waits on a barrier, while another process has a dependency on the process that exited, and is stuck on an infinite loop. To avoid such situations, we have to perform a global reduction on the ghosts array to ascertain completion.

MPI-3 neighborhood collectives implementation

We use blocking neighborhood collective operations on a distributed graph topology. Specifi- cally, we use MPI Neighbor alltoall and MPI Neighbor alltoallv to exchange data among neighbors. The distributed graph topology is created based on ghost vertices that are shared between processes following our 1D vertex-based graph distribution. Unlike MPI RMA or Send- Recv cases, where communication is initiated immediately, in this case, the outgoing data is stored in a buffer for later communication. Nearest neighborhood communication is invoked using this buffer once every iteration. The idea is to allow data aggregation before initiating collective com- munication. The outgoing data counts are exchanged among using MPI Neighbor alltoall, allowing a process to receive incoming data counts and prepare the receive buffer. The perfor- mance of neighborhood collectives on our distributed graph topology may be suboptimal in certain cases, especially since we do not make any assumptions about the underlying graph structure. Although we discuss distributed-memory implementations using half-approx graph matching

141 as a case study, our MPI communication substrate comprising of Send-Recv, RMA and neigh- borhood collective routines can be applied to any graph algorithm imitating the owner-computes model.

6.5 Experimental evaluation

In this section, we present results and observations from our experimental evaluation using a set of synthetic and real-world graphs with diverse characteristics. In the context of neighborhood col- lective model, we study the impact of graph reordering using the Reverse Cuthill-McKee (RCM) algorithm [126, 45]. We also provide results from comparing the performance of our implementa- tions with a similar implementation of half-approx matching named MatchBox-P [33]. MatchBox- P uses the MPI Send-Recv model. Since this implementation has limitations on the size and types of inputs it can process, we compare our results with MatchBox-P only for moderate-sized inputs.

6.5.1 Notations and experimental setup

Notations. We use the following descriptors in the figures and tables listed in this section to refer to the different variants of our parallel algorithm presented in Section 6.4:

• NSR: Baseline parallel version using nonblocking MPI Send-Recv.

• RMA: Uses MPI-3 RMA, internally it also uses neighborhood collectives to exchange incom- ing data counts among neighbors.

• NCL: Uses blocking MPI-3 neighborhood collectives.

• MBP: Nonblocking MPI Send-Recv in MatchBox-P.

In the process graph, each process is represented by a unique vertex or node, and an edge is added to each node that it shares neighborhood with (see Figure 6.1). If the degree of a node is high, it participates in a lot of neighborhoods resulting in larger volume of communication. The number of edges in the process graph is represented by |Ep|, whereas the total number of edges in the

142 input graph, including the edges connected to ghost vertices is denoted by |E0|. The average and

maximum node degrees in the process graph are represented by davg and dmax. In this context,

standard deviation applies to degrees in the process graph (denoted by σd), and edges augmented

with ghost vertices (denoted by σ|E0|). We follow these notations throughout the rest of the chapter. Computing platform. We used the NERSC Cori supercomputer for our experimental evalu-

ations. NERSC Cori is a 2,388-node Cray R XC40TMmachine with dual-socket Intel R XeonTME5- 2698v3 (Haswell) CPUs at 2.3 GHz per node, 32 cores per node, 128 GB main memory per node,

40 MB L3 cache/socket and the Cray R XCTMseries interconnect (Cray R AriesTMwith Dragonfly

topology). We use cray-mpich/7.7.0 as our MPI implementation, and Intel R 18.0.1 compiler with -O3 -xHost compilation option to build the codes. We use the TAU profiling tool [169] to generate point-to-point communication matrix plots. Dataset. We summarize the datasets used for evaluation in Table 6.2. We use different types of synthetically generated graphs: Random geometric graphs (RGGs); R-MAT graphs, used in Graph500 BFS benchmark; and, stochastic block partition graphs (based on the degree-corrected stochastic block models). Datasets were obtained from the SuiteSparse Matrix Collection1 and MIT Graph Challenge website2.

6.5.2 Scaling analysis and comparison with MatchBox-P

We present execution time in seconds for different inputs in this section. Data is presented in log2 scale for both X axis and Y axis. We present both strong scaling and weak scaling results. We first present weak scaling performance of three classes of synthetic graphs in Figure 6.5. Our distributed-memory implementation of random geometric graph (RGG) generator is such that any process executing matching on the subgraphs will communicate with at most two neighboring processes. By restricting the neighborhood size to two, we observe 2 − 3.5× speedup on 4-16K processes for both NCL and RMA versions relative to NSR, for multi-billion-edge RGG (Fig- ure 6.5a). In Figure 6.5b, we demonstrate the weak scaling performance of moderate sized Graph

1https://sparse.tamu.edu 2http://graphchallenge.mit.edu/data-sets

143 Table 6.2: Synthetic and real-world graphs used for evaluation

Graph category Identifier | V | | E | Random geometric d=8.56E-05 536.87M 6.64B graphs (RGG) d=6.12E-05 1.07B 13.57B d=4.37E-05 2.14B 27.73B Graph500 R-MAT Scale 21 2.09M 33.55M Scale 22 4.19M 67.10M Scale 23 8.38M 134.21M Scale 24 16.77M 268.43M Stochastic block high overlap,low 1M 23.7M partitioned graphs block sizes (HILO) —"— 5M 118.7M —"— 20M 475.16M V2a 55M 117.2M U1a 67.7M 138.8M Protein K-mer P1a 139.3M 297.8M V1r 214M 465.4M DNA Cage15 5.15M 99.19M CFD HV15R 2.01M 283.07M Orkut 3M 117.1M Social networks Friendster 65.6M 1.8B

500 R-MAT graphs (with 33 − 268M edges) on 512 to 4K processes. We observe about 1.2 − 3× speedup for RMA and NCL relative to NSR. Figure 6.5c demonstrates contrasting behavior, where NSR performs better than NCL and RMA, using a stochastic block partitioned graph (450M edges), comprising of clusters of ver- tices and high degree of connectivity between them. Although NCL scales with the number of processes, NSR performs at least 1.5 − 2.7× better across 512-2K processes. In order to gain better insight on the performance of NCL, we present statistics on the process graph for different number of processes with this input in Table 6.3. Due to high degree of connectivity between pro- cesses, NCL/RMA is not efficient for this input (in stark contrast to RGG distribution, where the maximum degree is bounded). We now present performance results from the execution of real-world graphs of moderate to large sizes. For the Protein k-mer graphs in Figure 6.6, we observe that RMA performs about

144 64 512 32 NSR RMA NCL 256 16 32 128 8 16 64 4 32 8 2

Execution time (in secs) Execution time (in secs) 16 Execution time (in secs)

4 8 1 4K(6.7B) 8K(13.6B) 16K(27.8B) 512(33.5M) 1K(67.1M) 2K(134.2M)4K(268.4M) 512(23.7M) 1K(118.7M) 2K(475.1M) Processes (# Edges) Processes(#Edges) Processes(#Edges) (a) Random geometric graphs (b) Graph500 R-MAT graphs on (c) Stochastic block-partitioned on 4K-16K processes 512-4K processes graphs on 512-2K processes

Figure 6.5: Weak scaling of NSR, RMA, and NCL on synthetic graphs

Table 6.3: Graph topology statistics for stochastic block partitioned graph on 512-2K processes

p | Ep | dmax davg 512 1.31E+05 511 511 1024 5.24E+05 1023 1023 2048 2.10E+06 2047 2047

25 − 35% better than NSR and NCL. In some cases, performance of both RMA and NCL was 2 − 3× better than NSR. The structure of k-mer graphs consists of grids of different sizes; when the grids are densely packed, it affects the performance of neighborhood collectives.

V2A (|E|=117.2M) U1A (|E|=138.8M) P1A (|E|=297.8M) V1R (|E|=465.4M) 8 16 128 512 NSR RMA 256 4 NCL 8 64 128 32 2 4 64 16 1 2 32 8 16 0.5 1 4 8 Execution time (in secs) Execution time (in secs) Execution time (in secs) Execution time (in secs) 0.25 0.5 2 4 1024 2048 4096 1024 2048 4096 1024 2048 4096 1024 2048 4096 Processes Processes Processes Processes

Figure 6.6: Strong scaling results on 1K-4K processes for different instances of Protein K-mer graphs

Strong scaling results for the two social network graphs, Friendster (1.8B edges) and Orkut (117M edges), are presented in Figure 6.7. We observe 2 − 5× speedup for NCL and RMA on 1K and 2K processes, relative to NSR. However, for both the inputs, scalability of NCL and RMA is adversely affected with larger number of processes. Similar to the stochastic block-partition graph, we observe large neighborhood for NCL, as shown in Table 6.4, resulting in poor performance for NCL. For Friendster, the number of edges connecting ghost vertices (|E0|) increase by 4× on 4K processes, whereas for Orkut the increase between 512 and 2048 processes is 14×. Since we use blocking collective operations (for RMA/NCL), the degree distribution adversely affects the

145 256 64 NSR RMA NCL 128 32

64

16 32 Execution time (in secs) Execution time (in secs)

16 8 1024 2048 4096 512 1024 2048 Processes Processes (a) Friendster (1K-4K processes) (b) Orkut (512-2K processes)

Figure 6.7: Performance of RMA and NCL on social network graphs performance at scale, as compared to nonblocking Send-Recv implementation. Consequently, we next present reordering as a potential approach to address this problem (Section 6.5.3).

Table 6.4: Neighborhood graph topology statistics for Friendster and Orkut

p | Ep | dmax davg σd Friendster on 2K/4K processes 2048 2.09E+06 2047 2045 2045.29 4096 8.33E+06 4095 4069 4069.87 Orkut on 512/2K processes 512 1.30E+05 511 509 509.03 2048 1.84E+06 2047 1797 1808.03

6.5.3 Impact of graph reordering

We define neighborhood-distance of a vertex as the absolute difference between the maximum and minimum vertex labels (unique integers) of v. Reordering (or renumbering) of vertices such that the average neighborhood-distance of the reordered graph is minimized is known as the Graph Optimal Linear Arrangement (GOLA) problem [3]. It is also known as the Minimum Linear Arrangement (MINLA) problem [116]. Since these problems are computationally expensive (NP- complete), we explore the minimization of the maximum neighborhood-distance, which has been explored in the context of sparse linear algebra as the bandwidth minimization problem, also known to be NP-complete [149]. The objective for the bandwidth minimization problem is to permute the rows and columns of a sparse matrix such that the non-zero elements are brought as close to the diagonal as possible. For example, consider the structure of a graph represented in the form of its

146 adjacency matrix. A given vertex v is represented by Row v and Column v in the matrix. The neighbors of v are represented by the nonzero values in Row v (and Column v) of the matrix. An undirected graph has a symmetric structure in the matrix, and a directed graph can have an unsymmetric structure. We explore bandwidth minimization using the Reverse Cuthill-McKee (RCM), which can be implemented in linear time and is therefore a practical heuristic for large- scale inputs. The reordered and original matrices that we use in our experiments are presented in Figure 6.8.

(a) Original Cage15 (b) Reord. Cage15 (c) Original HV15R (d) Reord. HV15R

Figure 6.8: Rendering of the original graph and RCM reordered graph expressed through the adjacency matrix of the respective graphs (Cage15 and HV15R). Each non-zero entry in the matrix represents an edge between the corresponding row and column (vertices).

We observe counter-intuitive results with graph reordering, due to a simple 1D vertex-based partitioning of data in our current implementations. We observe that every process experiences an increase in overall communication volume due to an increase in the number of ghost vertices. Table 6.5 summarizes this increase by presenting the number of edges augmented by ghost ver- tices for a given number of partitions (i.e., |E0|), for both the original and RCM-based reordered graphs. We pick two graphs to analyze the effects of graph reordering, Cage15 is distributed on 256 processes, whereas HV15R is spread over 512 processes.

Table 6.5: Impact of reordering depicted through the number of edges augmented with the number of ghost vertices for different partitions

Original RCM Graph |V | |E| 0 0 0 0 0 0 |E | |E |max |E |avg σ|E0| |E | |E |max |E |avg σ|E0| Cage15 5.15E+06 9.92E+07 1.88E+08 1.29E+06 7.34E+05 1.64E+05 1.98E+08 8.77E+05 7.74E+05 8.64E+04 HV15R 2.01E+06 2.83E+08 5.62E+08 1.34E+06 1.10E+06 9.29E+04 5.66E+08 1.24E+06 1.11E+06 6.36E+04

147 Table 6.6: Neighborhood topology of original vs RCM reordered graphs

Original RCM Graphs |Ep| dmax davg σd |Ep| dmax davg σd Cage15 (p = 256) 3572 58 27.90 9.40 7423 87 57.99 23.91 HV15R (p = 512) 5100 43 19.92 7.38 14403 83 56.26 18.12

Overall, we observe 1 − 5% increase in total and average number of edges for reordered cases

due to a balanced number of edges per process. We observe that standard deviation (σ|E0|) is decreased by 30 − 40% relative to the original graph distribution. For the same datasets, we summarize the details of process neighborhood in Table 6.6. We observe that the average node

degree (davg) of RCM-based reordered graphs is about 2× that of the original graphs, and thus, increasing the volume of communication on average. Consequently, NSR suffers a slowdown of 1.2 − 1.7× for reordered graphs. As illustrated in Figure 6.9, NCL exhibits a speedup of 2 − 5× compared to the baseline Send-Recv version. In Figure 6.10, we show the communication profile for the original and reordered variants of HV15R. Although RCM reduces the bandwidth, the irregular block structures along the diagonal can lead to load imbalance. We also note that the two inputs chosen for evaluation have amenable sparsity structure and do not completely benefit from reordering. However, our goal is to show the efficacy of reordering as a good heuristics for challenging datasets in the context of neighborhood collectives.

4 2 NSR RMA 2 NCL 1 MBP 1 0.5 0.5 0.25 0.25 0.125

Execution time (in secs) 0.125 Execution time (in secs)

0.0625 0.0625 Cage15 Cage15(RCM) HV15R HV15R(RCM) Cage15 Cage15(RCM) HV15R HV15R(RCM) Graphs (on 1K processes) Graphs (on 2K processes)

Figure 6.9: Comparison of original and RCM reordering on 1K/2K processes

In Figure 6.9, we also observe NSR performing 1.2 − 2× better than MBP for large graphs,

148 (a) Original HV15R (b) Reordered HV15R

Figure 6.10: Communication volumes (in bytes) of original HV15R and RCM reordered HV15R. Black spots indicate zero communication. The vertical axis represents the sender process ids and the horizontal axis represents the receiver process ids. whereas NCL/RMA consistently outperformed MBP by 2.5 − 7×.

6.5.4 Performance summary

We hypothesized that one-sided (RMA) and neighborhood collective (NCL) communication mod- els are superior alternatives to the standard point-to-point (NSR) communication model. But, our findings reveal the trade-offs associated with these versions. In this section, we discuss three aspects to summarize our work: i) Performance of the MPI implementations (Table 6.7 and Fig- ure 6.11), ii) Power and memory usage (Table 6.8), and iii) Implementation remarks.

Performance of MPI implementations

We use a combination of absolute (summarized in Table 6.7) and relative (illustrated in Figure 6.11) performance information to summarize the overall performance of the three communication mod- els used to implement half-approximate matching. For each input, we list the best performing variant in terms of speedup relative to NSR using data from 512 to 16K process-runs in Table 6.7. We capture the relative performance using the performance profile shown in Figure 6.11. We in- clude data from 50 representative combinations of (input, #processes) to build this profile. We observe that RMA consistently outperforms the other two, but NCL is relatively close to RMA.

149 However, performance of NSR is up to 6× slower than the other two, but is competitive in about 10% of the inputs.

Figure 6.11: Performance profiles for RMA, NCL and NSR using a subset of inputs used in the experiments. The X-axis shows the factor by which a given scheme fares relative to the best performing scheme. The Y- axis shows the fraction of problems for which this happened. The closer a curve is aligned to the Y-axis the superior its performance is.

Power and memory usage

Using three moderate to large inputs on 1K processes (32 nodes), we summarize energy and mem- ory usage in Table 6.8. Information is collected using CrayPat [52], which reports power/energy per node and average memory consumption per process for a given system configuration. In Table 6.8, we see that the average memory consumption for NCL is the least, about 1.03 − 2.3× less than NSR, and about 9 − 27% less than RMA for each case. The overall node energy consumption of NSR is about 4× that of NCL and RMA for Friendster. The relative increase in communication percentages of NCL and RMA relative to NSR can be attributed to the exit criteria in the second phase of the algorithm (as described in Section 6.4). For NSR, a local summation on the nghosts array is sufficient to determine the completion of outstanding Send operations.

150 Table 6.7: Versions yielding the best performance over the Send-Recv baseline version (run on 512-16K processes) for various input graphs.

Graph category Identifier Best speedup Version d=8.56E-05 3.5× NCL Random geometric d=6.12E-05 2.56× NCL graphs (RGG) d=4.37E-05 2× NCL Scale 21 2.32× NCL Scale 22 3× RMA Graph500 R-MAT Scale 23 3.17× RMA Scale 24 2× NCL V2a 1.4× RMA U1a 2.2× RMA Protein K-mer P1a 2.32× RMA V1r 3.3× RMA DNA Cage15 6× NCL CFD HV15R 4× NCL Orkut 3.26× NCL Social network Friendster 4.45× RMA

Table 6.8: Power/energy and memory usage on 1K processes

Mem. Node Node Comp. MPI Ver. EDP (MB/proc.) eng. (kJ) pwr. (kW) % % Friendster (1.8B edges) NSR 977.7 2868.04 10.7 61.6 38.4 8.29E+08 RMA 577.4 793.27 9.78 21.4 78.6 1.35E+08 NCL 419.3 740.13 9.65 20.8 79.1 1.27E+08 Stochastic block-partitioned graph (475.1M edges) NSR 154.8 485.80 8.18 57.5 42.5 2.88E+07 RMA 196.3 690.41 9.09 7.2 92.8 5.24E+07 NCL 149 593.90 8.82 7.2 92.7 4.00E+07 HV15R (283.07M edges) NSR 210.2 154.98 5.95 13.5 86.4 4.04E+06 RMA 116.8 163.97 6.32 4.6 95.3 4.25E+06 NCL 106.9 140.85 6.07 3.2 96.7 3.27E+06

However, for RMA and NCL, since processes do not coordinate with each other, it may lead to a situation where a process exits the iteration and waits on a barrier, while another process has a dependency on the process that exited, and is therefore stuck in an infinite loop. To avoid such situations, we have to perform a global reduction on the nghosts array to ascertain completion, which adds to the additional volume in communication. Performance or energy values cannot be

151 taken in isolation and for identifying an approach that provides the best tradeoffs, one needs to compute Energy-Delay Product (EDP). While more testing is needed to that effect, based on the results we observed in this chapter, we find that NCL appears to provide a reasonable tradeoff between power/energy and memory usage.

Implementation remarks

Based on our experience in building distributed-memory half-approx matching (which is represen- tative of a broader class of iterative graph algorithms), we posit that the RMA version provides reasonable benefit in terms of memory usage, power/energy consumption and overall scalabil- ity. While it is possible to make the Send-Recv version optimal, handling message aggregation in irregular applications is challenging. On the other hand, neighborhood collective performance is sensitive to the graph structure, and mitigating such issues requires careful graph partitioning (Section 6.5.3), which in itself is NP-hard. The performance differences between NSR, NCL and RMA versions of half-approx match- ing can be attributed to the respective parallel implementations (discussed in Section 6.4.3), and the quality of the platform MPI implementation. Currently, NSR uses MPI Iprobe to poll for incoming messages, and MPI Recv to retrive probed messages one by one (see Table 6.1). In con- trast, the RMA version invokes a flush operation to complete outstanding transfers, and issues a neighborhood all-to-all operation to fetch the incoming data count. Depending on the structure and distribution of the graph, this neighborhood all-to-all operation can affect the scalability for RMA versions, as compared to NSR. An obvious improvement of NCL over other schemes is the im- plicit message aggregation (discussed in Section 6.4.4). However, since the NCL implementation uses the blocking MPI neighborhood collective interface, as compared to nonblocking interfaces used for RMA and NSR, the performance is affected at scale. Also, unlike RMA or NCL, NSR is comparatively more efficient in determining the exit criteria (discussed in Section 6.5.4), as it does not require an extra round of communication per iteration. It is also possible that RMA is better optimized than NCL on current Cray systems. Mod-

152 ern HPC interconnects have hardware support for optimizing collective operations [124, 81], and ideally neighborhood collective operations can be implemented internally using MPI collectives. However, constructing intra-communicators from the underlying graph topology is not straight- forward. Hence, a number of contemporary MPI implementations (such as MPICH [13] and its derivatives) at present use point-to-point Send-Recv operations to develop neighborhood collec- tives functionality.

6.6 Related Work

Lumsdaine et al. [129] and Hendrickson et al. [93] provide excellent overview of the challenges in implementing graph algorithms on HPC platforms. Gregor and Lumsdaine provide an overview of their experiences with the Parallel Boost Graph Library in [83]. Buluc¸and Gilbert provide an alternative means to implement graph algorithms using linear algebra kernels in [28]. Thorsen et al. propose a partitioned global address space (PGAS) implementation of maxi- mum weight matching in [178]. We discussed a set of closely related work on half-approximate matching in Section 6.4. Besta et al. provides extensive analysis on the role of communication direction (push or pull) in graph analytics, and uses MPI RMA in implementing push or pull variants of graph algo- rithms [19]. Our distributed-memory half-approximate matching is based on the push model. Kandalla et al. study the impact of nonblocking neighborhood collectives on a two-level breadth-first search (BFS) algorithm [107]. Communication patterns for the matching algorithm are not comparable with the communication patterns for BFS. Since the authors experiment only with synthetic graphs featuring small-world properties (average shortest path lengths are small), BFS converges in a few iterations and communication properties are conducive for collective op- erations. However, matching displays dynamic and unpredictable communication behavior com- pared to BFS, as shown in Figure 6.12. Dang et al. provide a lightweight communication runtime for supporting distributed-memory thread-based graph algorithms in Galois graph analytics system [48, 143]. They use MPI RMA

153 (a) Matching (b) Graph500 BFS

Figure 6.12: Communication volumes (in terms of bytes exchanged) of baseline implementation of half- approximate matching and Graph500 BFS, using R-MAT graph of 134.2M edges on 1024 processes.

(not passive target synchronization like us, but active target synchronization, which is more re- strictive) and Send-Recv (particularly MPI Iprobe, that we use as well), but not neighborhood collectives, in their communication runtime. The Suitor algorithm proposed by Manne and Halap- panavar is the current fastest algorithm for half-approximate matching in practice although several algorithms have been proposed in the recent past [131]. The Suitor algorithm is closely related to the locally-dominant algorithm discussed in this chapter. Khan et al. adapted the Suitor algorithm for the b-matching problem, which is a generalization of the matching problem [114].

6.7 Chapter summary

We investigated the performance implications of designing a prototypical graph algorithm, half- approx matching, with MPI-3 RMA and neighborhood collective models and compared them with a baseline Send-Recv implementation. We demonstrated speedups of 1.4 − 6× (using up to 16K processes) for the RMA and neighborhood collective implementations relative to the baseline ver- sion, using a variety of synthetic and real-world graphs. We explored the concept of graph reordering by reducing bandwidth using the Reverse Cuthill- McKee algorithm. We demonstrated the impact of reordering on communication patterns and volume, especially for the neighborhood collective model. Although we did not observe expected

154 benefits in our limited experiments, we believe that careful distribution of reordered graphs can lead to significant performance benefits. We believe that the insight presented in this chapter will benefit other researchers in exploring the novel MPI-3 features for irregular applications such as graph algorithms, especially on the impending exascale architectures with massive concurrency coupled with restricted memory and power footprints.

155 CHAPTER 7 CONCLUSION AND FUTURE WORK

The research discussed in this dissertation paves way for future efforts in enhancing the state-of- the-art of one-sided communication interfaces and distributed-memory graph analytics. We briefly summarize our findings and discuss the future work in this chapter.

7.1 Summary of findings

Communication is one of the critical aspects in achieving the desired performance for scientific applications running on supercomputers. As such, the choice of the underlying communication model determines the performance of an application. With the advent of many cores on modern HPC compute nodes, investigating applicability of one-sided communication models is important to optimize communication performance at extreme scales. Chapter 3 and Chapter 4 covers the basic functionality of MPI RMA, and introduces high-level communication interfaces that provide convenient building blocks for developing distributed-memory applications. Distributed-memory graph analytics fall under the category of irregular applications, that are challenging to optimize due to frequent accesses to elements that are in noncontiguous locations of memory, and little arithmetic intensity, restricting the overlap of communication and computa- tion. However, the classic option to improve performance is to invoke communication avoiding optimizations. As it turns out, some graph algorithms that are implemented using approximate computing techniques or heuristics have the capability to incorporate adaptive strategies that can significantly reduce communication/computation. This enhanced efficiency also has a desirable side effect in lowering the overall memory/power footprint of an application. Graph algorithms such as clustering and matching are useful in several domains such as computational biology, data analytics and cyber security. Due to emergence of large-scale data from real-world applications such as social networks, there is a need to filter graphs with tens of billions of edges quickly,

156 which graph matching and clustering algorithms enable. We discuss distributed-memory imple- mentations of graph clustering and matching in Chapter 5 and Chapter 6, respectively. Lessons learned from them can be applied to a broad class of distributed-memory graph applications.

7.2 Future work

Many scientific applications require asynchronous updates to distributed data structures, and subse- quent accesses to a high performance scientific computation toolkit. Therefore, apart from produc- tive asynchronous communication abstractions, users also desire numerical analysis capabilities without having to know anything about a particular scientific computation toolkit interface. At present, RMACXX expression interface supports elementwise operations. In order to extend the functionality of RMACXX expression interface, we propose bindings to a distributed-memory sci- entific computation toolkit. Strong numeric computation capabilities will propel RMACXX to be a truly convenient interface for the end users. Since graph partitioning is an NP-hard problem, our approach so far has been to implement a simple vertex-based partitioning scheme to minimize the I/O/communication costs pertaining to distribution, at the expense of a higher load imbalance for certain inputs. However, we observe sig- nificant performance benefit for irregular vertex distributions (processes owning a variable number of vertices and all the associated edges), in some cases. A specific example using the soc-friendster graph (1.8B edges) that captures the difference between a simple vertex-based distribution as com- pared to a distribution that attempts to balance the number of edges owned by a process by allowing a variable number of vertices, is shown in Figure 7.1. This optimization is made possible by ac- cessing the input graph file twice, the first time the file is partially read to capture the edge counts associated with vertices, and the next time the data is actually read into the process-local data structures. Depending on the input graph, this extra task can add very little overhead. As shown in Figure 7.1, the balanced distribution takes only about 2.5 secs extra in performing file I/O and managing the distributed graph on 1K processes of NERSC Cori. The significantly lower standard deviation for the balanced distribution lends credence to the fact that the variation of the number

157 of edges across processes is optimal.

------Graph edge distribution characteristics Graph edge distribution characteristics ------Number of vertices: 65608366 Number of vertices: 65608366 Number of edges: 3612134270 Number of edges: 3612134270 Maximum number of edges: 10931433 Maximum number of edges: 3530338 Average number of edges: 3.52747e+06 Average number of edges: 3.52747e+06 Expected value of Xˆ2: 2.19218e+13 Expected value of Xˆ2: 1.24431e+13 Variance: 9.47869e+12 Variance: 4.09031e+07 Standard deviation: 3.07875e+06 Standard deviation: 6395.55 ------File I/O and dist-graph creation (in s): 3.13744 File I/O and dist-graph creation (in s): 5.74023 Modularity: 0.615409 Modularity: 0.616169 Iterations: 113 Iterations: 119 Time (in s): 525.798 Time (in s): 306.217 ***************************************************** ***************************************************** Figure 7.1: Original (left) vs Balanced (right) graph edge distribution of soc-friendster for graph clustering (running the first phase only) across 1K processes of NERSC Cori.

Since graph partitioning determines the volume of communication for distributed-memory graph applications in general, switching to a balanced edge distribution can lead to reasonable benefits in performance. For e.g., Figure 7.1 demonstrates about 42% improvement in perfor- mance in graph clustering (discussed in Chapter 5) by using the balanced distribution as compared to the default vertex-based distribution.

158 BIBLIOGRAPHY

[1] Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22(6):685–701, 2010. [2] U.S Energy Information Administration. 2017 average monthly bill- residential, 2017. [3] D. Adolphson and T. Hu. Optimal linear ordering. SIAM Journal on Applied Mathematics, 25(3):403–423, 1973. [4] Abdelhalim Amer et al. Locking aspects in multithreaded mpi implementations. Argonne National Lab., Tech. Rep. P6005-0516, 2016. [5] InfiniBand Trade Association et al. Infinibandtm architecture specification. http://www. infinibandta. org, 2004. [6] David Avis. A survey of heuristics for the weighted matching problem. Networks, 13(4):475–493, 1983. [7] J Bachan et al. UPC++ specification v1. 0, draft 6. 2018. [8] John Bachan et al. The UPC++ PGAS library for exascale computing. In Proceedings of the Second Annual PGAS Applications Workshop, page 7. ACM, 2017. [9] David Bader, Aydın Buluc¸, John Gilbert, Joseph Gonzalez, Jeremy Kepner, and Timothy Mattson. The graph blas effort and its implications for exascale. In SIAM Workshop on Exascale Applied Mathematics Challenges and Opportunities (EX14), 2014. [10] David Bader and Kamesh Madduri. Design and implementation of the HPCS graph anal- ysis benchmark on symmetric multiprocessors. In International Conference on High- Performance Computing, pages 465–476. Springer, 2005. [11] David A Bader and Kamesh Madduri. Gtgraph: A synthetic graph generator suite. Atlanta, GA, February, 2006. [12] Seung-Hee Bae and Bill Howe. Gossipmap: A distributed community detection algorithm for billion-edge directed graphs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 27. ACM, 2015. [13] Pavan Balaji, Wesley Bland, William Gropp, Rob Latham, Huiwei Lu, Antonio J Pena, Ken Raffenetti, Sangmin Seo, Rajeev Thakur, and Junchao Zhang. Mpich users guide. Argonne National Laboratory, 2014. [14] Satish Balay, William Gropp, Lois Curfman McInnes, and Barry F Smith. PETSc, the portable, extensible toolkit for scientific computation. Argonne National Laboratory, 2:17, 1998.

159 [15] Richard F Barrett, Paul S Crozier, DW Doerfler, Michael A Heroux, Paul T Lin, HK Thorn- quist, TG Trucano, and Courtenay T Vaughan. Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications. Jour- nal of Parallel and Distributed Computing, 75:107–122, 2015.

[16] Gerald Baumgartner, Alexander Auer, David E Bernholdt, Alina Bibireata, Venkatesh Chop- pella, Daniel Cociorva, Xiaoyang Gao, Robert J Harrison, So Hirata, Sriram Krishnamoor- thy, et al. Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276–292, 2005.

[17] Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, et al. Exascale computing study: Technology challenges in achieving exascale systems. Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep, 15, 2008.

[18] Claude Bernard, Michael C Ogilvie, Thomas A DeGrand, Carleton E DeTar, Steven A Got- tlieb, A Krasnitz, Robert L Sugar, and Doug Toussaint. Studying quarks and gluons on mimd parallel computers. The International Journal of Supercomputing Applications, 5(4):61–70, 1991.

[19] Maciej Besta, Michał Podstawski, Linus Groner, Edgar Solomonik, and Torsten Hoefler. To push or to pull: On reducing communication and synchronization in graph computations. In Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, pages 93–104. ACM, 2017.

[20] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.

[21] Paolo Boldi and Sebastiano Vigna. The WebGraph framework I: Compression techniques. In Proc. of the Thirteenth International World Wide Web Conference (WWW 2004), pages 595–601, Manhattan, USA, 2004. ACM Press.

[22] Erik G Boman, Doruk Bozdag, Umit V Catalyurek, Karen D Devine, Assefaw H Gebremed- hin, Paul D Hovland, Alex Pothen, et al. Combinatorial algorithms for computational sci- ence and engineering. In Journal of Physics: Conference Series, volume 125, page 5. Institute of Physics Publishing, 2008.

[23] Erik G Boman, Doruk Bozdag, Umit V Catalyurek, Karen D Devine, Assefaw H Gebremed- hin, Paul D Hovland, Alex Pothen, and Michelle Mills Strout. Enabling high performance computational science through combinatorial algorithms. In Journal of Physics: Conference Series, volume 78, page 012058. IOP Publishing, 2007.

[24] Dan Bonachea and Jason Duell. Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations. International Journal of High Performance Comput- ing and Networking, 1(1-3):91–99, 2004.

160 [25] Dan Bonachea and Paul Hargrove. GASNet specification, v1. 8.1. 2017.

[26] George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Azzam Haidar, Thomas Herault, Jakub Kurzak, Julien Langou, Pierre Lemarinier, Hatem Ltaief, et al. Flex- ible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), pages 1432–1441. IEEE, 2011.

[27] Ulrik Brandes, Daniel Delling, Marco Gaertler, Robert Gorke,¨ Martin Hoefer, Zoran Nikoloski, and Dorothea Wagner. Maximizing modularity is hard. arXiv preprint physics/0608255, 2006.

[28] Aydin Buluc and John R Gilbert. The combinatorial blas: Design, implementation, and applications. Int. J. High Perform. Comput. Appl., 25(4):496–509, November 2011.

[29] Nazar Buzun, Anton Korshunov, Valeriy Avanesov, Ilya Filonenko, Ilya Kozlov, Denis Tur- dakov, and Hangkyu Kim. Egolp: Fast and distributed community detection in billion-node social networks. In Data Mining Workshop (ICDMW), 2014 IEEE International Conference on, pages 533–540. IEEE, 2014.

[30] Surendra Byna, William Gropp, Xian-He Sun, and Rajeev Thakur. Improving the per- formance of MPI derived datatypes by optimizing memory-access cost. In International Conference on Cluster Computing (CLUSTER), pages 412–419. IEEE, 2003.

[31] Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. memkind: An extensible heap memory manager for heteroge- neous memory platforms and mixed memory policies. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2015.

[32] William W Carlson et al. Introduction to UPC and language specification. Technical report, Technical Report CCS-TR-99-157, IDA Center for Computing Sciences, 1999.

[33] Umit¨ V C¸atalyurek,¨ Florin Dobrian, Assefaw Gebremedhin, Mahantesh Halappanavar, and Alex Pothen. Distributed-memory parallel algorithms for matching and coloring. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pages 1971–1980. IEEE, 2011.

[34] Umit¨ V. C¸atalyurek,¨ John Feo, Assefaw H. Gebremedhin, Mahantesh Halappanavar, and Alex Pothen. Graph coloring algorithms for multi-core and massively multithreaded archi- tectures. Parallel Computing, 38(10):576 – 594, 2012.

[35] Bradford L Chamberlain, David Callahan, and Hans P Zima. Parallel programmability and the Chapel language. The International Journal of High Performance Computing Applica- tions, 21(3):291–312, 2007.

[36] Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. Collective com- munication: theory, practice, and experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783, 2007.

161 [37] Ernie W Chan, Marcel F Heimlich, Avi Purkayastha, and Robert A Van De Geijn. On opti- mizing collective communication. In Cluster Computing, 2004 IEEE International Confer- ence on, pages 145–155. IEEE, 2004.

[38] Philippe Charles et al. X10: an object-oriented approach to non-uniform cluster computing. In Acm Sigplan Notices, volume 40, pages 519–538. ACM, 2005.

[39] Jaeyoung Choi, James Demmel, Inderjiit Dhillon, Jack Dongarra, Susan Ostrouchov, An- toine Petitet, Ken Stanley, David Walker, and R Clinton Whaley. ScaLAPACK: A portable linear algebra library for distributed memory computers — design issues and performance. In Applied Parallel Computing Computations in Physics, Chemistry and Engineering Sci- ence, pages 95–106. Springer, 1996.

[40] David Cohen, Thomas Talpey, Arkady Kanevsky, Uri Cummings, Michael Krause, Renato Recio, Diego Crupnicoff, Lloyd Dickman, and Paul Grun. Remote direct memory access over the converged enhanced ethernet fabric: Evaluating the options. In 2009 17th IEEE Symposium on High Performance Interconnects, pages 123–130. IEEE, 2009.

[41] RDMA consortium et al. Architectural specifications for rdma over tcp/ip, 2009.

[42] Michele Coscia, Fosca Giannotti, and Dino Pedreschi. A classification for community dis- covery methods in complex networks. Statistical Analysis and Data Mining: The ASA Data Science Journal, 4(5):512–546, 2011.

[43] cppreference. aggregate initialization. http://en.cppreference.com/w/cpp/ language/aggregate_initialization, 2017.

[44] cppreference. C++ compiler support. https://en.cppreference.com/w/cpp/ compiler_support, 2018.

[45] Elizabeth Cuthill. Several strategies for reducing the bandwidth of matrices. In Sparse Matrices and Their Applications, pages 157–166. Springer, 1972.

[46] Krzysztof Czarnecki et al. Generative programming and active libraries. In Generic Pro- gramming, pages 25–39. Springer, 2000.

[47] Leonardo Dagum and Ramesh Menon. Openmp: an industry standard api for shared- memory programming. IEEE computational science and engineering, 5(1):46–55, 1998.

[48] H. Dang, R. Dathathri, G. Gill, A. Brooks, N. Dryden, A. Lenharth, L. Hoang, K. Pingali, and M. Snir. A lightweight communication runtime for distributed graph analytics. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 980– 989, May 2018.

[49] Hoang-Vu Dang et al. Advanced thread synchronization for multithreaded MPI implemen- tations. In Cluster, Cloud and Grid Computing (CCGRID), 2017 17th IEEE/ACM Interna- tional Symposium on, pages 314–324. IEEE, 2017.

162 [50] Erik Davis and Sunder Sethuraman. Consistency of modularity clustering on random geo- metric graphs. arXiv preprint arXiv:1604.03993, 2016.

[51] Timothy A Davis and Yifan Hu. The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS), 38(1):1, 2011.

[52] Luiz DeRose, Bill Homer, Dean Johnson, Steve Kaufmann, and Heidi Poxon. Cray perfor- mance analysis tools. In Tools for High Performance Computing, pages 191–199. Springer, 2008.

[53] Josep D´ıaz, Dieter Mitsche, and Xavier Perez-Gim´ enez.´ Large connectivity for dynamic random geometric graphs. IEEE Transactions on Mobile Computing, 8(6):821–835, 2009.

[54] James Dinan et al. Supporting the global arrays PGAS model using MPI one-sided com- munication. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 739–750. IEEE, 2012.

[55] Jack Dongarra. Compressed row storage. http://www.netlib.org/utk/people/ JackDongarra/etemplates/node373.html.

[56] Jack Dongarra and Michael A Heroux. Toward a new metric for ranking high performance computing systems. Sandia Report, SAND2013-4744, 312:150, 2013.

[57] Jack J Dongarra, James R Bunch, Cleve B Moler, and Gilbert W Stewart. LINPACK users’ guide, volume 8. Siam, 1979.

[58] Jack J Dongarra, Jermey Du Cruz, Sven Hammarling, and Iain S Duff. Algorithm 679: A set of level 3 basic linear algebra subprograms: model implementation and test programs. ACM Transactions on Mathematical Software (TOMS), 16(1):18–28, 1990.

[59] Jack J Dongarra, Piotr Luszczek, and Antoine Petitet. The linpack benchmark: past, present and future. Concurrency and Computation: practice and experience, 15(9):803–820, 2003.

[60] Sudip S Dosanjh, Richard F Barrett, DW Doerfler, Simon D Hammond, Karl S Hemmert, Michael A Heroux, Paul T Lin, Kevin T Pedretti, Arun F Rodrigues, TG Trucano, et al. Exascale design space exploration and co-design. Future Generation Computer Systems, 30:46–58, 2014.

[61] Doratha E Drake and Stefan Hougardy. A simple approximation algorithm for the weighted matching problem. Information Processing Letters, 85(4):211–213, 2003.

[62] TV Eicken, David E Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active Mes- sages: a mechanism for integrated communication and computation. In Computer Archi- tecture, 1992. Proceedings., The 19th Annual International Symposium on, pages 256–266. IEEE, 1992.

[63] Alessandro Fanfarillo et al. OpenCoarrays: open-source transport layers supporting coarray fortran compilers. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models, page 4. ACM, 2014.

163 [64] Kyle G Felker et al. The energy band memory server algorithm for parallel monte carlo transport calculations. In SNA+ MC 2013-Joint International Conference on Supercomput- ing in Nuclear Applications+ Monte Carlo, page 04207. EDP Sciences, 2014.

[65] Santo Fortunato. Community detection in graphs. Physics reports, 486(3):75–174, 2010.

[66] Santo Fortunato and Marc Barthelemy.´ Resolution limit in community detection. Proceed- ings of the National Academy of Sciences, 104(1):36–41, 2007.

[67] Message Passing Interface Forum. MPI: a message-passing interface standard version 3.0. http://mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.

[68] 2017 Standard C++ Foundation. Inheritance - abstract base classes (abcs). https:// isocpp.org/wiki/faq/abcs, 2017.

[69] Python Software Foundation. multiprocessing - process-based parallelism. https:// docs.python.org/3/library/multiprocessing.html, 2001-2019.

[70] Python Software Foundation. threading - thread-based parallelism. https://docs. python.org/3/library/threading.html, 2001-2019.

[71] Hubertus Franke, Rusty Russell, and Matthew Kirkwood. Fuss, futexes and furwocks: Fast userlevel locking in linux. In AUUG Conference Proceedings, volume 85. AUUG, Inc., 2002.

[72] Harold N Gabow. An efficient implementation of edmonds’ algorithm for maximum match- ing on graphs. Journal of the ACM (JACM), 23(2):221–234, 1976.

[73] Assefaw H Gebremedhin, Erik G Boman, and Bora Ucar. 2016 Proceedings of the Seventh SIAM Workshop on Combinatorial Scientific Computing. SIAM, 2016.

[74] Robert Gerstenberger, Maciej Besta, and Torsten Hoefler. Enabling highly-scalable remote memory access programming with mpi-3 one sided. Scientific Programming, 22(2):75–91, 2014.

[75] Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, and As- sefaw H Gebremedhin. Scalable Distributed Memory Community Detection Using Vite. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–7. IEEE, 2018.

[76] Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo, Ananth Kalyanaraman, Hao Lu, Daniel Chavarria-Miranda, Arif Khan, and Assefaw Gebremedhin. Distributed Louvain Algorithm for Graph Community Detection. In 2018 IEEE International Parallel and Dis- tributed Processing Symposium (IPDPS), pages 885–895. IEEE, 2018.

[77] Sayan Ghosh, Jeff R Hammond, Antonio J Pena, Pavan Balaji, Assefaw H Gebremedhin, and Barbara Chapman. One-sided interface for matrix operations using mpi-3 rma: A case study with elemental. In 2016 45th International Conference on Parallel Processing (ICPP), pages 185–194. IEEE, 2016.

164 [78] Steven Gold and Anand Rangarajan. A graduated assignment algorithm for graph matching. IEEE Transactions on pattern analysis and machine intelligence, 18(4):377–388, 1996.

[79] Herman H Goldstine and . The electronic numerical integrator and com- puter (eniac). IEEE Annals of the History of Computing, 18(1):10–16, 1996.

[80] Benjamin H Good, Yves-Alexandre de Montjoye, and Aaron Clauset. Performance of mod- ularity maximization in practical contexts. Physical Review E, 81(4):046106, 2010.

[81] Richard L Graham, Steve Poole, Pavel Shamis, Gil Bloch, Noam Bloch, Hillel Chapman, Michael Kagan, Ariel Shahar, Ishai Rabinovitz, and Gilad Shainer. Connectx-2 infiniband management queues: First investigation of the new support for network offloaded collective operations. In 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing, pages 53–62. IEEE, 2010.

[82] Clara Granell, Sergio Gomez, and Alex Arenas. Hierarchical multiresolution method to overcome the resolution limit in complex networks. International Journal of Bifurcation and Chaos, 22(07):1250171, 2012.

[83] Douglas P. Gregor and Andrew Lumsdaine. The parallel bgl : A generic library for dis- tributed graph computations. 2005.

[84] William Gropp, Torsten Hoefler, Rajeev Thakur, and Ewing Lusk. Using advanced MPI: Modern features of the message-passing interface. MIT Press, 2014.

[85] William Gropp, Torsten Hoefler, Rajeev Thakur, and Jesper Larsson Traff.¨ Performance expectations and guidelines for MPI derived datatypes. In Recent Advances in the Message Passing Interface, pages 150–159. Springer, 2011.

[86] William Gropp, Rajeev Thakur, and Ewing Lusk. Using MPI-2: advanced features of the message passing interface. MIT press, 1999.

[87] William D Gropp. Using node information to implement mpi cartesian topologies. In Proceedings of the 25th European MPI Users’ Group Meeting, page 18. ACM, 2018.

[88] Yanfei Guo et al. Memory compression techniques for network address management in MPI. In Parallel and Distributed Processing Symposium (IPDPS), 2017 IEEE International, pages 1008–1017. IEEE, 2017.

[89] Shinichi Habata, Kazuhiko Umezawa, Mitsuo Yokokawa, and Shigemune Kitawaki. Hard- ware system of the earth simulator. Parallel Computing, 30(12):1287–1313, 2004.

[90] Mahantesh Halappanavar. Algorithms for Vertex-weighted Matching in Graphs. PhD thesis, Norfolk, VA, USA, 2009. AAI3371496.

[91] Mahantesh Halappanavar, Hao Lu, Ananth Kalyanaraman, and Antonino Tumeo. Scalable static and dynamic community detection using grappolo. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1–6. IEEE, 2017.

165 [92] Paul H Hargrove and Dan Bonachea. GASNet-EX performance improvements due to spe- cialization for the Cray Aries network. 2018.

[93] B. Hendrickson and J. W. Berry. Graph analysis with high-performance computing. Com- puting in Science Engineering, 10(2):14–19, March 2008.

[94] Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574, 3, 2009.

[95] Paul N Hilfinger et al. Titanium language reference manual, version 2.19. Technical report, UC Berkeley Tech Rep. UCB/EECS-2005-15, Tech. Rep, 2005.

[96] T. Hoefler and T. Schneider. Optimization principles for collective neighborhood commu- nications. In SC ’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 1–10, Nov 2012.

[97] Torsten Hoefler, Christian Siebert, and Andrew Lumsdaine. Scalable communication proto- cols for dynamic sparse data exchange. ACM Sigplan Notices, 45(5):159–168, 2010.

[98] Torsten Hoefler and Jesper Larsson Traff. Sparse collective operations for MPI. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–8. IEEE, 2009.

[99] Torsten Hoefler et al. The scalable process topology interface of MPI 2.2. Concurrency and Computation: Practice and Experience, 23(4):293–310, 2011.

[100] Torsten Hoefler et al. Remote memory access programming in MPI-3. ACM Transactions on Parallel Computing, 2(2):9, 2015.

[101] Jaap-Henk Hoepman. Simple distributed weighted matchings. arXiv preprint cs/0410047, 2004.

[102] John E Hopcroft and Richard M Karp. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973.

[103] Darko Hric, Richard K Darst, and Santo Fortunato. Community detection in networks: Structural communities versus ground truth. Physical Review E, 90(6):062805, 2014.

[104] Intel. Intel R software development emulator. https://software.intel.com/ en-us/articles/intel-software-development-emulator, 2016.

[105] David S Johnson. Approximation algorithms for combinatorial problems. Journal of com- puter and system sciences, 9(3):256–278, 1974.

[106] Mark T. Jones and Paul E. Plassmann. A parallel graph coloring heuristic. SIAM J. Sci. Comput., 14(3):654–669, May 1993.

166 [107] K. Kandalla, A. Bulu, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda. Can network-offload based non-blocking neighborhood mpi collectives improve communication overheads of irregular graph algorithms? In 2012 IEEE International Conference on Cluster Computing Workshops, pages 222–230, Sept 2012.

[108] Edward Kao, Vijay Gadepally, Michael Hurley, Michael Jones, Jeremy Kepner, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Siddharth Samsi, William Song, et al. Stream- ing graph challenge: Stochastic block partition. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE, pages 1–12. IEEE, 2017.

[109] Ian Karlin et al. Exploring traditional and emerging parallel programming models using a proxy application. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th Interna- tional Symposium on, pages 919–932. IEEE, 2013.

[110] Brian Karrer and Mark EJ Newman. Stochastic blockmodels and community structure in networks. Physical Review E, 83(1):016107, 2011.

[111] George Karypis, Kirk Schloegel, and Vipin Kumar. Parmetis: Parallel graph partitioning and sparse matrix ordering library. Version 1.0, Dept. of Computer Science, University of Minnesota, 1997.

[112] Jeremy Kepner, Peter Aaltonen, David Bader, Aydin Buluc¸, Franz Franchetti, John Gilbert, Dylan Hutchison, Manoj Kumar, Andrew Lumsdaine, Henning Meyerhenke, et al. Mathe- matical foundations of the graphblas. In 2016 IEEE High Performance Extreme Computing Conference (HPEC), pages 1–9. IEEE, 2016.

[113] Jeremy Kepner, David Bader, Aydın Buluc¸, John Gilbert, Timothy Mattson, and Henning Meyerhenke. Graphs, matrices, and the graphblas: Seven good reasons. Procedia Computer Science, 51:2453–2462, 2015.

[114] A. Khan, A. Pothen, M. M. A. Patwary, M. Halappanavar, N. R. Satish, N. Sundaram, and P. Dubey. Designing scalable b-matching algorithms on distributed memory multiproces- sors by approximation. In SC ’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 773–783, Nov 2016.

[115] Robert C Kirby. A new look at expression templates for matrix computation. Computing in Science & Engineering, 5(3):66–70, 2003.

[116] Yehuda Koren and David Harel. A multi-scale algorithm for the linear arrangement problem. In Revised Papers from the 28th International Workshop on Graph-Theoretic Concepts in Computer Science, WG ’02, pages 296–309, London, UK, UK, 2002. Springer-Verlag.

[117] David Krackhardt and Robert N Stern. Informal networks and organizational crises: An experimental simulation. Social psychology quarterly, pages 123–140, 1988.

[118] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logis- tics quarterly, 2(1-2):83–97, 1955.

167 [119] Andrea Lancichinetti, Santo Fortunato, and Filippo Radicchi. Benchmark graphs for testing community detection algorithms. Physical review E, 78(4):046110, 2008.

[120] James H Laros III, Kevin Pedretti, Suzanne M Kelly, Wei Shu, Kurt Ferreira, John Vandyke, and Courtenay Vaughan. Energy delay product. In Energy-Efficient High Performance Computing, pages 51–55. Springer, 2013.

[121] Jinpil Lee and Mitsuhisa Sato. Implementation and performance evaluation of XcalableMP: A parallel programming language for distributed memory systems. In Parallel Processing Workshops (ICPPW), 2010 39th International Conference on, pages 413–420. IEEE, 2010.

[122] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.

[123] Zhenping Li, Shihua Zhang, Rui-Sheng Wang, Xiang-Sun Zhang, and Luonan Chen. Quan- titative function for community detection. Physical review E, 77(3):036109, 2008.

[124] Jiuxing Liu, Amith R Mamidala, and Dhabaleswar K Panda. Fast and scalable mpi-level broadcast using infiniband’s hardware multicast support. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings., page 10. IEEE, 2004.

[125] Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. High performance rdma-based mpi implementation over infiniband. International Journal of Parallel Programming, 32(3):167– 198, 2004.

[126] Wai-Hung Liu and Andrew H. Sherman. Comparative analysis of the cuthill-mckee and the reverse cuthill-mckee ordering algorithms for sparse matrices. SIAM Journal on Numerical Analysis, 13(2):198–213, 1976.

[127] Xing Liu, Anup Patel, and Edmond Chow. A new scalable parallel algorithm for Fock matrix construction. In 28th International Parallel and Distributed Processing Symposium, pages 902–914. IEEE, 2014.

[128] Hao Lu, Mahantesh Halappanavar, and Ananth Kalyanaraman. Parallel heuristics for scal- able community detection. Parallel Computing, 47:19–37, 2015.

[129] Andrew Lumsdaine, Douglas P. Gregor, Bruce Hendrickson, and Jonathan W. Berry. Chal- lenges in parallel graph processing. Parallel Processing Letters, 17:5–20, 2007.

[130] Piotr R Luszczek et al. The HPC challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 213, 2006.

[131] F. Manne and M. Halappanavar. New effective multithreaded matching algorithms. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pages 519–528, May 2014.

[132] Fredrik Manne and Rob H Bisseling. A parallel approximation algorithm for the weighted maximum matching problem. In International Conference on Parallel Processing and Ap- plied Mathematics, pages 708–717. Springer, 2007.

168 [133] Andreas Marek, Volker Blum, Rainer Johanni, Ville Havu, Bruno Lang, Thomas Aucken- thaler, Alexander Heinecke, Hans-Joachim Bungartz, and Hermann Lederer. The ELPA library: scalable parallel eigenvalue solutions for electronic structure theory and computa- tional science. Journal of Physics: Condensed Matter, 26(21):213201, 2014.

[134] John Mellor-Crummey et al. A new vision for Coarray Fortran. In Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, page 5. ACM, 2009.

[135] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 3.0. www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.

[136] Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst Simon, and Martin Meuer. Top 500 list, 2012.

[137] S. H. Mirsadeghi, J. L. Trff, P. Balaji, and A. Afsahi. Exploiting common neighborhoods to optimize mpi neighborhood collectives. In 2017 IEEE 24th International Conference on High Performance Computing (HiPC), pages 348–357, Dec 2017.

[138] Prasenjit Mitra et al. Fast collective communication libraries, please. In Proceedings of the Intel Supercomputing Users’ Group Meeting, volume 1995, 1995.

[139] Sparsh Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):62, 2016.

[140] Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. Introducing the graph 500. Cray Users Group (CUG), 19:45–74, 2010.

[141] Mark EJ Newman. Communities, modules and large-scale structure in networks. Nature physics, 8(1):25, 2012.

[142] Mark EJ Newman and Michelle Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):026113, 2004.

[143] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Sys- tems Principles, pages 456–471. ACM, 2013.

[144] Jaroslaw Nieplocha et al. Global Arrays: a nonuniform memory access programming model for high-performance computers. The Journal of Supercomputing, 10(2):169–189, 1996.

[145] Robert W Numrich and John Reid. Coarrays in the next fortran standard. In ACM SIGPLAN Fortran Forum, volume 24, pages 4–17. ACM, 2005.

[146] CUDA Nvidia. Compute unified device architecture programming guide. 2007.

[147] Suely Oliveira et al. Using graph theory to improve some algorithms in scientific computing. In NEMACOM-New Methods in Applied and Computational Mathematics Workshop, pages 33–41. Centre for Mathematics and its Applications, Mathematical Sciences Institute , 1999.

169 [148] Michael Ovelgonne.¨ Distributed community detection in web-scale networks. In Proceed- ings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Anal- ysis and Mining, pages 66–73. ACM, 2013. [149] Christos H. Papadimitriou. The np-completeness of the bandwidth minimization problem. Computing, 16:263–270, 1976. [150] Mathew Penrose et al. Random geometric graphs. Number 5. Oxford university press, 2003. [151] Steve Plimpton. Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1):1–19, 1995. [152] Mason A Porter, Jukka-Pekka Onnela, and Peter J Mucha. Communities in networks. No- tices of the AMS, 56(9):1082–1097, 2009. [153] Jack Poulson, Bryan Marker, Robert A Van de Geijn, Jeff R Hammond, and Nichols A Romero. Elemental: A new framework for distributed memory dense matrix computations. ACM Transactions on Mathematical Software (TOMS), 39(2):13, 2013. [154] Viktor Prasanna. Goffish: Graph-oriented framework for foresight and insight using scalable heuristics. Technical report, UNIVERSITY OF SOUTHERN CALIFORNIA LOS ANGE- LES, 2015. [155] Robert Preis. Linear time 1/2-approximation algorithm for maximum weighted matching in general graphs. In Annual Symposium on Theoretical Aspects of Computer Science, pages 259–269. Springer, 1999. [156] Xinyu Que, Fabio Checconi, Fabrizio Petrini, and John A Gunnels. Scalable community detection with the louvain algorithm. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International, pages 28–37. IEEE, 2015. [157] Rolf Rabenseifner. Optimization of collective reduction operations. In International Con- ference on Computational Science, pages 1–9. Springer, 2004. [158] Ken Raffenetti et al. Why is MPI so slow?: analyzing the fundamental limits in imple- menting MPI-3.1. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, page 62. ACM, 2017. [159] et al. Recio. A Remote Direct Memory Access Protocol Specification. https://tools. ietf.org/html/rfc5040. [160] John Reid. The new features of fortran 2018. In ACM SIGPLAN Fortran Forum, volume 37, pages 5–43. ACM, 2018. [161] Ralf Reussner, Jesper Larsson Traff,¨ and Gunnar Hunzelmann. A benchmark for MPI de- rived datatypes. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 10–17. Springer, 2000. [162] John VW Reynders et al. POOMA: A framework for scientific simulations on parallel architectures. Parallel Programming in C+, pages 547–588, 1996.

170 [163] AICS RIKEN. University of Tsukuba. Omni Compiler Project. http:// omni-compiler.org.

[164] Ryan Rossi and Nesreen Ahmed. The network data repository with interactive graph ana- lytics and visualization. In AAAI, volume 15, pages 4292–4293, 2015.

[165] Ryan A. Rossi and Nesreen K. Ahmed. The network data repository with interactive graph analytics and visualization. In Proceedings of the Twenty-Ninth AAAI Conference on Artifi- cial Intelligence, 2015.

[166] K. Rupp. 42 years of microprocessor trend data., 2017.

[167] Gilad Shainer, Ali Ayoub, Pak Lui, Tong Liu, Michael Kagan, Christian R Trott, Greg Scant- len, and Paul S Crozier. The development of mellanox/nvidia gpudirect over infinibanda new model for gpu to gpu communications. Computer Science-Research and Development, 26(3-4):267–273, 2011.

[168] John Shalf, Sudip Dosanjh, and John Morrison. Exascale computing technology challenges. In International Conference on High Performance Computing for Computational Science, pages 1–25. Springer, 2010.

[169] Sameer S Shende and Allen D Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, 20(2):287–311, 2006.

[170] Min Si, Antonio J Pena,˜ Jeff Hammond, Pavan Balaji, and Yutaka Ishikawa. Scaling NWChem with efficient and portable asynchronous communication in MPI RMA. In 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 811–816. IEEE, 2015.

[171] Min Si et al. Casper: An asynchronous progress model for MPI RMA on many-core archi- tectures. In Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE Interna- tional, pages 665–676. IEEE, 2015.

[172] Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. Knights landing: Second- generation intel xeon phi product. Ieee micro, 36(2):34–46, 2016.

[173] John E Stone et al. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science & engineering, 12(3):66–73, 2010.

[174] Bjarne Stroustrup. Bjarne stroustrup’s faq. http://www.stroustrup.com/bs_ faq.html, 2016.

[175] Sayantan Sur, Hyun-Wook Jin, Lei Chai, and Dhabaleswar K Panda. Rdma read based ren- dezvous protocol for mpi over infiniband: design alternatives and benefits. In Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel program- ming, pages 32–39. ACM, 2006.

171 [176] Monika ten Bruggencate and Duncan Roweth. DMAPP - An API for One-Sided Program Models on Baker systems. In Cray User Group Conference, 2010.

[177] Rajeev Thakur et al. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 19(1):49–66, 2005.

[178] Alicia Thorsen, Phillip Merkey, and Fredrik Manne. Maximum weighted matching using the partitioned global address space model. In Proceedings of the 2009 Spring Simulation Multiconference, page 109. Society for Computer Simulation International, 2009.

[179] Vinod Tipparaju, William Gropp, Hubert Ritzdorf, Rajeev Thakur, and Jesper L Traff. In- vestigating high performance rma interfaces for the mpi-3 standard. In 2009 International Conference on Parallel Processing, pages 293–300. IEEE, 2009.

[180] Jeffrey Touchman. Comparative genomics. Nature Education Knowledge, 3(10):13, 2010.

[181] Vincent A Traag, Paul Van Dooren, and Yurii Nesterov. Narrow scope for resolution-limit- free community detection. Physical Review E, 84(1):016114, 2011.

[182] Jesper Traff,¨ Rolf Hempel, Hubert Ritzdorf, and Falk Zimmermann. Flattening on the fly: Efficient handling of MPI derived datatypes. Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 678–678, 1999.

[183] Jesper Larsson Traff. Implementing the MPI process topology mechanism. In Supercom- puting, ACM/IEEE 2002 Conference, pages 28–28. IEEE, 2002.

[184] Marat Valiev, Eric J Bylaska, Niranjan Govind, Karol Kowalski, Tjerk P Straatsma, Huber- tus JJ Van Dam, Dunyou Wang, Jarek Nieplocha, Edoardo Apra, Theresa L Windus, et al. NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations. Computer Physics Communications, 181(9):1477–1489, 2010.

[185] Robert A Van de Geijn. Using PLAPACK: Parallel Linear Algebra Package. MIT Press, 1997.

[186] Guido Van Rossum and Fred L Drake. Python language reference manual. 2003.

[187] Todd Veldhuizen. Expression templates. C++ Report, 7(5):26–31, 1995.

[188] Todd Veldhuizen. Blitz++ users guide. URL: http://oonumerics. org/blitz, 2001.

[189] David W Walker and Jack J Dongarra. Mpi: A standard message passing interface. Super- computer, 12:56–68, 1996.

[190]J org¨ Walter and Mathias Koch. The boost uBLAS library, 2002.

[191] Charith Wickramaarachchi, Marc Frincu, Patrick Small, and Viktor K Prasanna. Fast paral- lel algorithm for unfolding of communities in large graphs. In High Performance Extreme Computing Conference (HPEC), 2014 IEEE, pages 1–6. IEEE, 2014.

172 [192] Wikipedia. Gini coefficient. https://en.wikipedia.org/wiki/Gini_ coefficient.

[193] Chaoran Yang et al. Portable, MPI-interoperable Coarray Fortran. In ACM SIGPLAN No- tices, volume 49, pages 81–92. ACM, 2014.

[194] Mitsuo Yokokawa, Ken’ichi Itakura, Atsuya Uno, Takashi Ishihara, and Yukio Kaneda. 16.4-tflops direct numerical simulation of turbulence by a fourier spectral method on the earth simulator. In SC’02: Proceedings of the 2002 ACM/IEEE Conference on Supercom- puting, pages 50–50. IEEE, 2002.

[195] Jianping Zeng and Hongfeng Yu. A scalable distributed louvain algorithm for large-scale graph community detection. In 2018 IEEE International Conference on Cluster Computing (CLUSTER), pages 268–278. IEEE, 2018.

173