Ghosh Wsu 0251E 12693.Pdf (9.791Mb)
Total Page:16
File Type:pdf, Size:1020Kb
SUPPORTING EFFICIENT GRAPH ANALYTICS AND SCIENTIFIC COMPUTATION USING ASYNCHRONOUS DISTRIBUTED-MEMORY PROGRAMMING MODELS By SAYAN GHOSH A dissertation submitted in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY WASHINGTON STATE UNIVERSITY School of Electrical Engineering and Computer Science MAY 2019 c Copyright by SAYAN GHOSH, 2019 All Rights Reserved c Copyright by SAYAN GHOSH, 2019 All Rights Reserved To the Faculty of Washington State University: The members of the Committee appointed to examine the dissertation of SAYAN GHOSH find it satisfactory and recommend that it be accepted. Assefaw H. Gebremedhin, Ph.D., Chair Carl Hauser, Ph.D. Ananth Kalyanaraman, Ph.D. Pavan Balaji, Ph.D. Mahantesh Halappanavar, Ph.D. ii ACKNOWLEDGEMENT I thank my adviser, Dr. Assefaw Gebremedhin for his generous guidance, unflagging support, and considerable enthusiasm toward my research. I greatly appreciate his attempts in always push- ing me to refine my writing and narration skills, which has helped me to become a better researcher and communicator. I would like to thank Dr. Jeff Hammond for introducing me to one-sided communication models, which play an important role in my thesis. I would also like to thank Dr. Barbara Chapman and Dr. Sunita Chandrasekaran, for their unwavering support during my Masters studies at the University of Houston. I am immensely fortunate to have had the opportunity to work with all of my thesis committee members. As a Teaching Assistant to Dr. Carl Hauser for the Computer Networks course, I appreciate that he encouraged me to solve the problem sets on my own, so that I could assist the students effectively. Through the discussions with Dr. Pavan Balaji, I have learned the importance of low-level performance analysis for comprehensive evaluation of an application. I am grateful to Drs. Mahantesh Halappanavar and Ananth Kalyanaraman for introducing me to graph community detection related research. I sincerely believe that criticism outperforms praise. I have been lucky to have mentors who never settled for less and always pushed me to explore a bit more. I appreciate the supervision of Drs. Jeff Hammond, Pavan Balaji, Antonio Pena˜ and Yanfei Guo during my internships at Argonne National Laboratory. I spent over a year as an intern and an Alternate Sponsored Fellow at Pacific Northwest National Laboratory, and I would like to thank Drs. Mahantesh Halappanavar and Antonino Tumeo for engaging me with research on graph community detection. I admire every one of them for their guidance and the efforts in enhancing my knowledge. Special thanks to the administrative staff of the Electrical Engineering and Computer Science department, Graduate School, and the Office of International Programs at Washington State Uni- versity for their commitment toward helping students. iii I would like to thank my parents/in-laws for their unswerving support and deep empathy, de- spite the vast distance between us. Finally, I would like to recognize my wife Priyanka for her constructive criticisms, logical disagreements, unconditional love, and sharing all the hardships of student life with magnificent flair — ”Strangers on this road we are on; We are not two, we are one”. iv SUPPORTING EFFICIENT GRAPH ANALYTICS AND SCIENTIFIC COMPUTATION USING ASYNCHRONOUS DISTRIBUTED-MEMORY PROGRAMMING MODELS Abstract by Sayan Ghosh, Ph.D. Washington State University May 2019 Chair: Assefaw H. Gebremedhin Future High Performance Computing (HPC) nodes will have many more processors than the contemporary architectures. In such a system with massive parallelism it will be necessary to use all the available cores to drive the network performance. Hence, there is a need to explore one-sided models which decouple communication from synchronization. Apart from focusing on optimizing communication, it is also desirable to improve the productivity of existing one-sided models by designing convenient abstractions that can alleviate the complexities of parallel appli- cation development. Classically, a majority of applications running on HPC systems have been arithmetic intensive. However, data-driven applications are becoming more prominent, employ- ing algorithms from areas such as graph theory, machine learning, and data mining. Most graph applications have minimal arithmetic requirements, and exhibit irregular communication patterns. Therefore, it is useful to identify approximate methods that can enable communication-avoiding optimizations for graph applications, by potentially sacrificing some quality. The first part of this dissertation addresses the need to reduce synchronization by exploring one-sided communication models and designing convenient abstractions that serve the need of distributed-memory scientific applications. The second part of the dissertation is about evaluating v the impact of approximate methods and communication models on parallel graph applications. We begin with the design and development of an asynchronous matrix communication inter- face that can be leveraged in parallel numerical linear algebra applications. Next, we discuss the design of a compact set of C++ abstractions over a one-sided communication model, which improves developer productivity significantly. Then, we study the challenges associated with par- allelizing community detection in graphs, and develop a distributed-memory implementation that incorporates a number of approximate methods to optimize performance. Finally, we consider a half-approximation algorithm for graph matching, and evaluate the implications of different com- munication models in its distributed-memory implementation. We also examine the effect of data reordering on performance. In summary, this dissertation provides concrete insights into designing low-overhead high-level interfaces over asynchronous distributed-memory models for building parallel scientific applica- tions, and presents empirical analysis on the effect of approximate methods and communication models in deriving efficiency for irregular scientific applications using distributed-memory graph applications as a use-case. vi TABLE OF CONTENTS Page ACKNOWLEDGEMENT . ii ABSTRACT . v LIST OF TABLES . xiii LIST OF FIGURES . xvi CHAPTER 1: INTRODUCTION . 1 1.1 HARDWARE TRENDS . 1 1.2 POWER CONSUMPTION GOVERN FUTURE SYSTEM DESIGN . 1 1.3 IRREGULAR APPLICATION CHALLENGES . 3 1.4 USING SPARSE LINEAR ALGEBRA FOR GRAPH APPLICATIONS . 4 1.5 MOTIVATION . 5 1.5.1 Distributed-memory applications and Message Passing Interface . 6 1.5.2 One-sided communication model . 7 1.5.3 Approximate computing techniques . 8 1.5.4 Summary . 9 1.6 CONTRIBUTIONS . 10 1.7 PUBLICATIONS . 11 vii 1.8 DISSERTATION ORGANIZATION . 12 CHAPTER 2: BACKGROUND ON MPI ONE-SIDED COMMUNICATION . 14 2.1 INTRODUCTION . 14 2.2 REMOTE DIRECT MEMORY ACCESS . 16 2.3 MEMORY MODEL . 17 2.3.1 Memory consistency . 17 2.3.2 MPI RMA memory model . 18 2.4 MPI-2 TO MPI-3 RMA . 19 2.5 CHAPTER SUMMARY . 20 CHAPTER 3: ONE-SIDED INTERFACE FOR MATRIX OPERATIONS USING MPI: A CASE STUDY WITH ELEMENTAL . 21 3.1 INTRODUCTION . 21 3.2 ABOUT ELEMENTAL . 23 3.2.1 Data Distribution . 24 3.2.2 Elemental AXPY Interface . 26 3.3 BEYOND THE ELEMENTAL AXPY INTERFACE . 27 3.3.1 Enhancing the Performance of the Existing AXPY Interface . 28 3.3.2 From the AXPY Interface to the RMA Interface . 29 3.4 PROPOSED ONE-SIDED APIS . 30 3.4.1 RMAInterface ................................. 30 3.4.2 Distributed Arrays Interface (EL::DA) . 35 3.5 EXPERIMENTAL EVALUATION . 36 viii 3.5.1 Microbenchmark Evaluation . 38 3.5.2 Application Evaluation – GTFock . 42 3.6 CHAPTER SUMMARY . 43 CHAPTER 4: RMACXX: AN EFFICIENT HIGH-LEVEL C++ INTERFACE OVER MPI-3 RMA . 45 4.1 INTRODUCTION . 45 4.2 RELATED WORK . 49 4.3 DESIGN PRINCIPLES OF RMACXX . 51 4.3.1 Window class . 52 4.3.2 Standard interface . 56 4.3.3 Expression interface . 60 4.4 EXPERIMENTAL EVALUATION . 66 4.4.1 Instruction count and latency analysis . 67 4.4.2 Message rate and remote atomics . 72 4.4.3 Application evaluations . 75 4.5 CHAPTER SUMMARY . 79 CHAPTER 5: DISTRIBUTED-MEMORY PARALLEL LOUVAIN METHOD FOR GRAPH COMMUNITY DETECTION . 80 5.1 INTRODUCTION . 80 5.2 RELATED WORK . 82 5.3 PRELIMINARIES . 83 5.3.1 Modularity . 83 5.3.2 Serial Louvain algorithm . 85 ix 5.3.3 Challenges in distributed-memory parallelization . 85 5.4 THE PARALLEL ALGORITHM . 86 5.4.1 Input distribution . 87 5.4.2 Overview of the parallel algorithm . 87 5.5 APPROXIMATE METHODS FOR PERFORMANCE OPTIMIZATION . 91 5.5.1 Threshold Cycling . 93 5.5.2 Early Termination . 93 5.5.3 Incomplete Coloring . 95 5.6 EXPERIMENTAL EVALUATION . 96 5.6.1 Algorithms compared . 97 5.6.2 Experimental platforms . 97 5.6.3 Test graphs . 98 5.6.4 Comparison on a single node . 100 5.6.5 Strong scaling . 101 5.6.6 Weak scaling . 102 5.6.7 Analysis of performance of the approximate computing methods/heuristics 104 5.6.8 Combining approximate methods/heuristics delivers better performance . 107 5.6.9 Solution quality assessment . 109 5.7 APPLICABILITY OF THE LOUVAINMETHOD AS A BENCHMARKING TOOL FOR GRAPH ANALYTICS . 110 5.7.1 Characteristics of distributed-memory Louvain method . 111 5.7.2 Synthetic Data Generation . 113 5.8 ANALYSIS OF MEMORY AFFINITY, POWER CONSUMPTION, AND COM- MUNICATION PRIMITIVES . 115 x 5.8.1 Evaluation on Intel Knights Landing R architecture . 116 5.8.2 Power, energy and memory usage . 118 5.8.3 Impact of MPI communication method . 119 5.9 ADDRESSING THE RESOLUTION LIMIT PROBLEM . 123 5.10 CHAPTER SUMMARY . 126 CHAPTER 6: EXPLORING MPI COMMUNICATION MODELS FOR GRAPH AP- PLICATIONS USING GRAPH MATCHING AS A CASE STUDY . 127 6.1 INTRODUCTION . 127 6.2 IMPLEMENTING DISTRIBUTED-MEMORY PARALLEL GRAPH ALGORITHMS USING MPI . ..