Expressiveness, Programmability and Portable High Performance of Global Address Space Languages
Total Page:16
File Type:pdf, Size:1020Kb
RICE UNIVERSITY Expressiveness, Programmability and Portable High Performance of Global Address Space Languages by Yuri Dotsenko A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE Doctor of Philosophy APPROVED, THESIS COMMITTEE: Dr. John Mellor-Crummey, Chair Associate Professor of Computer Science Dr. Ken Kennedy John and Ann Doerr University Professor of Computational Engineering Dr. Peter Joseph Varman, Professor of Electrical & Computer Engineering HOUSTON, TEXAS JANUARY, 2007 Expressiveness, Programmability and Portable High Performance of Global Address Space Languages Yuri Dotsenko Abstract The Message Passing Interface (MPI) is the library-based programming model employed by most scalable parallel applications today; however, it is not easy to use. To simplify program development, Partitioned Global Address Space (PGAS) languages have emerged as promising alternatives to MPI. Co-array Fortran (CAF), Titanium, and Unified Parallel C are explicitly parallel single-program multiple-data languages that provide the abstraction of a global shared memory and enable programmers to use one-sided communication to access remote data. This thesis focuses on evaluating PGAS languages and explores new language features to simplify the development of high performance programs in CAF. To simplify program development, we explore extending CAF with abstractions for group, Cartesian, and graph communication topologies that we call co-spaces. The com- bination of co-spaces, textual barriers, and single values enables effective analysis and optimization of CAF programs. We present an algorithm for synchronization strength re- duction (SSR), which replaces textual barriers with faster point-to-point synchronization. This optimization is both difficult and error-prone for developers to perform manually. SSR-optimized versions of Jacobi iteration and the NAS MG and CG benchmarks yield performance similar to that of our best hand-optimized variants and demonstrate signifi- cant improvement over their barrier-based counterparts. To simplify the development of codes that rely on producer-consumer communication, we explore extending CAF with multi-version variables (MVVs). MVVs increase pro- grammer productivity by insulating application developers from the details of buffer man- agement, communication, and synchronization. Sweep3D, NAS BT, and NAS SP codes expressed using MVVs are much simpler than the fastest hand-coded variants, and experi- ments show that they yield similar performance. To avoid exposing latency in distributed memory systems, we explore extending CAF with distributed multithreading (DMT) based on the concept of function shipping. Func- tion shipping facilitates co-locating computation with data as well as executing several asynchronous activities in the remote and local memory. DMT uses co-subroutines/co- functions to ship computation with either blocking or non-blocking semantics. A proto- type implementation and experiments show that DMT simplifies development of parallel search algorithms and the performance of DMT-based RandomAccess exceeds that of the reference MPI implementation. Acknowledgments I would like to thank my adviser, John Mellor-Crummey, for his inspiration, technical di- rection, and material support. I want to thank my other committee members, Ken Kennedy and Peter Varman, for their insightful comments and discussions. I am indebted to Tim Harvey for his help with data-flow analysis. I am grateful to Luay Nakhleh, Keith Cooper, and Daniel Chavarr´ıa-Miranda who provided guidance and advice. I want to thank my colleague, Cristian Coarfa, for years of productive collaboration. I would like to thank Fengmei Zhao, Nathan Tallent, and Jason Eckhardt for their work on the Open64 infrastructure. Jarek Nieplocha and Vinod Tipparaju provided invaluable help with the ARMCI library. Kathy Yelick, Dan Bonachea, Parry Husbands, Christian Bell, Wei Chen, and Costin Iancu provided assistance with the GASNet library. Craig Rasmussen helped to reverse-engineer the dope vector format of several Fortran 95 vendor compilers. Collaboration with Tarek El-Ghazawi, Francois Cantonnet, Ashrujit Mohanti, and Yiyi Yao resulted in a successful joint publication. I would like to thank many people at Rice who helped me during my stay. Tim Harvey, Bill Scherer, and Charles Koelbel helped in revising and improving the manuscript. Daniel Chavarr´ıa-Miranda and Tim Harvey provided invaluable assistance with the qualification examination. I want to thank Robert Fowler, Yuan Zhao, Apan Qasem, Alex Grosul, Zoran Budimlic, Nathan Froyd, Arun Chauhan, Anirban Mandal, Guohua Jin, Cheryl McCosh, Rajarshi Bandyopadhyay, Anshuman Das Gupta, Todd Waterman, Mackale Joyner, Ajay Gulati, Rui Zhang, and John Garvin. This dissertation is dedicated to my family, my dear wife Sofia, my son Nikolai, my sister, and my parents for their never-ending love, infinite patience, and support. Sofia also contributed to this work by drawing some of the illustrations. Contents Abstract ii Acknowledgments iv List of Illustrations xii List of Tables xix 1 Introduction 1 1.1 Thesisoverview................................ 3 1.2 Contributionsofjointwork . .. 4 1.3 Researchcontributions . 7 1.3.1 CAFcommunicationtopologies–co-spaces. ... 7 1.3.2 Synchronizationstrengthreduction . ... 8 1.3.3 Multi-versionvariables. 9 1.3.4 Distributedmultithreading . 10 1.4 Thesisoutline................................. 10 2 Related Work 11 2.1 CAFcompilers ................................ 11 2.2 Data-parallel and task-parallel languages . ......... 12 2.2.1 High-PerformanceFortran . 14 2.2.2 OpenMP ............................... 15 2.2.3 UC .................................. 16 2.2.4 Compiler-basedparallelization . ... 17 2.3 PGASprogrammingmodels . 18 2.3.1 UnifiedParallelC.. .. .. .. .. .. .. .. 19 vi 2.3.2 Titanium ............................... 20 2.3.3 Barrier synchronization analysis and optimization . ......... 22 2.4 Message-passing and RPC-based programming models . ....... 23 2.4.1 MessagePassingInterface . 23 2.4.2 Messagepassinginlanguages . 26 2.5 Concurrencyinimperativelanguages. ..... 28 2.5.1 Single-AssignmentC. 28 2.5.2 Data-flowandstream-basedlanguages. .. 29 2.5.3 Clockedfinalmodel . .. .. .. .. .. .. .. 30 2.6 Functionshipping............................... 31 2.6.1 Remoteprocedurecalls. 31 2.6.2 ActiveMessages ........................... 31 2.6.3 Multilisp ............................... 31 2.6.4 Cilk.................................. 32 2.6.5 JavaRemoteMethodInvocation . 32 3 Background 33 3.1 Co-arrayFortran ............................... 33 3.1.1 Co-arrays ............................... 34 3.1.2 Accessingco-arrays . 34 3.1.3 Allocatableandpointerco-arraycomponents . ..... 35 3.1.4 Procedurecalls ............................ 35 3.1.5 Synchronization ........................... 35 3.1.6 CAFmemoryconsistencymodel. 36 3.2 CommunicationsupportforPGASlanguages . .... 38 3.2.1 ARMCI................................ 39 3.2.2 GASNet................................ 40 3.3 Experimentalplatforms . 41 vii 3.3.1 Itanium2+Myrinet2000cluster(RTC) . .. 41 3.3.2 Itanium2+Quadricscluster(MPP2) . 41 3.3.3 Alpha+Quadricscluster(Lemieux) . 41 3.3.4 Altix3000(Altix1). 41 3.3.5 SGIOrigin2000(MAPY) . 42 3.4 Parallelbenchmarksandapplications. ...... 42 3.4.1 NASParallelBenchmarks . 43 3.4.2 Sweep3D ............................... 44 3.4.3 RandomAccess ............................ 46 3.4.4 Data-flowanalysis . .. .. .. .. .. .. .. 47 4 Co-array Fortran for Distributed Memory Platforms 49 4.1 Rice Co-array Fortran compiler — cafc .................. 49 4.1.1 Memorymanagement . 50 4.1.2 Co-array descriptors and local co-array accesses . ....... 50 4.1.3 Co-arrayparameters . 52 4.1.4 COMMONandSAVEco-arrays . 53 4.1.5 Proceduresplitting . 54 4.1.6 Multipleco-dimensions . 57 4.1.7 Intrinsicfunctions . .. .. .. .. .. .. .. 59 4.1.8 Communicationcodegeneration . 60 4.1.9 Allocatableandpointerco-arraycomponents . ..... 64 4.2 Experimentalevaluation . 67 4.2.1 Co-array representation and local accesses . ...... 68 4.2.2 Communicationefficiency . 68 4.2.3 Clusterarchitectures . 74 4.2.4 Point-to-point vs. barrier-based synchronization . .......... 78 4.2.5 Improvingsynchronizationviabuffering. ..... 78 viii 4.2.6 PerformanceevaluationofCAFandUPC . 79 5 Co-spaces: Communication Topologies for CAF 90 5.1 CommunicationtopologiesinCAF. .. 92 5.2 Co-spacetypes ................................ 94 5.2.1 Group................................. 96 5.2.2 Cartesian ............................... 98 5.2.3 Graph.................................100 5.3 Co-spaceusageexamples. 103 5.4 Propertiesofco-spaceneighborfunctions . .......106 5.5 Implementation ................................107 6 Analyzing CAF Programs 109 6.1 DifficultyofanalyzingCAFprograms . .109 6.2 Languageenhancements . .110 6.2.1 Textualgroupbarriers . .110 6.2.2 Groupsinglevalues. .111 6.3 Inference of group single values and group executable statements. 112 6.3.1 Algorithmapplicability. 112 6.3.2 Forwardpropagationinferencealgorithm . .112 6.4 Analysisofcommunicationstructure . .118 6.4.1 Analyzablegroup-executablePUT/GET . 119 6.4.2 Analyzablenon-group-executablePUT/GET . .125 6.4.3 Otheranalyzablecommunicationpatterns . .126 7 Synchronization Strength Reduction 128 7.1 Motivation...................................128 7.2 IntuitionbehindSSR .............................130 7.2.1 Correctness of SSR for analyzable group-executable PUTs/GETs . 131 ix 7.2.2 Correctness of SSR for analyzable non-group-executable PUTs/GETs..............................137