Implementation and Evaluation of Additional Parallel Features in Coarray Fortran

IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN A Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Shiyao Ge May 2016 IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN Shiyao Ge APPROVED: Barbara Chapman, Chairman Dept. of Computer Science Edgar Gabriel Dept. of Computer Science Mikhail Sekachev TOTAL E&P Research and Technology USA, LLC Dean, College of Natural Sciences and Mathematics ii Acknowledgements I would like to express my deep gratitude to my advisor, Prof, Barbara Chapman, for her guidance and support in my work and her patience during the time being her student. I would also thank my other committee members, Prof. Edgar Gabriel and Dr. Mikhail Sekachev for taking time to review my work and give valuable feedback. I would give special thanks to my mentor Deepak Eachempati and Dounia Khaldi for their valuable and constructive suggestions during the design and development of this research work. A huge thank to Tony Curtis, who gives many help in my early stages in HPCTools group, helping with techinical issues and reviewing my work. I would also like to extend my thanks to TOTAL for funding our work. Thanks to Mikhail Sekachev, Henri Calandra and Maxime Hugues for allowing me to access to their machines to verify our work and giving me feedback. Thanks Texas Advanced Computing Center (TACC) for providing computing resources. It is an honor for me to work within the HPCTools research group. I would thank Siddhartha Jana for his countless valuable feedback and discussion. Also I would thank Pengfei Hao for helping me resolve techinical problem and giving wise advices. Finally, I am very grateful to my parents for their care and support all the time. iii IMPLEMENTATION AND EVALUATION OF ADDITIONAL PARALLEL FEATURES IN COARRAY FORTRAN An Abstract of a Thesis Presented to the Faculty of the Department of Computer Science University of Houston In Partial Fulfillment of the Requirements for the Degree Master of Science By Shiyao Ge May 2016 iv Abstract The Fortran 2008 language standard added a feature called "coarrays" to allow parallel programming in Fortran with only minimal changes to existing sequential Fortran programs. Coarrays turn Fortran into a Partitioned Global Address Space (PGAS) language, following the Single Program, Multiple Data (SPMD) model. The next revision of the Fortran standard is expected to introduce some more sophisticated coarrays language features. One feature is the "team"; a way of group- ing components (images) of parallel Fortran programs. Teams can, for example, be allocated different sub-tasks. Proposed team support in the standard includes statements for forming image teams, reassigning membership of teams, and statements for performing communication and synchronization with respect to image teams. These features are collected and discussed in the Fortran Technical Specification Draft. In this thesis, we will present implementation and evaluation of some of these new features. The open-source compiler, OpenUH, developed by this research group is extended to implement support for team and collective. We discuss two optimizations we have applied in order to reduce network communication and local memory footprint in the compiler's Coarrays runtime. Experimental results using several micro-benchmarks, one benchmark from the NAS Parallel Benchmark suite and High Performance Linpack suite show that new features make the program logic more concise, while achieving good performance. v Contents 1 Introduction 1 1.1 Motivation . 2 1.2 Contributions . 3 1.3 Thesis Organization . 4 2 Background 5 2.1 PGAS model . 5 2.2 Fortran in HPC . 9 2.3 Message Passing Interface . 12 2.4 Coarrays in Fortran 2008 . 13 2.4.1 Execution unit . 13 2.4.2 Coarray . 14 2.4.3 Image control statement . 15 2.4.4 Termination . 17 2.4.5 Atomic variable and subroutines . 18 2.5 Task decomposition in parallel program . 18 2.6 Survey of coarray fortran implementations . 20 2.6.1 OpenCoarrays . 21 2.6.2 Rice CAF 2.0 . 21 vi 3 Infrastructure 24 3.1 OpenUH compiler . 24 3.2 CAF runtime structure . 26 3.3 GASNet . 28 3.3.1 Active Message . 30 4 Implementation 32 4.1 Additional parallel feature syntax . 32 4.2 Base Team Implemenation . 35 4.2.1 Memory Management . 36 4.2.2 Forming and Changing Teams . 40 4.2.3 Synchronization and Collective Operations . 44 4.3 Runtime Optimizations . 46 4.3.1 Runtime Data Locality Optimization . 46 4.3.2 Distributed member mapping list . 48 4.4 Incomplete parallel features . 51 5 Results 54 5.1 Experiment Setup . 54 5.2 Benchmarks . 55 5.2.1 Team Microbenchmark . 55 5.2.2 Reduction . 57 5.2.3 Using Team-based Collectives for CG . 59 5.2.4 HPL . 61 6 Conclusion 63 6.1 Future Work . 64 vii Bibliography 66 viii List of Figures 2.1 PGAS programming model . 7 2.2 CAF code snippet using only Fortran 2008 features . 17 3.1 OpenUH compiler infrastructure . 25 3.2 CAF runtime structure in OpenUH compiler . 28 3.3 Active Message to query the image index . 31 4.1 OpenUH Coarray Fortran team implementation . 33 4.2 Code depicting allocation of coarrays inside teams . 34 4.3 Evolving state of managed heap during team symmetric allocations . 38 4.4 Flow chart of FORM TEAM function in runtime . 43 4.5 OpenUH Coarray Fortran team structure in memory . 49 4.6 Flow chart of mapping image index to process id in distributed mapping table . 50 4.7 Usage of team selectors to access coarrays across team boundary . 53 5.1 Barrier synchronization for groups of images (4096 total images), on Stampede. 55 5.2 Comparison of team barrier between UHCAF and CAF 2.0 (1024 total images), on Stampede. 56 5.3 Performance evaluations for the 2-level reduction algorithm using the Teams Microbenchmark suite . 58 5.4 CG benchmark (class D) on Stampede, using 16 images per node . 59 ix 5.5 Performance results for HPL . 62 x List of Tables 4.1 Team data structure . 37 xi Chapter 1 Introduction In past few decades, people defined several parallel programming models to help abstract the parallel computing system interfaces. In recent years, a programming model referred to as Partitioned Global Address Space (PGAS) has engaged much attention as a highly scalable approach for programming large-scale parallel systems. The PGAS programming model is characterized by a logically partitioned global memory space, where partitions have affinity with the processes/threads executing the program. This property allows PGAS-based applications to specify an explicit data decomposition that reduces the number of remote accesses with longer latencies. This programming model marries the performance and data locality (partitioning) features of distributed memory model with the programmability and data referencing simplicity of a shared-memory (global address space) model. Several languages and libraries follow the PGAS programming model. OpenSHMEM[10] and Global Arrays[41] are examples of library-based PGAS implementation, while 1 Unified Parallel C (UPC)[8], Titanium[26], X10[12], Chapel[9] and Coarray Fortran (CAF)[42] are examples of PGAS-based languages. Compared with the library-based implementation, which assumes the programmer will use the library calls to implement the correct semantics following the programming specification, the language- based implementations aim to simplify the burden of writing applications that effi- ciently utilize these features and achieve performance goals for the non-expert pro- grammers. However, the adoption of language-base implementation is much slower than the libraries-based implementation. 1.1 Motivation For a long time, Fortran is one of the dominant languages in the HPC area. Accord- ing to The National Energy Research Scientific Computing Center (NERSC), which is the primary scientific computing facility for the Office of Science in the U.S. De- partment of Energy, over 1/2 the hours on their systems are used by Fortran codes. Fortran 2008 has introduced a set of language features that support PGAS programming model, often referred to as Coarray Fortran or Fortran Coarrays (CAF). Currently, only a few compilers embrace these new features into their latest release. Although Fortran 2008 has included a set of simple but efficient PGAS features, users demand for advanced Coarray features to express more complicated parallelism in their application. Based on that, the Fortran work group has identified a set of advanced features and plans to introduce them into next language standard[14]. The HPCTools Group in the University of Houston has developed a functional 2 compiler and runtime implementation to support the Coarray features in Fortran 2008[17]. While processes in global environment always keep its one-to-one mapping of the co-subscription, the team, which is a nested construct to represent subset of processes, brings more complexity. Performing communication and synchronization should be respect to teams, rather than in the flat global environment. The thesis describes the implementation of the theses features. It also presents two optimizations we applied to reduce the network communication and local memory usage in runtime. We also give a discussion of in which case the optimizations can give better performance. 1.2 Contributions The contributions of this work include: • a description of an early implementation of additional parallel processing features, including teams, collectives, and barrier operations, which are comple- mentary to the existing Fortran coarrays model and are being developed for incorporation into the next revision of the Fortran standard • optimization techniques in the runtime, including locality-aware optimization and distributed mapping table • evaluation of enhanced coarray features using benchmarks to assess the useful- ness of team-based synchronizations and collectives.

Implementation and Evaluation of Additional Parallel Features in Coarray Fortran

Evaluation of the Coarray Fortran Programming Model on the Example of a Lattice Boltzmann Code

A Comparison of Coarray Fortran, Unified Parallel C and T

Cafe: Coarray Fortran Extensions for Heterogeneous Computing

Coarray Fortran Runtime Implementation in Openuh

Parallel Computing

CS 470 Spring 2018

Coarray Fortran: Past, Present, and Future

The Cascade High Productivity Programming Language

Locally-Oriented Programming: a Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Coarray Fortran 2.0

IBM Power Systems Compiler Strategy

Selecting the Right Parallel Programming Model