Cafe: Coarray Fortran Extensions for Heterogeneous Computing
Total Page:16
File Type:pdf, Size:1020Kb
2016 IEEE International Parallel and Distributed Processing Symposium Workshops CAFe: Coarray Fortran Extensions for Heterogeneous Computing Craig Rasmussen∗, Matthew Sottile‡, Soren Rasmussen∗ Dan Nagle† and William Dumas∗ ∗ University of Oregon, Eugene, Oregon †NCAR, Boulder, Colorado ‡Galois, Portland, Oregon Abstract—Emerging hybrid accelerator architectures are often simple tuning if the architecture is fundamentally different than proposed for inclusion as components in an exascale machine, previous ones. not only for performance reasons but also to reduce total power A number of options are available to an HPC programmer consumption. Unfortunately, programmers of these architectures face a daunting and steep learning curve that frequently requires beyond just using a serial language with MPI or OpenMP. learning a new language (e.g., OpenCL) or adopting a new New parallel languages have been under development for programming model. Furthermore, the distributed (and fre- over a decade, for example, Chapel and the PGAS languages quently multi-level) nature of the memory organization of clusters including Coarray Fortran (CAF). Options for heterogeneous of these machines provides an additional level of complexity. computing include usage of OpenACC, CUDA, CUDA For- This paper presents preliminary work examining how Fortran coarray syntax can be extended to provide simpler access to tran, and OpenCL for programming attached accelerator de- accelerator architectures. This programming model integrates the vices. Each of these choices provides the programmer with Partitioned Global Address Space (PGAS) features of Fortran abstractions over parallel systems that expose different levels with some of the more task-oriented constructs in OpenMP of detail about the specific systems to compile to, and as 4.0 and OpenACC. It also includes the potential for compiler- such, entail different degrees of effort to design for and realize based transformations targeting the Open Community Runtime (OCR) environment. We demonstrate these CoArray Fortran performance portability. extensions (CAFe) by implementing a multigrid Laplacian solver A reasonable solution given today’s language limitations is and transforming this high-level code to a mixture of standard to use MPI for distributed memory parallelism plus OpenMP coarray Fortran and OpenCL kernels. for on-chip parallelism. This hybrid approach is taken by two Index Terms —distributed memory parallelism, domain specific successful, large-scale AMR multiphysics code frameworks, language FLASH and Chombo; currently however, neither of these codes support hardware accelerators[1]. Other choices for I. INTRODUCTION expressing on-chip parallelism are OpenCL [2] and NVIDIA’s A goal of high-level programming languages should be CUDA. Unfortunately, achieving high performance using ei- to allow a programmer to develop software for a specific ther of these two languages can be a daunting challenge and hardware architecture, without actually exposing hardware there are no guarantees that either language is suitable for features. In fact, allowing low-level hardware details to be future hardware generations. controlled by the programmer is actually a barrier to software Another potential option is to abandon the structure of performance portability: the explicit use of low-level hardware message passing within a procedural programming language details leads to code becoming excessively specialized to a altogether. For example, the Open Community Runtime (OCR) given target system. The path forward for high-level program- project[3] provides an asynchronous task-based runtime de- ming languages should be to allow a programmer to provide signed for machines with high-core count[4], such as can more semantic information about their intent, thus allowing the be expected in any exascale hardware design. The project’s compiler more freedom to choose how to instantiate this intent goal is to create a framework and reference implementation in a specific implementation in order to retarget applications to help developers explore programming methods to improve to new architectures. the power efficiency, programmability, and reliability of HPC Performance portability (or, even simply portability) has applications while maintaining application performance. been a perpetual problem in computing, especially within the It cannot be emphasized too strongly that using the OpenCL high performance computing community in which relatively programming language and libraries (or programming for the new and immature emerging technologies are constantly being OCR environment) presents a high hurdle for the application tried and tested. As processor architectures hit limitations due developer creating real scientific applications. A goal of this to physical and manufacturing constraints, we are seeing a paper is to examine ways in which high-level syntax based trend towards more diversity in processor designs than were on Fortran coarrays can be utilized by compiler-based tools to experienced for the bulk of the 1990s and 2000s. Porta- target programming environments like OpenCL, thus freeing bility, especially performance portability, is becoming more the developer from the task of frequently porting the applica- challenging as moving to a new CPU may entail more than tion to new complex environments. The hope is that high-level /16978-1-5090-3682-0/16 $31.00 © 2016 IEEE $31.00 © 2016 IEEE 357 DOI 10.1109/IPDPSW.2016.140 syntax can hide the complexities of the underlying hardware II. COARRAY FORTRAN EXTENSIONS and software by delegating some optimization tasks to the Topics highlighted in this section are: 1. Distributed memory compiler instead of the programmer. array allocation; 2. Explicit memory placement; 3. Remote With the addition of coarrays in the 2008 standard, For- memory transfer; and 4. Remote execution and synchroniza- tran is now a parallel language. Originally called Coarray tion. This description is within the context of extensions Fortran (CAF) by its authors[5], parallelism in Fortran is to Fortran; as shorthand, these extensions are referred to similar to MPI in that all cooperating CAF processes (called as CAFe, for Coarray Fortran extensions. Please note that images) execute a Single Program but with different (Multiple) throughout this paper, text appearing in all caps (such as Data (SPMD). Key concepts in CAF are parallel execution, GET_SUBIMAGE) indicates new syntax or functionality ex- distributed memory allocation, and remote memory transfer. tended by CAFe. Parallel execution and memory allocation are straightforward A. Subimages as each program is replicated NP times with each program image allocating its own memory locally. Special syntax is We first introduce the important new concept of a CAFe provided to indicate which arrays are visible between dis- subimage. Fortran images are a collection of distributed mem- tributed memory processes and remote memory transfer is ory processes that all run the same program. CAFe extends indicated with new array syntax using square brackets, for the concept of a Fortran image by allowing images to be example, hierarchical; by this we mean that an image may host a subimage (or several subimages). It is important to note that U(1) = U(1)[3] a subimage is not visible to Fortran images other than its hosting image. Subimages also execute differently than normal copies memory from the first element of U residing on image images and may even execute on different non-homogeneous number 3 into the corresponding element of the local array hardware, e.g., an attached accelerator device. Subimages are U. The fundamental programming model is similar to that task based while images all execute a Single Program but with provided by message passing systems. different (Multiple) Data (SPMD). A task can be given to a subimage, but execution on the subimage terminates once Since each CAF process executes the same program, there the task is finished. Memory on a subimage is permanent, are no good avenues for exploiting task-based parallelism. however, and must be explicitly allocated and deallocated. Neither is there an adequate mechanism for dynamic coarray A programmer requests a subimage by executing the new memory allocation, nor a mechanism for the remote execu- CAFe function, tion of functions. In CAF, like MPI, the programmer often combines distributed memory processes running the same device = GET_SUBIMAGE(device_id) program image with a shared memory threaded programming where the integer argument represents an attached hardware model using OpenMP (or OpenACC). This model breaks down device (or perhaps a separate process or a team of threads; when dealing with architectures that are not isomorphic to it is the compilers responsibility to determine precisely what the model for which MPI and relatives were designed – a a subimage means in terms of the software and hardware distributed collection of homogenous processing elements in environment under its control). If the function fails (e.g., the which each processing element supports a small number of requested device is unavailable) it returns the local image num- parallel threads of execution. For example, an HPC system that ber of the process that is executing the current program, ob- makes use of traditional multicore CPUs as well as GPGPU tained by a call to the Fortran 2008 function this_image(). accelerators breaks this model and, to effectively program it, Returning the current image if the GET_SUBIMAGE call fails,