2016 IEEE International Parallel and Distributed Processing Symposium Workshops

CAFe: Coarray Extensions for Heterogeneous Computing

Craig Rasmussen∗, Matthew Sottile‡, Soren Rasmussen∗ Dan Nagle† and William Dumas∗ ∗ University of Oregon, Eugene, Oregon †NCAR, Boulder, Colorado ‡Galois, Portland, Oregon

Abstract—Emerging hybrid accelerator architectures are often simple tuning if the architecture is fundamentally different than proposed for inclusion as components in an exascale machine, previous ones. not only for performance reasons but also to reduce total power A number of options are available to an HPC programmer consumption. Unfortunately, programmers of these architectures face a daunting and steep learning curve that frequently requires beyond just using a serial language with MPI or OpenMP. learning a new language (e.g., OpenCL) or adopting a new New parallel languages have been under development for programming model. Furthermore, the distributed (and fre- over a decade, for example, Chapel and the PGAS languages quently multi-level) nature of the memory organization of clusters including Coarray Fortran (CAF). Options for heterogeneous of these machines provides an additional level of complexity. computing include usage of OpenACC, CUDA, CUDA For- This paper presents preliminary work examining how Fortran coarray syntax can be extended to provide simpler access to tran, and OpenCL for programming attached accelerator de- accelerator architectures. This programming model integrates the vices. Each of these choices provides the programmer with Partitioned Global Address Space (PGAS) features of Fortran abstractions over parallel systems that expose different levels with some of the more task-oriented constructs in OpenMP of detail about the specific systems to compile to, and as 4.0 and OpenACC. It also includes the potential for - such, entail different degrees of effort to design for and realize based transformations targeting the Open Community Runtime (OCR) environment. We demonstrate these CoArray Fortran performance portability. extensions (CAFe) by implementing a multigrid Laplacian solver A reasonable solution given today’s language limitations is and transforming this high-level code to a mixture of standard to use MPI for parallelism plus OpenMP coarray Fortran and OpenCL kernels. for on-chip parallelism. This hybrid approach is taken by two Index Terms —distributed memory parallelism, domain specific successful, large-scale AMR multiphysics code frameworks, language FLASH and Chombo; currently however, neither of these codes support hardware accelerators[1]. Other choices for I. INTRODUCTION expressing on-chip parallelism are OpenCL [2] and NVIDIA’s A goal of high-level programming languages should be CUDA. Unfortunately, achieving high performance using ei- to allow a programmer to develop software for a specific ther of these two languages can be a daunting challenge and hardware architecture, without actually exposing hardware there are no guarantees that either language is suitable for features. In fact, allowing low-level hardware details to be future hardware generations. controlled by the programmer is actually a barrier to software Another potential option is to abandon the structure of performance portability: the explicit use of low-level hardware message passing within a language details leads to code becoming excessively specialized to a altogether. For example, the Open Community Runtime (OCR) given target system. The path forward for high-level program- project[3] provides an asynchronous task-based runtime de- ming languages should be to allow a programmer to provide signed for machines with high-core count[4], such as can more semantic information about their intent, thus allowing the be expected in any exascale hardware design. The project’s compiler more freedom to choose how to instantiate this intent goal is to create a framework and reference implementation in a specific implementation in order to retarget applications to help developers explore programming methods to improve to new architectures. the power efficiency, programmability, and reliability of HPC Performance portability (or, even simply portability) has applications while maintaining application performance. been a perpetual problem in computing, especially within the It cannot be emphasized too strongly that using the OpenCL high performance computing community in which relatively programming language and libraries (or programming for the new and immature emerging technologies are constantly being OCR environment) presents a high hurdle for the application tried and tested. As processor architectures hit limitations due developer creating real scientific applications. A goal of this to physical and manufacturing constraints, we are seeing a paper is to examine ways in which high-level syntax based trend towards more diversity in processor designs than were on Fortran coarrays can be utilized by compiler-based tools to experienced for the bulk of the 1990s and 2000s. Porta- target programming environments like OpenCL, thus freeing bility, especially performance portability, is becoming more the developer from the task of frequently porting the applica- challenging as moving to a new CPU may entail more than tion to new complex environments. The hope is that high-level

/16978-1-5090-3682-0/16 $31.00 © 2016 IEEE $31.00 © 2016 IEEE 357 DOI 10.1109/IPDPSW.2016.140 syntax can hide the complexities of the underlying hardware II. COARRAY FORTRAN EXTENSIONS and software by delegating some optimization tasks to the Topics highlighted in this section are: 1. Distributed memory compiler instead of the programmer. array allocation; 2. Explicit memory placement; 3. Remote With the addition of coarrays in the 2008 standard, For- memory transfer; and 4. Remote execution and synchroniza- tran is now a parallel language. Originally called Coarray tion. This description is within the context of extensions Fortran (CAF) by its authors[5], parallelism in Fortran is to Fortran; as shorthand, these extensions are referred to similar to MPI in that all cooperating CAF processes (called as CAFe, for Coarray Fortran extensions. Please note that images) execute a Single Program but with different (Multiple) throughout this paper, text appearing in all caps (such as Data (SPMD). Key concepts in CAF are parallel execution, GET_SUBIMAGE) indicates new syntax or functionality ex- distributed memory allocation, and remote memory transfer. tended by CAFe. Parallel execution and memory allocation are straightforward A. Subimages as each program is replicated NP times with each program image allocating its own memory locally. Special syntax is We first introduce the important new concept of a CAFe provided to indicate which arrays are visible between dis- subimage. Fortran images are a collection of distributed mem- tributed memory processes and remote memory transfer is ory processes that all run the same program. CAFe extends indicated with new array syntax using square brackets, for the concept of a Fortran image by allowing images to be example, hierarchical; by this we mean that an image may host a subimage (or several subimages). It is important to note that U(1) = U(1)[3] a subimage is not visible to Fortran images other than its hosting image. Subimages also execute differently than normal copies memory from the first element of U residing on image images and may even execute on different non-homogeneous number 3 into the corresponding element of the local array hardware, e.g., an attached accelerator device. Subimages are U. The fundamental programming model is similar to that task based while images all execute a Single Program but with provided by message passing systems. different (Multiple) Data (SPMD). A task can be given to a subimage, but execution on the subimage terminates once Since each CAF executes the same program, there the task is finished. Memory on a subimage is permanent, are no good avenues for exploiting task-based parallelism. however, and must be explicitly allocated and deallocated. Neither is there an adequate mechanism for dynamic coarray A programmer requests a subimage by executing the new memory allocation, nor a mechanism for the remote execu- CAFe function, tion of functions. In CAF, like MPI, the programmer often combines distributed memory processes running the same device = GET_SUBIMAGE(device_id) program image with a threaded programming where the integer argument represents an attached hardware model using OpenMP (or OpenACC). This model breaks down device (or perhaps a separate process or a team of threads; when dealing with architectures that are not isomorphic to it is the responsibility to determine precisely what the model for which MPI and relatives were designed – a a subimage means in terms of the software and hardware distributed collection of homogenous processing elements in environment under its control). If the function fails (e.g., the which each processing element supports a small number of requested device is unavailable) it returns the local image num- parallel threads of execution. For example, an HPC system that ber of the process that is executing the current program, ob- makes use of traditional multicore CPUs as well as GPGPU tained by a call to the Fortran 2008 function this_image(). accelerators breaks this model and, to effectively program it, Returning the current image if the GET_SUBIMAGE call fails, currently requires a mixture of programming systems in a allows program execution to proceed correctly even if there are single application. no attached devices. In this paper we propose CAFe: Co-Array Fortran Exten- Once a variable representing a subimage has been obtained, sions. These extensions to CAF introduce relatively minor it can be used within a program to allocate memory, transfer additions to the existing coarray portion of the Fortran lan- memory, and execute tasks. This functionality is described guage. They include the concept of subimages in order to nest below. coarray images, semantics for distributed array declarations, memory placement, remote memory transfer, and remote task B. Distributed Memory Array Declaration execution and synchronization. CAFe builds upon the existing In Fortran, variables that are visible to other program images standardized coarray features of Fortran. We will introduce must be declared with the codimension attribute, for example, the proposed CAFe extensions, define their semantics, demon- real, allocatable :: U(:,:)[:], V(:,:)[:] strate their use in a simple example, and provide experimental results demonstrating the relative performance of computation declares that coarrays U and V are allocatable with a rank of and communication in a prototype implementation executing two and a corank of one. In Fortran, square brackets []are in a heterogeneous environment. CAFe is complementary to used to indicate operations (possibly expensive) on distributed previous work extending coarray Fortran[6], [7]. memory where parentheses ()are used to select specific

358 array elements; square brackets, on the other hand, denote the hosted by another image is not allowed. In addition, memory location of the array or array elements selected. transfer can only be initiated by the hosting image. Thus code When declared thusly, coarrays must subsequently be al- executing on a subimage cannot use coarray notation to select located on all program images. With CAFe, this declaration memory on another subimage nor on its host. also allows memory to be allocated on a subimage, however memory allocation on a subimage is not required unless it is E. Remote Execution specifically used by some task executing that will be executed The final CAFe concept introduced is that of remote ex- on the subimage. ecution. As discussed above, all CAF images (similar to MPI ranks) execute the same program. The only way for C. Explicit Memory Placement an image to execute something different is to conditionally As discussed, CAFe allows coarray memory to be condi- execute a block of code based on explicitly checking that tionally allocated on a subimage. It should be allocated condi- an executing image’s rank (this_image()) meets some tionally because the requested subimage in a GET_SUBIMAGE established criterion. There currently exists no mechanism for call may not be available. In addition, not all coarrays need be one image to execute a block of code or a procedure on another allocated on each subimage. For example, the following code image. segment allocates memory for coarrays U and V on all images, However, CAFe allows images to execute tasks on hosted but only coarray U is allocated on the subimage specified by subimages using standard procedure calls. For example, the variable device: call relax(U[device]) [[device]] allocate(U(M,N)[*], V(M,N)[*]) if (device /= this_image()) then executes the subroutine relax, on the subimage device, using coarray U located on the same subimage. This follows the allocate(U(M,N)[*]) [[device]] end if same pattern as introduced for remote coarray allocation, whereby the double square bracket notation indicates where where M and N are constant parameters. The first allocate the function will execute. Functions to be executed on a statement allocates coarrays U and V on all program images. subimage must be pure procedures (i.e., declared with the The * symbol in the allocation is required for the last corank Fortran keyword pure) and scalar formal parameters must dimension in CAF (in this case there is only one) to allow be passed by value (i.e., declared with the value attribute). for the number of program images to be a runtime parameter, CAFe expands on CAF execution semantics by providing which can be ascertained with a num_images() function a task-based mechanism, where tasks are defined in terms call. If device is available, its numeric value will be dif- of pure Fortran procedures. Since the procedures must be ferent from this_image() and thus coarray U will also be declared as pure, they cannot perform any I/O or call any allocated on the device. Note that placement of the memory is impure procedures; nor can they initiate any communication specified by the use of double square bracket notation [[ ]]. between the subimage and the hosting image. Like single square brackets, double square brackets indicate In addition, CAFe subimage tasks may also be defined by something special (and possibly expensive) is to occur, in this the body of a Fortran do concurrent construct and executed case possibly remote memory allocation. on a subimage. For example, the code segment, Deallocation of subimage memory is similar to allocation, do concurrent(i=1:N-1) [[device]] if (device /= this_image()) then block deallocate(U) [[device]] real :: w = 0.67 end if S(i) = (1-w)*T(i) + w*(T(i-1)+T(i+1)/2) D. Remote Memory Transfer end block end do Once memory is allocated on the device, it can be initialized by copying memory from the hosting image to the device. This will execute all of the code within the do concurrent is explicitly done with normal CAF syntax. For example, construct on device. Execution of a do concurrent loop on a subimage acts as if the loop were extracted as an outlined U[device] = 3.14 function and the subsequent task executed on a subimage. Thus assigns 3.14 to all of the array elements of U located on there is an implicit barrier at the end of the loop that blocks device. Specific array elements can also be transferred, the hosting image from continuing execution until all of the iterates of the loop have finished. U(1,:) = U(1,:)[device] In Fortran, the execution semantics of a do concurrent U(M,:) = U(M,:)[device] loop is that loop iterates can be executed concurrently in where the first and last columns of U are copied from the any order. Note the similarities between this example and the device to the hosting image. OpenMP pragma !$omp parallel do, private(w). CAFe restricts memory transfer between a subimage and The use of the Fortran block construct to declare private its hosting image only; transferring memory to a subimage variables allows a natural and clear way to distinguish between

359 variables that are private to a “” (although in the threads will have completed operating on the temporary buffer example above, the “private” distinction for the variable w is before it is copied back to the primary array variable. unnecessary because it is a constant over all loop iterations). Fortran 2015 also introduces the concept of events. An event The comparison with OpenMP also suggests that subimage is a Fortran type that can be posted to by one image and waited execution may not be restricted to a single hardware execution for by another image. CAFe relaxes the constraint that events unit, but may indeed be executed by a team of threads in a must be coarrays so that events can be used locally within single (shared) memory address space. This association with a single image to control concurrent execution of multiple a team of threads will be exploited below when considering a subimage tasks. For example, the following code segment team of subimages. shows how two tasks can be run concurrently on two separate subimages: F. Integration with Standard Coarray Features type(event_type) :: evt The new CAFe features introduced above work in concert call update_ocn [[dev1, WITH_EVENT=evt]] with the existing 2008 Fortran coarray constructs and with the call update_atm [[dev2, WITH_EVENT=evt]] new parallel features to appear in the next Fortran standard[8] event wait (evt, until_count=2) (referred to here as Fortran 2015). This section describes in more detail how the proposed extensions of CAFe interact In this example, the syntax [[dev1, with the parallel language features of Fortran 2015. WITH_EVENT=evt]] indicates that the called subroutine 1) Fortran 2008 Standard: Perhaps the most important is to be run on dev1 and that the event variable evt of the 2008 parallel constructs are those involving the con- will receive notification (an implicit posting) once the task trol of image execution, including sync all and sync has completed. The event wait statement will cause images statements. A sync all statement implements a execution on the hosting image to wait until both tasks have barrier synchronizing all CAF images while a sync images completed. One may even run multiple tasks on the same statement synchronizes a specified set of images. Neither device using events, however unless the subimage hardware of these synchronization primitives affect tasks potentially allows concurrent execution of multiple tasks, only one task executing concurrently on a subimage. Once subimage tasks will be executed at a time. begin execution they run to completion from the perspective Fortran 2015 also provides new intrinsic procedures for of the hosting image. atomic and collective operations. At this time we do not con- Other features of 2008 include the critical construct and sider atomic operations for usage with the CAFe extensions. the lock and unlock statements. The critical construct Atomic variables and associated synchronization constructs limits the execution of a block of code to one image at a time, place too great of requirements on the executing software and while lock variables provide a facility for atomic operations. hardware environment. For example, the Open Community Both of these features are defined in terms of normal CAF Runtime environment does not provide for atomic variables images so subimages do not directly participate unless the as the restriction helps OCR to better scale on a wider range parent (or host) of a subimage does. of parallel computers. 2) Fortran 2015 Standard: The most important concept Collective operations, however, could potentially be very introduced by the 2015 standard is that of a team of images. useful and fit nicely within CAFe concepts. For example, a Subimages may not be included within a team of standard team consisting of hosted subimages could be formed and then CAF images. However, if one considers the comparison of the data be broadcasted to team members and reduction operations execution of a do concurrent block on a subimage with a such as co_add performed. However since all team members team of OpenMP threads (as described above), the subimage of the current team must participate in collective operations could be considered as a team of one member that may be and since CAFe restricts subimages from being part of the executing its tasks on a team of shared memory threads. This same team with regular images, it makes it impossible for allows the use of a sync team statement to synchronize with an image to participate with a team of its hosted subimages possibly executing threads. in collective operations. So for now, collectives are excluded Thus we introduce another CAFe intrinsic function, from CAFe. GET_SUBIMAGE_TEAM(), that returns a variable represent- 3) Alternative Syntax: It is noted that these extensions to ing a fully-formed (established) team consisting of only one Fortran have been described in terms of explicit language team member, the current subimage. If this function is called syntax modifying the base language. CAFe could equivalently outside the context of a task executing on a subimage it have been described in terms of compiler directives. For will return only a team consisting of the current image, example, subimage task execution could be written as this_image(). Synchronization across possible subimage call relax(U[dev1]) !$OFP [[dev1]] threads is then possible using the statement which would allow a standard CAF compiler to compile CAFe sync team (GET_SUBIMAGE_TEAM()) programs. This syntax usage would make the CAFe constructs This allows implementation of the technique of double buffer- described here an embedded Domain Specific Language. In a ing within a subimage task so that all of the potential subimage future version of the implementation of CAFe, we intend to

360 also support compiler directives in this form. (extended with necessary OpenCL keywords). The C AST Alternative syntax for remote execution may be preferable ATerms have mostly a one-to-one correspondence with Fortran to the double square bracket notation. While the syntax [[ ]] terms: a pure Fortran procedure is transformed to an OpenCL makes it clear that the programmer intends remote execution kernel; Fortran formal parameters are transformed to C pa- to take place, perhaps syntax closer to standard Fortran is rameters (with a direct mapping of types); and local Fortran better, especially if standardization of CAFe were to occur. variable declarations are rewritten as C declarations. Similarly, For example, memory placement could be requested by a slight Fortran executable statements are rewritten as C statements. modification of the allocation statement The only minor complication is mapping the CAFe local, array index view to the global C index space. This translation is allocate(U(M,N)[ ], DEVICE=dev1) * facilitated by a Fortran symbol table that stores array shape and similarly remote execution via a subroutine call could be and halo size information. However, these transformation are replaced by not entirely completed and some of the OpenCL kernels used in the example section have been coded by hand. call relax(U[dev1]), DEVICE=dev1 However, while the alternative syntax for memory allocation B. Transformations at the Calling Site appears very natural, using the syntax ,DEVICE=dev1 to de- Transformations of a CAFe procedure call site are more note remote execution is problematic when applied to Fortran difficult, though technically straight forward. The Fortran function calls. For example, the double square bracket notation function call must be transformed to a call to run the OpenCL in kernel (generated as described above). This is facilitated by use of the ForOpenCL library which provides Fortran bindings y = sqrt( sin(x)[[dev1]] * 2) to the OpenCL runtime[12]. However, this usage requires the cannot be replaced with ,DEVICE=dev1. declaration of extra variables, allocation of memory on the OpenCL device (subimage), transfer of memory, marshalling III. CAFE IMPLEMENTATION of kernel arguments, and synchronization. This section briefly describes how CAFe extensions to These transformations are accomplished using several Fortran have been implemented as source-to-source transfor- rewrite stages using Stratego/XT strategies: (1) a symbol table mations via rewrite rules (for expressing basic transformations) is produced in the first pass to store information related and rewriting strategies (for controlling the application of the to arrays, including array shape and allocation status; (2) rewrite rules). A CAFe file is transformed to Fortran and additional variables are declared that are used to maintain OpenCL files through generative programming techniques us- the OpenCL runtime state, including the OpenCL device, the ing Stratego/XT tools[9] and is accomplished in three phases: OpenCL kernel, and OpenCL variables used to marshall kernel (1) parsing to produce a CAFe Abstract Syntax Tree (AST) arguments; and (3) all CAFe code related to subimage usage represented in the Annotated Term Format (ATerm[10]); (2) is desugared (lowered) to standard Fortran with calls to the transformations of CAFe AST nodes to Fortran and C AST ForOpenCL library as needed. The latter step includes the nodes; and finally (3) pretty-printing to the base Fortran and allocation of OpenCL device memory, transfer of memory to OpenCL languages. and from the OpenCL device, marshalling of kernel arguments, The foundation of CAFe is the syntax definition of the base running of the OpenCL kernel, and synchronization. language expressed in SDF (Syntax Definition Formalism) as Though not yet available, similar rewrite strategies are part of the Open Fortran Project (OFP)[11]. CAFe is defined in planned for targeting programming models other than OpenCL a separate SDF module that extends the Fortran 2008 language including parallel execution with OpenMP and OpenACC standard with only 11 context-free syntax rules. Parsing is directives. However, as CAFe is designed, simple serial ex- implemented in Stratego/XT by a scannerless generalized-LR ecution — with respect to subimage tasks — is automatically parser and the conversion of transformed AST nodes to text provided by the regular CAF compiler. is accomplished with simple pretty-printing rules. IV. MULTIGRID LAPLACIAN IMPLEMENTATION A. Transformations for Concurrent Procedures In this section a one-dimensional solution to Laplace’s A key component of code generation for CAFe is the equation is described using CAFe syntax. The solution uses targeting of a Fortran pure procedure or a do concurrent multigrid techniques to improve the rate of convergence of construct for execution on a particular hardware architecture. iterative methods. We describe the implementation in terms of The execution target can be one of several choices, includ- a one-dimensional problem for simplicity. ing serial execution by the current program image, parallel execution by inlining with OpenMP compiler directives, or A. Multigrid Algorithm parallel execution by heterogeneous processing elements with The multigrid algorithm uses a series of successively coarser a language like OpenCL. grids to iteratively approximate the solution obtained from We have developed rewrite rules and strategies in Strat- the preceding finer grid. Higher frequency error modes are ego/XT to rewrite Fortran AST nodes to C AST nodes damped out as the solution processes up the grid hierarchy.

361 Multigrid shared shared Prolongate Data Transfer cell cell N/2

N

-101234567NN+1 Fig. 3. Prolongation of coarse grid to fine grid. Arrows show the points used to interpolate onto the fine grid. Fig. 1. Two multigrid levels (N,N/2) on one image showing the two cells that are shared with neighbors on the left and the right and computed redundantly. The fine level grid (N) also shows the two cells that must be exchanged between neighbors. Arrows represent the direction of data exchange from the cells that an exact solution can be obtained (perhaps by a non- perspective of the local image. iterative method). This solution is then propagated down the grid hierarchy as shown in Figure 3.

Restrict B. Implementation of the Relaxation Step The multigrid algorithm has been implemented in Fortran using the CAFe syntax described in the previous section. Data coarrays at each grid level are declared, allocated, and initial- ized on each program image. These coarrays are then copied Fig. 2. Restriction of fine grid to coarse grid. Arrows show the points used to the subimages by the parent, hosting images. Below we in mapping onto the coarse grid. show the implementation of the relaxation loop on grid level N. Each iteration step incorporates concurrent computation on an image and its hosted subimage. In order to execute concur- At the top of the hierarchy, an exact solution is obtained and rently, we must declare an event variable to allow concurrent then this solution is iterated and propagated down the grid execution and then synchronization once computation on both hierarchy form coarser to finer grids. After a few sweeps up the subimage and the image have completed. and down the hierarchy the solution will have been obtained The event variable is declared as, on the desired original grid. Two levels of the grid hierarchy are shown in Fig. 1 running type(event_type) :: evt on a processor computing on an interior portion of the grid; After initialization, this event variable has its count variable set additional images are computing on regions to the left and to 0 and is incremented each time the subimage task completes right of the region shown in the figure. The finest grid (size during an iteration of the code segment shown below: N) is shown in the bottom of the figure with cell indices running from (-1:N+1). Cells 0 and N are shared between image neighbors to the left and right (indicated by the light doi=1,nsteps 1 call relax (N, V1h[dev], Buf[dev]) & blue coloring and the vertical dashed lines). These shared cells [[dev, EVENT=evt]] 2 are computed redundantly on each program image. call relax_boundary (N, V1h) 3 The red cells in the figure are computed concurrently on all subimages (one per image). This distribution of computation wait event (evt, until_count=i) 4 over the hierarchical domain of images and subimages requires V1h(0)[dev] = V1h(0) 5 communication of boundary regions between the various com- V1h(N)[dev] = V1h(N) 6 putational resources. The communication is shown by arrows in the grid level N of the figure. Data in cells 1 and 7 V1h( 1) = V1h( 1) [dev] 7 must be transferred from the subimage (upwardly pointing V1h(N-1) = V1h(N-1) [dev] 8 arrows) and the shared cell computations must be copied to sync all 9 the subimage (downwardly pointing arrows). The horizontal arrows represent data computed and subsequently transferred V1h( -1) = V1h(N-1) [left] 10 from the left and right neighbors. V1h(N+1) = V1h( 1) [right] 11 The second grid level N/2 is also shown in Figure 1. sync all 12 Data on this grid is obtained by simple interpolation from end do 13 information on grid level N as shown in Figure 2. We call this interpolation a restriction and refer the reader to previous This example shows the first relaxation steps on the finest work on the multigrid technique for more information[13]. The grid level N. The initial guess (provided by the coarray V1h) approximate solution on the N/2 grid is relaxed a few times is relaxed nsteps times as specified by the loop beginning at and then passed on to grid level N/4 for further iteration. statement 1. The relaxation is computed on the subimage dev At the top of the grid hierarchy there are sufficiently few by the call to the relax procedure at statement 2. This call

362 uses coarray memory located on the subimage as specified by TABLE I the arguments V1h[dev] and Buf[dev], where the second SIZE OF CUBE AND TIME IN MILLISECONDS coarray is used for temporary storage. N GPU CPU Comm Since an event evt has been supplied to the call in 16 0.02 0.01 0.07 statement 2, the program continues execution without waiting 32 0.02 0.03 0.10 for the remote task on dev to complete. The count asso- 64 0.04 0.12 0.27 ciated with the event will be increased by an implicit post 128 0.42 0.47 0.81 256 4.59 1.97 2.99 to the event once the task has finished executing. Thus the 512 43.08 12.51 16.49 task started by statement 2 will execute concurrently with the relax_boundary call at statement 3 — the first call executes on the subimage dev while the second call executes Note that all three times in Table 1 are roughly equivalent on the local image this_image(). for N = 128. Also note that the communication and the Most of the computation is accomplished by the subimage computational times on the CPU are roughly equivalent and executing the relax procedure (not shown) operating over scale the same with increasing N. This is not surprising as the interior corresponding to fine grid indices (1:N-1). Concur- both involve only the surface elements of the cube. However, 3 rently, relax_boundary operates on the boundary indices computational time on the GPU grows much faster, O(N ), 0 and N (note that memory for the boundary computation is as it is computing on the interior of the cube. located on local image as the coarray V1h is not coindexed VI. CONCLUSIONS using square bracket notation in statement 3. Once relaxation of the boundary has completed, the program Fortran is a general-purpose programming language and is image will wait for the event indicating that the subimage extremely important to the scientific and high performance task has finished executing relaxation on the interior (statement computing (HPC) community. While new code development 4). The until_count for the event is increased by 1 each is often in languages other than Fortran (such as C++), surveys iteration of the loop, so the program waits until the event show that Fortran usage is still high[14]. Fortran will be counter equals the iterate counter i. with us well into the future as the lifetime of important scientific applications is measured in decades and Fortran Once the wait has completed, communication of the halo usage dominates in scientific communities that are critical to regions between the subimage and the hosting image can understanding the future of life on our planet such as the begin. Shared boundary cells at 0 and N are copied to the climate community. subimage by the execution of statements 5 and 6 and interior However, significant challenges arise as hardware adapts regions at cells 1 and N-1 are copied from the subimage by to the end of Moore’s Law[15]. The revolution occurring in statements 7 and 8. Boundary cells are exchanged (copied) hardware architecture has completely altered the landscape for from neighboring images by statements 10 and 11. To ensure scientific computing as a code designed for the distributed that no race conditions are introduced by the exchange of memory bulk synchronous model of parallelism may actually information between images, explicit synchronization is ac- run slower on advanced hardware[16]. complished by the execution of statements 9 and 12. Parallelism was added to Fortran with the addition of coarrays in the 2008 standard[17] to enable the migration V. P ERFORMANCE MEASUREMENTS of codes as hardware evolves. Significant language features It is important to note that the emphasis of this work is to were left out of this early version of coarrays as identified by explore and explain the new syntax and execution semantics Mellor-Crummy et al.[6]. Many of these features have been of a CAFe application. A thorough examination of potential adopted in what is likely to be called the 2015 standard of the performance gains (if any) using CAFe for parallelization of language[18], such as teams of program images and improved code is beyond the scope of this paper. The primary purpose synchronization constructs like events. of CAFe is to combine the parallel features of Fortran coarrays However, the Fortran coarray model fails to address hetero- — executing a single program — with concurrent execution geneity in hardware as is already being seen in HPC computing of separate tasks executing on potentially heterogeneous hard- platforms with the adoption of distributed nodes consisting ware. of general purpose CPUs with attached accelerator devices However the relative performance of computation on the like GPUs. To address this shortcoming of the coarray model, interior of a three-dimensional grid, performed by the GPU, CAFe introduces the concept of subimages executing on a compared with the time required to compute on boundary hosting image. CAFe includes: planes by the CPU is of interest. Table 1 shows average • Dynamic creation of subimages. execution time for a relaxation step on the GPU (column 2) • Dynamic memory allocation and placement on a subim- and on the CPU (column 3) in milliseconds, where N (column age. 1) is the number of cells in one dimension of a 3D cube; time • Explicit addressing of memory on subimages using stan- for the exchange of halo information between the GPU and dard coarray notation. It follows the Partitioned Global the CPU for each iteration is also shown in column 4. Address Space (PGAS) model, although with subimages

363 the address space is hierarchical in the sense that the satisfied, such as occurs when computation of a new iteration memory partitioning only allows memory exchange be- of data elements have been updated and other upstream tasks tween a subimage and its hosting image as initiated by have been completed. the hosting image. An intriguing possibility exists for automatic conversion of • Task creation and execution on subimages using extended CAFe codes to OCR: coarray syntax with double square brackets [[ ]]. Task • CAF codes have well defined boundaries (called segment mays be defined in terms of standard pure Fortran proce- boundaries) indicating where new OCR tasks must be dures and task execution continues without interruption created. until completion. In addition, a Fortran 2015 event type • CAF has many fewer library routines (primarily collec- may be notified upon completion of the task to allow tives) that must be converted. concurrent execution on the subimage and on the hosting • CAF has events that already define some of the depen- image. dencies necessary for the creation of OCR tasks. • CAFe integrates Fortran 2015 features to allow relatively • CAFe introduces the notion of task based parallelism simple programs to be written that employ all of the het- (in addition to the existing Fortran constructs like do erogenous components expected on exascale platforms. concurrent). A. Limitations In the future we hope to examine if the possibility exists for CAFe codes to be automatically converted to an entirely In this paper we have defined CAFe and described its different like the Open Community semantics and relationship to the 2008 and 2015 Fortran stan- Runtime. Given the change in architectures being considered dards. Possible criticism of this work includes the following: for future HPC platforms it is likely that similar, novel runtime • The source-to-source transformations that have been de- models will be proposed, and we believe that CAFe provides veloped to implement CAFe are rudimentary and have not a viable programming model for compiler-based mapping to been optimized. For example, memory transfers have not these new systems. been aggregated into larger message blocks to improve performance. Memory transfer is also blocking so that ACKNOWLEDGMENT prefetching of data cannot be employed via techniques This work was supported by the grant 234910 funded by involving the reordering of statements (code motion) that the U.S. Department of Energy, Office of Science. The authors are effectively employed in CAF compilers. would also like to thank Robert Robey and Wayne Weseloh • The Laplacian example is in reality a “toy” problem. at Los Alamos National Laborabory for several stimulating Real multiphysics codes employ hundreds (or more) of conversations with respect to programming models. variables and would have entirely different performance characteristics because of the increased pressure on cache REFERENCES and register usage. [1] A. Dubey and B. van Straalen, “Experiences from software engineer- • Thorough examination of performance capabilities on ing of large scale AMR multiphysics code frameworks,” CoRR, vol. real scientific applications (or at least scientific “mini- abs/1309.1781, 2013. [2] Khronos OpenCL Working Group, The OpenCL Specification Version: apps”) has not been attempted. Such a study of both 1.1 Document Revision: 44, 2011. CAFe and CAF itself is necessary to demonstrate the [3] “OCR, working group, The Open Community Runtime Interface,” 2014. coarray model as a viable alternative for application codes [Online]. Available: https://xstackwiki.modelado.org/images/1/13/Ocr- v0.9-spec.pdf with complexity above that of an idealized mini-app. [4] J. Dokulil and S. Benkner, “Retargeting of the Open Community • Programming models like that proposed by CAFe already Runtime to Intel Xeon Phi,” Procedia Computer Science, vol. 51, pp. exist in the form of a combination of MPI and OpenMP 1453 – 1462, 2015. [5] R. W. Numrich and J. Reid, “Co-array Fortran for parallel program- (or OpenACC) programs. The advantage of CAFe — by ming,” SIGPLAN Fortran Forum, vol. 17, no. 2, pp. 1–31, 1998. being a purely language construct — is that it opens up [6] J. Mellor-Crummey, L. Adhianto, W. N. Scherer, III, and G. Jin, “A new the possibility of compiler optimizations that would oth- vision for coarray Fortran,” in Proceedings of the Third Conference on erwise be impossible with a library approach intermixed Partitioned Global Address Space Programing Models, ser. PGAS ’09. New York, NY, USA: ACM, 2009. with compiler directives. [7] G. Jin, J. M. Mellor-Crummey, L. Adhianto, W. N. S. III, and C. Yang, “Implementation and performance evaluation of the HPC challenge B. Future Work benchmarks in coarray Fortran 2.0,” in 25th IEEE International Sympo- sium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, In spite of the limitations discussed, adoption of the task- Alaska, USA, 16-20 May, 2011 - Conference Proceedings, 2011. based parallelism of CAFe and the synchronization event types [8] The Fortran Committee, “TS 18508 Additional parallel features in in Fortran 2015 suggests intriguing possibilities. Tasks plus the Fortran,” ISO/IEC JTC1/SC22/WG5 N2007, Mar. 2014. [Online]. Available: ftp://ftp.nag.co.uk/sc22wg5/N2001-N2050/N2007.pdf dependencies represented by events suggest a correspondence [9] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser, “Stratego/XT with the Open Community Runtime (OCR) libraries. In OCR, 0.17. a language and toolset for program transformation,” Science of the entire program must be broken into a set of tasks repre- Computer Programming, vol. 72, no. 1–2, pp. 52 – 70, 2008. [10] M. van den Brand, H. A. de Jong, P. Klint, and P. A. Olivier, “Efficient sented by a directed acyclic graph (DAG). In OCR, tasks are annotated terms,” Softw., Pract. Exper., vol. 30, no. 3, pp. 259–291, available for executing when all of their input dependencies are 2000.

364 [11] “Open Fortran Project (OFP) software repository.” [Online]. Available: [15] S. Ashby, P. Beckman, J. Chen, P. Colella, B. Collins, D. Crawford, https://github.com/OpenFortranProject/ofp-sdf J. Dongarra,˜ D. Kothe, R. Lusk, P. Messina et al., “The opportunities and [12] M. J. Sottile, C. E. Rasmussen, W. N. Weseloh, R. W. Robey, D. Quinlan, challenges of exascale computing,” Summary Report of the Advanced and J. Overbey, “ForOpenCL: Transformations exploiting array syntax in Scientific Computing Advisory Committee (ASCAC) Subcommittee, pp. Fortran for accelerator programming,” Int. J. Comput. Sci. Eng., vol. 8, 1–77, 2010. no. 1, pp. 47–57, Feb. 2013. [16] A. Dubey, “Stencils in scientific computations,” in Proceedings of the [13] A. Brandt, “Multi-level adaptive solutions to boundary-value problems,” Second Workshop on Optimizing Stencil Computations, ser. WOSC ’14. Mathematics of Computation, vol. 31, no. 138, pp. 333–390, 1977. New York, NY, USA: ACM, 2014, pp. 57–57. [14] P. Prabhu, T. B. Jablin, A. Raman, Y. Zhang, J. Huang, H. a. Kim, N. P. [17] J. Reid, “The new features of Fortran 2008,” SIGPLAN Fortran Johnson, F. Liu, S. Ghosh, S. Beard et al., “A survey of the practice of Forum, vol. 27, no. 2, pp. 8–21, Aug. 2008. [Online]. Available: computational science,” in State of the practice reports. ACM, 2011, http://doi.acm.org/10.1145/1408643.1408645 p. 19. [18] The Fortran Committee, “F2015 Working Document,” J3/15-007, Mar. 2014. [Online]. Available: http://j3-fortran.org/doc/year/15/15-007.pdf

365