Foropencl: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming

ForOpenCL: Transformations Exploiting Array Syntax in Fortran for Accelerator Programming Matthew J. Sottile Craig E Rasmussen Wayne N. Weseloh Galois, Inc. Los Alamos National Los Alamos National Portland, OR 97204 Laboratory Laboratory [email protected] Los Alamos, NM 87545 [email protected] [email protected] Robert W. Robey Daniel Quinlan Jeffrey Overbey Los Alamos National Lawrence Livermore National University of Illinois at Laboratory Laboratory Urbanna-Champaign [email protected] [email protected] [email protected] ABSTRACT the performance of the M2090 NVIDIA Tesla processor is in Emerging GPU architectures for high performance comput- the same neighborhood [7]. Unfortunately the performance ing are well suited to a data-parallel programming model. that many of the new accelerator architectures offer comes This paper presents preliminary work examining a program- at a cost. Architectural changes are trending toward multi- ming methodology that provides Fortran programmers with ple heterogeneous cores and less of a reliance on superscalar access to these emerging systems. We use array constructs instruction level parallelism and hardware managed memory in Fortran to show how this infrequently exploited, stan- hierarchies (such as traditional caches). dardized language feature is easily transformed to lower-level accelerator code. The transformations in ForOpenCL are These changes place a heavy burden on application program- based on a simple mapping from Fortran to OpenCL. We mers as they work to adapt to these new systems. An es- demonstrate, using a stencil code solving the shallow-water pecially challenging problem is not only how to program to fluid equations, that the performance of the ForOpenCL these new architectures | considering the massive scale of compiler-generated transformations is comparable with that concurrency available | but also how to design programs of hand-optimized OpenCL code. that are portable across the changing landscape of com- puter architectures. How does a programmer write one pro- 1. INTRODUCTION gram that can perform well on both a conventional multicore CPU and a GPU (or any other emerging many-core archi- This paper presents a compiler-level approach for targeting tectures)? a single program to multiple, and possibly fundamentally different, processor architectures. This technique allows the A directive-based approach, such as OpenMP or the Accel- application programmer to adopt a single, high-level pro- erator programming model from the Portland Group [14], is gramming model without sacrificing performance. We sug- one solution to this problem. However, in this paper we take gest that existing data-parallel features in Fortran are well- a somewhat different approach. A common theme amongst suited to applying automatic transformations that generate the new processors is the emphasis on data-parallel program- code specifically tuned for different hardware architectures ming. This model is well-suited to architectures that are using low-level programming models such as OpenCL. For based on either vector processing or massively parallel col- algorithms that can be easily expressed in terms of whole lections of simple cores. The recent CUDA and OpenCL array, data-parallel operations, writing code in Fortran and programming languages are intended to support this pro- transforming it automatically to specific low-level implemen- gramming model. tations removes the burden of creating and maintaining multiple versions of architecture specific code. The problem with OpenCL and CUDA is that they expose too much detail about the machine architecture to the pro- The peak performance of these newer accelerator architec- grammer [15]. The programmer is responsible for explic- tures can be substantial. Intel expects a teraflop for the itly managing memory (including the staging of data back SGEMM benchmark with their Knights Ferry processor while and forth between the host CPU and the accelerator de- vice) and specifically taking into account architectural dif- ferences (such as whether the architecture contains vector units). While these languages have been attractive as a method for early adopters to utilize these new architectures, they are less attractive to programmers who do not have the time or resources to manually port their code to every new architecture and programming model that emerges. 1.1 Approach operations that we focus on in this paper. Additional pro- We demonstrate that a subset of Fortran map surprisingly gram analysis methods may be investigated to study their well onto GPUs when transformed to OpenCL kernels. This applicability in future versions of this work. data-parallel subset includes: array syntax using assign- ment statements and binary operators, array constructs like 2. PROGRAMMING MODEL WHERE, and the use of pure and elemental functions. In addi- A question that one may pose is \Why choose Fortran and tion, we provide new functions that explicitly take advantage not a more modern language like X for programming accel- of the stencil geometry of the problem domain we consider. erator architectures?" The recent rise in interest in concur- Note that this subset of the Fortran language is implicitly rency and parallelism at the language level due to multicore parallel. This programming model does not require explicit CPUs and many-core accelerators has driven a number of declaration of parallelism within the program. In addition, new language developments, both as novel languages and ex- programs are expressed using entirely standard Fortran so tensions on existing ones. However, for many scientific users it can be compiled for and executed on a single core without with existing codes written in Fortran, new languages and concurrency. language extensions to use novel new architectures present a challenge: how do programmers effectively use them while Transformations are supplied that provide a mechanism for avoiding rewriting code and potentially growing dependent converting Fortran procedures written in the Fortran sub- on a transient technology that will vanish tomorrow? In this set described in this paper to OpenCL kernels. We use the paper we explore the constructs in Fortran that are partic- 1 ROSE compiler infrastructure to develop these transforma- ularly relevant to GPU architectures. tions. ROSE uses the Open Fortran Parser2 to parse For- tran 2008 syntax and can generate C-based OpenCL. Since In this section we present the Fortran subset employed in ROSE's intermediate representation (IR) was constructed this paper. This sub-setting language will allow scientific to represent multiple languages, it is relatively straightfor- programmers to stay within the Fortran language and yet ward to transform high-level Fortran IR nodes to C OpenCL have direct access to GPU hardware. We start by examin- nodes. This work is also applicable to transformations to ing how this programming model relates to developments in vendor-specific languages, similar to OpenCL, such as the other languages. NVIDIA CUDA language. 2.1 Comparison to Prior Fortran Work Transformations for arbitrary Fortran procedures are not at- A number of previous efforts have exploited data-parallel tempted. Furthermore, a mechanism to transform the call- programming at the language level to utilize novel architec- ing site to automatically invoke OpenCL kernels is not pro- tures. The origin of the array syntax adopted by Fortran in vided at this time. While it is possible to accomplish this the 1990 standard can be found in the APL language [6]. task within ROSE, it is considered outside the scope of this These additions to Fortran allowed parallelism to be ex- paper. However, ForOpenCL provides via Fortran interfaces pressed with whole-array operations at the expression level, a mechanism to call the C OpenCL runtime and enable For- instead of via parallelism within explicit DO-loops, as im- tran programmers to access OpenCL kernels generated by plemented in earlier variants of the language (e.g., IVTRAN the supplied transformations. for the Illiac IV). We study the automatic transformations for an application The High Performance Fortran (HPF) extension of Fortran example that is typical of stencil codes that update array el- was proposed to add features to the language that would en- ements according to a fixed pattern. Stencil codes are often hance the ability of compilers to emit fast parallel code for employed in applications based on finite-difference or finite- distributed and shared memory parallel computers [10]. One volume methods in computational fluid dynamics (CFD). of the notable additions to the language in HPF was syntax The example described later in this paper is a simple shallow- to specify the distribution of data structures amongst a set water model in two dimensions using finite volume methods. of parallel processors. HPF also introduced an alternative Stencil-like patterns appear in a number of other contexts looping construct to the traditional DO-loop called FORALL as well. In image processing, they appear in convolution- that was better suited for parallel compilation. An addi- based algorithms in which small kernels are convolved with tional keyword, INDEPENDENT, was added to allow the pro- an image to implement denoising, edge detection, and other grammer to indicate when the loop contained no loop-order common operators. Similar stencil operators appear in gen- dependencies that allowed for parallel execution. These con- eral signal processing applications as well. structs are similar

Load more