Accelerating a Climate Physics Model with Opencl

1 Accelerating a climate physics model with OpenCL Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou {fahad3, dg9, lsebald1, szhou} @ umbc.edu University of Maryland Baltimore County Abstract—Open Computing Language (OpenCL) is fast cross-platform programming framework. Many authors becoming the standard for heterogeneous parallel com- have discussed architectural independence as benefits of puting. It is designed to run on CPUs, GPUs, and OpenCL [3], [1], [4]. Though there are many examples other accelerator architectures. By implementing a real- and code fragments available, cross-platform OpenCL world application, a solar radiation model component implementation examples are hard to find for real-world widely used in climate and weather models, we show that OpenCL multi-threaded programming and execution applications. In addition, to our knowledge there is no model can dramatically increase performance even on CPU report available on the portability of OpenCL in terms of architectures. Our preliminary investigation indicates that code and performance with CPU as the compute device low-level vector instructions and code representations in across platforms. OpenCL contribute to dramatic performance improvement The OpenCL execution model is based on kernel over the serial version when compared with the execution execution where the kernel is a function declared in a of the serial code compiled across various compilers on program and is executed on a compute device. In this multiple platforms with auto vectorization flags. However, paper, we use the term OpenCL compute device and the portability of OpenCL implementations needs to im- OpenCL device interchangeably. On submission of a ker- prove, even for CPU architectures. nel instance called a work-item the host device allocates Index Terms—Multi-threaded programming; Parallel a global ID for the work item. Each work item executes computing; Heterogeneous architectures; Climate Model; the same code but on different datasets and can take a IBM Cell B.E.; OpenCL; Vectorization; Compilers; GCC; different path of execution. Work items are organized ICC; IBM XLC into work-groups. Work-groups are given unique work- group IDs with the same dimensionality as that of the I. INTRODUCTION index space used to define the work-items. Following this The demand to increase forecast predictability has logical assignment of data and tasks, work-items execute been pushing climate and weather models to increase concurrently on the processing elements of a compute model grid resolution and include more physical pro- device. The host maintains a queuing data structure cesses. Accelerating climate simulations with the help called a command-queue to coordinate in-order or out- of emerging multicore computing paradigms provide us of-order execution of kernels on the device. with tools to address these demands. Current trends in In this paper we investigate how OpenCL will perform the computing industry have moved from optimizing in CPUs and its portability among CPUs through par- performance gains on single-core processors to increas- allelization of a real-world climate physics application, ing the overall performance through parallel computing the Goddard solar radiation model component [5] with with many-core processors. This trend has been all too the OpenCL implementations on IBM PowerPC and common in GPUs in the past. Now it is being widely POWER6 blades as well as on the Intel x86 architecture adopted by CPU architectures as well. with Mac OS X. The solar radiation model compo- OpenCL provides a standard heterogeneous parallel nent was originally written in Fortran. Over the years programming environment for applications to execute the structure of this Fortran code has been stable. It on CPUs, GPUs, and other accelerators such as DSPs represents a typical climate physics model component and mobile processors [1], [2]. To support such a wide whose calculations are carried out along the vertical array of processor architectures, OpenCL puts forward a (altitudinal) direction (so-called column physics). This thread-extensive model for programming. OpenCL code particular code was ported to the IBM Cell Broad- typically comprises of multiple kernels that are executed band Engine by Zhou et al [6] where detailed code over a multi-threaded grid. There is tremendous interest structure analysis and performance gains were reported. to develop an OpenCL application and run it on CPUs We implemented the serial version of the C code in as well as GPUs since OpenCL claims to be a standard OpenCL version 1.0, manually optimized all sections 2 of the code to run in a multi-threaded fashion using terms of code. The same OpenCL code runnable in OpenCL kernels and benchmarked the code on IBM IBM JS21 and JS22 blades did not compile or run JS21 and JS22 blades, on a POWER6 AIX system and seamlessly in Mac OS X as one would expect. Some on Mac OS X versions 10.6.4 and 10.6.7. We used code modifications were required to get correct results. the approach of extracting compute-intensive kernels Thread scheduling differences on the platforms caused without changing the overall code structure. This allowed the program to output correct answer in one case (IBM us to have multiple manual optimizations for specific blades) while the incorrect answer in the other case (Mac architectures while maintaining the code structure. We OS X). After successfully porting the first two sections found a 3X∼4X performance improvement per core over of the code with OpenCL, the speedups on Mac OS X the original serial code compiled with GCC. We believe with Intel CPU are similar to those on the IBM platform, these dramatic performance gains on the CPU show that which indicates the OpenCL performance is portable OpenCL provides interface to implement light-weight across platforms to some extent. multi-threading code on the CPU [7]. POSIX Threads The rest of the paper is organized as follows. In Sec- is known to be heavy-weight threads when used in tion II, we discuss the solar radiation code structure and multi-threaded programming, which increases program its OpenCL implementation. In Section III we present memory requirements and adds to context switching the results on IBM blades, POWER6 AIX and Mac costs. OpenCL provides access to a multi-threaded pro- OS X and we discuss potential reasons behind dramatic gramming and execution model as well as a low level performance gain on CPUs with OpenCL. In Section API for memory and thread management. Its relaxed IV we discuss related works and further analysis of memory consistency model is similar to that of CUDA performance gain and conclude in Section V. [8] developed for NVIDIA GPUs. We report our preliminary investigation results on OpenCL performance on II. CODE STRUCTURE AND OPENCL CPUs and on serial code performance across platforms IMPLEMENTATION when compiled with GCC, Intel C++ and IBM XLC A production-quality climate or weather model code compilers. We also touch upon vectorization results from can span up to a few hundred thousand lines. Most these three compilers with respect to two reference code of these codes are written in Fortran. We used single- sections in the serial code. precision, C version of a representative, compute- Our findings show that the GCC auto vectorization[9], intensive physics model component code with about [10] implementation fails in certain cases where the 1500 lines as a serial baseline. This baseline is used to OpenCL compiler succeeds in generating fast executable confirm the correctness of the results obtained with the code. This is particularly true for certain nested loop multi-threaded parallel version as well as to calculate the constructs, though the GCC vectorization unit performs performance speedup achieved against it. The selected as expected in many other cases in this model. Similar code is the solar radiation model component, SOLAR, results were obtained from Intel ICC compiler and IBM which has been widely used in climate and weather XLC compiler for these nested loop constructs. We models. This particular version of the code is from compiled the serial version of the solar radiation model the production version of the NASA GEOS-5 climate component code with the GCC for the SSE{2,3,4}[11] physics model. This model component along with the and the Altivec[12] target instruction sets respectively. IRRAD (the infrared radiation component) typically IBM XLC serial code compilation used Altivec[13] as takes 20 percent of the execution time[6]. The exact target architecture. OpenCL kernels for sections of code fraction of SOLAR execution time compared to the comprising of these loop constructs have been compared total model varies among the models. For example, in with the GCC, ICC and XLC vectorization report for NASA GEOS5 atmospheric model, the solar radiation these code sections. The details of these findings will can take 5%∼10%. However that fraction could increase be elaborated in the Performance Analysis subsection significantly if the aerosol effect is included. Reducing under Section III. The increased performance of our the execution time of SOLAR can benefit the climate OpenCL implementation can be attributed to implicit and weather simulations in two ways: allowing a reduced instruction and data parallelism by the OpenCL compiler overall simulation time and an increase in the complexity on Intel[14], [15] and IBM platforms as well as to of models within the same simulation time. The climate the efficient execution pipelining and better memory physics model component executes calculations only management. along the vertical (altitudinal) direction. In a climate or Based on our experience, current implementations weather physics model, most of the physical processes in of OpenCL from IBM and Apple are not portable in a whole or partial globe are modeled with such columns 3 at each point of a horizontal grid such as the latitudinal- in synchronization to break the linear program flow and longitudinal grid.

Load more