Center for Computation & Technology

CCT Technical Report Series CCT-TR-2008-1

A General Relativistic Evolution Code on CUDA Architectures

Burkhard Zink Center for Computation & Technology and Department of Physics & Astronomy Louisiana State University, Baton Rouge, LA 70803 USA

Posted January 2008.

cct.lsu.edu/CCT-TR/CCT-TR-2008-1

The author(s) retain all copyright privileges for this article and accompanying materials. Nothing may be copied or republished without the written consent of the author(s). A general relativistic evolution code on CUDA architectures

Burkhard Zink Center for Computation and Technology, and Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA 70803, USA [email protected]

ABSTRACT In this technical report, I will describe an implementation I describe the implementation of a finite-differencing code of one of the most challenging problems in computational for solving Einstein’s field equations on a GPU, and measure physics, solving Albert Einstein’s field equations for general speed-ups compared to a serial code on a CPU for different relativistic gravitation, on a . The parallelization and caching schemes. Using the most effi- primary purpose is to make an estimate of potential perfor- cient scheme, the (single precision) GPU code on an mance gains in comparison to current CPUs, and gain an FX 5600 is shown to be up to 26 times faster than understanding of architectural requirements for middleware the a serial CPU code running on an AMD Opteron 2.4 GHz. solutions serving the needs of the scientific community, most Even though the actual speed-ups in production codes will notably the Cactus Computational Toolkit [4, 3, 9]. vary with the particular problem, the results obtained here indicate that future GPUs supporting double-precision op- The particular GPU used for these experiments is an NVIDIA erations can potentially be a very useful platform for solving G80 series card (Quadro FX 5600). NVIDIA has also re- astrophysical problems. leased a software development kit called CUDA (compute unified device architecture) [8] for development of GPU code using an extension of the C language. As opposed to earlier 1. INTRODUCTION attempts to program GPUs with the language sup- The high parallel processing performance of graphics pro- plied for graphics applications, this makes it easier to port cessing units (GPUs), with current models achieving peak existing general-purpose computation code to the target de- performances of up to 350 GFlop/s (for single precision vice. floating-point operations), has been used traditionally to transform, light and rasterize triangles in three-dimensional Section 2 contains a description of the G80 hardware, the computer graphics applications. In recent architectures, how- CUDA architectural model, and the performance consider- ever, the vectorized pipeline for processing triangles has been ations important for GPU codes. Then, in Section 3, we replaced by a unified scalar processing model based on a will turn to the particular problem of solving Einstein’s field large set of stream processors. This change has initiated equations, which we will approach from a mostly algorithmic a consideration of GPUs for solving general purpose com- (as opposed to physical) point of view. Section 4 describes puting problems, and triggered the field of general-purpose the structure and implementation of the code, and Section 5 computing on graphics-processing units (GPGPU). discusses the benchmarking results obtained. I give a dis- cussion of the results, and an outlook, in Section 6. High-performance, massively parallel computing is one of the major tools for the scientific community to understand and quantify problems not amenable to traditional analyti- 2. CUDA AND THE G80 ARCHITECTURE cal techniques, and has led to ever-increasing hardware per- formance requirements for tackling more and more advanced 2.1 The NVIDIA G80 architecture questions. Therefore, GPGPU appears to be a natural tar- The NVIDIA G80 hardware [7] is the foundation of the get for scientific and engineering applications, many of which GeForce 8 series consumer graphics cards, the Quadro FX admit highly parallel algorithms which are already used on 4600 and 5600 workstation cards, and the new Tesla 870 current-generation supercomputers based on multi-core CPUs. set of GPGPU boards. G80 represents the third major ar- chitectural change for NVIDIA’s line of graphics accelera- tors. Traditionally, GPU accelerators enhance the trans- formation and rendering of simple geometric shapes, usu- ally triangles. The processing pipeline consists of transfor- mation and lighting (now vertex shading), triangle setup, pixel shading, raster operations (blending, z-buffering, anti- aliasing) and the output to the frame buffer for scan-out to the display. First and second generation GPU architectures typically process these steps with special purpose hardware in a pipelined fashion.

1 Number of multiprocessors (MP) 16 2.2 CUDA (compute unified device architec- Number of stream processors per MP 8 Warp size (see text) 32 ture) Parallel data cache 16 kB Since the G80 is based on general purpose stream processors Number of banks in parallel data cache 16 with a high peak performance, it appears to be a natural Number of 32-bit registers per MP 8192 target for general purpose parallel computing, and in par- Clock frequency of each MP 1.35 GHz ticular scientific applications. NVIDIA has recently released Frame buffer memory type GDDR3 the CUDA SDK [8] for running parallel computations on the Frame buffer interface width 384 bits device hardware. While earlier attempts to use GPUs for Frame buffer size 1.5 GB general-purpose computing had to use the various shading Constants memory size 64 kB languages, e.g. Microsoft’s HLSL or NVIDIA’s Cg, CUDA Clock frequency of the board 800 MHz is based on an extension of the C language. Host bus interface PCI Express The SDK consists of a compiler (nvcc), host and device run- time libraries, and a driver API. Architecturally, the driver Table 1: Technical specifications of a Quadro FX builds the primary layer on top of the device hardware, 5600 GPU. and provides interfaces which can be accessed either by the CUDA runtime or the application. Furthermore, the CUDA runtime can be accessed by the application and the service libraries (currently for BLAS and FFT).

The increase in demand for programmability of illumination CUDA’s parallelization model is a slight abstraction of the models and geometry modifiers, and also load-balancing re- G80 hardware. Threads are arranged into blocks, where quirements with respect to vertex and pixel shading opera- each block is executed on only one multiprocessor. There- tions, has led to more generality in GPU design. The current fore, within a block, additional thread context and synchro- G80 architecture consists of a parallel set of stream proces- nization options exist (by use of the shared resources on sors with a full set of integer and (up to FP32) floating point the chip), whereas no global synchronization is available be- instructions. When processing triangles and textures, the tween blocks. A set of blocks constitute a SIMD compute individual steps of the processing pipeline are dynamically kernel. Kernel calls themselves are asynchronous to the host mapped to these processors, and since the operations are CPU: they return immediately after issuance. Currently highly parallelizable, the scheduler can consistently main- only one kernel can be executed at any time on the device, tain a high load. which is a limitation of the driver software.

Physically, eight stream processors (SP) are arranged in a The thread-block model hides the particular number of mul- multiprocessor with texture filtering and addressing units, a tiprocessors and stream processors from the CUDA kernel texture cache, a set of registers, a cache for constants, and a insofar as the block grid is partially serialized into batches. parallel data cache. Each multiprocessor is operated by an Each multiprocessor can execute more than one block and instruction decoding unit which executes a particular com- more threads than the warp size per multiprocessor to hide mand in a warp: the same command is executed on all SPs device memory access latency. However, the limitations of for a set of clock cycles (because the instruction units and this model are implicit performance characteristics: If the the SP ALUs have different clock speeds). This constitutes number of blocks is lower than the number of multiproces- a minimal unit of SIMD computation on the multiproces- sors, or the number of threads is lower than the warp size, sor called the warp size, and will be important later when significant performance degradation will occur. Therefore, considering code efficiency on the architecture. the physical configuration is only abstracted insofar as the kernel compiles and runs on different GPU configurations, The GPU card contains a set of such multiprocessors, with but not in terms of achieving maximal performance. the number depending on the particular model (e.g, one in the GeForce 8400M G, 12 in the GeForce 8800 GTS, and Within the kernel thread code, the context provides a log- 16 in the GeForce 8800 GTX, Quadro FX 5600 and Tesla ical arrangement of blocks in one- or two-dimensional, and 870 models). The multiprocessors are operated by a thread threads per block in one-, two-, and three-dimensional sets, scheduling unit with fast switching capabilities. In addition, which is convenient for the kind of numerical grid applica- the board has frame buffer memory and and an extra set tion we will present below since it avoids additional (costly) of memory for constants. The particular numbers for the modulo operations inside the thread. Threads support lo- Quadro FX 5600 are listed in Table 1, and Fig. 1 shows a cal variables and access to the shared memory (which maps diagram of the architecture (cf. also [7]). to the data parallel cache in G80), which is common for all threads within a block. The actual peak performance of the card depends on how many operations can be performed in one cycle. The stream The kernel is called from the host code via a special syntax processors technically support one MAD (multiply-add) and which specifies the block grid size and threads per block, and one MUL (multiply) per cycle, which would correspond to can be synchronized back to the host via a runtime function. 1.35 GHz * 3 * 128 = 518.4 GFlop/s. However, not all Therefore, the host CPU can perform independent compu- execution units can be used simultaneously, so a conservative tations or issue additional kernels while the GPU operates. estimate is to assume one MAD per cycle, leading to a peak This may be interesting for delayed analysis and output pro- performance of 345.6 GFlop/s. cesses performed on the data, although the bandwidth for

2 Figure 1: Simplified diagram of the G80 architecture. The GPU contains several multiprocessors (MPs) which are operated by a common thread scheduler with fast switching capabilities. Each multiprocessor can run several threads using stream processors (SPs) which share a common instruction unit, a set of registers and a data parallel cache. A memory bus connects the MPs to the frame buffer memory and a constants memory.

3 transferring from the frame buffer (where grid variables will alescing operations, usually again in the form base address be stored for evolution) to the host memory has to be taken plus thread dependent offset. In addition, the base address into account. should be aligned to a multiple of the half-warp size times the type size in bytes for coalescing the read/write opera- 2.3 Performance optimization considerations tion. To achieve high performance of the GPU for a particular All these performance requirements come into play when op- application problem, often approximately measured in terms timizing a code for high speed-ups. The experiments below of a speed-up compared to a serial code on the host CPU, will demonstrate that the actual speed-up can be very sen- the kernel would usually encompass those operations which sitive to the particular implementation of the memory man- are highly data-parallelizable due to the SIMD concept of agement. While non-optimized implementations for prob- CUDA. lems with high arithmetic intensity may already achieve a significant speed-up, a full use of the GPU’s floating point An efficient port needs to maximize the ratio of arithmetic performance requires some thought. We will see that the na- over device memory operations, the arithmetic intensity, ture of the problem also imposes trade-offs between different since access to the frame buffer involves large latencies (up performance requirements with strongly affect the maximum to 600 cycles on the G80) because it is not cached. One speed-up, so that e.g. comparing even implementations of common strategy is to first copy selected parts of the de- different finite-difference problems can produce completely vice memory which is needed by all threads to the shared different speed-ups, since there is a trade-off between local memory, then operating on it, and finally writing the results cache space (i.e. number of grid variables) and the number back to device memory. At the same time, many threads can of threads in a block. be started concurrently on the multiprocessors to hide the access latencies. 3. THE ALGORITHM TO SOLVE Each block issued to a multiprocessor reserves memory for EINSTEIN’S FIELD EQUATIONS the local variables of all threads in the block, and shared In its best-known form, Einstein’s field equation for general memory according to the allocation made in the kernel. The relativity can be written as number of blocks and threads running concurrently on the multiprocessor is limited by the shared memory space (16 kB on the Quadro FX 5600), which implies a trade-off between hiding device memory access latency by multi-threading and Gµν = 8πTµν (1) effective cache size per thread. where µ, ν = 0 . . . 3. G are the sixteen components of Since threads are operated on in warps, i.e. each of the µν the Einstein tensor, and T are the sixteen components of stream processors in a multiprocessor operates sequentially µν the energy-momentum tensor. Both functionally depend on on batches of SIMD instructions, the resulting warp size the spacetime metric tensor g which described the local is the minimum number of threads per block for efficient µν curvature of spacetime and is therefore the central object of parallelization, and the actual number should be an integer general relativity, since curvature relates to physical effects multiple of it. In addition to the upper bound on the cache like light bending and attraction of bodies by gravitation. size by concurrent issuance of blocks, this gives a problem- dependent lower bound on the number of threads in a block. We will, however, not be concerned with the physical in- NVIDIA also states [8] that the optimal number of threads terpretation of these equations here (for an introduction see per block is actually 64 in the G80 architecture to allow the [11]), but only with the requirements to formulate a finite- nvcc compiler to reduce the number of register bank conflicts differencing evolution algorithm from them. The field equa- in an optimal way. tions as formulated in eqn. 1 can be transformed into a reg- ular initial-boundary value problem (i.e., a set of partial dif- On a G80 card, actual FP32 operations like multiply (MUL) ferential equations in time and space) for twelve variables: and multiply+add (MAD) cost 4 cycles per warp, i.e. one the six components of the three-metric, g , i, j = 1 . . . 3, cycle per thread, reciprocals cost 16 cycles, and for certain ij and the six components of the extrinsic curvature, K . In mathematical functions like sine and cosine microcode is ij addition, the equations contain the four free gauge functions available which performs the evaluation in 32 cycles. Con- α and βi as parameters, which are usually also treated as trol flow instructions are to be avoided if possible, since in evolutionary variables. many cases they may have to be serialized. The usual approach to solving these equations proceeds as Accessing shared memory by threads should avoid bank con- follows: flicts to ensure maximal bandwidth. This requires the threads of a warp to read and write to the shared memory in a certain way, since consecutive addresses (e.g. obtained from a base 1. Define a spatial domain to solve on, e.g. give a range address and an offset given by the thread id) are distributed of coordinates [xlow, xhigh] [ylow, yhigh] [zlow, zhigh] into several independent partial warps for maximum band- on which to solve the prob×lem. × width. There is, however, also a broadcast mechanism which can distribute data from one bank to all threads in the warp. 2. Discretize the domain in some appropriate way. We will only discuss uniform Cartesian discretizations here Finally, accessing device memory can be optimized by co- (also known as uni-grids), but there are supercomputer

4 implementations to use adaptive mesh refinement and fairer assessment of speed-up as compared to a device code multiple blocks, e.g. in the Cactus Computational in emulation mode1. Toolkit. This CPU code is a C language implementation of the al- 3. On each discrete cell, specify the initial data by setting gorithm mentioned in Section 3 for three spatial dimen- values for the evolutionary variables. Usually, a set of sions, with a second-order Runge-Kutta scheme for time in- cells directly outside of the domain, called ghost zones, tegration and second-order accurate first and second central is used for setting boundary conditions. finite-differencing operators. The actual right hand side was extracted from a particular implementation of the ADM sys- 4. The evolution proper proceeds by a loop over time tem in Cactus. The code performs these steps: steps. In each time step, the right hand side of the equation ∂ A(x, y, z) = RHS needs to be evaluated, t 1. Allocate memory for all grid variables, and the same where A is a grid variable and RHS is a function which amount for the scratch space variables needed by the depends on A and other grid variables, their spatial Runge-Kutta scheme. derivatives, and potentially free parameters and the coordinate location. 2. Write initial data into the grid variables. The exact data is irrelevant unless it produces NaN’s by discrete 5. Since the right hand side depends on the spatial deriva- instabilities during the evolution. The data used here tives, finite-differencing operations have to be performed is a Gaussian gauge pulse in x direction. on the grid. This usually involves some local sten- i i+1 cil of variables around the cell, e.g. ∂xA (A 3. Perform the main loop: i 1 ≈ − A − )/2(∆x) for the second-order accurate central ap- (a) Swap the pointers between the grid variables and proximation to the first x derivative of A at position their scratch counterparts. i. (b) Call the evolution function. This function loops 6. Having obtained the right hand side, the set of evo- over all grid cells (three nested loops, one for each lution equations of advanced with a technique to dis- direction), and in each cell (i) calculates all partial cretely approximate ordinary differential equations, e.g. derivatives by finite differencing in second order, a Runge-Kutta step. This obtains the values of grid (ii) calculates a number of additional temporary functions at time t + ∆t from their values at t. The variables only needed locally, and (iii) writes the Runge-Kutta algorithm generates partial time approx- evolved grid functions into the scratch space. imations during the course of its operation, and there- (c) The swapping and evolution is repeated, now for fore requires a set of scratch levels having the size of the second half step involved in the RK2 scheme. the main grid variables. (d) Output the evolved data to the disk. 7. Finally, new boundary conditions need to be imposed 4. Release allocated storage. after the time update. This usually involves operations on the ghost zones and their immediate neighbors, e.g. by extrapolation. The only relevant target for parallelization is the evolution routine inside the main loop, since it employs most of the wall clock time in almost all situations (unless there is 3d The actual representation of Einstein’s field equations used output at every time step), and because it naturally lends here is the so-called ADM formalism (after Arnowitt, Deser itself to parallel computation. Therefore, we will take a and Misner [1]). This is not a commonly used method nowa- closer look at its operations. days, since it tends to produce numerical instabilities, but it will suffice for demonstrating how to port such a code to Einstein’s field equations are some of the most complex clas- a CUDA environment, and it is a good choice for a proto- sical field equations in nature, and therefore codes solving type implementation due to its simplicity. More advanced them naturally have a high arithmetic intensity. For eval- schemes (NOK-BSSN [6, 10, 2] and the generalized har- uating a single right hand side, hundreds of floating point monic formalism [5]) contain more variables, which has con- operations act on only a few grid functions in device memory sequences for the available set of caching schemes (see be- - and their locally obtained partial derivatives. Therefore, as low), but should fundamentally be portable to CUDA in a soon as the partial derivatives are obtained, parallelization similar manner. The particular test cases below operate on a can easily proceed by assigning each cell to a single thread simple dynamical test problem, a gauge wave, and use static and performing the evaluation locally. The resulting write boundaries. operations to device memory only involve that particular cell and therefore do not produce concurrency conflicts (though they may produce bank conflicts). 4. IMPLEMENTATION OF EINSTEIN’S FIELD EQUATIONS IN CUDA For the finite difference evaluation, each cell accesses its im- While it is possible to develop a GPU code directly, there mediate neighbors and reads grid variables from that cell, are advantages to first implementing a stand-alone CPU im- 1The CUDA C compiler nvcc admits to compile code which plementation in C, and then porting this to CUDA. In par- actually executes on the host processor, mostly for debug- ticular, debugging the host code is easier, and it will yield a ging purposes.

5 operates on them, and writes the result into local variables each block, to a multiple of 643, which is a multiple of the which are only defined inside the innermost loop. While it warp size (a strong requirement for high performance) and is possible to calculate all finite differences outside of the also allows the compiler to avoid register conflicts. Even loop, in an extra step, to logically decouple the (seman- without considering shared memory at this stage, this puts tically different) finite difference and evolution operations, a significant constraint on how much registers are actually this implies multiplying the required storage by more than a available per thread. On the G80, there are 8192 registers factor of four2. Translated to a CUDA device, this would in- per multiprocessor, i.e. we can store 128 32-bit words lo- volve additional operations on device memory and therefore cally per thread. The partial derivatives and other helper reduce performance. variables used in the ADM code already need about 100 per thread. More complicated codes may need to either (i) This approach can be translated directly into a GPU device reduce the number of local helper variables at the cost of in- code by using these changes to the CPU implementation: creased arithmetic operations, or (ii) reduce the block size. Usually, experiments need to be done to establish which op- tion is preferable in practice. It would seem intuitive that 1. Allocate the grid functions on the device memory (frame reducing the number of threads to the warp size is better, buffer) in addition to allocating them on the host mem- but the high latency of device memory accesses can be hid- ory. The host memory allocation is still useful for set- den much more effectively by more threads, and therefore ting up initial data and output of data to the disk. easily outweigh the additional local arithmetic cost involved in implicitly repeating operations usually mapped to helper 2. After writing initial data on the host memory, copy it variables. to the device memory using runtime functions. Instead of operating on one-dimensional columns, it is also 3. In the main loop, perform the swap operations on the possible to decompose the grid into cubes (as is usual for device memory pointers. MPI parallelization schemes), and therefore have each thread evolve exactly one cell. Since the block decomposition by 4. Replace the evolution function by a CUDA kernel which CUDA is logically either one- or two-dimensional, there are is distributed to the device multiprocessors. three options for doing this: (i) Use the block decomposition as before, but start a thread for each cell separately, (ii) let 5. Since every evolution half-step depends on results from the kernel only operate on a grid slab of defined thickness the previous one, synchronize to the CUDA device af- and call it repeatedly from the host. Option (i) is imprac- ter calling the kernel. This time span could potentially tical due to the limitations of available registers, so we are also be used for asynchronous output. left with option (ii), which is illustrated in Fig. 3. From the 6. If output is requested in a particular iteration, copy onset, it is unclear how this compares to the column decom- the relevant grid variables back to host memory, and position used before. However, the stage 2 parallelization then perform the output as usual. will need to operate on blocks as opposed to columns, as discussed below. 4.1 Stage 1 parallelization The CUDA kernel function could in principle be ported 4.2 Stage 2 parallelization So far, the kernel has not made use of the shared memory from the CPU code by using this scheme: Divide any two- (or data parallel cache in terms of the G80 architecture). dimensional orthogonal projection of the three-dimensional Since the frame buffer accesses are un-cached, it is neces- Cartesian grid, i.e. any of the planes (x, y), (x, z) or (y, z), sary to implement a manual cache algorithm for this. Un- into independent blocks of the same size. Each thread in fortunately, the shared memory is even more limited than the block then operates on a one-dimensional slab in the re- the register space: 16 kB per multiprocessor on the G80, maining direction, i.e. the three nested loops in the evolution which translates into only 256 bytes per thread when using code are replaced by one loop over the remaining direction. the recommended block size of 64. The cell index calculation needed in the kernel is made easy by CUDA’s support of two-dimensional block and thread The most obvious target for a caching algorithm are the indices, which directly correspond to the two-dimensional finite difference operations, since they repeatedly use the domain decomposition. With this algorithm, the problem is same data (the value of quantity A at cell (i, j, k) is accessed equally and independently parallelized, and were it not for by six neighboring cells for second-order accurate first par- the additional considerations imposed by the memory access tial derivatives, and by an additional eight cells for second latencies, the approach would be already optimal. For future derivatives). To perform these operations from the cache reference, I will call it a Stage 1 approach, since it is a direct while avoiding conditional statements in the kernel, a num- translation of the CPU code while reducing the appropriate ber of ghost zones outside the operated cube need to be number of nested loops, and since it does not take advantage stored as well. Generally, this number will be reduced for of the data parallel cache of the G80 multiprocessors. Fig. 2 equally sized cubes, which suggest to take 4 4 4 as the illustrates the parallelization technique. interior cube and one ghost cell, resulting in 6×4 th×reads per block (double the warp size, and the suggested block size for A first efficiency improvement for the G80 can be obtained by adjusting the block size, i.e. the number of threads in 3This restricts the possible grid sizes to a multiple of 8 in each direction of the two-dimensional decomposition for op- 2Three additional variables for each directional derivative, timal parallelization. We will later see that the best scheme and storage for second derivatives for some variables. actually has a multiple of 4 as a requirement.

6 Figure 2: Stage I column parallelization. The grid, here as an example of 16 16 16cells (boundary ghost zones are not drawn) is decomposed in a plane into blocks, which correspond to×the×conceptual thread blocks in CUDA. Each block is 8 8 cells large to obtain a block size (number of threads per block) which is a multiple of the warp size and× also optimized for reducing register bank conflicts. Each thread operates on a one-dimensional column, i.e. the kernel contains a loop in the remaining direction. As a Stage I scheme, the operations do not use the data parallel cache on the multiprocessors.

7 Figure 3: Stage I/II block parallelization. Instead of using columns, the grid is first decomposed into slabs of a certain height, here marked by blue lines, each slab is further distributed into blocks corresponding to the CUDA blocks, and finally each cube of size 4 4 4 is parallelized into 64 threads, which is optimal for the same reasons stated in Fig. 2. In contrast to the× c×olumn scheme, the kernel does not contain a loop over cells for purposes of calculating the evolution, though a loop is used to copy data to the parallel cache for the Stage II parallelization (see text and Fig. 4). The kernel is called by the host repeatedly to operate on all slabs in the grid.

8 efficient register usage) and 6 6 6 = 216 cells to store in * THREADS_PER_BLOCK_Y the cache. Therefore, with this×sch×eme, we can store at most * THREADS_PER_BLOCK_Z; 18 FP32 variables per cell. This is enough for the ADM grid ... functions (depending on the choice of gauge, there are 12 to 16), but not enough for more complicated schemes or more // -- third read/write general systems. In those cases, the only option would be tid += THREADS_PER_BLOCK_X to reduce the size of the cube and the number of threads, * THREADS_PER_BLOCK_Y which likely incurs a loss of performance. * THREADS_PER_BLOCK_Z; ... The transfer of data from the frame buffer to the parallel cache is handled by each thread separately. Were it only for // -- fourth read/write the interior cube without ghost zones, an efficient scheme tid += THREADS_PER_BLOCK_X would be to copy all data for exactly the cell which is oper- * THREADS_PER_BLOCK_Y ated on by this thread. However, the ghost zones also need * THREADS_PER_BLOCK_Z; to be set up, for which the most obvious scheme is to use if (tid < SHARED_ARRAY_X conditional statements which act on the boundary surfaces. * SHARED_ARRAY_Y This is obviously not very efficient, since (i) many threads * SHARED_ARRAY_Z) { will not participate in these operations, and (ii) conditional o = tid / (SHARED_ARRAY_X statements are often serialized into the minimal instruction * SHARED_ARRAY_Y); size (the warp size) and will therefore reduce performance res = tid - o * SHARED_ARRAY_X even more. * SHARED_ARRAY_Y; n = res / SHARED_ARRAY_X; A better approach is this: Recompute the three-dimensional m = res - n * SHARED_ARRAY_X; thread index into a linear index, and then use integer divi- index = (iL+m) + GRID_NX sion and subtraction operations to obtain the three-dimensional * ((jL+n) + GRID_NY * (kL+o)); index corresponding to the cube including ghost zones; the indexS = m + SHARED_ARRAY_X two cache copy approaches are illustrated in Fig. 4. The * (n + SHARED_ARRAY_Y * o); operations for this are: g11S[indexS] = g11[index]; ... } __shared__ CCTK_REAL g11S[SHARED_ARRAY_SIZE], ...

// Block index Here, g11 is the grid array for the variable g11, g11S is its bx = blockIdx.x; shared memory equivalent, and the preprocessor constants by = blockIdx.y; describe the size of the shared array (SHARED_ARRAY_X/Y/Z = 6), the block size (THREADS_PER_BLOCK_X/Y/Z = 4), and // Thread index the global grid size (GRID_NX/NY/NZ). tx = threadIdx.x; ty = threadIdx.y; Now every thread copies data for and evolves different cells. tz = threadIdx.z; Since there are more cells including ghost zones than threads, several such operations need to be performed (in the case of // Copy data to shared memory a 4 4 4 cube and a ghost size of 1, we need four cycles). iL = bx * THREADS_PER_BLOCK_X; Eac×h time,× the effective thread index tid is increased by jL = by * THREADS_PER_BLOCK_Y; the block size, so that the last read operation either (i) con- kL = kstart - 1; tains a conditional statement for excluding invalid indices, or tid = tx + THREADS_PER_BLOCK_X (ii) the grid arrays are artificially enlarged to accommodate * (ty + THREADS_PER_BLOCK_Y * tz); copying the data without segmentation faults. Experiments have shown that option (i) does not lead to a degradation // -- first read/write in performance, and it is preferable due to simplicity and o = tid / (SHARED_ARRAY_X reduced memory usage. * SHARED_ARRAY_Y); res = tid - o * SHARED_ARRAY_X To increase coalescence for device memory operations and * SHARED_ARRAY_Y; reduce bank conflicts for shared memory writes in these op- n = res / SHARED_ARRAY_X; erations, it is important to operate on consecutive addresses. m = res - n * SHARED_ARRAY_X; This is easiest done by doing the block decomposition in the index = (iL+m) + GRID_NX (x, y) plane, since then the index ordering scheme we use * ((jL+n) + GRID_NY * (kL+o)); accesses addresses in form base address plus index for the indexS = m + SHARED_ARRAY_X shared memory (avoiding bank conflicts), and within each * (n + SHARED_ARRAY_Y * o); x direction read access access to the device memory can be g11S[indexS] = g11[index]; coalesced. An even more efficient scheme could possibly be ... obtained by (i) ordering device memory in a way that more reads can be coalesced at once, and (ii) adjust the base ad- // -- second read/write dress to a multiple of the half warp size times the FP32 size. tid += THREADS_PER_BLOCK_X These have not been implemented here and would require

9 Figure 4: Schemes for caching the frame buffer data in the data parallel cache of a GPU multiprocessor. The diagrams show two-dimensional slices of the 4 4 4 cubes for illustration purposes. Each thread is assigned to one interior cell, but it needs data from adjac×en×t cells for calculating finite differences. The interior portion of the cube is delimited by red lines, while the cube including ghost cells is indicated by blue lines. A direct scheme would first have each thread copy data for its cell into the cached array, and then, by conditional statements, assign threads to the ghost cells. However, this forces the compiler to serialized parts of the copy operation. A more efficient scheme operates on the data parallel cache arrays in a linear addressing mode, such that threads do not necessarily copy their assigned cells. The linear scheme also reduces cache bank conflicts more efficiently.

10 non-trivial changes to the data structures. 5. Synchronizing all threads in a block after the cache operation. After the partial derivatives have been obtained using the cached data, all local operations are performed (calculation 6. Performing all local operations, and writing the results of helper variables, right hand sides, and time update). The back into shared memory. results can be written directly into device memory, but, as a 7. Writing the shared memory data back into device mem- final improvement, we can reuse the shared memory space by ory in a coalesced operation. first writing the evolved variables into it, and then rewriting the results back to device memory in an extra set of grouped instructions. A direct write would look like 5. PERFORMANCE RESULTS The performance measurements were all conducted on the NVIDIA Quadro Plex 1000 VCS cluster qp at NCSA. The g11[index] = (local instructions) cluster consists of one login node and 16 compute nodes with g12[index] = (local instructions) two dual-core AMD Opteron 2.4 GHz processors, 8 GB of ... main memory, and four NVIDIA Quadro FX 5600 GPUs K33[index] = (local instructions) with 1.5 GB frame buffer memory each, making a total of 6 GB. whereas a delayed write is obtained by For comparison purposes, we measure the speed-up pro- vided by one Quadro GPU compared to a serial code run- ning on the CPU (which consequently only uses one core g11S[indexS] = (local instructions) of the CPU). Both codes run on single-precision floating g12S[indexS] = (local instructions) point arithmetics, and are compiled using the GNU C com- ... piler version 4.1.2 (revision 20070626), and the nvcc compiler K33S[indexS] = (local instructions) provided by CUDA version 1.0. In both cases, level three optimizations have been performed. // Coalesced write-out g11[index] = g11S[indexS]; The first benchmark is illustrated in Fig. 5. The codes oper- g12[index] = g12S[indexS]; ate on a 128 128 128 Cartesian grid (130 130 130 in- ... cluding boun×dary g×host zones), and perform×100 it×erations. K33[index] = K33S[indexS]; The boundary data is static, and analysis and output oper- ations are not performed during the measuring cycle. Only the main evolution is measured, which excluded allocation, It would seem that the latter code contains more instruc- deallocation and device setup operations; those take roughly tions and is therefore at a disadvantage to the former one; 3 seconds in total for CUDA and diminish in relation to the however, in the latter case, the compiler can coalesce the main loop for more iterations. write instruction across all threads in a warp, and since the device memory operation are more expensive by orders of As the diagram shows, speed-ups of up to 23 can be achieved magnitude than all other operations, this can lead to a gain with a cached algorithm and coalesced write-out to the frame in performance. buffer. Not using the data-parallel cache results in many frame buffer accesses during the evaluation of the finite dif- In summary, the parallelization scheme on the CUDA device ferences, inducing large latencies which can only partially involves: be hidden by the thread scheduler due to the large register count needed by each thread. 1. Identifying the most computationally intensive and par- allelizable parts of the code. In the case discussed here, The effect of the grid size is shown in Fig. 6. This demon- this the is the evolution step inside the main loop. strates that larger local grid sizes lead to higher speed-ups. The frame buffer on the Quadro FX 5600 (1.5 GB) can store up to about 230 230 230 cells for this particular computa- 2. Restructuring the parallel problem in units which cor- × × respond to the block/thread scheme used by CUDA, tional problem; for these problem sizes, speed-ups of about with particular note of the warp size and the register 26 can be achieved. This is a useful result for parallelizing bank optimization requirements. larger problems across multiple nodes (via MPI), since the ratio of the number of the boundary ghost zones with re- 3. In the compute kernel, implementing a cache algorithm spect to the internal cells diminishes with grid size, leading using the shared memory. Since the minimum effective to a reduction of the communication overhead relative to number of threads is limited by the warp size, there computational operations in the interior. may be trade-offs involved between the number of con- current threads on a multiprocessor and the number of As mentioned in Section 4, the Stage IIb parallelization variables to cache. scheme caches all local grid variables by combined read op- erations. If the available shared space is not sufficient to 4. Implementing an efficient scheme to copy the data from store all variables, one could either (i) reduce the block size, device memory to the cache in a thread-based order, or (ii) keep the block size, but read the variables separately. with as few memory bank conflicts as possible, and by The latter option is possible because the finite difference op- using coalesced device memory operations. erators are decoupled, and the local operations only act on

11 Figure 5: Speed-up of the general relativistic evolution code on an NVIDIA Quadro FX 5600 GPU, compared to the serial code running on an AMD Opteron 2.4 GHz. The code performs 100 iterations on a Cartesian grid with 128 128 128 cells. The left-most bar is the reference result on the host CPU, with a wall clock time of 430.3 ×seconds× normalized to a speed-up of 1. The Stage Ia result is for the GPU code using a column decomposition (cf. Section 4), and the Stage Ib result is using a block decomposition, both without making use of the data parallel cache. Stage IIa is using the data parallel cache for read operations, and Stage IIb for read operations and a coalesced write operation.

12 Figure 6: Speed-up for different grid sizes. Each grid size is separately compared to the serial CPU code, which gives the normal sample for comparison. The GPU parallelization is more efficient for larger grids, leading to speed-ups of up to 26.

13 Figure 7: Speed-up with different caching strategies to accommodate more grid variables than fit into the data parallel cache. The reference point is the Stage IIb result for a 128 128 128 grid. Reducing the block size, i.e. the number of threads per block, increases the available shared× memo× ry per thread and thus allows to store more variables. However, it also induces a significant performance hit, as shown in the figure. Another approach is to treat all variables separately for purposes of taking finite differences, instead of caching all in advance. This still introduces an increased overhead, but performs better than the former option.

14 the results stored in the registers. Fig. 7 shows the effect codes and is being used widely in production work. It is of these choices on the GPU code performance. In compar- planned to extend Cactus in a way that it can make use of ison to the Stage IIb 128 128 128 reference point with a combined CPU/GPU clusters with high performance, while speed-up of 23, reducing t×he blo×ck size to the warp size (32) providing the end user with a simple and unified interface results in a performance reduction of about 45%, while the to solve problems. In more general terms, a middleware like indidivual caching of variables reduce performance by about Cactus is useful to abstract different hardware architectures 32%. Above all else, this shows how sensitively the actual and even programming paradigms from the scientific prob- speed-up depends on the memory access scheme. lem at hand.

6. DISCUSSION AND OUTLOOK 7. ACKNOWLEDGMENTS The benchmark results in Section 5 clearly demonstrate that The author would like to thank Gabrielle Allen, Daniel Katz, a GPU can potentially enhance the computation of finite John Michalakes, and Erik Schnetter for discussion and com- difference codes solving Einstein’s field equations by sig- ments. Calculations have been performed on the Quadro nificant amounts. It should be noted that the architec- Plex 1000 VCS cluster at NCSA, with special thanks to ture under review here is limited to single-precision floating Jeremy Enos for providing timely support with this machine. point operations, and it is expected that actual speed-ups in future GPUs performing double-precision operations are 8. REFERENCES lower. However, even a speed-up in the order of 10 is quite [1] R. Arnowitt, S. Deser, and C. W. Misner. The useful for practical purposes, since it allows to increase pro- dynamics of general relativity. In L. Witten, editor, ductivity and reduce turn-around times for test problems Gravitation: An introduction to current research, by an order of magnitude. Also, future parallel supercom- pages 227–265. John Wiley, New York, 1962. puters may include GPU hardware which should be taken [2] T. W. Baumgarte and S. L. Shapiro. On the numerical advantage of. integration of Einstein’s field equations. Phys. Rev. D, 59:024007, 1999. The speed-ups measured here have compared a serial code [3] Cactus Computational Toolkit home page, to a single GPU parallel code. Clearly, the actual speed- http://www.cactuscode.org/. up in a particular workstation setup would effectively com- [4] T. Goodale, G. Allen, G. Lanfermann, J. Mass´o, pare CPU-parallelized (e.g. OpenMP, MPI) against GPU- T. Radke, E. Seidel, and J. Shalf. The Cactus parallelized situations, or even combinations where CPUs framework and toolkit: Design and applications. In and GPUs are used at the same time. The ratio of CPU Vector and Parallel Processing – VECPAR’2002, 5th cores to GPUs is actually one in the cluster we have been International Conference, Lecture Notes in Computer using, so the general ratio of GPU vs CPU code should be of Science, Berlin, 2003. Springer. a similar order of magnitude, assuming the synchronization [5] L. Lindblom, M. A. Scheel, L. E. Kidder, R. Owen, between the GPUs does not turn out to be overly expensive. and O. Rinne. A new generalized harmonic evolution system. Class. Quantum Grav., 23:S447–S462, 2006. Another scenario is to use GPUs for the grid evolution code, but off-loading tasks which could potentially be done asyn- [6] T. Nakamura, K. Oohara, and Y. Kojima. General chronously (analysis, output) to the host CPUs. In this case, relativistic collapse to black holes and gravitational both resources could be used effectively at the same time, waves from black holes. Prog. Theor. Phys. Suppl., and the associated speed-ups compared to a pure CPU code 90:1–218, 1987. would be even higher. All this requires to copy data from the [7] NVIDIA. NVIDIA GeForce 8800 GPU Architecture frame buffer to the main memory, however, which may turn Overview, 2006. out to be an additional bottleneck in the code’s operation. [8] NVIDIA. CUDA Programming Guide, Version 1.0, 2007. A problem with porting codes to GPUs is that a certain [9] J. Shalf, E. Schnetter, G. Allen, and E. Seidel. Cactus amount of expertise and experimentation is required from as benchmarking platform. CCT Technical Report the researcher, and the efficiency of the parallel code is Series, CCT-TR-2006-3, 2006. not always easy to predict. Also, there are tight memory [10] M. Shibata and T. Nakamura. Evolution of constraints on the multiprocessors, which will become even three-dimensional gravitational waves: Harmonic more important for double-precision floating point opera- slicing case. Phys. Rev. D, 52:5428, 1995. tions. Since the performance of the code depends strongly [11] R. M. Wald. General relativity. The University of on the memory access scheme as demonstrated above, and Chicago Press, Chicago, 1984. since limitations in memory determine the list of available caching schemes, there is a non-trivial influence of the par- ticular finite-differencing problem on the solution algorithm.

The importance of middleware solutions which are able to efficiently use GPUs and clusters of combined CPUs/GPUs is therefore expected to increase, since they can hide most of the data structures and optimization details from appli- cation scientists or business users. The Cactus Computa- tional Toolkit[4, 3, 9] is an example of a middleware which already provides abstractions for MPI-parallelized scientific

15