A general relativistic evolution code on CUDA architectures

Burkhard Zink Center for Computation and Technology Louisiana State University Baton Rouge, LA 70803, USA

Abstract I describe the implementation of a finite-differencing code for solving Einstein’s field equations on a GPU, and measure speed-ups compared to a serial code on a CPU for different parallelization and caching schemes. Using the most efficient scheme, the (single precision) GPU code on an NVIDIA FX 5600 is shown to be up to 26 times faster than the a serial CPU code running on an AMD Opteron 2.4 GHz. Even though the actual speed-ups in production codes will vary with the particular problem, the results obtained here indicate that future GPUs supporting double-precision operations can potentially be a very useful platform for solving astrophysical problems.

1 Introduction

The high parallel processing performance of graphics processing units (GPUs), with current models achieving peak performances of up to 350 GFlop/s (for single precision floating-point operations), has been used traditionally to trans- form, light and rasterize triangles in three-dimensional computer graphics appli- cations. In recent architectures, however, the vectorized pipeline for processing triangles has been replaced by a unified scalar processing model based on a large set of stream processors [7]. This change has initiated a consideration of GPUs for solving general purpose problems, and triggered the field of general-purpose computing on graphics-processing units (GPGPU). High-performance, massively is one of the major tools for the scientific community to understand and quantify problems not amenable to traditional analytical techniques, and has led to ever-increasing hardware perfor- mance requirements for tackling more and more advanced questions. Therefore, GPGPU appears to be a natural target for scientific and engineering applica- tions, many of which admit highly parallel algorithms which are already used on current-generation supercomputers based on multi-core CPUs. In this technical report, I will describe an implementation of one of the most challenging problems in computational physics, solving Albert Einstein’s field

1 equations for general relativistic gravitation, on a . The primary purpose is to make an estimate of potential performance gains in com- parison to current CPUs, and gain an understanding of architectural require- ments for middleware solutions serving the needs of the scientific community, most notably the Cactus Computational Toolkit [4, 3, 9]. The particular GPU used for these experiments is an NVIDIA G80 series card (Quadro FX 5600). NVIDIA has also released a software development kit called CUDA (compute unified device architecture) [8] for development of GPU code using an extension of the C language. As opposed to earlier attempts to program GPUs with the language supplied for graphics applications, this makes it easier to port existing general-purpose computation code to the target device. Section 2 contains a description of the G80 hardware, the CUDA archi- tectural model, and the performance considerations important for GPU codes. Then, in Section 3, we will turn to the particular problem of solving Einstein’s field equations, which we will approach from a mostly algorithmic (as opposed to physical) point of view. Section 4 describes the structure and implementation of the code, and Section 5 discusses the benchmarking results obtained. I give a discussion of the results, and an outlook, in Section 6.

2 CUDA and the G80 architecture

2.1 The NVIDIA G80 architecture The NVIDIA G80 hardware [7] is the foundation of the GeForce 8 series con- sumer graphics cards, the Quadro FX 4600 and 5600 workstation cards, and the new Tesla 870 set of GPGPU boards. G80 represents the third major ar- chitectural change for NVIDIA’s line of graphics accelerators. Traditionally, GPU accelerators enhance the transformation and rendering of simple geomet- ric shapes, usually triangles. The processing pipeline consists of transformation and lighting (now vertex ), triangle setup, pixel shading, raster opera- tions (blending, z-buffering, anti-aliasing) and the output to the frame buffer for scan-out to the display. First and second generation GPU architectures typically process these steps with special purpose hardware in a pipelined fashion. The increase in demand for programmability of illumination models and ge- ometry modifiers, and also load-balancing requirements with respect to vertex and pixel shading operations, has led to more generality in GPU design. The current G80 architecture consists of a parallel set of stream processors with a full set of integer and (up to FP32) floating point instructions. When process- ing triangles and textures, the individual steps of the processing pipeline are dynamically mapped to these processors, and since the operations are highly parallelizable, the scheduler can consistently maintain a high load. Physically, eight stream processors (SP) are arranged in a multiprocessor with texture filtering and addressing units, a texture , a set of registers, a cache for constants, and a parallel data cache. Each multiprocessor is operated

2 Number of multiprocessors (MP) 16 Number of stream processors per MP 8 Warp size (see text) 32 Parallel data cache 16 kB Number of banks in parallel data cache 16 Number of 32-bit registers per MP 8192 Clock frequency of each MP 1.35 GHz Frame buffer memory type GDDR3 Frame buffer interface width 384 bits Frame buffer size 1.5 GB Constants memory size 64 kB Clock frequency of the board 800 MHz Host bus interface PCI Express

Table 1: Technical specifications of a Quadro FX 5600 GPU. by an instruction decoding unit which executes a particular command in a warp: the same command is executed on all SPs for a set of clock cycles (because the instruction units and the SP ALUs have different clock speeds). This constitutes a minimal unit of SIMD computation on the multiprocessor called the warp size, and will be important later when considering code efficiency on the architecture. The GPU card contains a set of such multiprocessors, with the number depending on the particular model (e.g, one in the GeForce 8400M G, 12 in the GeForce 8800 GTS, and 16 in the GeForce 8800 GTX, Quadro FX 5600 and Tesla 870 models). The multiprocessors are operated by a scheduling unit with fast switching capabilities. In addition, the board has frame buffer memory and and an extra set of memory for constants. The particular numbers for the Quadro FX 5600 are listed in Table 1, and Fig. 1 shows a diagram of the architecture (cf. also [7]). The actual peak performance of the card depends on how many operations can be performed in one cycle. The stream processors technically support one MAD (multiply-add) and one MUL (multiply) per cycle, which would corre- spond to 1.35 GHz * 3 * 128 = 518.4 GFlop/s. However, not all execution units can be used simultaneously, so a conservative estimate is to assume one MAD per cycle, leading to a peak performance of 345.6 GFlop/s.

2.2 CUDA (compute unified device architecture) Since the G80 is based on general purpose stream processors with a high peak performance, it appears to be a natural target for general purpose parallel com- puting, and in particular scientific applications. NVIDIA has recently released the CUDA SDK [8] for running parallel computations on the device hardware. While earlier attempts to use GPUs for general-purpose computing had to use the various shading languages, e.g. Microsoft’s HLSL or NVIDIA’s Cg, CUDA is based on an extension of the C language.

3 Figure 1: Simplified diagram of the G80 architecture. The GPU contains several multiprocessors (MPs) which are operated by a common thread scheduler with fast switching capabilities. Each multiprocessor can run several threads using stream processors (SPs) which share a common instruction unit, a set of registers and a data parallel cache. A memory bus connects the MPs to the frame buffer memory and a constants memory.

The SDK consists of a compiler (nvcc), host and device runtime libraries, and a driver API. Architecturally, the driver builds the primary layer on top of the device hardware, and provides interfaces which can be accessed either by the CUDA runtime or the application. Furthermore, the CUDA runtime can be accessed by the application and the service libraries (currently for BLAS and FFT). CUDA’s parallelization model is a slight abstraction of the G80 hardware. Threads are arranged into blocks, where each block is executed on only one multiprocessor. Therefore, within a block, additional thread context and syn- chronization options exist (by use of the shared resources on the chip), whereas no global synchronization is available between blocks. A set of blocks constitute a SIMD compute kernel. Kernel calls themselves are asynchronous to the host CPU: they return immediately after issuance. Currently only one kernel can be executed at any time on the device, which is a limitation of the driver software. The thread-block model hides the particular number of multiprocessors and stream processors from the CUDA kernel insofar as the block grid is partially se- rialized into batches. Each multiprocessor can execute more than one block and

4 more threads than the warp size per multiprocessor to hide device memory access latency. However, the limitations of this model are implicit performance char- acteristics: If the number of blocks is lower than the number of multiprocessors, or the number of threads is lower than the warp size, significant performance degradation will occur. Therefore, the physical configuration is only abstracted insofar as the kernel compiles and runs on different GPU configurations, but not in terms of achieving maximal performance. Within the kernel thread code, the context provides a logical arrangement of blocks in one- or two-dimensional, and threads per block in one-, two-, and three- dimensional sets, which is convenient for the kind of numerical grid application we will present below since it avoids additional (costly) modulo operations inside the thread. Threads support local variables and access to the shared memory (which maps to the data parallel cache in G80), which is common for all threads within a block. The kernel is called from the host code via a special syntax which specifies the block grid size and threads per block, and can be synchronized back to the host via a runtime function. Therefore, the host CPU can perform indepen- dent computations or issue additional kernels while the GPU operates. This may be interesting for delayed analysis and output processes performed on the data, although the bandwidth for transferring from the frame buffer (where grid variables will be stored for evolution) to the host memory has to be taken into account.

2.3 Performance optimization considerations To achieve high performance of the GPU for a particular application problem, often approximately measured in terms of a speed-up compared to a serial code on the host CPU, the kernel would usually encompass those operations which are highly data-parallelizable due to the SIMD concept of CUDA. An efficient port needs to maximize the ratio of arithmetic over device mem- ory operations, the arithmetic intensity, since access to the frame buffer involves large latencies (up to 600 cycles on the G80) because it is not cached. One com- mon strategy is to first copy selected parts of the device memory which is needed by all threads to the shared memory, then operating on it, and finally writing the results back to device memory. At the same time, many threads can be started concurrently on the multiprocessors to hide the access latencies. Each block issued to a multiprocessor reserves memory for the local variables of all threads in the block, and shared memory according to the allocation made in the kernel. The number of blocks and threads running concurrently on the multiprocessor is limited by the shared memory space (16 kB on the Quadro FX 5600), which implies a trade-off between hiding device memory access latency by multi-threading and effective cache size per thread. Since threads are operated on in warps, i.e. each of the stream processors in a multiprocessor operates sequentially on batches of SIMD instructions, the resulting warp size is the minimum number of threads per block for efficient parallelization, and the actual number should be an integer multiple of it. In

5 addition to the upper bound on the cache size by concurrent issuance of blocks, this gives a problem-dependent lower bound on the number of threads in a block. NVIDIA also states [8] that the optimal number of threads per block is actually 64 in the G80 architecture to allow the nvcc compiler to reduce the number of register bank conflicts in an optimal way. On a G80 card, actual FP32 operations like multiply (MUL) and multi- ply+add (MAD) cost 4 cycles per warp, i.e. one cycle per thread, reciprocals cost 16 cycles, and for certain mathematical functions like sine and cosine mi- crocode is available which performs the evaluation in 32 cycles. Control flow instructions are to be avoided if possible, since in many cases they may have to be serialized. Accessing shared memory by threads should avoid bank conflicts to ensure maximal bandwidth. This requires the threads of a warp to read and write to the shared memory in a certain way, since consecutive addresses (e.g. obtained from a base address and an offset given by the thread id) are distributed into several independent partial warps for maximum bandwidth. There is, however, also a broadcast mechanism which can distribute data from one bank to all threads in the warp. Finally, accessing device memory can be optimized by coalescing operations, usually again in the form base address plus thread dependent offset. In addition, the base address should be aligned to a multiple of the half-warp size times the type size in bytes for coalescing the read/write operation. All these performance requirements come into play when optimizing a code for high speed-ups. The experiments below will demonstrate that the actual speed-up can be very sensitive to the particular implementation of the mem- ory management. While non-optimized implementations for problems with high arithmetic intensity may already achieve a significant speed-up, a full use of the GPU’s floating point performance requires some thought. We will see that the nature of the problem also imposes trade-offs between different performance re- quirements with strongly affect the maximum speed-up, so that e.g. comparing even implementations of different finite-difference problems can produce com- pletely different speed-ups, since there is a trade-off between local cache space (i.e. number of grid variables) and the number of threads in a block.

3 The algorithm to solve Einstein’s field equations

In its best-known form, Einstein’s field equation for general relativity can be written as

Gµν = 8πTµν (1)

where µ, ν = 0 ... 3. Gµν are the sixteen components of the Einstein tensor, and Tµν are the sixteen components of the energy-momentum tensor. Both func- tionally depend on the spacetime metric tensor gµν which described the local

6 curvature of spacetime and is therefore the central object of general relativity, since curvature relates to physical effects like light bending and attraction of bodies by gravitation. We will, however, not be concerned with the physical interpretation of these equations here (for an introduction see [11]), but only with the requirements to formulate a finite-differencing evolution algorithm from them. The field equa- tions as formulated in eqn. 1 can be transformed into a regular initial-boundary value problem (i.e., a set of partial differential equations in time and space) for twelve variables: the six components of the three-metric, gij, i, j = 1 ... 3, and the six components of the extrinsic curvature, Kij. In addition, the equations contain the four free gauge functions α and βi as parameters, which are usually also treated as evolutionary variables. The usual approach to solving these equations proceeds as follows: 1. Define a spatial domain to solve on, e.g. give a range of coordinates [xlow, xhigh] × [ylow, yhigh] × [zlow, zhigh] on which to solve the problem. 2. Discretize the domain in some appropriate way. We will only discuss uniform Cartesian discretizations here (also known as uni-grids), but there are supercomputer implementations to use adaptive mesh refinement and multiple blocks, e.g. in the Cactus Computational Toolkit. 3. On each discrete cell, specify the initial data by setting values for the evo- lutionary variables. Usually, a set of cells directly outside of the domain, called ghost zones, is used for setting boundary conditions. 4. The evolution proper proceeds by a loop over time steps. In each time step, the right hand side of the equation ∂tA(x, y, z) = RHS needs to be evaluated, where A is a grid variable and RHS is a function which depends on A and other grid variables, their spatial derivatives, and potentially free parameters and the coordinate location. 5. Since the right hand side depends on the spatial derivatives, finite-differencing operations have to be performed on the grid. This usually involves some i i+1 i−1 local stencil of variables around the cell, e.g. ∂xA ≈ (A −A )/2(∆x) for the second-order accurate central approximation to the first x deriva- tive of A at position i. 6. Having obtained the right hand side, the set of evolution equations of advanced with a technique to discretely approximate ordinary differential equations, e.g. a Runge-Kutta step. This obtains the values of grid func- tions at time t + ∆t from their values at t. The Runge-Kutta algorithm generates partial time approximations during the course of its operation, and therefore requires a set of scratch levels having the size of the main grid variables. 7. Finally, new boundary conditions need to be imposed after the time up- date. This usually involves operations on the ghost zones and their im- mediate neighbors, e.g. by extrapolation.

7 The actual representation of Einstein’s field equations used here is the so- called ADM formalism (after Arnowitt, Deser and Misner [1]). This is not a commonly used method nowadays, since it tends to produce numerical instabil- ities, but it will suffice for demonstrating how to port such a code to a CUDA environment, and it is a good choice for a prototype implementation due to its simplicity. More advanced schemes (NOK-BSSN [6, 10, 2] and the general- ized harmonic formalism [5]) contain more variables, which has consequences for the available set of caching schemes (see below), but should fundamentally be portable to CUDA in a similar manner. The particular test cases below operate on a simple dynamical test problem, a gauge wave, and use static boundaries.

4 Implementation of Einstein’s field equations in CUDA

While it is possible to develop a GPU code directly, there are advantages to first implementing a stand-alone CPU implementation in C, and then porting this to CUDA. In particular, debugging the host code is easier, and it will yield a fairer assessment of speed-up as compared to a device code in emulation mode1. This CPU code is a C language implementation of the algorithm mentioned in Section 3 for three spatial dimensions, with a second-order Runge-Kutta scheme for time integration and second-order accurate first and second central finite-differencing operators. The actual right hand side was extracted from a particular implementation of the ADM system in Cactus. The code performs these steps: 1. Allocate memory for all grid variables, and the same amount for the scratch space variables needed by the Runge-Kutta scheme. 2. Write initial data into the grid variables. The exact data is irrelevant unless it produces NaN’s by discrete instabilities during the evolution. The data used here is a Gaussian gauge pulse in x direction. 3. Perform the main loop: (a) Swap the pointers between the grid variables and their scratch coun- terparts. (b) Call the evolution function. This function loops over all grid cells (three nested loops, one for each direction), and in each cell (i) calcu- lates all partial derivatives by finite differencing in second order, (ii) calculates a number of additional temporary variables only needed locally, and (iii) writes the evolved grid functions into the scratch space. (c) The swapping and evolution is repeated, now for the second half step involved in the RK2 scheme. 1The CUDA C compiler nvcc admits to compile code which actually executes on the host processor, mostly for debugging purposes.

8 (d) Output the evolved data to the disk. 4. Release allocated storage. The only relevant target for parallelization is the evolution routine inside the main loop, since it employs most of the wall clock time in almost all situ- ations (unless there is 3d output at every time step), and because it naturally lends itself to parallel computation. Therefore, we will take a closer look at its operations. Einstein’s field equations are some of the most complex classical field equa- tions in nature, and therefore codes solving them naturally have a high arith- metic intensity. For evaluating a single right hand side, hundreds of floating point operations act on only a few grid functions in device memory - and their locally obtained partial derivatives. Therefore, as soon as the partial derivatives are obtained, parallelization can easily proceed by assigning each cell to a single thread and performing the evaluation locally. The resulting write operations to device memory only involve that particular cell and therefore do not produce concurrency conflicts (though they may produce bank conflicts). For the finite difference evaluation, each cell accesses its immediate neighbors and reads grid variables from that cell, operates on them, and writes the result into local variables which are only defined inside the innermost loop. While it is possible to calculate all finite differences outside of the loop, in an extra step, to logically decouple the (semantically different) finite difference and evolution operations, this implies multiplying the required storage by more than a factor of four2. Translated to a CUDA device, this would involve additional operations on device memory and therefore reduce performance. This approach can be translated directly into a GPU device code by using these changes to the CPU implementation: 1. Allocate the grid functions on the device memory (frame buffer) in addi- tion to allocating them on the host memory. The host memory allocation is still useful for setting up initial data and output of data to the disk. 2. After writing initial data on the host memory, copy it to the device memory using runtime functions. 3. In the main loop, perform the swap operations on the device memory pointers. 4. Replace the evolution function by a CUDA kernel which is distributed to the device multiprocessors. 5. Since every evolution half-step depends on results from the previous one, synchronize to the CUDA device after calling the kernel. This time span could potentially also be used for asynchronous output. 6. If output is requested in a particular iteration, copy the relevant grid variables back to host memory, and then perform the output as usual. 2Three additional variables for each directional derivative, and storage for second deriva- tives for some variables.

9 4.1 Stage 1 parallelization The CUDA kernel function could in principle be ported from the CPU code by using this scheme: Divide any two-dimensional orthogonal projection of the three-dimensional Cartesian grid, i.e. any of the planes (x, y), (x, z) or (y, z), into independent blocks of the same size. Each thread in the block then oper- ates on a one-dimensional slab in the remaining direction, i.e. the three nested loops in the evolution code are replaced by one loop over the remaining direc- tion. The cell index calculation needed in the kernel is made easy by CUDA’s support of two-dimensional block and thread indices, which directly correspond to the two-dimensional domain decomposition. With this algorithm, the prob- lem is equally and independently parallelized, and were it not for the additional considerations imposed by the memory access latencies, the approach would be already optimal. For future reference, I will call it a Stage 1 approach, since it is a direct translation of the CPU code while reducing the appropriate number of nested loops, and since it does not take advantage of the data parallel cache of the G80 multiprocessors. Fig. 2 illustrates the parallelization technique. A first efficiency improvement for the G80 can be obtained by adjusting the block size, i.e. the number of threads in each block, to a multiple of 643, which is a multiple of the warp size (a strong requirement for high performance) and also allows the compiler to avoid register conflicts. Even without considering shared memory at this stage, this puts a significant constraint on how much registers are actually available per thread. On the G80, there are 8192 registers per multiprocessor, i.e. we can store 128 32-bit words locally per thread. The partial derivatives and other helper variables used in the ADM code already need about 100 per thread. More complicated codes may need to either (i) reduce the number of local helper variables at the cost of increased arithmetic operations, or (ii) reduce the block size. Usually, experiments need to be done to establish which option is preferable in practice. It would seem intuitive that reducing the number of threads to the warp size is better, but the high latency of device memory accesses can be hidden much more effectively by more threads, and therefore easily outweigh the additional local arithmetic cost involved in implicitly repeating operations usually mapped to helper variables. Instead of operating on one-dimensional columns, it is also possible to de- compose the grid into cubes (as is usual for MPI parallelization schemes), and therefore have each thread evolve exactly one cell. Since the block decomposi- tion by CUDA is logically either one- or two-dimensional, there are three options for doing this: (i) Use the block decomposition as before, but start a thread for each cell separately, (ii) let the kernel only operate on a grid slab of defined thickness and call it repeatedly from the host. Option (i) is impractical due to the limitations of available registers, so we are left with option (ii), which is illustrated in Fig. 3. From the onset, it is unclear how this compares to the column decomposition used before. However, the stage 2 parallelization will

3This restricts the possible grid sizes to a multiple of 8 in each direction of the two- dimensional decomposition for optimal parallelization. We will later see that the best scheme actually has a multiple of 4 as a requirement.

10 need to operate on blocks as opposed to columns, as discussed below.

4.2 Stage 2 parallelization So far, the kernel has not made use of the shared memory (or data parallel cache in terms of the G80 architecture). Since the frame buffer accesses are un-cached, it is necessary to implement a manual cache algorithm for this. Unfortunately, the shared memory is even more limited than the register space: 16 kB per multiprocessor on the G80, which translates into only 256 bytes per thread when using the recommended block size of 64. The most obvious target for a caching algorithm are the finite difference operations, since they repeatedly use the same data (the value of quantity A at cell (i, j, k) is accessed by six neighboring cells for second-order accurate first partial derivatives, and by an additional eight cells for second derivatives). To perform these operations from the cache while avoiding conditional statements in the kernel, a number of ghost zones outside the operated cube need to be stored as well. Generally, this number will be reduced for equally sized cubes, which suggest to take 4 × 4 × 4 as the interior cube and one ghost cell, resulting in 64 threads per block (double the warp size, and the suggested block size for efficient register usage) and 6×6×6 = 216 cells to store in the cache. Therefore, with this scheme, we can store at most 18 FP32 variables per cell. This is enough for the ADM grid functions (depending on the choice of gauge, there are 12 to 16), but not enough for more complicated schemes or more general systems. In those cases, the only option would be to reduce the size of the cube and the number of threads, which likely incurs a loss of performance. The transfer of data from the frame buffer to the parallel cache is handled by each thread separately. Were it only for the interior cube without ghost zones, an efficient scheme would be to copy all data for exactly the cell which is operated on by this thread. However, the ghost zones also need to be set up, for which the most obvious scheme is to use conditional statements which act on the boundary surfaces. This is obviously not very efficient, since (i) many threads will not participate in these operations, and (ii) conditional statements are often serialized into the minimal instruction size (the warp size) and will therefore reduce performance even more. A better approach is this: Recompute the three-dimensional thread index into a linear index, and then use integer division and subtraction operations to obtain the three-dimensional index corresponding to the cube including ghost zones; the two cache copy approaches are illustrated in Fig. 4. The operations for this are:

__shared__ CCTK_REAL g11S[SHARED_ARRAY_SIZE], ...

// Block index bx = blockIdx.x; by = blockIdx.y;

11 // Thread index tx = threadIdx.x; ty = threadIdx.y; tz = threadIdx.z;

// Copy data to shared memory iL = bx * THREADS_PER_BLOCK_X; jL = by * THREADS_PER_BLOCK_Y; kL = kstart - 1; tid = tx + THREADS_PER_BLOCK_X * (ty + THREADS_PER_BLOCK_Y * tz);

// -- first read/write o = tid / (SHARED_ARRAY_X * SHARED_ARRAY_Y); res = tid - o * SHARED_ARRAY_X * SHARED_ARRAY_Y; n = res / SHARED_ARRAY_X; m = res - n * SHARED_ARRAY_X; index = (iL+m) + GRID_NX * ((jL+n) + GRID_NY * (kL+o)); indexS = m + SHARED_ARRAY_X * (n + SHARED_ARRAY_Y * o); g11S[indexS] = g11[index]; ...

// -- second read/write tid += THREADS_PER_BLOCK_X * THREADS_PER_BLOCK_Y * THREADS_PER_BLOCK_Z; ...

// -- third read/write tid += THREADS_PER_BLOCK_X * THREADS_PER_BLOCK_Y * THREADS_PER_BLOCK_Z; ...

// -- fourth read/write tid += THREADS_PER_BLOCK_X * THREADS_PER_BLOCK_Y * THREADS_PER_BLOCK_Z; if (tid < SHARED_ARRAY_X * SHARED_ARRAY_Y * SHARED_ARRAY_Z) { o = tid / (SHARED_ARRAY_X

12 * SHARED_ARRAY_Y); res = tid - o * SHARED_ARRAY_X * SHARED_ARRAY_Y; n = res / SHARED_ARRAY_X; m = res - n * SHARED_ARRAY_X; index = (iL+m) + GRID_NX * ((jL+n) + GRID_NY * (kL+o)); indexS = m + SHARED_ARRAY_X * (n + SHARED_ARRAY_Y * o); g11S[indexS] = g11[index]; ... }

Here, g11 is the grid array for the variable g11, g11S is its shared memory equivalent, and the preprocessor constants describe the size of the shared array (SHARED ARRAY X/Y/Z = 6), the block size (THREADS PER BLOCK X/Y/Z = 4), and the global grid size (GRID NX/NY/NZ). Now every thread copies data for and evolves different cells. Since there are more cells including ghost zones than threads, several such operations need to be performed (in the case of a 4 × 4 × 4 cube and a ghost size of 1, we need four cycles). Each time, the effective thread index tid is increased by the block size, so that the last read operation either (i) contains a conditional statement for excluding invalid indices, or (ii) the grid arrays are artificially enlarged to accommodate copying the data without segmentation faults. Experiments have shown that option (i) does not lead to a degradation in performance, and it is preferable due to simplicity and reduced memory usage. To increase coalescence for device memory operations and reduce bank con- flicts for shared memory writes in these operations, it is important to operate on consecutive addresses. This is easiest done by doing the block decomposition in the (x, y) plane, since then the index ordering scheme we use accesses ad- dresses in form base address plus index for the shared memory (avoiding bank conflicts), and within each x direction read access access to the device memory can be coalesced. An even more efficient scheme could possibly be obtained by (i) ordering device memory in a way that more reads can be coalesced at once, and (ii) adjust the base address to a multiple of the half warp size times the FP32 size. These have not been implemented here and would require non-trivial changes to the data structures. In summary, the parallelization scheme on the CUDA device involves:

1. Identifying the most computationally intensive and parallelizable parts of the code. In the case discussed here, this the is the evolution step inside the main loop.

2. Restructuring the parallel problem in units which correspond to the block/thread scheme used by CUDA, with particular note of the warp size and the reg- ister bank optimization requirements.

13 3. In the compute kernel, implementing a cache algorithm using the shared memory. Since the minimum effective number of threads is limited by the warp size, there may be trade-offs involved between the number of concurrent threads on a multiprocessor and the number of variables to cache.

4. Implementing an efficient scheme to copy the data from device memory to the cache in a thread-based order, with as few memory bank conflicts as possible, and by using coalesced device memory operations.

5. Synchronizing all threads in a block after the cache operation.

6. Performing all local operations, and writing the results back into shared memory.

5 Performance results

The performance measurements were all conducted on the NVIDIA Quadro Plex 1000 VCS cluster qp at NCSA. The cluster consists of one login node and 16 compute nodes with two dual-core AMD Opteron 2.4 GHz processors, 8 GB of main memory, and four NVIDIA Quadro FX 5600 GPUs with 1.5 GB frame buffer memory each, making a total of 6 GB. For comparison purposes, we measure the speed-up provided by one Quadro GPU compared to a serial code running on the CPU (which consequently only uses one core of the CPU). Both codes run on single-precision floating point arithmetics, and are compiled using the GNU C compiler version 4.1.2 (revision 20070626), and the nvcc compiler provided by CUDA version 1.0. In both cases, level three optimizations have been performed. The first benchmark is illustrated in Fig. 5. The codes operate on a 128 × 128 × 128 Cartesian grid (130 × 130 × 130 including boundary ghost zones), and perform 100 iterations. The boundary data is static, and analysis and output operations are not performed during the measuring cycle. Only the main evolution is measured, which excluded allocation, deallocation and device setup operations; those take roughly 3 seconds in total for CUDA and diminish in relation to the main loop for more iterations. As the diagram shows, speed-ups of up to 23 can be achieved with a cached algorithm and coalesced write-out to the frame buffer. Not using the data- parallel cache results in many frame buffer accesses during the evaluation of the finite differences, inducing large latencies which can only partially be hidden by the thread scheduler due to the large register count needed by each thread. The effect of the grid size is shown in Fig. 6. This demonstrates that larger local grid sizes lead to higher speed-ups. The frame buffer on the Quadro FX 5600 (1.5 GB) can store up to about 230 × 230 × 230 cells for this particular computational problem; for these problem sizes, speed-ups of about 26 can be achieved. This is a useful result for parallelizing larger problems across multiple nodes (via MPI), since the ratio of the number of the boundary ghost

14 zones with respect to the internal cells diminishes with grid size, leading to a reduction of the communication overhead relative to computational operations in the interior. As mentioned in Section 4, the Stage IIb parallelization scheme caches all local grid variables by combined read operations. If the available shared space is not sufficient to store all variables, one could either (i) reduce the block size, or (ii) keep the block size, but read the variables separately. The latter option is possible because the finite difference operators are decoupled, and the local operations only act on the results stored in the registers. Fig. 7 shows the effect of these choices on the GPU code performance. In comparison to the Stage IIb 128 × 128 × 128 reference point with a speed-up of 23, reducing the block size to the warp size (32) results in a performance reduction of about 45%, while the indidivual caching of variables reduce performance by about 32%. Above all else, this shows how sensitively the actual speed-up depends on the memory access scheme.

6 Discussion and outlook

The benchmark results4 in Section 5 clearly demonstrate that a GPU can po- tentially enhance the computation of finite difference codes solving Einstein’s field equations by significant amounts. It should be noted that the architecture under review here is limited to single-precision floating point operations, and it is expected that actual speed-ups in future GPUs performing double-precision operations are lower. However, even a speed-up in the order of 10 is quite useful for practical purposes, since it allows to increase productivity and reduce turn- around times for test problems by an order of magnitude. Also, future parallel supercomputers may include GPU hardware which should be taken advantage of. The speed-ups measured here have compared a serial code to a single GPU parallel code. Clearly, the actual speed-up in a particular workstation setup would effectively compare CPU-parallelized (e.g. OpenMP, MPI) against GPU- parallelized situations, or even combinations where CPUs and GPUs are used at the same time. The ratio of CPU cores to GPUs is actually one in the cluster we have been using, so the general ratio of GPU vs CPU code should be of a similar order of magnitude, assuming the synchronization between the GPUs does not turn out to be overly expensive. Another scenario is to use GPUs for the grid evolution code, but off-loading tasks which could potentially be done asynchronously (analysis, output) to the host CPUs. In this case, both resources could be used effectively at the same time, and the associated speed-ups compared to a pure CPU code would be even higher. All this requires to copy data from the frame buffer to the main

4For this application, application kernel and application are only different by the host code calling the kernel, so the speedup values between kernel and full application are almost identical.

15 memory, however, which may turn out to be an additional bottleneck in the code’s operation. A problem with porting codes to GPUs is that a certain amount of exper- tise and experimentation is required from the researcher, and the efficiency of the parallel code is not always easy to predict. Also, there are tight memory constraints on the multiprocessors, which will become even more important for double-precision floating point operations. Since the performance of the code depends strongly on the memory access scheme as demonstrated above, and since limitations in memory determine the list of available caching schemes, there is a non-trivial influence of the particular finite-differencing problem on the solution algorithm. The importance of middleware solutions which are able to efficiently use GPUs and clusters of combined CPUs/GPUs is therefore expected to increase, since they can hide most of the data structures and optimization details from application scientists or business users. The Cactus Computational Toolkit[4, 3, 9] is an example of a middleware which already provides abstractions for MPI-parallelized scientific codes and is being used widely in production work. It is planned to extend Cactus in a way that it can make use of combined CPU/GPU clusters with high performance, while providing the end user with a simple and unified interface to solve problems. In more general terms, a middleware like Cactus is useful to abstract different hardware architectures and even programming paradigms from the scientific problem at hand.

7 Acknowledgments

The author would like to thank Gabrielle Allen, Daniel Katz, John Michalakes, and Erik Schnetter for discussion and comments. Calculations have been per- formed on the Quadro Plex 1000 VCS cluster at NCSA, with special thanks to Jeremy Enos for providing timely support with this machine.

References

[1] R. Arnowitt, S. Deser, and C. W. Misner. The dynamics of general relativ- ity. In L. Witten, editor, Gravitation: An introduction to current research, pages 227–265. John Wiley, New York, 1962.

[2] T. W. Baumgarte and S. L. Shapiro. On the numerical integration of Einstein’s field equations. Phys. Rev. D, 59:024007, 1999.

[3] Cactus Computational Toolkit home page, http://www.cactuscode.org/.

[4] T. Goodale, G. Allen, G. Lanfermann, J. Mass´o,T. Radke, E. Seidel, and J. Shalf. The Cactus framework and toolkit: Design and applications. In Vector and Parallel Processing – VECPAR’2002, 5th International Confer- ence, Lecture Notes in Computer Science, Berlin, 2003. Springer.

16 [5] L. Lindblom, M. A. Scheel, L. E. Kidder, R. Owen, and O. Rinne. A new generalized harmonic evolution system. Class. Quantum Grav., 23:S447– S462, 2006.

[6] T. Nakamura, K. Oohara, and Y. Kojima. General relativistic collapse to black holes and gravitational waves from black holes. Prog. Theor. Phys. Suppl., 90:1–218, 1987.

[7] NVIDIA. NVIDIA GeForce 8800 GPU Architecture Overview, 2006.

[8] NVIDIA. CUDA Programming Guide, Version 1.0, 2007.

[9] J. Shalf, E. Schnetter, G. Allen, and E. Seidel. Cactus as benchmarking platform. CCT Technical Report Series, CCT-TR-2006-3, 2006.

[10] M. Shibata and T. Nakamura. Evolution of three-dimensional gravitational waves: Harmonic slicing case. Phys. Rev. D, 52:5428, 1995.

[11] R. M. Wald. General relativity. The University of Chicago Press, Chicago, 1984.

17 Figure 2: Stage I column parallelization. The grid, here as an example of 16 × 16 × 16cells (boundary ghost zones are not drawn) is decomposed in a plane into blocks, which correspond to the conceptual thread blocks in CUDA. Each block is 8 × 8 cells large to obtain a block size (number of threads per block) which is a multiple of the warp size and also optimized for reducing register bank conflicts. Each thread operates on a one-dimensional column, i.e. the kernel contains a loop in the remaining direction. As a Stage I scheme, the operations do not use the data parallel cache on the multiprocessors.

18 Figure 3: Stage I/II block parallelization. Instead of using columns, the grid is first decomposed into slabs of a certain height, here marked by blue lines, each slab is further distributed into blocks corresponding to the CUDA blocks, and finally each cube of size 4 × 4 × 4 is parallelized into 64 threads, which is optimal for the same reasons stated in Fig. 2. In contrast to the column scheme, the kernel does not contain a loop over cells for purposes of calculating the evolution, though a loop is used to copy data to the parallel cache for the Stage II parallelization (see text and Fig. 4). The kernel is called by the host repeatedly to operate on all slabs in the grid.

19 Figure 4: Schemes for caching the frame buffer data in the data parallel cache of a GPU multiprocessor. The diagrams show two-dimensional slices of the 4 × 4 × 4 cubes for illustration purposes. Each thread is assigned to one interior cell, but it needs data from adjacent cells for calculating finite differences. The interior portion of the cube is delimited by red lines, while the cube including ghost cells is indicated by blue lines. A direct scheme would first have each thread copy data for its cell into the cached array, and then, by conditional statements, assign threads to the ghost cells. However, this forces the compiler to serialized parts of the copy operation. A more efficient scheme operates on the data parallel cache arrays in a linear addressing mode, such that threads do not necessarily copy their assigned cells. The linear scheme also reduces cache bank conflicts more efficiently.

20 Figure 5: Speed-up of the general relativistic evolution code on an NVIDIA Quadro FX 5600 GPU, compared to the serial code running on an AMD Opteron 2.4 GHz. The code performs 100 iterations on a Cartesian grid with 128×128× 128 cells. The left-most bar is the reference result on the host CPU, with a wall clock time of 430.3 seconds normalized to a speed-up of 1. The Stage Ia result is for the GPU code using a column decomposition (cf. Section 4), and the Stage Ib result is using a block decomposition, both without making use of the data parallel cache. Stage IIa is using the data parallel cache for read operations, and Stage IIb for read operations and a coalesced write operation.

21 Figure 6: Speed-up for different grid sizes. Each grid size is separately compared to the serial CPU code, which gives the normal sample for comparison. The GPU parallelization is more efficient for larger grids, leading to speed-ups of up to 26.

22 Figure 7: Speed-up with different caching strategies to accommodate more grid variables than fit into the data parallel cache. The reference point is the Stage IIb result for a 128 × 128 × 128 grid. Reducing the block size, i.e. the number of threads per block, increases the available shared memory per thread and thus allows to store more variables. However, it also induces a significant performance hit, as shown in the figure. Another approach is to treat all variables separately for purposes of taking finite differences, instead of caching all in advance. This still introduces an increased overhead, but performs better than the former option.

23