A General Relativistic Evolution Code on CUDA Architectures

A general relativistic evolution code on CUDA architectures Burkhard Zink Center for Computation and Technology Louisiana State University Baton Rouge, LA 70803, USA Abstract I describe the implementation of a finite-differencing code for solving Einstein’s field equations on a GPU, and measure speed-ups compared to a serial code on a CPU for different parallelization and caching schemes. Using the most efficient scheme, the (single precision) GPU code on an NVIDIA Quadro FX 5600 is shown to be up to 26 times faster than the a serial CPU code running on an AMD Opteron 2.4 GHz. Even though the actual speed-ups in production codes will vary with the particular problem, the results obtained here indicate that future GPUs supporting double-precision operations can potentially be a very useful platform for solving astrophysical problems. 1 Introduction The high parallel processing performance of graphics processing units (GPUs), with current models achieving peak performances of up to 350 GFlop/s (for single precision floating-point operations), has been used traditionally to trans- form, light and rasterize triangles in three-dimensional computer graphics applications. In recent architectures, however, the vectorized pipeline for processing triangles has been replaced by a unified scalar processing model based on a large set of stream processors [7]. This change has initiated a consideration of GPUs for solving general purpose computing problems, and triggered the field of general-purpose computing on graphics-processing units (GPGPU). High-performance, massively parallel computing is one of the major tools for the scientific community to understand and quantify problems not amenable to traditional analytical techniques, and has led to ever-increasing hardware performance requirements for tackling more and more advanced questions. Therefore, GPGPU appears to be a natural target for scientific and engineering applications, many of which admit highly parallel algorithms which are already used on current-generation supercomputers based on multi-core CPUs. In this technical report, I will describe an implementation of one of the most challenging problems in computational physics, solving Albert Einstein’s field 1 equations for general relativistic gravitation, on a graphics processing unit. The primary purpose is to make an estimate of potential performance gains in com- parison to current CPUs, and gain an understanding of architectural requirements for middleware solutions serving the needs of the scientific community, most notably the Cactus Computational Toolkit [4, 3, 9]. The particular GPU used for these experiments is an NVIDIA G80 series card (Quadro FX 5600). NVIDIA has also released a software development kit called CUDA (compute unified device architecture) [8] for development of GPU code using an extension of the C language. As opposed to earlier attempts to program GPUs with the shader language supplied for graphics applications, this makes it easier to port existing general-purpose computation code to the target device. Section 2 contains a description of the G80 hardware, the CUDA architectural model, and the performance considerations important for GPU codes. Then, in Section 3, we will turn to the particular problem of solving Einstein’s field equations, which we will approach from a mostly algorithmic (as opposed to physical) point of view. Section 4 describes the structure and implementation of the code, and Section 5 discusses the benchmarking results obtained. I give a discussion of the results, and an outlook, in Section 6. 2 CUDA and the G80 architecture 2.1 The NVIDIA G80 architecture The NVIDIA G80 hardware [7] is the foundation of the GeForce 8 series con- sumer graphics cards, the Quadro FX 4600 and 5600 workstation cards, and the new Tesla 870 set of GPGPU boards. G80 represents the third major architectural change for NVIDIA’s line of graphics accelerators. Traditionally, GPU accelerators enhance the transformation and rendering of simple geomet- ric shapes, usually triangles. The processing pipeline consists of transformation and lighting (now vertex shading), triangle setup, pixel shading, raster operations (blending, z-buffering, anti-aliasing) and the output to the frame buffer for scan-out to the display. First and second generation GPU architectures typically process these steps with special purpose hardware in a pipelined fashion. The increase in demand for programmability of illumination models and ge- ometry modifiers, and also load-balancing requirements with respect to vertex and pixel shading operations, has led to more generality in GPU design. The current G80 architecture consists of a parallel set of stream processors with a full set of integer and (up to FP32) floating point instructions. When processing triangles and textures, the individual steps of the processing pipeline are dynamically mapped to these processors, and since the operations are highly parallelizable, the scheduler can consistently maintain a high load. Physically, eight stream processors (SP) are arranged in a multiprocessor with texture filtering and addressing units, a texture cache, a set of registers, a cache for constants, and a parallel data cache. Each multiprocessor is operated 2 Number of multiprocessors (MP) 16 Number of stream processors per MP 8 Warp size (see text) 32 Parallel data cache 16 kB Number of banks in parallel data cache 16 Number of 32-bit registers per MP 8192 Clock frequency of each MP 1.35 GHz Frame buffer memory type GDDR3 Frame buffer interface width 384 bits Frame buffer size 1.5 GB Constants memory size 64 kB Clock frequency of the board 800 MHz Host bus interface PCI Express Table 1: Technical specifications of a Quadro FX 5600 GPU. by an instruction decoding unit which executes a particular command in a warp: the same command is executed on all SPs for a set of clock cycles (because the instruction units and the SP ALUs have different clock speeds). This constitutes a minimal unit of SIMD computation on the multiprocessor called the warp size, and will be important later when considering code efficiency on the architecture. The GPU card contains a set of such multiprocessors, with the number depending on the particular model (e.g, one in the GeForce 8400M G, 12 in the GeForce 8800 GTS, and 16 in the GeForce 8800 GTX, Quadro FX 5600 and Tesla 870 models). The multiprocessors are operated by a thread scheduling unit with fast switching capabilities. In addition, the board has frame buffer memory and and an extra set of memory for constants. The particular numbers for the Quadro FX 5600 are listed in Table 1, and Fig. 1 shows a diagram of the architecture (cf. also [7]). The actual peak performance of the card depends on how many operations can be performed in one cycle. The stream processors technically support one MAD (multiply-add) and one MUL (multiply) per cycle, which would corre- spond to 1.35 GHz * 3 * 128 = 518.4 GFlop/s. However, not all execution units can be used simultaneously, so a conservative estimate is to assume one MAD per cycle, leading to a peak performance of 345.6 GFlop/s. 2.2 CUDA (compute unified device architecture) Since the G80 is based on general purpose stream processors with a high peak performance, it appears to be a natural target for general purpose parallel computing, and in particular scientific applications. NVIDIA has recently released the CUDA SDK [8] for running parallel computations on the device hardware. While earlier attempts to use GPUs for general-purpose computing had to use the various shading languages, e.g. Microsoft’s HLSL or NVIDIA’s Cg, CUDA is based on an extension of the C language. 3 Figure 1: Simplified diagram of the G80 architecture. The GPU contains several multiprocessors (MPs) which are operated by a common thread scheduler with fast switching capabilities. Each multiprocessor can run several threads using stream processors (SPs) which share a common instruction unit, a set of registers and a data parallel cache. A memory bus connects the MPs to the frame buffer memory and a constants memory. The SDK consists of a compiler (nvcc), host and device runtime libraries, and a driver API. Architecturally, the driver builds the primary layer on top of the device hardware, and provides interfaces which can be accessed either by the CUDA runtime or the application. Furthermore, the CUDA runtime can be accessed by the application and the service libraries (currently for BLAS and FFT). CUDA’s parallelization model is a slight abstraction of the G80 hardware. Threads are arranged into blocks, where each block is executed on only one multiprocessor. Therefore, within a block, additional thread context and synchronization options exist (by use of the shared resources on the chip), whereas no global synchronization is available between blocks. A set of blocks constitute a SIMD compute kernel. Kernel calls themselves are asynchronous to the host CPU: they return immediately after issuance. Currently only one kernel can be executed at any time on the device, which is a limitation of the driver software. The thread-block model hides the particular number of multiprocessors and stream processors from the CUDA kernel insofar as the block grid is partially se- rialized into batches. Each multiprocessor can execute more than one block and 4 more threads than the warp size per multiprocessor to hide device memory access latency. However, the limitations of this model are implicit performance char- acteristics: If the number of blocks is lower than the number of multiprocessors, or the number of threads is lower than the warp size, significant performance degradation will occur. Therefore, the physical configuration is only abstracted insofar as the kernel compiles and runs on different GPU configurations, but not in terms of achieving maximal performance. Within the kernel thread code, the context provides a logical arrangement of blocks in one- or two-dimensional, and threads per block in one-, two-, and three- dimensional sets, which is convenient for the kind of numerical grid application we will present below since it avoids additional (costly) modulo operations inside the thread.

A General Relativistic Evolution Code on CUDA Architectures

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Near Data Processing: Are We There Yet?

Integrated Framework for Heterogeneous Embedded Platforms Using Opencl Author: Kulin Seth Department: Electrical and Computer Engineering

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

Opencl Programming

Lecture 7 Today's Content Trends

Compute & Memory Optimizations for High-Quality Speech Recognition on Low-End GPU Processors