CCT Technical Report Series a General Relativistic Evolution Code
Total Page:16
File Type:pdf, Size:1020Kb
Center for Computation & Technology CCT Technical Report Series CCT-TR-2008-1 A General Relativistic Evolution Code on CUDA Architectures Burkhard Zink Center for Computation & Technology and Department of Physics & Astronomy Louisiana State University, Baton Rouge, LA 70803 USA Posted January 2008. cct.lsu.edu/CCT-TR/CCT-TR-2008-1 The author(s) retain all copyright privileges for this article and accompanying materials. Nothing may be copied or republished without the written consent of the author(s). A general relativistic evolution code on CUDA architectures Burkhard Zink Center for Computation and Technology, and Department of Physics and Astronomy, Louisiana State University, Baton Rouge, LA 70803, USA [email protected] ABSTRACT In this technical report, I will describe an implementation I describe the implementation of a finite-differencing code of one of the most challenging problems in computational for solving Einstein’s field equations on a GPU, and measure physics, solving Albert Einstein’s field equations for general speed-ups compared to a serial code on a CPU for different relativistic gravitation, on a graphics processing unit. The parallelization and caching schemes. Using the most effi- primary purpose is to make an estimate of potential perfor- cient scheme, the (single precision) GPU code on an NVIDIA mance gains in comparison to current CPUs, and gain an Quadro FX 5600 is shown to be up to 26 times faster than understanding of architectural requirements for middleware the a serial CPU code running on an AMD Opteron 2.4 GHz. solutions serving the needs of the scientific community, most Even though the actual speed-ups in production codes will notably the Cactus Computational Toolkit [4, 3, 9]. vary with the particular problem, the results obtained here indicate that future GPUs supporting double-precision op- The particular GPU used for these experiments is an NVIDIA erations can potentially be a very useful platform for solving G80 series card (Quadro FX 5600). NVIDIA has also re- astrophysical problems. leased a software development kit called CUDA (compute unified device architecture) [8] for development of GPU code using an extension of the C language. As opposed to earlier 1. INTRODUCTION attempts to program GPUs with the shader language sup- The high parallel processing performance of graphics pro- plied for graphics applications, this makes it easier to port cessing units (GPUs), with current models achieving peak existing general-purpose computation code to the target de- performances of up to 350 GFlop/s (for single precision vice. floating-point operations), has been used traditionally to transform, light and rasterize triangles in three-dimensional Section 2 contains a description of the G80 hardware, the computer graphics applications. In recent architectures, how- CUDA architectural model, and the performance consider- ever, the vectorized pipeline for processing triangles has been ations important for GPU codes. Then, in Section 3, we replaced by a unified scalar processing model based on a will turn to the particular problem of solving Einstein’s field large set of stream processors. This change has initiated equations, which we will approach from a mostly algorithmic a consideration of GPUs for solving general purpose com- (as opposed to physical) point of view. Section 4 describes puting problems, and triggered the field of general-purpose the structure and implementation of the code, and Section 5 computing on graphics-processing units (GPGPU). discusses the benchmarking results obtained. I give a dis- cussion of the results, and an outlook, in Section 6. High-performance, massively parallel computing is one of the major tools for the scientific community to understand and quantify problems not amenable to traditional analyti- 2. CUDA AND THE G80 ARCHITECTURE cal techniques, and has led to ever-increasing hardware per- formance requirements for tackling more and more advanced 2.1 The NVIDIA G80 architecture questions. Therefore, GPGPU appears to be a natural tar- The NVIDIA G80 hardware [7] is the foundation of the get for scientific and engineering applications, many of which GeForce 8 series consumer graphics cards, the Quadro FX admit highly parallel algorithms which are already used on 4600 and 5600 workstation cards, and the new Tesla 870 current-generation supercomputers based on multi-core CPUs. set of GPGPU boards. G80 represents the third major ar- chitectural change for NVIDIA’s line of graphics accelera- tors. Traditionally, GPU accelerators enhance the trans- formation and rendering of simple geometric shapes, usu- ally triangles. The processing pipeline consists of transfor- mation and lighting (now vertex shading), triangle setup, pixel shading, raster operations (blending, z-buffering, anti- aliasing) and the output to the frame buffer for scan-out to the display. First and second generation GPU architectures typically process these steps with special purpose hardware in a pipelined fashion. 1 Number of multiprocessors (MP) 16 2.2 CUDA (compute unified device architec- Number of stream processors per MP 8 Warp size (see text) 32 ture) Parallel data cache 16 kB Since the G80 is based on general purpose stream processors Number of banks in parallel data cache 16 with a high peak performance, it appears to be a natural Number of 32-bit registers per MP 8192 target for general purpose parallel computing, and in par- Clock frequency of each MP 1.35 GHz ticular scientific applications. NVIDIA has recently released Frame buffer memory type GDDR3 the CUDA SDK [8] for running parallel computations on the Frame buffer interface width 384 bits device hardware. While earlier attempts to use GPUs for Frame buffer size 1.5 GB general-purpose computing had to use the various shading Constants memory size 64 kB languages, e.g. Microsoft’s HLSL or NVIDIA’s Cg, CUDA Clock frequency of the board 800 MHz is based on an extension of the C language. Host bus interface PCI Express The SDK consists of a compiler (nvcc), host and device run- time libraries, and a driver API. Architecturally, the driver Table 1: Technical specifications of a Quadro FX builds the primary layer on top of the device hardware, 5600 GPU. and provides interfaces which can be accessed either by the CUDA runtime or the application. Furthermore, the CUDA runtime can be accessed by the application and the service libraries (currently for BLAS and FFT). The increase in demand for programmability of illumination CUDA’s parallelization model is a slight abstraction of the models and geometry modifiers, and also load-balancing re- G80 hardware. Threads are arranged into blocks, where quirements with respect to vertex and pixel shading opera- each block is executed on only one multiprocessor. There- tions, has led to more generality in GPU design. The current fore, within a block, additional thread context and synchro- G80 architecture consists of a parallel set of stream proces- nization options exist (by use of the shared resources on sors with a full set of integer and (up to FP32) floating point the chip), whereas no global synchronization is available be- instructions. When processing triangles and textures, the tween blocks. A set of blocks constitute a SIMD compute individual steps of the processing pipeline are dynamically kernel. Kernel calls themselves are asynchronous to the host mapped to these processors, and since the operations are CPU: they return immediately after issuance. Currently highly parallelizable, the scheduler can consistently main- only one kernel can be executed at any time on the device, tain a high load. which is a limitation of the driver software. Physically, eight stream processors (SP) are arranged in a The thread-block model hides the particular number of mul- multiprocessor with texture filtering and addressing units, a tiprocessors and stream processors from the CUDA kernel texture cache, a set of registers, a cache for constants, and a insofar as the block grid is partially serialized into batches. parallel data cache. Each multiprocessor is operated by an Each multiprocessor can execute more than one block and instruction decoding unit which executes a particular com- more threads than the warp size per multiprocessor to hide mand in a warp: the same command is executed on all SPs device memory access latency. However, the limitations of for a set of clock cycles (because the instruction units and this model are implicit performance characteristics: If the the SP ALUs have different clock speeds). This constitutes number of blocks is lower than the number of multiproces- a minimal unit of SIMD computation on the multiproces- sors, or the number of threads is lower than the warp size, sor called the warp size, and will be important later when significant performance degradation will occur. Therefore, considering code efficiency on the architecture. the physical configuration is only abstracted insofar as the kernel compiles and runs on different GPU configurations, The GPU card contains a set of such multiprocessors, with but not in terms of achieving maximal performance. the number depending on the particular model (e.g, one in the GeForce 8400M G, 12 in the GeForce 8800 GTS, and Within the kernel thread code, the context provides a log- 16 in the GeForce 8800 GTX, Quadro FX 5600 and Tesla ical arrangement of blocks in one- or two-dimensional, and 870 models). The multiprocessors are operated by a thread threads per block in one-, two-, and three-dimensional sets, scheduling unit with fast switching capabilities. In addition, which is convenient for the kind of numerical grid applica- the board has frame buffer memory and and an extra set tion we will present below since it avoids additional (costly) of memory for constants. The particular numbers for the modulo operations inside the thread. Threads support lo- Quadro FX 5600 are listed in Table 1, and Fig. 1 shows a cal variables and access to the shared memory (which maps diagram of the architecture (cf.