Optimizing Parallel Reduction in Cuda to Reach GPU Peak Performance

Optimizing Parallel Reduction In Cuda To Reach GPU Peak Performance 1Adityo Mahardito 2Adang Suhendra 3Deni Tri Hasta 1Gunadarma University([email protected]) 2Gunadarma University([email protected]) 3Gunadarma University([email protected]) Abstract GPUs are massively multi threaded many core chips that have hundreds of cores and thousands of concurrent threads swith high performance and memory bandwidth. Now days, GPUs have already been used to accelerate some numerically intensive high performance computing applications in parallel not only used to graphic processing. This thesis aims primarily to demonstrate the programming model ap- proaches that can be maximize the performance of GPUs. This is accomplished by a proof of maximum reach of bandwidth memory and get speed up from the GPU that used to process parallel computation. The programming environment that used is NVIDIA’s CUDA, it is parallel architecture and model for general- purpose computing on a GPU. 1 Introduction trol. Many companies have produced GPUs under a number of brand names. Intel, NVIDIA, AMD/AT, Currently, demand for hardware resources for pro- S3 Graphics, VIA Technologies, and Matrox processing data on computers is increasing. This can duce GPUs. But, NVIDIA and ATI control nearly be proved by continuously increasing the mini- 100% of the market. Recently, NVIDIA began re- mum requirement of software applications. The leasing cards supporting an API extension to the C phenomenon of the advent multi core CPUs and programming language CUDA (”Compute Unified many core GPUs means that mainstream processor Device Architecture”), which allows specified func- chips are now parallel systems. Many applications tions from a normal C program to run on the GPU’s that process large data sets can use a data paral- stream processors[8]. lel programming model to speed up the computa- The performance offered by the GPU, in fact, tions. The application outside the field of computa- depend-ing on the method approach that used tion itself are accelerated by data parallel process- while create a CUDA program, even though using ing, from general image rendering and processing, same algorithm. Because in the parallel program- signal processing, physics simulation to computa- ming, especially on CUDA, the thinks that must tional finance or computational biology. However, be considered to obtain optimal results are mem- to more support of data parallel processing, the ory usage and instruction usage. The main focus GPU can be considered than the CPU, because there of this thesis is make optimization of parallel pro- are very significant differences between the capabil- gramming on the CUDA environment that running ity that are owned by the CPUs and GPUs to pro- over GPU using parallel reduction algorithm. The cess sets large data on parallel computation. In parallel reduction algorithm that implemented is the other side, GPUs are very efficient not only at a summation algorithm that computes the sum of manipulating computer graphics, but also at their large arrays of values. highly parallel structure makes them more effective than general purpose CPUs for a range of complex 2 Literature Review algorithms [2]. The GPU is specialized for compute intensive, 2.1 Graphic Processor Unit highly parallel computation and therefore designed such that more transistors are devoted to data pro- A graphics processing unit or GPU is a processor at- cessing rather than data caching and flow con- tached to a graphics card dedicated to calculating floating point operations. GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth. The reason behind the discrepancy in floating- point capability between the CPU and the GPU is that the GPU is specialized for compute intensive, highly parallel computation and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control. Figure 1: Streaming Processor (SP) This difference makes the CPU and GPU suited for different kinds of problems. In a simplified way, one can say that a CPU performs best when a few, isn’t specific in saying about SFU architecture, just small pieces of data are processed in a complex, assume that each SFU is also a fully pipelined, sin- but sequential way. This lets the CPU utilize the gle issue, in order microprocessor. There’s a MT many transistors used for caching, branch predic- issue unit that dispatches instructions to all of the tion and instruction level parallelism. The GPU, on SPs and SFUs in the group. the other hand, need massively data parallel prob- In addition to the processor cores in a SM, lems to work efficiently. The programming model there’s a very small instruction cache, a read only most commonly used when programming a GPU is data cache and a 16KB read/write shared mem- based on the stream programming model. In the ory. These cache sizes are kept purposefully small stream programming model, input to and output because unlike a conventional desktop micropro- from a computation comes in the form of streams. cessor, the datasets we’re trying to cache here are A stream is a collection of homogeneous data ele- small. Each SP ends up working on an individual ments on which some operation, called a kernel, is pixel and despite the move to 32-bit floating point to be performed, and the operation on one element values, there’s only so much data associated with is independent of the other elements in the stream. a single pixel. The 16KB memory is akin to Cell’s Another important difference between a general local stores in that it’s not a cache, but a software purpose processor and a typical GPU is the mem- managed data store so that latency is always pre- ory bandwidth. Because of simpler memory mod- dictable. With this many cores in a single SM, con- els and no requirements from legacy operating sys- trol and predictability and very important to mak- tems, the GPU can support more than 100 GB/s ing the whole thing work efficiently. of memory bandwidth, while the bandwidth of The next cluster is Texture Processor Cluster general-purpose processors is around 20 GB/s [7]. (TPC), shown by Figure 2. NVIDIA purposefully designed its GPU architecture to be modular, so a single TPC can be made up of any number of SMs. 2.1.1 GPU Architecture In the GT200 architecture it was made up of three SMs and it can be different with the other series Start from Streaming processor (see at Figure 1), [4]. it is a fully pipelined, single-issue, in order microprocessor complete with two ALUs and a FPU. An SP doesn’t have any cache, so it’s not particularly great at anything other than cranking through tons of mathematical operations. Since an SP spends most of its time working on pixel or vertex data, the fact that it doesn’t have a cache doesn’t really matter, because if you add up enough of these cores of GPU technology you can start to get something productive given that graphics rendering is a highly parallelizable task[3]. The GPU has a many-core processor containing an array of Streaming Multiprocessors (SMs). A SM is an array of SPs, consists of 8 Streaming Proces- sors (SPs), along with two more processors called Special Function Units (SFUs). Each SFU has four FP multiply units which are used for transcenden- tal operations (e.g. sin, cosin) and interpolation, the latter being used in some of the calculations for things like anisotropic texture filtering. NVIDIA Figure 2: Texture Processor Cluster The components of the TPC however haven’t a device and how threads are grouped into blocks. changed, a TPC is made up of SMs, some control The memory model defines the different types of logic and a texture block. Remember that a SM is memories that are available to a CUDA program. a total of 8 SPs and 2 SFUs, so that brings the total up to 24 SPs and 6 SFUs per cluster in GT200. 2.3 CUDA Programming Model The texture block includes texture addressing and filtering logic as well as a L1 texture cache. NVIDIA’s Compute Unified Device Architecture The modular theme continues with the Stream- (CUDA) is a programming model and hard- ing Processor Array (SPA) that is composed of a ware/software environment for GPGPU, which con- number of TPCs, see Figure 3 [5]. sists of the following components: • An extension to the C programming language that al-lows programmers to define GPGPU functions called kernels which are executed by multiple threads on the GPU. • A compiler and related tools for translating CUDA source code to GPU devicespecific bi- nary • A software stack consisting of GPGPU application li-braries, CUDA runtime libraries and Figure 3: Streaming Processor Array CUDA device driver • A CUDA enabled GPU device Similarly, the processed data elements need to be stored back in global memory for copying back to host memory. The task of effectively hiding the 2.3.1 CUDA Kernels global memory access la-tency and managing the The CUDA programming language provides a memory hierarchy is very crucial for obtaining max- means for programmers to express a problem that imal performance from the GPU. Each SM manages exhibits significant data-parallelism as a CUDA ker- the creation, execution, synchronization and de- nel. A CUDA kernel is a function that is executed struction of concurrent threads in hardware, with in Single Program Multiple Data (SPMD) on a large zero scheduling overhead, and this is one of the key set of data elements. factors in achieving very high execution through- Kernel code is written on the thread level with put.

Load more