Evaluating CUDA Compared to Multicore Architectures∗

Evaluating CUDA Compared to Multicore Architectures∗ Laercio´ L. Pilla, Philippe O. A. Navaux Institute of Informatics - Federal University of Rio Grande do Sul - Porto Alegre, Brazil flaercio.pilla, [email protected] Abstract In this context, this paper presents an evaluation of the CUDA architecture using three classes of applications from The use of GPUS (Graphics Processing Units), such as the Dwarf Mine classification [2]. The performance of this the CUDA architecture, as accelerators for general pur- architecture was compared to the ones obtained with the Ne- pose applications has become a reality. Still, there is not a halem microarchitecture [3] and to a Dual Core processor, clear definition of the performance to be expected with ap- all using double precision floating point operations. plications on this complex architecture, specially when using double precision floating point operations. In this con- 2. CUDA Architecture text, this paper presents a comparison between the CUDA architecture and the Nehalem microarchitecture consider- Compute Unified Device Architecture (CUDA) [8] ing three classes of applications from the Dwarf Mine means both a SIMD (Single Instruction, Multiple Data) ar- classification. The results showed similar performances to chitecture and a programming model that extends languages two benchmarks and a speedup of 4 when comparing CUDA (such as C) to use these GPUs. The GPU works as a par- to a dual Nehalem system with the MapReduce application. allel accelerator for the CPU, as illustrated in Figure 1(a). These results indicate a successful mapping of three impor- A thread on the CPU makes a special function call, named tant classes of applications to GPUs. kernel, which executes on the GPU. All necessary data has to be transferred to the GPU before this call, since it has its own memory. 1. Introduction A CUDA GPU is composed of simple processing units called Scalar Processors (SPs). These SPs are organized Industry is undergoing a dramatic shift towards parallel in Streaming Multiprocessors (SMs). In the architecture of platforms, driven by the power and performance limitations the GPU considered on this paper, a SM is composed of of the sequential processors [1, 2]. An alternative popular eight SPs and one Instruction Unit. Due to this, all SPs in platform for high-performance computing includes the use the same SM execute the same instruction at each cycle, of Graphics Processing Units (GPUs) as accelerators, such which can be seen as a vector machine [9]. There is only as the CUDA architecture [8]. This massively parallel ar- one double precision floating point unit per SM. chitecture has been providing increases in performance for The CUDA programming model works with the abstrac- many applications, such as fluid dynamics [5, 9]. tion of thousands of threads computing in parallel. These However, developing and porting applications to this ar- threads are organized in blocks. Threads in the same block chitecture is not trivial. This happens due to complex as- can be easily synchronized with a barrier function. Multi- pects of the GPU such as processing and memory hierar- ple blocks organized as a grid compose a kernel. This log- chies, as well as the different performances with single and ical processing hierarchy is mapped to the physical hierar- double precision floating point operations. In addition, there chy as shown in Figure 1(b). A block of threads runs on a is little available research on which classes of problems can SM and each of its threads execute on a SP. A thread block take profit of it. can have more threads than a SM may run in parallel. A better outline of the application characteristics com- The CUDA Architecture also has a memory hierarchy, as patible with the CUDA architecture could bring benefits to depicted in Figure 1(c). All memories inside the SM have a the users by preventing unprofitable implementations and small size and low latency. The global memory has a large hardware investments. It could also help by better defining size and a high latency (from 400 to 600 cycle [8]) and can the expected behavior on GPUs and their utility. be accessed by all threads. To increase efficiency, some ac- cess patterns can coalesce several memory requests in one ∗ This research was partially supported by Microsoft. request with a large word. Thread Scalar Processor CPU GPU SP SM 1 process Shared Memory Symmetric Block Multiprocessor (SM) SP SP SP ... SP ... SP SP kernel ... SP Constant Cache Time Grid (kernel) GPU Texture Cache SM SM ... ... SM Device Memory (global memory) (a) Execution model (b) Logical to physical mapping (c) Memory hierarchy Figure 1: CUDA characteristics. 3. Dwarf Mine Classification All versions were compiled with the −O3 flag (which results in a performance similar to −O2). The CUDA driver The Dwarf Mine classification, also known as the used was the version 2:3. The experiments were run over 13 Dwarfs [1, 2], organizes algorithmic methods based on Ubuntu Linux 9:04 32 bits (64 bits on Nehalem). their common computation and communication behaviors. It was first developed by Phillip Collela as the 7 Dwarfs and 2x Nehalem Dual Core GTX280 then extended by researchers in Berkeley [2]. This classifi- Intel Xeon Intel Core 2 NVIDIA cation focuses on the underlying patterns that have persisted Model E5530 Duo E8500 GTX 280 on applications, being independent of implementation. Cores 2x Quad + HT Dual 240 SP Frequency 2.40GHz 3.16GHz 1.296GHz In this paper, we focus on three of these dwarfs to eval- Cache or 8x 256KB uate the CUDA architecture: Sparse Linear Algebra, Spec- Shared Mem + 2x 8MB 6MB 30x 16KB tral Methods and MapReduce. For this, we ported three ker- Memory 12GB 4GB 1GB nels from the NAS Parallel Benchmarks (NPB) [7], origi- nally implemented with OpenMP1, to the GPU2. The NPB Table 1: Systems setup. are a free set of benchmarks used to evaluate parallel architectures and are composed of representative Computa- The experimental results present a confidence of 99% tional Fluid Dynamics (CFD) kernels. These benchmarks with a maximum tolerated deviation of 10% and a mini- can be easily mapped to the Dwarf Mine classification, as mum of 20 executions. The speedups presented on this sec- seen in [2]. Each benchmark contains its own performance tion consider the performance of the sequential execution of and correctness verification codes, as well as test sets with the dwarfs over the Dual Core processor. The performance different sizes (in ascending order: W, A, B, C and D). All of GTX280 includes the time spent on memory allocation floating point operations use double precision in the NPB. and transfer. The parallel versions of the benchmarks run on 2 threads on the Dual Core and 16 threads on the 2xNe- 4. Experimental Evaluation halem (which includes HT technology). The size of the test instances was limited by the amount of memory on the GPU In these experiments, we compare the performance of and by the execution time of the kernels, since the GPU was the three considered dwarfs on three different systems: a not dedicated (also used for video). Dual Core, a GPU connected to the Dual Core, and a com- puter with two processors with the Nehalem microarchitec- 4.1. Sparse Linear Algebra - CG Benchmark ture. The configuration details of the systems are presented on Table 1. The dwarfs were compiled using gcc 4:3:1 for The Sparse Linear Algebra dwarf focuses on the charac- the multicore tests and nvcc 0:2:1221 for the GPU tests. teristics of sparse matrix algorithms. Since this kind of ma- The parallel versions running on the CPUs use OpenMP. trix contains many zero values, it is usually stored in a com- pressed format. Because of that, data is accessed through 1 Original C+OpenMP codes available at the Omni Project homepage indexed memory requests, which leads to non organized http://www.hpcs.cs.tsukuba.ac.jp/omni-openmp/ memory accesses. These kind of algorithms are known for 2 Implemented C+CUDA codes available at the HPCGPU Project their relatively poor performance, in some cases running at homepage http://hpcgpu.codeplex.com 10% of machine peak or less [10]. As can be seen in Figure 2, where the speedups are calcu- Since the dwarf presents a regular behavior, there is no lated considering the sequential execution of the benchmark divergent paths to be computed sequentially on the GPU. on the Dual Core, there is a decrease in performance when Even with reduced parallelism and computations, the gains running 16 threads on Nehalem for the smallest instance in performance are linked to the small amount of data of the (CG W). Also, the speedups for the other systems do no sur- instance. Still, the reduced parallelism limits the Nehalem pass 1:5. This happens mostly due to the small parallelism performance to a speedup of 2. within the instance, added to the synchronizations needed on the benchmark. The GTX280 still obtained a speedup, 2x Nehalem (16 threads) Dual Core (2 threads) GTX280 even while being underused. The best performance was ob- 4.00 3.63 tained with the Dual Core, which has the faster clock and 3.50 3.18 the smaller number of threads. 3.00 2.50 2.50 2x Nehalem (16 threads) Dual Core (2 threads) GTX280 1.97 2.00 1.60 1.52 1.43 1.49 Speedup 1.50 1.40 1.23 0.87 0.93 1.00 1.20 1.14 0.50 1.00 0.00 0.80 0.64 FT W FT A Speedup 0.60 0.40 Figure 3: Speedup for the FT Benchmark. 0.20 0.00 When considering the second instance (FT A), the best CG W CG A results (a speedup of 3:6) were obtained by the 2xNehalem system - with its large cache shared between threads - due to Figure 2: Speedup for the CG Benchmark.

Load more