Evaluating CUDA Compared to Multicore Architectures∗

Laercio´ L. Pilla, Philippe O. A. Navaux Institute of Informatics - Federal University of Rio Grande do Sul - Porto Alegre, Brazil {laercio.pilla, navaux}@inf.ufrgs.br

Abstract In this context, this paper presents an evaluation of the CUDA architecture using three classes of applications from The use of GPUS (Graphics Processing Units), such as the Dwarf Mine classification [2]. The performance of this the CUDA architecture, as accelerators for general pur- architecture was compared to the ones obtained with the Ne- pose applications has become a reality. Still, there is not a halem [3] and to a Dual Core , clear definition of the performance to be expected with ap- all using double precision floating point operations. plications on this complex architecture, specially when us- ing double precision floating point operations. In this con- 2. CUDA Architecture text, this paper presents a comparison between the CUDA architecture and the Nehalem microarchitecture consider- Compute Unified Device Architecture (CUDA) [8] ing three classes of applications from the Dwarf Mine means both a SIMD (Single Instruction, Multiple Data) ar- classification. The results showed similar performances to chitecture and a programming model that extends languages two benchmarks and a of 4 when comparing CUDA (such as ) to use these GPUs. The GPU works as a par- to a dual Nehalem system with the MapReduce application. allel accelerator for the CPU, as illustrated in Figure 1(a). These results indicate a successful mapping of three impor- A on the CPU makes a special function call, named tant classes of applications to GPUs. kernel, which executes on the GPU. All necessary data has to be transferred to the GPU before this call, since it has its own memory. 1. Introduction A CUDA GPU is composed of simple processing units called Scalar Processors (SPs). These SPs are organized Industry is undergoing a dramatic shift towards parallel in Streaming Multiprocessors (SMs). In the architecture of platforms, driven by the power and performance limitations the GPU considered on this paper, a SM is composed of of the sequential processors [1, 2]. An alternative popular eight SPs and one . Due to this, all SPs in platform for high-performance computing includes the use the same SM execute the same instruction at each cycle, of Graphics Processing Units (GPUs) as accelerators, such which can be seen as a vector machine [9]. There is only as the CUDA architecture [8]. This ar- one double precision floating point unit per SM. chitecture has been providing increases in performance for The CUDA programming model works with the abstrac- many applications, such as fluid dynamics [5, 9]. tion of thousands of threads computing in parallel. These However, developing and porting applications to this ar- threads are organized in blocks. Threads in the same block chitecture is not trivial. This happens due to complex as- can be easily synchronized with a barrier function. Multi- pects of the GPU such as processing and memory hierar- ple blocks organized as a grid compose a kernel. This log- chies, as well as the different performances with single and ical processing hierarchy is mapped to the physical hierar- double precision floating point operations. In addition, there chy as shown in Figure 1(b). A block of threads runs on a is little available research on which classes of problems can SM and each of its threads execute on a SP. A thread block take profit of it. can have more threads than a SM may run in parallel. A better outline of the application characteristics com- The CUDA Architecture also has a , as patible with the CUDA architecture could bring benefits to depicted in Figure 1(c). All memories inside the SM have a the users by preventing unprofitable implementations and small size and low latency. The global memory has a large hardware investments. It could also help by better defining size and a high latency (from 400 to 600 cycle [8]) and can the expected behavior on GPUs and their utility. be accessed by all threads. To increase efficiency, some ac- cess patterns can coalesce several memory requests in one ∗ This research was partially supported by Microsoft. request with a large word. Thread

CPU GPU SP SM 1

Symmetric Block Multiprocessor (SM) SP SP SP ... SP ... SP SP

kernel ... SP Constant Time

Grid (kernel) GPU Texture Cache SM SM ...... SM Device Memory (global memory)

(a) Execution model (b) Logical to physical mapping (c) Memory hierarchy

Figure 1: CUDA characteristics. 3. Dwarf Mine Classification All versions were compiled with the −O3 flag (which re- sults in a performance similar to −O2). The CUDA driver The Dwarf Mine classification, also known as the used was the version 2.3. The experiments were run over 13 Dwarfs [1, 2], organizes algorithmic methods based on Ubuntu 9.04 32 bits (64 bits on Nehalem). their common computation and communication behaviors. It was first developed by Phillip Collela as the 7 Dwarfs and 2x Nehalem Dual Core GTX280 then extended by researchers in Berkeley [2]. This classifi- Intel Xeon Intel Core 2 cation focuses on the underlying patterns that have persisted Model E5530 Duo E8500 GTX 280 on applications, being independent of implementation. Cores 2x Quad + HT Dual 240 SP Frequency 2.40GHz 3.16GHz 1.296GHz In this paper, we focus on three of these dwarfs to eval- Cache or 8x 256KB uate the CUDA architecture: Sparse Linear Algebra, Spec- Shared Mem + 2x 8MB 6MB 30x 16KB tral Methods and MapReduce. For this, we ported three ker- Memory 12GB 4GB 1GB nels from the NAS Parallel Benchmarks (NPB) [7], origi- nally implemented with OpenMP1, to the GPU2. The NPB Table 1: Systems setup. are a free set of benchmarks used to evaluate parallel ar- chitectures and are composed of representative Computa- The experimental results present a confidence of 99% tional Fluid Dynamics (CFD) kernels. These benchmarks with a maximum tolerated deviation of 10% and a mini- can be easily mapped to the Dwarf Mine classification, as mum of 20 executions. The speedups presented on this sec- seen in [2]. Each benchmark contains its own performance tion consider the performance of the sequential execution of and correctness verification codes, as well as test sets with the dwarfs over the Dual Core processor. The performance different sizes (in ascending order: W, A, B, C and D). All of GTX280 includes the time spent on memory allocation floating point operations use double precision in the NPB. and transfer. The parallel versions of the benchmarks run on 2 threads on the Dual Core and 16 threads on the 2xNe- 4. Experimental Evaluation halem (which includes HT technology). The size of the test instances was limited by the amount of memory on the GPU In these experiments, we compare the performance of and by the execution time of the kernels, since the GPU was the three considered dwarfs on three different systems: a not dedicated (also used for video). Dual Core, a GPU connected to the Dual Core, and a com- puter with two processors with the Nehalem microarchitec- 4.1. Sparse Linear Algebra - CG Benchmark ture. The configuration details of the systems are presented on Table 1. The dwarfs were compiled using gcc 4.3.1 for The Sparse Linear Algebra dwarf focuses on the charac- the multicore tests and nvcc 0.2.1221 for the GPU tests. teristics of sparse matrix . Since this kind of ma- The parallel versions running on the CPUs use OpenMP. trix contains many zero values, it is usually stored in a com- pressed format. Because of that, data is accessed through 1 Original C+OpenMP codes available at the Omni Project homepage indexed memory requests, which leads to non organized http://www.hpcs.cs.tsukuba.ac.jp/omni-openmp/ memory accesses. These kind of algorithms are known for 2 Implemented C+CUDA codes available at the HPCGPU Project their relatively poor performance, in some cases running at homepage http://hpcgpu.codeplex.com 10% of machine peak or less [10]. As can be seen in Figure 2, where the speedups are calcu- Since the dwarf presents a regular behavior, there is no lated considering the sequential execution of the benchmark divergent paths to be computed sequentially on the GPU. on the Dual Core, there is a decrease in performance when Even with reduced parallelism and computations, the gains running 16 threads on Nehalem for the smallest instance in performance are linked to the small amount of data of the (CG W). Also, the speedups for the other systems do no sur- instance. Still, the reduced parallelism limits the Nehalem pass 1.5. This happens mostly due to the small parallelism performance to a speedup of 2. within the instance, added to the synchronizations needed on the benchmark. The GTX280 still obtained a speedup, 2x Nehalem (16 threads) Dual Core (2 threads) GTX280 even while being underused. The best performance was ob- 4.00 3.63 tained with the Dual Core, which has the faster clock and 3.50 3.18 the smaller number of threads. 3.00 2.50 2.50 2x Nehalem (16 threads) Dual Core (2 threads) GTX280 1.97 2.00 1.60 1.52 1.43 1.49 Speedup 1.50 1.40 1.23 0.87 0.93 1.00 1.20 1.14 0.50 1.00 0.00 0.80 0.64 FT W FT A Speedup 0.60 0.40 Figure 3: Speedup for the FT Benchmark. 0.20 0.00 When considering the second instance (FT A), the best CG W CG A results (a speedup of 3.6) were obtained by the 2xNehalem system - with its large cache shared between threads - due to Figure 2: Speedup for the CG Benchmark. the increase in parallelism. The GTX280 showed a perfor- mance 13% smaller than the 2xNehalem for this instance. There is a decrease in throughput with the increase in The smaller speedup of 3.2 for the GPU is linked to a bigger instance size for most experimental environments, which amount of non-coalesced memory requests and conflicts. is linked to the increase in data size (more cache misses and sparse memory accesses). This is most costly on the 4.3. MapReduce - EP Benchmark GPU, since it results in an underused memory bandwidth and separate memory requests. Still, the increase in paral- The MapReduce dwarf treats about applications with lelism made the GTX280 sustain its performance (with a massive parallelism and little or no communication. This variation of less than 4%), while it brought a increase of 2x kind of application can be seen as 3 parts: (i) the mapping in throughput for the 2xNehalem system. Both referenced of independent tasks to multiple processors; (ii) the paral- systems presented similar speedups of 1.5. lel computations; and (iii) the reduction of the independent results. This behavior can be seen in Monte Carlo meth- 4.2. Spectral Methods - FT Benchmark ods [2] and is so important that it was added as a structural design pattern to architect parallel software [1]. The Spectral Methods dwarf focuses on applications As shown in Figure 4, the best speedup for all instances which work with data on the spectral domain [2]. These ap- was obtained with the GTX280. The characteristics of the plications combine multiple butterfly stages, each with its benchmark - massively parallel, high arithmetic intensity, own specific pattern of data access. This leads to data per- no dependencies - are the ones that are best mapped to the mutations and all-to-all communications between stages. GPU. The increase in performance of over 30% when com- Spectral Methods present a reduced arithmetic intensity paring the EP W and EP A instances is linked to the in- (due to data movements) and use multiply-add operations. crease in parallelism and, consequently, to a better utiliza- These methods are present in many different fields, such as tion of the GPU. The smaller difference between the results physics, astronomy and computational finance [6]. for the EP A and EP B - with speedups of 40.7 and 42.4, re- For this benchmark, the Dual Core system presents a spectively - happened due to the fact that the GPU was al- small decrease in performance when using both cores, as ready being fully used. Thereby, the difference in through- shown in Figure 3. This is most related to memory and put is linked to the increase in computation time while keep- communication costs. For the smaller instance (FT W), the ing the same initial costs (the same amount of memory is al- GTX280 shows the best performance with a speedup of 2.5. located and transferred for the two instances). 2x Nehalem (16 threads) Dual Core (2 threads) GTX280 These results indicate a successful mapping of these 42.41 45 40.73 three classes of applications to the CUDA architecture. 40 Even when using double precision, the inclusion of a GPU 35 30.24 on a system can bring an increase in performance that may 30 surpass or equals a modern multicore architecture. In the 25 time of this paper publication, this presents an economy of

20 approximately 50% when comparing the prices of a dual Speedup 15 9.50 10.08 Nehalem system to a Dual Core processor plus a modern 10 7.62 GPU. 5 2.00 2.01 2.01 Future works include studies with different dwarfs, an 0 energy analysis of these dwarfs in different systems and ex- EP W EP A EP B periments using both the CPU and GPU in parallel.

Figure 4: Speedup for the EP Benchmark. References 5. Related Work [1] K. Asanovic, . Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, The use of the NAS Parallel Benchmarks can be seen J. Wawrzynek, D. Wessel, and K. Yelick. A view of the par- as a commonplace practice to evaluate the performance allel computing landscape. Communications of the ACM, of architectures and techniques. However, all publications 52(10):56–67, 2009. found running these benchmarks or similar algorithms on [2] K. Asanovic, B. Catanzaro, K. Yelick, R. Bodik, J. Gebis, the CUDA architecture only present results with single pre- P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, cision floating point operations, due to limitations of the ini- and S. Williams. The landscape of tial versions of the architecture [4, 6]. research: A view from berkeley. Electrical Engineering The closest work to the present paper was developed by and Computer Sciences, University of California at Berke- ley, Technical Report No. UCB/EECS-2006-183, December, Dongarra et al [5]. They presented some initial questions on 18(2006-183), 2006. using graphic processors and Field Programmable Gate Ar- [3] K. J. Barker, K. Davis, A. Hoisie, and D. J. Kerbyson. Per- rays (FPGAs) as accelerators for CFD kernels. They show formance Evaluation of the Nehalem Quad-Core Proces- results regarding the Dense Linear Algebra, Sparse Linear sor for Scientific Computing. Parallel Processing Letters, Algebra and Structured Grids dwarfs. They obtained a per- 18(4):453–469, 2008. formance of 325 GFlops when running the Cholesky factor- [4] A. Cevahir, A. Nukada, and S. Matsuoka. Fast Conjugate ization (a Dense Linear Algebra application) over the GPU. Gradients with Multiple GPUs. In Proceedings of the 9th Yet, there were no direct comparisons between the use of International Conference on : Part I, different accelerators and multicores. pages 893–903. Springer, 2009. [5] J. Dongarra, S. Moore, G. Peterson, S. Tomov, J. Allred, V. Natoli, and D. Richie. Exploring new architectures in ac- 6. Conclusions celerating CFD for Air Force applications. In Proceedings of HPCMP Users Group Conference, pages 14–17, 2008. This paper presented an evaluation of the CUDA ar- [6] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and chitecture considering three representative dwarfs from the J. Manferdelli. High performance discrete Fourier trans- forms on graphics processors. In 2008 SC - International Dwarf Mine classification. The CUDA architec- Conference for High Performance Computing, Networking, ture presented its best results with the MapReduce dwarf Storage and Analysis, pages 1–12. Ieee, 2008. (EP benchmark), with a speedup of 20 over the parallel ver- [7] H. Jin, M. Frumkin, and J. Yan. The OpenMP implementa- sion running on the Dual Core and a speedup of 4 when tion of NAS parallel benchmarks and its performance. NASA compared to the 2xNehalem system. Massive parallelism, Ames Research Center, Technical Report NAS-99-011, 1999. regularity and data independence were crucial to obtain this [8] NVIDIA. NVIDIA CUDA Compute Unified Device Archi- performance. The Sparse Linear Algebra dwarf (CG bench- tecture Programming Guide, 2009. mark) presented the smaller speedups. Still, CUDA’s perfor- [9] V. Volkov and J. Demmel. Benchmarking GPUs to tune mance was as good as the performance of the 2xNehalem. dense linear algebra. In Proceedings of the 2008 ACM/IEEE For the Spectral Methods dwarf (FT benchmark), its per- conference on Supercomputing, number November, pages 1– formance was better than Nehalem’s one for the FT W in- 11. IEEE Press, 2008. stance but 13% smaller for the second instance. This hap- [10] R. Vuduc, J. Demmel, and K. Yelick. OSKI: A library of au- pened due to the increase in data and bigger non-coalesced tomatically tuned sparse matrix kernels. Journal of Physics: memory access overhead. Conference Series, 16(i):521–530, 2005.