GPU: Power Vs Performance
Total Page:16
File Type:pdf, Size:1020Kb
IT 14 015 Examensarbete 30 hp June 2014 GPU: Power vs Performance Siddhartha Sankar Mondal Institutionen för informationsteknologi Department of Information Technology Abstract GPU: Power vs Performance Siddhartha Sankar Mondal Teknisk- naturvetenskaplig fakultet UTH-enheten GPUs are widely being used to meet the ever increasing demands of High performance computing. High-end GPUs are one of the highest consumers of power Besöksadress: in a computer. Power dissipation has always been a major concern area for computer Ångströmlaboratoriet Lägerhyddsvägen 1 architects. Due to power efficiency demands modern CPUs have moved towards Hus 4, Plan 0 multicore architectures. GPUs are already massively parallel architectures. There has been some encouraging results for power efficiency in CPUs by applying DVFS . The Postadress: vision is that a similar approach would also provide encouraging results for power Box 536 751 21 Uppsala efficiency in GPUs. Telefon: In this thesis we have analyzed the power and performance characteristics of GPU at 018 – 471 30 03 different frequencies. To help us achieve that, we have made a set of Telefax: microbenchmarks with different levels of memory boundedness and threads. We have 018 – 471 30 00 also used benchmarks from CUDA SDK and Parboil. We have used a GTX580 Fermi based GPU. We have also made a hardware infrastructure that accurately measures Hemsida: the power being consumed by the GPU. http://www.teknat.uu.se/student Handledare: Stefanos Kaxiras Ämnesgranskare: Stefanos Kaxiras Examinator: Ivan Christoff IT 14 015 Sponsor: UPMARC Tryckt av: Reprocentralen ITC Ahëuouu± I would like to thank my supervisor Prof. Stefanos Kaxiras for giving me the wonderful opportunity to work on this interesting topic. I would also like to thank his PhD students Vasileios Spiliopoulos and Konstantinos Koukos for helping me get started with the benchmarks and hardware setup. e LATEX community has been of great help while making this document. Also, I would like to thank the spirit that lives in the computer. is thesis was funded by UPMARC. v C±u±« Ahëuouu± v C±u±« vii ÕI±§o¶h± Õ ó BZh§¶o ì ó.Õ CUDA Programming modelì ó.ó GPU Architecture¦ ó.ì Power issues¢ ó.¦ Latencyä ó.¢ Previous workß ì Mu±oí É ì.Õ Experimental SetupÉ ì.ó Power measurementÉ ì.ì DVFSÕÕ ì.¦ MicrobenchmarksÕÕ ì.¢ Benchmarks Õó ì.¢.Õ Matrix Multiplication Õì ì.¢.ó Matrix Transpose Õì ì.¢.ì Histogram Õì ì.¢.¦ Radix Sort Õì ì.¢.¢ Merge Sort Õì ì.¢.ä Conjugate Gradient Õì ì.¢.ß BFS(Breadth First Search) Õì ì.¢. Eigenvalues Õì ì.¢.É Black-Scholes option pricing Õì ì.¢.Õþ ìD FDTD(ìD Finite Dierence Time Domain method) Õ¦ ì.¢.ÕÕ Scalar Product Õ¦ ¦ EêZ¶Z± Õ¢ vii viii h±u±« ¦.Õ Microbenchmarks Õä ¦.Õ.Õ Microbenchmark Õ Õä ¦.Õ.ó Microbenchmark ó Õß ¦.Õ.ì Microbenchmark ì Õ ¦.Õ.¦ Microbenchmark ¦ ÕÉ ¦.ó Benchmarks óþ ¦.ì Memory Bandwidth ó¦ ¦.ì.Õ Black-Scholes ó¦ ¦.ì.ó Eigenvalue ó¢ ¦.ì.ì ìD FDTD óä ¦.ì.¦ Scalar Product óß ¢ Ch¶« óÉ Bf§Z£í ìÕ A««ufí hou ìì Pëu§ uZ«¶§uu±« ìß 1 hZ£±u§ I±§o¶h± Over the last few years the CPU and GPU architectures have been evolving very rapidly. With the introduction of programmable shaders in GPU since the turn of century, it has been used by the scientic computing community as a powerful computational accelerator to the CPU. For many compute intensive workloads GPUs gives few orders of magnitude better performance than a CPU. e introduction of languages like CUDA and OpenCL has made it more easier to program on a GPU. In our evaluation we have used the Nvidia GTX¢þ GPU. It is based on the Fermi architecture[Õ¢]. With ¢Õó compute units called CUDA cores it can handle ó¦,¢ßä active threads [É, Õ¢]. Its o-chip memory bandwidth is around ÕÉó GB/s, which is considerably higher than CPUs main memory bandwidth(around ìó GB/s for core iß). eoretically it can do over Õ.¢ TFLOPS(with single precision FMA). Even with such great numbers to boast about, GPUs have a few bottlenecks that one has to consider before sending all workload down to the GPU. One of them is the power and the other is the cost to transfer data from host to device(i.e. CPU to GPU) and again back from device to host(i.e. GPU to CPU). e bandwidth of Õäx PCI express is around GB/s. Power is a very important constraint for every computer architecture[ß]. Modern high-end GPUs have thermal design power(TDP) of over óþþ watts. e GTX¢þ has a TDP of ó¦¦ watts. Whereas a high-end multicore CPU like Intel core iß has a TDP of around Õþþ watts. In terms of GFLOPS/watt GPUs can be considered to be more power ecient than CPUs. But for GPUs to be considered as co-processor it consumes way too much power out of the total power budget. ough GPUs have such high o-chip memory bandwidth, accessing the o-chip global memory is very expensive(latency of around ¦þþ to þþ clock cycles). GPUs hide this latency by scheduling very large number of threads at a time. For memory bound application it becomes hard to hide this latency and they have slack. On CPUs we can take advantage of this by applying DVFS to reduce the dynamic power. Õ ó hZ£±u§ Õ. ±§o¶h± In this thesis work we investigate the power and performance characteristics of a GPU with DVFS(Dynamic voltage and frequency scaling). To help us achieve that we make use of microbenchmarks with dierent levels of memory boundedness and threads. We follow it up by applying DVFS to some more benchmarks. Apart from the introduction in this chapter the thesis is structured as follows. In Chapter ó , we discuss about GPU programming, GPU architecture, the concept behind DVFS based power savings and also some previous work. In Chapter ì, we describe the method used to measure power and implement DVFS on GPU. We also throw some light on the benchmarks that were used for the evaluation. In Chapter ¦, we evaluate the performance and power consumption of the benchmarks under dierent number of threads, shader frequencies and memory bandwidth. In Chapter ¢, we reect on the conclusions drawn from our evaluation. In Appendix: Assembly code, we provide the assembly code of the micro benchmarks we used. In Appendix: Power measurements, we provide all the execution times and power consumption readings from the evaluation. 2 hZ£±u§ BZh§¶o ó.Õ h¶oZ £§§Z ou CUDA is a GPGPU programming language that can be used to write programs for NVIDIA GPUs. It is an extension of the C programming language. A code written in CUDA consists of a mix of host(CPU) code and device(GPU) code[É]. e NVCC compiler compiles the device code and separates it from the host code. At a later stage a C compiler can be used to compile the host code and the device code parts are replaced by calls to the compiled device code. A data parallel function that is to be run on a device is called a kernel. A kernel creates many threads to compute this data parallel workload. A thread block consists of many threads and a group of thread block form a grid. Õ __global__ void SimpleKernel(float A[N],float B[N],float C[N], int N) ó { ì //calculate thread id ¦ int i= blockIDx.x * blockDim.x + threadIdx.x; ¢ if(i<N) ä C[i] = A[i] + B[i]; ß } É int main() Õþ { ÕÕ ...... Õó //kernel invocation Õì SimpleKernel<<<numberOfBlocks,threadsPerBlock>>>(A,B,C,N); Õ¦ ...... Õ¢ } Listing ó.Õ: A simple CUDA kernel ì ¦ hZ£±u§ ó. fZh§¶o In the Listing ó.Õ on page ì we have shown a very simple CUDA program model. Inside the kernel SimpleKernel we have CUDA specic variables that help us identify the data element for the particular thread. threadIdx identies a thread inside a block. Each thread inside a block has a unique thread id. blockIdx identies the block in a grid. e blockDim gives the size of a block in a particular dimension. When we invoke the kernel from the device code, along with the kernel arguments we also pass the thread details. e variable threadsPerBlock gives the number of threads in a block and numberOfBlocks indicate the number of blocks that are to be passed. All the blocks are of the same size(i.e. they have the same number of threads). Both numberOfBlocks and threadsPerBlock can be ÕD or óD or ìD data types. A more detailed explanation about CUDA programming can be found in [É]. ó.ó £¶ Z§h±uh±¶§u SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM SM Fermi GPU Figure ó.Õ: On the le we have the CUDA core. SM is shown in the centre. It is made up of CUDA cores. e Fermi GPU architecture shown on the right has many SMs. Source: NVIDIA Fermi architecture whitepaper In our setup we use the Nvidia GTX¢þ GPU with CUDA compute capability ó.þ. It is based on the Fermi architecture[Õ¢]. Figure ó.Õ gives an overview of the Fermi architecture. e basic computing unit is the CUDA core. ere are ¢Õó CUDA cores on the Fermi. Each CUDA core can execute integer or oating point instructions for a single thread. ìó CUDA cores make up a single Streaming Multiprocessor(SM). ere in total Õä SMs on the GTX¢þ. Each SM has its own congurable shared memory and LÕ cache, which can be congured as Õä/¦ KB.