S.D.M COLLEGE of ENGINEERING and TECHNOLOGY Sridhar Y

VISVESVARAYA TECHNOLOGICAL UNIVERSITY S.D.M COLLEGE OF ENGINEERING AND TECHNOLOGY A seminar report on CUDA Submitted by Sridhar Y 2sd06cs108 8th semester DEPARTMENT OF COMPUTER SCIENCE ENGINEERING 2009-10 Page 1 VISVESVARAYA TECHNOLOGICAL UNIVERSITY S.D.M COLLEGE OF ENGINEERING AND TECHNOLOGY DEPARTMENT OF COMPUTER SCIENCE ENGINEERING CERTIFICATE Certified that the seminar work entitled “CUDA” is a bonafide work presented by Sridhar Y bearing USN 2SD06CS108 in a partial fulfillment for the award of degree of Bachelor of Engineering in Computer Science Engineering of the Visvesvaraya Technological University, Belgaum during the year 2009-10. The seminar report has been approved as it satisfies the academic requirements with respect to seminar work presented for the Bachelor of Engineering Degree. Staff in charge H.O.D CSE Name: Sridhar Y USN: 2SD06CS108 Page 2 Contents 1. Introduction 4 2. Evolution of GPU programming and CUDA 5 3. CUDA Structure for parallel processing 9 4. Programming model of CUDA 10 5. Portability and Security of the code 12 6. Managing Threads with CUDA 14 7. Elimination of Deadlocks in CUDA 17 8. Data distribution among the Thread Processes in CUDA 14 9. Challenges in CUDA for the Developers 19 10. The Pros and Cons of CUDA 19 11. Conclusions 21 Bibliography 21 Page 3 Abstract Parallel processing on multi core processors is the industry’s biggest software challenge, but the real problem is there are too many solutions. One of them is Nvidia’s Compute Unified Device Architecture (CUDA), a software platform for massively parallel high performance computing on the powerful Graphics Processing Units (GPUs). CUDA is the computing engine in NVIDIA graphics processing units that is accessible to software developers through industry standard programming languages. Unlike CPUs however, GPUs have parallel "many-core" architecture, each core capable of running thousands of threads simultaneously if an application is suited to this kind of architecture, the GPU can offer large performance benefits. In this report we look into the conventional GPU structure and how CUDA is implemented in programming techniques for high speed parallel processing. 1. Introduction Formally introduced in 2006, CUDA is steadily winning in scientific and engineering fields. A few years ago, pioneering programmers discovered that GPUs could be reharnessed for tasks other than graphics. However, their improvised programming model was clumsy, and the programmable pixel shaders on the chips weren’t the ideal engines for general purpose computing. Nvidia’s has seized upon this opportunity to create a better programming model and to improve the shaders. In fact, for the high-performance computing, Nvidia now prefers to call the shaders ―stream processors‖ or ―thread Page 4 processors‖. Currently CUDA is implemented in programming languages such as C/C++, FORTRAN, MATLAB, JAVA, .NET, etc. 2. Evolution of GPU and CUDA Graphics Processing Units (GPUs) have been used for non-graphics computation for several years this is called General-Purpose computation on GPUs (GPGPU) The GPU is specialized for compute-intensive, highly parallel computation (exactly what graphics rendering is about) so, more transistors can be devoted to data processing rather than data caching and flow control. The GPU is good at data-parallel processing. The same computation executed on many data elements in parallel with high arithmetic intensity. Many calculations per memory access same computation means lower requirement for sophisticated flow control High arithmetic intensity and many data elements mean that memory access latency can be hidden with calculations instead of big data caches. Page 5 GPGPU applications can be found in: Physics simulation Signal processing Computational geometry Database management Computational biology Computational finance Computer vision, Etc. Drawbacks of Legacy GPGPU Model: Software Limitations The GPU can only be programmed through a graphics API Page 6 High learning curve Graphics API overhead Hardware Limitations No general DRAM memory addressing Only gather: No scatter: Less programming flexibility Applications can easily be limited by DRAM memory bandwidth Waste of computation power Page 7 CUDA Highlights The API is an extension to the C programming language and hence has a low learning curve. The hardware is designed for lightweight runtime and driver and thus provide high performance. CUDA provides generic DRAM memory addressing Gather: Scatter: Page 8 CUDA enables access to an On-Chip Shared Memory for efficient inter-thread data sharing and thus big memory bandwidth saves. 3. CUDA Structure for parallel processing Each thread processor in a GPU (eg: Nvidia GeForce 8-series) can manage 96 concurrent threads, and these processors have their own FPUs, registers, and shared local memory. As Figure 1 shows, CUDA includes C/C++ software development tools, function libraries, and a hardware abstraction mechanism that hides the GPU hardware from developers. Although CUDA requires programmers to write special code for parallel processing, it doesn’t require them to explicitly manage threads in the conventional sense, which greatly simplifies the programming model. CUDA development tools work alongside a conventional C/C++ compiler, so programmers can mix GPU code with general-purpose code for the host CPU. For now, CUDA aims at data-intensive applications that need single-precision floating-point math (mostly scientific, engineering, and high-performance computing, as well as consumer photo and video editing).Double precision floating point is on Nvidia’s roadmap for a new GPU in near future. Page 9 Fig1. Nvidia’s CUDA platform for parallel processing on GPUs. Key elements are common C/C++ source code with different compiler forks for CPUs and GPUs; function libraries that simplify programming; and a hardware-abstraction mechanism that hides the details of the GPU architecture from programmers. 4. Programming model of CUDA Thread Batching • A kernel is executed as a grid of thread blocks. – All threads share data memory space. • A thread block is a batch of threads that can cooperate with each other by: – Synchronizing their execution. Page 10 • For hazard-free shared memory accesses. – Efficiently sharing data through a low latency shared memory. • Two threads from two different blocks cannot cooperate. • Threads and blocks have IDs – So each thread can decide what data to work on. – Block Id can be 1D or 2D. – Thread Id can be 1D, 2D, or 3D. Page 11 CUDA Device Memory Space • Each thread can: – R/W per-thread registers. – R/W per-thread local memory. – R/W per-block shared memory. – R/W per-grid global memory. – Read only per-grid constant memory. – Read only per-grid texture memory. • The host can R/W global, constant, and texture memories. – Global memory is the main means of communicating R/W Data between host and device and the contents are visible to all threads. – Texture and Constant Memories are constants initialized by host and contents visible to all threads. 5. Portability and Security of the code Nvidia starts with an advantage that makes other aspirants to parallel processing envious. Nvidia has always hidden the architectures of its GPUs beneath an application programming interface (API). As a result, application programmers never write directly to the metal. Instead, they call predefined graphics functions in the API. Hardware abstraction has two benefits. First, it simplifies the high-level programming model, Page 12 insulating programmers from the complex details of the GPU hardware. Second, hardware abstraction allows Nvidia to change the GPU architecture as often and as radically as desired. As Figure 2 shows, Nvidia’s latest GeForce 8 architecture has 128 thread processors, each capable of managing up to 96 concurrent threads, for a maximum of 12,288 threads. Nvidia can completely redesign this architecture in the next generation of GPUs without making the API obsolescent or breaking anyone’s application software. In contrast, most microprocessor architectures are practically carved in stone. They can be extended from one generation to the next, as Intel has done with MMX and SSE in the x86. But removing even one instruction or altering a programmer visible register will break someone’s software. Nvidia’s virtual platform relies on install-time compilation. When PC users install application software that uses the graphics card, the installer queries the GPU’s identity and automatically compiles the APIs for the target architecture. Users never see this compiler (actually, an assembler), except in the form of the installer’s progress bar. Install-time translation eliminates the performance penalty suffered by run-time interpreted or just-in-time compiled platforms like Java, yet it allows API calls to execute transparently on any Nvidia GPU. Page 13 Figure 2. Nvidia GeForce 8 graphics-processor architecture. This particular GeForce 8 GPU has 128 thread processors, also known as stream processors. (Graphics programmers call them “programmable pixel shaders.”) Each thread processor has a single-precision FPU and 1,024 registers, 32 bits wide. Each cluster of eight thread processors has 16KB of shared local memory supporting parallel data accesses. A hardware thread-execution manager automatically issues threads to the processors without requiring programmers to write explicitly threaded code. The maximum number of concurrent threads is 12,288. Page 14 Although this technology is designed a

S.D.M COLLEGE of ENGINEERING and TECHNOLOGY Sridhar Y

Unit – I Computer Architecture and Operating System – Scs1315

Lecture 6: Instruction Set Architecture and the 80X86

Low-Power Microprocessor Based on Stack Architecture

X86-64 Machine-Level Programming∗

Instruction Set Architecture

The Processor's Structure

Design and Implementation of a Multithreaded Associative Simd Processor

Instruction Set Architecture How to Speak Computer Crafting an ISA Where Are the Instructions?

Programming the 65816

SD Memory Card Interface Using SPI

Gpus and Gpgpus

4 Register/Stack Hybrid – Pipelined Verilog Soft Processor Core