CUDA: Speeding up Parallel Computing

Maria Andreina Francisco Rodriguez CUDA: Speeding Up Parallel Computing Maria Andreina Francisco Rodriguez [email protected] Department of Software Engineering Tongji University Shanghai, 201804, China Abstract CUDA is a computing engine in NVIDIA GPUs that is available to programmers using common programming languages. It's intended for developing parallel processing applications. The CUDA parallel-computing architecture supports many computational interfaces such as OpenCL and it is being used to speed up a large variety of applications in cryptography, computational biology and many other fields. Keywords: CUDA, Parallel-computing, GPU 1. INTRODUCTION The arrival of the multi-core CPUs many-core GPUs meant that the world as we know it was about to change. The processor chips that we all know about have been transformed to parallel systems. Graphic cards are now not only advanced enough to function as good hardware accelerators: they also perform in a superior way than all the other alternatives that are currently being offered on commodity hardware platforms[MV08]. With this big change we are now facing the challenge of developing new applications (and modifying old ones) that are able to increase their parallelism in a transparent way in order to control and exploit the benefits of the increment in the amount of processor cores available, in the same way that graphics applications do it with many-core GPUs. CUDA (Compute Unified Device Architecture) has been designed to conquer this challenge. CUDA is the new NVIDIA parallel-computing architecture for general purpose. It's intended for developing multi-core and parallel processing applications on GPUs, more particularly the NVIDIA's 8-series GPUs. Before CUDA was introduced in the market, accessing the computational power in GPUs and using it for non-graphics applications was a very complicated task. The available GPU could only be programmed using a graphics API, so all users had to learn graphic programming before using the GPU for parallel computing[ZMRS08][Che08]. CUDA is meant eliminate the need for learning graphic programming and to maintain a low learning curve for those programmers who are familiar with common programming languages. These instructions are for authors of submitting the research papers to the International Journal of Computer Science and Security (IJCSS). IJCSS is seeking research papers, technical reports, dissertation, letter etc for these interdisciplinary areas. The goal of the IJCSS is to publish the most recent results in the development of information technology. 2. THE CUDA ARCHITECTURE In order to program for the CUDA architecture, developers can use the CUDA Toolkit which is a C-like language development environment for CUDA-enabled GPUs, and it can be executed in any CUDA enabled processor. CUDA offers enhanced flexibility above earlier available programming tools for GPGPU. Also, using CUDA doesn't involve programmers rewriting any operations as if they were geometric primitives. It has the advantage of having a process International Journal of Computer Science and Security 1 Maria Andreina Francisco Rodriguez executing in many separated memory locations. On the other hand, the few limitations of CUDA include that it implements a very limited fraction of the instructions of C. In particular, it's worth highlighting that this subset doesn't include recursion or pointers.[CUD]. Furthermore, it's not possible to put any arbitrary code on each thread because they all share the same code. The CUDA GPUs contain multiple cores able to execute a great number of threads at the same time. Each core shares memory and registers. The shared memory located on the chip makes possible for the parallel tasks being executed to share information without needing to send it through the memory bus[NVI09]. These key abstractions are exposed to the developer in simple manner as a small group of language extensions, allowing him to separate the original problem in smaller, independent and more common problems. CUDA allows threads to collaborate with each other while they are being solved and each thread can be set to execute on any available processor core. This way it's possible to keep the expressivity of the language while enabling transparent scalability. As a result, a compiled CUDA application can run on multiple processor cores without knowing the real amount of processors[NVI08]. When the GPU is programmed using CUDA, it's possible to use it like any general-purpose processor, being able of running a great amount of parallel threads. The data can be copied between fast DRAM memories through optimized API calls and high-performance Direct Memory Access (DMA) engines. The problem is that GPU programs can only gather data from DRAM instead of writing data to the DRAM, limiting the application exibility.[Che08] A kernel is defined as a unit of work passed to the GPU by the computer. A kernel is run as several threads and those threads are divided in disjoint blocks. Every multiprocessor runs blocks of threads. Every block is separated into smaller fractions which are called warps. Each kernel has an equal amount of threads. The execution of active warps is divided in time. A task scheduler changes between warps to make the most of the available resources.[VAP+08] 3. APPLICATIONS USING CUDA An example of how CUDA is being used besides graphics rendering can be found in the computer games area, where it have also being utilized for physics calculations in games. Moreover, it's being used to speed up many applications in cryptography[VAP+08], computational biology[MV08] and other fields as well. An example of this is the BOINC (Berkeley Open Infrastructure for Network Computing) distributed computing client, which is a non-commercial middleware system for volunteer and grid computing[BOI]. In the area of Molecular Biology, the task of looking for resemblances in DNA and protein databases is nowadays a common procedure. The Smith-Waterman is an algorithm that finds an optimal alignment after looking at every feasible alignment among two sequences. Sadly, this algorithm has expensive computational cost. Additionally, the increment of DNA and protein databases causes this algorithm to be a nonviable option for looking resemblances when there are too many sequences. What is believed to be the best implementation found on commodity hardware of the algorithm of Smith Waterman is developed using CUDA[MV08]. This implementation works 2 to 30 times faster than all earlier efforts while also providing a cost effective solution. 4. CONCLUSION CUDA is a graphic card advanced enough to be used as a hardware accelerator. It performs in a superior way than other alternatives offered on commodity hardware platforms. CUDA is more flexibility than earlier programming tools for GPGPU. It has many advantages involving parallel computing, but it is limited by the fact that it is recursion and pointers free. International Journal of Computer Science and Security 2 Maria Andreina Francisco Rodriguez 5. REFERENCES [BOI] Berkeley open infrastructure for network computing (boinc). http://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing. Last visit: September 26th, 2009. [Che08] Jim X. Chen. Guide to Graphics Software Tools, pages 192{194. Springer, 2008. [CUD] Cuda. http://en.wikipedia.org/wiki/CUDA. Last visit: September 26th, 2009. [MV08] Svetlin A Manavski and Giorgio Valle. Cuda compatible gpu cards as efficient hardware accelerators for smith-waterman sequence alignment. BMC Bioinformatics, 9(2):10, 2008. [NVI08] NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Arquitecture. Programming Guide, June 2008. [NVI09] NVIDIA Corporation. Getting Started. NVIDIA CUDA Development Tools 2.3, July 2009. [VAP+08] Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos P.Markatos, and Sotiris Ioannidis. Gnort: High performance network intrusion detection using graphics processors. In 11th International Symposium On Recent Advances In Intrusion Detection (RAID), October 2008. [ZMRS08] Ahmed El Zein, Eric McCreath, Alistair Rendell, and Alex Smola. Performance eval- uation of the nvidia geforce 8800 gtx gpu for machine learning. In Marian Bubak, Geert Dick van Albada, and Jack Dongarra, editors, Computational Science -ICCS 2008: 8th International Conference. Proceedings Part I, Krakw, Poland, June 2008. Springer. International Journal of Computer Science and Security 3 .

Load more