Geometric Algorithms on CUDA

Journal of Virtual Reality and Broadcasting, Volume n(200n), no. n Geometric algorithms on CUDA Antonio Rueda, Lidia Ortega Departamento de Informatica´ Universidad de Jaen´ Paraje Las Lagunillas s/n. 23071 Jaen´ - Spain email: {ajrueda,lidia}@ujaen.es Abstract 1 Introduction The General-purpose computing on graphics process- The recent launch of the NVIDIA CUDA technology ing units (GPGPU) is a young area of research that has opened a new era in the young field of GPGPU has attracted attention of many research groups in the (General Purpose computation on GPUS). This tech- last years. Although graphics hardware has been used nology allows the design and implementation of paral- for general-purpose computation since the 1970s, the lel algorithms in a much simpler way than previous ap- flexibility and power processing of the modern graph- proaches based on shader programming. The present ics processing units (GPUs) has generalized its use for work explores the possibilities of CUDA for solving solving many problems in Signal Processing, Com- basic geometric problems on 3D triangle meshes like puter Vision, Computational Geometry or Scientific the point inclusion test or the self-intersection detec- Computing [OLG+07]. tion. A solution to these problems can be implemented in CUDA with only a small fraction of the effort re- The programming capabilities of the GPU evolve quired to design and implement an equivalent solution very rapidly. The first models only allowed limited using shader programming, and the results are impres- vertex programming; then pixel programming was sive when compared to a CPU execution. added and gradually, the length of the programs and its flexibility (use of loops, conditionals, texture ac- cesses, etc.) were increased. The last generation of NVIDIA GPUs (8 Series) supports programming at a new stage of the graphics pipeline: the geometry as- sembling. Several new programming languages like Keywords: GPGPU, CUDA, 3D triangle meshes, ARB GPU assembly language, GLSL [Ros06], HLSL inclusion test, self-intersection test or Cg [FK03] were developed aiming at exploiting GPU capabilities. GPU programming has been ex- tensively used in the last years for implementing im- Digital Peer Publishing Licence pressive real-time physical effects, new lighting mod- Any party may pass on this Work by electronic els and complex animations [Fer04, PF05], and have means and make it available for download under allowed a major leap forward in the visual quality and the terms and conditions of the current version realism of videogames. of the Digital Peer Publishing Licence (DPPL). But it should be kept in mind that vertex, pixel and The text of the licence may be accessed and geometry programming capabilities were aimed at im- retrieved via Internet at plementing graphics computations. Their use for gen- http://www.dipp.nrw.de/. eral purpose computing is difficult in many cases, im- First presented at the International Conference plying the complete redesign of algorithms whose im- on Computer Graphics Theory and Applications plementation in CPU require only a few lines. Clearly (GRAPP) 2008, extended and revised for JVRB the rigid memory model is the biggest problem: mem- urn:nbn:de:0009-6-348, ISSN 1860-2037 Journal of Virtual Reality and Broadcasting, Volume n(200n), no. n ory reads are only possible from textures or a limited memory that can be accessed from each of its proces- set of global and varying parameters, while memory sors, and there is a large global memory space com- writes are usually performed on a fixed position in the mon to all the multiprocessors (Figure 1). Shared framebuffer. Techniques such as multipass rendering, memory is very fast and is usually used for caching rendering to texture, and use of textures as lookup ta- data from global memory. Both shared and global bles are useful to overcome these limitations, but pro- memory can be accessed from any thread for reading gramming GPUs remains being a slow and error-prone and writing operations without restrictions. task. On the positive side, the implementation effort is usually rewarded with a superb performance, up to A CUDA execution runs several blocks of threads. 100X faster than CPU implementations in some cases. Each thread performs a single computation and is exe- The last advance in GPGPU is represented by the cuted by a SIMD processor. A block is a set of threads CUDA technology of NVIDIA. For the first time, a that are executed on the same multiprocessor and its GPU can be used without any knowledge of OpenGL, size should be chosen to maximize the use of the mul- DirectX or the graphics pipeline, as a general purpose tiprocessor. A thread can store data on its local regis- coprocessor that helps the CPU in the more complex ters, share data with other threads from the same block and time-expensive computations. With CUDA a GPU through the shared memory or access the device global can be programmed in C, in a very similar style to a memory. The number of blocks usually depends on CPU implementation, and the memory model is now the amount of data to process. Each thread is assigned simpler and more flexible. a local index inside the block with three components, In this work we explore the possibilities of the starting at (0, 0, 0), although in most cases only one CUDA technology for performing geometric compu- component (x) is used. The blocks are indexed using tations, through two case-studies: point-in-mesh inclu- a similar scheme. sion test and self-intersection detection. So far CUDA has been used in a few applications [Ngu07] but this is A CUDA computation starts at a host function by al- the first work which specifically compares the perfor- locating one or more buffers in the device global mem- mance of CPU vs CUDA in geometric applications. ory and transferring the data to process to them. An- Our goal has been to study the cost of implement- other buffer is usually necessary to store the results ing two typical geometric algorithms in CUDA and its of the computation. Then the CUDA computation is benefits in terms of performance against equivalents launched one or more times by specifying the num- CPU implementations. The algorithms used in each ber of blocks, threads per block, and thread function. problem are far from being the best, but the promis- Pointers to data and results buffers are passed as pa- ing results in this initial study motivate a future devel- rameters of the thread function. After the computa- opment of optimized CUDA implementations of these tion has completed, the results buffer is copied back to and similar geometric algorithms. CPU memory. The learning curve of CUDA is much faster than 2 Common Unified Device Architec- that of GPGPU base on shader programming with ture (CUDA) OpenGL/DirectX and Cg/HLSL/GLSL. The programming model is more similar to CPU programming, and The CUDA technology was presented by NVIDIA in the use of the C language makes most programmers 2006 and is supported by its latest generation of GPUs: feel comfortable. CUDA is also designed as a stable the 8 series. A CUDA program can be implemented in scalable API for developing GPGPU applications that C, but a preprocessor included in the CUDA toolkit is will run on several generations of GPUs. On the nega- required to translate its special features into code that tive side, CUDA loses the powerful and efficient math- can be processed by a C compiler. Therefore host and ematical matrix and vector operators that are available device CUDA code can now be combined in a straight- in the shader languages, in order to keep its compati- forward way. bility with the C standard. Moreover, it is likely that A CUDA-enabled GPU is composed of several in many cases an algorithm carefully implemented in MIMD multiprocessors that contain a set of SIMD a shader language could run faster than its equivalent processors [NVI07]. Each multiprocessor has a shared CUDA implementation. urn:nbn:de:0009-6-348, ISSN 1860-2037 Journal of Virtual Reality and Broadcasting, Volume n(200n), no. n Figure 1: CUDA Architecture with n MIMD multiprocessors with n × m SIMD processors. 3 Point-in-mesh inclusion test on The programming model of CUDA fits especially CUDA well with problems whose solution can be expressed in a matrix form. In our case, we could construct a matrix in which the rows are the tetrahedra to pro- The point-in-mesh inclusion test is a simple classical cess, and the columns the points to test. This matrix geometric algorithm, useful in the implementation of is divided into blocks of threads, and each thread is collision detection algorithms or in the conversion to made responsible of testing the point in the column j voxel-based representations. A GPU implementation against the tetrahedron in the row i, and adding the of this algorithm is only of interest with large triangle result of the test (0, 1, 0.5) to the counter j (see Fig- meshes and many points to test, as the cost of setting ure 2). This approach has a minor drawback: in order up the computation is high. to ensure a correct result after several add operations For our purpose we have chosen the algorithm of on the same position in global memory, performed by Feito & Torres [FT97] which presents several advan- concurrent threads, support for atomic functions is re- tages: it has a simple implementation, it is robust and quired. This feature is only available in newer devices can be easily parallelized. The pseudocode is shown of GeForce and Quadro series with compute capabil- next: ity 1.1 [NVI07].

Load more