GPU Acceleration of Image Processing Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
MMasarykasaryk U Universityniversity FFacultyaculty}w¡¢£¤¥¦§¨ of ofI Informatics!"#$%&'()+,-./012345<yA|nformatics GPU Acceleration of Automatic Bug-finding Techniques for Image ProcessingLinux Kernel Algorithms Dissertation Thesis Topic DissertationPavel Karas Thesis Topic Jiří Slabý Brno, spring 2010 Centre for Biomedical Image Analysis Supervisor: Doc. RNDr. Michal Kozubek, CSc. Supervisor: prof. RNDr. Antonín Kučera, Ph.D. Supervisor-specialist: Mgr. David Svoboda, PhD. Supervisor-specialist: Mgr. Jan Obdržálek, PhD. Brno, September 3, 2010 Supervisor’s Signature: ...................... Supervisor’s signature: Acknowledgement I would like to thank supervisors, both Doc. Kozubek and Dr. Svoboda for their patience and their suggestions. I thank my colleagues for their ideas and inspiration. My thanks belong also to my friends and parents for their endless support. i Contents 1 Introduction1 2 Report on Current Results5 2.1 Parallel Computing Architectures and Models................5 2.1.1 NVidia CUDA..............................5 2.1.2 ATI Stream................................7 2.1.3 OpenCL..................................7 2.1.4 Other architectures............................7 2.2 General-Purpose Computation on GPU....................9 2.3 Image Processing Acceleration......................... 12 2.4 Image Formation................................. 13 2.5 Image Enhancement............................... 15 2.5.1 Tone Mapping............................... 15 2.5.2 Image Restoration............................ 16 2.5.3 Geometrical Methods.......................... 17 2.6 Image Analysis.................................. 18 2.6.1 Morphological Processing........................ 18 2.6.2 Edge Detection.............................. 19 2.6.3 Image Segmentation........................... 20 2.6.4 Feature Extraction............................ 21 2.6.5 Shape Analysis.............................. 23 2.6.6 Image Registration............................ 23 2.7 Decomposition of Large Images in Image Processing............ 24 3 Dissertation Thesis Intent 27 References 29 4 Summary of Study Results 41 ii 1 Introduction Image processing is a rapidly developing field based both on mathematics and com- puter science. It is naturally connected to and strongly influenced by image acqui- sition. Community of users is wide and includes not only common customers with digital cameras, but also astronoms analysing data from telescopes and spacecrafts [GW02][SM06], specialists dealing with magnetic resonance imaging (MRI) or com- puted tomography (CT) [Cas96][GW02][Jäh05], and biologists working with optical microscopes [Ver98][Jäh05]. The users’ demands for image quality, in terms of res- olution, signal-to-noise ratio or dynamic range, are constantly growing. As a result, both the image acquisition devices and image processing algorithms are developing in order to satisfy the users’ needs. On the other hand, the complex image processing methods can be very time- consuming and their computation can be challenging even for recent computer hard- ware. Therefore, an integral part of the field is optimization, so that the complex algorithms could be performed in a reasonable time. For example, if a patient under- goes an magnetic resonance or ultrasound investigation, it is essential to have the data processed in several minutes so that a specialist can consult the results with the patient and propose next steps within a single visit [GLGN+08]. In other applications like user-assisted image segmentation, the data need to be processed real-time [HLSB04]. The most important computational hardware for image processing is Central Pro- cessing Units (CPU), for their versatility and tremendous speed. According to the oft-quated Moore’s Law, the CPU’s computing capability is doubled every two years. However, this trend is now inhibited as the semiconductor industry has reached phys- ical limits in production. Soon, it will be no longer possible to decrease the size of transistors, and therefore to increase the clock frequency. Thus, manufacturers seek for a different ways to enhance the CPU speed. The most significant trend is to increase CPU’s ability to process more and more tasks in parallel, by extending the instruction pipeline, involving vector processing units, and increasing the number of CPU cores. In contrast to CPU, Graphics Processing Units (GPU) are recently considered to be very efficient parallel processing units. While their original purpose was to provide a graphic output to users, their computing capabilities have been getting more and more complex and their performance has overcome common CPUs, in terms of both computing speed (in floating point operations per second) and memory bandwidth [GLG05][CBM+08][CDMS+08][PSL+10][LKC+10] (Figure1). Besides, their perfor- mance growth is not limited by current manufacturing technology, so the gulf between CPU and GPU computational power is still getting wider. The GPU architecture bene- fit from a massive fine-grained parallelization, as they are able to execute as many as thousands of threads concurrently. Attempts to utilize GPUs not only for manipulat- ing computer graphics, but also for different purposes, lead to a new research field, called “General-purpose computation on GPU (GPGPU)”. The graphic cards provide huge computational performance for a good price. For instance, NVidia GeForce GTX 480, a top desktop graphics card with 480 streaming processors, built on a new FERMI architecture, is slightly cheaper than the quad-core CPU Intel Core i7-960 (as to July 2010). Furthermore, the recent computer mainboards support up to 4 PCIe x16 slots 1 Figure 1: A performance gap between GPUs and CPUs [KmH10]. offering space for 4 GPU cards. With multiple slots occupied, a common affordable single PC can be turned into a supercomputer, literally. This approach also shows some caveats. First, there used to be no high-level pro- gramming languages specifically designed for general computation purposes. The programmers instead had to use shading languages such as Cg, High Level Shad- ing Language (HLSL) or OpenGL Shading Language (GLSL) [OLG+05][CDMS+08] [RRB+08], to utilize texture units. This no more poses an obstacle. In 2004, the Brook [BFH+04] and Sh [MDTP+04] programming languages were released as an extension of ANSI C with concepts from stream programming. In 2007, NVidia released the first version of Compute Unified Device Architecture (CUDA) [CUD10e][CUD10c], a programming model to code algorithms for execution on the GPU. A more general, non-proprietary framework, developped by Khronos Group, is Open Computing Lan- guage (OpenCL), which can be used for heterogenous platforms, including both the CPU and the GPU [Ope10c][Ope10b]. Second, not all methods are suitable for a massive parallel processing. The GPU performance is limited by several conditions: a) the ratio of parallel to sequential fraction of the algorithm, b) the ratio of floating point operations to global memory accesses, c) branching diversity, d) global synchronization requirements and e) data transfer overhead [CDMS+08][PSL+10]. Image processing algorithms in general are good candidates for GPU implementation, since the parallelization is naturally pro- vided by per-pixel (or per-voxel) operations. Therefore, they naturally fulfill the a) condition, particularly. This was confirmed by many studies examining GPU accelera- tion of image processing algorithms. Nevertheless, programmers have to keep in mind 2 Figure 2: GPU performance in single and double precision [Gri10]. also the other conditions in order to keep their implementations efficient. An enor- mous effort has been recently spent on accelerating image processing on GPU. Many examples of successful implementations are cited in the following section of the text. Third, programmers should be aware that graphic cards perform best in a single floating point precission. The performance in a double and higher precissions is signif- icantly worse. For instance, AMD/ATI GPUs provide approximately 5 times smaller computational power in a double precision compared to a single precision [AMD10], NVidia GPUs provide even 16 times smaller power [CUD10e][Tes09]. The latter should change with a release of TESLA cards built on a recent FERMI architecture [Tes10] (Fig- ure2). In many applications, this does not pose an obstacle, since a single precision can be enough, as will be shown in the following section. In some cases, a computation can be divided into two parts, of which the critical one (in means of precision) can be executed on CPU, whereas the other one is executed on GPU. In the other cases, a careful attention needs to be paid to precision, and the users should probably wait for better GPU architectures. Fourth, one of the bottlenecks of graphic cards is their relatively expensive, there- fore small global memory. A common modern GPU provides approximately 4–8 times smaller global memory than a common CPU. This can pose significant problems for programmers in their effort to implement algorithms on GPU. The amount of pro- cessed data can be enormous, especially in applications which produce three- or more 3 dimensional images, such as optical microscopy or computed tomography. Another example is magnetic resonance imaging, using diffusion tensor images. In this case, the image domain is relatively small, but the amount of information stored for every single image element is huge. As a result, the size of the image can exceed the size of GPU global memory, or, at best, there is enough