MMasarykasaryk U Universityniversity FFacultyaculty}w¡¢£¤¥¦§¨  of ofI Informatics!"#$%&'()+,-./012345

GPU Acceleration of Automatic Bug-finding Techniques for Image ProcessingLinux Kernel Algorithms

Dissertation Thesis Topic

DissertationPavel Karas Thesis Topic

Jiří Slabý

Brno, spring 2010

Centre for Biomedical Image Analysis

Supervisor: Doc. RNDr. Michal Kozubek, CSc. Supervisor: prof. RNDr. Antonín Kučera, Ph.D. Supervisor-specialist: Mgr. David Svoboda, PhD. Supervisor-specialist: Mgr. Jan Obdržálek, PhD. Brno, September 3, 2010

Supervisor’s Signature: ...... Supervisor’s signature: Acknowledgement

I would like to thank supervisors, both Doc. Kozubek and Dr. Svoboda for their patience and their suggestions. I thank my colleagues for their ideas and inspiration. My thanks belong also to my friends and parents for their endless support.

i Contents

1 Introduction1

2 Report on Current Results5 2.1 Architectures and Models...... 5 2.1.1 NVidia CUDA...... 5 2.1.2 ATI Stream...... 7 2.1.3 OpenCL...... 7 2.1.4 Other architectures...... 7 2.2 General-Purpose Computation on GPU...... 9 2.3 Image Processing Acceleration...... 12 2.4 Image Formation...... 13 2.5 Image Enhancement...... 15 2.5.1 Tone Mapping...... 15 2.5.2 Image Restoration...... 16 2.5.3 Geometrical Methods...... 17 2.6 Image Analysis...... 18 2.6.1 Morphological Processing...... 18 2.6.2 Edge Detection...... 19 2.6.3 Image Segmentation...... 20 2.6.4 Feature Extraction...... 21 2.6.5 Shape Analysis...... 23 2.6.6 Image Registration...... 23 2.7 Decomposition of Large Images in Image Processing...... 24

3 Dissertation Thesis Intent 27

References 29

4 Summary of Study Results 41

ii

1 Introduction

Image processing is a rapidly developing field based both on mathematics and com- puter science. It is naturally connected to and strongly influenced by image acqui- sition. Community of users is wide and includes not only common customers with digital cameras, but also astronoms analysing data from telescopes and spacecrafts [GW02][SM06], specialists dealing with magnetic resonance imaging (MRI) or com- puted tomography (CT) [Cas96][GW02][Jäh05], and biologists working with optical microscopes [Ver98][Jäh05]. The users’ demands for image quality, in terms of res- olution, signal-to-noise ratio or dynamic range, are constantly growing. As a result, both the image acquisition devices and image processing algorithms are developing in order to satisfy the users’ needs. On the other hand, the complex image processing methods can be very time- consuming and their computation can be challenging even for recent computer hard- ware. Therefore, an integral part of the field is optimization, so that the complex algorithms could be performed in a reasonable time. For example, if a patient under- goes an magnetic resonance or ultrasound investigation, it is essential to have the data processed in several minutes so that a specialist can consult the results with the patient and propose next steps within a single visit [GLGN+08]. In other applications like user-assisted image segmentation, the data need to be processed real-time [HLSB04]. The most important computational hardware for image processing is Central Pro- cessing Units (CPU), for their versatility and tremendous speed. According to the oft-quated Moore’s Law, the CPU’s computing capability is doubled every two years. However, this trend is now inhibited as the semiconductor industry has reached phys- ical limits in production. Soon, it will be no longer possible to decrease the size of transistors, and therefore to increase the clock frequency. Thus, manufacturers seek for a different ways to enhance the CPU speed. The most significant trend is to increase CPU’s ability to process more and more tasks in parallel, by extending the instruction pipeline, involving vector processing units, and increasing the number of CPU cores. In contrast to CPU, Graphics Processing Units (GPU) are recently considered to be very efficient parallel processing units. While their original purpose was to provide a graphic output to users, their computing capabilities have been getting more and more complex and their performance has overcome common CPUs, in terms of both computing speed (in floating point operations per second) and memory bandwidth [GLG05][CBM+08][CDMS+08][PSL+10][LKC+10] (Figure1). Besides, their perfor- mance growth is not limited by current manufacturing technology, so the gulf between CPU and GPU computational power is still getting wider. The GPU architecture bene- fit from a massive fine-grained parallelization, as they are able to execute as many as thousands of threads concurrently. Attempts to utilize GPUs not only for manipulat- ing computer graphics, but also for different purposes, lead to a new research field, called “General-purpose computation on GPU (GPGPU)”. The graphic cards provide huge computational performance for a good price. For instance, NVidia GeForce GTX 480, a top desktop graphics card with 480 streaming processors, built on a new FERMI architecture, is slightly cheaper than the quad-core CPU Intel Core i7-960 (as to July 2010). Furthermore, the recent computer mainboards support up to 4 PCIe x16 slots

1 Figure 1: A performance gap between GPUs and CPUs [KmH10]. offering space for 4 GPU cards. With multiple slots occupied, a common affordable single PC can be turned into a , literally. This approach also shows some caveats. First, there used to be no high-level pro- gramming languages specifically designed for general computation purposes. The programmers instead had to use shading languages such as Cg, High Level Shad- ing Language (HLSL) or OpenGL Shading Language (GLSL) [OLG+05][CDMS+08] [RRB+08], to utilize texture units. This no more poses an obstacle. In 2004, the Brook [BFH+04] and Sh [MDTP+04] programming languages were released as an extension of ANSI C with concepts from stream programming. In 2007, NVidia released the first version of Compute Unified Device Architecture (CUDA) [CUD10e][CUD10c], a programming model to code algorithms for execution on the GPU. A more general, non-proprietary framework, developped by Khronos Group, is Open Computing Lan- guage (OpenCL), which can be used for heterogenous platforms, including both the CPU and the GPU [Ope10c][Ope10b]. Second, not all methods are suitable for a massive parallel processing. The GPU performance is limited by several conditions: a) the ratio of parallel to sequential fraction of the algorithm, b) the ratio of floating point operations to global memory accesses, c) branching diversity, d) global synchronization requirements and e) data transfer overhead [CDMS+08][PSL+10]. Image processing algorithms in general are good candidates for GPU implementation, since the parallelization is naturally pro- vided by per-pixel (or per-voxel) operations. Therefore, they naturally fulfill the a) condition, particularly. This was confirmed by many studies examining GPU accelera- tion of image processing algorithms. Nevertheless, programmers have to keep in mind

2 Figure 2: GPU performance in single and double precision [Gri10]. also the other conditions in order to keep their implementations efficient. An enor- mous effort has been recently spent on accelerating image processing on GPU. Many examples of successful implementations are cited in the following section of the text. Third, programmers should be aware that graphic cards perform best in a single floating point precission. The performance in a double and higher precissions is signif- icantly worse. For instance, AMD/ATI GPUs provide approximately 5 times smaller computational power in a double precision compared to a single precision [AMD10], NVidia GPUs provide even 16 times smaller power [CUD10e][Tes09]. The latter should change with a release of TESLA cards built on a recent FERMI architecture [Tes10] (Fig- ure2). In many applications, this does not pose an obstacle, since a single precision can be enough, as will be shown in the following section. In some cases, a computation can be divided into two parts, of which the critical one (in means of precision) can be executed on CPU, whereas the other one is executed on GPU. In the other cases, a careful attention needs to be paid to precision, and the users should probably wait for better GPU architectures. Fourth, one of the bottlenecks of graphic cards is their relatively expensive, there- fore small global memory. A common modern GPU provides approximately 4–8 times smaller global memory than a common CPU. This can pose significant problems for programmers in their effort to implement algorithms on GPU. The amount of pro- cessed data can be enormous, especially in applications which produce three- or more

3 dimensional images, such as optical microscopy or computed tomography. Another example is magnetic resonance imaging, using diffusion tensor images. In this case, the image domain is relatively small, but the amount of information stored for every single image element is huge. As a result, the size of the image can exceed the size of GPU global memory, or, at best, there is enough global memory to allocate the space for the image itself, but not for auxilliary data structures. The simple fact, that image size exceeds the amount of available memory, does not itself pose an obstacle for a potential GPU implementation. There is a large class of image processing algorithms which can be referred to as methods with local knowledge. In this case, in order to process each image element only local information about the el- ement and its closest neighbours is needed. Image tonality adjustation, edge detection filters, and linear combination of multiple images are typical examples of this class. Linear and non-linear filters with relatively small kernels, as well as dilation and ero- sion with small structural elements can be also considered members of this class. Due to nature of local algorithms it is easy to split data into small chunks. Consequently, these parts can be processed independently, so only little effort has to be made on problem decomposition, actually. As will be shown in our survey, this is the case of the most current GPU implementations dealing with huge images. On the contrary, algorithms, which we refer to as methods with global knowledge, need to store the entire image. Fourier transform (and fast Fourier transform as well), Hough transform or wavelet transform are good examples of this class. Many methods, for instance, convolution and deconvolution with large kernels, are derived from these transforms, and therefore should be also considered global. The Euclidean distance transform, the watershed algorithm, and generally all segmentation methods based on a functional minimization are other candidates. In this case, it is necessary to seek for nontrivial methods to decompose a problem into several parts which are independent from each other, in order to process data on architectures with a small memory. In this decomposition process, it is important to pay attention to task dependency. The most desired goal is to split data into independent parts, so that they can be processed in parallel, in our case, on multiple GPU cores. Also, it may be possible to deal with huge data that cannot be stored not only in GPU memory but also in CPU memory. Without decomposition, the computation would have to be performed in a larger, but much slower external memory.

Intention of Thesis. The main goal of this dissertation thesis should be acceleration of image processing algorithms with respect to huge data. Since our laboratory, Centre for Biomedical Image Analysis (CBIA), is focused on optical microscopy, we deal with large 3-D images; therefore, it is necessary to have tools for processing such data. The thesis should focus on decomposition of image processing algorithms, especially of the class of global methods. For implementation we choose the GPU platform, since it is suitable not only for testing, but also for considerable speedup of computational- intensive algorithms. Although many papers concerning GPU acceleration were pub- lished, very few of them dealt with huge images and problem decomposition; there- fore, lot of questions remain unanswered in this field.

4 2 Report on Current Results

In this section we will give a summary of state of the art in GPU acceleration of im- age processing algorithms. The section is organized as follows: First, several popular parallel computing models are briefly described. Second, GPGPU is introduced and the most important surveys and performance studies are summarized. Third, a survey of GPU implementations is made, with respect to image processing algorithms. In the final subsection, we focus on implementations dealing with large images.

2.1 Parallel Computing Architectures and Models

2.1.1 NVidia CUDA

As the number of papers published shows, one of the most widely used parallel pro- gramming models is CUDA, developed by NVidia. It is very easy to use for program- mers, since it introduces a small number of extensions to C language, in order to pro- vide parallel execution. Another important features are flexibility of data structures, explicit access on the different physical memory levels of the GPU, and a good frame- work for programmers including a compiler, CUDA Software Development Kit (CUDA SDK), a debugger, a profiler, and CUFFT and CUBLAS scientific libraries [CUD10e] [CUD10d]. From the hardware point of view, NVidia graphics card architecture consists of a number of so-called streaming multiprocessors (SM). Each one includes 8 processor (SP) cores, a local memory shared by all SP, 16384 registers, and fast ALU units for hardware acceleration of trancendental functions. A global memory is shared by all SMs and provides capacity up to 4 GB and memory bandwidth up to 144 GB/s (to July 2010). FERMI architecture introduces new SMs equipped with 32 SPs and 32768 registers, improved ALU units for fast double precission floating point performance, and L1 cache. From the programmer’s point of view, every program consists of two parts, a host code ran on CPU (called a ’host’), and a kernel—a piece of code ran in parallel on GPU (called a ’device’). CUDA’s extensions to the C programming language allow programmer to define how many threads performing a kernel will be executed on a device. The number of threads can greatly (up to 10,000 ×) exceed the number of processing units which execute the program. This paradigm allows programmers to consider GPU as an abstract device providing a huge number of virtual resources. Therefore, not only it provides a good scalability for future hardware, it also allows to hide a global memory latency by very fast thread switching [CUD10c]. The threads executed on GPU form an 1D, 2-D, or 3-D structure, called a block (Figure3). The blocks are arranged in 1D or 2-D layout, called a grid. Thus, each thread can be exclusively defined by its coordinates within a block, and by coordinates of its block within a grid. Each block of threads is executed on a sigle SM; therefore, all threads of a single block can share data in a shared memory. Block’s threads are executed in warps of 32 threads each. A memory hierarchy is introduced in a number of levels (Figure4). In the lowest level, there are small, fast, non-shared registers. Then, each SM contains a shared

5 Chapter 2: Programming Model

Grid

Block (0, 0) Block (1, 0) Block (2, 0)

Block (0, 1) Block (1, 1) Block (2, 1)

Block (1, 1)

Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0)

Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1)

Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2)

Figure 2-1. Grid of Thread Blocks Figure 3: Threads hierarchy in CUDA programming model [CUD10c]. The number of threads per block and the number of blocks per grid specified in the <<<…>>> syntax can be of type int or dim3Chapter. Two 2:- dimensionalProgramming Model blocks or grids can be specified as in the example above. Each block withinThr theead grid can be identified by a one-dimensional or two- dimensional index accessible within the kernel throughPer -thethread built local-in blockIdx memory variable. The dimension of the thread block is accessible within the kernel through the built-in blockDim variable. Extending the previous MatAdd() example to handle multiple blocks, the code becomes as follows.Thread Block Per-block shared // Kernel definition memory __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) PDFill PDF Editor with Free{ Writer and Tools int i = blockIdx.x * blockDim.x + threadIdx.x; Grid 0int j = blockIdx.y * blockDim.y + threadIdx .y;

if (i < N && j < N) Block (0 , 0C[i][j]) Block (=1 , 0A[i][j]) Block (2,+ B[i][j];0)

CUDA ProgrammingBlock Guide (0 Version, 1) Block 3.0 (1, 1) Block (2, 1) 9

Grid 1 Global memory Block (0, 0) Block (1, 0)

Block (0, 1) Block (1, 1)

Block (0, 2) Block (1, 2)

Figure 2-2. Memory Hierarchy Figure 4: Memory hierarchy in CUDA programming model [CUD10c]. 2.4 Heterogeneous Programming

As illustrated by Figure 2-3, the CUDA programming model assumes that the CUDA threads execute on a physically separate device that operates as a coprocessor to the host running the C program. This is6 the case, for example, when the kernels execute on a GPU and the rest of the C program executes on a CPU.

CUDA Programming Guide Version 3.0 11 memory, as mentioned above. This memory is also fast and small; it can provide communication between threads in a block. Another important memory space is a constant memory, which is read-only. Since it is cached, it usually serves as a fast storage place for function parameters and constants. A texture memory is read-only and cached, as well, but it also includes several useful in-hardware features, such as a number of addressing modes, interpolation, and locality. In the highest level, there is a large but long-latency global memory space.

2.1.2 ATI Stream

ATI Stream is a parallel GPU architecure, designed for GPGPU computations, for- merly developed by ATI company, which was acquired by AMD in 2006. The series of graphics cards are branded as ATI , ATI FirePro, and AMD FireStream. Also, low-power cards for mobile devices are developed, such as ATI Mobility Radeon. Hardware implementation of units is architecturally different from NVidia’s stream processors, allowing a high number of stream processing units per core, L1 cache, and increased double precision performance. However, the cards offer lower memory bandwidth and processor clocking, so the overall performance is com- parable with the competitor’s. The GPGPU developers are supplied by ATI Stream SDK, allowing to develop applications in the high-level OpenCL language, described bellow [ATI10b]. Figure5 illustrates the relationship of the computing components in the ATI Stream architecture.

2.1.3 OpenCL

OpenCL is a framework for heteregenous platforms, consisting of CPU, GPU and other architectures. It is an open standard developed by Khronos Group [Ope10a]. It in- cludes a C-based cross-platform language, a platform layer API, a runtime API, and scientific libraries. In a context of GPU programming, it also supports an interoper- ability with the OpenGL standard. Analogous to CUDA, basic execution units are called kernels. They can be either data-parallel, allowing a fine-grained paralleliza- tion suitable for GPUs, or task-parallel, providing a coarse-grained parallelization for CPUs. A memory model is defined in levels of private, local, constant and global memory space [How10] (Figure6). OpenCL is supported by the both mainstream GPU vendors, however studies, such as by Kong et al. [KDY+10], revealed that with NVidia graphics cards, CUDA implementations generally achieve better performance than their OpenCL equivalents.

2.1.4 Other architectures

Although graphics cards are probably the most widely used devices nowadays, other hardware architectures can be exploited for massive parallel computing as well. Be- sides large expensive CPU clusters, a field-programmable gate array (FPGA) are pop- ular for their configurability and compact one-chip design, allowing them to be em- ployed in low-power autonomous devices [BR96]. Since FPGA architecure is much less accesible than GPU and requires low-level "close-to-metal" approach, we have decided

7 Figure 5: ATI Stream software enviroment [ATI10a].

Figure 6: Memory hierarchy in OpenCL programming model [ATI10a].

8 not to deal with FPGAs in our thesis topic. However, it can be mentioned that FPGAs can be also used in image processing [LDZZ09][YNAS09]. Also, a research in adapt- ing CUDA programming model into FPGA is conducted [PGS+09], building a bridge between GPGPU development and FPGA design.

2.2 General-Purpose Computation on GPU

GPGPU is the technique of mapping general-purpose algorithms to graphics hardware, exploiting its high computation and data throughput. The performance of graph- ics cards was employed in many applications, including traffic simulation [CBM+08], molecular dynamics [Joh07], seismic modelling [KEGM10], n-body simulation [Nyl07], and image processing [CDMS+08][PSL+10] (Figure7). For further information, the website GPGPU.org [GPG] can be recommended as an up-to-date source for both de- velopers and researchers, including notifications to recent papers. The OpenVIDIA framework [FR] is another project focused on the topic, specializing in OpenGL, Cg and CUDA implementations. One of the first methodical surveys made on GPGPU was made by Owens et al. [OLG+05]. Since published in 2005, it did not cover recent research based on popular high-level programming languages, such as CUDA or OpenCL. Not all the informa- tion, however, became out-of-date, as some paradigms in parallel computing remained unchanged. A streaming model, as an abstraction for GPU programming, was con- sidered in the survey. Several fundamental operations were introduced, such as map, reduce, scatter and gather, stream filtering, sort and search. A map operation applies a given function to every element in a given data stream. Computing a smaller output data stream from a larger input stream is called a re- duction. Scatter and gather are two fundamental memory operations, which represent indirect read and write memory acceses. Stream filtering provides an ability to select a subset of elements from a stream. Sort and search, two well-known operations, need very careful attention to be implemented efficiently on GPU. The survey also discussed several data structures that can be used in GPGPU, un- covering some constraints imposed by GPU for non-indexable structures, such as prior- ity queues, sets, and k-d trees. Many results of GPU acceleration were cited, showing a good effort that had been already made on GPGPU using high-level shading lan- guages. A revised version of the survey was published in 2007 [OLG+07]. Many performance studies comparing GPGPU and traditional CPU were pub- lished. A paper by Gong et al. [GLG05] provided a comparison between CPU (In- tel Pentium 4) and two commercially-succesful GPU platforms (ATI Radeon 9800 and NVidia GeForce 5950), in a number of real-time image processing applications. The au- thors used three different shading languages (HLSL, Cg and GLSL) and showed that both the GPU platforms are competitive, showing significant speedups (up to 36 × in the case of ATI and up to 22 × in the case of NVidia) over CPU. A study by Che et al. [CBM+08] examined several common, computationally de- manding applications, including traffic simulation, thermal simulation, and k-means algorithm. The speedup achieved by GPU implementations using CUDA was between 6 × and 40 ×, compared to CPU implementations, using a common commercially- available hardware.

9 (a) Fast and stable rendering of water surfaces (b) A brain cortex segmentation [CLW04]. [KW03].

(c) Calculating electrostatic potential maps in (d) Feature extraction [PSL+10]. molecular dynamics [Joh07].

(e) A simulation of a galaxy as an N-body system (f) A 3D reconstruction of a mouse placenta [Nyl07]. [RUCH09].

Figure 7: Applications in GPGPU.

10 Ryoo et al. [RRB+08] discussed NVidia GPU’s key features and optimization strate- gies. They studied GPU kernels, implemented in CUDA, for dozen various applica- tions, including H.264 video encoder, Lattice-Boltzman method, and MRI image re- construction. They claimed tremendous speedups comparing NVidia GeForce 8800 GTX to 2.2 GHz Opteron 248 (a single-core CPU), even using SIMD optimizations for CPU implementations. For example, MRI reconstruction algorithms achieved speedup rates over 100 ×, including data transfer overhead. The most of the kernels attained speedups over 10 ×, actually only two of them performed similarly on CPU and GPU, in particular, H.264 video encoder and FDTD electromagnetic wave propagation sim- ulator. The basic optimization principles described in the paper are maximizing the ratio of floating-point operations, hiding global memory latency by high streaming processor’s occupancy, and balancing the usage of global memory, shared memory, and registers, in order to achieve optimal memory bandwidth. Another survey of many different CUDA implementations was published by Gar- land et al. [GLGN+08]. The topics of interest, in this case, were molecular dynam- ics, numerical linear algebra, medical imaging, fluid dynamics, and seismic imaging. The most impressive results were attained in molecular dynamics. In the former ap- plication, parallelization was naturally provided by mapping each atom to a single thread. A CUDA application, called Highly Optimized Object-Oriented Molecular Dy- namics (HOOMD) was introduced and compared to another parallel software, called LAMMPS, showing that the former one executed on a single GeForce 8800 GPU per- formed similarly to the latter one executed on a cluster of 32 CPU cores. Another com- parison was made using a popular application Folding@Home; in this case, GeForce 8800 outperformed a dual-core CPU by rate of 100 ×. However, the results in other applications were not that impressive. For example, in linear algebra, a SGEMM dense matrix-matrix multiplication kernel by Volkov and Demmel was compared to Intel’s Math Kernel Library (MKL). Testing the graphics card mentioned above, and Intel Core2 Quad processor, the GPU implementation performed 3 × faster. The same re- sults were achieved for matrix factorization problem, in particular LU, Cholesky, and QR factorizations. In contrast with many previous studies, a recent paper by Lee et al. from Intel Corporation [LKC+10] shows that with a careful optimizing of CPU implementation, still the CPU can be succesful in many applications. They perform a comparison of 14 various implementations, showing speedups from 0.5 × to 15 × (GPU over CPU) which is significantly less than in many previous studies. Their experiment was made with Intel Core i7 and NVidia GTX280. The paper presents not only results of comparison, but also examination of the results, emphasising pros and cons of both architectures. The GPU benefits from higher global memory bandwidth and floating-point operation performance, hardware scatter-gather support, acceleration of transcendental (e.g. exponential and trigonomet- ric) functions, and fast texture lookup capability. As a result, kernels, such as bilateral filter, GJK collision detector, SAXPY (basic linear algebra task αx + y), and Lattice Boltz- mann method are accelerated by rate of 5 × and higher on GPU. On the other hand, the weaknesses of GPU are non-cached global memory (which does not stand for a new FERMI architecture), reduction overhead, and synchronization overhead. Thus, radix

11 sort and constraint solver perform better on CPU than on GPU; and kernels, such as Monte Carlo or tree index search, are performed on both architectures similarly. It can be pointed out that the results should slightly change using more recent graphics hardware, since NVidia has already released FERMI architecture by that time. Also, more impressive results can be achieved using multiple graphics cards in paral- lel. Nevertheless, it has been shown that the programmers indeed should be careful judging their CPU and GPU implementations; and that many tremendous speedups of 100 × and more, declared in GPGPU, should be treated with caution.

2.3 Image Processing Acceleration

Image processing is a wide field, including many various applications. It involves many specific algorithms. Among others, these are Fourier, Hough, Wavelet, and Radon transform [GW02][SM06], matrix multiplication, linear equation solvers, and principal component analysis [Pra01][Jäh05], partial differential equations (PDE) solvers [Kim05], sorting and searching algorithms [Pra01][GW02], statistical methods [Fra04], morphological operations [Ser83][Mar96], and graph algorithms. Many of these meth- ods are computational intensive and suitable for parallelization [BFRR01]. Castaño-Díez et al. [CDMS+08] presented a detailed performance evaluation of image processing algorithms, with respect to their GPU implementations. They tested a number of particular functions frequently used in image processing, such as a fast Fourier transform, a matrix multiplication, and several linear and non-linear filters. They also implemented more complex image processing algorithms, such as weighted back projection used for 3-D image reconstruction, and principal component analysis used for classification of large data sets. The GPU implementations were written in CUDA and Cg shading language and claimed speedups rates of at least 10 × in the most cases, comparing Intel Dual Core CPU and NVidia Quadro FX5600. Several GPU computing toolboxes for MATLAB were developed, such as Jacket [AE10] and Parallel Colt [WN]. Kong et al. [KDY+10] developped an open-source GPU library for the MATLAB Image Processing Toolbox. Mapping 12 various func- tions from IPT on GPU, using both CUDA and OpenCL parallel models, they studied which algorithm and hardware features influence the overall performance. Based on algorithm characteristics, they divided the functions into four categories: data inde- pendent (e.g. image intensity adjustment), data sharing (edge detection), algorithm dependent (Euclidean distance transform), and data dependent (dither). The functions of the first category map naturally on a GPU and achieve high kernel performance. They are usually limited by memory bandwidth and data transfer overhead. The data sharing algorithms are excellent candidates for massive parallelization. For each im- age element, the information about its neighbours is needed, therefore a careful opti- mization is required in order to hide global memory latency, by means of eliminating data reuse. With such optimization strategies, tremendous speedups can be achieved. The algorithm-dependent functions often require to be fundamentally revised, with respect to architecture differences. Still, they are good aspirants for GPU acceleration. The dither function, falling into the last category, performed poorly on GPU, due to large fraction of sequential processing, large synchronization overhead, and low ratio

12 of computations to memory accesses. The imaging process can be divided into several phases. First, an image is acquired by a device, digitized, and reconstructed. This is called image formation. Then, a pre- processing is made, applying various image enhancement techniques. This includes tone mapping, image restoration, and geometrical methods. Finally, image analysis is performed, extracting desired information from the image, by means of morphological processing, edge detection, image segmentation, feature extraction, shape analysis, and image registration [Pra01]. In the following text, we will discuss particular image processing methods and their GPU implementations.

2.4 Image Formation

Basically, most of imaging techniques use projection from 3-D space (real world) into 2-D space (imaging device). In many applications, these two-dimensional images are used for further processing, without special requirements for image formation. In other applications, such as magnetic resonance imaging (MRI), computed tomography (CT), and 4Pi confocal microscopy [CC74][HS92], a number of acquired images is used for reconstructing a complex 3-D image. This leads to advanced approaches to image formation. In [XM05], Xu and Mueller implemented three 3D reconstruction algorithms from CT, namely Feldkamp filtered backprojection, simultaneous algebraic reconstruction (SART), and expectation maximization (EM), on graphics hardware. From 80 projec- tions, they obtained reconstructions of 1283 pixels. The quality of GPU results was comparable to those obtained from CPU, achieving approximately 10 × speedup. In [XM09], Xu and Mueller introduced GPU implementation of the OS-SIRT iterative re- construction algorithm (Figure8). Furthermore, they implemented bilateral filters, for efficient regularization of reconstructed results. For the latter operation, they achieved tremendous speedups of more than 100 ×, using OpenGL Shading Language with NVidia 8800 GT graphics card. They approved the OS-SIRT method to outperform SART for data with high level of noise. Castaño-Díez et al. [CDMS+08] presented three different implementations of SART, two of them using CUDA, and one with a texture-based approach using Cg. All the implementations achieved approximately 50 × speedup, compared to their CPU equiv- alents. The Cg implementation performed the best, however, writing and debugging it required considerably higher effort than in CUDA, as noted by the authors. The GPU-accelerated weighted back projection, used for 3D reconstruction from 2-D elec- tron micrographs, was tested, as well. This included low-pass filtering of the acquired images, rotation, Fourier space weighting, and backprojection into a 3D image. They attained speedups up to 10 × on graphics hardware. Stone et al. [SHT+08] accelerated advanced MRI reconstruction algorithm, with respect to optimal reconstruction from non-Cartesian scan trajectories. They imple- mented anatomically constrained reconstruction algorithm by Haldar et al. [Hal08]. The key part of the algorithm was the conjugate gradient linear solver, designed to perform an iterative matrix inversion. As the number of elements in matrices was equal to the square of voxels in the reconstructed image, the matrices contained tril-

13 Figure 8: The iterative tomographic reconstruction of medical images with bilateral filtering [XM09]. lions of complex-valued elements for images of 1283 pixels. Fortunately, the matrices had either a sparse, or a convolutional structure, allowing programmers to handle with them in a compressed format, or in a Fourier domain, respectively. The authors tuned their algorithm, comparing seven versions with various optimizations. Their GPU best implementation was able to compute the MRI reconstruction within less than two minutes, while the CPU implementation completed in 23 minutes. Using four NVidia Quadro FX 5600 graphics cards, a multi-GPU implementation was tested as well, reducing the overall time to less than a minute. In ultrasound imaging, inverse-scattering algorithms are used for generating 3D volumetric images of a breast. These methods are time consuming, however, specialists would like to discuss results with their patients within the same visit. In Garland et al. [GLGN+08], GPU implementations of particular procedures used in these algorithms were discussed, especially 2D convolution by fast Fourier transform and CAXPBY rou- tine, which computes several basic operations from linear algebra. In the algorithm’s run, over 50 million 2D FFTs needed to be computed, accounting for over 75 percent of the computation time. Implementing a batched 2D FFT kernel, Garland et al. made the GPU convolution approximately 16 times faster than the CPU implementation. An iterative tomographic reconstruction was studied by Jang et al. [JKDP09]. Both forward and backward projection steps were implemented using CUDA on NVidia Tesla S870. This graphics hardware consists of four independent Tesla C870 GPUs, therefore, a multi-GPU approach was examined. For the forward projection, they

14 Figure 9: Examples of HDR images processed by tone mapping [GWWH05]. achieved 35 × speedup with a single GPU and 70 × speedup with all four GPU cores, compared to Intel Core 2 Duo CPU. For the back projection, GPU performed even better. The multi-GPU performance was significantly degraded due to CPU thread synchronization overhead, as each GPU core had to be driven by an individual CPU thread. However, this could be caused by the operating system’s thread scheduling policy, as stated by the authors. Using the single floating-point precision, the results showed negligible difference between the CPU and GPU reconstructions.

2.5 Image Enhancement

2.5.1 Tone Mapping

Tone mapping is the process of mapping pixel intensity values with high dynamic range (HDR) to a smaller dynamic range for the purposes of visualization. As recent display technologies still provide limiting dynamic range, tone mapping is an impor- tant part of image visualization. It also includes brightness and contrast enhancement. In this context, simple histogram-based adjustments can be also considered a part of tone mapping. As these methods involve a large number of pointwise operations, they can be parallelized on GPU very naturally; however, the speedup may be limited by global knowledge required (see Introduction) in a particular algorithm [BFRR01] [OLG+05]. Drago et al. [DMAC03] designed a fast HDR tone-mapping algorithm and achieved a modest speedup, comparing C implementation on Pentium 4, and OpenGL imple- mentation on ATI Fire GL X1. They tested their algorithm on common real images, such as a photograph of the Standford memorial church. Another method, tested on digital photographs, was published by Goodnight et al. [GWWH05] (Figure9). They implemented well-known algorithm of Reinhard et al. [RSSF02] and declared approximately 10 × speedup using an OpenGL implementation, using AMD Athlon 1800+ CPU with ATI Radeon 9800 Pro graphics card. Raspe and Müller presented

15 an interactive tone-mapping framework for medical volume data [RM07]. They built the cross-platform programming enviroment called "Cascada" using OpenGL Shading Language (GLSL). Their results revealed tremendous speeup factors up to 250 ×, tested on Intel Core 2 Duo and NVidia GeForce 8800 GTS. However, the performance was sig- nificantly degraded by data transfer overhead. Therefore, the algorithms worked best when avoiding data transfers. For example, user interface for tuning parameters of tone mapping maintained real-time performance.

2.5.2 Image Restoration

Image restoration aims to remove undesired artifacts in images, including blurring introduced by an optical system, a motion blur, and a noise from electronic sources. Thus, it is often an integral part of image formation. Removing blur leads to decon- volution techniques; eliminating noise usually involves linear and non-linear filters. However, for real images, use of advanced approaches is inevitable, taking both degra- dation features into account. Gong et al. [GLG05] implemented a number of simple image filters, including mean, Gaussian and median filter, using the HLSL shading language. They compared two graphics cards (ATI Radeon 9800 XT and NVidia GeForce FX 5950 Ultra) with the Intel Pentium 4 processor. For linear filters, they achieved approximately 4 × speedup with the ATI GPU and almost no speedup with the NVidia GPU. For nonlinear filters, such as the median, the ATI card attained the same processing times as the CPU, and the NVidia card was 6 × slower. It was shown that nonlinear filters indeed perform worse than linear filters, as they involve many comparing operations, which are not very suitable for paralelization of graphics hardware. In this work, only very small 3 × 3 filter kernels were discussed. Much larger kernels would probably give more precise results, eliminating the effect of overheads. In Castaño-Díez et al. [CDMS+08], both linear and median filters were accelerated, showing good performance, especially for large 3D images. For small images, such as 128 × 128 pixels, only a modest speedup was attained. This could be expected, due to some overheads, however, the authors did not comment this issue in detail. Kong et al. [KDY+10] implemented MATLAB function conv2, which provide a convolution of 2D images. With image sizes of over million pixels, the achieved kernel speedups were in order of hundreds, however, the overall speedups were approximately 10 ×, taking into account data transfers. A number of linear filter implementations can be found in CUDA SDK [CUD10d]. Among advanced restoration methods, Richardson-Lucy deconvolution algorithm [RIC72][Luc74] is one of the best known and widely used. Domanski et al. [DVW09] compared a CPU implementation, using the FFTW library [FJ09], and a GPU im- plementation, using the CUDA framework and the CUFFT library [CUD10b]. Two approaches were discussed, deconvolution in the spatial domain and in the Fourier domain. Even for relatively small sizes of a filter kernel, the FFT-based approach out- performed the spatial-domain approach. Comparing the CPU and the GPU platforms, i.e. the NVidia GTX 260 graphics card and the Intel Xeon Quad-Core processor, the GPU implementation was 8 × faster for 2D images and 4 × faster for 3D images. Quammen et al. [QFTI09] presented a performance study of three 3D deconvolution

16 (a) Original image (b) Blurred and noisy image (c) Reconstructed image

Figure 10: Reconstruction of an astronomical image [SZZ10].

algorithms, namely Wiener [Wie64], Jansson-van Cittert [SN06], and Richardson-Lucy, on both multi-core CPUs and GPUs. As the FFT-based approach was preferred, both the FFTW and the CUFFT libraries were employed. The scalability of algorithms on CPU was studied, measuring performance on various numbers of CPU cores, ranging from 1 to 16. The results revealed that the CPU platform, consisting of 8 dual-core AMD Opteron nodes connected by a Hyper-Transport network, performed best in the case of 8 CPU cores. For a larger number of cores, a sharp drop-off in scalability was revealed, probably caused by some implementation features in FFTW library. Then, the 8-core CPU and a single-core GPU, NVidia Quadro FX 5600 in particular, were compared. The CPU performed 2 × faster for the Wiener filter, and approximately 1.5 × faster for the other filters. As the time complexity of all the algorithms is strongly depended on FFT performance, other FFT libraries for GPU should be considered, such as [NOEM08] and [GLD+08]. For further information, refer to section 2.6.6. Expectation Maximization (EM) method, as described in [SV82], Scaled gradient projection (SGP) [BZZ09] and Total Variation (TV) regularization [ROF92] are examples of optimization techniques, used for edge-preserving removal of noise. Serafini et al. [SZZ10] proposed a GPU version of both SGP and EM using the CUDA framework, and the CUFFT [CUD10b] and CUBLAS [CUD10a] libraries (Figure 10). They used double precision in the CPU implementation, tested on AMD Athlon X2 Dual-Core, and single precision in the GPU implementation, ran on NVidia GTX 280. According to their results, SGP largely outperformed EM in means of the time consumption, obtaining the same image quality. The speedup achieved by GPU was approximately 20–30 × for both methods. Moazeni et al. [MBS09] described a GPU implementation of TV regularization for denoising of diffusion tensor images (DTI), used in MRI. Their algorithm required less than 2 minutes to process a 3-D image of 1283 voxels on NVidia Quadro FX 5600. The same algorithm needed more than 3 hours to be completed on a dual-core CPU.

2.5.3 Geometrical Methods

Geometrical modification of images is very common image processing operation, in- cluding basic transformations, such as translation, rotation, and scaling; and advanced

17 procedures, such as nonlinear warping and perspective transformation. These algo- rithms are very naturally accelerated on graphics hardware, since they actually repre- sent one of the most typical tasks for GPUs. Geometrical methods are often an integral part of image formation and image registration, such as in [SKSF07] and [MOOXS08]. For further information on the topics, refer to sections 2.4 and 2.6.6. In one of early works, Ren et al. [RPZ02] described hardware acceleration of the Elliptical weighted average (EWA) surface splatting method, used for high-quality ren- dering of point-sampled 3-D objects. In the paper, an object space formulation was pre- sented, and a GPU implementation was developed, using DirectX-supported graphics hardware. Kelly and Kokaram [KK04] proposed an algorithm for motion estimation in video sequences. The key part of the algorithm was fast image interpolation, us- ing textures on graphics hardware. Fan et al. [FEK+05] described an interpolation method called natural neighbor interpolation, based on Voronoi tessellation. Strengert et al. [SKE06] applied bilinear texture interpolation, implemented on modern graphics hardware, to accelerate pyramid methods, which are popular especially in fast im- age processing. Kraus et al. [KES07] proposed the edge-directed image interpolation method for image up-sampling, which avoided ringing artifacts, and preserved smooth and sharp edges. The graphics hardware was exploited for the computation, using OpenGL. Zhu et al. [ZLB+09] proposed Spherical Depth Image, a spherical represen- tation of image-based models, adopted to image warping. A real-time performance was achieved, implementing the algorithm on a graphics hardware. Terrain and surface rendering is a typical application for geometrical methods. Botsch and Kobbelt [BK03] proposed a fast framework for point-based geometry. The algorithm was implemented on the NVidia GeForceFX 5800 Ultra graphics card, using OpenGL. The splat filtering was used for smoothing textures, achieving more eye- appealing results. Dachsbacher and Stamminger [DS04] presented a view dependent level-of-detail approach for interactive terrain rendering in real-time. During render- ing, procedural geometric and texture detail adding was accelerated by graphics hard- ware. Schneider et al. [SBW06] described a method for creation and rendering of realistic landscapes. The real-time fractal terrain synthesizer, combined with WYSI- WYG user interface, was built and implemented on NVidia GeForce 7800 GTX using OpenGL. Levin et al. [LDS05] implemented the thin plate spline (TPS) warping algorithm [Duc76] using GLSL. Warping was applied on CT-based volumes of 512 × 512 × 173 voxels, using 92 user-defined displacement vectors. Xiao et al. [XCZ09] proposed a bilinear warping method for 2D CT reconstruction, accelerated on graphics hardware. Using OpenGL and a HP W4000 PC workstation having 3DLabs Wildcat 6110 graphic card, the GPU implementation achieved 20 × speedup for forward-projection stage and 15 × speedup for whole reconstruction.

2.6 Image Analysis

2.6.1 Morphological Processing

Mathematical morphology is based on analysis and processing spatial structures in images. The basic concepts of this theory trace back to set theory, integral geometry

18 and lattice algebra. Morphological methods can be used both for image pre-processing and for image analysis. For example, opening and closing can be adopted for smooth- ing contours of objects or eliminating small holes in objects. In the other application, image skeletonizing can be used to describe structure of objects. Watershed transform is one of the well known approaches to image segmentation. For information about GPU-based watershed implementations, refer to section 2.6.3. Dilation and erosion are two fundamental morphological operations. Due to their nature, they can be very simply implemented on graphics hardware, as presented in [HB08] or [KDY+10]. Another GPU implementation of dilation, erosion, and closing can be found in [ESA05]. In this paper, application of morphological methods to pre- processing in medical imaging was studied. Sugano and Miyamoto [SM08] proposed parallel implementation of several morphological operations, however, they did not use graphics hardware, but the high-performance Cell Broadband Engine processor. Dortmont et al. [vDvdWT06] implemented a method for computing distance trans- forms and skeletons of 3D objects on graphics hardware. Rong and Tan [RT06] pro- posed a fast GPU-based jump flooding algorithm to compute an approximations of the Voronoi diagram and the distance transform. Cao et al. [CTMT10] described a CUDA implementation of a parallel banding algorithm for computing the exact Euclidean dis- tance transform (EDT), which can be further used for performing such morphological operations as dilation, erosion, opening, and closing.

2.6.2 Edge Detection

Edges, i.e. changes or discontinuities in image luminance or color values, represent important regions in image. According to their character, edges are often detected by derivative-based operators, for example, first-order derivative methods (based on gra- dient or directional derivatives) and second-order derivative methods (e.g. Laplacian operator). Edge detectors can be often implemented by means of convolution, either by a basic convolution, or by a separable convolution, as many edge-detecting filters have a separable kernel. In the case of large unseparable kernels, a convolution in Fourier domain can be performed, as described in [FC06ˇ ]. CUDA implementation of convolution methods was described by Podlozhnyuk [Pod07]. Other approaches to edge detection are based on mathematical morphology [YQ08] or on active contour models [CV01]. An example GPU implementation of the latter can be found in [SA09]. Canny is one of widely used edge detectors, based on first-order derivatives. A sim- ple CUDA implementation of Canny operator is described in [Fun05]. OpenGL and Cg versions were presented by Ruiz et al. [RUG07]. A more detailed study on Canny operator was presented by Luo and Duraiswami [LD08]. The algorithm was imple- mented in CUDA and subsequently integrated in MATLAB, achieving approximately 3 × speedup over an optimized CPU implementation in the OpenCV library, tested on Intel Core 2 Duo 6600 and NVidia GeForce 8800 GTX. Kong et al. [KDY+10] presented GPU-accelerated procedure for MATLAB, performing the Prewitt edge detector. They declared more than 1000 × kernel speedup over the edge function implemented in the MATLAB IPT, without taking data transfers into account. Edge-detecting algorithms can be also included in image segmentation or feature extraction methods. For further information, please refer to sections 2.6.3 and 2.6.4.

19 Figure 11: 3D segmentation of fishes in acoustic underwater images [SA09].

2.6.3 Image Segmentation

Image segmentation aims to recognize objects in images. There are many approaches to image segmentation, namely pixel-based (e.g. thresholding), edge-based (derivative- based approaches), region-based (split-and-merge and region-growing), and model- based (Hough transform). Segmentation by global optimization leads to more complex regularization techniques [Jäh05]. Mathematical morphology introduces a specific cat- egory of segmentation algorithms, for example, a watershed transform [BM91]. Many examples of early work on GPU-based segmentation methods can be found in comprehensive studies by Hadwiger et al. [HLSB04] and Owens et al. [OLG+05]. We will briefly mention some of them. Diewald et al. [DPRS01] presented an early GPU implementation of anisotropic diffusion models. Sherbondy et al. [SHN03] de- scribed GPU implementation of isotropic diffusion based on the Perona-Malik model, with an application to 3-D volume medical data. Lefohn et al. [LCW03] used a stream- ing approach on a GPU to accelerate a level set method. They proposed an efficient memory management scheme using 2-D textures; and achieved 15 × speedup. Further- more, in [CLW04], they presented a new tool, called GIST, for interactive GPU-based 3-D segmentation. Sharma and Anton [SA09] proposed a level set method for 3D reconstruction of acoustic underwater images (Figure 11). Their approach was based on level set evolu- tion using Mumford-Shah segmentation functional, with a special view to noise sup- presion. They implemented a number of procedures in CUDA, including signed dis- tance transform, weighted median filter, and PDE solver. As the acoustic images from marine surveys can often result in gigabytes of data, the algorithm was optimized in means of memory usage. Due to large resolution, the images were divided into small volumes of size 1283 voxels, stored into 3D textures, and processed independently. They were able to complete the computation in one minute, using the mobile GPU, namely NVidia GeForce 8600M GT. An interesting application of segmentation can be found in Zitnick et al. [ZKU+04]. In this paper, a method for interactive control of viewpoint in video sequences was pro- posed, combining synchronized streams from multiple cameras. As a result, user could replay video from a new viewpoint, continously change the point of view in dramas, and create "freeze frame" visual effects. Their segmentation method was based on color matching. For more convenient results, a variant of Perona and Malik anisotropic diffu-

20 sion was used to reduce noise in images. Their GPU implementation showed real-time performance for data stored in the graphics hardware’s memory. The watershed transform views a greyscale image as a relief flooded by water. Once the particular flooded regions, called catchment basins, merge, the lines of merging are marked as watershed lines, i.e. the segmentation lines. The major issue with the stan- dard watershed is over-segmentation. Therefore, several advanced approaches were designed to address the problem. As the watershed is a global operation, it is fairly challenging to be implemented on graphics hardware. Several implementations can be found, such as [PGX08] and [KP08]. In the former work, the multi-degree immers- ing watershed algorithm was written in CUDA. Comparing NVidia GeForce 8500 GT graphics card with Pentium 4, more than a double speedup was achieved. In the latter work, a cellular automaton was designed to implement a number of graph algorithms to perform the watershed transform. Compared with the toboggan-based watershed implemented on CPU, the GPU implementation attained more than a double speedup, as well, using Intel Xeon and ATI Radeon X1950 Pro. Graph algorithms can be used not only in the watershed transform, but also in graph-cuts approaches. These methods are succesful competitors to level sets. Graph cuts can be used e.g. for Mumford-Shah functional computation [GA09]. They can perform fast, however, they require a lot of memory. Therefore, they are difficult to be applied on large images, especially when implemented on GPU. CUDA implementa- tion, achieving a 10 × speedup over a CPU implementation, can be found in [VN08]. Harish et al. [HN07] implemented several fundamental graph algorithms in CUDA, including breadth first search, single source shortest path, and all-pairs shortest path. They achieved a significant speedup (up to 50 ×) for artificial graphs, however, for real-world graphs with small per-vertex degree, the performance of GPU was compa- rable to CPU. Barnat et al. [BBBC10ˇ ] proposed a CUDA implementation for computing strongly connected components of a directed graph.

2.6.4 Feature Extraction

An image feature is a representative characteristic or attribute of an image. The goal of feature extraction is to transform input data, that can be very large, into a reduced set of information, which is used for consequential analysis and classification. Ampli- tude features are used for extracting information, contained in image amplitude, i.e. luminance, spectral value, or other units. Measures describing an image histogram, such as mean, standard deviation, skewness, or energy, often aid in amplitude fea- ture extraction. Image transforms, for example Fourier, Haar, and Hough transforms, can be employed to detect shape-based features. Texture-based approaches include edge detectors, autocorrelation methods, Gabor filters, dependency matrix methods, singular-value decomposition (SVD), and many others [Pra01]. For the classification of image data, principal component analysis (PCA) and related techniques are used. One of the methods, used not only for feature extraction, but also in other appli- cations, is the wavelet transform. Hopf and Ertl [HE00] proposed an early OpenGL implementation of the transform. Measured on a Silicon Graphics Octane with R10000 195MHz processor and a MXE graphics pipe, they achieved a significant speedup for

21 Figure 12: Feature extraction in a sunflower image [CVG08]. large images. Garcia and Shen [GS05] described a Cg implementation of 3D wavelet reconstruction, and evaluated the performance on the Intel Xeon processor and the NVidia Quadro FX 3400 graphics card. For a visible woman volume dataset, the GPU achieved 6 × speedup for Haar wavelets and 7 × speedup for Daubechies. Another OpenGL implementation of the wavelet transform, applied to wavelet-based geomet- ric design, stylized image processing, texture-illuminance decoupling, and JPEG2000 image encoding, was proposed by Wong et al. [WLHW07]. Fung et al. [FM05] implemented a number of feature detection methods in their OpenVIDIA library. The algorithm was designed to compute unique appearance-based feature vectors, and to track features in video sequences. A detailed description of fea- ture vectors computing can be also found in [Fun05]. Ruiz et al. [RUG07] accelerated the Hough transform for edge and circle feature detection, using OpenGL and Cg shading languages. Castaño-Díez et al. [CDMS+08] accelerated the PCA method on graphics hardware, using CUDA. As the algorithm includes a number of linear algebra operations, such as a covariance matrix computation, the CUBLAS library was used to implement such procedures. They achieved up to 10 × speedup for classification of data from electron micrographs. Cornelis and Van Gool [CVG08] presented a fast CUDA implementation of feature- detecting method, called Speeded Up Robust Features (SURF) [BTVG06] (Figure 12). With respect to optimization of data layout in GPU texture memory, a scale-space pyramid creation was proposed. Feature descriptor matching was performed using different criteria, such as the sum of squared differences (SSD) or the dot product

22 (DOT). In the tests, performed on images of 640 × 480 pixels, the framerate of over 100 fps was attained on NVidia GeForce 8800 GTX and 35 fps on mobile graphics hardware, namely NVidia Quadro FX 1600M. Gipp et al. [GMH+09] proposed a GPU-based algorithm for computing the Har- alick’s texture features for biological applications. An optimized approach to a co- occurence matrix generation was introduced, in order to save a memory space. The GPU implementation was written in CUDA and ran on the NVidia GeForce 280 GTX graphics card, to perform 32 × faster than an optimized software version, executed on an Intel Core 2 Quad machine. Ujaldon [Uja09] used CUDA to implement computa- tion of Zernike moments on NVidia GeForce 8800 GTX, claiming 5 × speedup over the best existing CPU implementation.

2.6.5 Shape Analysis

Shape-analysing techniques are useful tools for classification of objects within an im- age, including both qualitative and quantitative approaches. Many shape properties can be described, such as topological attributes, size measurements and other geomet- ric attributes, shape orientation descriptors, spatial moments, and Fourier descriptors [Pra01]. In shape analysis, distance transforms, e.g. the Euclidean distance transform (EDT), are often employed for measurements of geometric characteristics. A number of GPU implementation of EDT can be found in literature, for example in [RT06], [vDvdWT06], and [CTMT10]. Coeurjolly [Coe10] described several methods of volumetric analysis, based on distance transforms. In particular, separable approaches were discussed, in order to design high-performance parallel algorithms. Peterson et al. [PDHW10] proposed a particle system for triangular meshes, which can be used in statistical shape analysis applications. Considering a number of optimal- izations, both naive and tuned CUDA implementations were described. The optimized GPU version of the algorithm attained up to 300 × speedup over the CPU version, when evaluated on NVidia GeForce GTX 260 and Intel Core 2 Duo. The method was tested on both synthetic and real data, demonstrating similar results.

2.6.6 Image Registration

Geometrical alignment of images is called image registration. It can be used either for a fusion, in the case the images contain the same scene but different information, or for stitching images together, in order to get one large mosaic picture. Basically, there are two groups of image registration methods, area-based and feature-based. The ex- amples of the former category are correlation, mutual information, and optimization methods, in the latter category there are methods using invariant descriptors, relax- ation methods, and wavelets [Zit03]. One of the area-based registration methods is based on the gradient flow, where a constant brightness in the image is assumed. Thus, the motion corresponds to bright- ness changes and motion vectors can be calculated from partial space and time deriva- tives of image brightness. Strzodka et al. [SDR03][SDR04] implemented the multi- scale gradient flow method using DirectX9 graphics hardware, in particular, NVidia

23 Figure 13: Stitched panoramic view of the Durban Coastline [DDdV10].

GeForceFX 5800. They were able to register fully distorted images, with resolution of 513 × 513 pixels, within 10 seconds. In [SG04], Strzodka et al. presented a system for real-time motion evaluation and visualization on 640 × 480 25 Hz image sequences. Duvenhage et al. [DDdV10] proposed a GPU implementation of the Lucas-Kanade registration algorithm, using GLSL (Figure 13). Their algorithm was applied to provide 3-D stabilisation of the ocean-based wide area surveillance system. In their multi- resolution approach, the image pyramid is created and the registration is performed in the lowest resolution first, and subsequentially refined in higher resolutions. Dealing with images of 4096 × 1000 pixels, the implementation used regions of interest (ROI) of 256 × 256 pixels or 512 × 512 pixels for image registration. Using NVidia GeForce GTX285, a real-time performance (30 frames per second) was achieved. Classical representatives of the area-based methods are correlation methods, based on Fourier shift theorem. Thus, they can be efficiently computed in a Fourier domain, using the fast Fourier transform; on the other hand, they can be used only in the case of simple transformations, such as shift, rotation, and scaling. Acceleration on GPU is therefore limited by an efficient implementation of FFT algorithm. The CUFFT library by NVidia provides framework for computing FFT on GPU [CUD10b]. However, the speedup is bounded by memory bandwidth and data transfer overhead. Also, since the FFT is sensitive to accuracy, and the graphics hardware performs best in single precision, programmers have to pay attention to error rates. Nukada et al. [NOEM08] proposed a custom kernel for 3-D FFT, written in CUDA, claiming 3 × faster perfor- mance compared to CUFFT. However, they assumed image sizes to be power of two, therefore, the performance for general image sizes is unknown. A more general, well- described implementation was presented by Govindaraju et al. in [GLD+08], achieving 2–4 × speedup over CUFFT library.

2.7 Decomposition of Large Images in Image Processing

In certain applications, namely MRI, CT, and confocal microscopy, huge 3-D images need to be acquired and reconstructed. In particular, this can mean up to gigabytes of data per a single image. Even with 2-D images, excessively large resolution can be achieved, such as in Earth observation (satellite or panoramic views) or in pathology (images of tissues). For such data, image processing is difficult and requires advanced approaches, with respect to optimizations of memory usage. This is particularly true for GPU implementations, since graphics hardware have relatively small memory, as discussed in Introduction.

24 Sometimes, the solution is to simply buy a card with larger memory. For example, in [CTMT10], the algorithm is tested on NVidia GeForce GTX 280 with 1 GB of video memory for 2D images, and on NVidia Tesla C1060 with 4 GB of memory for larger 3D images. It can be also mentioned that the latter card achieves a slightly slower performance, due to a lower memory bandwidth. However, this is not a versatile solution, indeed. While the size of memory space on graphics hardware is constantly increasing, so does the resolution of acquiring devices, introducing a "race" between image and memory sizes. Furthermore, high-end graphics hardware is not always affordable for every user. Thus, it is advisable to search for optimizations of image processing methods, in order to be able to decompose images into smaller parts. A convolution and deconvolution are typical examples of operations, where a data decomposition can be performed in a straitghforward, but sub-optimal manner. Due to convolution properties, the (de)convolved image can be divided into arbitrarily small parts, but all the sub-images have to be extended with neighbouring pixels of at least size of the filter kernel (PSF), as described in [TH78]. Thus, a lot of redundant com- putation needs to be performed, proportionaly to the PSF size. This approach was succesfully used in [BRHM96], to compute spatially-variant deconvolution of Hubble Space Telescope images. In MRI and tomographic recontructions, a division of dataset can be used to per- form computation on multiple processing units, thereby accelerating time-consuming algorithms. In [SHT+08], a multi-GPU implementation of MRI recontruction was tested on four NVidia Quadro graphics cards. The key part of the algorithm was time- consuming computation of excessively large vector. Therefore, a simple data splitting could be used. Slightly sub-linear speedup was achieved, due to an overhead required to arrange the data. Jang et al. [JKDP09] proposed an algorithm for iterative tomo- graphic reconstruction on multiple GPUs, namely on NVidia Tesla S870, which con- sisted of four GPUs. A detailed description of multi-GPU utilization, however, was not introduced in the paper. Another example of decomposition can be found in [SA09]. In this paper, Sharma and Anton proposed a level set method for 3D reconstruction of large acoustic un- derwater images. With multi-beam echo sounders, a series of 2-D images is acquired, forming a 3-D volume of nearly 100 million voxels. These data use about 470 MB of memory along with the same amount of memory for storing the signed distance field. Therefore, the volume was divided into small sub-volumes and kept in 3D textures of 1283 voxels. Then, the chunks of data were processed independently, as described in the following paragraph. First, the signed distance transform was computed. However, it could not be im- plemented in a straightforward manner, since it is a global operation. Instead, a lo- cal approximation of the Euclidean distance was performed, using the Chamfer dis- tance. During every single pass, the information about the distance is propagated using 3 × 3× 3 neighbourhood voxels. Hence, every sub-volume need to be supported with a one voxel cover from other adjointing sub-volumes, thereby reducing the computa- tional domain to a volume of 1263 pixels. Second, average intensities were computed, by means of reduction. This does not require an advanced approached to handle the sub-volumes. Third, the weighted median was calculated, using iterative method. Fi-

25 nally, the PDE solver was executed, using information from a 5 × 5× 5 neighbourhood, thus reducing the computational domain further down to 1243 pixels. Takizawa and Kobayashi [TK06] proposed a k-means algorithm on a cluster of mul- tiple computers, supported by graphics hardware. Their method was based on three- level hierarchical approach to parallel processing. On the top, the divide-and-conquer approach was employed in two levels. In this manner, large-scale data were divided into partly-parallel subtasks, each performed by an individual PC. After obtaining the data, the k-means algorithm was executed in two phases. One part, the nearest neigh- bour seach, was implemented on GPU; the other one was performed on CPU, since it involved random access memory writes and conditional branches, that are unappro- priate for GPU acceleration. The algorithm was evaluated on various configurations. The GPU-assisted implementation on a single PC achieved 4 × speedup over the CPU- only version. Further performance gain, attained by 4 computers, was approximately 3 ×, showing relatively good scalability on PC clusters. Hartley et al. [HCR+08] designed a cooperative enviroment for the analysis of pathology images.The analysis included extraction of four second order statistical fea- tures, namely contrast, correlation, homogeneity, and energy. For the purpose of tex- ture analysis, the local binary pattern (LBP) feature was computed. Since one slice of a tissue can be up to 30 gigabytes of uncompressed data, the computation would take hours on a single CPU. Therefore, the images were decomposed into tiles of 1000 × 1000 pixels and processed independently and in parallel on a cluster of 16 nodes. Each node was equipped with two dual-core AMD Opteron 2218 CPUs and two NVidia Quadro FX 5600 GPUs. Additional six computers were used as reader nodes in order to provide enough disk bandwidth. The results revealed TIFF decom- pression stage to be a significant bottleneck when two graphics cards per node were used. Three different image sizes were tested, "small", middle" and "large", with res- olution of 2.1, 4.7, and 8.5 gigapixels, respectively. The total computation times for the particular image sizes were under 4, 7, and 12 seconds, respectively. Hereat, the analysis on a single CPU would take almost 3 hours for the large image size. For the parallelization in the work above, the DataCutter middleware framework by Beynon et al. [Dat01] was used. This enviroment was designed to support processing of large scientific datasets stored in archival storage systems across a network. After having previously developped Active Data Repository (ADR), Beynon et al. proposed DataCutter to extend and apply the features of ADR, in particular, to provide data pro- cessing on machines separated from storage servers. In the DataCutter architecture, clients are supported by two important services, namely indexing and filtering. The former was proposed to retrieve spatial subsets from large datasets. Since the dataset sizes could be up to petabytes, a multi-level hierarchical indexing scheme was used, based on R-trees structure. The latter was employed to perform non-spatial subsetting and data aggregation. For experimental testing of the DataCutter framework, the Vir- tual Microscope application was developed, to emulate a high power light microscope. For detail information refer to [BKC+01].

26 3 Dissertation Thesis Intent

The main intent of my dissertation thesis will be mapping of image processing algo- rithms to GPU. I will focus on processing of large 3D images, which we deal with in the CBIA laboratory. As the output of the thesis, the methods will be implemented and included in the i3dalgo library, which is developed by our laboratory. Many parallel programming models, as known to date, require considerably high effort for the GPU implementation [CDMS+08]. Considering the trade-off between computational speed and time consumption for software developping, I decided to choose CUDA, for it provides good framework for programmers and, generally, significant speedups of im- plementations. If there will be time enough, OpenCL implementations will be studied as well, in order to provide a comparison between the two most popular frameworks. Since the algorithms will be applied in the light microscopy, the following image pro- cessing methods, essential for this field, will be studied.

Image Restoration. Convolution and deconvolution are two related methods used for image restoration. The former is employed in CytoPacq, the online framework for simulating fluorescence microscopy images [SKS09], to imitate a blurring introduced by an optical system of a microscope. In contrast, the latter is used to reconstruct the original image from the blurred one.

Image Segmentation. Image segmentation is the fundamental part of image pro- cessing in biomedical application. Therefore, several segmentation algorithms will be considered, based on the methods from mathematical morphology (described be- low). These methods can be significantly time-consuming, the global knowledge is needed during computation in some cases, therefore, they are challenging for GPU implementation.

Morphological Processing. The implementations of basic algorithms, such as dila- tion, erosion, opening, and closing, have been already well-studied and described in lit- erature. Thus, more advanced approaches will be considered, for example, watershed transform, morphological reconstructions and reconstruction-based operators. These algorithms are often used for image segmentation, either for pre-processing, or for segmentation itself. The well-known bottleneck of the watershed is over-segmentation, thus, the pre-processing is an essential part of the segmentation process.

Schedule

I plan to finish my Ph.D. studies by dissertation defence in the term autumn 2012 (after 8 terms of doctoral studies). The expected time schedule of my work comes as follows:

• Spring 2011 — an Erasmus internship in foreign research centre. The methods from mathematical morphology will be studied. The particular algorithms will be chosen for my research topic, to be applied for segmentation of biomedical data. Subsequently, the research on parallelization of the algorithms will be conducted.

27 • Autumn 2011 — I will proceed with previous work on morphological process- ing and image segmentation. Also, deconvolution methods will be examined. Since large filter kernels (PSFs) are typical for the light microscopy, deconvolu- tion based on the fast Fourier transform will be studied, in particular.

• Spring 2012 — The deconvolution algorithms will be implemented on GPU and applied on biomedical data. Both real data from optical microscopy and artificial data from the Cytopacq simulator [SKS09] will be taken into account.

• Autumn 2012 — Dissertation write-up and defense.

During the Ph.D. studies, all results of the previous work should be published on an international conference or in a highly rated journal.

28 References

[AE10] AccelerEyes - MATLAB GPU Computing. http://www.accelereyes. com, 2010.

[AMD10] AMD FireStreamTM 9270 GPU Compute Accelerator. http://www.amd. com/us/products/workstation/firestream/firestream-9270/pages/ firestream-9270.aspx, Jul 2010.

[ATI10a] ATI Stream SDK OpenCL Programming Guide. http://developer. amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_ Programming_Guide.pdf, Aug 2010.

[ATI10b] ATI Stream Software Development Kit (SDK) v2.1. http://developer. amd.com/gpu/ATIStreamSDK/Pages/default.aspx, May 2010.

[BBBC10]ˇ Jiˇrí Barnat, Petr Bauch, Luboš Brim, and Milan Ceška.ˇ Computing Strongly Connected Components in Parallel on CUDA. Technical Re- port FIMU-RS-2010-10, Faculty of Informatics, Masaryk University, Brno, 2010.

[BFH+04] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, and Pat Hanrahan. Brook for GPUs: Stream Computing on Graphics Hardware. ACM TRANSACTIONS ON GRAPHICS, 23:777– 786, 2004.

[BFRR01] Thomas Bräunl, Stefan Feyrer, Wolfgang Rapf, and Michael Reinhardt. Parallel Image Processing. Springer, 2001.

[BK03] Mario Botsch and Leif Kobbelt. High-Quality Point-Based Rendering on Modern GPUs. In PG ’03: Proceedings of the 11th Pacific Conference on Computer Graphics and Applications, page 335, Washington, DC, USA, 2003. IEEE Computer Society.

[BKC+01] Michael D. Beynon, Tahsin Kurc, Umit Catalyurek, Chialin Chang, Alan Sussman, and Joel Saltz. Distributed processing of very large datasets with DataCutter. Parallel Comput., 27(11):1457–1478, 2001.

[BM91] S. Beucher and Centre De Morphologie Mathmatique. The Watershed Transformation Applied To Image Segmentation. In Scanning Microscopy International, pages 299–314, 1991.

[BR96] Stephen Brown and Jonathan Rose. Architecture of FPGAs and CPLDs: A Tutorial. IEEE Design and Test of Computers, 13:42–57, 1996.

[BRHM96] A. F. Boden, D. C. Redding, R. J. Hanisch, and J. Mo. Massively par- allel spatially variant maximum-likelihood restoration of Hubble Space Telescope imagery. J. Opt. Soc. Am. A, 13(7):1537–1545, 1996.

29 [BTVG06] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: Speeded Up Robust Features. In Aleš Leonardis, Horst Bischof, and Axel Pinz, ed- itors, Computer Vision – ECCV 2006, volume 3951 of Lecture Notes in Computer Science, pages 404–417. Springer Berlin / Heidelberg, 2006. 10.1007/11744023_32.

[BZZ09] S. Bonettini, R. Zanella, and L. Zanni. A scaled gradient projection method for constrained image deblurring. Inverse Problems, 25(1):015002, 2009.

[Cas96] Kenneth R. Castleman. Digital Image Processing. Prentice Hall, 2nd edi- tion, 1996.

[CBM+08] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaf- fer, and Kevin Skadron. A performance study of general-purpose appli- cations on graphics processors using CUDA. J. Parallel Distrib. Comput., 68(10):1370–1380, 2008.

[CC74] Christoph Cremer and Thomas Cremer. Considerations on a laser- scanning-microscope with high resolution and depth of field. 1974.

[CDMS+08] Daniel Casta´no-Díez, Dominik Moser, Andreas Schoenegger, Sabine Pruggnaller, and Achilleas S. Frangakis. Performance evaluation of im- age processing algorithms on the GPU. Journal of Structural Biology, 164(1):153 – 160, 2008.

[CLW04] Joshua E. Cates, Aaron E. Lefohn, and Ross T. Whitaker. GIST: an in- teractive, GPU-based level set segmentation tool for 3D medical images. Medical Image Analysis, 8(3):217 – 231, 2004. Medical Image Computing and Computer-Assisted Intervention - MICCAI 2003.

[Coe10] David Coeurjolly. Volumetric analysis of digital objects using dis- tance transformation: performance issues and applications, August 2010. Workshop on Applications of Digital Geometry and Mathematical Mor- phology.

[CTMT10] Thanh-Tung Cao, Ke Tang, Anis Mohamed, and Tiow-Seng Tan. Parallel Banding Algorithm to compute exact distance transform with the GPU. In I3D ’10: Proceedings of the 2010 ACM SIGGRAPH symposium on Inter- active 3D Graphics and Games, pages 83–90, New York, NY, USA, 2010. ACM.

[CUD10a] CUDATM CUBLAS Library 3.1. http://developer.download.nvidia. com/compute//3_1/toolkit/docs/CUBLAS_Library_3.1.pdf, Jun 2010.

[CUD10b] CUDATM CUFFT Library 3.1. http://developer.download.nvidia. com/compute/cuda/3_1/toolkit/docs/CUFFT_Library_3.1.pdf, Jun 2010.

30 [CUD10c] CUDATM Programming Guide 3.1. http://developer.download. nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ ProgrammingGuide_3.1.pdf, Jun 2010.

[CUD10d] CUDATM SDK Code Samples 3.1. http://developer.nvidia.com/ object/cuda_sdk_samples.html, Jun 2010.

[CUD10e] NVIDIA GPU Computing Developer Home Page. http://developer. nvidia.com/object/gpucomputing.html, Jun 2010.

[CV01] Tony F. Chan and Luminita A. Vese. A level set algorithm for minimizing the Mumford-Shah functional in image processing. In VLSM ’01: Proceed- ings of the IEEE Workshop on Variational and Level Set Methods (VLSM’01), page 161, Washington, DC, USA, 2001. IEEE Computer Society.

[CVG08] N. Cornelis and L. Van Gool. Fast scale invariant feature detection and matching on programmable graphics hardware. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pages 1 –8, 23-28 2008.

[Dat01] DataCutter Project. http://www.cs.umd.edu/projects/hpsl/ ResearchAreas/DataCutter.htm, 2001.

[DDdV10] Bernardt Duvenhage, J. P. Delport, and Jason de Villiers. Implementa- tion of the Lucas-Kanade image registration algorithm on a GPU for 3D computational platform stabilisation. In AFRIGRAPH ’10: Proceedings of the 7th International Conference on Computer Graphics, Virtual Reality, Visu- alisation and Interaction in Africa, pages 83–90, New York, NY, USA, 2010. ACM.

[DMAC03] F. Drago, K. Myszkowski, T. Annen, and N. Chiba. Adaptive Logarithmic Mapping For Displaying High Contrast Scenes. Computer Graphics Forum, 22:419–426, 2003.

[DPRS01] Udo Diewald, Tobias Preusser, Martin Rumpf, and Robert Strzodka. Dif- fusion Models And Their Accelerated Solution In Image And Surface Processing. Acta Math. Univ. Comenian, 70:15–34, 2001.

[DS04] Carsten Dachsbacher and Marc Stamminger. Rendering Procedural Ter- rain by Geometry Image Warping. In Rendering Techniques, pages 103– 110, 2004.

[Duc76] J. Duchon. Interpolation des fonctions de deux variables suivant le principe de la flexion des plaques minces. R.A.I.R.O. Analyse num ’erique, 10:5–12, 1976.

[DVW09] L. Domanski, P. Vallotton, and D. Wang. Two and Three-Dimensional Image Deconvolution on Graphics Hardware. In Proceedings of the 18th World IMACS/MODSIM Congress, pages 1010–1016, Cairns, Australia, 13- 17 July 2009.

31 [ESA05] O.C. Eidheim, J. Skjermo, and L. Aurdal. Real-time analysis of ultra- sound images using GPU. International Congress Series, 1281:284 – 289, 2005. CARS 2005: Computer Assisted Radiology and Surgery.

[FEK+05] Quanfu Fan, Alon Efrat, Vladlen Koltun, Shankar Krishnan, and Suresh Venkatasubramanian. Hardware-assisted natural neighbor interpola- tion. In In: Proc. 7th Workshop on Algorithm Engineering and Experiments (ALENEX), 2005.

[FJ09] Matteo Frigo and Steven G. Johnson. FFTW 3.2.2. http://www.fftw. org/fftw3.pdf, Jul 2009.

[FM05] James Fung and Steve Mann. OpenVIDIA: parallel GPU computer vi- sion. In MULTIMEDIA ’05: Proceedings of the 13th annual ACM interna- tional conference on Multimedia, pages 849–852, New York, NY, USA, 2005. ACM.

[FR] James Fung and John Roberts. OpenVIDIA: Parallel GPU Computer Vi- sion . http://openvidia.sourceforge.net/index.php/OpenVIDIA.

[Fra04] François Goudail and Philippe Réfrégier. Statistical image processing tech- niques for noisy images: an application-oriented approach. Springer, 2004.

[Fun05] J. Fung. Computer Vision on the GPU, chapter 40, pages 649–665. Addison- Wesley, 2005.

[FC06]ˇ O. Fialka and M. Cadík.ˇ FFT and Convolution Performance in Image Filtering on GPU. In Information Visualization, 2006. IV 2006. Tenth Inter- national Conference on, pages 609 –614, 5-7 2006.

[GA09] Leo Grady and Christopher V. Alvino. The piecewise smooth Mumford- Shah functional on an arbitrary graph. Trans. Img. Proc., 18(11):2547–2561, 2009.

[GLD+08] Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High performance discrete Fourier transforms on graphics processors. In SC ’08: Proceedings of the 2008 ACM/IEEE con- ference on Supercomputing, pages 1–12, Piscataway, NJ, USA, 2008. IEEE Press.

[GLG05] Minglun Gong, Aaron Langille, and Mingwei Gong. Image Analysis and Recognition, volume 3656/2005, chapter Real-Time Image Process- ing Using Graphics Hardware: A Performance Study, pages 1217–1225. Springer Berlin / Heidelberg, 2005.

[GLGN+08] Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson, Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily Volkov. Parallel Computing Experiences with CUDA. IEEE Micro, 28(4):13–27, 2008.

32 [GMH+09] Markus Gipp, Guillermo Marcus, Nathalie Harder, Apichat Suratanee, Karl Rohr, Rainer König, and Reinhard Männer. Haralick’s Texture Fea- tures Computed by GPUs for Biological Applications. 2009.

[GPG] General-Purpose computation on Graphics Processing Units. http:// www.gpgpu.org.

[Gri10] Future of computing: GPGPU? http://gridtalk-project.blogspot. com/2010/07/future-of-computing-gpgpu.html, Jul 2010.

[GS05] Antonio Garcia and Han-Wei Shen. GPU-based 3D wavelet recon- struction with tileboarding. The Visual Computer, 21:755–763, 2005. 10.1007/s00371-005-0332-0.

[GW02] Rafael C. Gonzales and Richard E. Woods. Digital Image Processing. Prentice-Hall, 2nd edition, 2002.

[GWWH05] Nolan Goodnight, Rui Wang, Cliff Woolley, and Greg Humphreys. In- teractive time-dependent tone mapping using programmable graphics hardware. In SIGGRAPH ’05: ACM SIGGRAPH 2005 Courses, page 180, New York, NY, USA, 2005. ACM.

[Hal08] Haldar, J. P. and Hernando, D. and Song, S.-K. and Liang, Z.-P. Anatom- ically constrained reconstruction from noisy data. Magnetic Resonance in Medicine, 59:810–818, 2008.

[HB08] S. Harding and W. Banzhaf. Genetic programming on GPUs for image processing. Int. J. High Perform. Syst. Archit., 1(4):231–240, 2008.

[HCR+08] Timothy D.R. Hartley, Umit Catalyurek, Antonio Ruiz, Francisco Igual, Rafael Mayo, and Manuel Ujaldon. Biomedical image analysis on a co- operative cluster of GPUs and multicores. In ICS ’08: Proceedings of the 22nd annual international conference on Supercomputing, pages 15–25, New York, NY, USA, 2008. ACM.

[HE00] Matthias Hopf and Thomas Ertl. Hardware Accelerated Wavelet Trans- formations. In In Proceedings of EG/IEEE TCVG Symposium on Visualization VisSym ’00, pages 93–103, 2000.

[HLSB04] Markus Hadwiger, Caroline Langer, Henning Scharsach, and Katja Büh- ler. State of the Art Report 2004 on GPU-Based Segmentation. Technical report, VRVis Research Center, Vienna, 2004.

[HN07] Pawan Harish and P. J. Narayanan. Accelerating large graph algorithms on the GPU using CUDA. In HiPC’07: Proceedings of the 14th international conference on High performance computing, pages 197–208, Berlin, Heidel- berg, 2007. Springer-Verlag.

[How10] Lee Howes. OpenCL: Parallel Computing for CPUs and GPUs. http://developer.amd.com/gpu_assets/OpenCL_Parallel_ Computing_for_CPUs_and_GPUs_201003.pdf, Mar 2010.

33 [HS92] Stefan Hell and Ernst H. K. Stelzer. Properties of a 4Pi confocal fluores- cence microscope. J. Opt. Soc. Am. A, 9(12):2159–2166, 1992.

[Jäh05] Bernd Jähne. Digital Image Processing. Springer, 6th edition, 2005.

[JKDP09] Byunghyun Jang, David Kaeli, Synho Do, and Homer Pien. Multi GPU implementation of iterative tomographic reconstruction algorithms. In ISBI’09: Proceedings of the Sixth IEEE international conference on Symposium on Biomedical Imaging, pages 185–188, Piscataway, NJ, USA, 2009. IEEE Press.

[Joh07] John E. Stone and James C. Phillips and Peter L. Freddolino and David J. Hardy and Leonardo G. Trabuco and Klaus Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618–2640, 2007.

[KDY+10] Jingfei Kong, Martin Dimitrov, Yi Yang, Janaka Liyanage, Lin Cao, Jacob Staples, Mike Mantor, and Huiyang Zhou. Accelerating MATLAB Image Processing Toolbox functions on GPUs. In GPGPU ’10: Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, pages 75–85, New York, NY, USA, 2010. ACM.

[KEGM10] Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa. High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster. Journal of Computational Physics, In Press, Corrected Proof:–, 2010.

[KES07] Martin Kraus, Mike Eissele, and Magnus Strengert. GPU-Based Edge- Directed Image Interpolation. In Bjarne Ersbøll and Kim Pedersen, ed- itors, Image Analysis, volume 4522 of Lecture Notes in Computer Science, pages 532–541. Springer Berlin / Heidelberg, 2007. 10.1007/978-3-540- 73040-8_54.

[Kim05] Kimmel, Ron and Sochen, Nir and Weickert, Joachim, editor. Scale Space and PDE Methods in Computer Vision. Springer, 2005. Proceedings of the 5th International Conference on Scale Space and PDE Methods in Com- puter Vision, Scale-Space 2005.

[KK04] Francis Kelly and Anil Kokaram. Fast Image Interpolation for Motion Estimation using. In in Real Time Imaging VIII, SPIE, pages 184–194, 2004.

[KmH10] David Kirk and Wen mei Hwu. Programming Massively Parallel Processors: A Hands-on Approach. Morgan Kaufmann, 2010.

[KP08] C. Kauffmann and N. Piche. Cellular automaton for ultra-fast watershed transform on GPU. In Pattern Recognition, 2008. ICPR 2008. 19th Interna- tional Conference on, pages 1 –4, 8-11 2008.

[KW03] Jens Krüger and Rüdiger Westermann. Linear algebra operators for GPU implementation of numerical algorithms. In SIGGRAPH ’03: ACM SIG- GRAPH 2003 Papers, pages 908–916, New York, NY, USA, 2003. ACM.

34 [LCW03] Aaron Lefohn, Joshua Cates, and Ross Whitaker. Interactive, GPU-Based Level Sets for 3D Segmentation. pages 564–572. 2003.

[LD08] Yuancheng Luo and R. Duraiswami. Canny edge detection on NVIDIA CUDA. In Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ’08. IEEE Computer Society Conference on, pages 1 –8, 23-28 2008.

[LDS05] David Levin, Damini Dey, and Piotr Slomka. Efficient 3D nonlinear warping of computed tomography: two high-performance implemen- tations using OpenGL. volume 5744, pages 34–42. SPIE, 2005.

[LDZZ09] Baofeng Li, Yong Dou, Haifang Zhou, and Xingming Zhou. FPGA accel- erator for wavelet-based automated global image registration. EURASIP J. Embedded Syst., 2009:1–10, 2009.

[LKC+10] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Dae- hyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News, 38(3):451–460, 2010.

[Luc74] L. B. Lucy. An iterative technique for the rectification of observed distri- butions. The Astronomical Journal, 79:745–+, June 1974.

[Mar96] Petros A. Maragos. Mathematical Morphology and Its Applications to Image and Signal Processing. Kluwer Academic Publishers, Norwell, MA, USA, 1996.

[MBS09] Maryam Moazeni, Alex Bui, and Majid Sarrafzadeh. Accelerating total variation regularization for matrix-valued images on GPUs. In CF ’09: Proceedings of the 6th ACM conference on Computing frontiers, pages 137– 146, New York, NY, USA, 2009. ACM.

[MDTP+04] Michael McCool, Stefanus Du Toit, Tiberiu Popa, Bryan Chan, and Kevin Moule. Shader algebra. In SIGGRAPH ’04: ACM SIGGRAPH 2004 Papers, pages 787–795, New York, NY, USA, 2004. ACM.

[MOOXS08] P. Muyan-Ozcelik, J.D. Owens, Junyi Xia, and S.S. Samant. Fast De- formable Registration on the GPU: A CUDA Implementation of Demons. In Computational Sciences and Its Applications, 2008. ICCSA ’08. Interna- tional Conference on, pages 223 –233, june 2008.

[NOEM08] Akira Nukada, Yasuhiko Ogata, Toshio Endo, and Satoshi Matsuoka. Bandwidth intensive 3-D FFT kernel for GPUs using CUDA. In SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1– 11, Piscataway, NJ, USA, 2008. IEEE Press.

[Nyl07] Nyland, Lars and Harris, Mark and Prins, Jan. Fast N-Body Simulation with CUDA. In Hubert Nguyen, editor, GPU Gems 3, chapter 31. Addison Wesley Professional, August 2007.

35 [OLG+05] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron E. Lefohn, and Timothy J. Purcell. A Survey of General- Purpose Computation on Graphics Hardware. pages 21–51, August 2005.

[OLG+07] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger, Aaron Lefohn, and Timothy J. Purcell. A survey of general- purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80–113, 2007.

[Ope10a] OpenCL. http://www.khronos.org/opencl/, Jun 2010.

[Ope10b] OpenCL 1.1 Reference Pages. http://www.khronos.org/registry/cl/ sdk/1.1/docs/man/xhtml/, Jun 2010.

[Ope10c] The OpenCL Specification 1.1. http://developer.amd.com/gpu/ ATIStreamSDK/assets/opencl-1.1-rev33.pdf, Jun 2010.

[PDHW10] Brad Peterson, Manasi Datar, Mary Hall, and Ross Whitaker. GPU Ac- celerated Particle System for Triangulated Surface Meshes. Jul 2010.

[PGS+09] Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Dem- ing Chen, Jason Cong, and Wen-Mei W. Hwu. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. Application Specific Processors, Symposium on, 0:35–42, 2009.

[PGX08] Lei Pan, Lixu Gu, and Jianrong Xu. Implementation of medical im- age segmentation in CUDA. In Information Technology and Applications in Biomedicine, 2008. ITAB 2008. International Conference on, pages 82 –85, 30-31 2008.

[Pod07] Victor Podlozhnyuk. Image Convolution with CUDA. http: //developer.download.nvidia.com/compute/cuda/sdk/website/ projects/convolutionSeparable/doc/convolutionSeparable.pdf, Jun 2007.

[Pra01] William K. Pratt. Digital Image Processing. John Wiley & Sons, 3rd edition, 2001.

[PSL+10] In Kyu Park, Nitin Singhal, Man Hee Lee, Sungdae Cho, and Chris Kim. Design and Performance Evaluation of Image Processing Algorithms on GPUs. IEEE Transactions on Parallel and Distributed Systems, 99(PrePrints), 2010.

[QFTI09] Cory W. Quammen, David Feng, and Russell M. Taylor II. Performance of 3D Deconvolution Algorithms on Multi-Core and Many-Core Archi- tectures. Technical report, University of North Carolina at Chapel Hill, Department of Computer Science, 2009.

[RIC72] WILLIAM HADLEY RICHARDSON. Bayesian-Based Iterative Method of Image Restoration. J. Opt. Soc. Am., 62(1):55–59, 1972.

36 [RM07] Matthias Raspe and Stefan Müller. Using a GPU-based Framework for Interactive Tone Mapping of Medical Volume Data. In SIGRAD 2007: Computer Graphics in Healthcare, pages 3–10. Linköping University Elec- tronic Press, Linköpings universitet, 2007.

[ROF92] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total vari- ation based noise removal algorithms. Phys. D, 60(1-4):259–268, 1992.

[RPZ02] Liu Ren, Hanspeter Pfister, and Matthias Zwicker. Object Space EWA Surface Splatting: A Hardware Accelerated Approach to High Quality Point Rendering. In Computer Graphics Forum, pages 461–470, 2002.

[RRB+08] Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, Sam S. Stone, David B. Kirk, and Wen-mei W. Hwu. Optimization principles and ap- plication performance evaluation of a multithreaded GPU using CUDA. In PPoPP ’08: Proceedings of the 13th ACM SIGPLAN Symposium on Princi- ples and practice of parallel programming, pages 73–82, New York, NY, USA, 2008. ACM.

[RSSF02] Erik Reinhard, Michael Stark, Peter Shirley, and James Ferwerda. Pho- tographic tone reproduction for digital images. ACM Trans. Graph., 21(3):267–276, 2002.

[RT06] Guodong Rong and Tiow-Seng Tan. Jump flooding in GPU with applica- tions to Voronoi diagram and distance transform. In I3D ’06: Proceedings of the 2006 symposium on Interactive 3D graphics and games, pages 109–116, New York, NY, USA, 2006. ACM.

[RUCH09] Antonio Ruiz, Manuel Ujaldon, Lee Cooper, and Kun Huang. Non-rigid Registration for Large Sets of Microscopic Images on Graphics Proces- sors. J. Signal Process. Syst., 55(1-3):229–250, 2009.

[RUG07] Antonio Ruiz, Manuel Ujaldón, and Nicolás Guil. Using Graphics Hard- ware for Enhancing Edge and Circle Detection. In Joan Martí, José Benedí, Ana Mendonça, and Joan Serrat, editors, Pattern Recognition and Image Analysis, volume 4478 of Lecture Notes in Computer Science, pages 234–241. Springer Berlin / Heidelberg, 2007. 10.1007/978-3-540-72849- 8_30.

[SA09] Ojaswa Sharma and François Anton. CUDA based Level Set Method for 3D Reconstruction of Fishes from Large Acoustic Data. In Interna- tional Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (WSCG), 2009.

[SBW06] Jens Schneider, Tobias Boldte, and Rüdiger Westermann. Real-Time Edit- ing, Synthesis, and Rendering of Infinite Landscapes on GPUs. In Vision, Modeling and Visualization 2006, 2006.

37 [SDR03] Robert Strzodka, Marc Droske, and Martin Rumpf. Fast Image Reg- istration in DX9 Graphics Hardware. Journal of Medical Informatics and Technologies, 6:43–49, Nov 2003.

[SDR04] Robert Strzodka, Marc Droske, and Martin Rumpf. Image Registration by a Regularized Gradient Flow - A Streaming Implementation in DX9 Graphics Hardware. Computing, 73(4):373–389, November 2004.

[Ser83] Jean Serra. Image Analysis and Mathematical Morphology. Academic Press, Inc., Orlando, FL, USA, 1983.

[SG04] Robert Strzodka and Christoph Garbe. Real-Time Motion Estimation and Visualization on Graphics Cards. In Proceedings IEEE Visualization 2004, pages 545–552, 2004.

[SHN03] Anthony Sherbondy, Mike Houston, and Sandy Napel. Fast Volume Segmentation With Simultaneous Visualization Using Programmable Graphics Hardware. In VIS ’03: Proceedings of the 14th IEEE Visualiza- tion 2003 (VIS’03), page 23, Washington, DC, USA, 2003. IEEE Computer Society.

[SHT+08] S. S. Stone, J. P. Haldar, S. C. Tsao, W. m. W. Hwu, B. P. Sutton, and Z. P. Liang. Accelerating advanced MRI reconstructions on GPUs. J. Parallel Distrib. Comput., 68(10):1307–1318, 2008.

[SKE06] Magnus Strengert, Martin Kraus, and Thomas Ertl. Pyramid Methods in GPU-Based Image Processing. pages 169–176, 2006.

[SKS09] David Svoboda, Michal Kozubek, and Stanislav Stejskal. Generation of Digital Phantoms of Cell Nuclei and Simulation of Image Formation in 3D Image Cytometry. CYTOMETRY PART A, 75A(6):494–509, JUN 2009.

[SKSF07] G C Sharp, N Kandasamy, H Singh, and M Folkert. GPU-based streaming architectures for fast cone-beam CT image reconstruction and demons deformable registration. Physics in Medicine and Biology, 52(19):5771, 2007.

[SM06] Jean-Luc Starck and Fionn Murtagh. Astronomical Image and Data Analy- sis. Springer, 2006.

[SM08] H. Sugano and R. Miyamoto. Parallel implementation of morphological processing on Cell/BE with OpenCV interface. In Communications, Con- trol and Signal Processing, 2008. ISCCSP 2008. 3rd International Symposium on, pages 578 –583, 12-14 2008.

[SN06] P. Sarder and A. Nehorai. Deconvolution methods for 3-D fluorescence microscopy images. Signal Processing Magazine, IEEE, 23(3):32 – 45, may 2006.

38 [SV82] L. A. Shepp and Y. Vardi. Maximum Likelihood Reconstruction for Emis- sion Tomography. Medical Imaging, IEEE Transactions on, 1(2):113–122, 1982.

[SZZ10] Thomas Serafini, Riccardo Zanella, and Luca Zanni. Gradient projection methods for image deblurring and denoising on graphics processors. Advances in Parallel Computing, 19:59–66, 2010.

[Tes09] Tesla C1060 Computing Processor Board. http://www.nvidia.com/ docs/IO/43395/BD-04111-001_v05.pdf, Jan 2009.

[Tes10] Tesla C2050 / C2070 GPU Computing Processor. http://www.nvidia. com/object/product_tesla_C2050_C2070_us.html, Jul 2010.

[TH78] H. Trussell and B. Hunt. Image restoration of space variant blurs by sectioned methods. In Acoustics, Speech, and Signal Processing, IEEE Inter- national Conference on ICASSP ’78., volume 3, pages 196 – 198, apr 1978.

[TK06] Hiroyuki Takizawa and Hiroaki Kobayashi. Hierarchical parallel process- ing of large scale data clustering on a PC cluster with GPU co-processing. J. Supercomput., 36(3):219–234, 2006.

[Uja09] Manuel Ujaldon. GPU acceleration of Zernike moments for large-scale images. In IPDPS ’09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 1–8, Washington, DC, USA, 2009. IEEE Computer Society.

[vDvdWT06] M. van Dortmont, H. van de Wetering, and A. Telea. Skeletonization and Distance Transforms of 3D Volumes Using Graphics Hardware. In Attila Kuba, László Nyúl, and Kálmán Palágyi, editors, Discrete Geometry for Computer Imagery, volume 4245 of Lecture Notes in Computer Science, pages 617–629. Springer Berlin / Heidelberg, 2006. 10.1007/11907350_52.

[Ver98] Peter J. Verveer. Computational and optical methods for improving resolu- tion and signal quality in fluorescence microscopy. PhD thesis, Technische Universiteit Te Delft, 1998.

[VN08] Vibhav Vineet and P. J. Narayanan. CUDA cuts: Fast graph cuts on the GPU. Computer Vision and Pattern Recognition Workshop, 0:1–8, June 2008.

[Wie64] Norbert Wiener. Extrapolation, Interpolation, and Smoothing of Stationary Time Series. The MIT Press, 1964.

[WLHW07] Tien-Tsin Wong, Chi-Sing Leung, Pheng-Ann Heng, and Jianqing Wang. Discrete Wavelet Transform on Consumer-Level Graphics Hardware. Multimedia, IEEE Transactions on, 9(3):668 –673, april 2007.

[WN] Piotr Wendykier and James G. Nagy. Parallel Colt: A High Performance Java Library for Scientific Computing and Image Processing. ACM Trans- actions on Mathematical Software, 37(3).

39 [XCZ09] Yongshun Xiao, Zhiqiang Chen, and Li Zhang. Accelerated CT Recon- struction Using GPU SIMD Parallel Computing with Bilinear Warping Method. Information Science and Engineering, International Conference on, 0:95–98, 2009.

[XM05] Fang Xu and K. Mueller. Accelerating popular tomographic reconstruc- tion algorithms on commodity PC graphics hardware. Nuclear Science, IEEE Transactions on, 52(3):654 – 663, june 2005.

[XM09] Wei Xu and Klaus Mueller. Accelerating regularized iterative CT recon- struction on commodity graphics hardware (GPU). In ISBI’09: Proceed- ings of the Sixth IEEE international conference on Symposium on Biomedical Imaging, pages 1287–1290, Piscataway, NJ, USA, 2009. IEEE Press.

[YNAS09] Lee Seng Yeong, Christopher Wing Hong Ngau, Li-Minn Ang, and Kah Phooi Seng. Efficient processing of a rainfall simulation watershed on an FPGA-based architecture with fast access to neighbourhood pixels. EURASIP J. Embedded Syst., 2009:4–4, 2009.

[YQ08] Lei Yu and Dawei Qi. Improved multi-scale and structuring element morphological detection in the log CT image. In Cybernetics and Intelligent Systems, 2008 IEEE Conference on, pages 1248 –1253, 21-24 2008.

[Zit03] Barbora Zitová. Image registration methods: a survey. Image and Vision Computing, 21(11):977–1000, October 2003.

[ZKU+04] C. Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High-quality video view interpolation us- ing a layered representation. ACM Trans. Graph., 23(3):600–608, 2004.

[ZLB+09] Jian Zhu, Youquan Liu, Kai Bao, Yuanzhang Chang, and Enhua Wu. Warping of a spherical representation of image-based models on GPU. In VRCAI ’09: Proceedings of the 8th International Conference on Virtual Reality Continuum and its Applications in Industry, pages 89–94, New York, NY, USA, 2009. ACM.

40 4 Summary of Study Results

During the first three terms of doctoral study I succesfully passed courses focused on image processing methods, namely Advanced Methods of Digital Image Processing (FI:PA166), Digital Image Filtering (FI:PA171), Mathematical Morphology (FI:PA173), and Digital Geometry (FI:PA170). I learned a number of methods from mathematical morphology, image filtering, edge detection, and image segmentation. As a part of my study, I attended the Image Acquisition and Analysis in Microscopy workshop in Prague in May, 2009. I also attended the Spring School of Image Processing in Mariánská in June, 2010. I read papers with results of past researches in the area of GPU acceleration. In par- ticular, I focused on implementations of image processing algorithms. Also I learned the CUDA parallel programming model and succesfully passed the GPU Programming (FI:PV197) course. I implemented several algorithms in CUDA, namely convolution, morphological re- construction, and a number of reconstruction-based operators (for detailed description refer to the following paragraphs). At CBIA FI MU, we develop i3dcore and i3dalgo, cross-platform C++ libraries for 3D image processing. These libraries are now extended to support GPU-based processing and the image processing algorithms implemented in CUDA will be soon included in the libraries.

Convolution

I implemented convolution of large 3D images, based on the fast Fourier transform, on GPU. The CUDA framework and the CUFFT library were used for the implementation. With D. Svoboda we proposed a method to decompose large 3D images using the decimation in frequency (DIF) algorithm, so that we are able to process images that are too large to be stored in the GPU memory. The paper on the topic was submitted to EURASIP Journal on Advances in Signal Processing [KS10].

Morphological reconstruction

I studied efficient algorithms for computing the morphological reconstruction, de- scribed in literature. I proposed a GPU-based implementation in CUDA and com- pared the performance of all approaches. The paper on the topic was submitted to the MEMICS 2010 conference [K10].

References to the submitted papers [KS10] Pavel Karas and David Svoboda. Convolution of Large 3D Images on GPU and its Decomposition. EURASIP Journal on Advances in Signal Processing. Hindawi, submitted.

[K10] Pavel Karas. Efficient Computation of Morphological Greyscale Reconstruc- tion. MEMICS 2010 Annual Doctoral Workshop on Mathematical and Engineering Methods in Computer Science, submitted.

41 There is a theory which states that if ever anyone discovers exactly what the Universe is for and why it is here, it will instantly disappear and be replaced by something even more bizarre and inexplicable.

There is another theory which states that this has already happened.

Douglas Adams, The Restaurant at the End of the Universe

42