Hardware Efficient PDE Solvers in Quantized Image Processing

Hardware Efficient PDE Solvers in Quantized Image Processing Vom Fachbereich Mathematik der Universitat¨ Duisburg-Essen (Campus Duisburg) zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation von Robert Strzodka aus Tarnowitz Referent: Prof. Dr. Martin Rumpf Korreferent: Prof. Dr. Thomas Ertl Datum der Einreichung: 30 Sep 2004 Tag der mundlichen¨ Prufung:¨ 20 Dez 2004 ii Contents Abstract v 1 Introduction 1 1.1 Motivation . 2 1.2 Thesis Guide . 5 1.3 Summary . 8 Acknowledgments . 11 2 PDE Solvers in Quantized Image Processing 13 2.1 Continuous PDE Based Image Processing . 15 2.2 Discretization - Quantization . 24 2.3 Anisotropic Diffusion . 43 2.4 Level-Set Segmentation . 55 2.5 Gradient Flow Registration . 59 2.6 Data-Flow . 63 2.7 Conclusions . 67 3 Data Processing 69 3.1 Data Access . 71 3.2 Computation . 84 3.3 Hardware Architectures . 97 3.4 Conclusions . 105 4 Hardware Efficient Implementations 107 4.1 Graphics Hardware . 110 4.2 Reconfigurable Logic . 156 4.3 Reconfigurable Computing . 165 4.4 Comparison of Architectures . 180 Bibliography 187 Acronyms 201 Index 205 iii iv Abstract Performance and accuracy of scientific computations are competing aspects. A close interplay between the design of computational schemes and their implementation can improve both aspects by making better use of the available resources. The thesis describes the design of robust schemes under strong quantization and their hardware efficient implementation on data- stream-based architectures for PDE based image processing. The strong quantization improves execution time, but renders traditional error estimates use- less. The precision of the number formats is too small to control the quantitative error in iterative schemes. Instead, quantized schemes which preserve the qualitative behavior of the continuous models are constructed. In particular for the solution of the quantized anisotropic diffusion model one can derive a quantized scale-space with almost identical properties to the continuous one. Thus the image evolution is accurately reconstructed despite the inability to control the error in the long run, which is difficult even for high precision computations. All memory intensive algorithms are, nowadays, burdened with the memory gap problem which degrades performance enormously. The instruction-stream-based computing paradigm reenforces this problem, whereas architectures subscribing to data-stream-based computing offer more possibilities to bridge the gap between memory and logic performance. Also more parallelism is available in these devices. Three architectures of this type are covered: graphics hardware, reconfigurable logic and reconfigurable computing devices. They allow to exploit the parallelism inherent in image processing applications and apply a memory efficient usage. Their pros and cons and future development are discussed. The combination of robust quantized schemes and hardware efficient implementations deliver an accurate reproduction of the continuous evolution and significant performance gains over standard software solutions. The applied devices are available on affordable AGP/PCI boards, offering true alternatives even to small multi-processor systems. v Abstract AMS Subject Classification (MSC 2000) 65Y10 Numerical analysis: Algorithms for specific classes of architectures • 68U10 Computer science: Image processing • ACM Computing Classification System (CSS 1998) G.4 Mathematical Software: Efficiency, Reliability and robustness, Parallel and vector • implementations I.4.3 [Image Processing and Computer Vision]: Enhancement—Smoothing, Registra- • tion I.4.6 [Image Processing and Computer Vision]: Segmentation—Region growing and • partitioning G.1.8 [Numerical Analysis]: Partial Differential Equations—Finite element methods, • Finite difference methods, Parabolic equations, Hyperbolic equations, Multigrid and multilevel methods B.3.1 [Memory Structures] Semiconductor Memories—Dynamic memory (DRAM) • I.3.1 [Computer Graphics]: Hardware Architecture—Graphics processors • B.7.1 [Integrated Circuits]: Types and Design Styles—Gate arrays • C.1.3 [Processor Architectures]: Other Architecture Styles—Adaptable architectures • C.4 Performance of Systems: Performance attributes • J.3 Life and Medical Sciences: Health • General Terms: Algorithms, Languages, Performance, Theory Keywords quantization, qualitative error control, quantized scale-space, memory gap, performance, data- stream-based processing, graphics hardware, reconfigurable logic, reconfigurable computing vi 1 Introduction Contents 1.1 Motivation . 2 1.1.1 Operation Count and Performance . 2 1.1.2 Precision and Accuracy . 3 1.1.3 Choice of Applications and Architectures . 3 1.2 Thesis Guide . 5 1.2.1 Thesis Structure . 5 1.2.2 Index and Acronyms . 5 1.2.3 Notation . 6 1.2.4 Binary Prefixes . 7 1.3 Summary . 8 1.3.1 PDE Solvers in Quantized Image Processing . 8 1.3.2 Data Processing . 10 1.3.3 Hardware Efficient Implementations . 10 Acknowledgments . 11 Tables 1.1 General notation. 6 1.2 International decimal and binary prefixes. 7 The motivation section presents the broader picture of the thesis and outlines ideas which em- brace the different chapters. In the Thesis Guide we present a quick overview of the thesis and cover presentational aspects. The chapter ends with a summary of the results and acknowledgments. 1 1 Introduction 1.1 Motivation Numerical mathematics is concerned with the design of fast and accurate schemes for the approximate solution of mathematical problems. Computer systems are the target platforms for the implementation of theses schemes. So the trade-off between the competing factors of performance and accuracy applies both to the mathematical level where approximations of different accuracy order are chosen, and the implementational level where number formats and operations of different precision are used. Traditionally, the optimization processes are performed separately by mathematicians and computer scientists respectively. The common interface is the operation count of a scheme which is sought to be reduced. We argue that this measure is much too simple as it completely ignores the diverse performance characteristics of computer systems. Thus apparently efficient mathematical schemes perform surprisingly badly on actual systems. In the area of partial differential equation (PDE) based image processing the thesis demonstrates how an early consideration of performance relevant hardware aspects and a close coupling of the scheme design and its implementation fully exploit the available resources and so deliver fast and accurate solutions. 1.1.1 Operation Count and Performance The merits of the exponential development in semiconductors have benefited memory and computing elements in different respects. Data transport and data processing have not developed at the same pace. The consequences are far-reaching but can be outlined by an example. If we consider a simple addition of two vectors C¯ = A¯+B¯ of size N, then the operation count is N. Modern micro-processors can process two operands made up of four 32-bit floats in one clock cycle. So if the processor runs at 3GHz it can perform 12G floating point OPS (FLOPS) and we should finish the addition in N=12 ns. Real performance values are at least an order of magnitude lower. The processor can really run almost 12G FLOPS if not disturbed by anything else, but the data cannot be transported that quickly. The parallel operations namely require 96GB/s of input data and 48GB/s for the output. But the current memory systems can provide a bandwidth of at most 6.4GB/s. This means that the computational unit spends 95% of time waiting for the data. So global data movement and not local computations are expensive and decisive for the overall performance. The situation becomes even worse when the components of the vectors A;¯ B¯ are not arranged one after another in memory, e.g. if they are part of larger structures or arranged in irregular lists. Then memory latency, the time needed to find the individual components in memory, becomes dominant and the performance can drop by as much as an order of magnitude again. Therefore, it is often advisable to enforce a linear arrangement of vector components, even if this means the inclusion of additional entries to fill up the gaps of the irregular arrangement. The operation count is increased, but the data can be processed in a seamless data stream, avoiding the latencies. These two components data addressing and data transport dominate the execution times of 2 1.1 Motivation many algorithms. This fact has been acknowledged for some time already, and remedies have been developed, but the problem grows with each generation of new processors. Meanwhile hardware architectures subscribing to a data oriented computing paradigm have evolved. We evaluate three different representatives of this concept on image processing applications. The results show that the focus on regular data handling instead of minimal operation count deliv- ers superior results. 1.1.2 Precision and Accuracy In image processing applications performance is very critical. For this purpose one is often prepared to sacrifice strict accuracy requirements as long the quality of the results does not suffer significantly. The question arises how can we secure robust results with less precise computations. A number format has only finitely many representations for the real numbers in a computer. So beside the error introduced by the discretization of the continuous PDE models,

Load more