Efficient Algorithms for Large-Scale Image Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Jan Wassenberg Efficient Algorithms for Large-Scale Image Analysis Efficient Algorithms for Large-Scale Image Analysis Efficient Wassenberg Schriftenreihe Automatische Sichtprüfung und Bildverarbeitung | Band 4 4 Jan Wassenberg Efficient Algorithms for Large-Scale Image Analysis Schriftenreihe Automatische Sichtprüfung und Bildverarbeitung Band 4 Herausgeber: Prof. Dr.-Ing. Jürgen Beyerer Lehrstuhl für Interaktive Echtzeitsysteme am Karlsruher Institut für Technologie Fraunhofer-Institut für Optronik, Systemtechnik und Bildauswertung IOSB Eine Übersicht über alle bisher in dieser Schriftenreihe erschienenen Bände finden Sie am Ende des Buchs. Efficient Algorithms for Large-Scale Image Analysis by Jan Wassenberg Dissertation, Karlsruher Institut für Technologie Fakultät für Informatik Tag der mündlichen Prüfung: 24. Oktober 2011 Impressum Karlsruher Institut für Technologie (KIT) KIT Scientific Publishing Straße am Forum 2 D-76131 Karlsruhe www.ksp.kit.edu KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Diese Veröffentlichung ist im Internet unter folgender Creative Commons-Lizenz publiziert: http://creativecommons.org/licenses/by-nc-nd/3.0/de/ KIT Scientific Publishing 2012 Print on Demand ISSN: 1866-5934 ISBN: 978-3-86644-786-8 Efficient Algorithms for Large-Scale Image Analysis zur Erlangung des akademischen Grades eines Doktors der Ingenieurwissenschaften der Fakultät für Informatik des Karlsruher Instituts für Technologie genehmigte Dissertation von Jan Wassenberg aus Koblenz Tag der mündlichen Prüfung: 24. Oktober 2011 Erster Gutachter: Prof. Dr. Peter Sanders Zweiter Gutachter: Prof. Dr.-Ing. Jürgen Beyerer Abstract The past decade has seen major improvements in the capabilities and availability of imaging sensor systems. Commercial satellites routinely provide panchromatic images with sub-meter resolution. Airborne line scanner cameras yield multi-spectral data with a ground sample distance of 5 cm. The resulting overabundance of data brings with it the challenge of timely analysis. Fully auto- mated processing still appears infeasible, but an intermediate step might involve a computer-assisted search for interesting objects. This would reduce the amount of data for an analyst to examine, but remains a challenge in terms of processing speed and working memory. This work begins by discussing the trade-offs among the various hardware architectures that might be brought to bear upon the problem. FPGA and GPU-based solutions are less universal and entail longer development cycles, hence the choice of commodity multi-core CPU architectures. Distributed processing on a cluster is deemed too costly. We will demonstrate the feasibility of processing aerial images of 100 km × 100 km areas at 1 m resolution within 2 hours on a single workstation with two processors and a total of twelve cores. Because existing approaches cannot cope with such amounts of data, each stage of the image processing pipeline – from data access and signal processing to object extraction and feature computation – will have to be designed from the ground up for maximum performance. We introduce new efficient algorithms that provide useful results at faster speeds than previously possible. Let us begin with the most time-critical task – the extraction of ‘object’ candidates from an image, also known as segmentation. This step is necessary because individual pixels do not provide enough information for the screening task. A simple but reason- able model for the objects involves grouping similar pixels together. High-quality clustering algorithms based on mean shift, maximum network flow and anisotropic diffusion are far too time-consuming. ix We introduce a new graph-based algorithm with the important property of avoiding both under- and oversegmentation. Its distin- guishing feature is the independent parallel processing of image tiles without splitting objects at the boundaries. Our efficient implementation takes advantage of SIMD instructions and out- performs mean shift by a factor of 50 while producing results of similar quality. Recognizing the outstanding performance of its microarchitecture-aware virtual-memory counting sort subroutine, we develop it into a general 32-bit integer sorter, yielding the fastest known algorithm for shared-memory machines. Because segmentation groups together similar pixels, it is help- ful to suppress sensor noise. The ‘Bilateral Filter’ is an adaptive smoothing kernel that preserves edges by excluding pixels that are distant in the spatial or radiometric sense. Several fast approxi- mation algorithms are known, e.g. convolution in a downsampled higher-dimensional space. We accelerate this technique by a factor of 14 via parallelization, vectorization and a SIMD-friendly approx- imation of the 3D Gauss kernel. The software is 73 times as fast as an exact computation on an FPGA and outperforms a GPU-based approximation by a factor of 1.8. Physical limitations of satellite sensors constitute an additional hurdle. The narrow multispectral bands require larger detectors and usually have a lower resolution than the panchromatic band. Fusing both datasets is termed ‘pan-sharpening’ and improves the segmentation due to the additional color information. Previous techniques are vulnerable to color distortion because of mismatches between the bands’ spectral response functions. To reduce this effect, we compute the optimal set of band weights for each input image. Our new algorithm outperforms existing approaches by a factor of 100, improves upon their color fidelity and also reduces noise in the panchromatic band. Because these modules achieve throughputs on the order of several hundred MB/s, the next bottleneck to be addressed is I/O. The ubiquitous GDAL library is far slower than the theoretical x disk throughput. We design an image representation that avoids unnecessary copying, and describe little-known techniques for efficient asynchronous I/O. The resulting software is up to 12 times as fast as GDAL. Further improvements are possible by compressing the data if decompression throughput is on par with the transfer speeds of a disk array. We develop a novel lossless asymmetric SIMD codec that achieves a compression ratio of 0.5 for 16-bit pixels and reaches decompression throughputs of 2 700 MB/s on a single core. This is about 100 times as fast as lossless JPEG- 2000 and only 20–60% larger on multispectral satellite datasets. Let us now return to the extracted objects. Additional steps for detecting and simplifying their contours would provide use- ful information, e.g. for classifying them as man-made. To allow annotating large images with the resulting polygons, we devise a software rasterizer. High-quality antialiasing is achieved by deriv- ing the optimal polynomial low-pass filter. Our implementation outperforms the Gupta-Sproull algorithm by a factor of 24 and exceeds the fillrate of a mid-range GPU. The previously described processing chain is effective, but electro-optical sensors cannot penetrate cloud cover. Because much of the earth’s surface is shrouded in clouds at any given time, we have added a workflow for (nearly) weather-independent synthetic aperture radar. Small, highly-reflective objects can be differentiated from uniformly bright regions by subtracting each pixel’s back- ground, estimated from the darkest ring surrounding it. We reduce the asymptotic complexity of this approach to its lower bound by means of a new algorithm inspired by Range Minimum Queries. A sophisticated pipelining scheme ensures the working set fits in cache, and the vectorized and parallelized software outperforms an FPGA implementation by a factor of 100. These results challenge the conventional wisdom that FPGA and GPU solutions enable significant speedups over general- purpose CPUs. Because all of the above algorithms have reached the lower bound of their complexity, their usefulness is decided xi by constant factors. It is the thesis of this work that optimized software running on general-purpose CPUs can compare favorably in this regard. The key enabling factors are vectorization, paral- lelization, and consideration of basic microarchitectural realities such as the memory hierarchy. We have shown these techniques to be applicable towards a variety of image processing tasks. How- ever, it is not sufficient to ‘tune’ software in the final phases of its development. Instead, each part of the algorithm engineering cycle – design, analysis, implementation and experimentation – should account for the computer architecture. For example, no amount of subsequent tuning would redeem an approach to segmentation that relies on a global ranking of pixels, which is fundamentally less amenable to parallelization than a graph-based method. The algorithms introduced in this work speed up seven separate tasks by factors of 10 to 100, thus dispelling the notion that such efforts are not worthwhile. We are surprised to have improved upon long-studied topics such as lossless image compression and line rasterization. However, the techniques described herein may allow similar successes in other domains. xii Acknowledgements I sincerely thank my advisor, Prof. Peter Sanders, for providing guidance – identifying promising avenues to explore, teaching algorithm engineering, and sharing the lore of clever optimizations. Thank you, Prof. Dr.-Ing. Jürgen Beyerer, for reviewing this thesis. Looking back earlier, I thank my parents for their love and sup- port, and for allowing me access to a TRS-80