GPU-ACCELERATED TERRAIN PROCESSING

By

Wenli Li

A Dissertation Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Subject: COMPUTER SCIENCE

Examining Committee:

W. Randolph Franklin, Dissertation Adviser

Christopher D. Carothers, Member

Barbara M. Cutler, Member

Richard J. Radke, Member

Charles V. Stewart, Member

Rensselaer Polytechnic Institute Troy, New York

July 2016 (For Graduation August 2016) c Copyright 2016 by Wenli Li All Rights Reserved

ii CONTENTS

LIST OF TABLES ...... vi

LIST OF FIGURES ...... ix

ABSTRACT ...... xi

ACKNOWLEDGMENT ...... xiii

1. Introduction ...... 1

2. Background and Related Work ...... 4 2.1 Terrain and Visibility ...... 4 2.2 CUDA ...... 7 2.3 ODETLAP ...... 8 2.4 Spatial Interpolation ...... 13 2.5 Spatial Data Compression ...... 20

3. 1D ODETLAP Approximation and Compression ...... 23 3.1 1D ODETLAP Approximation ...... 23 3.2 1D ODETLAP Compression ...... 25 3.3 Suitability of Datasets ...... 26 3.4 Summary ...... 28

4. 2D ODETLAP Approximation ...... 29 4.1 GPU Acceleration ...... 29 4.2 ODETLAP Approximation, Natural Neighbor Interpolation, and Ra- dial Basis Function Interpolation ...... 31 4.3 Summary ...... 36

5. 2D ODETLAP Compression ...... 38 5.1 2D Datasets ...... 38 5.2 The Simple Algorithm ...... 38 5.3 Adding and Removing Points ...... 46 5.4 Optimizing Point Values ...... 47 5.5 Anterior Optimizations ...... 55

iii 5.6 Posterior Optimizations ...... 57 5.7 Anterior and Posterior Optimizations (the Complex Algorithm) . . . 57 5.8 Compressing Selected Points ...... 60 5.9 Regular Points ...... 64 5.10 JPEG 2000 ...... 65 5.11 Summary ...... 72

6. 3D ODETLAP Compression ...... 74 6.1 3D Datasets ...... 74 6.2 The Simple Algorithm ...... 75 6.3 Adding and Removing Points ...... 77 6.4 Optimizing Point Values ...... 77 6.5 Anterior Optimizations ...... 78 6.6 Posterior Optimizations ...... 78 6.7 The Complex Algorithm ...... 79 6.8 Compressing Selected Points ...... 80 6.9 Regular Points ...... 82 6.10 JP3D ...... 83 6.11 Summary ...... 85

7. 3D Segmented ODETLAP Compression ...... 87 7.1 Introduction ...... 87 7.2 The Algorithm ...... 89 7.3 Results ...... 93 7.4 Evaluation ...... 98 7.5 Summary ...... 101

8. GPU-Accelerated Multiple Observer Siting ...... 102 8.1 Introduction ...... 102 8.2 Multiple Observer Siting ...... 104 8.3 Optimization ...... 107 8.4 Parallelization ...... 109 8.5 Results and Discussion ...... 112 8.6 Summary ...... 118

iv 9. Conclusions and Future Work ...... 120

REFERENCES ...... 122

v LIST OF TABLES

4.1 Average running time in seconds and device memory usage in MB, using different sparse matrix formats ...... 30

4.2 Average running time in seconds and device memory usage in MB, using different iterative solvers ...... 31

4.3 Average running time in seconds and GPU to CPU speedups ...... 31

4.4 Average elevation, slope, and curvature errors of ODETLAP approxima- tion on the three tests ...... 33

4.5 Average elevation, slope, and curvature errors of natural neighbor inter- polation on the three tests ...... 35

4.6 Average elevation, slope, and curvature errors of multiquadric RBF interpolation/approximation on the three tests ...... 36

5.1 Statistics of the 2D datasets ...... 39

5.2 Elevation, slope, and curvature errors of the simple algorithm and different initial sets, dataset n43w074 ...... 41

5.3 Elevation, slope, and curvature errors of the simple algorithm and different Rs, dataset n43w074 ...... 41

5.4 Elevation, slope, and curvature errors of the simple algorithm and two smoothing factors, dataset n43w074 ...... 42

5.5 Elevation, slope, and curvature errors of the simple algorithm and a varying R, dataset n43w074 ...... 43

5.6 Elevation, slope, and curvature errors of the simple algorithm and different averaging equations, dataset n43w074 ...... 44

5.7 Elevation, slope, and curvature errors of Algorithms 5.1 and 5.2, dataset n43w074 ...... 47

5.8 Elevation, slope, and curvature errors of Algorithm 5.3 and different ωs, dataset n43w074 ...... 49

5.9 Elevation, slope, and curvature errors of Algorithms 5.1–5.3 and different accuracy and precision settings, dataset n43w074 ...... 51

5.10 Elevation, slope, and curvature errors of Algorithm 5.4, dataset n43w074 57

vi 5.11 Elevation, slope, and curvature errors of Algorithms 5.3–5.5, dataset n43w074 ...... 58 5.12 Elevation, slope, and curvature errors of the complex algorithm and a varying R ...... 60 5.13 Elevation, slope, and curvature errors of the simple algorithm and quantization, dataset n43w074 ...... 61 5.14 Elevation, slope, and curvature errors of the complex algorithm and quantization, dataset n43w074 ...... 61 5.15 Compressing the selected-point mask, dataset n43w074 ...... 62 5.16 Compressing the values of selected points, dataset n43w074 ...... 63 5.17 Elevation, slope, and curvature errors of ODETLAP compression . . . . 66 5.18 Compression pipelines and compressed sizes of ODETLAP compression 66 5.19 Specified compression ratio, compressed size, and compression errors of JPEG 2000, dataset n43w074 ...... 67 5.20 Relative elevation/value errors of ODETLAP compression to JPEG 2000 72 6.1 Statistics of the 3D datasets ...... 75 6.2 Value and gradient errors of the simple algorithm and different Rs, dataset mri1 ...... 76 6.3 Value and gradient errors of the simple algorithm and two smoothing factors, dataset mri1 ...... 77 6.4 Value and gradient errors of the simple algorithm and a varying R, dataset mri1 ...... 77 6.5 Value and gradient errors of Algorithms 5.1 and 5.2, dataset mri1 . . . 78 6.6 Value and gradient errors of Algorithm 5.3 and different ωs, dataset mri1 78 6.7 Value and gradient errors of Algorithm 5.4, dataset mri1 ...... 79 6.8 Value and gradient errors of Algorithms 5.3–5.5, dataset mri1 ...... 79 6.9 Value and gradient errors of the complex algorithm and a varying R . . 79 6.10 Value and gradient errors of the simple algorithm and quantization, dataset mri1 ...... 80 6.11 Value and gradient errors of the complex algorithm and quantization, dataset mri1 ...... 80

vii 6.12 Compressing the selected-point mask, dataset mri1 ...... 81

6.13 Compressing the values of selected points, dataset mri1 ...... 81

6.14 Value and gradient errors of ODETLAP compression ...... 82

6.15 Compression pipelines and compressed sizes of ODETLAP compression 83

6.16 Relative value errors of ODETLAP compression to JP3D ...... 86

7.1 Statistics of the datasets. Empty: the percentage of empty data points for the atmospheric datasets or the percentage of zero data points for the MRI dataset...... 94

7.2 Results of segmented ODETLAP compression. Sizes are in bytes. . . . . 97

7.3 Smoothness measures of the datasets normalized on [0, 1], and the compression ratio for 2% target. RMSL: root-mean-square Laplacian. RMSB: root-mean-square biharmonic...... 98

7.4 Interpolated compressed sizes in KB of segmented ODETLAP compres- sion and JP3D when MAXE is 2% and 1%...... 101

8.1 Results of vix using 10, 50, or 250 random targets per point. interval: the interval between successive evaluation points along the line of sight. Time: the running time of CUDA vix in seconds. RMSE: the RMSE of the approximate visibility index map. Observ.: the number of observers selected for 95% coverage...... 116

8.2 Results of the CUDA, OpenMP, and sequential programs, averaged over 10 runs. vix, . . . , total: running time in seconds. Observers: the number of observers selected for 95% coverage...... 118

8.3 Speedups of the CUDA and OpenMP programs over the sequential program...... 119

viii LIST OF FIGURES

2.1 LOS algorithms ...... 7

2.2 ODETLAP compression ...... 10

3.1 Effects of the smoothing factor R ...... 24

3.2 Some 1D ODETLAP compression results ...... 27

3.3 Compressing a step function – red curve: approximation; green line: dataset; blue circles: selected points ...... 28

4.1 DEM datasets ...... 32

4.2 ODETLAP approximation error analysis ...... 34

4.3 Natural neighbor interpolation error analysis ...... 35

4.4 Multiquadric RBF interpolation error analysis ...... 37

5.1 Grayscale images of the 2D datasets ...... 44

5.2 The number of known points selected by ODETLAP compression and the errors of the approximation, dataset n43w074 ...... 45

5.3 Results of the simple algorithm and one or two smoothing factors, dataset n43w074 ...... 52

5.4 Results of the simple algorithm and a varying R, dataset n43w074 . . . 53

5.5 Averaging equation stencils ...... 53

5.6 Optimizing known point values – optimizing one known point value in each iteration; vertical dotted lines separating rounds of iterations . . . 54

5.7 JPEG 2000 and ODETLAP compression, dataset n41w076 ...... 68

5.8 JPEG 2000 and ODETLAP compression, dataset n41w077 ...... 68

5.9 JPEG 2000 and ODETLAP compression, dataset n42w076 ...... 69

5.10 JPEG 2000 and ODETLAP compression, dataset n42w077 ...... 69

5.11 JPEG 2000 and ODETLAP compression, dataset n43w074 ...... 70

5.12 JPEG 2000 and ODETLAP compression, dataset n43w075 ...... 70

5.13 JPEG 2000 and ODETLAP compression, dataset n44w074 ...... 71

ix 5.14 JPEG 2000 and ODETLAP compression, dataset n44w075 ...... 71

6.1 Grayscale images of the 3D datasets ...... 75

6.2 JP3D and ODETLAP compression, dataset mri1 ...... 84

6.3 JP3D and ODETLAP compression, dataset mri2 ...... 84

6.4 JP3D and ODETLAP compression, dataset mri3 ...... 85

6.5 JP3D and ODETLAP compression, dataset mri4 ...... 85

7.1 Slices 0, 8, and 16 of the 360 × 180 × 24 atmospheric datasets...... 95

7.2 Slices 0, 40, 80, and 120 of the 256 × 256 × 160 MRI dataset...... 96

7.3 Compressed sizes in KB and approximation errors in percentage of JP3D and segmented ODETLAP compression. AVGE: average absolute error. RMSE: root-mean-square error. MAXE: maximum absolute error. All n/e means all nonempty data points...... 100

8.1 Computing a viewshed using the R2 algorithm...... 106

8.2 Shaded relief plot of the terrain dataset...... 113

8.3 Exact visibility index map of the terrain, normalized in integers 0–255, with roi = 100 and height = 100...... 115

8.4 The running time of CUDA vix in seconds versus the number of observers selected for 95% coverage. Data points are labeled with the value of interval...... 117

x ABSTRACT

This thesis extends Overdetermined Laplacian Partial Differential Equations (ODET- LAP) for spatial data approximation and compression and parallelizes multiple observer siting on terrain, using General-Purpose Computing on Graphics Processing Units (GPGPU). Both ODETLAP compression and multiple observer siting use greedy algorithms that are parallelizable within iterations but sequential between iterations. They also demonstrate terrain-related research and applications that benefit from GPU acceleration and showcase the achievable speedups. ODETLAP approximation approximates a spatial dataset from scattered data points on a regular grid by solving an overdetermined system of linear equations that minimizes the absolute Laplacian of the approximation and the value errors of the data points. We show that ODETLAP approximation is a linear opera- tor and is comparable in accuracy with natural neighbor interpolation and the multiquadric-biharmonic method. Using ODETLAP approximation to approximate spatial datasets, ODETLAP compression compresses a dataset as a set of known points and their values and decompresses it as an approximation from the known points. We implement ODETLAP approximation and compression using the CUSP library and the speedup is 8 times on a GPU. We design multiple algorithms to improve the accuracy of ODETLAP compression using a limited number of known points and use them to compress 2D terrain datasets and 3D MRI datasets. The results show that ODETLAP compression is 30% to 50% better in minimizing the maximum absolute error than JPEG 2000 for the terrain datasets and 40% to 60% better than JP3D for the MRI datasets. To further increase speed, we design a segmented ODETLAP compression algorithm and use it to compress larger 3D atmospheric and MRI datasets. The results show that the compressed size of the dataset is 60% that of JP3D for the same maximum absolute error. Multiple observer siting places multiple observer points above a terrain to maximize the total area visible from at least one observer. The algorithm first selects a set of highly visible points as tentative observers, and then iteratively selects

xi observers to maximize the cumulative viewshed. We improve the time and space complexities of the algorithm and parallelize it using CUDA on a GPU and using OpenMP on multi-core CPUs. The speedup is up to 60 times on the GPU and 16 times on two CPUs with 16 cores.

xii ACKNOWLEDGMENT

First, I would like to thank my advisor, Professor W. Randolph Franklin, for his dedicated guidance and unchanging support. This work is not possible without him. I would like to thank the members of my doctoral committee: Professor Christopher D. Carothers, Professor Barbara M. Cutler, Professor Richard J. Radke, and Professor Charles V. Stewart. Their comments and suggestions are extremely helpful. I would like to thank the staff of the Computer Science Department: Chris Coonrad, Terry Hayden, and Pam Paslow. They are the best friends of graduate students. I would like to thank my colleagues: Daniel N. Benedetti, David L. Hedin, Dr. Marcus V. A. Andrade, and Salles V. G. Magalhães. Their prescence made life more enjoyable. Last but not least, I would like to thank my parents, Haihong Liu and Feng Li, for their enormous care and great faith in me.

xiii CHAPTER 1 Introduction

While the performance of a CPU core stagnates, significant performance gains come from the use of multiple cores. For example, current Intel Xeon and AMD FX CPUs have 4 to 18 cores. Another development in hardware is the popularity of accelerators like GPUs and coprocessors that have many more cores than CPUs. For example, current Intel Xeon Phi coprocessors have up to 72 x86 cores and NVIDIA Tesla GPUs have up to 26 streaming multiprocessors and 4992 CUDA cores. Except in a multitasking environment, software has to use multiple threads to utilize these hardware. Terrain processing is usually time-consuming because the number of points is very large. For example, Shuttle Radar Topography Mission (SRTM) 3 -second (about 90m) data has 1201 × 1201 points in a 1 × 1 degree block and SRTM 1 arc-second (about 30m) data has 3601 × 3601 points in a 1 × 1 degree block [1]. The National Elevation Dataset (NED) has three nationwide layers at 2, 1 and 1/3 arc-second, and two high resolution layers at 1/9 arc-second and 1-meter in limited areas [2]. Terrain processing benefits greatly from parallel computing because often each point can be processed separately or with a few neighboring points. For example, the aspect, slope or curvature of a terrain can be computed by convolving it with a 3 × 3 stencil. GPUs are massively parallel devices containing tens of streaming multiproces- sors and hundreds or thousands of processing units. They are designed for computer graphics applications like video games but are fully programmable and optimized for SIMD vector operations. Using GPUs for non-computer graphics applications like scientific computing is called General-Purpose Computing on Graphics Processing Units (GPGPU) [3]. Although the performance of a GPU core is a small fraction of that of a CPU core, it is compensated by the number of cores. Latest GPUs boast several hundred GB/s of theoretical memory bandwidth and several TFLOPS of theoretical single-precision performance. However, theoretical performance is hard to achieve. Compute Unified Device Architecture (CUDA) is a parallel computing

1 2 platform and programming model for NVIDIA GPUs [4]. CUDA C/C++, CUDA Fortran and wrappers of the API for other languages are the standard way to program a CUDA-enabled GPU, but it is often much easier to use a GPU-accelerated library [5]. For example, Thrust is a parallel algorithms library that provides a high-level interface for parallel programming [6]. In this thesis, we study terrain-related problems that use greedy algorithms. The major topic is Overdetermined Laplacian Partial Differential Equations (ODETLAP) for spatial data approximation and compression [7]. ODETLAP approximation is for scattered data approximation on a grid. To compute the approximate value of each grid point, it solves an overdetermined system of linear equations with the approximate values. The equations include an averaging equation for each grid point, which maximizes the smoothness of the approximation, and a known value equation for each data point, which minimizes the approximation error of the data points. ODETLAP compression is for the compression of spatial datasets using ODETLAP approximation. It compresses a dataset as a set of data points and values and decompresses the dataset as an ODETLAP approximation from the data points. The data points are selected greedily to minimize the maximum absolute error of the approximation to the dataset. The research of ODETLAP approximation and compression benefits greatly from GPU acceleration because solving large overdetermined systems is slow. Another topic is multiple observer siting on terrain [8]. Multiple observer siting is for the placement of multiple observer points above a terrain, to maximize the total area that is visible from at least one observer. It first computes a visibility index for each terrain point and selects a large set of points as tentative observers, and then greedily selects observers from the tentative observers to maximize the total visible area of selected observers. The application of multiple observer siting has a lot of inherent parallelism and achieves great speedup by GPU acceleration. The main contributions of this thesis are:

• We have shown the linearity of ODETLAP approximation and its best approximation to a dataset. We have greatly reduced the root-mean-square error of ODETLAP compression for 1D datasets. 3

• We have accelerated ODETLAP approximation on a GPU using the CUSP library and achieved about 8 times speedup. • We have shown that ODETLAP approximation is comparable in accuracy to natural neighbor interpolation and the multiquadric-biharmonic method. ODETLAP approximation is not confined to the convex hull of the data points and can be faster than the multiquadric-biharmonic method with many data points. • We have designed multiple algorithms that use a varying smoothing factor, adding and removing known points, or optimizing known point values to improve the accuracy of ODETLAP compression using a limited number of known points by about 10% to 30% for 2D and 3D datasets. • We have used new methods to compress the set of known points and values selected by ODETLAP compression: run-length encoding along space-filling curves and delta encoding along space-filling curves or minimum-spanning trees. • We have shown that ODETLAP compression is about 30% to 50% better in the maximum absolute error than JPEG 2000 for the 2D terrain datasets that we tested. It is about 40% to 60% better in the maximum absolute error than JP3D for the 3D MRI datasets that we tested. • We have designed a segmented ODETLAP compression algorithm that is much faster than the unsegmented algorithm and used it to compress larger 3D atmospheric and MRI datasets. The results show that the compressed size of the dataset is about 60% that of JP3D for the same maximum absolute error. • We have reduced the time and space complexities of multiple observer siting from a polynomial of terrain width to a polynomial of radius of interest. We have parallelized the algorithm using CUDA on a GPU and using OpenMP on multi-core CPUs, and achieved up to 60 times speedup on the GPU and 16 times speedup on two CPUs with 16 cores. CHAPTER 2 Background and Related Work

This chapter lays the background and discusses the related work, including ter- rain models and visibility analysis, CUDA architecture and programming model, ODETLAP approximation and compression, and spatial data interpolation and compression.

2.1 Terrain and Visibility A terrain is a 2.5-D surface that has at most one intersection point with any line parallel to the z-axis [9]. In other words, a terrain is a 2D function f that maps a point (x, y) to an elevation f(x, y). Common terrain models in Geographical Information Science (GIS) are the contour map, Digital Elevation Map (DEM) and Triangulated Irregular Network (TIN) [10]. A contour map consists of contour lines that connect points of equal elevations. The difference in elevation between adjacent contour lines is called a contour interval, which is usually fixed in a contour map. As a result, the density of contour lines implies the direction and magnitude of gradient. A contour map is usually represented as a Digital Line Graph (DLG). Contour lines are also called level sets in Mathematics. A digital elevation map is a discrete approximation of a continuous terrain. The domain of a DEM is a regular grid of points (ri, cj), where i and j = 0, 1, . . . , n − 1. Therefore, the number of points or the size of the DEM is n2. DEM is the terrain model that we use in this thesis. A triangulated irregular network is a piecewise-linear approximation of a continuous terrain [11]. A TIN is defined on a triangulation of an irregular set of points (xi, yi), where i = 1, . . . , n, whose elevations are known. The elevation of a non-vertex point in the triangulation is linearly interpolated from the vertex elevations of the triangle that contains the point. The graph of a TIN is a polyhedral surface of triangular faces. A TIN can be created from a DEM by selecting points that are important in topology or by dropping points that are less important. In

4 5 this way, a TIN approximates a DEM with much fewer points and not a significant loss of accuracy, but the coordinates of points are no longer implicit. Algorithm 2.1 creates a TIN from a DEM by greedily adding points to minimize the maximum absolute error.

Algorithm 2.1: Creating a TIN from a DEM Input: A DEM Output: A triangulation select an initial set of points from the DEM whose convex hull contains the DEM; compute the Delaunay triangulation of selected points; while not done do find the point with the largest difference between its elevation in the DEM and in the TIN, and add it to the set of selected points; insert the newly added point into and update the Delaunay triangulation of selected points; end The output is the Delaunay triangulation of selected points;

Terrain analysis is for extracting information from elevation data. Simple analysis includes aspect, slope and curvature, and complex analysis includes visibility and hydrology. The rates of change of a terrain f in the x and y directions can be computed as

f(x + 1, y) − f(x − 1, y) f ≈ , x 2∆ f(x, y + 1) − f(x, y − 1) f ≈ , y 2∆

or as below for higher accuracy, where ∆ is the distance between adjacent points [10]. 6

fx ≈ (f(x + 1, y − 1) + 2f(x + 1, y) + f(x + 1, y + 1)) − (f(x − 1, y − 1) + 2f(x − 1, y) + f(x − 1, y + 1))/8∆,

fy ≈ (f(x − 1, y + 1) + 2f(x, y + 1) + f(x + 1, y + 1)) − (f(x − 1, y − 1) + 2f(x, y − 1) + f(x + 1, y − 1))/8∆.

Aspect is the maximum downslope direction. It is computed as aspect =

(180/π) arctan(fy/fx) and converted to compass direction based on the signs of fx and fy [12]. Slope measures the steepness of a surface and its angle in degrees is computed as

q 2 2 slope = (180/π) arctan fx + fy ,

[12]. Curvature measures the convexity or concavity of a surface and it is computed as

curvature = −(fxx + fyy),

[12]. Because terrain curvature is small, it is usually multiplied by 100 so that the unit becomes 0.01m−1.

Given a terrain f on a grid G, a viewpoint v = (xv, yv, zv) and a target point t = (xt, yt, zt) such that (xv, yv) ∈ G, (xt, yt) ∈ G, zv ≥ f(xv, yv) and zt ≥ f(xt, yt), the line of sight (LOS) from v to t is the line segment between v and t [13, 14, 15]. t is said to be visible from v if the LOS does not intersect f except at endpoints. Determining the visibility of a target point is called an LOS query. For a TIN, an LOS query is computed by checking the LOS for intersection with a series of triangles. For a DEM, there are an exact algorithm and multiple approximate algorithms. Illustrated in Figure 2.1 (a), the exact algorithm compares the elevations of the LOS and line segments between adjacent points at their intersections on the plane, by linearly interpolating the elevations of adjacent points. An approximate algorithm, illustrated in Figure 2.1 (b), discretizes the LOS on G by line rasterization and compares point elevations on the LOS and in the DEM. The viewshed of a viewpoint v is the set of terrain points visible from v [13, 15]. 7

t t

v v (a) Exact (b) Approximate

Figure 2.1: LOS algorithms

A maximum visible distance r on the plane, called the radius of interest, is usually specified for v, so that its viewshed is part of a disc of radius r on the plane. For a TIN, the viewshed consists of complex polygons with holes, whose complexity is O(n2) if the TIN has n vertices. For a DEM, the viewshed consists of a set of points, which can be represented as a binary image. The exact algorithm to compute a viewshed on a DEM uses the exact LOS algorithm to compute the visibility of each point within r. The complexity of the algorithm is Θ(r3) because there are Θ(r2) points and each takes Θ(r) time. An approximate algorithm typically computes the LOSs of boundary points and derives the visibility of inner points along the LOSs, whose complexity is Θ(r2). The total viewshed of a DEM is the size of the viewshed of each point as a viewpoint [16]. When the heights of observer and target above the terrain are equal, all points are mutually visible or invisible. The size of a viewshed is the number of points visible from a point, or the number of points from which a point is visible.

2.2 CUDA Compute Unified Device Architecture (CUDA) is the NVIDIA solution to GPGPU [4]. It is based on NVIDIA GPUs and CUDA C/C++ and Fortran compil- ers, and supports third party wrappers for other languages, compiler directives and libraries. The main components of a CUDA-enabled GPU are streaming multipro- cessors and global memory. Streaming multiprocessors are like vector processors and the global memory is like the main memory. In the Kepler compute architecture [17], 8 each streaming multiprocessor, called an SMX, contains 192 single-precision CUDA cores and 64 double-precision units, so that double-precision performance is only 1/3 of single-precision performance. In the memory hierarchy of Kepler GPUs [17], each SMX has 64K or 128K 32-bit registers, 64KB or 128KB memory configurable between shared memory and L1 cache, and 48KB cache for read-only data. The GPU has 1536KB L2 cache shared among SMXs, and 5GB to 12GB global memory, which is much slower than registers and shared memory. A CUDA program starts on the CPU. In a simple work flow, it copies data from the main memory to GPU global memory, launches parallel functions called kernels on the GPU, and copies results from GPU global memory to the main memory. A kernel launch creates a hierarchy of CUDA threads executing the same kernel function. Threads are grouped into thread blocks and thread blocks are grouped into a grid. Each thread block is assigned to a single SMX for execution. Thread blocks are independent of each other and can be scheduled in any order. Each thread block has its own shared memory and the threads of a thread block can cooperate via shared memory and barrier synchronization. Kernel launches or thread grids can share results in the global memory via device synchronization. While CUDA C/C++ is good for highly optimizable tasks that benefit from a detailed mapping onto hardware, Thrust [6] is good for problems that are less optimizable or do not benefit from a detailed mapping onto hardware. Thrust is a C++ template library that provides a high-level interface to CUDA programming. It allows the description of computation using common parallel algorithms and is fully interoperable with CUDA C/C++. Thrust provides two generic containers, one in host memory and one in device memory, and multiple algorithms for transformations, reductions, prefix-sums, reordering and sorting. Built on Thrust, CUSP [18] is a C++ template library for sparse linear algebra. CUSP provides multiple sparse matrix formats, iterative solvers and preconditioners.

2.3 ODETLAP Overdetermined Laplacian Partial Differential Equations (ODETLAP) is a spatial data approximation and compression method invented by Franklin [7, 20]. 9

ODETLAP has two components: approximation and lossy compression. For approx- imation, it computes a value for each point of a regular grid from a set of known points and values by solving an overdetermined system of linear equations. The equations include an averaging equation for each point, known or unknown, and a known-value equation for each known point. The basic format of the averaging equation is the finite-difference approximation of the Laplace’s equation, which makes the value of each point equal to the average value of adjacent points.

u(x − 1, y) + u(x + 1, y) + u(x, y − 1) + u(x, y + 1) − 4u(x, y) = 0.

Multiplying both sides of the equation with a positive parameter R gives

Ru(x − 1, y) + Ru(x + 1, y) + Ru(x, y − 1) + Ru(x, y + 1) − 4Ru(x, y) = 0. (2.1)

R is called the smoothing factor. It does not change the equation but changes its weight in the system. The known-value equation of each known point makes its value equal to its known value.

u(x, y) = v(x, y). (2.2)

Equations 2.1 and 2.2 constitute a system of more equations than unknowns, whose approximate solution gives a value to each point of the grid. The approximate value of each point, especially unknown, is nearly the average value of adjacent points, and the approximate value of each known point is nearly its known value. R influences the smoothness of the approximation and its accuracy at known points by changing the relative weight of averaging equations and known-value equations. For lossy compression, the method selects a small set of points and values K from a point dataset as its compression, and reconstructs the dataset from K using ODETLAP approximation. K is selected so that the error of the reconstructed dataset is as small as possible. K is compressed using general data compression 10

Figure 2.2: ODETLAP compression [19] methods to reduce the size of the compression. As Figure 2.2 illustrates, a small set of 1000 points is selected by ODETLAP compression from a 400 × 400 elevation dataset and compressed. The elevation dataset is reconstructed by ODETLAP approximation from the set of selected points. In addition, ODETLAP approximation can use other information like contour lines and user-supplied points besides the selected points in the reconstruction of the dataset.

Given a grid Gn×n, a set of known points (xi, yi), where i = 1, . . . , k, and values v(xi, yi), ODETLAP approximation takes the approximate value of each point u(x, y) as an unknown variable and computes it using an overdetermined system of Equations 2.1 and 2.2. The system has n2 + k equations and n2 unknowns,

    A 0  1     x =   , A2 v

2 2 2 where A1 is n × n with 4 nonzero elements on each row, A2 is k × n with 1 nonzero element on each row, x = (u(0, 0), . . . , u(0, n − 1), u(1, 0), . . . , u(n − 1, n − 1))T is T the vector of unknowns, and v = (v(x1, y1), . . . , v(xk, yk)) is the vector of known values. The direct method to solve an overdetermined system Ax = b is to compute x = (AT A)−1AT b. For ODETLAP approximation, A and AT A are sparse but 11

(AT A)−1 is dense and requires too much memory. Therefore, we solve the system

AT Ax = AT b, using an iterative solver, where AT A is a sparse n2 × n2 matrix. R specifies the relative weight of averaging equations and known-value equations. If R = 1, they have equal weight. If R < 1, the approximation is less smooth but more accurate at known points. If R > 1, the solution is smoother but less accurate at known points. ODETLAP approximation extends naturally to higher dimensions. In 3D, the averaging equation in the format of the discrete Laplace’s equation is

u(x−1, y, z)+u(x+1, y, z)+u(x, y−1, z)+u(x, y+1, z)+u(x, y, z−1)+u(x, y, z+1) − 6u(x, y, z) = 0.

In general, the averaging equation in the format of the discrete dD Laplace’s equation is d X (u(. . . , xi − 1,...) + u(. . . , xi + 1,...)) − 2du(x1, . . . , xd) = 0. i=1 A boundary point of the grid has fewer adjacent points than an inner point. In this thesis, we use the Dirichlet boundary condition for ODETLAP approximation and specify that adjacent points outside of the grid take the same value as the boundary point. In dD, a boundary point has d + b adjacent points on a bD boundary. The other component of ODETLAP is lossy data compression [20, 21, 22, 23]. The method selects an important set of data points from a point dataset as a compression of the dataset, and decompresses it from the data points using ODETLAP approximation. Algorithm 2.2 shows a greedy algorithm to select a set of known points from a DEM. It first selects an initial set of known points and computes an approximation from the known points. Then it iteratively adds a number of data points with large absolute differences between approximate and data values to the set of known points, and computes an new approximation from the known points. A minimum distance is imposed between points added in the same iteration to prevent clustering. The algorithm stops when the approximation is accurate enough. 12

Algorithm 2.2: ODETLAP compression Input: A DEM Output: A set of known points select an initial set of known points; compute the ODETLAP approximation of the known points; while not done do add a number of data points with large errors to the set of known points; compute the ODETLAP approximation of the known points; end the output is the set of known points;

The set of known points and their values are compressed separately. The points are represented as a binary image the size of the grid, where 1 denotes a known point and 0 denotes an unknown point. The image is first compressed by Run-Length Encoding (RLE), counting the number of 0s between 1s, then each run length l is stored in this format:

• if l < 254, a byte for l • if 254 ≤ l < 510, a marker byte 0xFE and a byte for l − 254 • if 510 ≤ l < 766, a marker byte 0xFF and a byte for l − 510 • if l > 766, two marker bytes 0xFFFF and two bytes for l

The set of known point values is compressed by delta encoding and the file compressor. Lau et al. [24] used ODETLAP approximation for artifact-free bathymetry reconstruction from very unevenly distributed depth data. They proposed using a smaller R where data is sparse to preserve features and a bigger R where data is dense to reduce artifact. To preserve more bathymetry features, they proposed a two-step approach that first reconstructs a surface using ODETLAP approximation with R = 1 and then smooths it using ODETLAP approximation with a bigger R [25]. They showed that the results of nearest-neighbor interpolation, inverse distance weighting, Kriging, and cubic spline interpolation are full of artifact. In another application of ODETLAP approximation, Lau and Franklin [26, 27, 28, 29] proposed an induced terrain method for completing fragmentary river 13 networks. It first constructs a terrain with known river segments and elevations, and then derives a river network through all known river segments. They improved the accuracy of the method by considering known non-river locations. To prevent a river from crossing a non-river location, it raises the elevations of non-river locations above any known river locations. Later, they proposed a method for river segment connection, which exploits segment geometries to generate the induced terrain. It assigns the lowest elevations to river segments and increasingly higher elevations to more distant locations. It also assigns lower elevations to areas radiating from segment tips to attract water flow in river derivation. In ODETLAP compression, Franklin et al. [30, 31] proposed a method to reconstruct terrains with accurate slope as well as elevation. Li et al. [32, 33] proposed a method for 3D oceanographic data compression and extended the method for a 5D geospatial dataset, which outperforms the 3D Set Partitioning In Hierarchical Trees (SPIHT) of Said, Kim, and Pearlman [34, 35]. Stuetzle et al. [36] proposed a terrain simplification method based on the compression of hydrology features and an error metric based on the potential energy of water flow. Tracy et al. [37, 38, 39] proposed a path planning method and several error metrics to evaluate the quality of various terrain compression methods. ODETLAP approximation has been accelerated by parallel processing. Li and Franklin [19, 40] accelerated ODETLAP approximation on GPU using CUSP and used it for ODETLAP compression in MATLAB. Benedetti et al. [41, 42] accelerated ODETLAP approximation and compression using Thrust and CUSP. Stookey et al. [43, 44] proposed a parallel ODETLAP approximation for terrain compression and decompression on a cluster. The method divides a grid into overlapping patches, applies ODETLAP approximation to each patch, and merges the approximations together. It not only reduces the complexity of ODETLAP approximation but also enables the concurrent processing of patches.

2.4 Spatial Interpolation The problem of spatial interpolation is to predict the values of unknown points from the values of known points, mostly in 2D. Both the unknown and known points 14 are usually points of a regular grid. Spatial interpolation includes both interpolation with exact values of known points and approximation with inexact values of known points [45]. Spatial interpolation methods are based on the assumption of spatial correlation between point values, as stated by Tobler’s first law of geography: “Everything is related to everything else, but near things are more related than distant things” [46]. The simplest method is called nearest-neighbor interpolation, which gives an unknown point the value of its nearest known point [10]. Let x be an unknown point and xi, i = 1, . . . , k be a set of known points. The value of x is computed as u(x) = v(xi), d(x, xi) = minj d(x, xj), where d is a distance measure. With the Voronoi diagram of the known points, an unknown point is given the value of the known point whose Voronoi polygon contains the unknown point. Like nearest neighbor interpolation, natural neighbor interpolation is also based on the Voronoi diagram of the known points, but the value of an unknown point is Pk computed as a weighted sum of known points values, u(x) = i=1 wiv(xi), so that the interpolation is smooth [10]. The weight wi of known point xi is the ratio of its Voronoi polygon occupied by the Voronoi polygon of the unknown point, when it is inserted into the Voronoi diagram. Therefore, only the known points that share a Voronoi edge with the unknown point contribute to its value. Linear interpolation is based on the Delaunay triangulation of the known points [10]. The value of an unknown point is computed as a weighted sum of the values of the three known points defining the triangle that contains the unknown point, so that the unknown point is coplanar with the three known points when raised to their values. The interpolation is a piecewise linear function. Inverse Distance Weighting (IDW) gives an unknown point a weighted average value of some or all of the known points. In the original method of Shepard [47], the value of an unknown point is computed as

Pk i=1 wiv(xi) −p u(x) = Pk , wi(x) = d(x, xi) , i=1 wi

where wi is the weight function of xi and p is a positive number called the power parameter. The weight of a known point is inversely proportional to its distance to 15 the unknown point, so that more distant known points have less contribution to the interpolated value. As p → ∞, an IDW interpolation becomes closer and closer to a nearest neighbor interpolation. If the number of known points is large, it is expensive to compute unknown values using all the known values, so that a maximum distance or number of known points is specified. In a modified IDW of Renka [48], the weight function is defined as

!2 max(Ri − d(x, xi), 0) wi(x) = , Rid(x, xi)

where Ri is a radius of influence such that wi is nonzero only for unknown points within Ri of xi. A type of interpolation methods is to compute the steady state solution of a Partial Differential Equation (PDE) [49]. A PDE has partial derivatives and multiple variables. Simple PDEs can be solved analytically by reducing them to Ordinary Differential Equations (ODE). Complex PDEs can be solved numerically by changing them to systems of finite-difference equations and using iterative methods. The finite- difference method is based on the discretization of continuous functions. For example, the forward-difference, backward-difference, and central-difference approximations of the derivative of f(x) are

f(x + h) − f(x) f 0(x) ≈ , h f(x) − f(x − h) f 0(x) ≈ , h f(x + h) − f(x − h) f 0(x) ≈ , 2h which are second-order accurate. To solve a linear system Ax = b, basic methods include the Jacobi method, Gauss-Seidel method and Successive Over-Relaxation (SOR) [50]. Starting with an initial guess, these methods repeatedly update an approximate solution until it converges. Better methods are the Krylov subspace methods, which include the Conjugate Gradient method (CG), Conjugate Residual method (CR), and Generalized Minimum Residual method (GMRES). Based on the Krylov subspace of A and b, these methods approximate the solution A−1b by p(A)b, 16 where p is a polynomial. Tobler’s smooth pycnophylactic interpolation [51] is a method close to ODET- LAP approximation. The purpose of the method is to derive a smooth density function on a grid domain, given a subdivision of the domain into regions, and the sum of all point values in each region. The density function z needs to satisfy the pycnophylactic constraint to preserve the volume of each region, and the non- negativity constraint such that z ≥ 0. The pycnophylactic constraint is defined as RR z(x, y) dx dy = H , where R is the ith region and H is its volume. A smooth Ri i i i RR 2 2 function can be derived by minimizing (zx + zy ) dx dy, whose solution is given by the Laplace’s equation zxx + zyy = 0. Alternatively, it can be derived by min- RR 2 imizing (zxx + zyy) dx dy, whose solution is given by the biharmonic equation zxxxx + 2zxxyy + zyyyy = 0. On a grid domain, the finite-difference form of the first criterion can be used as a smoothness measure of z, which is

X X 2 2 t0 = ((z(x + 1, y) − z(x, y)) + (z(x, y + 1) − z(x, y)) ). x y

Adding the pycnophylactic constraint and Lagrange multipliers to the formula, it is

r X X t = t0 + λi(Hi − z(x, y)). i=1 (x,y)∈Ri

Then tz(x,y) = 0 and tλi = 0 form a linear system of n + r equations and n + r variables, where n is the number of grid points and r is the number of regions, which can be used to solve z(x, y) and λi. Splines are a class of interpolation methods that build a continuous function from scattered data points. Mitasova et al. [52] described the objective of spline methods as an approximation function that is both close to the data points and smooth. Given data points xj and their values zj, the problem is to find a function s that minimizes k X 2 (zj − s(xj)) w + I(f), j=1 where w is a smoothing parameter and I is a smoothness seminorm. For example, RR 2 2 2 if I(s) = (sxx + 2sxy + syy) dx dy, the method is a Thin Plate Spline (TPS) [53]. 17

And the solution is k X s(x) = t(x) + λjr(x, xj), j=1 where t is a trend function and r is a Radial Basis Function (RBF), all depending on I. Hutchinson [54] used partial thin plate splines to interpolate mean rainfall from point data. The model approximates the value of point (xi, yi) as

k X zi = f(xi, yi) + βjψj(xi, yi) + i, j=1 where f is an unknown function and βj are unknown parameters. ψj are known T 2 T functions and i are random errors with E( ) = V σ , where  = (1, . . . , n) 2 T T and V and σ are problem specific. Let z = (z1, . . . , zn) and g = (g1, . . . , gn) , Pk gi = f(xi, yi) + j=1 βjψj(xi, yi). f and βj can be computed by minimizing

T −1 (z − g) V (z − g) + ρJm(f),

where Jm(f) is a smoothness measure of f defined in its mth derivatives and ρ is the smoothing parameter. Hardy’s multiquadric-biharmonic method [55] is similar yet different from TPS. While TPS is biharmonic in 2D, the method is biharmonic in 3D. It is very effective in estimating steep gradients. In collocation mode, the interpolation function is

k X 2 2 2 1/2 h(x, y) = αj((x − xj) + (y − yj) + ∆ ) , j=1 where αj are unknown coefficients and ∆ is an arbitrary constant. In least squares mode, additional equations are written for maximum and minimum points whose slopes are zero.

k X 2 2 2 −1/2 αj((x − xj) + (y − yj) + ∆ ) (x − xj) = 0, j=1

k X 2 2 2 −1/2 αj((x − xj) + (y − yj) + ∆ ) (y − yj) = 0. j=1 18

Slopes are fitted exactly but values are not fitted exactly. In osculating mode, both slopes and values are fitted exactly. Both TPS and the multiquadric-biharmonic method are instances of a category of RBF approximation/interpolation methods [56]. In general, the problem is to

find an interpolation function f that satisfies f(xi) = v(xi), i = 1, . . . , k, that is, f interpolates the values of known points. Such a function that also minimizes energy and maximizes smoothness has the form

k X f(x) = wiφ(d(x, xi)) + p(x), i=1 where wi is a weight coefficient, φ is an RBF centered at xi, and p is a polynomial q related to φ. For example, φ(r) = r2 log(r) is a TPS and φ(r) = (r/)2 + 1 is a multiquadric RBF, where  is an adjustable constant. The relevant p in 2D is p(x, y) = c1 + c2x + c3y. In addition, orthogonality requires

k k k X X X wi = wixi = wiyi = 0 i=1 i=1 i=1

. The above equations constitute a linear system

      AP w v           =   , P T 0 c 0 where

A = (φ(d((xi, yi), (xj, yj))))k×k,

P = (1, xi, yi)k×3,

T w = (w1, . . . , wk) ,

T c = (c1, c2, c3) ,

T v = (v(x1, y1), . . . , v(xk, yk)) .

The solution to the system gives the coefficients wi and ci and determines the 19 interpolantion function f. Reviews of comparative studies of spatial interpolation methods can be found in the works of Li and Heap [57, 58, 59]. They considered three categories of methods: non-geostatistical methods, geostatistical methods and combined methods. Gosciewski [60] proposed using a combination of interpolation methods to reduce deformation in DEM generation. Hugentobler and Schneider [61, 62] used triangular Coons patches in digital terrain modeling and proposed a modification to the method to allow the modeling of breaklines. Nittel et al. [63] extended data stream management systems for near real-time interpolation of continuous phenomena using 260K sensor updates. Pouderoux et al. [64] proposed a method for interpolating elevation data by dividing a surface into subdomains, interpolating each subdomain using RBF interpolation, and blending subsurfaces hierarchically. Wise [65] evaluated the elevation and gradient errors of five interpolation methods using cross-validation, by down-sampling a terrain and interpolating it back to the original resolution. The results show that the elevation error has weak correlation with the visual quality of interpolated terrains, and it is not a good predictor of the gradient error at individual points but a good predictor of the gradient error in a whole terrain. PDE-based image interpolation is well studied. Bertalmio et al. [66] proposed a method for image inpainting by smoothly propagating information in isophote directions from the surroundings of holes. The method is inspired by the use of PDEs in image processing. The results are sharp and without color artifact, but the method cannot reproduce large textured areas. Caselles et al. [67] discussed models of elliptic PDEs for interpolating from points or curves on the plane. In particular, they studied the Absolute Minimal Lipschitz Extension (AMLE) model and applied it to restore poor dynamic range images. AMLE is a nonlinear elliptic PDE and is solved by a nonlinear over-relaxation method. Almansa et al. [68] used AMLE to interpolate DEMs and discussed its relationship with other methods. Weickert [69] studied anisotropic diffusion filters using an adapted diffusion tensor instead of a scalar diffusivity. Anisotropic diffusion filtering is smoothing within a region while diffusing along edges. It can be used to eliminate noise and small details while preserving edges. Facciolo et al. [70] proposed a constrained anisotropic diffusion 20 model and applied it to the interpolation of DEMs. Masnou and Morel [71] proposed a method to perform disocclusion using level lines. Disocclusion is the recovery of hidden parts in an image by interpolation from surrounding areas. The method can recover strong discontinuities, which is hard to achieve with PDE-based methods.

2.5 Spatial Data Compression Geospatial data are stored in numerous open or proprietary vector and raster file formats, some of which are compressed formats like ESRI’s SDC and JPEG 2000 [72]. These data files are often compressed by general file archivers and compressors. Much related work is on terrain compression. Coll et al. [73] proposed a LEPP- surface triangulation method that starts with a coarse triangulation of an input terrain and iteratively adds points to reduce the worst edge approximation error. Then they proposed a new LEPP-surface method to achieve both a small approximation error and well-shaped 3D triangles [74]. Löffler and Schumann [75] proposed a feature-sensitive terrain simplification method, which maintains the regularity of a terrain mesh and recomputes vertex positions according to the Quadric Error Metric (QEM). Zhou and Chen [76] proposed a DEM generalization method for converting a DEM to a TIN, which keeps important features and morphology using a minimum number of points. Chen et al. [77] compared three drainage-constrained DEM generalization methods including a stream burning method, a surface fitting method and a constrained-TIN method. Results show that the constrained-TIN method is the best. Solé et al. [78] proposed a compression method for DEMs, combining the Morse and drainage structures, Laplacian interpolation and arithmetic coding. Franklin and Inanc [79] proposed a method for lossy terrain compression based on planar data segments. Inanc [80] proposed a terrain compression method called Overdetermined Compression (ODETCOM) that uses a causal template for prediction and compresses the error using arithmetic coding. Wei et al. [81] proposed an adaptive pattern- driven method for terrain compression by modeling and encoding visual patterns using a set of extracted features. Durdević and Tartalja [82] proposed a method for lossy or and decompression of regular height fields. Lossy 21 compression approximates a height field with quadratic Bézier surfaces and lossless compression superimposes the residual over the lossy approximation. The method is implemented in CUDA and supports single point decompression and progressive decompression. Zeng and Chen [83] proposed a terrain compression method based on the Discrete Wavelet Transform (DWT) and Run-Length Encoding (RLE). The method uses Message Passing Interface (MPI) for data distribution among compute nodes and CUDA for computation. It overlaps MPI communication, CPU-GPU memory transfers and GPU computation to increase efficiency. PDE-based image compression is extensively studied. Szirányi et al. [84] applied anisotropic diffusion in image compression. They used anisotropic diffusion as a preprocessing step before JPEG compression, which results in higher quality decompressed images in Peak Signal-to-Noise Ratio (PSNR) than JPEG compression without preprocessing. Dell [85] studied the problem of finding a good inpainting mask or seed points for lossy image compression. He proposed a significance measure of a seed point for the reconstruction of another image point. Zimmer [86] studied PDE- based inpainting for image compression by storing small neighborhoods of corners instead of edges and corners. Galić et al. [87] proposed using the Edge-Enhancing Anisotropic Diffusion (EED) of Weickert [69] for scattered data approximation and image compression. They used the B-Tree Triangular Coding (BTTC) of Distasi et al. [88] to encode an image and EED for decoding. Then they proposed a method using EED within encoding [89]. Results show that the method outperforms JPEG and is close to JPEG 2000 at high compression ratios. Later, Schmaltz et al. [90] proposed a method using EED and an adaptive rectangular grid. Enhanced with other techniques, the method can outperform JPEG 2000 in Mean Square Error (MSE) at high compression ratios. Then they demonstrated the advantages of EED over other PDEs in image compression, various rectangular subdivision patterns and the extension of the method to 3D and shape data [91]. Peter and Weickert [92] extended the method with a luma preference mode in the YCbCr space for color image compression, which outperforms JPEG 2000 and the method in the RGB space. Mainberger et al. [93] studied the optimization of the positions and values of 22 selected points in image compression using the Laplace’s equation for inpainting. They proposed a two-step approach to optimize the point mask. The first step is called probabilistic sparsification, which iteratively removes points from an image until a desired point density is reached. In each iteration, a fraction of random mask points is removed and their inpainting errors computed, then a fraction of the removed points with the largest errors are put back in the mask. The second step is called nonlocal pixel exchange. In each iteration, it selects a number of random non-mask points and exchanges the worst one with a random mask point if the exchange improves the mask. The inpainting u = r(c, f) of a mask c and an image f is computed by solving a system of Laplace’s equations ∆u(x) = 0 for non-mask points and u(x) = f(x) for mask points. If c is fixed, then r(c, f) is a linear function of f. To optimize the values of mask points for the MSE of r(c, f) is to find an α that minimizes kf − r(c, f + α)k2.

k X r(c, f + α) = r(c, f) + r(c, α) = r(c, f) + αir(c, ei), i so that α can be solved by the least squares method. Because solving α is very slow, an iterative approach is proposed that optimizes point values individually by 2 minimizing kf − r(c, f + αiei)k with

T r(c, ei) (f − r(c, f)) αi = T . r(c, ei) r(c, ei) CHAPTER 3 1D ODETLAP Approximation and Compression

This chapter discusses ODETLAP approximation and compression in 1D. ODETLAP is very similar from 1D to 3D. However, 1D algorithms are significantly easier to implement and faster to execute than their 2D and 3D counterparts. 1D datasets are also easier to visualize than 2D and 3D datasets. Therefore, studying ODETLAP in 1D helps us make observations that are harder to see in higher dimensions.

3.1 1D ODETLAP Approximation In 1D, the discrete Laplace’s equation is f(x − 1) + f(x + 1) − 2f(x) = 0. 1D ODETLAP approximation is implemented in Python using the NumPy library, while 2D and 3D ODETLAP approximation are implemented in C++ using CUSP. 1D approximation does not use an iterative solver and computes the exact solution of an ODETLAP system instead of an approximate solution. The value of a known point influences the values of all unknown points in an exact solution but may appear to only influence nearby unknown points in an approximate solution because an iterative solver is like a diffusion process. In 1D, the influence of a known point is strong within one or two other known points but is weak beyond a few other known points. It also depends on the proximity and values of other known points. Figure 3.1 shows the approximation of three points at x = 100, 500, 900 using a wide range of smoothing factor R’s. Blue circles are the known points and red curves are approximations. When R = 1 × 10−16, the approximation is noisy due to numerical instability. When R = 1 × 10−8 to 100, the approximation is almost the same. When R = 10000, the approximation no longer goes near the known points. When R = 1 × 108, the approximation is the average value of the known points. When R = 1 × 1016, known point values no longer matter and the approximation is 0. Here proves the linearity of ODETLAP approximation. When the coordinates of known points are fixed, ODETLAP approximation is a linear operator of their

23 24

Figure 3.1: Effects of the smoothing factor R values, because the coefficient matrix of the ODETLAP system only depends on O their coordinates. Let the system be Ax = b = Bv, where B = ( I ) and v is the values of known points. Then we have AT Ax = AT Bv and x = (AT A)−1AT Bv. Let T −1 T M = (A A) A B so that x = Mv. If there is v = v1 + v2, then

x = Mv = M(v1 + v2) = Mv1 + Mv2 = x1 + x2,

where x1 and x2 are the solutions with known point values being v1 and v2. If there is v = kv0, then x = Mv = Mkv0 = kMv0 = kx0, where x0 is the solution with known point values being v0. Therefore, the ODETLAP approximation x = Mv = ODL(v) is a linear operator of v.

ODL(v1 + v2) = ODL(v1) + ODL(v2), 25

ODL(kv) = k ODL(v).

3.2 1D ODETLAP Compression The problem of ODETLAP compression is that: given a set of n points

S = {x}n and values f(x), find a set of k points T = {x}k and values g(x), so that the ODETLAP approximation h(x) of T and g(x) on S has the minimum p-norm kh − fkp, where p can be 1, 2 or infinity, corresponding to the average absolute, root mean square or maximum absolute difference between values of h and f. It is usually assumed that S is the points of a regular grid, T ⊂ S and sometimes g(x) = f(x) for x ∈ T . Then it becomes a combinatorial optimization problem with n ( k ) possible solutions. Approximate solutions to the problem can be computed using approximation algorithms like greedy algorithms. There are at least two types of greedy algorithms: one that starts with T = ∅ and iteratively adds points to T until |T | = k, and one that starts with T = S and iteratively removes points from T until |T | = k. The first algorithm is more efficient when k  n. Below is a list of 1D compression algorithms that we tested. More details of algorithms 3 and 5 are given in the chapter of 2D compression.

• Algorithm 1: in each iteration, add the point with the largest absolute error to the set of selected points (the error is the difference between the approximation and the dataset). This is a greedy algorithm of the first type. • Algorithm 2: the same as algorithm 1, but not adding points whose absolute curvature is larger than a threshold (the threshold is %70 of the largest absolute curvature). The purpose is to avoid overshoot/undershoot at points with large curvatures. • Algorithm 3: In a loop, add two points with the largest absolute error, and remove one point that has the smallest error when removed. To find the point to remove, each selected point is tentatively removed to compute its approximation error. • Algorithm 4: select points as algorithm 1 and then optimize the values of selected points to minimize the root mean square error (RMSE) of the approximation. Let the dataset be f and the approximation of known point 26

values be x = ODL(v) = Mv. To optimize v to minimize kx−fk, the RMSE of the x, is to solve x = Mv = f. M is relatively small in 1D, so that the algorithm computes M and solves v directly. • Algorithm 5: select points as algorithm 3 and then optimize the values of selected points to minimize RMSE.

Figure 3.2 shows the results of algorithms 1-5 on 2 1D terrain profiles. Each algorithm selects 30 points from a total of 600 points. The smoothing factor R = 0.01. Green curves are the datasets, blue lines are the selected points, and red curves are the approximations. The RMSE of the approximation is displayed in the top of each subfigure. The results show that algorithms 2-5 are all better than algorithm 1 in terms of RMSE. Algorithms 4 and 5 are much better, but selected points are no longer data points.

3.3 Suitability of Datasets The results of ODETLAP compression are bad for some datasets because ODETLAP approximation suffers from overshoot and undershoot with fast-changing known point values, which is similar to the Runge’s phenomenon of polynomial interpolation. Take 1D ODETLAP compression for example. The dataset is a discrete step function f(x) = 1, x = 0 ... 499 and f(x) = −1, x = 500 ... 999. ODETLAP compression adds one point in each iteration with R = 0.01. Figure 3.3 shows the results when 10, 12, 14 and 16 points are selected. Each subfigure shows the approximation (red), dataset (green) and selected points (blue). While the approximation is almost perfect with 16 points, it oscillates badly with less points. Therefore, ODETLAP compression selects a lot of points around breakpoints. It is possible to compress the 1D step function using two types of known points and known value equations. A type 1 known point x has the data value f(x) and equation u(x) = f(x). A type 2 known point x has the forward difference value f(x + 1) − f(x) and equation u(x + 1) − u(x) = f(x + 1) − f(x). ODETLAP compression adds alternately a type 1 point and a type 2 point in each iteration. In this way, the approximation is almost perfect with as few as 6 points. The method 27

(a) Algorithm 1 (b) Algorithm 1

(c) Algorithm 2 (d) Algorithm 2

(e) Algorithm 3 (f) Algorithm 3

(g) Algorithm 4 (h) Algorithm 4

(i) Algorithm 5 (j) Algorithm 5

Figure 3.2: Some 1D ODETLAP compression results 28

(a) 10 points (b) 12 points

(c) 14 points (d) 16 points

Figure 3.3: Compressing a step function – red curve: approximation; green line: dataset; blue circles: selected points works less well in higher dimensions because there are more than one difference values.

3.4 Summary We have shown that ODETLAP approximation is a linear operator of known point values with the known points fixed, so that its best approximation to a dataset is defined. We have greatly reduced the root-mean-square error of 1D ODETLAP compression using a fixed number of points. CHAPTER 4 2D ODETLAP Approximation

This chapter discusses ODETLAP approximation in 2D, including the GPU accelera- tion using CUSP, error analysis, and comparison with natural neighbor interpolation and the multiquadric-biharmonic method. In 2D, the discrete Laplace’s equation is

f(x − 1, y) + f(x + 1, y) + f(x, y − 1) + f(x, y + 1) − 4f(x, y) = 0.

4.1 GPU Acceleration Solving large systems is time-consuming and benefits from parallel processing. The CUSP implementation of ODETLAP approximation has three parts:

1. build the coefficient matrix A and right-hand side b of an ODETLAP system on CPU 2. compute AT , AT A and AT b on GPU 3. solve AT Ax = AT b on GPU

CUSP has these sparse matrix formats:

• COO: coordinate matrix format • CSR: compressed sparse row matrix format • ELL: ELLPACK/ITPACK matrix format • HYB: hybrid ELL/COO matrix format

Host A, device A and device AT use the COO format, which is fast for basic operations. The format of device AT A affects the performance of the solver, whose main operation is sparse-matrix-vector multiplication. We tried each format of AT A for evaluation. The value type is the single precision float type. The solver is the conjugate gradient method and the relative tolerance is 1 × 10−6. The datasets are 3 1200 × 1200 DEMs down-sized from NED 1 × 1 degree blocks n43w072, n43w073 and n43w074. Each dataset has 1% random points selected as known points. ODETLAP

29 30

Table 4.1: Average running time in seconds and device memory usage in MB, using different sparse matrix formats

COO CSR ELL HYB Part 1 0.06 0.06 0.06 0.07 Part 2 0.27 0.28 0.30 0.31 Part 3 21.59 20.42 10.69 10.72 Memory 493 428 422 422 approximation with R = 1 is run 10 times. Table 4.1 shows the average running time of each part in seconds, on our test machine with two Intel Xeon E5-2687W CPUs and one NVIDIA Tesla K20Xm GPU accelerator, and the device memory usage after part 3 in MB (peak memory usage is higher). Part 1 and part 2 use little time. Part 3 is the fastest using ELL or HYB. CUSP has these iterative solvers:

• Relaxation methods

– Gauss-Seidel relaxation – Jacobi relaxation – Successive Over-Relaxation (SOR) relaxation

• Krylov subspace methods

– Biconjugate Gradient (BiCG) method – Biconjugate Gradient Stabilized (BiCGstab) method – Conjugate Gradient (CG) method – Conjugate Residual (CR) method – Generalized Minimum Residual (GMRES) method

ODETLAP approximation uses the ELL format. The methods Jacobi, BiCGstab and CR do not converge. Gauss-Seidel and SOR are very slow and GMRES is slow. Table 4.2 shows the results of BiCG and CG. CG is faster than BiCG. CUSP has these preconditioners: smoothed aggregation-based algebraic multigrid (AMG) pre- conditioner, approximate inverse (AINV) preconditioner, and diagonal preconditioner. Using the CG method, AINV and diagonal do not reduce the running time. Table 4.2 also shows the results of AMG-CG. The running time is more than halved but 31

Table 4.2: Average running time in seconds and device memory usage in MB, using different iterative solvers

BiCG CG AMG-CG Part 1 0.06 0.06 0.06 Part 2 0.30 0.30 0.30 Part 3 16.98 10.66 4.71 Memory 422 422 730

Table 4.3: Average running time in seconds and GPU to CPU speedups

CPU GPU Speedup Part 1 0.04 0.06 0.61 Part 2 0.87 0.30 2.92 Part 3 38.01 4.71 8.07 memory usage is higher. In summary, ODETLAP approximation will use the ELL format, CG method, and AMG preconditioner. Using CUSP, host data structures are computed on CPU and device data structures are computed on GPU. Storing all data structures on host puts all computation on the CPU. Table 4.3 shows the sequential time of the CPU and the speedup of the GPU. The GPU is about 8 times as fast as the CPU. CUSP does not support CPU parallel processing.

4.2 ODETLAP Approximation, Natural Neighbor Interpo- lation, and Radial Basis Function Interpolation We did some tests of scattered-point approximation using ODETLAP. The datasets are 60 600 × 600 DEMs down-sized from NED 1 × 1 degree blocks between (40◦ N, 69◦ W) and (47◦ N, 80◦ W), as shown in Figure 4.1. The tests are:

1. 3600 regular known points at (10i+5, 10j+5), i,j = 0. . . 59. 2. 3600 uniform random known points. The probability of a point being picked is 1/360000. 3. 3600 nonuniform random known points. The probability of a point being 32

48

47

46

45

44

Latitude 43

42

41

40

39 80 78 76 74 72 70 68 Longitude

Figure 4.1: DEM datasets

picked is proportionate to its absolute curvature, implemented using the inverse transform sampling method.

The error metrics are:

• average absolute, RMS and maximum absolute errors in meters (AVGE, RMSE and MAXE) • average, RMS and maximum slope errors in degrees (AVGSE, RMSSE and MAXSE) • average absolute, RMS and maximum absolute curvature errors in 0.01 radians (AVGCE, RMSCE and MAXCE)

Slope is calculated as

180 q × atan( (f(x + 1, y) − f(x − 1, y))2 + (f(x, y + 1) − f(x, y − 1))2/2h), π 33

Table 4.4: Average elevation, slope, and curvature errors of ODETLAP approximation on the three tests

Test AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 1 17.1 25.8 237.3 2.14 3.20 26.78 0.038 0.057 0.615 2 20.2 30.8 278.7 2.21 3.31 27.64 0.038 0.057 0.612 3 22.1 31.3 242.6 2.11 3.07 25.54 0.038 0.056 0.602

[12]. Curvature is calculated as

100 × (4f(x, y) − f(x − 1, y) − f(x + 1, y) − f(x, y − 1) − f(x, y + 1))/h2,

[12]. Here h = 180 meters because the spatial resolution is 6 arc-seconds. For dataset n43w074 (Figure 5.1 (e)), the range of slope is [0.00, 34.20] degrees and the average slope is 3.93 degrees. The range of curvature is [-0.676, 0.627] and the average curvature is 0.000. We tested R = 0.001, 0.01, 0.1, 1, 10 and found that R = 1 is usually better in MAXE but R = 0.01 or 0.1 is usually better in the other errors. Table 4.4 shows the average errors of the three tests with R = 0.01. Regular points are better in elevation errors while nonuniform points are better in slope errors. Take dataset n43w074 and test 2 for example, Figure 4.2 (a) shows the proba- bility densities of elevation, slope and curvature of the dataset and approximation. While the probabilities of elevation are very similar between the dataset and ap- proximation, the probabilities of slope and curvature of the approximation are much higher near 0 than those of the dataset. The case of slope is because the approxima- tion is smoother. The case of curvature is because Laplace’s equations zeroed the curvatures of unknown points. Figure 4.2 (b) shows the scatter plots of elevation, slope or curvature versus elevation, slope or curvature error of the approximation. Slope-slope-error and curvature-curvature-error show strong negative correlation. Curvature-elevation-error shows weak negative correlation, so that a concave point (negative curvature) is more likely to have overestimated elevation and a convex point (positive curvature) underestimated elevation. Natural neighbor interpolation is a popular method that generates smooth 34

(a) Probabilities

(b) Scatter plots

Figure 4.2: ODETLAP approximation error analysis interpolations. We tested natural neighbor interpolation in MATLAB. Unknown points outside the convex hull of known points are interpolated using linear interpo- lation. Table 4.5 shows the average errors of the three tests, which are worse than that of ODETLAP approximation. Figure 4.3 shows the scatter plots of each value versus each error for dataset n43w074 and test 2, which are similar to that of ODETLAP approximation. 35

Table 4.5: Average elevation, slope, and curvature errors of natural neighbor interpo- lation on the three tests

Test AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 1 18.1 26.8 257.4 2.26 3.35 27.16 0.038 0.057 0.623 2 21.2 31.9 321.6 2.39 3.55 29.86 0.039 0.059 0.774 3 22.9 32.0 303.4 2.31 3.37 29.35 0.039 0.060 0.919

Figure 4.3: Natural neighbor interpolation error analysis

RBF interpolation is a very good method that can be used for either interpo- lation or approximation. It is more complicated to use because there are a lot of options. We tested RBF interpolation in Python using the SciPy library. First we tested different functions: linear, cubic, quintic, thin plate, multiquadric, inverse multiquadric and gaussian, and different values of the epsilon parameter for the last 3 functions. We found the multiquadric function with epsilon = 2 is good for our tests. The smooth parameter specifies either interpolation (smooth = 0) or approximation (smooth > 0). Table 4.6 shows the average errors of the three tests, with smooth = 0 or smooth = 10. The errors of smooth = 10 are mostly worse than those of 36

Table 4.6: Average elevation, slope, and curvature errors of multiquadric RBF interpolation/approximation on the three tests

Test AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE Interpolation, smooth = 0 1 17.0 25.5 232.8 2.16 3.23 26.81 0.038 0.057 0.614 2 19.8 30.2 273.0 2.24 3.35 27.75 0.038 0.057 0.612 3 21.5 30.4 231.3 2.12 3.09 25.59 0.038 0.056 0.605 Approximation, smooth = 10 1 22.3 31.5 251.3 2.74 3.96 29.03 0.038 0.057 0.614 2 24.0 34.3 278.0 2.76 3.98 29.25 0.038 0.057 0.615 3 26.0 34.8 229.7 2.73 3.91 28.32 0.038 0.057 0.613 smooth = 0. Multiquadric RBF interpolation is slightly better than ODETLAP approximation in elevation errors. Figure 4.4 shows the scatter plots of each value versus each error for dataset n43w074 and test 2 with smooth = 0.

4.3 Summary We have accelerated ODETLAP approximation using the CUSP library and achieved about 8 times speedup. ODETLAP approximation is comparable in accu- racy to natural neighbor interpolation and multiquadric RBF interpolation. Unlike natural neighbor interpolation, ODETLAP approximation can do extrapolation besides interpolation. As the number of known points increases, ODETLAP ap- proximation is faster while multiquadric RBF interpolation is slower, so that it is faster than multiquadric RBF interpolation when there are a lot of known points. As a smooth interpolation, ODETLAP approximation shares some characteristics with natural neighbor interpolation and multiquadric RBF interpolation in that they underestimate slope and minimize curvature. 37

Figure 4.4: Multiquadric RBF interpolation error analysis CHAPTER 5 2D ODETLAP Compression

This chapter discusses new algorithms to improve the accuracy of ODETLAP compression and new methods to compress the set of known points and values. We test ODETLAP compression on 2D terrain datasets and show that it is 30% to 50% better in minimizing the maximum absolute error than JPEG 2000. The maximum absolute error is important in many applications. For example, a large error in the elevation data indicates a potential hazard for a flying aircraft, a ship, or a submarine, while the average error is irrelevant. There is a similar concern with any collision detection problem, e.g., in robotics and mechanical assembly.

5.1 2D Datasets We use 8 600 × 600 terrain datasets for 2D compression experiments. The terrain datasets are downsized from the National Elevation Dataset (NED) 1 × 1 degree blocks by averaging the values of every 6 × 6 points and rounding the average as the value of one point. The unit of the terrain datasets is integer meter. We use one terrain dataset, n43w074, in most of the experiments and all the datasets in the comparison with JPEG 2000. Table 5.1 shows the statistics and Figure 5.1 shows the grayscale images of the datasets. The values of the datasets are integers. In this chapter, the target number of selected points for ODETLAP compression is always 3600 (%1). ODETLAP compression uses the double precision double type and a lower relative tolerance of 1 × 10−9, because it is necessary for the optimization of known point values.

5.2 The Simple Algorithm Figure 5.2 (a) shows the curves of average absolute error (AVGE), RMS error (RMSE), and maximum absolute error (MAXE) of the approximation of dataset n43w074, as the original ODETLAP compression algorithm adds one point in each iteration from 100 to 3600 known points. The y-axis is set to log scale for better

38 39

Table 5.1: Statistics of the 2D datasets

Dataset Minimum Maximum Average STD n41w076 0 621 176.2 104.6 n41w077 62 633 220.5 106.8 n42w076 90 808 428.6 106.5 n42w077 133 782 400.9 139.9 n43w074 -1 1138 273.8 189.0 n43w075 43 1225 471.3 192.2 n44w074 23 1158 277.1 176.6 n44w075 93 1163 538.0 152.6 visualization. The errors go down with more points, but the distances between AVGE, RMSE and the bottom of MAXE do not change much, or their ratios do not change much. AVGE and RMSE are relatively stable, but MAXE is highly unstable. Figure 5.2 (b) shows the part with 1000-1100 known points with the y-axis in linear scale. Here the bottom of MAXE is below 150 but the spikes are near 200. The problem is that after selecting a specific number k of known points, the MAXE of the approximation can be either good or bad. It is not very common for the MAXE to be the minimum of the MAXEs of all j ≤ k known points. For example, of the 3501 values of MAXE in Figure 5.2 (a), 2152 ones are not the minimum of the values to their left, 1068 ones are more than 10% of the minimum, and 549 ones are more than 20% of the minimum. There are two methods to guarantee a good MAXE:

1. Select a specific number of points, and iteratively remove the last added point until the MAXE is good. In this way, the output may have less than the specified number of points. 2. Select a specific number of points, and iteratively switch a known point with an unknown point, by removing a known point and adding an unknown point, until the MAXE is good. In this way, the output has the specified number of points.

The resulting errors of the two methods of guaranteeing a good MAXE are very close. However, the first method is simpler than the second method. 40

We use the first method to guarantee a good MAXE for the output set of selected points, and call the augmented ODETLAP compression algorithm ‘the simple algorithm’, as shown in Algorithm 5.1. The algorithm updates the minimum MAXE and remembers the corresponding set of known points after each ODETLAP approximation and before adding the next unknown point.

Algorithm 5.1: The simple algorithm Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end the output is the set of known points and values with the minimum MAXE; Procedure: Add an unknown point begin compute the ODETLAP approximation of the known points; if the MAXE of the approximation is less than the minimum MAXE found so far then update the minimum MAXE and remember the current set of known points; end add the (unknown) point with the largest error to the set of known points; end

The purpose of the first experiment is to evaluate different initial sets of known points. We always select an initial set of regular points for an even coverage of the domain. The points are at (i*I+I/2, j*I+I/2), where i and j are eligible integers and I is the interval between adjacent points. For example, when I = 10, the points are (5, 5), (5, 15), . . . , (15, 5), . . . , (15, 15), and so on. Table 5.2 shows the compression errors (approximation errors of the output) of the simple algorithm with different initial sets of known points. The dataset is n43w074 and the smoothing factor R = 0.01. The output contains no more than 3600 points. The initial sets correspond to I = 60, 30, 20, 15, 12, 10. The last result is the approximation of 3600 regular points. As the number of regular known points increases, AVGE and RMSE go down and up, and the other errors go up. If not 41

Table 5.2: Elevation, slope, and curvature errors of the simple algorithm and different initial sets, dataset n43w074

Initial AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 10x10 18.1 22.9 69.5 1.73 2.39 15.57 0.039 0.057 0.550 20x20 17.6 22.3 70.0 1.76 2.43 17.48 0.039 0.057 0.583 30x30 16.5 21.4 71.3 1.79 2.48 18.42 0.040 0.058 0.552 40x40 16.3 21.5 74.7 1.85 2.59 18.52 0.040 0.058 0.586 50x50 16.6 22.6 87.5 1.96 2.79 21.41 0.040 0.059 0.536 60x60 18.5 28.7 316.4 2.27 3.46 29.44 0.042 0.062 0.681

Table 5.3: Elevation, slope, and curvature errors of the simple algorithm and different Rs, dataset n43w074

R AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 0.01 18.1 22.9 69.5 1.73 2.39 15.57 0.039 0.057 0.550 0.02 18.2 23.0 69.5 1.73 2.39 15.56 0.039 0.057 0.551 0.05 17.9 22.7 69.2 1.73 2.39 15.56 0.039 0.057 0.551 0.1 17.8 22.4 68.9 1.73 2.40 16.00 0.039 0.057 0.544 0.2 17.6 22.2 67.6 1.74 2.41 20.36 0.039 0.057 0.560 0.5 16.7 20.8 63.6 1.81 2.56 18.01 0.040 0.058 0.534 specified, we use an initial set of 10 × 10 points in the following experiments. The next experiment tests the simple algorithm with different values of the smoothing factor R. Table 5.3 shows the errors. In general, as R increases, elevation errors decrease but slope errors increase. Besides, ODETLAP approximation becomes slower and the errors of known points increase. The simple algorithm assumes that the point with the largest error is unknown. When R = 1, in one iteration the point with the largest error is a known point, so that the algorithm cannot proceed. The problem can be avoided by adding the unknown point with the largest error. However, we use R = 0.01 in the experiments for its efficiency, if not specified. There are other methods to reduce elevation errors. Figure 3.2 shows that ODETLAP approximation can be overly smooth and overshoot/undershoot at known points with large absolute curvatures. Reducing the smoothness of the approximation at known points may help increasing its accuracy. The idea is to use a smaller smoothing factor R2 for known points than the smoothing factor R1 for unknown points, so that the smoothness of the approximation changes 42

Table 5.4: Elevation, slope, and curvature errors of the simple algorithm and two smoothing factors, dataset n43w074

R1 R2 AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 0.02 0.01 16.2 20.4 61.3 1.78 2.48 16.46 0.040 0.059 0.615 0.05 0.02 16.0 19.9 59.3 1.81 2.55 17.55 0.041 0.060 0.633 0.1 0.05 16.3 20.3 61.0 1.78 2.49 16.53 0.040 0.059 0.616 0.2 0.1 16.1 20.2 60.7 1.79 2.50 16.47 0.040 0.059 0.617 0.5 0.2 16.4 20.2 58.8 1.87 2.64 18.28 0.040 0.059 0.545 0.05 0.01 17.0 20.8 56.5 1.97 2.75 18.15 0.042 0.063 0.571 0.1 0.02 17.0 20.7 56.5 1.97 2.76 18.15 0.042 0.063 0.571 0.2 0.05 16.7 20.4 56.9 1.92 2.69 18.17 0.041 0.062 0.564 0.5 0.1 17.2 20.9 56.6 1.99 2.78 18.12 0.042 0.062 0.571 at known points. The experiment tests the simple algorithm with a smoothing factor R1 for unknown points and a smoothing factor R2 for known points. Table 5.4 shows the errors. The first five results have R1/R2 = 2 or 2.5 and the last 4 results have R1/R2 = 4 or 5. Compared with the result of R = 0.01, the results are better in elevation errors but worse in slope and curvature errors. R1/R2 = 4 or 5 is better than R1/R2 = 2 or 2.5 in MAXE but worse in other errors. However, the approximation with two smoothing factors causes a visual artifact. Figure 5.3 shows the relief plots of the dataset, the approximation with R = 0.01, the approximation with R1 = 0.02 and R2 = 0.01, and the approximation with R1 = 0.05 and R2 = 0.01. The known points are visible in Figure 5.3 (c) and (d) because the approximation is not smooth at them. They show up as sharp concave or convex points with large absolute Laplacian values. Concave points look bad on terrains. To reduce the visual artifact of the approximation with two smoothing factors, we make the smoothing factor change gradually from known points to unknown points, so that the approximation changes gradually at known points. The experiment tests the simple algorithm with the smoothing factor varying linearly from R2 at known points to R1 at unknown points over a radius. If the radius is 0, the smoothing factor is R1 at all unknown points. If the radius is infinity, the smoothing factor is R2 at all points. Table 5.5 shows the results of R1 = 0.02 and R2 = 0.01 or R1 = 0.05 and R2 = 0.01, with a radius of 0, 10, 20, 30 points. As the radius increases, 43

Table 5.5: Elevation, slope, and curvature errors of the simple algorithm and a varying R, dataset n43w074

R1 R2 r AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 0.02 0.01 0 16.2 20.4 61.3 1.78 2.48 16.46 0.040 0.059 0.615 0.02 0.01 10 16.3 20.6 62.7 1.72 2.38 16.22 0.039 0.057 0.557 0.02 0.01 20 17.1 21.6 64.5 1.73 2.38 15.82 0.039 0.057 0.550 0.02 0.01 30 17.2 21.7 65.9 1.73 2.38 17.10 0.039 0.057 0.600 0.05 0.01 0 17.0 20.8 56.5 1.97 2.75 18.15 0.042 0.063 0.571 0.05 0.01 10 15.6 19.5 56.4 1.79 2.47 15.67 0.039 0.057 0.588 0.05 0.01 20 15.7 19.7 59.7 1.75 2.41 16.47 0.039 0.057 0.574 0.05 0.01 30 15.9 20.1 61.1 1.74 2.39 16.42 0.039 0.057 0.582 the errors first decrease then increase, and eventually become those of a single R. Figure 5.4 shows the relief plots of the approximation with R1 = 0.02, R2 = 0.01 and r = 10, and the approximation with R1 = 0.05, R2 = 0.01 and r = 20. They look better than Figure 5.3 (c) and (d). Let f be an interpolating function. According to Hutchinson, minimizing rough- R 2 2 ness penalty J1(f) = (fx + fy )dxdy leads to the minimum potential interpolation, R 2 2 2 and minimizing roughness penalty J2(f) = (fxx + 2fxy + fyy )dxdy leads to the minimum curvature interpolation. Because the minimum curvature interpolation tends to overshoot/undershoot around packed data points of very different values, he −2 created a roughness penalty J(f) = 0.5h J1(f) + J2(f) for interpolating terrains. This experiment tests the simple algorithm with different averaging equations for ODETLAP approximation, as listed below. ∆f = 0 is the Laplace’s equation. ∆2f = 0 is the biharmonic equation. ∆f + ∆2f = 0 is similar to Hutchinson’s roughness penalty. Figure 5.5 shows the stencils of the first three equations.

1. ∆f = 0, 5-point stencil, O(h2) accuracy 2. ∆f = 0, 9-point stencil, O(h6) accuracy 3. ∆2f = 0, 13-point stencil, O(h2) accuracy 4. ∆f + ∆2f = 0, 5-point stencil + 13-point stencil, O(h2) accuracy

We use a relative tolerance of 1 × 10−6 for the solver because the biharmonic equation is very slow to solve. Table 5.6 shows the results. The Laplace’s equation is more accurate than the biharmonic equation for ODETLAP compression. The 9-point stencil is not significantly better than the 5-point stencil that we are using. 44

(a) n41w076 (b) n41w077 (c) n42w076

(d) n42w077 (e) n43w074 (f) n43w075

(g) n44w074 (h) n44w075

Figure 5.1: Grayscale images of the 2D datasets

Table 5.6: Elevation, slope, and curvature errors of the simple algorithm and different averaging equations, dataset n43w074

Equ. AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 1 17.3 22.0 67.9 1.76 2.41 18.28 0.039 0.057 0.591 2 17.6 22.2 67.7 1.75 2.40 15.74 0.039 0.057 0.568 3 27.0 34.3 107.3 2.30 3.14 20.95 0.042 0.061 0.585 4 21.3 27.0 83.9 1.90 2.60 18.42 0.040 0.059 0.572 45

(a) 100-3600 known points

(b) 1000-1100 known points

Figure 5.2: The number of known points selected by ODETLAP compression and the errors of the approximation, dataset n43w074 46

5.3 Adding and Removing Points Data compression works because of redundancy. The simple algorithm is additive. The objective is to minimize the maximum error with a limited number of points, and the method is to add the worst point to eliminate its error, in the hope of reducing the maximum error. Adding the worst point may or may not reduce the maximum error, and if it does not, it helps reducing the maximum error in the future. Adding the worst point is an approximation of adding the point that would reduce the maximum error the most or increase the maximum error the least, and the simple algorithm is an approximation of finding a limited number of points with the smallest maximum error. The solution of an approximation algorithm is not optimal, and there is redundancy in the set of selected points. If we define the redundancy of a known point corresponding its error if it is unknown, a known point is probably the least redundant when it is added. As more points are added, the redundancy of a known point increases in general, and some points are more redundant than others. If the redundancy of a point is large, removing it from the set of known points would not change its approximate value or the approximation very much. In this case, the point can be replaced with another point. Algorithm 5.2 shows the algorithm that adds and removes points. To remove a known point, it first selects a subset of random known points and computes their errors by tentatively removing them all together, and then actually removes the tentatively removed point with the smallest error. This is an approximation of removing the most redundant point. The errors of tentatively removed points are close to their errors when individually removed, if they are sparse and scattered. The algorithm first adds one point in each iteration and then adds two points and removes one point in each iteration. Table 5.7 shows the results of Algorithms 5.1 and 5.2 for comparison. Al- gorithm 5.2 first selects 10 × 10 points, and adds one point per iteration to 1000 points, then adds two points and removes one point per iteration to 3600 points. Procedure ‘remove a known point’ tentatively removes 1% of the known points, so that it chooses a point to remove from at least 100 points. Algorithm 5.2 is better in elevation and slope errors, although the improvement is small. Removing points 47

Algorithm 5.2: Adding and removing points Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end while not done do add an unknown point; add an unknown point; remove a known point; end the output is the set of known points and values with the minimum MAXE; Procedure: Remove a known point begin tentatively remove a fraction of the known points, chosen at random; compute the ODETLAP approximation of the remaining known points; actually remove the tentatively removed known point with the smallest error; end

Table 5.7: Elevation, slope, and curvature errors of Algorithms 5.1 and 5.2, dataset n43w074

Alg. AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 5.1 18.1 22.9 69.5 1.73 2.39 15.57 0.039 0.057 0.550 5.2 17.2 21.7 66.8 1.69 2.34 15.18 0.039 0.057 0.553 is more effective if there are more suboptimal points to choose from, for example, regular points.

5.4 Optimizing Point Values The RMSE of the approximation of a set of known points can be reduced by optimizing their values. The optimization is based on a fixed set of points, so that it is better performed when most or all of the known points are added, because adding or removing points changes the optimized values of known points. A vector x ∈ X is a best approximation of a vector f if it minimizes kf − xk. A vector x in the column space C(A) of a matrix A is a best approximation of f 48 if x = Av and v is a solution of Av = f. The ODETLAP approximation of known point values v is x = ODL(v) = Mv, where ODL is the ODETLAP approximation T −1 T O operator, M = (A A) A B, and B = ( I ). x is a best approximation of a dataset f if Mv = f. A direct solution of v is only possible if A is small because it needs to compute (AT A)−1. Therefore, we optimize known point values individually and iteratively like

Mainberger et al. [93]. To optimize the value vi of the ith known point, we solve

ODL(v +dei) = f for d, where ei is the unit vector. Given the linearity of ODETLAP approximation, ODL(v + dei) = ODL(v) + d ODL(ei) = f, so that

T ODL(ei) (f − ODL(v)) d = T . ODL(ei) ODL(ei)

0 The optimized value of the ith known point is vi + d, as v = v + dei is the new values and ODL(v0) = f. The new approximation is

0 0 x = ODL(v ) = ODL(v) + d ODL(ei) = x + d ODL(ei).

To avoid overshoot, the new value of the ith point is set to vi + ωd with a relaxation factor ω < 1. Algorithm 5.3 shows the algorithm that optimizes known point values after Algorithm 5.2. Procedure ‘optimize known point values’ optimizes point values in complete rounds, as long as the relative reduction of RMSE after a round is not less than E. We choose E = 0.001. The first experiment tests Algorithm 5.3 with different values of the relaxation factor ω. Table 5.8 shows the results, which are better than that of Algorithm 5.2 in AVGE and RMSE, but are worse in MAXE and slope errors. As ω increases, slope errors get worse, but elevation and curvature errors are roughly the same. We choose to use ω = 0.5 in the following experiments. The optimization converges faster but is less stable with a bigger ω. Figure 5.6 shows the progress of the optimization with ω = 0.5. Figure 5.6 (a) shows the d of each iteration/point. Vertical dotted lines separate different rounds of optimization (each round has 3596 iterations). In general, the magnitude 49

Algorithm 5.3: Optimizing point values Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum MAXE; optimize known point values; Procedure: Optimize known point values begin compute the ODETLAP approximation x of the known points; foreach known point i do compute ODL(ei); end while the relative reduction of RMSE is not less than E do foreach known point i in random order do T ODL(ei) (f−x) compute d = T ; ODL(ei) ODL(ei) set vi ← vi + ωd; set x ← x + ωd ODL(ei); end end end

Table 5.8: Elevation, slope, and curvature errors of Algorithm 5.3 and different ωs, dataset n43w074

ω AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 0.25 12.3 15.5 77.5 1.78 2.48 15.85 0.039 0.057 0.533 0.5 12.2 15.5 77.8 1.79 2.49 16.06 0.039 0.057 0.532 0.75 12.2 15.5 78.0 1.79 2.50 16.13 0.039 0.057 0.532 1.0 12.2 15.5 77.9 1.80 2.50 16.26 0.039 0.057 0.532 1.25 12.2 15.5 77.8 1.80 2.51 16.28 0.039 0.057 0.532 1.5 12.2 15.5 77.6 1.80 2.51 16.23 0.039 0.057 0.533 50 of d decreases as the optimization converges. It is also visibly different between successive rounds of optimization. Figure 5.6 (b) shows the AVGE and RMSE of the approximation of each iteration. They decrease first fast then slow as the optimization converges. The slopes of the curves change obviously after each round of optimization, at the vertical dotted lines. The first round of optimization is much more effective than the other rounds. The curves can be roughly fit by an exponential function. Figure 5.6 (c) shows the MAXE of each iteration. MAXE increases a lot at the beginning, and fluctuates in the first round, when |d| is large. It increases slower as |d| decreases. Figure 5.6 (d) shows the RMSE-MAXE curve. The optimization starts from the lower-right endpoint of the curve and stops at the top-left endpoint. It would be better if we can reduce both RMSE and MAXE, or reduce RMSE without increasing MAXE. Formally, we have a multi-objective optimization problem min(RMSE(x), MAXE(x)), where x is the approximation of no more than a target number of known points. The objective function is obj(x) = (RMSE(x), MAXE(x))T . The Pareto front of optimal obj(x) lies on the lower-left of the MAXE-RMSE curve. In adding and removing points, the accuracy of the computation can be lower because the change of the approximation is larger. However, in optimizing point values, the change of the approximation is smaller, especially in later rounds. Accuracy is necessary for the convergence and affects the outcome of the optimization. Using a more precise data type and a smaller relative tolerance for the solver increases the accuracy of the computation, at the expense of computing time. The next experiment tests Algorithms 5.1–5.3 with these accuracy and precision settings:

1. float type and a relative tolerance of 1 × 10−6 2. double type and a relative tolerance of 1 × 10−6 3. double type and a relative tolerance of 1 × 10−9

Algorithm 5.3 only uses different settings in the optimization. Table 5.9 shows the results. Algorithms 5.1 and 5.2 have similar errors using the three settings, but Algorithm 5.3 has much better MAXE using setting 3 than using settings 1 and 2. Therefore, we use setting 3 in the compression experiments. 51

Table 5.9: Elevation, slope, and curvature errors of Algorithms 5.1–5.3 and different accuracy and precision settings, dataset n43w074

Setting AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE Algorithm 5.1 1 17.4 22.1 68.4 1.76 2.40 17.27 0.039 0.057 0.558 2 17.3 22.0 67.9 1.76 2.41 18.28 0.039 0.057 0.591 3 18.1 22.9 69.5 1.73 2.39 15.57 0.039 0.057 0.550 Algorithm 5.2 1 17.7 22.2 66.6 1.73 2.37 16.61 0.039 0.057 0.548 2 17.4 21.9 65.5 1.72 2.37 16.11 0.039 0.057 0.580 3 17.2 21.7 66.8 1.69 2.34 15.18 0.039 0.057 0.553 Algorithm 5.3 1 12.7 16.2 96.5 1.78 2.48 16.10 0.039 0.057 0.532 2 12.7 16.1 96.6 1.78 2.48 16.11 0.039 0.057 0.532 3 12.2 15.5 77.8 1.79 2.49 16.06 0.039 0.057 0.532 52

(a) Dataset (b) R = 0.01

(c) R1 = 0.02 and R2 = 0.01 (d) R1 = 0.05 and R2 = 0.01

Figure 5.3: Results of the simple algorithm and one or two smoothing factors, dataset n43w074 53

(a) R1 = 0.02, R2 = 0.01 and r = 10 (b) R1 = 0.05, R2 = 0.01 and r = 20

Figure 5.4: Results of the simple algorithm and a varying R, dataset n43w074

1

2 -8 2

1 1/6 4/6 1/6 1 -8 20 -8 1

1 -4 1 4/6 -20/6 4/6 2 -8 2

1 1/6 4/6 1/6 1

Figure 5.5: Averaging equation stencils 54

(a) The optimal change to a known point (b) The AVGE and RMSE of the approxima- value vs. iteration tion vs. iteration

(c) The MAXE of the approximation vs. it- (d) MAXE vs. RMSE over all iterations eration

Figure 5.6: Optimizing known point values – optimizing one known point value in each iteration; vertical dotted lines separating rounds of iterations 55

5.5 Anterior Optimizations We have two tools to reduce elevation errors. The first one is adding and removing points, which reduces MAXE. The second one is optimizing point values, which reduces RMSE (and AVGE). However, optimizing point values increases MAXE, and adding and removing points after optimization increases RMSE. Therefore, RMSE and MAXE are conflicting objectives. We reduce the two-objective problem to a one-objective problem by defining the objective function obj as a scalar. The definition specifies the relative weights of RMSE and MAXE, and we use obj = q RMSE2 + MAXE2. Because MAXE is larger than RMSE, MAXE weighs much more than RMSE in obj. An alternative definition is obj = RMSE + MAXE, which still weighs MAXE more than RMSE. Given the two tools of reducing errors, the idea is to use them alternately to reduce both RMSE and MAXE. Algorithm 5.4 shows the algorithm that optimizes known point values multiple times before the target number of points are selected. Optimizations are interleaved with and followed by adding and removing points to reduce MAXE. The procedure to add an unknown point is different because the worst point may be known after optimization. If the worst point is known, it reduces the difference between its value and data value by 1 − S and computes a new approximation, until the worst point is unknown. We use S = 0.9. The experiment tests the algorithm with different numbers of optimizations and different numbers of points added between and after optimizations, as shown in Table 5.10. The first three results have 100 points added between and after 2, 4 or 6 optimizations. The last three results have 200 points added between and after 2, 4 or 6 optimizations. For example, the third result has optimizations after 3000, 3100, 3200, 3300, 3400, 3500 points are selected. The value of the objective function usually decreases as the number of optimizations increases, but the gain is diminishing. The third result is the best of these results. Compared with the result of Algorithm 5.3, it is worse in AVGE and RMSE but better in MAXE and the objective function. 56

Algorithm 5.4: Anterior optimizations Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; for multiple times do optimize known point values; while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; end Procedure: Add an unknown point begin compute the ODETLAP approximation of the known points; if the MAXE of the approximation is less than the minimum MAXE found so far then update the minimum MAXE and remember the current set of known points; end while the point p with the largest error is known do set the value v(p) of the point to f(p) + S(v(p) − f(p)), where S < 1 and f(p) is the data value of the point; compute the ODETLAP approximation of the known points; end add the (unknown) point with the largest error to the set of known points; end 57

Table 5.10: Elevation, slope, and curvature errors of Algorithm 5.4, dataset n43w074

Opt. at points AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 3400,3500 13.5 16.9 60.1 1.80 2.50 16.33 0.039 0.057 0.536 3200,3300...3500 13.5 16.9 57.4 1.82 2.52 16.38 0.039 0.057 0.535 3000,3100...3500 13.4 16.9 56.8 1.81 2.52 15.42 0.039 0.057 0.536 3200,3400 14.2 17.8 58.5 1.81 2.51 16.25 0.039 0.057 0.540 2800,3000...3400 13.9 17.4 57.1 1.81 2.53 15.70 0.039 0.057 0.529 2400,2600...3400 14.2 17.7 57.4 1.81 2.54 16.59 0.039 0.057 0.548

5.6 Posterior Optimizations As another way to use the two tools to reduce both RMSE and MAXE, Algorithm 5.5 shows the algorithm that optimizes known point values multiple times after the target number of points are selected. After each optimization, it adds one point and removes one point in each iteration to reduce MAXE, and stops, in our implementation, when the objective function after optimization is not reduced in 200 iterations. It does optimizations and adding and removing repeatedly until the objective function cannot be reduced. Table 5.11 shows the results of Algorithms 5.3–5.5 for comparison. Both Algorithms 5.4 and 5.5 are better than Algorithm 5.3 in MAXE and the objective function.

5.7 Anterior and Posterior Optimizations (the Complex Al- gorithm) Combining Algorithms 5.4 and 5.5 gives us a third way to use the two tools to reduce both RMSE and MAXE, as shown in Algorithm 5.6, which we call ‘the complex algorithm’. The algorithm optimizes known point values multiple times before and after the target number of points are selected. The experiment tests the algorithm first with R = 0.01 and then with varying R = 0.01–0.02 (R1 = 0.02, R2 = 0.01 and r = 10), using two datasets n43w074 and n44w074. The program does optimizations after 3000, 3100, 3200, 3300, 3400, 3500 points are selected, and after the target number of 3600 points are selected. The first result in Table 5.12 shows that the complex algorithm is better in the objective 58

Algorithm 5.5: Posterior optimizations Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; q while the minimum RMSE2 + MAXE2 decreases do optimize known point values; while not done do add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; end

Table 5.11: Elevation, slope, and curvature errors of Algorithms 5.3–5.5, dataset n43w074

Alg. AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 5.3 12.2 15.5 77.8 1.79 2.49 16.06 0.039 0.057 0.532 5.4 13.4 16.9 56.8 1.81 2.52 15.42 0.039 0.057 0.536 5.5 12.8 16.1 58.6 1.80 2.51 15.39 0.039 0.057 0.532 function than the previous algorithms. Based on the two datasets, the results of varying R = 0.01–0.02 are better than the results of R = 0.01 in MAXE and the objective function, but can be better or worse in AVGE and RMSE. 59

Algorithm 5.6: Anterior and posterior optimizations (the complex algo- rithm) Input: A point dataset Output: A set of known points and values select an initial set of known points; while not done do add an unknown point; end while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; for multiple times do optimize known point values; while not done do add an unknown point; add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; end q while the minimum RMSE2 + MAXE2 decreases do optimize known point values; while not done do add an unknown point; remove a known point; end set the set of known points and values to those with the minimum q RMSE2 + MAXE2; end 60

Table 5.12: Elevation, slope, and curvature errors of the complex algorithm and a varying R

Dataset AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE R = 0.01 n43w074 12.7 16.0 56.5 1.81 2.53 15.39 0.039 0.057 0.536 n44w074 17.9 22.7 78.0 2.40 3.34 21.23 0.051 0.074 0.775 Varying R = 0.01–0.02 n43w074 13.4 16.7 53.8 1.79 2.49 15.63 0.039 0.057 0.520 n44w074 17.2 21.6 73.3 2.39 3.29 19.75 0.050 0.074 0.628

5.8 Compressing Selected Points This section discusses the compression of the output of ODETLAP compression, a set of selected points and values, to reduce the size of the compressed dataset. The selected points and their values are compressed separately. Mapping a set of values to a smaller range and reducing the number of unique values help reducing the compressed size of the values. Quantization essentially reduces the precision of the values. This experiment quantizes the output of the simple algorithm to different ranges of integers and computes the approximation errors. For example, to quantize the values on range 0-255, first find the minimum value vmin and the maximum value vmax, and let scale = 255/(vmax-vmin), then convert each value v to round(scale*(v-vmin)). To map a value w back to the original range, it is converted to w/scale+vmin. Therefore, vmin and scale need to be saved after quantization. The range of the output of the simple algorithm is [0, 1138] integer meters, which is about the range of 10 bits. The values are quantized to the range of 9, 8, 7, 6 or 5 bits. The results in Table 5.13 show that elevation errors do not increase much until range 0-127, and slope errors do not increase much until range 0-31. It is not shown but curvature errors do not increase much until range 0-15. Different from the simple algorithm, the output of the complex algorithm has floating-point values because of optimizations. The values can be rounded to integers or quantized on a small range. Table 5.14 shows that quantization on 0-255 does not increase the errors much. To compress (the coordinates of) the selected points, we represent them as a 61

Table 5.13: Elevation, slope, and curvature errors of the simple algorithm and quantization, dataset n43w074

Range AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 0-1138 18.1 22.9 69.5 1.73 2.39 15.57 0.039 0.057 0.550 0-511 18.1 22.9 70.5 1.73 2.39 15.55 0.039 0.057 0.551 0-255 18.1 22.9 71.0 1.73 2.39 15.47 0.039 0.057 0.552 0-127 18.3 23.1 73.9 1.73 2.39 15.45 0.039 0.057 0.547 0-63 18.6 23.5 79.5 1.74 2.40 15.53 0.039 0.057 0.546 0-31 19.4 24.4 96.0 1.75 2.42 16.06 0.039 0.057 0.553

Table 5.14: Elevation, slope, and curvature errors of the complex algorithm and quantization, dataset n43w074

Range AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE Float 13.4 16.7 53.8 1.79 2.49 15.63 0.039 0.057 0.520 Integer 13.4 16.7 54.0 1.79 2.49 15.61 0.039 0.057 0.520 0-255 13.4 16.7 55.3 1.79 2.49 15.41 0.039 0.057 0.520 binary image or mask of the size of the dataset, with 1 standing for a known point and 0 for an unknown point, and encode the mask by run-length encoding (RLE) the 1 pixels. The run length of a 1 pixel is the number of consecutive 0’s before it. The row-column coordinates of a known point are two integers, while its run length is one integer. By visiting the pixels of the mask in different orders, a 1 pixel has different numbers of 0’s before it. We consider three orders:

1. row-column order (RC) 2. Hilbert space-filling curve order (Hilbert) 3. Morton space-filling curve order (Morton)

We try the list of methods below to compress the run lengths of the three orders, and find the combination with the smallest size. Arithmetic coding and Huffman coding use the MATLAB implementation of Karl Skretting [94]. Variable-length quantity is a base-128 representation of nonnegative integers that uses one byte for a digit in 0-127. 7zip, bzip2, and are open-source file archivers or compressors available on the Ubuntu 14.04.4 LTS operating system. We use the default settings of them. 62

Table 5.15: Compressing the selected-point mask, dataset n43w074

Source AC HC VLQ-7zip VLQ-bzip2 VLQ-gzip VLQ-zpaq RLE.RC 4050 3618 3635 3942 3576 3647 RLE.Hilbert 3490 3484 3523 3783 3460 3507 RLE.Morton 3517 3513 3541 3822 3475 3518

• AC: arithmetic coding • HC: Huffman coding • VLQ-7zip: variable-length quantity and 7zip v9.20 • VLQ-bzip2: variable-length quantity and bzip2 v1.0.6 • VLQ-gzip: variable-length quantity and gzip v1.6 • VLQ-zpaq: variable-length quantity and zpaq v1.10

We compress the output mask of the complex algorithm with varying R = 0.01–0.02 and Table 5.15 shows the compressed sizes in bytes. The smallest size is 3460 bytes (highlighted) and the compression pipeline is RLE.Hilbert-VLQ-gzip. If the distribution of known points is truly random, the Shannon entropy of a pixel is

H = −1% log2(1%) − 99% log2(99%) ≈ 0.08 and the best average compressed size of the mask is 6002H/8 ≈ 3636 bytes, which is not far from what we have. We encode the values of selected points by delta encoding (DE). The delta value of a value is its difference from the previous value or 0 if it is the first. The values and delta values are all integers. Likewise, visiting known points in different orders gives different delta values. We consider five orders:

1. row-column (RC) 2. Hilbert space-filling curve (Hilbert) 3. Morton space-filling order (Morton) 4. the breadth-first traversal of a minimum-spanning tree (MST1) 5. the depth-first traversal of a minimum-spanning tree (MST2)

We try the list of methods below to compress the delta values of the five orders and find the combination with the smallest size. Delta values can be negative. To encode them in variable-length quantity, we make them nonnegative either by 63

Table 5.16: Compressing the values of selected points, dataset n43w074

Source AC HC VLQ1-7zip VLQ1-bzip2 VLQ1-gzip VLQ1-zpaq DE.RC 3532 3464 3617 3848 3509 3612 DE.Hilbert 3024 3027 3176 3344 3086 3193 DE.Morton 3090 3084 3236 3407 3138 3244 DE.MST1 3035 3019 3182 3328 3069 3199 DE.MST2 2994 3019 3168 3303 3070 3174 Source VLQ2-7zip VLQ2-bzip2 VLQ2-gzip VLQ2-zpaq DE.RC 3955 3639 4350 3592 DE.Hilbert 3233 3357 3228 3256 DE.Morton 3721 3418 3905 3427 DE.MST1 3222 3352 3123 3248 DE.MST2 3169 3317 3123 3191 mapping a value v to 2v+1 if it is nonnegative and to -2v if it is negative (VLQ1) or adding an offset to each value and saving the offset (VLQ2). Using the second method, the offset is equal to the opposite of the minimum value and is saved using the first method before the delta values.

• AC: arithmetic coding • HC: Huffman coding • VLQ1-7zip: variable-length quantity (mapping) and 7zip • VLQ1-bzip2: variable-length quantity (mapping) and bzip2 • VLQ1-gzip: variable-length quantity (mapping) and gzip • VLQ1-zpaq: variable-length quantity (mapping) and zpaq • VLQ2-7zip: variable-length quantity (offset) and 7zip • VLQ2-bzip2: variable-length quantity (offset) and bzip2 • VLQ2-gzip: variable-length quantity (offset) and gzip • VLQ2-zpaq: variable-length quantity (offset) and zpaq

We compress the output values of the complex algorithm with varying R = 0.01–0.02 after quantization on 0-255 and Table 5.16 shows the compressed sizes in bytes. The smallest size is 2994 bytes (highlighted) and the compression pipeline is DE.MST2-AC. 64

Besides the compressed mask and compressed values, additional information is necessary for the decompression of the selected points and values. They are all put in a header file. Thus the compression of the output of ODETLAP compression has three files: a header, a mask and a set of values. For 2D compression, the header has 24 bytes and contains the list of information below. Items 1 and 2 specify the dimensions of the grid. Items 3-5 specify the smoothing factor, which can be either constant or varying. Items 6 and 7 specify the mask compression pipeline and the values compression pipeline. Items 8 and 9 specify the vmin and scale used in quantization.

1. the number of rows (2 bytes): 0-65535 2. the number of columns (2 bytes): 0-65535 3. R1 (4 bits): 0-15, for 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50 and 100 4. R2 (4 bits): as R1 5. radius (1 byte): 0-255 6. mask compression pipeline (1 byte): 0-255 7. values compression pipeline (1 byte): 0-255 8. the vmin in quantization (8 bytes): floating point 9. the scale in quantization (8 bytes): floating point

As a result, the total compressed size of the output of the complex algorithm with varying R = 0.01–0.02 and input dataset n43w074 is 24 + 3460 + 2994 = 6478 bytes. The mask compression pipeline is RLE.Hilbert-VLQ-gzip and the values compression pipeline is DE.MST2-AC.

5.9 Regular Points Table 5.2 shows that more regular known points increases MAXE but not necessarily AVGE and RMSE. More regular known points also reduces the size of the compressed mask because its run-length encoding of row-column order has many identical run lengths. Therefore, we also experiment a version of the complex algorithm that selects more regular points as the initial set of known points and does 65 not remove them in the process, so that the output contains these regular points. The only modification to the algorithm is that, in the procedure of removing a known point, the point to remove is chosen from the set of known points sans the regular initial points. Reducing the size of the compressed mask alone is less efficient than reducing the sizes of both the compressed mask and the compressed values. For a balanced compression, when the mask contains more regular points, the values are quantized on a smaller range. In the experiment, we compress all the 8 datasets. The algorithm is the complex algorithm with varying R = 0.01–0.02. The datasets are compressed with (1) no regular known points and value quantization on 0-255, (2) 35 × 35 regular known points and value quantization on 0-127, or (3) 43 × 43 regular known points and value quantization on 0-63. 352 is about a third of the 3600 target and 432 is about a half of the target. Table 5.17 shows the compression errors and Table 5.18 shows the compression pipelines and compressed sizes for four datasets of the output of 2D ODETLAP compression. Even though the mask compression pipeline and the values compression pipeline have many options, the chosen ones are only a few.

5.10 JPEG 2000 For comparison, we compress the datasets by JPEG 2000 [95], using tools from the open-source OpenJPEG library v2.1.0 [96]. Each dataset is compressed with a compression ratio of 75, 80, 85, . . . , 195 or 200 to get the size and decompressed to compute the errors. Both compression and decompression use the default settings. For example, Table 5.19 shows the results of dataset n43w074. Ratios 190 and 195 have the same results. The JPEG 2000 compression of a dataset has a limited number of sizes and the real compression ratio may be different from the specified compression ratio. Figures 5.7 to 5.14 show the results of JPEG 2000 and ODETLAP compression of each dataset. For each dataset, the figure shows each of 9 error metrics versus the compressed size. Black circles are the results of JPEG 2000. Red, green and blue 66

Table 5.17: Elevation, slope, and curvature errors of ODETLAP compression

Reg. Quant. AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE n43w074 N/A 0-255 13.4 16.7 55.3 1.79 2.49 15.41 0.039 0.057 0.520 35x35 0-127 13.1 16.8 63.4 1.87 2.61 17.17 0.040 0.058 0.601 43x43 0-63 13.5 17.5 72.2 1.92 2.70 18.29 0.040 0.059 0.569 n43w075 N/A 0-255 20.3 25.3 83.5 2.58 3.50 19.62 0.050 0.071 0.623 35x35 0-127 20.8 26.5 93.6 2.74 3.74 21.14 0.051 0.072 0.764 43x43 0-63 21.7 27.9 105.7 2.85 3.93 23.38 0.052 0.073 0.671 n44w074 N/A 0-255 17.2 21.6 74.8 2.39 3.29 19.74 0.050 0.074 0.628 35x35 0-127 17.6 22.9 86.7 2.48 3.45 22.29 0.051 0.075 0.699 43x43 0-63 18.2 23.7 97.9 2.57 3.58 24.45 0.051 0.076 0.703 n44w075 N/A 0-255 17.4 21.9 72.7 2.43 3.33 20.65 0.050 0.071 0.536 35x35 0-127 17.4 22.3 81.3 2.54 3.50 19.80 0.051 0.073 0.584 43x43 0-63 17.9 23.3 94.3 2.61 3.63 21.79 0.051 0.073 0.931

Table 5.18: Compression pipelines and compressed sizes of ODETLAP compression

Reg. Quant. Mask compr. pipeline Values compr. pipeline Mask Values Total n43w074 N/A 0-255 RLE.Hilbert-VLQ-gzip DE.MST2-AC 3460 2994 6478 35x35 0-127 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2720 2510 5254 43x43 0-63 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2139 2040 4203 n43w075 N/A 0-255 RLE.Hilbert-VLQ-gzip DE.MST2-AC 3509 3244 6777 35x35 0-127 RLE.RC-VLQ-zpaq DE.MST2-AC 2654 2784 5462 43x43 0-63 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2093 2332 4449 n44w074 N/A 0-255 RLE.Hilbert-VLQ-gzip DE.MST2-AC 3476 3160 6660 35x35 0-127 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2738 2677 5439 43x43 0-63 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2179 2182 4385 n44w075 N/A 0-255 RLE.Morton-VLQ-gzip DE.MST2-AC 3509 3134 6667 35x35 0-127 RLE.RC-VLQ-zpaq DE.Morton-AC 2724 2646 5394 43x43 0-63 RLE.RC-VLQ-zpaq DE.Hilbert-AC 2171 2157 4352 67

Table 5.19: Specified compression ratio, compressed size, and compression errors of JPEG 2000, dataset n43w074

Ratio Size AVGE RMSE MAXE AVGSE RMSSE MAXSE AVGCE RMSCE MAXCE 75 9374 7.6 10.1 87 1.41 1.96 15.66 0.041 0.060 0.787 80 9015 7.8 10.3 87 1.43 1.99 15.66 0.042 0.060 0.787 85 8425 8.0 10.7 92 1.46 2.05 21.39 0.042 0.061 0.790 90 7961 8.3 11.1 100 1.49 2.10 21.23 0.042 0.061 0.784 95 7514 8.5 11.4 107 1.52 2.15 21.71 0.042 0.062 0.787 100 7164 8.7 11.7 107 1.54 2.19 20.76 0.042 0.062 0.787 105 6749 9.0 12.1 106 1.57 2.23 20.76 0.042 0.062 0.781 110 6544 9.1 12.3 107 1.59 2.25 18.07 0.042 0.062 0.790 115 6243 9.3 12.5 109 1.61 2.29 18.07 0.043 0.062 0.679 120 5790 9.7 13.0 109 1.65 2.35 20.45 0.043 0.063 0.673 125 5770 9.7 13.0 112 1.65 2.35 20.45 0.043 0.063 0.673 130 5433 10.0 13.4 112 1.67 2.38 20.45 0.043 0.063 0.673 135 5264 10.1 13.6 114 1.69 2.40 20.45 0.043 0.063 0.673 140 5074 10.3 13.8 114 1.70 2.42 20.45 0.043 0.063 0.673 145 4952 10.4 13.9 114 1.71 2.43 20.67 0.043 0.063 0.679 150 4809 10.5 14.1 114 1.72 2.45 20.89 0.043 0.063 0.673 155 4654 10.7 14.3 123 1.73 2.46 20.89 0.043 0.063 0.673 160 4436 10.9 14.6 117 1.75 2.49 20.89 0.043 0.063 0.673 165 4355 11.0 14.7 123 1.75 2.51 20.89 0.043 0.063 0.673 170 4236 11.1 14.9 136 1.77 2.55 23.81 0.043 0.064 0.679 175 4126 11.2 15.1 158 1.78 2.57 23.65 0.043 0.064 0.679 180 3974 11.4 15.3 164 1.78 2.58 23.81 0.043 0.064 0.673 185 3859 11.5 15.5 164 1.80 2.61 23.81 0.043 0.064 0.673 190 3695 11.8 15.9 164 1.83 2.65 23.81 0.043 0.064 0.673 195 3695 11.8 15.9 164 1.83 2.65 23.81 0.043 0.064 0.673 200 3604 11.9 16.1 164 1.85 2.68 23.81 0.043 0.064 0.673 diamonds are the results of ODETLAP compression with none, 35 × 35 or 43 × 43 regular known points. The figures show that ODETLAP compression is better than JPEG 2000 in MAXE but worse in AVGE and RMSE. Including regular known points in the output helps reducing the gap between ODETLAP compression and JPEG 2000 in AVGE and RMSE, and keeps the advantage of ODETLAP compression in MAXE. ODETLAP compression is worse in AVGSE and RMSSE but may be better or worse in MAXSE. ODETLAP compression is a little better in AVGCE and RMSCE but may be better or worse in MAXCE. In summary, ODETLAP compression is better in the maximum absolute error but worse in average absolute and RMS errors than JPEG 2000 for the terrain datasets. 68

Figure 5.7: JPEG 2000 and ODETLAP compression, dataset n41w076

Figure 5.8: JPEG 2000 and ODETLAP compression, dataset n41w077 69

Figure 5.9: JPEG 2000 and ODETLAP compression, dataset n42w076

Figure 5.10: JPEG 2000 and ODETLAP compression, dataset n42w077 70

Figure 5.11: JPEG 2000 and ODETLAP compression, dataset n43w074

Figure 5.12: JPEG 2000 and ODETLAP compression, dataset n43w075 71

Figure 5.13: JPEG 2000 and ODETLAP compression, dataset n44w074

Figure 5.14: JPEG 2000 and ODETLAP compression, dataset n44w075 72

Table 5.20: Relative elevation/value errors of ODETLAP compression to JPEG 2000

Result No regular points Some regular points More regular points Dataset AVGE RMSE MAXE AVGE RMSE MAXE AVGE RMSE MAXE n41w076 146.5% 137.8% 56.7% 140.1% 134.0% 57.7% 128.5% 125.4% 62.5% n41w077 173.5% 164.3% 60.8% 151.4% 148.4% 57.6% 140.4% 140.5% 61.3% n42w076 133.4% 127.9% 60.8% 131.6% 127.5% 63.0% 123.4% 119.2% 56.5% n42w077 142.4% 136.4% 67.4% 135.6% 129.9% 64.1% 128.6% 123.5% 56.9% n43w074 146.4% 135.6% 51.7% 129.7% 123.5% 55.6% 121.6% 116.9% 53.9% n43w075 150.1% 142.0% 63.8% 135.1% 129.3% 62.4% 127.7% 122.3% 68.2% n44w074 141.2% 132.5% 71.3% 131.2% 126.0% 49.0% 122.6% 118.4% 53.5% n44w075 145.5% 137.6% 66.7% 132.9% 127.9% 68.3% 124.4% 120.3% 57.5%

Table 5.20 shows the ratios of the elevation/value errors of ODETLAP com- pression to those of JPEG 2000 at similar compressed sizes for the 2D datasets. For each dataset, we calculate the actual compression ratios of the three ODETLAP results, and use them as the specified compression ratios of JPEG 2000. The actual compression ratios of the JPEG 2000 results may be different but are close. The errors show that ODETLAP compression is about 30% to 50% better in MAXE, and regular known points help improving AVGE and RMSE.

5.11 Summary We have designed multiple algorithms to improve the accuracy of ODETLAP compression, including a varying R, adding and removing points, and optimizing point values before and after selecting the target number of points. The complex algorithm is about 30% more accurate than the simple algorithm in elevation errors, with the same target number of points. We have used new methods to compress the selected points and values of ODETLAP compression. The results show that ODETLAP compression is effective for the 2D datasets and is about 30% to 50% better in the maximum absolute error than JPEG 2000, at similar compressed sizes. For the same maximum absolute error, the ODETLAP file is often half as small as the JPEG 2000 file. Including some regular known points in the output brings ODETLAP compression closer to JPEG 2000 in the average absolute and RMS errors, while keeping the advantage of ODETLAP compression in the maximum 73 absolute error. The accuracies of ODETLAP compression and JPEG 2000 are more similar in slope and curvature errors than in elevation errors. CHAPTER 6 3D ODETLAP Compression

In this chapter, we test ODETLAP compression on 3D MRI datasets and show that it is 40% to 60% better in minimizing the maximum absolute error than JP3D. In 3D, the discrete Laplace’s equation is

f(x−1, y, z)+f(x+1, y, z)+f(x, y−1, z)+f(x, y+1, z)+f(x, y, z−1)+f(x, y, z+1) − 6f(x, y, z) = 0.

While a 2D grid has rows and columns, a 3D grid has slices, rows, and columns. A 3D grid usually has a higher percentage of boundary points. For example, a 1000 × 1000 grid has 3996 boundary points, while a 100 × 100 × 100 grid has 58808 boundary points.

6.1 3D Datasets We use 4 64 × 64 × 64 magnetic resonance imaging (MRI) datasets for 3D compression experiments. The MRI datasets are extracted from the 128 slices by 256 rows by 256 columns "OAS1_0001_MR1 mpr-1" dataset of the Open Access Series of Imaging Studies (OASIS) [97]. The first four items in the list below show the position of each MRI dataset in the original dataset. The original dataset has many zero-valued cells but the extracted datasets do not.

• mri1: slices 32-95, rows 64-127, and columns 64-127 • mri2: slices 32-95, rows 64-127, and columns 128-191 • mri3: slices 32-95, rows 128-191, and columns 64-127 • mri4: slices 32-95, rows 128-191, and columns 128-191

The datasets have integer values and the MRI datasets are nonnegative. Ta- ble 6.1 shows the statistics of the datasets. Figure 6.1 shows a 2D image of each dataset with 8 × 8 slices in row-column order. We use the dataset mri1 in most of the

74 75

Table 6.1: Statistics of the 3D datasets

Dataset Minimum Maximum Average STD mri1 0 3745 1075.1 518.8 mri2 0 4095 1019.9 608.0 mri3 10 3947 891.5 419.1 mri4 0 4095 891.4 731.2

(a) mri1 (b) mri2

(c) mri3 (d) mri4

Figure 6.1: Grayscale images of the 3D datasets

3D ODETLAP compression experiments. For simplicity, ODETLAP compression always selects a target number of 2600 (∼ 1%) points.

6.2 The Simple Algorithm The error metrics of 3D compression experiments are:

1. average absolute, RMS and maximum absolute errors (AVGE, RMSE and MAXE) 76

Table 6.2: Value and gradient errors of the simple algorithm and different Rs, dataset mri1

R AVGE RMSE MAXE AVGGE RMSGE MAXGE 0.01 191.1 237.6 734.1 118.9 158.0 888.0 0.02 187.8 233.7 731.7 118.3 157.4 925.5 0.05 190.2 235.8 729.2 119.6 158.9 884.6 0.1 188.3 233.2 719.6 122.1 162.3 924.6

2. average, RMS and maximum gradient magnitude errors (AVGGE, RMSGE and MAXGE)

We start with the simple algorithm (Algorithm 5.1). The dataset is mri1. The algorithm selects an initial set of 4 × 4 × 4 points at (i*16+8, j*16+8, k*16+8) and adds one point in each iteration to 2600 points. The output is the set of known points and values with the smallest MAXE. Table 6.2 shows the results of the simple algorithm with different values of R. The results show that value errors decrease as R increases, but gradient errors increase. When R = 0.2, it happens in one iteration that the point with the largest error is a known point, so that the algorithm cannot proceed. Like in 2D ODETLAP compression, we use R = 0.01 in the following experiments if it is not specified. Table 6.3 shows the results of the algorithm with a smoothing factor R1 for unknown points and a smaller smoothing factor R2 for known points. The first three results have R1/R2 = 2 or 2.5 and the last two results have R1/R2 = 5. The last two results are worse than the first three results. Compared with the result of the simple algorithm with R = 0.01, the first three results are better in MAXE but worse in all other errors. The results also show artifact because the approximation is not smooth at known points. Table 6.4 shows the results of the algorithm with R varying from R2 at known points to R1 at unknown points over a radius. The first four results have R1 = 0.02 and R2 = 0.01 and the last four results have R1 = 0.05 and R2 = 0.01. The results show that almost every error decreases as the radius increases, so that a varying R is better than two smoothing factors, but the results are still not better than that of R = 0.01 except in MAXE. Therefore, two smoothing factors and a varying R 77

Table 6.3: Value and gradient errors of the simple algorithm and two smoothing factors, dataset mri1

R1 R2 AVGE RMSE MAXE AVGGE RMSGE MAXGE 0.02 0.01 207.1 251.8 702.2 140.4 183.3 936.1 0.05 0.02 218.1 262.8 702.5 148.1 192.2 935.5 0.1 0.05 207.4 252.1 700.4 141.8 185.1 945.3 0.05 0.01 265.4 309.9 726.8 169.4 214.0 1000.8 0.1 0.02 266.4 311.0 729.0 169.6 214.3 974.0

Table 6.4: Value and gradient errors of the simple algorithm and a varying R, dataset mri1

R1 R2 r AVGE RMSE MAXE AVGGE RMSGE MAXGE 0.02 0.01 0 207.7 252.4 702.2 140.2 183.2 936.1 0.02 0.01 2 205.0 249.9 697.6 137.3 179.6 906.4 0.02 0.01 4 197.5 241.9 700.0 130.8 171.6 877.7 0.02 0.01 6 194.6 239.1 703.2 127.5 167.6 886.1 0.05 0.01 0 265.4 309.9 726.8 169.4 214.0 1000.8 0.05 0.01 2 259.4 303.6 723.3 165.5 209.7 949.5 0.05 0.01 4 236.3 280.2 707.2 154.2 197.2 951.8 0.05 0.01 6 220.1 264.9 700.4 145.9 188.5 930.1 do not work as well in the 3D experiments as in the 2D experiments. It seems that smoothness is more important in 3D ODETLAP compression, as point values are more interrelated.

6.3 Adding and Removing Points This experiment tests the algorithm that adds and removes points (Algo- rithm 5.2). The algorithm first adds one point per iteration to 1000 points, then adds two points and removes one point per iteration to 2600 points. Table 6.5 compares the results of Algorithms 5.1 and 5.2. The result of Algorithm 5.2 is a little better.

6.4 Optimizing Point Values This experiment tests Algorithm 5.3, the algorithm that optimizes known point values after Algorithm 5.2. Table 6.6 shows the results of the algorithm with different 78

Table 6.5: Value and gradient errors of Algorithms 5.1 and 5.2, dataset mri1

Alg. AVGE RMSE MAXE AVGGE RMSGE MAXGE 5.1 191.1 237.6 734.1 118.9 158.0 888.0 5.2 189.9 235.4 724.8 118.3 156.5 828.8

Table 6.6: Value and gradient errors of Algorithm 5.3 and different ωs, dataset mri1

ω AVGE RMSE MAXE AVGGE RMSGE MAXGE 0.25 165.6 208.0 875.2 119.5 157.8 859.2 0.5 165.0 207.3 907.4 119.9 158.2 857.5 0.75 164.8 207.2 922.3 119.9 158.2 858.3 1.0 164.8 207.1 929.2 120.0 158.3 858.0 values of the relaxation factor. As ω increases, AVGE and RMSE decrease slightly but MAXE increases, and AVGGE and RMSGE increase slightly. The influence of ω on MAXE is much larger than in the 2D experiment. Compared with the result of Algorithm 5.2, the results are better in AVGE and RMSE but worse in MAXE, and a little worse in gradient errors. Like in 2D ODETLAP compression, we use ω = 0.5 in the following experiments.

6.5 Anterior Optimizations q To minimize both RMSE and MAXE, we use RMSE2 + MAXE2 as the objective function and alternately reduce RMSE and MAXE. This experiment tests the algorithm that optimizes known point values multiple times before the target number of points are selected (Algorithm 5.4). The algorithm is tested with different numbers of optimizations and 100 points added between and after optimizations. Table 6.7 shows the results. As the number of optimizations increases, the value of the objective function decreases but gradient errors increase. The second result is used as the result of Algorithm 5.4.

6.6 Posterior Optimizations This experiment tests the algorithm that optimizes known point values multiple times after the target number of points are selected (Algorithm 5.5). Table 6.8 shows 79

Table 6.7: Value and gradient errors of Algorithm 5.4, dataset mri1

Opt. at points AVGE RMSE MAXE AVGGE RMSGE MAXGE 2400,2500 171.7 214.1 696.5 119.9 158.2 863.2 2200,2300...2500 169.6 212.0 687.7 120.8 159.3 852.4 2000,2100...2500 170.3 212.6 684.2 121.0 159.9 923.4

Table 6.8: Value and gradient errors of Algorithms 5.3–5.5, dataset mri1

Alg. AVGE RMSE MAXE AVGGE RMSGE MAXGE 5.3 165.0 207.3 907.4 119.9 158.2 857.5 5.4 169.6 212.0 687.7 120.8 159.3 852.4 5.5 172.2 214.6 705.0 118.8 156.8 806.9 the results of Algorithms 5.3–5.5 for comparison. The result of Algorithm 5.5 is worse than the result of Algorithm 5.4 in value errors but better in gradient errors.

6.7 The Complex Algorithm We test the complex algorithm (Algorithm 5.6) with R = 0.01 or varying R = 0.01–0.02 (R1 = 0.02, R2 = 0.01 and r = 4). The algorithm does optimizations after 2200, 2300, 2400, 2500 and the target 2600 points are selected. Table 6.9 shows the results. First, the result of Algorithm 5.6 is the same as the result of Algorithm 5.4 (R = 0.01 and mri1) because, in this case, posterior optimizations do not improve the result of anterior optimizations. Second, the results of varying R = 0.01–0.02 are only better in MAXE but worse in all other errors than the results of R = 0.01. Therefore, unlike in 2D ODETLAP compression, we use the complex algorithm with R = 0.01 to compress the 3D datasets.

Table 6.9: Value and gradient errors of the complex algorithm and a varying R

R AVGE RMSE MAXE AVGGE RMSGE MAXGE 0.01 169.6 212.0 687.7 120.8 159.3 852.4 0.01–0.02 175.9 218.2 668.9 125.6 164.4 906.6 80

Table 6.10: Value and gradient errors of the simple algorithm and quantization, dataset mri1

Range AVGE RMSE MAXE AVGGE RMSGE MAXGE 15-3745 191.1 237.6 734.1 118.9 158.0 888.0 0-511 191.1 237.6 734.7 118.9 158.0 888.2 0-255 191.1 237.7 737.0 118.9 158.0 887.9 0-127 191.2 237.7 736.7 118.9 158.0 887.6 0-63 191.4 237.9 759.0 118.8 158.0 889.9 0-31 191.8 238.1 765.8 119.1 158.3 892.4

Table 6.11: Value and gradient errors of the complex algorithm and quantization, dataset mri1

Range AVGE RMSE MAXE AVGGE RMSGE MAXGE Float 169.6 212.0 687.7 120.8 159.3 852.4 Integer 169.6 212.0 687.8 120.8 159.3 852.4 0-255 169.6 212.0 687.5 120.8 159.3 854.7

6.8 Compressing Selected Points The output of 3D ODETLAP compression is compressed as a header, a mask and a set of values, in the same process as that of 2D ODETLAP compression. Table 6.10 shows the results of the simple algorithm, with the values of selected points quantized on different ranges. The first row shows the original range of values and result. The results show that value errors do not increase much until range 0-63 and gradient errors do not increase much until range 0-31. The output values of the complex algorithm are floating-point and Table 6.11 shows the results with the values rounded to integers or quantized on 0-255. The results are almost the same. The mask of selected points is a 3D binary image the size of the dataset. It is encoded by run-length encoding the number of 0’s before each 1, with the pixels in slice-row-column order (SRC), 3D Hilbert space-filling curve order, or 3D Morton space-filling curve order. We compress the output mask of the complex algorithm with R = 0.01, as shown in Table 6.12. The smallest compressed size is 2467 bytes and the compression pipeline is RLE.Morton-HC. If the distribution of known points is random, the Shannon entropy of a pixel is about 0.08 and the best 81

Table 6.12: Compressing the selected-point mask, dataset mri1

Source AC HC VLQ-7zip VLQ-bzip2 VLQ-gzip VLQ-zpaq RLE.SRC 2564 2525 2574 2844 2560 2585 RLE.Hilbert 2528 2475 2502 2789 2507 2499 RLE.Morton 2525 2467 2512 2791 2508 2496

Table 6.13: Compressing the values of selected points, dataset mri1

Source AC HC VLQ1-7zip VLQ1-bzip2 VLQ1-gzip VLQ1-zpaq DE.SRC 2786 2705 2853 3101 2819 2840 DE.Hilbert 2769 2706 2877 3140 2844 2848 DE.Morton 2752 2688 2845 3090 2832 2845 DE.MST1 2823 2725 2921 3138 2936 2885 DE.MST2 2799 2725 2899 3110 2929 2878 Source VLQ2-7zip VLQ2-bzip2 VLQ2-gzip VLQ2-zpaq DE.SRC 3093 2852 3146 2779 DE.Hilbert 2986 2898 3148 2794 DE.Morton 3170 2891 3143 2806 DE.MST1 3013 2761 3156 2710 DE.MST2 2987 2777 3163 2709 average compressed size of the mask is about 2630 bytes. The values of selected points are encoded by delta encoding, with the known points in slice-row-column order, 3D Hilbert curve order, 3D Morton curve order, breadth-first minimum-spanning tree order, or depth-first minimum-spanning tree order. Delta values are either mapped or offset to nonnegative integers for variable- length quantity. We compress the output values of the complex algorithm with R = 0.01, quantized on 0-255, as shown in Table 6.13. The smallest compressed size is 2688 bytes and the compression pipeline is DE.Morton-HC. The header has 26 bytes, 2 bytes more than the header of 2D ODETLAP compression for the number of slices. The total size is 26 + 2467 + 2688 = 5181 bytes. The total compressed size of the output of the complex algorithm with R = 0.01 and input dataset mri1 is 26 + 2467 + 2688 = 5181 bytes. The mask compression pipeline is RLE.Morton-HC and the values compression pipeline is DE.Morton-HC. 82

Table 6.14: Value and gradient errors of ODETLAP compression

Reg. Quant. AVGE RMSE MAXE AVGGE RMSGE MAXGE mri1 N/A 0-255 169.6 212.0 687.5 120.8 159.3 854.7 9x9x9 0-127 174.8 219.5 738.7 124.4 164.5 863.9 11x11x11 0-63 179.9 227.8 803.4 129.1 171.3 1016.3 mri2 N/A 0-255 255.4 315.4 1006.1 170.1 226.5 1441.7 9x9x9 0-127 237.2 300.4 1102.0 174.7 234.1 1471.0 11x11x11 0-63 233.3 302.7 1243.4 173.8 241.0 1616.3 mri3 N/A 0-255 171.0 213.7 712.0 137.2 174.5 865.9 9x9x9 0-127 175.9 221.7 750.7 142.0 180.7 940.5 11x11x11 0-63 181.6 229.7 820.0 145.2 185.9 1001.1 mri4 N/A 0-255 260.6 335.0 1143.6 185.0 254.3 1641.5 9x9x9 0-127 241.3 321.1 1245.3 187.8 259.9 1515.0 11x11x11 0-63 246.6 332.9 1347.9 191.5 269.0 1508.4

6.9 Regular Points As in 2D ODETLAP compression, regular known points reduce the compressed size of the mask. To include regular points in the output, we use a version of the complex algorithm that selects more regular points as the initial set of known points and does not remove them in the process. Output values are quantized on a smaller range for balance. We compress the 4 datasets using the complex algorithm with R = 0.01. The datasets are compressed in three configurations: (1) no regular known points and value quantization on 0-255, (2) 9×9×9 regular known points and value quantization on 0-127, and (3) 11 × 11 × 11 regular known points and value quantization on 0-63. 92 and 112 are about a third and a half of the 2600 target number of points. Table 6.14 shows the compression errors and Table 6.15 shows the mask compression pipeline, values compression pipeline and compressed sizes for the datasets. Again, there are not many varieties of the compression pipelines. 83

Table 6.15: Compression pipelines and compressed sizes of ODETLAP compression

Reg. Quant. Mask compr. pipeline Values compr. pipeline Mask Values Total mri1 N/A 0-255 RLE.Morton-HC DE.Morton-HC 2467 2688 5181 9x9x9 0-127 RLE.SRC-VLQ-zpaq DE.SRC-HC 2040 2334 4400 11x11x11 0-63 RLE.SRC-VLQ-gzip DE.SRC-HC 1516 1903 3445 mri2 N/A 0-255 RLE.Hilbert-HC DE.Hilbert-VLQ2-zpaq 2233 2729 4988 9x9x9 0-127 RLE.SRC-VLQ-zpaq DE.Hilbert-HC 1863 2391 4280 11x11x11 0-63 RLE.SRC-VLQ-zpaq DE.Hilbert-AC 1433 1975 3434 mri3 N/A 0-255 RLE.Morton-VLQ-gzip DE.SRC-HC 2493 2638 5157 9x9x9 0-127 RLE.SRC-VLQ-gzip DE.SRC-HC 2150 2248 4424 11x11x11 0-63 RLE.SRC-VLQ-gzip DE.SRC-AC 1611 1853 3490 mri4 N/A 0-255 RLE.Morton-VLQ-gzip DE.SRC-HC 2387 2713 5126 9x9x9 0-127 RLE.SRC-VLQ-gzip DE.SRC-HC 2096 2339 4461 11x11x11 0-63 RLE.SRC-VLQ-gzip DE.SRC-AC 1634 1907 3567

6.10 JP3D For comparison, we use JP3D [98] to compress the 3D datasets. JPEG 2000 Part 10 JP3D extends JPEG 2000 Part 1 for 3D volumetric compression. JP3D uses a 3D version of the Embedded Block Coding by Optimized Truncation (EBCOT) coder of JPEG 2000 Part 1. It divides a volumetric dataset into cuboid tiles and applies a Discrete Wavelet Transform (DWT) to each tile. The DWT coefficients of each tile are divided into dyadically-sized code blocks and each code block is encoded by the EBCOT coder. We use the tools from OpenJPEG v2.1.0 and compress each dataset with a compression ratio of 75, 80, 85, . . . , 195 or 200. The number of resolutions in each axis is 6 and the coding algorithm is 3D-EBCOT. Figures 6.2 to 6.5 show the results of JP3D and ODETLAP compression of each dataset. A figure shows each of 6 error metrics versus the compressed size. Black circles are the results of JP3D and red, green and blue diamonds are the results of ODETLAP compression with none, 9 × 9 × 9 or 11 × 11 × 11 regular known points. The figures show that ODETLAP compression is much better than JP3D in MAXE but worse in AVGE and RMSE. Like in 2D ODETLAP compression, including 84

Figure 6.2: JP3D and ODETLAP compression, dataset mri1

Figure 6.3: JP3D and ODETLAP compression, dataset mri2 regular known points in the output helps reducing the gap between ODETLAP compression and JP3D in AVGE and RMSE, and keeps the advantage of ODETLAP compression in MAXE. ODETLAP compression is worse in AVGGE and RMSGE but better in MAXGE. In summary, ODETLAP compression is much better in the maximum absolute error but worse in average absolute and RMS errors than JP3D for the MRI datasets. Table 6.16 shows the ratios of the value errors of ODETLAP compression to those of JP3D at similar compressed sizes for the 3D datasets. For each dataset, we calculate the actual compression ratios of the three ODETLAP results, and use them as the specified compression ratios of JP3D. The actual compression ratios of the JP3D results may be different but are close. The errors show that ODETLAP 85

Figure 6.4: JP3D and ODETLAP compression, dataset mri3

Figure 6.5: JP3D and ODETLAP compression, dataset mri4 compression is about 40% to 60% better in MAXE for the MRI datasets, and regular known points help improving AVGE and RMSE.

6.11 Summary The complex algorithm is about 10% more accurate than the simple algorithm with the same target number of points. For the 3D datasets, ODETLAP compression is about 40% to 60% better in the maximum absolute error than JP3D at similar compressed sizes. For the same maximum absolute error, the ODETLAP file is less than half as small as the JP3D file. Like in 2D ODETLAP compression, including some regular known points in the output brings ODETLAP compression closer 86

Table 6.16: Relative value errors of ODETLAP compression to JP3D

Result No regular points Some regular points More regular points Dataset AVGE RMSE MAXE AVGE RMSE MAXE AVGE RMSE MAXE mri1 126.7% 122.5% 45.6% 125.0% 120.9% 40.4% 115.0% 112.1% 41.8% mri2 151.1% 139.1% 50.6% 131.6% 125.5% 56.7% 122.8% 118.3% 42.8% mri3 117.5% 112.9% 46.2% 116.6% 113.1% 46.9% 110.6% 106.9% 34.1% mri4 133.2% 124.8% 54.2% 119.5% 115.1% 58.1% 116.5% 112.3% 62.6% to JP3D in the average absolute and RMS errors, while keeping the advantage of ODETLAP compression in the maximum absolute error. CHAPTER 7 3D Segmented ODETLAP Compression

This chapter discusses a segmented ODETLAP compression algorithm to increase speed and reduce memory requirement so that it compresses larger datasets. We test the algorithm on 3D atmospheric and MRI datasets and show that the compressed size of the dataset is 40% smaller than that of JP3D for the same maximum absolute error.

7.1 Introduction ODETLAP approximation is reasonably fast, especially with GPU acceleration. However, ODETLAP compression is slow because it may involve hundreds of thou- sands of ODETLAP approximations, depending on the size of the dataset. We can either reduce the number of approximations or reduce the size of each approximation to increase the speed. As the inspiration of this work, Mitášová and Mitáš [100, 101] developed a segmentation procedure for the interpolation of large datasets using completely regularized splines. The idea is based on the local behavior of the interpolation function. The procedure divides a regular grid into square segments such that the number of data points in the 3 × 3 neighborhood of each segment is less than a threshold. Interpolated values in each segment are computed from data points in its 3 × 3, 5 × 5, or larger neighborhood such that the number of data points is larger than a second threshold. It improves the smoothness of the interpolation across segment boundaries. Segmentation reduces both computation time and memory requirement. JPEG [102] and JPEG 2000 [95] split an image into blocks or tiles and transform and encode each block separately. Tiling reduces memory requirement but may produce a blocking artifact. This chapter has been submitted to: W. Li, W. R. Franklin, S. V. G. Magalhães, M. V. A. Andrade, and D. L. Hedin, “3D segmented ODETLAP compression,” submitted for publication.

87 88

Tiling was also used to reduce the complexity of ODETLAP approximation, and overlapping tiles were used to enhance smoothness. Stookey et al. [43, 44] proposed a parallel ODETLAP approximation algorithm for terrain approximation on an IBM Blue Gene/L. The algorithm divides a grid into overlapping patches and computes an ODETLAP approximation for each patch, then merges the results using bilinear interpolation. For example, it divides the grid into patches of w × w points, whose lower left corners are at (iw/2, jw/2), i, j = 0, 1,..., so that each point is in at most four patches. The patches are grouped into blocks, and the blocks are approximated and interpolated in parallel. The algorithm can be used in ODETLAP compression. Li et al. [32, 33, 40] proposed an ODETLAP compression algorithm for 3D and 4D oceanographic data compression and had better results than the 3D set partitioning in hierarchical trees (3D SPIHT) of Said, Kim, and Pearlman [34, 35]. The algorithm uses an ODETLAP approximation algorithm that divides a grid into two interleaved meshes of boxes, computes an ODETLAP approximation for each box, and merges the results by weighted average. For example, it divides the grid into a mesh of boxes whose vertices are at (iw, jw, kw), i, j, k = 0, 1,..., and another mesh of boxes whose vertices are at (iw + w/2, jw + w/2, kw + w/2), i, j, k = 0, 1,..., so that each point is in at most two boxes. The approximation is accelerated on the GPU using an iterative solver from the CUSP library, while the rest of the algorithm is implemented in MATLAB. Benedetti et al. [41, 42] accelerated ODETLAP approximation and compression in C++ on the GPU using the CUSP and Thrust libraries. In this chapter, we propose a segmented ODETLAP compression algorithm to increase the speed by reducing the size of each approximation. The algorithm is fundamentally different from the patched algorithm of Stookey et al. and the boxed algorithm of Li et al. because it does not approximate the entire dataset each time and does not have a separate merging step. We use the algorithm to compress 3D datasets and evaluate its relative performance to JP3D. Results show that the algorithm is usually better in the maximum absolute error. 89

7.2 The Algorithm ODETLAP compression can be accelerated by dividing a dataset into segments and compressing each segment separately, like JPEG and JPEG 2000, because the running time of ODETLAP approximation grows fast as the size of the grid increases. In addition, segmentation reduces the memory requirement of ODETLAP approxima- tion. However, compressing segments separately produces blocking artifacts across segment boundaries in the decompressed dataset, which consists of an ODETLAP approximation in each segment, computed from the known points in the segment. To improve smoothness and reduce artifacts, the approximation in each segment can be computed as the center of a larger approximation in a neighborhood of the segment, using the known points in the neighborhood (including the segment), like Mitášová and Mitáš [100, 101]. The idea is based on the local behavior of ODET- LAP approximation: the approximate value of a grid point is mostly determined by the values of the nearest known points. If there are sufficient known points in the neighborhood and outside of the segment, the value of a point in the segment approximation is close to its value in an ODETLAP approximation of the dataset from the known points in all segments. The main idea of the segmented ODETLAP compression algorithm is to divide a dataset into segments and compress each segment in association with adjacent segments. The algorithm selects a set of known points in each segment so that the maximum absolute error (MAXE) of the segment approximation is not larger than a target threshold. The segment approximation is computed as the center of an ODETLAP approximation in its neighborhood. The algorithm adds known points in the segments in a round-robin fashion so as to compress them in synchrony. As Algorithm 7.1 shows, given a segmented point dataset, the algorithm first selects an initial set of known points and marks all segments as needing processing, which means the MAXE of the segment approximation may be larger than the target. While there are segments that need processing, and for each such segment in random order, the algorithm computes an approximation in the segment, and if the MAXE of the approximation is larger than the target, it adds a known point in the segment, otherwise it marks the segment as not needing processing. The approximation 90 in a segment changes and thus the segment needs processing if a known point is added in its neighborhood. Therefore, if the algorithm adds a known point in a segment, it marks all other segments whose neighborhood contains the point as needing processing. The algorithm terminates after the MAXE of each segment approximation is not larger than the target, so that the MAXE of the dataset approximation is not larger than the target. As datasets have different ranges, we will specify error metrics and the target relative to the data range. The parameters of the algorithm are: • R: the smoothing factor of ODETLAP approximation, which is 0.01 in the rest of the chapter; • segment width (SW): the width of a segment; each segment is SW×SW×SW (in 3D); • outer width (OW): the width between a segment boundary and its neighbor- hood boundary; the neighborhood is (SW+2OW)×(SW+2OW)×(SW+2OW); • target: the target MAXE of each segment approximation.

Let n be the size of the grid (dataset). The time complexity of the conjugate gradient method is O(n4/3) for three-dimensional elliptic problems [103]. With segmentation, the size of the grid (neighborhood) is (SW + 2OW)3, and the time complexity of the method is O((SW + 2OW)4). However, as the neighborhood size decreases, the speedup of GPU-accelerated ODETLAP approximation may also decrease, because the GPU needs a high occupancy to be efficient. A dataset may have empty data points that do not have a value, which is often indicated by a special value not in the data range. A dataset may also have many data points that have the same value. In the first case, the values of empty data points can be interpolated for ODETLAP compression, but it is impossible to distinguish between empty and nonempty points in the decompressed dataset. In the second case, it is less efficient to select many known points of the same value. The solution is to include a nonempty mask of the dataset in its compression. The nonempty mask is a binary image the size of the dataset, with 1 denoting nonempty and 0 denoting empty. An empty value is also included in the compression, which is either the special value of empty points or a common value in the dataset. The 91

Algorithm 7.1: Segmented ODETLAP compression Input: A segmented point dataset Output: A set of known points and values select an initial set of known points; mark all segments as needing processing; while there are segments that need processing do foreach segment s that needs processing do compute an approximation in s as the center of an ODETLAP approximation from the known points in its neighborhood; if the MAXE of the approximation is larger than a target then add the data point p ∈ s with the largest error to the set of known points; foreach segment t, t 6= s, intersecting the neighborhood of s do if p is in the neighborhood of t then mark t as needing processing; end end else mark s as not needing processing; end end end empty value is assigned to the empty points in the dataset approximation. With the nonempty mask, the algorithm does not select empty points as known points, and the decompressed dataset is accurate at the empty points. Using a lower precision or fewer bits for known point values can effectively reduce the size of their compression without significantly increasing the error of the decompressed dataset (if the precision is not too low). Regardless of whether the values are integer or floating point, they can be quantized on a small range of integers and encoded in a few bits. We can quantize the values of known points either before they are selected or after they are selected. Quantizing the values after the known points are selected makes it harder to control the error of the dataset approximation. Therefore, we quantize the dataset before selecting the known points. In this chapter, we quantize the dataset on integers 0–255 or the range of 8 bits. To quantize a dataset on 0–255, we find the maximum and minimum values vmax and vmin of the dataset, and calculate scale = 255/(vmax − vmin). Then we convert each value v to 92 round((v − vmin) · scale). Now, the dataset approximation has a different range from the dataset. To decompress the dataset, we clamp the dataset approximation between 0 and 255, and convert each approximate value v to v/scale + vmin. Let the target MAXE be a percent of the data range. Quantizing the dataset on 0–255 changes its range to 255 and induces an absolute error of at most 0.5/255 = 0.196% at each point. The algorithm guarantees the target MAXE of the dataset approximation from the quantized dataset. The quantized and inverse-quantized dataset has an absolute error of at most 0.196% from the dataset. Therefore, the decompressed dataset has a MAXE of at most 0.196% more than the target. The algorithm to decompress a dataset from a nonempty mask and a set of known points and values is: 1. compute an approximation of the quantized dataset by computing an ap- proximation for each segment; 2. clamp the dataset approximation between 0 and 255 and inverse-quantize it; 3. assign the empty value to the empty points.

The ODETLAP compression of a dataset consists of a header, a nonempty mask, a known point mask, and the known point values. The nonempty mask is compressed by a pipeline of run-length encoding (RLE), variable-length quantity (VLQ) and the ZPAQ v1.10 archiver. The RLE of the nonempty mask is the alternating run lengths of 0s and 1s, starting with a run length of 0s. VLQ is a binary format that uses one byte for an integer in 0–127, two bytes for an integer in 128–(1282 − 1), and so on. ZPAQ is an open source file archiver. The known point mask is also compressed by a pipeline of RLE, VLQ, and ZPAQ. However, the RLE of the known point mask is the run lengths of 0s between 1s, which is not in a strict sense an RLE but is more efficient if the ratio of 1s is not too large. The known point values are compressed by a pipeline of offset delta encoding, VLQ, and ZPAQ. The delta encoding of the values contains negative deltas, which can not be stored in VLQ. Offsetting the deltas makes them nonnegative: find the minimum delta dmin; subtract dmin from each delta; and prepend -dmin to the deltas. The header has 18 bytes: • the numbers of slices, rows, and columns of the dataset: 2 bytes each; • SW and OW: 1 byte each; 93

• vmin and scale for inverse-quantization: 4 bytes each; • the empty value: 2 bytes.

We use regularly spaced data points as the initial known points of the algorithm, because they are evenly distributed across the dataset, and they help reduce the compressed size of the known point mask. For a given target, if the mask size is larger than the values size, using more regular initial known points often reduces the sum of the two sizes, and if the values size is larger than the mask size, quantizing the dataset on a smaller range often reduces the sum of the two sizes. If the target is large, the algorithm should use less regular initial known points, and if the target is small, the algorithm should use more regular initial known points. The optimal initial set of known points depends on the dataset and the target. In this chapter, we use a fixed initial set of known points at (4i + 2, 4j + 2, 4k + 2), i, j, k = 0, 1,..., except empty data points. The interval between them is 4 in each dimension so that they are at most 1/64 of the dataset.

7.3 Results The datasets tested are five 360 × 180 × 24 atmospheric datasets and one 256 × 256 × 160 MRI dataset. The atmospheric datasets are the January 2015 AIRX3STM [104] grid fields that have an extra dimension of 24 standard pressure levels. The basenames of the grid fields are CH4_VMR (CH4 volume mixing ratio), CO_VMR (CO volume mixing ratio), GPHeight (geopotential height), O3_VMR (O3 volume mixing ratio), and Temperature (atmospheric temperature). Each field appears in four grids, tagged _A, _D, _TqJ_A, and _TqJ_D. The grids are 360 × 180 with 1 × 1 (◦)2 cells. For example, CH4_VMR_A and CH4_VMR_D are the same field in two grids. We only use the fields in the grid tagged _A, meaning ascending orbit and individual quality control. We use each grid field as a regular point dataset and call them by their basenames: CH4_VMR, CO_VMR, GPHeight, O3_VMR, and Temperature. Table 7.1 shows some statistics of the atmospheric datasets. GPHeight is integer and the other four are floating point. They all have about 2% empty data points, indicated by the special value -9999. Figure 7.1 shows slices 0, 8, and 16 of each of 94

Table 7.1: Statistics of the datasets. Empty: the percentage of empty data points for the atmospheric datasets or the percentage of zero data points for the MRI dataset.

Dataset Minimum Maximum Median Empty CH4_VMR 1.55 · 10−7 2.02 · 10−6 1.62 · 10−6 2.02% CO_VMR 1.52 · 10−8 3.03 · 10−7 4.07 · 10−8 2.00% GPHeight 1 50017 18204 1.99% O3_VMR 1.23 · 10−8 1.41 · 10−5 1.08 · 10−6 1.99% Temperature 188.0 311.9 235.5 1.99% MRI 0 3866 0 60.59% the atmospheric datasets. Empty data points are shown in white, which appear to be landmasses. The MRI dataset is the co-registered, averaged image from four individual scan images of MR session OAS1_0001_MR1, OASIS cross-sectional MRI data [97]. The individual scan images are 256 × 256 × 128 with 1 × 1 × 1.25 mm3 voxels. The averaged image is 256 × 256 × 160 with 1 × 1 × 1 mm3 voxels. Table 7.1 shows that the MRI dataset has about 60% zero data points. Figure 7.2 shows slices 0, 40, 80, and 120 of the MRI dataset. Nonzero data points are in center. We compress the datasets using the segmented ODETLAP compression al- gorithm. The value type of data structures is the single precision float type and the relative tolerance of the iterative solver is 1 × 10−6. The smoothing factor R of ODETLAP approximation is 0.01. The atmospheric datasets are compressed with SW = 12 and OW = 6, so that a segment is 12 × 12 × 12 and the neighborhood is 24 × 24 × 24. The MRI dataset is compressed with SW = 16 and OW = 8. In general, as SW and OW increase, the running time of the algorithm increases, but the number of known points selected for a given target MAXE decreases. The algorithm selects an initial set of known points at (4i + 2, 4j + 2, 4k + 2), i, j, k = 0, 1,..., except empty or zero data points. For the atmospheric datasets, the initial known points are about 24000 data points. For the MRI dataset, the initial known points are 64613 data points. Using more initial known points decreases the running time but increases the number of known points selected, which may or may not increase the size of the compression. Error metrics and the target are specified as percentages of the data range. We compress the atmospheric datasets with targets 3%, 2.5%, 2%, 95

CH4_VMR - 0 CH4_VMR - 8 CH4_VMR - 16 0 0 0 2.0e-06 1.8e-06 1.4e-06 2.0e-06 1.8e-06 1.3e-06 50 1.9e-06 50 1.8e-06 50 1.2e-06 1.9e-06 1.8e-06 1.1e-06 1.0e-06 100 1.8e-06 100 1.8e-06 100 1.8e-06 1.7e-06 9.6e-07 1.8e-06 1.7e-06 8.8e-07 150 1.7e-06 150 1.7e-06 150 8.0e-07 1.7e-06 1.7e-06 7.2e-07 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

CO_VMR - 0 CO_VMR - 8 CO_VMR - 16 0 2.0e-07 0 0 1.8e-07 1.1e-07 2.1e-08 1.0e-07 2.0e-08 50 1.6e-07 50 9.6e-08 50 1.4e-07 8.8e-08 1.9e-08 1.8e-08 100 1.2e-07 100 8.0e-08 100 1.0e-07 7.2e-08 1.8e-08 8.0e-08 6.4e-08 1.7e-08 150 150 5.6e-08 150 6.0e-08 1.6e-08 4.0e-08 4.8e-08 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

GPHeight - 0 GPHeight - 8 GPHeight - 16 0 400 0 0 11000 28800 350 10800 28500 50 300 50 10600 50 28200 250 10400 27900 100 200 100 10200 100 27600 150 10000 27300 100 9800 27000 150 50 150 9600 150 26700 9400 26400 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

O3_VMR - 0 O3_VMR - 8 O3_VMR - 16 0 0 2.5e-07 0 3.6e-08 1.2e-05 3.3e-08 2.2e-07 2.0e-07 1.1e-05 50 3.0e-08 50 50 1.0e-05 2.7e-08 1.7e-07 1.5e-07 9.0e-06 100 2.4e-08 100 100 8.0e-06 2.1e-08 1.2e-07 7.0e-06 1.0e-07 6.0e-06 1.8e-08 7.5e-08 150 1.5e-08 150 150 5.0e-06 5.0e-08 4.0e-06 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

Temperature - 0 Temperature - 8 Temperature - 16 0 3.1e+02 0 2.3e+02 0 2.4e+02 3.0e+02 2.3e+02 2.4e+02 2.3e+02 50 2.9e+02 50 50 2.3e+02 2.8e+02 2.2e+02 2.2e+02 2.3e+02 100 2.7e+02 100 100 2.2e+02 2.6e+02 2.2e+02 2.2e+02 2.2e+02 2.5e+02 150 150 2.1e+02 150 2.2e+02 2.4e+02 2.1e+02 2.3e+02 2.1e+02 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350

Figure 7.1: Slices 0, 8, and 16 of the 360 × 180 × 24 atmospheric datasets.

1.5%, 1%, and 0.5%, and compress the MRI dataset with targets 6%, 5%, 4%, 3%, 2%, and 1%. In addition, we compress the datasets with the target 0%, by selecting all nonempty or nonzero data points as known points. Selecting all nonempty data points does not require the algorithm, and the compression does not have a known point mask. Table 7.2 shows the results of the algorithm for each dataset and target. The results are: the number of known points selected, where all n/e means all nonempty or nonzero data points; the average absolute error (AVGE) of the decompressed dataset; the root-mean-square error (RMSE) of the decompressed dataset; the maximum absolute error (MAXE) of the decompressed dataset; the size of the compressed known point mask; the size of the compressed known point values; the total size of the compressed dataset; and the compression ratio. The compressed dataset includes 96

MRI - 0 MRI - 40 MRI - 80 MRI - 120 0 0 3600 0 0 3600 360 3200 3200 3200 50 320 50 50 2800 50 2800 2800 280 2400 240 2400 2400 100 100 100 2000 100 200 2000 2000 1600 150 160 150 1600 150 150 1600 120 1200 1200 1200 200 80 200 800 200 800 200 800 40 400 400 400 250 0 250 0 250 0 250 0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Figure 7.2: Slices 0, 40, 80, and 120 of the 256 × 256 × 160 MRI dataset. the header and the compressed nonempty mask, known point mask, and known point values. The size of the nonempty mask is 2433, 2375, 2357, 2356, 2358, and 17206 bytes, respectively, for datasets CH4_VMR, CO_VMR, GPHeight, O3_VMR, Temperature, and MRI. The original size of each atmospheric dataset is 6220800 bytes (5.9 MB) and the original size of the MRI dataset is 20971520 bytes (20 MB). The results show that MAXE is at most 0.20% larger than the target. AVGE and RMSE are usually much smaller than MAXE. The number of known points and the compressed sizes increase fast as the target decreases. The size of the known point mask is 0 when all nonempty data points are selected. The results for dataset GPHeight show an anomaly: the compression ratio for 0% target is larger than the compression ratio for 0.5% target. The reason is that the size of known point values is much smaller than the size of the known point mask for the dataset. Although the size of known point values increases a lot from 0.5% to 0% target, the size of the known point mask decreases even more (to 0). Therefore, it is important to balance the two sizes. Some datasets compress better than others. For example, the compression ratio is 396.0 for dataset CH4_VMR and 127.6 for dataset Temperature when the target is 2%. We found that smoother datasets compress better. The smoothness of a dataset can be characterized by its Laplacian. Because the datasets have different ranges, we normalize them on [0, 1], and compute the root-mean-square of the Laplacian of the normalized datasets. As shown in Table 7.3, the value is loosely related to the compression ratios when the target is 2%, but it indicates that datasets CH4_VMR, CO_VMR, and GPHeight compress better. A better smoothness measure is the Laplacian of the Laplacian, or the biharmonic. We compute the root-mean-square of the biharmonic of the normalized datasets. As shown in Table 7.3, the value is 97

Table 7.2: Results of segmented ODETLAP compression. Sizes are in bytes.

Dataset Target Points AVGE RMSE MAXE Mask Values Total Ratio CH4_VMR 3.0% 26711 0.53% 0.74% 3.17% 2516 5829 10794 576.3 CH4_VMR 2.5% 28494 0.48% 0.65% 2.68% 3785 6478 12712 489.4 CH4_VMR 2.0% 31557 0.43% 0.57% 2.18% 5792 7470 15711 396.0 CH4_VMR 1.5% 39691 0.40% 0.51% 1.66% 10890 9359 22698 274.1 CH4_VMR 1.0% 69667 0.30% 0.38% 1.19% 27592 16022 46063 135.0 CH4_VMR 0.5% 202848 0.16% 0.20% 0.69% 80552 40360 123361 50.4 CH4_VMR 0.0% All n/e 0.10% 0.11% 0.20% 0 210841 213290 29.2 CO_VMR 3.0% 26868 0.67% 0.90% 3.16% 2922 8038 13351 465.9 CO_VMR 2.5% 29199 0.62% 0.83% 2.69% 4834 9253 16478 377.5 CO_VMR 2.0% 34681 0.55% 0.68% 2.17% 8688 11144 22223 279.9 CO_VMR 1.5% 43567 0.46% 0.55% 1.68% 14291 14770 31452 197.8 CO_VMR 1.0% 67289 0.31% 0.37% 1.18% 27635 23477 53503 116.3 CO_VMR 0.5% 153145 0.16% 0.19% 0.69% 64784 49068 116243 53.5 CO_VMR 0.0% All n/e 0.10% 0.11% 0.20% 0 213858 216249 28.8 GPHeight 3.0% 28813 0.59% 0.79% 3.17% 3149 1994 7516 827.7 GPHeight 2.5% 30734 0.56% 0.74% 2.66% 3726 2149 8248 754.2 GPHeight 2.0% 33642 0.50% 0.65% 2.18% 5118 2419 9910 627.7 GPHeight 1.5% 39659 0.46% 0.58% 1.68% 8134 2992 13499 460.8 GPHeight 1.0% 67466 0.39% 0.46% 1.18% 21090 5186 28649 217.1 GPHeight 0.5% 208150 0.22% 0.26% 0.69% 72046 12130 86549 71.9 GPHeight 0.0% All n/e 0.10% 0.11% 0.20% 0 41872 44245 140.6 O3_VMR 3.0% 42321 0.69% 0.96% 3.17% 11425 11382 25179 247.1 O3_VMR 2.5% 48743 0.60% 0.82% 2.68% 14834 13753 30959 200.9 O3_VMR 2.0% 59703 0.50% 0.68% 2.19% 20161 17585 40118 155.1 O3_VMR 1.5% 81677 0.40% 0.53% 1.69% 29728 24583 56683 109.7 O3_VMR 1.0% 131809 0.28% 0.36% 1.19% 48028 38395 88795 70.1 O3_VMR 0.5% 262396 0.15% 0.19% 0.69% 80220 68702 151294 41.1 O3_VMR 0.0% All n/e 0.10% 0.12% 0.20% 0 186058 188430 33.0 Temperature 3.0% 41074 0.95% 1.18% 3.16% 11713 12517 26604 233.8 Temperature 2.5% 50313 0.83% 1.02% 2.68% 17080 15669 35123 177.1 Temperature 2.0% 65957 0.70% 0.84% 2.18% 25683 20695 48752 127.6 Temperature 1.5% 96786 0.53% 0.64% 1.70% 40782 29797 72953 85.3 Temperature 1.0% 163326 0.36% 0.43% 1.19% 66899 45963 115236 54.0 Temperature 0.5% 347499 0.18% 0.22% 0.69% 117735 80366 200475 31.0 Temperature 0.0% All n/e 0.10% 0.11% 0.20% 0 209563 211937 29.4 MRI 6% 292724 0.60% 1.27% 6.19% 154989 258259 430472 48.7 MRI 5% 362331 0.52% 1.09% 5.19% 186616 319131 522971 40.1 MRI 4% 471515 0.43% 0.90% 4.19% 229597 412245 659066 31.8 MRI 3% 652309 0.33% 0.69% 3.19% 288544 563325 869093 24.1 MRI 2% 982647 0.22% 0.47% 2.19% 372541 825539 1215304 17.3 MRI 1% 1755817 0.11% 0.23% 1.19% 501646 1370377 1889247 11.1 MRI 0% All n/e 0.04% 0.07% 0.20% 0 2532834 2550058 8.2 98

Table 7.3: Smoothness measures of the datasets normalized on [0, 1], and the compression ratio for 2% target. RMSL: root-mean-square Laplacian. RMSB: root-mean-square biharmonic.

Dataset RMSL RMSB Ratio - 2% CH4_VMR 0.017 0.051 396.0 CO_VMR 0.015 0.054 279.9 GPHeight 0.018 0.044 627.7 O3_VMR 0.034 0.071 155.1 Temperature 0.032 0.077 127.6 MRI 0.089 0.400 17.3 closely related to the compression ratios when the target is 2%. This analysis does not take quantization into consideration.

7.4 Evaluation To evaluate the results of the algorithm, we also compress the datasets using JPEG 2000 Part 10 JP3D [98]. JP3D extends JPEG 2000 Part 1 for 3D volumetric compression. It uses a 3D version of the embedded block coding by optimized truncation (EBCOT) algorithm of JPEG 2000 Part 1. The algorithm divides a volumetric dataset into cuboid tiles and uses discrete wavelet transform (DWT) to decompose each tile into subbands. The subbands are partitioned into dyadic-sized code blocks and each code block is encoded independently. We use OpenJPEG v2.1.0, an open-source JPEG 2000 codec. JP3D also needs a nonempty mask for datasets with empty data points, because there is no way to distinguish between empty and nonempty points in the decompression. To compress the atmospheric datasets, we first quantize them on integers 0–255, because JP3D compresses integers while the datasets are floating point. Quantizing the datasets on 0–255 produces very small errors, as shown in Table 7.2 for 0% target. Then we interpolate the values of empty points using nearest-neighbor interpolation, because JP3D requires a value at every point. Then we manually find the best possible compression parameters for the datasets. The size of code block is 64 × 32 × 8, which has to do with the shape of the datasets. The number of resolutions in x, y, and z axis is 8, 8, and 1 (resolutions in x and 99 y axes have to be the same), which also has to do with the shape of the datasets. The coding algorithm is 3D-EBCOT. We compress the quantized datasets with the target PSNR 40, 40.5, 41, . . . , until lossless compression is achieved. The JP3D compression of an atmospheric dataset consists of a 10-byte header (vmin and scale for inverse-quantization, and the empty value), a nonempty mask, and a JP3D file. To decompress the dataset, we decompress and inverse-quantize the JP3D file, and assign the empty value to the empty points. The process of compressing the MRI dataset is simpler because it is integer and does not have empty data points. We manually find the best possible compression parameters for the dataset. The size of code block is 64 × 64 × 32, related to the shape of the dataset. The number of resolutions in x, y, and z axis is 10, 10, and 3. We compress the dataset with the target PSNR 40, 40.5, 41, . . . , 99.5. The compressed dataset is a JP3D file. Figure 7.3 shows the compressed size and approximation error plots of JP3D and the algorithm for each dataset and error metric. The horizontal axis is the compressed size in KB and the verical axis is an approximation error in percentage. In each plot, the blue circles are the JP3D results (some may be outside the plot). The green circles are the ODETLAP results. And the red circle is the result of compressing all nonempty or nonzero data points. The plots show that JP3D is usually better in AVGE and RMSE but ODETLAP is usually better in MAXE. The relative performance of ODETLAP to JP3D, however, is different from dataset to dataset. ODETLAP is much better in MAXE and/or closer in AVGE and RMSE for datasets CH4_VMR, CO_VMR, and MRI. It performs less favorably for datasets GPHeight, O3_VMR, and Temperature. Table 7.4 shows the compressed sizes of the algorithm and JP3D when MAXE is 2% and 1% for each dataset. The sizes are linearly interpolated from the two results closest to each MAXE. The ODETLAP size is smaller than the JP3D size in most cases. It is about 60% the JP3D size in average. 100

CH4_VMR - AVGE CH4_VMR - RMSE CH4_VMR - MAXE 4.0 0.6 JP3D 0.8 JP3D 3.5 JP3D 0.5 ODETLAP ODETLAP 3.0 ODETLAP 0.6 0.4 All n/e All n/e 2.5 All n/e 2.0 0.3 0.4 1.5 0.2 0.2 1.0 Percentage error 0.1 0.5 0.0 0.0 0.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Compressed size (KB) Compressed size (KB) Compressed size (KB)

CO_VMR - AVGE CO_VMR - RMSE CO_VMR - MAXE 4.0 0.8 JP3D 1.0 JP3D 3.5 JP3D 0.7 3.0 0.6 ODETLAP 0.8 ODETLAP ODETLAP 0.5 All n/e All n/e 2.5 All n/e 0.6 0.4 2.0 0.3 0.4 1.5 0.2 1.0

Percentage error 0.2 0.1 0.5 0.0 0.0 0.0 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Compressed size (KB) Compressed size (KB) Compressed size (KB)

GPHeight - AVGE GPHeight - RMSE GPHeight - MAXE 1.0 4.0 0.7 JP3D JP3D 3.5 JP3D 0.6 0.8 ODETLAP ODETLAP 3.0 ODETLAP 0.5 All n/e 0.6 All n/e 2.5 All n/e 0.4 2.0 0.3 0.4 1.5 0.2 1.0 0.2 Percentage error 0.1 0.5 0.0 0.0 0.0 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90 Compressed size (KB) Compressed size (KB) Compressed size (KB)

O3_VMR - AVGE O3_VMR - RMSE O3_VMR - MAXE 1.2 4.0 0.8 JP3D JP3D 3.5 JP3D 1.0 ODETLAP ODETLAP 3.0 ODETLAP 0.6 0.8 All n/e All n/e 2.5 All n/e 0.4 0.6 2.0 1.5 0.4 0.2 1.0 Percentage error 0.2 0.5 0.0 0.0 0.0 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Compressed size (KB) Compressed size (KB) Compressed size (KB)

Temperature - AVGE Temperature - RMSE Temperature - MAXE 1.2 4.0 1.4 JP3D JP3D 3.5 JP3D 1.0 1.2 ODETLAP ODETLAP 3.0 ODETLAP 0.8 1.0 All n/e All n/e 2.5 All n/e 0.6 0.8 2.0 0.6 0.4 1.5 0.4 1.0 Percentage error 0.2 0.2 0.5 0.0 0.0 0.0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 0 50 100 150 200 250 300 Compressed size (KB) Compressed size (KB) Compressed size (KB)

MRI - AVGE MRI - RMSE MRI - MAXE 1.6 8 0.7 JP3D 1.4 JP3D 7 JP3D 0.6 ODETLAP 1.2 ODETLAP 6 ODETLAP 0.5 All n/e 1.0 All n/e 5 All n/e 0.4 0.8 4 0.3 0.6 3 0.2 0.4 2 Percentage error 0.1 0.2 1 0.0 0.0 0 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500 2000 2500 3000 3500 Compressed size (KB) Compressed size (KB) Compressed size (KB)

Figure 7.3: Compressed sizes in KB and approximation errors in percentage of JP3D and segmented ODETLAP compression. AVGE: average absolute error. RMSE: root- mean-square error. MAXE: maximum absolute error. All n/e means all nonempty data points. 101

Table 7.4: Interpolated compressed sizes in KB of segmented ODETLAP compression and JP3D when MAXE is 2% and 1%.

ODETLAP JP3D Dataset 2% 1% 2% 1% CH4_VMR 17.7 73.3 71.8 192.6 CO_VMR 24.9 74.9 90.9 180.0 GPHeight 10.9 48.9 18.0 48.2 O3_VMR 45.3 110.2 71.6 152.4 Temperature 56.3 144.4 111.2 207.5 MRI 1313.8 1970.3 1556.1 1984.8

7.5 Summary We use segmentation to increase the speed of ODETLAP compression by reducing the size of each approximation in iterative selection. The algorithm can be further accelerated by reducing the number of approximations. For example, it can add multiple known points in each iteration, and use sub-segmentation to spread out the points. If the target MAXE is small, using more regular initial known points not only reduces the time but also improves the performance. Segmentation is good for capturing local trends but bad for capturing global trends, by limiting the influence of known points. The objective of the algorithm is to minimize the maximum absolute error. Results for 3D atmospheric and MRI datasets show that it is usually better in MAXE than JP3D. For the same MAXE, the compressed size of the dataset is about 60% that of JP3D. CHAPTER 8 GPU-Accelerated Multiple Observer Siting

This chapter discusses the optimization and parallelization of the multiple observer siting algorithm of Franklin and Vogt [105]. Like spatial interpolation, visibility analysis is a common terrain application with considerable inherent parallelism. Like compression, siting uses a greedy algorithm that is parallelizable within iterations but sequential between iterations. We parallelize the algorithm using CUDA on a GPU and using OpenMP on multi-core CPUs. The speedup is up to 60 times on a GPU and about 16 times on two CPUs with 16 cores.

8.1 Introduction The purpose of multiple observer siting [8] is to place observers to cover the surface of a terrain or targets above the terrain. It is useful in the placement of radio transmission towers, mobile ad hoc networks, and environmental monitoring sites. Given a terrain represented as a digital elevation map (DEM), the algorithm assumes that observer and target points are placed above terrain points. Therefore, the number of possible observer and target positions on the plane is the size of the terrain. The objective of the algorithm can be to cover as many targets as possible using a number of observers or to cover a number of targets using as few observers as possible. In the usual sense, an observer covers a target that is within a radius of interest and is visible or has a direct line of sight from the observer. If the targets are the terrain points (raised to a common height), then the targets visible from an observer constitute the viewshed of the observer, and the number of targets covered by an observer is the viewshed area of the observer. There are many possibilities. The radius of interest is usually defined in 2D, but can also be defined in 3D. An observer can cover a target with a quality or probability [107]. A target may require the covering of at least k > 1 observers for positioning or reliability. An observer This chapter has been submitted to: W. Li, W. R. Franklin, S. V. G. Magalhães, and M. V. A. Andrade, “GPU-accelerated multiple observer siting,” submitted for publication.

102 103 may require the covering of at least one other observer for communication purposes. Both the observers and the targets can be mobile [108]. The application may have other restrictions than the number of observers or targets, for example, the cost of placing an observer at each position. Multiple observer siting is a compute-intensive problem with considerable inherent parallelism. The multiple observer siting algorithm of Franklin and Vogt [105, 109] can be optimized to reduce the time and space complexities. Parallelizing the algorithm would greatly increase its speed, so that the program would be very fast for small terrains and not too slow for large terrains. GPUs are massively parallel devices containing thousands of processing units. They were designed for computer graphics applications but are fully programmable and optimized for SIMD vector operations. Using GPUs for scientific computing is called general-purpose computing on graphics processing units (GPGPU) [3]. Although a CPU core is much more powerful than a GPU core, a GPU has many more cores than a CPU. Latest GPUs have up to a few hundred GB/s of memory bandwidth and a few TFLOPS of single-precision processing power, but theoretical peak performance is very hard to achieve. The parallelization of line-of-sight and viewshed algorithms on terrains using GPGPU or multi-core CPUs is an active topic. Strnad [110] parallelized the line-of- sight calculations between two sets of points—a source set and a destination set—on a GPU, and implementated it on a multi-core CPU for comparison. Zhao et al. [111] parallelized the R3 algorithm [112] to compute viewsheds on a GPU. The parallel algorithm combines coarse-scale and fine-scale domain decompositions to deal with memory limit and enhance memory access performance. Osterman [113] parallelized the r.los module (R3 algorithm) of the open-source GRASS GIS on a GPU. Osterman et al. [114] also parallelized the R2 algorithm [112]. Axell and Fridén [115] parallelized and compared the R2 algorithm on a GPU and on a multi-core CPU. Bravo et al. [116] parallelized the XDRAW algorithm [112] to compute viewsheds on a multi-core CPU, after improving its IO efficiency and compatibility with SIMD instructions. Ferreira et al. [117, 118] parallelized the sweep-line algorithm of Van Kreveld [119] to compute viewsheds on multi-core CPUs. In multiple observer siting, Magalhães et al. 104

[120] proposed a local search heuristic to increase the percentage of a terrain covered by a set of k observers. Given a set of candidate observers, each subset of k observers is a solution S and each solution with one observer different from S is a neighbor of S. Starting from an initial solution, the heuristic repeatedly replaces the current solution with a better neighbor, until a local optima is found. Pena et al. [121] improved the performance of the heuristic by accelerating the overlay of viewsheds on a GPU using dynamic programming and a sparse-dense matrix multiplication algorithm. In this chapter, we optimize and parallelize the multiple observer siting al- goirthm of Franklin and Vogt on a GPU. We also implement it on multi-core CPUs for comparison. With visibility computation being an essential part, the siting algorithm is different and more complicated. We review the multiple observer siting algoirthm in the next section.

8.2 Multiple Observer Siting Franklin and Vogt [105, 109] proposed an algorithm to select a set of observers to cover the surface of a terrain. Let the visibility index of a given terrain point be the number of terrain points visible from an observer at that point, divided by the total number of terrain points within the given radius of interest. The algorithm first computes an approximate visibility index for each terrain point, and then selects a set of terrain points with high visibility indexes as candidate observer positions. The observers at the candidate positions are called tentative observers. Then the algorithm computes the viewshed of each tentative observer and iteratively greedily selects observers from the tentative observers to cover the terrain surface. As an option, the algorithm can select observers that are visible from other observers. At the top level, this algorithm has four sequential steps: vix, findmax, viewshed, and site. vix, the first step, computes an approximate visibility index for each terrain point, which is normalized as an integer from 0 to 255 and stored in a byte. The parameters of vix include the number of rows or columns of the terrain, nrows, the radius of interest of observers, roi, the height of observers or targets above the 105 terrain, height, and the number of random targets for each observer. For simplicity, the algorithm assumes that the terrain is a square and the observer height and the target height are equal, but these restrictions are easy to remove. Placing an observer at each terrain point, vix picks a number of random terrain points within roi as targets and calculates their line-of-sight visibility. Then it calculates the ratio of visible targets, normalized in 0–255, as the approximate visibility index of the terrain point. findmax, the second step, selects a small subset of terrain points with high visibility indexes as tentative observers. For performance consideration, the number of tentative observers selected is much smaller than the size of the terrain. However, highly visible points are often close to each other, for example, along ridges and on water surface. Selecting the most visible points as tentative observers would not cover much of the terrain. Therefore, findmax divides the terrain into square blocks and selects a number of highly visible points in each block as tentative observers. The parameters of findmax include the number to select and the block width. findmax adjusts their values so that the last block is not too small and each block has the same number of tentative observers. viewshed, the third step, computes the viewshed of each such tentative observer using the R2 algorithm. With a square of width (2roi + 1) centered at an observer, the algorithm shoots a ray from the observer to each point on the boundary of the square and calculates the visibility of the points along the rasterized ray. Terrain points within roi along a ray are visited in order and their visibility is determined by the local horizon. The observer and the points as targets are height above the terrain, while the points as obstacles are on the terrain. Figure 8.1 shows a schematic illustration of the algorithm with roi = 3. On the left of the figure, an observer O is at the center of a 7 × 7 square, wherein the shaded point cells are not within roi and the dashed lines are the rays from O to the boundary points. On the right of the figure, the points along rays OA and OB are on a 1D terrain and their visibility is determined by the local horizon (dotted lines). Because of the discrete nature of the algorithm, a point is often on multiple rays and is considered visible if it is visible along any ray. 106

A O B

A O B

Figure 8.1: Computing a viewshed using the R2 algorithm.

site, the last step, selects a set of observers from the tentative observers to cover the terrain. It uses a greedy algorithm that adds one tentative observer to the set of observers at a time. At first, the set of observers and their cumulative viewshed are empty. Each iteration of the algorithm computes the union area of the viewshed of each unused tentative observer and the cumulative viewshed, and finds the unused tentative observer with the largest union area. Then it adds that tentative observer to the set of observers, and updates the cumulative viewshed as its union with the viewshed of the new observer. Since an unused viewshed would not add more area to the cumulative viewshed in this iteration than it would in the previous iteration, it is unnecessary to compute its union area in this iteration if another unused viewshed that would add more area is encountered. The algorithm may stop at a maximum number of observers or a minimum coverage of the terrain. 107

8.3 Optimization First we optimize the multiple observer siting algorithm to reduce its time and space complexities. It is important that we parallelize an efficient sequential algorithm rather than an inefficient one. The space requirement is important for GPU parallel programs because the GPU memory is smaller than the CPU memory, and memory copies between them are expensive. The parallel algorithm has the same four steps as the sequential one: vix, findmax, viewshed, and site. vix is often the most time-consuming step. As it computes approximate visibility indexes by target sampling, it can compute approximate line-of-sight visibilities by point sampling to further increase speed. It still uses random points within roi of an observer as targets. Rana [122] proposed using topographic feature points as targets, but it is not definitely better than using random points. For each target, instead of computing a line-of-sight visibility by evaluating all points along the line between the observer and the target, vix computes an approximate line-of-sight visibility by evaluating a subset of the points with an interval between successive points of evaluation along the line of sight. The idea is like volume ray casting in Computer Graphics, which uses equidistant sampling points along a viewing ray for shading and compositing. interval = 1 is equivalent to evaluating all the points. The choice of interval and its effects on the outcome will be discussed in the results section. Instead of selecting multiple tentative observers per block, findmax selects one tentative observer per block. There are two benefits. First, selecting one tentative observer is accomplished by scanning the block points for the highest visibility index, while selecting multiple tentative observers requires sorting the points by visibility index. Second, highly visible points are still close to each other in a block. With the same number of tentative observers, a smaller block width and one tentative observer per block has better results than a larger block width and multiple tentative observers per block. The time complexity of findmax is linear in the data size, i.e., Θ(nrows2). viewshed still uses the R2 algorithm with rasterized rays, but it can use other algorithms to increase speed. For example, the XDRAW algorithm is faster but 108 less accurate than R2. Wang et al. [123] proposed a viewshed algorithm that uses a plane instead of lines of sight in each of 8 standard sectors around the observer to approximate the local horizon. The algorithm is faster but less accurate than XDRAW. Israelevitz [124] extended XDRAW to increase accuracy by sacrificing speed. A more compact representation is used for the viewsheds of tentative observers and the cumulative viewshed in site. A viewshed is (2roi +1)×(2roi +1) pixels, and the cumulative viewshed is nrows × nrows pixels. Each pixel uses only one bit and each row of a viewshed is padded to word size. The bit representation is compact and fast to process using bitwise operators. If rows are not padded, indexing is easier but boundary detection is harder. The difficulty of the representation is the misalignment between a viewshed and the cumulative viewshed. A word in a viewshed usually overlaps two words in the cumulative viewshed. The number of tentative observers nrows2 nrows2 2 is bw2 (bw = block width) and the time complexity of viewshed is O( bw2 roi ). The time cost of site depends very much on the number of tentative observers. Previously, site computed the union area of each unused viewshed V and the cumulative viewshed C in each iteration, which is very time consuming. Two modifications to the algorithm of site greatly reduce its time complexity. First, the size of V is (2roi + 1)2 and the size of C is nrows2, so that the time to compute the area of V ∪ C is O(nrows2). To find the tentative observer to add, however, it can look for the unused viewshed that would add the largest extra area to the cumulative viewshed, instead of looking for the one with the largest union area. The extra area of 2 V can be computed as the area of V − CV , where CV is the corresponding (2roi + 1) region in C (or partially in C), the time of which is O(roi2). In bit representation,

V −CV can be implemented as V and (not CV ). Second, it is unnecessary to compute the extra area of all unused tentative observers in each iteration. The ones whose extra area needs computing are within 2roi of the last added observer. If the distance between an unused tentative observer and the last added observer is larger than 2roi, then CV and V − CV stay the same. The number of tentative observers within 2roi 4πroi2 is bw2 . In each iteration of site, the time to find the unused tentative observers nrows2 whose extra area needs computing is O( bw2 ). The time to compute the extra area roi4 of them is O( bw2 ). The time to find the tentative observer to add (with the largest 109

nrows2 2 extra area) is O( bw2 ). The time to update C is O(roi ). Summing all up, the time nrows2 roi4 2 roi complexity of an iteration of site is O( bw2 + bw2 + roi ). If O( bw ) = O(1), which nrows2 2 is reasonable, the time complexity is O( roi2 + roi ). The iteration stops when the coverage of C is not lower than a threshold, or when no unused tentative observers have a positive extra area. Algorithm 8.1 shows the algorithm.

Algorithm 8.1: The algorithm of site Input: A DEM, a set of tentative observers, and their viewsheds Output: A set of observers the set of observers and the cumulative viewshed are empty; while coverage is lower than a threshold and can be increased do foreach unused tentative observer do if the set of observers is empty or it is within 2roi of the last added observer then compute the extra area that its viewshed would add to the cumulative viewshed; end end add the unused tentative observer with the largest extra area to the set of observers; update the cumulative viewshed as its union with the viewshed of the newly added observer; end return the set of observers;

8.4 Parallelization The multiple observer siting algorithm is compute-intensive, but with consid- erable inherent parallelism. Parallelizing the algorithm greatly increases its speed. The Compute Unified Device Architecture (CUDA) [4] is a parallel computing plat- form and programming model for NVIDIA GPUs. We parallelize the algorithm using CUDA on an NVIDIA GPU. The four steps of the algorithm are parallelized separately. An NVIDIA GPU has a few streaming multiprocessors and each multiprocessor has quite a few CUDA cores. To perform a task using CUDA, it is necessary to define a kernel function that is executed by a large number of CUDA threads. Each thread does a fraction of the work and is executed on a CUDA core. The threads are grouped 110 into thread blocks and each thread block is executed on a multiprocessor. The threads of a block are divided into warps of 32 threads and each warp is instruction locked. The threads of a block can be synchronized, while the threads of different blocks cannot. vix, the first step, computes a visibility index for every terrain point. The number of points is usually very large, so that we define a kernel function to compute the visibility index of a point in each CUDA thread. The function picks random targets for the point and calculates their line-of-sight visibility. findmax, the second step, finds the point with the highest visibility index in every terrain block as a tentative observer. The number of terrain blocks is much smaller than the number of points. We define a kernel function to find the most visible point in a terrain block in each thread block. The function finds the most visible point in a portion of the terrain block in each thread of the block, and performs a parallel reduction to find the most visible point of the terrain block. viewshed, the third step, computes the viewshed of every tentative observer. The number of tentative observers is the same as the number of terrain blocks. We define a kernel function to compute the viewshed of a tentative observer in each thread block. The function computes a slice of the viewshed in each thread of the block. For example, if the block has 4 threads, then each thread computes a quarter of the viewshed. In the case of Figure 8.1, each thread shoots 6 consecutive rays from the observer to points on the boundary of the square. The major task in an iteration of site, the last step, is to compute the extra area of unused tentative observers within 2roi of the last added observer. In an earlier attempt, we defined a kernel function to check if the extra area of a tentative observer needs computing and, if so, compute its extra area in each CUDA thread. The function is slow because the extra area of most tentative observers does not need computing so that the workload is very unbalanced among the threads. The problem is that the function does two tasks: finding tentative observers whose extra area needs computing and computing the extra area. Defining separate functions for the two tasks eliminates the problem. The first function checks if the extra area of a tentative observer needs computing in each CUDA thread. The second 111 function computes the extra area of a tentative observer in each thread block, with each thread processing one or more rows of the viewshed. The two functions can be called in sequence from the CPU, or the second function can be called from the first function. We choose the latter and define the first function to call the second function if an extra area needs computing. A third kernel function is defined to find the unused tentative observer with the largest extra area in a portion of the tentative observers in each thread block. The function finds a tentative observer in each thread of the block and performs a parallel reduction to find the tentative observer. On the CPU, the tentative observer to add is found from the results of the thread blocks. A fourth kernel function is defined to update the cumulative viewshed. The function processes a row of the viewshed of the newly added observer in each CUDA thread. A viewshed is too small to utilize the GPU, but it is slower to update the cumulative viewshed on the CPU and copy it to the GPU. To sum up, the CUDA program has seven kernel functions, one each for vix, findmax, and viewshed, and four for site. Functions 1–3 are called once, while functions 4–7 are called in a loop. Function 5 is called from function 4.

1. compute visibility indexes 2. select tentative observers 3. compute the viewsheds of tentative observers 4. find the unused tentative observers whose extra area needs computing 5. compute the extra area of a viewshed 6. find unused tentative observers with the largest extra area 7. update the cumulative viewshed

For comparison, we parallelize the algorithm using OpenMP on multi-core CPUs. OpenMP uses compiler directives and library routines to direct parallel execution. In the OpenMP program, we use a #pragma omp parallel for schedule(guided) directive for the following for loops:

• compute the visibility index for each terrain point • find the most visible point as a tentative observer for each terrain block • compute the viewshed for each tentative observer 112

• check if the extra area needs computing for each tentative observer • compute the extra area for each row of a viewshed • update the cumulative viewshed for each row of a viewshed

In addition, a parallel region (using a #pragma omp parallel directive) finds the unused tentative observer with the largest extra area. Each thread finds a tentative observer in a portion of the tentative observers, and updates the global tentative observer with the largest extra area in a critical region (using a #pragma omp critical directive).

8.5 Results and Discussion We test the parallel and sequential programs on a machine with an NVIDIA Tesla K20Xm GPU accelerator (14 streaming multiprocessors, 192 CUDA cores per multiprocessor, and 6GB GDDR5 memory), two Intel Xeon E5-2687W CPUs at 3.1GHz (8 cores and 16 hyper-threads per CPU), and 128GB DDR3 memory, running Ubuntu 16.04 LTS. The CUDA program is compiled with nvcc 7.5.17 and gcc 4.9.3 at optimization level -O2. The OpenMP and sequential programs are compiled with gcc 4.9.3 at optimization level -O2. The terrain dataset is a 16385 × 16385 Puget Sound DEM from Lindstrom and Pascucci [125], which was extracted from USGS 10-meter DEMs. The unit of value is 0.1 meter and the range of values is [0, 43930]. Figure 8.2 shows a shaded relief plot of the terrain, which is about half mountains and half plains. vix computes an approximate line-of-sight visibility with an interval between successive evaluation points along each line of sight. We use the CUDA program to test the effects of interval because it is the fastest. To evaluate the accuracy of the approximate visibility index, the exact visibility index map of the terrain, normalized in integers 0–255, is computed with roi = 100 (1000 meters) and observer/target height = 100 (10 meters). Figure 8.3 shows the exact visibility index map of the terrain. The figure shows that highly visible points (light colored) are on flat terrain— plains, valleys, and waters. In the simplest case, interval can be a constant value. However, we found that points closer to the observer along the line of sight are more important for determining visibility. For example, a closer point with a higher 113

Figure 8.2: Shaded relief plot of the terrain dataset. elevation than the observer appears higher (has a larger altitude angle) than a more distant point with the same elevation. Therefore, evaluation points should become denser and interval should become smaller towards the observer. The same is true from the viewpoint of the target, so that interval should become smaller towards the target. We test the CUDA program with roi = 100, height = 100, block width (bw) = 50 (107584 tentative observers), and target coverage = 95%. vix uses 10, 50, or 250 random targets per point and different values of interval. Let the observer be at point 0 and the target be at point 100 along the line of sight. Starting from point 1, 114 the evaluation points are point 1, 1 + interval, 1 + 2interval, and so on, if interval is constant. The following values of interval (and the corresponding evaluation points) are tested:

• 1: 99 points (1, 2, 3, . . . , 99) • 2: 50 points (1, 3, 5, . . . , 99) • 4: 25 points (1, 5, 9, . . . , 97) • 8: 13 points (1, 9, 17, . . . , 97) • 16: 7 points (1, 17, 33, . . . , 97) • 32: 4 points (1, 33, 65, 97) • ‘exponential’: 7 points (1, 2, 4, . . . , 64) • ‘Fibonacci’: 10 points (1, 2, 3, . . . , 89) • ‘bidirectional exponential’: 12 points (1, 2, 4, . . . , 96, 98, 99) • ‘bidirectional Fibonacci’: 16 points (1, 2, 3, . . . , 97, 98, 99)

Table 8.1 shows the running time of vix in seconds, the RMSE of the approxi- mate visibility index map (VIM), and the number of observers selected for the target coverage. The smaller the number of observers, the better the result. More random targets per point produces a longer running time for vix, a smaller RMSE of the VIM, and a smaller number of observers. The improvement from 10 to 50 targets is larger than the improvement from 50 to 250 targets. However, a smaller RMSE does not necessarily mean fewer observers. For example, interval = 8 has a smaller RMSE but more observers than interval = ‘exponential’ or ‘Fibonacci’. A larger interval produces shorter time, larger RMSE, and more observers. Figure 8.4 shows that (interval =) ‘exponential’ and ‘Fibonacci’ are below and thus better than the 1, 2, . . . , 32 curve, while ‘bidirectional exponential’ and ‘bidirectional Fibonacci’ are almost on the curve. We choose 50 targets and interval = ‘exponential’ for vix, so that its time complexity is O(nrows2 log(roi)). Table 8.2 shows the results of the parallel and sequential programs with different combinations of roi and bw: roi = 50 and bw = 25, roi = 100 and bw = 50, roi = 200 and bw = 100, roi = 100 and bw = 25, and roi = 200 and bw = 50. The roi roi first three combinations have bw = 2 and the last two combinations have bw = 4. 115

Figure 8.3: Exact visibility index map of the terrain, normalized in integers 0–255, with roi = 100 and height = 100.

The other parameters are height = 100 and target coverage = 95%. The number of tentative observers is 429025, 107584, or 26896 when bw = 25, 50, or 100. Each result is an average from 10 runs of the program. The OpenMP program uses 50 threads with dynamic threads disabled. The results are the running time of vix, findmax, viewshed, site, and the program (total) in seconds, and the number of observers. The total time includes I/O time, which is about 0.2 seconds. The number of observers is slightly different for the programs because of parallel execution and 116

Table 8.1: Results of vix using 10, 50, or 250 random targets per point. interval: the interval between successive evaluation points along the line of sight. Time: the running time of CUDA vix in seconds. RMSE: the RMSE of the approximate visibility index map. Observ.: the number of observers selected for 95% coverage.

10 targets 50 targets 250 targets interval Time RMSE Observ. Time RMSE Observ. Time RMSE Observ. 1 16.2 174.8 16323 80.4 172.3 15152 402.3 171.8 14833 2 9.3 175.8 16380 45.9 173.2 15184 230.1 172.7 14856 4 5.9 177.3 16436 28.1 174.8 15262 139.4 174.2 14908 8 4.1 180.6 16651 18.3 178.1 15444 89.6 177.6 15070 16 3.0 188.8 17254 12.7 186.6 15929 61.3 186.1 15469 32 2.3 209.9 18588 9.2 208.3 16921 43.8 207.9 16312 exp 2.9 187.5 16589 12.2 185.2 15267 58.5 184.8 14917 fib 3.4 183.8 16456 14.4 181.5 15197 69.9 181.0 14859 biexp 6.2 181.8 16506 29.3 179.4 15275 145.1 178.9 14924 bifib 7.5 179.1 16466 36.2 176.6 15215 179.7 176.1 14866 randomization. A smaller bw and more tentative observers produce longer time of viewshed and site but less observers. The time of vix is roughly proportional to log(roi). The time of findmax is very small. The time of viewshed is roughly proportional to roi and the number of tentative observers. The time of site varies a lot. It is related to roi, bw, and the number of observers. With a fixed roi, the time of site increases as bw decreases and the number of tentative observers increases. With a fixed bw, it may either decrease or increase as roi increases. Table 8.3 shows the speedups of the parallel programs over the sequential program. The speedup of vix is about 72 to 95 times for the CUDA program and 19 to 20 times for the OpenMP program. The speedup for the CUDA program decreases as roi increases, because evaluation points are farther from the observer and memory accesses for their elevations are less local, and because CUDA threads in a warp are instruction locked and access memory at the same time. The speedup for the OpenMP program also decreases a little as roi increases. The speedup of findmax is about 5 to 10 times for the CUDA program and 13 to 16 times for the OpenMP program. The workload is too small for the CUDA program. As bw increases, the speedup for the CUDA program increases because each thread has more independent work, while the speedup for the OpenMP program decreases because the granularity 117

10 targets 19000 32 18500 18000

17500 16 17000 Observers 8 exp biexp bifib 2 16500 fib 1 4 16000 2 4 6 8 10 12 14 16 18 50 targets 17000 32

16500

16000 16

Observers 15500 8 exp biexp bifib 2 1 fib 4 15000 0 10 20 30 40 50 60 70 80 90 250 targets 16400 32 16200 16000 15800 15600 16 15400 Observers 15200 8 15000 exp biexp bifib 2 1 14800 fib 4 0 50 100 150 200 250 300 350 400 450 Time

Figure 8.4: The running time of CUDA vix in seconds versus the number of observers selected for 95% coverage. Data points are labeled with the value of interval. of parallelism is larger. The speedup of viewshed is about 39 to 57 times for the CUDA program and 17 to 18 times for the OpenMP program. For the same reason as for vix, the speedup for the CUDA program decreases as roi increases (with the same bw), but increases as bw decreases and the number of tentative observers increases (with the same roi). However, the speedup for the OpenMP program does not decrease as roi increases (with the same bw). The speedup of site is about 5 to 9 times for the CUDA program and 2 to 7 times for the OpenMP program. The 118

Table 8.2: Results of the CUDA, OpenMP, and sequential programs, averaged over 10 runs. vix, . . . , total: running time in seconds. Observers: the number of observers selected for 95% coverage.

CUDA program roi bw vix findmax viewshed site Total Observers 50 25 9.9 0.3 2.4 37.3 50.1 49199 100 50 12.0 0.1 2.8 4.7 20.0 15273 200 100 15.4 0.1 3.3 1.8 20.8 5461 100 25 12.0 0.3 10.4 14.2 37.1 14789 200 50 15.4 0.1 11.9 3.8 31.5 5185 OpenMP program roi bw vix findmax viewshed site Total Observers 50 25 46.8 0.1 7.8 36.4 91.3 49201 100 50 53.0 0.1 7.7 9.5 70.5 15258 200 100 59.3 0.1 7.5 4.4 71.5 5449 100 25 52.9 0.1 29.8 21.8 104.8 14781 200 50 59.2 0.1 29.3 11.8 100.6 5182 Sequential program roi bw vix findmax viewshed site Total Observers 50 25 939.8 1.5 138.0 249.2 1328.8 49263 100 50 1029.6 1.3 132.6 25.2 1188.9 15282 200 100 1100.2 1.1 128.7 9.3 1239.6 5452 100 25 1032.8 1.3 529.2 93.6 1657.2 14787 200 50 1109.9 1.2 509.2 35.3 1655.8 5186 speedup increases as bw decreases (with the same roi) but may increase or decrease as roi increases (with the same bw). The speedup of the program is about 27 to 60 times for the CUDA program and 15 to 17 times for the OpenMP program. The speedup may increase or decrease as roi increases (with the same bw), and decreases as bw decreases (with the same roi). The CUDA program is about 3 times as fast as the OpenMP program.

8.6 Summary We have optimized the multiple observer siting algorithm and parallelized it using CUDA on a GPU and using OpenMP on multi-core CPUs. In general, the speedup of the CUDA program is about 30 to 60 times on an NVIDIA Tesla K20Xm 119

Table 8.3: Speedups of the CUDA and OpenMP programs over the sequential program.

CUDA program roi bw vix findmax viewshed site Total 50 25 94.6 5.6 57.0 6.7 26.5 100 50 85.5 9.8 46.7 5.3 59.5 200 100 71.6 9.5 39.3 5.2 59.7 100 25 85.8 4.9 50.9 6.6 44.6 200 50 72.2 8.5 42.7 9.2 52.5 OpenMP program roi bw vix findmax viewshed site Total 50 25 20.1 15.5 17.7 6.9 14.6 100 50 19.4 14.6 17.1 2.7 16.9 200 100 18.6 13.1 17.1 2.1 17.3 100 25 19.5 15.1 17.8 4.3 15.8 200 50 18.8 15.0 17.4 3.0 16.5

GPU accelerator over the sequential program on a CPU core. The speedup of the OpenMP program is about 16 times on two Intel Xeon E5-2687W CPUs with 16 cores over the sequential program. Both techniques are very effective in accelerating the program. The CUDA program is faster, while the OpenMP program is easier to implement. Due to overhead, the GPU is more efficient for long computation, while the CPU is more efficient for short computation. If a program contains both long and short computations, it is possible to achieve greater efficiency by combining GPU and CPU parallel execution. CHAPTER 9 Conclusions and Future Work

Terrain-related research and applications benefit greatly from GPU acceleration. Using the CUSP library, ODETLAP approximation and compression are about 8 times as fast on a GPU as they are on a CPU core. ODETLAP compression is slow because it may involve thousands of ODETLAP approximations. By reducing the size of each approximation, segmented ODETLAP compression is much faster and more memory-efficient, and would be even faster if the segments are processed in parallel. ODETLAP approximation is comparable in accuracy to natural neighbor inter- polation and the multiquadric-biharmonic method, with similar error characteristics. Unlike natural neighbor interpolation, ODETLAP approximation does extrapolation as well as interpolation. In addition, it can be faster than the multiquadric-biharmonic method if there are many data points. ODETLAP approximation is a linear operator on the values of known points, which can be optimized to minimize the distance between the approximation and the dataset. ODETLAP compression is better in minimizing the maximum absolute error than JPEG 2000 and JP3D for the test datasets. We designed new algorithms to improve the accuracy of ODETLAP compression by about 10% to 30% and used new methods to compress the set of known points and values. We tested ODETLAP compression on 2D terrain and 3D MRI datasets. The results show that ODETLAP compression is about 30% to 50% better in the maximum absolute error than JPEG 2000 for the terrain datasets, and about 40% to 60% better in the maximum absolute error than JP3D for the MRI datasets. For the same maximum absolute error, the ODETLAP file is often half as small as the JPEG 2000 file for the terrain datasets and less than half as small as the JP3D file for the MRI datasets. We tested the segmented ODETLAP compression algorithm on larger 3D atmospheric and MRI datasets and showed that the compressed size of the dataset is about 60% that of JP3D for the same maximum absolute error.

120 121

Future work of ODETLAP includes approximation and compression on a non- regular grid, using other types of known points and equations, and looking for new suitable datasets. We would also like to find better ways to explore the regularity of known points. For the segmented algorithm, we may use multi-core CPUs to parallelize segment processing, and will use it to compress 4D datasets. The multiple observer siting algorithm has considerable inherent parallelism and is orders of magnitude faster with parallel processing. We optimized and parallelized the algorithm using CUDA on a GPU and using OpenMP on multi-core CPUs. The program is up to 60 times as fast on an NVIDIA Tesla GPU as on a CPU core, and about 16 times as fast on two Intel Xeon CPUs with 16 cores. The step that computes the visibility indexes of terrain points has the largest speedup on the GPU, up to 95 times, while the steps that finds tentative observers and selects observers have smaller speedups. The CUDA program can site thousands of observers above a terrain with hundreds of millions of points within a minute. There are two directions for the future work of multiple observer siting. The first direction is to further increase speed by selecting multiple observers in each iteration of the greedy algorithm to increase parallelism. Because parallel execution is fast, the second direction is to reduce the number of observers selected for a target coverage by computing a more accurate visibility index map or using more tentative observers. REFERENCES

[1] T. G. Farr, P. A. Rosen, E. Caro, R. Crippen, R. Duren, S. Hensley, M. Kobrick, M. Paller, E. Rodriguez, L. Roth, D. Seal, S. Shaffer, J. Shimada, J. Um- land, M. Werner, M. Oskin, D. Burbank, and D. Alsdorf, “The shuttle radar topography mission,” Rev. Geophys., vol. 45, no. 2, RG2004, June 2007.

[2] D. B. Gesch, M. J. Oimoen, S. K. Greenlee, C. A. Nelson, M. Steuck, and D. J. Tyler, “The national elevation dataset,” Photogramm. Eng. Remote Sens., vol. 68, no. 1, pp. 5–11, Jan. 2002.

[3] D. Luebke, M. Harris, J. Krüger, T. Purcell, N. Govindaraju, I. Buck, C. Wool- ley, and A. Lefohn, “GPGPU: general purpose computation on graphics hard- ware,” in ACM SIGGRAPH 2004 Course Notes, Los Angeles, CA, 2004, art. 33.

[4] NVIDIA Corporation. (2016). “CUDA parallel computing platform.” [Online]. Available: http://www.nvidia.com/object/cuda_home_new.html (Date Last Accessed, July, 1, 2016).

[5] NVIDIA Corporation. (2016). “GPU-accelerated libraries.” [Online]. Available: https://developer.nvidia.com/gpu-accelerated-libraries (Date Last Accessed, July, 1, 2016).

[6] N. Bell and J. Hoberock, “Thrust: a productivity-oriented library for CUDA,” in GPU Computing Gems Jade Edition. San Francisco, CA: Morgan Kaufmann, 2011, pp. 359–372.

[7] W. R. Franklin and M. Gousie, “Terrain elevation data structure operations,” in Proc. 19th Int. Cartographic Conf., Ottawa, Canada, 1999, pp. 1011–1020.

[8] W. R. Franklin, “Siting observers on terrain,” in Advances in Spatial Data Handling: 10th Int. Symp. Spatial Data Handling. Berlin, Germany: Springer- Verlag, 2002, pp. 109–120.

[9] M. de Berg, O. Cheong, M. van Kreveld, and M. Overmars, Computational Geometry: Algorithms and Applications, 3rd ed. Berlin, Germany: Springer- Verlag, 2010.

[10] P. A. Burrough and R. A. McDonnell, Principles of Geographical Information Systems, 2nd ed. New York, NY: Oxford UP, 1998.

[11] T. K. Peucker, R. J. Fowler, J. J. Little, and D. M. Mark, “The triangulated irregular network,” in Proc. Amer. Soc. Photogrammetry Digital Terrain Models Symp., St. Louis, MO, 1978, pp. 516–540.

122 123

[12] L. W. Zevenbergen and C. R. Thorne, “Quantitative analysis of land surface to- pography,” Earth Surf. Process. Landforms, vol. 12, no. 1, pp. 47–56, Jan./Feb. 1987. [13] L. De Floriani and P. Magillo, “Algorithms for visibility computation on digital terrain models,” in Proc. 1993 ACM/SIGAPP Symp. Appl. Computing: States of the Art and Practice, Indianapolis, IN, 1993, pp. 380–387. [14] C. K. Ray, “Representing visibility for siting problems,” Ph.D. dissertation, Dept. Elect. Comput. Syst. Eng., Rensselaer Polytechnic Inst., Troy, NY, 1994. [15] B. Kaučič and B. Zalik, “Comparison of viewshed algorithms on regular spaced points,” in Proc. 18th Spring Conf. Comput. Graph., San Diego, CA, 2002, pp. 177–183. [16] S. Tabik, E. L. Zapata, and L. F. Romero, “Simultaneous computation of total viewshed on large high resolution grids,” Int. J. Geogr. Inform. Sci., vol. 27, no. 4, pp. 804–814, Apr. 2013. [17] NVIDIA Corporation. (2014). “Kepler compute architecture white paper.” [Online]. Available: http://international.download.nvidia.com/pdf/kepler/NVIDIA-Kepler- GK110-GK210-Architecture-Whitepaper.pdf (Date Last Accessed, July, 1, 2016). [18] S. Dalton, N. Bell, L. Olson, and M. Garland. (2015). “Cusp: generic parallel algorithms for sparse matrix and graph computations.” [Online]. Available: http://cusplibrary.github.io/ (Date Last Accessed, July, 1, 2016). [19] W. R. Franklin, Y. Li, T.-Y. Lau, and P. Fox, “CUDA-accelerated HD- ODETLAP: lossy high dimensional gridded data compression,” in Modern Accelerator Technologies for Geographic Information Science. Boston, MA: Springer, 2013, ch. 8, pp. 95–111. [20] W. R. Franklin, M. Inanc, and Z. Xie, “Two novel surface representa- tion techniques,” in Proc. AutoCarto 2006, Vancouver, WA, 2006. Avail- able: http://www.cartogis.org/docs/proceedings/2006/franklin_inanc_xie.pdf (Date Last Accessed, July, 1, 2016). [21] Z. Xie, W. R. Franklin, B. Cutler, M. A. Andrade, M. Inanc, and D. M. Tracy, “Surface compression using over-determined laplacian approximation,” in Proc. SPIE 6697, Advanced Signal Process. Algorithms, Architectures, and Implementations XVII, 2007, art. 66970F. [22] Z. Xie, M. A. Andrade, W. R. Franklin, B. Cutler, M. Inanc, J. Muckell, and D. M. Tracy, “Progressive transmission of lossily compressed terrain,” in Proc. XXXIV Conferencia Latinoamericana de Informtica, Santa Fe, Argentina, 2008, pp. 8–12. 124

[23] Z. Xie, “Representation, compression and progressive transmission of digital terrain data using over-determined laplacian partial differential equations,” M.S. thesis, Comput. Sci. Dept., Rensselaer Polytechnic Inst., Troy, NY, 2008.

[24] T.-Y. Lau, Y. Li, Z. Xie, and W. R. Franklin, “Sea floor bathymetry trackline surface fitting without visible artifacts using ODETLAP,” in Proc. 17th ACM SIGSPATIAL Int. Conf. Advances in Geographic Inform. Syst., Seattle, WA, 2009, pp. 508–511.

[25] T.-Y. Lau and W. R. Franklin, “Automated artifact-free seafloor surface reconstruction with two-step ODETLAP,” SIGSPATIAL Special, vol. 4, no. 3, pp. 8–13, Nov. 2012.

[26] T.-Y. Lau and W. R. Franklin, “Completing fragmentary river networks via induced terrain,” Cartogr. Geogr. Inform. Sci., vol. 38, no. 2, pp. 161–173, Apr. 2011.

[27] T.-Y. Lau and W. R. Franklin, “Better completion of fragmentary river networks with the induced terrain approach by using known non-river locations,” in Proc. 15th Int. Symp. Spatial Data Handling, Bonn, Germany, 2012.

[28] T.-Y. Lau, “Two-step ODETLAP and induced terrain framework for improved geographical data reconstruction,” Ph.D. dissertation, Comput. Sci. Dept., Rensselaer Polytechnic Inst., Troy, NY, 2012.

[29] T.-Y. Lau and W. R. Franklin, “River network completion without height samples using geometry-based induced terrain,” Cartogr. Geogr. Inform. Sci., vol. 40, no. 4, pp. 316–325, Apr. 2013.

[30] W. R. Franklin, D. M. Tracy, M. A. Andrade, J. Muckell, M. Inanc, Z. Xie, and B. M. Cutler, “Slope accuracy and path planning on compressed terrain,” in Headway in Spatial Data Handling: 13th Int. Symp. Spatial Data Handling. Berlin, Germany: Springer-Verlag, 2008, pp. 335–349.

[31] Z. Xie, W. R. Franklin, and D. M. Tracy, “Slope preserving lossy terrain compression,” SIGSPATIAL Special, vol. 2, no. 3, pp. 19–24, Nov. 2010.

[32] Y. Li and W. R. Franklin, “5D-ODETLAP: a novel high- dimensional compression method on time-varying geospatial data,” in Proc. AutoCarto 2010, Orlando, FL, 2010. Available: https://www.asprs.org/a/publications/proceedings/orlando2010/files/You%20Li.pdf (Date Last Accessed, July, 1, 2016).

[33] Y. Li, T.-Y. Lau, C. S. Stuetzle, P. Fox, and W. R. Franklin, “3D oceanographic data compression using 3D-ODETLAP,” SIGSPATIAL Special, vol. 2, no. 3, pp. 7–12, Nov. 2010. 125

[34] A. Said and W. A. Pearlman, “A new, fast, and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243–250, June 1996. [35] B.-J. Kim and W. Pearlman, “An embedded wavelet video coder using three- dimensional set partitioning in hierarchical trees (SPIHT),” in Proc. Data Compression Conf., Snowbird, UT, 1997, pp. 251–260. [36] C. Stuetzle, W. R. Franklin, and B. Cutler, “Evaluating hydrology preservation of simplified terrain representations,” SIGSPATIAL Special, vol. 1, no. 1, pp. 51–56, Mar. 2009. [37] D. M. Tracy, W. R. Franklin, B. Cutler, M. Andrade, F. T. Luk, M. Inanc, and Z. Xie, “Multiple observer siting and path planning on a compressed terrain,” Proc. SPIE 6697, Advanced Signal Process. Algorithms, Architectures, and Implementations XVII, 2007, art. 66970G. [38] D. M. Tracy, W. R. Franklin, B. Cutler, F. T. Luk, and M. Andrade, “Path planning on a compressed terrain,” in Proc. 16th ACM SIGSPATIAL Int. Conf. Advances in Geographic Inform. Syst., Irvine, CA, 2008, art. 56. [39] D. M. Tracy, “Path Planning and Slope Representation on Compressed Terrain,” Ph.D. dissertation, Comput. Sci. Dept., Rensselaer Polytechnic Inst., Troy, NY, 2009. [40] Y. Li, “CUDA-accelerated HD-ODETLAP: a high dimensional geospatial data compression framework,” Ph.D. dissertation, Comput. Sci. Dept., Rensselaer Polytechnic Inst., Troy, NY, 2011. [41] D. N. Benedetti, W. R. Franklin, and W. Li, “CUDA-accelerated ODETLAP: a parallel lossy compression implementation,” presented at the 23rd Fall Workshop on Computational Geometry, New York, NY, 2013. [42] D. N. Benedetti, “CUDA-accelerated ODETLAP: a parallel lossy compression implementation for multidimensional data,” M.S. thesis, Dept. Elect. Comput. Syst. Eng., Rensselaer Polytechnic Inst., Troy, NY, 2014. [43] J. Stookey, Z. Xie, B. Cutler, W. R. Franklin, D. Tracy, and M. V. A. Andrade, “Parallel ODETLAP for terrain compression and reconstruction,” in Proc. 16th ACM SIGSPATIAL Int. Conf. Advances in Geographic Inform. Syst., Irvine, CA, 2008, art. 17. [44] J. Stookey, “Parallel terrain compression and reconstruction,” M.S. thesis, Dept. Elect. Comput. Syst. Eng., Rensselaer Polytechnic Inst., Troy, NY, 2008. [45] L. Mitas and H. Mitasova, “Spatial interpolation,” in Geographical Information Systems: Principles, Techniques, Management and Applications. Hoboken, NJ: Wiley, 2005, ch. 34. 126

[46] W. Tobler, “A computer movie simulating urban growth in the Detroit region,” Econ. Geogr., vol. 46, no. 2, pp. 234–240, June 1970.

[47] D. Shepard, “A two-dimensional interpolation function for irregularly-spaced data,” in Proc. 23rd ACM Nat. Conf., Las Vegas, NV, 1968, pp. 517–524.

[48] R. J. Renka, “Multivariate interpolation of large sets of scattered data,” ACM Trans. Math. Softw., vol. 14, no. 2, pp. 139–148, June 1988.

[49] S. J. Farlow, Partial Differential Equations for Scientists and Engineers, reprint ed. Mineola, NY: Dover, 1993.

[50] Y. Saad, Iterative Methods for Sparse Linear Systems, 2nd ed. Philadelphia, PA: SIAM, 2003.

[51] W. R. Tobler, “Smooth pycnophylactic interpolation for geographical regions,” J. Amer. Stat. Assoc., vol. 74, no. 367, pp. 519–536, Sep. 1979.

[52] H. Mitasova, L. Mitas, W. M. Brown, D. P. Gerdes, I. Kosinovsky, and T. Baker, “Modelling spatially and temporally distributed phenomena: new methods and tools for GRASS GIS,” Int. J. Geogr. Inform. Sci., vol. 9, no. 4, pp. 433–446, June 1995.

[53] J. Duchon, “Splines minimizing rotation-invariant semi-norms in sobolev spaces,” in Constructive Theory of Functions of Several Variables: Proc. Conf. Held at Oberwolfach. Berlin, Germany: Springer-Verlag, 1977, pp. 85–100.

[54] M. F. Hutchinson, “Interpolating mean rainfall using thin plate smoothing splines,” Int. J. Geogr. Inform. Sci., vol. 9, no. 4, pp. 385–403, June 1995.

[55] R. L. Hardy, “Theory and applications of the multiquadric-biharmonic method,” Comput. Math. Applicat., vol. 19, no. 8/9, pp. 163–208, 1990.

[56] J. C. Carr, R. K. Beatson, J. B. Cherrie, T. J. Mitchell, W. R. Fright, B. C. McCallum, and T. R. Evans, “Reconstruction and representation of 3D objects with radial basis functions,” in Proc. 28th Annu. Conf. Comput. Graph. and Interactive Techniques, Los Angeles, CA, 2001, pp. 67–76.

[57] J. Li and A. D. Heap, A Review of Spatial Interpolation Methods for Environ- mental Scientists. Canberra, Australia: Geoscience Australia, 2008.

[58] J. Li and A. D. Heap, “A review of comparative studies of spatial interpolation methods in environmental sciences: performance and impact factors,” Ecol. Informat., vol. 6, no. 3/4, pp. 228–241, July 2011.

[59] J. Li and A. D. Heap, “Spatial interpolation methods applied in the environ- mental sciences: a review,” Environ. Modell. Softw., vol./no. 53, pp. 173–189, Mar. 2014. 127

[60] D. Gosciewski, “Reduction of deformations of the digital terrain model by merging interpolation algorithms,” Comput. Geosci., vol./no. 64, pp. 61–71, Mar. 2014.

[61] M. Hugentobler, “Terrain Modelling with Triangle Based Free-Form Surfaces,” Ph.D. dissertation, Dept. Geogr., Univ. Zurich, Z urich, Switzerland, 2004.

[62] M. Hugentobler and B. Schneider, “Breaklines in Coons surfaces over triangles for the use in terrain modelling,” Comput. Geosci., vol. 31, no. 1, pp. 45–54, Feb. 2005.

[63] S. Nittel, J. C. Whittier, and Q. Liang, “Real-time spatial interpolation of continuous phenomena using mobile sensor data streams,” in Proc. 20th Int. Conf. Advances in Geographic Inform. Syst., Redondo Beach, CA, 2012, pp. 530– 533.

[64] J. Pouderoux, J.-C. Gonzato, I. Tobor, and P. Guitton, “Adaptive hierarchical rbf interpolation for creating smooth digital elevation models,” in Proc. 12th Annu. ACM Int. Workshop on Geographic Inform. Syst., Washington DC, 2004, pp. 232–240.

[65] S. Wise, “Cross-validation as a means of investigating DEM interpolation error,” Comput. Geosci., vol. 37, no. 8, pp. 978–991, Aug. 2011.

[66] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester, “Image inpainting,” in Proc. 27th Annu. Conf. Comput. Graph. and Interactive Techniques, New Orleans, LA, 2000, pp. 417–424.

[67] V. Caselles, J. M. Morel, and C. Sbert, “An axiomatic approach to image interpolation,” IEEE Trans. Image Process., vol. 7, no. 3, pp. 376–386, Mar. 1998.

[68] A. Almansa, F. Cao, Y. Gousseau, and B. Rougé, “Interpolation of digital elevation models using AMLE and related methods,” IEEE Trans. Geosci. Remote Sens., vol. 40, no. 2, pp. 314–325, Feb. 2002.

[69] J. Weickert, “Theoretical foundations of anisotropic diffusion in image process- ing,” in Proc. 7th Workshop on Theoretical Foundations of Comput. Vision, Dagstuhl, Germany, 1994, pp. 221–236.

[70] G. Facciolo, F. Lecumberry, A. Almansa, A. Pardo, V. Caselles, and B. Rougé, “Constrained anisotropic diffusion and some applications,” in Proc. British Machine Vision Conf., Edinburgh, UK, 2006, pp. 1049–1058.

[71] S. Masnou and J.-M. Morel, “Level lines based disocclusion,” in Proc. IEEE Int. Conf. Image Process., Chicago, IL, 1998, pp. 259–263. 128

[72] J. Essic. (2016). “Geospatial data formats.” [Online]. Available: https://www.lib.ncsu.edu/gis/formats.html (Date Last Accessed, July, 1, 2016).

[73] N. Coll, M. Guerrieri, M.-C. Rivara, and J. A. Sellarès, “Terrain approximation from grid data points,” in Actas XIII Encuentros de Geometría Computacional, Zaragoza, Spain, 2009, pp. 101–108.

[74] N. Coll, M. Guerrieri, M.-C. Rivara, and J. A. Sellarès, “Adaptive simplification of huge sets of terrain grid data for geosciences applications,” J. Comput. Appl. Math., vol. 236, no. 6, pp. 1410–1422, Oct. 2011.

[75] F. Löffler and H. Schumann, “QEM-filtering: a new technique for feature- sensitive terrain mesh simplification,” in Proc. 15th Int. Workshop on Vision, Modeling and Visualization, Siegen, Germany, 2010, pp. 1–8.

[76] Q. Zhou and Y. Chen, “Generalization of DEM for terrain analysis using a compound method,” ISPRS J. Photogramm. Remote Sens., vol. 66, no. 1, pp. 38–45, Jan. 2011.

[77] Y. Chen, J. P. Wilson, Q. Zhu, and Q. Zhou, “Comparison of drainage- constrained methods for DEM generalization,” Comput. Geosci., vol./no. 48, pp. 41–49, Nov. 2012.

[78] A. Solé, V. Caselles, G. Sapiro, and F. Arándiga, “Morse description and geometric encoding of digital elevation maps,” IEEE Trans. Image Process., vol. 13, no. 9, pp. 1245–1262, Sep. 2004.

[79] W. R. Franklin and M. Inanc, “Compressing terrain datasets using segmenta- tion,” in Proc. SPIE 6313, Advanced Signal Process. Algorithms, Architectures, and Implementations XVI, 2006, art. 63130H.

[80] M. Inanc, “Compressing Terrain Elevation Datasets,” Ph.D. dissertation, Com- put. Sci. Dept., Rensselaer Polytechnic Inst., Troy, NY, 2008.

[81] H. Wei, S. Zabuawala, L. Zhang, J. Zhu, J. Yadegar, J. de la Cruz, and H. J. Gonzalez, “Adaptive pattern-driven compression of large-area high-resolution terrain data,” in Proc. IEEE Int. Symp. Multimedia, Dana Point, CA, 2011, pp. 339–344.

[82] D. M. Durdević and I. I. Tartalja, “HFPaC: GPU friendly height field parallel compression,” GeoInformatica, vol. 17, no. 1, pp. 207–233, Jan. 2013.

[83] Z. Fei and C. Yumin, “An MPI-CUDA implementation for the compres- sion of DEM,” in Proc. Geomorphometry, Nanjing, China, 2013. Available: http://geomorphometry.org/system/files/FeiYumin2013.pdf (Date Last Ac- cessed, July, 1, 2016). 129

[84] T. Szirányi, I. Kopilovic, and B. P. Tóth, “Anisotropic diffusion as a prepro- cessing step for efficient image compression,” in Proc. 14th Int. Conf. Pattern Recognition, Stockholm, Sweden, 1998, pp. 1565–1567.

[85] H. Dell, “Seed points in PDE-driven interpolation,” B.S. thesis, Dept. Comput. Sci., Saarland Univ., Saarbr ucken, Germany, 2006.

[86] H. L. Zimmer, “PDE-based image compression using corner information,” M.S. thesis, Dept. Comput. Sci., Saarland Univ., Saarbr ucken, Germany, 2007.

[87] I. Galić, J. Weickert, M. Welk, A. Bruhn, A. Belyaev, and H.-P. Seidel, “To- wards PDE-based image compression,” in Proc. 3rd Int. Workshop Variational, Geometric, and Level Set Methods in Comput. Vision, Beijing, China, 2005, pp. 37–48.

[88] R. Distasi, M. Nappi, and S. Vitulano, “Image compression by B-tree triangular coding,” IEEE Trans. Commun., vol. 45, no. 9, pp. 1095–1100, Sep. 1997.

[89] I. Galić, J. Weickert, M. Welk, A. Bruhn, A. Belyaev, and H.-P. Seidel, “Image compression with anisotropic diffusion,” J. Math. Imaging and Vision, vol. 31, no. 2/3, pp. 255–269, July 2008.

[90] C. Schmaltz, J. Weickert, and A. Bruhn, “Beating the quality of JPEG 2000 with anisotropic diffusion,” in Proc. 31st DAGM Symp. Pattern Recognition, Jena, Germany, 2009, pp. 452–461.

[91] C. Schmaltz, P. Peter, M. Mainberger, F. Ebel, J. Weickert, and A. Bruhn, “Understanding, optimising, and extending data compression with anisotropic diffusion,” Int. J. Comput. Vision, vol. 108, no. 3, pp. 222–240, July 2014.

[92] P. Peter and J. Weickert, “Colour image compression with anisotropic diffusion,” in Proc. IEEE Int. Conf. Image Process., Paris, France, 2014, pp. 4822–4826.

[93] M. Mainberger, S. Hoffmann, J. Weickert, C. H. Tang, D. Johannsen, F. Neu- mann, and B. Doerr, “Optimising spatial and tonal data for homogeneous diffusion inpainting,” in Proc. 3rd Int. Conf. Scale Space and Variational Methods in Comput. Vision, Ein-Gedi, Israel, 2012, pp. 26–37.

[94] J. H. H. Karl Skretting and S. O. Aase, “Improved Huffman coding using recursive splitting,” in Proc. Norwegian Signal Process. Symp., Trondheim, Norway, 1999. Available: http://www.ux.uis.no/ karlsk/proj99/norsig99.pdf (Date Last Accessed, July, 1, 2016).

[95] D. S. Taubman and M. W. Marcellin, “JPEG2000: standard for interactive imaging,” Proc. IEEE, vol. 90, no. 8, pp. 1336–1357, Aug. 2002.

[96] Université catholique de Louvain. (2015). “OpenJPEG.” [Online]. Available: http://www.openjpeg.org/ (Date Last Accessed, July, 1, 2016). 130

[97] D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults,” J. Cogn. Neurosci., vol. 19, no. 9, pp. 1498–1507, Sep. 2007.

[98] P. Schelkens, A. Munteanu, A. Tzannes, and C. Brislawn, “JPEG2000 Part 10 – volumetric data encoding,” in Proc. IEEE Int. Symp. Circuits Syst., Kos, Greece, 2006, pp. 3874–3877.

[99] W. Li, W. R. Franklin, S. V. G. Magalhães, M. V. A. Andrade, and D. L. Hedin, “3D segmented ODETLAP compression,” submitted for publication.

[100] H. Mitášová and L. Mitáš, “Interpolation by regularized spline with tension: I. theory and implementation,” Math. Geol., vol. 25, no. 6, pp. 641–655, Aug. 1993.

[101] H. Mitášová and J. Hofierka, “Interpolation by regularized spline with tension: II. application to terrain modeling and surface geometry analysis,” Math. Geol., vol. 25, no. 6, pp. 657–669, Aug. 1993.

[102] W. B. Pennebaker and J. L. Mitchell, JPEG: Still Image Data Compression Standard. Boston, MA: Springer, 1993.

[103] J. R. Shewchuk, “An introduction to the conjugate gradient method without the agonizing pain,” Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep., 1994.

[104] AIRS Science Team/Joao Texeira. (2013). “Aqua AIRS Level 3 Monthly Standard Physical Retrieval (AIRS+AMSU), version 006,” NASA Goddard Earth Science Data and Information Services Center (GES DISC), Greenbelt, MD. [Online]. Available: doi:10.5067/AQUA/AIRS/DATA319 (Date Last Accessed, July, 1, 2016).

[105] W. R. Franklin and C. Vogt, “Efficient observer siting on large terrain cells,” presented at the 3rd Int. Conf. Geographic Information Science, Adelphi, MD, 2004.

[106] W. Li, W. R. Franklin, S. V. G. Magalhães, and M. V. A. Andrade, “GPU- accelerated multiple observer siting,” submitted for publication.

[107] V. Akbarzadeh, C. Gagné, M. Parizeau, M. Argany, and M. A. Mostafavi, “Probabilistic sensing model for sensor placement optimization based on line- of-sight coverage,” IEEE Trans. Instrum. Meas., vol. 62, no. 2, pp. 293–303, Feb. 2013.

[108] A. Efrat, J. S. B. Mitchell, S. Sankararaman, and P. Myers, “Efficient algorithms for pursuing moving evaders in terrains,” in Proc. 20th Int. Conf. Advances in Geographic Inform. Syst., Redondo Beach, CA, 2012, pp. 33–42. 131

[109] W. R. Franklin and C. Vogt, “Tradeoffs when multiple observer siting on large terrain cells,” in Proc. 12th Int. Symp. Spatial Data Handling, Vienna, Austria, 2006, pp. 845–861. [110] D. Strnad, “Parallel terrain visibility calculation on the graphics processing unit,” Concurr. Comput.: Pract. Exper., vol. 23, no. 18, pp. 2452–2462, Dec. 2011. [111] Y. Zhao, A. Padmanabhan, and S. Wang, “A parallel computing approach to viewshed analysis of large terrain data using graphics processing units,” Int. J. Geogr. Inform. Sci., vol. 27, no. 2, pp. 363–384, Feb. 2013. [112] W. R. Franklin and C. Ray, “Higher isn’t necessarily better: visibility algorithms and experiments,” in Advances in GIS Research: 6th Int. Symp. Spatial Data Handling. Berlin, Germany: Springer-Verlag, 1994, pp. 751–770. [113] A. Osterman, “Implementation of the r.cuda.los module in the open source GRASS GIS by using parallel computation on the NVIDIA CUDA graphic cards,” Elektrotehniški Vestnik, vol. 79, no. 1/2, pp. 19–24, June 2012. [114] A. Osterman, L. Benedičič, and P. Ritoša, “An IO-efficient parallel implemen- tation of an R2 viewshed algorithm for large terrain maps on a CUDA GPU,” Int. J. Geogr. Inform. Sci., vol. 28, no. 11, pp. 2304–2327, Nov. 2014. [115] T. Axell and M. Fridén, “Comparison between GPU and parallel CPU optimiza- tions in viewshed analysis,” M.S. thesis, Dept. Comput. Sci. Eng., Chalmers Univ. Technology, G oteborg, Sweden, 2015. [116] J. C. C. Bravo, T. Sarjakoski, and J. Westerholm, “Efficient implementation of a fast viewshed algorithm on simd architectures,” in Proc. 23rd Euromicro Int. Conf. Parallel, Distributed, and Network-Based Process., Turku, Finland, 2015, pp. 199–202. [117] C. R. Ferreira, M. V. A. Andrade, S. V. G. Magalhães, W. R. Franklin, and G. C. Pena, “A parallel algorithm for viewshed computation on grid terrains,” J. Inform. Data Manage., vol. 5, no. 2, pp. 171–180, June 2014. [118] C. R. Ferreira, M. V. A. Andrade, S. V. G. Magalhães, and W. R. Franklin, “An efficient external memory algorithm for terrain viewshed computation,” ACM Trans. Spatial Algorithms Syst., vol. 2, no. 2, art. 6, July 2016. [119] M. V. Kreveld, “Variations on sweep algorithms: efficient computation of extended viewsheds and class intervals,” in Proc. 7th Int. Symp. Spatial Data Handling, Delft, Netherlands, 1996, pp. 15–27. [120] S. V. G. Magalhães, M. V. A. Andrade, and C. Ferreira, “Heuristics to site observers in a terrain represented by a digital elevation matrix,” in Proc. XI Brazilian Symp. Geoinformatics, Campos do Jordão, Brazil, 2010, pp. 110–121. 132

[121] G. C. Pena, S. V. G. Magalhães, M. V. A. Andrade, W. R. Franklin, C. R. Ferreira, and W. Li, “An efficient GPU multiple-observer siting method based on sparse-matrix multiplication,” in Proc. 3rd ACM SIGSPATIAL Int. Workshop on Analytics for Big Geospatial Data, Dallas, TX, 2014, pp. 54–63.

[122] S. Rana, “Fast approximation of visibility dominance using topographic features as targets and the associated uncertainty,” Photogramm. Eng. Remote Sens., vol. 69, no. 8, pp. 881–888, Aug. 2003.

[123] J. Wang, G. J. Robinson, and W. K., “Generating viewsheds without using sightlines,” Photogramm. Eng. Remote Sens., vol. 66, no. 1, pp. 87–90, Jan. 2000.

[124] D. Israelevitz, “A fast algorithm for approximate viewshed computation,” Photogramm. Eng. Remote Sens., vol. 69, no. 7, pp. 767–774, July 2003.

[125] P. Lindstrom and V. Pascucci, “Visualization of large terrains made easy,” in Proc. IEEE Conf. Visualization, San Diego, CA, 2001, pp. 363–371.