Cairn Detection in Southern Arabia Using a Supervised Automatic Detection

Algorithm and Multiple Sample Data Spectroscopic Clustering

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of

Philosophy in the Graduate School of The Ohio State University

By

Jared Michael Schuetter, M.S.

Graduate Program in Statistics

The Ohio State University

2010

Dissertation Committee:

Professor Tao Shi, Co-Advisor

Professor Prem Goel, Co-Advisor

Professor Joy McCorriston

Professor Yoon Lee

Professor Stuart Ludsin, GFR Copyright by

Jared Michael Schuetter

2010 ABSTRACT

Excavating cairns in southern Arabia is a way for anthropologists to understand which factors led ancient settlers to transition from a pastoral lifestyle and tribal narrative to the formation of states that exist today. Locating these monuments has traditionally been done in the field, relying on eyewitness reports and costly searches through the arid landscape.

In this thesis, an algorithm for automatically detecting cairns in satellite imagery is presented. The algorithm uses a set of filters in a window based approach to eliminate background pixels and other objects that do not look like cairns. The resulting set of detected objects constitutes fewer than 0.001% of the pixels in the satellite image, and contains the objects that look the most like cairns in imagery. When a training set of cairns is available, a further reduction of this set of objects can take place, along with a likelihood-based ranking system.

To aid in cairn detection, the satellite image is also clustered to determine land- form classes that tend to be consistent with the presence of cairns. Due to the large number of pixels in the image, a subsample spectral clustering algorithm called “Mul- tiple Sample Data Spectroscopic clustering” is used. This multiple sample clustering procedure is motivated by perturbation studies on single sample spectral algorithms.

The studies, presented in this thesis, show that sampling variability in the single sample approach can cause an unsatisfactory level of instability in clustering results.

ii The multiple sample data spectroscopic clustering algorithm is intended to stabilize this perturbation by combining information from different samples. While sampling variability is still present, the use of multiple samples mitigates its effect on cluster results.

Finally, a step-through of the cairn detection algorithm and satellite image clus- tering are given for an image in the Hadramawt region of Yemen. The top ranked detected objects are presented, and a discussion of parameter selection and future work follows.

iii Dedicated to Michelle, who has patiently waited for 5 years

to see this finished dissertation, and to Claudia, who gave

me the deadline for finishing it.

iv ACKNOWLEDGMENTS

First and foremost, I would like to thank my co-advisors, Dr. Shi and Dr. Goel.

Both of you have provided a great deal of input for the material presented in this thesis. In addition, I appreciate your continued optimism that a solution for any problem will turn up eventually. On multiple occasions, I have met with each of you to share the results of my most recent abject failure at clustering and/or cairn detection to have you suggest a number of other possible approaches to try. I’m starting to learn that you can’t make any progress without failing a few times along the way.

I would also like to thank the NSF-HSD team. I am grateful to Dr. McCorriston for her unwavering excitement about the project, her leadership, and her confidence that I will eventually come up with something that will work in the cairn detection algorithm. Thank you to Matt for always making sure you understand the latest detection techniques, providing your insights as an anthropologist, and taking all of those horrible meeting notes. Finally, I’m especially thankful to Jihye, who has on numerous occasions provided me with last-minute satellite imagery for some crazy detection technique I’ve come up with, spent a great deal of her time helping me create (and recreate) the cairn training set, and processed imagery so that I can use

Matlab for the detection algorithm.

v I am also grateful to Dr. Lee for agreeing to serve on both my candidacy and dissertation committees, and for being one of the best professors I’ve had at OSU. In addition, I want to thank Drs. Notz, Santner, and Dean for allowing me to moonlight as a computer experimenter. I also appreciate the friendships I’ve made with other graduate students in the department, including Danel, Candace, Jenny, Josh, Arun,

Soma, and especially Mallik, who will really get a kick out of the ridiculous title of my dissertation. Finally, I would like to thank Dr. Stasny, whose confidence in my abilities is the reason I even came to OSU. I wasn’t sure if I had the chops to get a

Ph.D, but she sure was, and I appreciate it.

Last but not least, I would like to thank my family for their support over the last

5 years. I am especially grateful to Michelle, who has been a source of inspiration

(and motivation), and still makes me feel like the luckiest guy in the world.

vi VITA

May 1998...... duPont Manual High School

May 2002...... B.S. Math & Education, Denison University

August 2005 ...... M.S. Applied Statistics, Bowling Green State University

FIELDS OF STUDY

Major Field: Statistics

vii TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iv

Acknowledgments ...... v

Vita ...... vii

List of Tables ...... xi

List of Figures ...... xii

Chapters:

1. INTRODUCTION ...... 1

1.1 The NSF-HSD Project ...... 1 1.2 Multiple Sample Data Spectroscopic Clustering ...... 4 1.3 Combining Clustering and Cairn Detection ...... 5

2. PROJECT OVERVIEW ...... 7

2.1 Southern Arabia in the ...... 7 2.2 Goals of the Project ...... 8 2.3 Cairn Detection ...... 10

3. TECHNIQUES FOR DETECTING OBJECTS IN IMAGERY ...... 17

3.1 Introduction ...... 17 3.2 Basic Image Processing Techniques ...... 19 3.2.1 Point Operators ...... 19

viii 3.2.2 Template Operators ...... 25 3.2.3 Group/Window Operators ...... 29 3.3 ...... 35 3.3.1 Edge Vector Formulation ...... 36 3.3.2 Improvements on Edge Detection ...... 38 3.3.3 Other Edge Detection Techniques ...... 47 3.4 Shape Matching and Extraction ...... 49 3.4.1 Basic Techniques ...... 49 3.4.2 Template Matching ...... 50 3.4.3 Methods ...... 52 3.4.4 Deformable Templates ...... 60 3.5 Post-Detection Processing ...... 63

4. CAIRN DETECTION ...... 64

4.1 Introduction ...... 64 4.2 ...... 70 4.3 Vegetation Removal ...... 73 4.4 Size Metrics ...... 77 4.5 Measuring Circularity ...... 81 4.5.1 Hough Transform Circle Fitting ...... 81 4.5.2 Boundary Extraction ...... 85 4.5.3 Circularity Calculation ...... 87 4.6 Reduction to Cairn Region ...... 91 4.7 Assigning Cairn Likelihoods ...... 95 4.8 Algorithm Summary ...... 97 4.9 Results for Polygon 9 ...... 101 4.10 Discussion ...... 112

5. TECHNIQUES FOR CLUSTERING DATA ...... 118

5.1 Introduction ...... 118 5.1.1 Data Clustering ...... 119 5.1.2 Applications of Clustering ...... 121 5.2 Clustering Algorithms ...... 123 5.2.1 Clustering by Central Tendency ...... 124 5.2.2 Model Based Clustering ...... 130 5.2.3 Spectral Clustering Algorithms ...... 133 5.2.4 Measuring the Quality of Results ...... 144 5.3 Spectral Clustering for Large Datasets ...... 146 5.3.1 Sparse Matrix Representations ...... 147 5.3.2 Single Subsample Approximation ...... 150

ix 6. MULTIPLE SAMPLE DATA SPECTROSCOPIC CLUSTERING .... 158

6.1 Algorithm Overview ...... 158 6.2 Sparse Extension for Faster Computation ...... 164 6.3 Performance on Real and Simulated Datasets ...... 167 6.3.1 Comparison to Single Subsample Approach ...... 168 6.3.2 Image Segmentation Applications ...... 169 6.3.3 Sparse Extension vs. Full Extension Comparison ...... 170 6.3.4 Parameter Selection ...... 171 6.4 Conclusions ...... 173

7. REDUCTION OF FALSE DETECTIONS BY CLUSTERING ...... 180

7.1 Introduction ...... 180 7.2 Satellite Image Clustering ...... 182 7.2.1 Size Reduction for Computation ...... 183 7.2.2 Equalized DEM Measure ...... 185 7.2.3 Algorithm Summary ...... 188 7.3 Cluster Results for Polygon 9 ...... 190 7.4 Discussion ...... 196

8. SUMMARY AND FUTURE WORK ...... 198

8.1 Algorithm Summaries ...... 198 8.1.1 The Cairn Detection Algorithm ...... 198 8.1.2 The Multiple Sample DaSpec Algorithm ...... 200 8.1.3 Clustering in Cairn Detection ...... 202 8.2 Future Work ...... 203 8.2.1 Cairn Detection ...... 203 8.2.2 Multiple Sample DaSpec Clustering ...... 206

x LIST OF TABLES

Table Page

3.1 Pascal’s Triangles for Addition and Subtraction ...... 41

4.1 Parameters for Cairn Detection Algorithm ...... 99

4.2 Cairn Detection Algorithm ...... 100

4.3 Initialization Parameters for Polygon 9 Detection ...... 104

5.1 Data Spectroscopic Clustering Algorithm ...... 141

5.2 Comparison of Spectral Algorithms ...... 142

6.1 Multi-Sample DaSpec Algorithm ...... 164

7.1 Satellite Image Clustering Algorithm ...... 190

xi LIST OF FIGURES

Figure Page

2.1 A map of modern day southern Arabia. The focus region for the project is indicated by the blue oval...... 9

2.2 (a) A small-sized cairn roughly 3 meters in diameter. (b) A larger cairn roughly 5-6 meters in diameter...... 11

2.3 (a) A QuickBird image of a wadi in Yemen. The field team physically walked through the areas indicated in green to do a monument survey. As they conducted the survey, they also looked into the distance for other monuments. The extent of this visual inspection is indicated in yellow. (b) Examples of some cairns from the training set as they appear in the panchromatic layer of the imagery...... 15

2.4 (a) A cairn as it appears in the field. (b) The same cairn in 0.6 meter QuickBird satellite imagery. (False color is created with pan-merged RGB pixels) ...... 16

3.1 (a) An image of the ostrich. (b) Image thresholded at the median value of c = 0.5843. (c) Image thresholded at the 75th percentile value of c = 0.6804. (d) Brightness Eq. (3.1) applied to the image, where a = 1.5 and b = 0. Resulting intensities larger than 1 are truncated to 1. (e) Gamma corrected image of the ostrich, γ = 0.5. (f) Gamma corrected image of the ostrich, γ = 3...... 21

3.2 (a) Original image of the ostrich. (b) Normalized image of the ostrich. (c) Equalized image of the ostrich using M = 64 bins. (d) Histogram of intensities for the original image. (e) Histogram of intensities for the normalized image. (f) Histogram of intensities for the equalized image. 24

xii 3.3 (a) Original image to be filtered with a 3 × 3 window operator. (b) Template centered at the top left pixel (a pixel). (c) Repetition of image to fill the border template...... 27

3.4 (a) The template for the 7 × 7 averaging operator. (b) The template for the 7 × 7 gaussian operator (σ = 4/3). (c) Original image of the ostrich. (d) Convolution of the ostrich image with the 7 × 7 averaging operator. (e) Convolution with the 7 × 7 gaussian operator. Border pixels were coded as zero (black)...... 30

3.5 (a) Original image of the ostrich. (b) Gaussian smoothed ostrich (3×3 window, σ = 2/3). (c) Median filtered ostrich (3 × 3 window). (d) Ostrich image with 5% of pixels corrupted with noise (intensity set to 0). (e) Gaussian smoothed noisy ostrich (3 × 3 window, σ = 2/3). (f) Median filtered noisy ostrich (3 × 3 window)...... 32

3.6 Vector representation of an edge...... 38

3.7 (a) An image of an urn. (b) Horizontal edge component Mh(x, y) as defined in Eq. (3.4). (c) Vertical edge component Mv(x, y) as defined in Eq. (3.5). (d) Magnitude M of the edge detector (Eq. (3.2)). (e) Edges thresholded at the 0.9 quantile magnitude (M > 16.0312). .. 39

3.8 Non-maximal suppression for Canny’s edge detector. Edge magnitudes are linearly interpolated at locations A and B, then compared to the central pixel magnitude M(x, y). If M(x, y) is the largest, pixel (x, y) is considered an edge pixel...... 44

3.9 (a) Urn image. (b) Canny edge detection results for L = 0.03, U = 0.08, and σ = 1. (c) Canny edge detection results for L = 0, U = 0.60, and σ = 1. (d) Canny edge detection results for L = 0.10, U = 0.20, and σ = 0.50...... 46

3.10 (a) Convolution of urn image with 13 × 13 LoG template (σ = 2). (b) Zero crossing pixels. (c) A closer look at the center of image. .... 47

3.11 (a) Image of an outdoor cafe. (b) The template T. (c) Accumulator function for the image, calculated at valid pixels. Blue indicates low values, and red high values. (d) The template superimposed on the minimizing pixel of the accumulator function...... 51

xiii 3.12 (a) Image of the outdoor cafe, with salt-and-pepper noise. 20% of pixels had intensities flipped. (b) The template T. (c) Accumulator function for the noisy image, calculated at valid pixels. (d) The template su- perimposed on the minimizing pixel of the accumulator function. (e) Canny edges for the cafe image. (f) A version of the template with an outline of the object. (g) Accumulator function for the edge image and template, with a close-up view at the minimizing pixel. (h) The edge template superimposed on the minimizing pixel of the accumulator function...... 53

3.13 (a) Polar representation of a line where θ < π/2. (b) Polar represen- tation of a line where θ > π/2...... 56

3.14 (a) Image of a hawk. (b) Canny edgesp for the hawk image. (c) Ac- cumulator function, with ρ = 1,..., b p2 + q2c on the vertical and θ = 0◦,..., 180◦ on the horizontal. (d) Line corresponding to ρ = 76 pixels, θ = 7◦, the maximizing parameter pair in the accumulator function (see in (c)). (e) Lines corresponding to the four highest peaks in the accumulator function...... 57

3.15 (a) Image of a cactus flower. (b) Canny edges for the cactus image. (c) Accumulator function over the (x0, y0) parameter space for r = 100 pixels. (d) Circle corresponding to (x0, y0) = (171, 179), the maximiz- ing parameter pair in the accumulator function. (e) Image of a starfish. (f) Canny edges for the starfish image. (g) Accumulator function over the (x0, y0) parameter space for r = 10 pixels. (h) Top 10 circles cor- responding to the largest parameter pairs in the accumulator function...... 58

3.16 (a) Grid over the unit square. (b) Deformation from Eq. (3.11) applied x x y y y y for M = 3, N = 3, ξ11 = ξ13 = 1, ξ12 = ξ21 = ξ23 = ξ32 = 1 and all x y other ξmn and ξmn set to 0. Here, α = 1. (c) The same deformation coefficients, but now α = 0.5. (d) A template of a bird on the unit square (centered at (0, 0) for easier rotation). (e) Deformation from (b) applied to the bird template. (f) Bird template rotated by 30◦, deformed using the displacement function from (c), and then scaled by 1/2...... 61

4.1 (a) Statistics on 288 cairns discovered in Yemen in 2007. (b) Histogram of cairn preservation levels. (c) Histogram of cairn diameters. .... 66

xiv 4.2 A random selection of 49 cairns from the training set. Each cairn window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only...... 67

4.3 Examples of 25 × 25 vegetation windows passing the first filter from Yemen and Oman. To display properly, intensities were scaled to [0,1] within each cairn window before plotting...... 74

4.4 A window I containing vegetation. NDVI values INDVI (i, j) are over- laid on the image. The grid shows the size of the 2.4 meter resolution pixels that IR and Red values are measured in. The red box in the center is the 5 × 5 window over which NDVI values are averaged. In this case, the average is 0.101, which would likely fall above the upper threshold UV . This bush would not pass the vegetation filter. .... 75

4.5 A window I containing a cairn. NDVI values INDVI (i, j) are overlaid on the image. The grid shows the size of the 2.4 meter resolution pixels that IR and Red values are measured in. The red box in the center is the 5 × 5 window over which NDVI values are averaged. In this case, the average is 0.055, which would likely fall between LV and UV . Therefore, the cairn would pass the vegetation filter...... 76

4.6 (a) Window I containing a small cairn. (b) Window I containing a large cairn. (c) Successive boxplots of inner window intensities for cairn (a) for w = 3,..., 13. The red dotted line is the median for the entire window I, and the circle on the boxplot shows the quantile value qp(W ) for p = 0.95. The size of this cairn is JS = 6. (d) The same plots for cairn (b), which has size JS = 12...... 80

4.7 (a) Window I containing a cairn. (b) The median filtered intensity window I0. (c) The best fitting Hough circle (blue) and its center pixel (yellow) overlaid on I. The radius of the circle is 5 pixels...... 82

4.8 (a) Window I from Figure 4.7. (b) Canny edges for the raw intensity I. (c) Intensity edges overlaid on the cairn. (d) The binary map B, with p = 0.90. (e) Canny edges for the binary map B. (f) Binary map edges overlaid on the cairn. Note that the edge pixels match the cairn boundary much better, the curve is more complete, and there are not as many extraneous edge pixels...... 87

xv 4.9 (a)-(c) Binary map edges for three different cairns. (d)-(f) Extracted boundaries. The center pixel (13, 13) is colored magenta, and the start- ing locations for the extraction are in red. Cairn (c) shows an unfortu- nate side-effect of this procedure, where the chain of boundary pixels strays from the object...... 88

4.10 (a)-(c) Cairn circularity scores indicated for the extracted boundaries (in blue). The pixel containing the center of the object is in yellow. (d)-(f) Examples of circularity scores when objects are not as circular. These are cairns for which background intensities interfered with the boundary extraction...... 92

4.11 Pairwise correlations for the six features across all 106 cairns in the training set...... 93

4.12 (a) A set of points in <2. (b) The Delaunay tessellation over those points...... 95

4.13 A visualization of the cairn detection procedure...... 98

4.14 The cairn training set for Polygon 9. Each subimage shows the 25 × 25 panchromatic intensity window. To illustrate the contrast between the cairn and the background, each window is normalized to a 0 (dark) to 1(light) scale for display purposes only...... 102

4.15 Empirical cumulative distribution functions for the 60 cairns in the Polygon 9 training set for (a) JB, (b) JV , (c) JS, (d) JHR, (e) JHS, and (f) JC . Note that the size metric JS ranges from 6 to 12, which corresponds to 3.6 meters to 7.2 meters in diameter. This is roughly the range of cairn diameters observed in Figure 4.1(c)...... 103

4.16 Histograms of the six features across the Polygon 9 training set. ... 107

4.17 A random selection of 20 objects that passed the JB blob detector in Polygon 9 with LB = 10 and UB = 23. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only...... 108

xvi 4.18 A random selection of 20 objects that passed the JV vegetation filter in Polygon 9 with LV = 0.045 and UV = 0.065. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only...... 108

4.19 A random selection of 20 objects that passed the JS size filter in Poly- gon 9 with LS = 8 and US = 12. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only...... 109

4.20 A random selection of 20 objects that passed the circularity filters in Polygon 9. These include the Hough ratio JHR, with LHR = 1.5,UHR = 5, then the Hough score JHS, with LHS = 5,UHS = 32, and finally the circularity measure JC with thresholds LC = 0.7 and UC = 1. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only. . 109

4.21 A map of the 1634 detected objects in Polygon 9...... 111

ˆ ˆ 4.22 The approximate marginal distributions f1(x),..., f6(x) for the 60 training cairns in Polygon 9...... 115

4.23 The top 50 likelihood ranked objects (left to right by row) from Polygon 9. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only. 116

4.24 Boxplots of the six features for training cairns located in two regions of Oman and two from Yemen. There are 5 training cairns in Oman 3, 9 in Oman 16, 60 in Yemen 9, and 37 in Yemen 17...... 117

5.1 Row 1: Histogram of the original dataset (n = 1000) drawn i.i.d. from a mixture Gaussian distribution defined in (5.4). Row 2: Top five ∗ eigenvectors of Kn. Rows 3 through 6: Top five eigenvectors of Km for each of 4 samples (m = 300). To help the comparison to the first row, the vectors are plotted using the ordering in the original data space. Row 7: Top five eigenvectors from each of four subsamples (size m = 300) overlaid...... 153

xvii 5.2 Row 1: Histogram of the original dataset (n = 1000) drawn i.i.d. from a mixture Gaussian distribution defined in (5.4). Row 2, 4, 6: Top five −1 −1/2 −1/2 eigenvectors of Kn, Dn Wn, and Dn WnDn respectively. Rows 3, 5, 7: Top five eigenvectors from each of four subsamples (size m = 300) overlaid...... 156

6.1 Overview of the Multiple Sample DaSpec Clustering Algorithm. ... 175

6.2 (a)-(c) Multi-sample data spectroscopic clustering results for three con- secutive runs using T = 5 samples of size m = 50. The dataset has n = 45, 000 points, with 5000 sampled from each of nine bivariate gaussian distributions with centers in {1, 3, 5} × {1, 3, 5} and covari- ance matrices σ2I, where σ2 = 0.1. Bandwidths used were ω = 0.4 and τ = 0.5. (d)-(f) Single-sample data spectroscopic clustering re- sults for three consecutive runs using T = 1 sample of size m = 250. The bandwidth ω was set to 0.15...... 176

6.3 Column 1: Original Image. Column 2: Multi-sample data spectro- scopic clustering results using the indicated number of subsamples and a sample size of m = 100. Parameters, runtimes, and groups detected for the procedures are given in the table...... 177

6.4 Column 1: A simple image segmented with T = 1, T = 5, and T = 10. Column 2: A more complex image segmented with T = 1, T = 5, and T = 10...... 178

6.5 First Row: Original Image. Second Row: Multiple subsample data spectroscopic clustering results using sparse LASSO extension. Third Row: Multiple subsample data spectroscopic clustering results using the full extension...... 179

7.1 (a) Polygon 9 scaled by a factor of w = 8. (b) Polygon 9 scaled by a factor of w = 25. (c) Polygon 9 scaled by a factor of w = 50. (d)-(f) Close-up views of the top left corner of the image for each scale factor. 184

7.2 A plot of the Polygon 9 digital elevation model (DEM)...... 186

L 7.3 (a) The local elevation comparison IE for Polygon 9. (b) The global G elevation comparison IE for the same image. (c) The equalized DEM ∗ IE. (d) The original DEM IE, for comparison...... 189

xviii 7.4 Cluster results from Polygon 9. The image was scaled by a factor of 25 and clustered using RGB, IR, and Equalized DEM bands. Weights were αr = αc = α1 = α2 = α3 = α4 = 1 and α5 = 8. Parameters for the multi-sample DaSpec algorithm were T = 20, m = 100, ω = 0.25, and τ = 0.20. A total of Gˆ = 6 groups were uncovered, and are shown in the plot. Runtime was 7.111 seconds...... 192

7.5 All 76 training cairns from Polygon 9 (including the poorly preserved ones) overlaid on the cluster results...... 193

7.6 (a) A bar graph showing which clusters the training cairns fall into. (b) Cairn clusters 1 (red) and 4 (blue)...... 194

7.7 (a) The top 50 detected objects in both clusters 1 and 4. (b) The top 50 detected objects in cluster 4 only. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only...... 195

8.1 A high circular (HCT) with a tail (a) in the satellite imagery, and (b) on the ground...... 205

xix CHAPTER 1

INTRODUCTION

In this thesis, I present the work I have done in two related areas over the past three years. The first is an interdisciplinary research project I have worked on with

Dr. Prem Goel, and the second is the development of a clustering algorithm with

Dr. Tao Shi. Additionally, I also present some background material about techniques in image processing, object detection, and data clustering. A brief overview of the thesis is given in the sections below.

1.1 The NSF-HSD Project

In the summer of 2008, I joined the team on a project funded by the and

Social Dynamics (HSD) initiative in the National Science Foundation (NSF). The project has members from three different departments on campus: Statistics, Geodetic

Science, and Anthropology. The NSF-HSD project has grown out of previous work done by Dr. Joy McCorriston in the Anthropology department. This previous work was called the RASA project, which stands for Roots of Agriculture in Southern

Arabia.

Through RASA, and then into NSF-HSD, the goal has been to better understand how inhabitants of southern Arabia transitioned over time from pastoralism to a

1 sedentary lifestyle. It is clear that many factors influenced this transition, but one of the dominant factors appears to be climate change. The Holocene era, which lasted from 10,000 to 3500 BP (before present) saw a gradual decrease in precipitation that led to dwindling resources. This may have led to the settlement of tribal people in areas where resources were present, and pressured them to make territorial claims of ownership.

Dr. McCorriston and her team, including doctoral student Matthew Senn and

Dr. Michael Harrower, have made several visits to the Hadramawt region of Yemen and the Dhofar region of Oman to excavate monuments built by these ancient peo- ple. A large number of the monuments are classified as high circular (HCTs), also called cairns, which are prominent cylindrical structures about 3 to 8 meters in diameter and 1 to 2 meters tall.

Burial monuments were important to ancient people, as they are to us today. The items interred with the bodies reveal what was important to the people that buried them. Furthermore, bone fragments from the cairns often provide reliable carbon dating results. By analyzing the objects found in these tombs, Dr. McCorriston and her team hope to better understand what was leading the ancient people to settle in the region. Ultimately, the transition to a sedentary lifestyle and the rise of agriculture led to a move away from tribal governing structures and into the hybrid tribal and state structures seen today in southern Arabia. Therefore, understanding the paradigm shift in settlement patterns could help inform studies on the cultural and political history of the entire region.

Prior to the involvement of the Statistics and Geodetic Science departments, the anthropologists could find cairns and other monuments in a limited number of ways.

2 Satellite imagery is available in southern Arabia, and can theoretically be searched through by eye to locate monuments. The problem with this approach is that even the best available imagery is on the order of 0.6 meters per pixel. When cairns are

3 to 8 meters in diameter, they only cover 5 to 13 square pixel regions in the image.

Considering that these images cover hundreds of square kilometers, it is a test in patience to comb through a large image for such small objects.

Another way to find monuments is to talk to local inhabitants while in the field.

These people often know about the existence of structures in the local region, and can point the way there. There are a couple of problems with this approach. First, the monuments people will know about are going to be heavily concentrated along transportation routes in the area. There could be a large number of monuments that no one knows about because they are in unpopulated areas. Second, local inhabitants are usually not anthropologists, and often lead the team to ancient monuments that are really only a few decades old.

To explore a way of automatically detecting cairns, Dr. McCorriston enlisted the help of the Geodetic Science and Statistics departments Dr. Dorota Brzezinska and doctoral student Jihye Park from the former, and Dr. Goel and graduate student

Jacob Reidhead from the latter. I took over for Jacob in summer of 2008 after a year of work had transpired. The desired output of the collaboration between the three departments was an algorithm that would read in a satellite image and return a list of GPS coordinates corresponding to likely cairns in the region. Dr. Goel and I have worked over the past two years to make the algorithm a reality, and the work presented here gives a procedure that meets the description. Of course, there is always room for improvement.

3 An overview of the history of southern Arabia and an overview of the NSF-HSD project is given in Chapter 2. In order to lay the groundwork for discussion of the algorithm, I used Chapter 3 to discuss background material on image processing and object detection. Finally, Chapter 4 contains a detailed description of the algorithm itself, as as a discussion of the motivations behind its development.

1.2 Multiple Sample Data Spectroscopic Clustering

In 2007, I began work with Dr. Tao Shi on a newly developed spectral clustering algorithm called Data Spectroscopic clustering (DaSpec). The goal at that time was to compare its performance with other competing spectral algorithms by applying it to simulated datasets and natural images. At that time, I noticed that clustering imagery was difficult, even for moderate to small sized images. This led me to research existing ways to deal with computational limitations in spectral clustering.

The main problem with spectral clustering is the need for storage and eigen- decomposition of an n × n affinity matrix, which can be difficult when n is large. One of the ways to reduce computation in spectral algorithms is to use a single sample drawn from the dataset to approximate the eigenvectors used for the clustering. Then, these eigenvectors can be extended to the entire dataset and inform cluster label assignment.

After trying to apply the single sample approach to DaSpec, we noticed that results tended to be unstable due to sampling variability. This led us to conduct a perturbation analysis of the spectrum of the affinity matrix for different clustering algorithms. We found that, indeed, sampling variability was often introducing an unacceptable level of instability to the algorithm.

4 Following this analysis, we began development of an algorithm that would use eigenvectors from multiple samples drawn from the dataset. The information in these eigenvectors would be combined and extended to the full dataset, thus stabilizing the results by reducing the effect of sampling variability. The hardest part of the process was deciding how to combine those eigenvectors, since they are evaluated at different locations in the data space.

Chapter 5 contains background material on different clustering techniques, includ- ing spectral algorithms. In the same chapter, I also present some of the challenges of using spectral clustering algorithms for large datasets, and the remedies that are available. In the last part of the chapter, I give the results of the perturbation analysis on the single sample approximation method. The multiple sample data spectroscopic clustering algorithm is described in Chapter 6. I have also provided a section showing the algorithm’s performance on real and simulated datasets, including images.

1.3 Combining Clustering and Cairn Detection

In the NSF-HSD project, we have recently been exploring ways to reduce the number of false detections made by the algorithm. One of these approaches is to cluster the satellite image and determine which regions tend to contain cairns, and which do not. This information can be useful in deciding which of the detected objects are most likely to be cairns by virtue of the landform type on which they are located.

From my work on the multiple sample DaSpec algorithm, I knew it could do a good job of segmenting imagery given the right parameters. Therefore, I tried to apply it to satellite imagery from Yemen and Oman to assist the cairn detection project. It took some work to adapt the method to work on these images, primarily

5 because they are so large that all of the pixels in the image cannot be stored in memory simultaneously. I was able to cluster the full image by breaking it into pieces and processing each piece separately, but the process was time consuming and made parameter optimization difficult.

Ultimately, I decided to lower the resolution on the image so that clustering could be done quicker and easier. This resolution reduction does not seem to hurt perfor- mance, because the goal is to identify large scale landform classes, and not smaller scale local features. A description of the satellite image clustering procedure is given in Chapter 7, along with a case study on one of the images from Yemen called Polygon

9. Finally, Chapter 8 summarizes the results presented in the thesis and identifies areas for future work and algorithm development.

6 CHAPTER 2

PROJECT OVERVIEW

2.1 Southern Arabia in the Holocene

The Holocene era (roughly 10,000 to 3500 years BP) marked a paradigm shift in the pattern of human settlement in southern Arabia. The region itself (modern day

Yemen and Oman) is bordered to the north by , and on the other three sides by the Arabian and Red Seas, the Persian Gulf, and the Gulfs of Oman and Aden (see

Figure 2.1). The landscape is riddled with drainage channels called wadis that are incised into the mountainous region bordering the desert, and slope southward toward the sea. During the moist early Holocene, which lasted from 10,000 to 6000 years ago, the climate was characterized by a relatively wet climate with dense vegetation.

Human settlement seems to have reached a peak around 6500 BP, when rainfall began to diminish, and the climate may have adopted seasonal patterning (McCorriston et al., 2002).

The following middle and late Holocene, lasting 6000 to 3500 BP, was marked by a decrease in precipitation and human activity. This period was punctuated by droughts in 5200 and 4200 BP, the second of which lasted roughly 300 years (Staubwasser and

Weiss, 2006). This climate change transformed southern Arabia into an arid region

7 dominated by highland plateaus and empty wadi channels. By the late Holocene, the number of habitable areas were few.

Brunner (1997) describes this late Holocene southern Arabia as a mix of five types of geography. The coastal plains were hot and relatively humid, which made agriculture unpleasant. The west and south of Yemen had a more ideal climate, but was mountainous and required extensive terracing to settle there. To the east and north, precipitation was much more sparse, and settlement was only possible in deeply eroded valleys using runoff . The thin region where wadis drained into the desert provided one place for settlers to grow crops, due to the accumulation of fertile sediment and a milder climate. Finally, the highland regions supported the majority of the population, due to the combination of flat plains and two rainy seasons. There is also evidence of a return to pastoral herding lifestyles that had been commonplace in the early Holocene (see, e.g., McCorriston et al., 2002).

2.2 Goals of the Project

It is the period of roughly 5000 to 3000 BP that is the focus of the project dis- cussed in this thesis. The effort is funded through the National Science Foundation

(NSF) under the Human and Social Dynamics (HSD) priority area. It is also an off- shoot of another project, called ”Roots of Agriculture in Southern Arabia” (RASA).

The team, led by Dr. Joy McCorriston of the Anthropology Department at OSU, is a collaboration between the Anthropology, Geodetic Sciences, and Statistics depart- ments.

In a broad sense, our goals are to better understand the environmental and social factors affecting tribal dynamics in southern Arabia. In the late Holocene, the harsh

8 Figure 2.1: A map of modern day southern Arabia. The focus region for the project is indicated by the blue oval.

climate and scant resources reinforced a tribal structure in which social dynamics were driven by a narrative of kinship and lineage in relation to an ancestral founder.

Over time, however, states eventually formed. In modern times, social dynamics are now influenced not only by genealogical factors, but also by economic status, ethnicity, etc. How did this transition occur, and what were the factors that guided the change from tribal communities to the formation of states? We can gain insight into these dynamics by excavating the remnants of past settlements, which now take

9 the form of monuments scattered throughout the landscape. The scope of the project is constrained to the Hadramawt region in eastern Yemen, and the Dhofar region in western Oman.

One important indicator of habitation and claims of resource ownership is the presence of burial monuments, called cairns (see Figure 2.2). The cairns, also called high circular tombs (HCTs), that are found in this region from the late Holocene range from 3 to 8 meters in diameter. They are circular walled tombs which stand roughly a meter or two high, and may contain multiple interred bodies. Evidence suggests that some tombs were reused over time, with older bodies being covered over with rocks and newer bodies laid on top. Cairns also tend to appear in clusters, and seem to be a marker of territorial ownership due to their frequent occupation of easily observed ridges overlooking wadis.

The theory is that individuals passing through an area would surely be using the dried wadi channels, because of the ease of travel compared to the terraced terrain that dominates the landscape. As they walked through the wadi, they would notice the cairns on the hillsides and recognize that another tribe of people controlled the resources in the area. Cairns were signposts that showed that a tribe had dominated the region for a sustained period of time (at least a generation). Such displays of ownership would have become necessary for survival as the resources across southern

Arabia dwindled.

2.3 Cairn Detection

Due to the aforementioned difficult terrain in modern-day Yemen and Oman, dis- covering cairns can be a tedious task. Anthropologists have had to rely on directions

10 (a)

(b)

Figure 2.2: (a) A small-sized cairn roughly 3 meters in diameter. (b) A larger cairn roughly 5-6 meters in diameter. 11 from natives who are familiar with the area, or drive up and down the wadis in jeeps to look for cairns. Unfortunately, ground surveying is time consuming and expensive

– furthermore, it does not allow the discovery of any monuments which are not within viewing distance of the wadi basin. The unforgiving terrain also makes excursions into the higher terraced areas extremely difficult.

As has improved, satellite imagery now has a fine enough resolution to distinguish cairns. However, the size of the region of interest (roughly 500 ×

1000 km) coupled with the sparse distribution of cairns makes it nearly impossible to locate them upon visual inspection. The difficulty of human cairn detection in satellite imagery is the reason why statistics is involved in this project. My work on the NSF-HSD project has been in the development of a supervised learning algorithm to automatically detect cairns in satellite imagery.

Available information comes in several forms, the first of which is the imagery itself. Each QuickBird satellite image acquired by the team covers roughly 100 to 200 square kilometers, and is composed of multiple bands, each of which provides different information. The finest resolution band is the panchromatic (intensity) band, where each pixel gives the brightness of a 0.6 × 0.6 square meter patch of ground surface.

QuickBird imagery also has four multispectral bands that measure the reflectance of the surface in different wavelengths of visible and near-visible light. In order of increasing wavelength, these include Blue (450–520 nm), Green (520–600 nm), Red

(630–690 nm), and near infrared (760–900 nm) light. The multispectral bands are at

2.4 meter resolution, which means that one pixel of an R/G/B/IR band covers a 4×4 pixel region of the panchromatic band, or a 2.4 × 2.4 square meter patch of ground.

12 An ASTER digital elevation model (DEM) is also available for southern Arabia.

Digital elevation models are generated by imaging the same region of the earth with the same satellite at two or more different angles. Even though the angle differences are slight, this stereo representation of the earth’s surface gives a way to estimate the elevation of the ground at every location in the image. Since higher resolution incurs more error, DEMs tend to be much lower resolution to preserve accuracy. The

ASTER DEM is at 30 meter resolution, which means one ASTER pixel is the size of a 50 × 50 square of QuickBird panchromatic intensity pixels.

Regarding satellite imagery, field expeditions to a particular region have two pri- mary goals. The first is to take GPS coordinates at ground control points (GCPs) that are easily identifiable from space. These GCPs are used to correct mismatches between the image coordinates and the actual coordinates of the imaged region. Such mismatches generally occur because of the angle of the satellite to the earth as the image is taken, the curvature of the earth itself, and other small errors that occur due to the sensors in the satellite. The process of fitting the two coordinate systems together is called geo-rectification, and must be done to match imagery to reality.

The second goal of the field expeditions is to locate and excavate monuments.

Locating the monuments, as discussed before, is generally done by a drivethrough survey of the area guided by conversation with indigenous people. As an alternative, the detection algorithm discussed in this thesis could be used to generate potential monument locations. Each discovered monument is photographed and measured.

Relevant information (including its GPS location) is recorded on a monument survey form, and a small number of the monuments are also excavated. The information on

13 the survey forms is then assembled into a spreadsheet, which is the second piece of information available.

The GPS coordinates of cairns can then be used to match the ground location to the imagery (after rectification with the GCPs) and obtain a training set of known cairns. This training set is the third and most crucial piece of information available for the development of the detection algorithm. An example of imagery, the field expedition, and the cairn training set is given in Figures 2.3 and 2.4.

From the training set images, it is apparent that cairns share certain features in common, even in satellite imagery. They appear as near circular patches of relatively darker color than their surroundings. The size of the patch ranges anywhere from a

5 × 5 to a 13 × 13 pixel region. Sometimes there is a shadow effect that appears as a crescent around one edge of the cairn, but it is not always present. Exploiting these commonalities among cairns requires a variety of techniques. An overview of image processing and object detection methods appears in the next chapter. The automatic cairn detection algorithm is then described in Chapter 4.

14 (a) (b)

Figure 2.3: (a) A QuickBird image of a wadi in Yemen. The field team physically walked through the areas indicated in green to do a monument survey. As they conducted the survey, they also looked into the distance for other monuments. The extent of this visual inspection is indicated in yellow. (b) Examples of some cairns from the training set as they appear in the panchromatic layer of the imagery.

15 (a)

(b)

Figure 2.4: (a) A cairn as it appears in the field. (b) The same cairn in 0.6 meter QuickBird satellite imagery. (False color is created with pan-merged RGB pixels)

16 CHAPTER 3

TECHNIQUES FOR DETECTING OBJECTS IN IMAGERY

3.1 Introduction

Many scientific applications involve image processing in some form or another.

In many situations, the goal of the processing is to detect objects that are present in the image. After detection, other analyses could be run – e.g. object extraction, calculation of relevant features, shape classification, etc. Detection techniques for objects range from simple to complex. Usually, the processing time scales with the complexity of the detection method. Ideally, one uses the simplest possible detection methods that are necessary for the particular application, but avoids using overly complex methods in order to keep computation time at a minumum.

To begin, consider a p × q gray-level image composed of pixels I(x, y), where x =

1, . . . , p and y = 1, . . . , q. Each N-bit pixel I(x, y) takes a value in the range [0, 2N ], so the image is represented by a matrix of pixel values with p rows and q columns.

Without loss of generality, we will scale the intensity values by 2N to place the pixel values in the range [0, 1]. The techniques described in this chapter will only consider gray-level imagery, but they can also be applied to color imagery, which is expressed

17 as a higher dimensional array with p rows, q columns, and depth d. The combination of values in each layer of the array gives unique representations of colors – more colors are available if the layers are given higher bit values. Common representations of colors are given in RGB (Red / Green / Blue), HSI (Hue / Saturation / Intensity),

HSV (Hue / Saturation / Value), Y’UV (Luma / Chrominance), or Y’CbCr (Luma

/ BlueDiff / RedDiff).

The Webster’s dictionary definition of an object is “anything that is visible or tangible and is relatively stable in form” (1996). In 2D imagery, then, an object will manifest itself as something that is visible – that is, a region of the image that is distinguishable from the background. Being able to distinguish the object implies that there is a boundary – i.e. a set of pixel location pairs (x, y) – around this region that separates the interior of the object from the background (or other objects). If the object is occluded, this boundary may not be completely visible, but there will still be a boundary between the object and the source of the occlusion. Some of the most basic image processing techniques (Section 3.2) modify the image in an attempt to render the boundary more recognizable.

When the object has a clear boundary, a more complex set of edge detection techniques can be used to locate the boundary pixels in the image (Section 3.3).

Discovering the object boundary makes higher level processing algorithms available.

Using this edge information, one can see if certain types of objects appear in the image by searching for their outlines among the edge pixels (Section 3.4). At this point, the object has been detected and can be used in further analysis, like shape description and classification (Section 3.5). What follows is a description of some of these available techniques and the situations for which they were designed. The

18 examples in this chapter use cropped versions of images found in the Berkeley Image

Database (Martin et al., 2001). Application of the techniques on these images was performed in Matlab R2009a. For more details about these techniques, see Nixon and

Aguado (2008) and Ritter and Wilson (2001).

3.2 Basic Image Processing Techniques

Some of the simplest image processing algorithms aim to clarify the location and boundary of the object to make detection easier. Typically, an object’s boundary might be hard to distinguish due to illumination, contrast, and/or noise. Basic pro- cessing tries to mitigate some of these effects, and the algorithms fall into three general classes: point operators, template operators, and group/window operators.

3.2.1 Point Operators

Point operators are the quickest and easiest algorithms to perform on imagery.

These operators convert each original image pixel value I(x, y) to a new value f(I(x, y)) for some function f. Some examples of point operators are given below, listed in order of increasing computational complexity.

Intensity Thresholding

The simplest example of a point operator is the thresholding operation. The image is converted into a binary matrix, where all intensities below a threshold c are coded as 0, and the others are coded as a 1. In other words, thresholding is the point operator defined by f(I(x, y)) = 1(I(x, y) ≥ c), where 1(A) is the indicator function for event A and c is any constant. For more complex thresholding operations, c could also be a function of the pair (x, y).

19 Thresholding is highly sensitive to illumination effects and noise that may be present in the image. It also requires knowledge of the optimal threshold, which may be difficult to determine without prior information. Fortunately, automatic methods of determining this threshold do exist (e.g. Otsu, 1979). In low-noise situations with controlled lighting, thresholding might be enough to find and extract objects. How- ever, the vast majority of applications do not meet these criteria, and more complex image processing methods are required.

Brightness Adjustment

Beyond thresholding, the most basic example of intensity adjustment is a linear increase or decrease in value, where

f(I(x, y)) = a · I(x, y) + b (3.1)

Here a and b are constants chosen such that f(I(x, y)) is still in the range [0, 1]

(otherwise, intensity is truncated to [0, 1] for values outside that range). Of course, even more interesting changes to brightness could take place if intensity changes are non-linear functions of I(x, y).

Gamma Correction

Gamma correction adjusts intensity based on a power transformation, where f(I(x, y)) = I(x, y)γ. Since the range of the pixel values is normalized to the range

[0, 1], the gamma corrected intensities will also fall in the range [0, 1]. A value of γ = 1 yields the identity function. Values of γ less than 1 put relatively more weight on lower intensity (darker) pixels, while values of γ larger than 1 increase larger intensity

20 (a) (b) (c)

(d) (e) (f)

Figure 3.1: (a) An image of the ostrich. (b) Image thresholded at the median value of c = 0.5843. (c) Image thresholded at the 75th percentile value of c = 0.6804. (d) Brightness Eq. (3.1) applied to the image, where a = 1.5 and b = 0. Resulting intensities larger than 1 are truncated to 1. (e) Gamma corrected image of the ostrich, γ = 0.5. (f) Gamma corrected image of the ostrich, γ = 3.

(brighter) pixels more. Unlike a linear brightness scale/shift, gamma correction al- lows the contrast between pixels to be controlled, which can make object boundaries clearer.

Histogram Normalization

While the pixel values of the image fall in the range [0, 1], it is possible that the range of intensities does not completely cover the interval. For example, the image

21 of the ostrich in Figure 3.2(a) only contains intensities between 0.2980 and 0.8877.

To increase the contrast between lighter and darker regions, histogram normalization scales the intensities so that the histogram stretches over the range [0, 1]. In general, if Imin and Imax are the minimum and maximum intensities in the image, respectively, and one wishes to stretch the histogram over the range [c1, c2] for c1 < c2, the new intensity is:

(c2 − c1) f(I(x, y)) = · (I(x, y) − Imin) + c1 Imax − Imin Note, however, that depending on the shape of the histogram there still may be intensity levels that are poorly represented in the image. For the ostrich example, the histogram after normalization (Figure 3.2(e)) now stretches from [0, 1] as intended, but almost all of the intensities are still between 0.2 and 0.8. To provide even further contrast in the image, it is necessary to try to maximize the amount of information spread across the interval.

Histogram Equalization

Histogram equalization is a non-linear map of intensity values that is intended to

flatten out the intensity histogram so that all intensities in the interval [c1, c2] are equally represented. The result is an image with even more contrast between light and dark regions than histogram normalization can provide. To accomplish this, suppose the original image takes intensities in the range [Imin, Imax], and the new image is intended to have intensities in the range [c1, c2]. Each of these intervals is

old old first split into M equal sized bins. For the old image, denote the bins b0 , . . . , bM

old old and the number of pixels in each bin n0 , . . . , nM . Likewise, denote the bins for the

new new new new new image b0 , . . . , bM and the number of pixels in each bin n0 , . . . , nM .

22 old The goal now is to move all pixels in bin bi from the original image into a new

new bin bj for the new image. Since the intent is to flatten out the histogram, the number of pixels in each new bin should be roughly equal. There are M bins over the

new interval [c1, c2], which means for a p × q image, each bin bk should contain roughly

new pq nk = M pixels. To put this another way, the cumulative sum of all bins up to bin new bj should be j j X X pq pq nnew = = j · k M M k=1 i=1 To retain the ordering of pixels from the old image to the new one, it is also important

old to make sure that the cumulative sum of all bins prior to bi in the old image is the

new same as the cumulative sum up to bj in the new one. That is:

j Xi X pq nold = nnew = j · k k M k=1 k=1

old Using this relationship, the index of the new bin matching a given bin bi is:

i i P old P old nk M · nk k=1 k=1 j = pq = M pq

old Once the matching bin is determined, one need only recode the pixels in bin bi with

new values that place them in the matching bin bj for the new image. One possibility

new is to recode them with a value corresponding to the midpoint of bin bj , or perhaps interpolate a value inside the destination bin based on the relative ordering of the pixel within its original bin. One could describe histogram equalization, then, as the following point operator:

¡ new¢ f(I(x, y)) = midpoint bg(i) ,

i old M P old where I(x, y) falls into bin bi and g(i) = pq · nk . k=1 23 (a) (b) (c)

(d) (e) (f)

Figure 3.2: (a) Original image of the ostrich. (b) Normalized image of the ostrich. (c) Equalized image of the ostrich using M = 64 bins. (d) Histogram of intensities for the original image. (e) Histogram of intensities for the normalized image. (f) Histogram of intensities for the equalized image.

Histogram equalization has the desirable property of invariance to linear bright- ness transformations. That is, if the intensities are all scaled or shifted by a constant amount across the entire image, the histogram equalized image will not be any dif- ferent. Unfortunately, histogram equalization also suffers from sensitivity to noise.

Noisy pixels can alter the histogram of the original image, which necessarily alters the equalized histogram as well. To deal with noise in the image, one needs to move beyond point operators and consider spatial information.

24 3.2.2 Template Operators

Point operators can resolve some of the basic issues inherent in imagery, like brightness and contrast. However, as the new intensity values are functions only of a single value in the original image, there are some kinds of problems they cannot handle. One serious drawback to these operators is their sensitivity to noise and outlier pixels. If a pixel has a drastically different intensity value compared to its immediate neighbors due to a noise source (e.g. sensor error, dust and dirt, etc.), there is no way for the algorithm to take this into account. More robust operators work on groups of neighboring pixels to incorporate a spatial element to image processing.

Many smoothers and filters can be found in this class of operators.

These operators convolve an image with a template, which is a matrix of weighting coefficients. Usually, the matrix is a w×w square and the number of rows and columns is odd, so the template can be defined by the weights αij, for i, j ∈ {−r, . . . , r} and window radius r = (w + 1)/2. The template is centered at each pixel in the image, which is then replaced by a weighted sum of the neighboring pixels according to the

αij values. In other words, pixel (x, y) has its original intensity I(x, y) replaced with: Xr Xr f(I(x, y)) = αi,jI(x + i, y + j) i=−r j=−r For example, a 3x3 template would take the form

α−1,1 α0,1 α1,1 α−1,0 α0,0 α1,0 α−1,−1 α0,−1 α1,−1 and replace the center pixel I(x, y) with:

f(I(x, y)) = α1,1I(x + 1, y + 1) + α1,0I(x + 1, y) + α1,−1I(x + 1, y − 1) + α0,1I(x, y + 1) + α0,0I(x, y) + α0,−1I(x, y − 1) + α−1,1I(x − 1, y + 1) + α−1,0I(x − 1, y) + α−1,−1I(x − 1, y − 1)

25 Also note that at border pixels, the template runs over the side of the image. There are several remedies in these cases:

1. Exclude the pixels, which results in a smaller image of size (p − 2r) × (q − 2r).

This is the easiest way to deal with the situation, and should only be done if

the region of interest in the image is not on the edges to begin with.

2. The pixels could remain in the image, but not be altered. This could be done

if the image needs to be kept the same size as the original.

3. The pixels could be given a border value, e.g. zero, which would also keep the

same image size.

4. The pixels could have a smaller version of the template applied to them. For

example, if the template is a simple 3 × 3 window average, the calculation for

the top left corner could be done with the 2 × 2 block of the template that lies

in the image. For more complicated templates, this may not be an option.

5. Although in many circumstances it may not be appropriate, one could also

assume the image repeats in a tiled fashion in the areas outside the .

In this case, the template could be computed for border pixels by substituting

values from a repeated version of the image (see Figure 3.3).

To speed computation, convolution of the template with the image is often done in the Fourier domain. For a p × q image, the discrete Fourier transform from an image to two frequency components ω1 and ω2 is given by

p−1 q−1 µ ¶ µ ¶ 1 X X 2πiω x 2πiω y F(I(x, y)) = F (ω , ω ) = I(x, y) exp − 1 exp − 2 I 1 2 pq p q x=0 y=0

26 (a) (b) (c)

Figure 3.3: (a) Original image to be filtered with a 3 × 3 window operator. (b) Template centered at the top left pixel (a border pixel). (c) Repetition of image to fill the border template.

The conversion back from the Fourier domain to the image is given by the inverse

Fourier transform:

p−1 q−1 µ ¶ µ ¶ X X 2πiω x 2πiω y F −1(F (ω , ω )) = I(x, y) = F (ω , ω ) exp 1 exp 2 I 1 2 I 1 2 p q ω1=0 ω2=0 This process can be done efficiently by using a method called the fast Fourier trans- form (FFT). The major advantage of working in the Fourier domain is the fact that convolution of an image with a template can be represented as elementwise multipli- cation of their Fourier versions. That is, for a template T and image I, we have:

F(T ∗ I) = F(T ) ·F(I).

Thus one need only compute the Fourier version of the template and the image, then multiply them together. An inverse Fourier transform of this product will give the result of the convolution in the image domain. Since the FFT is fast, as is the elementwise multiplication of matrices, this approach is computationally superior to multiplying the template pixel by pixel in the image domain. In the Fourier

27 formulation of the problem, the image is assumed to repeat indefinitely outside the borders. Therefore, the template is applied to border pixels as in Figure 3.3.

Operators that use templates include averaging operators, gaussian smoothers, and some edge detectors. Several of these template operators are described below.

Discussion of edge detectors can be found in Section 3.3.

Averaging Operators

These operators use a window template whose coefficients sum to 1. The result is the replacement of the central pixel with a weighted average of its neighbors. The most basic example, often called a box operator, is a w × w template with equal

−2 weights αij = w . However, other weights on the coefficients could be used to achieve other effects.

Since the central pixel is being replaced with an average of its neighbors, convolv- ing the image with this template will result in a smoother version of the image. The degree of smoothing depends on the coefficients and the size of the template. Larger windows result in more smoothing, since each pixel is averaged with a larger number of its neighbors. The purpose of an operator like this one is to reduce the effect of noise on further analyses by smoothing noisy pixels into their neighbors. An example of this operator at work is given in Figure 3.4.

Gaussian Smoothers

Gaussian smoothers are a special case of averaging operator where the coefficients of the template are derived according to the gaussian distribution. Treating the center of the template as the origin (0, 0), the row and column directions are given independent marginal gaussian distributions with mean 0 and common variance σ2.

28 Therefore, the joint distribution at point (x, y) is given by:

2 2 2 2 1 − x 1 − y 1 − x +y g(x, y|σ) = √ e 2σ2 · √ e 2σ2 = e 2σ2 2πσ 2πσ 2πσ2

The coefficients of the template are calculated at the discrete locations corresponding to the pixels in the template: αij = g(i, j|σ), for i, j ∈ {−r, . . . , r} and window radius r = (w + 1)/2.

The shape of this template is like a bell, rather than being flat. This places a majority of the weight on the center pixel, but still smooths its intensity with its neighbors (see Figure 3.4). As a result, when the center pixel’s intensity is much higher or lower than its surroundings (e.g. for noise pixels), the weights on the surrounding neighbors will bring its intensity to a level more consistent with the surroundings. However, when there is not such a disparity in intensities, the original value at the center pixel is not altered much. This template, therefore, has the effect of smoothing out noise while simultaneously maintaining sharpness of features and edges. Compare the smoothed images from Figure 3.4 (d) and (e). The gaussian template is just as efficient at removing noise as the simple averaging filter, but the resulting image does not look as blurry.

Gaussian smoothing is generally considered to be the optimal choice for noise reduction. Furthermore, the first derivative of gaussian operator has been shown to approximate the optimal edge detecting filter for the popular

(Canny, 1986) described in Section 3.3.

3.2.3 Group/Window Operators

Some filters perform a similar function to template operators in the sense that they alter the value of a pixel based on the values of its closest neighbors. However,

29 (a) (b)

(c) (d) (e)

Figure 3.4: (a) The template for the 7 × 7 averaging operator. (b) The template for the 7 × 7 gaussian operator (σ = 4/3). (c) Original image of the ostrich. (d) Convolution of the ostrich image with the 7 × 7 averaging operator. (e) Convolution with the 7 × 7 gaussian operator. Border pixels were coded as zero (black).

30 unlike the previously mentioned template operators, they cannot be expressed as a convolution of the image with a particular template. A window is still passed over the image as before, but the value assigned to the central pixel is a non-linear function of the other pixels in the window. While these operators tend to be slower and more complicated, they can also handle messier images and still yield impressive results.

Examples of these operators include median and mode filters, as well as other more exotic approaches. Some of the more complex edge detectors fall into this category as well (see Section 3.3).

Median Filtering

The concept of median filtering is simple: each pixel intensity I(x, y) is replaced with the median value in a window of size w×w centered at (x, y). The benefit of this approach comes from the robustness of the median to outliers. In cases where noise mechanisms produce large outlying pixel intensities (e.g. salt and pepper noise), an averaging filter can run into trouble due to the inherent sensitivity of the mean to outliers. However, the median intensity is less sensitive to these values, and can deal with this kind of noise more effectively.

The drawback of median filtering, however, is the increase in computational com- plexity. Calculation of the mean of pixel intensities requires summing (complexity

O(n)), while computing the median requires sorting of the values (complexity at least

O(n log n), depending on the sorting method). In this case, n = w2, so as the size of the window increases, the difference between median computation and mean com- putation accelerates. For noisy images and a small window width, however, median

filtering can perform well (see Figure 3.5).

31 (a) (b) (c)

(d) (e) (f)

Figure 3.5: (a) Original image of the ostrich. (b) Gaussian smoothed ostrich (3 × 3 window, σ = 2/3). (c) Median filtered ostrich (3×3 window). (d) Ostrich image with 5% of pixels corrupted with noise (intensity set to 0). (e) Gaussian smoothed noisy ostrich (3 × 3 window, σ = 2/3). (f) Median filtered noisy ostrich (3 × 3 window).

Mode Filtering

The mode filter is most useful in situations where pixels take categorical values.

Although images typically are not of this sort, it does happen in some applications.

For example, in image segmentation, images are split into homogeneous regions that are spatially connected and similar in intensity/color. The result of an image segmen- tation algorithm is an image the same size as the original, where each pixel is coded

32 with a group label. If one wanted to obtain a smoother pattern of labeling across the image, none of the group operators mentioned previously would be appropriate.

Notwithstanding the fact that averaging group labels makes no sense, the operators would also introduce non-integer values which would not correspond to any of the group labels.

The solution to this problem, then, is to replace each pixel with the same group label as the majority of its neighbors. Such a filter is called a mode filter. A window of size w × w is centered on each pixel, and the value of the central pixel is replaced with the mode of the pixel values inside the window. The result is that small holes that disrupt the continuity of the groups across the image are filled in, yielding a smoother looking image. Larger window sizes will smooth the image even more by considering a larger range of neighbors around each point.

Anisotropic Diffusion

This algorithm, developed by Perona and Malik (1990), is a smoothing technique that aims to marry the benefits of gaussian smoothing with the preservation of edges.

The image is viewed at different scales by iterated gaussian smoothing at increasing levels of σ. The algorithm then draws from heat diffusion theory to characterize the nature of the resulting image. There are two parameters in the model. The first parameter controls the amount of smoothing that occurs, while a second parameter controls the amount of diffusion (i.e. blending) that occurs in edge regions of the image. By carefully choosing a combination of parameter values, one can smooth homogenous regions of the image (resulting in a more uniform matte appearance) while still preserving crisp edges between differing regions.

33 Mean Shift Filtering

The mean shift filter, developed by Fukunaga and Hostetler (1975), then redis- covered by Cheng (1995), attempts to regress pixel intensities to the mean of the local region. Like anisotropic diffusion, the effect is a smoothing of homogeneous re- gions with edge preservation. The mean shift occurs over several iterations, and the amount of shift depends on several parameters. The algorithm uses a kernel K(·), and weighting function w(x, y, x0, y0) for two points (x, y) and (x0, y0). In each iteration, the intensity I(x, y) at each pixel (x, y) is replaced by the value of the function P K(|I(x0, y0) − I(x, y)|)w(x, y, x0, y0)I(x, y) (x0,y0)∈W f(x, y) = P , K(|I(x0, y0) − I(x, y)|)w(x, y, x0, y0) (x0,y0)∈W where W is the set of pixels in a k × k window around (x, y). After each pass through the algorithm, the intensities of pixels are shifted toward a local weighted mean, where the contribution of neighboring pixels to the mean is weighted by the closeness of their intensities to the central pixel. As a result, pixels with similar intensities will eventually shift toward a common value, producing homogenous regions of the image at different equilibrium intensity levels. The mean shift filter can be used as a first step in clustering algorithms.

Force Field Transform

The futuristic sounding force field transform (Hurley et al., 2005) was developed for applications in ear biometrics, i.e. distinguishing between individuals based on images of the ear. This method assumes that each pixel exerts a force on its neighbor that is proportional to the intensity at that pixel, and inversely proportional to the square of the distance to its neighbor. Taking all of these forces into account, the

34 entire image can be modeled as a force field. New values at each pixel are given by the net force acting on that pixel from all of the other pixels in the image. The result is a smoother, undulating surface over the image that can reduce the effect of noise.

Dilation and Erosion

Dilation and erosion operators can also be used to reduce noise in images. In the binary image case, pixels are either coded as 0 and 1. These pixels can be split into two groups by intensity. One of these groups is selected, and a w × w window is then centered over each pixel of that group. When all pixels in the window have the same value, nothing is done to the central pixel. However, when pixels from both groups appear in the window, one of two things takes place. If dilation is desired, all pixels in the window are given the same value as the central pixel. If erosion is desired, the central pixel is flipped to the other group.

In effect, the dilation operator expands the boundary of one of the groups by

(w + 1)/2 pixels, while the erosion operator contracts the boundary by (w + 1)/2 pixels. In the binary case, a dilation of one group is equivalent to an erosion of the other group. However, these operators can be extended to the general setting where pixels take on all gray-leveled values. In this case, the two groups are defined by a split at a given intensity threshold. Values on the edges of one group are then shifted toward (erosion) or away (dilation) from the threshold in proportion to their distance from the threshold.

3.3 Edge Detection

The basic image processing techniques described in the previous section serve the role of cleaning the image. They can help to reduce the effects of noise in the image,

35 which makes it easier to distinguish which regions in the image belong to objects and where the boundaries between those regions lie. Techniques such as histogram equalization also alter the range of intensities in the image to yield a better contrast between objects and their backgrounds. The next step to locating objects in the image is to determine where the edges of those objects are. Once these edges are found, objects can be extracted from the image and higher level functions come into play. Typically, next steps would involve calculating descriptors of the object’s shape, comparison of the shape to other objects, classifying the object into a group, and ultimately identifying the object.

A short chronological history of edge detection is given below, but all these de- tectors work similarly. Edges in an image occur where the intensity changes sharply from one value to another. Edge detectors try to find boundary pixels by examining the intensity gradient in a local neighborhood around each pixel in the image. If this gradient is steep, there is evidence that the central pixel is an edge pixel. By their nature, edge detectors are very sensitive to noise. The local gradient will be large around noisy pixels, which results in large numbers of falsely detected edge pixels that can obscure the actual shape of the object. Edge detection can be improved by combining it with noise reduction techiques mentioned in the previous section, and often these techniques are worked into the edge detection algorithm itself.

3.3.1 Edge Vector Formulation

Since the goal of edge detection is to find pixels where the intensity change is largest, the standard approach is to approximate the local derivative in intensity by differencing neighboring pixels. For a two dimensional image surface, differencing

36 could be considered at an infinite number of angles straddling the central pixel. How- ever, due to the discrete nature of images and the local nature of edge information, the overall edge potential at a pixel could be thought of as a combination of only two components: one in the horizontal direction and the other in the vertical direction.

Generally, edge information in these directions is obtained through convolution of two templates with the image. The horizontal template, denoted Mh, differences pix- els along the rows and is designed to pick up horizontal edges. The vertical template, denoted Mv, differences along columns to pick up vertical edges. These templates are convolved with the image, resulting in vertical and horizontal measures of edge potential at each pixel.

Combining these two functions can be done in a variety of ways. The easiest is to take the absolute value of each measure, then examine their sum or the maximum of the two. However, in practice what is often done is to treat Mh and Mv as the horizontal and vertical components of an edge vector. The magnitude M of the vector measures how strong the edge is at the pixel, and the angle θ of the vector gives the estimated angle of the edge. Using this formulation allows one to characterize an edge not just by the sharpness of the intensity change (magnitude), but also by its orientation (angle). See Figure 3.6 for a visualization.

Typically, edges are detected by thresholding the magnitude M of the edge vector at a certain value. If the value is too large, few if any edge pixels will be detected. If the value is too small, there will be a large number of false edges.

Early edge detectors used the following templates, which approximate the first derivative with a single difference between neighboring pixels:

37 q 2 2 M = Mh + Mv (3.2) µ ¶ M (x, y) θ = tan−1 v (3.3) Mh(x, y)

Figure 3.6: Vector representation of an edge.

1 1 -1 -1

Mh Mv This corresponds to calculating at each pixel (x, y) the horizontal function

Mh(x, y) = |Mh ∗ I| = |I(x, y) − I(x, y + 1)| (3.4) and the vertical function

Mv(x, y) = |Mv ∗ I| = |I(x, y) − I(x + 1, y)|. (3.5)

3.3.2 Improvements on Edge Detection

In the formulation above, the horizontal and vertical edge potential is defined by the absolute difference between a pixel and its neighbor. This is akin to approximating the derivative of a function f(·) at a point x by f 0 ˆ(x) = f(x + 1) − f(x). Or, in a more general sense, f(x + h) − f(x) f 0 ˆ(x) = (3.6) h 38 (a) (b) (c)

(d) (e)

Figure 3.7: (a) An image of an urn. (b) Horizontal edge component Mh(x, y) as defined in Eq. (3.4). (c) Vertical edge component Mv(x, y) as defined in Eq. (3.5). (d) Magnitude M of the edge detector (Eq. (3.2)). (e) Edges thresholded at the 0.9 quantile magnitude (M > 16.0312).

However, this formulation only considers the difference in the function from point x to its neighbor in one direction. A better approximation is to use the average of the derivative estimates from both sides:

1 f(x + h) − f(x) 1 f(x) − f(x − h) f(x + h) − f(x − h) f 0 ˆ(x) = + = (3.7) 2 h 2 h 2h

39 That is, it is better to use the following templates, which incorporate the pixels above and below (horizonal detection) or to the left and right (vertical detection) of the central pixel: -1 0 -1 0 1 1

Mh Mv

The Operator

Successive historical improvements on edge detection resulted from more accurate approximations of the intensity gradient, as well as a consideration of edge continuity and directionality. For example, the Roberts cross operator (Roberts, 1965) extends

Eq. (3.7) to two dimensions. Rather than estimating the derivative in the horizontal and vertical directions independently, the Roberts cross operator uses templates that consider both at once. This is done by convolving 2 × 2 templates with the image that difference along the diagonal instead of the : 1 0 0 1 0 -1 -1 0

Mh Mv

The

The Prewitt detector (see Prewitt, 1970; Prewitt and Mendelsohn, 1966) improved detection results by incorporating averaging into the process. Rather than examine the difference between pixels neighboring the central pixel only, the Prewitt detector looks at differences across three rows and columns simultaneously. The templates for

Prewitt’s edge detector are given below:

40 1 1 1 1 0 -1 0 0 0 1 0 -1 -1 -1 -1 1 0 -1

Mh Mv Edges tend to have a much greater contrast with the Prewitt detector compared to the Roberts cross operator. The detected edges also appear cleaner, with fewer noisy false edge pixels scattered throughout the image.

Sobel’s Edge Detector

Originally presented in a PhD thesis (Sobel, 1970), Sobel’s edge detector is a variation of the Prewitt detector where the weights on the differenced pixels are modified. These weights are largest in the rows and columns containing the central pixel, and decrease outward. A method also exists for extending the filter window to larger sizes. The weights for the Sobel detector are given by Pascal’s triangles for addition and subtraction (Table 3.1).

1 1 1 -1 1 2 1 1 0 -1 1 3 3 1 1 1 -1 -1 1 4 6 4 1 1 2 0 -2 -1 Pascal’s Triangle for Addition Pascal’s Triangle for Subtraction

Table 3.1: Pascal’s Triangles for Addition and Subtraction

To create a Sobel template, one need only select a row v+ from the triangle for addition and a row v− from each triangle for subtraction. Then, the Mh and Mv

41 templates are given by the equations below.

T Mh = v−v+ (3.8)

T T Mv = v+v− = Mh (3.9)

For example, a 5×5 Sobel template is obtained by selecting the rows from each triangle that contain 5 elements. This gives v+ = (1, 4, 6, 4, 1) and v− = (1, 2, 0, −2, −1). The templates are then calculated as follows:   1 1 4 6 4 1    2  2 8 12 8 2 T   Mh = v−v+ =  0  (1, 4, 6, 4, 1) = 0 0 0 0 0  −2  −2 −8 −12 −8 −2 −1 −1 −4 −6 −4 −1   1 1 2 0 −2 −1    4  4 8 0 −8 −4 T   Mv = v+v− =  6  (1, 2, 0, −2, −1) = 6 12 0 −12 −6  4  4 8 0 −8 −4 1 1 2 0 −2 −1 At the time it was introduced, Sobel’s detector had a number of advantages over other methods. First and foremost, it tended to work better. Also, the ability to extend templates to arbitrary size gave it more flexibility than other methods.

Canny’s Edge Detector

This detector debuted in 1986 in a paper by John Canny (Canny, 1986). Shortly thereafter, it became one of the most popular edge detectors being used in practice.

Today, it is likely the most commonly used edge detector. The algorithm combines de- noising, smoothing, and non-maximal suppression to obtain edge boundaries around objects in the image. An outline of the algorithm is as follows:

1. Run gaussian smoothing on the image with standard deviation σ, as in Section

3.2.2.

42 2. Convolve the image with a Sobel template from Section 3.3.2. Typically the

template width depends on σ.

3. Use non-maximal suppression to remove multiple edge responses.

4. Connect edge pixels by using hysteresis thresholding.

Implementation of the first two steps has been described in previous sections.

The smoothing step is intended to minimize the impact of noise to eliminate spurious responses. The amount of noise reduction is dependent on σ – however, if σ is too large, the edges will be blurry and hard to detect. After eliminating noise, the is used to locate the edge pixels. However, even when thresholded the Sobel detector produces thick edges around objects. That is, for a single edge in the image, there may be multiple layers of pixels that exceed the threshold and are classified as edge pixels. This makes it difficult to pin down the precise location of an edge.

The third step of the algorithm is where this issue is resolved. In the non-maximal suppression step, the local 3 × 3 neighborhood of each pixel is examined to see if that location has the largest edge magnitude in the direction perpendicular to the edge.

The edge direction is calculated using the outputs of each Sobel template as per Eq.

(3.3). In the perpendicular direction, edge magnitudes are then linearly interpolated between the closest neighboring pixels. This is done by estimating the slope of the edge by Mv(x, y)/Mh(x, y), which is a result of the edge vector formulation (see Figure

3.6).

For example, consider the case depicted in Figure 3.8. In this case, the interpolated magnitudes at locations A and B would be: µ ¶ M (x, y) M (x, y) M(A) = v M(x − 1, y − 1) + 1 − v ) M(x − 1, y) Mh(x, y) Mh(x, y) 43 Figure 3.8: Non-maximal suppression for Canny’s edge detector. Edge magnitudes are linearly interpolated at locations A and B, then compared to the central pixel magnitude M(x, y). If M(x, y) is the largest, pixel (x, y) is considered an edge pixel.

µ ¶ M (x, y) M (x, y) M(B) = v M(x + 1, y + 1) + 1 − v ) M(x + 1, y) Mh(x, y) Mh(x, y) The interpolated magnitudes are then compared to M(x, y), the edge magnitude at the central pixel. If M(x, y) is the largest of the three, the central pixel is retained as an edge pixel. Otherwise, its edge magnitude is set to zero. As a result, non-maximual suppression eliminates multiple responses to the same edge, and only retains the pixels that are closest to the edge, rather than those that are further away.

The final step of the algorithm is hysteresis thresholding, which is intented to connect chains of edge pixels where gaps may exist. An upper threshold U and a lower threshold L are defined. Then, pixels in the image are consecutively searched until one is found whose edge magnitude exceeds the upper threshold U. The neighbors of this pixel are examined, and any of those pixels whose edge magnitude exceeds the

44 lower threshold L are converted to edge pixels. The neighbors of those edge pixels are searched in turn, and any neighbors exceeding L will again be converted.

This process continues until no more neighbors have edge magnitudes that exceed the lower threshold. As a result, one obtains a chain of edge pixels along an edge of the image. The initial pixel is the only one whose edge magnitude must exceed the upper threshold U. After the discovery of this chain, the image is searched for other pixels exceeding the upper threshold to start a new chain.

The result of the algorithm is a collection of edge pixel chains, each only one pixel wide. For this reason, when compared to other methods, the Canny edge detector has more precise edges with fewer gaps. The output of the algorithm on the urn image for different parameter values is given in Figure 3.9.

The Marr-Hildreth Edge Detector

Another very popular detector is the Marr-Hildreth operator (Marr and Hildreth,

1980). Rather than looking for the pixels where the magnitude of the first derivative is largest, this approach searches for pixels where the second derivative of the intensity is zero. The simplest approximation to the second derivative is the laplacian operator, which takes the difference of the first derivative at consecutive points:

f 00(x) = f 0(x) − f 0(x + 1).

However, this approximation is even more sensitive to noise than the first deriva- tive approximations discussed in Sections 3.3.1 and 3.3.2. To overcome the noise issue, Marr-Hildreth edge detection uses a laplacian of gaussian (LoG) operator that simultaneously smooths and twice differentiates the pixels.

45 (a) (b)

(c) (d)

Figure 3.9: (a) Urn image. (b) Canny edge detection results for L = 0.03, U = 0.08, and σ = 1. (c) Canny edge detection results for L = 0, U = 0.60, and σ = 1. (d) Canny edge detection results for L = 0.10, U = 0.20, and σ = 0.50.

46 (a) (b) (c)

Figure 3.10: (a) Convolution of urn image with 13 × 13 LoG template (σ = 2). (b) Zero crossing pixels. (c) A closer look at the center of image.

Edge pixels are defined to be those pixels that contain the zero crossings of the second derivative. Unfortunately, this can be difficult to determine due to the discrete nature of the image. Finding zero crossings can be done by manual thresholding, local least squares fitting, or window based calculations. Since this detector is essentially pulling out contours in the image that follow the zero regions of the second derivative, the detected edges are all enclosed areas.

3.3.3 Other Edge Detection Techniques

There are a multitude of other edge detection techniques in the literature, but only the most popular and/or historically significant have been discussed in the previous section. Several other detectors are outlined below. In general, these techniques do not perform as well as the Canny or Marr-Hildreth edge detectors, but they do

47 illustrate how edge detection has been formulated in many different ways, all of which are completely valid.

• The Kirsch edge detector (Kirsch, 1971) uses a set of masks rotated around the

central pixel to determine edge information.

• The Wallis logarithmic edge detection technique (see Pratt, 1977) compares the

logarithm of the intensity at the central pixel to the log intensity of its four

immediate neighbors.

• Frei and Chen (1977) formulated each 3 × 3 window in the image as a 9-

dimensional vector, and projected these vectors onto an orthogonal basis {v1,..., v9}

spanning that space. The first four basis vectors v1, v2, v3, and v4 are specially

chosen to emphasis edge information, and the other five basis vectors contain

other information. The strength of the edge is measured by how small the angle

is between the projection into the edge subspace and the projection into all nine

basis vectors.

• Phase Congruency is a method that grew out of the concept of local energy

(Morrone and Owens, 1987). In this formulation of the edge detection problem,

intensity levels in the image are approximated by a sum of sine waves of different

frequencies. At the location of an edge, the intensity levels look like a step

function, where intensity is relatively constant on either side of the edge, but

there is a step up on the edge itself. In order for the sine components to

produce this step phenomenon, they must all be in phase at the step location.

By locating pixels at which the sine waves are maximally in phase, edges can

be revealed.

48 3.4 Shape Matching and Extraction

Section 3.2 gave a summary of techniques for noise reduction in imagery, the goals of which were to clean the image and prepare it for edge detection. In Section 3.3, a number of edge detection techniques were outlined. The goal of these algorithms is to extract a boundary around objects in the image as the next step toward object retrieval or recognition. Once the edges have been found in an image, there are a number of possible avenues to pursue. One of the simplest questions to ask would be whether a collection of edge pixels in the image matches an existing outline in a library or database of object shapes. This might be the first step toward identifying which objects are in the image, and what the scene depicts. Of course, the outline of an object in a database will not always have an exact match, even if the same object that produced the outline is present in the image. In this case, one may need to change the outline (e.g. scale, rotate, etc.) to find a match.

When a match is found, the object can be extracted from the image. From this point, certain features of the object’s shape can be quantified. Using these descriptors, one could then try to determine which existing class of objects this new outline fits into. These techniques are more complex and are briefly covered in Section 3.5. For now, we look at techniques for finding objects in images.

3.4.1 Basic Techniques

There are a couple of very basic approaches to object extraction that can be taken, even without detecting edges first. In situations where objects tend to be at the same intensity level and appear much lighter or darker than the background, a simple intensity threshold will do the trick. However, there is a clear weakness to

49 a thresholding approach. When the brightness increases or decreases in the image, the threshold must be changed. In addition, if parts of the object have varying illumination levels (e.g. due to shadow), the shape extraction may be incomplete. A thresholding approach would only work in a very small number of applications.

In applications involving imagery over time, there is another easy approach to object extraction. If one has imagery of the same scene depicted multiple times, it may be possible to piece together a background image that shows the scene devoid of objects. In imagery where objects are present, a simple subtraction of the background image will yield zeros everywhere except at the object pixels. This is still subject to problems, however. For example, if part of the object matches closely enough to the background, those pixels may be removed with the background subtraction. The effect would be not unlike a weatherman who wears the wrong colored shirt to work and gives the forecast without a torso!

In the vast majority of applications, simple thresholding and background subtrac- tion will not be enough to consistently extract objects. To account for differences in illumination and contrast, one can first extract the edge pixels from the image, then apply one of the more complex techniques in subsequent sections.

3.4.2 Template Matching

Aside from thresholding and background subtraction, template matching is the next simplest procedure to implement in practice. However, it also has use in a limited number of applications. The idea is that a sub-image of a known object is available, and one wishes to see if the larger image contains this sub-image. The sub-image is called the template (not to be confused with template operators in the previous

50 (a) (b) (c) (d)

Figure 3.11: (a) Image of an outdoor cafe. (b) The template T. (c) Accumulator function for the image, calculated at valid pixels. Blue indicates low values, and red high values. (d) The template superimposed on the minimizing pixel of the accumulator function.

section), and it can take the form of an edge outline, a binary shape representation of an object, or a picture of an object itself. A k × k0 template is represented by a function T(i, j), where i = 0, 1, . . . , k and j = 0, 1, . . . , k0.

Finding the best match to the template in the image is the same as finding the pixel (x, y) for which a squared error loss function is minimized. That is, find (x∗, y∗) satisfying: Xk Xk0 (x∗, y∗) = arg min (I(x + i, y + j) − T(i, j))2 (x,y) i=0 j=0 Since an image is a discrete space, finding this minimizer is as easy as computing the function above for every pixel in the image. The collection of function values for all of the pixels is called an accumulator function, and the minimizer will be the pixel where this accumulator function is smallest (see Figure 3.11).

This technique can also be generalized to include considerations of size and ori- entation. If multiple templates are available that show the original template rotated

51 and scaled in various ways, the accumulator function can be calculated for each tem- plate separately. The minimizing pixel over all of these functions will be the one that matches the original template best, and at the specific rotation angle and size depicted in the template used to generate that function. Typically, one does not have access to templates rotated in a variety of angles and scaled to different sizes, so this approach may not be an option. However, image processing software can often produce rotated and scaled versions of images, which make this a possibility in some applications.

It is worth mentioning that template matching is very robust to noise and occlusion of objects in the image (Figure 3.12). Since it is only a best fit match, global noise in an image will only globally dampen the accumulator function. However, the value at the minimizing pixel will still be the smallest compared to the other pixels in the image. One could also implement a lower threshold on the sum of squares for which all pixels with values below that threshold signify a template match. This could be useful in cases when multiple copies of the template are present. Similarly, in cases when no match is present, an upper threshold on the sum of squares may be appropriate. These thresholds would have to depend on both the size of the template and the how strong a template match is desired.

3.4.3 Hough Transform Methods

Template matching is useful in certain applications, but in most cases, the method is too rigid. Often, it is preferable to search for objects or features in an image that are not constrained to match a specific template (or set of templates), but instead only need to satisfy some relaxed properties. As long of these properties are general

52 (a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.12: (a) Image of the outdoor cafe, with salt-and-pepper noise. 20% of pixels had intensities flipped. (b) The template T. (c) Accumulator function for the noisy image, calculated at valid pixels. (d) The template superimposed on the minimizing pixel of the accumulator function. (e) Canny edges for the cafe image. (f) A version of the template with an outline of the object. (g) Accumulator function for the edge image and template, with a close-up view at the minimizing pixel. (h) The edge template superimposed on the minimizing pixel of the accumulator function.

53 enough, a robust class of objects can be detected in imagery without a significant amount of computation.

One of the most popular ways to do this is the Hough transform, proposed by

Paul Hough (1962). Subsequent work, e.g., Rosenfeld (1969); Duda and Hart (1972) popularized it as an image processing technique. It is faster than template matching, and can provide similar results, but for more generalized shapes like lines and circles.

At its root, the Hough transform is an evidence gathering approach to object detection. The first step is to use an edge detector to reduce an image to binary form, where 1 represents object edges and 0 represents object interiors and background. The transform also requires a parameterized shape, which it then tries to fit to the image.

For each edge pixel, the transform considers all possible parameter values of the shape that would result in that pixel lying on the shape boundary. For each com- bination of parameters where this is true, a vote is cast in an accumulator function over the parameter space. After this process is repeated for every edge pixel, the combination of parameters with the most votes gives the best fitting shape.

This evidence gathering approach allows a large number of sizes and orientations of a shape to be considered, all with low computational cost. Specifics are given in the sections below outlining this approach for lines and circles. However, the method can be generalized to any shape through the generalized Hough transform (see Ballard,

1981, for details).

The Hough Transform for Lines

Consider the standard form of a line, Ax+By = C. This equation can be rewritten as αx+βy+1 = 0, where α = −A/C and β = −B/C. Typically the line is considered an object in (x, y) space for a fixed pair of parameters (α, β). However, one could

54 also consider the line to be an object in (α, β) space for a fixed point (x, y). In this formulation, a certain point (x∗, y∗) lies on an infinite number of lines, the parameters of which are arranged in linear fashion in the parameter space of (α, β). The equation of this line is αx∗ + βy∗ + 1 = 0.

For each edge pixel in the image, then, one can plot the line through that pixel in the parameter space. The point where the most lines cross will yield a specific parameter pair (α∗, β∗) which corresponds to the line that best fits the arrangement of edge pixels in the image.

In practice, one uses an accumulator function over a mesh in the parameter space.

For each edge pixel (x, y), the accumulator function is increased by one at every pa- rameter pair (α, β) on the line αx + βy + 1 = 0. After all edge pixels have been processed, the optimal set of parameters is the one at the location where the accu- mulator function is largest. The relative heights of other peaks in the accumulator function reveal the degree to which other parameter pairs fit the image. By choosing parameters corresponding to the largest peaks in the accumulator function, one can detect multiple lines that are present in an image.

This formulation is difficult to implement, because the parameters α and β corre- spond to −1/b and m/b, where m is the slope of the line and b is the y-intercept. Since a y-intercept of b = 0 can make the parameters infinite valued, this can make the discretization of the space difficult. To use the Hough transform, then, it is helpful to parameterize the line in such a way as to have bounded parameters.

One solution is the polar representation of a line, illustrated in Figure 3.13.A perpendicular is drawn to the line from the top left corner of the image. The length of this perpendicular is ρ, and the angle of the perpendicular to the x-axis (the top of

55 (a) (b)

Figure 3.13: (a) Polar representation of a line where θ < π/2. (b) Polar representation of a line where θ > π/2.

the image) is θ. Every line that could appear in the p×q image is uniquely determined h p i by a pair (θ, ρ), where θ ∈ [0, 2π) and ρ ∈ 0, p2 + q2 .

The parameter space for (θ, ρ) can be easily discretized and an accumulator func- tion generated from each edge point. Note that since each line is no longer represented as a linear function in (θ, ρ), the edge pixels will not create lines in the parameter space, but rather sinusoidal curves. Figure 3.14 shows an example of an image, the accumulator function, and the detected lines for the Hough transform.

The Hough Transform for Circles

As with lines, the equation of a circle can also be formulated in two ways. The

2 2 equation of a circle centered at (x0, y0) with radius r is given by (x − x0) +(y − y0) =

2 r . This can be thought of as a function for fixed parameters (x0, y0, r) and variables

(x, y), or as a function for a fixed point (x, y) and variables (x0, y0, r). In this case, a reformulation of the parameters is not necessary, since they are all bounded.

56 (a) (b) (c)

(d) (e)

Figure 3.14: (a) Image of a hawk.p (b) Canny edges for the hawk image. (c) Accu- mulator function, with ρ = 1,..., b p2 + q2c on the vertical and θ = 0◦,..., 180◦ on the horizontal. (d) Line corresponding to ρ = 76 pixels, θ = 7◦, the maximizing pa- rameter pair in the accumulator function (see arrow in (c)). (e) Lines corresponding to the four highest peaks in the accumulator function.

p 1 2 2 For a p × q image, we have x0 ∈ [1, p], y0 ∈ [1, q], and r ∈ [1, 2 p + q ]. The space can again be discretized, and evidence is gathered in an accumulator function.

However, unlike the Hough transform for lines, the accumulator function is now three dimensional.

A possible approach could be to generate the entire accumulator function and choose the center and radius combination that gives the highest value. In other applications, the radius r is already known (or approximated), and a two-dimensional

57 (a) (b) (c) (d)

(e) (f) (g) (h)

Figure 3.15: (a) Image of a cactus flower. (b) Canny edges for the cactus image. (c) Accumulator function over the (x0, y0) parameter space for r = 100 pixels. (d) Circle corresponding to (x0, y0) = (171, 179), the maximizing parameter pair in the accumulator function. (e) Image of a starfish. (f) Canny edges for the starfish image. (g) Accumulator function over the (x0, y0) parameter space for r = 10 pixels. (h) Top 10 circles corresponding to the largest parameter pairs in the accumulator function.

accumulator function can be used to find the best center (x0, y0). Examples for two images and different circle radii can be found in Figure 3.15.

The Generalized Hough Transform

The evidence gathering approach taken by the Hough transformation can be ex- tended to arbitrary shapes (see Ballard, 1981; Merlin and Farber, 1975) in a technique called the generalized hough transform (GHT). In essence, the shape template can be described by a curve · ¸ x(θ) v(θ) = . y(θ)

58 The actualization of the shape in the image, denoted w(θ) will be a rotated, scaled, and translated version of the original template. For a rotation matrix R, scale factor

λ, and coordinates translated by (x0, y0), the image shape can be written as:

w(θ) = (x0, y0) + λRv(θ).

Each edge pixel of the image shape is a realization of the curve w(θ) in the discrete image space. Therefore, for fixed R and λ, an edge pixel w(θ∗) can cast votes in a two-dimensional accumulator function for the variables (x0, y0). These votes are cast for all points (x0, y0) satisfying

∗ (x0, y0) = w(θ ) − λRv(θ).

To find the optimal values of R and λ, this process can be repeated over a grid, effectively creating a four-dimensional accumulator function.

There are a few difficulties in implementation, mostly relating to the fact that a continuous shape is being represented in a discrete image space at different orienta- tions and scales. This makes it challenging to determine which point in the shape template an edge pixel represents, especially with low resolution images. However, the process above is a basic outline of the procedure.

Although not discussed here, there are other parameterizations of shapes that enable quicker matching. For example, invariant generalized hough transforms use parameterizations of shapes that are invariant to one of the operations (e.g. rotation).

Other approaches bypass calculation of the entire accumulator space, and instead perform a search to locate the maximum.

59 3.4.4 Deformable Templates

Deformable template matching is an elastic matching method, in the sense that the template being fit to the image is allowed to be transformed in more ways than simple scaling, translation, and rotation. The method was originally used by Yuille et al. (1989) for facial recognition, and the formulation here was given by Jain et al.

(1996). The algorithm first requires a template of the object one wishes to find in the image. Without loss of generality, this template is assumed to be a curve over the unit square [0, 1] × [0, 1].

A deformation of the template is achieved through the use of a displacement function D(x, y), for each point (x, y) on the template. The form of the displacement function is a sum that utilizes trigonometric functions of increasing frequency, with higher frequencies being responsible for smaller scale deformation. A displacement function uses two orthogonal bases

x emn(x, y) = (2 sin(πnx) cos(πmy), 0)

y emn(x, y) = (0, 2 sin(πny) cos(πmx))

+ x y where m, n ∈ Z , as well as a set of corresponding parameters (ξmn, ξmn). The displacement function itself is given by

X∞ X∞ ξx · ex + ξy · ey D∗(x, y) = mn mn mn mn , (3.10) απ2(n2 + m2) m=1 n=1 with constant α, which controls the strength of the displacement. Typically, this

x y function is approximated by setting ξmn = 0 and ξmn = 0 for all m > M and n > N. In this case, the displacement function becomes

XM XN ξx · ex + ξy · ey D(x, y) = mn mn mn mn . (3.11) απ2(n2 + m2) m=1 n=1 60 (a) (b) (c)

(d) (e) (f)

Figure 3.16: (a) Grid over the unit square. (b) Deformation from Eq. (3.11) applied x x y y y y x for M = 3, N = 3, ξ11 = ξ13 = 1, ξ12 = ξ21 = ξ23 = ξ32 = 1 and all other ξmn and y ξmn set to 0. Here, α = 1. (c) The same deformation coefficients, but now α = 0.5. (d) A template of a bird on the unit square (centered at (0, 0) for easier rotation). (e) Deformation from (b) applied to the bird template. (f) Bird template rotated by 30◦, deformed using the displacement function from (c), and then scaled by 1/2.

An example of a displacement function and its effect on a shape is given in Figure

3.16.

A deformed template is a combination of displacement, scaling, rotation, and translation. The original template T0 is acted on by rotation operator RΘ at angle

Θ, deformed by displacement function D(·, ·), scaled by a factor of s, and translated

61 by d = (dx, dy). As a result, the deformed template can be written as:

x y T (x, y) = T0 (s ·{(x, y) + D(RΘ(x, y))} + (d , d )) .

To attract edges of the deformed template to features in the image, an edge potential is defined. Using the output of an edge detector (e.g. Canny’s detector) on the image, the edge potential at point (x, y) is defined by:

Φ(x, y) = − exp{−ρ · δ(x, y)},

where δ(x, y) is the L2 distance from (x, y) to the nearest edge pixel. If the template

fits the edges well, this potential will be close to −1, since the distances δ(x, y) will all be close to 0.

To incorporate directional information, let β(x, y) denote the angle between the tangent lines at the template edge and the image edge at (x, y). This angle is used to modify the edge potential and calculate the energy function

1 X E(T ) = [1 + Φ(x, y)| cos(β(x, y))|] , nT where nT is the number of pixels on the template. The energy is a measure of how badly the template fits the edges of the image. If the edge directions at (x, y) for the image and the template are close to orthogonal, the contribution of that pixel to the sum will be nearly 1. Otherwise, the contribution is (1 − Φ(x, y)). If the template matches the edge pixels well and the directions line up correctly, the energy will move closer to zero. Therefore, to fit a deformed template to an image, the goal is to find the specific rotation, scale, translation and discplacement function that minimizes this energy.

Jain et al. (1996) use a multiresolution algorithm to accomplish this task. At the first step, they discretize the parameter space and only consider displacement

62 functions with M = N = 1. The best fitting templates in this step are used as original templates in the next step, where a finer grid can be used and M = N = 2.

The best fits are used as input to the finest scale resolution matching step, where

M = N = 3. Through proper choice of grid size and thresholds on the energy function, a template can be fit to an image with reasonable computing time.

3.5 Post-Detection Processing

After edge detection and some form of template matching or extraction has taken place, an object can be considered detected. What happens next depends on the application, but it typically involves an attempt to determine what the object is and what its relation is to other objects in the image. An object can be described in a number of ways:

• Outline description - Includes techniques like chain coding, Fourier descriptors,

connectivity, circularity/ellipticity, and curve/shape analysis

• Basic region description - Includes descriptors like area, perimeter, and invariant

moments

• Advanced region description - Includes color and texture description, and image

segmentation

Using these descriptions of objects, one can go on to describe relationships between objects, usually through clustering or classification. These techniques are beyond the scope of this thesis, but they are the cutting edge of image processing and recognition.

Given this summary of image processing techniques, we now return to the cairn detection problem, and use some of these algorithms to find needles in haystacks.

63 CHAPTER 4

CAIRN DETECTION

4.1 Introduction

The first step in detecting a particular type of object (e.g. a cairn) is to character- ize exactly what makes that object different from other objects and the background.

In the NSF-HSD project, one way to do this is to observe a training set of cairns found in the field. In January of 2007 and 2008, a field team was able to uncover a large number of monuments in Yemen and Oman. In all, there were more than 350 monuments with labels “HCT” (cairns) or “possible HCT” (possible cairns). When a cairn was discovered in the field, a protocol was followed to ensure that certain information was recorded.

First, the GPS coordinates were taken at various locations around the perimeter of the cairn. Then, a monument form was filled out by one of the team members.

The form contains information measuring physical dimensions of the cairn (height, diameter, preservation level) as well as location information (landform type, number of other monuments nearby, visibility from the wadi). Figure 4.1 gives summary statistics for the cairns discovered in Yemen in 2007.

64 Note that over half of the cairns are poorly preserved. These cairns have fallen over or collapsed, and often look like a pile. Also, there is evidence that some of the oldest cairns were dismantled for materials to build newer cairns. More recently, cairns have also been robbed for valuables, which negatively impacts the preservation level. Many of these poorest preserved cairns are not visible from the satellite imagery, and have been removed from the training set. However, some of the better preserved cairns are clearly visible.

In all, the training set contains 9 cairns from Oman imagery, and 97 from Yemen.

Less time has been spent in Oman until recently, hence the lower number of cairns from that region in the dataset. Figure 4.2 shows a group of cairns randomly selected from the training set. One can see that cairns do stand out from the immediate environment, but there is also quite a lot of variability from cairn to cairn. The high amount of variability is in part due to the different preservation levels of the cairns, however, the cairns still seem to have several features in common that help distinguish them from the rest of the landscape.

One of the first noticeable features of cairns is the relative intensity difference between the cairn and its background. When the cairns were built, the most likely scenario is that the builders used stones taken from the immediate vicinity, since they would not have to carry them a long distance. Therefore, one of the patterns we see in the satellite imagery is a clearing of rubble around the cairns. This produces a relatively lighter colored background behind the cairn.

Another distinct of the cairns is their tendency to be circular. All of the cairns found in the field were originally built as cylinders, and many of them remain this way. Some of them have collapsed, especially when they are built too close to

65 Minimum First Quartile Median Third Quartile Maximum Height: 0.15 0.75 0.97 1.30 2.60 Diameter: 0.30 2.80 3.50 4.40 7.70 Elevation: 500.76 846.02 879.43 982.11 1331.60

(a)

(b) (c)

Figure 4.1: (a) Statistics on 288 cairns discovered in Yemen in 2007. (b) Histogram of cairn preservation levels. (c) Histogram of cairn diameters.

the wadi slopes. As a result, most cairns look circular from space, but some of them might look oblong if they have partially collapsed down the slope.

Finally, depending on the time of day the satellite imagery was taken, cairns might display some shadowing. This is especially true for better preserved and/or newer cairns, because they tend to stand higher. Poorer preserved cairns tend to be flatter and wider because the interior chamber has collapsed.

Designing a successful cairn detection algorithm requires locating all pixels in a satellite image that might belong to a cairn. There are a number of options for how to proceed, and a few are listed below:

66 Figure 4.2: A random selection of 49 cairns from the training set. Each cairn window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

1. Threshold pixels based on panchromatic intensity, RGB layers, near infrared,

or digital elevation model (DEM) values.

67 2. Model a background process for the satellite image and then measure significant

deviations from this process.

3. Cluster the image into many groups and locate the “cairn group”.

4. Center a window at each pixel in the image and run an analysis on the window

to determine whether a cairn is present.

Dealing with satellite imagery introduces several complicating factors. First of all, the images themselves are massive. The image shown in Figure 2.3 (Chapter 2) is of the Wadi Idim area in Yemen, and is labeled as Polygon 9. Since a large number of training cairns came from this region, it will be used throughout the chapter to show how the detection algorithm works. Polygon 9 is 17,650 pixels long and 10,824 pixels wide, which means it contains almost 200 million pixels. This is much too large for programs like Matlab to hold the entire image in memory at once for processing.

Also, the sheer size makes it much too time consuming to run complex operators (e.g. anisotropic diffusion, mean shift filtering, template matching) over the image.

A second problem is that the cairns themselves constitute a very small proportion of the pixels in the image. Even assuming 1000 cairns were present in the image, each taking up 13 × 13 pixel regions of imagery, they would still collectively only comprise

0.07% of the 200 million pixels overall. Indications from the field are that far fewer than this number are actually present in each image region. Due to the relatively small size of the cairns, almost the entire image could be considered background.

This makes background modeling processes like #2 far too noisy to pick out process deviations at the cairn locations.

68 Also, while the background itself is highly variable, so are the cairns. Depending on the local soil type, cairns in one part of the region can have different intensities and color compositions than those in other areas. While cairns are relatively darker than their immediate surroundings, the actual intensity values may differ between cairns. This makes approaches like #1 and #3 difficult, or even impossible.

Ultimately, owing to the small proportion of cairn pixels in the image and large- scale variation across the image, the best approach seems to be #4. One need only consider local information to determine whether an object is a cairn. Furthermore, the calculations involved must be basic enough to obtain detection results in a timely fashion.

The cairn detection algorithm we ended up developing uses a sliding window approach. The image itself is diced into 1048 × 1048 pieces that overlap by 24 pixels on each edge. Then, a 25 × 25 window is centered around each pixel in the image

(the overlap in the diced pieces is to fit this window around all of the pixels). Each window is sent through a battery of increasingly complex tests, with the window only progressing to a test if it has passed all of the previous ones. In this way, the least cairn-like objects are weeded out immediately in the simple tests, and objects that look more like cairns will be tested more extensively. The overall effect of the algorithm is like passing the image through a number of screens, each one with a much smaller mesh size. These screens filter out the objects that look the least like cairns first, until only the best candidates remain.

To put this in the context of image processing, each filter is a group operator

(Section 3.2.3) passed over all candidate pixels in the image. After obtaining values from the operator, the pixels are thresholded (as in Section 3.2.1). Pixels that do not

69 meet the threshold criterion are removed from the candidate pool. Ultimately, after all operators and thresholds are applied, the remaining candidate pixels and their neighbors should form objects that look like cairns.

Finally, to avoid confusion, common notation will be used in all of the discussions of operator values and thresholds. In keeping with notation from Chapter 3, a 25×25 window is considered an image in itself and will be called I. The notation I implicitly assumes the intensity (panchromatic) band in the satellite image is being used. In cases when another band is used, a subscript will indicate this (e.g. IR for the red band of imagery). The pixel value in the ith row and jth column of the window I will be denoted I(i, j). Note that the central pixel in each window will therefore be I(13, 13). Each operator will be denoted J, with a subscript indicating which operator was used. The lower and upper thresholds for J will be denoted L and U, respectively, with a subscript matching the one for J.

4.2 Blob Detection

The first filter used in the detector must be as basic as possible, because it will be computed over every single pixel in the satellite image. In addition, it needs to discriminate well enough to eliminate the vast majority of pixels in the image and keep them from moving on to more complicated testing. The filter that fits this description is a simple “blob detector”.

The idea is to locate 25 × 25 windows of imagery that contain something dark in the middle and lighter around the edges. To make this comparison, only the panchromatic intensity layer of the image is used. The intensity window I is broken into an inner w × w window (with w odd) and the remaining outer section. Let the

70 set of inner window pixels be

W = {(i, j): i = 13 − r, . . . , 13 + r, j = 13 − r, . . . , 13 + r},

w−1 where r = 2 is the radius of the window. The corresponding outer region is W c = {(i, j):(i, j) 6∈ W }.

From Figure 4.1, we see that most cairn diameters are between 2 meters and 7 meters, which translates into 3.5 pixels and 12 pixels, respectively. The inner window is intended to cover the darkest pixels in the cairn, while the outer region should contain mostly lighter pixels. I tested different window widths in the 3 to 11 pixel range, and it seems that w = 7 is the best window size to use for most cairns.

After the window I is split into the w × w inner window W and its complement, one can imagine running a test to see if the pixels on the inside are significantly darker than those on the outside. The w2 pixels in the inner window are considered independent draws from one group, and the 625 − w2 remaining pixels are treated as draws from a second group. Let the mean and standard deviation of the inner window pixels be I¯(W ) and sI(W ), respectively. Similarly, let the mean and standard

c deviation of the outer region be I¯(W ) and sI(W c). Then, the intensities between the regions are compared using the two sample t-statistic with unknown, unequal variance: I¯(W c) − I¯(W ) JB = q ¡ ¢ (4.1) 2 1 1 Sp · w2 + 625−w2 2 Here, the pooled variance Sp is defined by (w2 − 1)s2 + (625 − w2 − 1)s2 2 I(W ) I(W c) 1 £ 2 2 2 2 ¤ S = = (w − 1)s + (624 − w )s c . p w2 + (625 − w2) − 2 623 I(W ) I(W )

It is important to note that even though independence assumptions are clearly violated (images have a definite correlation structure), the t-statistic still gives a

71 useful measure of interior darkness vs. exterior lightness. Furthermore, the statistic is normalized by the pooled variance and it only measures a relative difference in intensity. Therefore, this statistic is robust to changes in soil type and variability from window to window. This helps to ensure that most cairns will have large JB values, even though they may be in regions of the image that appear much different.

Large positive values of JB indicate a large discrepancy between a dark inner window and lighter outer region. Therefore, in order to filter out non-cairn pixels, a lower threshold LB is set on the t-statistic, as well as an optional upper threshold

UB. These thresholds are selected by examination of a training set of cairns. In the training set of cairns from Yemen and Oman, I have found that most JB values for cairns fall between LB = 7 and UB = 30. These limits do change when comparing regions in different climates (e.g. Yemen vs. Oman), but seem to be relatively stable within those areas. For example, Oman cairns tend to have better contrast with their background and get higher JB values (∼ 15 to 30) than Yemen cairns (∼ 7 to 25).

Note that several pixels in an object may have large JB values that exceed LB.

To avoid multiple detections of the same object, a window I will pass the detector only if the JB value at the central pixel is the maximum in its local 9 × 9 window.

Depending on how close cairns are expected to be to one another, this window could be chosen larger or smaller than 9 × 9.

After the initial pass with this detector, the only objects that remain will be those which appear as dark blobs against a lighter background. Since the detector only requires the calculation of averages, it is very fast. Still, computation of these values for an entire 200 million pixel satellite image can take up to 8 hours. The price paid in computation is worth it in the end, however, because the filter tends to eliminate

72 more than 95% of the pixels in the image from further consideration. See Section 4.9 for an example of this filter’s performance on Polygon 9.

4.3 Vegetation Removal

Even with the thresholding on JB, a large number of non-cairn objects still make it through. One of the largest groups of these non-cairns is vegetation. The stream channels (called wadis) that wind their way through the region are dotted with small, dry bushes that are well suited to the arid climate. The size of these bushes is often close to that of a cairn, and their sparse distribution around the landscape means that the area around each bush is usually clear of other vegetation. As a result, bushes appear to be cairn-sized blotches against a relatively lighter background, and tend to have large JB values. Figure 4.3 shows an example of 20 bushes that pass the thresholds LB and UB for Polygon 9. Note that they all look a lot like cairns, except for the higher contrast between the bush and the background soil.

To remove most of the vegetation from the detected objects, the next filter uses a value called the Normalized Difference Vegetation Index (NDVI). This value is typically used in the remote sensing literature to find large sections of satellite imagery that contain forests, grassland, and other forms of vegetation. The usefulness of

NDVI rests in basic scientific knowledge about photosynthesis, and the effect it has on vegetation imagery. Plants absorb energy from the sun, but only at wavelengths that are useful for synthesizing organic molecules. These useful wavelengths absorbed by plants fall in the part of the visible spectrum corresponding to red light, which is why plants appear green. However, at longer wavelengths in the near infrared range, plants cannot use the energy in the photons, which are reflected away. Other

73 Figure 4.3: Examples of 25 × 25 vegetation windows passing the first filter from Yemen and Oman. To display properly, intensities were scaled to [0,1] within each cairn window before plotting.

materials (soil, clouds, water) do not tend to absorb as much red light relative to near infrared, so this difference can be exploited to find patches of vegetation in the imagery. NDVI is calculated by

IR − R NDVI = , IR + R where IR is the near infrared value for the pixel and R is its red value. Note that this value lies in the range [−1, 1]. Typically, dense canopies of vegetation have values of NDVI in the range of 0.3 to 0.8. Unfortunately, owing to the sparseness of vegetation in southern Arabia, vegetation tends to only have an NDVI around 0.1 or 0.2. However, since cairns have NDVI values closer to 0.05, there is only a slight overlap between the two groups.

74 Figure 4.4: A window I containing vegetation. NDVI values INDVI (i, j) are overlaid on the image. The grid shows the size of the 2.4 meter resolution pixels that IR and Red values are measured in. The red box in the center is the 5×5 window over which NDVI values are averaged. In this case, the average is 0.101, which would likely fall above the upper threshold UV . This bush would not pass the vegetation filter.

75 Figure 4.5: A window I containing a cairn. NDVI values INDVI (i, j) are overlaid on the image. The grid shows the size of the 2.4 meter resolution pixels that IR and Red values are measured in. The red box in the center is the 5 × 5 window over which NDVI values are averaged. In this case, the average is 0.055, which would likely fall between LV and UV . Therefore, the cairn would pass the vegetation filter.

76 In the cairn detection algorithm, the filter JB returns a collection of windows

I containing darker blobs on lighter backgrounds. To implement the second filter,

NDVI values are calculated for these objects. Since the RGB and IR channels are all at 2.4 meter resolution, one of these pixels takes up a 4 × 4 pixel block at the

0.6 meter resolution of panchromatic imagery. Therefore, the 25 × 25 windows in the red and near infrared bands (IR and IIR, respectively) will exhibit a block structure.

Since bushes are not very large, one 2.4m pixel often covers part of the bush and part of the background soil. The IR and IIR values for that pixel, therefore, are a blend of the vegetation and the soil, which gives a lower NDVI value than the vegetation itself. To account for this blending phenomenon, NDVI values are averaged for all

0.6m pixels in a small window centered on the object. There was almost no difference between using a 5 × 5 window instead of a 7 × 7 or 9 × 9 window, so the 5 × 5 was chosen for faster computation. Therefore, the calculated value is given by

1 X15 X15 I (i, j) − I (i, j) J = IR R . (4.2) V 25 I (i, j) + I (i, j) i=11 j=11 IR R Figures 4.4 and 4.5 show how the process would work for a bush and a cairn, respec- tively.

The average NDVI in the 5 × 5 window is compared to a lower threshold, LV , and upper threshold, UV . As in the first filter, thresholds are chosen by examining the training set of cairns. Values of JV seem to vary more from region to region.

Generally, the lowest values are around 0.04, and the highest around 0.12.

4.4 Size Metrics

At this point in the algorithm, there are still quite a few candidate objects left

(see Section 4.9 for an example), so the next filter still needs to be computationally

77 simple. However, since a large number of pixels have been eliminated from consider- ation, something more complicated than averaging can be implemented. So far, the candidate pool consists of darker blobs against lighter backgrounds, and most of the vegetation should be removed. In the next step, objects are that are not cairn sized will be eliminated.

The first filter already helped to eliminate objects that are almost as large as the 25 × 25 window I, since they would have had low JB values. Depending on the contrast between the object and its surroundings, however, a large JB statistic for a

7 × 7 inner window does not necessarily mean that the object is 7 pixels in diameter.

A small, dark object against a light background could still have a moderate JB value, even though the inner window variation will be large. An object larger than 7 pixels in diameter could still pass the first filter as well, because the size of the outer region is large enough to absorb several darker pixels and still have a high mean and lower variance.

To estimate the size of the object, consider starting at the center pixel I(13, 13) in the window, which should be in the darkest region of the object. Moving outward from the center, pixel intensities I(i, j) should get lighter and lighter, until they cross the object boundary and dramatically increase in brightness. The goal is to determine how many pixels from the center one needs to travel before this dramatic increase occurs, since that corresponds to the edge of the object. The count of the number of pixels from the center to the edge will give a rough estimate of the radius of the object.

To do this, consider that the outer window region contains more pixels than the object, so the median intensity for the window I will be a lighter value consistent with

78 background soil. Moving outward from the interior of the object, intensities should lie below this median value until the object boundary is crossed, at which time the dramatic increase in intensity should jump over the median value. To observe this phenomenon in all directions at once, consecutive nested windows of size 3 × 3, 5 × 5,

7 × 7, etc. are examined in order. For each window, one can then see if the lighter pixels in the inner window have crossed the median for the entire window. When this occurs, the diameter of the object can be estimated by the approximate window size at the time of the median crossing. A summary of the calculation is given below, and

Figure 4.6 shows the procedure applied to two different cairns.

1. Calculate the median intensity of the entire window,

M = median{I(i, j): i, j = 1,..., 25}.

w−1 2. Initialize the inner window width at w = 3 (with radius r = 2 = 1), and define the inner window to be

W = {(i, j): i = 13 − r, . . . , 13 + r, j = 13 − r, . . . , 13 + r}.

th 3. Find qp(W ), the p quantile intensity of the inner window W :

qp(W ) = quantilep{I(i, j):(i, j) ∈ W }

4. If qp(W ) > M, estimate the size of the cairn to be:

JS = w − 1 (4.3)

Otherwise, let w = w + 2 and return to step 3.

A visualization of the procedure for two different sized cairns appears in Figure 4.6.

As with the previous filters, a lower threshold LS and upper threshold US are used to remove objects that are larger or smaller than most cairns in the region.

79 (a) (b)

(c) (d)

Figure 4.6: (a) Window I containing a small cairn. (b) Window I containing a large cairn. (c) Successive boxplots of inner window intensities for cairn (a) for w = 3,..., 13. The red dotted line is the median for the entire window I, and the circle on the boxplot shows the quantile value qp(W ) for p = 0.95. The size of this cairn is JS = 6. (d) The same plots for cairn (b), which has size JS = 12.

80 4.5 Measuring Circularity

At this point in the process, most candidate pixels have been removed from con- sideration. The only windows remaining contain objects that are roughly the same size as a cairn, and are unlikely to be vegetation. In order to further reduce this set, we now turn to another distinguishing feature of cairns: their circular shape.

To test the circularity of the objects in the candidate pool, there are two different approaches used. The first is a measurement of how well a circle Hough transform

(Section 3.4.3) fits the window for different radii. The second is a closed boundary extraction on the object and analysis of the circularity of that boundary. Together, these filters give a comprehensive description of the object’s circularity.

4.5.1 Hough Transform Circle Fitting

Fitting a circle to the window with the Hough transform is the first step toward extracting an object boundary (Section 4.5.2), and ultimately calculating a value that measures the object’s circularity (Section 4.5.3). The idea is to assume the object is perfectly circular, and determine where the center of that circle would be. In order to choose the radius of the circle, we use knowledge about cairns found in the field.

From Figure 4.1, we see that the diameters of cairns tend to be between 2 and 7 meters. This translates into approximately 3.5 to 12 pixels. If the circular object is a cairn, then, that circle should have a radius between 2 and 6 pixels. Using this information, we successively fit circles of radii 2, 3, 4, 5, and 6 pixels to the window

I.

In order to fit the Hough circle to the window, edge information must first be extracted. A 3×3 median filter (see Section 3.2.3) is used to smooth out the intensity

81 (a) (b) (c)

Figure 4.7: (a) Window I containing a cairn. (b) The median filtered intensity window I0. (c) The best fitting Hough circle (blue) and its center pixel (yellow) overlaid on I. The radius of the circle is 5 pixels.

values in the window I, resulting in a smoothed window I0. Next, a Canny edge detector is used (see Section 3.3.2 for details) to locate the pixels (i, j) with large changes in the local intensity gradient at I0(i, j). The resulting output is a window

E, where E(i, j) = 1 if (i, j) is an edge pixel, and E(i, j) = 0 otherwise.

Then, by following the procedure outlined in Section 3.4.3, a three dimensional accumulator function f(i, j, r) is generated, with size 25 × 25 × 4. The point where this function is maximized gives the location (i∗, j∗) of the central pixel for the best

fitted circle in the XY plane, and the radius r∗ of that circle on the Z-axis. To ensure that the best fitting circle is placed on the object in the middle of the window and not something in the background, the circle center is restricted to be somewhere in the middle 7 × 7 block {(i, j): i, j = 10,..., 16} of the entire window. An example of this process is shown in Figure 4.7.

82 Hough Ratio

After fitting the circle to the window, two features are calculated that are intended to capture information about how good the fit is. The first measure is called the

“Hough Ratio”. If the circle fits the object perfectly, we would expect to see complete aligment between the pixels on the boundary of the circle and the edge pixels found by the Canny detector. Additionally, since the immediate zone near the cairn should be clear, there should not be many other edge pixels present in the background of the window.

One measure of how well edge pixels fit the boundary of the Hough circle is the value of the maximum in the accumulator function. The higher the value of the accumulator, the more edge pixels are consistent with the best fitting Hough circle and contribute to the function at that point. If the area around the object is clear, there should not be many edge pixels present. Therefore, the ratio of the maximum to the number of edge pixels should be large for cairns. The Hough ratio is defined by: f(i∗, j∗, r∗) J = , (4.4) HR P25 P25 E(i, j) i=1 j=1 where f(i∗, j∗, r∗) is the value of the accumulator function at the best fitting Hough

∗ ∗ ∗ circle of radius r centered at (i , j ). A lower threshold LHR can be used to eliminate objects that are not circular enough to be cairns, and this threshold can be tuned based on the training set. Technically, one could also define a nominal upper threshold of UHR = 1.

83 Hough Score

The second measure of fit for the Hough circle is called the “Hough Score”. The thinking behind this measure is the same as for the blob detector JB (Eq. (4.1)). If the circle fits the object well, all of the pixels inside the circle will have low (dark) intensity values. Likewise, all pixels outside the circle will have much lighter intensities. A t- statistic comparing the difference between these intensities will give a measure of how well the circle fits the object.

The set of pixels inside the Hough circle of radius r∗ centered at (i∗, j∗) is defined by

H = {(i, j):(i − i∗)2 + (j − j∗)2 ≤ r2}, and the outer region is Hc = {(i, j):(i, j) 6∈ H}. The intensities of the |H| points inside the circle are compared to those of the |Hc| points outside the circle with the

Hough score: I¯(H) − I¯(Hc) JHS = r ³ ´ (4.5) 2 1 1 Sp · |H| + |Hc|

c Here, (I¯(H), sI(H)) and (I¯(H ), sI(Hc)) are the intensity means and standard devia-

2 tions of the two sets of pixels, and Sp is the pooled variance, given by

2 c 2 (|H| − 1)s + (|H | − 1)s c S2 = I(H) I(H ) . p |H| + |Hc| − 2

The Hough score is positively correlated with the JB measure (Eq. (4.1)), but it captures different information. The JB value, needing to be computationally quick, is blindly calculated using a square window W centered on the middle pixel (13, 13).

However, due to shadowing effects, the 25×25 window I is not always centered exactly on the cairn. This fact, coupled with the use of the square template W , means that

84 a handful of background pixels are likely to intrude into the inner window, and some darker cairn pixels are likely to bleed into the background region. As a result, the

JB value may not accurately represent the true contrast between the cairn and its surroundings.

The Hough circle fitting ensures that a circular template is used to enclose the cairn, and it is not required to be centered on pixel (13, 13). Therefore, the t-statistic calculated in Eq. (4.5) is a better reflection of the normalized intensity difference between the object interior and the background. By examining the training dataset, a lower and upper threshold (LHS and UHS, respectively) for typical Hough score values of cairns can be determined.

4.5.2 Boundary Extraction

The Hough circle fitting step is a very crude measure of the circularity of an object.

It is useful in eliminating clearly non-circular objects, but neither the Hough ratio or Hough score use all of the details of the object boundary to measure circularity.

However, the fitted circle isolates the specific region the object occupies in the window.

This information can be used to extract an object boundary, which can then be analyzed in more detail.

To determine the object boundary, a binary map B of the 25 × 25 window I is created. Object pixels should be of a relatively lower intensity than the background, and the fitted Hough circle should be placed over the object in the window. Therefore, the intensities of the pixels inside the Hough circle should be representative of the intensity of the object itself. Recall that the set of pixels inside the circle is denoted

th H, and let qp(H) be the p quantile intensity of the pixels inside the circle. Then

85 the binary map is defined by: ½ 1 if I(i, j) < q (H) B(i, j) = p 0 otherwise

The binary map B should contain almost all of the pixels in the object, and might possibly contain some darker regions in the background. Note that if p 6= 1, there may be a few pixels in the interior of the object that are not selected. To account for this situation, all holes in the binary map are filled. A pixel (i, j) is in a hole if

B(i, j) = 0 and there is not a chain of other pixels with value 0 that reaches the edge of the window.

A Canny edge detector is used to extract the edge pixels for B. Using the binary mapping technique helps to give a better estimate of the object boundary, and the curve tends to be closed, with no edge pixels in the interior of the object. I find that in many cases, when p is too large (e.g. p = 1), too many background are included in the binary map, which can make it harder to extract the object boundary. I find that p = 0.90 tends to provide reasonable results. A comparison of the edge pixels of B to the edge pixels of the original image I is given in Figure 4.8. Note that the edges of the binary map match the edges of the object much better.

At this point, boundary extraction is almost complete. However, there are likely a few edge pixels in the background of the window that are not a part of the object itself. Since the object is roughly centered in the 25×25 window, the object boundary pixels will be close to the center pixel (13, 13). Therefore, all pixels within 4 units of the center pixel are selected, and the object boundary is defined by all edge pixels connected in a chain to those starting pixels. If no pixel is within 4 units of the center, the closest edge pixel is used as the starting point. A connection is defined by

86 (a) (b) (c)

(d) (e) (f)

Figure 4.8: (a) Window I from Figure 4.7. (b) Canny edges for the raw intensity I. (c) Intensity edges overlaid on the cairn. (d) The binary map B, with p = 0.90. (e) Canny edges for the binary map B. (f) Binary map edges overlaid on the cairn. Note that the edge pixels match the cairn boundary much better, the curve is more complete, and there are not as many extraneous edge pixels.

a pixel being adjacent in any of the 8 directions. An example of boundary extraction is given in Figure 4.9.

4.5.3 Circularity Calculation

After the boundary extraction step, we are left with a collection of edge pixels that should describe the shape of the object in the window. Considering these as a collection of points in <2, the goal is to measure how circular the distribution of

87 (a) (b) (c)

(d) (e) (f)

Figure 4.9: (a)-(c) Binary map edges for three different cairns. (d)-(f) Extracted boundaries. The center pixel (13, 13) is colored magenta, and the starting locations for the extraction are in red. Cairn (c) shows an unfortunate side-effect of this procedure, where the chain of boundary pixels strays from the object.

these points is. There are a number of ways to measure the circularity of collections of points, and a few of the most popular approaches are described below.

Measures of Circularity

One of the earliest proposed measures of the circularity of a shape with area A and perimeter P is given by 4π · A C = . O P 2

88 Note that for a circle, A = πr2 and P = 2πr give a value C = 1. For any other shape, the perimeter will be larger relative to the area, resulting in a value C < 1.

This method was extended for use in digital curves by Haralick (1974), but has many inherent flaws. First of all, if there is a gap in the boundary, the perimeter and area cannot be calculated without severe error in measurement. Even when there are no gaps, the perimeter cannot be calculated without first ordering the boundary points, which is not always a trivial task.

Proffitt (1982) suggests calculation of the radii r1, . . . , rn of each of the n border pixels from the center of the object. Usingr ¯ and sr as the mean and standard deviation of these radii, respectively, the circularity measure is defined by r ³s ´2 C = 1 − r . (4.6) P r¯

Since the radii r1, . . . , rn are positive, the standard deviation sr can be no larger than the sample meanr ¯. Therefore, C as defined in Eq. (4.6) will be between 0 and 1, where 1 represents a perfect circle.

Giger et al. (1988) define circularity by the proportion of overlap between a shape

A and the discrete disk D that has the same area (number of pixels) and the same center of gravity as A. The measure is defined by

|A ∩ D| C = , G |A| where a discrete disk has a circularity score CG = 0 and all other shapes have strictly positive values. Bottema (2000) expanded on this and dealt with the cases when such a discrete disk does not exist. He suggests the measure

|A \ (A ∩ D (c))| C = n , B |A|

89 where n is the number of pixels in A, c is the center pixel of A, and Dn(c) is the set of n pixels closest to c. Dn(c) is a discrete disk for pairs (n, c) where such a disk exists.

However, this shape can still be defined when the disk does not exist.

Other more complex approaches exist as well. For instance, Stojmenovi´cand

Nayak (2007) propose calculating the center of the object and then converting its boundary points to polar coordinates using the center as the origin. In polar coor- dinate space, circular objects will have a linear arrangement of points. They suggest measuring the linearity of these polar coordinates with the techniques described in

Stojmenovi´cet al. (2006). Zunic and Hirota (2008) have a method based on cal- culating geometric moments of shapes. A recent method by Roussillon et al. (2010) converts the object boundary to a set of parallel planes in a dual space. The circularity is calculated by examining their intersecting regions.

There are many other approaches in the literature that are not listed here. Some of them are given in the paper by Stojmenovi´cand Nayak (2007), which compares several different approaches on a large test set of object boundaries. Ultimately the best circularity measure for the purposes of cairn detection is one that has a good balance of performance and computational simplicity.

Circularity for Cairn Detection

After testing several of the methods, I decided that Proffitt’s method (Eq. (4.6)) was the one that best fit this description. However, due to the small size of the objects

(no larger than 25 × 25 pixels, there cannot be much variation in the number of edge pixels or their distance from the object center. To penalize non-circularity a bit more,

I modified Proffitt’s equation by removing the exponentiation.

90 Suppose there are ne edge pixels extracted along the boundary of the detected object. This collection of edge pixels can be considered a set of ordered pairs (xi, yi),

th where xi is the row and yi is the column of the i edge pixel. This is a strange coordinate system, with x on the vertical axis and y on the horizontal, but it is just a rotated and flipped version of the Cartesian coordinate system, which will be invariant with respect to circularity calculations.

The center of the object is defined by (¯x, y¯), and the radius to each edge pixel is p 2 2 ri = (xi − x¯) + (yi − y¯) ), for i = 1, . . . , n. Object circularity is then defined by a modified version of Proffitt’s equation:

³s ´ J = 1 − r , (4.7) C r¯ wherer ¯ and sr are the sample mean and standard deviation of the radii, respectively.

Higher values of JC denote more circular objects, and 0 ≤ JC ≤ 1. Several examples of circularity scores are given in Figure 4.10.

As with the other operators, JC is given a lower threshold LC and an upper thresh- old UC . These thresholds can be determined using the cairn training set. However, as the results in Figure 4.10 illustrate, the JC value can be dimished significantly if the boundary extraction does not work as expected. Therefore, it might be a good idea to set LC to at least 0.6 or 0.7 to avoid detecting a large number of clearly non-circular objects.

4.6 Reduction to Cairn Region

At this point in the algorithm, all objects that have passed through the six filters have been mapped to a six-dimensional feature space J . The ith object is represented

i i i i i i by the six-tuple (JB,JV ,JS,JHR,JHS,JC ), and the space itself is a six-dimensional

91 (a) (b) (c)

(d) (e) (f)

Figure 4.10: (a)-(c) Cairn circularity scores indicated for the extracted boundaries (in blue). The pixel containing the center of the object is in yellow. (d)-(f) Examples of circularity scores when objects are not as circular. These are cairns for which background intensities interfered with the boundary extraction.

rectangular solid defined by:

J = [LB,UB] × [LV ,UV ] × [LS,US] × [LHR,UHR] × [LHS,UHS] × [LC ,UC ].

A plot of the pairwise correlations between features across all 106 cairns in the training set is given in Figure 4.11. Notice that the largest correlations are between the two measures of Hough circle fit, JHR and JHS, and the blob detector JB. Since the largest correlation is only 0.551, and most of them are well below 0.4, it seems that

92 Figure 4.11: Pairwise correlations for the six features across all 106 cairns in the training set.

each feature is measuring something different about the cairns. Therefore, dimension reduction should not be required.

Given a large enough training set, there is an optional step to the algorithm that is available. Although the specific values for features may change depending on the region covered by a satellite image (e.g. Yemen vs. Oman), almost all of these features should have a unimodal distribution. The two measures JB and JHS are based on a t-statistic. Even though independence assumptions are violated, they still appear to be distributed roughly unimodal and symmetric. The vegetation measure JV is based

93 on an average of values, and so should also be roughly symmetric and unimodal. The distribution of cairn diameters is roughly normally distributed (see Figure 4.1(c)), so the size measure JS should be as well. Although not likely symmetric, the distribution of Hough ratios JHR across the cairns should also tend to a single mode. Finally, since most cairns are roughly circular, the circularity measure JC should also tend to be unimodal.

The unimodal marginal distributions of each of the six features suggest that the training cairns are distributed in a cloud in the space J . Cairn-like objects can be found in the denser center of the cloud, which corresponds to the marginal modes, and as one moves out of the cloud, the objects look less like cairns. Given a large enough sample of training cairns to define the cloud, one can now further subset detected objects by only retaining those objects that lie within the convex hull of the training set.

In practice, this is done using a Delaunay tessellation of the training set in the space J . The Delaunay tessellation (also called the Delaunay triangulation) of a set of points X in n dimensions is a unique triangulation such that for each simplex

(i.e. n-dimensional triangle), the hypersphere passing through its vertices does not contain any point in X. For an example in <2, see Figure 4.12. Determining whether a detected object falls into the convex hull is as easy as determining which simplex of the tessellation it is a member of. If the object is not inside any of the simplices, it is discarded as a non-cairn.

The Delaunay tessellation and reduction removes a large number of objects from consideration. The two-dimensional example in Figure 4.12 shows that the corners of the rectangle occupied by the points are not covered by the tessellation. In the case

94 (a) (b)

Figure 4.12: (a) A set of points in <2. (b) The Delaunay tessellation over those points.

of the feature space J , all of the corners in six dimensions are removed. In such a large space, the region encompassed by the training set convex hull is proportionally much smaller, and many of the detected objects can be eliminated.

4.7 Assigning Cairn Likelihoods

After filtering all of the pixels in the satellite image by the six J operators and subsetting to the convex hull of the training set, the number of remaining candidate objects will be small. The size and variability of the training set certainly affects the number of such objects, but empirically, I typically see on the order of 500 to

2500 objects remaining. Now that the candidate pool is a manageable size, the goal now is to rank the objects by a “cairn likelihood”. Ideally, this value would be the probability that the candidate object is a cairn.

95 To do this, we first use the mean and standard deviation of the training set to transform J to the new space J 0 by mean centering and scaling each of the six dimensions. That is, for training set mean vector µ and standard deviation vector σ, we apply the function g(x) to all x ∈ J , where µ ¶ x − µ x − µ T g(x) = 1 1 ,..., 6 6 . σ1 σ6

In this centered and scaled space, we treat the six features JB, JV , JS, JHR, JHS, and

JC as if they were independent variables.

The marginal probability of an object being a cairn can be described by a density

0 function fi(x), i = 1,..., 6, where x ∈ J is the point corresponding to the object in the transformed feature space. Then, the joint probability of the object being a cairn is given by the likelihood:

Y6 P (x is cairn) = L(x) = fi(x). i=1

While the continuous densities fi(x) are unknown, they can be estimated by step functions, which can be obtained from the training set. For the ith feature, split the training set into bins [j − 0.5, j + 0.5) for j ∈ Z. Then the marginal density fi(x) can be approximated by the step function

X∞ Xnc 1(j − 0.5 ≤ y < j + 0.5) fˆ(x) = i , i n j=−∞ i=1 c

where y1,..., ync are the nc cairns in the training set and 1(A) is the indicator func- tion of event A. Essentially, the height of the step function at a point x corresponds to the height of its bin in a cumulative frequency histogram, with the bins defined as above.

96 Using these approximate marginal distributions, the likelihood becomes

Y6 ˆ ˆ L(x) = fi(x). i=1

Now, to compare two objects x1 and x2 to see which is more likely to be a cairn, ˆ ˆ ˆ one need only compare L(x1) and L(x2). Larger values of L correspond to objects that are more likely to be cairns. In implementation, the likelihood is negative log transformed to avoid roundoff error with very small numbers:

X6 ˆ ˆ L(x) = − log(L(x)) = − log(fi(x)) (4.8) i=1

Note that since the log function is monotonic, the same interpretation applies as before, only flipped due to the negative sign. Each detected object x can be ranked according to the value L(x). Now, small values indicate objects that are the most likely to be cairns.

4.8 Algorithm Summary

The basic idea of the cairn detection algorithm is summarized in Figure 4.13. In the first step, objects are identifed by locating dark blotches of intensity contrasted against lighter colored background material. Next, these objects are characterized by features that help distinguish them from cairns. This includes seeing whether the object is vegetation, if it is the right size to be a cairn, and whether it is as circular as a cairn. Objects that make it through this series of filters are guaranteed to look a lot like cairns. The reduction to the convex hull formed by a training set is a further attempt to subset down to the most important objects. Finally, the estimated likelihood values serve to rank detected objects by how cairn-like they appear to be.

97 Figure 4.13: A visualization of the cairn detection procedure.

98 Parameter Description w0 Size of outer window I * I used w0 = 25 wB Size of inner window for blob detector JB * I used wB = 7 LB,UB Lower and upper thresholds for JB wBmax Size of local window to see if JB is a maximum * I used wBmax = 9 wV Size of inner window for vegetation feature JV * I used wV = 5 LV ,UV Lower and upper thresholds for JV pS Inner window quantile used for size measure JS * I used pS = 0.95 in Figure 4.6 LS,US Lower and upper thresholds for JS rmin, rmax Smallest and largest circle radius used for Hough fitting * I used rmin = 2 and rmax = 6 wH Size of window for median filtering when fitting Hough circles * I used wH = 3 LHR,UHR Lower and upper thresholds for the Hough ratio JHR LHS,UHS Lower and upper thresholds for the Hough score JHS pC Quantile used inside fitted Hough circle to generate binary map B * I used pC = 0.90 in Figure 4.8 LC ,UC Lower and upper thresholds for the circularity measure JC

Table 4.1: Parameters for Cairn Detection Algorithm

99 Step Calculation Section 1 Around each pixel, extract the w0 × w0 window I. 4.1 Split I into the wB × wB inner window and complementary outer region. Then calculate JB using Eq. (4.1). Eliminate 2 all windows I whose JB values do not fall in the interval 4.2 [LB,UB] or whose JB value is not the maximum in a local wBmax × wBmax window. Calculate the NDVI values in the inner wV × wV window of I. Next, compute the average of those values, J (see Eq. (4.2)). 3 V 4.3 Eliminate all windows I whose JV values do not fall in the interval [LV ,UV ]. In each surviving window I, calculate the median window intensity M. Expand windows of size 3, 5, 7, etc. from the th 4 center until the pS percentile of the inner window crosses 4.4 the median M. Set JS with Eq. (4.3). Eliminate all windows I whose JS values do not fall in the interval [LS,US]. For remaining windows I, smooth I with a wH × wH median filter. Calculate the Canny edges for the smoothed window, and successively fit Hough circles of radius rmin to rmax. Find 5 the best fitting triple in the accumulator function, and the 4.5.1 corresponding Hough circle. Calculate JHR and JHS using Eqs. (4.4) and (4.5). Eliminate all windows I for which JHR 6∈ [LHR,UHR] or JHS 6∈ [LHS,UHS]. For any candidate windows I left, create the binary map B by identifying all pixels with intensity below the pth quantile 6 C 4.5.2 intensity in the interior of the Hough circle from Step 5. Find the Canny edges of the binary map B. Using the row and column coordinates of the edge pixels, 7 calculate the circularity JC using Eq. (4.7). Eliminate all 4.5.3 windows I whose JC values do not fall in the interval [LC ,UC ]. Perform a Delaunay tessellation on the training set of cairns. Treating the detected objects as points in the 6-D space J , 8 4.6 subset to only those candidate objects inside the convex hull for the training set. ˆ Estimate the marginal densities fi(x) for the training set. For each remaining object, calculate the negative log-likelihood 9 4.7 L(x) given in Eq. (4.8). Rank the objects by L(x) and export them to be analyzed by hand.

Table 4.2: Cairn Detection Algorithm

100 A detailed list of parameters required for the algorithm is given in Table 4.1. Note that many of these parameters were only mentioned in passing during the description of the algorithm given in this chapter. The only values that would really need to change depending on the region being analyzed would be the lower and upper thresh- olds for each of the operators. Given a set of parameter values, a skeleton of the algorithm is presented in Table 4.2.

4.9 Results for Polygon 9

To demonstrate the effectiveness of the cairn detection algorithm, the process was used on Polygon 9 in Yemen (Chapter 2, Figure 2.3). This image was selected because of the large number of training cairns available in the area. In all, there are 76 cairns, but 16 of them are poor quality in the imagery. This set is reduced to 60 cairns, which are displayed in Figure 4.14.

Determination of the parameters for the algorithm is done by examining the em- pirical cumulative distribution functions for the six features over the 60 training cairns from the image. The functions are given in Figure 4.15. One way to set thresholds is to choose upper and lower bounds that will retain all cairns in the training set. In this case, a good set of thresholds for this purpose might be:

[LB,UB] = [9, 23] [LHR,UHR] = [1.5, 5] [LV ,UV ] = [0.045, 0.065] [LHS,UHS] = [4, 32] [LS,US] = [6, 12] [LC ,UC ] = [0.45, 1]

However, insisting that every cairn in the training set pass the thresholds may be a bit conservative, and allow a large number of false detections. Often, the thresholds can be tightened with only the loss of a few training cairns, but a large reduction in

101 Figure 4.14: The cairn training set for Polygon 9. Each subimage shows the 25 × 25 panchromatic intensity window. To illustrate the contrast between the cairn and the background, each window is normalized to a 0 (dark) to 1(light) scale for display purposes only.

102 (a) (b)

(c) (d)

(e) (f)

Figure 4.15: Empirical cumulative distribution functions for the 60 cairns in the Polygon 9 training set for (a) JB, (b) JV , (c) JS, (d) JHR, (e) JHS, and (f) JC . Note that the size metric JS ranges from 6 to 12, which corresponds to 3.6 meters to 7.2 meters in diameter. This is roughly the range of cairn diameters observed in Figure 4.1(c).

103 Category Parameter Settings General Purpose w0 = 25 Blob Detector LB = 10 UB = 23 wB = 7 wBmax = 9 Vegetation Removal LV = 0.045 UV = 0.065 wV = 5 Size Metric LS = 8 US = 12 pS = 0.95 L = 1.5 U = 5 L = 5 U = 32 Hough Methods HR HR HS HS wH = 3 rmin = 2 rmax = 6 Circularity Score LC = 0.7 UC = 1 pC = 0.90

Table 4.3: Initialization Parameters for Polygon 9 Detection

falsely detected objects. For example, the bounds can be reduced to the levels below while only losing 14 of the cairns (23% of the training set). Most of these cairns are eliminated because they have low circularity scores JC or a small size JS.

[LB,UB] = [10, 23] [LHR,UHR] = [1.5, 5] [LV ,UV ] = [0.045, 0.065] [LHS,UHS] = [5, 32] [LS,US] = [8, 12] [LC ,UC ] = [0.7, 1]

To run the detection algorithm for Polygon 9, the parameters were initialized with the values in Table 4.3. For a description of the parameters, refer to Table 4.1. Note that the threshold values used are the more restrictive ones suggested above.

Polygon 9 has 17, 650 rows and 10, 824 columns, which results in 191, 043, 600 total pixels covering 68.775 square kilometers. In the absence of any other informa- tion, there are 191 million possible pixels that could be cairns. Scanning the entire image for cairns by eye would be quite time consuming, and it would be difficult to maintain the focus required to recognize them from their surroundings. Fortunately, the cairn detection algorithm is able to whittle down the number of candidate pixels substantially.

104 The first filter, JB, is calculated for all of the pixels in the image. Due to the large number of pixels, this is by far the most computationally intensive part of the algorithm, and takes approximately 8 hours to complete. The pixels are now subset to those that have the maximum JB value in their local 9 × 9 neighborhood and simultaneously have JB ∈ [LB,UB]. This reduction takes the candidate pool from

191, 043, 600 pixels to 91, 280 pixels, which is a reduction by 99.95%! A random selection of 20 objects passing this filter are shown in Figure 4.17.

Note that there are several types of objects that are detected by this filter besides cairns. Bushes and trees tend to be detected quite well due to the high contrast between the plant and the background soil (e.g. two objects in bottom left corner).

Discolorations in the soil can also be detected, as long as the contrast is large enough

(see, e.g., object in top right corner). Finally, shadows cast by cliffs and ridgelines show up in large numbers as well (e.g. top left corner). Since they are narrow and pass through the center pixel, the inner window can be very dark, while a majority of the outer window is still much lighter. This situation leads to a high JB value.

At this point, the remaining 91, 280 candidate objects are sent through the vege- tation filter JV . By thresholding on the average NDVI of the object, the hope is to eliminate a large amount of the vegetation that made it through the first filter. A total of 67, 325 objects make it through the vegetation filter, which is another 26.24% reduction in the candidate pool. Figure 4.18 shows 20 randomly selected objects af- ter this filtering step. Notice that there are fewer objects that look like vegetation – unfortunately, the soil blotches and ridgelines still persist in the candidate pool.

The size filter, JS, is designed to remove objects that are either too large or too small to be cairns. Thresholding of this metric to sizes between LS = 8 and US = 12

105 pixels in diameter cuts the pool of 67, 325 objects down to 46, 547 objects – a further reduction of 30.86%. Some of the larger soil discolorations disappear, but many of the cliffs and ridgelines remain. This is because the size is calculated by expanding inner windows until the 95th percentile inner intensity crosses the median intensity for the entire 25 × 25 window. Since the ridge shadow is thin in one direction, the size is estimated to be the approximate width of the ridge in that narrow direction.

For many ridges, the width of the shadow is within the size threshold values. Figure

4.19 shows a collection of 20 randomly sampled objects passing the size filter.

In the next step of the algorithm, the 46, 547 remaining objects are sent through the Hough transform, where circles of radius 2–6 are successively fit to the edge pixels of the window. The best fitting circle is used to compute the Hough ratio from Eq.

(4.4) and the Hough score from Eq. (4.5). Both values give a measure of how well this circle fits the object boundary. The Hough filters drop the number of detected objects to 40, 618, which is a further 12.74% reduction.

To give a more rigorous examination of object circularity, the final filter JC is applied. The binary map B is created for each of the 40, 618 objects, and the edges of that map are used to extract a better boundary of the object. The circularity of this boundary is measured using the modified version of Proffitt’s equation, and all objects with JC < LC = 0.7 are removed from the candidate pool. The thresholding reduces the set of objects by another 49.82% to 20, 381 survivors. Figure 4.20 shows

20 randomly selected objects that passed the three circularity filters. Notice that objects now have a definite circular appearance.

After filtering Polygon 9 with all six filters, the original field of 191, 043, 600 pixels has been reduced by 99.99% to only 20, 381 pixels. To further narrow down this set of

106 Figure 4.16: Histograms of the six features across the Polygon 9 training set.

objects, we now subset to the convex hull formed by the 60 training cairns in the six dimensional space J . Figure 4.16 shows histograms of the six features for all of the training cairns. Note that the distributions are (mostly) unimodal. The distributions for JHR and JC are skewed and not quite unimodal – the appearance of the extra mode is likely due to the fact that boundary extraction for cairns can occasionally go awry (see Figure 4.10(d)-(f), for example). The JHR and JC values therefore end up with two modes – one that coincides with correct boundary extraction, and one for the times that boundary extraction fails.

From the marginal distributions of the training cairns, it seems likely that the cairns are distributed in a cloud. To find detected objects within this cloud, a De- launay tessellation is performed over the training cairns, where each cairn is treated as a point in the feature space J . Then, each of the detected objects is examined to check whether it falls inside one of the simplices (and therefore inside the convex hull), or it does not. By subsetting to only the objects residing inside the convex hull, the number of detections is reduced by another 91.98%, from 20, 381 to 1634. Note

107 Figure 4.17: A random selection of 20 objects that passed the JB blob detector in Polygon 9 with LB = 10 and UB = 23. Each 25×25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

Figure 4.18: A random selection of 20 objects that passed the JV vegetation filter in Polygon 9 with LV = 0.045 and UV = 0.065. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

108 Figure 4.19: A random selection of 20 objects that passed the JS size filter in Polygon 9 with LS = 8 and US = 12. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

Figure 4.20: A random selection of 20 objects that passed the circularity filters in Polygon 9. These include the Hough ratio JHR, with LHR = 1.5,UHR = 5, then the Hough score JHS, with LHS = 5,UHS = 32, and finally the circularity measure JC with thresholds LC = 0.7 and UC = 1. Each 25×25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

109 that this procedure also ensures that none of the training cairns are removed from the detected objects, since they are the vertices of the Delaunay tesselation. At this point, the final set of detected objects is in place, and they can now be ranked with the likelihood-based approach. A map of the detected object locations is shown in

Figure 4.21.

Each feature is mean centered and scaled, and the marginal distributions fˆ(x) for each normalized feature are estimated from the training set. These appear in Figure

4.22. The likelihood value L(x) (see Eq. (4.8)) for each object x is calculated, and the objects with the smallest L(x) value are the most likely to be cairns.

The final output of the algorithm is a list of 1634 objects ranked in order by their

L(x) values. The top 50 ranked objects are shown in Figure 4.23. Note that these objects share a lot of properties with the set of training cairns in Figure 4.14. In fact, 42 of the 60 cairns in the training set appear as one of these 1634 objects (3 of them in the Top 50). The other 18 training cairns were not detected primarily due to restrictive lower bounds LB and LC chosen for the blob detector JB and circularity measure JC , respectively. These restrictive bounds were used to keep the number of false detections to a manageable size. It is also worth mentioning that of the 16 poorer quality training cairns (the ones removed from the initial set of 76 cairns), 4 of them passed all of the filters. However, they were eliminated in the convex hull reduction step of the algorithm. At this point, a field inspection could be done to determine which of the detected objects are actually cairns.

110 Figure 4.21: A map of the 1634 detected objects in Polygon 9.

111 4.10 Discussion

The algorithm in this chapter provides one way to automatically detect objects in satellite imagery that appear to be cairns. The critical insight for the algorithm is the characterization of objects as 25 × 25 windows that have certain features.

Upon examination of a training set of ground-truthed cairns, commonalities between the monuments became apparent. The window features used in the algorithm were developed to exploit these commonalities and reduce the set of candidate objects to only those that had the most in common with the cairn training set.

Note that this approach assumes the existence of a training set of cairns. In many situations, such a training set may not be available. However, the example provided in the previous section gives a ballpark region for setting thresholds on the different filters. To illustrate the variability of features over different regions of southern Arabia, Figure 4.24 shows boxplots of JB, JV , JS, JHR, JHS, and JC values of training cairns from areas of Oman and Yemen.

For the most part, the features stay relatively stable across the different regions, with minor fluctuations (e.g. JV in Yemen 17). This is likely due to the fact that the filters do not depend on the raw values of the multispectral bands. Instead, they look for a contrast within a local window (JB, JS, JHS), a relative difference between bands (JV ), or detect edges and evaluate the shape of the object (JHR, JHS, JC ).

All of these measures are robust to differences in soil intensity, brightness of sunlight, elevation, and other large scale changes from region to region.

Whether a training set is available or not, there are other ways to incorporate prior information in the cairn detection process. This prior information comes from field knowledge and theory from anthropology, geography, and geology. For instance, my

112 team members indicate that cairns often appear in clusters in the field. Since cairns are burial monuments, this phenomenon is likely due to multiple family members being buried in the same vicinity. Knowing that cairns appear in clusters can be useful in ruling out detected objects that are solitary, and far removed from the others.

Additionally, the working assumption is that cairns were used not only for burial, but also as symbols of territorial ownership and control over resources. In this capac- ity, cairns would need to be seen by travellers passing through in order to warn them not to cause trouble. Indeed, the archaeologists in the field note that the overwhelm- ing majority of cairns dot the sides of the wadis (river channels), which people in the region use as natural roadways between settlements. Since almost all of the cairns discovered thus far are visible from the wadi, this is a strong indication that detected objects far from the wadi are not cairns.

Finally, there is evidence that cairns only appear on certain types of landforms.

Clearly, for a cairn to exist there are a few prerequisites. The ground must be flat enough to keep the cairn from collapsing downhill, and the immediate area should be rocky enough to provide building materials. The region must also be accessible by foot, since people had to get to the location to build the cairn in the first place. In the field, cairns often tend to appear on bedrock terraces, which are flat and provide plenty of stones for construction. However, it is difficult to look at a satellite image and recognize where such landforms appear, and if there may be other regions suitable for cairn building.

To incorporate these latter two concepts – that cairns are near wadi channels and only appear in certain landforms – it is useful to identify homogeneous regions of the

113 image that identify the wadi and the various landforms. Available information comes in the form of the spectral information and elevation information. The spectral bands include panchromatic intensity, red, green, blue, and near infrared, with intensity at

0.6 meter resolution, and the others at 2.4 meter resolution. Elevation is given in a

30 meter resolution digital elevation model (DEM), which estimates the elevation of each pixel above sea level.

For large datasets (e.g. satellite imagery), a class of clustering algorithms called spectral clustering algorithms can be used. However, approximation methods – e.g. matrix sparsity or subsampling – must be made due to the size of the dataset. In this thesis, a multiple sample clustering technique is used to segment the satellite image into different landform types based on spectral and elevation information. Then, the clustering results and training set can be used together to subset the collection of detected objects to only those which lie in regions with a high cairn density. Chapter

5 gives some background on spectral clustering methods, and discusses the problems that arise in large data situations (e.g. when segmenting satellite imagery). These large data issues are the motivation for the multiple sample clustering algorithm, a description of which appears in Chapter 6. Finally, the application of the method to

Polygon 9 is discussed in Chapter 7.

114 ˆ ˆ Figure 4.22: The approximate marginal distributions f1(x),..., f6(x) for the 60 train- ing cairns in Polygon 9.

115 Figure 4.23: The top 50 likelihood ranked objects (left to right by row) from Polygon 9. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only.

116 (a) (b)

(c) (d)

(e) (f)

Figure 4.24: Boxplots of the six features for training cairns located in two regions of Oman and two from Yemen. There are 5 training cairns in Oman 3, 9 in Oman 16, 60 in Yemen 9, and 37 in Yemen 17.

117 CHAPTER 5

TECHNIQUES FOR CLUSTERING DATA

5.1 Introduction

When confronted with analyzing a dataset, 20 years ago a statistician would have to make do with far fewer observations than he or she would like. With skyrocketing computing power and sensing , however, the statistician of today is often faced with a somewhat ironic question: Where do I even begin to analyze all these data? With the growing number of industries collecting and storing observations, as well as the ballooning complexity of these data, it is becoming more common for statisticians to encounter very large, high dimensional data sets. The question of how to deal with these data has been fueling innovations in the field of data mining and statistical learning.

Suppose we are given a set of observations {x1, x2,..., xn} of d-dimensional vec-

T tors, where xi = (xi1, xi2, . . . , xid) . We call each of these dimensions a feature, or attribute of the observation, and the space spanned by these features is called the fea- ture space. The observations are generated via some stochastic mechanism underlying this space, but its exact nature is generally unknown. If d is large, understanding

118 what form this mechanism takes is even more difficult. When n is large, memory and processing power may be insufficient to use data analysis techniques on the full set of observations. The question is how to gain an approximate understanding of the processes at work in the data space, and be able to do so with tractable computation.

One common approach is to use dimension reduction techniques such as principal component analysis, factor analysis, and projection pursuit (see, e.g., Fodor, 2002).

In general, each of these approaches tries to project the high-dimensional data down into a lower-dimensional space without losing much of its relevant structure. This can result in a more intuitive and understandable interpretation of the observations and their relationship to the physical mechanism that is producing them.

In many applications though (e.g. image segmentation), the dimension is smaller and reduction may not be as helpful. Also, tractable computation is a concern. In these situations, it may be more appropriate to cluster the data. The idea is to divide the data set into a number of homogenous groups, which provides a natural way of summarizing the data. This can ease computation burdens because it may be possible to run analyses on each cluster separately, rather than the entire data set at once.

We begin with some background about the concept and its applications.

5.1.1 Data Clustering

The goal of clustering, given a data set, is to determine how many naturally occurring groups are present in the data, and to assign the points in the set to the cluster for which they are a member. There are several ambiguous terms used in this definition. The first is the question of how one defines a cluster. Intuitively, a cluster is a group of points that are “close”, or exhibit “similar” behavior. Carmichael et al.

119 (1968) define a cluster as a region of space with a continuous relative density of points that is surrounded by a region of continuous relative emptiness. From a statistical perspective, we may claim that the data are generated according to some underlying density p(x). In this case, the objective is to locate convex regions of the feature space that contain modes of this density. Of course, one needs to be careful about

d how similarity is defined. For points in R , L2 distance, or some function of it, may be appropriate. However, for more exotic feature spaces this distance measure may need to be defined by a subject matter expert.

By their nature, clustering algorithms provide insight into the underlying structure of the data by identifying homogeneous subgroups of points. Since similar observa- tions may behave in similar ways, such information allows one to “step back” from the data and view it from a broader perspective, which can guide future analyses. Due to this fact, clustering also allows the data to be represented in much more efficient ways. In information theory (Shannon, 1948), entropy refers to the expected amount of information required to describe the distribution of data to be communicated. For many tasks, exact data values are not needed, but rather some sort of local average.

In these cases, the entropy can be reduced by representing each point by a cluster representative. For instance, in vector quantization for image analysis, small patches of pixels are clustered (e.g. by k-means) and then replaced by the group centroids.

As a result, the number of bits required to store each pixel is drastically reduced. If only large scale objects need to be preserved in the image, the reduction in visual quality is not an issue, and the exact values of the original pixels do not need to be stored.

120 Another example that illustrates the usefulness of clustering is the task of search- ing through a collection. Suppose a candidate point is proposed, and the goal is to search through the existing data set to find the point closest to this candidate. A na¨ıve solution is to compute pairwise distances between the candidate and all other points to find the one for which the distance is minimized. The search time for this solution is O(n). However, if the data are clustered into k groups, one can compare the candidate to the cluster centroids to locate the nearest group. Then, the min- imization need only take place on this nearest group. Now, the solution would be

th O(k + maxg Ng), where Ng is the number of points in the g cluster, g = 1, . . . , k. Of course, this method assumes that the groups are compactly distributed around the clusters centers, and not, for example, on a lower dimensional manifold. However, in situations where such assumptions are reasonable, this can result in substantially faster search times.

5.1.2 Applications of Clustering

Clustering can reveal the characteristics of the mechanism generating a data set, and can also reduce storage and searching time. There are many fields of study for which these properties can be applied. First of all, clustering enjoys many applications in the area of social behavior and interaction. For example, marketing studies use it to identify groups of people that will serve as target markets for new products

(see, e.g. Arabie and Hubert, 1994). It is also used for product positioning, in which companies try to change perceptions of an existing product. The product is modeled as a collection of attributes, and the target market is used to identify which attributes should be changed and in which combinations in order to create a desired perception

121 change. Clustering also plays a part in social network analysis (see, e.g. Scott, 1988), where social structures of interdependency are uncovered to identify communities of people. This interdependency can take the form of monetary compensation, the

flow of ideas, conflict, or kinship ties. This theory can be used to guide decisions in organizational management and foreign and domestic policy. Identifying communities is not only limited to people, as well. The field of animal ecology uses clustering methods to describe communities of animals linked in some way, e.g. through habitat or feeding relationships (see, e.g. Crozier and Zabel, 2006).

There are many web applications for clustering as well. For instance, consumer websites like Amazon.com group items together to offer suggestions based on previ- ous purchases (Linden et al., 2003). In addition, some internet search engines use a cluster-based approach. One of the best known of these is called Clusty, which ar- ranges search results into topic-based clusters. Since most searches can be interpreted in multiple ways, this approach allows users to select the class of results that matches their intended interpretation. Finally, cluster analysis has uses in internet security – for example, in proactive detection of distributed denial of service (DDoS) computer attacks (Lee et al., 2008).

Clustering also finds a great deal of use in scientific applications like gene ex- pression data (e.g. Jiang et al., 2004). By grouping the data in different ways, one can find clusters of genes with similar expression patterns, or classes of samples that exhibit comparable expression profiles. Along these lines, early work in clustering led to developments in creating natural taxonomies of all living things. Psychological studies also make use of data clustering to identify groups of people who may benefit from particular goods or services (see, e.g. Clatworthy et al., 2005). For instance,

122 cluster analysis has been used to identify different kinds of depression. Clustering has also been used in chemical applications to identify groups of compounds that have similar properties.

5.2 Clustering Algorithms

Clustering algorithms fall into three broad categories. The first class of algorithms focuses on identifying clusters by central tendency. The assumption is that a cluster of points will be distributed about some center, so that the overall variation within clusters is far less than the variation between clusters. Combinatorial algorithms consider all possible configurations of points arranged in a given number of clusters.

They seek the configuration that globally minimizes a relevant optimization function

(e.g. within cluster variability). Hierarchical algorithms create a separate clustering result for different numbers of groups, from a single large cluster to n singletons.

This sequence of results is chosen in such a way as to have nested clusters across the spectrum.

Model based methods, like mixture modeling, provide a different perspective. Rather than approach the clustering task from a combinatorial standpoint, these methods assume the underlying data are a random draw from a particular probability model.

The data are then used to estimate the parameters of this model. This completely specifies the distribution of the data, which can then be used for clustering and clas- sification.

The third group of algorithms, spectral clustering algorithms, approach the prob- lem by computing pairwise similarity measures. These distances are arranged in a matrix, which is then decomposed into eigenvectors and eigenvalues. By examining

123 these eigenvectors, decisions can be made to obtain a final grouping. Since assump- tions about central tendency are not made in this case, spectral clustering can perform much better when clusters are arranged on manifolds, provided the appropriate metric is defined.

5.2.1 Clustering by Central Tendency

This class of algorithms does not assume a specific statistical model for the data.

However, some assumptions about the underlying data density are still made. In particular, points in the same cluster are assumed to be distributed around a cluster center with relatively small variance. The goal of these algorithms, then, is to iden- tify where these cluster centers reside by finding the configuration of clusters that optimizes some objective function capturing this variation. To be more specific, for a given set of d-dimensional points {x1, x2,..., xn}, an algorithm finds the set of k clusters {C1,C2,...,Ck} that minimizes (or maximizes) an objective function Wk(C).

One example is the measure of within cluster point scatter, which is to be minimized:

1 Xk X X W (C) = d(x , x ) (5.1) k 2 i j g=1 xi∈Cg xj ∈Cg

Here, d(xi, xj) is a distance measure between xi and xj. Unfortunately, globally mini- mizing Eq. (5.1) requires checking every possible configuration of cluster labels, which is not feasible even for some data sets that would be considered small. Algorithms instead minimize the function using a greedy descent approach, which produces a local, rather than global, minimum.

124 k-Means

The k-means algorithm is one of the earliest clustering techniques, and has been investigated by a number of individuals (e.g., Dalenius, 1951; Steinhaus, 1956; Cox,

1957; MacQueen, 1967; Engelman and Hartigan, 1969; Fisher and Van Ness, 1971;

Sebestyen, 1962; Lloyd, 1982). It is intended for use with points in Rd, and uses

2 squared L2 distance d(x, y) = kx − yk . The number of clusters is assumed to be k, and the goal of the algorithm is to locally minimize the function in Eq. (5.1). Within each cluster Cg, the point minimizing the distance to all cluster points is the mean of those points, x¯g. With Ng being the size of cluster Cg, the objective function can be rewritten:

1 Xk X X Xk X W (C) = d(x , x ) = N kx − x¯ k2 (5.2) k 2 i j g i g g=1 xi∈Cg xj ∈Cg g=1 xi∈Cg

Usually, this function is locally minimized using Lloyd’s algorithm (Lloyd, 1982), which iteratively groups points to the nearest of k randomly initialized centroids, then recalculates the centroids of each group by averaging its members.

There are several drawbacks to this algorithm. The first is its dependence on random starting locations. For different choices of starting points, the algorithm can converge to different results. This can be especially problematic when cluster sizes are not equal. Secondly, note that this algorithm requires specification of the cluster number k, which heavily influenced results. In cases when subject matter knowledge cannot be used to determine k, some automatic methods exist (e.g. the “gap” statistic by Tibshirani et al., 2001).

125 K-medoids is a variation on the k-means algorithm that restricts the cluster cen- troids to one of the elements in the cluster, rather than the mean of the cluster. While this optimization is computationally expensive, it can be run without explicit knowl- edge of the observations’ values. All that is needed is a proximity matrix containing pairwise distances between the points. Methods do exist for efficiently locating cluster centroids (see, e.g. Kaufman and Rousseeuw, 1990; Massart et al., 1983).

Quality Threshold Clustering

QT Clustering, developed by Heyer et al. (1999), has several advantages over k- means. First of all, there are no randomly chosen starting locations, which means clustering results will always be the same for the same set of points. Additionally, the algorithm gives the user more flexibility in the number of groups. Rather than directly specifying the number of groups k, the user instead defines a threshold parameter t that indirectly controls how many groups are detected.

For each point xi, this algorithm forms a candidate cluster Gi by combining xi with its consecutive nearest neighbors y until the distance between all y 6∈ Gi and Gi is larger than a threshold t. The candidate set Gi with the largest number of points is then taken to be the first group, and those points are removed from the dataset. The process is then reiterated on the remainder of the points, and the process continues until all points have been put into a cluster.

In the QT Clust algorithm, the threshold t affects the size and number of groups.

If it is very small, the algorithm will yield a large number of very small clusters. If it is large, the algorithm will give a small number of large clusters. The thresholding also means that very large clusters will often be split into smaller pieces. To use this

126 algorithm, a distance measure must also be defined between a point x and a set S.

Usually, complete linkage is used: d(x,S) = max d(x, s). s∈S

Hierarchical Clustering

Hierarchical algorithms avoid the issue of specifying the number of groups, or having the number of groups depend on a parameter value. These approaches compute cluster results for a range of values of the number of groups k. These clusters are arranged so that each result is nested in the one before it.

There are two kinds of hierarchical algorithms. The first kind, called an agglom- erative algorithm, uses a bottom-up approach. Points begin in singleton groups (i.e. k = n) and clusters are progressively merged together as the number of groups k decreases. At each step, the clusters chosen for the merge are the ones that are the least dissimilar, as defined by some dissimilarity measure D. Some commonly used measures D are given below for two groups A and B:

• Single Linkage: D(A, B) = min d(xi, xj) xi∈A,xj ∈B

• Complete Linkage: D(A, B) = max d(xi, xj) xi∈A,xj ∈B

1 X X • Average Linkage: D(A, B) = d(xi, xj) NANB xi∈A xj ∈B

X X 1 1 2 2 • Centroid Linkage: D(A, B) = k xi − xjk = kx¯A − x¯Bk NA NB xi∈A xj ∈B

• D is the sum of within cluster variances.

127 • D is the increase in variation when clusters are merged (Ward’s Criterion).

The second kind of hierarchical algorithm is called a divisive algorithm. These al- gorithms begin with all points in a single group (i.e. k = 1) and clusters are split apart as the number of groups increases. Options for choosing this split are not as well studied as methods for merging clusters. To choose the cluster to be divided at each stage, two approaches are to either choose the cluster with the largest diameter

(Kaufman and Rousseeuw, 1990) or the one with the largest within-cluster pairwise distance. Then, to make the split, one can use k-means on the points inside the cluster, with k = 2.

Either hierarchical clustering method will yield a sequence of nested cluster results

n {C1,...,Ck}k=1. The decision is then up to the statistician to decide which value of k to choose. Some methods exist for choosing this k. For example, in agglomerative clustering, one can compute the intergroup distance between each of the sub-clusters that are merged at each step. When the intergroup distance makes a jump from being relatively small to large, this is a sign that the groups being combined have become quite different from each other. Choosing the k that corresponds to this shift is one possibility for deciding on the number of groups. In similar fashion, one can examine the sub-groups after each split in a divisive algorithm to decide on a value of k.

A visual method for choosing the number of groups is to use a dendrogram with the height of the cluster splits determined by the intergroup distances. Clusters that merge high in the plot relative to later splits indicate natural groupings.

128 DBSCAN

This algorithm, an acronym for Density-Based Spatial Clustering of Applications with Noise, was created by Ester et al. (1996). This technique chains points together along high-density regions of the feature space to obtain clusters. This chaining ap- proach allows the algorithm to not only pick up compact clusters, but also elongated, irregular structures arranged on lower-dimensional manifolds. Note, however, that since this method depends on the density of clusters of points, it does not perform as well in high dimensional settings.

To cluster the data DBSCAN first chooses an arbitrary point x and looks at an

²-neighborhood around it. If the number of other points in the neighborhood exceeds a threshold minP ts, the point and its neighbors are set aside as a new group. Next, the points within each of the neighbors’ ²-neighborhoods are added to the group, and then their neighbors, and so on. The process continues until no more points exist in the ²-neighborhoods of the boundary points of the group. At this point, all of the points in the cluster are removed from the dataset, and the procedure above is repeated for a new randomly selected point until all points in the dataset have been placed in a cluster.

Note that all clusters that result from this algorithm are forced to be at least ² distance apart at the closest point, since otherwise points in the two clusters would be connected through an ²-neighborhood. However, this also means that two naturally occurring clusters could still be labeled as a single cluster if some chain of points closer than ² distance apart is able to connect them. When this happens, the authors suggest recursing the algorithm on the points in that cluster with a larger value of minP ts. This should separate the two clusters.

129 5.2.2 Model Based Clustering

Model based approaches to clustering assume that an underlying stochastic mech- anism is generating the observed data. By making some assumptions about this mechanism, a parametric model can be developed and used to cluster the points in the data set.

Mixture Modeling

Given a set of points X = {x1,..., xn}, this approach assumes that each point in the data set is generated by one of several components. In particular, we say that a point in cluster i is randomly drawn from some density pi(x|θi) with parameters θi.

Supposing that each of the points are drawn from one of these components 1, . . . , k, the entire set of data represents a random draw from a mixture distribution p defined to be: Xk p(x|Θ) = πipi(x|θi) i=1

Θ = (π1, . . . , πk, θ1, . . . , θk) is a vector of component parameters θi and component mixing weights πi. By assuming a particular density for each pi(x|θi), the form of the mixture density can be completely specified in terms of the parameter vector Θ.

Ideally, these parameters could be estimated from the data by calculating the MLEs.

The likelihood function is given by:

Yn Yn Xk L(Θ|X) = p(xi|Θ) = πjpj(xi|θj) i=1 i=1 j=1

130 To derive the MLEs, one typically would maximize the log-likelihood: Ã ! Xn Xn Xk l(Θ|X) = log L(Θ|X) = log p(xi|Θ) = log πjpj(xi|θj) i=1 i=1 j=1

However, in this case the maximization is difficult due to the sum in the log term.

Usually, the values of the parameters are instead estimated by the Expectation-

Maximization (EM) algorithm (Dempster et al., 1977). The idea is that if only the cluster identities c = {c1, . . . , cn} of the observations were known, the complete log-likelihood l(Θ|X, c) would be much easier to maximize, as it would take the form:

Xn Xn Xn

l(Θ|X, c) = log πci pci (xi|θci ) = log πci + log pci (xi|θci ) i=1 i=1 i=1

However, since c is unknown, the maximization cannot be done. Instead, we estimate f(c|X, Θ∗), the distribution of c given the data X and a current estimate of the parameters Θ∗:

∗ ∗ ∗ π∗ p (x |θ∗ ) ∗ f(ci, xi|Θ ) f(ci|Θ )f(xi|ci, Θ ) ci ci i ci f(ci|xi, Θ ) = = = P f(x |Θ∗) f(x |Θ∗) k ∗ ∗ i i j=1 πj pj(xi|θj )

Using this distribution, we can calculate the expected value of the complete log- likelihood given the data and estimated parameters:

∗ ∗ Q(Θ, Θ ) = Ec [l(Θ|X, c)|X, Θ ]

This expected value is solely in terms of the parameter vector Θ, and can be maxi- mized. The EM algorithm repeats the expectation and maximization steps for many iterations, updating the parameter vector each time, until the parameters converge.

131 For example, gaussian mixture modeling models the underlying distribution of the observations as a mixture of multivariate gaussian distributions:

Xk πi 1 T p(x|Θ) = d 1 exp{− (x − µi) Σi(x − µi)}, 2 2 2 i=1 (2π) |Σi| where Θ = (µ1,..., µk, Σ1,..., Σk). One can then use the EM algorithm to estimate the parameters (πi, µi, Σi), i = 1, . . . , k. See the tutorial by Dinov (2008) for details.

Once the parameters of a mixture model have been estimated, the points are treated as a sample from a weighted mixture of known component densities. To assign group labels to the points, one can observe which of the weighted densities is ˆ largest at that point. That is, assign point x the cluster label arg maxi πˆipi(x|θi).

Mode Association Clustering

This non-paramatric clustering technique has recently been advanced by Li et al.

(2007). The intuition behind this method is that the underlying distribution of the observations has a certain number of modes. These modes should correspond with the regions in the feature space where the density of the sampled points is largest.

If one were to begin at each of the sampled points and ascend the density to the nearest local maximum, points in the same cluster should yield the same destination

– the mode corresponding to their cluster. Points in different clusters will ascend to different modes.

The authors present a technique called Model EM (MEM) that allows the local maximization of a mixture density. They then estimate the underlying data distri- bution using gaussian kernel density estimation for a given bandwidth σ. Since the estimate itself is a mixture distribution, one can use Modal EM to locally maximize

132 observations. The observations that ascend to the same mode are placed in the same cluster.

A modification to this algorithm also creates a hierarchical structure to cluster configurations over a sequence of bandwidths σ1 < . . . < ση. Essentially, this is done by first finding cluster modes at the smallest bandwidth. Then, for the next largest bandwidth, the Modal EM algorithm is applied with starting locations at those modes. Thus, all points which converged to the same mode are grouped in the same way according to that mode’s behavior in the next step. As the bandwidth gets larger, the density estimates yield fewer modes, until eventually there is a single mode

(and hence, a single cluster). This gives a hierarchical sequence of cluster labels for the observations, and allows one to choose the results that seem the most intuitive.

5.2.3 Spectral Clustering Algorithms

Rather than modeling the data distribution or optimizing objective functions to

find the optimal cluster, spectral algorithms focus on pairwise distances, or similar- ities, between points. For two points xi and xj, one can define a symmetric, non- negative similarity function s(xi, xj). Then a useful way of representing the data is in the form of an affinity matrix (a.k.a similarity matrix or proximity matrix) whose

th (i, j) entry is (Kn)i,j = s(xi, xj). Different choices of similarity function can result in different clustering results. For example, three popular kinds of similarity functions are listed below.

• ²-Neighborhood: ½ 1 d(xi, xj) ≤ ² s(xi, xj) = 0 d(xi, xj) > ²

133 Here, d(xi, xj) is a distance measure between points xi and xj (e.g. L2 distance

d(xi, xj) = kxi − xjk). Points are only similar if they are within some small

distance ² of each other. Since the distance is typically very small, the similarity

is assigned a value of 1.

• k-Nearest Neighbor:

½ f(d(x , x )) x ∈ kNN(x ) or x ∈ kNN(x ) s(x , x ) = i j i j j i i j 0 otherwise

kNN(x) is the set of k nearest neighbors to the point x, and f(·) is a decreasing

function. By defining s(xi, xj) in this way, it is not guaranteed to be symmetric,

which is problematic. However, an alternate symmetric version of the k-nearest

neighbor similarity function requires both points to be in the other’s group

of nearest neighbors. Such a function is called a mutual k-nearest neighbor

function.

• Kernel function:

s(xi, xj) = K(d(xi, xj))

The function K can be any kernel, but it is usually restricted to be a symmetric,

positive semi-definite function, called a Mercer kernel. A popular choice is

K(x) = exp{−x2/(2ω2)}), the gaussian kernel with bandwidth ω.

Once a similarity function is decided upon, spectral algorithms make use of its eigen- vectors to arrive at a clustering decision. There are many algorithms in this category, and each one uses eigenvectors from a slightly different matrix. All of these matri- ces are a version of the affinity matrix Kn, the weight matrix Wn = Kn − I, or its

134 laplacian Dn − Wn, where Dn is a diagonal matrix called the degree matrix, where

Pn (Dn)ii = j=1(Wn)ij. Several historical algorithms are briefly described below. In the descriptions of the algorithms, the terms largest eigenvector and smallest eigen- vector are used to describe the eigenvectors with the largest and smallest associated eigenvalues, respectively. Other such terms have similar meanings.

The Algorithm of Scott and Longuet-Higgins (1990)

The developers of this algorithm (Scott and Longuet-Higgins, 1990) approached the problem of clustering from a molecular physics perspective. For a dataset {x1,..., xn}, they use the gaussian kernel function to construct the affinity matrix Kn, which is given by: µ ¶ kx − x k2 (K ) = exp − i j n i,j 2ω2

The algorithm rests on the observation that the eigenvectors of Kn with the few largest associated eigenvalues seem to contain information about cluster membership.

The algorithm works by first computing the top k eigenvectors of Kn, then ar- ranging them in columns in an n×k matrix V . Each row of this matrix is normalized

T to length 1, and the matrix Q = VV is computed. Points xi and xj that are in the same group will have similar magnitudes in each of the eigenvectors, which results in similar rows i and j of the normalized V matrix. In this case, the entry Qij, which is the cosine of the angle between the rows, will be approximately 1. When xi and xj are in different groups, Qij will be closer to 0. Therefore, to cluster the points, one should examine the Q matrix and split it into groups of points with entries near 1.

135 The Normalized Cuts Algorithm (1997)

The normalized cuts procedure was created by Shi and Malik (2000). They ap- proached clustering from a graph theoretic point of view, and considered the problem of how to split the data set into two groups in an optimal fashion. The points x1,..., xn are represented as nodes in a graph G, and the entries of the affinity ma- trix are represented by weights wij on the edges connecting those nodes. However, self edges on a graph are not permitted, so the weights wii = 0 for i = 1, . . . , n. Thus the new affinity matrix (a.k.a. weight matrix) is Wn = Kn − I, the original affinity matrix modified to have zeroes on the diagonal.

The goal of the algorithm is to define two subgraphs A and B such that the sum of edge weights between the groups is smallest, relative to the sizes of the groups.

The split between two groups is called the normalized cut, which is to be minimized.

The authors show that determining which groups A and B minimize this cut value is equivalent to doing a discrete minimization for the vector y on the Rayleigh Quotient,

T y (Dn − Wn)y T y Dny

Pn where Dn is the degree matrix with (Dn)ii = j=1(Wn)ij, and the elements of y are restricted to be either 1 or a specific negative constant. Unfortunately, completing the discrete minimization would require calculating the quotient for every possible configuration of two groups, which is not feasible even for relatively small data sets.

In order to carry out the computation, Shi and Malik suggest a relaxation on the discrete constraint to allow a continuous solution to the minimization. However,

136 there is no assurance that this continuous solution in any way resembles the discrete one.

Minimizing the Rayleigh Quotient is equivalent to solving the generalized eigen- value system (Dn − Wn)y = λDny. The smallest eigenvector is always the trivial solution 1n×1, with eigenvalue 0. Thus, the non-trivial minimizer is the second- smallest eigenvector of this generalized eigensystem. Note that if one multiplies both

−1 sides of the equation by Dn , the minimizer is also the second-smallest eigenvector of

−1 Dn (Dn − Wn) (see von Luxburg, 2007 for more information). Therefore, to split the dataset into two groups, the normalized cuts algorithm first

−1 computes the normalized weight matrix L = Dn (Dn − Wn). The second smallest eigenvector vn−1 of L will show a stepwise constant structure, which can be used to segment the dataset into two pieces. The next smallest eigenvectors vn−2, vn−3, etc. will recursively partition each of the clusters into subclusters, if more than two groups exist.

The Algorithm of Perona and Freeman (1998)

As Scott and Longuet-Higgins suggested, the largest eigenvectors seem to contain information about cluster membership. Perona and Freeman (1998) offered an ex- planation as to why this is the case. They proved that for a block diagonal affinity matrix Kn, the top eigenvector will be exactly zero for points in all but one of the blocks, and non-negative for the remaining ones.

This means that if the similarity function is chosen such that all points in different clusters have similarities of zero, the top eigenvector of the affinity matrix will separate out one of the groups (the foreground) from the rest (the background) by thresholding

137 the vector to positive values. In general, if pairwise similarities for points in different groups are no more than ², then the top eigenvector will be no more than ² for all but one of the groups, and non-negative (presumably larger) for the other.

Perona and Freeman proposed using a gaussian affinity matrix (Kn)i,j and its top eigenvector v1 to split the data into two parts. With a properly chosen bandwidth, their proofs showed that the top eigenvector should have a value near zero for points in the foreground, and larger positive values for those in the background. However, this rests on the assumption that the affinity matrix will have a block diagonal structure, which is not always the case (see Weiss, 1999, for an example).

The Modified Normalized Cuts Algorithm (2001)

In this spectral clustering approach, Meila and Shi (2001) adapted the normalized cuts algorithm using a random walk perspective. Once again, the points x1,..., xn are represented as nodes on a graph, and similarities between points are weights on the edges of the graph. The affinity matrix is Wn = Kn − I, since self edges are not permitted.

Pn As before, let the diagonal degree matrix Dn be defined by (Dn)ii = j=1(Wn)ij.

−1 Then the matrix Pn = Dn Wn has a sum of one across all rows and columns, and represents transition probabilities from one node to another. The authors prove that among the eigenvectors of Pn, k of them will be piecewise constant over some partition (A1,...,Ak) of the points. Here, a vector v is piecewise constant over a

k partition (Ai)i=1 if v(x) = v(y) whenever x, y ∈ Aj. In this context, the partition represents the separation of the points into naturally occurring clusters.

138 In practice, these k piecewise constant eigenvectors tend to be the top eigenvectors of Pn. Therefore, one need only calculate these top k eigenvectors of Pn and arrange them in columns in a vector V . The rows of V are then treated as points in Rk and can be clustered with any algorithm (e.g. k-means). The cluster label for row i is then given to the point xi.

The Algorithm of Ng, Jordan, and Weiss (2001)

Building on the techniques developed by Scott and Longuet-Higgins (1990) and

Meila and Shi (2001), the authors of this algorithm chose to examine eigenvectors of

−1 a slightly different matrix. Rather than decomposing the matrix Pn = Dn Wn as in the modified normalized cuts algorithm, they use the matrix

1 1 ∗ − 2 − 2 Pn = Dn WnDn .

∗ As in the Scott and Longuet-Higgins algorithm, the top k eigenvectors of Pn are arranged in columns in the matrix V , which then has its rows normalized to length

1. Group labels are then extracted by treating the rows of V as observations in Rk and clustering them with k-means. The point xi is assigned the cluster label given to row i.

Data Spectroscopic Clustering (2009)

This newer method by Shi et al. (2009) exploits properties of a density-dependent convolution operator to both identify the number of clusters and assign points to those clusters. For a density P , the convolution operator with respect to kernel K(x, y) is

139 given by: Z

KP f(x) = K(x, y)f(y)P (y)dy

This operator is itself approximated by an empirical operator defined over the ran- domly sampled observations from P , i.e.,

1 Xn K f(x) = K(x , x)f(x ) Pn n i i i=0

Finally, it can be shown that the eigenvectors of affinity matrix Kn evaluate the

eigenfunctions of KPn at locations x1,..., xn. In other words, the eigenvectors of the

affinity matrix represent a discretization of eigenfunctions of KPn , and are therefore an approximate discretization of eigenfunctions KP .

This means that as long as n is sufficiently large, one would expect the properties of eigenfunctions of KP to be present in the eigenvectors of Kn, with some perturbation.

The authors prove that the convolution operator does in fact have some interesting properties:

• All of the eigenfunctions of KP decay away from the high density area of P .

• The top eigenfunction has no sign change on Rd, has multiplicity one, and is

non-zero on the support of P .

• If P is a mixture density, the top eigenfunctions of each component will appear

in perturbed form in the eigenfunctions of KP . Given sufficient separation

between high density areas of these components, these eigenfunctions will also

have no sign change up to small threshold ².

140 ³ ´ 2 ||xi−xj || Step 1: Calculate the affinity matrix (Kn)i,j = exp − 2ω2

Step 2: Compute a large number of its top eigenvectors, say k of them: {v1, ..., vk}. Locate the Gˆ eigenvectors {v , ..., v } for which there is no sign change, Step 3: (1) (Gˆ) up to threshold ². ∗ Put point x into group g = argmax |v(g)(x)|. Step 4: g

Table 5.1: Data Spectroscopic Clustering Algorithm

These results suggest that if one examines eigenvectors of Kn, they will have similar properties. Namely, that there will be one eigenvector for each mixture component that has no sign change up to a small threshold ². Furthermore, this eigenvector will be large over the high density region of the component, and decay to zero in the tails. Therefore, cluster membership can be determined for a point x by finding the component eigenvector that has the highest value at location x. The algorithm is summarized in Table 5.1.

In practice, this method seems well-equipped to locate lower dimensional mani- folds in the data. Furthermore, it does not assumed balanced groups, like k-means, and is able to work well even if groups are of varying sizes. One major asset of this algorithm is that the number of groups need not be specified, but is instead controlled by the bandwidth ω. Also, the cluster results are deterministic – since there is no k-means clustering of the points in a projected space (as in Meila and Shi, 2001 and

Ng et al., 2004), the cluster labeling is not subject to random starting locations.

141 Algorithm Eigenvectors Used Matrix as function of Wn and Dn Scott & Longuet-Higgins k largest (normalized) Wn + I nd th −1 Shi & Malik 2 -k smallest I − Dn Wn Perona & Freeman single largest Wn + I −1 Meila & Shi k largest Dn Wn 1 1 − 2 − 2 Ng, Jordan & Weiss k largest (normalized) Dn WnDn Shi, Belkin & Yu k with no sign change Wn + I

Table 5.2: Comparison of Spectral Algorithms

Similarity of Spectral Algorithms

Several different spectral methods have been presented here, and the intuition behind them arises in different ways. Some of the formulations use a graph theo- retic approach, while others directly consider properties of the eigenvectors of Kn.

However, the calculations performed in each algorithm are strikingly similar.

In every case, one uses either the top k or bottom k eigenvectors from a matrix.

These matrices are all cousins of each other, and can most easily be expressed in terms of the weight matrix Wn = Kn − I and its degree matrix Dn. Recall that Dn

Pn is diagonal, with (Dn)ii = j=1(Wn)ij. Weiss (1999) also provides some general results on perturbed block-diagonal ma- trices and shows how some of these methods share properties. For instance, the second smallest generalized eigenvector of the Shi and Malik algorithm can be writ- ten as a componentwise ratio of the first and second smallest generalized eigenvectors

(since the first is a vector of ones). This ratio actually corresponds to the largest eigenvector of the affinity matrix that is used in the Perona and Freeman algorithm.

Weiss also discusses the shared properties between the Shi and Malik algorithm and

142 the Scott and Longuet-Higgins algorithm, whose marriage is seen in the approach by

Ng, Jordan, and Weiss.

Comparison of Spectral Clustering to Other Approaches

Spectral algorithms approach the task of data clustering from a completely differ- ent angle than other approaches. These methods map the observations into another space through a kernel function, then find the solution of a particular eigensystem involving the affinity matrix. Combinatorial, hierarchical, and modeling approaches, on the other hand, attempt to locate clusters by directly discovering relatively dense regions in the feature space. There are advantages and disadvantages to both ap- proaches.

One major drawback to most classical approaches is an implicit assumption that clusters are compact. While such an assumption is often reasonable, this is not always the case (e.g. in image segmentation, they are not always compact in pixel space). Some algorithms like Single Linkage Agglomerative Clustering and DBSCAN are able to locate irregular clusters, but at the cost of poorer performance in other ways. Spectral algorithms, on the other hand, create clusters solely through the examination of pairwise similarity, which assumes nothing about the overall shape of the clusters. Different choices of the kernel can allow one to uncover very irregular shaped clusters lying on lower-dimensional manifolds of the feature space.

Spectral algorithms’ dependence on pairwise similarity also affords another bene-

fit. In situations when the exact observations are unknown, but relational information allows the construction of the affinity matrix Kn, spectral algorithms can still be used

143 for segmentation. In contrast, most classical clustering methods must use the values of the observations.

The Achilles’ heel for spectral algorithms is the price paid for computation. Calcu- lating the entries of the affinity matrix and performing the eigenvector decomposition can be taxing for large data sets. While some classical algorithms like k-means may take time computing a large number of pairwise distances, they still eventually con- verge. Since spectral algorithms have to store all pairwise similarities simultaneously, and provide complex calculations on the matrix of those values, they can sometimes be intractable. Classical algorithms also lessen the burden on memory, because fewer values must be stored at the same time. A discussion of how to circumvent some of the computational problems of spectral algorithms is presented in Section 5.3.

5.2.4 Measuring the Quality of Results

Clearly, a multitude of algorithms exist to cluster data into naturally occuring groups. Once this has been done, how does one evaluate whether the results are appropriate? How does one decide which of two results is “better”? Unfortunately, such comparisons are highly subjective. Two people, when presented with a pair of clustering results, may disagree over which one is better. They may not even agree about how many groups are present in the data. Nevertheless, one can attempt to objectively measure the quality of a clustering result.

If one happens to have validation data, i.e. data that is already “correctly” clus- tered, this data could be used to assess the appropriateness of a particular cluster configuration. For example, suppose a validation partition P exists, and a given

144 cluster configuration C is to be evaluated. One can consider every possible pair of observations in the dataset and calculate the following quantities:

• a = Total # of pairs placed in the same cluster in both C and P .

• b = Total # of pairs placed in the same cluster in C but different clusters in P .

• c = Total # of pairs placed in different clusters in C but the same cluster in P .

• d = Total # of pairs placed in different clusters in both C and P .

Functions of these quantities can be used to assess similarity between C and P . High values reflect the quality of the cluster configuration. Some examples include:

a+d • The Rand Statistic: R = a+b+c+d

a • The Jaccard Coefficient: J = a+b+c

p a a • Folkes and Mallows Index: FM = a+b · a+c

Typically, data clustering is done in an unsupervised situation where such a validation partition P does not exist. In this case, one approach is to evaluate critera that in some way measure desired properties of the ideal configuration. A commonly used technique is a comparison between the affinity matrix Kn and the configuration C via Hubert’s Γ statistic:

2 Xn−1 Xn Γ = (K ) · Y n(n − 1) n i,j ij i=1 j=i+1

Here, Yij is an indicator function that takes value 1 if xi and xj are in different groups, and 0 otherwise. When the configuration represents a hierarchical clustering result, the comparison between Kn and C can be made using a cophenetic correlation

145 coefficient index (CPCC). Many other methods exist for evaluating hard clustering results, and a thorough list can be found in Gan et al. (2007). Delling et al. (2006) have also developed a battery of unit-tests – i.e., testing of individual units, or steps of the algorithm through examples whose solutions are known – that ensure the cluster configuration has certain desirable properties.

Finally, clustering a collection of observations is usually one step in a larger decision-making process. As a result, one could also consider evaluating a cluster configuration by observing the success of decisions made as a result of the cluster analysis. Configurations that lead to bad decisions could be considered poor, while those that yield good decisions would be highly desirable. Such measures of “good” and “poor” could be quantitatively measured for many applications. However, this method of evaluation is case specific – as such, it would only be useful if the decision- making process is repeated periodically on similar data.

5.3 Spectral Clustering for Large Datasets

Spectral algorithms provide a powerful way to segment a data set into groups without necessarily assuming characteristics like compactness and equal cluster size.

n(n−1) 2 However, construction of the affinity matrix Kn requires 2 = O(n ) computa- tions. The size of this matrix can cause problems for moderate to large values of n. Section 5.3.1 contains a discussion of the ways this can affect the application of spectral methods, and sparse matrix techniques that are commonly used to overcome difficulties.

Another method, the Nystr¨omextension, gives a way to approximate the eigen- vectors of the affinity matrix using a single subsample from the data. Part of my

146 research with Dr. Tao Shi has aimed to better understand the perturbation that is introduced by sampling variability. In particular, random variation from sample to sample can lead to unstable clustering results depending on what spectral algorithm is used. Section 5.3.2 shows empirical results that illustrate the impact this variability can have on spectral clustering algorithms.

In order to lessen the random perturbation effects of the single sample approxima- tion, I worked on a new algorithm that combines information from multiple samples to approximate the eigenvectors of the full affinity matrix. The multiple sampling algorithm is tailored to work with data spectroscopic clustering, but the ideas could be extended to other algorithms as well. A detailed outline of the procedure is given in Chapter 6.

5.3.1 Sparse Matrix Representations

A common approach to overcome computational requirements is to introduce spar- sity to the affinity matrix Kn, and then use a sparse eigensolver to compute its eigen- vectors. There are a variety of ways to do this, and they each have their own merits.

A description of these methods is given in the following sections.

Computation of Pairwise Distances

Clearly, one of the potential problems in computing the affinity matrix is the

n(n−1) amount of time it takes to perform all 2 calculations. Especially when n is large, this can easily make an algorithm intractable. Approaches to circumvent this issue rely on the fact that when two points are far apart, the similarity function will be

147 very close to zero. Rather than compute the similarity function for these points, it is instead set to zero.

Practically speaking, this means that the similarity function is assumed to be zero unless the points are close enough to each other. Making such approximations does not seem to generate enough perturbation to change clustering results, and the resulting sparse matrix representation of Kn quickens later calculations as well. There are two commonly used methods for introducing matrix sparsity:

• ²-Neighborhood Approach: Only compute the similarity between points that

are within a certain distance threshold ² of each other. While this method dras-

tically reduces the number of calculations made (especially if ² is very small),

it also has the unfortunate effect of splitting apart large, homogenous regions

of the feature space that intuitively should be considered a single cluster.

• k-Nearest Neighbors Approach: Only compute the similarity between x and the

k nearest neighbors of x. This method can be modified so that the similarity

between x and y is only calculated if x is a nearest neighbor of y and vice versa

(see, e.g., von Luxburg, 2007). This is called the mutual k-nearest neighbor

method, and has the symmetry property that similarity functions require. As

in the ²-neighborhood approach, very large clusters will tend to be split into

smaller pieces. In this case, the size of the pieces depends on how densely

populated the cluster is.

Note that these methods are only useful when there is some computationally cheap way to determine the distance between two points. For instance, in image applications, a circle of a given radius ² can be drawn around a given pixel x. Then,

148 one need only see if a new pixel y is in that set of pixels covered by the circle to determine whether s(x, y) should be calculated. This can save computation time for large images.

In many situations, however, the similarity function is just a function of the dis- tance, i.e. computing the distance takes nearly as much time as calculating the similarity. In these cases, the above approaches yield a negligible reduction in com- putation time.

Matrix Storage

For large data sets, the n × n matrix Kn is simply too large to fit into memory. In this case, the usual approach is to represent Kn as a sparse matrix. Most programming packages offer this representation of the matrix, which only requires the storage of the (i,j,value) triples for non-zero entries in the matrix. In order to obtain such a sparse matrix, one would use a method in Section 5.3.1.

Decomposing the Matrix

Each of the spectral algorithms requires calculation of a number of eigenvectors of the affinity matrix Kn or its (normalized) graph laplacian. For large matrices, this can cause some difficulty. Typically, when the size of the matrix is making the decomposition intractable, sparseness is introduced to the matrix. Many eigensolvers exist for decomposing such sparse matrices. Perhaps the most popular of these is the Lanczos algorithm (see, e.g., Cullum and Willoughby, 2002), which converts the matrix to a tri-diagonal form using an iterative process. Eigenvectors of matrices in this form can be easily computed. Introducing sparseness to the matrix can be done

149 using thresholding or nearest neighbor approaches as previously mentioned. However, if the affinity matrix is already in memory, some other options are available as well.

One possibility is to sort the elements of the affinity matrix and zero out the smallest entries. The definition of “smallest” can be used to control how sparse the resulting matrix is. Truncating only a small number of entries will not yield enough sparseness to allow the eigensolvers to converge quickly. Truncating many of the entries could create problems similiar to thresholding – namely, clusters will only contain local collections of points, and large clusters will be broken into smaller pieces.

Achlioptas and Frank (2001) suggest two other methods for converting Kn to a sparse matrix. The first is to randomly zero out elements of the affinity matrix, which allows one to avoid computation of all of the entries. The second is to round the entries to ±b, for some value b. The authors show that these methods amount to using a “randomized” kernel that behaves like the true kernel in expectation, and they provide results concerning the accuracy of this approximation.

5.3.2 Single Subsample Approximation

One other way of improving the efficiency of spectral algorithms is to approximate the results through subsampling. Fowlkes et al. (2004) suggest a procedure called the

Nystr¨omextension, which uses eigenvectors of a subsample affinity matrix to estimate

∗ ∗ the eigenvectors vj of the full affinity matrix Kn. A subsample {x1,..., xm} of size m is drawn from the dataset, and these sampled observations are used to create

∗ a subsample affinity matrix Km. Denote the eigenvectors and eigenvalues of this

∗ ∗ matrix vj and λj , respectively. Then the eigenvectors vj from the full dataset can

150 be approximated at location x by a weighted average of the subsample eigenvector values: r m 1 Xm v˜ [x] = K(x∗, x)v∗[x∗], (5.3) j n λ∗ i j i j i=1 where j = 1, . . . , n and v[x] denotes the element of the vector v that corresponds to the data point x. This vector v˜j approximates an eigenvector vj obtained from the affinity matrix Kn using all n observations, rather than the subsample. The eigenvalues λj can also be approximated by

n λ˜ = λ∗. j m j

To simplify the process, the extension can be expressed in a matrix operation format:

r m 1 ∗ v˜ = ∗ Kn,mvj , n λj

∗ where the i, j-th element of the n × m matrix Kn,m is defined to be K(xi , x).

∗ Note that the weights in Eq. (5.3) are the kernel values K(xi , x), which measure

∗ the similarity between a sampled observation xi and the approximation location x.

∗ ∗ ∗ Therefore, the vector values vj [xi ] are weighted more when xi is closer to x, and less when the points are futher apart. Essentially, the Nystr¨omextension works like a kernel smoother to extend the subsample eigenvectors to full length.

Empirical Perturbation Analysis

The approximate eigenvectors obtained via the Nystr¨omextension are used in sub- sequent analysis and assignment of cluster labels. It is therefore natural to investigate how well this approximation performs.

151 Depending on the spectral method being used for clustering, certain behaviors could be considered desirable for the sample eigenvectors of the affinity matrix (or its graph laplacian) to have. First of all, the sample eigenvector itself should be a good approximation to one of the eigenvectors obtained using the full matrix. In this way, there would be minimal group membership information lost in the sampling process.

Second, the ordering of eigenvectors can impact cluster results, especially for methods that only examine the top or bottom few eigenvectors of the matrix. Thus, another desirable property is that the sample eigenvectors preserve the same ordering as the full eigenvectors. Combining these two properties together, a reasonable expectation

∗ ∗ ∗ of a sampling procedure is that for the top k sample eigenvectors v1, v2,..., vk and

∗ ∗ ∗ the top k full eigenvectors v1, v2,..., vk, we have vi [x ] ≈ vi[x ] for all i = 1, . . . , k and sample points x∗.

First, consider the simulation study presented in Figure 5.1. In this case, a dataset of size n = 1000 is generated from a univariate mixture Gaussian distribution defined by

2 2 4 1 1 P (x) = N(−4, 0.42)+ N(−2, 0.42)+ N(0, 0.42)+ N(2, 0.32)+ N(4, 0.32). 10 10 10 10 10 (5.4)

The top of the figure shows a histogram of the dataset, followed by the top five eigenvectors of the affinity matrix. The matrix was constructed using a gaussian kernel with bandwidth ω = 0.25. Clearly, four of the five eigenvectors are identifying distinct groups in the dataset. One of them (#2) is identifying redundant information associated with the center group.

152 Figure 5.1: Row 1: Histogram of the original dataset (n = 1000) drawn i.i.d. from a mixture Gaussian distribution defined in (5.4). Row 2: Top five eigenvectors of Kn. ∗ Rows 3 through 6: Top five eigenvectors of Km for each of 4 samples (m = 300). To help the comparison to the first row, the vectors are plotted using the ordering in the original data space. Row 7: Top five eigenvectors from each of four subsamples (size m = 300) overlaid.

153 Next, four different subsamples of size m were drawn from the dataset, and the

∗ subsample affinity matrix Km was constructed for each one. The next four rows of Figure 5.1 show the top five eigenvectors of these subsample affinity matrices.

Clearly, these eigenvectors also contain grouping information that is similar to the full eigenvectors in row 2. However, the ordering of the eigenvectors varies from sample to sample. This means that there is no guarantee that the ith eigenvector from a subsample will contain the same cluster information as the ith eigenvector of

Kn.

Additionally, note that the subspace spanned by the top five eigenvectors is not even the same. The first two subsamples identify four of the five groups in the data.

However, the remaining subsamples only identify the three groups on the left.

Theoretical Perturbation Analysis

To understand why this perturbation occurs from a theoretical standpoint, con- sider the dataset to be a realization from a probability density P . The kernel K used in the affinity matrix is strongly connected to the convolution operator given by

Z

KP f(x) = K(x, y)f(y)dP (y). (5.5)

As discussed in Section 5.2.3, this operator is approximated by the empirical operator

KPn defined over observations in the dataset, where

Z 1 Xn K f(x) = K(x, y)f(y)dP (y) = K(x , x)f(x ) (5.6) Pn n n i i i=0

As shown in von Luxburg et al. (2008) and Shi et al. (2009), the eigenvalues of KPn are

1 the same as those of n Kn. Also, at each data point xi, the value of the eigenfunction 154 φ(xi) for this operator matches the element v[xi] of the corresponding eigenvector of

1 n Kn.

∗ The subsample density Pm will slightly differ from Pn due to sampling perturba-

tion. Therefore, the spectrum of KPn will appear in perturbed form as the spectrum

∗ of the subsample empirical operator KPm , where

Z Xm ∗ 1 ∗ ∗ K ∗ f(x) = K(x, y)f(y)dP (y) = K(x , x)f(x ). (5.7) Pm m m i i i=0

∗ The difference in the spectra of KPn and KPm are connected to the distance between

∗ Pn and Pm. When this distance is small (e.g. as the sample size m increases), the

∗ spectra of KPm approximate those of KPn well. However, this approximation may be worse if the sample size is not large enough. This phenomenon is summed up in the following Lemma, which appears in Shi et al. (2009):

Lemma 1 For the operator KPn defined in Eq. (5.6). If

2 kKPn f − λfkL ≤ ² (5.8) Pn

2 for some λ, ² > 0, and f ∈ LPn , then KP has an eigenvalue λk such that |λk − λ| ≤ ².

If we further assume that t = mini:λi6=λk |λi − λk| > ², then KPn has an eigenfunction ² fk corresponding to λk such that kf − fkkL2 ≤ . Pn t−²

In other words, if there is a function f that is close to being an eigenfunction

of KPn with eigenvalue λ, then KP has an eigenvalue that is approximately λ, and an eigenfunction that is approximately f. This will occur provided there is enough separation in the spectrum (i.e. the eigengap is large enough at λ).

∗ ∗ If f is taken to be an eigenfunction φi (·) of KPm , this lemma says that as long as

∗ φi (·) is approximately equal to an eigenfunction of KPn , it will also be an approximate

155 Figure 5.2: Row 1: Histogram of the original dataset (n = 1000) drawn i.i.d. from a mixture Gaussian distribution defined in (5.4). Row 2, 4, 6: Top five eigenvectors −1 −1/2 −1/2 of Kn, Dn Wn, and Dn WnDn respectively. Rows 3, 5, 7: Top five eigenvectors from each of four subsamples (size m = 300) overlaid.

156 version of an eigenfunction φj(·) for KP . In addition, the corresponding eigenvalues

∗ λi and λj will be close. However, note that i and j need not be the same value, which is the source of the order switching phenomenon among the top eigenvectors observed in Figure 5.1.

∗ As mentioned before, the distance between the spectra of KPn and KPm is con- trolled by the sample size m. If one takes larger samples, this distance will decrease and the approximation will improve. As m increases to n, the perturbation will de-

∗ crease until eventually m = n and KPm = KPn . Approximation error hinges on the size of the perturbation relative to the gap between the eigenvalues. Therefore, one might expect to see stabilization of the top eigenvectors with a relatively low sample size, but m may need to be increased further to stabilize the lower vectors, where the eigengap is much smaller.

It is important to note that while these results apply directly to the affinity matrix

Kn, they can be extended to its graph laplacian Wn. Figure 5.2 shows the effect of subsample perturbation on the top spectra of several matrices used in spectral clustering algorithms. In each case, there are clear parallels between eigenvector behavior in Kn and the various forms of the laplacian. That is, the eigenvectors appear in different orders, and do not always identify the same groups found using the eigenvectors of the full matrix.

The results in this section demonstrate that the eigenvectors used in subsample spectral clustering methods may change order, flip signs, and contain redundant group information. A desirable sub-sampling spectral algorithm, therefore, must be robust to these changes in the eigenvectors and reliably produce similar clustering results across different samples.

157 CHAPTER 6

MULTIPLE SAMPLE DATA SPECTROSCOPIC CLUSTERING

6.1 Algorithm Overview

∗ Given the instability of the spectrum of Km illustrated in Section 5.3.2, a single subsample may not always produce stable results. To stabilize the clustering process,

I have developed a multiple subsample approach, an overview of which appears in

Figure 6.1.

First, T subsamples of size m are randomly drawn from the dataset. For each subsample, an affinity matrix is calculated and the eigenvectors with no sign change are selected in the same manner as the original DaSpec clustering (Section 5.2.3).

After this, the no sign change eigenvectors from all of the subsamples are considered as a single dataset of m-dimensional objects and are clustered using DaSpec. This second clustering step determines which eigenvectors from the subsamples are identifying the same groups in the original data.

Next, each cluster of no sign change eigenvectors is combined into a single vector, which is then used to approximate an eigenvector of the original dataset. Cluster

158 labels are assigned to each observation based upon which group has the largest ap- proximate vector magnitude at that location. The procedure is discussed in detail below.

Step 1: Draw subsamples of size m from the dataset T times

t t t t Begin by drawing T samples of size m, denoted X = {x1, x2, ..., xm}, where the sample number is t = 1, ..., T . Each sample yields its own affinity matrix

µ ¶ ||xt − xt ||2 (Kt ) = K(xt, xt ) = exp − i j , m i,j i j 2ω2 where the bandwidth ω is common to all subsamples. Each affinity matrix is decom-

t t t posed and its eigenvectors are denoted v1, v2,..., vm, with associated eigenvalues

t t λ1 ≥ ... ≥ λm.

Step 2: Find eigenvectors with no sign change from each sample affinity matrix

t t t The eigenvectors v1, v2,..., vm, t = 1,...,T , are examined for the no sign change property, up to a small threshold ² (as in Section 5.2.3). This gives, for each sample, a collection of eigenvectors vt , vt ,..., vt , with eigenvalues λt , λt , . . . , λt . (1) (2) (Gˆt) (1) (2) (Gˆt) Here, Gˆt is the number of groups identified by subsample t. Note that if only a single subsample were to be used, the number of detected clusters in the data corresponds to the number of these eigenvectors with no sign change.

Perturbation of the values of the eigenvectors, introduced by subsampling, can affect whether an eigenvector has this property. As a result, subsampling introduces variation in the number of groups identified by DaSpec. This variation presents more

159 often when small groups are present or the numbers of groups is large, since points in those groups are less likely to be chosen, resulting in a larger perturbation effect on eigenvectors representing those groups.

Notice that in the overview shown in Figure 6.1, T = 4 samples are drawn.

Only one of the samples (Sample 3) finds all four of the gaussian clusters, while the other samples find only two or three. If only a single subsample had been used, the cluster results may have returned a fewer number of groups than are actually present.

However, we will see that the multiple subsample approach combines the information from all of the samples and detects each of the four groups.

Step 3: Cluster the collection of all no sign change eigenvectors

To summarize the process so far, each of the subsamples X1, X2,..., XT have P T ˆt together identified G = t=1 G eigenvectors with no sign change. Assuming that there are a small number G ¿G of naturally occurring groups in the full dataset, it is expected that many of these samples have identified the same groups, but using different observations sampled from those groups. In the example illustrated in Figure

6.1, there are G = 4 clearly separable groups in the full dataset. Although the eigenvectors for each subsample take values at different observations than the others, it is clear that some of the absolute values of the vectors are large in the same regions of the data space (in this case R1).

The goal now is to combine the information from multiple samples to find where these overlaps occur and condense the full set of G eigenvectors down to a set of G eigenvectors that identify the unique groups in the full dataset. Figuring out which of the vectors identify the same groups, though, is not as easy as it might seem.

160 First of all, the order of detected groups is not always the same from sample to sample. Even if two subsamples are able to isolate the same group from the dataset, the corresponding eigenvectors with no sign change may not be in the same position in the eigenspectrum. For example, all of the samples in Figure 6.1 identify the left- most cluster centered at -5. However, the eigenvector corresponding to that cluster

4 1 2 3 shows up in the second (v(2)), third (v(3) and v(3)), and fourth (v(4)) positions. Second, there is the aforementioned fact that the subsamples are drawn at different observations. While two eigenvectors might be large over the region corresponding to the same group, it is hard to directly compare the values in the vectors since they do not line up.

To avoid the ordering problem and still determine which vectors match, the eigen- vectors themselves are combined using DaSpec clustering. The procedure from Sec- tion 5.2.3 can be used for this task, with simple modifications. Since the eigenvectors are not always evaluated at the same points in the dataset, an L2 distance measure is inappropriate. We must therefore define a new similarity measure between vectors that measures the degree to which the vectors have large absolute values in the same region of the data space.

An intuitive distance measure between vectors is to identify the “representative” set of points for which each of the vectors takes on large absolute values, and then

t compute the distance between these sets of points. For example, for vectors v(i) and

t0 v(j), define these sets of points to be:

t t t t ˆt R(i) = {x ∈ X : |v(i)[x]| > c · max |v(i)[y]|} i ∈ {1, 2,..., G } y∈Xt

161 t0 t0 t0 t0 ˆt0 R(j) = {x ∈ X : |v(j)[x]| > c · max |v(j)[y]|} j ∈ {1, 2,..., G }, y∈Xt0

t t0 t t0 where c ∈ [0, 1]. The similarity of v(i) to v(j) can be defined as D(R(i),R(j)), where D is any measure of closeness between sets of points (e.g. complete linkage, average linkage, single linkage, etc.). I suggest the following measure, which seems to work well:

X X t t0 t t0 1 1 S(v(i), v(j)) = D(R(i),R(i)) = t min d(x, y) + t0 min d(x, y), |R | y∈Rt0 |R | x∈Rt (i) x∈Rt (j) (i) t0 (j) (i) y∈R(i) (6.1) where d(x, y) is a relevant distance measure between two points x and y from the data (e.g. kx − yk for real valued vector data). Using this similarity measure, the affinity matrix for the set of G sample eigenvectors with no sign change is (KG)i,j =

t t0 K(v(i), v(j)), where: Ã ! [S(vt , vt0 )]2 K(vt , vt0 ) = exp − (i) (j) , (i) (j) 2τ 2

Here, τ is the bandwidth for the kernel, which is different than the bandwidth ω used in the first DaSpec step. Following the rest of the DaSpec procedure (Section 5.2.3) using this KG will yield a set of cluster labels 1,...,G for each of the G vectors.

Step 4: Combine eigenvectors within each cluster found in step 3 to obtain a single eigenvector for that group

For each group g = 1,...,G, denote the ng vectors with the label g by u(g),1,..., u(g),ng ,

i 1 T where each vector u(g),i comes from the one of the subsamples X(g) ∈ {X ,..., X }.

These vectors can then be merged into a length m · ng vector u(g) over the set

ng i X(g) = ∪i=1X(g).

162 There are a couple of notes to make here. First, if an observation is repeated in two or more of the samples, it can simply be repeated in u(g) so that the vector is

still of length m · ng. Second, since the signs of the vectors u(g),1,..., u(g),ng might differ, it is important to flip them all the same direction. As the data spectroscopic clustering procedure identifies eigenvectors with approximately no sign change, this is a trivial operation. For instance, one could simply multiply each vector u(g),i by sign{max i (u(g),i[x])}. x∈X(g) At the end of this procedure, then, the G sample eigenvectors with no sign change

(each of size m × 1) have been converted into G group eigenvectors u(1), u(2),..., u(G) of size m · ng × 1. Each u(i) represents an approximate version of a no sign change eigenvector of the full affinity matrix Kn.

Step 5: Extend each group eigenvector and assign cluster labels

d To assign group labels to the full set of points {x1, x2,..., xn} ⊂ R , the Nystr¨om extension is used. However, since each group eigenvector u(g) is a combination of eigenvectors from different samples, it is unclear what normalization term to use in the extension. I suggest leaving out the normalization term in the original extension

(Eq. (5.3)) and normalizing the approximated eigenvectors by length instead.

∗ ∗ ∗ For each group eigenvector u(g) taking values on the set X(g) ≡ {x1, x2,..., xm·ng }, g = 1,...,G, estimate the value of the vector v(g) at an observation xi by:

mX·ng ∗ ∗ v˜(g)[xi] = K(xi, xj ) u(g)[xj ] (6.2) j=1

163 Step 1: Draw T samples of size m from the data set.

Run DaSpec on each sample to obtain the eigenvectors vt ,..., vt . Step 2: (1) (Gt) P Cluster the G n = G eigenvectors using DaSpec and the similarity Step 3: g=1 g measure given in Eq. (6.1).

Combine vectors u(g),1,..., u(g),ng sharing the same cluster label Step 4: g ∈ {1,...,G}, yielding G group eigenvectors u(1),..., u(G), where u(g) is of length m · ng Extend each vector u(g) to length n using the Nystr¨omextension and normalize to length 1, yielding the approximate full eigenvectors Step 5: v˜(1),..., v˜(G). Assign to each observation xi the cluster label argmax |v˜(g)[xi]|. g=1,...,G

Table 6.1: Multi-Sample DaSpec Algorithm

Then, normalize the vectors v˜(1), v˜(2),..., v˜(G) to length 1 and assign to point xi the label argmax |v˜(g)[xi]|, for i = 1, . . . , n. A summary of the multiple-sample procedure g=1,...,G is given in Table 6.1.

6.2 Sparse Extension for Faster Computation

Implementation of either a single sample or multiple sample approach to clus- tering using the Nystr¨omextension shows that the extension step takes the bulk of the computing time. Recall the Nystr¨omextension for an eigenvector v, which is reproduced below. r m 1 Xm v˜[x] = K(x∗, x)v∗[x∗] (6.3) n λ∗ i i i=1

164 ∗ ∗ ∗ Here, {x1,..., xm} is a sample of the observations and v is an eigenvector of the

∗ ∗ ∗ ∗ subsample affinity matrix Km(i, j) = K(xi , xj ) with eigenvalue λ . The computa- tional complexity of the extension is O(nm2), which is still too slow for large n and reasonable m.

To ease computational burdens, vector extension can be modified using a sparse approximation method. Note that the form of Eq. (6.3) is not unlike that of a p ∗ m 1 ∗ ∗ regression problem. Treating K(xi , ·) as an explanatory variable and n λ∗ v [xi ] as its coefficient, we have a regression model setup. Also, since

1 Xm v∗[x∗] = K(x∗, x∗)v∗[x∗], j = 1, . . . , m, (6.4) j λ∗ i j i i=1

∗ ∗ the response for the linear model is observed at locations x1,..., xm.

∗ The explanatory variables K(xi , ·), i = 1, . . . , m are highly correlated, and many of them are close to zero. Also, since the no sign change vectors decay away from

∗ ∗ the support of the group they identify, the majority of the values v [xi ] are close to zero. This means that most of the terms in equation 6.4 are near zero, and a sparse representation of the model should approximate the equation well. In this way, each sample is reduced to a (potentially) smaller set of points that serves as a proxy for the sample in extensions to any point x in the data space.

∗ To determine which of the K(xi , ·) are influential in determining the values of v˜, one can use a variable selection method (e.g. LASSO regression). That is, one can

find the sparse vector w∗ such that:

∗ ∗ ∗ v ≈ Kmw . (6.5)

165 Here, the non-zero entries of w∗ correspond to a small number (say, r) of influential

∗ ∗ points chosen from x1,..., xm. The quality of the extension will depend on how many non-zero entries are allowed in w∗ – more entries mean slower extension, but better quality approximation. A surprisingly small number of predictors seem necessary to achieve the level of approximation required for clustering. For the datasets in Section

6.3, LASSO regression with as few as r = 5 to r = 10 non-zero coefficients in w∗ is sufficient for clustering purposes.

∗ ∗ ∗ In summary, the sparse extension, using the sample x1,..., xm and eigenvector v

∗ of sample affinity matrix Km, is given by:

r r v˜ = K w∗, (6.6) n n,m

∗ ∗ where Kn,m(i, j) = K(xi , xj). As a result of the sparsity in w , only a small number

∗ (r · n) of K(xi , xj) values need to be evaluated in the extension step and this further reduces the computation cost since r ¿ m.

In the multiple sampling situation, I suggest doing the sparse selection on the original G eigenvectors with no sign change, which are of length m. The affinity

1 T matrices Km,...,Km already need to be computed to find these eigenvectors, and the variable selection procedure will run much quicker for only m sampled points. In

each group g, the sparse representations of u(g),1,..., u(g),ng are functions that each approximate v(g). By virtue of random sampling, each of these functions carries equal weight. Therefore, in the extension step of the multiple sample procedure, estimate v(g)[x] by an average of the predictions given by the vectors u(g),i that were combined to form u(g). This will only require computing r · ng additional kernel values.

166 6.3 Performance on Real and Simulated Datasets

The examples in this section highlight a few of the properties of the multiple sample procedure. We begin with a brief discussion of parameter selection. The number of samples, T , and the sample size, m, both control the level of detail that can be achieved in the clustering results.

Increasing T means that more samples are drawn from the dataset, which increases the chances that smaller groups in the data can be discovered. If T is not large enough, some groups might not have any of their members sampled and will be merged into larger groups. The sample size m needs to be large enough to distinguish between a few of the groups in the data, and also large enough to have a chance of detecting smaller groups. However, making the sample size too large will slow down the algorithm, since a total of T eigen-decompositions must be performed on m × m sample affinity matrices.

Fowlkes et al. (2004) suggest a sample size of fewer than one percent of all pixels seems to be sufficient to properly segment a dataset. In many cases, a multiple sample procedure can make do with even fewer. Indeed, in the first example of this section,

T = 5 samples of size m = 50 are used to recover 9 groups in a dataset with 45, 000 points.

Determination of the bandwidth parameters ω and τ is not always straightfor- ward, but a coarse estimate can be obtained from the data. Some discussion of this calculation is given in Section 6.3.4. In general, smaller values of ω encourage the detection of more groups within each subsample, while larger values tend to put all of the points into a few larger groups. The value of τ controls to what degree these subsample groups are combined to obtain the final grouping vectors. Larger τ values

167 force more dissimilar groups to be combined, while smaller values discourage this, resulting in more groups.

Runtimes for these datasets are shown for a PC with a 2.39 GHz processor and

4GB RAM. All of the code for the clustering algorithm is written in Matlab 2009a except for the sparse extension, which is written in C.

6.3.1 Comparison to Single Subsample Approach

This simulation shows how the use of multiple samples can increase clustering stability and extract more grouping information from the data than the single sample approach with the same total number of points. The dataset in R2 given in Figure 6.2 shows nine bivariate gaussian clusters centered at {1, 3, 5} × {1, 3, 5} with covariance matrices (0.1)2I. There are n = 45, 000 points in the dataset, with 5000 chosen from each cluster. Each of the nine gaussian clusters is consistently recovered with the multiple sampling approach – however, if the single sample procedure is used with the same total number of sampled points, at most 7 clusters can be detected.

The multiple sample approach leverages the fact that each sample is looking at the dataset with a fresh perspective. While each of the samples might only identify a few of the nine clusters, these will not always be the same clusters identified by the other samples. When the cluster information is combined, all nine clusters have been detected. Clearly, as few as five samples seem to be enough to consistently avoid missing one of the nine groups.

It is important to note that in this case, a single sample of size m = 1000 or 2000 would recover all nine of the groups. However, if this problem were scaled up to have more groups or higher dimension, drawing such a proportionately larger single sample

168 might not be possible. In other words, if this problem were run on a computer that could only handle at most 250 × 250 affinity matrices (or even only 50 × 50 matrices), the multiple sample approach could still identify the nine groups. A single sample of size 50, or even 250 as shown in the figure, would not be able to do this.

A quick note also needs to be made about the choice of ω for the single subsample run. The value of ω was lowered as far as possible, since lower bandwidths tend to recover more groups in the data. When ω was decreased below 0.15, the affinity matrix became near singular and the eigendecomposition could not be done. Therefore, the results in the second row show the best one could do to identify all 9 clusters at once with a single subsample of size 250.

6.3.2 Image Segmentation Applications

To explore the performance of the multiple sampling procedure in real-world ap- plications, the method was also applied in the context of image segmentation. The test images, each of size 481 × 321, were taken from the Berkeley Image Database

(Martin et al., 2001).

Each pixel is treated as an observation in 5-dimensional space. To be more spe- cific, for pixel x in row r, column c, with RGB values R, G, and B, its associated ¡ ¢ ¡ ¢ r−1 c−1 R G B vector is (0.5 320 , 0.5 480 , 255 , 255 , 255 ). This puts the first two coordinates (spa- tial information) in the range [0, 0.5] and the other three (color information) in the range [0, 1]. In this way, closer pixels are more likely to be placed in the same group, but pixel colors still matter more in deciding group membership.

I found that bandwidths of ω ∈ [0.03, 0.06] and τ ∈ [0.10, 0.20] work well for images of this size. Larger ω values seem to work better for smaller subsamples –

169 since less information is available, a larger ω only allows each subsample to find a few groups. When larger subsamples are used, more information is available and smaller

ω values seem to work well by allowing more groups to be discovered. A more general procedure for determining good starting points for these bandwidths can be found in

Section 6.3.4. Figure 6.3 shows three images accompanied by their cluster results. In each case, T = 10 or 20 samples of size m = 100 were taken.

One might also ask how many subsamples are necessary to achieve satisfactory clustering results. I have investigated the effects of changes to the number of samples

T , and the sample size m. It appears that choice of these parameters depends on the kind of image being segmented. In images for which there are a few large, well separated groups, a small single subsample is usually enough to perform the segmen- tation. However, in more complex images, single subsamples tend to produce results that are more variable and less satisfactory. In these cases, the multiple subsample approach seems to make a difference by stabilizing the results (see Figure 6.4).

With either type of image, there is no evidence that using multiple subsamples can hurt classification. There is an added computational cost in situations for which a single subsample would do the job, but one might be willing to pay that cost in an automatic procedure that is used to segment a variety of image types.

6.3.3 Sparse Extension vs. Full Extension Comparison

Figure 6.5 demonstrates the usefulness of the proposed sparse extension method by comparing its results to those of full extension on the same image. In both cases,

T = 20 samples of size m = 100 were used. For the sparse extension, LASSO regression was used to express each subsample eigenvector using only five points as

170 predictors. During the extension, eigenvector values for each pixel in the image were predicted using all 20 of the regression equations and then averaged. This is compared to the full extension method, where pixel eigenvector values are approximated with the Nystr¨omextension, which uses all 100 sampled points.

Clearly, the sparse extension performs orders of magnitude quicker than the full extension, and the time savings increases with the number of groups detected. Fur- thermore, the cluster labels match fairly well, especially in the first image. For the second image, when more groups are present, there are slight differences. However, due to the sparse approximation, group labels are distributed more smoothly across the image. One could argue that the sparse extension yields more satisfactory results than the full extension in this example.

6.3.4 Parameter Selection

There are several issues with the multiple-sample data spectrocopic clustering algorithm which I am working to address. Most notably, there is the problem of choosing parameter values. In general, there are five parameters: number of samples

T , size of samples m, bandwidths ω and τ, and the threshold parameter ² to be chosen in the data spectroscopic clustering algorithm. In some situations, there may be more. For example, in image segmentation there is also a weight one can place on the row and column indices (0.5 was used for the results in the previous section). The broad effects made by changing these parameter values are known, but the interaction of these effects is not as well understood.

The choice of subsample number T and size m depends on the kind of dataset used in the clustering. In general, I find that for simpler datasets with a small number

171 of large, well separated groups, a single subsample is sufficient to obtain a stable and satisfactory clustering result. In this case, the subsample size m can be chosen well under 1% of the total number of observations without any sort of undesirable effects.

For datasets with a larger number of groups and/or smaller sized groups, a larger number of sampled observations is necessary.

While a single sample can occasionally yield satisfactory results, it appears that the cluster labels are too unstable from run to run. To stabilize the process in these cases, anywhere from T = 5 to T = 20 subsamples might be necessary. For situations with large numbers of groups, I find that the subsample size m works in concert with the number of subsamples T to identify groups. The size m must be large enough to allow each subsample to identify several of the groups in the dataset, while the number of samples T must be high enough to ensure that each natural grouping is found by at least a few of the subsamples.

To identify a ballpark region of decent bandwidth values ω and τ, I follow the method given in Shi et al. (2009) for data spectroscopic clustering. For a similarity

(or affinity) matrix S, calculate the vector q = {5th percentile of the rows of S}.

Then, estimate the bandwidth ω using

quantile(q, 0.95) ω = q , 2 χ0.95,d where d is the dimension of each observation. This procedure can also be used to

t t0 find values for τ by using the matrix Sij = S(v(i), v(j)) of similarities between all eigenvectors with no sign change from each of the T samples (d is again the dimension of each observation).

172 I find that this gives a good starting point for determining bandwidths. To fine tune these parameters, the approach is to either shrink or expand ω and τ by mul- tiplying with a fixed value, usually between 0.5 and 1.5. In practice, I find that it often helps to use a smaller ω and a larger τ (i.e. shrink ω and expand τ, each by a small amount). In this way, subsamples are encouraged to find multiple groups, but many of the neighboring groups are then merged together in the second clustering step. Well separated groups will not be merged, however, so these settings encourage the discovery of smaller groups in the data that may not have been found with only a single subsample.

To choose ², I also follow the heuristic given in Shi et al. (2009) for data spectro- scopic clustering. This parameter represents the threshold for determining whether an eigenvector has the no sign change property. The vector is said to have no sign change if (i) all of its elements are larger than −², or (ii) all of its elements are less than ².

Larger values of ² lead to more eigenvectors having the no sign change property, which results in more groups being detected by the algorithm. Smaller values of ² result in fewer groups. For an m-dimensional vector v taking values at locations

1 x1,..., xm, Shi et al. (2009) suggest using ² = m maxi∈{1,...,m} v[xi]. This is the convention I use in my implementation of the multiple subsample procedure.

6.4 Conclusions

In the previous chapter, we saw that the eigenvectors of a sample affinity matrix, or its laplacian, experience perturbations in sign and ordering. Unless a clustering

173 algorithm is robust to these differences (e.g. data spectroscopic clustering), pertur- bation can have a serious impact on the final groupings.

Even when the algorithm is robust, instabilities arise due to the random nature of the single sample chosen. This effect is most apparent when the number of natural groupings is large and the sample size is small. If the sample size, due to computa- tional burdens, cannot be chosen large enough to uncover these groups, cluster results can be very unstable.

I have proposed the use of multiple samples, each of which uncovers different groups in the data. By carefully combining the information contained in these sam- ples, all of the clusters can be uncovered with tractable computation. The method described here works for data spectroscopic clustering, but it is by no means the only way to combine multiple samples.

In addition to the multiple sampling procedure, I have also outlined a sparse extension method that can be used instead of the full Nystr¨omextension whether a single sample or multiple sample is used. Through image segmentation results,

I showed that the computation used with sparse extension is orders of magnitude smaller than the full extension, while the cluster results for the two methods look nearly indistinguishable. Even if a single sample approach seems more prudent in a particular application, sparse extension should still be used to speed up the runtime of the clustering algorithm.

174 Figure 6.1: Overview of the Multiple Sample DaSpec Clustering Algorithm.

175 (a) (b) (c)

(d) (e) (f)

Multiple Subsample Single Subsample Runtime Runtime Algorithm Algorithm (a) 3.037 sec. (d) 3.229 sec. (b) 3.441 sec. (e) 3.194 sec. (c) 3.120 sec. (f) 2.856 sec.

Figure 6.2: (a)-(c) Multi-sample data spectroscopic clustering results for three con- secutive runs using T = 5 samples of size m = 50. The dataset has n = 45, 000 points, with 5000 sampled from each of nine bivariate gaussian distributions with centers in {1, 3, 5} × {1, 3, 5} and covariance matrices σ2I, where σ2 = 0.1. Bandwidths used were ω = 0.4 and τ = 0.5. (d)-(f) Single-sample data spectroscopic clustering results for three consecutive runs using T = 1 sample of size m = 250. The bandwidth ω was set to 0.15.

176 (a)

(b)

(c)

Image Sample Number Sample Size Groups ω τ Runtime (a) 10 100 3 0.06 0.20 5.714 sec. (b) 20 100 3 0.06 0.15 17.416 sec. (c) 20 100 7 0.06 0.14 26.279 sec.

Figure 6.3: Column 1: Original Image. Column 2: Multi-sample data spectroscopic clustering results using the indicated number of subsamples and a sample size of m = 100. Parameters, runtimes, and groups detected for the procedures are given in the table.

177 (a) (d)

(b) (e)

(c) (f)

Image Sample Number Sample Size Groups ω τ Runtime (a) 1 100 3 0.06 — 0.652 sec. (b) 5 100 3 0.06 0.20 2.456 sec. (c) 10 100 2 0.06 0.20 4.386 sec. (d) 1 100 7 0.06 — 1.246 sec. (e) 5 100 6 0.06 0.15 5.318 sec. (f) 10 100 7 0.06 0.15 8.185 sec.

Figure 6.4: Column 1: A simple image segmented with T = 1, T = 5, and T = 10. Column 2: A more complex image segmented with T = 1, T = 5, and T = 10.

178 (a) (c)

(b) (d)

Extension Sample Sample Time for Image Groups ω τ Method Number Size Extension (a) Sparse 20 100 2 0.15 0.25 7.878 sec. (b) Full 20 100 2 0.15 0.25 527.271 sec. (c) Sparse 20 100 7 0.06 0.15 26.448 sec. (d) Full 20 100 7 0.06 0.15 2177.272 sec.

Figure 6.5: First Row: Original Image. Second Row: Multiple subsample data spec- troscopic clustering results using sparse LASSO extension. Third Row: Multiple subsample data spectroscopic clustering results using the full extension.

179 CHAPTER 7

REDUCTION OF FALSE DETECTIONS BY CLUSTERING

7.1 Introduction

The automatic cairn detection algorithm described in Chapter 5 located pixels in a satellite image that met several criteria. These criteria were embedded in several

filters that are successively run over the image in 25 × 25 windows.

The blob detector JB, finds all windows with a dark patch in the center and relatively lighter material in the outer region. These dark patches are then examined by the vegetation filter JV to see if the normalized difference vegetation index (NDVI) is large enough to indicate that the object is a plant. If it is not, a size metric JS is calculated to see if the object is a size consistent with that of a cairn. The two features JHR and JHS measure how well a circle fits around the edges of the object.

This fitting is done with a Hough transform. Finally, the object’s circularity JC is measured using a modified form of Proffitt’s equation. Windows with objects that pass all 6 of these filters are potential cairns. If a training set is available, these objects can also be reduced to only those within the convex hull of the training cairns, and then be ranked using a likelihood-based approach.

180 At the beginning of the procedure, any pixel in the image could be part of a cairn.

Through the use of these filters, the set of potential pixels has been dramatically reduced by more than 99.9%. However, even this large a reduction can still result in a high number of detected objects. To lower this number further, some other information must be incorporated. In this case, we make the observation that cairns will only appear on certain types of landforms. These landforms should tend to have similar spectral values, and should be at similar local elevations. Spectral clustering provides a way to cluster the satellite image into homogenous regions by color and elevation. Then, by examining these clusters, we can see which regions cairns tend to appear in. The locations of the detected objects can then be intersected with these cairn regions, which will reduce the set of candidate objects even further.

Recall that the satellite image is composed of multiple bands, each of which pro- vides different information. The panchromatic intensity layer is at 0.6 meter reso- lution, and measures the brightness (luminosity) of the ground surface. The four multispectral bands measure reflectance of the surface in different wavelengths of visible and near-visible light, and include near infrared, red, green, and blue light.

The multispectral bands are at 2.4 meter resolution, which means that one pixel of an R/G/B/IR band covers a 4 × 4 pixel region of the panchromatic band. Finally, the digital elevation model (DEM) is available at a 30 meter resolution, which is the equivalent of a 50 × 50 square of panchromatic pixels.

Due to the size of the satellite images, the straightforward use of a spectral clus- tering algorithm is very time consuming, computationally intensive, and, in some situations, not even feasible with today’s computers. For this reason, the multiple sample data spectroscopic clustering algorithm is a good candidate for clustering the

181 image into different landforms. Section 7.2 explains how the algorithm is adapted to apply to satellite images, and Section 7.3 shows an application of the clustering on

Polygon 9.

7.2 Satellite Image Clustering

In order to cluster the satellite image into landforms, each pixel is treated as a vector (r, c, b1, b2, . . . , bk). Here, r and c are the row and column indices of the pixel, respectively, and b1, . . . , bk are the values of the k bands of imagery at that location.

The row and column indices are included to introduce spatial smoothness to the clustering by making it more likely that adjacent pixels will be placed into the same group.

In order to give equal weight to each dimension, the ranges of the indices and bands are converted to a [0, 1] scale. However, if one wants to emphasize certain features

(e.g. color, elevation, spatial location) over others, weights can be applied to the dimensions of the vector. Therefore, the set of pixels that are used in the clustering

∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ algorithm are of the form (αrr , αcc , α1b1, . . . , αkbk), where r , c , b1, . . . , bk are the normalized versions of r, c, b1, . . . , bk and αr, αc, α1, . . . , αk > 0 are weights on the dimensions.

However, before the cluster algorithm can be directly applied to the satellite image, there are a couple of issues to overcome. First of all, the image is quite large, which can make clustering difficult and time consuming. Second, the DEM is not useful for clustering in its original state. A description of these problems and their solution follows.

182 7.2.1 Size Reduction for Computation

The large size of the satellite image can make it difficult to perform a clustering analysis, even with the multiple sample data spectroscopic clustering algorithm. For example, the size of Polygon 9 is approximately 200 million pixels. Even though approximation of the no sign change eigenvectors can be quickly done for several samples drawn from the image, these vectors still need to be extended to the full image in order to determine cluster membership.

In Eq. (6.2), the extension equation is given by:

mX·ng ∗ ∗ v˜(g)[xi] = K(xi, xj ) u(g)[xj ] j=1

Here, u(g) is the approximate eigenvector formed by combining ng sample eigen- vectors from samples of size m. The observations used to form u(g) are denoted

∗ ∗ {x1,..., xm·ng }.

Note that to approximate the value of v˜(g) at a single pixel xi requires the compu-

∗ tation of m·ng kernel values K(xi, xj ), j = 1, . . . , m·ng. Even using sparse extension, there are still r · ng kernels to compute, where r is the number of points in the sparse representation of the eigenvector. Therefore, to extend a single eigenvector to all pixels in the image requires N · m · ng (or N · r · ng) calculations, where N is the total number of pixels in the image (e.g. 200 million). Considering that a total of

G eigenvectors must be extended in this way, the amoung of computation required is immense.

As an example, I clustered Polygon 9 into three groups, and the process took over four hours, even using a sparse LASSO extension. When the calculation takes this long, it is difficult to optimize the parameters in the model, e.g. sample size m,

183 (a) (b) (c)

(d) (e) (f)

Figure 7.1: (a) Polygon 9 scaled by a factor of w = 8. (b) Polygon 9 scaled by a factor of w = 25. (c) Polygon 9 scaled by a factor of w = 50. (d)-(f) Close-up views of the top left corner of the image for each scale factor.

sample number T , and bandwidths ω and τ. It therefore becomes necessary to reduce the processing time in other ways.

Since landform clustering only needs to happen on a large scale, one easy remedy is to scale the image to a lower resolution. For a scaling factor w, each w × w block of pixels is averaged to form a single pixel in the new image. If N is the number of

184 pixels in the satellite image, this scaling reduces the number of pixels to N/(w2). By choosing a reasonable scaling factor, large scale features in the image are preserved, but the computation required for clustering is much reduced. Figure 7.1 gives an example of Polygon 9 converted to several lower resolution versions. Note that even at w = 25 or 50, different types of landforms are still distinguishable.

7.2.2 Equalized DEM Measure

The digital elevation model (DEM) is a useful for distinguishing between dif- ferent landforms, because they tend to appear at different elevation levels. Elevation differences are shaped by erosion from natural processes like water flow and wind.

These processes affect soil composition and appearance, and determine what kinds of natural features are present in certain areas (e.g. vegetation, water, rocks and boulders). Therefore, landform type is strongly linked to local elevation.

The problem is that the base elevation level changes over a region, which makes it difficult to apply DEM to the clustering in its original state. For example, Figure 7.2 shows the DEM for Polygon 9 on the original scale. Note that the elevation levels of different landform types seem to be shifting from north to south. Pixels in the wadi region at the top of the image are at approximately 775 meters, while pixels in the same wadi at the bottom of the image are at 825 meters. The mountainous regions around the wadi show a reverse trend in elevations, with the hills in the north being a higher elevation than the ones in the south.

Even though the wadi pixels are at different elevations throughout the image, intuitively it would be good for all of them to be placed in the same group. The same is true for similar landform types in the mountain regions. This makes sense because

185 Figure 7.2: A plot of the Polygon 9 digital elevation model (DEM).

the relative elevations between those pixels and their surroundings are roughly the same, even though the actual elevations are not.

To prepare the DEM for use in clustering, we therefore try to convert the actual elevation into a relative elevation. This new “equalized” DEM measure should be the same for all pixels found in similar landform types. To do this, we use a mix of local and global elevation comparisons. First, denote the DEM value at pixel (i, j) by IE(i, j).

L The local elevation comparison IE(i, j) at pixel (i, j) is done by ranking its ele- vation among neighboring pixels. For a distance threshold δ, first locate all pixels

186 (i0, j0) whose distance from pixel (i, j) is at most δ. That is, form the set of pixels

p 0 0 0 2 0 2 Nδ = {(i , j ): (i − i ) + (j − j ) ≤ δ}.

The local value at (i, j) is then defined by

X L 1 0 0 IE(i, j) = 1(IE(i , j ) < IE(i, j)), (7.1) |Nδ| − 1 0 0 (i ,j )∈Nδ where 1(A) is the indicator function that takes value 1 if event A is true, and 0 otherwise. |Nδ| denotes the size of the set Nδ. Note that if (i, j) is the highest pixel

L within a δ radius, then IE(i, j) = 1. Similarly, if it is the lowest pixel within the

L radius, then IE(i, j) = 0.

L A plot of IE is given in Figure 7.3(a). Note that the values are now more equalized across the landforms, but there are still some artifacts of the procedure that produce undesirable results. For example, the small wadi tributaries in the top right corner of the image become much more pronounced, because they are incised into the hills in the region. Since this method only uses neighboring pixels within a δ radius, local oddities like this occur. To obtain better results, this local measure is tempered with a global comparison.

G The global elevation comparison IE(i, j) at pixel (i, j) is done by ranking its elevation among all pixels within its row i and column j. That is, for an image with p rows and q columns, " # p q 1 X X IG(i, j) = 1(I (i0, j) < I (i, j))) + 1(I (i, j0) < I (i, j))) . E p + q − 2 E E E E i0=1 j0=1 (7.2)

This metric approximates the rank of the pixel’s elevation among all pixels in the image with its rank among all pixels in its row or column. By restricting to the row

187 and column of the pixel (i, j), the number of comparisons made to IE(i, j) is reduced

G from p·q −1 to p+q −2. Note that as with the local comparison, the value of IE(i, j) is bounded by [0, 1].

G Figure 7.3(b) shows a plot of the IE values for Polygon 9. Due to the nature of

G its computation, the map of IE values shows some oddities as well. For instance, the tributary in the center of the south half of the polygon has much lower values than similar regions elsewhere in the image because its rows and columns contain quite a few wadi pixels.

To combine the effects of the local and global elevation comparisons, the maximum ranking of the two is used as the final equalized DEM metric. That is, the equalized

∗ DEM IE is defined by:

∗ ¡ L G ¢ IE(i, j) = max IE(i, j), IE(i, j) . (7.3)

Note that the equalized DEM (Figure 7.3(c)) balances the effects of the local and global DEM comparisons, thus avoiding some of the oddities in the results of those procedures.

7.2.3 Algorithm Summary

After DEM equalization and proper scaling of the image, the pixels can be put through the multiple sample data spectroscopic clustering procedure. The output of the algorithm is a cluster label for each pixel in the image. Through the use of the cairn training set, one can identify which clusters cairns appear in. The set of detected objects obtained from the cairn detection algorithm can then be reduced even further by subsetting to only those clusters. Alternatively, one could define a

188 (a) (b)

(c) (d)

L Figure 7.3: (a) The local elevation comparison IE for Polygon 9. (b) The global G ∗ elevation comparison IE for the same image. (c) The equalized DEM IE. (d) The original DEM IE, for comparison.

189 Step Calculation Initialize row and column weights αr, αc. Determine which bands b1, . . . , bk 1 will be used for the clustering and initialize weights α1, . . . , αk. Finally, determine the scaling factor w used to lower the image resolution. If the DEM band is being used, (i.e. bi = IE for some i), replace it with the 2 ∗ equalized DEM band IE. For each band bi, scale the image by a factor of w. If the number of rows or columns is not divisible by w, ignore the extra pixels on the border. After 3 scaling, normalize the range of bi to [0, 1] across the image to form the ∗ normalized image bi . Treat each low resolution pixel as a vector of weighted elements in k + 2 4 dimensions. That is, for³ a p × q image I, let pixel (i, j) be the´ point i−1 j−1 ∗ ∗ x(i, j) = αr p−1 , αc q−1 , α1b1(i, j), . . . , αkbk(i, j) Cluster the set of p · q points x(i, j), i = 1, . . . , p, j = 1, . . . , q using 5 the multiple sample data spectroscopic clustering algorithm from Table 6.1 in Chapter 6.

Table 7.1: Satellite Image Clustering Algorithm

likelihood that an object is a cairn based on the number of training cairns falling into its cluster and/or its distance to the clusters with the highest density of training cairns. A comprehensive summary of the clustering procedure is given in Table 7.1.

7.3 Cluster Results for Polygon 9

To illustrate the image clustering algorithm on a satellite image, consider the running example of Polygon 9. To divide this image into homogenous regions, it was

first scaled by a factor of 25, as in Figure 7.1(b). Next, the red, green, blue, near infrared, and equalized DEM bands were used to cluster the image with multiple sample data spectroscopic clustering. The row and column weights were initialized

190 to αr = αc = 1, and the color band weights were set at α1 = α2 = α3 = α4 = 1 as well. The equalized DEM was given a weight of α5 = 8, which is 8 times the weight on the rows and columns, and double the combined weight on the color bands.

For the multiple sample DaSpec algorithm, T = 20 samples of size m = 100 pixels were drawn from the image. The bandwidth for the first stage of clustering was set to

ω = 0.25, and the bandwidth used in the second stage was τ = 0.20. The scaled image was of size 705 × 432, for a total of 304, 560 points in the dataset. The time taken to perform the clustering was 7.111 seconds, and the number of estimated groups was

Gˆ = 6. The clustering results are shown in Figure 7.4.

Recall that in Chapter 4, the cairn detection algorithm detected 1634 objects, which were shown in Figure 4.21. With cluster results in hand, we can now determine which of these objects fall into regions typically inhabited by cairns, and which do not. An overlay of the training cairn locations on the clusters is given in Figure 7.5.

Note that most of the cairns appear to fall in similar clusters – the largest of these is the medium green colored group that lines the wadi.

The bar graph in Figure 7.6(a) shows the distribution of cluster labels across the training cairns. Most of the cairns are placed into one of two groups: clusters 1 and

4. A plot of the two clusters (Figure 7.6(b)) reveals that these are precisely the two clusters that line the wadi region of the image. Out of the 1634 detected objects, 562 are in cluster 4 and 137 are in cluster 1 – a total of 699 in either cluster. This is an additional 57.22% reduction of the detected object set.

Figure 7.7 shows the top 50 ranked objects from clusters 1 and 4, followed by the top 50 ranked objects from cluster 4. Note that cluster 1 is primarily relegated to the boundary of the wadi, although it sometimes moves into the wadi as well. This

191 Figure 7.4: Cluster results from Polygon 9. The image was scaled by a factor of 25 and clustered using RGB, IR, and Equalized DEM bands. Weights were αr = αc = α1 = α2 = α3 = α4 = 1 and α5 = 8. Parameters for the multi-sample DaSpec algorithm were T = 20, m = 100, ω = 0.25, and τ = 0.20. A total of Gˆ = 6 groups were uncovered, and are shown in the plot. Runtime was 7.111 seconds.

explains the presence of a couple of detections that look like vegetation in the first plot. These objects must have barely passed the JV filter, and scored well in the other features.

If only the dominant cluster 4 is used, there seems to be less vegetation. However, in either case there is strong evidence that the remaining objects are very cairn-like

192 Figure 7.5: All 76 training cairns from Polygon 9 (including the poorly preserved ones) overlaid on the cluster results.

193 (a)

(b)

Figure 7.6: (a) A bar graph showing which clusters the training cairns fall into. (b) Cairn clusters 1 (red) and 4 (blue).

194 (a)

(b)

Figure 7.7: (a) The top 50 detected objects in both clusters 1 and 4. (b) The top 50 detected objects in cluster 4 only. Each 25 × 25 window has intensities normalized to [0,1] scale within that window. This is to enhance the contrast for display purposes only. 195 in appearance. Indeed, 33 of the original 60 well preserved training cairns are in the cluster 4 set, and 3 are in cluster 1. With the number of objects on the order of hundreds, a visual inspection of the satellite imagery at those object locations could reveal clues about which ones are the most likely to be cairns.

7.4 Discussion

In this chapter, a method was set forward to cluster a satellite image into ho- mogenous landform groups based on multispectral data. In essence, after scaling the image to a more manageable size, each pixel is treated as a weighted vector contain- ing row/column, color, and elevation information. These pixels are clustered using the multiple sample DaSpec algorithm, and the clusters can then be used to inform decision making practices regarding which detected objects are actually cairns.

The major complicating factor in this procedure is the variability of imagery across different regions and times. Depending on the location, the values in the color and elevation bands may change a great deal. The time of day at which the image was taken could influence color information, as well. The landforms themselves could also complicate matters – for example, an image with a large expanse of forest may require a different approach than the arid landscapes the algorithm was designed for.

Ultimately, the procedure described above should be robust to regional changes.

However, determining which parameter settings are optimal for a given image is not a trivial task. Most likely, there will be some trial and error involved, as well as reliance on prior knowledge. When I was deciding on settings to cluster Polygon 9, I used the automatically detected bandwidths ω and τ as a starting point, and then looked at cluster results for different values in that area. I had an idea of what kind of cluster

196 results were desirable – namely, somewhere between 4 to 8 groups and heavier weight put on the DEM than the color bands. Unfortunately, a lot of the parameter selection is more art than science.

Finally, finding cairn clusters is easier to do when a training set is available.

However, if prior information is incorporated, this may not be necessary. For instance, one of our team’s hypotheses is that cairns should be visible from the wadi because they doubled as territorial markers. Assuming that this theory is correct, the clusters lining the wadi (clusters 1 and 4) could be considered cairn clusters without the presence of a training set.

At the very least, the algorithm laid out in this chapter provides a first step in landform classification using spectral clustering. It works successfully in parts of southern Arabia, and seems to improve cairn detection results by removing objects that are not in locations consistent with cairn construction.

197 CHAPTER 8

SUMMARY AND FUTURE WORK

In this thesis, I have presented the background, motivation, and development of two different algorithms. The first is an algorithm used to automatically detect burial monuments in satellite imagery, with or without the presence of a training set of cairns. The second is a spectral clustering algorithm that uses multiple samples to overcome computational issues for large data situations while stabilizing perturbation arising from sampling variability. A brief summary of these procedures is given in

Section 8.1, and future research directions for both projects are given in Section 8.2.

8.1 Algorithm Summaries

8.1.1 The Cairn Detection Algorithm

In Chapter 4, I presented a method for automatically detecting cairns in satellite imagery from southern Arabia. The motivation behind the algorithm was to remove the need for anthropologists to comb through images by eye, or to rely as heavily on native inhabitants for the location of ancient monuments. The resulting algorithm is successful at eliminating over 99.99% of the pixels in the image, thereby reducing the number of detected objects to a manageable size.

198 The algorithm itself is built around the notion of moving a 25×25 window operator over each pixel in the image. For each window, a number of values are calculated that help distinguish cairns from other objects and background pixels. The first filter is the “blob detector” JB (Eq. (4.1)), which measures the relative intensity of the inner part of the window to the outer border region. Next, the windows are sent through a vegetation filter JV (Eq. (4.2)), which measures the normalized difference vegetation index to determine whether the detected objects are bushes or trees. The sizes of the objects are then measured using the operator JS (Eq. (4.3)), which finds the smallest inner window such that the 95th percentile inner window intensity is larger than the median intensity for the window.

Objects that pass these three filters are now measured for circularity. First, a circle is fit to the window using a Hough transform. Two measurements that use this circle are taken on the object. The first, JHR (Eq. (4.4)), is the ratio of the maximum accumulator function value (a measure of how well the circle fits) to the number of edge pixels in the window. Large ratios indicate circular objects with minimal background clutter – a property shared by cairns. Additionally, the Hough score JHS

(Eq. (4.5)) is also calculated for the window. It is essentially the same concept as the

JB value, but instead of comparing the inner window and border region intensities, it compares the Hough circle interior and border region intensities. Objects passing the Hough filters then have their boundary extracted, and a circularity score JC (Eq.

(4.7)) calculated using a modified version of Proffitt’s method.

If a training set of cairns is available, it can be used to tune the thresholds for the filters. In addition, it can be used to further reduce the set of detected objects by subsetting down to only those objects that lie within the convex hull formed by

199 the cairns in the six-dimensional feature space J . Finally, the objects can be ranked using a likelihood based approach based on the marginal distributions of the training cairns in each of the dimensions.

At the end of Chapter 4, a demonstration of the algorithm was used on Polygon 9.

Results show that the algorithm was able to filter out a large number of objects from the image, including vegetation, ridgelines, and some blotchy patches of soil. Visual inspection of the top ranked objects shows that all of the objects share features with cairns – in fact, 42 of the 60 training cairns are detected in the final set. Additionally,

4 of the 16 poorer quality training cairns passed the filters, but were eliminated in the convex hull reduction step. These results demonstrate the usefulness of the algorithm, and the promise that it holds for automatic cairn detection in the future.

8.1.2 The Multiple Sample DaSpec Algorithm

In Chapter 5, I described the computational difficulties that arise when spectral clustering algorithms are used on large datasets – namely, that storage and eigen- decomposition of the n × n affinity matrix can exceed the capabilities of the machine being used. There are multiple ways to deal with this problem, one of which is to use a single sample to approximate the eigenvectors of the affinity matrix, and then extend it (via Nystr¨omextension) to the full dataset. However, a perturbation analysis reveals that the eigenvectors of the sample affinity matrix experience changes in sign and ordering due to sampling variability. This perturbation can seriously impact the results of clustering analyses that are not robust to those changes. In an attempt to reduce the effects of sampling variability on cluster results, Dr. Shi and I developed

200 a multiple sample adaptation of the data spectroscopic clustering algorithm, which appears in Chapter 6.

The method works by first calculating the eigenvectors of affinity matrices from T samples of size m. These eigenvectors themselves are then clustered using DaSpec and a similarity measure defined over vectors (see Eq. (6.1)). Matching eigenvectors are combined together to form G group eigenvectors, which are then extended to the full dataset and used to assign cluster labels. There are two bandwidth parameters, ω and

τ. The former controls the kernel bandwidth for the original sample affinity matrices, while the latter is the kernel bandwidth for DaSpec in the eigenvector matching stage.

Results of the algorithm are given at the end of Chapter 6, and illustrate two major points. First, a simulation shows how the use of multiple samples can increase clustering stability and extract more grouping information from the data than the single sample approach with the same total number of points. In the multiple sam- ple situation, each of the samples isolates different groups in the data. When this information is combined, all of the groups can be recovered. This contrasts with the single sample approach, which can only identify so many groups. In datasets with a large number of groups relative to the size of the sample, multiple samples have a clear advantage.

Second, the algorithm was applied to real images from the Berkeley Image Database.

Results show that the multiple sample DaSpec procedure is able to quickly and sat- isfactorily segment imagery into homogenous groups.

In addition to the multiple sampling procedure, I also outlined a sparse exten- sion method that can be used instead of the full Nystr¨omextension. The method

201 approximates the Nystr¨omextension by calculating the sum using only a few in-

fluential observations, rather than the entire sample. Image clustering results show that the sparse extension method gives runtimes orders of magnitude faster than the full extension, with very little performance sacrificed. It is important to note that the sparse extension can be used with both the single sample and multiple sample clustering approaches.

8.1.3 Clustering in Cairn Detection

To combine the two areas of research I have worked in and aid the cairn detection procedure, I also applied the multiple sample DaSpec algorithm to a satellite image from Yemen. The idea is that cairns will tend to appear in certain types of landforms that are conducive to cairn construction. In particular, cairns should be in locations that are flat (so the cairn does not fall over), not at high elevations (so the builders could walk there), and rocky (to provide building materials).

To cluster the image into different kinds of landforms, I first converted the image into a lower resolution version that smoothed out local variation in color. I also changed the digital elevation model (DEM), which measures absolute elevation, into an equalized DEM that measures the relative local elevation of each 30 meter pixel.

Each of the image pixels is then represented as a weighted vector in seven dimensions: row index, column index, red, green, blue, near infrared, and equalized DEM. Finally, these pixels are clustered using multiple sample data spectroscopic clustering.

Details of the algorithm and results are given in Chapter 7. To summarize the results, there was one dominant cluster in Polygon 9 that contained most of the training cairns, as well as a secondary cluster that contained a smaller proportion

202 of the cairns. By subsetting to these two clusters, the number of detected objects decreases by over 50%, and the remaining objects seem to be much more cairn-like in appearance.

8.2 Future Work

8.2.1 Cairn Detection

While the results of the cairn detection are promising, there are still improvements that can be made. First of all, the algorithm as it stands is weighted heavily toward

filters based on contrasts in intensity only. The only color information used is in the vegetation filter, which uses NDVI based on Red and near IR values. There is certainly more work that can be done designing new ways to eliminate non-cairns that incorporate other bands of the imagery besides panchromatic, red, and near infrared.

Other more complex methods could also be brought to bear on the problem. Over the last couple of years, I have experimented with using classification techniques like support vector machines and linear discriminant analysis to separate cairns from non- cairn detected objects. While these approaches have not yet born fruit, I am confident there is a way to use them at some step of the algorithm to reduce the number of false detections. Recently, Dr. Goel and I have been examining the covariance structure of 7 × 7 windows centered on training cairns to see if a common pattern emerges that could be exploited. Finally, some cairns exhibit a shadowing pattern that thus far has not been used for detection purposes. The presence of a shadow indicates that a detected object has substantial height as opposed to being a simple discoloration of the soil, or a local variation in stone color.

203 The anthropologists in the field are also surveying the areas around the coordinates of detected objects to assess the accuracy of the algorithm. Preliminary reports indicate that many of the false detections are circular, relatively darker cairn-sized regions caused by a variety of causes (e.g. soil discoloration, large stones, etc.). This is certainly encouraging, and their feedback will be valuable in determining where the algorithm is failing, and what can be done to improve it.

In the distant future, the hope is that this algorithm can be extended for use in other locales besides southern Arabia. Clearly, the addition of large expanses of water, forest, or urban development could cause this algorithm to fail. The addition of other techniques may be necessary to deal with these kinds of environments.

This algorithm could also be adapted to detect objects other than cairns. The window based filter approach is flexible, and could support a variety of window-based operators that measure different characteristics of detected objects. Even in southern

Arabia, there are other monuments of interest to anthropologists. For instance, some high circular tombs have tails made of stone piles (see Figure 8.1), the purpose of which is still a mystery. In addition, Al-Shahrˆı (2007) identifies six different kinds of graves that appear in the Dhofar region of Oman, each having its own unique characteristics that may or may not be visible in satellite imagery.

There are other monuments in southern Arabia that are not related to burial, but still may be of interest to anthropologists. Foremost among these are structures called triliths, which are composed of three stones standing upright, and a fourth capstone on top. Triliths are about one meter in height and tend to be arranged in patterns along with stone and a set of four stones laying on the ground (see Al-Shahrˆı, 2007, for details and an illustration). Indications are that the triliths carried ceremonial

204 (a)

(b)

Figure 8.1: A high circular tomb (HCT) with a tail (a) in the satellite imagery, and (b) on the ground.

importance, and their proximity to large numbers of hearths indicates that they may have been used in regional feasts held between different tribes of people. In some instances, the distinct pattern of triliths and hearths is visible from space, and the detection algorithm could be tuned to detect these monuments.

205 8.2.2 Multiple Sample DaSpec Clustering

The next step toward improving the multiple sample DaSpec algorithm is to make the parameter selection process more streamlined. In some situations, the algorithm works well for a wide range of settings – however, for other applications the algorithm does not perform well unless parameters are in a narrow range of values. This is especially true for the bandwidth parameters ω and τ. Future research needs to first identify which kinds of data situations, in terms of group numbers and dataset size, make parameter selection difficult. By examining the performance of the algorithm in these scenarios, it may be possible to understand where the procedure is breaking down.

Along these lines, the trade-off between the number of samples T and sample sizes m is still unclear. Anecdotally, I have observed that a law of diminishing returns applies to both parameters. That is, the improvement in clustering performance declines as the number of samples and their sizes increase. For example, in the images of size 481 × 321 shown in Chapter 6, visual inspection of cluster results for

T > 20 or m > 500 indicated no apparent advantage to using those values instead of T = 20 and m = 500. Identification of situations that require large numbers of samples vs. large sample sizes would be beneficial in improving the robustness of the algorithm to the characteristics of different datasets.

I have also done some work on improving clustering by introducing an uncertainty measure to the group labeling process. In the final step of the algorithm, each point x is given the label corresponding to the eigenvector v with the largest value at that point. However, in cases when one or more groups in the data are not identified, those group members still must be given a label. Usually, assigning this label is

206 an exercise in choosing the maximum among several near-zero values, since none of the eigenvectors are large for points in the missing groups. If v(1),..., v(G) are the G eigenvectors with no sign change identified by the algorithm, the clustering uncertainty for a point x is defined by:

" # G 1/2 X ¡ ¢2 k(v(1)[x],..., v(G)[x])k = v(i)[x] (8.1) i=1 When the true group for point x is not identified, the norm in Eq. (8.1) will be close to zero. To improve cluster results, then, points with uncertainty values below a certain threshold could be given a separate label, rather than assigning a label of a group that the point definitely does not belong to. Preliminary results on images show that using the uncertainty measure can often make cluster results more appealing, at least upon visual inspection. Future work along these lines could build on this uncertainty measure to detect small groups in the data that are obscured by other dominant groups, or to identify outlier groups in datasets.

207 BIBLIOGRAPHY

D. Achlioptas and M. Frank. Fast computation of low rank matrix approximations.

In In Proceedings of the 33rd Annual ACM Symposium on Theory of Computing,

pages 611–618, 2001.

A. Al-Shahrˆı.Grave types and “triliths” in dhofar. Arabian Archaeology and Epigra-

phy, 2(3):182–195, 2007.

P. Arabie and L. Hubert. Cluster analysis in marketing research. In R. Bagozzi, editor,

Advanced Methods for Marketing Research, pages 160–189. Blackwell, Malden, MA,

1994.

D. H. Ballard. Generalizing the hough transform to detect arbitrary shapes. Pattern

Recognition, 13(2):111–122, 1981.

M. J. Bottema. Circularity of objects in images. In IEEE International Conference

on Acoustics, Speech, and Signal Processing, volume 4, 2000.

U. Brunner. Geography and human settlements in ancient southern arabia. Arabian

Archaeology and Epigraphy, 8(2):190–202, 1997.

J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal.

Mach. Intell., 8(6):679–698, 1986. ISSN 0162-8828.

208 J. Carmichael, J. George, and R. Julius. Finding natural clusters. Systematic Zoology,

17(2):144–150, 1968.

Y. Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 17(8):790–799, 1995. ISSN 0162-8828.

J. Clatworthy, D. Buick, M. Hankins, J. Weinman, and R. Home. The use and

reporting of cluster analysis in health psychology: A review. British Journal of

Health Psychology, 10:329–358, 2005.

R. Costello, editor. Random House Webster’s College Dictionary. Random House,

Inc., New York, NY, 1996.

D. R. Cox. Note on grouping. Journal of the American Statistical Association, 52:

543–547, 1957.

L. Crozier and R. W. Zabel. Climate impacts at multiple scales: evidence for differ-

ential population responses in juvenile chinook salmon. Journal of Animal Ecology,

75(5):1100–1109, 2006.

J. K. Cullum and R. A. Willoughby. Lanczos Algorithms for Large Symmetric Eigen-

value Computations, Vol. 1. Society for Industrial and Applied Mathematics,

Philadelphia, PA, USA, 2002.

T. Dalenius. The problem of optimum stratification. Skandinavisk Aktuarietidskrift,

34:133–148, 1951.

D. Delling, R. Gaertler, R. G¨orke, Z. Nokoloski, and D. Wagner. How to evaluate

clustering techniques. Technical Report, 2006.

209 A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data

via the em algorithm. Journal of the Royal Statistics Society, 39(1):1–21, 1977.

I. Dinov. Expectation maximization and mixture modeling tutorial. Statistics Online

Computational Resource, December 2008.

R. O. Duda and P. E. Hart. Use of the hough transformation to detect lines and

curves in pictures. Communications of the ACM, 15(1):11–15, 1972.

L. Engelman and J. A. Hartigan. Percentage points of a test for clusters. Journal of

the American Statistical Association, 64:1647–1648, 1969.

M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering

clusters in large spatial databases with noise. Proceedings of the 2nd International

Conference on Knowledge Discovery and Data Mining, pages 226–231, August 1996.

L. Fisher and J. W. Van Ness. Admissible clustering procedures. Biometrika, 58:

91–104, 1971.

I. Fodor. A survey of dimension reduction techniques. LLNL technical report. UCRL-

ID-148494, 2002.

C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the nystr¨om

method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2):

214–225, 2004.

W. Frei and C. Chen. Fast boundary detection: A generalization and a new algorithm.

IEEE Transactions on Electronic Computers, C-26, 1977.

210 K. Fukunaga and L. Hostetler. The estimation of the gradient of a density func-

tion, with applications in pattern recognition. IEEE Transactions on Information

Theory, 21(1):32–40, 1975.

G. Gan, C. Ma, and J. Wu. Data Clustering: Theory, Algorithms, and Applications.

Series on Statistics and Applied Probability. ASA-SIAM, Philadelphia, PA, 2007.

M. Giger, K. Doi, and H. MacMahon. Image feature analysis and computer-aided

diagnosis in digital radiography. 3. automated detection of nodules in peripheral

lung fields. Medical Physics, 15(2):158–166, 1988.

R. M. Haralick. A measure for circularity of digital figures. IEEE Transactions on

Systems, Man and Cybernetics, 4(2):394–396, 1974.

L. Heyer, S. Kruglyak, and S. Yooseph. Exploring expression data: identification and

analysis of coexpressed genes. Genome Research, 9(11):1106–1115, November 1999.

P. V. C. Hough. Method and means for recognizing complex patterns. In US Patent

3069654, 1962.

D. J. Hurley, M. S. Nixon, and J. N. Carter. Force field feature extraction for ear

biometrics. Comput. Vis. Image Underst., 98(3):491–512, 2005. ISSN 1077-3142.

A. Jain, Y. Zhong, and S. Lakshmanan. Object matching using deformable templates.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(3):267–278,

1996.

D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey.

IEEE Transactions on Knowledge and Data Engineering, 16(11):1370–1386, 2004.

211 L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster

Analysis. Wiley, New York, 1990.

R. Kirsch. Computer determination of the constituent structure of biological images.

Computers and Biomedical Research, 4, 1971.

K. Lee, J. Kim, K.H. Kwon, Y. Han, and S. Kim. DDoS attack detection method

using cluster analysis. Expert Systems with Applications, 34(3):1659–1665, 2008.

J. Li, S. Ray, and B. Lindsay. A nonparametric statistical approach to clustering via

mode identification. Journal of Machine Learning Research, 8:1687–1723, 2007.

G. Linden, B. Smith, and J. York. Amazon.com recommendations: Item-to-item

collaborative filtering. IEEE Internet computing, 7(1):76–80, 2003.

S. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information

Theory, 28:129–137, 1982.

J. MacQueen. Some methods for classification and snalysis of multivariate observa-

tions. In L. Le Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium

on Mathematical Statistics and Probability, volume 1, pages 281–297. University of

California Press, 1967.

D. Marr and E. C. Hildreth. Theory of edge detection. Proc. Royal Society of London,

B-207:187–217, 1980.

D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural

images and its application to evaluating segmentation algorithms and measuring

ecological statistics. In Proc. 8th Int’l Conf. Computer Vision, volume 2, pages

416–423, July 2001.

212 D. Massart, F. Plastria, and L. Kaufman. Non-hierarchical clustering with masloc.

The Journal of the Pattern Recognition Society, 16:507–516, 1983.

J. McCorriston, E. A. Oches, D. E. Walkter, and K. L. Cole. Holocene paleoecology

and in highland southern arabia. Pal´eorient, 28(1):61–88, 2002.

M. Meila and J. Shi. A random walks view of spectral segmentation. In International

Conference on AI and Statistics (AISTAT), January 2001.

P. M. Merlin and D. J. Farber. A parallel mechanism for detecting curves in pictures.

IEEE Transactions on Computers, 24:96–98, 1975.

M. C. Morrone and R. A. Owens. Feature detection from local energy. Pattern

Recognition Lett., 6(5):303–313, 1987. ISSN 0167-8655.

A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm.

Advances in Neural Information Processing Systems, 14:955–962, 2004.

M. S. Nixon and A. S. Aguado. Feature Extraction and Image Processing. Elsevier

Ltd., Oxford, England, 2nd edition, 2008.

N. A. Otsu. A threshold selection method from gray-level histograms. IEEE Trans-

actions on Systems, Man, and Cybernetics, SMC9(1):62–66, 1979.

P. Perona and W. Freeman. A factorization approach to grouping. Proceedings of

ECCV, pages 655–670, 1998.

P. Perona and J. Malik. Scale-space and edge detection using anisotropic diffusion.

IEEE Trans. Pattern Anal. Mach. Intell., 12(7):629–639, 1990. ISSN 0162-8828.

W. K. Pratt. Digital Image Processing. John Wiley, New York, NY, USA, 1977.

213 J. M. S. Prewitt. Object enhancement and extraction. In Picture Processing and

Psychopictorics, pages 75–149, 1970.

J. M. S. Prewitt and M. L. Mendelsohn. The analysis of cell images. Annals NY

Acad. Sci., 128(3):1035–1053, 1966.

D. Proffitt. The measurement of circularity and ellipticity on a digital grid. Pattern

Recognition, 15(5):383–387, 1982.

G. X. Ritter and J. N. Wilson. Handbook of Computer Vision Algorithms in Image

Algebra. CRC Press, New York, NY, USA, 2nd edition, 2001.

L. G. Roberts. Machine perception of three-dimensional solids. Optical and Electro-

Optical Information Processing, pages 159–197, 1965.

A. Rosenfeld. Picture Processing by Computer. Academic Press, 1969.

T. Roussillon, S. Sivignon, and L. Tougne. A measure of circularity for parts of digital

boundaries and its fast computation. Pattern Recognition, 43(1):37–46, 2010.

G. Scott and H. Longuet-Higgins. Feature grouping by relocalisation of eigenvectors

of proximity matrix. Proceedings of British Machine Vision Conference, 1990.

J. Scott. Social network analysis. Sociology, 22(1):109–127, 1988.

G. S. Sebestyen. Decision Making Processes in Pattern Recognition. Macmillan, 1962.

C. Shannon. A mathematical theory of communication. The Bell System Technical

Journal, 27:379–423, 623–656, July, October 1948.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions

on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.

214 T. Shi, M. Belkin, and B. Yu. Data spectroscopy: Eigenspace of convolution operators

and clustering. Annals of Statistics, 37(6B):3960–3984, 2009.

I. E. Sobel. Camera models and machine perception. PhD thesis, Stanford University,

1970.

M. Staubwasser and H. Weiss. Holocene climate and cultural evolution in late pre-

historic – early historic west asia. Quaternary Research, 66:372–387, 2006.

H. Steinhaus. Sur la division des corp mat´erielsen parties. Bulletin L’Acadmie

Polonaise des Science, 1(4):801–804, 1956.

M. Stojmenovi´cand A. Nayak. Shape based circularity measures of planar point

sets. In IEEE International Conference on Signal Processing and Communications,

pages 1279–1282, 2007.

M. Stojmenovi´c,A. Nayak, and J. Zunic. Measuring linearity of a finite set of points.

In IEEE International Conference on Cybernetics and Intelligent Systems, pages

222–227, 2006.

R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a

dataset via the gap statistic. Journal of the Royal Statistical Society: Series B, 32

(2):411–423, 2001.

M. Tosi. The emerging picture of prehistoric arabia. Annual Review of Anthropology,

15(1):461–490, 1986.

U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):

395–416, 2007.

215 U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering.

Annals of Statistics, 36(2):555–586, 2008.

Y. Weiss. Segmentation using eigenvectors: a unifying view. International Conference

on Computer Vision, 1999.

A. L. Yuille, D. S. Cohen, and P. W. Hallinan. Feature extraction from faces us-

ing deformable templates. IEEE Proceedings Conference on Computer Vision and

Pattern Recognition, pages 104–109, 1989.

J. Zunic and K. Hirota. Measuring shape circularity. Progress in Pattern Recognition,

Image Analysis and Applications - LNCS, 5197:94–101, 2008.

216