<<

Precise Image Registration and Occlusion Detection

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Vinod Khare, B. Tech.

Graduate Program in Civil Engineering

The Ohio State University

2011

Master’s Examination Committee:

Asst. Prof. Alper Yilmaz, Advisor Prof. Ron Li Prof. Carolyn Merry c Copyright by

Vinod Khare

2011 Abstract

Image registration and mosaicking is a fundamental problem in computer vision.

The large number of approaches developed to achieve this end can be largely divided into two categories - direct methods and feature-based methods. Direct methods work by shifting or warping the images relative to each other and looking at how much the pixels agree. Feature based methods work by estimating parametric transformation between two images using point correspondences.

In this work, we extend the standard feature-based approach to multiple images and adopt the photogrammetric process to improve the accuracy of registration. In particular, we use a multi-head camera mount providing multiple non-overlapping images per time epoch and use multiple epochs, which increases the number of images to be considered during the estimation process. The existence of a dominant scene plane in 3-space, visible in all the images acquired from the multi-head platform formulated in a bundle block adjustment framework in the image space, provides precise registration between the images.

We also develop an appearance-based method for detecting potential occluders in the scene. Our method builds upon existing appearance-based approaches and extends it to multiple views.

ii to my mother

iii Acknowledgments

I would like to extend my heartfelt gratitude to my advisor Dr. Alper Yilmaz whose invaluable guidance in work and personal life made this work possible. His never ending support was valuable every step of the way. He made sure that he was always availabe and had constructive suggestions towards my research. I would also like to thank Dr. Li and Dr. Merry for their participation as committee members for my Master’s Thesis Examination.

I would like to thank all the members of the Photogrammetric Computer Vision

Lab for a friendly and intellectually stimulating environment that made my stay at

OSU the most memorable experience of my life. I would also like to thank my friends at Ohio State for some of the most fun times ever.

iv Vita

April 1986 ...... Born - Indore, India

June 2007 ...... B. Tech. Civil Engineering, Indian Institute of Technology Kanpur, India September 2008 - present ...... Graduate Student, Department of Civil & Environmental Eng. and Geodetic Science, The Ohio State University

Publications

Research Publications

V. Khare, A. Yilmaz, and O. Mendoza-Schrok. Precise Image Registration and Occlusion Detection In Proceedings of the IEEE National Aeronautics and Aerospace Electronics Conference, 2010.

V. Khare, S. Yadav, A. Rastogi, and O. Dikshit. Textural Classification of SAR Images In Proceedings of the XXVII INCA Internation Congress, 2007.

Fields of Study

Major Field: Civil Engineering

v Table of Contents

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... v

List of Tables ...... ix

List of Figures ...... x

1. Introduction ...... 1

1.1 Overview ...... 1 1.2 Related Work ...... 3 1.3 Motivations and Contributions ...... 10 1.4 Organization of the Thesis ...... 12 1.5 Notation ...... 12

2. Projective and Estimation ...... 13

2.1 Projective Space and Homogenous Coordinates ...... 15 2.2 Projective Transformations and Planar Homography ...... 16 2.3 Estimation of Planar Homography ...... 18 2.3.1 Number of Measurements Required ...... 19 2.3.2 Direct Linear Transform (DLT) ...... 20 2.4 Cost Functions ...... 23 2.4.1 Algebraic Distance ...... 23 2.4.2 Symmetric Transfer Error ...... 23 2.4.3 Reprojection Error ...... 24

vi 2.5 Transformation Invariance and Normalization ...... 25 2.5.1 Image Coordinate Transforms ...... 26 2.5.2 Non-Invariance of DLT ...... 26 2.5.3 Invariance of Geometric Error ...... 27 2.5.4 Normalization ...... 27 2.6 Iterative Minimization ...... 28 2.6.1 Gold Standard Algorithm ...... 29

3. Image Mosaicking ...... 31

3.1 Key-point Detection ...... 32 3.1.1 Harris and Stephen’s Corner Detector ...... 33 3.1.2 Scale Invariant Features Transform (SIFT) ...... 34 3.2 Key-point Matching ...... 35 3.3 Robust Estimation (RANSAC) ...... 36

4. Image Registration for Multi-Head Cameras ...... 40

4.1 Projection Model and Initial Estimates ...... 41 4.2 Algorithm Overview ...... 42 4.2.1 Key-point Detection and Matching ...... 44 4.2.2 Image Connectivity Graph ...... 45 4.2.3 Initial Estimation by Graph Traversal ...... 46 4.2.4 Optimizing with the Levenberg-Marquardt Algorithm . . . 48 4.3 Occlusion Detection ...... 50

5. Experiments and Results ...... 54

5.1 Data Description ...... 54 5.2 Experiments with Simulated Data ...... 55 5.3 Registration of CLIF Dataset ...... 59 5.4 Details of Occlusion Detection ...... 62

6. Conclusions and Future Work ...... 67

Bibliography ...... 69

Appendices 74

A. Software Implementation Details ...... 74

vii A.1 Software Environment ...... 74 A.2 Input ...... 74 A.3 Caching ...... 75 A.4 Output ...... 75 A.5 Running the program ...... 76 A.6 Viewing Results ...... 76 A.7 Notes for Programmers ...... 77

viii List of Tables

Table Page

2.1 Characteristics of various and invariants therein...... 14

5.1 RMS errors for the registration of simulated dataset...... 57

ix List of Figures

Figure Page

2.1 Points become lines and lines become planes in projective space. . . . 17

2.2 A central projection projects all points from one plane to another along rays passing through a perspective center. This is a simpler case of projecticity called perspectivity...... 19

2.3 Examples of 2D Homography, x0 = Hx. A planar homography is induced between two images by a world plane...... 20

2.4 An illustration of (a) Symmetric transfer error and (b) Reprojection error. The points x and x0 are measured points and hence do not correspond perfectly because of noise. The esimated points xˆ and xˆ0 do match perfectly for the estimated homography Hˆ ...... 25

3.1 Detecting key-points with sift and finding geometrically consistent points with RANSAC. The figure above shows the key-points detected with SIFT and matched using BBF. The lower figure shows matches geo- metrically consistent with homography calculated using RANSAC. . . 39

3.2 Mosaicing images after estimating homography using RANSAC. . . . 39

4.1 The projection model used for bundle adjusment using Levenber-Marquardt estimation...... 42

4.2 Flow chart depicting the various stages of registering images from a multi-head camera...... 43

4.3 The number of points detected decreases as one moves down the scale space octaves...... 44

x 4.4 Schematic of an Image Connectivity Graph...... 46

4.5 A typical scenario in creating the mosaic. The images I1 ...In−1 are 0 0 warped into the reference image space I0 as I1 ...In−1 to create the panorama...... 52

5.1 The relative positioning of the cameras as shown from the camera viewpoint onboard the aircraft...... 56

5.2 The images from the six camera heads arranged side by side. Note that there is some overlap between the images as the principle axes of the cameras are not parallel...... 57

5.3 Capturing simulated data points over an urban terrain with two camera heads in linear motion. (a) Two camera heads forming the multi-head camera move in an oblique direction to simulate flight for four time epochs. (b), (c), (d) Random terrains are generated with a ‘flat’ ground and random high-rise ‘buildings’. Data points are captured for the four time epochs...... 58

5.4 Bundle adjustment of the synthetic dataset. Points captured for epoch 2 are projected onto points captured for epoch 1. (a), (b) and (c) plot the points for three random datasets. Points are divided into ground points and potential occluders (building points). Ground points register accurately while building points show misregistration due to parallax...... 59

5.5 ‘Mosaicing’ for adjacent cameras for the simulated dataset. Despite having no overlap between them, adjacent cameras are connected in the connectivity graph and hence can be ‘stitched’ together. The projected points are compared to the real points locations...... 60

5.6 Mosaicking for two images. The two images taken from camera 2 shown above in (a) and (b) were part of the bundle adjustment and have been precisely registered in (c). The difference in pixel values is visible in (d). The difference image was contrast enhanced to improve visibility. 61

5.7 Mosaicking for two images. The two images taken from camera 4 shown above in (a) and (b) were part of the bundle adjustment and have been precisely registered in (c). The difference in pixel values is visible in (d). The difference image was contrast enhanced to improve visibility. 62

xi 5.8 Comparison with traditional approach. (a) shows the mosaic generated for 20 epochs using the traditional approach of sequentially register- ing pairs of images. (b) Shows the same images registered using our method. Note that in (a) the misregistration increases as we move away from the reference image and our approach (b) successfully reduces this misregistration...... 63

5.9 Mosaic of the CLIF dataset across 20 epochs. This mosaic is generated by using pixel values from only one transformed image...... 63

5.10 Occlusion detection on the CLIF dataset across 20 epochs. The occlu- sion is overlayed in red...... 64

5.11 Mosaic of the CLIF dataset across 20 epochs. This mosaic is generated by blending pixel values from all the transformed images...... 64

5.12 (a) Mosaic across 20 epochs for a different area. (b) Mosaic with over- laid occlusions...... 65

5.13 Mosaic across 150 epochs. (a) shows the panorama and (b) shows the overlayed potential occluders...... 65

5.14 Details of occlusion detection. (a) We choose a small highlighted region on a mosaic of two views for illustration. (b) and (c) show the detected

occlusion for To = 0.5 and To = 0.15, respectively. A white (= 1) pixel denotes a ground point and a black (= 0) pixel denotes a potential occluder. Finally, (d) shows the occlusion overlayed on the mosaic. The areas in red represent the ground plane non-occluders...... 66

5.15 Detected occlusion back projected onto original views...... 66

A.1 Screenshot of GUI for processing data...... 77

A.2 Matching key-points using the GUI...... 78

A.3 GUI for viewing results...... 78

xii Chapter 1: INTRODUCTION

1.1 Overview

Cameras, being real physical devices, do not have an infinite field of view. They can only see a small portion of a scene at a given time. However, it is often useful to combine several different images into one large single image. For example, we might need to combine several aerial photographs to create a large scale map of the area. In photography, images may be mosaicked together to get aesthetically pleasing ultra- wide angle images or 360◦ panorama views. Mosaicking consecutive frames from a

video can be used to generate a ”video summary” [1].

Additionally, image alignment is also used to stabilize images in a digital camera

by aligning consecutive frames to obtain a sharp image instead of letting them overlap

and create a blurred view. The same technique can be used to stabilize videos. Video

compression also uses image alignment to calculate which parts of the video do not

change from frame to frame.

The availability of cheap digital cameras and evolution of large internet databases

of images have posed new challenges. Work has been done to automatically recognize

panoramas from a collection of mixed images. Attempts have been made to create

virtual reality 3D models of the world by registering large collection of images from the

1 internet. Commercial products, such as Google StreetView and Microsoft Photosynth, provide virtual views for real locations using image registration.

Thus, image mosaicking is one of the classical problems of computer vision. Algo- rithms for image alignment and stitching have been in widespread use since the early

1980s.

Image mosaicking is a two-step process. First, a relationship between pixel co- ordinates for two images must be established. This constitutes establishing what is called a ’motion model’. Then, a transformation between point locations must be estimated. Once the transformation is known, warping one image to align with the other is trivial.

There are two broad approaches to image alignment: direct (pixel-based) methods and feature-based methods. Direct methods work by shifting or warping the images relative to each other and looking at how much the pixels agree. Obviously, an exhaustive search of all possible configurations is impossible. A hierarchical coarse- to-fine refinement technique is used to make the problem computationally tractable.

Feature-based methods start by detecting keypoints or features on the image and obtaining pairs of point matches. These ‘tie-points are then used as observations within the assumed model and some error metric is minimized to solve for the un- known parameters of the motion model. In case a large number of images are given, the motion models can be solved simultaneously for all image pairs using bundle adjustment. However, classical bundle adjustment used within the photogrammetry community requires the knowledge of camera parameters to calculate relative orien- tation of cameras and 3D locations of points.

2 When camera parameters are not available, image alignment can be performed by assuming the existence of a dominant plane in the scene (for example, the ground plane or a building facade). The relationship between image points on this plane is given by a linear transformation, which can be estimated robustly using a RANSAC approach. Even in the absence of camera parameters, it is possible to formulate the problem within a bundle adjustment framework to utilize the redundancy of point correspondences available in order to increase precision.

An associated problem is that of occlusion detection. When two views are reg- istered together, the pixels visible in one of the views might not be visible in the other view. In some applications, such as surveillance, it might be important to know when a pixel has been occluded. Several approaches have been developed to deal with this problem, ranging from simple appearance-based algorithms to more sophisticated algorithms estimating depth fields.

1.2 Related Work

Being one of the oldest problems to be tackled by researchers in computer vi- sion, image mosaicking has a rich literature available. In this section we give a brief overview of the work that is most relevant to this thesis.

One of the two major approaches to image registration is the feature-based regis- tation. Features (keypoints) are extracted from the images, a global correspondence is established between points on different views, and a parametric transformation between them is estimated.

3 The first step in this process is the extraction of keypoints. Older terminology

talked about ‘corner like’ features [2]. However, modern usage is keypoints, interest

points or salient points.

It has been shown [3] that the reliability of the motion estimate depends most critically on the size of the smallest eigenvalue of the image Hessian matrix, λ0.

Shi and Tomasi [4] use this quantity to propose good features to track. They use a combination of translational and affine-based patch alignment to track points through an image sequence. However, using a square image patch with equal weighting might not be the best idea. Instead, gaussian weighting can be used to increase accuracy of the location of the detected keypoint [2, 5].

Different researchers have used different functions of the eigenvalues of the image

Hessian matrix to characterize a key-point. Harris (1988) [2] has used the following quantity to formulate a combined corner and edge detector

2 2 det(A) − αtr(A) = λ0λ1 − α(λ0 + λ1) (1.1)

where, A is the image Hessian matrix, λ0 and λ1 are the two smallest eigenvalues of A and α is a scalar constant. Usually α = 0.06 is chosen.

To reduce the effect of 1D edges, [6] uses the quantity

λ0 − αλ1 (1.2)

with α = 0.05. Another effective function is the harmonic mean used by [7]

det(A) λ λ = 0 1 . (1.3) tr(A) λ0 + λ1

In most applications it is important for the key-points to be invariant to scale, rotation and illumination differences. The Scale Invariant Feature Transform (SIFT)

4 [8] proposes using the scale space minima of the difference of gaussians (DoG) to extract features, which are invariant to scale and rotation and partially invariant to illumination differences. Speeded up Robust Features (SURF) [9] build on the strengths of existing detectors and provide similar repeatability, distinctiveness and robustness while greatly speeding up the computation.

Given the profusion of techniques and definitions available, it is important to have an experimental comparision of various key-point detectors. Such a comparision was provided by [10] focusing on the repeatability of feature detectors. They also evaluated the information content of the features using local entropy. A more up-to- date evaluation was performed by [11] including the more recent scale space-based feature detectors. They found that SIFT and SIFT-like features performed best.

Once key-points have been found on the images, global correspondence across views has to be established. This is a difficult task for corner like features, which are characterized only by a scalar quantity. The problem becomes one of tracking the feature across views. The approach that is taken is called detect then track. Features are compared using local appearance or modified local appearance using affine or other motion models.

Feature transforms, such as SIFT, provide a unique feature vector for each key- point. SIFT in particular gives us a 128 dimensional feature vector, which can be easily compared to determine point correspondence. The brute force approach is to exhaustively search all point pairs. However, this soon becomes intractable as the number of points increases. More efficient indexing schemes have been developed us- ing spatial data structures, such as kd-trees. A Best Bin First approach was proposed

5 by [12], which uses a modified search ordering for a kd-tree algorithm so that bins in feature space are searched in the order of their closest distance from query location.

Such matches are often putative in the sense that they might not be geometrically consistent. It is useful to fit these putative matches to a geometric relation, such as homography or fundamental matrix, and reject ill fitting matches. Algorithms, such as RANSAC [13], allow for a robust estimation of geometrically consistem matches.

RANSAC has been modified by [14] to add the most confident matches first in an effort to increase the performance. This approach has been used by [15] to estimate homography based on the modified median filter.

Once point matches are known, an appropriate geometric relation can be estab- lished between views and solved for transformation between the images. The most general form is this problem is structure from motion and is dealt extensively in

[16, 17, 18].

To register more than two views, researchers build upon the techniques that have been developed to register two images. The simplest approach is to register two images and then add subsequent views one at a time [19]. This way we can also discover what images overlap each other [20]. As more views are added, the accumulation of error may lead to a gap (or excessive overlap) between the beginning and end frames.

Gap closing can be performed by stretching out the alignment of all images [19].

A better approach is to simultaneously align all views at the same time using a least squares approch. This approach is well known to the photogrammetry commu- nity as Bundle Adjustment [21]. Bundle adjustment was first applied in computer vision by [22] to solve the generic structure from motion problem. This was further

6 specialized for panoramic image stitching by [23, 20, 24]. An analogous direct ap- proach is formulated by dividing the image into patches and creating a virtual feature correspondence between them [23].

Even after such global adjustment, the images may be not perfectly aligned and appear to be blurry or ghosted. Several factors may contribute to this misalignment including 3D parallax, radial distortion and small or large scene movement, such as swaying trees or walking people. Each problem needs a different approach to solve. For example, the plumb method straightens radial distortion until curved lines become straight [25]. A mosaic-based approach adjusts the views until the misregistration is reduced in the overlap areas [26, 20].

A related problem is figuring out which images actually go together to form a panorama. For example, this problem arises in internet photo collections where some of the photos might not be part of the panorama at all. This problem is called recognizing panoramas by [27].

If the images forming the panorama have been obtained in a sequence that covers the entire scene and the first and last images are given, topology inference can be used to assemble a panorama automatically [20]. A system for creating multi-viewpoint panoramas of long, largely planar scenes is given by [28]. However, in most real scenarios such a sequence is not available. The user may have taken the pictures at random or the pictures are obtained from disparate datasets like different Internet photo collections. They may thus differ largely in position, orientation, scale and lighting conditions. Such a dataset was successfully matched by [29]. A similar problem was solved for wide area surveillance by [30].

7 Pairwise overlap based on a feature-based technique has been used to construct an

overlap graph on which a connected component analysis is used to find a panorama

[27]. The features are matched using the SIFT key-point detector by Lowe [8]. Many

of the above mentioned techniques have been combined by [31] to determine 3D

point location along with the camera coordinates to generate an immersive 3D photo tourism enviroment.

Once the transformations between the images are available, the next task is to mosaic the images into one single view. This task involves selecting a compositing surface (flat, cylindrical, spherical) and view (the reference image). Also, the actual pixels contributing to the final mosaic have to be selected such that ghosting, blurring and visible seams are minimized. This is as much an artistic endeavor as a technical one. To assist users with creating mosaics, several interactive techniques have been proposed [32, 33, 34, 35].

If only a few images have to be stitched, it is natural to select a flat surface as the compositing surface with one of the images as the reference view. Such a panorama is called a flat panorama. The projection onto the final surface is still a perspective projection in this case and straight lines remain straight. However, for a large field of view, the edge pixels start getting stretched for a flat panorama. It has been observed that beyond a 90◦ field of view, significant distortion occurs. Thus, a cylindrical [36] or spherical [19] surface is chosen instead to reduce pixel distortion. In practice, any surface used for environment mapping in computer graphics can be used.

Cartographers have developed a number of methods more suitable for representing the surface of the earth.

8 The selection of the parameterization is really dependent on the application. The trade-off is between keeping the local geometry undistorted (straight lines straight) and provide uniform sampling (prevent pixel stretching).

Finally, we need to decide how to blend the pixels to create an attractive panorama.

This is a simple task if the registration is perfect and the exposure of different views is identical. However, for real images artifacts such as seams (due to exposure dif- ferences), ghosting (due to moving objects) and blurring (due to mis-registration) occur.

The simplest approach is fill the pixel locations with an average value across views.

However, if the motion is large, this does not work very well. Instead, a median filter was used by [37] to remove rapidly moving objects. Sometimes multiple copies of a movie object may need to be retained. Center-weighting and maximum likelihood selection can be used in such cases [32]. Seamless stitching while preserving the dynamic range was performed by [38].

Rich literature exists on image stitching, which will not be discussed here because it is beyond the scope of this work. However, the reader is referred to [39] for a good overview of the subject.

An associated problem that arises in image mosaicking is that of occlusion detec- tion. Pixels that are visible in one view may be occluded in another view. Some pixels may not be visible at all. In some applications, such as surveillance, it might be use- ful to identify these potentially occluded or occluding regions in order to characterize parts of the scene that may not be visible.

The approaches for detecting occlusion can be divided into two groups: appearance- based and geometry-based. Appearance-based methods compare the appearance of

9 the pixels in two or more views to determine if occlusion is present. This may in- volved checking for local bimodality in the pixel values, match goodness jumps along occlusion boundaries, left-right appearance check or checking for ordering of scene points [40]. Alternate methods try to extract depth information in an attempt to reason about the 3D geometric structure of the scene [44].

An empirical comparision of five appearance-based methods has been given by

[40]. They conclude that disparity matching, which is the matching of the difference of pixel values between views, gives the best results for most cases. This approach has been used by [41] in simultaneously performing stereo matching and occlusion detection.

Occlusion detection was used by [42] to create artificial unoccluded views. Only a single image was used by [43] to detect occlusion boundaries by using traditional edge and region cues, along with 3D surface and depth cues. Depth cues using pseudo- depth was used by [44] to detect occusion boundaries. Occlusion boundaries were detected in video sequences by using a spatio-temporal edge detector [45].

1.3 Motivations and Contributions

In case a series of images ordered in time is provided, the traditional approach cumulatively increases the registration error [46]. The error accumulation is usually attributed to the fact that the estimation process is independent between different image pairs. An important constraint imposed by all image registration algorithms is existence of a dominant scene plane, which the estimated homography transforma- tion confirms. However, while processing a series of images, this requirement may

10 confuse the algorithms to select different planes when the registration is estimated independently for different image pairs.

Especially in the case of multi-head cameras, there is little or no overlap between images captured at the same time epoch. A simple pair-wise registration will fail to register images captured from different cameras at the same time epoch.

We propose a novel approach to overcome the problems stated above by simul- taneously estimating the transformation between all images, which, in turn, ensures a single dominant plane across multiple images. Our approach follows a similar pro- cess to the bundle adjustment framework, commonly employed in photogrammetry to estimate camera parameters across multiple images. Particularly, we first extract a set of interest points on each image using the SIFT detector. These points are then matched using a kd-tree structure across multiple images. The corresponding interest points across images of different time epochs provide necessary geometric constraints to estimate the transformation. A dominant scene plane is assumed to exist in 3- space. We define a cost function based on the projection of points in 3-space onto all images obtained from the multi-head camera system and apply a global minimization to simultaneously estimate transformation between all images. In our approach we assume that the point coordinates are in 3-space and the homography transformations between scene plane and image plane are unknown, which we estimate from measured image coordinates using the Levenberg-Marquardt minimization technique.

We also propose a method to accurately detect potential occluders within the scene. Our approach is based on calculating the cross-correlation across multiple views. Correlation has been found to be the most effective metric to characterise image matching [40] and hence is expected to give good results for our case.

11 1.4 Organization of the Thesis

This thesis is organized as follows. Chapter 2 introduces the basic concepts of

and estimation of planar homography using the DLT and Gold

Standard algorithms. Chapter 3 provides the key principles of creating panoramic

mosaics using homography estimation and bundle adjustment. Chapter 4 gives the

details of the algorithm developed in this work. Mosaicking of images from multi-head

cameras and appearance based occlusion detection is explained. Chapter 5 describes

the experiments performed and results obtained. Finally, Chapter 6 provides the

conclusions and future directions for this research. Additional material regarding the

software developed and tips for future maintainers of the source code are included in

Appendix A.

1.5 Notation

Throughout this work, the following notion will be used. An upper case letter in-

dicates a matrix, e.g. the homography matrix H. Bold faced letters indicate a vector,

e.g. the homogeneous point vector x = (x, y, w)T . Lower case letters indicate scalars.

The superscript T indicates a transpose of a matrix or vector. The correspondence

0 0 between a point xi on Image 1 and point xi on Image 2 is indicated by xi ↔ xi. The views will always be distinguished by using a prime (0) over the respective quantities.

The L2 norm of a vector is denoted by || · ||. Estimated quantities are denoted by a hat over them, e.g. the estimates homography vector hˆ.

12 Chapter 2: PROJECTIVE GEOMETRY AND HOMOGRAPHY ESTIMATION

The chief agenda of photogrammetry and computer vision is to extract geometric, spatial and content information from images recorded by cameras. The vast majority of cameras in use are projective cameras. These devices project the 3D world onto a

flat 2D image using the projective tranformation.

Euclidean geometry is very effective at describing the 3D world. An Euclidean transformation preserves the distance between two points and the angle between two lines. This models the real world in which objects retain their shape on moving and parallel lines remain parallel.

However, when the 3D world is projected onto an image, Euclidean geometry becomes nearly useless. Parallel lines in a scene actually become convergent in a photograph, a circle can appear to be an ellipse. A new kind of geometry needs to be used to study images to facilitate easy mathematical formulation of perspective transformations.

The disadvantages of Euclidean geometry in dealing with images are noted by

[18]:

• Euclidean geometry leads to complicated formulation in case of images.

13 • Lenses introduce perpective geometry while projecting the 3D world onto a 2D

image.

• Euclidean geometry is not preserved in images. For example, parallel lines do

not remain parallel.

• Projective transformations are easier to formulate with projective geometry.

Euclidean geometry is actually only a special case of the more general projective geometry. By applying an increasing number of constraints on projective geometry we get affine, similarity and then Euclidean geometry. This relationship is illustrated in Table 2.1.

Euclidean Similarity Affine Projective

Transformations Rotation, Translation XXXX Uniform scaling XXX Shear XX Perspective Projection X

Invariants Length X Angle, Ratio of Length XX Parallelism XXX Incidence, Cross Ratio XXXX

Table 2.1: Characteristics of various geometries and invariants therein.

Since projective geometry captures the most general case of geometric transfor-

mations, it is used to formulate the perspective transformations between images. In

14 this chapter, we provide the fundamental definitions of projective geometry and their use in establishing relationships between image planes (known as homography).

2.1 Projective Space and Homogenous Coordinates

Many rigorous definitions of projective space are available (e.g. see [47, p. 41] and [48, p. 29]). However, we restrict our discussion to the real projective space that is most useful for studying images. Our definition will also lead to the idea of homogeneous coordinates. According to [49, chapter 2], the projective space is defined as:

Definition 1. The real projective space of dimension n. Given a real (n + 1)- dimensional vector space Rn+1, the projective space of dimension n, Pn, induced by

n+1 n+1 0 R is the set R /0n+1, under equivalence relation ≈ defined such that ∀x, x ∈

n+1 R /0n+1 x ≈ x0 ⇐⇒ ∃λ 6= 0, x = λx0, (2.1)

where, λ is a scale and 0n+1 is (n + 1)-dimensional null vector.

A projective point is a vector x ∈ Pn and all vectors x0 ∈ Rn+1, which are equivalent with a vector x under the equivalence relation ≈ that are used to represent the projective point x ∈ Pn. This is the so-called homogeneous coordinate representation. Thus, the proportional vectors x and x0 (x = λx0) will represent the same point in projective space, which means the magnitudes of vectors x, x0 ∈ Rn do not matter, only their directions do, and regardless of the value of λ, all projective properties of a point must be preserved. Furthermore, a null vector has no meaning and is not defined in projective space according to definition 1. This definition is illustrated by developing the homogeneous representations of points and lines [16, p. 26]

15 Homogenous Representation of Lines

A line on a plane is represented by the equation ax + by + c = 0. Different choices

of a, b and c result in different lines. Thus, the line can be represented by the vector

(a, b, c)T . Indeed, this representation is not unique, as a vector k(a, b, c)T also satisfies

the equation of the line for k 6= 0. The set of all vectors k(a, b, c)T form an equivalence

class and is known as a homogeneous vector. The set of equivalence classes of vectors

in R3 − (0, 0, 0)T form a projective space P2 in accordance with the above definition.

The vector (0, 0, 0)T is not included P2 because it does not correspond to any line.

Homogenous Representation of Points

The point x = (x, y)T lies on the line l = (a, b, c)T iff ax + by + c = 0. This can be

expressed as the vector product, (x, y, 1)(a, b, c)T = (x, y, 1)l = 0. This means that the

vector (x, y)T in R2 can be represented by (x, y, 1)T . Once again, this representation is not unique as any vector k(x, y, 1)T will also satisfy the above equations. Thus the homogeneous vector (kx, ky, k) is considered the representation of point (x, y)T . An

T T arbitrary homogeneous vector (x1, x2, x3) represents the point (x1/x3, x2/x3) .

Model for the Projective Plane

2 3 T We can think of P as a set of rays in R . The set of all vectors k(x1, x2, x3) is a

ray in R3 passing through the origin. A line in P2 is a plane in R3. This is illustrated in Figure 2.1.

2.2 Projective Transformations and Planar Homography

A projective transformation is defined as follows [16, p. 33].

16 푥2

휋 O 퐱 푥3

푥1

Figure 2.1: Points become lines and lines become planes in projective space.

Definition 2. Projective Transformation. A planar projective transformation is a

linear transformation on homogeneous 3-vectors represented by a non-singular 3 × 3

matrix:

 0      x1 h11 h12 h13 x1 0  x2  =  h21 h22 h23   x2  (2.2) 0 x3 h31 h32 h33 x3 or, more succinctly, x0 = Hx.

Once again, the matrix H can be multiplied by a non-zero scalar constant with- out changing the projective transformation (the corresponding point vectors can be

17 divided by this constant since they are homogeneous). Thus, the matrix H is a homo- geneous transformation. It contains eight independent ratios and has eight degrees of freedom. A projective transformation projects a set of points to a projectively equivalent set of points leaving all the projective properties invariant.

Perspective (Central) Projection

The simplest form of a projective transformation is the central projection, which is projection along rays that pass through a common point. This is illustrated in

Figure 2.2. This is essentially the process that a camera performs under a simplified pin-hole camera model. In fact, if the coordinates associated with both planes are

Euclidean, then the projection reduces to a simpler perspectivity that only has six degrees of freedom.

Planar Homography

Another type of projective transformation is shown in Figure 2.3. This kind of projective transformation is induced between two image by a world plane and is called a 2D homography or planar homography. It represents the situation in which a scene plane is photographed in two different images. In this thesis we are most concerned with this type of transformation, as we use it to find the relationship between views in order to mosaic them.

2.3 Estimation of Planar Homography

The problem of estimation of planar homography is thus: given a set of points in

0 scene space Xi and their corresponding projections in two views xi ↔ xi, estimate

0 the linear transformation H, such that xi = Hxi ∀i.

18 퐱 푥2

퐱′

휋 푥3 O

푥1

Figure 2.2: A central projection projects all points from one plane to another along rays passing through a perspective center. This is a simpler case of projecticity called perspectivity.

2.3.1 Number of Measurements Required

As discussed above, the 3 × 3 homography matrix has only eight independent

0 ratios. Each point to point correspondence xi ↔ Hxi accounts for two constraints. A 2D point has two degrees of freedom in the x and the y directions, respectively.

Alternately, each point is a homogeneous vector with two degrees of freedom and hence contributes two constraints. This can be expressed as:

0 x1 h11x1 + h12x2 + h13x3 0 = (2.3) x3 h31x1 + h32x2 + h33x3

19 퐻

퐱′ 퐱

X

Figure 2.3: Examples of 2D Homography, x0 = Hx. A planar homography is induced between two images by a world plane.

0 x2 h21x1 + h22x2 + h23x3 0 = (2.4) x3 h31x1 + h32x2 + h33x3

Therefore, to solve for eight unknown, a minimum number of four point correspon- dences are needed.

2.3.2 Direct Linear Transform (DLT)

0 When four point correspondences xi ↔ xi are given, it is a simple matter to find a

0 linear solution for H. We use the transformation that xi = Hxi. Note that since these

20 are homogeneous vectors, they are equal in magnitude only up to a scale. However,

0 they have the same direction implying that the cross product is zero, xi × Hxi = 0. Denoting the jth row of H by hjT we get  1T  h xi 2T Hxi =  h xi  . (2.5) 3T h xi 0 0 0 0 Again, using xi = (xi, yi, zi) we get  0 3T 0 2T  yih − wih xi 0 0 1T 0 3T xi × Hxi =  wih − xih xi  (2.6) 0 2T 0 1T xih − yih xi jT T j Substituting h xi = xi h for i = 1...3 results in  T 0 T 0 T   1  0 −wixi −yixi h 0 T 0 T 2  wixi 0 −xixi   h  = 0 (2.7) 0 T 0 T T 3 yixi xixi 0 h

It can be seen that this is a linear equation of the form Aih in the unknown h.

Since this is an equation in homogeneous coordinates, the third row of the above equation can be obtained, up to a scale, by a linear combination of the first two rows.

Thus, we have only two independent equations. These can be written out as  h1   0T −w0xT −y0xT  i i i i h2 = 0. (2.8) w0x 0T −x0 xT   i i i i h3 Or, in short

Aih = 0. (2.9)

Here, Ai is a 2 × 9 matrix and h is a 9 × 1 unknown vector. Each point correspon- dence gives us two independent equations in the unknown h. Stacking up equations for four point correspondences will give us A, which is an 8 × 9 matrix with rank 8.

Thus, A will have a 1-dimensional null space giving a solution for h. Obviously, the trivial solution h = 0 is not interesting. The non-zero solution to h is determined only up to a scale. This is fine, as H is only defined up to a scale. An arbitrary scale can be imposed on h by using conditions such as ||h|| = 1 or h9 = 1.

21 Over-Determined Solution

0 If more than four point correspondences xi ↔ xi are given, the set of equations Ah = 0 are over-determined. If the point locations are exact, then rank of A is 8 and

it still has an exact solution. However, in real life, point locations will always have

some noise associated with them. Thus, instead of solving for an exact solution, an

approximate solution can be determined by finding an h that minimizes a suitable cost

function. Generally, the condition ||h|| = 1 is imposed. Since Ah = 0, it also seems

natural to minimize ||Ah||. This is the same as finding the minima of ||Ah||/||h||.

The problem can be equivalently stated as finding the (unit) eigenvector of AT A with the least eigenvalue. This formulation results in a simple Direct Linear Transform

(DLT) algorithm for solving h listed in Algorithm 1.

Algorithm 1 The Direct Linear Transform (DLT) algorithm. 0 for all xi ↔ xi do Compute Ai end for

Combine n 2 × 9 matrices Ai into a 2n × 9 matrix A.

[U, D, V T ] = SVD(A).

h = last column of V .

22 2.4 Cost Functions

As described above, we can minimize different cost functions while solving the

overdetermined system Ah = 0. Some commonly used cost functions are described

in this section.

2.4.1 Algebraic Distance

The DLT algorigthm minimizes the norm ||Ah||. The vector  = Ah is called the

0 residual vector. Each point correspondence xi ↔ xi contributes a partial residual i

0 to . i is called the algebraic error vector associated with xi ↔ xi and H. The norm

||i|| is a scalar and is called the algebraic distance

 T 0 T 0 T  2 0 2 2 0 −wixi −yixi dalg(xi,Hxi) = ||i|| = 0 T 0 T h . (2.10) wixi 0 −xixi For the complete vector  = Ah the total algebraic distance error is given by

2 X 0 2 X 2 |||| = dalg(xi,Hxi) = ||i|| . (2.11) i i The advantages of using the algebraic distance is that it is linear (and thus unique) and computationally cheap to calculate. However, it does not make any geometric or statistical sense and may not correspond to the solution that is intuitively expected

[50]. However, with proper normalization, algebraic distance can give a good enough

solution. As a result, they are often used as a first approximation for seeding a

non-linear algorithm that minimizes a geometric or statistical cost function.

2.4.2 Symmetric Transfer Error

Another cost function can be defined based on the Euclidean distance measured

on the image. This definition is based on minimizing the difference between the

measured coordinates of a point and the estimated values.

23 The following notation is used for this discussion

• x: the measured coordinates of an image point

• xˆ: the estimated coordinates of an image point

• x˜: the true values of an image point

Let d(x, y) be the Euclidean distance between the inhomogeneous coordinates of

0 an image point. Then, the transfer error between an image point, xi, and its estimate,

0 ˆ xˆi = Hxi, is defined as X 0 ˆ 2 d(xi, Hxi) . (2.12) i This cost function can be minimized for the estimated Hˆ . Since while estimating

Hˆ , the point is visible in two images, it is logical to sum the error on both images.

This is called the symmetric transfer error. If the forward transformation is H and the inverse H−1, then the symmetric transfer error can be described as

X 0 ˆ 2 ˆ −1 0 2 d(xi, Hxi) + d(xi, H xi) . (2.13) i

Once again, this cost function is minimized for the estimated Hˆ .

2.4.3 Reprojection Error

The idea behind reprojection error is to estimate a correction to each point corre- spondence such that an exact estimate correspondence is obtained. That is, we seek

ˆ 0 an estimated H and a perfectly matched estimated point correspondence xˆi ↔ xˆi that minimizes the cost function

X 2 0 0 2 d(xi, xˆi) + d(xi, xˆi) (2.14) i

24 0 ˆ under the condition xˆi = Hxˆi ∀i. Thus, with reprojection error we estimate both Hˆ and the point correspondences. The difference between symmetric transfer error and reprojection error is explained in Figure 2.4.

퐱 푑 푑’ 퐱’

퐻−1 (a)

Image 1 Image 2

퐱 퐱’

푑 푑’ x x ′ 퐻−1 (b) 퐻

Figure 2.4: An illustration of (a) Symmetric transfer error and (b) Reprojection error. The points x and x0 are measured points and hence do not correspond perfectly because of noise. The esimated points xˆ and xˆ0 do match perfectly for the estimated homography Hˆ .

2.5 Transformation Invariance and Normalization

The results of the DLT algorithm described earlier depend on the particular choice of coordinate system for the images. It is clearly not desirable to have the algorithm dependent on arbitrary choices of scale or origin. However, scale and translation

25 invariance can be obtained by normalizing image coordinates before performing the

DLT. This approach is described below.

2.5.1 Image Coordinate Transforms

The invariance of DLT to translation and scaling is now explored. Consider the

0 0 point correspondences xi ↔ xi related by the equation xi = Hxi. Let these points be transformed by the linear transformations

x˜ = T x

x˜0 = T 0x0. (2.15)

0 0 0 −1 ˜ 0 −1 Substituting in xi = Hxi we get x˜ = T HT x˜. This implies that H = T HT . Thus, we have an alternate method of determining H for a set of point correspon- dences.

1. Transform the coordinates using Equation 2.15.

2. Find the relation H˜ from the transformed coordinates.

3. Set H = T 0−1HT˜ .

Depending on whether the algorithm is invariant or not, this alternate method will give the same or a different value of H.

2.5.2 Non-Invariance of DLT

In this section it will be shown that DLT is not invariant to the kind of transfor- mation described above. This discussion follows from [16].

Let T 0 be a similarity transformation with scale factor s, and let T be an arbitrary projective transformation. Further supposee H is any 2D homography and let H˜ be

26 defined by H˜ = T 0HT −1. Then ||A˜h˜|| = s||Ah|| where h and h˜ are vectors of the entries of H and H˜ .

The proof for the above and more details can be found in [16]. It is found that H and H˜ are not related in any simple manner because the condition ||H|| = 1 is not equivalent to the condition ||H˜ || that are imposed in the DLT algorithm.

2.5.3 Invariance of Geometric Error

Geometric error on the other hand is invariant to similarity transformations. Using the notation as above, suppose that T and T 0 are similarity transformations in P2. It can be seen that

d(x˜0, H˜ x˜) = d(T 0x0,T 0HT −1T x)

= d(T 0x0,T 0Hx)

= d(x0,Hx) (2.16)

since the Euclidean distance between two points remains the same under a similarity

transformation. Thus, the two minimizations will be equivalent in this case.

2.5.4 Normalization

The image coordinates of the point correspondences used for DLT can be normal-

ized in a way, such that DLT does become invariant to similarity transformations.

This can be done by performing an isotropic scaling as follows:

1. Translate the points, such that their centroid is at origin

√ 2. Scale the points, such that the average distance from the origin is 2

3. Apply this transformation to each image independently

27 It has been shown that normalizing points before DLT not only results in an invariant H, but also improves accuracy. It improves stability for other DLT com- putations, such as the fundamental matrix or the trifocal tensor. A modified DLT algorithm with normalization is listed in Algorithm 2.

Algorithm 2 Direct Linear Transform (DLT) Algorithm with Normalization

1. Normalization of xi: Compute the similarity transformation T , such that√ the centroid of points xi lies at origin and the mean distance from the origin is 2.

0 0 2. Normalization of xi: Perform Step 1 for points xi.

0 3. DLT: Perform the DLT Algorithm 1 on point correspondences x˜i ↔ x˜i obtain- ing homography H˜ .

4. Denormalization: Set H = T 0−1HT˜ .

2.6 Iterative Minimization

For cost functions using geometric error like symmetric transfer error or the repro- jections error, iterative minimization techniques have to be used to obtain a solution.

However, iterative minimization techniques have a number of disadvantages. The include:

• They are slower than linear methods.

• They need an initial estimate to begin with. Often the final solution can depend

heavily on the choice of the initial estimate.

• They are not guaranteed to converge or might converge to a local minima.

28 • The selection of the stopping criterion for iteration can be tricky. It is difficult

to determine when a good solution can be reached.

Despite these shortcomings, iterative techniques are widely used to estimate para- metric transformations between images.

An appropriate cost funtion has to be chosen to begin an interative minimiza- tion. Many different cost functions may be available and the appropriate one chosen based on the characteristics of the solution desired. The problem is then parameter-

ized. Once again, several different choices of parameterization may be available. For

example, the problem of finding H may be parameterized in the nine unknowns of

H. Alternately, only eight unknowns can be used as we know that H has only eight

degrees of freedom. In general, sophisticated non-linear methods like the Levenberg-

Marquardt optimization are smart enough to deal with overparameterization.

Next, an appropriate funtional relation must be established expressing the cost

funtion in terms of the parameter vector. Once this is done, the iteration can begin by

specifying a suitable initial value for the paramters. Initial values are often obtained

by using an approximate linear solution for the problem.

2.6.1 Gold Standard Algorithm

The Gold Standard Algorithm is a non-linear algorithm for estimating homogra-

phy using the reprojections error. It works well for a small number of point corre-

spondences but may become intractable if the number of point correspondences is

large.

The Gold Standard Algorithm is parameterized using the estimated point loca- ˆ tions xˆi and the estimated homography H. If the number of point correspondences is

29 0 n then the total number of unknowns is 2n+9. The other set of points xˆi are ignored

0 ˆ because they are obtained by the relation xˆi = Hxˆi. Thus, we get the parameter vector

ˆ R = (h, xˆ1,..., xˆn) (2.17)

and the functional relation

ˆ 0 0 f :(h, xˆ1,..., xˆn) → (xˆ1, xˆ1,..., xˆn, xˆn) (2.18)

where ||X − f(R)||2 becomes the 4n cost function vector.

The above formulation can now be solved using the standard Levenberg-Marquardt

algorithm. The Gold Standard Algorithm using this formulation is listed in Algorithm

3.

Algorithm 3 Gold Standard Algorithm 1. Initialization: Initialize Hˆ . This can be done using the DLT algorithm de- scribed in Algorithm 2.

2. Calculate initial estimates of xˆi using the meausred points xi. 3. Minimize the cost X 2 0 0 2 d(xi, xˆi) + d(xi, xˆi) i using the Levenberg-Marquardt optimization.

30 Chapter 3: IMAGE MOSAICKING

There are two broad approaches to image mosaicking: direct methods and feature-

based methods. Direct methods shift (or warp) the images relative to each other

and examine how much the pixels agree based on a suitable error metric. Error

metrics, such as sum of squared differences (SSD), sum of absolute differences (SAD) or normalized cross correlation can be used. A suitable search technique must be devised to check all possible warping configurations. The simplest approach is to try a brute force search. However, this can be too slow so hierarchical coarse-to-

fine techniques based on image pyramids are employed to increase the accuracy of matching [51]. Even though the direct methods use information from all the pixels present in the image, large parallax can cause significant errors in matching. It is also observed that direct methods have a limited range of convergence and that they fail too often to be useful [39].

Feature-based methods first detect interest points using a suitable interest point detector, such as Harris corner detector, SIFT or SURF, and then evaluate the trans- formation between the images using view geometry constraints on these points. Many researchers have used the homography transformation to register two different views in the absence of camera parameters [27, 15]. Given a set of putative point corre- spondences across the two views, these methods estimate the affine or homography

31 transform between the two views using estimation techniques, such as the Random

Sample Consensus (RANSAC) [13]. The resulting transformation is then used to warp one image onto the other.

In case a series of images ordered in time is provided, the traditional approach of sequentially registering images cumulatively increases the registration error as the number of images increases [46]. The error accumulation is usually attributed to the fact that the estimation process is independent between different image pairs.

An important constraint imposed by all image registration algorithms is existence of a dominant scene plane which the estimated homography transformation confirms.

However, while processing a series of images, this requirement may confuse the al- gorithms to select different planes when the registration is estimated independently for different image pairs. Especially, in the case of multi-head cameras, there is little or no overlap between images captured at the same time epoch. A simple pair-wise registration will fail to register images captured from different cameras at the same time epoch.

3.1 Key-point Detection

The process of feature based registration of two images begins with detecting cor- responding points on these images. This is accomplished by detecting key-points (or interest points) on each image and matching them to find correspondences. At least the view point and possibly illumination, exposure and other image properties change between the views. Thus, the ideal key-point detector needs to be invariant to scale, rotation, projective transformation, illumination and exposure variations. Several key-point detectors have been developed with these goals in mind. An overview of

32 two popular ones - Harris and Stephen’s Corner Detector and Scale Invariant Feature

Transform (SIFT) is provided here.

3.1.1 Harris and Stephen’s Corner Detector

The Harris and Stephen [2] define a ‘corner’ by a large variation in the SSD between an image patch and all possible shifts in its neighborhood. Without loss of generality, a two-dimensional gray scale image I can be assumed. Consider an image patch around point (u, v) shifted by amount (x, y). The weighted sum of squares difference (SSD) between these two patches is given by:

X X 2 S(x, y) = w(u, v)(I(u + x, v + y) − I(u, v)) . (3.1) u v Taking the Taylor expansion of I(u + x, v + y) and simplifying we get the approx- imation:

X X 2 S(x, y) ≈ w(u, v)(Ix(u, v)x + Iy(u, v)y) , (3.2) u v

where, Ix and Iy are the image derivatives in the x and y directions, respectively.

This can be written in the matrix form:

x S(x, y) ≈ x y A , (3.3) y

where A is the structure tensor,

 2   2  X X Ix IxIy hIxi hIxIyi A = w(u, v) 2 = 2 . (3.4) IxIy I hIxIyi hI i u v y y This matrix is also called the Harris matrix. As noted above, a corner is defined as a large variation in S. Thus, for an interest point, A should have a large value. In practive, actually calculating the eigenvalues is computationally expensive. Instead, the following function Mc is evaluated, where κ is a tunable sensitivity parameter.

2 2 Mc = λ0λ1 − α (λ0 + λ1) = det(A) − α trace (A) (3.5)

33 The value of κ has to be determined empirically, and in the literature values in the range 0.04 - 0.15 have been reported as feasible.

3.1.2 Scale Invariant Features Transform (SIFT)

Scale Invariant Feature Transform (SIFT) is an algorithm to detect and describe local features in an image [52]. SIFT transforms a given gray scale image to a large set of feature vectors. The features are invariant to translation, rotation and scaling.

Additionally, they are partially invariant to illumination changes and robust to local affine geometric distortions.

SIFT works by extrema detection. The image is convolved with gaussians at different scales and the difference of gaussians (DoG) between adjacent scales is con- structed. For image I(x, y), the convolutions is denoted by:

L (x, y, kσ) = G (x, y, kσ) ∗ I (x, y) (3.6) where, G (x, y, kσ) denotes the Gaussian kernel with scale kσ. An octave is then calculated by taking the DoG after doubling the scale, as defined by:

D (x, y, σ) = L (x, y, kiσ) − L (x, y, kjσ) . (3.7)

Features or key-points are characterized by maxima or minima in the scale space so constructed. A point at (x, y) is compared to its eight neighbors in the same image and nine neighbors in each of the two adjacent images to find a maxima or a minima.

Scale-space extrema detection produces too many keypoint candidates, some of which are unstable. The next step in the algorithm is to perform a detailed fit to the nearby data for accurate location, scale, and ratio of principal curvatures. This information allows points to be rejected that have low contrast (and are therefore

34 sensitive to noise) or are poorly localized along an edge. After these refinements,

the SIFT algorithm produces a 128 dimensional feature vector that characterizes a

keypoint.

3.2 Key-point Matching

The next step in the process is to match a set of key-points to their corresponding

points on another image. Once key-point matches (called tie-points in the photogram-

metry community) are known, the point locations can be used in a bundle adjustment

framework for solve for point and camera geometry.

Key-point detectors, such as SIFT, are ideally suited for point matching. They

provide an information rich feature vector for each key-point. Point matching then be-

comes the simple matter of finding the Euclidean nearest neighbors in an n-dimensional space (n = 128 for SIFT). The simplest way is to perform a brute force search over all points. However, this can be prohibitively expensive in most practical situations where the number of points can range from a few hundred to a few thousand. For low dimensional spaces, hashing techniques have often been used to match feature vec- tors. However, these suffer from the curse of dimensionality and become intractable in large dimensional spaces. An alternate approach is to construct a kd-tree structure to efficiently find the nearest neighbor in n-dimensional space [53].

To create a kd-tree structure for a n-dimensional space, we begin with N points

in Rn. The space is split along the dimension i, where the data show the maximum variance. A cut is made at the median value m in that dimension. This ensures that there are an equal number of points in each half. A node is created in the kd-tree to

35 store i and m. The procedure iterates over each half of the data to create a complete binary tree with depth d = log2 N. This tree can be searched in O(log N) time. Even the nearest neighbor search with a kd-tree may be too slow, as the dimen- sionality increases. The search may be further speeded up, if we are willing to accept an approximate solution. One such approach is the approximate nearest neighbor search (ANN) using the Best Bin First (BBF) approach [12]. The BBF algorithm searches the tree according to a priority queue instead of a simple queue. The priority is based on the proximity of a particular dimension being checked. It also prunes the search space by stopping when a certain number of nodes have been checked and returns the closest neighbor at this point. Thus, BBF returns the nearest neighbor in a large number of cases and a very close neighbor for the rest of them. In this way,

BBF manages to speed up the search by two orders of magnitude.

3.3 Robust Estimation (RANSAC)

Once point correspondences are known, it is a simple matter to estimate the ho- mographic relationship using DLT or Gold Standard algorithms. However, in these algorithms it was assumed that the uncertainties in point measurements come from measurement noise, the error is gaussian and the estimation process will be able to successfully ‘absorb’ them. However, in real scenarios this may not always be the case. The point correspondences may not be accurately known due to an error in the matching algorithm or human error if the correspondences are being manually deter- mined or there are limitations of approximate algorithms such as BFF. Alternately, two or more different solutions may be present in the measured data. For example, the scene may contain two or more dominant planes, giving two or more homographic

36 relationships, only one of which is the desired solution. Thus, real data will have a

(often large) number of outliers and will contribute a large amount of error to the estimation process.

To deal with this scenario a variety of robust estimation techniques tolerant to outliers have been developed. The most popular of these is the RAndom SAmple

Consensus) or RANSAC [13].

When there are outliers in the measured values, often more than one distinct solution is possible. RANSAC randomly selects several solutions from the given dataset and then chooses the one that gives the least residual error with the maximal number of data points fitting the solution. The algorithm is explained in Algorithm

4.

Algorithm 4 Random Sample Consensus (RANSAC) 1. Let S be the set of data points. Let s be the minimum subset needed to solve the model.

2. Select a random si from S and solve the model Mi.

3. Computer inlier set Si such that all points in Si are within the distance threshold t of Mi.

4. If the number of points in Si is greater than T , a threshold for the number of inliers, then terminate iteration.

5. If the number of points in Si is less than T , then select a new random si and repeat.

6. If the number of iterations is greater than N, the threshold on number of iter- ations, then terminate.

7. Re-estimate the model using Si only.

37 The RANSAC algorithm can be successfully employed to estimate homography.

The algorithm to automatically compute homography between two views is presented in Algorithm 5.

Algorithm 5 Automatically computing homography using RANSAC 1. Compute interest points on the two given images

2. Find putative point matches between images

3. Estimate homography using RANSAC

(a) Select four random putative matches and calculate H using DLT or Gold Standard algorithm

(b) Calculate a suitable distance metric d⊥ for each putative match. Algebraic or geometric distance can be used.

(c) Compute the inlier set, such that d⊥ is less than some distance threshold t. (d) Terminate if number of inliers is greater than a threshold T or the number of iterations is greater than some threshold N (e) Re-estimate H using only the inlier set.

These techniques are demonstrated in Figures 3.1 and 3.2.

38 Figure 3.1: Detecting key-points with sift and finding geometrically consistent points with RANSAC. The figure above shows the key-points detected with SIFT and matched using BBF. The lower figure shows matches geometrically consistent with homography calculated using RANSAC.

Figure 3.2: Mosaicing images after estimating homography using RANSAC.

39 Chapter 4: IMAGE REGISTRATION FOR MULTI-HEAD CAMERAS

In this work, we develop a new technique for registering images from a multi-head camera and for detecting potential occluders in the scene. The images are captured from an airborne camera with six camera heads. The images are of a predominantly urban scene with large and small buildings.

We make the following two assumptions to solve this problem:

1. A dominant plane exists in the scene for which homography can be estimated

across views.

2. Any scene points not on the dominant plane are potential sources of occlusion.

As will be evident from the data description, these are reasonable assumptions to make for the scope of our problem. We additionally assume that no information about the camera calibration, relative orientation or aircraft position is available. We seek to perform image registration using image appearance only.

40 4.1 Projection Model and Initial Estimates

Let X1,..., Xnp be the homogenous coordinates of np points in three-space. The

j j projection of point Xi on image Ij is denoted by xi . Then Xi and xi are related by the camera projection matrix Pj for image Ij:

j xi = PjXi, (4.1)

 T where, Xi = Xi Yi Zi Wi and Pj is a 3 × 4 projection matrix. Imposing the constraint that all points lie on a plane, we set Zi = 0 for all points. Then (4.1) reduces to j ˜ xi = HjXi, (4.2)

where, Hj defines a homography from projective 2D scene coordinates to projective

2D image coordinates. This is illustrated in Figure 4.1.

To perform the bundle adjustment we need initial estimates of the

Hjs and 2D projective points in three-space Xis. Without loss of generality, we set

H1 to be identity.

1 ˜ xi = H1Xi

1 ˜ ⇒ xi = Xi

Then for the jth image we get

j ˜ xi = HjXi

j 1 ⇒ xi = Hjxi .

Using the above relation, Hj can be estimated using the RANSAC approach. For two arbitrary images Ij and Ij0 :

j0 −1 j xi = Hj0 Hj xi (4.3)

41 Figure 4.1: The projection model used for bundle adjusment using Levenber- Marquardt estimation.

and

ˆ −1 j0 Xi = Hj0 xi . (4.4)

ˆ Thus, if Hj is known, Hj0 and Xi may be estimated.

4.2 Algorithm Overview

Figure 4.2 shows the flow chart for the algorithm developed in this work. The process starts with detecting key-points on each image separately using the SIFT key-point detector. Thereupon, homography consistent key-points are matched on each image pair. Since finding matches on each possible image pair is prohibitively expensive, it is assumed that maximum overlap exists only between images whose time epochs are close to each other. Since the images are sequencial in time as well as space, this is again a reasonable assumption to make.

42 Using this information, a plane projective image overlap graph can now be created.

The homography transformation for the first image is set to be identity. The graph is then traversed to find the initial estimates of homographies for each image. After this, a global bundle adjustment is performed using the Levenberg Marquardt optimization technique [54] to generate precise transformations between images.

Once the transformations have been obtained, the images can be mosaicked to generate a panorama. Additionally, an analysis of the local appearance across views gives a method to determine potential occluders in the scene.

Each of these steps is now explained in detail.

Homography Keypoint consistent key- Detection point matching

Plane-projective Image Initial Estimates of Connectivity 퐻푗 Graph

Bundle Adjustment using Transformed Levenberg- Images Marquardt

Mosaic Occlusion

Figure 4.2: Flow chart depicting the various stages of registering images from a multi- head camera.

43 4.2.1 Key-point Detection and Matching

Key-points are detected on each available image independently using the SIFT key-point detector. Ideally, the points should be uniformly distributed in space, yet be distinct enough for accurate matching. Too many points increase the dimensionality of the problem and make it unsolvable using iterative techniques. While the number of points output by SIFT can be increased by decreasing the various thresholds for detecting peaks, edges and the minimum value of the L2 norm of the descriptor, these tend to concentrate the detected points along edge and peaks. In order to get spatially well distributed points, it is best to begin processing at a lower octave in the scale space. Figure 4.3 illustrates how points decrease as we move lower along the octaves. The points remain spatially well distributed.

First Octave = 0 First Octave = 1 First Octave = 2

Figure 4.3: The number of points detected decreases as one moves down the scale space octaves.

Key-points are then matched for all available image pairs. The points are first matched by an approximate nearest neighbor search using the Best Bin First (BBF)

44 algorithm. These putative matches are further refined by finding geometrically con- sistent points within a RANSAC framework. The points are tested for the existence of a homography and hence for the existence of a dominant plane in the scene. The homography is calculated using a simple DLT algorithm because Gold Standard can- not deal with high dimensionality and only an approximate solution is needed at this point.

Since the epochs in the data are sequential in time and space, it is logical to assume that an image overlaps only with images near to it in time and space. Thus, an image is matched only to n epochs coming before and after it in the dataset. This value can be changed in the algorithm.

We also want to avoid the selection of different planes in different image pairs.

Thus, for a given set of image pair matches, care is taken that only the points lying on a previously detected plane are used in the next set of matches. This ensures that the same dominant plane is selected by all image pair matchings.

4.2.2 Image Connectivity Graph

From the detected matches, a plane projective image connectivity graph is cre- ated. Each node on this graph is an image and an edge exists between them if the images overlap. The ‘strength’ of the edge is defined as the number of geometrically consistent point matches between the images that it joins. In practice, the weak edges representing matches below a certain threshold are pruned altogether. This is to prevent a weak overlap between images contributing to the adjustment process.

Figure 4.4 presents a schematic of what such a graph might look like. The images obtained from two different cameras will be strongly connected. However, very few

45 strong connections may exist between the two cameras due to little overlap between

them.

Camera 1 Camera 2

Figure 4.4: Schematic of an Image Connectivity Graph.

4.2.3 Initial Estimation by Graph Traversal

The image connectivity graph is constructed in order to efficiently estimate initial values for the bundle adjustment. We select one image on the graph as the reference image. All other images will be warped onto this image to create a flat panorama.

We need to estimate the transformation (homography matrix) that maps each image

Ii to the reference image I1. Thus each node on the image connectivity graph gets

associated with a homography matrix Hi, which maps it to the reference image.

We begin by assigning H1 = identity. From Equation 4.3 we know that if Hj

is known Hj0 can be estimated. Thus, the homography can be estimated for each

46 node connected to I1. These nodes in turn give homography estimates for further connections. The entire graph can be estimated in this manner by propagating from the reference image.

Many different approaches can be taken for traversing the image connectivity graph. The key concern is to get an accurate initial estimation, since non-linear optimization is sensitive to the starting values of the iteration. In general, the fur- ther away we move from the reference node, the poorer our estimate of homography gets. This is because the error tends to accumulate as we hop across views. Thus, it is best to keep the estimation path between a given node and the reference as short as possible. Additionally, the estimation is better when the edge between the nodes is strong. This is because a strong edge indicates a larger number of point correspondences, resulting in a better homography estimate by DLT.

Keeping this in mind, two different approaches to graph traversal may be formu- lated.

1. Move along the strongest edge: at each stage in the graph traversal we

move along the strongest available edge. This is the greedy approach. This

algorithm is presented in Algorithm 6.

2. Estimate along the shortest path: each edge is weighted in an inverse

relation to its strength. The shortest path then becomes the global strongest

path containing the least number of nodes. It can be computed using Djikstra’s

shortest path algorithm [55]. This is the more globally optimal approach but

takes longer time to compute.

47 In practice, the first approach results in very long traversal chains on the graph.

This results in a large error in the estimates. It is more optimal to use the second

approach. A shortest path tree is constructed using Dijkstra’s algorithm and traversed

breadth first to estimate all nodes. However, this approach takes a longer time to

compute.

Algorithm 6 The strongest edge first (greedy) algorithm for traversing the image connectivity graph. Let G = (E,V ) be the graph where E is the set of edges and V is the set of vertices.

Set S = V1 where S is the set of estimated vertices and V1 is the reference vertex.

repeat for all Vi ∈ S do Find max(E(i, j) = E(imax, jmax)

Estimate Vjmax

S = S + Vjmax end for until S = V

4.2.4 Optimizing with the Levenberg-Marquardt Algorithm

Using this projection model we can set up a bundle adjustment-like framework

to simultaneously evaluate Hjs and Xis up to a general homographic ambiguity. We perform the optimization using the Levenberg-Marquardt minimization technique.

We split the homography matrix Hj into three rows and combine them into a single

48 vector:  T  H1 T Hj =  H2  T H2   H1 hj =  H2  . H2 Within the Levenberg-Marquardt framework we have the choice of choosing the

parameter vector in several ways. A complete parameterization can be achieved by

using h iT R = hT ... hT Xˆ T ... Xˆ T , (4.5) 1 ni 1 np

ˆ T ˆ T ˜ ˜ where X1 ... Xnp are the estimated values of X1,..., Xnp and the measurement vector is set to be:

 j V = xi . (4.6)

Reducing Dimensionality

However, in practive we found that this parameterization had too high a dimen- tionality to be solved in reasonable time. This was because the number of 3D points in this problem is very large. The dimensionality in this case is 9ni + 3np. Here

4 np is usually of the order of 10 . Thus, a more prudent choice is where only the homographies are used as parameters.

 hT ... hT T R = 1 ni . (4.7)

A funtional relation can then be achieved between R and the measurement vector

 hT ... hT T  j f : 1 ni → xi . (4.8)

j T ˆ where estimated value xˆi = f(hj , Xi). Thus we manage to reduce the dimension- ality of the problem to 9ni which is solvable for small ni.

49 Finally, the error vector is defined as:

j j j 2 ei = d(xi , ˆxi ) (4.9)

j where d(·, ·)is the Euclidean distance between the image point xi and estimated

j point ˆxi . These set of equations can now be solved using the standard Levenberg-Marquardt optimization [54].

Once precise homographies between the images are known, the images can trans- formed to form a panoramic mosaic. Several excellent algorithms exist for creating appealing seamless images. We do not explore this problem further in this work.

We use a simple cubic-interpolation to transform the images and do not construct seamless mosaics.

4.3 Occlusion Detection

In this work, we register images by assuming the existence of a dominant plane in the scene and finding the homography between two views on this plane. We posit that the scene points that do not lie on this dominant plane are potential occluders and seek to identify them in the image space.

Two approaches can be taken for occlusion detection. The first is a geometry- based approach. Since the homographic relations for the points on the dominant plane are known, any points that do not satisfy this relation lie above or below the plane and can be characterized as occluders. However, in our particular case there are two problems with this approach

• The distance between the ground and the aircraft is very large compared to the

height of built-up features on the ground. Thus, the parallax for non-ground

50 points (e.g., buildings) is too small to sufficiently distinguish them from ground

points.

• It is desirable to determine potential occludes information for every pixel in the

generated panorama. A point based approach does not give this dense map.

More sophisticated geometry-based methods calculate pseudo depth in an attempt to reason about the 3D structure of the scene and identify occluders. However, these methods are computationally very expensive and difficult to generalize for more than two views.

Keeping this in mind, we develop an appearance-based approach that works well across multiple views. A typical scenario obtained after registering images using our algorithm is shown in Figure 4.5. The images I1 ...In−1 are warped to the reference image space I0 to create the mosaic. A pixel x0 on I0 corresponds to pixels x1 ... xn−1 on the untransformed views. Let the appearance of these points on the transformed

0 0 0 0 views I1 ...In−1 be x1 ... xn−1.

0 0 We make the following assumption regarding the appearances of x1 ... xn−1 - if x0 is a ground pixel, then it will register precisely with x1 ... xn−1 and its appearance

0 0 will be very similar to x1 ... xn−1. However, if it is not a point on the ground then

0 0 the appearance would differ from at least one of x1 ... xn−1. We develop a normalized correlation-based framework to measure this similarity.

To compute the normalized correlation across views, we proceed as follows. Con- sider a window of size sw × sw around the point x0. Let the pixel gray values in

2 this window be collected into a sw × 1 vector w0. Consider similar windows around

0 0 0 0 the corresponding points x1 ... xn−1 in the transformed images I1 ...In−1. Let the pixel gray values in these windows be collected in similar vectors w1 ... wn−1. We

51 퐼1 퐼2

퐱ퟏ

퐱ퟐ

퐼푅 퐼′1 퐼푛

퐼′2 퐱퐧

퐱퐑

퐼′푛

Figure 4.5: A typical scenario in creating the mosaic. The images I1 ...In−1 are 0 0 warped into the reference image space I0 as I1 ...In−1 to create the panorama.

2 form the sw × n data matrix W = [w0 w1 ... wn−1] and calculate the normalized variance-covariance matrix as:

Σ = cor(W ). (4.10)

Σ is an n × n normalized correlation matrix where all the diagonal entries are 1.

We make the following observations about Σ:

• Σ(i, j) is close to 1, if the appearance of views i and j match closely. This

indicates the existence of a non-occluding pixel in these views.

• Σ(i, j) is close to 0, if the appearance of views i and j differ from each other.

This indicates an occluding pixel.

52 However, Σ is an information rich feature that can be interpreted in many ways.

In this work we use the following simple interpretation - if there exists at least one

pair (i, j) such that Σ(i, j) < To then that pixel is an occluding pixel. Here To is a threshold close to 1, which makes sure that the correlation between views (i, j) is high. In practice, we have used To = 0.15. By ensuring that Σ(i, j) is maximal across all views, we are erring on the side of caution.

53 Chapter 5: EXPERIMENTS AND RESULTS

5.1 Data Description

The data used for this work was obtained from a multi-head camera with six camera heads. It is known as the Columbus Large Image Format (CLIF) data. The salient features of this data-set are given below.

• The image frames were taken simultaneously from a matrix of six cameras.

• There are nominally 4095 frames from each of the six cameras.

• The relative positioning of the cameras as shown from the camera viewpoint on-

board the aircraft is shown in Figure 5.1. Each of these cameras are independent

and their relative orientations are not known.

• The images from the cameras along the top row of the matrix need to be rotated

clockwise 90◦ and flipped horizontally to be put in the correct orientation with

respect to the plane.

• The images from the cameras along the bottom row of the matrix need to be

rotated counterclockwise 90◦ and flipped horizontally. These rotations are an

artifact of the recording process.

54 • The scene shows a flyover of The Ohio State University (OSU) Campus. The

plane approaches the campus and circles several passes, taking approximately

two frames per second from each camera.

• The images are numbered with the camera (from 000000 to 000005) followed

by a dash, followed by the frame number (from 000001 through 004095.)

• Each image is 4008 (cols) by 2672 (rows) pixels, in 8-bit grayscale raw format.

• In addition, there is an accompanying .txt file for each frame which records the

aircraft position and time for each epoch. However, since the aim of this work

is to produce a mosaic without using camera orientation, this information is not

used.

• The data was approved for public release by the Air Force Research Labs under

AFRL/WS-07-1129.

Figure 5.2 shows the six images from one epoch side by side. It is easy to see that the principle axes of the cameras are at an angle to each other and there is some overlap between the cameras. The images are of an urban scene with a dominant ground plane with buildings and other structures over it. The altitude of the plane is large compared to the height of the buildings.

5.2 Experiments with Simulated Data

In order to better understand the problem, we conducted experiments with sim- ulated data. We created a camera head with only two independent cameras. To simulate the urban scene we generated flat terrain DEM with a dominant ground plane with random high-rise buildings. ‘Images’ are generated at four epochs as

55 5 3 1

4 2 0

Figure 5.1: The relative positioning of the cameras as shown from the camera view- point onboard the aircraft.

the cameras move parallel to the ground in close approximation to the flight of the aircraft. Points in three-space are selected randomly on this terrain and are back projected onto virtual cameras to produce image points. A small gaussian noise is added to make the measurements more realistic. Figure 5.3 graphically illustrates the generated data.

We apply the technique detailed in Section 4.2.4 to this simulated dataset. We can begin directly with the bundle adjustment because the point correspondences are already known. Figure 5.4 shows the results for three random datasets. Points from epoch 2 are projected onto points from epoch 1 for camera 1. Camera 1, epoch 1 is chosen as the reference ‘image’.

The points are divided into two categories, ‘ground points’, which lie on the ground plane and ‘potential occluders’, which are points that do not lie on the ground plane

56 Figure 5.2: The images from the six camera heads arranged side by side. Note that there is some overlap between the images as the principle axes of the cameras are not parallel.

and may be potential occluders in the scene. We note that the ground points register accurately while building points show misregistration due to parallax. Table 5.1 lists the RMS errors for the points plotted in Figure 5.4.

Data Set RMS Error, Ground Points (m) RMS Error, Building Points (m) 1 1.16×10−4 7.49×10−3 2 1.23×10−4 6.32×10−3 3 1.19×10−4 7.01×10−3

Table 5.1: RMS errors for the registration of simulated dataset.

57 10 75 Camera 1 Camera 2

Overlap 400

400 (a) (b)

(c) (d)

Figure 5.3: Capturing simulated data points over an urban terrain with two camera heads in linear motion. (a) Two camera heads forming the multi-head camera move in an oblique direction to simulate flight for four time epochs. (b), (c), (d) Random terrains are generated with a ‘flat’ ground and random high-rise ‘buildings’. Data points are captured for the four time epochs.

While adjacent cameras do not have overlap in this simulated scenario, they are connected through other views in the image connectivity graph. This way, a transfor- mation can be estimated between them, allowing us to mosaic them independently, if required. Figure 5.5 shows the ‘mosaic’ from adjacent cameras 1 and 2 for the first random dataset. The projected set of points from camera 2 are compared to ‘real’ point locations. The RMS error for these projected points is 2.45 × 10−4 meters.

58 (a) (b) (c)

Figure 5.4: Bundle adjustment of the synthetic dataset. Points captured for epoch 2 are projected onto points captured for epoch 1. (a), (b) and (c) plot the points for three random datasets. Points are divided into ground points and potential occlud- ers (building points). Ground points register accurately while building points show misregistration due to parallax.

Using this simulated experiment we are able to show that the proposed bundle adjustment can achieve a high degree of accuracy. We also find that potential oc- cluders might be identified using the points that do not lie on the selected dominant plane. However, this approach does not work for the real data as observed in Section

4.3.

5.3 Registration of CLIF Dataset

Our algorithm was successfully tested on the CLIF dataset. Due to the large dimensionality of the data, registration was limited to 20 epochs at a time.

Figures 5.6 and 5.7 show the results for the registration of two images at a time.

The original images are shown along with the mosaic and the difference image. The difference image has been contrast enhanced to make the differences more visible. We see that the images have been precisely registered. The difference in appearance is small at the ground plane and is high for points away from the ground plane such as

59 Figure 5.5: ‘Mosaicing’ for adjacent cameras for the simulated dataset. Despite having no overlap between them, adjacent cameras are connected in the connectivity graph and hence can be ‘stitched’ together. The projected points are compared to the real points locations.

building tops. Figure 5.8 compares the traditional approach of sequentially registering pairs of images to our global error minimization approach. We note that our method successfully reduces the accumulation of errors as frames are added.

Figure 5.9 shows the complete mosaic of the CLIF dataset across 20 epochs. Once again, the images have been precisely registered. This mosaic is created by using the pixel value from only one transformed view. Figure 5.10 shows the detected potential occluders overlayed in red. Figure 5.11 shows the mosaic created by blending the transformed images using averaging. Due to small misregistrations, the ghosting and blurring occurs in the mosaic. The precision of registration decreases towards the

60 (a) (b)

(c) (d)

Figure 5.6: Mosaicking for two images. The two images taken from camera 2 shown above in (a) and (b) were part of the bundle adjustment and have been precisely registered in (c). The difference in pixel values is visible in (d). The difference image was contrast enhanced to improve visibility.

region with very busy texture, since the precision of key-point detection is low in that area. Figure 5.10 shows the detected potential occluders on the mosaic. We note that the potential occluders are mostly around buildings in the scene, as expected.

Another mosaic is shown with detected occlusions in Figure 5.12.

Finally, we extend the process to an even larger number of frames. Figure 5.13 shows the mosaic across 150 epochs. Since bundle adjustment for such large dimen- sionality is intractable, we perform the optimization for sets of 20 epochs and carry forward the transformations to the next set. This is done by choosing an image from the previously evaluated set as the reference image for the next set and setting the initial estimate of its homography to the value already computed.

61 (a) (b)

(c) (d)

Figure 5.7: Mosaicking for two images. The two images taken from camera 4 shown above in (a) and (b) were part of the bundle adjustment and have been precisely registered in (c). The difference in pixel values is visible in (d). The difference image was contrast enhanced to improve visibility.

5.4 Details of Occlusion Detection

Figure 5.14 shows the details of occlusion detection. For illustrative purposes, we

use a mosaic of only two views. We choose a small highlighted rectangle as the study

area. We compute potential occluders using To = 0.5 and To = 0.15, respectively.

Note that as To decreases, the area detected as occluding also decreases.

The detected occlusion can also be projected back on the original images. This is shown in Figure 5.15 for two images from the CLIF dataset.

62 (a) (b)

Figure 5.8: Comparison with traditional approach. (a) shows the mosaic generated for 20 epochs using the traditional approach of sequentially registering pairs of im- ages. (b) Shows the same images registered using our method. Note that in (a) the misregistration increases as we move away from the reference image and our approach (b) successfully reduces this misregistration.

Figure 5.9: Mosaic of the CLIF dataset across 20 epochs. This mosaic is generated by using pixel values from only one transformed image.

63 Figure 5.10: Occlusion detection on the CLIF dataset across 20 epochs. The occlusion is overlayed in red.

Figure 5.11: Mosaic of the CLIF dataset across 20 epochs. This mosaic is generated by blending pixel values from all the transformed images.

64 (a) (b)

Figure 5.12: (a) Mosaic across 20 epochs for a different area. (b) Mosaic with overlaid occlusions.

(a) (b)

Figure 5.13: Mosaic across 150 epochs. (a) shows the panorama and (b) shows the overlayed potential occluders.

65 (a) (b)

(c) (d)

Figure 5.14: Details of occlusion detection. (a) We choose a small highlighted region on a mosaic of two views for illustration. (b) and (c) show the detected occlusion for To = 0.5 and To = 0.15, respectively. A white (= 1) pixel denotes a ground point and a black (= 0) pixel denotes a potential occluder. Finally, (d) shows the occlusion overlayed on the mosaic. The areas in red represent the ground plane non-occluders.

Figure 5.15: Detected occlusion back projected onto original views.

66 Chapter 6: CONCLUSIONS AND FUTURE WORK

The objective of this study was to produce an algorithm for precise registration of aerial images obtained from multi-head cameras in the absense of camera calibration or relative orientation of the camera heads. We also sought to develop a method for the detection of potential occluders in the scene based on appearance alone. As detailed in Chapter 1, several methods for generating panoramic mosaics have been developed in the past. However, these often rely on the availability of some information about the camera parameters (focal length from EXIF tags) or scene geometry (assuming a planar scene). Additionally, no satisfactory algorithm exists for detecting occlusion across multiple views.

In this work, we have successfully developed a bundle adjustment framework that uses point correspondences across multiple views only. The error is adjusted globally, ensuring that all geometrical constrains in the data are enforced and a precise regis- tration is achieved. We have reduced the dimensionality of the Levenberg-Marquardt algorithm to make the problem tractable and devised a novel method to estimate initial point estimate and homography estimates using the image connectivity graph.

We extend the appearance based occlusion detection to multiple views and sat- isfactory detect potential occluders in the scene. We use information from multiple

67 views to characterise occlusions and are able to back project detected occlusions onto

the original views independently.

However, despite the dimensionality reduction, registration for a large number of

images becomes computationally too expensive. In this study we were able to register

20 × 6 = 120 images at a time. Data sets larger than this had to be broken down into smaller sets and computed independently, propagating the homography estimates from one set to another to finally produce a complete mosaic. This decreased the accuracy of registration and we found error accumulating as the number of frames increased. We believe that exploring other optimization techniques could be explored in the future to reduce the computational complexity.

The information rich matrix Σ calculated during occlusion detection was also not fully utilized. We use a very simple measure for defining an occluding pixel. More sophisticated measures based on Σ can be developed for a more robust detection of occlusion.

68 Bibliography

[1] D. Steedly, C. Pal, and R. Szeliski, Efficiently Registering Video into Panoramic Mosaics. IEEE, 2005.

[2] C. Harris and M. Stephens, “A combined corner and edge detector,” in Alvey vision conference, vol. 15, p. 50, Manchester, UK, 1988.

[3] P. Anandan, “A computational framework and an algorithm for the measure- ment of visual motion,” International Journal of Computer Vision, vol. 2, no. 3, pp. 283–310, 1989.

[4] J. Shi and C. Tomasi, Good features to track. IEEE Comput. Soc. Press, 1994.

[5] W. F¨orstner,“A feature based correspondence algorithm for image matching,” International Archives of Photogrammetry and Remote Sensing, vol. 26, no. 3, pp. 150–166, 1986.

[6] B. Triggs, “Detecting keypoints with stable position, orientation, and scale un- der illumination changes,” in Eighth European Conference on Computer Vision, pp. 100–113, In Eighth European Conference on Computer Vision (ECCV 2004), 2004.

[7] M. Brown, R. Szeliski, and S. Winder, Multi-Image Matching Using Multi-Scale Oriented Patches. IEEE, 2005.

[8] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” In- ternational Journal of Computer Vision, vol. 60, pp. 91–110, Nov. 2004.

[9] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” Computer VisionECCV 2006, vol. 3951, pp. 404–417, 2006.

[10] C. Schmid, R. Mohr, and C. Bauckhage, “Evaluation of Interest Point Detec- tors,” 2000.

[11] K. Mikolajczyk and C. Schmid, “Performance evaluation of local descrip- tors.,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, pp. 1615–30, Oct. 2005.

69 [12] J. Beis and D. Lowe, “Shape indexing using approximate nearest-neighbour search in high-dimensional spaces,” in Computer Vision and Pattern Recognition, 1997. Proceedings., 1997 IEEE Computer Society Conference on, pp. 1000–1006, IEEE, 1997.

[13] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A Paradigm for Model Fitting,” Communications of the ACM, vol. 24, no. 6, 1981.

[14] O. Chum and J. Matas, “Matching with PROSAC-progressive sample consen- sus,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 220–226, Ieee, 2005.

[15] F. Wu and X. Fang, “An improved RANSAC homography algorithm for feature based image mosaic,” Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision, 2007.

[16] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge, UK: Cambridge University Press, second ed., 2003.

[17] P. Wolf and B. Dewitt, “Elements of Photogrammetry with Applications in GIS,” New York, USA, 2000.

[18] O. Faugeras, Three-dimensional computer vision: a geometric viewpoint. the MIT Press, Dec. 1993.

[19] R. Szeliski and H.-Y. Shum, “Creating full view panoramic image mosaics and environment maps,” International Conference on Computer Graphics and Inter- active Techniques, 1997.

[20] H. Sawhney and R. Kumar, “True multi-image alignment and its application to mosaicing and lens distortion correction,” Pattern Analysis and Machine Intel- ligence, IEEE Transactions on, vol. 21, no. 3, pp. 235–243, 1999.

[21] B. Triggs, A. Zisserman, R. Szeliski, P. McLauchlan, R. Hartley, and A. Fitzgib- bon, Vision Algorithms: Theory and Practice, vol. 1883 of Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, Apr. 2000.

[22] R. Szeliski and S. Kang, Recovering 3D shape and motion from image streams using nonlinear least squares. IEEE Comput. Soc. Press, 1993.

[23] H.-Y. Shum and R. Szeliski, “Systems and Experiment Paper: Construction of Panoramic Image Mosaics with Global and Local Alignment,” International Journal of Computer Vision, vol. 36, pp. 101–130, Feb. 2000.

70 [24] S. Coorg and S. Teller, “Spherical Mosaics with Quaternions and Dense Corre- lation,” International Journal of Computer Vision, vol. 37, pp. 259–273, June 2000.

[25] D. C. Brown, “Close-range camera calibration,” 1971.

[26] G. Stein, Lens distortion calibration using point correspondences. IEEE Comput. Soc, 1997.

[27] M. Brown and D. G. Lowe, Recognising panoramas. IEEE, 2003.

[28] A. Agarwala, M. Agrawala, M. Cohen, D. Salesin, and R. Szeliski, “Photograph- ing long scenes with multi-viewpoint panoramas,” International Conference on Computer Graphics and Interactive Techniques, vol. 25, no. 3, p. 853, 2006.

[29] F. Schaffalitzky and A. Zisserman, “Multi-view matching for unordered image sets, or How do I organize my holiday snaps?,” Computer VisionECCV 2002, vol. 2350, pp. 414–431, Apr. 2002.

[30] M. Heikkil¨aand M. Pietik¨ainen,“An image mosaicing module for wide-area surveillance,” International Multimedia Conference, 2005.

[31] N. Snavely, S. M. Seitz, and R. Szeliski, “Modeling the World from Internet Photo Collections,” International Journal of Computer Vision, vol. 80, no. 2, pp. 189–210, 2007.

[32] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen, “Interactive digital photomontage,” ACM Transac- tions on Graphics, vol. 23, p. 294, Aug. 2004.

[33] Y. Li, J. Sun, C.-K. Tang, and H.-Y. Shum, “Lazy snapping,” ACM Transactions on Graphics, vol. 23, p. 303, Aug. 2004.

[34] C. Rother, V. Kolmogorov, and A. Blake, “”GrabCut”,” ACM Transactions on Graphics, vol. 23, p. 309, Aug. 2004.

[35] P. Baudisch, D. Tan, D. Steedly, E. Rudolph, M. Uyttendaele, C. Pal, and R. Szeliski, “Panoramic viewfinder: providing a real-time preview to help users avoid flaws in panoramic pictures,” OZCHI; Vol. 122, 2005.

[36] R. Szeliski, Image mosaicing for tele-reality applications. IEEE Comput. Soc. Press, 1994.

[37] M. Irani and P. Anandan, “Video indexing based on mosaic representations,” Proceedings of the IEEE, vol. 86, pp. 905–921, May 1998.

71 [38] A. Eden, M. Uyttendaele, and R. Szeliski, “Seamless Image Stitching of Scenes with Large Motions and Exposure Differences,” CVPR, 2006.

[39] R. Szeliski, “Image alignment and stitching: a tutorial,” Foundations and Trends in Computer Graphics and Vision, vol. 2, no. 1, 2006.

[40] G. Egnal and R. Wildes, “Detecting binocular half-occlusions: empirical compar- isons of five approaches,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, pp. 1127–1133, Aug. 2002.

[41] C. Zitnick and T. Kanade, “A cooperative algorithm for stereo matching and occlusion detection,” IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 22, pp. 675–684, July 2000.

[42] C. Herley, Automatic occlusion removal from minimum number of images. IEEE, 2005.

[43] D. Hoiem, A. N. Stein, A. A. Efros, and M. Hebert, Recovering Occlusion Bound- aries from a Single Image. IEEE, Oct. 2007.

[44] X. He and A. Yuille, “Occlusion boundary detection using pseudo-depth,” Com- puter VisionECCV 2010, pp. 539–552, 2010.

[45] A. N. Stein and M. Hebert, “Local detection of occlusion boundaries in video,” Image and Vision Computing, vol. 27, pp. 514–522, Apr. 2009.

[46] P. McLauchlan, “Image mosaicing using sequential bundle adjustment,” Image and Vision Computing, vol. 20, pp. 751–759, Aug. 2002.

[47] M. K. Bennet, Affine and Projective Geometry. John Wiley and Sons, Inc., 1995.

[48] R. Casse, Projective Geometry: An Introduction. New York: Oxford University Press, Inc., 2006.

[49] H. Li, Invariant Algebras and Geometric Reasoning. Singapore: World Scientific Publishing, 2008.

[50] F. L. Bookstein, “Fitting conic sections to scattered data,” Computer Graphics and Image Processing, vol. 9, pp. 56–71, Jan. 1979.

[51] J. Bergen, P. Anandan, K. Hanna, and R. Hingorani, “Hierarchical model-based motion estimation,” 1992.

[52] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” In- ternational Journal of Computer Vision, vol. 60, pp. 91–110, Nov. 2004.

72 [53] J. H. Freidman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software, vol. 3, pp. 209–226, Sept. 1977.

[54] D. W. Marquardt, “An Algorithm for Least-Squares Estimation of Nonlinear Parameters,” SIAM Journal on Applied Mathematics, vol. 11, no. 2, p. 431, 1963.

[55] E. W. Dijkstra, “A note on two problems in connexion with graphs,” Numerische Mathematik, vol. 1, pp. 269–271, Dec. 1959.

73 Appendix A: SOFTWARE IMPLEMENTATION DETAILS

A.1 Software Environment

The software was developed and tested using Matlab R2009b on an Ubuntu Linux

10.10 system. Several third party libraries have been used. Care is taken that these are cross platform. The software has been run successfully on Windows 7, but extensive testing was not performed and bugs may remain.

Parts of the program are written in C++ compiled with GCC 4.5. However, older versions of GCC should be able to compile these parts. On windows the MinGW

GCC compiler should be used. Two third party libraries Eigen for and CImg for image IO are used. Check the documentation for these libraries for details on how to get them working. CImg requires installation of ImageMagick to work properly on Windows. Since both are distributed as header files under open source licenses and can be freely used for non-commercial purposes.

A.2 Input

The input to the system consists of all images stored in a single directory. The images can be in the raw format or any of the popular image formats (tif, bmp, jpg) supported by Matlab. If raw image format is provided, the program will prompt the

74 user to enter the row × col size. The images are assumed to be named as in the CLIF dataset.

The input directory, the cache directory and an output directory need to be spec- ified. Also, a start epoch and an end epoch are to be provided.

A.3 Caching

Various intermediate results are cached so that processing can be paused at any point and resumed at a later time. A cache directory has to be provided. Results are cached as Matlab .mat files. The caching scheme is as follows.

Filenames (or combinations thereof) are used as a handles.

1. Key-points: SHA-256 hash of the filename is calculated and key-points are

stored as [hash].sift

2. Point matches: Filenames are combined into a string filename1-filename2. Again

an SHA-256 hash is calculated and point matches stored as [hash].hmatch

A.4 Output

Output is generated in a provided output folder. A subfolder named [fromEpoch]-

[toEpoch] is created and all output is generated in the subfolder. The output consists of the following components:

1. Warped images: warped image numbered 1 . . . n are created. These are image

that have been warped onto the reference image.

2. xdata.txt, ydata.txt: two text files contain x and y translation parameters.

3. mosaic.tif: the generated mosaic.

75 4. occlusion.tif: file containing the occlusion results. Gray values indicate the

amount of occlusion probability at that pixel.

A.5 Running the program

To run the program, launch matlab and navigate to the folder containing the code. Type uiProcessData at the matlab prompt. The GUI shown in 1 will launch.

Provide the input as described above. The data processing bas been broken down in three steps. This is because each step takes a long time to run. Thus, processing can be broken down into different sessions. Since all results are cached, no data is lost from one run to another.

1. Detect key-points: clicking Detect Key-points will detect key points on all image

between the specified epochs.

2. Match key-points: clicking Match Key-points will match all image pairs between

the specified epochs.

3. Generating results: clicking Generate Results will generate the mosaics and

occlusions in the specified output directory.

A.6 Viewing Results

A separate GUI is provided for viewing results. To launch it, type uiDisplayMosaic at the Matlab command prompt. Provide the output directory and the start and end epochs. Click View Panorama and View Occlusion buttons to view the respective views. The output is displayed in the standard Matlab image viewer and all the

Matlab tools are available.

76 Figure A.1: Screenshot of GUI for processing data.

A.7 Notes for Programmers

The program has been written so as to make it easier for future maintainers to modify the source code. The core components for key-point detection and matching, optimization, error reporting, image stitching and occlusion detection have been sep- arated out in an API likes approach. Standard matlab help documentation has been provided for most of the source code. API documentation can be generated using the m2html documentation system available.

Source code management was done using Git. The entire git repository going back to the initial commits is provided.

77 Figure A.2: Matching key-points using the GUI.

Figure A.3: GUI for viewing results.

78