Improved Algorithms for Incremental Self-calibrated Reconstruction from Video

por Rafael Lemuz López

Tesis sometida como requisito parcial para obtener el grado de DOCTOR EN CIENCIAS EN LA ESPECIALIDAD DE CIENCIAS COMPUTACIONALES en el Instituto Nacional de Astrof´ısica, Optica´ y Electronica´ Abril 2008 Tonantzintla, Puebla Supervisada por: Dr. Miguel Octavio Arias Estrada, INAOE

°c INAOE 2008 El autor otorga al INAOE el permiso de reproducir y distribuir copias en su totalidad o en partes de esta tesis ii iii Summary

Self-calibrated 3D reconstruction algorithms deal with the problem of recov- ering the three-dimensional structure of the scene and the camera motion using 2D images. A distinctive property of self-calibrated reconstruction methods is that camera calibration (the estimation of the camera intrinsic parameters: focal length, principal point, and radial lens distortion; and extrinsic parameters: orien- tation and position) is computed using intrinsic geometric information contained in the projective images of real scenes. Algorithms to solve 3D reconstruction problems heavily relay in finding correct matches between salient features that correspond to the same scene elements in different images. Then, by using corre- spondence data, a projective estimate of 3D scene structure and camera motion is computed. Finally using geometric constraints the camera parameters and the projective model are upgrade to a metric one.

This thesis proposes new algorithms to solve problems involved in self-calibrated reconstruction methods, including salient point detection, robust feature match- ing and projective reconstruction. An improved salient point detection algorithm is proposed, that ranks better interest points accordingly to the intuitive notion of corner points by computing directly the angular difference between dominant edges. A robust feature matching algorithm that merges spatial and appearance properties between putative match candidates that increase the number of cor- rect matches and discard false matches pairs is also proposed. In addition, a projective reconstruction algorithm is proposed that selects on-line the most con- tributing frames in the projective reconstruction process to overcome one of the intrinsic limitation of factorization like algorithms, to deal with the problem of key frame selection in the 3D self-calibrated pipeline. A full pipeline for a 3D reconstruction algorithm is developed with the proposed algorithms. Promising iv

results are shown and contributions and limitations of this work are discussed. v vi Resumen

Los algoritmos de reconstrucci´on3D auto-calibrada tratan con el problema de recuperar la informaci´on3D de una escena y el movimiento de la c´amara a partir de im´agenes. Una propiedad distintiva de los m´etodos de reconstrucci´on auto- calibrada es que los par´ametros intrinsecos de la c´amara: longitud focal, punto principal, e incluso la distorci´on radial; as´ıcomo los par´ametros extrinsecos: la orientaci´ony posici´onrelativa de la c´amara con respecto a la escena se calculan utilizando informaci´ongeom´etricaintrinsecamente contenida en las im´agenes de una escena real est´atica. Es decir, estos m´etodos no utilizan herramientas adi- cionales como motores de retroalimentaci´onpara el c´alculode la longitud focal o patrones de calibraci´onprefabricados.

Sin embargo, el proceso de reconstrucci´on autocalibrada, depende fuertemente de tener identificados puntos de correspondencia entre regiones de imagenes que representan al mismo elemento de la escena capturados desde puntos de obser- vaci´ondiferentes. As´ı,utilizando unicamente puntos de correspondencia se obtiene una primera estimaci´on de la estructura de la escena y el movimento de la c´amara que no preserva distancias y ´angulos, llamada reconstrucci´onprojectiva. Poste- riormente haciendo algunas suposciones e imponiendo restricciones sobre algunos par´ametros de la c´amara el modelo proyectivo se lleva a un modelo euclideando que difiere de la representaci´on de la escena real por un factor de escala y la orientaci´onoriginal.

En esta tesis se proponen nuevos algoritmos para el problema de reconstrucci´on autocalibrada, en particular para los problemas de: detecci´on de puntos de inter´es, b´usquedade correspondencias y reconstrucci´on proyectiva.

Se propone un algoritmo para la detecci´onde puntos de inter´es,que ordena mejor los puntos detectados de acuerdo a la noci´onintuitiva de esquina calculando vii directamente la diferencia angular entre los bordes dominantes. Un nuevo algo- ritmo para la b´usquedade correspondencias que integra propiedades espaciales y de apariencia en una m´etrica de similaridad entre posibles puntos de corresopon- dencia. El nuevo algoritmo incrementa el n´umerode pares de correspondencia y al mismo tiempo disminuye los errores de empatamiento. Adem´as, se propone un al- goritmo de reconstrucci´onproyectiva que selecciona en tiempo de ejecuci´on las im- agenes que mas contribuyen durante el proceso de reconstrucci´onpara sobrepasar una de las limitaciones inerentes a los algoritmos de reconstrucci´on proyectiva basados en el m´etodo de factorizaci´on:la selecci´onde los frames m´asimportantes durante el proceso completo reconstruci´on auto-calibrada. Finalmente, se mues- tran resultados prometedores y se discuten las contribuciones y limitaciones de este trabajo. viii ix Acknowledgements

There are many people who have provided guidance, and support throughout the years to whom I wish thanks. First my advisor, Miguel Octavio Arias Estrada who has guided me through these years and has taught me what it means to be a researcher. Secondly to Patrick Hebert, who pointed me, the significance of clear and precise communication of research results. I want to thank to the Professors Leopoldo Altamirano Robles, Olac Fuentes Chaves and Aurelio L´opez L´opez because they have a great impact in my academic and professional skills giving me the opportunity to interact with them during my stay at the INAOE. Then to Eliezer Jara for teaching me the way of systematic analysis in laboratory practices and share his invaluable experience in building prototypes for diverse applications which have an enormous impact in my professional formation. I also want to thank the interesting people I have met along the way whom I have the opportunity of interacting through informal discussions, and some provide support and encouragement, Blanca, Rita, Irene, Luis, Jorge, and Marco Aurelio. Specially I want to express my gratitude to Carlos Guillen for the hours invested in clarifying some mathematical concepts during the last year. And the guys of the LVSN lab at Laval university, in particular to Jean-Daniel Deschˆenes and Jean-Nicolas Ouellet for make so pleasant the visit to Quebec. Finally, I also want to recognize the facilities given by the technical staff of the INAOE in particular the people of the computer science department.

This research was done with the financial support of the CONACYT scholar- ship grant 184921. x xi Dedicatory

To my parents and brothers .... xii Contents

1 Introduction 1

1.1 Overview of 3D reconstruction from video ...... 4

1.1.1 Interest point detector ...... 6

1.1.2 Matching correspondence ...... 7

1.1.3 Projective reconstruction ...... 8

1.1.4 Self-Calibration ...... 9

1.1.5 Rectification ...... 10

1.1.6 Dense Stereo Reconstruction ...... 10

1.2 Objectives ...... 11

1.2.1 Main Objective ...... 11

1.2.2 Particular Objectives ...... 11

1.3 Contributions ...... 12

1.3.1 Robust feature matching ...... 12

1.3.2 Incremental 3D reconstruction by inter-frame selection . . 12

1.4 Organization of the Thesis ...... 13

1.5 Conclusions ...... 13

2 Multiple View Geometry 15

2.1 Preliminaries ...... 15

2.1.1 Homogeneous Coordinates ...... 15 xiv CONTENTS

2.2 Camera Models ...... 16

2.2.1 Perspective model ...... 16

2.2.2 Orthographic Model ...... 19

2.2.3 Lens Distortion ...... 20

2.3 Multiple View Constraints ...... 20

2.3.1 Two view Geometry ...... 21

2.3.2 Fundamental Matrix estimation ...... 22

2.3.3 Planar Homography ...... 24

2.3.4 Homography estimation ...... 25

Number of Measurements ...... 26

2.3.5 Projective Reconstruction ...... 26

Merging Projective matrices using Epipolar Geometry . . . 26

The Factorization Method ...... 28

Non-linear Bundle Adjustment ...... 29

2.3.6 Incremental Projective Reconstruction ...... 30

2.4 3D Scene Reconstruction ...... 30

2.4.1 Camera Calibration ...... 30

2.4.2 Triangulation ...... 31

2.4.3 Survey of Camera Calibration ...... 32

Photogrammetric calibration ...... 32

Self-calibration ...... 33

2.4.4 Absolute Conic ...... 35

2.5 Stratified Self-calibration ...... 37

2.5.1 Affine Stratification ...... 38

2.6 RANSAC computation ...... 39

2.7 Conclusions ...... 40 CONTENTS xv

3 The Correspondence Problem 41

3.1 Introduction ...... 41

3.2 Feature Correspondence Overview ...... 42

3.3 Salient point detection ...... 43

3.3.1 Pioneer Feature Detectors ...... 44

First Derivative Methods ...... 44

Second derivative methods ...... 46

Local energy methods ...... 47

Detectors of junction regions ...... 47

3.3.2 Invariant Feature Detectors ...... 48

3.4 Salient point Descriptor ...... 49

3.4.1 SIFT descriptor ...... 50

3.5 Matching salient points ...... 51

3.6 Geometric Constraints for Matching ...... 51

3.7 The importance of Gaussian Integration Scale and Derivative filters 53

3.8 Cov-Harris: Improved Harris ...... 55

3.8.1 Segmentation of Partial Derivatives ...... 55

3.8.2 Edge direction estimation by Covariance Matrix ...... 57

3.8.3 Ranking Corner Points by the Angular difference between dominant edges ...... 58

3.9 Discussion ...... 60

4 IC-SIFT: Robust Feature Matching Algorithm 63

4.1 Introduction ...... 63

4.2 Related Work ...... 64

4.2.1 Scale Invariant Feature Transform ...... 66

4.2.2 Iterative Closest Point ICP ...... 68 xvi CONTENTS

4.3 IC-SIFT: Iterative Closest SIFT ...... 71

4.3.1 Finding Initial Matching Pairs ...... 71

4.3.2 Matching SIFT features: adding a weighted distance factor 72

4.3.3 Differencing Registration Error ...... 73

4.4 Robust feature Matching Experimental Results ...... 76

4.5 Discussion ...... 83

5 A new Incremental Projective Factorization Algorithm 85

5.1 Introduction ...... 85

5.2 Related Work ...... 86

5.3 Projective Factorization ...... 87

5.4 Proposed Incremental Projective Reconstruction Algorithm . . . . 91

5.4.1 Domain Reduction by inter-frame Selection ...... 91

5.4.2 Incremental Projective Reconstruction Algorithm . . . . . 93

5.5 Incremental Projective Reconstruction Experimental Results . . . 94

5.5.1 Incremental Projective Reconstruction Accuracy ...... 94

5.5.2 Processing Time ...... 95

5.5.3 Real Image Sequence experiments ...... 97

5.5.4 Conclusions ...... 98

6 Implementation and Experimental Results 99

6.1 Self-calibrated reconstruction from video experiments ...... 100

6.2 Salient Point detection ...... 101

6.3 Salient point detection by Harris algorithm ...... 102

6.4 Matching restricted list to estimate geometric constraints . . . . . 104

6.4.1 Robust fundamental matrix estimation ...... 105

6.4.2 Enforcing Epipolar Constraint for semi-dense matching . . 105 CONTENTS xvii

6.5 Projective and Euclidean Reconstruction ...... 108

6.6 Discussion ...... 110

7 Conclusions 113

7.1 Summary of contributions ...... 113

7.1.1 Robust feature matching for wide separated views . . . . . 114

7.1.2 Incremental 3D reconstruction by inter-frame selection . . 114

7.1.3 Robust feature matching on video sequences ...... 115

7.2 Future work ...... 115

7.2.1 Tracking algorithm with motion blur ...... 116

7.2.2 Inter-frame selection removing critical configurations . . . 117

7.2.3 Collaborative structure from motion ...... 117

7.2.4 Real-time processing ...... 117 xviii CONTENTS Chapter 1

Introduction

The recovering of Three-Dimensional information of a scene from multiple images captured with a camera is one of the fundamental problems of computer vision. There are numerous methods to deal with this problem. The methods can be classified in different taxonomies according to the intrinsic properties of specific methods, for example by the kind of sensor (sonar, range laser, fringe projectors and inertial measurement units), by the possibility to change the scene by modify- ing lighting conditions (passive and active), by the source of information analyzed to extract depth information (shadows, texture, contour, geometry, focus, defo- cus, symmetry, disparity, reciprocity, light fields and photometry). A distinction between methods is done if the scene remains static or dynamic while process- ing information. When video cameras are used to recover depth information if the camera image formation mapping parameters are known then reconstruction methods are called pre-calibrated and self-calibrated when camera parameters are unknown.

The application of each method depends on the requirements of specific prob- lems ranging from accuracy, precision, processing speed, mobility, accessibility to information sources, natural ambient light modification, dimension constraints and budget to mention just a few. The ideal method for each singular applica- tion is a trade off between these and other constraints less clear as for example: 2 Introduction

the need for portability, when human user interaction is allowed, the need for specific 3D model representation (depth map, voxels, mesh, level sets or vector fields), amount and quality of the generated information, i. e., some applications require a special model representation and full scene description of the scene while for others a sparse model representation can be enough. A distinguishing work that highlights the importance of using the same data representation in the whole reconstruction process from 3D reconstruction, partial view registration to ren- dering visualization is the work presented in [THL02, THL03, THDL04], where a common framework based on vector fields allows the real time reconstruction using range curves with a hand held scanner.

An in depth description of the 3D reconstruction methods is out of the scope of this thesis, we refer the interested reader to excellent recent surveys in different computer vision domains [Cur, SCMS01a, H´eb01, SCD+06].

This thesis deals with the estimation of the structure of a scene from images by self-calibrated methods. The method has attracted the attention of numerous research groups in recent years because this method can extract three-dimensional information from a set of images without previous knowledge of the camera. This problem is also called Structure from Motion (SfM) and self localization and mapping in the robotics literature. In the last few years, important progress has been done on this research area, but the problem is still hard to solve and there is no method that can be applied to general scenes and that fulfill most of the requirements expressed in the previous paragraphs. Assuming a static scene viewed with a camera having rigid motion, the problem has been formulated with several approaches and the state of the art research has focused its attention on individual image processing stages and in the developing of robust high level stages to recover the unknown camera parameters for different camera models using only a set of images as input data.

Some properties of the self-calibrated reconstruction method that highlight its advantages over more sophisticated ones with expensive set-ups (for example using: laser rage finders, pattern projectors, lighting arrays, Global Positioning 3

Systems and Inertial Measurement Units) are here described, some of them are derived from the fact that self-calibrated reconstruction can recover the structure and motion using only one moving camera:

• Automatic recovering of camera location and orientation with respect to the scene up to a Euclidean transformation.

• The possibility to compute an estimate of 3D models from a set of images taken with the same camera without further information.

• Low cost since in the last years the widespread use of video cameras has decreased their cost.

• Allows the three-dimensional reconstruction of close and far viewed scenes (indoor and outdoor model generation).

• Portability, mobility and less energy consumption.

Low-cost cameras are increasing their resolution and image quality, like those used in cellular phones make them feasible for self-calibrated 3D model recon- struction.

However some drawback of self-calibration methods when compared with those that use specialized setups are:

• Self-Calibration requires texturized information for modeling a scene, then the inability to cope with homogenous texturized scenes.

• A model is recovered with a sparse set of 3D points instead of dense depth maps.

• Low quality 3D models are recovered when compared to those methods using more complex Hardware components like structured light based methods.

• High dependency of establishing correspondences between salient points on images that represent the same scene element in conditions of wide separated views. 4 Introduction

1.1 Overview of 3D reconstruction from video

The methods to recover 3D models from images taken with an uncalibrated cam- era [MHOP01, ST01, RP05, MP05, HZ00b, GSV01] presume that a sequence of images are available. The method relays in the assumption that the scene remains static while the capturing camera circumnavigates around the scene to be mod- eled. An important requirement of self-calibrated reconstruction methods is that the scene mostly contains a distinctive set of image regions that may be distin- guished in different views. The main processing steps involved in self-calibrated reconstruction are shown in figure 1.1.

The first step is to identify those salient points in the images. Pioneer ap- proaches for self-calibrated reconstruction used standard corner detection algo- rithms but recently the need for affine invariant point detection algorithm has emerged and important progress on this area has been done. The reason for the need of invariant salient point detection algorithm is that even small view point changes during image capture modify the appearance properties of salient points due to varying lighting conditions and projective deformation during the image formation process.

After a set of salient points has been identified the next step is to find for each salient point in the first image the corresponding feature points in subse- quent images that correspond to the same scene element. This problem is called the correspondence problem. An important assumption made during this step is that images do no differ too much between consecutive frames. This allows to restrict the search space for finding corresponding features in different images and match them using cross-correlation methods. However, since camera motion is unconstrained and unknown, more sophisticated approaches have appeared using invariant feature point descriptors, reducing the search space by using geomet- ric constraints and by computing robust estimates to select only the best match candidates.

The third stage assumes that the correspondence problem has been solved and 1.1 Overview of 3D reconstruction from video 5 a measurement matrix with true salient point matches between all features has been built. Then, an estimate of camera motion and scene structure is recovered. However, since camera parameters are unknown, the actual estimate is a projective representation of the real metric scene.

Thus, the next step is to find a projective mapping that transforms the projec- tive reconstruction to a metric reconstruction. There are two main approaches to solve this problem. The former is to explicitly estimate the affine transformation by finding the plane at infinity and then using the absolute conic (a special conic that lives in the plane at infinity) to find the camera parameters by imposing restrictions in the camera parameters (e.g. rectangular or square pixels, constant aspect ratio and principal point in the middle of the image).

The recovered model until this stage consists of a sparse set of points that differs from the real scene points by a scale factor that is solvable if one real distance between salient point is known from the scene. However, if we know the camera parameters, it is possible to compute a dense reconstruction of the scene by using standard stereo calibrated reconstruction frameworks.

If a dense map is needed, by using the camera parameters of a pair of images, a rectification process can be computed to align images in such a way that corre- sponding points can be found searching along a line. Then, dense robust stereo matching algorithms can establish correspondences between almost every pixel between images. However, even imposing this geometric restriction the problem is difficult due to the absence of texture information and occluding image areas.

Figure 1.1 illustrates the steps to achieve self-calibrated 3D modeling from video taken from [PGV+04]. Different state of the art methods have their own specific components but follow a similar pipeline.

In the following subsections there is an overview of the stages of the multiple view reconstruction method and the algorithms commonly used. 6 Introduction

Figure 1.1: The steps to achieve self-calibration from multiple images taken from [PGV+04].

1.1.1 Interest point detector

The first step consists in automatically detecting ’interest points’ in the images that are sufficiently different from their neighbor pixels.

Numerous algorithms have been proposed to extract interest points from im- ages. Different region properties around a point are used to define what points in an image are ’interesting’. Some detectors find points of highly varying texture, while others locate corner points. Corner points are formed when two or more non parallel edges meet. An edge in an image is a sharp variation of the inten- 1.1 Overview of 3D reconstruction from video 7 sity function. Edges usually define the boundary between two different objects or parts of the same object.

In general, interest points detectors find areas of images with high variance in at least two directions. The variance along different directions computed using all pixels in a window centered about a point are good measures of the distinct- ness. Usually the Harris and Stephens’ corner detector is selected for doing this task [HS88] since, the corner responses estimated by the Harris operator through eigenvalues analysis has the property of being invariant to scale when using pyra- midal processing as in [Lin98, MS02]. Even though, there are other alternatives as Sojka [Soj03], Susan [SB97], and KLT [KT91, ST94], a recent study of the corner stability and corner localization properties of the features extracted by different algorithms suggest that the KLT and Harris corner detectors are more suitable for tracking features in long sequences [TS04]. State of the art algorithms have extended the Harris algorithm to make it stable under affine image transforma- tions [MS02, TG04, MS05a] and applicable in scenarios where small view point changes modify the local appearance of salient points [Low99].

1.1.2 Matching correspondence

After detecting interest points, the next step is to track those features across different images in a video sequence. The goal is to find for every interest point in the first image the corresponding point in subsequent images associated with the same scene element.

The correspondence problem has been studied in depth in two different setups. In the ’stereo’ correspondence problem where the camera motion is restricted to be mainly translational and the images of the same scene are pre-aligned limiting the search of corresponding points to the same image row, see [SS02, BBH03] for recent reviews. But even under this constraints the problem is difficult to solve due to image noise, object occlusions, varying lighting conditions, the presence of specular highlights, shadows, and motion blur. 8 Introduction

On the other hand, a harder setup of the correspondence problem occurs when the images are captured under large and unknown camera motion in the ’wide baseline matching’ due to perspective effects, varying scale, and stronger variations in lighting conditions.

The problem of looking for correspondences on video streams is known as multi feature tracking in the literature. Although many tracking algorithms exist [SPFP96, HB96, FTTR99, SHF01], the Kanade Lucas Tomasi tracking algorithm is commonly used [KT91]. When only few images of the object or the scene are available, wide-baseline matching methods can be used [ZDFL95a, FTG03]. These methods use affine invariant regions for matching images which are robust but, they are computationally more expensive [MS05a].

1.1.3 Projective reconstruction

Projective reconstruction is the best that can be done without camera calibration or additional metric information about the scene [Tri97]. Thus, knowing only feature correspondences the recovered camera pose and scene structure differs from the metric reconstruction by a projective transformation.

There are two kinds of methods (although many variants) for doing the pro- jective reconstruction step: Those, based on epipolar geometry and others based on factorization. In methods based on epipolar geometry [FLM92, GSV01, RP05, MP05], the first two images are used to initialize a reference frame. The world frame is aligned with the first camera and from the third image its fundamen- tal matrix rotation part is aligned with the fundamental matrix of the previous image. The epipolar geometry based method estimates camera motion and 3D structure for each view. When the last image is processed, a nonlinear optimiza- tion algorithm can refine the camera matrices and 3D structure.

Factorization methods solve the projective reconstruction problem using a data matrix (the image coordinates of corresponding point in all the images). The data matrix is factorized using singular value decomposition (SVD) into two matrices, 1.1 Overview of 3D reconstruction from video 9 which represent object shape and camera motion respectively. The factorization method, first developed for the orthographic projection model [TK92a, TK92b] was later extended to consider weak perspective, para-perspective, and projective camera models [MK94, PK97, ST01, HK00, MHOP01]. The factorization method is preferable to the epipolar due to its accuracy, numerical stability, robustness, and because it avoids computing the epipolar geometry which is prone to errors when the separation between images is short and then implicit human intervention is needed to select appropriate images.

1.1.4 Self-Calibration

A projective reconstruction does not preserve parallelism, length ratios, and an- gle between lines of real 3D scenes. The process of upgrading from projective reconstruction to a metric one where those properties are preserved is called self- calibration or auto-calibration. To upgrade from a projective reconstruction to a metric reconstruction both, the parameters of the perspective projection that model the image formation process and the camera location most be estimated.

Assuming that all images are taken by the same camera and some internal camera parameters are known, Euclidean structure of the scene can be recovered. Furthermore, the camera calibration can be solved.

The first self-calibration method [FLM92, MF92] directly finds the intrinsic camera parameters that are consistent with the underlying projective geometry of a sequence of images using pairwise epipolar geometry.

Hartley et al [Har92, AZH96] proposed a non-linear least squares method to solve self-calibration but that needs a good initial guess for the unknowns. In [PG97] Pollefeys and Van Gool, extend the Hartley method where the projective reconstruction is first updated to an affine reconstruction and then to a metric reconstruction assuming variable focal length, while other camera parameters re- main constant. In [HK00] Han et al proposed a linear algorithm to recover the intrinsic parameters when the principal point and the focal lengths are unknown 10 Introduction

and convert the projective solution to an Euclidean solution simultaneously.

The structure of the scene recovered after self-calibration is a sparse set of points. A dense depth map must be estimated to build a realistic 3D model. Two additional steps can accomplish this task: Rectification and Dense Stereo Reconstruction.

1.1.5 Rectification

Taking two or more images with their corresponding matching points between them, the Rectification process exploits the epipolar constraint to align the images in such a way that all corresponding points will have the same y-coordinate in two images. This image transformation greatly reduces the search for feature point correspondences to a thin scan-line because due to image noise the estimated epipolar geometry is prone to small errors making necessary to extend the search for corresponding points to neighbor scan-lines.

When there is no close up motion between images, planar rectification can correctly align the images [Har98] by projecting both images onto a plane that is parallel to the baseline. To consider the case of forward/backward camera motion non-planar rectification algorithms project the images using projective matrices [FTV97, FTV00]. Roy et al proposed to use cylinder [RMC97] coordinates and Pollefeys et al used polar coordinates [PKG99] to reduce the computational cost.

1.1.6 Dense Stereo Reconstruction

Dense Stereo Reconstruction is the task of establishing a dense correspondence map between points of different calibrated views and recovering three-dimensional information for each pair of match points. When there are only two images avail- able the problem is called binocular stereo.

Combining information from several images makes the process more robust and precise which has evolved to the multi-stereo methods such as: voxel coloring, 1.2 Objectives 11 space carving, and lightfields methods, which also benefit from known camera parameters. See [SCMS01b, LZWL04, SCD+06] for recent reviews about multi- ocular stereo reconstruction methods.

1.2 Objectives

1.2.1 Main Objective

The main objective of this thesis work is to develop new algorithms for incremental self-calibrated 3D reconstruction from video streams where, for each captured frame an estimate of camera pose and the structure of the scene can be computed on line.

However, since even individual stages of the full self-calibrated reconstruc- tion method are open problems, we have identified more specific objectives to solve some drawbacks of the state of the art algorithms. In particular, since self-calibrated reconstruction methods heavily depend on finding correct matches between points on images we decided to address this problem to improve the ap- plicability of the reconstruction methods dealing with real scenes captured with un-stabilized cameras. In addition, for the projective reconstruction algorithm based on the factorization method we identify the need for new algorithms with time limits constraints.

1.2.2 Particular Objectives

• Propose a robust matching algorithm to find corresponding salient points on different images, even when repetitive patterns exist on local areas of the scenes.

• Investigate a collaborative approach between stages of the reconstruction pipeline to improve the robustness of matching algorithms. 12 Introduction

• Develop a new projective reconstruction algorithm based on the factoriza- tion method with time limit constraints.

1.3 Contributions

In this work new algorithms are proposed to solve specific problems of self- calibrated reconstruction from video. Specifically we improve on the following issues:

1.3.1 Robust feature matching

A robust feature matching algorithm is proposed [LLAE06a]. A matching metric is introduced to enforce geometric and photometric properties in the matching criterion. Thus, corresponding points are matched within an iterative framework using a local motion descriptor and the similarity between scale invariant region descriptors (SIFT) to avoid mismatch errors between distant points.

1.3.2 Incremental 3D reconstruction by inter-frame selec- tion

A new algorithm is presented for the selection of frames to recover the camera motion and scene structure for the projective camera model [LLAE06b] in the factorization method. By direct measurement of the contribution of each frame in the progressive quality of 3D model reconstruction allows the reduction of the memory resources and keeps the computational cost approximately constant for every frame of an image sequence. 1.4 Organization of the Thesis 13

1.4 Organization of the Thesis

The thesis is structured as follows. This Chapter was a general introduction to the problem of self-calibrated reconstruction and stated the objectives and contributions of this work. Then in Chapter 2 the background about Multiple View Geometry is reviewed. Chapter 3 presents a review of the state of the art in the correspondence problem.

Chapters 4 and 5 introduce the proposed algorithms, describe their proper- ties, and analyze the advantages of the new algorithms with respect to state of the art approaches. Chapter 4 presents a novel method for robust feature matching, and chapter 5, an incremental projective reconstruction algorithm. In chapter 6 experimental results are carried out and a performance evaluation of the pro- posed algorithms is analyzed and discussed in the applications of tracking and 3D reconstruction from video. Finally, in chapter 7 the conclusions and possible improvements in this research area are discussed.

1.5 Conclusions

In this chapter we have given a brief introduction to the problem of self-calibrated 3D reconstruction from images. The advantages and disadvantages of self-calibrated methods have been explained. Then the objectives and contributions were pre- sented. Finally a general overview of the document was introduced describing the thesis content. 14 Introduction Chapter 2

Multiple View Geometry

2.1 Preliminaries

In this chapter some computer vision basics will be introduced. For a more thor- ough descritption, a book about geometry and 3D vision, for example [TV98, FL01, Atk01, HZ00a, FP02], is recommended.

2.1.1 Homogeneous Coordinates

In homogeneous or projective coordinates, the Euclidean 3D vector (x, y, z)T is represented by k(x, y, z, 1)T and the 2D vector (x, y)T is written as k(x, y, 1)T , where k 6= 0 is a real number. In the 2D case, this means that every point is represented by a line in 3D. Homogenous points with the last coordinate equal to zero do not have any counterpart in Euclidean space. These points are points at infinity and have an important role in the upgrade from a projective recon- struction to a metric one. Given the Euclidean point (x/k, y/k)T , in homoge- nous coordinates this point is represented by (x/k, y/k, 1)T , which is the same as (x, y, k)T . As k approximates to 0, the point goes to infinity in a certain direction. (x, y, 0)T is the vanishing point. Using the homogeneous representation, the 2D line ax + by + c = 0 is written as k(a, b, c)T . This means that a point k(x, y, 1)T lies on the line k(a, b, c)T if and only if (x, y, 1)(a, b, c)T = 0. The intersection 16 Multiple View Geometry

between two 2D lines is computed as the cross product between the two lines. Lines that are parallel in Euclidean space meet at infinity in projective space.

The projective geometry is useful to model the perspective mapping that oc- curs during the image formation process because using this geometry the perspec- tive transformation of a camera is expressed as a linear operation.

2.2 Camera Models

In this section the projection model from real 3D scene points into an image point is revised.

2.2.1 Perspective model

Let us consider the perspective model that is shown in figure 2.1. Every 3D scene point X(X,Y,Z) is projected on the image plane to a point x(u, v) through the optical center C. The optical axis is a perpendicular line to the image plane passing through the optical center. The center of radial symmetry in the image or principal point, (i.e., the point of intersection of the optical axis and the image plane) is given by O. The distance between C (the optical center) and the image plane is the focal length f. We define the camera coordinate system as follows. The optical center of the camera is the origin of the coordinate system. The image plane is parallel to the XY plane, held at a distance of f from the origin. Using the basic laws of trigonometry the following relations are derived: fX fY u = , v = Z Z Once expressed in homogeneous coordinates the above relations transform to the following:       X u f 0 0 0          Y   v  ∼  0 f 0 0               Z  1 0 0 1 0   1 2.2 Camera Models 17 where the relationship ∼ stands for ’equal up to a scale’.

Figure 2.1: Projective Camera Model.

Practically all available digital cameras deviate from the perspective model. First, the principal point (u0, v0) does not necessarily lie on the geometrical center of the image. Second, the horizontal and vertical axes (u and v) of the image are not perpendicular. Let the angle between the two axes be θ. Finally, each pixel is not a perfect square and consequently we have fu and fv as the two focal lengths that are measured in terms of the unit lengths along the u and v directions. By incorporating these deviations in the camera model the transformation that maps scene points (X,Y,Z) to their image coordinates (u, v) is described as follows:       X u f f cotθ u 0   u v 0        Y   v  ∼  0 fv v 0       sinθ 0         Z  1 0 0 1 0   1

In practice the 3D point is available in the world coordinate system that is different from the camera coordinate system. The motion between these coordinate systems 18 Multiple View Geometry

is given by (R, t):       X u f f cotθ u   u v 0       h i  Y   v  ∼  0 fv v  R −Rt      sinθ 0         Z  1 0 0 1   1

  fu fvcotθ u0   h i P =  fv   0 sinθ v0  R −Rt   0 0 1   fu fvcotθ u0   K =  fv   0 sinθ v0    0 0 1

The 3 × 4 matrix P that projects a 3D scene point X to the corresponding image point x is called the projection matrix. The 3 × 3 matrix K that contains the

internal parameters (u0, v0, θ, fu, fv) is generally referred to as the intrinsic matrix of a camera.

In back-projection, given an image point x, the goal is to find the set of 3D points that project to it. The back-projection of an image point is a ray in space. We can compute this ray by identifying two points on this ray. The first point can be the optical center C, since it lies on this ray. Since PC = 0, C is nothing but the right nullspace of P. Second, the point P+x, where P+ is the pseudoinverse 1 of P, lies on the back-projected ray because it projects to point p on the image. Thus, the back-projection of p can be computed as follows.

(2.2.1) X(λ) = P+x + λC

The parameter λ allows to get different points on the back-projected ray.

1The pseudoinverse A+ of a matrix A is a generalization of the inverse and it exists for general (m, n) matrix. If m > n and if A has full rank (n) then A+ = (AT A)−1AT . 2.2 Camera Models 19

2.2.2 Orthographic Model

Figure 2.2 shows the orthographic camera model. This is an affine camera model that has a projection matrix P in which the last row has a form (0, 0, 0, 1). In par- ticular, the orthographic camera model has a projection matrix P of the following form:

  1 0 0 0     R t   P = 0 1 0 0     0 1 0 0 0 1

Figure 2.2: Orthographic Camera Model.

The projection of a 3D point X into the image point x is given below:

x = PX

Similar to the perspective camera, the back-projected ray is obtained as:

X(λ) = P+x + λC

However, the optical center C, which is the right null space of P, is a point at infinity in an orthographic camera.

Under the orthographic projection model, the projection (u, v) of the p − th point X = (X,Y,Z)T in 3D space into image frame f is given by the following expression: 20 Multiple View Geometry

(2.2.2) u = X, v = Y

2.2.3 Lens Distortion

The linear projection equations do not take into account the lens shape, which af- fects the projection in a non-linear way. The lens shape causes a radial lens distor-

T tion, (δrx, δry). Let (ux, vy) be the projected point without lens distortion while

T p 2 2 (u, v) represents the observed coordinates, and let r = (u − u0) + (v − v0) be

the radial distance from the principal point in the projected image, where (u0, v0) shift the center of the image to (0, 0). The radial lens distortion [Atk01] may then be approximated by the series

3 5 3 5 (2.2.3) δrx = u(K1r + K2r + ...) δry = v(K1r + K2r + ...)

The effects of radial distortion are shown in figure (2.3).

Figure 2.3: The effect of radial distortion.

2.3 Multiple View Constraints

In this section, we examine the relations that arise when a single scene is imaged by two or more camera. By analyzing these relationships the location of an image 2.3 Multiple View Constraints 21 point can be constrained to lay in a restricted image location. In addition, 3D reconstruction is solvable when a minimum of five or four real scene points are observed in two or three cameras positions respectively with varying viewpoint.

2.3.1 Two view Geometry

Figure 2.4 shows the inherent geometric constraints of two projective cameras imaging the same scene X. Two image points x and x0 are in correspondence when they are the image of the same world point.

Figure 2.4: Two view geometry constraints modeled by the fundamental matrix F. The two camera centers are indicated by C and C’. The camera centers, a 3D-space point X, and its images x and x’ lie in a common plane Π. The ray defined by the first camera center, C, and the point X is imaged as a line l’. The 3D-space point X which projects to x must lie on l’.

X is a common world 3D point, x its image in the first view, x0 its image in the second view, C and C’ are the two cameras centers. The line segment that connects them is called the baseline. The points X, C, and C’ define a plane, called the epipolar plane Π. l and l0 are the epipolar lines of the two projections of X. The projection of the camera centers on the other images, e and e0, are named epipoles. The relation among all these elements forms the epipolar constraint.

The fundamental matrix F, is a 3 × 3 singular matrix describing the relation between two different images of the same scene. For corresponding points in two 22 Multiple View Geometry

images the following equation holds: x0T Fx = 0, where x is a point in the first image and x0 is the corresponding point in the second image.

Assume that the 3D point X is projected to the point x in the first image in a stereo pair. l = Fx defines a line in the second image. This is called epipolar line, and it is the projection in the second camera of the line going through the first camera center and the 3D point X. If the point X is visible in the second camera, its image, x0, must lie on the epipolar line. In homogeneous coordinates a point x0 lies on a line Fx if and only if x0T Fx = 0. This means that if we know the fundamental matrix, i.e. the epipolar geometry, for an image pair, stereo matching becomes much easier. To find correspondences between two images, the search is restricted along the epipolar line.

2.3.2 Fundamental Matrix estimation

The fundamental matrix can be recovered from only seven correspondences [FLM92, BS03] by means of non linear methods. However, if 8 correspondences are known linear algorithms exist to solve the problem, one of them is the eight point algo- rithm proposed by Hartley in [Har92, Har95].

Given x = (u, v, 1), x0 = (u0, v0, 1) two corresponding points expressed in ho- mogeneous coordinates, each match pair gives rise to one linear equation in the unknown entries of F :

0 0 0 0 0 0 (2.3.1) u uf11 + u vf12 + u f13 + v uf21 + v vf22 + v f23 + uf31 + vf31 + f33 = 0

From a set of n point correspondences, we obtain a set of linear equations in the form:

  0 0 0 0 0 0 u1u1 u1v1 u1 v1u1 v1v1 v1 u1 v1 1    ......  (2.3.2) Af =  ......  F = 0,   0 0 0 0 0 0 unun unvn un vnun vnvn vn un vn 1 2.3 Multiple View Constraints 23 where f is a 9-vector containing the entries of matrix F. The least-squares solution for F is the eigenvector corresponding to the smallest eigenvalue of A, that is the last column of V in the SVD, A = UDV T . This is the unconstrained fundamental matrix F’c since the rank 2 constraint has not been enforced. To obtain the correct rank 2 fundamental matrix, let the diagonal matrix obtained from SVD D = (d1, d2, d3), then the correct rank 2 fundamental matrix F is given by

T (2.3.3) F = U × diag(d1, d2, 0) × V = 0.

In general algebraic computations are unstable when using real image coordinates measurements due to large numerical variations. Hence normalization of the in- put data is required. Hartley in [Har95] proposed to normalize by centering the measurement data in the origin and make the mean distance of the measurements √ from the origin to have 2. Transforming the image coordinates according to xˆ = Tx and xˆ0 = T0x0, where T and T 0 are normalizing transformations consist- ing of translation and scaling:

  1/σx 0 −µx/σx     (2.3.4) T =  0 1/σy −µy/σy  ,   0 0 1 where, means µx and standard deviation σx are given by:

v n u n 1 X u 1 X 2 µx = xi σx = t (xi − µx) n n i=1 i=1 v n u n 1 X u 1 X 2 µy = yi σy = t (yi − µy) n n i=1 i=1

Then, after the fundamental matrix estimation, the obtained matrix Fe must be de-normalized by F = T>FeT. 24 Multiple View Geometry

2.3.3 Planar Homography

The planar homography is a non-singular linear transformation that maps points between two different planes. The homography between two views plays an im- portant role in the geometry of multiple views [TV98, HZ00b].

When a planar object is imaged from multiple viewpoints or when a scene is imaged by cameras having the same optical center, the images are related by a unique homography. For a plane Π = [vT , 1] with a vector v in the scene, the ray

0 corresponding to a point XΠ, projects to x in the other image (see figure 2.5). Given the projection matrices P = [I | 0] and P’ = [A | a] for the two views, the

homography induced by the plane is given by (assuming Π4 = 1 since the plane does not pass through the center of the first camera [HZ00a]):

Figure 2.5: A planar homography H maps a point x from the plane Π to a point x’ in the plane Π0.

(2.3.5) x0 = Hx with H = A − avT

If the cameras have different intrinsic matrices K’ and K respectively, the ho- mography due to the plane is given by [HZ00a]: 2.3 Multiple View Constraints 25

(2.3.6) H = K0(A − avT )K−1

2.3.4 Homography estimation

A homography H can be used to transfer feature points on a plane from one view to the other. A point x on the plane can be transferred to its image x0 on the other view using:

(2.3.7) x0 = Hx where H is a 3x3 matrix known up to a scale factor, and hence has only 8 degrees of freedom. H can be estimated by a linear algorithm given a set of four point

0 correspondences, (xi, xi) as follows: Expanding equation 2.3.7 for a given point correspondence, and normalizing with respect to the homogeneous component to yield,

0 h1xi + h2yi + h3 0 h4xi + h5yi + h6 (2.3.8) xi = and yi = h7xi + h8yi + h9 h7xi + h8yi + h9

Rearranging the two equations leaves to two equations that are linear in the elements of the homography, H, i.e.

  0 0 u1 v1 1 0 0 0 −u1u1 u1v1 −u1    0 0   0 0 0 u1 v1 1 −v1u1 v1v1 −v1     ......  (2.3.9) Ah =  ......  h = 0;    0 0   un vn 1 0 0 0 −unun unvn −un    0 0 0 0 0 un vn 1 −vnun vnvn −vn

hence, one point correspondence yields two equations. Then, at least four point correspondences are required for a rank deficient 8x9 matrix [HZ00a]. When more 26 Multiple View Geometry

than four point correspondences are known a least-squares estimate can solve for

the unknown parameters hi.

Number of Measurements

The matrix H contains 9 entries, but it is defined only up to scale. Thus, the total number of degrees of freedom in a 2D projective transformation is 8.

Each corresponding 2D point or line between views, generates two constraints on H by Equation 2.3.8 and hence the correspondence of four points or four lines is sufficient to compute H. For a planar affine transformation with 6 degrees of freedom, only three corresponding points or lines are required, and so on.

A conic equation provides five constraints on a 2D homography. Hence two matching conics are sufficient to recover the homography.

In practice, salient points, lines, and conics detected in the image could be noisy to get a good solution using the minimum numbers of them. A large number of features is used to obtain a robust solution [HZ00a].

2.3.5 Projective Reconstruction

There are three principal approaches to recover the structure and motion of a scene from images up to a projective transformation: 1) Epipolar Geometry based methods that merge partial results; 2) Factorization methods where all the corre- spondences are treated simultaneously to get an estimate of the camera pose and 3) Robust non linear methods.

Merging Projective matrices using Epipolar Geometry

The epipoles contain information about the extrinsic camera parameters, namely the position of the camera center C and the orientation of the optical axis. How- ever, this information cannot be directly retrieved.

First, two images are selected and an initial reconstruction frame is setup. Then, 2.3 Multiple View Constraints 27 the pose of the camera for the other views is determined in this frame, and each time the initial reconstruction is refined and extended. In this way the pose estimation of views that have no common features with the reference views also becomes possible.

Defining the projection matrices P and P’ for the first and second views, respec- tively and choosing a specific canonical form for the camera matrices, in which the first camera is:

(2.3.10) P = [I3×3 0]

Note that it is always possible to make a set of camera matrices canonical by applying a projective transformation that is obtained as follows: augment the first matrix P by an additional row to make it a 4 × 4 non-singular matrix ˜P. Then apply the homography H ∼ ˜P to all the cameras and world points as

ic i −1 c (2.3.11) P ∼ P H ,Xj ∼ HXj

i c where P is the i-th camera and Xj the j-th world point (here the superscript is used to denote the transformed entities, also note that at this point we do not yet have world points, nor need them). Observe that the set of cameras is still not unique, we have a four parameter choice for the last row of ˜P (for finite cameras

T we can use (0, 0, 0, 1) ). For such a canonical pair of cameras P ∼ [I3×3 0] and P’ ∼ [M m].

As seen above, there is a four parameter choice in the set of canonical cameras. Without further proof, the general four parameter formula for a pair of canonic camera matrices corresponding to a fundamental matrix F is given by

ic i −1 0 0 0 T 0 (2.3.12) P ∼ P H ,P ∼ [[e ]xF + e v λe ] 28 Multiple View Geometry

where e0 is an epipole, v is any 3-vector and λ a non-zero scalar, together v and λ encode the four unknown parameters.

The Factorization Method

The factorization method described in [MHOP01] was first proposed for the or- thographic camera model. The original method assumes that n feature points are observed by m orthographic cameras. Then, by stacking the n correspond- ing points of the m frames a registered measurement matrix W with dimensions 3m × n is formed, as follows:

    x11 x12 x1n P1     h i  . . .   .  (2.3.13) W =  . . .  =  .  X1 ··· Xn     xm1 xm2 xmn Pm

T where Xj = (xj, yj, zj, 1) , (j = 1, ··· , n) are the unknown homogeneous 3D

point vectors, Pi(i = 1, ··· , n) are the unknown 3 × 4 image projections matrix T associated with camera i and, xij = (uij, vij, 1) are the measured homogeneous image point vectors respectively.

Then, using Singular Value Decomposition (SVD), two matrices are computed, which represent object shape and camera motion respectively.

Mahamud et al in [MHOP01], proposed a bilinear iterative algorithm by adding new constraints on the error function minimized by the Sturm-Triggs method [ST01]. In the Mahamud et al method, initial projective depth values are obtained using the Kanade orthographic method [TK92b] as initial estimation avoiding the need to estimate projective depths (projective scale factors that represent the depth information lost during image projection) from epipolar geometry. They showed, that their minimization algorithm is guaranteed to converges to a local minimum for its error function. Implementation results shown that their method converge in less than 20 iterations and yields comparable errors than the Sturm- Triggs method [ST01]. 2.3 Multiple View Constraints 29

Then, the full original iterative projective factorization algorithm [MHOP01] is described as follow:

1. Compute the current scaled measurement matrix W by equation (1);

2. Normalize W; subtracting the mean of each frame to every point P ;

3. Perform the rank-3 factorization on W by SVD, W = USV T , to generate an

T estimate of projective matrix P and shape matrix X; P = U3 and X = SV3

where U3,S3 and V3 are the sub-matrices obtained from U, S and V using only the 3 first columns (the ones associated with the 3 largest eigenvalues) and S is a diagonal matrix with elements σ known as the singular values of W. Algorithm 1 - Original projective factorization algorithm.

Non-linear Bundle Adjustment

Ideally, to solve the structure from motion problem the mean-squared distance between the observed image points and the points positions predicted from the parameters λij, Pi and Xj should be minimized, i.e.:

1 2 (2.3.14) E = min Σ k xij − PiXj k . λij

However, the corresponding problem is difficult since the error is highly non-linear in the unknowns λij, Pi, and Xj. Similar to the linear vs non-linear calibration algorithms, the main disadvantage of non-linear projective reconstruction methods is that they are iterative methods and they need an initial solution that has to be close enough to the real solution to avoid local minima. This is why linear methods are still useful in order to be used as an initial solution. 30 Multiple View Geometry

2.3.6 Incremental Projective Reconstruction

Time critical applications can accept sub-optimal reconstruction estimation ob- tained by incremental approaches. There are two categories of incremental tech- niques. The first is based on probabilistic methods like extended Kalman filter theory [BCC90, MRM94, SPFP96, Dav05, DRMS07] that is able to model the non- linearity between structure and motion estimates. Other probabilistic approaches use the particle filter [GTS+07, KRD07, TM06].

The second category relays on the subdivision of a video stream in sub-sequences and it is based on the concatenation of successive views related by the epipo- lar geometry. This problem has been investigated by Repko and Pollefeys in [MP05, PGV+02], counting the number of feature points tracked in successive pairs of images and analyzing the reprojection errors for pairs and triplets of views using epipolar geometry and homographies computing the Geometric Ro- bust Information Criterion (GRIC) proposed in [TFZ98]. Their keyframe criterion selects two or three views where the GRIC score of the epipolar model is lower than the score of the homography model. A similar idea was presented in [GCH+02]. Recently Martinec and Pajdla [MP05, MP06] have proposed incremental methods using triplets of images that can cope with missing data.

2.4 3D Scene Reconstruction

2.4.1 Camera Calibration

Geometric Camera calibration is a necessary step for recovering the 3D position of a scene point when only its projections in two images with different viewpoint are known. Each projection defines a ray in space; the intersection of both rays is the 3D point location.

By calibration we mean the determination of the intrinsic matrix (K) and external pose parameters (R, t) of the image formation model. Sometimes radial or other 2.4 3D Scene Reconstruction 31 distortions using additional parameters are modeled. The computed geometric model relates the 3D coordinates of a point of the scene, and the 2D coordinates of the projected point into the image.

2.4.2 Triangulation

Once camera calibration is known, it is possible to compute the 3D positions of the image points observed by multiple cameras through a process called triangulation.

Figure 2.6: Calibrated Reconstruction by Triangulation.

Triangulation in its simplest form is illustrated in Fig. 2.6: the direction towards a target position X in space is determined from two different locations.

Given a three-dimensional point X, the first step consists in finding the points xl and xr projected in the left and right image planes Il and Ir respectively. Then, point X lies on line Ll joining xl and the left optical center CL and, similarly, we know that X lies along a line Lr joining xr and CR. Assuming that the camera parameters (intrinsic and extrinsic) are known, the parameters of Ll and Lr can be explicitly computed. Therefore, the point X is at the intersection of the two lines. This procedure is called triangulation.

0 Knowing the projection equations for each camera view xl = PX and xr = P X this equation can be combined into the form AX = 0, which is an equation linear 32 Multiple View Geometry

in X to find the three-dimensional location of point X.

Thus the triangulation process can be used to reconstruct a scene from point correspondences, but only after the set of camera views have been calibrated. Finding the calibration matrices P and P0 of a set of camera views for a shared coordinate system means reconstructing the camera views. If only a projective reconstruction of the camera views can be determined, then only a projective reconstruction of the scene can be created (see Section 2.3.5).

2.4.3 Survey of Camera Calibration

Geometric calibration methods can be classified into two main groups depending on the nature of the information used:

Photogrammetric calibration

Photogrammetric methods use a calibration pattern with known geometry to cali- brate the cameras. The input to the calibration algorithm is the set of 3D points of the pattern and their corresponding 2D projections. To recover the 11 parameters that define the camera projection matrix P, the optimization of a cost criterion is computed. Linear methods for camera calibration known as DLT (Direct Linear Transform) were the first to appear [AAK71], formulating the calibration problem as the solution of a system of linear equations (see [HZ00a] or [FL01] for a detailed description and theory).

Figure 2.7: Examples of calibration patterns. 2.4 3D Scene Reconstruction 33

The DLT algorithm needs a minimum of 6 points in order to solve the 11 un- knowns of the projection matrix. Different authors have proposed algorithms using a lower number of points at the cost of not recovering all the parameters. In [QL98] 4 points are sufficient to estimate the camera pose. [Tri99] proposes a 4- point algorithm to recover the pose and the focal length, and a 5-point algorithm to recover the pose, the focal length and the principal point.

DLT calibration methods have the problem that results are not very accurate. In order to solve this problem, non-linear methods emerged using more complex non-linear camera models such as those including lens distortion, and using more complex cost functions [Tsa87, Bro76, Tri99, LVD98, Hei00].

The main disadvantage of non-linear methods is that they are iterative methods, computation time is higher and they need an initial solution that has to be close enough to the real solution to avoid local minima. This is why linear methods are still useful in order to be used as an initial solution for the non-linear problem.

In photogrammetric calibration the cameras are calibrated before using them. Then, no-changes are allowed in the camera parameters (like focal length, zoom, etc) while capturing the images.

Self-calibration

The second technique namely self-calibration, does not need the use of calibra- tion grids. The 11 projection matrix parameters are computed using images of unknown but static scenes [FLM92, PGP96].

Self-calibration depends on the constraint that there is only one possible recon- struction consistent with both the image sequences and a priori constraints on the internal parameters.

In general three types of constraints can be applied to ”self- calibrate” a cam- era: scene constraints, camera motion constraints and constraints on the camera intrinsics. All of these have been tried separately or in conjunction. In the case of a hand-held camera and an unknown scene only the last type of constraints 34 Multiple View Geometry

can be used. Reducing the ambiguity on the reconstruction by imposing restric- tions on the intrinsic camera parameters is termed self-calibration (in the area of computer vision). In recent years many researchers have been working on this subject. Most self-calibration algorithms are concerned with unknown but con- stant intrinsic camera parameters (see for example Faugeras et al [FLM92], Hart- ley [Har93], Pollefeys and Van Gool [PGP96, PG97, PG99], Heyden and Astr¨om˚ [HA97],˚ and Triggs [Tri99]). Recently, the problem of self-calibration in the case of varying intrinsic camera parameters has also beem studied (see Pollefeys et al [PKG98, PGV+04] and Heyden and Astr¨om˚ [HA99]).˚

The general approach to self-calibration is similar for different methods:

• Obtain the projective camera matrices;

• Update the projective camera matrices to Euclidean matrices using self- calibration constraints.

The projective camera matrices are computed using epipolar geometry from the correspondence of the same features detected in different views, the most common features being points. Updating the camera matrices from projective to Euclidean can be done in a ’direct’ way or in a ’stratified’ way. Direct methods pass directly from the projective form to the Euclidean form. Stratified methods first update the projective camera matrices to affine camera matrices (i.e., find the plane at the infinity) and, from affine camera matrices to Euclidean camera matrices. Stratified methods are in general more robust and simplify the passage from the affine stratum to the Euclidean one.

Direct Methods

Kruppa equations. The first auto-calibration method is historically due to Faugeras et al [FLM92, MF92]. They proposed an approach based on the Kruppa equations [Kru13] and established the relation between the camera intrinsic parameters and the absolute conic.

QR decomposition. Hartley proposed a different technique to self-calibration 2.4 3D Scene Reconstruction 35 based on the QR decomposition of the projection matrix [Har93]. This solu- tion is more robust than using the Kruppa equations and it can be applied to any number of views.

Absolute quadric. Triggs [Tri97] proposed two algorithms (one linear and one non-linear) to directly estimate the absolute quadric from a set of images (the absolute quadric is the dual of the absolute conic). Similar equations had already been used by [HA97]˚ but without relating them to the absolute quadric concept.

Stratified Self-Calibration Methods

Stratified methods, as opposed to direct ones, first recover the plane at infinity and then find the intrinsic parameters. The idea of separating the computation of the plane at infinity from the intrinsic parameters appears already in the methods of Hartley [Har93, AZH96], and Faugeras [FR96]. One of the best results on stratified methods is due to Pollefeys and Van-Gool [PG97], who developed a complete stratified auto-calibration approach based on the recovery of the plane at the infinity using the modulus constraint [MPG99].

2.4.4 Absolute Conic

By using previous knowledge (known scene points) or making some realistic as- sumptions as for example: perpendicularity or parallelism of some scene lines or planes, knowing some intrinsic camera parameters (principal point located on the center of the image, constant aspect ratio) the projection matrix P can be esti- mated to transform the projective reconstruction to obtain an affine estimation.

The absolute conic Ω is a particular conic in the plane at infinity and its equation is (2.4.1):

(2.4.1) X2 + Y 2 + Z2 = 0.

All points in the conic have complex coordinates. It defines a circle in the pro- √ jective plane of radius i = −1, with equation x2 + y2 = −1 (with x = X/Z and 36 Multiple View Geometry

y = Y/Z.

The conic Ω is invariant under rigid motions and under uniform changes of scale so its relative position to a moving camera is constant [FLM92]. Therefore, its image ω will be constant if the intrinsic camera parameters are constant. The conic Ω can be considered a calibration object that is always present in all scenes.

The conic Ω can be represented by the Dual Absolute Quadric Ω0. In this case, both Ω and its supporting plane, the plane at infinity Π, are expressed through one geometry entity and the relationship between Ω and its image ω is obtained using the projection equation for the Ω0:

0 0 T (2.4.2) ω ≈ PiΩ Pi ,

where the operator ≈ means up to a scale factor, ω0 represents the dual of ω,Ω0

the dual of Ω and Pi the projection matrix for view i (see Figure 2.8)

Figure 2.8: The absolute Conic in the plane at infinity is projected in the same image location of a moving camera. 2.5 Stratified Self-calibration 37

2.5 Stratified Self-calibration

After applying one of the projective reconstruction methods discussed in 2.3.5, the estimated scene points Xn and camera matrices Pm differ from the real Euclidean points Xcn and the real projection matrix Pcm by a projective mapping H. This means

k −1 (2.5.1) λ Pm = PcmH Xn = XcnH.

The stratified self-calibration algorithm due to Pollefeys and van Gool [PG97] uses the rigidity constraint about the scene and assumes knowledge about some intrinsic camera parameters to restrict the space of possible H, and estimates a stratified reconstruction updating from projective to affine and then to Euclidean.

A general projective transformation H can be uniquely decomposed as follows:

      I3 0 K 03 R −Rt (2.5.2) H = HP HAHE =       , T T T ω 1 03 1 03 1

T T where HP is an element of a 3D projective group. [ω 1] represents the plane at infinity in the original scene. Once we know HP , the scene is known up to an affine transformation HAHE, that is, the recovered scene is an affine reconstruction.

HA represents an element of a 3D affine group. The upper triangular matrix K describes anisotropic scaling (3 diagonal entries) and a skewing of coordinates axes (3 off-diagonal entries).

T HE represents an element of 3D Euclidean group. Rotation matrix R(RR = I) represents rotation and vector t translation. Once we know HE, we know the scene in absolute world coordinates.

Any projective reconstruction can be transformed by some projective transfor- mation such that one camera projection matrix equals [I3|03]. Assuming further the original camera matrices in the form: 38 Multiple View Geometry

(2.5.3) Pk = Kk[Rk| − Rktk].

It can be verified that if e.g. P1 = I and 2.5.3 holds, then

(2.5.4) K = K1, R = R1, t = t1.

2.5.1 Affine Stratification

Knowing at least 3 scene points that are at infinity in the original scene, the plane at infinity can be estimated, represented by vector [−ω 1]T and thus recovering

Hp. These points can be found as the intersections of planes or lines that are known to be parallel in the original scene.

From 2.5.1, 2.5.2 and 2.5.3

  1 k k k k k KR −KRt (2.5.5) k = K [R − R t ] = P   . λ ωT 1

By multiplying the equation by matrix Ψ = diag([1 1 1 0]T ) and multiplying each side by the transposed of itself the following expression is obtained:

  k −2 k kT k K T kT (2.5.6) (λ ) = K K = P   [K ω]P . ωT

Assuming (i) all cameras have zero skew and aspect ratio equal to 1, and (ii) the principal points are approximately known for all cameras. Then, Kk can be transformed (transforming image points accordingly) to Kk = diag([f k f k 0]T ). 1 Substituting this to 2.5.6 and assuming P = [I3|03] yields 2.6 RANSAC computation 39

  1 2 1 (f ) 0 0 f ω1    1 2 1  −2 k kT k  0 (f ) 0 f ω2  (2.5.7) (λ) K K = P      0 0 1 ω3    1 1 T f ω1 f ω2 ω3 ω ω

k k k k k−2 k kT k Due to the fact that c13 = c23 = 0 and c11 = c22, where λ K K = [cij], 1 2 1 T yields a linear system for 5 unknowns (f ) , Kω, and ω ω. These unknowns can be computed from Kn > 2 views.

2.6 RANSAC computation

The linear methods for estimating the epipolar geometry and the planar homog- raphy transformations are sensitive to image errors and outliers resulting from mismatching.

However, by random selection of subsets during the implementation of a linear method, it is possible to obtain the correct solution.

RANSAC is an algorithm for robust model fitting by selecting a minimum sample set required for the model. Models containing outliers are rejected since they do not generate sufficient consensus.

Assume that the whole set of data may contain up to a fraction ε of outliers, then the probability that all N data in a subset are good is (1 − ε)k, and the probability that all s different subsets will still contain at least one or more outliers is (1−(1−ε)N )s. So the probability that at least one random subset has no outliers is given by

(2.6.1) P = (1 − (1 − (1 − ε)N )s.

Thus the number of iterations needed is computed as 40 Multiple View Geometry

ln(1 − P ) (2.6.2) s = . ln(1 − (1 − ε)N )

2.7 Conclusions

This chapter covered the relationships between two or more cameras imaging a common scene, or equivalently one camera that moves around in a static scene. We have shown the geometrical setting for two views, the epipolar geometry, giving rise to the fundamental matrix. It encodes the constraint that the image point corresponding to a point in another image must lie on an specific image line. This can be used for the matching of more feature points in the tracking phase. Although we have not presented here there is a similar relationship for three views called the trifocal tensor.

For multiple views we have presented two algorithms that provide a complete reconstruction based on point correspondences: the epipolar geometry based method and the perspective factorization method. However, the reconstruction that can be obtained with these methods named ”projective reconstruction” is not a metric one due to an ambiguity which makes not possible the creation of realistic new views. Moreover, in the next section we have seen how this ambi- guity can be resolved, allowing us to upgrade the projective reconstruction to a metric reconstruction. This level of reconstruction is sufficient to create realistic images from the reconstructed scene from new view points because the relative distance between points are appropriately recovered and differ from the real scene structure by a scale factor. Chapter 3

The Correspondence Problem

3.1 Introduction

The correspondence problem remains an important topic in computer vision since matching feature points between images is a fundamental step in many computer vision applications such as, detection of moving objects, object recognition, mo- tion segmentation, image compression, surveillance, image registration, recovery of 3D scene structures, and the synthesis of new camera views.

In this section we present an overview of the problem of salient point correspon- dence. Then, a survey of the state of the art algorithms to solve this problem is shown. Special emphasis in the survey of the different approaches considers the constraints that each algorithm impose to remove many of the mismatches found while keeping most of the correct matches. Figure 3.1 shows the difficulty of solving the correspondences problem when images are captured under different viewpoints. Note how the image regions of the same scene element have different scale and shear effects due to perspective distortion. 42 The Correspondence Problem

Figure 3.1: The correspondence problem. Two corresponding image regions of the same scene element have different appearance due to projective distortion. 3.2 Feature Correspondence Overview

The correspondence problem also called point matching, consists in finding points in different images that correspond to the same scene element. To simplify the discussion we consider the case when a pair of images is available but the discussion can be extended to multiple images setups.

Finding a dense corresponding mapping given two images of the same scene taken under different camera setup conditions is difficult even for static scenes due to the following issues:

• Occlusions. For many feature points in a pair of images there are points only visible in one image.

• Noise. Even when the same set of feature points are observed in both images accordingly to human image interpretation, the feature points in one image has strong photometric variation such as blurring effects, pixelization due to discretization and image noise introduced in the capture process that prevents appropriate matching between the same features.

• Projective deformations. Images taken with a moving camera subject to general rigid motion generate a set of images differing by a general projective transformations.

• Aperture problem. For texture-less image areas, like feature points located in an edge, multiple regions are putative candidates and then difficult to assign correct matching pairs. 3.3 Salient point detection 43

Algorithms for matching correspondences can be broadly divided according to the change in the viewpoint between images. There are methods that assumes short displacement between images. Then, by imposing this constraint, the motion between salient points can be expressed by a pure translation model simplifying to some extent the correspondence problem. The goal of stereo like methods is to compute a dense disparity map. Recent surveys of stereo matching algorithms can be seen in [SS02, BBH03].

For applications such as self-calibrated 3D reconstruction, even when the im- ages can arise from a moving video camera, feature correspondences are needed between wide separated viewpoints. Under wide separated view conditions the correspondence problem is even more difficult since lighting conditions, deforma- tion due to perspective effects, image noise and different camera response tend to modify corresponding image regions for successive images. Then the goal is limited to find a sparse set of correct correspondences between the called salient points. A salient region is an image area where gradient variation is observed in at least two principal direction although blob regions and multi-junction regions are also taken as salient regions.

Three important issues in feature matching methods are: the interest point de- tector algorithm, the local descriptor and, the similarity metric that matches descriptors of the reference and target images.

3.3 Salient point detection

The first step consists in automatically detecting regions in the images that are sufficiently different from their neighbors. Salient point detectors find areas of images with high variance. The variance along different directions computed using all pixels in a window centered around a point are good measures of the distinctiveness.

Before describing the more representative interest point detectors, there are some 44 The Correspondence Problem

requirements to an optimal interest operator as proposed by Haralick and Shapiro [HS92]:

• distinctiveness: An interest point should stand out clearly against the back- ground and be unique in its neighborhood.

• Invariance: The determination should be independent of the geometrical and radiometrical distortions.

• Stability: The selection of interest points should be robust to noise and blunders.

• Uniqueness: Beside from local distinctiveness an interest point detector should also possess a global uniqueness, in order to improve the distinction of repetitive patterns.

• Interpretability: Interest values should have a significant meaning, so that they can be used for correspondence analysis and higher image interpreta- tion.

3.3.1 Pioneer Feature Detectors

Interest point detectors can be characterized by the different type of selected features.

First Derivative Methods

The initial interest point detector approach works on a single image scale [Mor79], by detecting points where large intensity variations in every direction occur. The Moravec corner detection algorithm computes an un-normalized local autocor- relation function of the image in four directions (horizontal, vertical, and two diagonals) and takes the lowest result as the interest point in a surrounding area. Therefore, it detects points where there are large intensity variations in every direction. 3.3 Salient point detection 45

The Harris detector [HS88] is based on the second moment matrix, also called the auto-correlation matrix, which is often used for feature detection or for describing corner like regions. This matrix describes the gradient distribution in a local region around a point x:

  2 Ix(x, σD) IxIy(x, σD) (3.3.1) M(x, σI , σD) = g(σI ) ∗   , 2 IxIy(x, σD) Iy (x, σD)

where the local image derivatives [Ix,Iy] = [∂I/∂x, ∂I/∂y] are computed with

Gaussian kernels of scale σD (differentiation scale). Then, the squared deriva- tives are smoothed with a Gaussian filter g(σI ) of scale σI (integration scale).

The eigenvalues (λ1, λ2) of this matrix are the principal curvatures of the auto- correlation function and represent two principal signal changes in a neighborhood of the point x. This property enables the extraction of points, for which the mag- nitude of both directional gradients is high, that is the signal change is significant in orthogonal directions.

Figure 3.2: Feature points detected by the Harris operator.

Various methods including KLT [KT91] and F¨orstner [FG87] consider a similar idea by evaluating the cornerness strength of each point by analyzing the eigen- values of the auto-correlation matrix explicitly or using the determinant and the trace of the second moment matrix.

In the KLT detector a feature is detected if the two eigenvalues of an image patch are bigger than an empirically computed threshold. In the F¨orstner method local 46 The Correspondence Problem

statistics automatically estimate the thresholds to select corner points.

The F¨orstner method [FG87] is based also on the auto-correlation function. Pix- els are classified into interest points, edges or region. The method separates the detection and localization stages, into the selection of windows, in which features are known to reside, and feature location within selected windows. Further sta- tistics performed locally allow estimating automatically the thresholds for the classification.

Second derivative methods

A similar approach to detect salient regions computes the Hessian of an image with second order partial derivatives. These derivatives encode the shape infor- mation by providing the description of blob like regions derived from the shape of the Hessian filter (mexican hat-type of filter). Second derivative salient region detectors are based on the determinant and the trace of this matrix similar to the first derivative salient detectors. Examples of Hessian local feature extractors are the proposed by Beaudet [Bea78] and Kitchen and Rosenfeld [KR82]. The trace of this matrix denotes the Laplacian filter, which is often used for isotropic [TP86].

The Beaudet [Bea78] corner detection algorithm computes the determinant of the Hessian to detect the salient points in which the gray level image is neither maximum nor minimum:

Table 3.1: Cornerness strength function for similar corner detectors Feature Cornerness Detector Strength Function

2 Harris CHarris = det(M) − 0.04trace(M)

KLT CKLT = λ2

Fortsner CF ortsner = det(M)/trace(M) 3.3 Salient point detection 47

2 2 2 (3.3.2) CBeaudet = IxxIyy + Ixy.

The Kitchen and Rosenfeld detector [KR82], basically model the intensity as a continuous function, and estimates the product of curvature of the intensity contour line and the edge strength at that point in the image:

2 2 IxxIy + IyyIx − 2IxyIxIy (3.3.3) CKR = 2 2 . Ix + Iy

Local energy methods

The local energy based methods [MO87, Kov00, CJ02] use the principal moments of the phase congruency information to determine corner and edge information. Phase congruency is a dimensionless quantity and provides information that is invariant to image contrast. This allows the magnitudes of the principal moments of phase congruency to be used directly to determine the edge and corner strength. The minimum and maximum moments provide feature information in their own right; it is not necessary to analyze their ratios. If the maximum moment of phase congruency at a point is large then that point should be marked as an edge. If the minimum moment of phase congruency is also large then that point should also be marked as a corner.

Detectors of junction regions

Other alternatives of salient points is to detect junction regions. Illustrative ex- amples of this approach are:

The SUSAN detector [SB97] takes its name from Smallest Univalue Segment Assimilating Nucleus. It is based on extracting a segment of a small circular image mask having approximately the same intensity as the central pixel (or nucleus). This segment (called the USAN) will have an area which is smaller at a corner of two regions of unequal intensity (about one quarter of the total area for 48 The Correspondence Problem

a perpendicular corner), increase to one half for an edge, and become the entire area in the center of a region with uniform intensity. 2D interest features are detected computing the size, centroid, and second moment of this area.

The Sojka [Soj03] algorithm measures the variance of the gradient directions. Then, the algorithm computes the angle between the segments intersecting at the candidate points by using a Bayesian estimation technique. This approach deals with the problem of detecting corner points when its neighborhood area has another partial corner region by considering only relevant self-corner areas during the angle computation.

A recent study [TS04] of the corner stability and corner localization properties of the features extracted by different algorithms suggest that Hessian based salient region detectors are more stable under different image data sets.

3.3.2 Invariant Feature Detectors

Recent progress in the development of feature detectors have looked for the ex- traction of salient regions invariants to image scale, rotation, and partially robust to projective distortion due to change in viewpoint, addition of noise, and illumi- nation variations.

Automatic scale selection has been studied by Lindeberg [Lin98]. The idea is to select the characteristic scale, for which a given feature detector has its maximum response over scales. The size of the region can be selected independently of image resolution for each salient region. The Laplacian operator is used for scale selection since it gives the best results in the experimental comparison in [Lin98].

An example of viewpoint invariant corner detector is the proposed in [KB01], which maximizes the entropy within the region. Mikolajczyk and Schmid [MS02] proposed the Harris-Laplace and the Harris-Affine detectors, which are extensions of the Harris corner detector. Tuytelaars and Van Gool [TG04] proposed the use of edge-based regions, that also exploits Harris corners in nearby edges. Focusing on speed, [Low99] proposed to approximate the Laplacian of Gaussians (LoG) by 3.4 Salient point Descriptor 49 a Difference of Gaussians (DoG) filter. An excellent review of the state of the art in feature detectors can be consulted in [MS04, MS05a]. Figure 3.3 show an example of the keypoints detected by the SIFT algorithm ([Low99]).

Figure 3.3: Salient points detected by the SIFT operator. Arrows length indicate the keypoints scale.

3.4 Salient point Descriptor

The description of local image patterns is the next step, after the extraction of features. The objective of the salient point descriptor is to obtain a compact and complete feature description that enables a similarity measure to be applied. The descriptors should capture the information about the shape and the texture of a local structure. The information content and the invariance are two important properties of local descriptors.

Many different techniques for describing local image regions have been developed. The determinant and the trace of the Hessian matrix have been used direclty for describing interest points by Koenderink [Koe84] and Schmid [SM97]. One com- monly used descriptor is a window of image pixels enclosing the salient region. Mikolajczyk and Schmid [MS02] have proposed the GLOH descriptor and have compared it against state of the art salient point descriptors, including steerable filters [FA91], shape context [BMP02], moment invariants [FS93], complex filters 50 The Correspondence Problem

[SZ02], and SIFT [Low99]. The SIFT descriptor has shown its superiority with respect to the other feature descriptors when images are subject to local transfor- mations like scale, rotation, and partial 3D pose changes.

3.4.1 SIFT descriptor

The Scale Invariant Feature Transform (SIFT) developed by Lowe [Low04] has 4 major stages:

1. Scale-space extrema detection: The first stage searches over using a Difference of Gaussian function to identify potential interest points.

2. Keypoint localization: The location and scale of each candidate point is de- termined and Keypoints are selected based on measures of stability.

3. Orientation assignment: One or more orientations are assigned to each Key- point based on local image gradients.

4. Keypoint descriptor: A descriptor is generated for each Keypoint from local image gradients information at the scale found in stage 2.

The SIFT feature algorithm is based upon finding locations within the scale space of an image which can be reliably extracted. The first stage finds scale- space extrema located in the Difference of Gaussians (DoG) function, which can be computed from the difference of two nearby scaled images separated by a multiplicative factor k. DoG scale space is sampled by blurring an image with successively larger Gaussian filters and subtracting each blurred image from the adjacent (more blurred) image

(3.4.1) D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ),

where L(x, y, ) is the scale space of an image, built by convolving the image I(x, y) with the Gaussian kernel G(x, y, σ). Points in the DoG function which are local extrema in their own scale and one scale above and below are extracted as keypoints. For a rotation invariant descriptor, the local image gradients are used 3.5 Matching salient points 51 to determine the main orientation for each region. For each image sample L(x, y), the gradient magnitude m(x, y) and orientation θ(x, y) is computed using pixel differences:

(3.4.2) m(x, y)2 = [L(x + 1, y) − L(x − 1, y)]2 + [L(x, y + 1) − L(x, y − 1)]2

L(x, y + 1) − L(x, y − 1) (3.4.3) θ(x, y) = tan−1 . L(x + 1, y) − L(x − 1, y)

The resulting feature descriptor is a 128 elements vectors with a total support window of 16 × 16 scaled pixels. For a more detailed explanation of the SIFT see [Low04].

3.5 Matching salient points

Using one of the simplest descriptors, a rectangular window of pixels around a salient point [LK81], cross-correlation can then be used to compute a simi- larity score between two descriptors. However, since window based descriptors are sensitive to the size of the descriptor and to deal with occlusions, variation of the algorithm have been proposed using sliding and adaptable local windows [BI99, NMSO96, OK92] and hierarchical matching strategies [GD05]. Focusing on speed, the sum of absolute difference, and simple correlation have been used. Other commonly matching metrics are sum of squared differences [AG93], Census [ZW94], and Rank metrics [BN98].

3.6 Geometric Constraints for Matching

The correspondence problem is specially difficult over two wide baseline frames due to the large appearance variations and illumination changes. Furthermore, when the scene contains repetitive pattern in local image regions the problem is 52 The Correspondence Problem

still harder. To address this issues constraints of spatial locations on putative matches have been considered to improve matching results. Methods that employ geometric constraints to refine the matching assignment can be divided according to the geometric entity that restraint the validity of matching between salient points. There are methods that compute the relation between pairs of images based on the epipolar geometry. Some methods relay on the relation between triplets of images expressed by the trifocal tensor. Another category of methods compute the homography assuming that salient regions are located in a domi- nant plane of the image scene. Finally, there are methods based on the Rigid Transformation relating salient regions.

Zhang et al [ZDFL95a], and Schmid and Zisserman [SZ97] assume that either the epipolar geometry of two views or the trifocal geometry of three views is known. Faugeras and Robert [FR96] assume the availability of the epipolar geometry among three views, to predict the location in the third view of features that are matched between the first two views. Related to the previous method is the work reported in [LF94, AS98] which synthesize novel views based on a set of reference views using knowledge of the associated epipolar and trifocal geometry, respectively.

A homography is a 3 × 3 matrix capable of describing the projective transforma- tion of points viewed on a plane between images. Georgis et al in [GPK98] require that the projections of four corresponding coplanar points at arbitrary positions are known. Meer et al [MLR98] employ projective and permutation invariants to obtain representations of coplanar point sets that are insensitive to both projec- tive transformations and permutations in the labeling of the set. Pritchett and Zisserman [PZ98] rely on the existence of suitable coplanar feature structures, namely parallelograms, to estimate local plane homographies which are then used to compensate for viewpoint differences and generate putative point matches.

The estimation of homographies between two images is discussed by Hartley and Zisserman [HZ00a]. 3.7 The importance of Gaussian Integration Scale and Derivative filters 53

3.7 The importance of Gaussian Integration Scale and Derivative filters

To properly detect corner points on images using the Harris algorithm, it is im- portant to analyze the derivative filter used, and the selection of the integration scale of the Gaussian filter. A small integration scale makes the algorithm sus- ceptible to noise but improves the localization accuracy retrieving corners near intersecting edges. Whereas, a large integration scale has the effect of smoothing the noise in partial derivatives but it also sacrifices localization accuracy.

Rohr in [Roh94, Roh05] has analyzed the negative effects of a small aperture angle between 3D straight line edges and the increased size of the Gaussian smoothing filters in the localization properties of corner detectors. The study shown that the absolute deviation of corner points decreases as the aperture angle between edges tends to 90 degrees while for small angles significant deviation occur.

In applications of salient point detection algorithms, like multiple feature track- ing, it is desirable to localize corner points near the edge intersections, since the associated corner points are less ambiguous for feature matching and more accu- rate in terms of localization.

There are two derivative filters cited in the literature, the Gaussian filter and the which is a numerical approximation of the analytical derivative function. However the Sobel operator has the inconvenient that magnifies the effect of noise. To illustrate the importance of using the Gaussian derivative, figure 3.4 shows the effect of the Sobel operator and the Gaussian derivative on the smoothed squared term of the Harris operator in a synthetic noise free image.

Observe that using the Sobel operator important artifacts appear on diagonal edges (spurious edges and local discontinuities along the edge profile) while, with Gaussian derivatives the presence of noise in the squared smoothed terms is signif- icantly reduced. However, when images are corrupted with noise even Gaussian derivatives can present some image artifacts before the appropriate integration 54 The Correspondence Problem

Figure 3.4: The effect of the Sobel operator in the smoothed squared terms of the Harris

operator (left). The result of using Gaussian derivative with scale σD = 9 pixels σI = 2 × σD (right).

scale is selected.

Figures (3.5 and 3.6) show comparative results for the Harris corner detector when the Sobel convolution mask replaces the Gaussian derivative. Corners are plotted by using color scale, a lighter corner is related with hight cornerness strength value.

Note that, using the Sobel operator the Harris algorithm assigns high cornerness strength values in isolated edges even in noise free images. Moreover, when images are corrupted with noise using Gaussian derivative filters still some problems persist (see figure 3.6 where low significant salient points are highlighted by a bounding box) since some corners are recovered in image regions with scarcely texture information. 3.8 Cov-Harris: Improved Harris corner Detection 55

Figure 3.5: The effect of using different derivative filters in the Harris algorithm. The Sobel

Operator (a) and Gaussian Derivative with parameters σD = 9 pixels and σI = 2 × σD (b).

3.8 Cov-Harris: Improved Harris corner Detec- tion

We explore the idea of using angular information for corner detection similar to the one proposed in [Ros96, ZKPF99, Soj03] where a polar mapping is used to find the subtended angle of dominant edges. The key difference of the algorithm presented in this section (CovH arris) is that the dominant orientation is directly estimated from the covariance matrix of each squared partial derivative term.

For every putative corner point arising from the original Harris corner detector with standard k = .04 threshold value empirically obtained in [HS88], the new algorithm adds directional gradient information to rank corner points as described in the following sections.

3.8.1 Segmentation of Partial Derivatives

Given a putative corner point selected using the Harris function 6.3.2 (see figure 3.8 for an example), its associated image region I, and squared partial derivatives

2 2 Im = {Ix,Iy } the Cov-Harris algorithm extracts the angular difference between dominant edges by analyzing each squared smoothed term. 56 The Correspondence Problem

Figure 3.6: The effect of using the Sobel Mask and the Gaussian derivative in the Harris algorithm.

Figure 3.7: Example of a probable corner point and its associated squared partial derivatives.

A proper corners consists of two adjacent edges with different angular orienta- tions. Individual squared partial derivatives can be segmented to obtain image coordinates (x, y) where the gradient magnitude is greater than a threshold value

Im > T hresholdm defined as follows:

(3.8.1) T hresholdm = mean(Im)

Gaussian derivative filters highlight edges on horizontal and vertical directions

2 2 on their corresponding squared partial derivative Ix,Iy . Then, since corner points are presumably centered on intersecting edges, we divide the segmented region centered on the corner candidate in two symmetrical windows as shown in figure (3.8 ). Then, only the (x, y) coordinates of the region with larger number of pixels 3.8 Cov-Harris: Improved Harris corner Detection 57 represents a dominant edge for each partial derivative.

Figure 3.8: A probable corner point and the segmentation process to extract dominant edges (Top). Dominant edge estimation from the covariance matrices for each partial derivative and the angular difference between dominant edges.

3.8.2 Edge direction estimation by Covariance Matrix

Two arrays of image coordinates can be defined as X = (xIx, yIx) and Y =

(xIy, yIy) that represent the dominant edges for each squared partial derivative and then, their corresponding covariance matrices are estimated as follows:

T CovX = (X − mean(X)) (X − mean(X)) (3.8.2) T CovY = (Y − mean(Y )) (Y − mean(Y )) where each covariance matrix has the form:

  x2 xy (3.8.3) Covm=X,Y =  . xy x2

Thus, the covariance matrix defines an ellipse where the maximum variance repre- sent the major axis of the ellipse which correspond to the dominant edge direction for the associated partial derivative. 58 The Correspondence Problem

By analyzing individually covariance matrices we can obtain the orientation of the two most dominant edges in a region window, with one edge for each partial derivative.

The angular orientation of a dominant edge can be estimated directly from the covariance matrix and it is given by:

2xy (3.8.4) θ = tan−1 . m (y2 − x) Then, the angular difference between dominant edges are directly computed from

θm by (3.8.5).

(3.8.5) θDif (θmx, θmy) = abs(θmx − θmy).

3.8.3 Ranking Corner Points by the Angular difference between dominant edges

Intuitively corners are better defined as the difference between dominant edges approximates to 90 degrees. This important fact is not considered by previous corner point detectors in the cornerness strength functions. We show that by adding a weighting factor corresponding to the difference in edge directions the corresponding Harris function allows the ranking of corner like points better than the original function. In addition, the proposed cornerness function (3.8.6) assigns lower strength values to corner points located where discrete derivative artifacts exist on images characterized by edge discontinuities and unsharped edge regions.

2 (3.8.6) Cstr = log(det(M) − k trace(M) ) ∗ θDif (θx, θy),

where (θx, θy) are in degrees. Proper threshold setting is key for finding coherent corner points. When false corner responses arise due to image noise, this problem can partially be solved by 3.8 Cov-Harris: Improved Harris corner Detection 59 over-smoothing the images at the cost of reducing localization accuracy. However, when the partial derivatives present overlapping responses it is specially difficult to set up an appropriate threshold value since setting a high value will discard corner regions having average gradient responses while a lower threshold parameter will bring up corner points in noisy regions.

To solve this problem we first apply the logarithmic function to the original Har- ris, this has the effect of normalizing the corner response to avoid large variations. Then, the log(Harris) strength function is weighted by the angular difference between dominant edges θDif (θx, θy) which yields stronger corner responses as the putative corner regions define L shape junctions corners. The proposed algorithm assigns a higher cornerness strength value when the angular difference between dominant edges is approximately 90 degrees and decreases the corner values as the angular difference between edges tends to zero. Then, by setting a threshold value greater than zero many corner points are discarded.

Figures (3.9 and 3.10) show the result of using the angular difference between dominant edges to weigh the log(Harris) corner strength function.

The algorithm was compared to the original Harris algorithm when both the Sobel operator and the Gaussian derivative filters are used on the same test images shown on figures (3.5 and 3.6). The detected corners are displayed along with their associated strength corner value using a color scale by sorting corners in ascending order. A lighter salient point location is related with a higher cornerness value and viceversa.

Note how the proposed Cov-Harris response function assigns low significance values in corner regions located on the edge where partial derivatives define similar dominant edge directions in both partial derivatives. 60 The Correspondence Problem

Figure 3.9: Comparative results between Harris (a) & b)) and Cov-Harris algorithms (c & d)). The effect of using in both algorithms the Sobel Operator and the Gaussian derivative with

parameter σD = 9 pixels and σI = 2 × σD.

3.9 Discussion

In this chapter an overview of the state of the art algorithms for the correspon- dence problem has been presented. Three relevant subproblems were described and a variety of different approaches to solve these problems were presented. Concerning matching putative salient points in different images, the importance of applying geometric constraints has been reported in previous works to reject false matching pairs.

In the salient point detection stage the majority of previous approaches have employed the gradient magnitude of partial derivatives to compute a measure 3.9 Discussion 61

Figure 3.10: Comparison between Harris (top) and Cov-Harris (bottom). The effect of using

the Sobel Operator and the Gaussian derivative with parameter σD = 9 pixels and

σI = 2 × σD.

of salient point strength. However, edges are not fully described by the edge magnitude. Additional edge features as edge direction and gradient normal have been scarcely considered for salient point detection. In addition, since salient point detection for wide baseline images remains a difficult task, recent efforts have focussed in developing invariant salient point detection algorithms robust to scale, and small image view point variations. Then, since recent salient point detection algorithms are based on pioneer works, they suffer from the same problems of the former approaches.

In addition, there is a topic that has been uncovered in previous works ’the 62 The Correspondence Problem

ranking of salient points’. Since a common indoor/outdoor image has between 200-2000 salient points, the ranking of salient points according to the corner qual- ity function is as important as the localization, repeatability, consistency, and other similar quality metrics that have been used to evaluate salient point algo- rithms. Having features well sorted by the corner strength function can avoid the construction of complex descriptors in false corner regions and at the same time heavily reducing the computational cost of solving the subsequent matching stage. Chapter 4

IC-SIFT: Robust Feature Matching Algorithm

4.1 Introduction

Matching salient points on images is an essential component in many computer vision tasks, ranging form medical imaging, industrial inspection, compression, surveillance and recently it has received much attention for uncalibrated 3D re- construction. Despite the enormous efforts to build robust matching algorithms, there is no method that can be applied to any situation. Template matching tech- niques for example are commonly used for tracking multiple regions of the same object [KT91, HB96]. Nevertheless, as current state of the art matching meth- ods can not guarantee mismatches due to image noise and illumination changes, extensions to the original algorithms use robust statistics methods for target po- sition prediction like the Kalman filter [Dav03, SS96]. Other extensions involve outlier rejection methods based on geometry constraints like epipolar geometry [ZDFL95b] and recently prediction of 3D camera motion has been considered to reduce the search space of feature correspondences [Dav03, Nis03] and decrease mismatch errors. This work is related to multiple features matching methods. The main contribution of this work is the integration of the discriminative properties 64 IC-SIFT: Robust Feature Matching Algorithm

encoded in the SIFT descriptor to the ICP algorithm to increase the number of matched features when processing images with repetitive patterns. A matching metric that incorporates appearance and predicted distance to the matching lo- cation on successive pairs of images is proposed. In addition, the Motion Context box computed in an uniform sampling area of feature points is introduced to ex- plicitly distinguish between local registration errors to improve the correct match pair assignment.

4.2 Related Work

Three important issues in feature matching methods are: the interest point de- tector algorithm, the local descriptor and the similarity metric that matches a descriptors of the reference and target images. Feature detector algorithms find image locations where large gradient variations occur in at least two principal directions. Harris [HS88] and KLT [KT91] feature detectors select regions where the eigenvalues of the second moment matrix are large. In the last few years fea- ture detectors have been extended to increase its repeatability detection rate under scale, illumination, and small pose variation changes [SZ02, MS04, KB01]. Once a salient point has been identified a local descriptor is built using photometric infor- mation. Recently Mikolajczyk and Schmid [MS05b] have compared various state of the art salient point descriptors, including steerable filters [FA91], shape context [BMP02], moment invariants [GMU96], complex filters [SZ02], SIFT [Low04] and cross-correlation. The SIFT descriptor has shown its superiority with respect to the other feature descriptors when images are subject to local transformations like scale, rotation, and partial 3D pose changes. Finally, match assignment is done by computing a similarity metric between descriptors. Commonly used similarity metrics include: sum of square differences, sum of absolute differences, normalized correlation, and Mahalanobis distance metrics.

A close related work [MDS05] by Mortensen et al presents a feature descriptor that extends SIFT with a vector that adds maximum curvature from the sur- 4.2 Related Work 65 rounding area of a salient point to improve the matching rate. Local features having similar appearance are disambiguated by a two component feature vector ( 4.2.1) based on a SIFT descriptor representing local properties and a global shape context vector that considers global curvilinear information. Figure 4.1 shows the extended SIFT descriptor with added shape representation proposed in [MDS05].

Figure 4.1: SIFT-Global Context taken from [MDS05]. (a) Original images with selected feature points marked. (b) SIFT (left) and shape context (right) of point marked in (a). (c) Reversed Curvature image of (a) with shape context bins overlaid.

The feature vector proposed in [MDS05] is defined as:

 ωS  (4.2.1) F = (1 − ω)G where S is the 128-dimensional local SIFT descriptor, G is the 60-dimension global shape context vector, and ω is a relative weighting factor. Thus, given two descriptors, Si and Sj, their distance metric is a simple Euclidean distance metric

v u 128 uX 2 (4.2.2) dS = |Si − Sj| = t (Si,k − Sj,k) k=1 for the SIFT component, S, of the feature vector and a χ2 statistic for the shape context component, G. 66 IC-SIFT: Robust Feature Matching Algorithm

60 2 2 1 X (hi,k − hj,k) (4.2.3) dG = χ = , 2 hi,k + hj,k k where h is an histogram bin that represents global curvature shape. Then, the final distance measure is estimated by the following expression:

(4.2.4) d = ωdS + (1 − ω)dG.

Finally, matches with a distance larger than a threshold Td are discarded. A limitation of this method is that it does not work well for images with large scale variation. Since the algorithm was developed for an application where the scene was constrained to small scale changes and the images are captured with a camera placed fronto parallel to the observed scene which also reduce the projec- tive distortion effects.

This is in part the motivation of the work. The suggested approach is to consider different cues to disambiguate the matching problem when repetitive patterns exist in images. In particular, by adding spatial information in the matching cri- terion the proposed algorithm can cope with limited scale variations and projective distortion effects.

4.2.1 Scale Invariant Feature Transform

The Scale Invariant Feature Transform (SIFT) developed by Lowe [Low04] has 4 major stages:

1. Scale-space extrema detection: The first stage searches over scale space using a Difference of Gaussian function to identify potential interest points.

2. Keypoint localization: The location and scale of each candidate point is de- termined and keypoints are selected based on measures of stability.

3. Orientation assignment: One or more orientations are assigned to each key- point based on local image gradients. 4.2 Related Work 67

4. Keypoint descriptor: A descriptor is generated for each keypoint from local image gradients information at the scale found in stage 2.

The SIFT feature algorithm is based upon finding locations within the scale space of an image which can be reliably extracted. The first stage finds scale- space extrema located in the Difference of Gaussians (DoG) function, which can be computed from the difference of two nearby scaled images separated by a multiplicative factor k. DoG scale space is sampled by blurring an image with successively larger Gaussian filters and subtracting each blurred image from the adjacent (more blurred) image

(4.2.5) D(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ) where L(x, y, ) is the scale space of an image, built by convolving the image I(x, y) with the Gaussian kernel G(x, y, σ). Points in the DoG function which are local extrema in their own scale and one scale above and below are extracted as keypoints. For a rotation invariant descriptor, the local image gradients are used to determine the main orientation for each region. For each image sample L(x, y), the gradient magnitude m(x, y) and orientation θ(x, y) is computed using pixel differences:

(4.2.6) m(x, y)2 = [L(x + 1, y) − L(x − 1, y)]2 + [L(x, y + 1) − L(x, y − 1)]2

L(x, y + 1) − L(x, y − 1) (4.2.7) θ(x, y) = tan−1 L(x + 1, y) − L(x − 1, y)

The resulting feature descriptor are 128 elements vectors with a total support window of 16 × 16 scaled pixels. For a more detailed explanation of the SIFT see [Low04]. 68 IC-SIFT: Robust Feature Matching Algorithm

4.2.2 Iterative Closest Point ICP

The ICP algorithm is another approach for solving the correspondence problem, although originally introduced to register 3D data sets by Besl and McKay [BM92] and Chen and Medioni [CM92] it is also used with 2D data sets to register images mainly on medical applications. The ICP algorithm iteratively performs two op- erations until convergence: the data matching and the transformation estimation to align the data sets. The ICP algorithm takes two data sets as input repre-

senting salient points of a reference image I1 and a target image I2. Let P and

Q be two input data sets containing Np and Nq feature points respectively, that

Np Nq is, P = {pi}i=1 and Q = {qi}i=1. The goal is to compute the parameters of the

transformation matrix T that best align the transformed points Pk = T (P ), to the point set Q. For 2D Euclidean transformation the parameters of T are the

rotation angle R and the translation vector t = (tx, ty). A summary of the ICP algorithm is as follows:

1. Initialize: k = 0; Pk = P , R = R0, t = t0;

2. Match closest points: For each point in Pk, find its closest point in Q. A list

Np C = {ci}i=1 is obtained where C ⊂ Q, and ci is the closest point to pi.

(4.2.8) ci = arg min k qm − Rpi − t k qm

3. Compute the Transformation parameters of T with a least mean square esti- mator, where the objective function to be minimized is:

Np 1 X (4.2.9) f(R, t) = k c − Rp − t k2 N i i p i=1

4. Apply the Transformation: Pk+1 = RP + t

5. If fk − fk+1 < τ or k > Kmax: Terminate, else k = k + 1 and repeat steps 2-5. 4.2 Related Work 69

Figure 4.2 shows a graphical example of the ICP registering synthetic data. Ob- serve that ICP can obtain an approximated solution even in cases when sets Pk and Q are in different scales and a non exact rigid transformation can align the data sets.

Figure 4.2: Example of use of ICP algorithm to align two set of points under different scales.

We use standard stopping conditions: the maximum allowable number of iter- ations Kmax, the change in mean square registration error falls below a preset threshold τ > 0, and when the change in transformation parameters has a small variation between successive iterations. Notice that the transformation parame- ters are computed from the original data set P whereas the closest points in the matching stage are computed using the transformed points Pk. Closed form solu- tions for computing the 2D Euclidean transformation exist based on SVD [CM92] and unit quaternions [BH87].

Two important problems in the ICP algorithm are: It needs a good initial com- putation of feature correspondences especially in cases where image deformation is present due to viewpoint changes. As a result, given two data sets, the ICP algorithm converges to different local minima depending on the initial correspon- dences assigned between the data sets. To deal with this problem other methods 70 IC-SIFT: Robust Feature Matching Algorithm

have been proposed using extra information like color [GRB94, JK97] in which the color information is incorporated to the shape information in the correspon- dence computation or using combination of multiple attributes [SJH98, GLB01], which are combined with the Euclidean distance in searching for the closest point during the matching process. When different attributes are available to measure the similarity between points it is difficult to establish a single global similarity metric because each metric may has different scale and varying relevance. An- other drawback of the ICP algorithm is that it is sensitive to the presence of false match correspondences or ”outliers”. Zhang [Zha94] introduced a variant of the ICP, which rejects corresponding points if the distance between the pair is greater

than a dynamically adapted threshold called Dmax. The threshold is updated at each iteration using the mean and the standard deviation of the computed dis- tances between the corresponding points as follows:

if µ < D // The registration is very good

Dmax = µ + 3η else if µ < 3D // The registration is still good

Dmax = µ + 2η else if µ < 6D // The registration is not too bad

Dmax = µ + η else // The registration is bad

Dmax = ξ

where D is a constant defined by the user that indicates the expected mean distance of the corresponding points when the registration is good and ξ is a max- imum tolerance distance value when the registration is bad. Following these ideas we propose to integrate the SIFT descriptor to the ICP algorithm and add local statistical measures to improve the robustness of the feature matching stage. Re- cent reviews and comparative studies about the ICP variant methods are available in [RL01, GA05] focused on alternatives that improve the convergence speed. 4.3 IC-SIFT: Iterative Closest SIFT 71

4.3 IC-SIFT: Iterative Closest SIFT

SIFT features are distinctive invariant features used to robustly describe and match salient points of images between different views of a scene. The SIFT feature description is invariant to scale and rotation, and robust to other im- age transforms, but it has some problems when repetitive structures occurs in images. The insight is that adding spatial information, the ambiguity between similar features can be reduced significantly. In this section we show how to in- tegrate the SIFT and ICP methods to overcome one of the intrinsic limitation of each algorithm. First, an algorithm is introduced to provide a good initial set of correspondences to initialize the ICP algorithm based on SIFT local descriptors. Then, we show how to add spatial information in the matching stage of SIFT to disambiguate corresponding points when some repetitive patterns exist in the images. The motivation is to increase the number of feature correspondences in images containing repetitive patterns and to eliminate false match pairs.

4.3.1 Finding Initial Matching Pairs

Since ICP is an iterative descent algorithm, it requires a good initial estimate in order to converge to a good local minimum. In the original ICP algorithm it is assumed that the data sets P and Q are partially aligned or a rough estimation of the transformation parameters is known. If this assumption is satisfied then, during the first matching process an important number of feature correspondences can be found and the algorithm can converge to a good local minimum. In the proposed algorithm a first estimate of the correspondences is computed based on the SIFT local descriptors. This allows to find a good amount of match pairs even when the initial partially alignment constraint is no satisfied.

The SIFT descriptors are 128 dimensional vector that describes the appearance of the local region. The search strategy is to search through the entire list of SIFT

Np Nq features descriptors Sp = {spi}i=1 for ones that are similar in the set S1 = {sqi}i=1.

Then, given two SIFT feature descriptors spi and sqj, one from each data sets P 72 IC-SIFT: Robust Feature Matching Algorithm

and Q, the similarity function is defined as the Euclidean distance metric as shown in equation 4.3.1:

v u 128 uX 2 (4.3.1) ds =k spi − sqi k= t (spi,k − sqi,k) k=1

Then, following the approach of [Low04] we use the distance ratio to the second best match to discard outliers and take this matching metric as the final similarity

measure in the initialization stage. Given ds1 and ds2 the first and second best

matches distances for a feature descriptor in the set Sp, a match is accepted if:

(4.3.2) ds1 < 0.8ds2

Notice that the distance ratio has a low value when the best match performs significantly better that the second best match, while a high value of the distance ratio is obtained when the feature point has at least one strong competitor in terms of the descriptor appearance.

4.3.2 Matching SIFT features: adding a weighted distance factor

The equation 4.2.8 of the original ICP algorithm is adapted to consider both the appearance similarity and the geometrical proximity properties between feature points by:

k qm − Rpi − t k (4.3.3) ci = arg min + ω k spi − sqm k qm δ

The coefficient ω is a factor that controls the weight of SIFT based appearance in the matching stage. We have observed that a good tradeoff to weight appearance and distance information is obtained setting the coefficient equal to 0.5. The 4.3 IC-SIFT: Iterative Closest SIFT 73 parameter δ manages the degree of interaction between features according to the distance. A small δ value which limits matching candidates to a reduced neighborhood while a larger value allows matching more distant features. Then, following [Zha94] the maximum tolerable distance to accept or reject matches can be computed automatically. We set the parameter δ varying accordingly to the mean µ and standard deviation η computed in each iteration of the IC-SIFT algorithm:

(4.3.4) δ = µ + 2η.

4.3.3 Differencing Registration Error

Traditional ICP algorithms do not distinguish among different registration errors for different point matches. This fact causes that many feature points can be rejected because their corresponding registration error is larger than the global registration error computed from all the pairing matches in previous iterations. This problem is especially evident working with real images since features of closer objects will have larger displacements than features corresponding to far away objects that will be related with small motion displacements. We propose to add a local motion descriptor to overcome this limitation of the original method. The suggested Motion Context uses a similar data structure introduced in [Che91] to speed up the searching of correspondences but it differs in the information it holds. The Motion Context is an M ×N rectangular grid that encloses the feature points from the reference image.

Each Motion Context box Bij will store the characteristic displacement vector vBij , the mean registration error µBij and the standard deviation ηBij of the NBij inlaying features in the corresponding box 4.3.5.

N P Bij (c − (Rp + t) (4.3.5) v = i=1 i i Bij N 74 IC-SIFT: Robust Feature Matching Algorithm

The stored values will be used to find matches with a clear distinction of local registration error estimation. Using local registration errors will be proven under experimental test to be especially useful in boundary regions in the next section.

Let (Xmin,Ymin) be the minimum and (Xmax,Ymax) be the maximum coordinates of the feature points in P . As we uniformly divide the bounding rectangular region that contains all the feature points, the number of Horizontal H and Vertical V subdivisions is defined by:

(X − X ) (Y − Y ) (4.3.6) H = max min ,V = max min ∆x ∆y

where ∆x and ∆y are the vertical and horizontal size of each subdivision. Then,

given the coordinates (x, y) of feature point pi its corresponding Bij indices are obtained as follows:

j (x−Xmin) k j (y−Ymin) k i = , j = , if x ≤ Xmax, y ≤ Ymax (4.3.7) ∆x ∆y i = H, j = V, if x > Xmax, y > Ymax

To remove the effect of outliers in the computation of local mean and standard deviation, only the subset of features whose registration error is lower than three times the global mean registration error are considered to obtain the values of each Motion Context box. During the initialization phase of the IC-SIFT algorithm the Motion Context is estimated using only the feature correspondences extracted by appearance information from the SIFT descriptor. Then, at the matching stage

for each feature point pi, its closest Motion Context box Bij is estimated and if the corresponding Motion Context box is in the boundary of the image the

transformed pk is set according to:

(4.3.8) pk = Rpi + t + 0.5vBij .

Finally, for each feature point pki its closest Motion Context box Bij is estimated

and the local estimated µBij and ηBij are used to accept or reject matching pairs 4.3 IC-SIFT: Iterative Closest SIFT 75

Figure 4.3: Example of the 4x4 Motion Context box describing the characteristic local motion of the neighbour features.

based on local thresholds dynamically updated as follows:

if µBij < D // The registration is very good

Dmax = µBij + 3ηBij

else if µBij < 3D // The registration is still good

Dmax = µBij + 2ηBij

else if µBij < 6D // The registration is not too bad

Dmax = µBij + ηBij else // The registration is bad

Dmax = ξ 76 IC-SIFT: Robust Feature Matching Algorithm

4.4 Robust feature Matching Experimental Re- sults

This section shows the comparative results of the proposed IC-SIFT Algorithm against the original SIFT and ICP algorithms. In all the experiments we evaluate the number of correct match features on image pairs containing repetitive pat- terns. The performance of each matching algorithm is measured considering the number of correct matching features and the number of false positive matches i.e. incorrect matches between non corresponding feature points. In addition the RMS registration error is measured. We begin by using the RANSAC-fundamental ma- trix computation procedure [ZDFL95a]. Then, the estimated epipolar lines are used to classify a match as incorrect when the distance from the corresponding epipolar line is more than one pixel. But, since this approach has problems iden- tifying false positives we have carried on a second manual classification process to obtain a more representative measurement between correct and incorrect matches. The verification process was done by two different users on each individual fea- ture point. Figure 4.4 shows the real data sets employed to evaluate the proposed algorithm.

Figure 4.4: Original test image pairs. Images are subject to rotation, scale and viewpoint change transformations. From left to right: Boat, Cube, Library, Sculpture, and Chessboard image pairs.

In the first experiment we use the first and sixth image of the boat image sequence from [MS05a]. Notice that this image pair has a large rotation and scale change. 4.4 Robust feature Matching Experimental Results 77

Figure (4.5) shows the comparative results for the boat image pair between SIFT, ICP, and IC-SIFT algorithms.

Figure 4.5: Feature correspondences initially found by SIFT and the proposed IC-SIFT algorithm (Top). Registered points using matches found by SIFT, ICP, and IC-SIFT algorithms (Middle). The Initial and Final Motion Context. (Bottom)

The registered points are plotted to provide a better idea of the match reliability between points. The cross marks (+) represent the target keypoints in the left image and the dot points (·) correspond to the keypoints in the right image after applying the rigid transformation that aligns corresponding points. Matching points are displayed by a liking line. The rigid transformation was estimated from the matches found by each algorithm.

Figure (4.5) also shows the characteristic displacement vectors stored in the Mo- tion Context for the Boat image pair. Observe that the Motion Context divides 78 IC-SIFT: Robust Feature Matching Algorithm

the image area where corresponding point have been matched into a rectangular uniform grid.

Table 4.1 summarize the results for the boat image pair. In this experiment the SIFT algorithm gives a good initial set of corresponding points then the refinement done by the IC-SIFT algorithm outperforms the original algorithms increasing by 30 the number of correct matches without introducing match errors since the transformation parameters that align the keypoints can be reliably obtained.

Table 4.1: Matching results for the boat image pair Match Initial Correct False RMS Method Points Matches Matches Error IC-SIFT 200 85 0 1.87 SIFT 200 55 6 3.92 ICP 200 73 23 2.53

The second experiment shows the performance of the new method in an image pair of the cube stereo pair with camera motion from top to bottom. Observe that there are many areas in the checkerboard pattern and in the cube with similar patterns. In addition, there are illumination changes on the left and right of the images since two spot lights were used to illuminate the scene.

Notice that since this image pair has repetitive patterns using the SIFT matching strategy, yields to poor results and the correspondences are found in relatively large regions where ambiguities between features can be saved to some extent. On the other hand the IC-SIFT algorithm outperforms the original algorithms with an improvement on both the number of correct matched points and with a significant reduction of false positives matches. One reason for this is that different registration errors are considered for features lying on each Motion Context box. This property of the IC-SIFT algorithm can be observed in the characteristic displacement vector associated to the Motion Context box, as shown in figure(4.7);

200 SIFT keypoint descriptors were initially detected on the first image and 4.4 Robust feature Matching Experimental Results 79

Figure 4.6: Matches found by proposed IC-SIFT Algorithm in the cube image pair (Top). Registered points using matches found by SIFT, ICP, and IC- SIFT algorithms (Bottom). looked for their matching pairs in 500 keypoints detected in the right image. Table 4.3 summarizes the results obtained by using the proposed IC-SIFT, the original SIFT, and ICP algorithms in the cube image pair.

In the next experiment, we investigate the robustness of the new matching al- gorithm on a natural image pair (Proprietary library), see figure 4.8. Table 4.3 sumarize the results.

Notice that there is a viewpoint change corresponding to 20 degrees. Furthermore the camera follows a translational motion from left to right. 200 interest points

Table 4.2: Matching results for the cube image pair Match Initial Correct False RMS Method Points Matches Matches Error IC-SIFT 200 110 3 1.91 SIFT 200 96 26 2.51 ICP 200 92 15 4.08 80 IC-SIFT: Robust Feature Matching Algorithm

Figure 4.7: The Motion Context for the cube image pair and the associated characteristic displacement vectors

where initially detected in the image on the left and matched to 500 putative keypoints in the right image.

In the next experiment, we evaluate the performance of the new algorithm in a manmade scene (Sculpture). Figure 4.9 shows comparative results between SIFT and IC-SIFT algorithm. 200 interest points where initially detected in the image on the left and matched to 500 putative keypoints in the right image. Table 4.4 summarize the results of the different methods for the Sculpture image pair. Note that beginning from 25 correct matches found by SIFT the new algorithm nearly doubles the number good matching pairs and it only adds one mismatch error.

Note that keypoints matched by using SIFT standard criteria [Low04] are located in an small subregion of the image. However, the IC-SIFT algorithm by using the

Table 4.3: Matching results for the library image pair Match Initial Correct False RMS Method Points Matches Matches Error IC-SIFT 200 118 7 2.84 SIFT 200 99 43 3.23 ICP 200 82 20 4.89 4.4 Robust feature Matching Experimental Results 81

Figure 4.8: Match pairs found using SIFT, ICP and IC-SIFT algorithms in the library image pair (Top). Registered points using matches found by SIFT, ICP, and IC- SIFT algorithms (Bottom).

Motion Context propagates the spatial information to disambiguate matching conflicts. Observe how the Motion Context dynamically updates its size covering larger image areas as new matches are unambiguously assigned (See figure 4.10).

Table 4.4: Matching results for the Sculpture image pair Match Initial Correct False RMS Method Points Matches Matches Error IC-SIFT 200 47 2 2.24 SIFT 200 26 1 3.13

The last experiment, was performed to examine the effectiveness of the IC-SIFT algorithm when evident repetitive patterns exist on the scene. A chessboard calibration pattern was placed in front of the camera. Figure 4.11 shows the corresponding image pair, the left image is near to the camera and the right one was captured when the calibration pattern is moved far away.

The results in figure 4.11 show that the SIFT keypoints matched by the Euclidean norm among descriptors and the match ratio proposed in [Low04] are vulnerable to the presence of repetitive patterns. However, the proposed IC-SIFT recovers 82 IC-SIFT: Robust Feature Matching Algorithm

Figure 4.9: From left to right matches by SIFT and IC-SIFT (Top). The registration error for SIFT and IC-SIFT (Bottom).

matches losses and no erroneous matches are obtained after three iterations.

Observing the final registered points in the middle row of figure 4.11, there are no matches on the top-left region. This is explained because no SIFT points were detected on that regions during the initialization and the IC-SIFT iterative process does not look for more keypoints on images.

Table 4.5: Matching results for the Chessboard image pair Match Initial Correct False RMS Method Points Matches Matches Error IC-SIFT 1000 946 0 0.59 SIFT 1000 476 88 5.21 4.5 Discussion 83

Figure 4.10: The evolution of the Motion Context in the IC-SIFT algorithm propagates spatial information to resolve matching conflicts.

4.5 Discussion

In this chapter the proposed feature matching algorithm has been presented. The new algorithm integrates spatial proximity information to the SIFT descriptor and appearance similarity to the ICP algorithm to overcome problems of identifying correct matches in images with repetitive patterns under wide baseline conditions. The new algorithm use the Motion Context and the characteristic displacement vector to allow different registration errors on local image areas. Then by propa- gating spatial information properly disambiguate similar appearance descriptors to improve matching assignment.

There are still some limitations of the proposed algorithm, one is due to the rigid transformation used to aligns the set of points. Thus, the method can be extended by using a homography mapping that can model shear effects or by a non-rigid registration model that can be used for the alignment process of the IC-SIFT algorithm. Furthermore, there are some problems when objects of different depths contribute to the local motion description, in that situation the characteristic displacement vector vBij will not be stable during the iterative process in particular if there is a significant distance variation between close objects. This problem can be addressed by varying the weighting term ω accordingly to the expected stability of the corresponding Motion Context box. For example, by relating the weighting term with the standard deviation of the local motion descriptor. In 84 IC-SIFT: Robust Feature Matching Algorithm

Figure 4.11: Performance comparison between SIFT and IC-SIFT for matching points in a calibration chessboard pattern.

addition, a coarse to find representation of the proposed Motion Context could help to manage different depths in the presence of nearby points. Future work will explore this possibilities and color information would be integrated to eliminate even more mismatch errors.

Despite its limitation the proposed IC-SIFT algorithm was applied to feature matching with excellent results. The performance of the novel ICP-SIFT match- ing algorithm when compared with the original SIFT shows that the proposed algorithm finds 30% more matches and significantly reduces the number of false positive matches when processing real image pairs containing repetitive patterns. We observe from the results that the IC-SIFT algorithm always gives the lowest registration errors. This is due to the fact that it always gives the lowest false positive match errors and the highest correct match scores. Chapter 5

A new Incremental Projective Factorization Algorithm

5.1 Introduction

The problem known as shape from motion consists in recovering the structure of the scene and camera motion at the same time using feature point correspondences from multiple views. In the last years different methods to obtain 3D models from an image sequence have been developed [MHOP01, ST01, RP05, MP05, HZ00b, GSV01].

Typical image sequences for real applications of structure from motion methods are formed of 20-30 frames per second with hundreds of feature point correspon- dences between successive frames. In addition, for certain periods of time the camera motion will be slow, adding negligible information to the reconstruction process but making the estimation of the structure of the scene and camera mo- tion unaffordable. Then, selecting the subset of frames that are more suitable to obtain a reliable projective reconstruction can reduce the amount of information and therefore the processing time.

The main contribution of the new algorithm is the development of an online domain reduction step. Analyzing the contribution of previous frames on the 86 A new Incremental Projective Factorization Algorithm

projective reconstruction process, a subset of frames is selected. The proposed algorithm has the advantage of keeping the robust characteristic of the original algorithm and reducing the computing time with comparable qualitative results to the original iterative factorization method.

5.2 Related Work

Projective reconstruction methods can be divided into four categories: epipo- lar geometry, recursive, factorization and robust non-linear methods. Epipolar geometry based methods [RP05, MP05, GSV01] compute the epipolar geometry between successive pairs of images and then merge the partial results to compute the structure and motion of the scene. Recursive methods [BCC90, SPFP96] are based on nonlinear filtering that fuses estimates over time and recently in [Nis03] using a preemptive scoring of the motion hypothesis from a calibrated camera. On the other hand, in the factorization methods [MHOP01, ST01, MK94, TMHF99], the information of correspondences of all frames is treated simultaneously. The optimal way to recover motion and structure from images is to use robust non- linear methods like bundle adjustment [TMHF99]. However, bundle adjustment does not give a direct solution; it is a refining process and requires a good starting point.

The original factorization method [TK92a] factors a measurement matrix repre- senting the image positions of feature points tracked over multiple frames then, using Singular Value Decomposition (SVD), two matrices are computed, which represent object shape and camera motion respectively. Three important prob- lems that arise with factorization like methods are:

• SVD is a computationally expensive procedure.

• The size of the measurement matrix increases for each new frame processed.

• All correspondence points must be available simultaneously for factorization. 5.3 Projective Factorization 87

To overcome these problems, various strategies have considered that the struc- ture of the scene could remain the same over time. Morita and Kanade [MK94] developed a sequential version for the orthographic and para-perspective camera models, that has almost the same accuracy but with lower computational cost. Peter Sturm and Bill Triggs [ST01] extended the factorization method for shape recovery from projective cameras. The algorithm recursively finds the unknown projective depths associated with the full perspective camera model from the epipolar geometry between each view. After that, the SVD factorization method is applied to find the shape and motion matrix. Mahamud et al in [MHOP01], proposed a bilinear iterative algorithm by adding new constraints in the error function minimized by the Sturm-Triggs method. In the Mahamud et al method, the initial projective depth values are obtained using the Kanade orthographic method as initial estimation avoiding the need to compute projective depths from the epipolar geometry.

The problem of selecting the subset of frames more suitable to obtain a reli- able estimation of structure and motion has been investigated for epipolar based projective reconstruction methods by Repko and Pollefeys in [MP05, PGV+02], counting the number of feature points tracked in successive pairs of images and analyzing the reprojection errors for pairs and triplets of views using epipolar geometry and homographies using the Geometric Robust Information Criterion (GRIC) proposed in [TFZ98]. Their keyframe criterion selects two or three views where the GRIC score of the epipolar model is lower than the score of the homog- raphy model. A similar idea was presented in [GCH+02].

5.3 Projective Factorization

The original projective factorization method described in [MHOP01] assumes that n feature points are observed by m perspective cameras. Then, a registered mea- surement matrix W with dimensions 3m × n is formed, as follows: 88 A new Incremental Projective Factorization Algorithm

    λ11x11 λ12x12 λ1nx1n P1     h i  . . .   .  (5.3.1) W =  . . .  =  .  X1 ··· Xn     λm1xm1 λm2xm2 λmnxmn Pm

T here Xj = (xj, yj, zj, 1) , (j = 1, ··· , n) are the unknown homogeneous 3D point

vectors, Pi(i = 1, ··· , m) are the unknown 3 × 4 image projections matrix as-

sociated with camera i, λij are scale factors called projective depths and, xij = T (uij, vij, 1) are the measured homogeneous image point vectors respectively. Ideally, to solve the structure from motion problem one should minimize the mean-squared distance between the observed image points and the point positions

predicted from the parameters λij, Pi and Xj, i.e. 5.3.2:

1 2 (5.3.2) E = Σ k xij − PiXj k λij However, the corresponding problem is difficult since the error is highly non-linear

in the unknowns λij, Pi and Xj. Mahamud et al [MHOP01] proposed an iterative algorithm to minimize the algebraic error function 5.3.3 :

2 (5.3.3) E = Σ k λijxij − PiXj k .

Their iterative algorithm minimizes E by alternating steps, where the motion and structure parameters are estimated from the measurement matrix, with steps where the projective depths are computed from the motion and structure esti- mates. Initially a rank-4 decomposition determines the best sets of projective depths for factorization. Two matrices are found: P (3m × 4 dimensions) and X (4 × n dimensions), representing camera motion and object shape respectively. Then, by imposing a constraint on the columns of W to have unit norm, the projective depths are computed.

To summarize the original projective factorization algorithm in a compact nota- tion, two new vectors and one new matrix are defined: 5.3 Projective Factorization 89

 T (5.3.4) λj = λ1j ··· λmj j = (1, ··· , n)

  x1j 0 ··· 0      0 x2j ··· 0  (5.3.5) Φj =    . . . .   . . . .    0 0 ··· x1j

T (5.3.6) dj = Φjλj = (λ1jx1j, ··· , λmjxmj)

Then, the full original iterative projective factorization algorithm [MHOP01] is described as follows: 90 A new Incremental Projective Factorization Algorithm

1. Set λij = 1, for i = 1, . . . , m and j = 1, . . . , n;

2. Compute the current scaled measurement matrix W by equation (1);

3. Normalize W; constraining the columns to |Wj| = 1, j = 1, . . . , n.

4. Perform the rank-4 factorization on W by SVD, W = USV T , to generate an

T estimate of projective matrix P and shape matrix X; P = U4 and X = S4V4

where U4,S4 and V4 are the submatrices obtained from U, S and V using only the 4 first columns (the ones associated with the 4 largest eigenvalues)

and S is a diagonal matrix with elements σ5 known as the singular values of W.

5. For j = 1, . . . , n do

T T (a) compute Aj = ΦiU4U4 and Bj = Φ4Φ4 ;

(b) solve the generalized eigenvalue problem Ajλ = σ5Bjλ and set λj to be the generalized eigenvector associated with the largest eigenvalue ;

6. If λij are approximately the same as the previous iteration, stop; else go to step 2; Algorithm 1 - Original projective factorization algorithm.

In step 3, if W corresponds to a projection of real 3D points then, with the correct

depth values, it must be a rank-4 matrix [HZ00a]. The fifth singular value σ5 of the S diagonal matrix represents the distance using the Frobenius norm from the

subspace of rank-4. Then, the σ5 can be used as a measure of this distance and it should have zero limit if the iterative process is known to converge. Due to the effect of noise and misleading matches on the preprocessing stage of tracking

algorithms, the σ5 eigenvalue will not be zero but, after the iterative process of the projective reconstruction algorithm, its value will be approximately the same as in the previous iteration reaching a stable value near zero. 5.4 Proposed Incremental Projective Reconstruction Algorithm 91

5.4 Proposed Incremental Projective Reconstruc- tion Algorithm

The proposed algorithm for incremental projective reconstruction IPF is similar to the Mahamud et al [MHOP01] approach but it assumes that only a predefined maximum number of r frames can be kept in W. The proposed method differs from the original one when the kth ≥ r frame is processed. A domain reduction stage is added to keep the processing time constant restricting the size of W to allocate only r frames. Then, from the kth ≥ r frame, the structure and motion of the scene is incrementally computed for each additional frame in nearly constant processing time.

5.4.1 Domain Reduction by inter-frame Selection

Our objective is to find the subset of r frames {Wr} from W to obtain a projective reconstruction preserving the properties of the original method like reconstruction accuracy and robustness but, at the same time reducing significantly the process- ing time. In order to obtain this result we propose to add an online domain reduction stage. The row subset selection from W is accomplished imposing a constraint on the rows by using the σ5 singular value. Let us assume that whenever an image is added at time t, we keep a record of the last value assigned to σ5(t) after the last iteration of the projective reconstruction method. Intuitively, whenever a new frame is processed, at the end of the iterative projective reconstruction algorithm the σ5(t + 1) eigenvalue gives an idea on the contribution of this new frame in improving the reconstruction quality according to the proximity of the σ5(t + 1) singular value to the zero limit. When the maximum number of frames r that can be considered on the measurement matrix W has been reached, a decision must be taken about which frames of W will give the best projective reconstruction quality in terms of the approximation to the rank-4 subspace. 92 A new Incremental Projective Factorization Algorithm

It would be desirable to estimate in the kth frame which combination C(k, r)

of the k previous frames taken r at a time gives the lower set of σ5 eigenvalues But this test would take long time to be computed since there are k!/(k − r)!r! possible combinations. Therefore, we propose to introduce a simplification. Only one row index of the data matrix (the one corresponding to the frame with less contribution to the reconstruction process) is located since it must be replaced.

Prior to the frame selection the proposed algorithm favors the selection of partic- ular frames corresponding to local minimums. Computing the discrete derivative

0 σ5(t).

0 (5.4.1) σ5(t) = σ5(t) − σ5(t − 1), t = 2, . . . , k

Local minimums are located analyzing sign changes from negative to positive in

0 σ5(t).

0 0 (5.4.2) localmins = {t }|σ5(t) > 0 ∧ σ5(t − 1) < 0, t = 2, . . . , k

Then, the σ5 corresponding to local minimums are re-estimated by:

1 (5.4.3) σ = σ (localmins) localmins 5 nf

where nf is the number of frames from the last local minimum.

Assuming without loss of generality that the localmins are in the form {σl1 <

σl2 < ... < σlk}. Finally, the subset of r frames being part of W are selected according to the following expression:

(5.4.4) Wtr = {Wi : σ5(i) ∈ σlocalmins, i = 1, . . . , r − 1} ∪ {Wk}, 5.4 Proposed Incremental Projective Reconstruction Algorithm 93

where Wi is the subset of rows constructed from the correspoding set of points at frame i.

Observe that by construction (eq: 5.4.4) we assure to consider the first an last frame of the image sequence since projective reconstruction benefits from wide separated views and because the interest is to have an estimate of structure and motion for the current frame. The rest of W includes the frames with highest contribution to yield close distance to the rank-4 subspace.

The decision about keeping or replacing rows from W is based on the analysis of σ5. Thus, if the σ5(t) corresponding to the last frame is smaller than the one associated to the previous frames in W, the corresponding less contributing frame will be deleted from W.

5.4.2 Incremental Projective Reconstruction Algorithm

Incorporating the new domain reduction algorithm described in section (5.4.1) to limit the size of the measurement matrix W , the proposed incremental projective factorization algorithm is as follows:

1. Set λij = λi−1j , for i = r, ··· , m and j = 1, ··· , n;

0 2. Compute σ5(t) = σ5(t) − σ5(t − 1);

0 3. Find the set of localmins in σ5(t) analyzing sign changes in σ5(t);

1 4. Set σ(localmins = σ5(localmins) nf ;

5. Find the new subset of r frames Wtr = {Wi : σ5(i) ∈ σlocalmins, i =

1, . . . , r − 1} ∪ {Wk}

6. Replace rows from W = Wtr ;

7. Do steps 3 to 6 of the original algorithm 1; Algorithm 2 - Proposed incremental projective algorithm. 94 A new Incremental Projective Factorization Algorithm

5.5 Incremental Projective Reconstruction Ex- perimental Results

Synthetic data was used to quantitatively compare the accuracy of the final 3D reconstruction of the original algorithm and the proposed incremental algorithm using 500 random generated points within a hemisphere of 100 units radius . Thirty views of these points were taken by a camera looking to the center of the sphere. The camera was located at a distance of 200 units from the origin of the sphere to introduce perspective effects. Figure 5.1 shows the synthetic data used for the experimental test.

Figure 5.1: The synthetic hemisphere points and camera locations used for comparing the original Batch Projective Factorization vs Incremental Projective Factorization methods.

5.5.1 Incremental Projective Reconstruction Accuracy

To compare the original Batch Projective Factorization [MHOP01] (BPF) and the proposed Incremental Projective Factorization algorithm (IPF) the following

computation are performed. A submatrix Wt is built containing the feature cor- respondences up to frame t to get the best estimation of the original algorithm.

Then, the original algorithm is applied to submatrix Wt. On the other hand the proposed algorithm is evaluated with a maximum predefined number of frames 5.5 Incremental Projective Reconstruction Experimental Results 95 explicitly shown in the experiments. Several experiments were conducted adding uniform noise with varying standard deviation to both x and y image coordi- nates. Figure 5.2 plots the RMS reprojection error resulting of adding zero mean Gaussian noise with standard deviation 2 to simulate tracking drift and image noise. To the left, three and five frames are kept in W when using the IPF algo- rithm and to the right, a closer view for frames 10 to 30 when seven frames are kept in W. Notice that the RMS reprojection error (1 pixel variation) of projec- tive reconstruction is similar for the original [MHOP01] and proposed method. However, the new method needs lower computational time and reduced memory resources.

Figure 5.2: The RMS reprojection error for the original BPF vs IPF algorithm using 3,5 and 7 frames.

Figure 5.3 shows the variation of the σ5 singular value during incremental pro- jective reconstruction using the synthetic data of figure 5.1 when the maximum allowable frames in the measurement matrix are 7. Note that at frame 30 the intermediate frames selected for projective reconstruction are frames 6, 9, 12, 20 and 24.

5.5.2 Processing Time

Figure 5.4 shows the comparison of processing time for the original and proposed method. The structure and motion is computed using 100 random generated points projected along a sequence of 10 frames. Notice how the proposed incre- 96 A new Incremental Projective Factorization Algorithm

Figure 5.3: The variation of the fifth singular value during projective reconstruction using the Incremental Projective Factorization method on frames 2 to 30.

mental reconstruction algorithm requires constant processing time (0.32 ms) when the number of previous frames to be considered on the reconstruction pipeline has been reached (4 in the current example). The processing time was measured using an Intel Pentium IV 1.5 GHz processor with 256 MB RAM.

Figure 5.4: The processing time for Incremental Projective Factorization Algorithm Vs Batch Projective factorization.

The reduction of processing time can appear marginal for the last example but having in mind that real image sequences could have hundreds of frames the advantage of our approach is evident. For example 10 seconds of video is composed of 300 frames then, computing the 300th with a full measurement matrix would take minutes compared to less than a second when the proposed algorithm is used 5.5 Incremental Projective Reconstruction Experimental Results 97 allowing only 10 frames in W during projective reconstruction.

5.5.3 Real Image Sequence experiments

In this section we qualitatively evaluate the structure of the scene recovered by the proposed incremental projective factorization method on three real image sequences.

Figure 5.5: Three frames of three real video sequences used in our experiments. Top, the pot sequence taken from [GSV01] consisting of 10 frames, 520 × 390 of image size and 44 feature points. Middle, the cube sequence contains 30 images and 32 salient points. Bottom, two cube sequences with 40 images and 52 salient points.

Considering the general scheme proposed in [RP05, MP05], a 3D self-calibrated reconstruction algorithm is developed. A texture-mapping technique is used to show an interpolated 3D structure of the original scenes. In all cases the texture of the first image is arbitrarily used to show realistic models.

Figure 5.6 shows that the proposed factorization method can efficiently recover the shape and motion from the corresponding image sequences. This is appreci- ated from the identifiable geometrical components recovered by our method. 98 A new Incremental Projective Factorization Algorithm

Figure 5.6: Reconstructed 3D models from the video sequences (Top). Measured reprojection error for the original BPF and the proposed IPF algorithms (Bottom). Left: 6 and 7 frames were considered for the IPF algorithm in the Pot sequence. Right: 9 and 13 frames are automatically considered for the cube sequences using IPF algorithm.

5.5.4 Conclusions

In this section an algorithm for incremental projective reconstruction has been proposed. We have shown that adding an online selection criterion to keep or reject frames, incremental projective factorization can achieve similar results com- pared to the original algorithm when a predefined number of frames are allowed to take part in the reconstruction process. An important advantage of the pro- posed method is that, after a latency period when the maximum allowable frames have been reached the processing time remains constant and similar results are obtained. Experimental results using synthetic and real scenes illustrated the accuracy and performance of the new factorization method. Chapter 6

Implementation and Experimental Results

This chapter describes how the algorithms proposed in chapters 4, 5, and 6 were integrated to deal with the problem of self-calibrated reconstruction. Experiments are presented using video streams to show the overall system performance. The algorithms were programmed in Matlab 7.0 and tested on a personal computer with an Intel Pentium IV 3.0 GHz processor using 2 GB in RAM memory under the Windows XP operating system.

In order to assess the performance of the new algorithms, real image sequences were captured with a Panasonic DMC−F 27 photo camera using a 640×480 image resolution. A snapshot camera capturing video streams was preferred instead of a specialized video camera to measure the performance of the algorithms on more challenging conditions. Video sequences grabbed with a photo camera offer challenging sequences because the integration time between frame to frame is larger than video cameras and significant changes between images are observed even in consecutive captured frames. The camera parameters were fixed during acquisition and the use of zoom was avoided to limit to some extent the effects of lens distortion in the captured images.

In the first section we give a brief description of the video sequences used to 100 Implementation and Experimental Results

test the integrated 3D reconstruction framework. Then we show step by step the performance of the novel algorithms on each video stream. Furthermore, we discuss how the novel salient point detection and the robust feature matching algo- rithms collaborate during the full 3D reconstruction pipeline to improve the track length of individual salient points. Then, the incremental projective factoriza- tion is evaluated to assess its performance and show quantitative and qualitative reconstruction results.

6.1 Self-calibrated reconstruction from video ex- periments

Figure 6.1, shows 8 images from a 70 frames of the video sequence Outdoor Build- ing. Figure 6.2 shows 8 images from a 90 frames video of the Library Sequence. Observe that both image sequences have repetitive patterns on the windows and stairs regions respectively and important viewpoint changes. In both scenes the camera followed a left to right trajectory with translation and rotating compo- nents while capturing the images.

Figure 6.1: Sample images of the Buildings Sequence. 6.2 Salient Point detection 101

Figure 6.2: Sample images of the Library Sequence.

6.2 Salient Point detection

A fundamental stage in the development of a full three-dimensional self-calibrated reconstruction algorithm from video is the creation of the data matrix containing corresponding points from successive frames of a video stream. In this section, we show a modification to the proposed algorithm presented in Chapter 5 to solve the correspondence problem in video streams that brings better results when the separation between views is short. The improved algorithm increases the number of correct matches and, at the same time reduces mismatch errors.

The algorithm variation consists of a two step matching process. First we restrict the number of salient points used to robustly compute the two view geometric con- straints arising from the fundamental matrix. Then, using the epipolar geometry constraint in the full set of salient points, a semi-dense corresponding matrix is built by combining geometric and photometric properties during the matching stage.

Given a pair of images, the first step restricts the selection of salient points to keep only the most prominent corner points from both images. Here, the idea is to select salient regions with high probability of being properly matched with their corresponding points in the next image. We use the criterion that salient points with the strongest angular difference between dominant edges are useful for this task. Then, after computing a reliable estimation of the geometric constraints 102 Implementation and Experimental Results

between images, the full set of detected salient points are evaluated to find their corresponding point between image using geometric and appearance constraints.

This is an important difference when compared to previous approaches, where all the detected salient points in both images are tested to find first the fundamental matrix geometric constraints and then to look for their corresponding points.

6.3 Salient point detection by Harris algorithm

The Harris detector [HS88] is based on the second moment matrix 6.3.1, which describes the gradient distribution in a local region around a point:

M(σI , σD) = σDg(σI )∗   (6.3.1) 2 Ix(x, σD) IxIy(x, σD)   2 IxIy(x, σD) Iy (x, σD)

Local image derivatives are computed with Gaussian kernels of scale σD (differen- tiation scale). Then, the computed terms of M are smoothed with a Gaussian win-

dow of scale σI (integration scale). The eigenvalues of this matrix (λ1, λ2)represent two principal signal changes in a neighborhood of the point. This property enables the extraction of points, for which the magnitude of both orthogonal directional gradients is high. In particular the Harris corner detector uses the following corner strength function to select salient points:

2 (6.3.2) Cstr = det(M) − 0.04trace(M)

Figure 6.3 shows the result of applying the Harris corner detector and the pro- posed Cov-Harris algorithm in the Building sequence.

On the top figure, using the Harris corner detector, there is no clear distinction between relevant corners and noise regions since similar cornerness strength values are given to strong and weak corners regions, see for example the corners located 6.3 Salient point detection by Harris algorithm 103

Figure 6.3: Detected Corners in the first image for the Building Sequence. Top: Using Harris corner detector. Bottom: Using Cov-Harris algorithm.

in the windows and those presumable corners located in the edge of the cylindrical structure. On the other hand the proposed algorithm ranks better the putative corner regions. Note that the novel algorithm assigns low conerness responses and even brings to zero many corner points having subtle texture information when there is only one dominant edge, while highlighting truly corner points if two dominant edges intersect in approximately perpendicular direction (see for example corners near the windows).

On the left of figures 6.3 and 6.4 the sorted cornerness strength values computed by using Harris and the cov-Harris operators are shown. Note, that the proposed cov-Harris algorithm gives zero strength values to the majority of noise corner points first estimated by Harris.

The modified Harris algorithm favors the early elimination of bad corners points 104 Implementation and Experimental Results

reducing the drift problem in tracking algorithms when they attempt to track salient points located on a single dominant edge.

In the next example (see Figure 6.4) the result of applying the Harris corner detector and the proposed Cov-Harris algorithm in the ”Library sequence” is presented. Observe how the Harris corner detector again assigns a higher score on many untextured image regions with lack of intuitive notion of corner.

Figure 6.4: Detected Corners in the first image for the Library sequence. Top: Using Harris corner detector. Bottom: Using Cov-Harris algorithm.

6.4 Matching restricted list to estimate geomet- ric constraints

For each salient point detected by the Cov-Harris operator with corner strength value greater than zero, a SIFT descriptor is computed. Matches are established 6.4 Matching restricted list to estimate geometric constraints 105 between features by using the Euclidean distance between descriptors and, follow- ing the approach of [Low04] the best match ratio helps to reject some mismatch errors. Given ds1 and ds2 the first and second best matches distances for a feature descriptor in the set Sp, a match is accepted if:

(6.4.1) ds1 < 0.8ds2 .

Thus, from the pairwise list of corresponding points with the robust RANSAC framework we estimate the fundamental matrix.

6.4.1 Robust fundamental matrix estimation

Each time a minimal sample (8 pairs of corresponding points) is selected from which a tentative F is calculated from the eight-point method. The fundamental matrix with lower error is kept as the optimal transformation fundamental matrix.

The epipolar lines for each pair of matching points can be derived from F. Those matches with the distance from the corresponding epipolar lines smaller than a threshold are kept as inliers. The process is iterated until a sufficient number of samples have been taken. The F is calculated from the optimal subset with the largest number of inliers. The final F is derived from all correct matches. Figures 6.5 and 6.6 show the estimated epipolar geometry for a pair of successive images in the test sequences.

6.4.2 Enforcing Epipolar Constraint for semi-dense match- ing

Let I and I0 be two successive images, with salient points x and x0 respectively, let F be the estimated fundamental matrix representing the geometric relations between both views. Then the following equation is satisfied: 106 Implementation and Experimental Results

Figure 6.5: Estimated Epipolar Geometry for an image pair of the library sequence.

Figure 6.6: Estimated Epipolar Geometry for an image pair of the Building Sequence.

(6.4.2) x0Fx = 0.

Given a salient point x, then by selecting the subset of salient points {x0} such that the distance to the corresponding epipolar line xF is below a threshold (in all our experiments was set to 2 pixels) the corresponding match is obtained by analyzing geometric and photometric properties of a local window around the candidate corresponding points (see Chapter 4, sub-section 4.3.2 for the details).

k ci − Rpi − t k (6.4.3) arg min + ω k sx − sx0 . k δ i i

where sxi, sx0i are SIFT descriptors. The coefficient ω is a factor that controls the weight of SIFT based appearance matching metric. The parameter δ manages 6.4 Matching restricted list to estimate geometric constraints 107 the degree of interaction between features according to the distance. A small δ value limits matching candidates to a reduced neighborhood while a larger value allows matching more distant features. In both experiments δ was fixed to 20 which is traduced in searching for corresponding feature points located up to 20 pixels over the epipolar line given the estimated rotation, translation and scale that align feature points using the IC-SIFT algorithm.

Figures 6.7 and 6.8 show the salient points tracked in all frames of the test sequences using the proposed algorithm. On the right of each figure, it is shown the trajectory followed by an individual feature point which reveals the destabilized nature of local motion between frames.

Figure 6.7: Salient points Tracked in all frames of Library image sequence.The trajectory of a Salient point tracked of Library image sequence.

Figure 6.8: Salient points Tracked by KLT in all frames in the Building image Sequence.

We compare the proposed tracking algorithm against the commonly used KLT 108 Implementation and Experimental Results

algorithm. Figure 6.9 shows the salient points tracked on an intermediate frame of the video stream. Observe how the KLT brings many false match correspondences on intermediate frames and early lost many salient points, even in the initial frames of the sequence. This erroneous effect of the KLT algorithm is highlighted when repetitive patterns exist in the images (see stairs in the proprietary ”Library sequence”). On the other hand shows that the proposed algorithm keeps a larger amount of true matches while significantly reducing mismatch errors.

Figure 6.9: The number of salient points tracked in an intermediate frame. Tracked points found by the KLT algorithm (Left). Tracked points in the same frame by using the proposed tracking algorithm. (Righ)

In these examples there is a remarkable improvement of the proposed algorithm, due to the use of both geometric and appearance constraints which reduce signif- icantly the chance to assign false matches with repetitive patterns since many of them do not satisfy the geometric constraints due to camera motion.

6.5 Projective and Euclidean Reconstruction

When the matching stage is finished, given the first two images a projective re- construction is estimated. Then, for all the subsequent images it is intended to upgrade the projective reconstruction to a Euclidian one. In Figure 6.10 and 6.11 we show the recovered structure of the scenes used for evaluation purpose shown in figures 6.1 and 6.2 respectively. 6.5 Projective and Euclidean Reconstruction 109

Figure 6.10: The projective and Euclidean Reconstruction for a sparse set of tracked points in the Library Sequence.

On the left, the structure of the scenes recovered by the projective reconstruction algorithm is shown using a color scale associated to the estimated projective depth. On the right the reprojection error is plotted. Red dots show the image location of tracked points while circles are drawn by computing the reprojected location using the projective camera matrices and projective depth estimates. A representative region is displayed in a small zoom window.

Figure 6.11: The projective and Euclidean Reconstruction for a sparse set of tracked points in the Buildings Sequence.

The reprojection error is a geometric error and is equal to the distance between 110 Implementation and Experimental Results

the projected point and the measured image location of every salient point. Given the recovered projection matrix P˜ and the estimated 3D point X˜, its associated reprojected image location is given byx ˜ = P˜X˜, then the reprojection error is computed using the Euclidean distance d(x, x˜).

In Figure 6.12 the reprojection error is used to quantify how closely the estimated projective depths and camera location recreates the measured point’s projection. Note that even image points located in the boundary of the image (close up view in figure (6.12) ) have less than one pixel registration error. Observe that the reprojection error distribution for the Building and Library projective reconstruc- tion estimates has zero mean and 0.5 pixel standard deviation which reveals a good projective estimate.

Figure 6.12: The reprojection error distribution. Left: Building Sequence, Rigth: Library Sequence.

6.6 Discussion

The success of self-calibrated reconstruction methods depends strongly on the re- liable estimation of corresponding pairwise points in different images. State of the art approaches that find reliable matching pairs use robust statistical meth- ods to filter the so-called outliers ’wrong matching pairs’. Other methods gain robustness during the matching stage by the use of invariant local descriptors or by enforcing geometric constraints. However, enforcing geometric constraints 6.6 Discussion 111 arising from the epipolar geometry or by a planar homography requires to know the matching salient points. Thus, to find the transformation matrix parameters that establish geometric constraints, usually the full set of salient points obtained are used by setting standard threshold parameters on the list of detected salient points. However, as shown in section 3.7, chapter 3, using angular information between dominant edges can improve the selection feature points with larger gra- dient variation in at least two principal directions.

In this section we have presented and alternative method to find reliable esti- mation of matching pairs between images by preselecting the best ranked salient points using the angular difference between dominant edges. The pre-selection stage early deletes salient points on remarkable varying image areas to robustly estimate the transformation parameters relating geometric constraints. Then, by enforcing the geometric constraints and by jointly considering geometric and photometric information a semi-dense set of matches is retrieved.

Remember that the objective of self-calibrated reconstruction is not necessarily to recovered a good approximation of the intrinsic structure of the scene being modeled but the estimation of the camera parameters since, knowing the cam- era parameters, methods like depth from stereo can yield dense depth maps for revealing the underlying three-dimensional structure of the scene. 112 Implementation and Experimental Results Chapter 7

Conclusions

7.1 Summary of contributions

This thesis treated with the problem of recovering the three-dimensional scene geometry and camera motion from a set of images taken with an uncalibrated camera.

Self-calibrated three-dimensional reconstruction algorithms have several process- ing stages that use a large amount of projected points in images to estimate the structure of the scene and camera motion. However, the intrinsic quality of every result has an enormous impact in the quality of the recovered models and in par- ticular the overall reconstruction has a strong dependency on the early processing steps.

In this thesis novel algorithms were proposed for the initial stages of self-calibration: an innovative salient point detection algorithm with an improved quality metric for better ranking corner points in images. A new algorithm to match better corresponding points for wide-baseline images and a variant for multiple feature tracking in un-stabilized video streams were developed. A novel algorithm that helps to select the best images to recover a projective reconstruction by impos- ing explicit time limits constraints was also presented. The integration of the proposed algorithms in a self-calibrated reconstruction framework was described. 114 Conclusions

7.1.1 Robust feature matching for wide separated views

In chapter 5 [LLAE06a], a robust feature matching algorithm was proposed and experimental results show its improvement under difficult image data sets. The most important difference of the proposed algorithm when compared to state of the art approaches is that it combines geometric and photometric aspects of a local area region around a salient point to assign matching correspondences. A matching metric was introduced to enforce geometric and photometric properties in the matching criterion. In addition, a local motion descriptor was used to restrict the search space during matching association. The main advantage of the proposed algorithm is that it avoids false match association and at the same time increases the number of correct matches even when repetitive patterns exist in some local image areas that are frequently present in scenes with manmade objects.

7.1.2 Incremental 3D reconstruction by inter-frame selec- tion

A new algorithm was presented for the selection of frames to recover the cam- era motion and scene structure for the projective camera model in chapter 6 [LLAE06b]. A new quality metric is introduced based on the fifth singular value computed from a rank 4 measurement matrix. The measurement of the contribu- tion of each frame throughout the progressive 3D model reconstruction allows the reduction of the memory resources and keeps the computational cost near con- stant after a latency period for each subsequent frame. Using the fifth singular value as a quality metric has the advantage of being a direct estimation since its value is intrinsically computed in the projective reconstruction algorithm. With the proposed algorithm it is now possible to enforce time limits constraints during projective reconstruction which is a desirable property in application like robot guidance and industrial inspection. 7.2 Future work 115

7.1.3 Robust feature matching on video sequences

The capability of recovering the structure of a static scene and the camera motion using unstabilized video streams strongly depends on the success of multi-feature tracking algorithms. We addressed this issue by integrating the Cov-Harris in- terest point detector with the matching algorithm that uses both geometric and photometric properties between match candidates to robustly assign correspon- dences between features.

A variant of the matching algorithm proposed in chapter 5 to address the problem of tracking multiple salient points in video streams was presented in chapter 7. The proposed algorithm matches salient points in a two-stage matching process. First, by using a limited set of salient points with stronger angular difference between dominant edges the geometric constraints based on the fundamental matrix are estimated. Then, by imposing epipolar constraints a semi-dense matching stage is computed from a full set of corner like points using an affine invariant local descriptor. Finally by considering geometric and photometric properties in the matching metric, matches between salient features are assigned.

In the first step, the robust salient point detection algorithm presented in this thesis has an important role for the success of the new tracking algorithm, since by considering only a subset of truly distinctive salient points; the transformation parameters that impose geometric constraints for feature matching are estimated. In this way, a remarkable reduction in the search space of robust RANSAC frame- work is performed by removing noisy corner like regions.

7.2 Future work

The 3D reconstruction problem using un-calibrated cameras has been studied by several research groups and recently enhancements to the pioneer algorithms have been proposed. In order to have a self-calibrated reconstruction method running in real applications, it is necessary to do further investigation. First we have 116 Conclusions

uncovered some important stages due to the limited time frame of the research. In particular we are interested in fill some uncovered themes:

Recovering dense three-dimensional models by exploring multi-view stereo recon- struction methods that make use of camera pose information to rectify images to reduce the ambiguity in the search for corresponding points.

Adding the previous stage also highlights the need to integrate an algorithm to merge partial models by using a 3D registration algorithm extending the IC-SIFT algorithm.

Another avenue for future work is in the combination of self-calibrated reconstruc- tion algorithms with photometric stereo methods for example using reciprocal image pairs which eliminate the common assumption that light irradiance is the same in all direction and allows the recovering of three-dimensional information even in textureless surface areas.

Then, even the more recent state of the art algorithms including the proposed here have some limitations especially in the first stages like feature detection and in the correspondence problem since false matching results give rise to low quality reconstructed models which can even avoid the recovery of camera motion and scene structure.

7.2.1 Tracking algorithm with motion blur

The track length of single features in multi-feature tracking algorithms is typically short between 10-25 frames using destabilized video streams. The rapid decrease of the matching scores is primarily due to the motion blur caused by camera motion. It would be desirable to add processing stages to deal with this problem exploring local and global approaches. By local approaches we suggest that the Cov-Harris operator could be extended by measuring the degree of local motion blur. And, by the global motion detection approach the degree of motion blur can measure every time a frame is captured and bypassing frames until sharp corners were detected. 7.2 Future work 117

7.2.2 Inter-frame selection removing critical configurations

General self-calibrated reconstruction algorithms have problems on degenerate camera setups like those when camera motions is restricted to only translating or rotating cameras. Rotation and translation information provided by the robust feature matching algorithm could be integrated in the inter-frame selection stage to prevent the use of images captured from degenerated camera motion.

In addition we are interested in exploring some emergent research topics in the current literature like the collaborative approach in the structure from motion problem and real-time processing in the full self-calibrated pipeline.

7.2.3 Collaborative structure from motion

Research concerning self-calibrated 3D reconstruction from images has been cen- tered on the algorithmic solutions for individual stages of the complete task using a sequential computing approach. On most methods the tracking algorithm is con- sidered a previous step which builds the data matrix used by the reconstruction algorithms. In general it is assumed that tracking algorithms and reconstruction are independent processes, and few work has been made to develop algorithms em- ploying a collaborative approach allowing feedback between the 3D reconstruction stages.

7.2.4 Real-time processing

Self-calibrated three-dimensional reconstruction algorithms have several process- ing stages. Stages like salient point detection, multiple feature matching and even projective reconstruction are solved by applying the same processing steps on multiple similar data structures.

In particular the interest point detector, and tracking algorithms are computa- tionally expensive processes that mainly has prevented to yield real-time during the 3D reconstruction when general purpose processors are used. By analyzing 118 Conclusions

the processing time of the full reconstruction process the robust detection and tracking algorithms make use of 45 − 50 percent of the CPU which prevent to compute and estimate of shape and motion imposing real-time constraints.

It would be interesting to develop a hardware-software architecture to assist in the self-calibrated reconstruction process. In the initial reconstruction stages ro- bust detection and tracking of salient points require high computational power and in the last stages image rectification and dense stereo computation are also computationally expensive.

Our work in dimensional reduction for the projective factorization method points in this direction. The proposed factorization algorithm is well suited for hardware implementation where strict time limits constraints can be imposed. Since, keep- ing internal matrices size fixed, the processing time is approximately constant. In addition, the memory management could be implemented imposing hardware space constraints for mobile applications. List of Tables

3.1 Cornerness strength function for similar corner detectors . . . . . 46

4.1 Matching results for the boat image pair ...... 78

4.2 Matching results for the cube image pair ...... 79

4.3 Matching results for the library image pair ...... 80

4.4 Matching results for the Sculpture image pair ...... 81

4.5 Matching results for the Chessboard image pair ...... 82 120 LIST OF TABLES List of Figures

1.1 The steps to achieve self-calibration from multiple images taken from [PGV+04]...... 6

2.1 Projective Camera Model...... 17

2.2 Orthographic Camera Model...... 19

2.3 The effect of radial distortion...... 20

2.4 Two view geometry constraints modeled by the fundamental ma- trix F. The two camera centers are indicated by C and C’. The camera centers, a 3D-space point X, and its images x and x’ lie in a common plane Π. The ray defined by the first camera center, C, and the point X is imaged as a line l’. The 3D-space point X which projects to x must lie on l’...... 21

2.5 A planar homography H maps a point x from the plane Π to a point x’ in the plane Π0...... 24

2.6 Calibrated Reconstruction by Triangulation...... 31

2.7 Examples of calibration patterns...... 32

2.8 The absolute Conic in the plane at infinity is projected in the same image location of a moving camera...... 36

3.1 The correspondence problem. Two corresponding image regions of the same scene element have different appearance due to projective distortion...... 42 122 LIST OF FIGURES

3.2 Feature points detected by the Harris operator...... 45

3.3 Salient points detected by the SIFT operator. Arrows length indi- cate the keypoints scale...... 49

3.4 The effect of the Sobel operator in the smoothed squared terms of the Harris operator (left). The result of using Gaussian derivative

with scale σD = 9 pixels σI = 2 × σD ...... 54

3.5 The effect of using different derivative filters in the Harris algo- rithm. The Sobel Operator (a) and Gaussian Derivative with pa-

rameters σD = 9 pixels and σI = 2 × σD (b)...... 55

3.6 The effect of using the Sobel Mask and the Gaussian derivative in the Harris algorithm...... 56

3.7 Example of a probable corner point and its associated squared par- tial derivatives...... 56

3.8 A probable corner point and the segmentation process to extract dominant edges (Top). Dominant edge estimation from the covari- ance matrices for each partial derivative and the angular difference between dominant edges...... 57

3.9 Comparative results between Harris (a) & b)) and Cov-Harris al- gorithms (c & d)). The effect of using in both algorithms the Sobel

Operator and the Gaussian derivative with parameter σD = 9 pixels

and σI = 2 × σD...... 60

3.10 Comparison between Harris (top) and Cov-Harris (bottom). The effect of using the Sobel Operator and the Gaussian derivative with

parameter σD = 9 pixels and σI = 2 × σD...... 61

4.1 SIFT-Global Context taken from [MDS05]. (a) Original images with selected feature points marked. (b) SIFT (left) and shape context (right) of point marked in (a). (c) Reversed Curvature image of (a) with shape context bins overlaid...... 65 LIST OF FIGURES 123

4.2 Example of use of ICP algorithm to align two set of points under different scales...... 69

4.3 Example of the 4x4 Motion Context box describing the character- istic local motion of the neighbour features...... 75

4.4 Original test image pairs. Images are subject to rotation, scale and viewpoint change transformations. From left to right: Boat, Cube, Library, Sculpture, and Chessboard image pairs...... 76

4.5 Feature correspondences initially found by SIFT and the proposed IC-SIFT algorithm (Top). Registered points using matches found by SIFT, ICP, and IC-SIFT algorithms (Middle). The Initial and Final Motion Context. (Bottom) ...... 77

4.6 Matches found by proposed IC-SIFT Algorithm in the cube image pair (Top). Registered points using matches found by SIFT, ICP, and IC- SIFT algorithms (Bottom)...... 79

4.7 The Motion Context for the cube image pair and the associated characteristic displacement vectors ...... 80

4.8 Match pairs found using SIFT, ICP and IC-SIFT algorithms in the library image pair (Top). Registered points using matches found by SIFT, ICP, and IC- SIFT algorithms (Bottom)...... 81

4.9 From left to right matches by SIFT and IC-SIFT (Top). The reg- istration error for SIFT and IC-SIFT (Bottom)...... 82

4.10 The evolution of the Motion Context in the IC-SIFT algorithm propagates spatial information to resolve matching conflicts. . . . 83

4.11 Performance comparison between SIFT and IC-SIFT for matching points in a calibration chessboard pattern...... 84

5.1 The synthetic hemisphere points and camera locations used for comparing the original Batch Projective Factorization vs Incremen- tal Projective Factorization methods...... 94 124 LIST OF FIGURES

5.2 The RMS reprojection error for the original BPF vs IPF algorithm using 3,5 and 7 frames...... 95

5.3 The variation of the fifth singular value during projective recon- struction using the Incremental Projective Factorization method on frames 2 to 30...... 96

5.4 The processing time for Incremental Projective Factorization Algo- rithm Vs Batch Projective factorization...... 96

5.5 Three frames of three real video sequences used in our experiments. Top, the pot sequence taken from [GSV01] consisting of 10 frames, 520 × 390 of image size and 44 feature points. Middle, the cube sequence contains 30 images and 32 salient points. Bottom, two cube sequences with 40 images and 52 salient points...... 97

5.6 Reconstructed 3D models from the video sequences (Top). Mea- sured reprojection error for the original BPF and the proposed IPF algorithms (Bottom). Left: 6 and 7 frames were considered for the IPF algorithm in the Pot sequence. Right: 9 and 13 frames are au- tomatically considered for the cube sequences using IPF algorithm. 98

6.1 Sample images of the Buildings Sequence...... 100

6.2 Sample images of the Library Sequence...... 101

6.3 Detected Corners in the first image for the Building Sequence. Top: Using Harris corner detector. Bottom: Using Cov-Harris algorithm. 103

6.4 Detected Corners in the first image for the Library sequence. Top: Using Harris corner detector. Bottom: Using Cov-Harris algorithm. 104

6.5 Estimated Epipolar Geometry for an image pair of the library se- quence...... 106

6.6 Estimated Epipolar Geometry for an image pair of the Building Sequence...... 106 LIST OF FIGURES 125

6.7 Salient points Tracked in all frames of Library image sequence.The trajectory of a Salient point tracked of Library image sequence. . 107

6.8 Salient points Tracked by KLT in all frames in the Building image Sequence...... 107

6.9 The number of salient points tracked in an intermediate frame. Tracked points found by the KLT algorithm (Left). Tracked points in the same frame by using the proposed tracking algorithm (Righ) 108

6.10 The projective and Euclidean Reconstruction for a sparse set of tracked points in the Library Sequence...... 109

6.11 The projective and Euclidean Reconstruction for a sparse set of tracked points in the Buildings Sequence...... 109

6.12 The reprojection error distribution. Left: Building Sequence, Rigth: Library Sequence...... 110 126 LIST OF FIGURES Bibliography

[AAK71] Y. Abdel-Aziz and H. Karara, Direct linear transformation from com- parator coordinates into object space coordinates in close-range pho- togrammetry, In Proc. IEEE Int. Conf. on Image Processing 1 (1971), 1–18.

[AG93] P. Aschwanden and W. Guggenbuhl, Experimental results from a comparative study on correlation-type registration algorithms, In Ro- bust Computer Vision 20 (1993), 268–289.

[AS98] S. Avidan and A. Shashua, Novel view synthesis by cascading trilinear tensors, IEEE Trans. Visualization and Computer Graphics 4 (1998), 293–306.

[Atk01] K. B. Atkinson, Close range photogrammetry and machine vision, Cambridge University Press, Whittles, 2001.

[AZH96] M. Armstrong, A. Zisserman, and R. I. Hartley, Self-calibration from image triplets, ECCV (1), 1996, pp. 3–16.

[BBH03] M.Z. Brown, D. Burschka, and G.D. Hager, Advances in computa- tional stereo, IEEE Trans. Pattern Analysis and Machine Intelligence 25 (2003), 993–1008.

[BCC90] T. J. Broida, S. Chandrashekhar, and R. Chellappa, Recursive esti- mation of 3d motion from a monocular image sequence, IEEE Trans- actions on Aerosp. Electron 26 (1990), 639–656. 128 BIBLIOGRAPHY

[Bea78] P. R. Beaudet, Rotationally invariant image operators, In Proceedings of the 4th International Joint Conference on Pattern Recognition 1 (1978), 579–583.

[BH87] K. Berthold and P. Horn, Closed–form solution of absolute orientation using unit quaternions, Optical Society of America 4 (1987), 629–642.

[BI99] A. F. Bobick and S. S. Intille, Large occlusion stereo, Intl. J. Comp. Vision 33(3) (1999), 181–200.

[BM92] P. J. Besl and N. D. McKay, A method for registration of 3–d shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (1992), 239–256.

[BMP02] S. Belongie, J. Malik, and J. Puzicha, Shape matching and ob- ject recognition using shape contexts, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 509–522.

[BN98] D. N. Bhat and S. K. Nayar, Ordinal measures for image correspon- dence, IEEE Trans. Pattern Analysis and Machine Intelligence 20 (1998), 415–423.

[Bro76] D. C. Brown, The bundle method – progress and prospects, Interna- tional Archives of Photogrammetry 21 (1976), 303–336.

[BS03] A. Bartoli and P. F. Sturm, Nonlinear estimation of the fundamental matrix with minimal parameters, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2003), no. 3, 426–432.

[Che91] D. Chetverikov, Fast neighborhood search in planar point sets, Pattern Recognition Letters 12 (1991), 409–412.

[CJ02] G. Carneiro and A. Jepson, Phase–based local features, In ECCV 1 (2002), 282–296. BIBLIOGRAPHY 129

[CM92] Y. Chen and G. Medioni, Object modelling by registration of multiple range images, Image and Vision Computing 10 (1992), 145–155.

[Cur] B. L. Curless, New methods for surface reconstruction from range images, CSL-TR-97-733 Ph. D. Thesis, pages = 1–206, year = 1997.

[Dav03] A. J. Davison, Real–time simultaneous localisation and mapping with a single camera, ICCV (2003), 1403–1410.

[Dav05] A. Davison, Active search for real-time vision, In Proceedings the International Conference on Computer Vision 1 (2005), 66–73.

[DRMS07] A. J. Davison, I. D. Reid, N. Molton, and O. Stasse, Monoslam: Real- time single camera slam, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007), no. 6, 1052–1067.

[FA91] W. Freeman and E. Adelson, The design and use of steerable filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (1991), 891–906.

[FG87] W. F¨orstner and E. G¨uulchm, A fast operator for detection and pre- cise location of distinct points, corners and centres of circular fea- tures, In Intercommission Conference on Fast Processing of Pho- togrammetric Data 1 (1987), 281–305.

[FL01] O. Faugeras and Q. T. Luong., The geometry of multiple images, MIT Press, 2001.

[FLM92] O. D. Faugeras, Q. T. Luong, and S. J. Maybank, Camera self- calibration: Theory and experiments, ECCV (1992), 321–334.

[FP02] D. A. Forsyth and J. Ponce, Computer vision: A modern approach, Prentice Hall, 2002.

[FR96] O. Faugeras and L. Robert, What can two images tell us about a third one?, Intl J. Computer Vison 1 (1996), 5–20. 130 BIBLIOGRAPHY

[FS93] J. Flusser and T. Suk, Pattern recognition by affine moment invari- ants, International Journal of Computer Vision 26 (1993), 167–174.

[FTG03] V. Ferrari, T. Tuytelaars, and L. J. Van Gool, Wide baseline multiple- view correspondences, in Proc. IEEE Conf. On comuter Vision and Pattern Recognition 1 (2003), 718–725.

[FTTR99] A. Fusiello, E. Trucco, T. Tommasini, and V. Roberto, Improving fea- ture tracking with robust statistics, Pattern Analysis and Applications 2 (1999), 312–320.

[FTV97] A. Fusiello, E. Trucco, and A. Verri, Rectification with unconstrained stereo geometry, Proceedings of the British Machine Vision Confer- ence (1997), 400–409.

[FTV00] , A compact algorithm for rectification of stereo pairs, Mach. Vis. Appl. 12 (2000), no. 1, 16–22.

[GA05] A. Gruen and D. Akca, Least squares 3d surface and curve matching, International Journal of Photogrammetry and Remote Sensing 59 (2005), 151–174.

[GCH+02] S. Gibson, J. Cook, T. Howard, R. Hubbold, and D. Oram, Accu- rate camera calibration for off–line, video–based augmented reality, ISMAR (2002), 37–46.

[GD05] K. Grauman and T. Darrell, Pyramid match kernels: Discriminative classification with sets of image features, In Proc. ICCV 1 (2005), 1458–1465.

[GLB01] G. Godin, D. Laurendeau, and R. A. Bergevin, Method for the regis- tration of attributed range images, IEEE International Conference on 3D Imaging and Modeling (2001), 179–186. BIBLIOGRAPHY 131

[GMU96] L. J. Van Gool, T. Moons, and D. Ungureanu, Affine/photometric invariants for planar intensity patterns, In Proceedings of the 4th European Conference on Computer Vision (1996), 642–651.

[GPK98] N. Georgis, M. Petrou, and J. Kittler, On the correspondence problem for wide angular separation of non–coplanar points, Image and Vision Computing 16 (1998), 35–41.

[GRB94] G. Godin, M. Rioux, and R. Baribeau, Three–dimensional registration using range and intensity information, Videometrics III, Proc. SPIE 2350 (1994), 279–290.

[GSV01] E. Grossman and J. Santos-Victor, Algebraic aspects of reconstruction of structured scenes from one or more views, In Proceedings of the BMVC (2001), 633–642.

[GTS+07] G. Grisetti, G. D. Tipaldi, C. Stachniss, W. Burgard, and D. Nardi, Fast and accurate slam with rao-blackwellized particle filters, Robotics and Autonomous Systems 55 (2007), no. 1, 30–38.

[HA97]˚ A. Heyden and K. Astr¨om,˚ Euclidean reconstruction from image se- quences with varying and unknown focal length and principal point, In IEEE Conf. Computer Vision and Pattern Recognition (1997), 438– 443.

[HA99]˚ , Flexible calibration: Minimal cases for auto-calibration, ICCV, 1999, pp. 350–355.

[Har92] R. I. Hartley, Estimation of relative camera positions for uncalibrated cameras, ECCV, 1992, pp. 579–587.

[Har93] , Euclidean reconstruction from uncalibrated cameras, in : J. L. Mundy, A. Zisserman, and D. Forsyth (eds.), Applications of Invari- ance in Computer Vision, LNCS, Springer Verlag 1 (1993), 237–256. 132 BIBLIOGRAPHY

[Har95] R. I. Hartley., In defence of the 8-point algorithm, In Proceedings of the IEEE International Conference on Computer Vision, 1995, pp. 1064–1070.

[Har98] R. I. Hartley, Theory and practice of projective rectification, Journal of Computer Vision, 1998, pp. 1–16.

[HB96] G. Hager and P. Belhumeur, Real–time tracking of image regions with changes in geometry and illumination, In proceedings of IEEE Con- ference on Computer vision and Pattern Recognition (1996), 403–410.

[H´eb01] P. H´ebert, A self-referenced hand-held range sensor, 3DIM, 2001, pp. 5–12.

[Hei00] J Heikkil¨a, Geometric camera calibration using circular control points, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (2000), 1066–1077.

[HK00] M. Han and T. Kanade, Creating 3d models with uncalibrated cam- eras, Workshop on the Application of Computer Vision (WACV2000) 9 (2000), 137–154.

[HS88] C. Harris and M. Stephens, A combined corner and edge detector, In M. M. Matthews, editor, Proc. Of the 4th ALVEY vision conference 27 (1988), 147–151.

[HS92] R. M. Haralick and L. G. Shapiro, Computer and robot vision, Addison–Wesley, 1992.

[HZ00a] R. I. Hartley and A. Zisserman, Multiple view geometry in computer vision, Cambridge University Press, Cambridge, UK, 2000.

[HZ00b] R. I. Hartley and A. Zisserman (eds.), Multiple view geometry in computer vision. 1 edn., Cambiridge, 2000. BIBLIOGRAPHY 133

[JK97] A. Johnson and S. Kang, Registration and integration of textured 3–d data, In Conference on Recent Advances in 3–D Digitial Imaging and Modeling (1997), 234–241.

[KB01] T. Kadir and M. Brady, Scale, saliency and image description, Inter- national Journal of Computer Vision 45 (2001), 83–105.

[Koe84] J. J. Koenderink, The structure of images, Biological Cybernetics 50 (1984), 363–396.

[Kov00] P. Kovesy, Phase congruency: A low-level image invariant, Psy- chological Research Psychologische Forschung. Springer-Verlag 64 (2000), 136–148.

[KR82] L. Kitchen and A. Rosenfeld, Gray–level corner detection, Pattern Recognition Letters 1 (1982), 95–102.

[KRD07] M. Kaess, A. Ranganathan, and F. Dellaert, Fast incremental square root information smoothing, Proceedings of the 20th International Joint Conference on Artificial Intelligence 1 (2007), 2129–2134.

[Kru13] E. Kruppa, Zur ermittlung eines objektes aus zwei perspektiven mit innerer orientierung, 19391948.

[KT91] T. Kanade and C. Tomasi, Detection and tracking of point features, CMU Technical Report CMU–CS–91–132 1 (1991), 91–132.

[LF94] S. Laveau and O. D. Faugeras, 3–d scene representation as a collec- tion of images and fundamental matrices, Technical Report RR–2205, INRIA 1 (1994), 1–25.

[Lin98] T. Lindeberg, Feature detection with automatic scale selection, Inter- national Journal of Computer Vision 30 (1998), 79–116.

[LK81] B. D. Lucas and T. Kanade, An iterative image registration technique with an application in stereo vision, In IJCAI–81 1 (1981), 674–679. 134 BIBLIOGRAPHY

[LLAE06a] R. Lemuz-L´opez and M. Arias-Estrada, A domain reduction algorithm for incremental projective reconstruction., International Symposium in Visual Computing(2), 2006, pp. 564–575.

[LLAE06b] , Iterative closest sift formulation for robust feature matching., International Symposium in Visual Computing (2), 2006, pp. 502– 513.

[Low99] D.G Lowe, Object recognition from local scale-invariant features, In- ternational Conference on Computer Vision 1 (1999), 1150–1157.

[Low04] D. Lowe, Distinctive image features from scale–invariant keypoints, International Journal of Computer Vision 60 (2004), 91–110.

[LVD98] J. M. Lavest, M. Viala, and M. Dhome, Do we really need an accurate calibration pattern to achieve a reliable camera calibration?, In Proc. ECCV 1 (1998), 158–174.

[LZWL04] Y. Lu, J. Z. Zhang, Q. M. J. Wu, and Z. N. Li, A survey of motion- parallax-based 3-d reconstruction algorithms, IEEE Trans. on Sys- tems, Man, and Cybernetics, Part C: Applications and Reviews, vol. 34, 2004, pp. 532–548.

[MDS05] E. N. Mortensen, H. Deng, and L. Shapiro, Descriptor with global context, In Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition 1 (2005), 184–190.

[MF92] S. J. Maybank and O. D. Faugeras, A theory of self-calibration of a moving camera, 123–151.

[MHOP01] S. Mahamud, M. Hebert, Y. Omori, and J. Ponce, Provably– convergent iterative methods for projective structure from motion, CVPR 1 (2001), 1018–1025. BIBLIOGRAPHY 135

[MK94] T. Morita and T. Kanade., A sequential factorization method for re- covering shape and motion from image streams, In Proceedings of ARPA Image Understanding Workshop 2 (1994), 1177–1188.

[MLR98] P. Meer, R. Lenz, and S. Ramakrishna, Efficient invariant represen- tations, Intl J. Computer Vison 26 (1998), 137–152.

[MO87] M. C. Morroneand and R. A. Owens, Feature detection from local energy, PatternRecognitionLetters 6 (1987), 303–313.

[Mor79] H. P. Moravec, Visual mapping by a robot rover, in International Joint Conference on Artificial Intelligence 1 (1979), 598–600.

[MP05] D. Martinec and T. Pajdla, 3d reconstruction by fitting low–rank ma- trices with missing data, CVPR 1 (2005), 198–205.

[MP06] , 3d reconstruction by glutting pair-wise euclidean reconstruc- tions, or ”how to achieve a good reconstruction form bad images”, In 3DPVT 1 (2006), 25–32.

[MPG99] K. Reinhard M. Pollefeys and L. J. Van Gool, Self–calibration and metric reconstruction of varying and unknown intrinsic camera para- meters, Int’l J. Computer Vision 1 (1999), 7–25.

[MRM94] P. McLauchlan, I. Reaid, and D. Murray, Recursive affine structure and motion form image sequences, In Proceedings of the 3rd Euro- pean Conference on Computer Vision (1994), 217–224.

[MS02] K. Mikolajczyk. and C. Schmid, An affine invariant interest point detector, In Proc. of 7th ECCV 1 (2002), 128–142.

[MS04] K. Mikolajczyk and C. Schmid., Scale and affine invariant interest point detectors, Int. Journal Computer Vision 60 (2004), 63–86.

[MS05a] K. Mikolajczyk and C. Schmid, A comparison of affine region detec- tors, International Journal of Computer Vision (2005), 43–72. 136 BIBLIOGRAPHY

[MS05b] K. Mikolajczyk and C. Schmid., A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis & Machine In- telligence (2005), 1615–1630.

[Nis03] D. Nist´er, Preemptive ransac for live structure and motion estimation, ICCV (2003), 199–206.

[NMSO96] Y. Nakamura, T. Matsuura, K. Satoh, and Y. Ohta, Occlusion de- tectable stereo – occlusion patterns in camera matrix, In CVPR96 1 (1996), 371–378.

[OK92] M. Okutomi and T. Kanade, A locally adaptive window for signal matching, Intl. J. Comp. Vision 7 (1992), no. 2, 143–162.

[PG97] M. Pollefeys and L. J. Van Gool, A stratified approach to metric self- calibration, CVPR, 1997, pp. 407–412.

[PG99] , Stratified self-calibration with the modulus constraint, IEEE Trans. Pattern Anal. Mach. Intell. 21 (1999), 707–724.

[PGP96] M. Pollefeys, L. J. Van Gool, and M. Proesmans, Euclidean 3d re- construction from image sequences with variable focal lengths, ECCV (1), 1996, pp. 31–42.

[PGV+02] M. Pollefeys, L. J. Gool, M. Vergauwen, K. Cornelis, F. Verbiest, and J. Tops, Video–to–3d, In Proceedings of Photogrammetric Com- puter Vision 2002 (ISPRS Commission III Symposium), International Archive of Photogrammetry and Remote Sensing (2002), 247–252.

[PGV+04] M. Pollefeys, L. J. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, Visual modeling with a hand–held camera, International Journal of Computer Vision 59 (2004), 207–232.

[PK97] C. J. Poelman and T. Kanade, A paraperspective factorization method for shape and motion recovery, IEEE Transactions on Pattern Analy- sis and Machine Intelligence 19 (1997), no. 3, 206–218. BIBLIOGRAPHY 137

[PKG98] M. Pollefeys, R. Koch, and L. J. Van Gool, Self-calibration and met- ric reconstruction in spite of varying and unknown internal camera parameters, ICCV, 1998, pp. 90–95.

[PKG99] , A simple and efficient rectification method for general mo- tion., ICCV, 1999, pp. 496–501.

[PZ98] P. Pritchett and A. Zisserman, Wide baseline stereo matching, Proc. Intl Conf. Computer Vision (1998), 754–760.

[QL98] L. Quan and Z. Lan, Linear n ≤ 4–point pose determination, In IEEE Int. Conf. Computer Vision (1998), 778–783.

[RL01] S. Rusinkiewicz and M. Levoy, Efficient variants of the icp algorithm, Proceedings of IEEE 3DIM (2001), 145–152.

[RMC97] S. Roy, J. Meunier, and I. J. Cox., Cylindrical rectification to mini- mize epipolar distortion, Conference on Computer Vision and Pattern Recognition, 1997, pp. 393–399.

[Roh94] K. Rohr, Localization properties of direct corner detectors, Journal of Mathematical Imaging and Vision 4 (1994), no. 2, 139–150.

[Roh05] , Fundamental Limits in 3D Landmark Localization, Proc. 19th Internat. Conf. on Information Processing in Medical Imaging (IPMI’05) 3565 (2005), 286–298.

[Ros96] P. L. Rosin, Augmenting corner descriptors, Graphical Models and Image Processing 58 (1996), 286–294.

[RP05] J. Repko and M. Pollefeys, 3d models from extended uncalibrated video sequences: Addressing key–frame selection and projective drift, In Proceedings of 3DIM 1 (2005), 150–157. 138 BIBLIOGRAPHY

[SB97] S. M. Smith and J. M. Brady, Susan – a new approach to low level im- age processing, International Journal of Computer Vision 23 (1997), 45–78.

[SCD+06] S. Seitz, B. L. Curless, J. Diebel, D. Scharstein, and R. Szeliski, A comparison and evaluation of multi-view stereo reconstruction algo- rithms, 519–526.

[SCMS01a] G. Slabaugh, B. Culbertson, T. Malzbender, and R. Schafer, A sur- vey of methods for volumetric scene reconstruction from photographs, 2001, pp. 81–100.

[SCMS01b] G. Slabaugh, B. Culbertson, T. Malzbender, and R. Shafer, A survey of methods for volumetric scene reconstruction from photographs, In Intl. WS on Volume Graphics, vol. 57, 2001, pp. 81–100.

[SHF01] S. Soatto, H. Hin, and P. Favaro, Real-time feature tracking and out- lier rejection with changes in illumination, Proc. IEEE ICCV’01 1 (2001), 684–689.

[SJH98] T. Schuts, T. Jost, and H. Hugli, Multi–featured matching algorithm for free–form 3d surface registration, In IEEE International Confer- ence on Pattern Recognition (1998), 982–984.

[SM97] C. Schmid and R. Mohr, Local grayvalue invariants for image re- trieval, IEEE Transactions on Pattern Analysis and Machine Intelli- gence 19 (1997), no. 5, 530–534.

[Soj03] E. Sojka, A new approach to detecting the corners in digital images, In Proc. IEEE Int. Conf. on Image Processing 2 (2003), 445–448.

[SPFP96] S. Soatto, P. Perona, R. Frezza, and G. Picci, Motion estimation via dynamic vision, IEEE Trans. Automat. Contr. 41 (1996), 393–413.

[SS96] P. Perona S. Soatto, R. Frezza, Motion estimation via dynamic vision, IEEE Transactions on Automatic Control 41 (1996), 393–413. BIBLIOGRAPHY 139

[SS02] D. Scharstein and R. Szeliski, A taxonomy and evaluation of dense two–frame stereo correspondence algorithms, Int´l J. Computer Vision 47 (2002), 7–42.

[ST94] J. Shi and C. Tomasi, Good features to track, IEEE Conference on Computer Vision and Pattern Recognition (CVPR’94) (1994), 593– 600.

[ST01] P. Sturm and B. Triggs, A factorization based algorithm for multi– image projective structure and motion, In Proceedings of European Conference on Computer Vision (Eccv’96) 1065 (2001), 709–720.

[SZ97] C. Schmid and A. Zisserman, Automatic line matching across views, Proc. Conf. Computer Vision and Pattern Recognition (1997), 666– 671.

[SZ02] F. Schaffalitzky and A. Zisserman, Multi–view matching for unordered image sets, In Proceedings of the 7th European Conference on Com- puter Vision 2350 (2002), 414–431.

[TFZ98] P. Torr, A. FitzGibbon, and A. Zisserman, Maintaining multiple mo- tion model hypotheses through many views to recover matching and structure, ICCV 1 (1998), 485–491.

[TG04] T. Tuytelaars and L. J. Van Gool, Matching widely separated views based on affine invariant regions, International Journal of Computer Vision 59 (2004), 61–85.

[THDL04] D. Tubic, P. H´ebert, J. D. Deschˆenes, and D. Laurendeau, A unified representation for interactive 3d modeling, 3DPVT, 2004, pp. 175– 182.

[THL02] D. Tubic, P. H´ebert, and Denis Laurendeau, A volumetric approach for interactive 3d modeling, 3DPVT, 2002, pp. 150–158. 140 BIBLIOGRAPHY

[THL03] D. Tubic, P. H´ebert, and D. Laurendeau, Efficient surface reconstruc- tion from range curves, 3DIM, 2003, pp. 54–61.

[TK92a] C. Tomasi and T. Kanade, Shape and motion from image streams – a factorization method, Int’l J. of Computer Vision 9 (1992), 137–154.

[TK92b] C. Tomasi and T. Kanade, Shape and motion from image streams under orthography: A factorization method, IJCV 9 (1992), 137–154.

[TM06] S. Thrun and M. Montemerlo, The graph slam algorithm with appli- cations to large-scale mapping of urban structures, I. J. Robotic Res. 25 (2006), no. 5–6, 403–429.

[TMHF99] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, Bundle adjustment–a modern synthesis, Workshop on Vision Algo- rithms (1999), 298–372.

[TP86] V. Torre and T. A. Poggio, On edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (1986), 147–163.

[Tri97] B. Triggs, Autocalibration and the absolute quadric, 609–614.

[Tri99] , Camera pose and calibration from 4 or 5 known 3d points, Proceedings of the 7th International Conference on Computer Vision (1999), 278–284.

[TS04] P. Tissainayagam and D. Suter, Assessing the performance of corner detectors for point feature tracking applications, Image Vision Com- put. 22 (2004), 663–679.

[Tsa87] R. Tsai, A versatile camera calibration technique for highaccuracy 3d machine vision metrology using off–the–shelf tv cameras and lenses, IEEE J. Robotics and Automation 3 (1987), 323–344.

[TV98] E. Trucco and A. Verri, Introductory techniques for 3-d computer vision, Prentice Hall, 1998. BIBLIOGRAPHY 141

[ZDFL95a] Z. Zhang, R. Deriche, O. Faugeras, and Q. Luong, A robust technique for matching two uncalibrated images through the recovery of the un- known epipolar geometry, Artificial Intelligence Journal 78 (1995), 87–119.

[ZDFL95b] Z. Zhang, R. Deriche, O. D. Faugeras, and Q. T. Luong, A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artif. Intell 78 (1995), 87–119.

[Zha94] Z. Zhang, Iterative point matching for registration of free–form curves and surfaces, Int. J. Comput. Vision 13 (1994), 119–152.

[ZKPF99] B. Zitova, J. Kautsky, G. Peters, and J. Flusser, Augmenting corner descriptors, Pattern Recognition Letters 2 (1999), 199–206.

[ZW94] R. Zabih and J. Woodfill, Non-parametric local transforms for com- puting visual correspondence, Proc. 3rd European Conf. Computer Vision 2 (1994), 150–158.