Grouping, Matching and Reconstruction in Multiple View Geometry

F. Schaffalitzky

Robotics Research Group Department of Engineering Science University of Oxford, UK

April 2002 Contents

1 Introduction 6

1.1 Objective and Motivation ...... 6

1.2 Overview ...... 8

2 Grouping with geometric constraints 12

2.1 Overview ...... 12

2.2 Objective and background ...... 12

2.3 Theory and geometry ...... 14

2.3.1 Parallel lines ...... 14

2.3.2 Equally spaced parallel lines ...... 15

2.3.3 Elations ...... 18

2.3.4 Regular grids ...... 21

2.4 Algorithms ...... 21

2.4.1 Vanishing point detection ...... 22

2.4.2 Computing the equally spaced line geometry ...... 24

2.4.3 Computing elations ...... 27

2.4.4 Computing grids ...... 28

2.5 Results ...... 30

2.5.1 Vanishing points ...... 30

2.5.2 Equally spaced lines ...... 30

2.5.3 Elations ...... 35

2.5.4 Grids ...... 40

1 2.6 Assessment ...... 40

3 Projective reconstruction 43

3.1 Overview ...... 43

3.2 Motivation and objective ...... 43

3.3 Theory ...... 45

3.3.1 Problem formulation and notation ...... 46

3.3.2 Pencils of cameras ...... 47

3.3.3 Derivation of quadratic constraints ...... 47

3.3.4 Five irrelevant solutions ...... 48

3.3.5 Factoring the constraints ...... 48

3.3.6 Cubic constraint ...... 50

3.3.7 Quasi-linear method for reconstruction ...... 51

3.3.8 Scaling the constraints ...... 53

3.3.9 Geometric error ...... 53

3.3.10 Related Work ...... 55

3.4 Algorithm details ...... 55

3.4.1 Computing camera pencils ...... 55

3.4.2 Inverting ...... 58

3.4.3 Robust Reconstruction Algorithm ...... 59

3.5 Results ...... 60

3.5.1 Synthetic data ...... 60

3.5.2 Real data I ...... 62

3.5.3 Real data II ...... 64

3.6 Assessment ...... 66

4 Auto-calibration 67

4.1 Overview ...... 67

4.2 Motivation ...... 67

4.3 Modulus constraints ...... 70

2 4.3.1 Motivation ...... 71

4.3.2 Algebraic nullspaces ...... 72

4.3.3 Horopters ...... 73

4.3.4 Characteristic equation of ¢¡¤£¦¥¨§ © ...... 76

4.3.5 Two-view (strong) modulus constraint ...... 77

4.3.6 Two-view (algebraic) quartic modulus constraint ...... 78

4.3.7 Three-view (algebraic) cubic modulus constraint ...... 79

4.3.8 Solving the equations ...... 79

4.3.9 Experimental Results...... 87

4.3.10 Partial constraint ...... 89

4.3.11 Conclusion and further work ...... 90

4.4 Square pixels ...... 91

4.4.1 Notation ...... 93

4.4.2 Co-conicity constraint on § ...... 95

4.4.3 Pascal’s theorem ...... 95

4.4.4 Collinearity in 3D ...... 97

4.4.5 Octic constraint ...... 98

4.4.6 Special case of 3D collinearity bracket ...... 99

4.4.7 Formula for sextic ...... 100

4.4.8 Formula for quintic ...... 100

4.4.9 Forty solutions ...... 103

4.5 Algorithms ...... 108

4.5.1 Recovering calibration from the plane at infinity ...... 108

4.5.2 Numerical Computation of Ideal Saturation ...... 109

4.5.3 Computing the Multiplication Operators ...... 111

4.5.4 Solving the Generalized Eigensystem ...... 111

4.6 Results ...... 112

4.7 Assessment ...... 113

3 5 Matching images 114

5.1 Objective and motivation ...... 114

5.1.1 What can one hope for? ...... 114

5.1.2 Direct methods ...... 116

5.1.3 Wide versus short base-line...... 117

5.2 Framework of invariant descriptors ...... 118

5.2.1 Interest point neighbourhood descriptors ...... 120

5.2.2 Intensity profiles descriptors ...... 127

5.2.3 Region descriptors ...... 127

5.2.4 Texture descriptors ...... 128

5.2.5 Geometric features ...... 128

5.3 Local (affine) transformations ...... 129

5.4 Algorithms ...... 131

5.4.1 Corner detector ...... 132

5.4.2 Shape adaptation ...... 133

5.4.3 Invariants used ...... 134

5.4.4 Registration ...... 135

5.4.5 Growing new matches from old ...... 136

5.4.6 Constraints on multi-view geometry from local affine transformation 136

5.5 Results ...... 138

5.6 Assessment ...... 138

6 Texture 153

6.1 Objective ...... 153

6.2 Background ...... 153

6.2.1 Texture description and models ...... 155

6.3 Theory ...... 161

6.3.1 Segmentation ...... 161

6.3.2 The 2nd moment matrix ...... 162

4 6.3.3 Behaviour under affine transformations ...... 162

6.3.4 Weak (gradient) isotropy ...... 165

6.3.5 Previous use of the 2nd moment matrix ...... 165

6.3.6 Description in normalized frame ...... 166

6.4 Algorithms ...... 169

6.4.1 Normalized Cut Segmentation ...... 169

6.4.2 Practical calculation of 2nd moment matrix ...... 170

6.4.3 Descriptors ...... 171

6.5 Results ...... 173

6.5.1 Synthetic data ...... 173

6.5.2 Intra-image texture matching ...... 174

6.5.3 Inter-image texture matching ...... 179

6.6 Assessment ...... 183

7 Discussion and Future Work 185

7.1 Grouping ...... 185

7.2 Reconstruction ...... 187

7.3 Auto-calibration ...... 188

7.4 Texture ...... 189

7.5 Matching ...... 191

A Estimating projectivities between projective spaces. 196

Bibliography 197

5 Chapter 1 Introduction

1.1 Objective and Motivation

The objective of the work presented in this dissertation is to advance the goal of construct- ing a multiple-view matching and reconstruction system which is flexible, fully automatic, reliable, efficient and accurate. There are three parts to this problem:

Image description: the raw, numerical image data which comes from cameras must

be condensed and “syntactified” to facilitate further processing. We need a higher-

level description of the contents of our images.

Feature matching: reasoning from 2D images about general 3D scenes (i.e. contain-

ing more types of object than could be treated by hand crafted specific models) re-

quires the ability to form correspondence between images. This capability provides

the link between past and present video frames, between the left and right image

from a stereo pair, and between a query image and its matches in a database. It must

be efficient and reliable.

Structure estimation: the existence of a common 3D structure is the constraint which

makes correspondence tractable.

The work in this dissertation consists of three more or less separate strands of research, cor- responding to these three parts of the general multiple-view reconstruction (a.k.a. “structure- from-motion”) and matching problem.

A few words on why one would still be researching structure-from-motion in the first

decade of the twenty-first century are in order: To the uninitiated, the fact that structure-

from-motion has a solution at all (i.e. that structure and motion can be recovered from im-

ages alone) may seem surprising but it has been known by photogrammetrists for at least ¢¡¤£ years [17, 1, 59, 146, 58, 68]. and large scale structure estimation on computer has been an active field of research for half a century (see historical overview in [160]). The interest

that the community has shown, over the last decade, in what is essen- tially photogrammetry or algebraic geometry [29, 28, 27, 15, 30, 31, 110] can be explained by the need for more flexible and robust tools that will work in a “point-and-shoot” setting without carefully planning an experiment. There is a demand for routinely tracking and reconstructing fairly unconstrained camera geometry but only in the last decade has it be- come possible to do this with off-the-shelf software that doesn’t require a trained operator.

This is partly due to the increased power of small workstations but also due to automation inadequacies, many of which have been reduced but not eliminated. It is still not the case that waving a video camera about in a scene suffices to recover useful 3D information, a competence that would be of great use in site modelling and robot navigation applications and in model acquisition (computer graphics) for films and games.

What is likely to happen next? The demand for automated structure-from-motion will not disappear; it will become embedded and with it will arise a demand for more sophisti- cated and high-level processing capabilities such as content-based database querying and object-level video indexing.

The competence that the community has at present is: reliable short base line robust feature tracking and reliable structure refinement using bundle adjustment. There are still nails that we cannot beat with this hammer, for example the correspondence problems when the camera motion is large or when the scene is non-rigid. Solutions to these prob- lems would be valuable in the more content oriented applications of the future due to the possibility of using 3D structure when reasoning in these domains, in contrast to most cur- rent methods [10, 89, 138] based on appearance alone (e.g. color histograms, appearance manifolds, (wavelet) texture features). As motivation for this, it is clear from considering many of the well-known optical illusions that human vision relies on forming spatial hy- potheses when interpreting static images (this improves performance in the “normal” do- main of life but it does of course degrade it in the abnormal domain of optical illusions).

7 The chapters of this dissertation relate to the three parts of the general structure-from- motion problem as follows:

The work on geometric grouping (chapter 2) and texture description (chapter 6) at-

tempts to provide high-level features that are very distinctive for matching.

The work on matching (chapter 5) was useful to gain insights into what is difficult

and what is easy about the feature matching problem.

The geometric work on projective reconstruction (chapter 3) and auto-calibration

(chapter 4) addresses the estimation side of the structure-from-motion pipe line.

1.2 Overview

Because the different strands of work are to a large extent independent, there is no single

“literature review” chapter but instead papers deemed relevant are discussed as needed.

Each chapter ends with a short assessment of the work presented but the final chapter 7 discusses the entire thesis in one place and suggests directions in which the research should move next.

The following is a brief overview of the individual chapters.

Chapter 2: Grouping

This chapter presents early work [123, 124, 125] on grouping image features using ge- ometric constraints. The purpose of grouping is to aggregate low-level image features into groupings that are somehow related or salient; groupings are features themselves but usually more distinctive than their parts. The starting point was the work of Leung and

Malik [69] on grouping repetition using the loose constraint of “registrability” with affine transformations. The chapter shows how making stronger assumptions about the geom- etry of the repetition can be used to aid the grouping process as well as to recover scene information (such as vanishing points and lines). Three models are investigated, all to do with repetition on a plane, a common case in man-made environments.

8 Firstly, the familiar fact that parallel lines in the world converge at a single point when imaged is the starting point for an analysis of what happens if the lines in the world are

co-planar and equally spaced. The cases where this happen in man-made environments are usually perceptually significant: stairs, windows and pillars are but three examples.

Secondly, the chapter develops a within-image relation which holds when general pat- terns on a plane are repeated by translation. In the terminology of projective geometry, the phenomenon is a special form of homography called an elation which exactly models the perspective effects sometimes approximated with affine transformations, but with fewer

parameters.

Thirdly, regular grid-like arrangements on a plane are investigated. Such patterns are

common as floor tiles or, again, windows on facades.

In all three cases, the geometric relation can be estimated from image data and the estimated parameters yield information about the plane in the world, e.g. vanishing lines

and points.

Chapter 3: Reconstruction ¡£¢

A method [129] for projective structure recovery from six point tracks over views is presented, addressing the problem of initializing structure either directly for bundle ad- justment or as part of a RANSAC robust estimator.

The geometric-algebraic constraints derived are not strictly speaking new but their derivation without the need for changing image coordinates is.

Two variants of the method are compared against projective factorization and (the sta- tistically optimal) bundle adjustment for varying levels of synthetic noise.

Lastly, the estimator is used to RANSAC for a projective reconstruction on real data.

The contribution of the -view estimator is that the entire structure is computed at once,

whereas with structure initialization from epipolar or trifocal geometry it would be neces-

sary to re-section those cameras not used for initialization.

9 Chapter 4: Calibration

An investigation into the geometry and algebra of two “minimal cases” for auto-calibration.

Both cases solve a similar problem: given camera matrices from the best possible (i.e. pro-

jectively bundle-adjusted) projective reconstruction of a scene, to recover from them a Eu- clidean reconstruction by making a suitable coordinate change. This is useful because,

if it is close enough to the truth, the Euclidean reconstruction gives an initialization for

bundle-adjustment using one’s camera model of choice. The use of scene constraints is

not considered.

The first calibration model is the case of constant but unknown calibration over three

views and its closed-form solution via the modulus constraints was published in [122].

The second calibration model is the case of “square pixels” – i.e. zero skew and unit aspect ratio but the focal lengths and principal points are unknown and changing across all (four) views. These are “Euclidean” image planes in the sense that concepts such as angle and relative distance can be transferred from the (raster) image plane to the physical projection plane of the camera.

For both models, the generic degree (number of solutions) is determined and a numer- ical method of solution of each problem constructed. Experimental results are shown on real data.

Chapter 5: Matching

The chapter explores a combination of the affine invariant point feature neighbourhood descriptors of Baumberg [5] and the idea of Pritchett and Zisserman [109] of growing new matches from old ones using local (affine) transformations.

A device for overcoming problems caused by scale change between images to be matched is to use an image-driven scale selection process at each feature; to overcome problems caused by foreshortening effects Baumberg invented a method for selecting a canonical affine neighbourhood of a corner feature. A variation of the algorithm is implemented and explored.

In addition, when each feature has an associated affine neighbourhood, each feature

10 match has an associated local affine transformation, which can be estimated and used to

(1) search for more matches and (2) estimate epipolar geometry.

Chapter 6: Texture

The goal of this work, published in [127], was to construct a texture descriptor which is invariant to affine geometric and photometric changes, as well as to moderate errors of segmentation. The eventual applications for this image description tool are database or video indexing and wide baseline correspondence.

The descriptor’s ability to discriminate, and its invariance, is evaluated on a selection of Brodatz textures by drawing ROC curves. A wide baseline correspondence application is built using the descriptor and results shown on real images.

11 Chapter 2 Grouping with geometric constraints

2.1 Overview

The chapter is organized as follows: the first section explains what grouping is in general and why it is useful. It is argued that looking for repetition in a single image gives a useful image descriptor. The contribution of the new work presented here is a set of three ge- ometric constraints that often apply to repeated structures; the geometries are analyzed

first. Next, a section of implementation details explains how the results presented in the last section were produced.

2.2 Objective and background

One view of computer vision processing is that, starting from the low-level input that dig-

ital images are, algorithms are applied to produce new representations at successively

higher levels until a stage is reached when the desired task can be carried out naturally.

Grouping refers to algorithms for enhancing, condensing and aggregating low-level

description/segmentation data into higher-level descriptions by associating (“grouping”)

features that appear to come from a semantically meaningful entity in the scene. For exam-

ple, the process of and line fitting to an intensity image yields a low-level

description; the process of determining sets of lines that could be junctions or opposite,

parallel sides of boxes, house etc is an example of grouping.

Images of real scenes often contain elements that are repeated and this repetition is

meaningful in the sense that it is pertinent to many visual processing tasks. For example,

the presence of repeated window-like features in an image is strongly correlated with the presence of buildings in the scene from which the image was formed. The response of a

grouping algorithm which detects repeated window-like features can therefore be used as

evidence for or against the presence of buildings. This idea was made explicit by Fleck and

Forsyth in [34] where the response of a human limb grouper is used to detect naked people in images.

For another example, in the problem of image correspondence, repeated structure is a source of ambiguity because there is no way to use image appearance alone to determine which of the many windows in one image corresponds to which windows in the other im- age. Traditional correspondence algorithms usually discard ambiguous matches (e.g. by using the winner-takes-all strategy) which is a shame because repeated structure is a very strong matching cue. This will be discussed further in chapter 5.

It is therefore of interest to detect and group together repeated elements from images.

Compared with searching for specific types of object in images, the search for repetition is a more general strategy because no a priori model is needed (in principle at least) but rather, the model is discovered from the image itself as any image feature which is repeated.

The work [69] by Leung and Malik is a canonical starting point. The goal is to find re- peated and spatially neighbouring surface elements at possibly various orientations. The image is represented by covering it with overlapping tiles in which the intensity variation

(expressed via the second moment matrix of the image gradient) suggests something of interest is happening (this is similar to the Harris cornerness measure [50]). Nearby tiles with similar appearance are then grouped together: to account for changes in surface ori- entation, the similarity of tiles is evaluated by attempting to affinely register tiles using the

Lucas-Kanade algorithm [80]. The output of the algorithm is a graph whose vertices are im- age patches and whose edges correspond to successful registrations. Since many objects

(e.g. fabrics and walls) contain, as part of their appearance, elements that are repeated such a graph is a useful starting point for description or recognition.

A practical problem with this approach is the cost of comparing large numbers of image patches to detect similarity. One way to address this is to use invariant descriptors and hashing techniques as illustrated in figure 2.2. The idea is to transform each image patch

13 Figure 2.1: Grouping on local affine transformations: using corners and cross-correlation. From left to right: (1) original image, (2) detected Harris corners, (3) regions grouped on appearance, (4) verified groupings, (5) new element found by search into a descriptor which is invariant to certain appearance changes (such as affine spatial distortion or intensity changes) and then compare the descriptors instead.

This chapter explores “harder”, more geometric, constraints that apply when the rep- etition is subject to some geometric regularity. It is demonstrated that grouping features which satisfy a geometric relationship can be used both for (automatic) detection and for estimation of geometric quantities such as vanishing points and lines.

Geometric groupings are useful image features because, being more likely to corre- spond to meaningful structures in the world than local appearance (“corner”) features, they are highly distinctive and provide strong constraints for matching and recognition.

2.3 Theory and geometry

2.3.1 Parallel lines

In many man-made scenes the presence of planes and straight lines gives rise to character- istic image appearances which can be exploited. For example, parallel lines in the world, such as railroad tracks, appear to converge at a single point when imaged by a perspective camera. From the point of view of projective geometry, parallel lines do actually meet in a point on the plane at infinity and the perspective image of that point is the point of image convergence.

14 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) original, (b) closed curves, (c) clustered invariants

(d) one cluster, (e) verified grouping, (f) enlarged grouping

Figure 2.2: Seed matches using closed curves. The idea here is to identify interesting re- gions by detecting closed Canny [13] edge contours, and then determine if these regions are related by affine transformations by computing their affine texture moment invari- ants [164]. Regions which are related by an affine transformation have the same value for affine invariants. Thus clustering on the invariants yields a putative grouping of regions. Eight affine invariants are computed, so each curve gives a point in an 8-dimensional space. The points are clustered in this 8D space by the k-means clustering algorithm. The plot shows the distribution and clustering of the zeroth order moments of shape (horizon- tal axis) and intensity (vertical axis). The cluster used as a hypothesised grouping is the bottom left-most one.

The geometry of imaged parallel lines is thus quite simple and can be exploited to build reliable yet simple algorithms for detecting vanishing points from images alone.

2.3.2 Equally spaced parallel lines

A characteristic pattern arises when the parallel lines are coplanar and equally spaced.

The image appearance is quite striking and immediately gives the viewer an impression of a plane. It occurs frequently in the form of stairs, fences, pillars, windows and other architectural structures.

The pattern is characteristic because it is possible to tell from the image alone, despite perspective distortion, that the lines are equally spaced. The configuration is highly con- strained and one can use the constraints to reject or verify grouping hypotheses.

15 Figure 2.3: Vanishing point: The many (approximately) parallel linear structures in this image appear to converge at a point near the right hand side of the image. Projectively, this point is the perspective projection of a real point on the plane at infinity.

l oo

l2

l 1

¡ l0

Figure 2.4: The pattern of lines arising from the equally spaced bars of a fence. The lines are not equally spaced in the image but form a characteristic pattern. It is characteristic because any three consecutive lines determine the whole pattern.

Much previous research has addressed the automatic detection and estimation of van- ishing points from parallel lines, e.g. using Hough transforms (clustering) or other robust methods (see, for example [11, 20, 81, 94, 141, 163, 112, 21, 73, 94] where earlier references are given). The novelty of the work presented here is in employing the additional con-

straints that arise from structured groupings of parallel elements on planes.

To analyze the geometry we note that, in some suitable coordinate system on the plane

that contains the lines, a set of equally spaced lines must have the form

¢¤£¦¥¨§ ©¦   £

16 X: world point

x: image point Scene plane

Center of projection Image plane

Figure 2.5: Diagram showing plane to plane projection.

§¡ © where ¥ is a normal vector to the line (determines the line’s direction) and the parame- ter  determines the position of the line. A family of equally spaced lines can be obtained

by sampling  at equally spaced (e.g. integer) values. The coefficient (column) vectors

£¤ ¥¦

§

¢ £ 



of the lines can be written in matrix form as

£ ¥

¤ ¦¨§

§

£



¢ £ 

£

©

£ 

which separates the properties of the pencil and its parameter into a ¢ matrix and a



©

-vector. The vector ¥ corresponds to a finite point on the projective line; replacing it

£ © by the ideal point ¥ , the corresponding line obtained is the line at infinity of the scene plane.

Now, under perspective imaging (see figure (2.5)), which is associated with a homogra-

    phy matrix  mapping points according to and lines according to (again,

using column vectors for lines), the imaged lines have coordinates

£ ¥

¤ ¦¨§ §

§

£

 

£  ¢¤£  

£

 

© ©

  

£

 

 where is a ¢ matrix of rank , determined up to scale only (i.e. scalar multiples of give rise to the same geometry). Such matrices form a ¡ parameter family. The existence of 17 x x/ X X / H H T T M H H

T T

image world

Figure 2.6: Elements related by a translation on a plane appear similar when imaged but

aren’t related by a translation in the image.

¥ £ ¦

a factorization ¤

§

£

 

£





£



places no constraint on because any ¢ matrix of rank two can be reduced to echelon

form by row operations (i.e. by a suitable choice of the general matrix  ).

While the matrix  describes a necessary condition on the image of a regularly spaced

pencil of parallel lines, this is of course not sufficient.

2.3.3 Elations

A common feature of man-made scenes is the repetition of planar patterns along a plane in the world by translation. The canonical example of this is windows on the side of a

building. What is the geometry which governs images of such scenes? An element, ¡ ,

¡

¥ © on the plane and its translation, ¢ , are both projected by the imaging proces into the image plane. Obviously these two projections will “look” similar but they won’t in general be related by a translation. Figure 2.6 illustrates the situation.

As is well known, the world plane and the image plane are related by a homography 



 ¤£ mapping a world point £ to the image point . Let us choose a coordinate system on

the world plane such that the translation on the world plane is given by

£ ¥ £ ¥

¤ ¦ ¤ ¦

¦¨§ £ ¦¨§

 

¦¨ £ £  £ ©¦¨

¥

£ £ £

 

¢ ¢

 £  £

where denotes a identity matrix. If two image points  are



¥

   

£ £ £ projections of two world points £ which are related by the translation then

18 v l

H

Figure 2.7: Left, geometric interpretation of the parameters of a conjugate translation. Right, example of elation in a real scene : the white tiles on the floor can be generated from a single tile by translation in the plane of the floor in the room.

and so, by back-projection, translation and re-projection:



¥



     

£ £ ¥ ¥

¤ ¤ ¦ ¦

¦

§



¦¨ £ £





    

£





¡



  

  where the row vector represents the imaged vanishing line of the world plane and the

column vector ¡ represents the imaged vanishing point of the direction of translation. Note

that the scaling of these vectors is significant in that they give rise to different transforma-

tions albeit all in the same direction.

These geometric interpretations of the parameters are illustrated in figure 2.7.

This type of homography is known as an elation [136, 142] or conjugate translation.

The line is called the vanishing line of the elation and the point is called the vertex of the elation. Note that the vertex lies on the vanishing line.

How large is the class of elations? The formula above requires two vectors, representing

a line and point which must be incident (because the direction of translation is parallel to



¡ the plane) . The admissible vectors thus form a ¢¤£ parameter family but because a scale factor from one vector can be moved into the other, each pair of incident vectors

come from a -parameter family giving rise to the same transformation:

  



¡ ¡

  

 ¥ © ¥  ©

¨§

¦£ so elations form a ¢¥£ parameter family. By contrast, affine transformations, 19 which have been used in the past (e.g. [69]) to group together repeated elements on (ap- proximately) planar surfaces, form a ¢ -parameter family yet this is only an approximation

to the imaging geometry.

What use are elations in grouping? In the first instance, they provide a powerful con- straint for verifying grouping hypotheses, by fitting the elation parameters to the image data and evaluating the goodness of fit. Secondly, a fitted elation allows extrapolation and so is a poweful tool for searching for new elements given old ones. Thirdly, the fitted pa- rameters yield geometrically meaningful quantities such as direction of translation and

plane orientation (from the vanishing line).

Elation from equally spaced parallel lines?

The geometry of equally spaced parallel lines from section 2.3.2 can of course be viewed as

a special case of repetition by translation on a plane. This means the there is a homography

£ £¡ £¢

 

, acting on the image plane, which maps the line to the line for any value of the ¤ parameter value  and spacing . However, for a given pencil of equally spaced parallel lines, there are many such elations due to the usual “aperture problem”: in effect, the pencil of lines determine the vanishing line of the plane but it does not uniquely determine the direction of translation because the lines only “care about” the amount of translation in

their normal direction.

   

¡





Algebraically, the compatibility between the elation and the -matrix



 ¦¥

is expressed by the two relations



£

¡





¤



£

¡

  ¥

Geometrically, what this means is that there is a -parameter family of elations compat-



¡

  ible with the given : for any point (represented by ) on the vanishing line , not equal

to the common intersection of the lines in the pencil, there is a unique compatible elation

with that point as its direction of translation.

20 column

M

row

Figure 2.8: Projective distortion of a regular grid.

2.3.4 Regular grids

An extension to repetition by a single translation is where there is a repetition in two direc- tions so that the world pattern is a regular grid of repeated elements. To each element can be assigned a pair of integer coordinates : one for the “row” and one for the “column”. See

figure 2.8.

The geometry can be described simply by the plane homography  which maps from these integer coordinates to the image coordinates.

Within the image, there is a family of associated transformations corresponding to trans-

lation of the grid in the world. These are

 ¦¥

¤

¡

 

¡ £¢  

(2.1)

¤

¡ 

where are the vanishing points of the two translation directions and is the (com-

  ¥

£

¤

¡

   

mon) vanishing line (so as before). and denote units of translation

along rows and columns. This geometry has ¢ degrees of freedom.

2.4 Algorithms

The previous section described the relations between image elements for three classes of

geometric repetition on a scene plane. This section describes how these relations may

be harnessed for grouping algorithms. The algorithms automatically detect and associate

elements satisfying the relation and, for example, thereby estimate the plane’s vanishing

line. 21 In all three cases, we are trying to find a set of image features which (a) “look” similar in the image, and (b) satisfy some geometric constraint. The similarity must be expressed in

photometric terms, such as intensity correlation. The particular constraint is specified by a

geometry with a number of parameters (e.g. the matrix  with five parameters, or an elation

with four parameters), but the parameters are a priori unknown to us. Moreover, the

features available in the image usually contain a large number of outliers, that is, features which don’t belong to the sought grouping.

In all cases, the RANSAC [32] paradigm can be used. This is a hypothesize-verify tech- nique which randomly samples from the set of possible models and evaluates the support for each model. The model with most support is deemed to be the salient one. The sam- pling strategy consists of choosing, from the set of input features, random subsets that are large enough to instantiate (fit) the model but small enough that there is a reasonable chance of eventually finding a subset containing only inliers. The principle is illustrated in detail by the algorithm for detection vanishing points.

2.4.1 Vanishing point detection

The overall algorithm is:

1. Detect straight lines.

2. Use RANSAC+MLE to find a large subset of concurrent lines.

3. Remove subset and repeat.

The detection of lines can be carried out by traditional edge detection [13] followed by segmentation of the detected edges into straight segments by breaking at points of high curvature followed by incremental least squares fitting.

In the context of vanishing point detection, the model fitted by the RANSAC algorithm is a single point (namely the vertex of the grouping of concurrent lines). The vertex can be computed from any pair of lines in the grouping so the search consist of randomly choos- ing pairs of lines, computing their point of intersection and evaluating the support for that

22 v

v

detail detail

Figure 2.9: Left, vanishing points are sometimes estimated by determining the point that minimizes the sum of squared perpendicular distances to the fitted lines, where the fitted lines are estimated from the measured edge points. This is not the Maximum Likelihood Estimate (MLE) of the point. In general the vanishing point determined in this manner will not lie on the fitted lines. Instead, right, the MLE of the vanishing point is found by determining a set of lines concurrent with the point, such that the perpendicular distance of the lines to the measured edge points are minimized. point by counting the number of image line segments which pass through the point within some tolerance.

In practice the test is carried out by drawing a line from the vertex through the image edge to see what the closest fit would be. This ensures that decisions are made based on image fitting errors, which is consistent with the error model that image measurements are correct except for additive isotropic Gaussian noise.

The vertex with the largest support is chosen and a maximum likelihood fit (optimiza- tion) to the inlying line segments is performed. The parameters to be determined should represent a configuration of lines passing through a common point. The MLE is that con-

figuration for which the likelihood is greatest. See figure 2.9. The MLE provides us not only with a parameter estimation, but also with a set of corrected lines which are exactly concur- rent [73]. The maximum likelihood estimate is computed using a non-linear optimization algorithm such as Levenberg-Marquardt [108].

This algorithm is very simple, efficient and reliable. It is a useful pre-processing step for other algorithms, such as the one presented below for computing equally spaced parallel lines by robustly fitting an  -matrix to image data.

23 loo

l

l3 l2 l1

Figure 2.10: Diagram to illustrate formula for  given three lines.

2.4.2 Computing the equally spaced line geometry

Directly from ¢ lines

   ¡ 

Let the three lines be  and choose a fourth line which does not pass through the 

common intersection of the first three. Then the vanishing line is given by the formula

£¢ ¢

£

   

¨   ¥¤     ¤ 

¢

where brackets denote the determinant of the ¢ matrix formed from the three lines 

(sketch proof: project the three given lines and the proposed line at infinity into ¦ using

¢ ¨§ ¥ © © 

¢

¡

 ¤  the linear operators  with and check that the resulting points on the projective line has the expected cross-ratio). The geometry is illustrated in figure (2.10).

Linearly from ¢ or more lines

This is a special case of a more general method for estimating projectivities between pro-

jective spaces from point and (hyper)plane correspondences. Any correspondence be-

tween a pair of points or a pair of (hyper)planes can be written as a set of linear equations 

in the elements of the projectivity. In this case we have to estimate a projectivity from ¦ to 

¦ ; see appendix for details.

24 As always with a “linear” estimate, it is important to condition the line data first. One technique (suggested by Hartley [54] in the context of estimating epipolar geometry from

8 point correspondences) which works well is to scale and translate the image or region

¢ 

£  containing the lines to roughly fill the inside of the unit square ¤ . The scaling used should be isotropic so as not to distort the measurement error process which is assumed isotropic.

Least-squares non-linear fit

In practice a serious algorithm will also employ a non-linear minimization, starting from an algebraic estimate. The form of the cost function depends on the precise nature of the input data, but if the lines are given as edgel-chains it makes sense to minimize the total squared sum of distances from the fitted model lines to the edgels, entirely analogous to the point illustrated by figure 2.9.

This computes the MLE under the measurement assumption that the localisation er-

rors of the points on the segmented edgel chains are independent and normally distributed.

The parameters to be fitted are the entries of the matrix  . The cost function to be mini-

mized is

¢¢¡

¤

£ 



¡¤£

¥   ¥ © ©

¡

¡ £¤£ 

£    



 ¥ © ¥ ¡ ©  ¡¤£ ¥

where the th image line is in correspondence with the line ¡ and (for

¤

§¦§¦§¦ ¥

¡

¥   ©

) are the points on the th image line (the notation denotes perpendicular  distance from a point  to a line ). Note, that in the ML estimation of the vanishing point above it was necessary to estimate both the point and the lines concurrent at that point.

The use of the  matrix avoids estimating the lines in this case because it provides a map

from the exactly known equally spaced scene lines to the image. The MLE of  also provides  the MLE of the vanishing line .

Robustly

Again, RANSAC can be used to robustly compute a model of equally spaced parallel lines from input data of con-current lines that might not all fit the model. Briefly, the strategy is

25 to choose random sets of three lines, instantiate the model (the matrix  ) and evaluate the

support for each such generated model.

Two pre-processing steps are useful. Firstly, the computational burden is reduced by

grouping concurrent (i.e. not necessarily equally spaced) lines and searching within such

groupings for equally spaced lines. Secondly, assuming similar visual appearance of the

lines to be grouped, each group of concurrent lines can be split into two groups, depending

on the direction of the image gradient across the lines (i.e. depending on whether the

intensity changes from dark to bright or bright to dark across the edge).

¤ 

An image edge is deemed to be supporting for if there exists an integer such that

¢ 

¥ ©

the RMS fitting error of the line  to the points of is less than some threshold, typically

¦ £ pixels.

In practice the sampling can be made more efficient by incorporating some heuristics based on the observation that while perspective distortion in images is significant it is usu- ally not severe. For example, it is reasonable only to investigate fitted models whice place all three lines on the same side of the image vanishing line. Secondly, the angle between the first two lines should not be too different from the angle between the last two lines.

Thirdly, the lines used both for initialization and to support a hypothesis should probably

“overlap” approximately in a direction perpendicular to the average direction of lines in the group.

Once the most highly supported grouping has been found, the MLE of the model (the matrix  ) is computed and used for a guided search for further image lines that satisfy the same relation (at integer values of  ) but that may have been missed in the original RANSAC

sample. As new lines are included the MLE of  is recomputed1.

1A more general approach to the constrained grouping problem is to consider a search over the set of all ¡

grouping hypotheses, where each hypothesis is a datum consisting of (1) model parameters (e.g. ¢ ) and (2) classification of each image feature as inlier or outlier. The problem then is to find the most likely hypothesis, or hypotheses, or more generally to discover and compactly represent the plausible regions of the search space. The basic RANSAC algorithm is a global way to generate hypotheses; the incremental search for more inliers is a local optimization strategy.

26 2.4.3 Computing elations

From two point correspondences

¢¡ ¢¡

¤£ £

Suppose given two points and their corresponding points under an elation acting ¥£

in the same image (see figure (2.11)). By intersecting the line joining to with the line

¡ ¡ ¡ joining to £ one obtains the position of , the vanishing point of the direction of transla- tion (by the assumption that the points actually satisfy an elation relationship, no three are collinear, unless they are all collinear in which case this construction will of course fail).

To recover the vanishing line it suffices to compute one point, other than ¡ on it. This

¡ ¡

¥£ £ is easily done by intersecting the line joining to with the line joining to .

Thus far, the elation has been determined up to the family

 

¡





§¦ but the remaining overal scale factor £ can be computed by reference to the original

correspondence constraints that

 ¡ ©¡

£ £

¨ and must hold, up to scale.

An alternative, and simpler, way to recover the elation is to map (rectify) the input points to the four points of a unit square (for which it is clear what the elation we seek is

– it is just a translation) and then conjugate the elation (translation) on the rectified plane with the rectifying homography.

From two line correspondences

This case is dual to the case of estimation from point correspondences, by interchanging

¡  the rolesˆ of and . The requirement that no three of the points be collinear becomes the requirement that no three of the lines are concurrent.

27 v a' a

H b' loo b

Figure 2.11: How to compute an elation from two point correspondences.

Least-squares non-linear fit

To carry out a least-squares fit one can use the following (over)parameterization of the

§ 

£

¢

¤ ¤

¡ ¡ ©

space of pairs ¥ of -vectors satisfying :

 ¡ § ¡ § ¡ ¡

£

¤

¥ © ¥ ©

 ¡

¡

¢¡

¢

¤ ¡ where are -vectors and are obtained by Gram-Schmidt orthogonalization.

Robustly

The algorithm for estimation from two point (or line) correspondences makes it easy to implement a RANSAC search for an elation. The overall method is:

1. Detect features, e.g. interest points or intersections of line segments.

2. Form putative within-image correspondences on the basis of proximity and local im-

age appearance (e.g. correlation).

3. RANSAC+incremental MLE by randomly sampling from set of putative correspon-

dences.

2.4.4 Computing grids

Many variations are possible; here a two-stage algorithm (based on interest points, but

other features are possible in principle) is given which first finds D grids and then tries to

28 organize these into a D grid. This can be thought of as “cascaded RANSAC” by analogy with the “cascaded ” of [163].

1. RANSAC search for 1D collinear repetitions.

(a) Interest point detection (Harris corners [50]).

(b) Grouping interest points based on proximity and image appearance (correla-

tion), as in Leung and Malik [69]. Each of the resulting groupings are fed sepa-

rately to the subsequent stages.

(c) RANSAC search for collinear subsets of points.

(d) RANSAC search for 1D regular grids. Identifying a D regular grid allows the van-

ishing point of the line to be computed.

2. Organization of the 2D grid.

(a) Clustering vanishing points. Grouping those rows with the same vanishing point

reduces the complexity of the data even further. The grouped rows correspond

to parallel rows in the scene plane.

(b) Final combination. Given a “row” and a “column” which overlap, it is easy to

construct the regular 2D grid. A MLE is then computed (see below) using all the

grouped elements.

(c) Guided search. Once the 2D regular grid has been estimated, we can predict

with accuracy the image locations of elements that were missed by the earlier

stages. The guided search consists of inspecting these locations, including any

elements there which are found to look similar to the other elements in the

grouping and then updating the estimate of the grid accordingly.

In this case, computing the MLE is conveniently posed as the estimation of a homog-

raphy  from the (integer) grid coordinates to image feature coordinates where errors are minimized only in the image.

29



¡ ¦

The grouping algorithm produces (noisy) image points  which have been as-

£¢

 ¡ sociated with integer grid coordinates ¡ . For a Gaussian error model the likelihood

function (and also the posterior distribution in the case of a uniform prior) is :

¢ © ©

§¦§¦§¦ 

¨§

¡  ¡ 

¤



 ¢

¥ ¦¥   ©

const (2.2)

¡ 

where  is the standard deviation of the errors in the image, is the homography mapping

¤



 ©

from the scene grid to the image, and ¥ denotes the Euclidean distance between two



points  in the image. Thus, the MLE is the homography which minimizes the sum of

squared residuals in the image :  

¤



 ¡ !¡ ¡ ©

¥ (2.3)

 ¡ As usual, once a model has been fitted it can be used to search for new correspon- dences.

2.5 Results

2.5.1 Vanishing points

Sometimes there are multiple instances of a type of grouping in an image. In this case, a RANSAC-based estimator will usually detect the dominant (largest) grouping. If the af- fected features are removed and the algorithm invoked on what is left, the second largest grouping can be extracted and so forth. This is illustrated in figure 2.12 which shows three

groupings of concurrent lines detected from a single image.

2.5.2 Equally spaced lines

Figures 2.15 shows the process of grouping equally spaced lines on an outdoor picture of a

fence.

Figures 2.16 and 2.17 show the same process applied to an indoor picture of a radiator

and an outdoor picture of a building facade. In the latter case, the slight irregularity of the

building (the pillars are not all of the same width) causes the algorithm to fail if the inlier

threshold is too low.

30 Figure 2.12: Multiple groupings of concurrent lines withing a single image. 1164 straight line segments were detected in the original image (top left). The three largest groupings found are shown in red. Top right: 490 lines. Bottom left: 347 lines. Bottom right: 129 lines. In each case, 1000 random samples were used in each case. The three directions found in this case are orthogonal and would suffice to compute the camera calibration using mild assumptions. Note the mistake made in the bottom left image due to accidental alignment of lines from different faces of the building.

31 Original image Edgel map Figure 2.13: Line grouping: original image and computed edgel map.

Concurrency groupings Fitted MLE model Figure 2.14: Left: lines grouped together using concurrency. The lines shown in white have been grouped together using the “equally spaced lines” constraint. Some lines have been left out due to insufficient sampling. Right: the model fitted by MLE to the grouping found on the left.

Vanishing line A larger grouping

Figure 2.15: Left: vanishing line of the fitted model above. Right: a larger grouping found

£ £ £ £ £ by increasing the number of samples to .

32 (a) (b) Figure 2.16: (a) original image. (b) the response of the equally-spaced-lines grouping algo- rithm can be used to group the bars of a radiator in a scene.

(a) (b)

(c) Figure 2.17: The same algorithm as in figure 2.15 was applied to the image (a) to obtain

the grouping (b). Some lines are missed on the left of the building. Turning up the inlier

¦ ¦

£ £ ¢ threshold from to gave the more complete grouping shown in (c). Also shown is the vanishing line computed from the grouping.

33 (a) (b)

(c) (d) Figure 2.18: Multiple equally spaced line groupings can occur in the same image. (a) The original image. Figures (b),(c),(d) show in turn the 3 groupings which are automatically extracted from the image. For each grouping a selection of lines predicted by the fitted model is shown, so while some of these lines agree with image contrast edges, those that don’t are extrapolated and are there to illustrate the nature of the fitted model. Figure 2.19 shows a vanishing line estimated from the first grouping.

Figures 2.18 and 2.19 show three groupings of equally spaced lines detected from a sin-

gle image.

34 Figure 2.19: MLE of vanishing line estimated from the first grouping in figure 2.18

2.5.3 Elations

The first example shows the steps of the process applied to “The Music Lesson” by Vermeer.

Although a painting and not a photograph, the perspective is correct to the extent that the models investigated in this chapter apply (the correct perspective of many renaissance paintings was demonstrated by Criminisi in [23] though not for the images used in this text). The captions of figures 2.20, 2.21, 2.22, 2.23, 2.24 and 2.20 explain and comment on the stages of the algorithm.

The second example, shown in figure 2.25, also illustrates the process.

The third example of the elation grouper in action is shown in figure 2.26 where an elation is found on two different planes in the same scene.

Figure 2.27 shows another example of two groupings being found in the same image, but this time on the same plane.

The last example, in figure 2.28, shows the elation grouper working on an image used earlier for the equally spaced parallel lines grouper.

35 Original image. Fitted lines. Figure 2.20: The sought (but unknown) element/grouping is the repeated floor tiling. The features which successfully provide this element/grouping are line pair intersections. The first stage in determining the features is fitting straight lines to output.

Line intersections. Lines and points together. Figure 2.21: Left: points of intersection of the lines found above. The line segments are extended slightly to allow intersections just beyond their endpoints. Right: lines and in- tersection points together. Note that line intersections do identify the corners of the floor tilings, but these points are only a small proportion of all the intersections detected.

36 Closeup of features. Seed matches. Figure 2.22: Left: a closeup of the computed intersections. Right: the black lines join pairs of intersections which are deemed to look similar on the basis of intensity neighbourhood correlation. Line pairs which reverse orientation are excluded, since these cannot map under an elation.

Sample with highest support. Next highest support. Figure 2.23: These two images demonstrate the core of the method : each seed match (shown in white), consisting of two matched lines, is sufficient to determine an elation in the image. These putative elations can be verified or rejected by scoring them according to the number of feature correspondences consistent with them. The two seed matches whose corresponding elations received the highest support are shown.

37 MLE vanishing line and point. Features transferred. Figure 2.24: Given the inliers to the elation grouping, the parameters of the elation can be estimated (MLE) more accurately. Left: the ground plane vanishing line and vanishing point of the translation direction. Note that the horizontal line is a very plausible horizon line and that the feature tracks all pass through the vanishing point. Right: the accuracy of the estimated parameters is also demonstrated by transferring elements under the elation: the extended tiling is obtained by mapping the original image lines under the estimated elation.

(a) (b) (c) (d) Figure 2.25: More elation grouping of line pair intersections. (a) Original image, (b) Seed correspondences based on cross-correlation, (c) Correspondences consistent with a seed (shown in white), (d) Result of guided search and the vanishing line of the plane (shown in black) computed from the MLE.

38 Grouping found. Another grouping found. Figure 2.26: Two groupings found in the same image by the elation grouper. Despite the difference in pose of the planes in the world, the same grouping algorithm is successful for both cases.

(a) (b) (c) Figure 2.27: Conjugate translation grouping of line pair intersections. (a) Original image, (b) One grouping after guided search, (c) Another grouping after guided search. Note that the bricks come in two lengths, long and short, but they can be grouped together because the same elation works for both brick lengths. Thus the grouper fires not on “shape”, but on the coherent spatial configuration described by the computed elation.

39 Figure 2.28: Another elations grouper example. The corners of the windows of the building on the left have been grouped together by the elation constraint.

2.5.4 Grids

The stages of the grid grouping algorithm are illustrated in figure 2.29. Apart from provid-

ing an image descriptor as well as information about the geometry of the imaged plane,

there is the possibility that such a grouping can be used to re-synthesize the image; i.e. it

is a form of image compression.

A second example is shown in figure 2.30, with the estimated texture element shown as an inset in the second image.

2.6 Assessment

The chapter demonstrates that the geometric relations investigated can be used to guide grouping algorithms and that the fitted models give accurate estimates of vanishing lines, points etc.

In terms of getting the methods to work on real images of real interest, the implemen- tations are plagued by ad hoc fixes (see e.g. figure 2.15) and the range of features that can be grouped is somewhat limited. Chapter 7 lists some cures for these ailments.

40 Initial grouping found. Organization of 2D grid.

New elements from guided search. Synthesis from element and grouping. Figure 2.29: Grid grouper in action. Top left: the initial grouping of interest points, based on neighbourhood correlation alone. Top right: a 2D grid found within the initial grouping. Bottom left: new elements found by a search guided by the 2D grid. Bottom right: image synthesis from MLE of final 2D grid grouping.

41 Original image. Synthesized pattern. Figure 2.30: Another example of the grid grouper, showing the original image and the pat- tern synthesized from the fitted grouping and the estimated pattern element (which is in- cluded as an inset). Note the algorithm only selects elements belonging to the grid. The two planes in the scene are geometrically almost indistinguishable.

42 Chapter 3 Projective reconstruction

3.1 Overview

This chapter presents an algorithm for projective reconstruction from ¢ point correspon-

¥ ¢ dences in ¡ views and compares it with some existing standard techniques.

3.2 Motivation and objective

A large number of methods exist for obtaining 3D structure and motion from features

tracked through image sequences. Their characteristics vary from the so-called minimal

methods [110, 111, 155] which work with the least data necessary to compute structure

and motion, through intermediate methods [33, 145] which may perform mismatch (out-

lier) rejection as well, to the full-bore bundle adjustment.

The minimal solutions can be used as search engines in robust estimation algorithms which automatically compute structure and cameras over multiple views. For example, the two-view seven-point solution is used in the RANSAC estimation of the fundamental

matrix in [155], and the three-view six-point solution in the RANSAC estimation of trifocal

geometry in [154]. It would seem natural then to use a minimal solution as a search engine

in four or more views. The problem is that in four or more views a solution is forced to include, say, a minimization to account for measurement error (noise). This is because in the two-view seven-point and three-view six-point cases there are the same number of measurement constraints as degrees of freedom in the geometry to be estimated whereas the four-views six-points case provides one more constraint than the number of degrees of freedom of the four-view geometry. This means that unlike in the two- and three-view cases where one can recover geometry which exactly relates the measured points, this is not possible in the case of four (or more) view case. Instead it is necessary to minimize a

fitting error, whether algebraic or geometric.

This chapter presents a quasi-linear solution for the ¢ point case in three or more views.

The solution minimizes an algebraic image error, and its computation involves only SVD [45]

and the solution of a cubic equation in a single variable. This is described in section 3.3.

Also described is a sub-optimal method (compared to bundle adjustment) which mini- mizes geometric image error at the cost of only a three parameter optimization.

This method is then employed in a RANSAC algorithm for computing a reconstruction of cameras and 3D scene points from a sequence of point tracks from real images. The objectives of such algorithms are now well established:

1. Minimize reprojection error. A common statistical noise model assumes that mea-

surement error is isotropic and Gaussian in the image. The Maximum Likelihood

Estimate in this case involves minimizing the total squared reprojection error over

the cameras and 3D points. This is bundle adjustment.

2. Cope with missing data. Structure-from-motion data often arises from tracking fea-

tures through image sequences and any one track may persist only in few of the total

frames.

3. Cope with mismatches. Appearance-based tracking can produce tracks of non-features.

A common example is a T-junction which generates a strong corner, moving slowly

between frames, but which is not the image of any one point in the world.

Bundle adjustment [51] is the most accurate and theoretically best justified technique.

It can cope with missing data and, with a suitable robust statistical cost function, can cope with mismatches. It will almost always be the final step of a reconstruction algorithm.

However, it is expensive to carry out and, more significantly, requires a good initial estimate in order to be effective (meaning, in order to converge in a few iterations and to not get stuck in a local minimum). Current methods of initializating a bundle adjustment include factorization [60, 84, 145, 156], hierarchical combination of sub-sequences [33], and the sequential Variable State Dimension Filter (VSDF) [93]. 44 In the special case of affine cameras, the factorization method of Tomasi and Kanade [152]

minimizes reprojection error, as pointed out by Reid [113], and so gives the optimal so- lution also found by bundle adjustment. However, factorization cannot cope with mis- matches, and methods to overcome missing data suggested by Jabobs [62] lose the op- timality of the solution. In the general case of perspective projection iterative factoriza- tion methods have been successfully developed and have can produce excellent results although they do not minimize the geometric reprojection error. The problems of missing data and mismatches remain though.

Sequential methods are computationally less expensive but suffer from sensitivity to errors committed early in the process. This can be remedied by re-visiting previous frames but taking this to its logical conclusion leads to a form of full-scale bundle adjustment.

The fact that one’s feature tracker may in fact be the first stage of a stratified system where the second stage tracks epipolar geometry and the third stage tracks trifocal geom- etry suggests the idea of using reconstructions of consecutive pairs, triplets etc of image frames, or other small sub-sequences, as the initial substrate for reconstruction. Such a hierarchical approach exploits the efficiency of bundle adjustment for short sequences to reduce the dimensionality of the global estimation problem. It remains an open question what the best hierarchy is, though. A variation is suggested by noting that a hierarchical approach consists of taking slices out of the camera-vs-world-point matrix of image mea-

sures and concentrating effort on each slice: it is of course possible also to keep all cameras and restrict attention to subsets of points instead, though this is generally less useful for long sequences since feature tracks tend to be short-lived.

3.3 Theory

This section describes a method of reconstruction (recovery of cameras and structure)

¥ ¢ from six point correspondences across ¡ views.

It is quite similar to the development given by Hartley [52] and Quan [110] for a re- ¢ construction of points from ¢ views. The crucial difference is that Quan used a standard

projective basis for the image points (and this distorts the measurement noise differently

45 for each point), whereas here the image coordinates are not transformed. The use of a stan-

dard basis in the image can seriously bias the estimates obtained [56, 129] in the presence

of measurement errors; the numerical results shown later demonstrate that the method described here produces a solution with pleasant noise resistance properties.

3.3.1 Problem formulation and notation

©

¥

¡ 



Given are six points in correspondence over images. The six points in image are £

©

 § ¥ ¥

¡ ¢

¢

¡

 ¡  

¦ ¥  ¦

, . The aim is to recover camera projection matrices ¦ and

¢

£ ¦

world points £ which explain the image data, in the sense that

©

©



¡ 

¡

¡ 

 £ £

£

¥

(up to scale) for all . In general an exact solution may not be possible due to measure-

©

¡

¡  £ ment noise, in which case one seeks image projections £ which are close to the mea-

sured image points in a suitable sense.

§¦§¦§¦



££¢ Assuming that £ are in general position (i.e. no four are coplanar), their con-

figuration is projectively equivalent [136] to that of the canonical projective basis

£ ¥ £ ¥ £ ¥ £ £ ¥ ¥

£ £ £

¤ ¦ ¤ ¦ ¤ ¦ ¤ ¤ ¦ ¦

£

£ £

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¥ ¦ ¦

    

¤ ¤ ¤ ¤¨§ ¤

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¥ ¦ ¦

£ £

£

 

¢ (3.1)

£ £ £

and so one may without loss of generality assume that



¤

£ £ £

 §

¡

¢ for ¥ . This fixes the coordinate system to be used in the world and reduces the problem to that of recovering (a) each camera matrix (with respect to the chosen coordi-

nates) and (b) the position of the sixth world point

¥ £

¤

¦ ¤

¦ ¥

 

¦ ¥

£ ££©

In what follows the subscript on the sixth world point will be dropped.

46 This choice of the canonical basis is not crucial, and any five points in general position could be used, but it is traditional and also computationally convenient. 1

3.3.2 Pencils of cameras

Since (the coordinates of) the first five world points are known, and the image points are given, one can compute for each view the (linear) family of projection matrices which map the first five points to the first five image points. The family is linear because the constraint

that such a projection matrix ¡ must satisfy is



£

¡

 £ £ £

 §

¡

¢ ¡

for ¥ . Each point gives three linear equations for whose combined rank is

. Counting, the linear family of projection matrices thus has dimension , i.e. it forms a pencil of possible cameras for each view. It is simple to compute this pencil using linear algebra (see section 3.4.1 below).

3.3.3 Derivation of quadratic constraints ¡ So far, the problem has been reduced to a problem in each view by satisfying the con-

straints imposed by the first five points. Next, consider the constraint imposed by the sixth

point in a single view. £¢

If the pencil of cameras for a particular view is spanned by ¡ then there are scalars

 £¤    ¥¤¦¢

such that sought camera matrix takes the form ¡ and so

  ¡ §¤¦¢

£

¡

 © £  © ¥ £ £ ©

 ¢

© £ £

which is to say that  lies in the line of image points spanned by and :

¢  £¢ 

£

©

£ £ ¤  (3.2)

1It is of course necessary for practical reconstruction purposes that the points one wishes to reconstruct are in general position, or it will be impossible to map them to the canonical basis using a collineation. This may be a source or instability for problems where there is little scene depth since the algebra will force nearly- coplanar points apart. The closer the true points are to being coplanar, the more passing to the canonical frame will amplify measurement noise.

47



£

¥ £ © £

which is a homogeneous quadratic equation in whose coefficients are readily

¡ £¢ § §

computed from . In matrix form, the quadric is the symmetric part of the matrix

 ¢ ¢



 © ¤¢¡

If effect, computing the sixth world point amounts to intersecting ¥ quadric surfaces in

¦ .

3.3.4 Five irrelevant solutions

By construction, the pencil of cameras computed for a particular view will map each of

¡  ¡

the first five world points £ to the corresponding image point . In particular, the two

 ¢

¡ ¡ £

projections £ and are equal (up to scale) so the expression in equation ( 3.2) vanishes

  §

¡

¢

£ ¡ for £ ( ).

It follows that whatever method is used to solve the quadratic equations to estimate ¥

£ , the first five basis points will appear as irrelevant solutions. Alternatively, each of the

quadric surfaces pass through each of the five canonical basis points.

¥  ¤£

For example, in the case ¢ , Bez´ out’s theorem [139] tells us that there

¥ 

¢

solutions which arise as the intersections of the quadric surfaces in ¦ . Since five of these are irrelevant, there are in general only three solutions of interest in this case.

3.3.5 Factoring the constraints

In order to solve the quadratic equations derived, and to take advantage of the five known

but irrelevant solutions, one can employ the trick of factoring the constraints through a

rational map.

To see what this means, consider that the set of quadrics which pass through the five

canonical basis points is a linear system of dimension ¡ : the set of all quadrics forms a £

linear system of dimension (this is the number of coefficients of a quadric) and each



£ £ ¡ ¡ basis point imposes one linear constraint, leaving a linear system of dimension .

For the canonical basis in equation (3.1) the linear system of quadrics through them is

48

§

 

spanned by the quadratic functions ¢ defined by:

£ ¥

£ £ £

¤ ¦

£ £ £

¥ ¦



£

¤ ¤ ¢¡

¥ ¦

£ £ £ £

 ¥ £ ©

£ £ £ £

£ ¥

£ £ £

¤ ¦

£ £ £ £

¥ ¦



£

¤ ¤ £¡

¥ ¦

£ £ £

 ¥ £ ©

£ £ £ £

£ ¥

£ £ £ £

¤ ¦

£ £ £

¥ ¦



£

¢¡

¥ ¦

£ £ £

¥ £ ©

(3.3)

£ £ £ £

£ ¥

£ £ £ £

¤ ¦

£ £ £

¥ ¦



§ £

¤ £¡

¥ ¦

£ £ £ £

¥ £ ©

£ £ £

£ ¥

£ £ £ £

¤ ¦

£ £ £ £

¥ ¦



£

¤ ¤¡

¥ ¦

£ £ £

¥ £ ©

¢

£ £ £

To see this, note that if is such a quadratic function, corresponding to a symmetric

  § 

¤ £ £

¢

¡¤£ ¥ ¡ © ¡ ¡

matrix , then for implies that for those . The fact that



¤ £ £

¥ ¢ © means that the remaining six (independent) entries of must add up to and

it is easy then to give the above basis.



£

¡

¥ £ ©

Since each quadric passes through each basis point, it must lie in the linear

© ©

§¦§¦§¦ §¦§¦§¦

¡  ¡ 

 ¥ ¥

¢ 

span of . In other words, there are scalars ¢ such that

© ©

 ¦§¦§¦

¡  ¡ 

¡ ¥  ¥

¢



¢



£

¡

¥ £ ©

Now the constraint that takes the form

© ©

©

¦

§ ¨§ ¡  §¦§¦§¦ ¡  § §¦§¦§¦ 

£

¡ 

¥ £ © ¥ ¥ ¥  ¥ £ © ¥ £ © ©

¢

 © ¢ (3.4)

which is a linear constraint on the (homogenous) ¡ -vector

¥ £ ¥ £ £ ¥

§

£

¤ ¤



¥ £ ©

£

¤ ¤

¦ ¥ ¦ ¥ ¥ ¦

 ¥ £ ©

¦ ¥ ¦ ¥ ¥ ¦

¦ ¤ ¦ ¤ ¤ ¦

  

£

¤

¦ ¥ ¦ ¥ ¥ ¦

¥ £ © £ ©

¥ (3.5)

¦ ¥ ¦ ¥ ¥ ¦

¤

£ §

¤

¥ £ ©

£

¤

¥ £ ©

¢

49



£ ¥ ¡¤£ ©

By comparing coefficients, it is easy to obtain the scalars ¥ from a quadric ; they

§ §

    

are .

¥   §¦§¦§¦ ¥

£

¡ ¥ £ ©

To summarize, the (one from each view) quadratic constraints

©

¥ §  

£

¡ 

¥ ¥ £ ©

on the sixth world point £ are equivalent to the linear constraints

§¦§¦§¦ ¥

. This is what is meant by factorization: the constraints can all be expressed in

¡

£ © terms of ¥ so it suffices to solve for that homogenenous -vector.

The equations obtained here are algebraically equivalent to the equations obtained by

Quan [110] and Carlsson and Weinshall [15]. However, here the original image coordinate

system is used, with the consequence that a different numerical solution is obtained in the

over-constrained case. It will be seen in section 3.3.7 that this algebraic solution may be a

better approximation of the solution which minimizes geometric error.

3.3.6 Cubic constraint

  

§ § §

 §  ¥

¢¢¡

¦ ¦ ¦  ¦ ¦ However, since , the image of the map is not all of .

In fact the image is the hypersurface £ cut out by the cubic equation

¤ ¤

¤ ¤

¤ ¤

¤ ¤ ¤ ¤ ¤

 § § § § 

£ £ £ £

¤ ¤

(3.6)

¤ ¤

¤

§ §

¤ ¤

The constraint can be verified by substitution. It can be computed by elimination theory

(but see also below).

When is singular, i.e. where in ¦ is not well defined by the formula (3.5)? A necessary

condition is clearly that

    

£ £ £ £ £ £

¤ ¤ ¤ ¤ ¤

¥ © ¥ © ¥ © ¥ © and be enumerating the possibilities one sees easily that the only possible points of sin-

gularity are the five canonical basic points. 2 It remains to analyze the structure of the

singularities. Consider what happens close to the first canonical basis point by passing the

¨ ¨ 

2 ¨ ©

Proof by cases: Suppose first ¥§¦ . Then from the first two equations. If were also non-zero

¨ ¨ ¨ 

then the last equation implies ¥ , giving the fifth canonical basis point. On the other hand, if is ¨©

zero then we get the first canonical basis point. Thus far we have covered the case ¥¦ so now we assume

¨© ¨ ¨ ¨© 

  ¥ . The equations reduce to so that no two of can be non-zero.

50

parameterized curve

£ ¥

¤ ¦

¦¢¡

¥ ¦



¦

¥ ¦

¦¤£

¥ ©

¦¤¥

£ ¥

through (here, ¡ are not all zero) and studying the local behaviour:

¥ £ £ ¥

¦ ¡ £¦¥ ¡ £©¥

¥ ©

¦ £ £¦¥ £¦£©¥

¦ ¥ ¥ ¦

¥ ©

¦ ¥ ¥ ¦

¦ ¤ ¤ ¦



¦ ¡§£¤£ ¦¤¥ ¦ £ ¥



¦ ¥ ¥ ¦

¥ ¥ © © 

¦ ¥ ¥ ¦

¦ ¡§¥ £ ¦¤¥ £ ¥



¦ £¨¥ £ ¦¢¥ £ ¥



¦ £ as  . Note that the limit always exists but is different for each direction of approach to

the base point, and that the same analysis applies to each of the five base points, so that §

each basepoint gives rise to a (projective) plane of points in ¦ . In general, we have

£ ¥

£ £ £

¤ ¤

£ ¥ £ ¥ £ ¥

¤ ¤

¤ ¤ ¤

£ £ £

¤ ¤

¥ ¦

¤ ¦ ¤ ¦ ¤ ¦

¤ ¤

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¦

¤ ¦



£ £ £

¤

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¦

¤ ¤

higher order terms

¥ ¦

£ £ £

¤

¤ ¤

£ £ £

¤

from which it is easy to extract generators for the five planes (substitute the coordinates

§ ¡ of each base point into the matrix above – the column space of the resulting matrix

spans the desired plane) and use linear algebra to check they are pairwise disjoint. This

has shown that “blows up” ¦ at the five base points: it is a device for adding new points to projective space, one for each direction of approach to a base point.

For the purpose of the remainder of this chapter, the significance of is that it maps

¦

any quadric through the five canonical basis points into a (subset of a) (hyper)plane

§ §

¦ ¦ ¦ . Alternatively, it pulls back hyperplanes in to quadrics in which pass through

the five base points.

3.3.7 Quasi-linear method for reconstruction

Putting together the pieces from previous sections: to solve for the position of the sixth

¥

¡

world point £ , one can solve the homogeneous linear system

©

¥ £

¦

§¨§¨§ §¨§¨§

 

©

¦ ¤

¦

§¨§¨§ §¨§¨§



¦ ¥



£

¦ ¥

§¨§¨§ §¨§¨§ §¨§¨§

£ ©

¥ (3.7)

©

¦

§¨§¨§ §¨§¨§

¢ 

51

¥ £ © and then try to recover £ from . In practice there are certain problems which will be

discussed now.

¥ 

In the case ¢ the linear system has nullity (at least) so there is a -parameter

£ © family of possible solutions for ¥ . But the cubic constraint from section 3.3.6 saves the day because imposing this non-linear constraint reduces the ambiguity to three – being

the number of solutions to a cubic equation.

¨§ For ¥ , the linear system will in general have a unique (up to scale) solution but

usually, due to measurement noise in the input data, this will not satisfy the cubic con-

¥ £ ©

straint and so it is not clear how best to recover £ from the estimated . The solution

¥ §

advocated next generalizes to any ¡ .

¥ § ¥¡ Suppose then that ¡ . The linear system in equation (3.7) will in general (if

§ ) not have a solution (but if the input data is noise-free it will) and the usual technique

of choosing the singular vector with smallest singular value can be employed. However, §

the point in ¦ so obtained in general will not lie on the cubic constraint surface and so

recovering an estimate of £ is problematic. The estimate needs to be projected down onto the constraint surface. In which direction should one project? The usefulness of the SVD in

solving a homogeneous but over-constrained linear system is that it elicits the “directions”

in which the system is strongly constrained (large singular values) and those in which it

is weakly constrained (small singular values). The singular vector with smallest singular

value is thus a useful starting point; the singular vector with the second smallest singular value is a plausible direction to search in.

This is the heuristic recommended in this work: use the last two singular vectors (i.e. §

those with smallest singular values) to span a line in ¦ and search along the line for so-

lutions that satisfy the cubic constraint. Geometrically, this means intersecting a cubic

surface with a line so one expects in general ¢ solutions.

¥  Note that the case ¢ is subsumed by this approach, because in that case the design matrix has rank ¢ , so the two smallest singular values are in fact zero.

52 S

Figure 3.1: Intersecting the cubic constraint surface with a line in general yields three so- lutions.

3.3.8 Scaling the constraints

When using the SVD to solve an over-constrained homogeneous linear system it makes a difference to the result how one chooses to scale the rows of the design matrix. The scheme

recommended here is described in section 3.4.1 and has the advantage of being invariant

to image similarity transformations.

3.3.9 Geometric error



¡ £

In each image, the fitting error is the distance from the reprojected point to the



¡ £



¥ © measured image point  . The reprojected point will depend both on the posi-

tion of the sixth world point and on the choice of camera in the pencil for that image. But

 ¤  ¢¡ ¢ ¡

for a given world point £ , and choice of camera in the pencil, the residual is the

  ¤  £¡ ¢  ¢

¡

£ £ £  £ £

D image vector from  to the point on the line joining and .

¤ ¤¡

The optimal choice of for given £ is thus easy to deduce; it must be such as to make

the perpendicular projection of  onto this line (figure 3.2). What this means is that explicit minimization over camera parameters is unnecessary and so only the ¢ degrees of freedom

for £ remain.

53 AX

x y

BX

Figure 3.2: Minimizing reprojection error in the reduced model. For a given world point £ , each choice of camera ¡ from the pencil projects the world point to a point on a line in the

image. The best choice of camera is the one which maps the world point to the orthogonal 

projection of the measured image point onto that line.

 ¢

¡

¥ £ ©  ¥ £ © £ £ Due to the cross-product, the components of the line are express-

ible as homogeneous quadratic functions of £ , and these in turn are expressible as linear

 ¢

¤

¡

£ © £ £

functions of ¥ . This is because the quadratic function vanishes at each .

£ ¥ £ ¥

¤ ¦ ¤

Thus: ¦

§¨§¨§  §¨§¨§

 ¥ £ © 

  ¢   §¨§¨§  §¨§¨§

 ¥ £ © £ £  ¥ £ ©  ¥ £ ©

§¨§¨§  §¨§¨§

¥ £ ©



¡

¢

for some matrix with rows ¡ whose coefficients can be determined from those

 ¢ 

¡ £

¥ ©  of and . If the sixth image point is  as before, then the squared geometric

image residual becomes:

¡ £



¤



¥  ¥ £ ©  ¥ £ © ¥ ¥ £ © ¥



¥ ¥   ¥ £ © © ¥

 

¥  ¥ £ © ¥ ¥  ¥ £ © ¥

  

¡ £





¥  ¥ £ ©  ¥ £ © ¥ £ © ¥

  

 (3.8)

¥  ¥ £ © ¥ ¥  ¥ £ © ¥

  £

¡ £





 

¥ ¥ © ¥ £ © ¥

 

 

¥  ¥ £ © ¥ ¥  ¥ £ © ¥ and this is the geometric error (summed over each image) which must be minimized over 54

¦

£ . One can now compare the algebraic cost to the geometric cost. The algebraic



¡

¥ £ © ¥ ¥ 

error minimized is , which corresponds to summing an algebraic residual

 

£



© ¥ £ © ¥  over each image. Thus, the algebraic cost neglects the denominator of the

geometric cost (3.8).

In principle this minimization problem could be solved by equating the partial deriva- ¥

tives of the cost to zero and attacking the resulting § algebraic equations but the degree

§ ¥

would be very high (probably £ ) because the denominators are incompatible.

An alternative is to use iterated least squares because the correct scaling of each design

£ © matrix row can be estimated from a previous estimate of ¥ but in practice this may not be useful compared with just running a bundle adjustment.

3.3.10 Related Work

§ ¡

Yan et al [169] describe a linear method for reconstruction from views of six points. £ The method described here and theirs both turn the set of quadratic equations in

into a set of linear equations in some auxiliary variables but differ in how they handle

the remaining non-linear constraints. The method of this chapter tries to combine all lin-

ear constraints using SVD and then imposes the cubic constraint in the least harmful way;

their method uses the non-linear constraints solely as a way to resolve ambiguity and in

fact neglects constraints that are not needed to do so. Secondly, their method uses projec-

tively transformed image coordinates, and so potentially suffers from the bias described

by Hartley in [55].

3.4 Algorithm details

3.4.1 Computing camera pencils

The choice of canonical basis is particularly convenient for computing the pencil of cam-

§¦§¦§¦ §¦§¦§¦

¤ ¤

¡

    ¢ eras which map the five world points ¢ to the five image points re-

55

 §

¢

¡

¥  £

spectively. For ¥ the th column of must be a multiple of so ¥

£ . . . . ¦

¤ . . . .

¥ ¦

   



§ §

¡

¥ ¦

        . . . .

. . . .

 §¦§¦§¦ 

§ ¤

  ¢

for some scalars . Now, the fifth point ¢ must map to :

     § 

¤ § £

¡

 ¥         © 

¢ ¢ ¢

  §¦§¦§¦ 

§

¥  © which comprises three linear equations of total rank on . In practice one

translates the image points so as to place the fifth point at the origin and the constraints

on are then given by

£ ¥



§



¤ ¦



§

¡ ¡ ¡ ¡

¥ ¦



£

  

¥ ¦



©

£ £ £ £ §

 



§

§

¡ £

¥ £ £ © where the ¥ th image point after translation is . A basis for the nullspace of this

matrix can be found by applying Givens columns operations to put the matrix into lower

 §  triangular form. Given a generator it is easy to recover the corresponding ¢ camera

matrix. ¢ This gives, of course, only one of infinitely many possible choices of generators  , for the pencil of cameras and, since the choice of generators affects the scaling of the linear

constraint it is necessary to be specific about the choice. Suppose we took another pair of

generators, say

   ¢

¡ ¡

£

   

¢   ¢

¡ ¡

£

   

£ ©

Then the corresponding row of the design matrix for ¥ will be given by

  

¡

¡

£ ¡

£¢¥¤ ¥ ¡¤£ © and so in order to give a meaningful recipe, the choice of generators must be specified

up to a unimodular (i.e. with determinant ) linear change of basis. The normalization 56 recommended can be described by saying that the choice of generators must form an or-

thonormal basis with respect to the “inner product” given by

 £¢ 

¥ © ¡¤£ ¡¤£

(3.9)

¡¡ £¢¥¤



¦

£¢¥¤ ¤ §¨¤ © 

which is similar to the Frobenius inner product, given by

 £¢ 

¥ ©¨  ¡ £ ¡¤£

¡¡ £¢¥¤ ¤ §



¦

¢¥¤ ¤ §¨¤ ©  except that the former leaves out the last row of the matrices.

Now, because any two orthonormal bases are related by an orthogonal transformation,

any two choices of orthonormal basis generators of the pencil of cameras will give the same

scaling of the relevant rows of the design matrix. This is true for both the Frobenius inner

product and the one advocated here, but the latter has the extra advantage of giving a nor-

malization that is invariant to similarity transformations of the images. To see this, note

that translating the fifth point to the origin gives translational invariance. If the image

points were rotated and scaled prior to computing the camera pencil, this would corre-

§

spond to applying a scaled rotation matrix to the two submatrix of the cameras

and this only affects the inner product by a constant factor, hence after normalization the

result would be the same. ¢ Summary. The generators  , of the pencil are computed by finding any two generators

for the linear system



£ ¤

¡



¢ ¢ and orthonormalizing them using the dot product (3.9).

57

3.4.2 Inverting

¤

§  

¤

© ¥ £ © £ ¥ © Having solved for ¥ it is necessary to recover and this is

simple, using the following relations :

 §

£ £

¤

¥ ©

 §

£

¤

¥ ©



£

¤

¥ ©

 §

£ £

¤

¥ ©

¤

 §

£ £

¤

¥ ©

¤



£ £

¥ ©

(3.10)



£ £

¤

¥ ©



£ £

¤

¥ ©



£ £

¥ ©

¤



£ £

¤

¥ ©



£ £

¤

¥ ©

¤



£ £

¥ ©

¤

¥ § ¥ ¡

£ £

¤

For example, the ratio can be obtained as the ratio . Thus lies in the

§ ¢

kernel of the following design matrix :

£ ¥

¤

§ ¥

£ £ £ £

¤

§ ¥

£ £ £

¤

¥ ¦

¥ ¦

¤

¥

£ £ £

¤

¥ ¦

¥ ¦

¤ ¦

§ ¥

£ £ £

(3.11)

¥ ¦

¥ ¦

¤

§ ¥

£ £ £ £

¤

¥

£ £ £

(Next to each row is shown the ratio used to produce the row.) This will have rank ¢ if

¤

§

£

the point with coordinates really does lie on . In fact, imposing this degeneracy

¤

§ § §

on submatrices gives quartic algebraic expressions in which are all multiples

of the cubic expression (3.6) defining £ .

¡ 

Alternatively, since is only defined up to scale one can assume, say, that ¤ and

¤

then recover the values of since their ratios to are known. Of course, one could

equally well start by assuming , or to equal and doing this produces the following

58 1. Compute, for each image, the pencil of cameras which map the five points of the canonical world basis to the first five image points.

2. Form, for each image, the quadratic constraint on the sixth point from

section 3.3.5

£ © 3. Form the corresponding linear system of equations in ¥ , scaled as explained in section 3.4.1

4. Extract the pencil of solutions corresponding to the two smallest sin- gular values and impose the non-linear cubic constraint from sec- tion 3.3.6 to get three solutions.

5. Extract the three corresponding values for .

Figure 3.3: Summary of 6-point reconstruction algorithm.

four inverses to ¡ :

£ ¥ £ ¥ £ ¥ £ ¥

¤ ¤

§ §

£ £ £ £

¢ ¥ © ¢ ¥ © ¥ ©£¢ ¥ ©

¤ ¦ ¤ ¦ ¤ ¦ ¤ ¦

¤ ¤

§ §

£ £ £ £

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¦

¥ ©£¢ ¥ ©£¢ ¥ © ¥ ©£¢

¥ ¦ ¥ ¦ ¥ ¦ ¥ ¦

¤ ¤

§ §

£ £ £

£

¥ ©£¢ ¥ ©£¢ ¥ © ¥ ©£¢

¤ ¤

§ §

£ £ £ £

¥ ©£¢ ¥ © ¤¢ ¥ © ¢ ¥ ©

(3.12)

¤

§

Whether or not a particular formula is applicable depends on the values of and

a practical way to perform the inversion is to try each formula in turn until the values of

¤

§¡

¤

all have absolute value at most (this is bound to happen so long as satisfy the cubic constraint 3.3.6).

3.4.3 Robust Reconstruction Algorithm

In this section we describe a robust algorithm for reconstruction built on the ¢ -point en-

gine of section 3.3. The input to the algorithm is a set of point tracks, some of which will contain mismatches. Robustness means that the algorithm is capable of rejecting mis-

matches, using the RANSAC [32] paradigm. It is a straightforward generalization of the

¢

¢ corresponding algorithm for ¥ points in views [153, 171] and points in views [6, 154].

Algorithm Summary. The input is a set of measured image projections. A number of world points have been tracked through a number of images. Some tracks may last for

59 many images, some for only a few (i.e. there may be missing data). There may be mis-

matches. Repeat the following steps as required:

1. From the set of tracks which appear in all images, select six at random. This set of

tracks will be called a basis.

2. Initialize a projective reconstruction using those six tracks. This will provide the

world coordinates (of the six points whose tracks we chose) and cameras for all the

views (either quasi-linear or with ¢ degrees of freedom optimization on the sixth

point – see below).

3. For all remaining tracks, compute optimal world point positions using the computed

cameras by minimizing the reprojection error over all views in which the point ap-

pears. This involves a numerical minimization but it is not very expensive.

4. Reject tracks whose image reprojection errors exceed a threshold. The number of

tracks which pass this criterion is used to score the reconstruction.

The justification for this algorithm is, as always with RANSAC, that once a “good” basis

is found it will (a) score highly and (b) provide a reconstruction against which other points

can be tested (to reject mismatches).

3.5 Results

3.5.1 Synthetic data

To assess what, if anything, is gained by using the quasi-linear algorithm instead of, say, a duality method this section will compare four different estimators on synthetic ¢ -point data sets with varying amounts of noise. The estimators are

Quasi-linear: minimizes an algebraic error on sixth point only.

Sub-optimal: minimizes the geometric error on sixth point only. The point of this is to

assess how closely the algebraic error approximate geometric error.

Factorization: a simple implementation of projective factorization. 60 −3 1.6 x 10 20 5 1.4 4.5 quasi−linear 1.2 4 quasi−linear 15 3.5 1 factorization sub−optimal 3 0.8 10 quasi−linear 2.5 0.6 2 bundle % failures reprojection error sub−optimal sub−optimal reconstruction error 1.5 0.4 factorization 5 bundle 1 0.2 0.5 factorization 0 0 bundle 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5 0 0.5 1 1.5 2 2.5

added noise (pixels) added noise (pixels) added noise (pixels)

£ £ £

Figure 3.4: Summary of experiments on synthetic data. data sets were generated ran- ¢ domly ( ¥ views of points) and each algorithm tried on each data set. Left: For each of

the four estimators (quasi-linear, sub-optimal, factorization and bundle adjustment), the

£ £ £ graph shows the average rms reprojection error over all data sets. Middle: the average

reconstruction error, for each estimator, into the ground truth frame. Right: the average £ number of times each estimator failed (i.e. gave a reprojection error greater than pix- els). In all cases, the factorization approach is the odd one out, sometimes performing better, sometimes worse than the others. The cause of this was not investigated for lack of time.

Bundle adjustment: minimizes reprojection error for all six points over all camera and

structure parameters.

There are three performance measures of interest:

1. reprojection error, which measures image fitting residuals.

2. reconstruction error, which measure deviation from the ground truth.

3. stability, which measures whether the algorithm converges nicely.

The claim is that the quasi-linear algorithm performs nearly as well as the more expensive variants and can safely be used in practice.

Synthetic data. Results are shown for testing the algorithm on synthetic data with varying

amounts of pixel localisation noise added; the noise model is isotropic Gaussian noise with

£ £ £  standard deviation  . For each value of , the algorithm is run on randomly generated

data sets. Each data set is produced by choosing six world points at random uniformly in

¢  §

£ ¡

the cube ¤ and six cameras with centres between and units from the origin and

principal rays passing through the cube. After projecting each point under each chosen

61

¡ ¡

camera, artificial noise is added. The images are , with square pixels, and the

principal point is at the centre of the image. Figure 3.4 summarizes the results. £ The “failures” refer to reconstructions for which some reprojection error exceeded

pixels. The quality of reconstruction degrades gracefully as the noise is turned up from the

¦ ¦

£ ¡ ¡

slightly optimistic to the somewhat pessimistic ; the rms and maximum reprojection

¦¡ ¢ ¢ error are highly correlated, with correlation coefficient £ in each case (which may also be an indicator of graceful degradation).

3.5.2 Real data I

£ §

£ ¢ £

The image sequence consists of colour images (JPEG, ¥ ) of a turntable, see fig- ure 3.5. The algorithms from before, except factorization, are compared on this sequence and the results tabulated also in figure 3.5. Points were entered and matched by hand using

a mouse (estimated accuracy is pixels standard deviation). Ground truth was obtained by

¦

£ ¡

measuring the turntable with vernier calipers, and is estimated to be accurate to .

While reprojection errors are measured in the images, to assess the reconstruction error it is necessary to homographically register a proposed reconstruction into the ground truth coordinate frame (minimizing registration error there); the reconstruction error is the fit- ting error of that registration. There were tracks, all seen in all views. Of course, in prin- ciple any six tracks could be used to compute a projective reconstruction, but in practice some bases are much better than others. Examples of poor bases include ones which are almost coplanar in the world or which have points very close together.

Bundle adjustment achieves the smallest reprojection error over all residuals, because it has greater freedom in distributing the error. Our method minimizes error on the sixth point of a six point basis. Thus it is no surprise that the effect of applying bundle adjust- ment to all points is to increase the error on the basis point (column ) but to decrease the error over all points (column ). These figures support our claim that the quasi-linear method gives a very good approximation to the optimized methods. Figure 3.6 shows the reprojected reconstruction in a representative view of the sequence.

62 basis residuals all residuals reconstruction (pixels) (pixels) error (mm) ¢ points quasi-linear 0.363 /2.32 0.750/2.32 0.467/0.676 ¢ points sub-optimal 0.358 /2.33 0.744/2.33 0.424/0.596 ¢ points bundle adjustment 0.115 /0.476 0.693/2.68 0.405/0.558

All points (and cameras) bundled 0.334 /0.822 0.409/1.08 0.355/0.521

£ Figure 3.5: Results for the tracks over the turntable images. The reconstruction is compared for the three different algorithms, residuals (reported as rms/max) are shown for the ¢ points which formed the basis (first column) and for all reconstructed points taken as a whole (second column). The last row shows the corresponding residuals after performing a full bundle adjustment.

0.8 sub−optimal quasi−linear

0.7 bundle

0.6

0.5 final bundle 0.4

0.3

0.2 rms reprojection error over all points (pixels) 0.1

0 3 4 5 6 7 8 9 10 number of views

Figure 3.6: Left: Reprojected reconstruction in view ¢ . The large white dots are the in- put points, measured from the images alone. The smaller, dark points are the reprojected points. Note that the reprojected points lie very close to the centre of each white dot. The reconstruction is computed with the ¢ -point sub-optimal algorithm. Right: The graph shows for each algorithm, the rms reprojection error for all tracks as a function of the number of views used. For comparison the corresponding error after full-bore bundle ad- justment is included.

63 Dinosaur sequence results basis residuals (pixels) all residuals (pixels) inliers ¢ points quasi-linear 0.0443/0.183 0.401/1.24 95 ¢ points sub-optimal 0.0443/0.183 0.401/1.24 95 ¢ points bundle adjustment 0.0422/0.127 0.383/1.181 97 All points (and cameras) bundled 0.313 /0.718 0.234/0.925 95

Figure 3.7: The top row shows the images and inlying tracks used from the dinosaur se- quence. The table in the bottom row summarizes the result of comparing the three differ- ent fitting algorithms (quasi-linear, sub-optimal, bundle adjustment). There were ¢ views.

For each mode of operation, the number of points marked as inliers by the algorithm is

shown in the third column. There were ¥ tracks seen in four or more views.

3.5.3 Real data II

The second sequence is of a dinosaur model rotating on a turntable (figure 3.7). The image

£ ¡ ¢

¥

size is ¥ . Motion tracks were obtained using the fundamental matrix based tracker

£ £

described in [33]. The robust reconstruction algorithm is applied using samples to the

§

£ ¡ ¢ £

subsequence consisting of images to . For these views, there were ¥ tracks of which

§

¢

only were seen in all views. ¥ tracks were seen in or more views. The sequence

contains both missing points and mismatched tracks.

For the six point RANSAC basis, a quasi-linear reconstruction was rejected if any re-

64 Figure 3.8: Dinosaur sequence reconstruction: a view of the reconstructed cameras (and points). Left: quasi-linear model, cameras computed from just ¢ tracks. Middle: after resectioning the cameras using the computed structure. Right: after bundle adjustment of

all points and cameras (the unit cube is for visualization only).

£ projection error exceeded pixels, and the subsequent ¢ degrees of freedom sub-optimal

solution was rejected if any reprojection error exceeded a threshold of ¡ pixels. These are

very generous thresholds and are only intended to avoid spending computation on very

bad initializations. The real criterion of quality is how much support an initialization has.

When backprojecting tracks to score the reconstruction, only tracks seen in § or more views

¦

¡ were used and tracks were rejected as mismatches if any residual exceed pixels after backprojection.

The algorithms of section 3.5.1 (except factorization) are again compared on this se- quence. The errors are summarized in figure 3.7. The last row shows an additional com- parison where bundle adjustment is applied to all the points and cameras of the final re- construction. Figure 3.7 also shows the tracks accepted by the algorithm. Figure 3.8 shows the computed cameras (and world points).

Remarks entirely analogous to the ones made about the previous sequence apply to this one, but note specifically that optimizing makes very little difference to the residuals. This means that the quasi-linear algorithm performs almost as well as the sub-optimal one. Ap- plying bundle adjustment to each initial ¢ -point reconstruction improves the fit somewhat, but the gain in accuracy and support is rather small compared to the extra computational

cost (in this example, there was a ¥ -fold increase in computation time).

¡ ¢ The results shown for view £ to are typical of results obtained for other segments of

65 consecutive views from this sequence. Decreasing the number of views used has the dis- advantage of narrowing the baseline, which generally leads to both structure and cameras

being less well determined. The advantage of using only a small number of points (i.e. ¢

instead of ¥ ) is that there is a higher probability that sufficient tracks will exist over many

views.

3.6 Assessment

Algorithms have been developed which estimate a six point reconstruction over views

by a quasi-linear or sub-optimal method. It has been demonstrated that these reconstruc-

¥ tions provide cameras which are sufficient for a robust reconstruction of ¢ points and

cameras over views from tracks which include mismatches and missing data.

It is a shame there was no time to implement, and properly compare the method of

this chapter with the duality method tested by Hartley [55]. Another valuable experiment

that could have been carried out is to investigate how sensitive the method is to an ill-

conditioned (i.e. near-planar) 3D basis.

66 Chapter 4 Auto-calibration

4.1 Overview

The auto-calibration problem is first described in general and two special cases investi- gated: first, the classical three-view case of constant internal parameters and second, the case known as “square pixels” or “known aspect ratio”. In each case the emphasis is on understanding the geometry and algebra of the specific problem.

4.2 Motivation

The camera matrices and structure obtained from a projective reconstruction algorithm, followed by projective bundle adjustment, can be highly accurate in terms of explaining

the image data. If the image data is given as points matched across several views and the

 ¡¤£

image of the ¥ th world point under the th camera is denoted by then a projective re-

¡

£ ¡

construction recovers cameras and world points such that



¡

£  ¡¤£

¡ (up to scale)

¥ © ¥ for all pairs ¥ for which the world point was seen in the th view.

However, the coordinate frame in which the reconstruction is expressed is (projectively)



¡ ¡



¡¡ ¡ £

arbitrary in the sense that for any collination , the cameras ¡ and world points



£ £

£ explain the image data equally well:

 

¡ ¡





£ £

¡ £ ¡¤£



¡ £

For many applications (e.g. generation of 3D models, or for measurement), such projective distorted coordinate frames are unsuitable. This give rise to the problem of recovering the

real, Euclidean, coordinate frame given a projective reconstruction. To a good approximation (assuming a wholly linear projection model and ignoring, for example, radial lens distortion), the real cameras which took the images can be modelled

as linear projection by matrices of the form

¡ £¢ ¢

£

¡

¡ ¤

¥ (4.1)

¢ ¢

for each view. Here, ¡ is the position -vector of the camera’s centre of projection, is a

¢ ¢ ¢

¢ rotation matrix which models the camera’s orientation and is the calibration

£ ¥

¥¤§¦ ¦

matrix. ¤

¡

¤©¨



£ £

£ £

The calibration matrix encodes information about the camera which is independent of its position and orientation in the world. It describes the coordinate transformation which

maps a ray in a camera-centred coordinate system to a (pixel) position in the image that is

read from the camera, so is affected by things such as focal length (distance from the light

sensor array to the centre of projection) and the shape and layout of the elements that

make up the sensor array. For physical reasons, the distortions modelled are translations, horizontal and vertical scaling and shearing along the horizontal axis, so the calibration

matrix is modelled as an upper-triangular matrix.

§

It is easy to see that every ¢ camera matrix may be decomposed as in equation (4.1);

¢ ¢ this can be achieved by taking the leading submatrix of ¡ and reducing it to upper

triangular form by post-multiplication with Givens rotations (“RQ decomposition”). The decomposition is essentially unique, so without making further assumptions about the

calibration (or motion), all coordinate frames are equally valid.

Let us count “degrees of freedom”: the datum of a perspective camera has degrees

£ ¡

¢ ¢

of freedom, whether this is counted as the of the projection matrix or as

£ ¢¡ from equation (4.1). A projective reconstruction thus provides parameters of

information (the subtracted ¢¡ is the dimension of the group of collineations, or projective

£

£ ¥ coordinate changes). A Euclidean reconstruction requires . The difference of must be made up by making modelling assumptions.

For example, if one assumes constant but unknown calibration across all views, the 68

¡ ¢ ¡ ¢ £

¥¡

term in the last expression must be replaced by so that we require

£ ¢¡

¡ ¢

, or , for the recovery of Euclidean structure to be possible. This case is treated in section 4.3.

Another common assumption is to take each camera to have zero skew; the sensor ele-

£

ments are taken to be rectangular. The term is then replaced by and so we require

£

£ £ £ ¢¡

¡

¥¢ , or . Ponce [106] discusses methods for solving this problem.

If in addition to zero skew we also assume unit 1 aspect ratio we get the case known as

§

¡

“square pixels”, where is replaced by and we require to recover Euclidean structure. This is the subject of section 4.4. The possibility of Euclidean reconstruction from square pixels was first noted theoretically by Heyden and Astr˚ om¨ [61] and later prac-

tically by Pollefeys et al [104] who noted that assuming zero skew was sufficient.

Auto-calibration, which here refers to the recovery of Euclidean structure from a pro-

jective structure, is a problem of choosing a coordinate system but it can be posed more

geometrically (i.e. in a coordinate-free way) as the problem of determining the position of

certain geometric entities in the (computed) projective world. The two entities are (1) the

plane at infinity ( £ ), which encodes all information about parallelism and (2) the abso- ¤ lute conic ( ), which lies on the plane at infinity and encodes all metric information such

as angle and relative distance.

Because the absolute conic lies on the plane at infinity it is natural to try to determine

the latter first and then to compute the former (this last step is quite simple). This approach

is known as stratified auto-calibration. Both the methods described in this chapter work in

this way.

There are other common approaches to auto-calibration. The simplest, due to Hart-

ley [53], is based on the assumption that the camera does not move but only rotates about

its centre of projection; this does not apply to the general motion case. The case of general

motion but with planar scenes was investigated by Triggs [158]. For general motion and

general scenes, a competitor of the stratified approach is to solve for the absolute quadric,

which is the locus of planes tangent to the absolute conic. This device was invented by

1If the aspect ratio is known, but not unity the calibration problem can be reduced to the unit aspect ratio easily so the “square pixel” case is effectively equivalent to the “known aspect ratio” case

69 Triggs in [157]. It has the advantage of encoding both affine and metric information in a

single object (which satisfies an algebraic constraint though – it is a singular quadric).

4.3 Modulus constraints

The modulus constraint is available in the situation where the calibration of the cameras

is the same, say when a single camera moves through a scene without re-focusing. In this

case the cameras recovered from projective reconstruction have the form

¡ ¢ ¢

£

¡



¡



¡ ¥ ¡ ¤ ¡ (4.2)

Now let § be any plane in the world (e.g. the plane at infinity). Such a plane induces a homography between any two images, by (back)projection from the first image out onto the plane in the world followed by (re)projection from the world into the second image.

This homography will be denoted



¡ ¡

¡

¡¤£¦¥ £ © ¥ ¡ £ § ©

§

and maps from image ¥ to image . In the special case where is the plane at infinity, the

homography is called the infinite homography, denoted ¡¤£ , and has a special structure; it

is easily verified that it takes the form

¡ ¢ ¢

 

 

¥ ¡ ©

¡¤£ £

which means that it is conjugate (in the algebraic sense) to a rotation matrix and so has

¢

¡¤£ ¡¥£§¦

 spectrum (up to scale) of the form because these are the eigenvalues of a

rotation by an angle ¥ .

Auto-calibration using the modulus constraint proceeds by searching for planes § for

¡¤£¦¥¨§ © which the induced homographies have spectra of this form.

The modulus constraint was first formulated by Pollefeys and Van Gool [105] who de- rived constraints (equations) of degree § on the position of the plane at infinity and used numerical optimization to find solutions which could be used as the starting point for a bundle adjustment.

70

The original quartic constraints of Pollefeys and VanGool were constructed from pairs

§  § ¢ of views, so three views are enough to reduce the ambiguity to a finite number ( ) of

solutions. This section shows how to construct a new constraint of degree ¢ by combining

information from triples of views instead. For a three-view projective reconstruction, the three quartic and one cubic constraint taken together admit only twenty-one solutions, a

considerable improvement.

4.3.1 Motivation

¡ ¡ ¡

Consider equation (4.2). If we subtract £ from we get:

¢ ¢  ¢ ¢

£ £ £ £

¡ ¡

 

¡ ¡

 

¡ £ ¥ £ ¤ £ ¡ ¥ ¡ ¤

¢  ¢ ¢

£

 

 

£ ¥£¢¥¤ ¡¡

¡

§

and it is easy to see that a null-vector of this ¢ matrix is

§¨§

¦ 

¡¤£

£ ©

¡ £

§

¢ £

where the -vector ¡ is the axis of rotation (eigenvector with eigenvalue ) of the matrix

¢  ¢ ¢





¡ £ £

¡ . It is the point on the plane at infinity corresponding to the direction of the axis ¥

of rotation of the camera motion between views and .

 ¦

¥ © ¥ Given three cameras, there are three pairs ¥ with and so we obtain three points on the plane at infinity, sufficient to recover it.

Unfortunately, this solution is flawed because the camera matrices that come from a

projective reconstruction may have been scaled relative to the right hand side of equation

(4.2) in ways that are not known a priori. In effect, all we know is that there are non-zero 

scalars ¡ such that the null-vectors of

 

£

¡ ¡

¡ ¡ £ £

lie on the plane at infinity.

¢

£

But something has been gained: instead of the -parameter problem of solving for

   



¥   © ¦ ¦ it suffices to solve the -parameter problem of solving for . 71

In order to make progress on this it is necessary first to understand how the null-space

§

of a ¢ projection matrix varies with the entries of the matrix.

4.3.2 Algebraic nullspaces

§

¢ A generic matrix ¡ has a -dimensional null-space which represents the optical centre of the associated perspective camera. Numerically, this can be computed very stably using a matrix decomposition such as QR or SVD but from an algebraic point of view these aren’t very enlightening. What is needed is an algebraic formula for the null-space and this will

be developed next.

§ §

¡ ¡

Consider the matrix obtained by adjoining the th row of to the end of . This

matrix is singular: §

§¨§¨§ §¨§¨§

¡



£

§ ©

£¢¥¤

¡  ¡  ¡ ¡ ¡

Expanding the determinant along the last row we find that each row (e.g. the th one) of

¢ ¢

¡ ¡

¥ ©

is annihilated by (i.e. orthogonal to) the vector ¡ of (signed) minors of :

§  

§  £

¢

¡

¡  ¡  ¡ ¡ ¡ ¥ © for

To make the choice of sign clear, and for definiteness, this is the expression used:

£ ¥

¤ ¤

§

¤ ¤

   

¤ ¤

§ £

¥ ¦

¤ ¤

   

¥ ¦

¤ ¤

§

¥ ¦

¤ ¤



¥ ¦

¥ ¦

¥ ¦

¤ ¤

§

¥ ¦

¤ ¤

   

¥ ¦

¤ ¤

§

¥ ¦

¤ ¤

   

¥ ¦

¤ ¤

§

¥ ¦

¤ ¤



¥ ¦



¡

¥ ¦

© ¡ ¥ ¥

¦ (4.3)

¤ ¤

§

¥ ¦

¤ ¤

¥      ¦

¤ ¤

§ £

¥ ¦

¤ ¤

¥ ¦

    

¤ ¤

¥ ¦

§

¤ ¤

¥ ¦

 

¥ ¦

¥ ¦

¤ ¤

¥ ¦

¤ ¦

¤ ¤

¥ ¦

    

¤ ¤

¥ ¦

¤ ¤

    

¤ ¤

¤ ¤

 

¡

¥ ©

A more intrinsic description of ¡ is to note that it satisfies the equation

§

¡



¡

©

§¢¡¦¥ © ¢¥¤

(4.4) §

72 §

for all (row) vectors § . In particular, a world plane represented by the vector passes

§ §

through the optical centre if and only if the matrix on the left is singular, i.e. if and ¡ only if § is a linear combination of the rows of . This is a well known fact; the advantage of using the algebraic form is that the scaling of the null-vector is defined canonically by the

scaling of the matrix.

4.3.3 Horopters

What is the geometric meaning of the null-spaces of linear combination of camera matri-

ces? A null-vector £ which satisfies

  ¡

£

¡ ¡

¥ ¡ ¡ £ £ ©¤£

also satisfies

  

¡ ¡

¡ ¡ £ £

£ £

¡ ¡

¡ £ £ £

and so corresponds to a point in ¦ whose image projections , are equal (up to

 ¥ 



£ ¦

scale). As the ratio ¡ varies the null-vectors

  

£

¡ ¡

£ ¡¤£ ¥ © ¡ ¥ ¡ ¡ £ £ ©

thus trace out a curve in space which is the locus of points which project to the same point ¥ under cameras and .

Such a curve is called a horopter curve and has been used before in the context of cali- bration [3] and ambiguity of reconstruction [92]. In general the horopter curve is a twisted cubic [136] which passes through both camera centres and which is asymptotic to the axis of rotation of the between-view camera motion. In special cases the curve may degenerate

into a union of a line and a conic, or worse. ¥

We next investigate the form of the parameterized horopter for views and . The ex-

¡ ¡

¦¥ ©

pression ¡ is a homogeneous cubic function of the entries of , so the expression above

  £

expands as a homogeneous cubic linear expression in ¡ and :

        

£ £ £

¡ ¡

 

¡ ¡ £ £ £¥¤ ¡

¡ ¥ ©

¡£¢ ¡ £§¦ £©¨

73

§

¡ ¡

¤

¡ £

¦ ¨ for some -vectors ¢ which are functions of and only. Writing it (perhaps) less

confusingly as

 §¤¡   ¢  ¤ ¢  ¤ ¢ §¤ ¢

¡ ¡ ¡ ¡ ¡

 

¤

¡ ¥ © ¥ © ¥ © ¥ © ¥ ©

¢ ¦ ¨

¨

and comparing coefficients shows that ¢ just reduce to the single-argument function

§

¥ ©

¡ :

¢ 

¡ ¡

¥ © ¡¦¥ ©

¢

¢ 

¡

¥ © ¡¦¥ ©

¨

§ ¨§

¤

£ ¥ ©

and that ¦ reduce to a two-argument function :

¢  ¢

¡ ¡

¤

¥ © £ ¥ ©

¢ 

¡ ¡

¥ © £ ¥ ©

¦

§

¥ © Also, the function £ is quadratic in its first argument and linear in its second argument.

To summarize:

        

£ £ £

¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡

 

¡ ¥ ¡ ¡ £ £ © ¡ ¥ ¡ © £¤£ ¥ ¡ £ © ¡ £ ¥ £ ¡ © ¡ ¥ £ ©

¡ ¡ £ £

      

£ £

 

¡ £ £ £ ¡ £¥£ ¡ £ ¡ ¡ ¡

£ £ ¡ ¡

¥ £ ¡¤£

The geometry for views is illustrated in figure 4.1. The points do not lie on the

¡ £ ¡¤£ ¡ ¡ horopter but the line joining each camera centre ¡ to is tangent to the horopter at .

For three cameras, each pair of views gives rise to a horopter and the coefficients of the

parameterization can be arranged in the following diagram:

¡ 

  

£ £

(4.5)

  ¦ 

£ £

¡  £  £  ¡

74 f f1 2

c σ σ 21 12

Figure 4.1: The horopter curve is defined for a pair of cameras and passes through the two

 ¡ 

camera centres ¡ , . The tangents to the curve at the camera centres pass through the ¡¤£ points £ . The curve is asymptotic to the axis of rotation (the screw axis) of the between- view camera motion. (The point labelled is the centre of the horopter [92] but is not

relevant to this discussion.)

¡  ¡¤£ The use of instead of the original camera matrices gives a representation of the problem which is independent of the camera calibration because these quantities are

expressed in terms that are invariant to changes in .

Note the appearance of a new entity, ¦ , in the centre of the diagram. The definition of

¦ , and all the other entities, can be expressed by the following expansion:

¡

       

¡ ¡ ¡ ¡ ¡ ¡ ¡



¡ ¡

¦

¡ ¡ £ ¥ ¡ £ ¡ ¡£¢ © ¡ ¥ ¡ © £ £ ¥ ¡ £ © ¡

¡ (4.6)

¦

¡ ¤ ¤ ¦

¡ ¡

¡¥¤£¢£ distinct

We will not have much to say about ¦ apart from some remarks in section 4.3.8.

Armed with the notation from this section we are now ready to relate the characteristic

¡¤£¦¥¨§ ©

equation of the plane-induced homography to the parameterization of the horopter

¥ for views .

75

¢¡¤£¦¥ 4.3.4 Characteristic equation of ¡¤£

Consider two camera projection matrices of the form

 ¢ 

¡

¥ ¤

 ¢ ¢ ¡

¥ ¤

 ¢

¢

where the leading ¢ submatrices are and respectively. The homography induced by



£ £ £ 

the plane § is clearly

 ¢   ¢

¡



¡



¥ § ©

and so its characteristic equation is

 ¢  ¢ 

¦ £ ¦ £





¢¥¤ ¥ © £¢¥¤ ¥ ©

¢

£¢¥¤

¤

 §

¦ £ ¦ ¦ £



¢

¥ ©

£¢¥¤

¤

§  ¢   

£¢¥¤ ¢¥¤ where , and the other two coefficients involve entries from both

and ¢ .

¤

§

The four coefficients contain all the information about the spectrum of the ho-

 

¡ ¡

mography . Now, if differently scaled camera matrices had been used, say £ and

 ¤¡

£ then the four coefficients would be

¤ ¤

§   §   ¤   ¤  ¤

 

£ £ £ £ and

and so it is natural to identify any two sets of coefficients that differ in this way:

¤ ¤

§ ¥¨ ¥ ¥  § ¥  ¤ ¥  ¤ ¥ ¤



 

¡

¥ ©

 £¤  ¦ for all £ . The result of this process is a new kind of “space” (similar to projective space where one identifies vectors that differ by a scalar factor) whose points describe the spectra of homographies. By abuse of language, such a § -tuple will also be known as the characteristic equation of a homography. 76 The next result is the crucial connection between horopters and the characteristic equa-

tions of plane-induced homographies:

Lemma. The characteristic equation of the plane-induced homography

¥¨§ ©

¡¤£ is (equivalent to):

¥ ¥ ¥

¥¨§ ¡ §  ¡¤£ §  £ ¡ § £ ©

Proof: Without loss of generality choose coordinates as above, that is:

 ¢ 

¡

¡ ¥ ¤

 ¢ ¢ ¡

¡

£ ¥ ¤



£ £ £

§ ¤

  ¢

 

The induced homography is so the characteristic equation is

¤

§ ¥¨ ¥ ¥

©

¥ where

¤

 §  ¤  ¤ §¤    §¤¦¢

 

£¢¥¤ ¥ ©

§

 ¤

¡ ¡



¡ £

© £¢¥¤

(by equation (4.4))

§

  ¥¤

¡ ¡

§ ¥ ¡ £ ©

   ¤  ¤ §¤



 

§ ¡  ¡¤£  £ ¡ £

   ¤  ¤ ¥¤

 

§ ¡ §  ¡¤£ §  £ ¡ § £

Comparing coefficients gives the required conclusion. QED

It now remains simply to combine this result with some knowledge of the properties of

the eigenvalues of rotation matrices.

4.3.5 Two-view (strong) modulus constraint

£¢ ¥¤¢§¦ ¢ 

¡¤£

¥ A rotation matrix has eigenvalues ¡ where for some , which is the angle of

rotation. The characteristic equation of such a rotation matrix is therefore

¥ ¨¢ ©¤¢ ¥ ¢ ¤¢ ¢ ¤ ¢ ¥ ¢ ¤¢  ¥ ¥ ¥

¥ ¥

¥ © ¥ ¥ ¥ ©

so we have the following necessary condition for § to be the plane at infinity:

¥ ¥ ¥ ¥ ¥ ¥

¡

¡ ¡¤£ £ ¡ £

¥¨§ §  §  § © ¥ ©

(4.7)

£

for some

77 4.3.6 Two-view (algebraic) quartic modulus constraint

In this section we eliminate the unknown scalar factors from equation (4.7). Recall that the

 £¤  ¦

meaning of equation (4.7) is that there are £ such that

 

¡

§

  ¤



§  ¡ £ ¥ ©

  ¤



£ ¡

§  ¥ ©

 ¤

§ £

and so, by direct substitution:

  ¤ 

© ©

¡ £ ¡ £ ¡¤£

¥¨§ © ¥¨§  © ¥ © ¥¨§ © ¥¨§  ©

 ¦ ¥

which gives the following algebraic equation, valid for each pair :

¤ ¤

¤ ¤



£

§ ¡ ¥¨§  ¡ £ ©

¤ ¤

(4.8)

¤ ¤

§ £ ¥¨§  £ ¡ ©

This is the same constraint as derived by Pollefeys and VanGool [105]. For three views,

 ¦

¥ © ¥ § ¦ there are three pairs ¥ with and so one gets three quartic equations in

which is sufficient to reduce the search to a finite number of solutions. By Bez´ out’s theo-

§ § §  §

¢ rem [139] the number of solutions, counting complex and repeated ones, is which is rather many. Pollefeys and VanGool solve the equations by starting from an esti-

mate based on certain assumptions about the calibration and then numerically minimiz- ing the expressions in equation (4.8).

A practical problem is that one may not find the correct solution in this way because

there are so many spurious ones. A particularly spurious example is the trifocal plane (the

plane spanned by the three camera centres) which is in fact a solution admitted by the ¡ £¢ quartic modulus constraints. To see this note that by definition the trifocal plane § sat-

isfies

 

£

¢

¤ ¥¢ ¡ § for and so each determinant in equation (4.8) vanishes. 78

§ 

¢ £ ¢

This still leaves ¢ possible solutions to be investigated. It will now be shown

that this set splits into three sets of solutions of which only one is of interest.

4.3.7 Three-view (algebraic) cubic modulus constraint

¥

¡ § £ ©

Equation (4.8) can be interpreted as saying that the ratio ¥¨§ be equal to the cube

¥ 

¢

§  ¡¤£ §  £ ¡ © ¥ ¥ © ¥ © ¥ ©

of the ratio ¥ Taking the product of the three ratios for and

¢

§ ©

¥ gives:

 ¥¨§  © ¥¨§  © ¥¨§ ©  ¥¨§    © ¥¨§   © ¥¨§   ©

©

¥¨§  © ¥¨§ © ¥¨§  © ¥¨§    © ¥¨§   © ¥¨§   ©

whence it follows that



¥¨§    © ¥¨§   © ¥¨§   ©

 ¤

¤ or (4.9)

¥¨§    © ¥¨§   © ¥¨§   ©



¡ ¡  where ¤ is a complex cube root of . Now, for physical reasons it is clear that

only solutions for which this ratio is real are of interest so one obtains the following cubic

constraint:



   © ¥¨§   © ¥¨§   © ¥¨§    © ¥¨§   © ¥¨§   © ¥¨§ (4.10)

With reference to the diagram (4.5), the quartic constraints can be seen to involve only two-view information (i.e. data coming from just one side of the triangle) whereas the new cubic constraint involves data from all three views at once.

The cubic constraint (4.10) was first described in [122].

By Bez´ out’s theorem, taking the cubic constraint and the two quartic constraints for

§ §  § £

view (say) gives exactly ¢ solutions, counting as before both real and com- plex, with suitable multiplicities. In particular, there must be some solutions to the three quartic constraints which do not satisfy the cubic constraint and this shows that all three categories (4.9) are non-empty (since the two complex categories must be of the same size because the complex conjugate of a solution from one category is a solution from the other category).

4.3.8 Solving the equations

A direct method for finding all solutions to the equations (4.8) and (4.10) for three cameras is given next. 79 In overview, the approach is to construct a parameterization of the set of solutions to

the cubic constraint and to substitute this into the three quartic constraints. In doing so, some things will simplify and it will be seen analytically that there are solutions.

Parameterization of cubic constraint

Consider equation (4.10). The product of the three ratios

¥ ¥ ¥

   §    © ¥ §   §   © ¥¨§   §   ©

¥¨§ and

  



is unity so there must be (non-zero) scalars  such that

¥   ¥ 

¥ §    §    © ¥   ©

¥   ¥ 

¥ §   §   © ¥  ©

¥   ¥ 

¥ §   §   © ¥  ©

 ¦ ¥

Put differently, this means that for :

  

£ £

§ ¥ £  ¡ £ ¡ £ ¡ ©

which is to say that the point

 

£

 ¡¤£ ¡ £ ¡ £ (4.11) must lie on the plane at infinity. A parameterization of the cubic constraint is then obtained

by taking the plane spanned by the three points in equation (4.11):

¡       

£ £ £

           

 § ¥ © ¥  ©£¢ ¥  ©£¢ ¥  ©

(4.12)

¢

 ¦ This parameterization has degree in .

Substitution into quartic constraints

The next step is to substitute equation (4.12) into the quartic constraints (4.8). Note that

by construction, the ratio of the elements in the second column of the matrices in

 ¥ 

£

equation (4.8) will be ¡ so the three quartic constraints can be written jointly as

£¤ ¥¦





¥ ©¤¢





¥ ©¤¢ 

rank 



¥ ©¤¢

80 £¥

which means that there is a scalar such that

 

¥ ¥ ©£¢ ¡

¡ (4.13)



¢

for . We can consider the equations

 ¡

¥ ©£¢ ¡ ¡ (4.14)

not as a system of plane cubics with an unknown parameter but as a system of three

  

¦ ¦

equations on ¦ with providing a coordinate on .



¦

The strategy will be to solve for and jointly.

 £

Special solutions at



£

For a general choice of , the system will have no solution, but for there are in general

© ¡

six solutions as will be shown next. There are several ways that ¥ can vanish for each



¢

:

©

Firstly, one of the three vectors making up the product ¥ could be zero. Since (in

 ¦  

¡ £  £ ¡ ¡ £ general)  the only way this can happen is if both and are zero and so this case

gives the three solutions:

£ ¥ £ ¥ £ ¥

¤ ¦ ¤ ¦ ¤ ¦

£ £

 

£

£ and

£ £

Secondly, the three factors could be non-zero but their product may vanish. This hap-

pens if and only if the factors are linearly dependent vectors (i.e. they represent collinear

¤

¢¡

points in ¦ ). There must be scalars (not all zero) such that

¤

¤ ¤ £¡ ¤ ¤ ¤ ¤ 

£ £ £ £

¥        © ¥     © ¥     ©

or equivalently, in matrix notation :

¤

£ ¥

¤



£ ¥ ¤

¤

£

¥ ¦

...... 

¥ ¦ ¦

¤ ......

¤¡ ¤

......

¥ ¦ ¥ ¦



£

¥ ¦ ¥ ¦

¤ ¦

¡ ¤

£

             

¥ ¦



¥ ¦ ¤ ......

...... 

¤

£

81

§ § ¢ Generically, the matrix on the left has rank so that there is a pencil of possible values

for the vector. But it is easy to verify that

¤ ¤

¤ ¤ ¤¡ ¤ ¤ ¡ ¤ ¤ 

£ £ £ £

¥  © ¥  © ¥ © ¥  © ¥  © ¥ © which is a cubic constraint on the elements of the vector. Hence there are (in general) three

solutions.

© ¥ © ¦¡ Thirdly, the product ¥ could be non-zero but each could vanish. This would

mean that the trifocal plane satisfied the cubic constraint. In certain degenerate cases that

can happen but, generically, it does not.

An alternative way to see the existence of these special solutions is to use the classical

result [57, 137] that a rational parameterization of a (smooth) cubic surface has six “base-



£

© points”, which is what the solutions to ¥ are.

Elimination using resultants

It remains to solve equations (4.14) in general. For this purpose one can use the Macaulay

multi-polynomial resultant [22, 82, 83] (see also [159] for an introduction in the context of camera calibration) to eliminate and obtain a univariate equation in , as follows.

A plane cubic curve is described by a homogeneous cubic polynomial in three vari-

£

 

ables. Suppose are three plane cubics (such polynomials have coefficients).



¢ In general, any two of them will meet in a finite number of points (namely ¢ by

Bez´ out’s theorem), so that the three plane cubics have no common points at all, in general.

The condition for the three cubics to have a common point is given by the resultant, which

¥¦   ©

is a polynomial function ¡£¢¥¤ of the coefficients of the cubics. This function

vanishes if and only if the three cubics have a common point. The resultant ¡£¢§¤ has

¡ ¥

degree in each and a total degree of .

  

 The point of all this is that, treating  as the variables of plane cubics, the ex-

pression

      

£ £ £

¤

¥ © ¡£¢§¤ ¥ ¥ ©  ¥ ©  ¥ © ©

 

(4.15)

 ¦§¦§¦

©¨  ©

¢

¤ ¤

 © ¢

82



¥ ¥

is a univariate polynomial in of degree which has a root at . It is a property of

¤

©

the resultant that the leading term of ¥ has coefficient . The fact that the term of degree

¢

is zero is a property of the particular cubics used. The discussion above showed that ¤



¢ £

has a zero of order at .

 

£ ¢

¡ Having solved for the ¥ non-zero values of , the can be recovered linearly.

One way to recover them is to use the description in [22] of the resultant as the determinant



¢¡ ¢¡

of a matrix whose null-vector(s) is the vector of quartic monomials in .

In practice, finding the roots of a univariate polynomial of degree is a notoriously

ill-conditioned problem [22, 166] and some care has to be taken. The approach adopted

for the work described here was to avoid explicit polynomial arithmetic by representing

¤

© ¥ functionally by the formula (4.15). This approach allows incremental reduction of

the order of ¤ by “peeling off” known solutions as they become available from a Newton-

Raphson polishing step applied to (4.14).

Counting the solutions

It has been shown that the three quartic and one cubic constraint admit exactly solu- § tions. This completes the classification of the ¢ solutions to the classical quartic modulus constraints. They are:

The trifocal plane spanned by the camera centres.

¤

¤

The two complex conjugate sets of solutions which give values ¤ in (4.9).

The solutions which additionally satisfy the cubic modulus constraint (4.10).

Of these, only the last class of solutions is of practical interest.

The mysterious

Recall that ¦ is defined by the expansion

       

¡



£ ¡¤£   ¦ ¡ ¡

£ ¡ ¥ © ¡

¡ ¡

¡ ¡

¡¥¤£¢£

83

from section 4.3.3 and that the (strong) modulus constraint (4.7) can be phrased as saying

  ¦ £

there must exist of scalars ¡ such that

 

§ ¡ ¡

¡

  

¥



¡¤£ £ ¡¤£

§ £ ¥ ¥ ©

¡

¥

¡¤£ ¥ ¥ ¡ £ where ¥ is the angle of rotation between views and . The scalar factors thus

have direct connections with the camera motion parameters. They are in fact the Frobe- ¥ nius dot products between the rotation matrices for views and where the Frobenius dot

product is defined by:

 £¢¢¡ ¤£¦¥  ¢



¥ ©

  

¦

 § What is the meaning of the corresponding scalar multiplying  in ? This section provides a partial answer to that question.

The starting point will be the direct calculation of ¦ for the three camera matrices

 ¢

¡

 ¤

 ¢ ¢ ¡

£

¡ 

¤ (4.16)

 ¢¨§

£ ©

¡

¤

which (using symbolic computer algebra) turns out to be:

£ ¥

£ £ £ £

               

£ £ £ £

¥ ¦

               

¥ ¦

¤ ¦

£ £ £ £

¥ ¦

               

(4.17)

¥ ¦

¦§¦§¦

£ £ £

            

¦§¦§¦

£ £ £

                   

which reduces to:

£ ¥ § § £ ¥ ¢ ¢

£ £

¥ ¥ © © ¥ ¥ © ©

£ ¥ ¢ £¦¥ § £ ¥ ¢ §

£ (4.18)

¥ © ¥ © ¥ © 

¢ § ¡

©

¢ Note that this holds for any ¢ matrices and vectors .

Eventually we are interested in ¦ for the three camera matrices

¡¢ ¢  ¢

£ ¤

¡

¡

¡ ¡ ¥ ¡¥¤ ¥ ¤ ¡

where 

¢

£ £

¡



¡ ¡

£ £

¡

 

84



¢

   

for but these are not in the simple form above. However, definition (4.4) implies



¡ ¡ ¡ ¡ ¡ ¡

   

   

¥   © ¥   ©

   that  to which we can apply the formula (4.18) by

setting

¢  ¢ ¢









¡  ¢

£

¡ ¡

 ¥   ©

§  ¢ ¢







 ¢

© £

¡ ¡

¥  ©

The end result is the following formula:

¡ ¡ ¡



¡ ¡    ¤   ¡

¢ ¢ ¡

 © ¥ 

  

where we have introduced the new notation

¢ ¢ ¡ ¤£ ¥ £ ¥ £¦¥

£

   

  ¥  © ¥  © ¥   ©

 

¢ ¢

¦



which is in fact (as should be expected from the definition of ) symmetric in 

¡ 

(this can be proved by applying the Cayley-Hamilton theorem to the matrix ¡ ). The

¡ ¥

interested reader may also verify that for distinct

¨§ ¢ ¡

£

¡ ¡

 

 £ ¡ £

¡ ¡

©

¡ £ ¥  ¢ ¢ ¡

 £   which is symmetric in and that (again using the Cayley-Hamilton

theorem). We can now state the result of this section:

 ¢ ¢ ¡

¡

£ ¡ £

Proposition. Let ¡ be the Frobenious dot products between

¢  ¢ ¢ ¢ ¡

¦

 

the three rotation matrices ¡ and let . Then:

 §

¦ ¡ ¡ ¡ ¡ ¡ ¡



¥     © ¥   © ¥  © ¥  ©

This relation can be computed as follows : Specialize the three rotation matrices to be

¤

¢¡

rotations by angles about the three coordinate axes and write out expressions for

¤

©    ¡ £¢ 

¦ ¡

¥ ¥ ¥

¡¤£ in terms of :

 ©   ¢ ¨¢ © ©  ¢

¦

¥ ©

 ©  ©¦ 

¡

 

  ¢  ¨¢

¡ (4.19)



 ¢ © ¢ ©

¡



85

 £¢

Using a Grobner¨ basis to eliminating © from these equations produces the stated rela-

¦ ¡

tion between ¡¤£ above. The reader may have reservations about this construction as a means of proof (we seem to have only proved the result for a special class of rotations), but

with a little care the apparent gap can be closed. 2

£ ¢ ¦ £

We note in closing the bound and that in terms of the between-view ¡¤£

rotations ¥ we have

§

 §

¥   ¥  ¥ 

¡ ¡ ¡ £

¥ ¥ ¥

  

©

   

§



§  ¥   ¥  ¥ 

¡ ¡ ¡

¡ ¥ ¥

£¢

©

¥   © ¥  © ¥  ©

which is interesting but not immediately or obviously applicable.

¡ £ ¡¤£ Constructing more constraints Just as calculating the ¡ and allowed the construc-

tion of cubic and quartic modulus constraints, the introduction of ¦ gives rise to some new

constraints, albeit of rather high degree. This paragraph briefly outlines the construction.

¤

¡¥¤ ¡¦¤ ¡¨§ ¡¨§

¢ ¢

 

Given two pairs of cubic monomials , and , satisfying the relation 

¤

 ¡ ¡

¡ £ ¥ ¦

¡  ¡¤£  

 , there is a relation between and . E.g. the relation

¤ ¤

 

 ¡ ¡

£  £   

   

gives rise to the relation

¥  ¥

¡ ¡ ¦

 

   ¦   

¥¨§ £ © ¥¨§ £ © ¥¨§ ©

and



£ £ £ £

   

¢

gives the relation (expressed as an equality of ratios):

¥  ¥

¡ ¦ ¡



¦

¥¨§    © ¥¨§ ¡ © ¥¨§ © ¥¨§ £  ©   

2 ¢

Very sketchy outline: Firstly, and without loss of generality, we can take © to be an identity matrix. Sec- § ondly, by an orthogonal change of coordinates we may assume the axis of rotation of © to be about the z-axis. Thus, the class of matrices for which we have proved the relation in fact exercises all the degrees of freedom that exist

86

¢ Many relations can be constructed in this way, by taking all -ic, -ic, -ic (etc) mono- mials which can be written as a product of ¢ -ic monomials in more than one way. For future

reference, the following table lists all the relations arising from ¢ -ics monomials (shown in

the left column):

¢¡

£ ¨ © ©

¢¥¤ ¢§¦¨¤ ¢ ¤ ¢ ¢ ¦ ¢ ¢

     

©¢© £ ¨ ©

¢¥ ¦¨¤ ¢ ¤ ¢ § ¦ ¢ §¨¢

 

¡

£ ¨ © ©

¢ §¨¢ ¢ § ¢ § §¨¢ §¨¢ §¨¢

¤ ¦¨¤ ¤ ¦



© £ £ ¨ ©¨©

¢ ¢ ¢ ¢ ¢

¦¨¤ ¤ ¦

    

¡



© £ ¨ © 

¢¤ §¦¨¤ ¢ ¦¨¤ ¢ § ¤ ¢ §¦ ¢ ¦ §¨¢ ¢

     

¡



© £ ¨ © 

¢¤ § ¦¨¤ ¢ ¤ §¨¢§¦¨¤ ¢ § §¦ ¢ §¨¢¦ §¨¢

   

 

© £ £ ¨ ©¨©

¢ § ¢ § §¨¢ §¨¢ §¨¢

¦¨¤ ¤ ¦

¡

© £ ¨ ©

¢ ¢ ¢ ¢ ¢ ¢

¤ ¦¨¤ ¤ ¦

      

¡



© £ ¨ © 

¤ ¢ ¤ §¦¨¤ ¢ § ¦¨¤ ¢ ¢ §¦ §¨¢ ¦ ¢

      

¡¡¢¡

¨ 

¤ ¢ ¤ § ¦¨¤ ¢ § ¤ §¦¨¤ ¢¥¤ §¨¢§¦  ¢ §¦ §¨¢ §¦ ¢ §¨¢§¦

       

¡



© £ ¨ © 

¢ § ¢ § § §¨¢ ¢ §¨¢ § §¨¢

¤ ¦¨¤ ¤ ¦ ¤ ¦ ¦

   

¡

© £ ¨ ©

¤ ¢ § §¦¨¤ §¨¢¥¤ §¨¢ §¨¢ ¦ §¨¢ §¨¢

£ ¨ © © ©

¤ ¢¥¤ §¦ ¢ §¦

    

¡



© £ ¨  ©

¤ ¢¥¤ § ¦ ¤ §¦¨¤ §¨¢ ¢ §¦ §¦ §¨¢

      

¡



£ ¨ ©  ©

¢ § § §¨¢ § ¢ § §¨¢ §

¤ ¦ ¤ ¦¨¤ ¤ ¦ ¦

     

©© £ ¨ ©

§¦¨¤ §¨¢¥¤ § ¦ §¨¢ §

 

© £ ¨ ©

¤ § ¦¨¤ § ¤ § §¦ § §

      

¢

£ £ ¨ ©¨© ©

§ § § § §

¦¨¤ ¤ ¦

    

¡

© £ ¨ ©

§ § § § § § §

¤ ¦¨¤ ¤ ¦

     

¦ ¡

Eliminating ¡¤£ from a set of ratios on the right hand side and substituting the corre- sponding terms from the left hand side then gives a constraint on the original target, § .

4.3.9 Experimental Results.

The algorithm described can be run on well-conditioned real data (but the same condition- ing must be used in each image, or the assumption of constant internals will be violated).

Figures 4.2 and 4.3 show three views from a sequence of images of a turntable. A pro- jective reconstruction was obtained by manually selecting feature points and running the

projective reconstruction algorithm described in chapter 3, culminating in full projective

bundle adjustment.

The zeroth, third and seventh cameras were then used as input to the modulus con-

¢

straint solver. Of the solutions found, were complex and so discarded. Of the remain-

£

£

¢ ing , only two gave values between and for the middle two coefficients in (4.7)

and so the other ¢ were discarded on the grounds that this is the possible range of values

¥ ¥ taken by ¥ for all angles .

For the remaining two solutions (see figure 4.3), a metric upgrade was attempted by 87 Figure 4.2: Three images from a sequence of used in the experiments.

(a) (b)

Figure 4.3: Result of metric rectification from the images shown. Of the solutions, only two were admissible (for reasons described in the main text). The first solution (a) is im- plausible. The second solution (b) is reasonable.

solving the linear equations on the image of the absolute conic obtained from the eigen-

¥ ¥

vectors of the infinite homographies. Specifically, if a homography has eigenvectors

¥ ¥

with real (corresponding to the axis of rotation) and complex conjugates (correspond-

  ¥  ¥ 

£

¥ © ¤ ¥ ¥ ¥ © ¤ ¥ © ¤ ¥ ¥ © ing to circular points) then the constraints are ¤ .

One was obviously wrong (by visual inspection) and the other looks good. As a test of

£¢ ¦ £¢ ¦ £ £ £¢ ¦

¢¤£ £ ¢¡

correctness, the angles at the corners of the base of the object are ¥ , , and

£¢ ¦

£ £

degrees (the base of the real object is planar, but the base of the reconstructed object

¢¤£ is not planar, so these angles do not add up to ¢ degrees).

Another example is provided by figure 4.4 which shows three images from a sequence

¢¤£ of seven. The structure was computed from manual point correspondences and pro-

88 Figure 4.4: Three out of seven images from the “Kampa” (Prague) sequence, courtesy of the

Center for Machine Perception, Prague.

£ ¢

jectively bundle adjusted. The calibration matrix recovered from views ¢ and is:

¡ ¢£

§ ¦ £ £ ¦ § § ¦ §

¡ £ ¡ ¢ £ ¢

¢ ¢

¥

 £ £ § ¦ £ ¦ §

£ £ ¡ ¡ ¡ ¡ ¡

¢

£ £

¡

¢ ¢ which shows plausible focal lengths (the images are ¢ ) and skew (almost zero).

By comparison, the calibration matrix obtained by bundle adjustment of the entire ¥ -view,

¢¤£ -point reconstruction using a constant intrinsics model (initialization by T. Werner us-

ing scene constraints) gives a calibration matrix

¡ ¢£

¦ ¦ § ¦ §

£ ¡ £ £ ¢ ¡ £ £

¢ ¢ ¢ ¢ ¢

¥ ¥ ¥

 ¢ ¦ ¢ £ § ¦

£ ¡ £ £ ¡

¢

£ £ which is reasonably close.

4.3.10 Partial constraint

Unfortunately the modulus constraint is only a partial constraint; it is necessary but not

sufficient in the sense that it may be possible to find planes for which the induced homo- graphies all have the required form, but there is no real conic on that plane which has the

image projection into all three images.

¤

¢¡ ¦

¡

In general, a homography with eigenvalues ¡ induces a collineation on the ( -

¢ ¢

dimensional) projective space of plane conics, represented by a matrix with eigenval-

¢

¤ ¤ ¤

¡ ¢¡ ¢¡

  ¦ 

ues .

¢

¡¥£ ¦ ¡¥£



In our case, the eigenvalues of of the homography, corresponding to eigen-

§¦ ¥

vector that represent the axis of rotation ¤ and its two conjugate points , gives the six

¢

 ¡¥£ ¡¥£ ¡¥£  ¡¤£§¦

 

eigenvalues for the collineation on the space of conics. The

89 I N

J

§¦ ¥ Figure 4.5: The pencil of conics determined by an “axis of rotation” ¤ and two points

conjugate to it.

eigenspace of dimension (eigenvalue here) corresponds to a pencil of conics fixed un-

der the homography.

What can go wrong is that the three pencils obtained from the three pairs of views may

not have a common element.

For ideal, synthetic data it is the case that there is a common fixed conic, but in the

presence of measurement noise this need not be the case. It should be stressed that this is

a weakness of the modulus constraint itself and not of the numerical algorithm presented.

For example, the example in figure 4.3 computed using views zero, three and seven works

fine, but using views zero, three and six from the same sequence does not produce a good

result because the camera motions for this triplet are close to having a single direction of

rotation axes. In practice this shows up as a near-ambiguity in the solution of the least squares problem for the absolute conic. It is thus easy to detect by inspecting the two smallest singular values of the linear system being solved.

4.3.11 Conclusion and further work

Traditionally the modulus constraint has been solved using an initial guess followed by

non-linear minimization. Our method is comparable in speed and has the added advan- tage of providing all solutions, which avoids the risk of computing a local minimum.

90 The extra complexity is a disadvantage but the exposition itself has thrown new light on

the problem.

The experiments show that the results obtained from the algorithm can be used as the

starting point for non-linear optimization if some care is taken to detect degenerate con-

figurations.

The problem of deciding which of the solutions is correct remains. Apart from re-

jecting complex solutions and solutions for which the constant from equation (4.7) does

not lie between £ and , the obvious approach is to try to perform a metric rectification

and test the fit of the computed model to the original image data. An interesting possibility is that the configuration of planes has a special structure which might single out certain

planes. For example, a simple counting argument (a homogenous cubic in three variables

£

has coefficients) shows that given planes in general position, there exists a unique cubic which they all satisfy and so, since the solutions to the modulus constraint satisfy

a cubic constraint, they cannot be in general position. (To appreciate the significance of ¢ this, consider that conics have ¢ coefficients and so points which lie on a conic cannot be

in general position because any ¡ of them determine the conic.)

4.4 Square pixels

The modulus constraint applies in the “classical” auto-calibration problem where the in-

ternal parameters are unknown, general and unchanging between views. In reality, camera

calibration is rarely completely unknown and often changes between views (e.g. as a result

of focusing). For example, one often has a priori beliefs about the skew or aspect ratio.

A common assumption is “square pixels”. The term refers to the case of zero skew and

unit aspect ratio; the sensor array can be thought of as consisting of square cells, one per

image pixel. The calibration matrix has the form

¥¦ £¤

£ ¡



£ £

(4.20)

£ £

where is the focal length (distance from image plane to center of projection), measured

£ in pixels. The numbers ¡ represent the principal point, i.e. the projection of the ray 91 through the centre of projection and orthogonal to the image plane.

The set of such matrices forms the (Lie) group of translations and isotropic scalings of

the plane.

Focusing for the moment on the Euclidean properties to be recovered in the world, ¡

it is clear that these are unchanged if any of the projective cameras ¡ is pre-multiplied

by a matrix of the form (4.20). In other words, any proposed algorithm for recovering this

Euclidean world data must be invariant to that group action on the input data. What, then,

is left of a projective camera if one factors out this group action? Or, put differently, which

properties of such a ¡ are invariant to the group action?

Clearly, the camera centre (being the right null space) remains the same. Since the group fixes (pointwise) the image line at infinity, rays through the camera centre which are backprojections of ideal image points are also invariant. This means that given two such rays in the projective reconstruction, the angle between them is known (because it would

be known for a calibrated camera – i.e. one whose calibration matrix is the identity matrix). ¡

In other words, the data provided by each projective camera ¡ is sufficient to compute

angles between rays lying in the principal plane and passing through the camera centre.

Metric information in the image plane is encoded by the two circular points

£ ¥ £ ¥

¤ ¦ ¤ ¦

 ¦ 

£

¥

£ £

They are the (only) two directions in the plane which are self-orthogonal. It follows that the back-projections of the circular points are self-orthogonal rays in space. In other words, they intersect the absolute conic. This is illustrated in figure 4.6.

The rest of the section is laid out as follows. The idea is that the observation about rays from the camera centres meeting the absolute conic can be used to formulate a constraint

(section 4.4.2) on the position of the plane at infinity. This is developed (section 4.4.3) using

Pascal’s Theorem which concerns the conditions under which six points in a plane lie on

conic. The exterior algebra is used as a technical tool to express this, but the bottom line § is that we will obtain polynomial constraints on : four constraints of degree five and

twenty-four constraints of degree six. These are worked out in sections 4.4.4 through 4.4.8. 92 ω

π oo

f3 f2

f1 f4

Figure 4.6: The circular points in each image plane give rise to two backprojected rays which intersect the absolute conic on the plane at infinity.

Section 4.4.9 discusses the number (forty) of solutions admitted by the constraints and

ways to solve them numerically.

4.4.1 Notation

¡ ¡

The th camera centre will be denoted by . The three row of the th camera matrix will be

§

¢¡ ¤£ ©

¡

¡

denoted by ¡ :

£¤ ¥¦

§¨§¨§ §¨§¨§

¡

 §¨§¨§¥¡ §¨§¨§

¡

¡

¡

§¨§¨§¦£ §¨§¨§ ¡

The plane at infinity is denoted § .

¢ § §¦§¦§¦ £¢

The bracket notation ¤ will be used to denote the determinant of the square matrix formed by concatenating the argument vectors. It vanishes if and only if the argu- ments are linearly dependent.

Exterior algebra will be used without remorse or apology. Briefly, any ¥ -dimensional

¡  §¦§¦§¦ ¥

¡ £

¨ §

vector space § gives rise to a collection of vector spaces in a natural

¥

¡ ¡

§

¨ § ¨ © §  ¨ © §

way. There is an associative bilinear multiplication operation ¢

¥



£ £ £ £ ¡

©

¢ § § called “wedge” which satisfies for all . The dimension of ¨ is .

93

§

 ©

§

 

For example, taking § with basis elements gives

 ©

¥

¨ §

  ¦

§



§ § ¡  

¨ span

 ¦

§ § §



     

§ ¡ ¢ ¢ ¢ ¢ ¢ ¢

¨ span

 ¦

§ § §

§ ¡  ¢ ¢  ¢ ¢  ¢  ¢  ¢  ¢

¨ span

§

 ¦

§

 

§ ¡ ¢ ¢ ¢

¨ span

¥ © ¥

¥ § 

The dual space § of linear functionals is a vector space of dimension too.

¡ ¡

¨ § ¨

There is a natural bilinear form on § given by

¦§¦§¦ ¦§¦§¦ ¡  ¡

£ £ £

¡ ¡

¥  ¢ ¢ ¥  ¢ ¢ £¢¥¤ ¥ ¥ ¡ £ ©

£

¥ ¡ § §

for all and £ .

¢

¨ §

The choice of a non-zero element (the “weight element” or “integral”) ¢¡ gives

¥

¡ ¡

¢



¨ §  ¨ §

rise to a natural “duality” isomorphism ¢ given by

¤ ¤

¢¡ ¡  ¡

£¡ ¢

¤

¢¡

¡ ¡

¢



§ ¨ § for all ¨ .

The reader who is not familiar with exterior algebra is probably lost by now; it should be possible to read the text without understanding all the details of the algebra, though. In any case, past section 4.4.6, it is no longer needed. It is used here because it was a useful

tool in the research and so to leave it out would obscure the origins of the ideas. §

The exterior algebras on © and its dual are used to represented projective geometric

§

¥¤ ©

quantities. Thus, the line joining two points in space is represented by the

§ §

¤ © ¢¡ ©



¢ ¨

product and, dually, the line formed by intersecting two planes

§

¡ ©



¢ ¨ in space is represented by the product

The back-projection from camera ¥ of the circular points are the 3D lines given by:

¦ ¦

¨§ £

¢

£ £ £ (4.21)

94

where we have set

¦

§  ¡

£¡

£

£ £ 4.4.2 Co-conicity constraint on

Recall from the counting arguments in section 4.2 that four (or more) cameras will be needed for auto-calibration with the square pixel constraint. Assume, then, given four cameras with square pixels, but expressed in an (unknown) projective coordinate frame.

The £ points given by

§ §

¦

¡

© § £ ©



§ ¢ ¢ ¨

¡

¡ (4.22) § are the intersections of with the rays obtained by back-projection of the image circular

points. Alternatively, they are the back-projections of the circular points onto the plane at infinity.

As observed earlier, these points all lie on the absolute conic, and this leads to a con-

straint between the four camera matrices ¡ and the plane at infinity, namely that the eight points given by (4.22) must lie on a conic.

This should motivate studying how to express the constraint of co-conicity (the prop- erty of lying on some conic – analogy with “collinearity = co-linearity”) algebraically.

4.4.3 Pascal’s theorem

The condition that ¢ points in a plane lie on a conic is expressed by Pascal’s Theorem [43,

¤ ¤ ¤

§ § §

114]. With reference to figure 4.8 let the six points be and draw the lines .

Then the condition is that the three points of intersection

¤ ¤

§ §

£¢ ¢ ¤¢ and

be collinear. This constraint is easily seen to be an algebraic function (of the homogeneous

¤

§¡

coordinates ) which is quadratic in each of the six variables. Of course, any of

¤

§ ¦

£

the partitions of ¡ into two sets of three could be used but they all give the

same constraint.

The constraint can (also) be written down explicitly using cross ratios, as follows (the

reader is referred to figure 4.7). First recall that a non-singular (projective) conic is isomor- 95

¡

¡

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡

¡

¡

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡

¡

¡

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡

 c

¡



¡ ¤

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡



¡



¡ £¤

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡



¡



¡ £

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡b

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡

¢¡¢



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡

¢¡¢



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¥¡¥

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡ ¦¡¦

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡ ¡ ¡ ¡ ¡

¡¡¡¡a

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¥¡¥

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡ ¦¡¦

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡d ¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡ ¨



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡ §¡§¨



¡



¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

¡¡¡¡

§¡§





¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡

¡ ¡ ¡ ¡

©

¡¡¡¡





¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡¡ ¡¡¡¡¡¡¡¡¡¡¡¡¡¡ © e f Figure 4.7: Projection from a point on a conic defines a notion of cross ratio for four (other) points on the conic. The notion is in fact independent of the choice of point to project

from so, in this figure, the four lines emanating from have the same cross ratio as those

emanating from . phic to a (projective) line and any such isomorphism gives rise to a notion of cross ratio on the conic, by evaluating the cross ratio of the corresponding points on the line. Two different isomorphisms from the conic to the line differ by an automorphism of the line, which must be a collineation (because the automorphism group of projective space is just the group of collineations [57]) and so leaves the cross ratio invariant. In other words, the

choice of isomorphism between conic and line does not affect the cross ratio on the conic.

¢ 

Now choose any isomorphism from ¦ to the conic, given as a -tuple of quadratic poly-

 nomials. If we compose this morphism with any linear projection (of rank ) to ¦ whose

points of projection is at a point on the conic we get a -tuple of quadratic polynomials  with a common root (corresponding to the point of projection), i.e. a collineation of ¦ . In

other words, any rank projection which is singular on the conic provides an isomorphism 

with ¦ . A more classical treatment of cross ratios on conics is given in [136].



¦

The cross ratio of four points in , represented by four 2-vectors is given by

the following expression involving brackets:

¢ ¢



¤ ¤

¡

¢ ¢

¥ ©

¤ ¤

¤

§¡

To get the cross ratio of the four lines from through we simply have to add at the 96

c

¤

¡¡¡¡¡¡

¡¡¡¡¡¡

£¤

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

£

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡

¡ ¡¡¡¡¡¡

¡¡¡¡¡¡b

¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡

¢¡¢

¡

¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡

¢¡¢

¡

¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡ ¡

¡

¦

¡

¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¥¡¥¦

¡

¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡

¡¡

¡

¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡



a

¥¡¥

¡

¡

¡¡

¡

¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡ ¡¡¡¡¡¡

¡¡¡¡¡¡ d



¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

§¡§

¨¡¨

¡¡¡¡¡¡¡¡¡

¡¡¡¡¡¡¡¡¡

¡

¡

¡¡

¡

¡¡¡¡¡¡

¡¡¡¡¡¡

§¡§

¨¡¨

¡

¡

©

¡¡¡¡¡¡ ¡¡¡¡¡¡ © e f

Figure 4.8: Pascal’s theorem: necessary and sufficient condition for ¢ points to lie on a conic.

end of each bracket:

¤

¢ § ¢

¤

§¡ 

¨¤ ¤

¡

¤

¥ © ¢ § ¢

¨¤ ¤ 

This trick, which relates the projective invariants of points in ¦ to the invariants of their 

projections in ¦ was also used by Carlsson [14] for the case of perspective projection

§¦§¦§¦

£ £ £



 ¦ ¥ © ¦ . In suitable coordinates, where the point of projection is , it sim- ply corresponds to discarding the last component of each vector.

Now, since the cross ratio is independent of the choice of projection, we can replace

by and equate the two cross ratios to get the constraint

¤ ¤ ¤ ¤ ¤

§ £¢ §¡ ¢ ¢ §¡ ¢ ¢ § ¢ ¢ §¡ ¢ 

£ £

© ¤ ¤ ¨¤ ¨¤ ¤ ¤ ¤ ¤ ¥ (4.23) which is an algebraic form of Pascal’s Theorem.

4.4.4 Collinearity in 3D

When applying Pascal’s Theorem to the problem of expressing the co-conicity of the inter- sections (4.22) one encounters the difficulty that the intersections are § -vectors so that the

brackets used to measure collinearity of ¢ points cannot be used.



£

¡

¥   § ©

To overcome this, we will construct another expression which mea-

§ §

© ©



£ ¨ § sures whether the intersections of three lines with a plane are collinear. 97

( λ v π ) ∗

1

v ∗ v v ∗ v ( λ π ) ∗ 2 ( λ π ) ( λ π ) 1 2

π λ 1 λ 2 λ

3

 

Figure 4.9: Geometry of collinearity for the three intersections of lines with a

plane § .

By replacing the ¢ -brackets from above with these new brackets we will obtain the desired

constraint.

With reference to figure 4.9, this can be rephrased by saying that the line must meet



¡ ¢ §

the line joining the two intersections ( ), or

 

£

¡

¥   § © ¥  ¢ § © ¢ ¥  ¢ § © ¢ (4.24)

3

¡ This formula shows that is linear in each of its first three arguments and quadratic in

the fourth and last argument, § . It is actually an alternating function of its three arguments but apart from that exactly the same constraint is obtained by permuting the three lines.

4.4.5 Octic constraint

Putting together the results from sections 4.4.3 and 4.4.4 above (by replacing each ¢ -bracket

in equation (4.23) with the appropriate -bracket), we obtain the following constraint on §

:

¢¡¤£ £ £ £ £¦¥ £¦§©¨  ¡¤£ £ £§¨  ¡¤£ £ £§¨  ¡¤£ £ £¥¨   ¡¤£ £ £¥¨  

¨    

¢ § © ¢ § © ¢ § ©

            

  

¡¤£ £ £¦¥©¨  ¡¤£ £ £¥¨  ¡¤£ £ £§©¨   ¡¤£ £ £¦§©¨  

   



¢ § © ¢ § ©

       

 

¨ ©

(4.25)

3 £

The geometric meaning of the formula is as follows: “ intersect the ray ¢ with the plane to get the point

¡¤£   £ ¡¤£  

¢ 



and intersect the ray  with the plane to get the point ; the line joining these two points £ must meet the line § ”.

98

¡

which holds for any six of the lines from equation (4.21). This constraint has degree in

£

¡ §

each line and degree in .

¡ £ In the case when two of the rays , meet, this expression factors into a heptic (i.e. of

degree ¥ ) and a linear term.

To see this, suppose that if § passes through the point of intersection (e.g. in our appli-

cation it will be one of the camera centres), call it ¡ , of two rays. Then two of the resulting

six points become coincident and, since there exists a conic through any five points, the

 § 

£ £

§ § ¡

constraint is therefore satisfied. Put differently: for any plane with we

§¨§¨§ 

£

¡

¥ £ ©

must also have . This means that vanishes on the variety defined by the

£ ©

ideal ¥ so that, by the Nullstellensatz, some power of lies in the ideal, i.e. is divisible by

. However, the linear form is a prime (in the coordinate ring) so in fact itself is divis-

ible by . Note: this also showed that the extraneous linear factor arises from the point of intersection of the two rays that are incident.

The same argument applies if more than two rays meet. For example, taking only the six rays arising from the first three cameras gives three pairs of coincident lines so the con- straint factors into one quintic and three linear factors (which are of no interest).

To carry out this factorization in practice one needs to study special forms of the collinear-

ity bracket from above.

4.4.6 Special case of 3D collinearity bracket ¡

When each from section 4.4.4 is given as the exterior product of two planes

¨§ £

¡ ¡ ¢ ¡

the formula (4.24) expands as follows:

§ £ ¥§ £ ¥§ £  § £ § £ § £

¡

¥  ¢  ¢ ¢ § © ¥  ¢ ¢ § © ¢ ¥  ¢ ¢ § © ¢ ¥ ¢ ©

   

 § £ § £ § £ ¡

§

 

¢ ¥ ¢ ¢ § © ¢ ¥ ¢ ¢ § ©

 

§ § £ ¡ £ § £ ¡



¥  ¢ ¢ § © ¥  ¢ ¢ § ©

 

§ § £ ¡ £ § £ ¡

©

¢¥¤

§

 

¥ ¢ ¢ § © ¥ ¢ ¢ § ©

 

¢ ¥§ ¤£ ¥§ ¢ ¥§ ¤£ ¤£



§  ¤ §  ¤

 

¢ ¥§ ¤£ ¥§ ¢ ¥§ ¤£ ¤£

©

¢¥¤

§  ¤ §  ¤

  (4.26) up to sign. 99 The reader who is more comfortable with the bracket notation could prove this by other means and forget all about exterior algebra at this point.

4.4.7 Formula for sextic

By expanding out equation (4.25) using the formula in equation (4.26) one is left with an

explicit formula for the sextic constraint that the points of intesection of the plane § with

the six lines

¨§ £

 ¢





¨§ £





¢





¨§ £

¢





¨§ £

§



¢





¨§ £

¢

¢

¨§ £

§

§

© ¢

lie on a conic. The explicit formula is:

 ¢ £ ¥§ ¤£ ¢ £ ¥§ ¤£ § £ ¥§ £ ¥§ £ § £ ¥§ £ ¥§ £

£ § §

§ §

¡ ¡ ¡ ¡

 

§ ¤ § ¤ ¥ ¢ ¢ ¢ § © ¥ ¢ ¢ ¢ § ©

     

   

¢ £ ¥§ ¤£ ¢ £ ¥§ ¤£ § £ ¥§ § £ ¥§ £ £ ¥§ £ ¥§ £

£ § §

§ §

¡ ¡ ¡ ¡

 

§ ¤ § ¤ ¥ ¢ § © ¥ ¢ ¢ ¢ § © ¢ ¢

     

    (4.27)

4.4.8 Formula for quintic

§  § §  §

§



We put and in the sextic formula (4.27) to derive an explicit formula for

the quintic constraint that the three pairs of lines

¦ ¦

¨§ £

¢

¡

¡ ¡



¢ § for meet the plane in a conic.

A complication

The result should be divisible by the linear factor

§ ¥§ ¤£

¡



§

¤

(4.28) but unfortunately it isn’t straightforward to see what the quotient is because there are terms

in the expanded expression which are not divisible by (4.28). The reason for this is that

¦

§ ¤£ ¡ there are relations in the bracket algebra – the “variables” ¡ inside the brackets are

100 independent quantities but the brackets are not algebraically independent. A simple ex-

ample of this can be seen by taking the co-conicity constraint (4.23) derived earlier and

permuting the variables; the new formula expresses the same constraint but it doesn’t look

like the original formula. Another example is provided by Cramer’s rule:

¢ § © £¢ § © ¢ §¡ © ¢ © §

¤ ¤ ¤ ¤

¢ §  £¢

If we substitute this into a bracket ¤ we arrive at the relation

¢ § ¢ ©  £¢ £¢ § © ¢  £¢ ¢ §¡ © ¢  £¢ ¢ © ¢ §  £¢

¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤

§¡ ©  £¢ ¢ between the -brackets of . Without knowing the relation it is not obvious that the right hand side factors into a product of two brackets.

Digression on bracket algebra

¤

¥ ¥

£

¡

¥ ¡¤£ ©

Take an matrix of indeterminates and consider the vectors ¡ formed from its

§¦§¦§¦ ¢

 

rows. For every sequence of indices we can form the bracket

¢

¢ ¦§¦§¦

£ £ £

¢ ¢

¢ ¡  ¡ 

¡

¢ ¦§¦§¦

£ £ £

£ ¢

£ ¡

¡  ¡ 

¡

¢ §¦§¦§¦ 

£ ¢

 



¡ ¡ ¡

¢

¡ ¡ ¡¡ ¤ £¢¥¤

£ ¢

 . . . .

. . . .

¢ ¦§¦§¦

£ £ £

¡  ¡ 

¡

¤ ¤ ¤ which is the determinant of the matrix formed from the entries of the chosen vec-

tors. If two indices are equal then the bracket vanishes and if two indices are transposed,

¦§¦§¦ ¢

¡ ¡ ¡

 

the bracket changes sign. We can therefore assume that whenever conve-

nient. The set of polynomial expressions formed from brackets forms a sub-algebra of the

¢

£

¡

¤ ¡¤£¨¤

ring and relations between brackets in the ¡ can be verified by expanding out de-

£

¡

£ ¡ terminants and evaluating polynomials over the entries ¡ of the . This is cumbersome,

though.

Instead we form an abstract algebra which is just the set of polynomial expressions in

“formal brackets”, which are symbols of the form

¢ §¦§¦§¦ ¢

 

¤

101

and have no a priori meaning. However, if we define a homomorphism from this algebra

¢

£

¡¤£¨¤

into ¤ by

¥ ¢ §¦§¦§¦ ¢ ¢ §¦§¦§¦

¡ ¡ ¡

¢

  ¤  ¡ ¡ ¡¡ ¤  then it is clear that the bracket relations we are interested in knowing about are precisely the elements of the kernel of . A Grobner¨ basis for this ideal of relations is given in [147] as well as an algorithm (the straightening algorithm) for putting any given bracket expression into normal form (useful because two bracket expressions are equal if and only if they have the same normal form). We briefly describe the notion of normal form: Every term in a

bracket expression is a scalar multiple of a product of ( , say)brackets; the tableau of such

¤

a term is the table whose rows are made from the factors in the product. A tableau is in normal form if each row is an increasing sequence and each column is a non-decreasing

sequence. For example, the tableau

¡ ¢£

§

£

¢

§

£ ¡

¢

§

¡ ¢ is not in normal form because (2nd column) and (3rd column) whereas each

term in its straightened version

¡ ¡ £ £ £ ¡

¢ ¢ ¢

£ £ £

¢ ¢

§ § §

£ £ £ £

¢ ¢ ¢

§ § §

¡ ¡ ¡

¢ is in normal form. Note that the normal form of a bracket expression is not necessarily

“nicer” than the original expression.

The final observation of this section is that if a tableau is in normal form and one ap- pends another row by taking the last ¤ variables in increasing order then the new tableau is

also in normal form. It follows that to determine if a given bracket polynomial is divisible

by a given bracket it suffices to choose a variable ordering for which the elements of are last and then compute the normal form of with respect to that ordering.

The formula

Using the final observation from the previous section, it is easy to remove the nuisance

§ ¥§ ¤£



factor (4.28) from the constraint, by choosing a variable order in which come 102

last. Without further ado, the formula is (dropping § for simplicity):

§ ¤£ ¤£ § ¤£ ¤£ ¢ £ ¤£ ¤£ § ¥§ ¤£ § ¥§ ¤£

£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¤£ § ¤£ ¤£ £ ¥§ ¤£ § ¥§ ¤£ £ ¥§ ¤£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¤£ § ¤£ ¤£ £ ¥§ ¤£ § ¥§ ¤£ £ ¥§ ¤£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¤£ § ¤£ ¤£ £ ¥§ ¤£ § ¤£ ¤£ § ¥§ ¤£

£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¥§ § ¤£ ¤£ £ ¥§ ¤£ £ ¥§ ¤£ £ ¥§ ¤£

£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¥§ § ¤£ ¤£ £ ¥§ ¤£ £ ¥§ ¤£ £ ¥§ ¤£

£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¤£ ¥§ § ¤£ ¥§ ¢ £ ¤£ ¤£ £ ¥§ ¤£ £ ¥§ ¤£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¥§ ¤£ £ ¥§ ¤£ £ ¥§ ¤£ £ ¥§ ¤£ £ ¥§ ¤£

  

¤ ¤ ¤ ¤ ¤

     

   

§ ¥§ ¤£ £ ¥§ ¥§ ¢ £ ¤£ ¤£ £ ¥§ ¤£ £ ¥§ ¤£

£

  

¤

¤ ¤ ¤ ¤

     

   

The value of this expression is (clearly) not its intrinsic beauty or any geometric insight it yields, but merely its usefulness for practical implementations. It might be expected that

choosing a different variable ordering might give a less complicated expression but no such

ordering was immediately forthcoming.

4.4.9 Forty solutions

§ § §

§ For cameras we thus obtain quintic equations and sextic equations in . How

many solutions do these equations admit? How can they be computed? This section dis- cusses two methods, one based on resultants and the other drawing on techniques from commutative algebra to reduce the problem to a generalized eigensystem.

Resultant methods

¢ 

Consider adjoining a linear equation £ to the original system and asking: when does

¢  the new system have solutions? Obviously, only when the plane described by £ con- tains a solution from the original system. However, the condition is also expressed by the

resultant of the augmented system, which is a polynomial constraint on the coefficients of

¢  ¢ £  ¢ . By sweeping a pencil of planes one obtains univariate equations in and hence

the solutions to the original system can be localized to a finite set of planes. By taking dif-

ferent choices of pencils (in general position with respect to each other) the solutions can

be localized to a finite set of points.

Since resultants are difficult to evaluate efficiently, let alone symbolically, this scheme

is modified to use functions that are necessary (if not sufficient) conditions for the aug-

103

mented system to have a solution. The idea is simply to multiply the original equations

¤

 ¢ 

£ £

¡ and the linear equation by suitable monomials to get a set of -ic homogene-

nous equations which is at least as large as the set of ¤ -ic monomials. The vanishing of the

maximal minors of the corresponding linear system are then necessary conditions for so-

lutions to exist. There is no compelling reason why the multipliers of ¢ must be monomials

¤

£ © and any choice of linearly independent ¥ -ics would do.

How large must ¤ be? This depends on the nature of the system to be solved but can

¤ 

usually be estimated by counting arguments. In our case ¢ will suffice because the

§ § §¦ §

£

four quintics and twenty-four sextics give rise to sextics whereas the linear

 £ § §

¡ ¢ ¡ ¢ £ ¡ ¢

equation gives sextics and there are only ¡ sextic monomials. 4 In fact

§ §  £ § § ¢ £ ¢ §¦§¦§¦ £ ¢

£ £ § § ¡

 ¡

it suffices to take products  (with each of degree ) of

¢ § §

¢ , so the constraint on has degree at most . To recap, we obtain a polynomial function

§ § ¢

©

¥ of degree such that

¢ 

£

¥ ©

 ¢  £

is a necessary condition for the system ¡ to have a solution. Moreover, numer-

¢ ¢

¡

¡ © ¡

ically evaluating ¥ for a simple choice of cameras , multipliers and plane shows ¡

that it does not vanish identically. Thus, for almost all choices of ¡ , almost all choices of

¢  ¢ £ § §

multipliers ¡ and almost all pencils of planes, there are at most solutions. This

shows that the number of solutions to the original system is, firstly, finite and, secondly, is §

no greater than § . However, this approach is not practical due to the numerical difficulties § involved in computing roots of polynomials of degree § .

Reduction to generalized eigensystem

An alternative method for counting solutions is to reduce the coefficients of the system to be solved modulo some prime and use symbolic methods to compute the degree of

the solution set. Experiments carried out (with hundreds of different primes) using the §

Singular [47] commutative algebra package strongly suggest that there are £ solutions.

4Evaluating a simple numerical example, it is found that the © sextics are linearly independent. Since linear

dependence can be characterized by the vanishing of polynomials in the entries of the input data, the set of

input data for which the © sextics are linearly dependent has positive codimension, hence measure zero.

104 To understand this section, some familiarity with commutative algebra really is re-

quired; the books [22, 26, 115] are suitable introductions to the subject.

© ¤

Let be a field (e.g. the set of rationals , reals or complex numbers ) and let

 ¢ ¡ ¡ ¡ ¡

£ ¥   ¤

be the ring of polynomials in four variables over (these variables will be the components

¦ ¥ of the sought plane at infinity, so £ is the homogenenous coordinate ring of ). Let be

the ideal of £ generated by the quintic and sextic equations described above. Thus, the

§ §

¥ ¡ ¡ ¡ ¡ ¡ £

elements of are linear combinations ¡ of the polynomials with .

¤ ¤

¢

£

¡ £

For any integer , let £ be the collection of elements of with degree exactly .

¥ £  £ 

Thus, £ is just the field , is the set of linear functions and is the set of quadratic

¢  ¢

¥ ¢ £ ¥

functions. Similarly, let ¥ be the subset of consisting of polynomials of degree

¤ ¢ exactly . Note that each £ is a finite dimensional vector space spanned by the monomials

(given in the multi-index notation)

¤£¢

¤ ¤£¤

¡  ¡ ¦§¦§¦ ¡

¢

¥

¤ ¤

¤

 ¢ ¢

£

¡

¡ ¡ ¥ £

where are integers and ¡ . Since is a subspace of , it is also finite- ¢

dimensional. The dimension of £ is

§

¤ ¤ ¤

¤ ¢

¢



¥ © ¥ © ¥ ©

©

¢

¦ ¦

¢

¤ ¢

The following table shows the dimension of £ as a function of : 

 d 0 1 2 3 4 5 6 7 8

¢ £

1 4 10 20 35 56 84 120 165

¢

¡ 

What are the dimensions of ¥ ? Since each generator has degree or more it is clear that

¤

¢   § ¢ §

£ £

¢

¥ ¥ for . There are four quintic constraints so we must have

(allowing for the possibility that the constraints are linearly dependent). Each quintic con-

§ ¡ ¡ ¡ ¡

 

 

straint gives rise to sextic constraints via multiplication by ¥ so the dimen-

§ § §

© ¥ ©

sion of ¥ is at most (from the original sextic constraint) plus times , i.e.

§ § §  §

£

¨

. Similarly, the elements of ¥ are obtained as linear combinations of 

quadratic monomials times the four quintic constraints with linear monomials times the

§ § § 

£ ¢

¢

¥ ¨ twenty-four sextic constraints so that . However, this last 105

 



¢ £

¢

£ ¨

bound is vacuous since . What this shows is that if one multiplies up the

¢ ¢ quintic and sextic constraints to get polynomials of degree ¥ , the resulting polynomials

are highly linearly dependent. ¢

One can use numerical experiments to discover the actual dimensions of ¥ . They are

given in the following extended table: 

 d 0 1 2 3 4 5 6 7 8

 

¢ £

1 4 10 20 35 56 84 120 165

 

¢ ¥

0 0 0 0 0 4 40 80 125

   

¢

¥

0 0 0 0 0 16 44 80 125

  

¢ ¢

£

£ ¥

1 4 10 20 35 52 44 40 40

¢

£

¢

£ ¥

1 4 10 20 35 40 40 40 40

¥ ¥

where ¥ is the so-called saturation of , which is an ideal constructed from as follows:

¤

¡ ¡

¤ ¤

£

¡

¥ a polynomial is in ¥ if and only if there exists a such that for all of

degree ¤ .

¤

¤ ¤

¡ ¤

£

¡

¥

¥ there exists such that for all of degree

§ § What is the purpose of this? Well, the solutions of the original equations are clearly

the same as the solutions of all the equations in ¥ , which are also the same as the solutions

§ §

¥

of ¥ . The processes of going from the equations to generates new equations in

¥ a fairly trivial way whereas the process of going from ¥ to generates new equations in a

less trivial way where the degree of the new equations can actually be less than the degree

of the old equations:

¢¡ ¡

¥ ¥ £

 §

¢ £

For example, the table shows that new quintic constraints have appeared in

§  § § § £ degree five and that £ new sextic constraints have approach in degree six, so these inclusions are (in general) proper. The saturation can be computed using simple

linear algebra (see section 4.5.2); in our case it is not necessary to consider any degrees

¤  above ¥ .

Poincare´ series. The dimension results were obtained by choosing random camera ma- trices with integer coefficients and reducing the resulting polynomials modulo primes ¤ .

106

Using a computer algebra package [47], the Poincare´ (power) series

   

¢

¡  ¢ ¤ ¢

¦ £ ¦

¥ © ¥ £ ©

¢£¢

¥

¤  ¤ 

¥

can be computed for ¥ and . Taking a moment to read this definition, the

¢

¤ ¢ ¢ ¦

reader can see that the coefficient of is the codimension of in £ , i.e. the number of

¤ ¢ ¢

linearly independent constraints that membership of places on an element of £ . In the



£

context of equation solving, the constraints on an ¥ are that the locus contains

the solutions to the system of equations whence ¥ was formed, so naively one would expect

¥

¦ ©

the coefficients of ¥ to equal the number of solutions (because the property that the



locus £ contains a given point places one linear constraint on ). In general, this is ¢

false (if only because the number of solutions might exceed the dimension of £ ) but it is a

¤

¦ ©

theorem [22] that for sufficiently large values of , the coefficients of ¥ are eventually § constant and equal to the degree of the solution set, which in this case is £ . For future

reference, the Poincare´ series in question are:

§

§ £

£ ¦ £ ¦ ¦ ¢¡ ¦ £ ¦ ¢ ¦ ¦

¢

¨ © 

¢

¡¦ 

¥

¦

¥ ©

£ ¦

§

§ § § § § ¦§¦§¦  §

¡ ¦ ¦ £ ¦ £ ¦ £ ¦¨§ ¦ £ ¦ £ ¦ ¡ ¦

¢

© ¨ 

¢

§

£ ¦ ¢ ¦ ¦ ¡ ¦ ¢¡ ¦

¢



¢

©¦ 

¦



¥ ©

£ ¦

§

 § § § § § § § ¦§¦§¦

¦ £ ¦ £ ¦ ¡ ¦ £ ¦ £ ¦ £ ¦ £ ¦ £ ¦

¢

 © ¨

¢ §

The point of all this is that it leads to a numerical algorithm for computing the £ so-

¥ © §¦§¦§¦ © ¢

lutions. In affine -space, with coordinates  , the solutions to a set of equations

© §¦§¦§¦ ©   §¦§¦§¦

£

¡ ¥  ¢ © ¥ ¥   ©

can be computed by forming the ideal of the polyno-

 ¢ © §¦§¦§¦ ©

 ¢ ¤

mial ring and considering the (finite-dimensional) vector space quotiont

¢ ¥

. If is any polynomial then the operation of “multiplication by ” induces a linear

¢ ¥  ¥

transformation ¥ of whose eigenvalues are the scalars obtained by evalu-

 © © §¦§¦§¦ ©

  ¢ ating at the solutions to the original system of equations [22]. By taking 107

in turn the solutions can be localized.

¢ ¥ However, to carry this out numerically it is necessary to choose a basis for and this is usually done with Grobner¨ bases which can have unpleasant numerical behaviour due to the pivoting intrinsic to the method. In projective space, the use of Grobner¨ bases

can be avoided by making use of the grading (homogeneity), as follows: both vector space

¤  ¤  § ¢

£

£ ¢ ¥ © £ © ¢ ¥ £ 

¢ ¢ ©

quotients ¢ and have dimension and any linear induces a

¢ ¥ ¤ ¤

¥ ©  ©

linear transformation ¢ . We get the following result:

¢ ¢

 £ 

Lemma. Let  . The solutions to the generalized eigenvalue problem

 ¢  ¢ 

£ £

£¢¥¤ ¥  ¥  ©  ¥  © ©

 ¥   ¢ ¥ ¢

  © ¥  ¥ £ ©  ¥ £ © © £ ¥

are the ratios ¥ where ranges over the solutions to .

 ¡

¥ ¡ © To make use of this in practice, one forms the four multiplication operators ¡ and solves the associated generalized eigenvalue problem. This means we look for non-

zero vectors ¤ such that

   

§

¤ ¤ ¤ ¤

  and

are multiples of the same vector ¡ :

  

£

¢

¤

¡

§ ¡

¡ for (4.29) ¡ where § are scalars, not all zero. These scalars form the homogenous coordinates of the

sought solution (plane at infinity) § .

This has been implemented and results are shown in section 4.6.

4.5 Algorithms

4.5.1 Recovering calibration from the plane at infinity

¡ £

Given an estimate of the plane at infinity, the calibration parameters of

¡ ¢£

£ ¡



£ £

£ £

108 can be recovered as follows: Intersect the plane with the £ rays and project these points

into the relevant image. Fit a conic through the £ image points; it should have matrix

£ ¥

¤ ¦

£ £ ¡

£ £ £



¡ 

 (4.30)

£ ¡ £ £ ¡ £

  

¡ £  from which the values of can be read off directly. The value of can then be recov-

ered easily too, but there is no guarantee in general that it will be positive (if the plane in

question happens to be the geometrically correct one it will of course be positive).

4.5.2 Numerical Computation of Ideal Saturation

¤ ¤

¢  

¡

¢

¥ ¥ ¥ ¥

After degree ¥ , the ideal is already saturated, i.e. we have for all . It only

¥ ¥ ©

remains to compute ¢ and .

£ §

¡ ¢ £

© ¨

¢ £ £ For calculations, identify £ , and with the set of vectors of size , and

respectively:

§

¡ ¡ ¡

  

©   ¥

¢

© ¨

£ ¤ ¢ ¤ £ ¤

£ and

¡

¡ £  £ £ ©  £ ¨ ¢

The operation “multiplication by ” gives linear maps ¢ and and the corre-

£ § £ §

¡ ¢ £

© ¨ ©

¢

  ¡

sponding matrices of size and will be denoted by ¡ and , respectively.

¢ ¢

¢

¥ £ Both ¥ and and are vector subspaces of and there are two complementary ways

to represent such subspaces: either explicitly by giving a set of generators or implicitly by

giving a set of linear equations that describe the subspace (i.e. a set of generators of its

¡



©

¢

¢ ¤ ¥ ¢

annihilator). For example, in the case of £ a basis of generators of is given by

§ §

¡ ¢

§ specifying a matrix ¢ of rank ; each column is a generator so that a quintic

belongs to the ideal if and only if its vector ¡ is in the column space:

§



¤ ¤

¡

§ ¥ ¤ ¢

¢ there exists such that

¡ ¡ ¢

¡ §

Implicitly, the subspace would be given by a matrix ¢ and a quintic belongs to ¡

the ideal if and only if it is annihilated by ¢ :



¡

¡

¢ § ¥ ¢

109 To convert from an explicit representation to an implicit one, take a basis for the left nullspace

of the representing matrix (e.g. by taking part of its SVD). To convert from implicit to the

explict, take a basis for the right nullspace.

Obviously, the ideal ¥ was given initially in terms of generators, i.e. explicitly; the idea ¢

used here is to express membership of ¥ implicitly, carry out the saturation operation to

§

¡ ¢

¢

get an implicit form of ¥ and then convert back to explicit form. In detail, start with

£ § §

©

and matrices ¢ and representing the known generators of degrees five and six. To

¨ ¨

get a generating set for ¥ we form a new matrix by horizontally concatenating matrices

¨ © ©

¢

 

¢

¡ £

§

¥

for and

¨ ©

 ©

¡

§ £

£ ¢ £

¢

for . This matrix has rows and columns but only has rank (at most) .

§

£ £

¡

Converting this to implicit form gives a matrix ¨ which describes membership of

¨

¥ ¥ ¥ ©

. Now it is easy to compute implicit forms of and ¢

¡ 

£

¨ ©

¡

¡

¡ § ¥ ¨ § ¥ ¨  © for all ¡ for all

and

¡ ¡ 

£

¨ © ©

¢

¡

¡

§ ¥ ¡ £ § ¥ ¨ ¥ ¨   ¥

¢ £

for all ¡ for all

£ §

¢¤£

In other words, the saturated ideal is given implicitly by the matrix

¢

¨ ©

¡

¨





¨ ©

¡

¢ £

¡ £



¡ ¨ 

¢ £



¢ £

¨ ©

¡ ©

¨ 

¨ ©

¡

§

¨ 

§

£ £ ¡ ¢

and by the matrix

¢

¨ © ©

¢

¡

¨  

 

¨ © ©

¢

¡

£ ¢

¨  

 

£ ¢

¨ © ©

¢

¡

£ ¢

¨

 

£ ¢



£ ¢

¨ © ©

¢

¡

§

£ ¢

¨  



£ ¢

¨ © ©

¢

¡

£ ¢



¨   ¡

 

£ ¢

¨ © ©

¢

¡

¢

£ ¢

¨  



£ ¢

¨ © ©

¢

¡

§

£ ¢

¨  

£ ¢



¨ © ©

£ ¢ ¢

¡

£ ¡

£ ¢

¨  

£ ¢

¨ © ©

¢

¡

§

¨

 

¨ © ©

¢

¡

§ §

¨  

110

§

£

¡

In practice, the rank of © is only and by orthogonalizing its row space one arrives at a

§ £ §

£

¥ © ¥

implicit form for . This can then be used to compute an implicit form for ¢ by

¢¤£ ¡ ¢

using the matrix

¢

©

¢

¡



©



©

¢

¡

£ £ ¢¡



 ¡

£ ¢

©



£ ¢

©

¢

¡

¢



©

©

¢

¡

§

 © instead.

4.5.3 Computing the Multiplication Operators

§ §  ¡

£ £

¥ ¡ ©

One needs to compute matrices corresponding to the operators ¡ ,



£

¢

used to pose the generalized eigenproblem (4.29). A convenient way to rep-

¢ §

resent linear maps between quotient vector spaces is to identify the quotient with a

§

complement of in and then work in .

§

§¦§¦§¦ §¦§¦§¦

§

©

¢

 ©  ¤ ¤

Specifically, choose orthonormal bases ¢ and of and such

§¦§¦§¦ §¦§¦§¦

§ § §

 © ¥  ¥

¢ ©

that generate ¢ and generate . This can be achieved easily using

§

¡ ¢ £

the Singular Value Decomposition. The matrix

 §¨§¨§

§

¡

§

¤

  ¥

¢

£ § § £

and the matrix

 §¨§¨§

§

¡

§

©   ¥

¤

§

¥

¤ ¥ ¥ £

¢ ©

give linear parameterizations from to the (orthogonal) complements of ¢ and in ©

and £ respectively.

§ §

£ £

The desired matrices are then given by

 

©

¢

¡¢¡ ¡

¡ 

¢

© ¡

where denotes the Hermitian transpose.

4.5.4 Solving the Generalized Eigensystem

One method for solving the generalized eigensystem

¤   ¢ 

£ £

£¢¥¤ ¥ ©

111

¥ ¥ ¡ £¢ for given matrices , is to simultaneously reduce both matrices to upper triangular form by row and column operations; the generalized eigenvalue pairs can then be read off

as corresponding diagonal elements.

¢¡ ¢£¡ ¤¡

 

In practice, one computes orthogonal such that is upper triangular and is upper block triangular with blocks of size or . This is known as a generalized Schur decomposition and is implemented by the LAPACK routine DGGES [102].

In our case, there are four matrices, not two. However, due to special structure (which

derives from the fact that multiplication of polynomials is commutative) of the eigensys-

  ¢¡

tem it suffices to compute the decomposition for, say, ¥ and . The same matrices

  

can then be used for ¡ and ( ), except possibly in degenerate cases.

4.6 Results

Experiments were carried out on synthetic data. To generate the synthetic data, a set of

four cameras with calibration matrix

¡ ¢£

¡ £ ¡ ¢



£ ¡ ¡ ¢

£ £

and ¡¤£ points were chosen. The points were chosen uniformly at random within the cube

 ¢ 

£ ¡ ¢

¤ and the cameras centres were all between and units of distance from the

origin. The principal ray of each camera passed through the cube .

£

Now, for varying values of  , isotropic Gaussian noise was added to the true image

reprojections and a projection reconstruction from the corrupted points performed. The

minimal solution estimator was run on this projective reconstruction and the values of

¡ £ £

closest to the ground truth chosen. The experiment was performed times for each

value of  and the mean and standard deviation computed. Figure 4.10 shows the results. £ The estimator mean is roughly correct but the standard deviation is approximately ¢ pixels per unit of synthetic noise. This indicates that the model is unstable; in fact no real

data on which the algorithm gave a usable result was found.

112 550 35

30 500

25 450

20

400 15

mean (pixels) 350 stddev (pixels) 10

300 5

250 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 σ (pixels) σ (pixels)

Figure 4.10: Graphs of estimator mean and standard deviation against varying levels of

¡ £

synthetic noise. The parameters are plotted in red, green and blue respectively.

4.7 Assessment

The analyses in this chapter have cast some new light on the geometry and algebra of two common calibration models. The emphasis was on elucidating the abstract structure of the problems and on finding direct methods of solution, so the experimental evaluation is a little weak. The practical conclusions drawn are that in problems of real interest, these

minimal solutions require very good data in order to give usable results.

It is possible that the synthetic data experiment was unrealistically hard since (this

pointed out by W. Triggs) there is a singularity (ambiguity) when the principal rays of the

cameras intersect and in the data used, the principal rays all passed through the same small

volume in space. More experimentation, also on real data, would be desirable.

113 Chapter 5 Matching images

5.1 Objective and motivation

This chapter addresses the problem of finding correspondences between two or more given

images. The main applications that drive this problem are reconstruction (structure recov-

ery) and object recognition from un-organised image sets. For example, a stroll around a

building, or other scene, with a stills camera might produce a few tens of images taken under no particular constraints and the task of simultaneously reconstructing the scene and inferring the camera positions relative to it requires the ability to find corresponding images of objects in the scene. Another important example is the possibility of search- ing large data sets (image data bases) using an example image of the sought object as the query. Thirdly, in video-structuring, where one wishes to annotate each shot from a film according to location, characters present etc, it is necessary to be able to detect the fact that two shots were taken in the same location.

The emphasis of this chapter will be mostly on reconstruction.

5.1.1 What can one hope for?

Consider the correspondence problem from first principles: one wants to find (localize) the same thing in two images. One immediately encounters two problems:

The thing can “look” very different in the two images. It is hard to tell when one has

been successful.

Even if one can find things that look the same, it doesn’t mean they are the same

thing. It is not possible to be certain. The second item is trite but it has an important philosophical consequence: any algo- rithm or program which prints the message “I found the same thing” must be wrong, or based on an unfounded heuristic because that outcome is not computable. The point is that one should realize from the outset that what a correspondence algorithm computes is not answers (as in “the right answer”) but hypotheses. No algorithm can compute all hy- potheses; a good algorithm computes plausible hypotheses and doesn’t generate unlikely hypotheses.

For example, suppose given ¢ images of a tree. An algorithm which gives the (entirely valid) hypothesis that the imaged scene is a cuboid on whose sides are painted the individ- ual input images, is useless. An algorithm which generates hypotheses about tree-like 3D scenes, even when the given images are actually of such a cuboid, is useful.

Most correspondence algorithms can therefore be thought of as consisting of two (pos- sibly coupled) processes:

Mechanisms for generating hypotheses.

Ways to explore, verify or reject given hypotheses.

For example, a corner tracker for video streams might use an interest point detector to- gether with intensity correlation within some disparity threshold to hypothesize matches between consecutive frames. To verify matches, it might impose a global consistency con- straint, e.g. by robustly fitting epipolar geometry to the data. Of course, the boundary between generation and verification is not completely clear: one might think of intensity correlation (and even the checking of a disparity limit) as verification of the hypotheses generated by the driving, outer loop over corner features.

When the between-image camera motion is expected to be large, the use of a dispar- ity threshold and simple intensity correlation will not generate plausible hypotheses and much more work must go into both the generation of hypotheses and the methods for ver- ification. This is discussed in section 5.1.3

If the scenes of interest are not (mostly) static, epipolar geometry constraints cannot be used, so the need for verification based on appearance or less-than-global constraint

115 is even greater. Note that this makes the matching/correspondence problem quite similar

to the problem of object recognition, where one wishes to recognise some object from an

image where the background is insignificant clutter (i.e. it cannot be used for correspon-

dence). In this chapter, though, it will be assumed that all scenes are static.

5.1.2 Direct methods

A direct method for finding correspondence between two images is a method which does not rely on an intermediate image description (feature extraction) but works directly with image intensities.

The major example of this is optic flow which models the appearance change between two images as the result of a vector flow field along which pixels are displaced. When the model applies, there are very effective coarse-to-fine alignment techniques for computing the flow [7, 8, 49] and this yields a dense set of correspondences. Moreover, since the need for an initial stage of feature extraction (which can be unreliable and only gives a sparse image representation) is avoided, direct methods can be very robust.

When the model does not apply, i.e. when there is no reasonable sense in which cor- respondence can be modelled as a point-to-point image transformation (consider cases with plenty of parallax), direct methods will perform poorly. In some of these cases, a direct method may succeed at a coarse scale but this is dependent upon having salient structure at large scales.

In many practical situations optic flow can be computed meaningfully at least for cer- tain parts of images; usually those that correspond to imaged surfaces that are reasonably smooth. Such local registration can be very useful and the technique is used in this chapter to verify potential interest point matches.

Perhaps the main drawback of direct methods is the complexity they entail for process- ing un-structured (i.e. non-video) data sets because they are incompatible with indexing methods. This makes direct approaches unsuitable for, say, database searches.

116 Figure 5.1: Illustration of the change in appearance which makes both hypothesis gen- eration and verification difficult for wide baseline image matching. The illumination has changed. The viewpoint has changed. The resolution is different. The inset parts of the windows visible in one image are not visible in the other.

5.1.3 Wide versus short base-line.

When the camera (or scene motion) is small the matching problem is largely solved by ex-

isting techniques such as optical flow or interest point tracking. Why do these techniques

fail for wide baseline images?

When the camera motion is large, or significant time elapses between the time images

are taken, the following new problems appear:

Significant occlusion is likely. The structures that can be seen in one image may not

be fully visible in other images.

The same scene element can “look” very different in two images due to lighting changes,

viewpoint changes and changes to the imaging device.

The correspondents of a feature in one image can, a priori, be anywhere in the other

image. This leads to a larger computational burden, as more hypotheses must be

investigated, and to a greater uncertainty when making decisions due to the greater

risk that several features in the other image have similar appearance.

117 The first two points are illustrated in figure 5.1. A partial remedy for these problems is presented next, in section 5.2.

5.2 Framework of invariant descriptors

The problems outlined in section 5.1.3 can be partly addressed by choosing feature de- scriptors which are invariant to the effects of illumination and viewpoint changes and matching the descriptions instead of the features directly. The combinatorial problem of choosing correspondences remains but it is ameliorated by being amenable to the tech- niques of hashing and spatial indexing (such as -d trees etc): to find similar-looking fea- tures it suffices to find neighbours in the space of invariants.

This is the essence of the framework of using invariant descriptors for matching; it is

a method to generate plausible match hypotheses with a small amount of effort. Unless

the invariants are exceptionally discriminating, a stage of evaluation/verification of these

hypotheses is necessary. In outline the matching paradigm is:

1. Detect features and compute invariant descriptors.

2. Hypothesize feature matches based on proximity in invariant space.

3. Verify or reject hypothesized feature matches based on appearance.

4. Enforce (robustly estimate) semi-local, then global constraints.

The purpose of invariant descriptors is to enhance (speed up) performance; the alter-

native, which is to evaluate all combinations of features in the images to be matched, will

always give better matches and so the only reason for not doing that is its cost. What is

 

the worst case complexity of matching features in one image with features in an-

  other? Comparing each pair of features for similarity has complexity but this does

not include the cost of assigning correspondences. The number of possible correspon-

  dence assignments is enormous and certainly dwarfs which is expensive enough in the first place.

118 Figure 5.2: Seen through a local rotation-invariant operator, these two textures are indis- tinguishable.

For a two-view matching system, speed may be merely a usability issue but for an ¤ -

view matching system, the algorithm complexity becomes a theoretical issue as well. An advantage of hypothesis generation based on invariant descriptors is that most of the work

is done on single images, so the complexity is linear, not quadratic, in ¤ .

The formation of invariant descriptors discards information which can be useful in later verification stages. For example, a change in scene illumination will usually affect more than one image feature and the effect is a coherent change in all the affected features

(a simple assumption is that all the image intensities are scaled by the same factor). If one

computed a number of features and described them in ways that are invariant to affine

intensity changes then each putative feature match implies an affine intensity change and

it is likely that this intensity change is common across all matches. By allowing a sepa-

rate intensity change for each match the discriminatory power of the matching system is

greatly reduced. An example of this aspect of description using invariants is illustrated

in figure 5.2. When computing an invariant it is worth noting the information which was

discarded (e.g. a normalization factor) or referring back to the original information (e.g.

image intensities) when verifying a hypotheses. In the example, above, imposing a co- herent intensity correction across all matched features in a pair of image yields a global photometric matching constraint.

Despite their limitations, invariant descriptors are useful tools and the rest of this sec-

tion will describe some of the techniques that are available.

Affine invariance. The degree of invariance required will be taken to be to: affine pho-

119 tometric changes and affine geometric transformations. Theoretically the notion of affine

photometric change is based on the assumption that camera responses are approximately linear over their range of normal operation. While not completely true it is a useful model.

The approximation of using affine transformations for image deformations is just an ex- pression of the assumption that the true deformations are differentiable (at some scale).

There are three main classes of invariant feature descriptors: interest point based, re- gion based and geometric. They will be discussed in turn and in varying detail.

5.2.1 Interest point neighbourhood descriptors

Interest points are image locations which are locally distinctive in some sense. For exam-

ple, Gilles [44] showed that the (information) entropy of local greyvalue histograms can be used to detect perceptually significant regions in images. The method consists of com- puting the entropy of a histogram of greyvalues in a window centred on each pixel, and selecting as interest points those locations where the entropy attains a local extremum.

Many other interest point detectors work in this way of computing a scalar measure of

“interest” and looking for local extrema. Schmid et al [135] compared a selection of interest point operators and found that the Harris corner detector [50] gave the most repeatable results. It is used in the work presented in this chapter.

Scale space and scale selection

Most interest point detectors use a “cornerness” measure (measure of interest) with a scale parameter which controls the size of the neighbourhood used in the calculation of the cornerness at a point.

Since objects in images often appear at different sizes (due to the perspective) it is nec- essary to adapt the choice of scale to the situation at hand. For example, if there is a scale

change between two images so that an object appears twice as large in image ¥ as in image

¦ , the difference in sampling will make image measurement incomparable for two rea- ¦ sons: firstly, fine scale intensity variation in image ¥ may not be detectable in image and, secondly, image (first) derivatives taken in ¦ will be twice as large as those taken at corre-

120

¥

¢ sponding points in ¥ ( th order derivatives will by times larger).

The notion of , popularized by Witkin [168] and Koenderink [65], addresses both these issues by making the observation that whenever an image is to be queried for

information, one should be explicit about the scale of the phenomenon one is interested

¥ 

in. Computationally it is achieved by embedding a given image ¥ in a family of images ,

£ ¡

where  is a scale parameter, by convolution with Gaussian filter kernels:

©   ©  © 

¥  ¥ ©  ¥ © ¢ ¥ ¥ ©

¢¡¤£ 

©  







¥ © ¦¥



£ 

The smoothing effect of convolution suppresses details at scales smaller than  . The sec-

ond ingredient is to always use scale stabilized derivatives, which means multiplying any

¥

¢

  ordinary th order (partial) image derivative of ¥ by (while “derivative” means rate of

change in the signal per unit change in the parameter, a scale stabilized derivative is the

rate of change in the signal per  units of change in the parameter) to give a dimensionless ¦ quantity. This ensures that one gets the same value at corresponding points of ¥ and [76].

How can one know which scale to use at any given point? One common framework

for this is to apply one’s cornerness (or whatever) measure at multiple scales and declare

the scale at which the response is largest to be the salient one. This very influential idea

is discussed in detail by Lindeberg in [74, 75, 76]. It was used by Baumberg [5], in work

that is the basis for this chapter, though it was found that the maximum often is quite “flat”

and the detected scale can be unreliable. Mikolajczyk and Schmid [97] reported that the

reliability of the technique can be improved by using two cornerness measures and looking

at their joint behaviour across scale.

Invariants from filter responses

Consider the problem of obtaining a description of the neighbourhood of an interest point

¥ © ©

 

 which is invariant to some group of (smooth) image transformations . E.g.

the transformations could be the group of rotations, the group of affine transformations

or the group of projective transformations. Without loss of generality one can assume that

£ £

¥ © each fixes the interest point, which can be taken to be at the origin . 121 A popular method is to apply a bank of (linear) filters to the neighbourhood and study the effect of the transformation group on the vector of filter responses in the hope of find-

ing some invariants.

¥ Now, transforming an image ¥ by a transformation gives a new image defined

by

©  ©





¥ ¥ © ¥ ¥ ¥ © ©

  ©

¥ ¥ ©

and the result of applying a filter to is (by putting ):

¤ ¤

¤ ¤

¤ ¤

¡      © © ©

¤ ¤

¥ ¥ ¥ © ¥ © ¥ ¥ © ¥ ¥ © © ©

¤¢¡ ¤

¡



which is equivalent to transforming the filter,  , via the rule

¤ ¤

¤ ¤

©  ©

¤ ¤

¥ © ¥ ¥ © © ©

¤£¡ ¤ ¡

and applying it to the original image:

¡  ¡

¥ ¥ which can be summarized by saying that applying the filter to the transformed image gives the same result as applying the transformed filter to the original image.

For the group action on images to induce a (linear) group action on filter response vec-

tors, say

¤

¡  ¡

¡ ¥ ¡ £ ¥ © £ ¥

£

§

 1

¢© ¥ ¡¤£¦¥ © ©

for an invertible matrix ¥ , it is in fact necessary that a group action be

§¦§¦§¦ ¢

induced on the space of filters. In other words, for a (finite) filter bank  and any

¡

group action , the transformed filter must be a linear combination of the original

filters, namely:

 §

¡ ¡¤£ £

¥ ©

£

¡  ¡ 

¨ ¨

¦ ¦ ¦

1 ¦

¡ ¡ ¡

¦ ¦

¡ ¡

¤¦¥ ¨§ © § ¤¥ © ¥ ©

Proof: The requirement is that § for all all im-

¡

¤¦§ ¥ ¨© ©

ages © of interest. But the left hand side is so the equality holds (for all images ) if and only if

¡ 

 ¨ ©

¦ ¦



¡ ¡

¦

§ ¥ © ©

§ ¥ for all in the class of images under consideration. If we take that class to be

¡ 



¨

¦ ¦

¡ ¡

¦

§ ¥ ¥ (bounded, locally integrable functions) then we can conclude that § , as required. Even the class of piece-wise constant images (or any other class which is dense in  ) would yield this conclusion, but the class of constant images would not, as it admits any filter bank under any group action.

122 In other words, if the filter responses are to be (somehow) co-variant with the group action

on images, it is necessary that the filter bank on its own be co-variant with the group action too.

This property is known as steerability and was introduced to the computer vision com- munity by Freeman and Adelson [40] for the rotation group.

For most groups of practical interest, it is a fairly strong restriction on the filter bank,

which is a drawback of the approach. The constraint can be relaxed by replacing the

 §

¡ ¡¤£¦¥ © £ ¡ constraint £ by some approximation. This would give rise to quasi-

invariants. Perona [103] discusses the notion of steerable kernel approximations in a Hilbert

space setting and also notes that for non-compact groups one cannot in general expect the

co-variance to be valid across the whole group anyway.

Otherwise, it is necessary to pass to a more general framework which isn’t just based on

filter responses taken at a point. This will be the approach taken later in section 5.2.1.

Rotation invariant descriptors

Schmid et al [132, 133, 134] obtain a local rotation invariant description of the image in- tensity near an interest point by computing invariants from the partial derivatives of the image intensity at that point. The collection of (stabilized) partial derivatives, computed at a particular scale, was introduced by Koenderink [65] who called it the local jet. The rota- tional invariants were worked out by Romeny et al [35, 150, 151] By applying the descriptor at multiple scales Schmid et al succeeded in extracting a rotation invariant description which is applicable across a range of scales.

An alternative method for rotational invariance is described by Baumberg in [5]. Here

the filters used are of the form



¡ £

£¢ ¢ ¥ ©

¥ ¢

where is the th derivative of a Gaussian. The effect of the group action is much simpler

£ (being just shifting of the angular coordinate ¥ modulo ) than for the Gaussian local jet

since the coefficent

¤ ¤

 ©  © 

¡ £

£¢ ¥ ¥ © ¢ ¥ ©

123

transforms as:

¤



¡

£

¢

£¢ ¤

under a rotation by an angle . To generate invariants to this group action is easy, e.g.

¤

¡

any product will do. The use of complex conjugation instead of division (as in [5]) © avoids division by zero.

Affine-invariant descriptors and affine shape adaptation

Baumberg [5] and Ballester et al [4] describe a technique for reducing the problem of affine invariance to one of rotational invariance. Both schemes are based on a device, affine scale space, invented by Lindeberg and Gar˚ ding and described in [78]. It will be described next.

The idea is a generalization of traditional (Gaussian) scale space, the observation be- ing that when there is foreshortening in the imaging process, the simple notion of scale change between two images is not applicable. Locally, at least for a smooth surface being imaged, the distortion can be a general affine transformation. To address this, Lindeberg and Gar˚ ding proposed to extend the set of linear “probes” from isotropic Gaussians to an-

isotropics ones. Specifically, for each positive definite matrix one forms the smoothing §

kernel §

©

©   © 

£









© ©

¢¡ ¥ © ¢¦¥¨§ £

  

£ ¥ £¢¥¤¤£ ¥





£ 

The isotropic density from above corresponds to the special case of letting be times

¢

the identity matrix. Where Gaussian scale space embeds a given image in a image

¡ by adding a scale axis, affine scale space embeds the image in a image whose extra

axes are the entries of £ :

©   ©  © 

¥©¡ ¥ © ¢¡ ¥ © ¢ ¥ ¥ ©

However, just as Gaussian scale space is useful if one knows something about the scale change between two images, affine scale space is useful only if one has some idea of which

matrix to use. The method of scale selection (see earlier) to find some scale at which

the image signal is doing something unusual generalizes, but in affine scale space there is

¢  more data to detect ( coefficients of £ , as opposed to just one for ).

124 Shape adapted smoothing. To illustrate this, and because it is needed later, consider

the problem described by Lindeberg, of aggregating a second moment matrix. In Gaussian

 

 scale space one chooses two scales  (with ) and computes derivatives at the

smaller scale §

§

  

 

§

¥

©

© ©

¥ ¥ ¢ ¥ ¥ ¢ ¥

¡ ¡

¥

¡ ¡

 ¥

while averaging the tensor product ¥ using the larger scale:

§

§ § §

¤   £

¥ ¥ ¥ ¥

¡

§ ©

¥ ¥  © ¢



¥ ¥ ¥ ¥

The result is a symmetric -tensor at each point of the image. It measures local signal

variation.

In affine scale space, the measured image is thought to arise as the affine projection of

some surface and to aggregate a second moment matrix one should both compute deriva-

tives and perform the averaging with respect to the surface’s affine frame. This should

motivate replacing  above with a general positive definite matrix :



¢¡

§

¥ © ¢ ¥

¡

¡



¢¡

©

¥ ¢ ¥

¡ (5.1)

§

§ § §

¡

 

£

¥ ¥ ¥ ¥

¡

§ ©

 ¥ ¥ © ¢

¡



¥ ¥ ¥ ¥

The scale factor  is an algorithm parameter, to be adjusted according to the application

at hand, but the choice of must agree with the particular image data. How can it be

chosen? Lindeberg and Gar˚ ding suggest choosing to be proportional to the inverse of the



¡

© ¥ ¥

corresponding aggregated second moment matrix  :

 



¡



¥ ¥ ©  (up to scale) (5.2)

What does this mean? The second moment matrix measures an-isotropy of the distribu-

 

   

tion of gradients and the affine transformation  transforms to a coordinate frame



 where it is isotropic. If is proportional to then is also isotropic in that coordinate frame (“isotropy” for a co-variance matrix means that the level curves of the corresponding

Gaussian p.d.f. are circular). To summarize, the condition (5.2) expresses the existence of 125 A

Original image Normalized image an affine coordinate frame in which the second moment matrix, computed with isotropic smoothing kernels, is itself isotropic.

For example, a surface in the world for which the gradient distribution is isotropic will,

when distorted by an affine transformation give rise to an image whose gradient distri-

 

   bution is an-isotropic. The affine tranformation  recovers the original coordi-

nate frame in which using isotropic smoothing corresponds to using the shape adapted smoothing kernel ¡ . This is illustrated in figure 5.2.1.

However, to find such a is not so straightforward because the two equations (5.1)

and (5.2) are coupled; one needs to know  to compute and vice versa. This suggests  an iterative technique: starting from, say, an isotropic £ , one can estimate using (5.1)

and then use equation (5.2) to get a new estimate of and so on.

While the convergence of this method has been proved by Lindeberg et al [78] for cer- tain special cases, it is known to diverge in many cases of real interest. The symptom is usually locking onto a cycle of period two and the solution is to damp the iteration (by introducing a gain parameter) when this happens. In general, the reason one might ex- pect there to exist a fixed point for the iteration (whether or not the algorithm will reach it) is that the desired fixed point maximizes isotropy, defined in a suitable sense. See more

details in section 5.4.2

There are certainly images for which one would not expect the method to converge at all, namely those where all the image variation is in a single direction (so the moment matrix has rank ) such as linear stripes; in practice the algorithm may converge for such an image due to interpolation and quantization errors but what the method is computing in that case is noise, not signal, and the result will not be reliable. The method works best

126 when there is a fair amount of intensity variation in more than one direction. This should be expected from the geometric interpretation that it is computing a coordinate frame in which image variation, computed on a circular region, is isotropic (as measured by the 2nd

moment matrix).

Affine invariant descriptor. A shape adapted second moment matrix can be used to com- pute an image descriptor which will be invariant to affine transformations of the original image. This is because, by construction, the shape adapted moment matrix is co-variant with affine transformations so that passing to the normalized frame (where the moment matrix is isotropic) is an affine invariant procedure. Not quite, though. While the moment matrix itself is co-variant, the affine transformation which maps it to an isotropic matrix is only determined up to a rotation. However, if one computes a rotation-invariant descrip- tion in the normalized frame, the end result will be an affine invariant descriptor of the original image.

Equation (5.2) does not determine the choice of scale. To constrain the iterative pro-

cedure (and make it more likely to converge) Lindeberg and Gar˚ ding advocate keeping the

smallest eigenvalue of  equal to whereas Baumberg chooses to set it equal to a constant.

These details will be discussed in the algorithms section 5.4.2.

5.2.2 Intensity profiles descriptors

An intensity profile is a 1D signal obtained by evaluating the 2D image signal along a curve.

If the curve is a line segment, affine invariants can be constructed simply by normalizing

¢

£ the power (or some other measure) of the profile signal and rescaling its domain to ¤ .

This method was proposed by Tell [149] who used it to vote for feature matches in a wide baseline matching algorithm and by Matas et al [91].

5.2.3 Region descriptors ¥ Given a region of an image , what description can one extract which will be invariant to affine transformations of (the region and) the image?

Van Gool et al [98, 164] describe a method based on computing moments of the image

127

intensity (generalization to colour is straightforward):

¤ ¤

© 

¡ ¡

¤  ©  © ©  

£ £

¢

¥ ¥ © ¥ ¥ © ¥ ¥ © ¥ ¥ ©

¢

¥ ¥

© 

¥ ¥

© where ¥ is the centroid of . Under an affine transformation of the image and region,

these coefficients transform in a predictable way which depends only on the linear part

¢

¥ © of the transformation. By computing -invariants of this group action, one obtains

affine invariants of the image region. Details can be found in the cited papers.

Many quite stable invariants can be generated in this way, especially for colour im-

ages, but the method has the disadvantage of needing an accurately segmented region in

the first instance. When matching regions across multiple views one needs segmentation

algorithm which finds the same region in both images and some ingenuity is needed to ac-

complish this. Van Gool et al describe methods for growing a region in an affine invariant way from a feature segmentation [161] or from low-level intensity extrema [162].

When applying these region descriptors, the hard part is finding the regions.

5.2.4 Texture descriptors

A texture descriptor is a sort of region descriptor which applies to regions that contain only one kind of texture. The descriptor should capture the general, statistical, characteristics of the texture and discard the particular characteristics of the sample, such as its shape.

The advantage of this is that accurate segmentation is not needed; specifically, the method will still work if there is over-segmentation. This idea is pursued in more detail in chapter 6.

5.2.5 Geometric features

Geometric features are things like curves (e.g. straight lines), intersections/inflexions of curves and configurations of points. Unlike the other features described so far, they do not involve any (explicit) intensity information, only geometry.

The advantage of geometric features is their robustness to photometric changes and the possibility of integrating information from quite a large region of an image. The projec- tive invariance geometry of points, lines, conics and other algebraic curves under (plane) 128

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

¨

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

§¨

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

§

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

¡ ¡ ¡ ¡

¢¡¢¡¢¡¢¡¢

£¡£¡£¡£

¤¡¤¡¤¡¤

£¡£¡£¡£

¤¡¤¡¤¡¤

£¡£¡£¡£

¤¡¤¡¤¡¤

£¡£¡£¡£

¤¡¤¡¤¡¤

£¡£¡£¡£

¤¡¤¡¤¡¤

£¡£¡£¡£

¤¡¤¡¤¡¤

¥¡¥¡¥

¦¡¦¡¦

¥¡¥¡¥

¦¡¦¡¦

©

¥¡¥¡¥

¦¡¦¡¦

©

¥¡¥¡¥

¦¡¦¡¦

¥¡¥¡¥ ¦¡¦¡¦

Figure 5.3: A match of features on a surface implies a local transformation of feature neigh- bourhoods. projective transformation is very rich [37, 101, 100, 119, 117, 118] but is somewhat limited

in scope by the restrictive nature of the objects of study. For man-made scenes, which are

rich in straight line and planar structures it is justified, though.

5.3 Local (affine) transformations

From a single feature (point) match one can often obtain a local affine transformation of

feature neighbourhoods. This is the case if the feature arises from a marking on an approx-

imately flat surface, see figure 5.3.

The local transformation can be estimated by registration techniques which extremize

129 ¢ a registration error such as SSD [80] (i.e. the  -norm of the intensity differences) or mutual information [165]. Its usefulness is demonstrated by the following examples

Reducing number of false matches. Geometric hashing with invariant descriptors may give many putative correspondences, some wrong. Verification by neighbourhood regis- tration reduces the number of false matches.

Resolving ambiguity. The local affine transformation between a pair of interest point matches is determined up to a -parameter family by the affine skew normalizations. A neighbourhood registration determines the remaining degree of freedom.

Growing new matches from old ones. From a given match one can “grow” the neigh- bourhood registration to find nearby new matches on the same surface, as described by

Pritchett and Zisserman [109]. This puts at a disadvantage matches which “disagree” with their neighbours so tends to improve the spatial coherence of matches.

The idea of using the semi-local neighbourhood of a feature correspondence to eval- uate the quality of the match is not new, it was used by Zhang et al [171], by Schmid and

Mohr [134] and by Baumberg [5]. The approach of these authors is to use it for rejection: failing to find neighbourhood support leads to a match hypothesis being discarded. In contrast, the growing approach advocated here uses the neighbourhood for both hypoth- esis generation and verification but matches are not rejected simply because they are am- biguous.

Extra constraints for multi-view relations. The data from local registrations provides extra constraints for estimating multi-view relations such as epipolar geometry and plane-

induced homographies. If, say, the neighbourhoods of two interest points

§ §

© ©

 

 

 

© ©

  

 and

 

130

have been locally registered with an affine transformation whose linear part is  then the

§ §

epipolar constraint §



£

   

© © ©

can be extended to an infinitesimal neighbourhood in the sense that

§ § §



¡ ¡



£

 

 

© © ©

©

¢

 must hold to first order in ¡ . This provides (in general) linear constraints on the entries of instead of the single linear constraint obtained from the point correspondences

alone.

A related idea is due to Anandan and Avidan [2] who show how to compute global epipolar geometry from several local affine approximations. The difference between their

approach and the one given here is that the constraint from each affine-registered interest

point applies to the global epipolar geometry so it is not necessary to first estimate local

affine epipolar geometry.

5.4 Algorithms

This section consists of many small parts but the overall purpose is to set down the details

of an interest point matching algorithm put together from the techniques investigated in

the chapter. The objective of the algorithm is to compute correspondences and epipo-

lar geometry from a pair of images. The overall flow of hypothesis generation/verification

is to (1) generate hypotheses based on affine invariant interest point descriptors, (2) ver-

ify hypotheses using affine intensity registration of feature neighbourhoods, (3) gener-

ate new hypotheses using local search guided by affine registrations from previous match

hypotheses and (4) verify hypotheses by robustly fitting global epipolar geometry using

RANSAC [32].

We first outline the overall algorithm in broad terms and the following sections elabo-

rate on the details of each part of the algorithm. Results are shown in section 5.5.

1. Detect interest points and characteristic scales.

131 2. At each corner, compute an affine adapted moment matrix  (if possible – discard it

otherwise).

¦

  

3. In the frame normalized by  compute invariants as described above.

 4. Using Mahalanobis distance in invariant space, compute the ( ¢ , say) nearest

neighbours for each corner. These are the initial hypotheses.

5. Verify each hypothesis by computing an image registration, minimizing the sum of

squared differences (SSD) between the target image and an affinely transformed (both

in position and intensity) source image. Reject unlikely registrations.

6. Grow matches by locally propagating the affine transformation for each putative match

found so far and verify new matches as above. Repeat this growing stage till no more

new matches are found.

7. Robustly fit epipolar geometry to match hypotheses.

8. Grow corners once more.

9. Fit epipolar geometry again.

5.4.1 Corner detector



£

The (scale stabilized) Harris cornerness measure was calculated at scales between 

¦  ¦

£ £

¢  ¥ and . The pointwise maximum of the resulting cornerness maps was computed

and local maxima detected, followed by sub-pixel localization by fitting a quadratic.

Apart from corner locations, this gives a scale estimate at each corner. The method has

the disadvantage that corners at large scales “hide” nearby corners at small scales since the response from a high-scale corner is (usually) wider than that of a small-scale corner.

More recent experiments use the corner detector of Mikolajczyk and Schmid [97] which does not suffer from this problem but has a slight tendency to detect the “same” corner at multiple scales. For the purpose of feature based matching this is a lesser evil than detec-

tion failure.

132 Small scale peak Large scale peak

Figure 5.4: A problem suffered by the interest point detector is that the width of the re- sponse of large scale interest points can hide the response of small scale interest points.

5.4.2 Shape adaptation

 ¦

£ ¥ Smoothing for computing gradients is done with a Gaussian of width  relative to the

detection scale.

The “integration scale”, the width of the Gaussian mask for accumulating the second

¦ £ moment matrix, is taken to be ¡ times the detected interest point scale.

In the iterative method of shape adaptation, the determinant of the smoothing kernel’s

§  £ is kept fixed (at ) throughout the iteration; this makes the algorithm more stable. Still,

failure to converge is not uncommon and the iteration is aborted if the ratio of eigenvalues

£ £

of the moment matrix exceeds ; the interest point is then discarded. Otherwise, the

£ £

algorithm runs for at most iterations or until the difference between successive £ s is

£

 ¥

a negligible  , whichever happens first. Interest points for which convergence is not

reached are discarded.

 ¤ ¤   ¢

Note that searching for isotropy is equivalent to minimizing the term ¢

 £¤ 

£



 © ¢ £¢¥¤ ¥  ©  tr ¥ (where are the eigenvalues of ) down to . This means that the above fixed point method could be replaced by a local optimization over the matrix  .

In practice, one is only interest in a limited range of distortions, making the search space

of the optimization a compact set. Since the cost function is continuous (except where  is singular, which is unlikely to happen in practice on images of real interest), it will at- tain a minimum somewhere on the search space. If the minimum happens to be on the

boundary of the search space, the solution is rejected. Through lack of time, this approach

was not investigated in this thesis, though it is likely to be a more reliable algorithm. Later

experiments with adaptive damping (based on detection of cycles) of the proposed fixed

133 point algorithm have in any case reduced the failure rate from approximately one in ¡ to approximately one in ¡¤£ .

5.4.3 Invariants used

To achieve invariance to affine photometric changes, the intensity in the neighbourhood

of the corner is normalized to lie in the range £ to . If this implies a scaling the original

image intensity by a factor of more than ¡ times the (dynamic) range of the image, then the

interest point is rejected as being too weak.

In the normalized frame, the coefficients of a Fourier-Mellin transform 2 are computed

according to:

¤ ¤



¡ £

£¢ ¢

¥ ¥ ¥ © ¥  © ¥



¥  §

¡



¢ ¥  ©   where is the th derivative of a Gaussian with variance . A value of is used in the normalized frame (due to the affine normalization, this is a dimensionless

quantity, it is not measured in pixels).

¤

¤

¡

The effect of rotating the image by an angle is to multiply £¢ by so the expression

¤

¥

£¢ is invariant to rotation; these are the invariants used in this chapter. The values

¥ § § 

£ ¢ ¡

¢

of were taken between and , giving complex coefficients and real

invariants. The following table shows the number of real invariants obtained from each

¥

choice of : n 0 1 2 3 0 1 1 1 1 m 1 1 2 2 2 2 1 2 2 2

3 1 2 2 2

 £

In the first row ( ), the filter response is real, so gives only one invariant. In the first

¥ 

£



¥ ¥ column ( ), the invariant combination is just ¥ , which is also real. All the other entries are complex, so give two invariants.

Figure 5.5 shows results of an experimental evaluation of the invariant descriptors. The

interest points tracked over the the sequence of images used have quite stable invariants.

¡  ¡¡ ¥¤§¦ ©¨ ¢¡¡ 

2 ¢¡

£¢ ¢  ¢

Later experiments [128] used filter kernels of the form  where is a Gaussian mask.

¢¡¡  ¢ This bank of filters differs from the bank of derivatives of  only by a linear coordinate change but the rotational group action on them is much simpler.

134 8 3 6

6 2 5

4 1 4 2 0

0 3

−1 −2 2 −2 −4

1 −3 −6

−8 −4 0

0 5 10 15 0 5 10 15 0 5 10 15

¤ ¤

¤

¥ ¥ ¥ ¥ ¥ ¥ ¥ 

¥  ¥ Invariant : Invariant : Invariant : 

Figure 5.5: Experimental evaluation of interest point invariance. Top: the matching algo- ¢ rithm was run on a turntable sequence of images, and projective structure computed us-

ing the algorithm from chapter 3. The reprojection error, in pixels, after bundle adjustment

¦ ¦ £

£ £ £

was (rms) and (max), which justifies a high degree of confidence in the correctness

£ ¢¡ of the point tracks. The picture shows frames £ , and and the point correspondences

found between them. Bottom: evolution of selected invariants across the sequence. Tracks

¢ £ of full length are shown in red and tracks of length or more are shown in green to give an impression of the range of values taken by these invariants.

5.4.4 Registration

An adaptation of the Levenberg-Marquardt [108] algorithm is used to solve the least squares ¥

intensity registration problem for the parameters of the affine spatial transformation ¢

 ¡

   .

The “translation part” ¡ is kept fixed (assumed to be determined by the corner features)

and only the “linear part”  is fitted.

Sometimes a registration converges to a solution which is unlikely because the implied intensity or image distortion is too great (e.g. a strong step edge might happen to fit sensor noise, or a JPEG artifact, in a smooth region if the latter is affinely amplified). If the final

estimated intensity scale factor between two neighbourhoods is greater than , or the reg-

£¡ istration RMS error is above of the signal range, the solution is considered bad and is

135 rejected (this can lead to features in shadow being rejected).

5.4.5 Growing new matches from old

¤ ¤

¡ 

Given an interest point match hypothesis  , with an estimated local affine transfor-

© ©

 ¢ ¥ ©  

mation  , we search for other point pairs such that

¤ ¤

£ £

¤ ¤

¥   © ¥   ©

¤

£

¢ ¥  ©  ©

and ¥

¤

¡ ¡

I.e. the point ¡ lies in a disk centred on and the agree with the affine transformation £ too ( pixels is a very generous threshold). Such pairs are tested for photometric similarity

by normalized cross-correlation guided by the (linear part of the) affine transformation

¦¡ £ and accepted as new match hypotheses if the correlation score is above £ . 3

Obviously, in this process it is necessary to keep track of pairs of features which have already been tested; new matches found are added to the list of matches to be grown from.

5.4.6 Constraints on multi-view geometry from local affine transformation

The basic equations were given above in section 5.3 and it is straightforward to implement the method by writing out the equations in full. Some care has to be taken when condi- tioning the resulting linear system; see below for details.

Rank of constraints

To prove the statements about the rank of the linear constraints obtained, choose coordi-

nate systems in which the corresponding points are at the origin(s) and the local affinity is

the identity.

In the epipolar case, the constraint then becomes

¥ £ ¥ £

¦ ¤ ¦ ¤

©

£ ¡

¢ ¢

  £¢ © ©   ¥¤  © 

£ £ £



 

¥ © ¢ ¢

¡ £ £

  

£ £ £

    which implies , and , three linear constraints.

3Since these thresholds are given in absolute pixel distances the construction is, of course, not scale or affine invariant. However, in the correlation test the decision is still based on the local affine transformation associated with the seed match.

136

For a homography, the constraint is

£ ¥ £ ¥ £ ¥

¤ ¦ ¤ ¦ ¤ ¦

© ©

£ £

  ¡ ¢ © ©    ¤ 

£ £

 

¢ ¥ ©

¢ ¢

     

¢ £

         which implies  and , linear constraints:

Conditioning

For a linear estimate from these constraints, it is necessary to condition the input data



£ £ £ first, say by scaling the image coordinates by a factor of pixels. One then needs to

weight the linear constraints coming from the location of the correspondence differently

from those coming from the local affinity (this is to get a better approximation to geometric

error). The relative weights should be and ¢ , respectively. To see why this is, we general-

ize the formulation slightly. Instead of a single affine transformation  between the images

   we suppose given two affine transformations,  , from some local parameterization of

the surface patch, into each image. Thus, to convert back to the previous formulation, we

  



 

set  . The constraint on epipolar geometry then becomes

§ § §

¡ ¡

 



£

     

© © ©

¡

© which must hold to first order in  . The two extra constraints arising from the lo-

cal affinities encode (in a geometrically correct way) compatibility with but the order of

magnitude of the error terms is the same as the (geometrically incorrect) error terms aris-

¡

4

¥ ¥ ¥ ing from requiring that points with ¥ should correspond under . Now, if we scale

the image coordinates by a factor , the error terms arising from the correspondence of

   





    

with scales by the same factor, wheras the between-view matrix  stays the

   



 

 

 

¥ ©  same (due to cancellation:  ). So it is necessary to scale the error terms from  “by hand”.

4 

The natural domain of these affinities should be thought of as being the unit disk in ¢ and the (set) image of the disk in each (raster) image should be thought of as the region over which the affine approximation is believed to be valid.

137 5.5 Results

Figures 5.6 – 5.17 , 5.18 – 5.29 and 5.30 – 5.41 show the stages of the matching algorithm on three pairs of stereo images from Merton College, Oxford. Note that the algorithm copes well with the change of scale in the first example. In the second example the base line is not very wide, but no use was made of that by the algorithm; in this case almost all the found matches can be grown from a single good seed match by propagating affine registra- tions. In the last example there are many false matches due to the repetitive structure of the scene. Figures 5.42, 5.44 and 5.45 show epipolar geometry and, in one case, 3D structure computed using the algorithm from chapter 3 (the Euclidean calibration by T. Werner).

Figures 5.48, 5.51 and 5.54 show some of the successful matchings for parts of a se- § quence of § images from Raglan Castle, Wales. No corner features are detected on the

floor in the first two images, so no matches are found there.

5.6 Assessment

The chapter presented an interest point matching system based on a combination of the invariant descriptors of Baumberg [5] and the local image transformation of Pritchett and

Zisserman [109].

The results section demonstrated the effectiveness of the method on real images with large camera motion as well as rotation (e.g. figures 5.51 and 5.54). This stems from the stable invariance of the corner descriptors. The failure mode is almost always matching ambiguity arising from scene repetition, as shown in figure 5.41.

Future work should concentrate on incorporating different types of image descriptors in an opportunistic way since some types of feature are more distinctive than others, de- pending on the type of scene. Ideally, the matching system should be able to tell from each image what sort of feature is likely to work well for that image and then employ it.

138 Figure 5.6: Stereo image pair.

Figure 5.7: Detected multi-scale Harris corners.

Figure 5.8: Close-up: the size of circles indicate the selected scales.

Figure 5.9: Affine adapted corner neighbourhoods. 139 Figure 5.10: Close-up of affine adapted corner neighbourhoods.

Figure 5.11: Corners which survived intensity normalization.

Figure 5.12: 1081 putative matches generated from invariants.

Figure 5.13: 892 putative matches verified by SSD registration. 140 Figure 5.14: 1335 putative matches after growing stage.

Figure 5.15: 592 corner matches surviving robust fit of epipolar geometry.

Figure 5.16: 829 corner matches after second growing stage.

Figure 5.17: 589 corner matches after final robust epipolar geometry fit. 141 Figure 5.18: Stereo image pair.

Figure 5.19: Detected multi-scale Harris corners.

Figure 5.20: Close-up: the size of circles indicate the selected scales.

Figure 5.21: Affine adapted corner neighbourhoods. 142 Figure 5.22: Close-up of affine adapted corner neighbourhoods.

Figure 5.23: Corners which survived intensity normalization.

Figure 5.24: 952 putative matches generated from invariants.

Figure 5.25: 806 putative matches verified by SSD registration. 143 Figure 5.26: 1540 putative matches after growing stage.

Figure 5.27: 429 corner matches surviving robust fit of epipolar geometry.

Figure 5.28: 919 corner matches after second growing stage.

Figure 5.29: 446 corner matches after final robust epipolar geometry fit. 144 Figure 5.30: Stereo image pair.

Figure 5.31: Detected multi-scale Harris corners.

Figure 5.32: Close-up: the size of circles indicate the selected scales.

Figure 5.33: Affine adapted corner neighbourhoods. 145 Figure 5.34: Close-up of affine adapted corner neighbourhoods.

Figure 5.35: Corners which survived intensity normalization.

Figure 5.36: 672 putative matches generated from invariants.

Figure 5.37: 589 putative matches verified by SSD registration. 146 Figure 5.38: 1253 putative matches after growing stage.

Figure 5.39: 551 corner matches surviving robust fit of epipolar geometry.

Figure 5.40: 904 corner matches after second growing stage.

Figure 5.41: 577 corner matches after final robust epipolar geometry fit. 147 Figure 5.42: Computed epipolar geometry for the stereo pair in figure 5.6. It is quite accu- rate.

Left Middle Right

Figure 5.43: Three views of a ¢ model produced by integrating the correspondences found.

Figure 5.44: Computed epipolar geometry for the stereo pair in figure 5.18. This too is quite accurate.

148 Figure 5.45: Computed epipolar geometry for the stereo pair in figure 5.30. It is not very accurate due to the matching failures on the right-most wall of the building. The cause of this is the large number of mismatches due to the repeated structure on the left-most wall.

149

Figure 5.46: Input images.

£ £ §

¢ ¡

Figure 5.47: Corners detected ( and ¢ ).

§ £ Figure 5.48: matches found.

150

Figure 5.49: Input images.

§

¢¡ ¡ ¢

Figure 5.50: Corners detected ( ¥ and ).

§ £ Figure 5.51: (sic) matches found.

151

Figure 5.52: Input images.

§

¡ £ ¡

Figure 5.53: Corners detected ( ¥ and ).

152 Figure 5.54: ¡¤£ matches found. Chapter 6

Texture

6.1 Objective

This chapter addresses the problem of computing texture descriptors invariant to view- point and illumination changes and how to use such descriptions in visual tasks such as

finding correspondence and indexing into image data bases.

Why texture? Firstly, texture is a very vivid visual key for humans and it would be useful to be able to use it in machine vision. Secondly, many vision algorithms rely on extracting some form of features for subsequent processing, such as interest points or edge contours, but feature detection often fails, especially in the presence of strong texture, so the ability to detect and reason with textures by machine would complement existing methods nicely.

Overview. The chapter starts with a review of some background literature, which also in- troduces some terms to be used later. Next, the theory explored in this chapter is presented in detail: mostly this is about the second moment matrix as a texture descriptor and how to use it to normalize texture samples for viewpoint changes by the familiar device of re- ducing affine ambiguity to rotational ambiguity. Details of algorithm implementation are given before the results section which presents some classification results on synthetic data as well as example applications to inter- and intra- image matching.

6.2 Background

The texture “problem” in computer visions has many strands to it:

Shape from texture: to determine the shape of an object from the way texture on the

surface deforms due to orientation and distance relative to the viewer. Texture segmentation/grouping: to use texture to separate distinct objects and to

associate related ones.

Recognition/description from texture: the material properties of objects are reflected

in the observed surface texture, so image texture is correlated with object identity.

Texture synthesis: to generate convincing-looking “new” image texture, e.g. from a

given example.

These strands are related in (at least) the following ways: shape from texture must be pre-

ceded by some understanding of the effects of viewpoint change on imaged texture. So must recognition and description, if it is to be viewpoint insensitive. An ideal texture de- scriptor extracts, from a finite sample, the essentials of a texture and discards the irrelevant, but how can one tell what a descriptor encodes and what it doesn’t encode? Synthesis is one way: by generating a new texture with the same description as the original, it becomes clear what the description has captured and what it hasn’t.

Texture synthesis is important on its own due to its potential for realistic graphical ren- dering of natural scenes.

There are several ways in which changes of viewpoint and illumination can affect the imaged appearance of a texture:

the colour or intensity of the light source can change;

a change in viewpoint changes the foreshortening experienced by, and the position

in the image of, an imaged texture;

changes in illumination direction or viewpoint can radically affect the appearance of

a rough texture (due to shadows or self-occlusions) or a texture with specular com-

ponents.

The last issue, ¢ effects, is probably the hardest to deal with.

As for the first item, normalization techniques such as histogram equalization or nor-

malization of a (robust estimate of) signal range can be brought to bear.

The second item is somewhat harder and is the topic of this chapter. 154 6.2.1 Texture description and models

Origins. This chapter is mostly about texture description. Attempts to describe texture by machine go back at least to Julesz [63, 64] who famously postulated that the joint empirical

¤ -th order statistics of image pixels account for human perception. (Here, “joint” refers to

studying how statistics such as intensity or gradient vary together and not just separately,

and “ ¤ ” refers to considering not just one, or two, but several statistics at once.) I.e. the

claim was that there exists some ¤ such that any two texture samples with the same em-

pirical joint statistics will appear (to humans) to come from the same texture. This is a very



strong claim and taken literally is not hard to refute (as Julesz did himself for ¤ ) but, as

a notion of what one should record about a texture, the idea has been very influential and

has led to generative models which work very well for certain textures [170].

Filter banks. The ability of humans to discriminate textures pre-attentively has at-

tracted plenty of attention in computer texture analysis. It is known that early phases of

biological vision processing is performed by retinal cells sensitive to certain local visual

appearances such as edges, bars and blobs. These observations led to the now common

technique of applying a bank of filters to input texture images and using the filter responses

as a representation on which further processing is carried out. A canonical reference for

this approach is the work [86, 87] by Malik and Perona. Usually, the filters are linear convo-

lution filters, such as derivatives/differences of Gaussians, Gabor filters or other wavelets,

possibly followed by non-linear processes such as half- or full-wave rectification and local

non-maximum suppression, or inhibition, within and across the filter response channels.

The use of local filters captures the notion that textures are composed of repeated elements

and the local inhibition amongst channels encodes some of the aspects of typical spatial

arrangements of the elements.

Random fields. A popular texture model which directly addresses the statistical de-

pendency between neighbouring texture elements is the theory of random fields from sta-

tistical mechanics. Specifically, Markov Random Fields (MRFs) are often used as image [42]

or texture [170] models. A MRF consists of (1) a set of sites (e.g. pixels), (2) a set of cliques, which are finite sets of sites (e.g. pairs of £ -way connected pixels) and (3) a collection of

155 ¡

random variables ¡ (e.g. image intensities) indexed by sites such that (Markov property)

¡ ¡

£ ¥ the conditional behaviour of any ¡ depends only on those for which are in some

clique. The notion that a texture can be described by its ¤ -th order statistics can be given concretet meaning by modeling a texture as a sample from a certain MRF whose param- eters are a priori unknown. Texture synthesis based on MRFs consists of recording the

(empirical) conditional behaviour of pixel values and then finding a way to sample from a

MRF with those statistics. The latter is usually carried out by finding a Gibbs distribution

(a technical device which parameterizes MRFs) with the same conditional behaviour and sampling from that instead, using Gibbs or Metropolis [95] sampling. This is quite slow.

Co-occurrence. “Co-occurence” just refers to recording the joint behaviour of image statistics instead of marginal behaviour. For example, the (marginal) distribution of pixel intensities is not very discriminating (it says nothing about the spatial arrangement of lo- cal features) whereas the joint distribution of pixel values in a given relative position can say a lot about the texture. This is spatial co-occurrence; a variation is to apply two or more image filters and to record the co-occurrence of filter responses at the same point.

This approach has been used by Leung and Malik [85] to operationalize the texton notion of Julesz by using clustering in the high-dimensional space of filter response vectors (col- lected over an image region) to find “typical” combinations of filter responses. The centre of each cluster corresponds to an idealized patch of local image appearance, the texton.

Hybrids. Zhu and Mumford [172] combine the MRF and filter paradigms to form a new unified theory, called FRAME, but it is still based on Gibbs sampling both for synthesis and learning. Portilla and Simoncelli [107] combine several of the above ideas to produce an impressive texture synthesis system. Firstly, a multi-scale wavelet filter bank is used to generate several filter channels. Secondly, the joint statistics of these channels is recorded

(mostly in the form of correlations). Thirdly, new texture is generated by iteratively adjust- ing an estimate (initially random noise) using straightforward projection in the space of

images. I.e. their observation is that most image statistics are essentially scalar functions

¥ © ¥ ¥ on the space of all images so one can adjust a given estimate by searching in the di-

rection of the gradient of at the current estimate for a new estimate which has the desired

156 Figure 6.1: Left: a homogeneous texture. Middle: an isotropic texture. Right: a regular texture.

value of the statistic:

 

¥ ¥ ¥ © ¥ new old old

where the value of  is chosen to be as small as possible subject to the requirement that

¥ © ¥ new should equal the empirical value of the statistic. The idea harks back to Julesz but the complication and cost of sampling from (the technical device that is) MRFs is avoided and the method allows considerable flexibility in the kinds of statistics which can be used.

Previous authors used the device of projection onto convex sets (POCS) for similar pur- poses, but it is not clear what the benefit of convexity is, except maybe for theoretical rea- sons (such as convergence properties).

Homogeneity. A texture can be homogeneous or not; the term will be explained: It is convenient to think of texture as the result of a spatially invariant process being driven by spatially varying parameters. For example, the appearance of grass is the result of the biological process of growing driven by the parameters of soil properties and weather at

the particular location. A sample of texture is said to be homogeneous if the parameters

are constant across the sample. For example, homogenous lawns of grass are made by ensuring that conditions (parameters) are the same everywhere; fungi, shade, moss and dandelions cause spatial variations and give rise to inhomogenous lawns. Natural brick

textures are often inhomogeneous because walls tend to get more dirty near the ground.

One of the holy grails of texture analysis is a computer algorithm which, given a small texture sample, can separate the process from the parameters. A simpler, and more com-

157 monly attacked, problem is to try to recover the process from a sample of homogeneous

texture, i.e. to ignore spatial parameter variation for the sake of tractability. On the other

hand, in shape from texture, the inhomogeneity of imaged texture is exploited by assum- ing it arises from a homogeneous world texture due (only) to changes in surface orientation and the effects of perspective [19, 24, 67, 77, 88, 90, 116, 121, 144]

Isotropy. A related notion is isotropy, which refers to a texture (or any thing) which has no preferred or distinguished directions. The fronto-parallel visual appearance of ran- domly scattered rice grain in purely ambient lighting is isotropic – if the camera or ob- server were rotated about its, his or her optical axis, the essential appearance would not change. Directional lighting quickly introduces an-isotropy (by shading, or casting shad-

ows). Brick texture is often an-isotropic due to the distinguished vertical and horizontal

directions which individuals bricks line up with. Brick made from natural stone can be

isotropic. Polka dots arranged in a regular grid form an an-isotropic texture. True isotropy is rare but weaker forms (e.g. conditions which are implied by, but do not imply, isotropy –

see section 6.3.4 below) are more command useful in practice.

In the classic paper [167], Witkin made the observation that a (weakly) isotropic tex-

ture can become (weakly) an-isotropic under perspective projection, so that, assuming a texture to be isotropic out in the world, measures of an-isotropy give information about surface orientation. The idea was taken up and generalized by Blake and Marinos [90].

Regularity. Texture sometimes consists of repetition of a small number of image el-

ements. The repetition can be regular (e.g. polka dots on fabric) or irregular (e.g. scat-

tered rice grain). Regularity (strictly speaking, periodicity) is often measured with spectral

methods, i.e. using the Fourier transform. One approach is to compute the power spec- trum of the texture and look for peaks as done by Krumm and Shafer [67] where peaks are tracked and used to obtain information about changes in surface orientation across an im- age. Another is to use auto-correlation, deemed by some to be more robust. It was used by

Chetverikov [18] to define a scalar measure of regularity.

Related to regularity is the Wold model [79, 143, 38, 39], a texture model which aims to decompose every texture as a linear combination of three parts: a random part, a direc-

158 tional part and a periodic part. It is a theorem that such a decomposition exists but the meaning of these terms is defined relative to a total ordering of the image pixels and this introduces extraneous structure which is irrelevant to the texture itself. The cited papers

seem mostly concerned with detecting peaks of the power spectrum and auto-correlation

map, i.e. with finding the periodic part.

Scale. All three notions of homogeneity, isotropy and regularity are entangled with the

question of scale. Scale refers to the length scale at which measurements are taken. For

example, suppose that a texture is described by its joint nd order statistics, i.e. by the joint

©  © 

¤

£ ¥ © ¥

probability distributions ¡ of pixel values of all pairs of pixels . The length

£

¥ ¥ ¥ ¥ ¥

scale of each distribution is just the distance ¥ between the pixels .

¤

£

In this context, homogeneity means that the distributions ¡ are translation invariant:



¤ ¤

¡ ¡

¡ £

£

¡ for all displacements . For example, in a “white dots on black” pattern with

©

©

varying density of dots, there can be a length scale above which the random variables ¥ ,



¥ © ¥ become practically independent because the length scale dwarfs the scale of the dot sizes. Thus, a texture can be homogeneous at one scale and inhomogeneous at another.

Similarly, for isotropy, it often happens that pairwise dependencies vary with the direc- ¥ tion from to for short lengths scales but are independent of it at long length scales. For

example, this is the case with iron filings in a magnetic field.

Similar remarks apply to regularity, and an example is furnished by a brick wall, where

the bricks themselves are arranged in a highly pattern whereas at the smaller scale of a few

millimeters, the imperfections in the bricks are highly irregular.

It seems that in every case, the property in question (homogeneity, isotropy or regu-

larity) may fail to apply at one, small, scale but appears gradually as the scale increases.

Of course, in practice what happens is that the length scale exceeds the size of the tex-

ture sample (the lawn, or the brick wall is only finite in size) so we have no information

about what would happen at the longer scale if “the wall were infinite”. If we modelled the

meaning of each part of an image (e.g. as a random field taking the value “brick”, “grass”,

“foliage”) and then modelled the pixel appearance as a smaller-scale random field driven

by the large scale field then we could think of the large-scale field as “parameters” and the

159 small-scale field as “process”. Note how the parameters would then be varying whereas the

process would not.

Invariance. The current state of the art in texture synthesis demonstrates that there are texture descriptors which are strong enough to capture the essential properties of textures.

However, many texture analysis, and most texture synthesis, methods are inherently tied to the coordinate system of the given training image, or do not use representations that are useful for texture recognition. For example, the texton clustering algorithm of Malik and

Leung is invariant to rotations (and translations) of the image but not to scale changes or

shearing. In multi-view applications, where an object must be recognised and reasoned

about from more than one point of view, an affinely invariant texture description would be valuable. For another example, the algorithm of Efros and Leung [25] can generate new texture from a sample image but is useless for texture recognition because the model it uses is the sample image; it doesn’t discard any of the inessential information present in

the particular training data.

Rotationally invariant texture classification has been investigated by several authors [16,

48, 89] and is usually based on applying rotationally invariant filter banks (e.g. using rota- tionally symmetric linear convolution kernels or complex polar-separable filters followed by rectification – see below).

A segmentation algorithm whose performance is invariant to affine image transfor- mations was presented by Ballester et al [4]. The variational framework of Mumford and

Shah [99] was used to carry out the actual segmentation but, as far as description goes, the contribution is the construction of a bank of affinely invariant local filters using the now common technique of adaptative estimation of local (weak) gradient an-isotropy (see section 6.3.4 below) to reduce the problem to achieving rotational invariance.

The regularity measure constructed by Chetverikov in [18] is affinely invariant by de- sign.

3D effects. Often, texture appearance is to a great extent due to fine-scale variation in surface geometry, such as bumps, cracks or roughness in general. The imaged appear- ance of such textures can change dramatically if the illumination conditions or viewpoint

160 changes.

Chantler [16] investigated the effect of illumination direction on non-flat textures using

spectral methods and photometric stereo, finding that in the absence of cast shadows the

effect can be modelled using spectral methods.

Leung and Malik [71, 72] very ambitiously addressed the problem of extending their

texton clustering method to the case of varying illumination and viewing direction by learn- ing a dictionary of what they call 3D textons. In the learning phase, the joint behaviour of

filter responses at a location on the training material is recorded over several known light- ing and viewpoint conditions and clustering is carried out as before. When recognising a new texture, the problem now is that a given training image has unknown illumination and viewpoint conditions and these must now be estimated as part of the decision process (in

the training stage they were given).

6.3 Theory

This section describes the main tools investigated in the work. The second moment matrix

of a texture region, and its properties and behaviour under affine transformations are dis-

cussed in sections 6.3.2 through 6.3.5. It is shown that passing to a coordinate frame where

the moment matrix is isotropic gives a normalization of a texture region which is canoni-

cal up to a similarity transformation. Section 6.3.6 shows how to make similarity-invariant

measurements in the canonical frame, thereby achieving full affine invariance.

6.3.1 Segmentation

For texture descriptors to be useful in practice, it is necessary to do segmentation, because most images of real interest contain more than one texture or a mixture of texture and non- texture. It is necessary to know where a texture is and how far it extends. Having said that,

no attention will be paid to segmentation in this section; it will be assumed that textures

©  ©

¡



¥ © are presented, pre-segmented, as intensity images ¥ defined over regions .

(One could use colour images and this would no doubt improve performance but as far as understanding and overcoming the problems of texture description are concerned it is not

161 interesting.)

6.3.2 The 2nd moment matrix

§

© © 



¥ ©

Consider an intensity image ¥ and a direction vector . The derivative of the §

image in the direction of is

§ §

§

¤

  ¥ ¥  ¡  §

¥ ¥ ¥

§

¥¡ ©  ¥ ¥

¡ ¡ ¡

¡ ¡ ¡ §

where ¦

¢

¦ ¦

§

¤

 §£¢ ¢ 

¦

¢

§

¢

©

¥ ¥

¢ ¢

© and ¢

are the differential and gradient of the intensity image. The amount of variation of the §

intensity image in the direction of over a region can be measured by the variance of the ¤

scalar image ¥ :

§

¥

§ § §©¨ §

¤ ¤ ¤ ¤ ¤ ¤

©  ©  © 

¤ ¤

 



©  

¥ ¥ ¥ ¦ ¥§¦ ¥ ¥ ¥

¥ ¥ ¥ ¥ ¥ ¥

§ §

 

In matrix notation, this is  where the second moment matrix is the matrix

§

¤ ¤

© 

§ § §



¥ ¥ ¥ ¥

§ ©

 ¥ ¥ ©

¥ ¥ ¥ ¥

¥ ¥



¥ ¥

which is the average value, over , of the outer product . It compactly represents § all information about image intensity variation (in the above sense) in all directions .

6.3.3 Behaviour under affine transformations

If the original image and region are affinely related to another image and region (see fig- ure 6.2), the components of the 2nd moment matrix computed over the other region will

be different. Let the new image be ¦ , given by

 ¦  ¡

¥ ¥  © ¥  ©

 ¢ ¢¡ © ¦



¥ ©  ¥ where and . The differentials of and at corresponding points are

related by

¤ ¤

 ¦    ¦

 ¥ ¥ or

162 Ω A A(Ω)

I J

Figure 6.2: An affine transformation of region within an image changes the 2nd moment matrix in a predictable way.

Therefore, the corresponding moment matrices are related by

 ¦ 



¥ ¥ ©  ¥ ©  (6.1)

Note that the translation part, ¡ , plays no part. The only part of the transformation which is relevant is the linear part, which says how local neighbourhoods are deformed.

The relationship is useful because it allows one to predict something about the appear- ance of a texture region from a viewpoint that is different from the original, given one.

Visualization: co- and contra- variance

The second moment matrix looks like a covariance matrix and covariance matrices are often visualized as ellipses with minor and major axes along the principal directions, of magnitude proportional to the square roots of the corresponding eigenvalues. However,

this is only appropriate when the matrix represents a displacement vectors; the second

moment matrix represents co-vectors. This is sometimes called an information matrix and

the correct way to visualize it is to draw the level curves





 

 (const)

where  is the displacement vector from the chosen centre of the ellipse to be drawn (this choice has no particular meaning). This is used in figure 6.3 to show second moment ma- trices computed from images of the same surface seen from different points of view, and in figure 6.4 to show moment matrices computed from different parts of the same texture in a single image. 163 Figure 6.3: Two images of the same textured surfaces acquired from different points of view. The affine distortion between the imaged sides of the tower is evident, as is the dif- ference in brightness. The second moment matrix is computed independently in each of the marked regions (indicated by colour and boundaries), and is represented as an ellipse centred on the region’s centroid. Note the computed ellipse appears to be attached to the surface: it transforms in the same manner (covariantly) as the surface between views. Fur- thermore for different regions on the same side of the tower the ellipse is virtually identical (up to size). This demonstrates the viewpoint invariance of the computation, and also the insensitivity to the shape of the support regions.

Figure 6.4: Computed second moment matrices (ellipses) for different support regions (highlighted) from various planar textures. The ellipses vary very little within each texture, despite the change in position, shape and size of the support region.

164 6.3.4 Weak (gradient) isotropy

A texture is weakly (gradient) isotropic if its second moment matrix is isotropic, i.e. it is a scalar multiple of the identity matrix.

The main idea explored in this chapter is the observation that, given a textured region, affinely transforming it such that it becomes weakly isotropic is a way to normalize the texture to compensate for certain aspects of viewpoint change. Since the group of trans- formations which preserve weak (gradient) isotropy is the group of similarity (rotation and scaling) transformations, the normalization of a texture is not really unique but is only

determined up to a similarity. Computationally (because of its numerical stability and in- 

variance to rotations) a convenient choice of is the (unique) positive square root of  :

¦

  

 (6.2) which has the desired effect because, according to equation (6.1), the moment matrix of

the transformed region is

   



 

£

  as required.

As remarked earlier, this idea originates with Witkin [167] and has been taken up by several other authors [5, 9, 4].

Examples of this normalization process can be seen in figure 6.5.

6.3.5 Previous use of the 2nd moment matrix

The 2nd moment matrix has a long history of use in feature detection and texture analy-

sis [9, 77, 88]. A common variation is known as centre weighting:

¤ ¤

©  ©  ©  © 

¥



¥ © ¥ ¥ © ¥ ¥ ©

© 

¥ £

¡ © where ¥ is a scalar field which integrates to one and which is mostly supported near the region of interest. The corner detector associated with (at least) the names of Har- ris [50], Nitzberg[66] and Forstner¨ [36] declares a pixel to be a corner based on the eigenval- ues of a moment matrix computed as a local average around the pixel. Many shape from 165 Figure 6.5: Examples of normalization by making the second moment matrix isotropic for the two upper-most marked regions in figure 6.3. The ellipses shown in the at figure would be circular if shown in this figure.

texture algorithms, e.g. [9, 77], use the second moment matrix to estimate the pose of the

surface being viewed, assuming the texture on the surface in the world is isotropic.

The use of a windowing function allows compromise between getting a local measure-

ment (small support needed) and getting sufficient sample size (large support needed) but

is not useful here because the desired measurement is non-local: the given region is as-

sumed to contain a single texture, all of which is equally valid, so the same weighting is

used everywhere. At most, some boundary effects can corrupt the estimate but this can be

avoided by morphological erosion of the region – there is no reason to weight the interior

pixels differently.

6.3.6 Description in normalized frame

By passing to a normalized frame, the affine invariance problem has been reduced to in-

variance under similarity transformations.

166 Rotational invariance

A convenient way to accomplish rotational invariance is to collect the outputs of a bank

of rotationally invariant (non-linear) filters. How does one construct such filters? In the

well-known work [134, 135] by Schmid et al, it is accomplished by computing (Gaussian-

blurred) derivatives as suggested by Koenderink [65] and using algebraically invariant func- tions of these, as worked out by Romeny [151]. The work was extended to colour images by

Gouet et al [46]. The theory behind the (algebraic, polynomial) invariants used by Schmid et al is classical invariant theory; by abandoning the restriction of looking for polynomial invariants there are simpler ways to construct many rotational invariants:

One approach is simply to apply a rotationally symmetric linear convolution filter, i.e.

one whose kernel has the form

©    © 

    

© ¥ © ¥ ©

¥ ,

¥ © where can be any function. An alternative, which extracts more information from the

image data, is to use polar-separable convolution filters of the form



¡ £

¢ ¢

¥ ¥ © ¥ ©

and take the absolute value of the filter response (this is a non-algebraic operation). A ¤

rotation of the image by an angle changes the phase of the convolution response by a

¤

¡

factor of so the absolute value

£¢ ¢ ¥ ¥ ¥ (6.3)

is rotationally invariant. Some possible choices are:

¢ ¢ ¤ 

¢

Zernike moments: an orthogonal basis for the set of polynomial functions

©

¢

¢ ¡ ¢  £

 ¥

on the unit disk ¥ .

¢



¢



¢ 



©

¥¥¤ ©



¢¡

 ¥

¢ ¥ ©

£¢

Gaussian derivatives (Fourier-Mellin kernels) ¢

¡   

Zernike moments were used in [4] but have the unpleasant property of being discon-

tinuous near the boundary of their support so that small changes in the placement of the

filter can result in relatively large changes in the filter response (an alternative explanation 167 is that the Fourier transform of a dis-continuous kernel decays slowly, so more weight is

given to high-frequency components of the image signal). By comparison, image deriva-

tives are usually computed by convolution with the derivative of a Gaussian, which ta-

pers off smoothly; a sharply truncated convolution kernel does not suppress noise so well.

Gaussian derivatives were used by Baumberg [5].

In practice, little difference was found between using the Zernike moments and deriva-

tives of Gaussians (see figure 6.11). If at all, the Zernike moments outperform the Fourier-

Mellin filters slightly for the parameters tested.

Histogram descriptors

Having obtained a new, rotation-invariant, representation of the texture region in terms

of filter responses, it still remains to extract a compact description which depends only on

the intrinsic properties of the texture and not, say, the shape of the region. In this work, the

method chosen was to label each pixel in the (normalized) region according to which filter

fires most strongly at that pixel. A histogram of labels is collected, and this histogram is the

texture descriptor. The approach is similar to that used by Malik et al [85]. The process is

illustrated in figure 6.6 which shows the histograms collected for two differently textured

regions. The histograms are different, whereas for two regions containing the same texture

the histograms are similar.

Histograms can be compared in many different ways, e.g. using the Kullback-Leibler 

divergence, the statistic, the Earth mover’s distance [120] or histogram intersection [148];

the relative merits of these are discussed by Schiele in [130]. For the experiments in this  chapter, the statistic was used, inspired by Shi and Malik [140], the theoretical advantage

of a statistic which can be readily related to a likelihood, and the finding of [130] that it

outperforms the histogram intersection measure.

Scale invariance

To extract information across scales, a histogram is collected at a number of scales (e.g.

¡ ) and comparisons between sets of histograms must be modified to take a possible scale

168 1200

1000

800

600

400

200

0 1 2 3 4 5 6 7 8 9 10 11

1400

1200

1000

800

600

400

200

0 1 2 3 4 5 6 7 8 9 10 11

2500

2000

1500

1000

500

0 1 2 3 4 5 6 7 8 9 10 11

Figure 6.6: Histograms collected, at the same scale, for three texture regions. Histograms coming from the same type of texture (red arrows) are similar.

change into account.

6.4 Algorithms

6.4.1 Normalized Cut Segmentation

To accomplish actual visual tasks using texture one must work with images containing

more than one texture, and non-texture. Thus it is necessary to perform image segmen- tation to isolate the regions to which processing can meaningfully be applied.

There is a large literature on texture segmentation [4, 67, 86, 70, 85, 140]. For the pur- poses of the investigations in this chapter, a rudimentary implementation of the Normal- ized Cuts algorithm in [85] was used. More details can be found in [70, 85, 140] but briefly,

the algorithm poses segmentation as a graph partitioning problem where the vertices are

pixels and the edges are labelled with affinities which are local measures of similarity. The

aim is to find partitions with large intra-group affinities and small inter-group affinities.

169 6.4.2 Practical calculation of 2nd moment matrix

Computing the second moment matrix over a region requires more care than intuition

from the domain of continuous images suggests, for two reasons.

Iteration

Firstly, suppose one has computed the 2nd moment matrix over a region; one obtains a rectifying affine map as in figure 6.2. Now, in the original image, the image operators (Gaus-

sian derivatives) used to computed intensity derivatives were circular so the corresponding

operators in the normalize frame in general won’t be (due to the effect of  ). To get true co- variance it is necessary to use derivative filters that are circular in the normalized image.

But since the normalized frame is a priori unknown, an iterative scheme is necessary (see

figure 6.7):

 

1. Start with ¥ , the identity matrix.

¤



¢ ¢

2. Compute a second moment matrix,  , in the -normalized frame.

¤

 

 ¢   ¢

3. Pull back to get an moment matrix ¢ for the original coordinate frame.

     

£ £

¡

¢

¢  ¢

©  4. Set ¥ where is an adaptation rate.

5. Repeat till convergence.

This idea was first formulated by Lindeberg and Gar˚ ding [78]. Convergence can be

taken, say, to mean that no pixel moved between two successive iterations. Derivatives

 ¦  ¦

£ £ ¡

¥ are computed using  . The adaptation rate used was .

Sampling issues

Secondly, sampling issues must be considered when warping (rectifying). To see why this is, consider that the rectification process works by stretching the input image in the princi-

pal directions of the moment matrix, by factors proportional to the square roots of the cor-

responding eigenvalues. The effect of stretching, as predicted by equation (6.1), by a factor

170 −normalized A −normalized Original A n n+1 − 1/2 A n M n − M n M n

A n+1

Figure 6.7: Diagram showing iterative adaptation procedure.

 in a given direction is to reduce the amount of variation (standard deviation) in that di- rection by a factor of  . But, in the domain of discretely sampled images, if the amount of stretching in some direction is less than then the interpolation inherent in warping will

cause undersampling (and hence both smoothing and aliasing) in that direction and so the

actual variation in the resampled image will be less than predicted. As a result, the smallest 

eigenvalue of the rectifying transformation must always be larger than , lest information  be lost due to discretization. For example, if is always chosen to have determinant , the

process is likely to diverge as the variation in the smallest principal direction is smoothed

away completely. A practical way to overcome this sampling problem is to always scale  such that the smallest eigenvalue is equal to .

6.4.3 Descriptors

The form of the descriptors has already been given in section 6.3.6; here it just remains to

set down the details of implementation omitted earlier.

Filter kernel sizes

As described in earlier sections, the descriptors applied in the normalized frames are sim-

ple histograms of filter responses. As for the sizes of kernels used, a Zernike filter kernel

is supported on a disk of radius and the width of a Fourier-Mellin kernel is described by

the standard deviation  of the associated Gaussian.

171 λ > 1 λ min min < 1

Figure 6.8: The texture region at the top has more variation in the vertical direction than the horizontal direction, so the affine normalization step will stretch more in the vertical direction then the horizontal one. The normalization on the left has smallest eigenvalue less than , so during rectification there is loss of resolution in the horizontal direction which means that in the normalized frame, horizontal variation will be under-estimated (green ellipse) relative to that predicted by theory (red circle). On the next iteration, the re- gion will be stretched even further in the vertical direction. By contrast, the normalization

on the right preserves relevant resolution and signal variation. 

In the experiments reported here, the values of and used were

¡ ¡

 ¦  § ¦

¢ ¡ £



and

 § §

£

¢

¢ ¢

where ¢ , covering an octave of scale change.

¥ ¥ ¥¡

¥ ¥ ¥ ©

The Zernike moments are indexed by integers where and mod ; for

¥ ¥ 

¡ £

these experiments all values with , except for were used, giving a total of

¥ ¥

£ ¡

histogram bins. The Fourier-Mellin kernels are indexed by integers with ; for

¥

£

these experiments all values with were used, giving a total of histogram bins.

Intensity normalization

Before applying these descriptors, the intensity should be normalized to reduce the effect

of illumination variations between different views of the texture. The basic idea is to offset

¢

£

and scale the intensity to fit the signal range into ¤ but this must be done robustly to 172 Figure 6.9: Twenty Brodatz images used for the synthetic data experiments.

reduce the influence of “flaws” in the texture (and other noise such as “dead” pixels).

£ £ The method used was to block the region into windows and compute (non-robustly)

the minimum and maximum intensity values over each window. This gives two data sets,

the medians of which were used as robust estimates of the minimum and maximum of the texture region intensities.

6.5 Results

6.5.1 Synthetic data

The end result of the method presented is a way to compute, for a pair of textures, a similar- ity measure which is invariant to affine deformations. As such this is not a texture classifier; to make a decision between “same texture” and “different textures” would require choosing a decision boundary, i.e. a threshold.

173

¡ ¡

¡ ¡

©¡©¡©¡©¡©¡©¡©

¡ ¡ ¡ ¡ ¡ ¡

¡ ¡

¡ ¡

¡ ¡ ¡

¢¡¢¡¢¡¢

©¡©¡©¡©¡©¡©¡©

¡ ¡ ¡ ¡ ¡ ¡

¡ ¡

¡ ¡

¡ ¡ ¡

¢¡¢¡¢¡¢

©¡©¡©¡©¡©¡©¡©

¡ ¡ ¡ ¡ ¡ ¡

¡ ¡

¡ ¡

¡ ¡ ¡

¢¡¢¡¢¡¢

©¡©¡©¡©¡©¡©¡©

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡

¡¡¡

¡ ¡

¡ ¡

¡¡¡

¡¡¡

¡¡¡

¡¡¡

¡¡¡¡

¡¡¡¡

§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¥¡¥¡¥¡¥¡¥¡¥¡¥

¦¡¦¡¦¡¦¡¦¡¦

¡¡¡

¡¡¡

¡¡¡¡

¡¡¡¡

§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¥¡¥¡¥¡¥¡¥¡¥¡¥

¦¡¦¡¦¡¦¡¦¡¦

¡¡¡¡

¡¡¡¡

§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¥¡¥¡¥¡¥¡¥¡¥¡¥

¦¡¦¡¦¡¦¡¦¡¦

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

§¡§¡§¡§¡§¡§

¨¡¨¡¨¡¨¡¨¡¨

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¥¡¥¡¥¡¥¡¥¡¥¡¥

¦¡¦¡¦¡¦¡¦¡¦

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

£¡£¡£¡£¡£

¤¡¤¡¤¡¤¡¤

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡¡

¡¡¡

¡ ¡ ¡ ¡ ¡ ¡

¡¡¡¡¡



¢ Figure 6.10: Diagram demonstrating the generation of ¢ synthetically warped tex- ture patches from a single large Brodatz texture.

The stability and discriminability of the descriptor were tested on synthetic data gen-

erated from the Brodatz [12] collection of textures. The Brodatz textures used are shown

in figure 6.9. For a given Brodatz texture, three sub-regions were chosen; these should be

classified as the same texture (if the descriptor is insensitive to segmentation). Three syn- thetic affine warps were then applied to each of the three regions; the resulting warped texture regions should also be classified as the same texture (if the descriptor is affine in-

variant). See figure 6.10. From a selection of twenty Brodatz textures one thus obtains

 £

£ £

¢ ¢ texture samples for which the ground truth classification decisions are known, and the only parameter left to tune is the threshold on the similarity measure. The

ROC curve for this classifier is shown in figure 6.11 (for two choices of filter kernel used to

compute histograms). The results demonstrate the effectiveness of this classifier, assum-

ing a suitable threshold has been found.

Another test of the affine invariance of the descriptor is illustrated in figure 6.14 which ¢ shows the stability of the (  -normalized) histograms over viewpoint changes for real data.

As can be seen, the histograms overlap to a large extent despite the large changes in view-

point and illumination.

6.5.2 Intra-image texture matching

In many cases, similar texture occurs multiple times in the same image in different orienta-

tions. Regions of textures of the same type can be determined using the texture descriptor.

174 1 1

0.9 0.99

0.8 0.98 0.7 0.97 0.6 0.96 0.5 0.95 0.4

0.94 0.3

0.93 0.2

0.92 0.1

0.91 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

1 1

0.9 0.99

0.8 0.98 0.7 0.97 0.6 0.96 0.5 0.95 0.4

0.94 0.3

0.93 0.2

0.92 0.1

0.91 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 6.11: ROC curves for texture matching using the texture descriptor, applied to a set

£ £

of regions of warped Brodatz textures. The parameter used to trace the ROC curves is a

§ § £ £

¢ ¢

¢

threshold on the similarity measure. Four support region sizes ( ¢ , ,

¡ ¢ ¡ ¢

and pixels) were tested, giving four curves in each plot; the performance of the classifier increases with support region size. Left: full plot. Right: detail. Top: Zernike kernels. Bottom: Fourier-Mellin kernels.

This is demonstrated in figure 6.15 which shows different hand selected parts of the same

texture matched in a single image.

Intra-image texture matching can be automated by first segmenting the image into tex-

ture patches, and then matching these patches using the invariant descriptor. The seg-

mentation algorithm is image-based so will place boundaries between regions of the same

texture type (but different orientation) but the view-point invariant descriptor is able to detect their similarity and group them together. We illustrate this for the right image of

figure 6.16.

The image is first segmented (see section 6.4.1) into texture regions. It is deliberately

over-segmented so that textures are not grouped over weak boundaries. This is illustrated

in figure 6.17 which shows the regions found. The sky has been extensively over-segmented,

175 Figure 6.12: Two regions of surface texture and their projections into fifteen real images. Top row: regions indicated by colouring. Middle row: regions indicated by their bound- aries. Bottom row: estimated second moment matrices; as usual, the ellipses appear to stick to the surface.

0.4

0.4 0.35

0.3

0.3 0.25

0.2 0.2

0.15 histogram value histogram value 0.1 0.1

0.05

0 0 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 histogram bin number histogram bin number Top region Bottom region

Figure 6.13: All ¢¡ histograms overlaid on the same graph. Left: for the top region. Right: for the bottom region. The histograms are mostly quite similar.

0.4 0.4

0.2 0.2 histogram value histogram value

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 frame number frame number Top region Bottom region

Figure 6.14: The first three histogram entries (red, green, blue respectively) as function of frame number. Left: for the top region. Right: for the bottom region.

176 Figure 6.15: Two different textures and four regions manually selected within each texture. Invariant descriptors were computed for each region and, using the similarity measure, each region’s three most similar other regions determined. In all eight cases, the three regions from the same texture were chosen.

Figure 6.16: A wide baseline pair of images.

177 Figure 6.17: The segmentation of the images in figure 6.16 into regions of homogeneous texture using the method of [85]. The over-segmentation is intentional.

Figure 6.18: Similar textures at different orientation can be detected within a single image. £ For one region, the most similar other regions in the image are shown by arrows. The original image is shown on the left.

178 but regions of (locally) uniform intensity are rejected as non-texture so this has no effect

on the later stages of the algorithm anyway. One of the final groupings of texture patches is illustrated in figure 6.18.

6.5.3 Inter-image texture matching

This section describes an application of the texture descriptors to inter-image matching. It is shown that the difficulties of wide baseline matching are ameliorated by the use of these descriptors. The goal is to compute epipolar geometry via region correspondences.

Image matching using texture descriptors

The approach adopted here is to first segment each image into homogeneous texture re- gions (as in figure 6.17) and then attempt to match these regions between images using the invariant descriptor for each region. Since several regions will have the same texture, each region in one image will have a set of possible matches in the other. This ambigu- ity is resolved by computing and matching features, since sets of features (such as interest points around a distinguishing mark) are unique to a particular region. The result is a set of interest point matches which then forms the basis for computing epipolar geometry.

In more detail, suppose there is a putative match between a region in one image and a region in another. This match arises because the texture descriptors of the two regions are similar. We want to verify or reject this match. Note that even if the match is correct the two regions may not correspond to precisely the same surface patch due to imperfect seg- mentation, but there will be some overlap. The putative region match partially provides the affine transformation between the regions. This transformation is obtained by com- posing the affine normalizations associated with each region, and is determined up to a rotation and translation (i.e. a planar Euclidean transformation) between the normalized images. If the rotation were known, then interest points computed in each region could be matched by using the transformation to essentially reduce the correspondence prob- lem to that of tracking, where only translation remains. Since only a rotation needs to be estimated, we try a number of rotations and choose the one with the maximum number of

179 feature correspondences that have normalized cross correlation above a threshold. In this

manner feature correspondences are used to verify or reject the putative region matches.

Finally, the most powerful advantage of having the affine transformation is that further

matches can be generated very efficiently for verified regions, in fact with the ease of the

short baseline case with horizontal epipolar lines.

Algorithm summary

1. Basic segmentation: Segment each image into regions with different textures.

2. Texture labelling: Compute invariant descriptors for each texture patch.

3. Putative inter-image texture matching: Guided by the similarity measure for invari-

ant descriptors, establish a set of putative inter-image texture region matches.

4. Verification: Verify the inter-image texture matches using interest points within each

region and the affine transformations provided by the texture descriptors.

5. Point match generation: Grow further point matches for each of the verified region

matches.

6. Robust estimation of the fundamental matrix: Using the matched interest points

use a standard method to robustly estimate epipolar geometry and a consistent set

of point correspondences.

Example and implementation details ¡

The method will be illustrated on the image pair of figure 6.16. In this case there are ¢ and

§

¢¤£ £ ¢ texture regions (as shown in figure 6.17), and interest points computed in each im- age. The putative region matches (e.g. see figure 6.19), computed from the texture descrip- tor ranking, are verified or rejected based on their associated interest point matches. Point

matches are assessed by affinely registering their local intensity neighbourhoods and then

minimizing an SSD error that corrects for lighting changes. A local neighbourhood match

¦ ¢ 

£ £

is accepted if the RMS fitting error is at most (intensity is scaled to the range ¤ )

180 Figure 6.19: A region (left) and the regions (right) with which it is putatively matched. The white dots are the centroids of the regions and are for identification only; the translational part of the affine transformation between the regions is computed from image point cor-

respondences.

¦

£ ¡

and the scale factor for lighting correction is between and . Of all the interest point

£

matches from the region match verification step, point matches passed this interest point verification step. The verified region matches are shown in figure 6.20.

In order to improve the effectiveness of the RANSAC epipolar geometry estimator, we

¡ ¡ “drown” the outliers that persist by growing more point matches (with a search win-

dow) using correlation guided by the local affine transformations obtained from SSD reg-

£ §

£ £ £ £ istrations. This yielded point matches of which survived the robust fitting step.

This shows that incorrect matches are far less likely to produce more incorrect matches by guided correlation than correct matches are to produce correct matches. The point matches are shown in figure 6.21.

The computed epipolar geometry is illustrated in figure 6.22. The quality is excellent,

and all steps of the algorithm are automatic. The two images used are part of a set of ¢¡

£ ¡

and we have run the algorithm on all pairs from this set with more than 65% giving at

£ £ £ least interest point matches, and more than 80% giving at least matches.

An example on a different type of scene is given in figure 6.23. Again the entire process

181

Figure 6.20: Verified region matches. These are the putative region matches which have been verified using correlation on corner features. Note that regions are matched despite foreshortening size changes of 50% or more, and significant changes in segmentation be- tween the two images. There are multiple matches for some regions due to this segmenta-

tion difference.

§

£ £ Figure 6.21: The final interest point matches after a robust fit and non-linear opti- mization of the epipolar geometry. A match is indicated by the line linking the corner to its position in the other view.

182 Figure 6.22: Epipolar geometry computed from the point matches found. from images to computed epipolar geometry is automatic. These two scene types (brick building facades and textured rocks) demonstrates that the invariant descriptor has suffi- cient stability and discriminability for both semi-regular and stochastic textures.

6.6 Assessment

We have described a texture descriptor which is affine invariant and insensitive to region segmentation. The discriminatory power of the descriptor was demonstrated by drawing its ROC curves on a small set of Brodatz textures.

The chapter also achieved the goal of constructing a prototype texture region matching system although due to limitations of the segmentation algorithm, the real images used to demonstrate were not really suitable for the algorithm – they contain many instances of the same kind of texture (brick) which makes invariant descriptors less useful.

183 Figure 6.23: Another example of epipolar geometric obtained using the method described

in this paper. Top: original images. Middle: some of the matched regions. Bottom: final

epipolar geometry computed from interest point matches.

184 Chapter 7

Discussion and Future Work

This chapter suggests improvements to the work described in the dissertation as well as entirely new investigations that spring from it. The chapters will be discussed individually for specific points.

7.1 Grouping

The work on grouping with geometric constraints provided proof of concept; the prototype algorithms and models were shown to apply to various classes of real images.

Future work could certainly include working out more geometries to exploit, and ex- amples can be seen in table 7.1, but working out the geometry is fairly easy compared to the task of making practical algorithms which can work on a larger class of real images.

Example Geometry Model n Family of Planar Homologies n

T

¤

¡

¢

¡



¥ ©  n , for T

¡ parameters

Conjugate Rotation (Homography)

 ¥

¢

, for -fold symmetry

¥ parameters

Table 7.1: Some other geometric constraints which could be explored. The left column shows example images, the middle column shows a schematic diagram to emphasize the geometry in the example image and the right column alludes briefly to a possible model. The main obstacles to practicality are (1) feature detection and (2) the inlier/outlier

decision scheme used.

The paradigm of feature detection followed by processes working just on the features

is too simplistic. For example, although the model of a set of concurrent lines can be ver-

ified/rejected by looking for inliers in the form of detected line segments, a more direct

verification based on direct inspection of image intensities would be less sensitive to fail-

ures of segmentation and therefore more robust in the presence of weak edges or edges at

unexpected scales. Similarly, elation grouping by matching up pairs of line intersections is

a valid proof of concept but it would be more interesting to generalize this to less restric-

tive feature types, or to use grey-level correlation to estimate elations directly. Features

may be useful for generating hypotheses efficiently; a more general method might be to

use a clustering method, such as the TCA (Transformed Component Analysis) method of

Frey et al [41] to detect which features to look for.

Some observations about outlier rejection: robust fitting with RANSAC is often thought of as the search for a basis of inliers but this is only half of the criterion for success – the model instantiated from the basis of inliers must also be accurate enough to correctly eval-

uate the support for the model. In the case of fitting epipolar geometry, or a homography,

such good bases of inliers usually have wide image extent. But in the case of fitting an

elation, or a pencil of equally spaced parallel lines, any basis of inliers generally has small

image extent (because the basis elements are assumed to be “adjacent” in the repeating pattern). Thus the two criteria for a good basis conflict; an (obvious) solution to the prob-

lem is to use incremental fitting when evaluating the support for a given basis, making sure

only to include new image data from regions where the fitted model is accurate. A general

framework would be to incorporate (first order) error propagation into the model fitting

and to accept/reject image support data with least uncertainty first.

The algorithms should be put to some practical uses, e.g. by running them on large collections of images of architectural scenes.

186 7.2 Reconstruction

It was hoped that a better approximation to geometric error could be achieved by careful analysis or refinement of the constraints 3.2 and 3.4 given in chapter 3, but such attempts quickly led to schemes only slightly different from specialized bundle adjustment or to iter- ative methods (re-weighted least squares) which defeats the original purpose of something which is almost as accurate as, but faster than, bundle adjustment.

A comparison of the proposed method with the duality method of Hartley [56] is lacking from the experiments. The original publication [129] includes such results (provided by

Hartley), but not on the same data set.

There is a multitude of possibilities for different motion and calibration (including lens distortion) models which may or may not make structure estimation more accurate. How-

ever, within the current dominant framework of seeking the Maximum Likelihood Estimate

(under the assumption of Gaussian measurement noise) there is really only one correct method of solution, namely to minimize the sum of squared residuals (or a robustified ver-

sion [160]), by whatever means. The problems with this are well known: (1) how to get an initial estimate, (2) how to get to the global minimum and (3) how to get there quickly.

What could be done differently?

Perhaps one could change the error model; no sensible noise model should allow a sin- gle reprojection error to be of the order of the image size, and the Gaussian model does pe- nalize these strongly, but what about 10 pixels? Or 5 pixels? Is that “impossible” or merely

“unlikely”? This of course depends on the size of image and one’s beliefs about measure- ment accuracy but there is something to be said for a bounded error model; only with such a model can a solution be rejected as outright wrong and not merely unlikely.

Incorporating inequality constraints into a gradient-based minimizer, such as Levenberg-

Marquardt, is of course both tedious and not clearly of much use, but might it be possible to go about it in the opposite order, i.e. to solve the inequality constraints directly (by some clever and yet-to-be-worked-out computational geometry magic) to compute an entire space of solutions and not just a single estimate? One strong advantage of such a scheme

187 would of course be its global and provable “convergence”. That assumes, of course, that the

given correspondences are correct but a related advantage of the approach is that when it

fails, it will be due to a discernible subset of inconsistent data, so may be better able to

report the reason for failing than generic numerical minimizers which myopically explore a priori completely general error surfaces. Considering that the dimensionality of practical structure-from-motion estimation problems runs into thousands of parameters, the com-

putational geometry needed would require a representation that is very clever indeed (e.g.

something must be done about the gauge problem) but it may be possible and it might

even be efficient.

7.3 Auto-calibration

In theoretical geometric terms, the work on calibration was quite successful: new facts

about the problems studied were discovered and practical numerical methods for their

solution developed.

The practical justification for studying “minimal solutions” is two-fold: Firstly, such

studies lead to insights about the structure of the problem, such as critical configurations.

The number of solutions is of interest because it gives a measure of the ambiguity of the

model; the more solutions the minimial case has, the more solutions (i.e. local minima)

are to be expected in a bundle adjustment scheme. Secondly, if such minimal solutions are

accurate enough, they would be useful in robust estimation with the RANSAC paradigm.

Just as the minimal “2-view, 7-point” method for estimating epipolar geometry provides

a useful constraint for interest point tracking by weeding out solutions with unlikely im-

age motions, a minimal auto-calibration solver would allow such a tracker to impose prior

knowledge about, say, focal length. The accuracy required for this is a point which remain

to be proven, though.

The following specific technical investigations are related to the work on calibration

and should be carried out:

Symbolic/generic saturation. In the square pixel case, the saturation of the ideal of con-

188

straints can be computed numerically; it is desirable to avoid this extra source of cost and ¢

numerical error and be able to write down all quintic constraints directly.

¡ ¡

Linear method for five or more views. For cameras, the number of quintic equa-

£

¢ £ ¡ ¢

¡

tions obtained is which is greater than the number ( ) of all quintic mono- mials, so in principle a linear solution is available. Preliminary experiments with this have not been successful.

Resistance to noise. How gracefully does the estimator performance degrade as a func- tion of noise levels? In many practical cases which aren’t degenerate configurations, the estimators fail to give useful results; some idea of why this is, and what can be done about it, would be useful.

7.4 Texture

The work in chapter 6 achieved the goal of implementing a prototype system for comput- ing correspondence between textured regions; the results are encouraging and efforts to improve the performance (the segmentation and normalization steps are currently very slow) will hopefully lead to its becoming a practical method.

Using affine skew normalization before computing texture invariants was novel. The power of the descriptor is demonstrated by the experiments on Brodatz textures but is less clear on the real images due to the difficulty of getting the texture segmentation algorithm to produce usable regions for real images of interest.

The algorithm involves many stages and the list of theoretical problems which will make practical differences is necessarily quite long:

Tuning existing descriptors. The texture descriptor has several parameters in it which could be tuned for better performance. The area under ROC curves, cross-validation per- formance, or some other measure of the power of a descriptor could be used as the cri- terion for finding optimal parameter settings. Tuning the parameters for optimal per-

189 formance would also make the comparison of different algorithms, or variations such as

Zernike versus Fourier-Mellin filters, more meaningful.

Texture scale selection. Using the second moment matrix to normalize for affine skew

simplifies the affine invariance problem by reducing it to the similarity invariance prob-

lem and invariance to similarity transformations is easier to achieve because it is close to

rotational invariance. A mechanism for scale selection would still be desirable, to avoid the

need for running rotationally invariant descriptors at multiple scales. Without a choice of

scale, the texture descriptor is also not a true invariant in the sense that it doesn’t yield a small set of numbers that are invariant to affine transformations.

Using relative phase. The rotationally invariant descriptor given in equation (6.3) by tak- ing the absolute value of responses to polar-separable complex linear filters discards the relative (complex) phase information between neighbouring pixels. If there were some way to retain this it should greatly improve discrimination, as illustrated in figure 5.2 from chapter 5. Whether it is worth selecting only some pairs (and how one would do that) or whether recording relative phases from all possible pairs is best, remains to be seen.

Other texture descriptors. Other texture descriptors should be considered; the empirical joint distributions of pixel values modelled by MRF models, or any other statistics known to be sufficient for texture synthesis, are obviously attractive because they are complete descriptions. The obstacle to using them is their lack of invariance properties.

Improve segmentation. The current matching system is limited by the performance of the segmentation algorithm. Improving the segmentation algorithm, e.g. by implementing techniques from the literature on texture segmentation, is one direction of future effort but in some sense the description of texture must precede its segmentation. Combining these two processes is perhaps the better strategy.

Better ways to verify texture region matches. In the wide baseline matching applica-

190 tion in chapter 6, the affine invariant texture descriptor was used to hypothesize region matches. The verification of a match hypothesis was done by means of interest point

matching. This is a weakness of the current design because the cases where interest point

matching is most likely to fail is exactly in textured regions.

Shape from texture. Retaining affine-variant [sic] texture descriptors can be useful, e.g.

for segmenting/grouping texture or for structure recovery: If the local texture statistics

measured from an image of a curved surface can be used to estimate local affine trans-

formation between nearby neighbourhoods then it is in principle possible to estimate the

entire local differential geometry of the surface covered by the texture, but without forming

correspondence between individual texture elements as in [88]. Clerc and Mallat investi-

gated something close to what is asked for here in [19] but the technique is still far from

practical.

7.5 Matching

The chapter is mostly about the results of implementing a method published elsewhere,

but includes some novel contributions: the use of local neighbourhood registration for

verification of match hypothesis and for generation of new ones.

Anandan and Avidan [2] used local affine registrations (from optical flow) to estimate

local affine epipolar geometry and hence global projective epipolar geometry, so do not

pass directly from local registrations to constraint on global epipolar geometry via con-

straints like those from section 5.3.

Many implementation details can be improved: The interest point detection algorithm

described is apparently inferior [96] to the one reported by Mikolajczyk and Schmid [97]

since large-scale corners tend to obliterate nearby small-scale corners; the algorithm from [97]

should be used instead. The intensity normalization scheme fails on very dim images since

most interest points are rejected as being too weak; the familiar ad hoc idiom of selecting

£ £ £

only the top ¤ ( , say) interest points is tempting but still ad hoc.

Experimental comparison of the “match growing” approach to enforcing local match

191 neighbourhood coherence with existing methods such as reported by Schmid [131] should be carried out.

The chapter is mostly about matching as in “find correspondences and epipolar geom- etry” and not as in “are these images of the same thing or scene?”. The latter interpretation, which includes finding correspondences for non-rigid scenes is certainly important but the demands on feature detection and image description were too great for the time avail- able. An opportunistic matching system incorporating several feature types is probably needed for this.

Other features and invariant descriptors. A matching system using several feature types will need several types of invariant descriptor. There are still plenty of technical challenges in this area:

Scale selection, being a 1D search, is fairly cheap to carry out in practice but selection

of local affine neighbourhoods, being an iterative search over three parameters, is

quite expensive. There may be ways to speed it up, or to avoid iteration altogether.

Can one compute local invariant descriptors for non-flat neighbourhoods, such as

the corners of windows on a facade? Currently, these are either discarded (when there

is large camera motion) or cause estimation bias (when the baseline is short enough

that a flat model “works”).

The grouping together of low level features gives rise to new features with more dis-

criminating and stable invariants.

Chapter 5 listed many types of feature and invariants. If these were available, how

would one weight their relative importance?

Resolving ambiguity. Ambiguous features are features which look similar to many other features in the same image; such features are difficult to match correctly. Generally, am-

biguity arises not by chance but from repetitive scene structure. If in addition the scene

has structural symmetries the problem is even worse. A serious example of this is shown

192 Figure 7.1: Illustration of ambiguity in wide baseline image matching. Firstly, all the win- dows on this building look similar at some level; it is hard to know which windows match without using global constraints. Secondly, these are images of two sides of a house. Which side is each picture taken from? The clues which disambiguate are so small (little surface markings on the walls) that it is almost necessary to segment foreground (such as people passing in the street) from background (the house) to use them. How else to tell that a person seen in one view but not the other is an outlier, as opposed to a disambiguating feature? in figure 7.5 where the task is to find correspondences across the sides of a building. The disambiguating cues (image information which can resolve the ambiguity) in this example are very subtle indeed.

In the past, ambigous features have mostly been discarded [5, 171] when found. This is effective when one can afford to throw away that information. An alternative is to look for local coherence [133] of matches, to do grouping to find more distinctive features, or to process the least ambiguous features first in the hope that geometric constraints (e.g. from estimated structure) can disambiguate the remaining features later.

193 Unify correspondence and structure estimation. The epipolar constraint is now a ubiq-

uitous matching constraint. Albeit expressed in image coordinates, it is nevertheless a con- straint imposed by 3D structure on the possible 2D correspondences. Passing to three

views, there are trifocal matching constraints and so on for more views. In an ¤ -view matching problem, reasoning about structure is probably crucial when trying to resolve ambiguities. In principle robust fitting with RANSAC can solve this but in practice there

might not be a basis satisfying both requirements of being inlying and giving a sufficiently accurate structure estimate, especially when correspondences are hard to come by.

For example, the images sequence used in figure 5.5 from chapter 5 have almost hor- izontal epipolar lines so that many false matches arise from specularities, which are at

fixed image positions (the vase is on a turntable) and so will be admitted by the epipolar

constraints. Using 3D structure as a matching constraint resolved those ambiguities com-

pletely.

A possible general framework for this is to acknowledge the measurement error in each

RANSAC basis and evaluate a given RANSAC hypothesis with reference to a “confidence

region” of possible estimates.

Beyond stereo pairs – applications The matching problem is often reduced to the case

of two views, e.g. by the reasoning that for several images we can just match each pair

separately. Consider the following applications, though

£ £

Photo-organization. Given a pile of images, divide them into related sub-piles

and organize each pile by tying it together via estimation of 3D structure.

£

©

Database search. Given a query image, find matching images in a database of

other images.

£ £ £ £ £ ¡¤£ £ £

Video indexing. Given a video with frames, each shot being – frames,

group together shots taken of the same location.

In all cases, the time it would take to do pair-wise matching is too great for practical pur-

poses. In the case of video indexing, it would be madness not to exploit the special struc- 194 ture, e.g. by using short base line techniques to track features over each shot; the resulting

feature tracks will be much more discriminating since they (1) were seen in more views and (2) (may) have associated 3D structure. This reduces the problem to matching a few hundred shots, a similar situation to the photo-organisation problem.

The complexity of the three applications listed have similar orders of magnitude, too large for exhaustive methods. Luckily the combination of invariant descriptors and effi- cient indexing structures is applicable in all three cases so significant successes in these

areas should be expected within three years or so.

195 Appendix A Estimating projectivities

between projective spaces.

¥

¢

¦  ¦

Consider the problem of estimating a projectivity from point and plane corre-



¢

¦ ¦  spondences. If a point  corresponds to a point then up to scale which

gives the linear constraint(s) on



£

¢ 

which are easy to write down in coordinates:

¤ ¤



¤ ¤

   ©  © 

£ £

¡ ¥  © ¡

¤ ¤

¡ ¡ ¡ ¡



¥ ¢  © ¡¤£ ¡ £

¡ £

¤ ¤

£ ¥  ©¨£

¡ ¡

¡ ¡ 

¢

¦ ¦ For a plane corresponding to a plane the constraint becomes

the linear equation

¡ 

£

¢

which can be written out in coordinates similarly:

¤ ¤

¤

¡

¤ ¤

¤ ¤

 ¡   ¡ ¡

£ £

¡ ¥ © ¡

¤ ¤

¤

¡ ¡ ¡ ¡

¡

¥ ¢ © ¡¤£ ¡ £

£ ¡

¤ ¤

£ £

¥ ©

¡ ¡ 197

Bibliography

[1] M. Abadie. Solution de la question 296. Nouv. Ann. Math., 14(50):142–145, 1855.

[2] P. Anandan and S. Avidan. Integrating local affine into global projective images in the

joint image space. In Proc. European Conference on Computer Vision, pages 907–921.

Springer-Verlag, June 2000.

[3] M. Armstrong, A. Zisserman, and R. Hartley. Self-calibration from image triplets. In

Proc. European Conference on Computer Vision, LNCS 1064/5, pages 3–16. Springer-

Verlag, 1996.

[4] C. Ballester and M. Gonzalez.´ Affine invariant texture segmentation and shape from

texture by variational methods. Journal of Mathematical Imaging and Vision, 9:141–

171, 1998.

[5] A. Baumberg. Reliable feature matching across widely separated views. In Proc. IEEE

Conference on Computer Vision and Pattern Recognition, pages 774–781, 2000.

[6] P. A. Beardsley, P. H. S. Torr, and A. Zisserman. 3D model acquisition from extended

image sequences. In Proc. 4th European Conference on Computer Vision, LNCS 1065,

Cambridge, pages 683–695, 1996.

[7] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based

motion estimation. In Proc. European Conference on Computer Vision, pages 237–

252, 1992.

[8] M. J. Black and P. Anandan. A framework for the robust estimation of optical flow. In

Proc. 4th International Conference on Computer Vision, Berlin, pages 231–236, 1993. [9] A. Blake and C. Marinos. Shape from texture: estimation, isotropy and moments.

Artificial Intelligence, 45(3):323–380, 1990.

[10] N. Boujemaa, S. Boughorbel, and C. Vertan. Soft color signatures for image retrieval

by content. In EUSFLAT 2001, 2001.

[11] B. Brillault-O’Mahony. New method for vanishing point detection. CVGIP: Image

Understanding, 54(2):289–300, 1991.

[12] P. Brodatz. Textures: A Photographic Album for Artists & Designers. Dover, New York,

1966.

[13] J. F. Canny. A computational approach to edge detection. IEEE Transactions on Pat-

tern Analysis and Machine Intelligence, 8(6):679–698, 1986.

[14] S. Carlsson. Duality of reconstruction and positioning from projective views. In IEEE

Workshop on Representation of Visual Scenes, Boston, 1995.

[15] S. Carlsson and D. Weinshall. Dual computation of projective shape and camera

positions from multiple images. International Journal of Computer Vision, 27(3),

1998.

[16] M. J. Chantler, G. McGunnigle, and J. Wu. Surface rotation invariant texture classifi-

cation using photometric stereo and surface magnitude spectra. In Proc. 11th British

Machine Vision Conference, Bristol, pages 486–495, 2000.

[17] M. Chasles. Question 296. Nouv. Ann. Math., 14(50), 1855.

[18] D. Chetverikov. Pattern regularity as a visual key. Image and Vision Computing,

18:975–985, 2000.

[19] M. Clerc and S. Mallat. Shape from texture through deformations. In Proc. 7th Inter-

national Conference on Computer Vision, Kerkyra, Greece, pages 405–410, September

1999.

198 [20] C. Coelho, M. Straforini, and M. Campani. Using geometrical rules and a priori

knowledge for the understanding of indoor scenes. In Proc. British Machine Vision

Conference, pages 229–234, 1990.

[21] R. T. Collins and R. S. Weiss. Vanishing point calculation as a statistical inference on

the unit sphere. In Proc. 3rd International Conference on Computer Vision, Osaka,

pages 400–403, December 1990.

[22] D. Cox, J. Little, and D. O’Shea. Using Algebraic Geometry. Springer, 1998.

[23] A. Criminisi. Accurate Visual Metrology from Single and Multiple Uncalibrated Im-

ages. PhD thesis, University of Oxford, Dept. Engineering Science, December 1999.

D.Phil. thesis.

[24] A. Criminisi and A. Zisserman. Shape from texture: homogeneity revisited. In Proc.

11th British Machine Vision Conference, Bristol, pages 82–91, UK, September 2000.

[25] A. Efros and T. Leung. Texture synthesis by non-parametric sampling. In Proc.

7th International Conference on Computer Vision, Kerkyra, Greece, pages 1039–1046,

September 1999.

[26] D. Eisenbud. Commutative Algebra with a View Toward Algebraic Geometry.

Springer-Verlag, 1988.

[27] O. D. Faugeras. What can be computed in three dimensions with an uncalibrated

stereo rig? Journal of the Optical Society of America, 1993. Submitted.

[28] O. D. Faugeras, Q.-T. Luong, and S. Maybank. Camera self-calibration: Theory and

experiments. In Proc. European Conference on Computer Vision, LNCS 588, pages

321–334. Springer-Verlag, 1992.

[29] O. D. Faugeras and S. J. Maybank. Motion from point matches: Multiplicity of solu-

tions. International Journal of Computer Vision, 4:225–246, 1990.

199 [30] O. D. Faugeras and B. Mourrain. On the geometry and algebra of point and line cor-

respondences between n images. In Proc. International Conference on Computer

Vision, pages 951–962, 1995.

[31] O. D. Faugeras and T. Papadopoulo. Grassmann-cayley algebra for modeling systems

of cameras and the algebraic equations of the manifold of trifocal tensors. Technical

Report 3225, INRIA, Sophia-Antipolis, France, 1997.

[32] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model fit-

ting with applications to image analysis and automated cartography. Comm. Assoc.

Comp. Mach., 24(6):381–395, 1981.

[33] A. W. Fitzgibbon and A. Zisserman. Automatic camera recovery for closed or open

image sequences. In Proc. European Conference on Computer Vision, pages 311–326.

Springer-Verlag, June 1998.

[34] M. M. Fleck, D. A. Forsyth, and C. Bregler. Finding naked people. In Proc. European

Conference on Computer Vision, LNCS 1065, pages 591–602. Springer-Verlag, 1996.

[35] L. Florack. The Syntactical Structure of scalar Images. PhD thesis, University of

Utrecht, Holland, 1993.

[36] W. Forstner¨ and E. Gulch.¨ A fast operator for detection and precise location of dis-

tinct points, corners and center of circular features. In Proc. of ISPRS Intercommis-

sion Conference on Fast Processing of Photogrammetric Data, Interlaken, Switzerland,

pages 281–305, June 2-4 1987.

[37] D. A. Forsyth. Recognizing algebraic surfaces from their outlines. In Proc. 4th Inter-

national Conference on Computer Vision, Berlin, pages 476–480, 1993.

[38] J. M. Francos, A. Z. Meiri, and B. Porat. A wold-like decomposition of 2-d discrete

homogeneous random fields. Annals Appl. Prob., 5:248–260, February 1995.

[39] J. M. Francos, B. Porat, and A. Z. Meiri. Orthogonal decompositions of 2-d nonhomo-

geneous discrete random fields. Math. Control, Sig. Syst., 8:375–389, October 1995. 200 [40] W. T. Freeman and E. H. Adelson. The design and use of steerable filters. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 13:891–906, 1991.

[41] B. J. Frey and N. Jojic. Transformed component analysis: Joint estimation of spatial

transformations and image components. In Proc. 7th International Conference on

Computer Vision, Kerkyra, Greece, pages 1190–1196, September 1999.

[42] S. Geman and D Geman. Stochastic relaxation, gibbs distributions, and the bayesian

restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, 6(6):721–741, November 1984.

[43] C. G. Gibson. Elementary Geometry of Algebraic Curves. Cambridge University Press,

1998.

[44] S. Gilles. Robust Description and Matching of Images. PhD thesis, Dept. of Engineer-

ing Science, University of Oxford, 1998.

[45] G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University

Press, Baltimore, MD, second edition, 1989.

[46] V. Gouet, P. Montesinos, and D. Pele.´ A fast matching method for color uncalibrated

images using differential invariants. In Proc. 9th British Machine Vision Conference,

Southampton, pages 367–376, 1998.

[47] G.-M. Greuel, G. Pfister, and H. Schonemann.¨ Singular version 1.2 user man-

ual. In Reports On Computer Algebra, number 21 in Reports On Computer

Algebra. Centre for Computer Algebra, University of Kaiserslautern, June 1998. ¢¡£¡¥¤§¦©¨ ¨ £ § ¡ ¡§  ¥  ! ¨#"%$&' ¨£(¢)#*£'+

[48] G. M. Haley and B. S. Manjunath. Rotation-invariant texture classification using a

complete space-frequency model. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 8(2):255–269, 1999.

[49] K.J. Hanna and N.E. Okamoto. Combining stereo and motion analysis for direct esti-

201 mation of scene structure. In Proc. 4th International Conference on Computer Vision,

Berlin, pages 357–365, 1993.

[50] C. J. Harris and M. Stephens. A combined corner and edge detector. In Proc. 4th

Alvey Vision Conference, Manchester, pages 147–151, 1988.

[51] R. I. Hartley. Euclidean reconstruction from uncalibrated views. In J. L. Mundy,

A. Zisserman, and D. Forsyth, editors, Proc. 2nd European-US Workshop on Invari-

ance, Azores, pages 187–202, 1993.

[52] R. I. Hartley. Projective reconstruction and invariants from multiple images. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 16:1036–1041, October

1994.

[53] R. I. Hartley. Self-calibration from multiple views with a rotating camera. In Proc.

European Conference on Computer Vision, LNCS 800/801, pages 471–478. Springer-

Verlag, 1994.

[54] R. I. Hartley. In defence of the 8-point algorithm. In Proc. International Conference

on Computer Vision, pages 1064–1070, 1995.

[55] R. I. Hartley. Multilinear relationships between coordinates of corresponding image

points and lines. In Proceedings of the Sophus Lie Symposium, Nordfjordeid, Norway,

1995.

[56] R. I. Hartley. Minimizing algebraic error. Philosophical Transactions of the Royal

Society of London, Series A, 356(1740):1175–1192, 1998.

[57] R. Hartshorne. Algebraic Geometry. Springer, 1997.

[58] F. Helmert. Die Mathematischen und Physikalischen Theorien der hoher¨ en

Geodasie¨ . Teil. Teubner, Leipzig, 1, 1880.

[59] O. Hesse. Die cubische Gleichung, von welcher die Losung¨ des Problems der Homo-

graphie von M. Chasles abhang.¨ J. Reine Angew. Math., 62:188–192, 1863.

202 [60] A. Heyden. Projective structure and motion from image sequences using subspace

methods. In Scandinavian Conference on Image Analysis, Lappenraanta, 1997.

[61] A. Heyden and K. Astr˚ om.¨ Euclidean reconstruction from image sequences with

varying and unknown focal length and principal point. In Proc. IEEE Conference on

Computer Vision and Pattern Recognition, 1997.

[62] D. Jacobs. Linear fitting with missing data: Applications to structure from motion

and to characterizing intensity images. In Proc. IEEE Conference on Computer Vision

and Pattern Recognition, pages 206–212, 1997.

[63] B. Julesz. Visual pattern discrimination. IRE Transactions on Information theory, IT-8

84-92., 1962.

[64] B. Julesz. Foundations of Cyclopean Perception. University of Chicago Press, 1971.

[65] J. J. Koenderink. The structure of images. Biol. Cybernetics, 50:363–370, 1984.

[66] S. Konishi and A. L. Yuille. Statistical cues for domain specific image segmentation

with performance analysis. In Proc. IEEE Conference on Computer Vision and Pattern

Recognition, 2000.

[67] J. Krumm and S. A. Shafer. Texture segmentation and shape in the same image. In

Proc. 5th International Conference on Computer Vision, Boston, pages 121–127, 1995.

[68] E. Kruppa. Zur Ermittlung eines Objektes aus zwei Perspektiven mit innerer Orien-

tierung. Sitz.-Ber. Akad. Wiss., Wien, Math. Naturw. Abt. IIa, 122:1939–1948, 1913.

[69] T. Leung and J. Malik. Detecting, localizing and grouping repeated scene elements

from an image. In Proc. European Conference on Computer Vision, LNCS 1064, pages

546–555. Springer-Verlag, 1996.

[70] T. Leung and J. Malik. Contour continuity and region based image segmenta-

tion. In Proc. European Conference on Computer Vision, LNCS 1406, pages 544–559.

Springer-Verlag, 1998.

203 [71] T. Leung and J. Malik. Recognizing surfaces using three-dimensional textons. In Proc.

7th International Conference on Computer Vision, Kerkyra, Greece, pages 1010–1017,

Kerkyra, Greece, September 1999.

[72] T. Leung and J. Malik. Representing and recognizing the visual appearance of ma-

terials using three-dimensional textons. International Journal of Computer Vision,

43(1):29–44, June 2001.

[73] D. Liebowitz and A. Zisserman. Metric rectification for perspective images of planes.

In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages 482–

488, June 1998.

[74] T. Lindeberg. Feature detection with automatic scale selection. International Journal

of Computer Vision, 30(2):77–116, 1998.

[75] T. Lindeberg. A scale selection principle for estimating image deformations. Image

and Vision Computing, 16(14):961–977, August 1998.

[76] T. Lindeberg. Principles for automatic scale selection. In B. Jahne, editor, Handbook

of Computer Vision and Applications, chapter 11, pages 239–275. Academic Press,

1999.

[77] T. Lindeberg and J. Gar˚ ding. Shape from texture from a multi-scale perspective. In

Proc. 4th International Conference on Computer Vision, Berlin, pages 683–691, May

1993.

[78] T. Lindeberg and J. Gar˚ ding. Shape-adapted smoothing in estimation of 3-d depth

cues from affine distortions of local 2-d brightness structure. In Proc. 3rd European

Conference on Computer Vision, Stockholm, Sweden, LNCS 800, pages 389–400, May

1994.

[79] F. Liu and R. Picard. Periodicity, directionality, and randomness: Wold features for

image modeling and retrieval. IEEE Transactions on Pattern Analysis and Machine

Intelligence, 1994.

204 [80] B. D. Lucas and T. Kanade. An iterative image registration technique with an appli-

cation to stereo vision. In Proc. of the 7th International Joint Conference on Artificial

Intelligence, pages 674–679, 1981.

[81] E. Lutton, H. Maitre, and J. Lopez-Krahe. Contribution to the determination of van-

ishing points using hough transform. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 16(4):430–438, April 1994.

[82] F. MacAulay. On some formulae in elimination. In Proceedings of the London Math-

ematical Society (3-27), May 1902.

[83] F. MacAulay. Note on the resultant of a number of polynomials of the same degree

(14–21). In Proceedings of the London Mathematical Society, June 1921.

[84] S. Mahamud and M. Hebert. Iterative projective reconstruction from multiple views.

In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head

Island, South Carolina, 2000.

[85] J. Malik, S. Belongie, J. Shi, and T. Leung. Textons, contours and regions: Cue com-

bination in image segmentation. In Proc. 7th International Conference on Computer

Vision, Kerkyra, Greece, pages 918–925, Kerkyra, Greece, September 1999.

[86] J. Malik and P. Perona. A computational model of texture segmentation. In Proc. IEEE

Conference on Computer Vision and Pattern Recognition, San Diego, pages 326–332,

June 1989.

[87] J. Malik and P. Perona. Preattentive texture discrimination with early vision mecha-

nisms. Journal of the Optical Society of America, 10:923–932, 1989.

[88] J. Malik and R. Rosenholtz. Computing local surface orientation and shape from

texture. International Journal of Computer Vision, 2(23):143–168, 1997.

[89] B. S. Manjunath and W. Y. Ma. Texture features for browsing and retrieval of im-

age data. IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.PAMI-

18,no.8:837–842, 1996. 205 [90] C. Marinos and A. Blake. Shape from texture: the homogeneity hypothesis. In Proc.

3rd International Conference on Computer Vision, Osaka, pages 350–353, 1990.

[91] J. Matas, J. Burianek, and J. Kittler. Object recognition using the invariant pixel-set

signature. In Proc. British Machine Vision Conference, pages 606–615, 2000.

[92] S. J. Maybank. Theory of reconstruction from image motion. Springer-Verlag, Berlin,

1993.

[93] P. F. McLauchlan and D. W. Murray. A unifying framework for structure from mo-

tion recovery from image sequences. In Proc. International Conference on Computer

Vision, pages 314–320, 1995.

[94] G. F. McLean and D. Kotturi. Vanishing point detection by line clustering. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 17(11):1090–1095, 1995.

[95] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of

state calculations by fast computer machines. Journal of Chemical Physics, 21:1087–

1092, 1953.

[96] K. Mikolajczyk. Interest point detectors – comparison, 2001. personal communica-

tion.

[97] K. Mikolajczyk and C. Schmid. Indexing based on scale invariant interest points. In

Proc. 8th International Conference on Computer Vision, Vancouver, Canada, 2001.

[98] F. Mindru, T. Moons, and L. Van Gool. Recognizing color patterns irrespective of

viewpoint and illumination. In Proc. IEEE Conference on Computer Vision and Pat-

tern Recognition, Fort Collins, Colorado, pages 368–373, 1999.

[99] D. Mumford and J. Shah. Optimal approximations by piecewise smooth functions

and associated variational problems. Comm. Pure Appl. Math., 42:577–684, 1989.

[100] J. Mundy, C. Huang, J. Liu, W. Hoffman, D. Forsyth, C. Rothwell, A. Zisserman,

S. Utcke, and O. Bournez. MORSE: A system based on geo-

metric invariants. In ARPA Image Understanding Workshop, pages 1393–1402, 1994. 206 [101] J. Mundy and A. Zisserman. Repeated structures: Image correspondence constraints

and ambiguity of 3D reconstruction. In J. Mundy, A. Zisserman, and D. Forsyth, ed-

itors, Applications of invariance in computer vision, pages 89–106. Springer-Verlag,

1994.

¦ ¨ ¨ #¥¡#!¢¡§ ¤£¥+ * [102] The NETLIB numerical software repository. ¢¡£¡¥¤ .

[103] P. Perona. Deformable kernels for early vision. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 17(5):488–499, 1995.

[104] M. Pollefeys, R. Koch, and L. Van Gool. Self calibration and metric reconstruction in

spite of varying and unknown internal camera parameters. In Proc. 6th International

Conference on Computer Vision, Bombay, India, pages 90–96, 1998.

[105] M. Pollefeys and L. Van Gool. A stratified approach to metric self calibration. In Proc.

IEEE Conference on Computer Vision and Pattern Recognition, 1997.

[106] J. Ponce. On computing metric upgrades of projective reconstructions under the

rectangular pixel asssumption. In M. Pollefeys, L. Van Gool, A. Zisserman, and

A. Fitzgibbon, editors, 3D Structure from Multiple Images of Large-Scale Environ-

ments, LNCS 2018, pages 52–67. Springer-Verlag, July 2000.

[107] J. Portilla and E. Simoncelli. A parametric texture model based on joint statistics of

complex wavelet coefficients. International Journal of Computer Vision, 40(1):49–70,

2000.

[108] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical Recipes in C. Cam-

bridge University Press, 1988.

[109] P. Pritchett and A. Zisserman. Wide baseline stereo matching. In Proc. 6th Interna-

tional Conference on Computer Vision, Bombay, India, pages 754–760, January 1998.

[110] L. Quan. Invariants of 6 points from 3 uncalibrated images. In J. O. Eckland, editor,

Proc. 3rd European Conference on Computer Vision, Stockholm, Sweden, pages 459–

469. Springer-Verlag, 1994. 207 [111] L. Quan, A. Heyden, and F. Kahl. Minimal projective reconstruction with missing

data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pages

210–216, June 1999.

[112] L Quan and R. Mohr. Determining perspective structures using hierarchical hough

transform. Pattern Recognition Letters, 9(1):279–286, 1989.

[113] I. D. Reid and D. W. Murray. Active tracking of foveated feature clusters using affine

structure. International Journal of Computer Vision, 18(1):41–60, 1996.

[114] M. Reid. Undergraduate Algebraic Geometry. Cambridge University Press, 1988.

[115] M. Reid. Undergraduate Commutative Algebra. Cambridge University Press, 1988.

[116] E. Ribeiro and E. R. Hancock. Adapting spectral scale for shape from texture. In Proc.

European Conference on Computer Vision, LNCS 1842, pages 421–433. Springer-

Verlag, 2000.

[117] C. Rothwell, A. Zisserman, D. Forsyth, and J. Mundy. Canonical frames for planar

object recognition. In Proc. European Conference on Computer Vision, LNCS 588.

Springer-Verlag, 1992.

[118] C. Rothwell, A. Zisserman, C. I. Marinos, D. Forsyth, and J. Mundy. Relative motion

and pose from arbitrary plane curves. Image and Vision Computing, 10(4):250–262,

1992.

[119] C. Rothwell, A. Zisserman, J. Mundy, and D. Forsyth. Efficient model library access

by projectively invariant indexing functions. In Proc. IEEE Conference on Computer

Vision and Pattern Recognition, pages 109–114, 1992.

[120] Y. Rubner, C. Tomasi, and L.J. Guibas. The earth mover’s distance as a metric for

image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.

[121] J. Sato and R. Cipolla. Extracting the affine transformation from texture moments.

Technical Report CUED/F-INFENG/TR 167, Dept. Engineering, University of Cam-

bridge, 1994. 208 [122] F. Schaffalitzky. Direct solution of modulus constraints. In Proc. Indian Conference on

Computer Vision, Graphics and Image Processing, Bangalore, pages 314–321, 2000.

[123] F. Schaffalitzky and A. Zisserman. Geometric grouping of repeated elements within

images. In Proc. 9th British Machine Vision Conference, Southampton, pages 13–22,

1998.

[124] F. Schaffalitzky and A. Zisserman. Geometric grouping of repeated elements within

images. In D. A. Forsyth, J. L. Mundy, V. Di Gesu, and R. Cipolla, editors, Shape, Con-

tour and Grouping in Computer Vision, LNCS 1681, pages 165–181. Springer-Verlag,

1999.

[125] F. Schaffalitzky and A. Zisserman. Planar grouping for automatic detection of van-

ishing lines and points. Image and Vision Computing, 18:647–658, 2000.

[126] F. Schaffalitzky and A. Zisserman. Viewpoint invariant texture description and

matching. Technical report, Dept. of Engineering Science, University of Oxford,

2001.

[127] F. Schaffalitzky and A. Zisserman. Viewpoint invariant texture matching and wide

baseline stereo. In Proc. 8th International Conference on Computer Vision, Vancou-

ver, Canada, July 2001.

[128] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets,

or “How do I organize my holiday snaps?”. In Proc. 7th European Conference on

Computer Vision, Copenhagen, Denmark, pages 414–431. Springer-Verlag, 2002.

[129] F. Schaffalitzky, A. Zisserman, R. I. Hartley, and P. H. S. Torr. A six point solution for

structure and motion. In Proc. European Conference on Computer Vision, pages 632–

648. Springer-Verlag, June 2000.

[130] B. Schiele and J. L. Crowley. Recognition without correspondence using multidimen-

sional receptive field histograms. Technical Report 453, MIT Media Lab, 1997.

209 [131] C. Schmid. Appariement d’Images par Invariants Locaux de Niveaux de Gris. PhD

thesis, L’Institut National Polytechnique de Grenoble, Grenoble, 1997.

[132] C. Schmid and R. Mohr. Matching by local invariants. Research report 2644, INRIA

Rhone-Alpesˆ , Grenoble, France, 1995.

[133] C. Schmid and R. Mohr. Combining greyvalue invariants with local constraints for

object recognition. Technical report, INRIA Rhone-Alpesˆ , Grenoble, France, 1996.

[134] C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence, 19(5):530–534, May 1997.

[135] C. Schmid, R. Mohr, and C. Bauckhage. Comparing and evaluating interest points.

In Proc. International Conference on Computer Vision, pages 230–235, 1998.

[136] J. G. Semple and G. T. Kneebone. Algebraic Projective Geometry. Oxford University

Press, 1979.

[137] J. G. Semple and L. Roth. Introduction to Algebraic Geometry. Oxford University

Press, 1949.

[138] C. J. Setchell and N. W. Campbell. Using colour gabor texture features for scene un-

derstanding. In 7th. International Conference on Image Processing and its Applica-

tions, pages 372–376, July 1999.

[139] I. Shafarevich. Basic Algebraic Geometry. Springer, 1994.

[140] J. Shi and J. Malik. Normalized cuts and image segmentation. In Proc. IEEE Confer-

ence on Computer Vision and Pattern Recognition, pages 731–743, 1997.

[141] J. A. Shufelt. Performance evaluation and analysis of vanishing point detection tech-

niques. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(3):282–

288, March 1999.

[142] C. E. Springer. Geometry and Analysis of Projective Spaces. Freeman, 1964.

210 [143] R. Sriram, J.M. Francos, and W.A. Pearlman. Texture coding using a wold decompo-

sition model. Proc. International Conference on Pattern Recognition, 94:35–39, 1994.

[144] J. V. Stone. Shape from texture: Textural invariance and the problem of scale in per-

spective images of textured surfaces. In Proc. 1st British Machine Vision Conference,

Oxford. BMVA Press, 1990.

[145] P. Sturm and W. Triggs. A factorization based algorithm for multi-image projective

structure and motion. In Proc. 4th European Conference on Computer Vision, Cam-

bridge, UK, pages 709–720, 1996.

[146] R. Sturm. Das Problem der Projektivitat¨ und seine Anwendung auf die Flachen¨

zweiten Grades. Math. Ann., 1:533–574, 1869.

[147] B. Sturmfels. Algorithms in Invariant Theory. Springer-Verlag, 1993.

[148] M. J. Swain and D. H. Ballard. Color indexing. International Journal of Computer

Vision, 7(1):11–32, November 1991.

[149] D. Tell and S. Carlsson. Wide baseline point matching using affine invariants com-

puted from intensity profiles. In Proc. 6th European Conference on Computer Vision,

Dublin, Ireland, LNCS 1842-1843, pages 814–828. Springer-Verlag, June 2000.

[150] B. M. ter Haar Romeny. Geometry-Driven Diffusion in Computer Vision. Kluwer Aca-

demic Press, 1994.

[151] B. M. ter Haar Romeny, L. M. J. Florack, A. H. Salden, and M. A. Viergever. Higher

order differential structure of images. Image and Vision Computing, 12(6):317–325,

1994.

[152] C. Tomasi and T. Kanade. Shape and motion from image streams under orthography:

A factorization approach. International Journal of Computer Vision, 9(2):137–154,

November 1992.

211 [153] P. H. S. Torr and D. W. Murray. The development and comparison of robust methods

for estimating the fundamental matrix. International Journal of Computer Vision,

24(3):271–300, 1997.

[154] P. H. S. Torr and A. Zisserman. Robust parameterization and computation of the

trifocal tensor. Image and Vision Computing, 15:591–605, 1997.

[155] P. H. S. Torr and A. Zisserman. Robust computation and parameterization of multiple

view relations. In Proc. 6th International Conference on Computer Vision, Bombay,

India, pages 727–732, January 1998.

[156] W. Triggs. Factorization methods for projective structure and motion. In Proc. IEEE

Conference on Computer Vision and Pattern Recognition, pages 845–851, 1996.

[157] W. Triggs. Auto-calibration and the absolute quadric. In Proc. IEEE Conference on

Computer Vision and Pattern Recognition, pages 609–614, 1997.

[158] W. Triggs. Autocalibration from planar scenes. In Proc. 5th European Conference on

Computer Vision, Freiburg, Germany, 1998.

[159] W. Triggs. Camera pose and calibration from 4 or 5 known 3D points. In Proc. 7th

International Conference on Computer Vision, Kerkyra, Greece, pages 278–284, 1999.

[160] W. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment: A mod-

ern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms:

Theory and Practice, LNCS. Springer Verlag, 2000.

[161] T. Tuytelaars and L. Van Gool. Content-based image retrieval based on local affinely

invariant regions. In Int. Conf. on Visual Information Systems, pages 493–500, 1999.

[162] T. Tuytelaars and L. Van Gool. Wide baseline stereo matching based on local, affinely

invariant regions. In Proc. 11th British Machine Vision Conference, Bristol, pages 412–

425, 2000.

212 [163] T. Tuytelaars, L. Van Gool, M. Proesmans, and T. Moons. The cascaded Hough trans-

form as an aid in aerial image interpretation. In Proc. 6th International Conference

on Computer Vision, Bombay, India, pages 67–72, January 1998.

[164] L. Van Gool, T. Moons, and D. Ungureanu. Affine / photometric invariants for planar

intensity patterns. In Proc. 4th European Conference on Computer Vision, Cambridge,

UK, pages 642–651. Springer-Verlag, 1996.

[165] P. Viola and W. Wells. Alignment by maximization of mutual information. In IEEE

Computer Society Press, editor, Proc. 5th International Conference on Computer Vi-

sion, Boston, pages 16–23, June 1995.

[166] J. Wilkinson. The evaluations of the zeroes of ill-conditioned polynomials, part 1.

Numerische Matematik, 1:150–166, 1959.

[167] A. P. Witkin. Recovering surface shape and orientation from texture. Artificial Intel-

ligence, 17(1–3):17–45, August 1981.

[168] A. P. Witkin. Scale-space filtering. In Proc. 8th IJCAI, pages 1019–1022, 1983.

[169] X. Yan, X. Dong-hui, P. Jia-xiong, and D. Ming-yue. The unique solution of projec-

tive invariants of six points from four uncalibrated images. Pattern Recognition,

30(3):513–517, 1997.

[170] A. Zalesny and L. Van Gool. A compact model for viewpoint dependent texture syn-

thesis. In Proc. European Conference on Computer Vision, LNCS 2018/5, pages 124–

143. Springer-Verlag, 2000.

[171] Z. Zhang, R. Deriche, O. D. Faugeras, and Q.-T. Luong. A robust technique for match-

ing two uncalibrated images through the recovery of the unknown epipolar geome-

try. Artificial Intelligence, 78:87–119, 1995.

[172] S.C. Zhu, Y. Wu, and D. Mumford. Filters, random-fields and maximum-entropy

(FRAME): Towards a unified theory for texture modeling. IJCV, 27(2):107–126, March

1998.

213