Structure from Motion without Correspondence

Frank Dellaert Steven M. Seitz Charles E. Thorpe Sebastian Thrun Computer Science Department & Robotics Institute Carnegie Mellon University, Pittsburgh PA 15213

Abstract The applicability of each of these methods is limited by the need for accurate correspondence, camera calibration, A method is presented to recover 3D scene structure and or shape information as input. Reliable pixel correspon- camera motion from multiple images without the need for dence is difficult to obtain, especially over a long sequence correspondence information. The problem is framed as of images. Feature-tracking techniques often fail to produce finding the maximum likelihood structure and motion given correct matches due to large motions, occlusions, or ambi- only the 2D measurements, integrating over all possible as- guities. Furthermore, errors in one frame are likely to prop- signments of 3D features to 2D measurements. This goal is agate to all subsequent frames. Outlier rejection techniques achieved by means of an algorithm which iteratively refines [3, 28, 9] can reduce these problems, but at the cost of elim- a probability distribution over the set of all correspondence inating valid features from the reconstruction, resulting in a assignments. At each iteration a new structure from mo- model that does not take into account all available measure- tion problem is solved, using as input a set of ’virtual mea- ments. A priori knowledge of camera parameters or epipo- surements’ derived from this probability distribution. The lar geometry can simplify the [6]. distribution needed can be efficiently obtained by Markov However, obtaining accurate calibrated sequences is diffi- Chain Monte Carlo sampling. The approach is cast within cult even in controlled laboratory environments. While re- the framework of Expectation-Maximization, which guaran- cent progress in self-calibration techniques [8, 21] promises tees convergence to a local maximizer of the likelihood. The to alleviate these difficulties, these techniques require point algorithm works well in practice, as will be demonstrated correspondence as input and therefore are sensitive to errors using results on several real image sequences. due to incorrectly-tracked features. In short, existing shape recovery techniques are strongly limited by their reliance on error-prone correspondence techniques. 1 Introduction In this paper, we address the structure from motion prob- A primary objective of is to enable re- lem (SFM) without prior knowledge of point correspon- constructing 3D scene geometry and camera motion from a dence or camera viewpoints. We frame the problem as set of images of a static scene. The current state of the art finding the maximum likelihood estimate of structure and provides solutions that apply only under special conditions. motion given only the measurements, integrating over all Specifically, existing techniques generally assume one or possible assignments of 3D features to 2D measurements. more of the following: While the full computation of this likelihood function is generally intractable, we propose to use the Expectation-

 Known correspondence: given a set of image feature Maximization algorithm (EM) [10, 5] as a practical method trajectories over time, solve for their 3D positions and for finding its maxima. We will show that EM has a simple camera motion. This classical formulation has been and intuitive interpretation in this context. studied in the context of structure from motion [26, 20, The outline of our method is as follows: instead of solv- 18] and, more recently, self-calibration [8, 21]. ing for structure and motion given the original image mea- surements, we solve a new SFM problem using newly syn-

 Known cameras: given calibrated images from known thesized ’virtual’ measurements, computed using our cur- camera viewpoints, solve for 3D scene shape. Stereo rent knowledge about the correspondences. This knowl- correspondence [6] and volumetric methods [7, 14] fit edge comes in the form of a probability distribution, ob- within this category. tained using the actual image data and an initial guess for

 Known shape: given one or more images and a 3D the structure and motion. By solving this new SFM prob- model of the scene, determine the camera viewpoint lem we obtain a new estimate for the structure and motion. corresponding to each image [12, 15]. This basic step is iterated until convergence. A key step x in our method is in computing the probability distribution 3 over correspondences, which is hard to obtain analytically. To circumvent this problem, we use Markov Chain Monte Carlo methods to sample from the distribution, which can x j =3 j =3 4 be done efficiently. In this respect, our approach resembles 11 x 22 2 j =4 that of Forsyth et al. [9] who also applied MCMC for struc- 23 j =2 j =2 ture from motion, assuming known correspondence. A key 13 21 u x 23 difference is that we solve for correspondence, structure, u 1 u 11 j =1 j =1 u 22 and motion simultaneously–a much more difficult problem. u 12 24 21 13 u u The problem of computing structure and motion from a 12 24 set of images without correspondence information remains m m 1 2 largely unaddressed in the literature. Several authors con- sidered the special case of correct but incomplete corre-

spondence, by hallucinating occluded features [26, 2], or Figure 1. An example with 4 features seen in 2 images. u

expanding a minimal correspondence into a complete cor- The 7 measurements ik are assigned to the individual

x j features j by means of the assignment variables . respondence [22]. However, these approaches require that a ik sufficient and non-degenerate set of initial correspondences be provided a priori which is assumed to be correct. A few model correspondence information, we introduce for each

authors have proposed methods for using geometric con-

u j

measurement ik the indicator variable , indicating that

straints to facilitate the correspondence problem in uncali- ik

u j x

ik is a measurement of the -th feature . Our nota-

ik j brated images. In particular, Irani [13] described how geo- ik tion is illustrated in Figure 1. metric rank constraints can be used to facilitate optical flow The choice of feature type and camera model defines the

computation over closely-spaced views. Beardsley et al. [3]

hm ; x  j measurement function i , predicting the measure-

proposed a two-phase approach for robustly computing fea-

u m x j = j

i j ment ik given and (with ): ture correspondences by processing images triplets. In the ik

first phase, a minimal point correspondence is computed

u = hm ; x + n

i j from a set of candidate matches using a RANSAC-based ik algorithm. The results from this phase are used to compute

where n is the measurement noise. Without loss of general-

the trifocal tensor [23] which in turn constrains the search x

ity, let us consider the case in which the features j are 3D

for other feature correspondences. Although we adopt a dif- u

points and the measurements ik are points in the 2D im- ferent approach and do not require closely-spaced views, we age. In this case the measurement function can be written follow these authors’ lead in coupling the estimation of cor- as a 3D rigid displacement followed by a projection: respondence and structure. Rather than consider a small set

of images or features at time, however, our strategy is to

hm ; x = [R x t ]

j i i j i i (1)

simultaneously optimize over all features in all images. t

The remainder of this paper is structured as follows: in R i

where i and are the rotation matrix and translation of

3 2

 : R ! R Section 2 we state the problem, introduce our notation, and i the -th camera, respectively, and i is a sketch the outline of our approach. Section 3 provides the projection operator which projects a 3D point to the 2D intuitive interpretation in terms of virtual measurements. In image plane. Various camera models can be defined by

Section 4, we discuss the use of MCMC sampling to imple- specifying the action of this projection operator on a point

T

=x; y ; z  ment the E-step. Section 5 presents the results. x [19]. For example, the projection operators

for orthography and calibrated perspective are defined as:

2 SFM without Correspondences

x x=z

p

o

[x]= ;  [x]= 

i

i y=z 2.1 Problem Statement and Notation y

The structure from motion (SFM) problem is this: given 2.2 SFM with Known Correspondences a set of images of a scene, taken from different viewpoints, recover the 3D structure of the scene along with the cam- To set the stage for the rest of the paper, it is convenient

era parameters. In the feature-based approach to SFM, we to view SFM as a maximum likelihood (ML) estimation

X x

n problem. Let us denote the set of 3D points as ,thesetof

consider the situation in which a set of 3D features j is

M U m

m cameras as , the set of measurements as , and a set of

viewed by a set of cameras i . As input data we are



u k 2f1::K g J  =X; M i

given the set of 2D measurements ik ,where assignments as . Furthermore, define .The

  

K i  =X ; M 

and i is the number of measurements in the -th image. To maximum likelihood estimate of structure

2 x x M U J X and motion given the data and is given by 1 2

1.8



 = L ; U; J argmax log (2) 1.6 j =1 j =2

 11 12 1.4

1.2  ; U; J  U J where the likelihood L of given and is de-

u u

U; Jj fined as any function proportional to P [25]. 1 11 12

If we are given the correspondence information J, 0.8

L ; U; J log is easy to evaluate. In the case that the 0.6

noise n on the measurements is i.i.d. zero-mean Gaussian 0.4 noise with standard deviation  , the negative log-likelihood 0.2 is simply a sum of squared re-projection errors: 0 m

−1 −0.5 0 1 0.5 1

K

m

i

x x

X 1 2

X Figure 2. Example where 2 features and , con-

1

2

log L; U; J= ku hm ; x k

y =2 i

ik strained to have , are seen in one image. The

j

2

ik

2

k =1 =1 i associated likelihoods are shown in Figure 3.

(3)

x J

In the case of orthographic, weak- and para-perspective u

2 2

12 to , and the opposite assignment . Suppose that the

  camera models, we find the estimate that minimizes m

camera 1 is known, and that the features are constrained =2

(3) using the factorization approach. Using this technique, to lie on the line y , such that they have only one free

a a

X M

; U affine structure and motion are first obtained from parameter each. To calculate L , note that we can

the measurements U by means of singular value decom- write it as a sum of likelihood terms of the form (2), with

position. They are then upgraded to Euclidean structure one term for every possible correspondence assignment J:

a

M

and motion by imposing metric constraints on .This X

 ; U= L; U; J is a well developed technique, and the reader is referred to L (5)

[26, 20, 18] for details and additional references. J

In the case of full perspective cameras the measure-

 ; U

The computation of L for this example is illustrated

hm ; x  j ment function i is non-linear, and we resort to non-linear optimization to minimize the re-projection error. in Figure 3: for each of the two possible assignments the

This procedure is known in and computer likelihoodis a unimodal distribution, but the total likelihood

; U

vision as , and details can be found in function L is bimodal. This agrees with the intuition

J J 2 [24, 11, 4]. We use factorization to obtain an initial esti- that either one of the assignments 1 or is equally likely.

mate and then use the Levenberg-Marquardt optimization  method to find  . Sparse matrix techniques discussed in 2.4 Maximizing the Likelihood Using EM

[11] are used to significantly reduce the computational cost.

 ; U While the full computation of L in (5) is gener-

2.3 SFM without Correspondences ally intractable, the EM algorithm [10, 5] provides a prac-

 ; U

In the case that the correspondences are unknown, the tical method for finding its maxima. In general, L

 

 is hard to obtain explicitly, as it involves summing over a

=X ; M  maximum likelihood estimate  of structure combinatorial number of possible assignments. However, it and motion given only the measurements U is given by:

can be proven that the EM algorithm converges to a local



L; U

log L; U =  argmax (4) maximum of .

 The idea of EM is to maximize the expected log likeli-

E flog L ; U; Jg

hood function t , where the expectation

Although this might seem counterintuitive at first, the above f



t

J =

states that we can find the ML structure and motion with- is taken with respect to the posterior distribution f

t

JjU;   J

out explicitly reasoning about which correspondence as- P over all possible assignments given the data

t 

signment might be correct. We ’only’ need to maximize U and a current guess for structure and motion. The EM

 ; U J the likelihood L , which does not depend on . algorithm then iterates over [25]:

To gain a better understanding for what the function

t

Q  

; U

L looks like, consider the example of Figure 2, 1. E-step: Calculate the expected log likelihood :

X

x x 2

where two features 1 and are seen in a 1D camera. In

t t

Q   = J log L ; U; J

f (6)

u u 22

this case there are two measurements 11 and ,andtwo

J

J u x

11 1 possible assignments: 1 (shown) assigns to and L( X; U, J ) L( X; U, J ) 1 2 2 2 t

To see this, let us first calculate the probability f that ij k

1.5 1.5

u i x j a measurement ik in image is assigned to a feature , 1 1 regardless of how the other measurements are assigned.

0.5 0.5 t

2 2 f

x 0 x 0 In other words, is the marginal posterior probability ij k

−0.5 −0.5 t

j = j jU;  

P , and it can be calculated by summing ik

−1 −1

t

J J j = j f over all possible assignments where : −1.5 −1.5 ik

−2 −2 X

−2 −1 0 1 2 −2 −1 0 1 2



t t x x t

1 1

= P j = j jU;  =  j ;jf J

f (8)

ik ik L( X; U) ij k

2 J

1.5

:; : 1 where  is the Kronecker delta function. 0.5 Equation (8) allows us to rewrite the expected log-

2 t

x 0  likelihood Q from (7) in a form that only depends in- −0.5

directly on the assignment variables j : ik −1

−1.5

m n K

−2 i

X X X

−2 −1 0 1 2 1

2 t

x t

1

f ku hm ; x k Q  =

i j

ik (9)

ij k

2

2

i=1 j =1

k =1

x x

2 Figure 3. The joint likelihood of 1 and from Figure 2 Now we state the main result in this section: it can be in three cases: (left) given U and the ’obvious’ assign-

shown by simple algebraic manipulation that (9) can be

J U ment 1 , (middle) given and the ’reverse’ assignment

written as the sum of a constant that does not depend on

J U

2 , and (right) given only, which is their sum .

n m

, and a new re-projection error of features in images

m n

t+1

X X 

2. M-step: Find the ML estimate for structure and 1

2

t t

Q  = C + hm ; x k kv

t j

i (10)

ij

t

Q  

motion, by maximizing : 2

2 

ij

i=1 j =1

t+1 t

= Q 

 argmax

t v

 where the virtual measurements and virtual measure-

ij

t 2



ment variances  are defined as

ij

t 

It is important to note that Q is calculated in the E-step

P

t t

K

i

t

f J 

by evaluating using the current guess for structure 2

u f

ik 

 

ij k

k =1

t 2

t

= ;  v  =

P P

t (11) ij

and motion (hence the superscript ), whereas in the M-step ij

K K

i i

t t

f f

t

ij k ij k

k =1 k =1  

we are optimizing Q with respect to the free variable

t+1

 

to obtain the new estimate . t

Each virtual measurement v is simply a weighted aver-

ij

u i

age of the original measurements ik in the -th image, and

3 SFM with Virtual Measurements t

the weights are the marginal probabilities f .Ifthereis ij k

In this section we show that the EM algorithm outlined no occlusion and all features are seen in all images, then

P

K

i

t =1

above can be interpreted in a simple and intuitive way. We f and the expressions further simplify.

ij k

k =1 show that the expected log-likelihood can be rewritten such

that the M-step amounts to solving a similar SFM problem, 3.2 Summary and Implementation Outline

t  but using as input a newly synthesized set of virtual mea- Writing Q as a re-projection error with respect to surements, created in the E-step. virtual measurements as in equation (10) provides an intu- itive interpretation for the overall algorithm:

3.1 Virtual Measurements t

1. E-step: Calculate the weights f from the distribu- In the context of SFM, we substitute the expression for ij k

tion over assignments. Then, in each of the m images

L; U; J

the log likelihood log from (3) in equation (6),

t

t v

calculate n virtual measurements .

 ij

and obtain the following expression for Q :

t+1

m K

i



X X

X 2. M-step: Find the structure and motion estimate

1

2

t

f J ku hm ; x k i

ik (7)

j that minimizes the (weighted) re-projection error given

2

ik

2

i=1

k =1

J the virtual measurements:

n m X

It is clear that a direct evaluation of (7) is infeasible, as the X

1

2

t t+1

kv  = hm ; x  k

i j

n argmin

ij t

number of possible assignments is combinatorial in .An 2

 2

ij



=1 j =1 efficient implementation is nevertheless possible. i In other words, the E-step synthesizes new measurement the E-step is executed by a Monte-Carlo process [25, 17]. data, and the M-step is a conventional SFM problem of The sample can be efficiently obtained using the Metropo- the same size as before. This means that we can use any lis algorithm, as will be described below.

known SFM algorithm at our disposal as is, from straight- First, note that we can do the sampling for each image

J i

forward orthographic factorization to the more recent algo- individually, as the assignment i within each image is

rithms that work with uncalibrated images. What is left is conditionally independent of the assignments in other im- t

to show how the E-step can be implemented. ages. This means that f can be factored as:

m

Y

t t

J= P J jU ;  

4 Implementing the E-step f i i =1

Since the M-step can be implemented using known SFM i

U i

approaches, we need only concern ourselves with the imple- where i are the measurements in image only.

t

P J jU ;   i

mentation of the E-step. In particular, we need to calculate To sample from i we use the Metropolis

t t

= P j = j jU;  

the marginal probabilities f . algorithm, an instance of the Markov Chain Monte Carlo

ik ij k methods (abbreviated MCMC), which involve a Markov 4.1 Conditional Independence, Mutual Exclusion chain in which a sequence of samples is generated [16, 9]. If we set up the transition probabilities correctly, the equi-

If we assume that the feature assignments j are con- ik

t librium distribution of the Markov chain will be equal to the  ditionally independent given U and , we can write the posterior distribution we would like to sample from. In our probability of each assignment J as the product of the prob-

case, we would like to generate a sequence of R samples

abilities of the individual assignments j :

ik

r

t

J P J jU ;   i

from the posterior i , and the Metropolis al-

i

K

m

i Y

Y gorithm can be formulated in the current context as follows

t t

P j ju ;   P JjU;  =

ik (12)

ik (adapted from the general description in [16]):

i=1 k =1 0

1. Start with a valid initial assignment J .

If this were a good approximation it would lead to an effi- i 0

2. Propose a new valid assignment J , which is proba-

cient implementation, as in that case the marginal probabil- i

r t

bilistically generated from J . i

ities f only depend on the distance from the measurement

ij k u

ik to the projected feature point: 3. Compute the ratio

0

t

P J jU ;  

 

i

i

a =

1 (14)

r

t

2 t

t

P J jU ;  

ku hm ; x k f = C exp

i

i

i j

ik (13)

ik

ij k

2

2

0 0 r +1

= J a>=1 J J

t 4. If then accept ,i.e.weset .

i i

i C

with a normalization constant. 0

ik a Otherwise, accept J with probability . If the pro- However, in reality the assignments are not independent: i

posal is rejected, then we keep the previous sample,

u j = j

ik

r +1

if a measurement has been assigned ,thenno r

ik

J J

i.e. we set = . i

other measurement in the same image should be assigned i x the same feature point j . The probability of such a double To actually implement this scheme, we need to specify three assignment is zero, which cannot be modeled by the expres- elements: (a) define what a ’valid’ assignment is, (b) a way sion above. In other words, it does not take into account the to probabilistically perturb them, and (c) an explicit expres-

important global constraint of mutual exclusion. sion for a. Below we do this for the case when there is

Imposing the mutual exclusion constraint, however, no occlusion, no spurious features, and all features are seen t makes it difficult to analytically express the weights f .

ij k in all images. Note that this is without loss of generality:

Conditional independence no longer holds, while the sim- in particular, this assumption is not needed for the general t ple expression (13) for the weights f relies crucially on

ij k desciption of the algorithm above. In this restricted case,

this assumption. We know of no efficient closed form ex- the only valid assignments j are permutations of the fea-

ik

t

f ::n pression for that allows only permutations. 1

ij k ture indices .Theproposal step can be implemented

by swapping the assignment variables j of two randomly

4.2 Sampling the Correspondence Distribution ik u

chosen measurements ik , which conserves the permuta-

The solution we propose is to instead sample from the tion property. Finally, the posterior ratio a can be evaluated

t

J

posterior probability distribution f over valid assign- very efficiently, as it can be shown to depend only on the dot

t f

ments J, to obtain approximate values for the weights . product of two vectors related to the swap (proof omitted): ij k

Formally this can be justified in the context of a Monte

1

T

a = exp u u  h h 

1 2 2 1

Carlo EM or MCEM, a version of the EM algorithm where 2 

u u 2

where 1 and the measurements whose assignments will

h h 2 be swapped, and 1 and are the projections of the fea- tures originally assigned to them . To conclude the E-step and compute the virtual measure-

ments in (11), the only thing left to do is to compute the

r

t

fJ g

marginal probabilities f from the sample .Fortu-

i ij k nately, this can be done without explicitly storing the sam- Figure 4. Three out of 11 cube images. Although the ples by keeping running counts of how many times each

images were originally taken as a sequence in time, the

u j

measurement ik is assigned to feature , and use that to

ordering of the images is irrelevant to our method.

t t C

compute f .Ifwedefine to be this count, we have:

ij k ij k

1

t

t

 C

f (15)

ij k

ij k R 4.3 Implementation in Practice

The pseudo-code for the final algorithm is as follows: 0

1. Generate an initial structure and motion estimate  . σ σ

t=0 σ=0.0 t=1 =25.1 t=3 =23.5

t U 2. Given  and the data , run the Metropolis sam-

pler in each image to obtain approximate values for t

the weights f , using equation (15).

ij k t

3. Calculate the virtual measurements v with (11).

ij +1 t t=10 σ=18.7 σ σ

4. Find the new estimate  for structure and motion t=20 =13.5 t=100 =1.0 t

using the virtual measurements v as data. This can ij be done using any SFM method compatible with the Figure 5. The structure estimate as initialized and at suc-

projection model assumed. cessive iterations t of the algorithm. 5. If not converged, return to step 2.

time, with the annealing parameter  decreasing exponen- To avoid getting stuck in local minima, it is important in tially from 25 pixels to 1 pixel. For each EM iteration, we practice to add annealing to this basic scheme. In annealing ran the sampler in each image for 10000 steps. An entire we artificially increase the noise parameter  for the early run takes about a minute of CPU time on a standard PC. As iterations, gradually decreasing it to its correct value. This is typical for EM, the algorithm can sometimes get stuck in has two beneficial consequences. First, the posterior distri-

local minima, in which case we restart it manually.

t  bution f will be less peaked when is high, so that the Metropolis sampler will explore the space of assignments In practice, the algorithm converges consistently and fast more easily, and avoid getting stuck on islands of high to an estimate for the structure and motion where the correct

t correspondence is the most probable one, and where most if   probability. Second, the expected log likelihood Q is not all assignments in the different images agree with each smoother and has less local maxima at higher values for  . We use a exponentially decreasing annealing scheme, but other. We illustrate this using the image set shown in Fig- have found that the algorithm is not sensitive to the exact ure 4, which was taken under orthographic projection. The scheme used. typical evolution of the algorithm is illustrated in Figure 5, where we have shown a wire-frame model of the recovered 5Results structure at successive instants of time. There are two im- portant points to note: (a) the gross structure is recovered In this section we show the results obtained with three in the very first iteration, starting from random initial struc- different sets of images. For each set we highlight a partic- ture, and (b) finer details of the structure are gradually re-

ular property of our method. For all the results we present, solved as the parameter  is decreased. The estimate for the

the input to the algorithm was a set of manually obtained structure after convergence is almost identical to the one x

image measurements. To initialize , the 3D points j were found by factorization when given the correct correspon-

generated randomly in a normally distributed cloud around dence. Incidentally, we found the algorithm converges less m

a depth of 1, whereas the cameras i were all initialized at often when we replace the random initialization by a ’good’ the origin. We ran the EM algorithm for 100 iterations each initial estimate where all the points in some image are pro- 50 50

100 100

150 150

200 200

250 250

10 20 30 40 50 10 20 30 40 50 t

Figure 7. The marginal probabilities f at an early and at ij k

Figure 6. 4 out of 5 perspective images of a house. a later iteration, respectively. Each row corresponds to

u a measurement ik , grouped according to image index,

jected onto a plane of constant depth.

n x whereas the columns represent the features j .Inthis

To illustrate the EM iterations, consider the set of images

=58 m =5 example n and . Black corresponds to a in Figure 6 taken under perspective projection. In the per- marginal probability of 1. spective case, we implement the M-step as para-perspective factorization followed by bundle adjustment. In this ex-

ample we do not show the recovered structure (which is Despite the space we have devoted to explaining the ra- t good), but show the marginal probabilities f at two dif- ij k tionale behind it, the final algorithm is simple and easy to ferent times during the course of the algorithm, in Figure implement. As summarized in Section 4.3, at each itera-

7. In early iterations,  is high and there is still a lot of tion one only needs to obtain a sample of probable assign- ambiguity. Towards the end, the distribution focuses in on ments, compute the virtual measurements, and solve a syn- one consistent assignment. If all the probability were con- thetic SFM problem using known methods. In addition, it

centrated in one consistent assignment over all images, the is fast: the Metropolis sampler, which is the main computa- t large f matrix would be a set of identical permutation ij k tional bottleneck, can be implemented very efficiently due matrices stacked one upon the other. to the incremental computation of the posterior ratios, and The algorithm also deals with situations where the im- the fact that we do not need to store the samples. ages are taken from widely separate viewpoints, as is the However, there is plenty of opportunity for future work. case for the images in Figure 8. In this sequence, the im- In the current paper, all features were hand-picked, and we age features used were the colored beads on the wire-frame have not shown results on sequences with occlusions or spu- toy in the image, plus four points on the ground plane. Im- rious features. This was a conscious choice on our part, in ages were taken from both sides of the object. Because of order to study our particular approach independent of other the ’see-through’ nature of the object, there is also a lot of factors. Clearly, in order to be more widely applicable, the potential confusion between image measurements. Figure 9 performance of the algorithm under these more general con- shows the wire-frame model obtained by our method, where ditions needs to be evaluated. This is a practical rather than each of the wires corresponds to one of the wires on the toy. a theoretical hurdle, as in theory occlusions and spurious Although in the final iteration there is still disagreement be- features are easily incorporated in the formulation above. tween images about the most likely feature assignment, the However, it does introduce the issue of how many features overall structure of the model is recovered despite the arbi- need to be instantiated, since now this will not be known a trary configuration of the cameras. priori. This problem of model selection has been addressed successfully before in [1, 27], and it is hoped that the lessons 6 Conclusions and Future Directions learned there can equally apply in the current context. In this paper we have presented a novel tool, which en- ables us to solve the structure from motion problem with- References out a priori correspondence information. In addition, it can cope with images given in arbitrary order and taken from [1] S. Ayer and H. Sawhney. Layered representation of motion widely separate viewpoints. video using robust maximum-likelihood estimation of mix- [9] D. Forsyth, S. Ioffe, and J. Haddon. Bayesian structure from motion. In Int. Conf. on Computer Vision (ICCV), pages 660– 665, 1999. [10] H. Hartley. Maximum likelihood estimation from incomplete data. Biometrics, 14:174–194, 1958. [11] R. Hartley. Euclidean reconstruction from uncalibrated views. In Application of Invariance in Computer Vision, pages 237–256, 1994. [12] D. Huttenlocher and S. Ullman. Recognizing solid objects by alignment with an image. Int. J. of Computer Vision, 5(2):195–212, November 1990. [13] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Int. Conf. on Computer Vision (ICCV), pages 626–633, 1999. [14] K. Kutulakos and S. Seitz. A theory of shape by space carv- ing. In Proc. Sevent Int. Conf. on Computer Vision, pages 307–314, 1999. Figure 8. 6 out of 8 images of a wire-frame toy, taken from [15] D. Lowe. Fitting parameterized three-dimensional models widely different viewpoints. to images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 13(5):441–450, May 1991. [16] D. MacKay. Introduction to Monte Carlo methods. In M.I.Jordan, editor, Learning in graphical models, pages 175– 204. MIT Press, 1999. [17] G. McLachlan and T. Krishnan. The EM algorithm and ex- tensions. Wiley series in probability and statistics. John Wi- ley & Sons, 1997. [18] D. Morris and T. Kanade. A unified factorization algorithm for points, line segments and planes with uncertainty mod- els. In Int. Conf. on Computer Vision (ICCV), pages 696–702, 1998. [19] D. Morris, K. Kanatani, and T. Kanade. Uncertainty model- ing for optimal structure from motion. In ICCV Workshop on Vision Algorithms: Theory and Practice, 1999. Figure 9. Recovered structure for wire-frame toy re- [20] C. Poelman and T. Kanade. A paraperspective factoriza- projected in 2 images. tion method for shape and motion recovery. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(3):206–218, March 1997. ture models and MDL encoding. In Int. Conf. on Computer [21] M. Pollefeys, R. Koch, and L. VanGool. Self-calibration and Vision (ICCV), pages 777–784, June 1995. metric reconstruction inspite of varying and unknown intrin- [2] R. Basri, A. Grove, and D. Jacobs. Efficient determination of sic camera parameters. Int. J. of Computer Vision, 32(1):7– shape from multiple images containing partial information. 25, August 1999. [22] S. Seitz and C. Dyer. Complete structure from four point cor- Pattern Recognition, 31(11):1691–1703, November 1998. respondences. In Proc. Fifth Int. Conf. on Computer Vision, [3] P. Beardsley, P. Torr, and A. Zisserman. 3D model acquisition pages 330–337, 1995. from extended image sequences. In Eur. Conf. on Computer [23] A. Shashua. Trilinear tensor: The fundamental construct of Vision (ECCV), pages II:683–695, 1996. multiple-view geometry and its applications. In Intl. Work- [4] M. Cooper and S. Robson. Theory of close range photogram- shop on Algebraic Frames For The Perception Action Cycle metry. In K. Atkinson, editor, Close range photogrammetry (AFPAC), Kiel Germany, September 1997. and machine vision, chapter 1, pages 9–51. Whittles Publish- [24] R. Szeliski and S. Kang. Recovering 3D shape and motion ing, 1996. from image streams using non-linear least squares. Technical [5] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood Report CRL 93/3, DEC Cambridge Research Lab, 1993. from incomplete data via the EM algorithm. J. of the Royal [25] M. Tanner. Tools for Statistical Inference. Springer, 1996. Statistical Society B, 39(1):1–38, 1977. [26] C. Tomasi and T. Kanade. Shape and motion from image [6] O. Faugeras. Three-dimensional computer vision: A geomet- streams under orthography: a factorization method. Int. J. of ric viewpoint. The MIT press, Cambridge, MA, 1993. Computer Vision, 9(2):137–154, Nov. 1992. [7] O. Faugeras and R. Keriven. Complete dense stereovision us- [27] P. Torr. An assessment of information criteria for motion ing level set methods. In Proc. 5th European Conf. on Com- model selection. In IEEE Conf. on Computer Vision and Pat- puter Vision, pages 379–393, 1998. tern Recognition (CVPR), pages 47–53, 1997. [28] Z. Zhang. Determining the and its uncer- [8] O. Faugeras, Q. Luong, and S. Maybank. Camera self- tainty - a review. Int. J. of Computer Vision, 27(2):161–195, calibration: Theory and experiments. In Eur. Conf. on Com- March 1998. puter Vision (ECCV), pages 321–334, 1992.