Structure from Motion without Correspondence
Frank Dellaert Steven M. Seitz Charles E. Thorpe Sebastian Thrun Computer Science Department & Robotics Institute Carnegie Mellon University, Pittsburgh PA 15213
Abstract The applicability of each of these methods is limited by the need for accurate correspondence, camera calibration, A method is presented to recover 3D scene structure and or shape information as input. Reliable pixel correspon- camera motion from multiple images without the need for dence is difficult to obtain, especially over a long sequence correspondence information. The problem is framed as of images. Feature-tracking techniques often fail to produce finding the maximum likelihood structure and motion given correct matches due to large motions, occlusions, or ambi- only the 2D measurements, integrating over all possible as- guities. Furthermore, errors in one frame are likely to prop- signments of 3D features to 2D measurements. This goal is agate to all subsequent frames. Outlier rejection techniques achieved by means of an algorithm which iteratively refines [3, 28, 9] can reduce these problems, but at the cost of elim- a probability distribution over the set of all correspondence inating valid features from the reconstruction, resulting in a assignments. At each iteration a new structure from mo- model that does not take into account all available measure- tion problem is solved, using as input a set of ’virtual mea- ments. A priori knowledge of camera parameters or epipo- surements’ derived from this probability distribution. The lar geometry can simplify the correspondence problem [6]. distribution needed can be efficiently obtained by Markov However, obtaining accurate calibrated sequences is diffi- Chain Monte Carlo sampling. The approach is cast within cult even in controlled laboratory environments. While re- the framework of Expectation-Maximization, which guaran- cent progress in self-calibration techniques [8, 21] promises tees convergence to a local maximizer of the likelihood. The to alleviate these difficulties, these techniques require point algorithm works well in practice, as will be demonstrated correspondence as input and therefore are sensitive to errors using results on several real image sequences. due to incorrectly-tracked features. In short, existing shape recovery techniques are strongly limited by their reliance on error-prone correspondence techniques. 1 Introduction In this paper, we address the structure from motion prob- A primary objective of computer vision is to enable re- lem (SFM) without prior knowledge of point correspon- constructing 3D scene geometry and camera motion from a dence or camera viewpoints. We frame the problem as set of images of a static scene. The current state of the art finding the maximum likelihood estimate of structure and provides solutions that apply only under special conditions. motion given only the measurements, integrating over all Specifically, existing techniques generally assume one or possible assignments of 3D features to 2D measurements. more of the following: While the full computation of this likelihood function is generally intractable, we propose to use the Expectation-
Known correspondence: given a set of image feature Maximization algorithm (EM) [10, 5] as a practical method trajectories over time, solve for their 3D positions and for finding its maxima. We will show that EM has a simple camera motion. This classical formulation has been and intuitive interpretation in this context. studied in the context of structure from motion [26, 20, The outline of our method is as follows: instead of solv- 18] and, more recently, self-calibration [8, 21]. ing for structure and motion given the original image mea- surements, we solve a new SFM problem using newly syn-
Known cameras: given calibrated images from known thesized ’virtual’ measurements, computed using our cur- camera viewpoints, solve for 3D scene shape. Stereo rent knowledge about the correspondences. This knowl- correspondence [6] and volumetric methods [7, 14] fit edge comes in the form of a probability distribution, ob- within this category. tained using the actual image data and an initial guess for
Known shape: given one or more images and a 3D the structure and motion. By solving this new SFM prob- model of the scene, determine the camera viewpoint lem we obtain a new estimate for the structure and motion. corresponding to each image [12, 15]. This basic step is iterated until convergence. A key step x in our method is in computing the probability distribution 3 over correspondences, which is hard to obtain analytically. To circumvent this problem, we use Markov Chain Monte Carlo methods to sample from the distribution, which can x j =3 j =3 4 be done efficiently. In this respect, our approach resembles 11 x 22 2 j =4 that of Forsyth et al. [9] who also applied MCMC for struc- 23 j =2 j =2 ture from motion, assuming known correspondence. A key 13 21 u x 23 difference is that we solve for correspondence, structure, u 1 u 11 j =1 j =1 u 22 and motion simultaneously–a much more difficult problem. u 12 24 21 13 u u The problem of computing structure and motion from a 12 24 set of images without correspondence information remains m m 1 2 largely unaddressed in the literature. Several authors con- sidered the special case of correct but incomplete corre-
spondence, by hallucinating occluded features [26, 2], or Figure 1. An example with 4 features seen in 2 images. u
expanding a minimal correspondence into a complete cor- The 7 measurements ik are assigned to the individual
x j features j by means of the assignment variables . respondence [22]. However, these approaches require that a ik sufficient and non-degenerate set of initial correspondences be provided a priori which is assumed to be correct. A few model correspondence information, we introduce for each
authors have proposed methods for using geometric con-
u j
measurement ik the indicator variable , indicating that
straints to facilitate the correspondence problem in uncali- ik
u j x
ik is a measurement of the -th feature . Our nota-
ik j brated images. In particular, Irani [13] described how geo- ik tion is illustrated in Figure 1. metric rank constraints can be used to facilitate optical flow The choice of feature type and camera model defines the
computation over closely-spaced views. Beardsley et al. [3]
hm ; x j measurement function i , predicting the measure-
proposed a two-phase approach for robustly computing fea-
u m x j = j
i j ment ik given and (with ): ture correspondences by processing images triplets. In the ik
first phase, a minimal point correspondence is computed
u = hm ; x + n
i j from a set of candidate matches using a RANSAC-based ik algorithm. The results from this phase are used to compute
where n is the measurement noise. Without loss of general-
the trifocal tensor [23] which in turn constrains the search x
ity, let us consider the case in which the features j are 3D
for other feature correspondences. Although we adopt a dif- u
points and the measurements ik are points in the 2D im- ferent approach and do not require closely-spaced views, we age. In this case the measurement function can be written follow these authors’ lead in coupling the estimation of cor- as a 3D rigid displacement followed by a projection: respondence and structure. Rather than consider a small set
of images or features at time, however, our strategy is to
hm ; x = [R x t ]
j i i j i i (1)
simultaneously optimize over all features in all images. t
The remainder of this paper is structured as follows: in R i
where i and are the rotation matrix and translation of
3 2
: R ! R Section 2 we state the problem, introduce our notation, and i the -th camera, respectively, and i is a sketch the outline of our approach. Section 3 provides the projection operator which projects a 3D point to the 2D intuitive interpretation in terms of virtual measurements. In image plane. Various camera models can be defined by
Section 4, we discuss the use of MCMC sampling to imple- specifying the action of this projection operator on a point
T
=x; y ; z ment the E-step. Section 5 presents the results. x [19]. For example, the projection operators
for orthography and calibrated perspective are defined as:
2 SFM without Correspondences
x x=z
p
o
[x]= ; [x]=
i
i y=z 2.1 Problem Statement and Notation y
The structure from motion (SFM) problem is this: given 2.2 SFM with Known Correspondences a set of images of a scene, taken from different viewpoints, recover the 3D structure of the scene along with the cam- To set the stage for the rest of the paper, it is convenient
era parameters. In the feature-based approach to SFM, we to view SFM as a maximum likelihood (ML) estimation
X x
n problem. Let us denote the set of 3D points as ,thesetof
consider the situation in which a set of 3D features j is
M U m
m cameras as , the set of measurements as , and a set of
viewed by a set of cameras i . As input data we are
u k 2f1::K g J =X; M i
given the set of 2D measurements ik ,where assignments as . Furthermore, define .The
K i =X ; M
and i is the number of measurements in the -th image. To maximum likelihood estimate of structure
2 x x M U J X and motion given the data and is given by 1 2
1.8
= L ; U; J argmax log (2) 1.6 j =1 j =2
11 12 1.4
1.2 ; U; J U J where the likelihood L of given and is de-
u u
U; Jj fined as any function proportional to P [25]. 1 11 12
If we are given the correspondence information J, 0.8
L ; U; J log is easy to evaluate. In the case that the 0.6
noise n on the measurements is i.i.d. zero-mean Gaussian 0.4 noise with standard deviation , the negative log-likelihood 0.2 is simply a sum of squared re-projection errors: 0 m
−1 −0.5 0 1 0.5 1
K
m
i
x x
X 1 2
X Figure 2. Example where 2 features and , con-
1
2