MASSACHUSETTS INSTITUTE OF TECHNOLOGY
ARTIFICIAL INTELLIGENCE LABORATORY
A.I. Memo No. 1363 September, 1992
Pro jective Structure from two Uncalibrated Images: Structure from
Motion and Recognition
Amnon Shashua
Abstract
This pap er addresses the problem of recovering relative structure, in the form of an invariant, from two views of a 3D
scene. The invariant structure is computed without any prior knowledge of camera geometry,orinternal calibration,
and with the prop erty that p ersp ective and orthographic pro jections are treated alike, namely, the system makes no
assumption regarding the existence of p ersp ective distortions in the input images.
We show that, given the lo cation of epip oles, the pro jective structure invariant can b e constructed from only four
corresp onding p oints pro jected from four non-coplanar p oints in space (like in the case of parallel pro jection). This
result leads to two algorithms for computing pro jective structure. The rst algorithm requires six corresp onding
p oints, four of which are assumed to b e pro jected from four coplanar p oints in space. Alternatively, the second
algorithm requires eight corresp onding p oints, without assumptions of coplanarity of ob ject p oints.
Our study of pro jective structure is applicable to b oth structure from motion and visual recognition. We use
pro jective structure to re-pro ject the 3D scene from two mo del images and six or eight corresp onding p oints with a
novel view of the scene. The re-pro jection pro cess is well-de ned under all cases of central pro jection, including the
case of parallel pro jection.
c
Copyright Massachusetts Institute of Technology, 1992
This rep ort describ es research done at the Arti cial Intelligence Lab oratory of the Massachusetts Institute of Technology.
Supp ort for the lab oratory's arti cial intelligence researchisprovided in part bytheAdvanced Research Pro jects Agency of
the Department of Defense under Oce of Naval Researchcontract N00014-85-K-0124. A. Shashua was also supp orted by NSF-IRI8900267.
Brill, Haag & Pyton (1991) have recently intro duced a 1 Intro duction
quadratic invariant based on the fundamental matrix of
The problem we address in this pap er is that of recover-
Longuet-Higgins (1981), which is computed from eight
ing relative, non-metric, structure of a three-dimensional
corresp onding p oints. In App endix E weshowthat
scene from two images, taken from di erent viewing p o-
their result is equivalenttointersecting epip olar lines,
sitions. The relative structure information is in the form
and therefore, is singular for certain viewing transfor-
of an invariant that can b e computed without any prior
mations dep ending on the viewing geometry b etween the
knowledge of camera geometry, and under all central pro-
two mo del views. Our pro jectiveinvariant is not based
jections | including the case of parallel pro jection. The
on an epip olar intersection, but is based directly on the
non-metric nature of the invariantallows the cameras to
relative structure of the ob ject, and do es not su er from
be internally uncalibrated (intrinsic parameters of cam-
any singularities, a nding that implies greater stability
era are unknown). The unique nature of the invariant al-
in the presence of errors.
lows the system to make no assumptions ab out existence
The pro jective structure invariant, and the re-
of p ersp ective distortions in the input images. Therefore,
pro jection metho d that follows, is based on an exten-
any degree of p ersp ective distortions is allowed, i.e., or-
sion of Ko enderink and Van-Do orn's representation of
thographic and p ersp ective pro jections are treated alike,
ane structure as an invariant de ned with resp ect to
or in other words, no assumptions are made on the size
a reference plane and a reference p oint. We start byin-
of eld of view.
tro ducing an alternativeaneinvariant, using two ref-
Weenvision this study as having applications b oth in
erence planes (section 5), and it can easily b e extended
the area of structure from motion and in the area of
to pro jective space. As a result we obtain a pro jective
visual recognition. In structure from motion our contri-
structure invariant (section 6).
bution is an addition to the recent studies of non-metric
Weshow that the di erence b etween the ane and
structure from motion pioneered by Ko enderink and Van
pro jective case lie entirely in the lo cation of the epip oles,
Do orn (1991) in parallel pro jection, followed byFaugeras
i.e., given the lo cation of epip oles b oth the ane and
(1992) and Mohr, Quan, Veillon & Boufama (1992) for
pro jective structures are constructed by linear metho ds
reconstructing the pro jective co ordinates of a scene up
using the information captured from four corresp onding
to an unknown pro jective transformation of 3D pro jec-
p oints pro jected from four non-coplanar p oints in space.
tive space. Our approach is similar to Ko enderink and
In the pro jectivecasewe need additional corresp onding
Van Do orn's in the sense that we deriveaninvariant,
p oints | solely for the purp ose of recovering the lo cation
based on a geometric construction, that records the 3D
of the epip oles (Theorem 1, section 6).
structure of the scene as a variation from two xed ref-
Weshow that the pro jective structure invariantcan
erence planes measured along the line of sight. Unlike
b e recovered from two views | pro duced by parallel or
Faugeras and Mohr et al. wedonotrecover the pro jec-
central pro jection | and six corresp onding p oints, four
tive co ordinates of the scene, and, as a result, we use a
of which are assumed to b e pro jected from four coplanar
smaller numb er of corresp onding p oints: in addition to
p oints in space (section 7.1). Alternatively, the pro jec-
the lo cation of epip oles we need only four corresp ond-
tive structure can b e recovered from eight corresp onding
ing p oints, coming from four non-coplanar p oints in the
p oints, without assuming coplanarity of ob ject p oints
scene, whereas Faugeras and Mohr et al. require corre-
(section 8.1). The 8-p oint metho d uses the fundamental
sp ondences coming from vepoints in general p osition.
matrix approach (Longuett-Higgins, 1981) for recover-
The second contribution of our study is to visual recog-
ing the lo cation of epip oles (as suggested byFaugeras,
nition of 3D ob jects from 2D images. We show that our
1992).
pro jectiveinvariant can b e used to predict novel views of
Finally,weshow that, for b oth schemes, it is p ossible
the ob ject, given two mo del views in full corresp ondence
to limit the viewing transformations to the group of rigid
and a small numb er of corresp onding p oints with the
motions, i.e., it is p ossible to work with p ersp ectivepro-
novel view. The predicted view is then matched against
jection assuming the cameras are calibrated. The result,
the novel input view, and if the twomatch, then the
however, do es not include orthographic pro jection.
novel view is considered to b e an instance of the same ob-
Exp eriments were conducted with b oth algorithms,
ject that gaverisetothetwo mo del views stored in mem-
and the results show that the 6-p oint algorithm is sta-
ory. This paradigm of recognition is within the general
ble under noise and under conditions that violate the
framework of alignment (Fischler and Bolles 1981, Lowe
assumption that four ob ject p oints are coplanar. The 8-
1985, Ullman 1989, Huttenlo cher and Ullman 1987) and,
p oint algorithm, although theoretically sup erior b ecause
more sp eci cally, of the paradigm prop osed byUllman
of lack of the coplanarity assumption, is considerably
and Basri (1989) that recognition can pro ceed using only
more sensitive to noise.
2D images, b oth for representing the mo del, and when
matching the mo del to the input image. We refer to the
2 Why not Classical SFM?
problem of predicting a novel view from a set of mo del
views using a limited numb er of corresp onding p oints,
The work of Ko enderink and Van Do orn (1991) on ane
structure from two orthographic views, and the work of as the problem of re-projection.
Ullman and Basri (1989) on re-pro jection from twoor- The problem of re-pro jection has b een dealt with in
thographic views, have a clear practical asp ect: it is the past primarily assuming parallel pro jection (Ull-
known that at least three orthographic views are re- man and Basri 1989, Ko enderink and Van Do orn 1991).
quired to recover metric structure, i.e., relative depth For the more general case of central pro jection, Barret, 1
the image plane is p erp endicular to the pro jecting rays. (Ullman 1979, Huang & Lee 1989, Aloimonos & Brown
The third problem is related to the wayshapeis 1989). Therefore, the suggestion to use ane structure
instead of metric structure allows a recognition system
typically represented under the p ersp ective pro jection
mo del. Because the center of pro jection is also the ori- to p erform re-pro jection from two-mo del views (Ullman
gin of the co ordinate system for describing shap e, the & Basri), and to generate novel views of the ob ject pro-
duced by ane transformations in space, rather than by
shap e di erence (e.g., di erence in depth, b etween two
ob ject p oints), is orders of magnitude smaller than the rigid transformations (Ko enderink & Van Do orn).
distance to the scene, and this makes the computations
This advantage, of working with two rather than three
very sensitive to noise. The sensitivitytonoiseisre-
views, is not present under p ersp ective pro jection, how-
duced if images are taken from distant viewp oints (large
ever. It is known that two p ersp ective views are sucient
base-line in stereo triangulation), but that makes the
for recovering metric structure (Roach & Aggarwal 1979,
pro cess of establishing corresp ondence b etween p oints in
Longuett-Higgins 1981, Tsai & Huang 1984, Faugeras &
b oth views more of a problem, and hence, may makethe
Maybank 1990). The question, therefore, is whylookfor
situation even worse. This problem do es not o ccur un-
alternative representations of structure, and new meth-
der the assumption of orthographic pro jection b ecause
o ds for p erforming re-pro jection?
translation in depth is lost under orthographic pro jec-
There are three ma jor problems in structure from mo-
tion, and therefore, the origin of the co ordinate system
tion metho ds: (i) critical dep endence on an orthographic
for describing shap e (metric and non-metric) is ob ject-
or p ersp ective mo del of pro jection, (ii) internal camera
centered, rather than viewer-centered (Tomasi, 1991).
calibration, and (iii) the problem of stereo-triangulation.
These problems, in isolation or put together, make
The rst problem is the strict division b etween meth-
much of the reason for the sensitivity of structure from
o ds that assume orthographic pro jection and metho ds
motion metho ds to errors. The recentwork of Faugeras
that assume p ersp ective pro jection. These two classes
(1992) and Mohr et al. (1992) addresses the problem of
of metho ds do not overlap in their domain of applica-
internal calibration by assuming central pro jection in-
tion. The p ersp ective mo del op erates under conditions
stead of p ersp ective pro jection. Faugeras and Mohr et
of signi cant p ersp ective distortions, such as driving on
al. then pro ceed to reconstruct the pro jectivecoordi-
a stretch of highway, requires a relatively large eld of
nates of the scene. Since pro jective co ordinates are mea-
view and relatively large depth variations b etween scene
sured relative to the center of pro jection, this approach
p oints (Adiv 1989, Dutta & Synder 1990, Tomasi 1991,
do es not address the problem of stereo-triangulation or
Broida et al. 1990). The orthographic mo del, on the
the problem of uniformity under b oth orthographic and
other hand, provides a reasonable approximation when
p ersp ective pro jection mo dels.
the imaging situation is at the other extreme, i.e., small
eld of view and small depth variation b etween ob ject
3 Camera Mo del and Notations
p oints (a situation for which p ersp ectiveschemes often
break down). Typical imaging situations are at neither
We assume that ob jects in the world are rigid and are
end of these extremes and, therefore, would b e vulner-
viewed under central projection. In central pro jection
able to errors in b oth mo dels. From the standp ointof
the center of pro jection is the origin of the camera co or-
p erforming recognition, this problem implies that the
dinate frame and can b e lo cated anywhere in pro jective
viewer has control over his eld of view | a prop erty
space. In other words, the center of pro jection can b e
that may b e reasonable to assume at the time of mo del
apoint in Euclidean space or an ideal point(suchas
acquisition, but less reasonable to assume o ccurring at
happ ens in parallel pro jection). The image plane is as-
recognition time.
sumed to b e arbitrarily p ositioned with resp ect to the
camera co ordinate frame (unlike p ersp ective pro jection The second problem is related to internal camera cal-
where it is parallel to the xy plane). We refer to this as a ibration. The assumption of p ersp ective pro jection in-
non-rigid camera con guration. The motion of the cam- cludes a distinguishable p oint, known as the principal
era, therefore, consists of the translation of the center of p oint, which is at the intersection of the optical axis and
pro jection, rotation of the co ordinate frame around the the image plane. The lo cation of the principal p ointis
new lo cation of the center of pro jection, and followed by an internal parameter of the camera, whichmay deviate
tilt, pan, and fo cal length scale of the image plane with somewhat from the geometric center of the image plane,
resp ect to the new optical axis. This mo del of pro jection and therefore, may require calibration. Persp ective pro-
will also b e referred to as p ersp ective pro jection with an
jection also assumes that the image plane is p erp endicu-
uncalibrated camera. lar to the optical axis and the p ossibility of imp erfections
We also include in our derivations the p ossibilityof in the camera requires, therefore, the recovery of the two
having a rigid camera con guration. A rigid camera is axes describing the image frame, and of the fo cal length.
simply the familiar mo del of perspective projection in Although the calibration pro cess is somewhat tedious, it
which the center of pro jection is a p oint in Euclidean is sometimes necessary for many of the available com-
space and the image plane is xed with resp ect to the mercial cameras (Brown 1971, Faig 1975, Lenz and Tsai
camera co ordinate frame. A rigid camera motion, there- 1987, Faugeras, Luong and Maybank 1992). The prob-
fore, consists of translation of the center of pro jection lem of calibration is lesser under orthographic pro jection
followed by rotation of the co ordinate frame and fo cal b ecause the pro jection do es not have a distinguishable
length scaling. Note that a rigid camera implicitly as- ray; therefore any p oint can serve as an origin, however
sumes internal calibration, i.e., the optical axis pierces must still b e considered b ecause of the assumption that 2
P Q 0
on the reference plane). Note that the lo cation ofp ~ is
known via the ane transformation determined bythe
~
ts. Finally,let Q reference pro jections of the three reference p oin
plane
Q b e the fourth reference p oint (not on the reference
P~
plane). Using a simple geometric drawing, the ane
structure invariant is derived as follows.
Consider Figure 1. The pro jections of the reference
p oint Q and an arbitrary p ointofinterest P form two
0 0 0 0
~ ~
P Pp p~ and QQq q~ .From similarity p’ similar trap ezoids:
p
of trap ezoids wehave,
q
0 0
~
jp p~ j P P j ~ j
p’
= : =
p
0 0
q’ ~ jq q~ j
jQ Qj
V
0
q; q is a given corresp onding p oint, we
1 By assuming that
variantthatisinvariant under parallel V ~ obtain a shap e in
2 q’
pro jection (the ob ject p oints are xed while the camera
changes the lo cation and p osition of the image plane
towards the pro jecting rays).
Before we extend this result to central pro jection by
Figure 1: Ko enderink and Van Do orn's Ane Structure.
using pro jective geometry,we rst describ e a di erent
ane invariant using two reference planes, rather than
one reference plane and a reference p oint. The new ane
through a xed p oint in the image and the image plane
invariant is the one that will b e applied later to central
is p erp endicular to the optical axis.
pro jection.
We denote ob ject p oints in capital letters and image
p oints in small letters. If P denotes an ob ject p ointin3D
5 Ane Structure Using Two
0 00
space, p; p ;p denote its pro jections onto the rst, sec-
Reference Planes
ond and novel pro jections, resp ectively.We treat image
p oints as rays (homogeneous co ordinates) in 3D space,
We make use of the same information | the pro jections
and refer to the notation p =(x; y ; 1) as the standard of four non-coplanar p oints | to set up two reference
representation of the image plane. We note that the
planes. Let P , j =1; :::; 4, b e the four non-coplanar
j
0
true co ordinates of the image plane are related to the
b e their ob- reference p oints in space, and let p ! p
j
j
served pro jections in b oth views. The p oints P ;P ;P standard representation by means of a pro jective trans-
1 2 3
formation of the plane. In case we deal with central
and P ;P ;P lie on two di erent planes, therefore, we
2 3 4
pro jection, all representations of image co ordinates are
can account for the motion of all p oints coplanar with
each of these two planes. Let P b e a p ointofinterest, allowed, and therefore, without loss of generalitywework
with the standard representation (more on that in Ap-
not coplanar with either of the reference planes, and let
~ ^
p endix A).
P and P b e its pro jections onto the two reference planes
along the raytowards the rst view.
~ ^
4 Ane Structure: Ko enderink and
Consider Figure 2. The pro jection of P; P and P onto
0 0 0
p ; p~ andp ^ resp ectively,gives rise to two similar trap e-
Van Do orn's Version
zoids from whichwe derive the following relation:
The ane structure invariant describ ed by Ko enderink
0 0
~
jP P j jp p~ j
and Van Do orn (1991) is based on a geometric con-
= = :
p
0 0
^
jp p^ j
jP P j
struction using a single reference plane, and a reference
p oint not coplanar with the reference plane. In ane
The ratio is invariant under parallel pro jection. There
p
geometry (induced by parallel pro jection), it is known
is no particular advantage for preferring over as
p p
from the fundamental theorem of plane pro jectivity, that
a measure of ane structure, but as will b e describ ed
three (non-collinear) corresp onding p oints are sucient
b elow, this new construction forms the basis for extend-
to uniquely determine all other corresp ondences (see Ap-
ing ane structure to pro jective structure, whereas the
p endix A for more details on plane pro jectivity under
single reference plane construction do es not (see Ap-
ane and pro jective geometry). Using three corresp ond-
p endix D for pro of ).
ing p oints b etween two views provides us, therefore, with
In the pro jectiveplane,we need four coplanar p oints
a transformation (ane transformation) for determining
to determine the motion of a reference plane. Weshow
the lo cation of all p oints of the plane passing through
that, given the epip oles, only three corresp onding p oints
the three reference p oints in the second image plane.
for each reference plane are sucient for recovering the
Let P b e an arbitrary p oint in the scene pro jecting
asso ciated pro jective transformations induced by those
0
~
onto p; p on the two image planes. Let P b e the pro jec-
planes. Altogether, the construction provides us with
tion of P onto the reference plane along the raytowards
four p oints along each epip olar line. The similarityof
0
~
the rst image plane, and letp ~ b e the pro jection of P trap ezoids in the ane case turns, therefore, into a cross-
0 0
onto the second image plane (p andp ~ coincide if P is ratio in the pro jective case. 3 P P reference planes
reference ~P plane P~ ^P image image ^ ~ plane P plane p’ p’ p^’
p’ p p V 2 ~ p’ v l V V 1 1 V
^p’
e shap e as the cross ratio
2 Figure 3: De nition of pro jectiv
0 0 0
of p ; p~ ; p^ ;V .
l
Figure 2: Ane structure using two reference planes.
as the ane invariant de ned in section 5 for parallel
pro jection.
The cross-ratio is a direct extension of the ane
p
This leads to the result (Theorem 1) that, in addition
structure invariant de ned in section 5 and is referred
to the epip oles, only four corresp onding p oints, pro jected
to as projective structure.We can use this invariantto
from four non-coplanar p oints in the scene, are sucient
reconstruct anynovel view of the ob ject (taken bya
for recovering the pro jective structure invariant for all
non-rigid camera) without ever recovering depth or even
other p oints. The epip oles can b e recovered by either
pro jective co ordinates of the ob ject.
extending the Ko enderink and Van Do orn (1991) con-
Having de ned the pro jective shap e invariant, and as-
struction to pro jective space using six p oints (four of
suming we still are given the lo cations of the epip oles,
which are assumed to b e coplanar), or by using other
we show next how to recover the pro jections of the two
metho ds, notably those based on the Longuet-Higgins
reference planes onto the second image plane, i.e., we
0 0 fundamental matrix. This leads to pro jective structure
describ e the computations leading top ~ andp ^ .
from eight p oints in general p osition.
Since we are working under central pro jection, we
need to identify four coplanar p oints on each reference
6 Pro jective Structure
plane. In other words, in the pro jective geometry of the
plane, four corresp onding p oints, no three of whichare
We assume for now that the lo cation of b oth epip oles is
collinear, are sucient to determine uniquely all other
known, and we will address the problem of nding the
corresp ondences (see App endix A, for more details). We
epip oles later. The epip oles, also known as the fo ci of ex-
must, therefore, identify four corresp onding p oints that
pansion, are the intersections of the line in space connect-
are pro jected from four coplanar p oints in space, and
ing the two centers of pro jection and the image planes.
then recover the pro jective transformation that accounts
There are two epip oles, one on each image plane | the
for all other corresp ondences induced from that plane.
epip ole on the second image we call the left epip ole, and
The following prop osition states that the corresp onding
the epip ole on the rst image we call the right epip ole.
epip oles can b e used as a fourth corresp onding p oint for
The image lines emanating from the epip oles are known
any three corresp onding p oints selected from the pair of
as the epipolar lines.
images.
Consider Figure 3 which illustrates the two reference
plane construction, de ned earlier for parallel pro jection,
Prop osition 1 Aprojective transformation, A, that is
now displayed in the case of central pro jection. The
determinedfrom three arbitrary, non-col linear, corre-
left epip ole is denoted by V , and b ecause it is on the
l
sponding points and the corresponding epipoles, is a pro-
line V V (connecting the two centers of pro jection), the
1 2
jective transformation of the plane passing through the
0
line PV pro jects onto the epip olar line p V . Therefore,
1 l
three object points which project onto the correspond-
0 0
~ ^
the p oints P and P pro ject onto the p ointsp ~ andp ^ ,
ing image points. The transformation A is an induced
0
which are b oth on the epip olar line p V . The p oints
l
epipolar transformation, i.e., the ray Ap intersects the
0 0 0
0
p ; p~ ; p^ and V are collinear and pro jectively related to
l
epipolar line p V for any arbitrary image point p and its
l
0 ~ ^
P; P; P; V , and therefore have the same cross-ratio:
corresponding point p .
1
0 0 0
~
Comment: An epip olar transformation F is a mapping
jV p^j jV p^ j jp p~ j jP P j
1 l
: = =
p
between corresp onding epip olar lines and is determined
0 0 0
~
jP p^j jp p^ j jV p~ j
jV P j
l
1
(not uniquely) from three corresp onding epip olar lines
and the epip oles. The induced p oint transformation is Note that when the epip ole V b ecomes an ideal p oint
l