Mathematical Flaws in the Essential Matrix Theory

Recent Advances in Signals and Systems

TAYEB BASTA College of Computing Al Ghurair University Po Box 37374, Dubai Academic City, Dubai UNITED ARAB EMIRATES [email protected]

Abstract: - Extracting 3D structure from two views is a flourishing subject in computer vision literature. In 1981 Longuet-Higgins introduces what it seemed a mathematically founded theory that relates the corresponding points from the two images independently from the extrinsic camera parameters. Since then a number of contributions based on such a theory was emerged. Higgins defined the world point in two different reference frames and derived a formula relating the image points defined in the two frames through a matrix. Trucco presented Longuet-Higgins’ solution by formulating the problem as the product of three planar vectors. He then derived an algebraic formula relating the two image points through an algebraic entity called the essential matrix. Such a matrix is independent of the position of cameras used to capture the two views. In this paper I clarify that the reasoning of Longuet-Higgins in its first form is based on an undefined vectors operation. His reasoning presented in Trucco was misled by assuming that the world reference frame is fixed onto the left camera frame. He did not take into account that (1) dividing the coordinates of a world point by its Z- coordinate is a point belonging to a plane parallel to the XY-plane, (2) fixing the world reference frame onto the left camera implies that the coordinates of any projection belong simultaneously to the world reference frame and the left camera frame. The contribution of this paper is to unveil a misconception in the theory of Higgins’ algorithm that remains hidden up to date.

Key-Words: - Essential matrix, fundamental matrix, two-view, epipolar geometry, 3D extraction

1 Introduction The contribution of this paper is to unveil a Extracting 3D structure from two views is a misconception in the theory of Higgins’ algorithm significant step of recovering 3D structure from a that remains hidden up to date. sequence of camera images. This theme in computer The rest of the paper is as follows: section 2 is a vision became prominent in 1981 when Longuet- basic theory of the essential and fundamental Higgins produced his famous algorithm "A matrices. Section 3 introduces Longuet-Higgins’ computer algorithm for reconstructing a scene from algorithm. In section 4 the misconception that two projections" [3]. Such an algorithm was occurred in Higgins’ theory was identified. Finally, supplemented in 1992 by Faugeras’ article “What our work concludes in section 5. can be seen in three dimensions with an uncalibrated stereo rig?” [2]. Since then a large number of 2 Essential and Fundamental Matrices relevant publications appeared in the computer Correspondence methods for extracting 3D structure vision discipline. Most of that literature is just for from two views of a given scene are based on the estimating the essential and fundamental matrices. epipolar geometry. Such a geometry is represented by a 3×3 singular matrix. The image points of a given world point are related by such matrix. 3D

ISSN: 1790-5109 215 ISBN: 978-960-474-114-4 Recent Advances in Signals and Systems

extraction problem is classified in two categories point M . Thus, the system does not know the depending to the availability of information about camera intrinsic and extrinsic parameters. This is the intrinsic camera parameters. When these known as the uncalibrated system. The extrinsic parameters are known, the problem belongs to the camera parameters are the parameters that define the calibrated case and the matrix is called essential location and orientation of the camera reference matrix; whereas when the parameters are unknown, frame with respect to a known world reference the problem belongs to be uncalibrated case and the frame. matrix is called fundamental matrix. The intrinsic In the uncalibrated system, the fundamental matrix camera parameters are the focal length, the pixel F encapsulates both intrinsic and extrinsic camera scale, the optical center, and the pixel skew. parameters and the relationship between ml and mr T is expressed by the formula r Fmm l = 0 , where F is the fundamental matrix. The knowledge of image correspondences enables 3D structure extraction and scene reconstruction from images. Calculating the essential matrix requires the knowledge of the intrinsic camera plus a number of corresponding pairs of points from the two views. Calculating such a matrix requires detecting points in the two views of the scene under Fig. 1 Epipolar Geometry study, and then matching pairs of those detected points from both views. Assuming that there is no occlusion; = ZYXM ),,( T is a point in the scene projected to the left and right 3 Longuet-Higgins' algorithm views as ml and mr , respectively as shown in Fig. 1. Image points m and m are called corresponding In 1981 Longuet-Higgins [3] presented a solution to l r the 3D extraction problem by providing a relation points. between the two image points of a given world Thus, a point ml in one image generates a line lr in point. This relation is through an algebraic entity the other image on which its corresponding point called essential matrix and it is independent from the mr must lie. The search for correspondences is thus extrinsic camera parameters. reduced from a region to a line; this is known as the epipolar constraint. 3.1. Original Longuet-Higgins’ approach

The relationship between ml and mr is described by a. The author considered a world point P with T ( 1 XXX 32 ),, and ( ′′ ,, XXX 321 ′ ) as their the formula r Emm l = 0 , where E is the essential matrix. coordinates with respect to the two camera reference frames. In 1992 Olivier Faugeras published a paper [2] that overturned existing thinking about camera b. He calculated the homogeneous coordinates calibration and the extraction of metric information of the two image points of P in the two from our environment using cameras. He set out to X u views: xu = and determine what information could be extracted from X 3 a binocular stereo rig for which there was no three- X ′ x′ = v vu = )3,2,1,( . dimensional metric calibration data available. All v X ′ that is assumed is that we have a stereo camera 3 system that is capable, by comparing the two c. The two sets of three-dimensional images, of establishing some correspondence coordinates are connected by an arbitrary between them. Each such correspondence, displacement, so ′ = ( − TXRX vvuvu ) . written mm rl ),( , indicates that the two images points

ml and mr are the images of the same world

ISSN: 1790-5109 216 ISBN: 978-960-474-114-4 Recent Advances in Signals and Systems

d. The author created a new matrix The example of Fig. 2 below illustrates that the dot

⎡ 0 − TT 23 ⎤ product ⋅ XV v is undefined. ⎢ ⎥ The example consists of a vector space in which two = RSQ where S = ⎢− 0 TT 13 ⎥ . ⎢ − TT 0 ⎥ coordinate frames are defined. The vectors V and ⎣ 12 ⎦ U are defined in the right coordinate frame as e. He then built a new term ′ XQX and used vuvu V = ]4,2,3[ T and = UUUU ],,[ T , respectively. the above arithmetic transformations to 321 The vector W is defined in the left frame as conclude that ′ XQX vuvu is equal to zero. W = ]4,2,3[ T . f. Finally he divided X ′ XQ vuvu by ′ XX 33 to arrive to the desired relation between the ⎡U1 ⎤ ⎢ ⎥ images coordinates′ xQx vuvu = 0 . The What about the dot product defined as ]4,2,3[ ⎢U 2 ⎥ ? entity Quv is the essential matrix. ⎣⎢U3 ⎦⎥ Is it equal to ⋅UV or ⋅UW ? 3.2. The Flaw in Longuet-Higgins’ derivation It is obvious that The flaw in Longuet-Higgins derivation consists of ⎡U1 ⎤ building the term ′ XQX vuvu . Such a term includes ⎢ ⎥ UV =⋅ ]4,2,3[ ⎢U 2 ⎥ (1) the vector X u′ = ,,( XXX 321 ′′′ ) which is defined with ⎣⎢U3 ⎦⎥ respect to the right camera frame, the And the product ⋅UW is not a legal operation unless vector X v = X1,( , XX 32 ) which is defined with the vectors W and U are defined with respect to the respect to the left camera frame, and the 3×3 matrix same coordinate frame; otherwise, we can have an Q which is the result of operations on the rotation uv unlimited number of frames and an unlimited and translation of the right camera frame with T respect to the left camera frame. number of vectors W = ]4,2,3[ each of which is Before substituting the elements of the term defined in one of these frames and satisfies the equation (1). ′ XQX vuvu , let us see whether it is a defined entity or not. The order of evaluation of this term starts by Y X computing = ′QXV uvu which is the product of a 3- U dimentional vector by a 3×3 matrix. V is a 3- W dimentional vector defined with respect to the same V frame as the vector X u′ . When Quv is not the X Y identity matrix, the magnitude or direction of V will Z ′ be different from those of X u . So the above term Left frame Right frame can be simplified to X ′Q X =V ⋅ X . u uv v v Z ⋅ XV v is the dot (scalar or inner) product of two Fig. 2 Two vectors V and W with equal coordinates vectors. and defined in two reference frames The scalar product is an operation between two vectors. The result is a scalar [1]. Building an expression including vectors or points While it is not explicitly stated, the two vectors must defined with respect to different coordinate frames is be defined with respect to the same reference frame. mathematically undefined. The term ′ XQX vuvu can The vector V is defined in the right camera frame only be valid if the vectors X u′ and X v are defined and the vector X v is defined in the left camera frame. with respect to one single coordinate frame, which is not the case in Longuet-Higgins’ work.

ISSN: 1790-5109 217 ISBN: 978-960-474-114-4 Recent Advances in Signals and Systems

It is noteworthy to distinguish between building the f p = r P (4) term ′ XQX and defining a relationship (e.g. r r vuvu Zr equality) that can exist between two expressions The equation of the epipolar plane through P can be including vectors defined in two different coordinate written as the coplanarity condition of the vectors frames. The latter is done to transform vectors and T points from one coordinate frame to another, Pl , T, and l − TP , or l PTTP l =×− 0)( . whereas the term ′ XQX vuvu is built to be evaluated Using equation (2), we obtain and it is undefined unless its components are defined T T r PTPR l =× 0)( (5) with respect to same coordinate frame. Recalling that a vector product can be written as a multiplication by a rank-deficient matrix, we can 3.3. New presentation of Longuet-Higgins' write × = SPPT ll , where approach ⎡0 − TT yz ⎤ In the rest of this section I consider the same ⎢ ⎥ solution approach to the 3D extraction as it is S = ⎢ z 0 − TT x ⎥ (6) presented, in a less ambiguous way, in [4]. The ⎢ ⎥ − TT xy 0 author succeeded in formulating the two-view ⎣ ⎦ problem as the product of three planar vectors in a single coordinate frame; that is the world reference Using this fact, (5) becomes frame (Fig. 3 below). He then extracted the essential T EPP = 0 (7) matrix formula. r l With = RSE (8) Observe that, using (3) and (4), and dividing by ZZ lr , (7) can be written as T r Epp l = 0 (9) [4]" 3.4. Flaw in the new presentation of Longuet-Higgins’ approach Researchers place the following assumption at the Fig. 3 Two-views as three coplanar vectors beginning of derivations of essential and fundamental matrices: Without loss of generality it “The reference frames of the left and the right is assumed that the world coordinate system cameras are related via the extrinsic parameters. coincides with the left camera system [5]. Such an These define a rigid transformation in 3-D space, assumption implies that the plane of the left camera is defined by the equation = fZ l . This is the plane defined by a translation vector, ( −= OOT lr ) , and a rotation matrix R . Given a point P in space, the parallel to the XY-plane and intersects with the Z- axis at the point fl ),0,0( . relation between Pl and Pr is therefore In this context I would like to cite some simple −= TPRP )( (2) lr mathematical facts. The relation between a point in 3-D space and its a. Cameras centers, vectors, points, translation, and projections is described by the usual equations of rotation are described with respect to a single perspective projection, in vector form: world reference frame. Such reference frame can fl be fixed onto the first camera, onto the second pl = Pl (3) Zl camera or anywhere else. However, it has to be fixed exactly once and before starting the and analysis of the problem under study. For the current problem, it is not true to fix it onto the

ISSN: 1790-5109 218 ISBN: 978-960-474-114-4 Recent Advances in Signals and Systems

first camera to project the vector Pl and then ⎛u′ ⎞ f ⎜ ⎟ X move it to the second camera to project the p r P == ⎜v′ ⎟ , where ′ = fu r and r Z r r Z vector Pr . r ⎜ ⎟ r ⎝ fr ⎠ b. Given a constant scalar a , the plane that is Yr parallel to the XY-plane and intersects with the ′ = fv r (11) Z-axis at the point a),0,0( is defined by the Zr equation = aZ . The points of this plane are The focal length of the right camerafr is a constant. defined by the set of coordinates,,( ayx ) . ′ ′ fvu r ),,( are the coordinates of the point pr in the From this point on I would like to invite the reader world reference frame. to discard, for a while, what is considered Let us now ask the following question: achievements in the computer vision discipline and • Is pr a point of the right camera plane or of the listen to mathematics. plane defined by the equation = fZ ? The vectors P and P are defined in the world r l r The answer is: ⎛ X ⎞ ⎛ X ⎞ ⎜ l ⎟ ⎜ r ⎟ • Whatever the position of the world reference reference frame by P = Y and P = Y , l ⎜ l ⎟ r ⎜ r ⎟ frame is, the point l = ,,( fvup l ) belongs to the ⎜ Z ⎟ ⎜ Z ⎟ ⎝ l ⎠ ⎝ r ⎠ plane defined by the equation = fZ l ; it is the respectively. plane parallel to the XY-plane and intersects with I rewrite the above equations that relate a given the Z-axis at the point fl ),0,0( and the point world point to its two image points in more details. r = ′ ′ fvup r ),,( belongs to the plane defined by Equation (3) can be rewritten as follows the equation = fZ r ; it is the plane parallel to the ⎛u ⎞ XY-plane and intersects with the Z-axis at the fl ⎜ ⎟ X l Yl pl Pl == ⎜v ⎟ , where = fu l and = fv l point,0,0( fr ) . Zl ⎜ ⎟ Zl Zl ⎝ fl ⎠ • If the world reference frame is fixed onto the left (10) camera, the point l = fvup l ),,( belongs to the

• As it was specified in the simple mathematical left camera plane and pr = ′′ ,,( fvu r ) belongs to facts above, the focal length of the left camera fl a plane parallel to the left camera plane.

is a constant implies that pl is a point of the • If the world reference frame is fixed onto the plane = fZ l that is the left camera plane. right camera, the point l = ,,( fvup l ) belongs to a plane parallel to the right camera plane and • The coordinates ,,( fvu l ) are of pl in the world = ′ ′ fvup ),,( belongs to the right camera reference frame. At the same time they are the r r plane. coordinates of pl in the left camera reference Consequently, the two points p and p belong to frame because of the assumption, stated at the l r beginning of the current section, saying two parallel planes, and if both cameras have the “Without loss of generality it is assumed that the same focal length, these points belong to the same world reference frame is fixed onto the first plane, i.e. the plane of the left camera onto which the camera”. world reference frame is fixed. If the world reference frame is fixed somewhere else, then these • If the world reference frame is fixed neither onto planes are both parallel to the XY-plane of such the left nor onto the right camera, frame. In the latter case none of the points pl and l = fvup l ),,( is a point of the plane parallel to XY-plane and intersects with the Z-axis at the pr belong to any camera planes. point ,0,0( fl ) . Such a plane is completely So the point pr defined by equation (4) is not the different from both cameras planes. image of the world point P on the right view as Equation (4) can be written as follows expected or as its name suggests. Consequently, the

ISSN: 1790-5109 219 ISBN: 978-960-474-114-4 Recent Advances in Signals and Systems

relationship T Epp = 0 is between two points that References: r l [1] B. Blank, S. Krantz, Calculus Multivariable, belong to two planes parallel to the XY-plane and is Wiley & Sons, 2008. not a relationship between a point from the left [2] O.D. Faugeras, What can be seen in three camera plane and a point from the right camera dimensions with an uncalibrated stereo rig? in plane. G. Sandini (ed.) Proceedings of the 2nd European Conference on Computer Vision, 4 Conclusion Santa Margherita Ligure, Italy, Springer-Verlag, 1992, pp. 563-578. While the use of the formula T Emm = 0 to r l [3] Longuet-Higgins, H.C., A computer algorithm calculate the essential matrix in order to extract 3D for reconstructing a scene from two projections, structure is seen as a landmark in the history of Nature, 293, 1981, pp. 133-135. computer vision discipline, it still needs to be further [4] E. Trucco, A. Verri, Introductory Techniques for revised and scrutinized for accuracy. Unfortunately, 3-D Computer Vision, Prentice Hall, 1998. instead of doing so researchers focused their [5] Z. Y. Zhang, Determining the epipolar geometry attention on estimating such supposed entity; the and its uncertainty - a review, International essential matrix. Journal of Computer Vision, 27(2), March 1998 Longuet-Higgins error is completely obvious and pp. 161-195. simple. He used two vectors defined with respect to two different coordinate frames in a scalar product operation which is incorrect. The expression ′ XQX vuvu is undefined. In the second approach presentation, the authors succeeded in defining the problem in one reference frame but they got a wrong impression caused by the similarity, at least in shape, of the image points. They did not take into account that the divisions of the left vector Pl and the right vector Pr by the scalars Zl and Zr , respectively, generate projections onto planes parallel to the XY-plane. The assumption “Without loss of generality it is assumed that the world reference frame coincides with the left camera frame” means that the vector Pr and its projection pr are defined in the world reference frame, so they are expressed in the left camera frame and they are not defined in the right camera frame. The findings of this study could defuse a delicate hidden mistake in the derivation of the essential matrix.

ISSN: 1790-5109 220 ISBN: 978-960-474-114-4