Robust Structure from Motion Using Motion Parallax
Total Page:16
File Type:pdf, Size:1020Kb
Robust Structure from Motion using Motion Parallax Roberto Cipolla Yasukasu Okamoto and Yoshinori Kuno Department of Engineering Research and Development Center University of Cambridge Toshiba Corporation Cambridge CB2 1PZ. Kawasaki 210, Japan. Abstract well known cue for the perception of 3D shape and mo- tion. Computational attempts to quantify the percep- We present an efficient and geometrically intuitive tion of 3D shape have determined the number of points algorithm to reliably interpret the image velocities of and the number of views needed to recover the spatial moving objects in 3D. It is well known that under configuration of the points and the motion compatible weak perspective the image motion of points on a plane with the views. Ullman, in his well-known structure can be characterised by an affine transformation. We from motion theorem [13], showed that a minimum of show that the relative image motion of a nearby non- three distinct orthographic views of four non-planar coplanar point and its projection on the plane is equiv- points in a rigid configuration allow the structure and alent to motion parallax and because it is independent motion to be completely determined. If perspective of viewer rotations it is a reliable geometric cue to 3D projection is assumed two views are, in principle, suf- shape and viewer/object motion ficient. In fact two views of eight points allow the In particular we show how to interpret the mo- problem to be solved with linear methods [8] while tion parallax vector of non-coplanar points (and con- five points from two views give a finite number of so- tours) and the curl, divergence and deformation com- lutions [3]. ponents of the affine transformation (defined by the three points or a closed-contour of the plane) in or- Problems with this approach der to recover the projection of the axis of rotation of a moving object; the change in relative position of Although structure from motion algorithms based the object; the rotation about the ray; the tilt of the on these formulations have been successfully applied in surface and a one parameter family of solutions for photogrammetry and some robotics systems [4] when the slant as a function of the magnitude of the rota- a wide field of view, a large range in depths and a large tion of the object. The latter is a manifestation of the number of accurately measured image data points are bas–relief ambiguity. These measurements, although assured, these algorithms have been of little or no representing an incomplete solution to structure from practical use in analysing imagery in which the ob- motion, are the only subset of structure and motion ject of interest occupies a small part of the field of parameters which can be reliably extracted from two view or is distant. views when perspective effects are small. In this paper we summarise why structure from mo- We present a real-time example in which the 3D tion algorithms are often very sensitive to errors in visual interpretation of hand gestures or a hand-held the measured image velocities and then show how to object is used as part of a man-machine interface. This efficiently and reliably extract an incomplete qualita- is an alternative to the Polhemus coil instrumented tive solution. We also show how to augment this into Dataglove commonly used in sensing manual gestures. a complete solution if additional constraints or views are available. The main problems with existing structure from 1Introduction motion formulations are: Structure from motion 1. Perspective effects are often small Structure from motion algorithms attempt to The way appearances change in the image due to deliver a complete quantitative solution to the relative motion between the viewer and the scene is a viewer or object motion (both 3D translation and 3D rotation) and then to reconstruct a euclidean remove the effect of any viewer rotations to leave a 3D copy of the scene. Such complete quantita- velocity component that depends purely on 3D shape tive solutions to the structure from motion prob- and viewer translational motion. Second, it decom- lem, however, are not only often too difficult, but poses the differential invariants of the image velocity are numerically ill-conditioned, often failing in a field (divergence , curl and deformation components graceless fashion in the presence of image mea- of the affine transformation) to recover the compo- surement noise [14]. This is because they rely on nents which depend on (1) the change of scale due the measurement of perspective effects which can to the change in distance between the viewer and the be very small. In such cases the effects in the object (For a general motion this is not encoded by image of viewer translations parallel to the image divergence alone); (2) the rotation of the object about plane are very difficult to discern from rotations the visual ray; and (3) relative surface orientation. It about axes parallel to the image plane. Another is a development of the pioneering work of Longuet– ambiguity which often arises is the bas–relief am- Higgins and Prazdny [9] and Koenderink and Van biguity which concerns the difficulty of distin- Doorn [5, 6] (reviewed below). guishing between a “shallow” structure close to the viewer and “deep” structures further away when perspective effects are small. Note that this 2Theoreticalframework concerns surface orientation and its effect – un- like the speed–scale ambiguity – is to distort the 2.1 Interpretation of image velocities un- shape. der perspective projection Global rigidity and independent motion 2. Consider an arbitrary co-ordinate system with the Existing approaches place a lot of emphasis on x-y plane spanning the image plane (f from optical global rigidity. Despite this it is well known that centre) and the z-axis aligned with the ray. Assume two (even orthographic) views give vivid 3D im- the viewer to have a translational velocity with compo- pressions even in the presence of a degree of non- nents {U1,U2,U3} and an angular velocity with com- rigidity such as the class of smooth transforma- ponents {Ω1, Ω2, Ω3}. Let the image velocity field at tions e.g. bending transformations which are lo- a point (x, y) in the vicinity of a visual direction be cally rigid [6]. Many existing methods can not represented as a 2D vector field, ⃗v(x, y) with x and y deal with multiple moving objects and they usu- components (u, v). The two components of the image ally require the input image to be segmented into velocity of a point in space, (X, Y, Z) due to relative parts corresponding with the same rigid body mo- motion between the observer and the scene under per- tion. Segmentation using image velocities should spective projection are given by [9]: be performed locally and with a small number of 2 measurements. This is a non-trivial task if the fU1 − xU3 xy x u = + fΩ2 − yΩ3 − Ω1 + Ω2 image velocity data is noisy. ! Z " f f 2 fU2 − yU3 xy y v = − fΩ1 + xΩ3 + Ω2 − Ω(1)1 Our approach ! Z " f f In this paper we present an efficient and reliable The image velocity consists of two components. solution to the structure from motion problem by ig- The first component is determined by relative transla- noring small perspective effects or the constraint of tional velocity and encodes the structure of the scene, global rigidity. Z. The second component depends only on rotational We assume weak perspective [11] in a small neigh- motion about the viewer centre (eye movements). It bourhood and concentrate on shape and motion pa- gives no useful information about the depth of the rameters which do not rely on perspective effects or point or the shape of the visible surface. It is this global rigidity. The solution is however incomplete rotational component which complicates the interpre- and motion and shape are expressed more qualita- tation of visual motion. The effects of rotation are tively by spatial order (relative depths) and affine hard to extricate however, although numerous solu- structure (Euclidean shape up to an arbitrary 3D tions have been proposed. As a consequence, point affine transformation [6]). image velocities and disparities do not encode shape The algorithm consists of two parts. First, relative in a simple efficient way since the rotational compo- velocities in a small neighbourhood are processed to nent is often arbitrarily chosen to shift attention and gaze by camera rotations or eye movements. The rota- image velocity field at a point (x, y) in the neighbour- tional component can be removed if, instead of using hood of a given visual direction is given by: raw image motion the difference of the image motions u u0 u u x of a pair of points, is used. This is called motion par- ≈ + x y (4) v v0 v v y allax. ! " ! " ! x y "! " The first term is a vector [u0,v0] representing a pure 2.2 Motion Parallax translation (specifying the change in image position of the centroid of the shape) while the second term is a Consider two visual features at different depths 2×2 tensor – the velocity gradient tensor – and repre- whose projections on the image plane are instanta- sents the distortion of the image shape. The latter can neously (xi,yi) i =1, 2 and which have image veloc- be decomposed into independent components which ities given by (1). If these two features are instanta- have simple geometric interpretations. These are a neously coincident in the image, (x1,y1)=(x2,y2)= 2D rigid rotation (vorticity), specifying the change v (x, y), their relative image velocity, (∆u, ∆v)–mo- in orientation, curl⃗; an isotropic expansion (diver- tion parallax – depends only on their relative inverse- gence) specifying a change in scale, div⃗v;andapure depths and on viewer translational velocity.