An Overview of Matchmoving Using Structure from Motion Methods

An Overview of Matchmoving using Structure from Motion Methods Kamyar Haji Allahverdi Pour Alimohammad Rabbani Department of Computer Engineering Department of Computer Engineering Sharif University of Technology Sharif University of Technology Tehran, Iran Tehran, Iran Email: [email protected] Email: [email protected] Abstract—Nowadays seeing supernatural happenings in movies is not a rare happening. Here we want to discuss a method on how to add extraterrestrial objects to an ordinary video. This method of adding computer-generated objects to a natural video is called Matchmoving [1]. The main part of this process is Structure from Motion (SfM). SfM is the process that enables Matchmovers to use the 3D information of the scene. We will mainly discuss this part, how it works and different methods to extract its information. I. INTRODUCTION As a Matchmover tries to add an artificial object to a real scene, he needs the 3D information of the scene. This information can be extracted from the sequence of images. Using the difference between images, a perception of this 3D structure can be computed. As converting 3D information to 2D information would inevitably introduce data loss, this task Fig. 1. Basic Steps of SfM [5] is challenging and needs extra information to do the recon- struction. This information can be found from our previous 2) The second component is mapping 3D points to 2D knowledge about the scene, e.g. looking for parallel lines in points. This mapping is a matrix like the one below: the images. Another way is finding corresponding points in X different views and construct 3D points using triangulation x f 000 c Y [8]. y 0 f 00 c 2Z 3 213 ⇠ 200f 03 c II. STRUCTURE FROM MOTION 6 1 7 4 5 4 5 6 7 First of all, we’ll discuss the process in which we can Where f is the focal length of camera.4 5 Here we are find information about the real scene. This task is done using homogenous coordinates, so we should keep the using a method called Structure from Motion. Basic steps of last element equal to 1. As we don’t have access to this process is shown in Figure 1. This process starts with this scale factor, we just have an arbitrary perception of understanding the mapping between 3D world and 2D image depth. This means that using just one image information, plane. After understanding that, we’ll discuss the method to we can’t understand how far an object is from a camera. find 3D points from 2D points. We can set this scale factor to 1, as we don’t know the exact number. A. From 3D to 2D 3) The final component is converting 2D coordinates of We want to model the process in which a 3D point is image plane to pixel coordinates. This transformation is mapped to a 2D point by a camera. The most common model done using an upper triangular matrix called K which is pinhole projection model. If there is no non-linear effect, hold camera intrinsic parameters. This matrix can be like radial distortion, this model is a good approximation for written as: most real cameras. This model has 3 components in which a ↵u su0 3D point is converted to a 2D point: K = 0 ↵ v 2 v 0 3 1) The first component is transforming real world coordi- 001 nates to camera coordinates. This transform consists of a By combining aforementioned4 matrices, we5 can compute a rotation and a translation, which together they are called projection matrix, P , which maps 3D points to 2D pixel camera extrinsic parameters. coordinates: This relation can be derived by replacing points in the essential 1 matrix relation with the multiplication of K− , inverse of camera intrinsic parameters matrix, by the pixel point: 1 T 1 (K− u) E(K0 u0)=0 T 1T 1 u (K− EK0 )u0 =0 T u Fu0 =0 So we can see that fundamental matrix is given by F Fig. 2. Epipolar Geometry 1T 1 ⇠ K− EK0 . This matrix has 9 elements and it can be computed up to an arbitrary scale, so we can calculate this x PX matrix using 8 correspondent points. ⇠ 3) Projection Matrices: After finding fundamental matrix By using this equation, we can find two linear equations to we can calculate projection matrices. This can be accom- find P elements. This matrix has 12 elements, which there are plished by decomposing fundamental matrix [4]. 11 unknown elements as the scale factor is arbitrary. So if we have at least 6 points in 3D space and their correspondents in D. Triangulation: From 2D to 3D 2D image plane, we can write 12 different equations to find By finding camera projection matrices, we can find the P elements. correspondent 3D point of every 2D pair of points. This process is called Triangulation. This process is done by B. Finding Correspondent Points intersecting projection rays, rays passing from camera center As mentioned before, we need to find corresponding points and 2D points on the image plane. As there is always some to extract 3D scene structure. One way to find these correspon- noise in our measurements, the projected rays of 2D points dent points, is to ask a user to tag these points in our images. will not intersect. So the main task of triangulation is to find This approach is time consuming and the results will depend 3D points that minimize this error: on how accurate the user has selected the correspondent points. X = arg min ui ui0 The automatic way to do so, in general, is a difficult task. X i k − k Methods for finding these correspondent points automatically, u P u use information around important pixels to describe them. Where i is the 2D point of the image and i0 is the predicted These important points are called interest points. We can 2D point by the 3D point and projection matrix. After doing measure the similarity between these descriptors to find an triangulation we have found the information we were looking approximate of how much two points can be considered for about the structure of the scene. matches. For example we can use Harris corner detector [3] E. Multiple-View Structure from Motion to find such interest points. Harris corner detector finds points As we discussed in the previous section, given two views that are the maxima for image auto-correlation function. we can find the correspondent 3D points to reconstruct the C. Epipolar Geometry structure of the scene. But with a sequence of images, e.g. Epipolar geometry or two-view geometry provides the rela- frames of a video, we have more than two views. The methods tion between two views of a scene. If we have point x in the used to reconstruct a 3D scene from multiple views are first view, epipolar geometry restricts the corresponding point, called sequential methods. As new views are given to these methods they extend the partial reconstructed scene with the x0, in the second view to lie on a line called epipolar line (Figure 2). This constraint can be formulated using a matrix new information. One of these ways uses sequential Monte called essential matrix. Carlo (SMC) methods for estimating 3D scene [7]. SMC 1) Essential Matrix: This matrix gives the relation that can originally is a non-linear filter for smoothing state space find epipolar lines: model. Another well-known method for refining 3D points when we have multiple views is Bundle Adjustment. Bundle T x Ex0 =0 Adjustment is the act of refining 3D points and projection matrices simultaneously. Where Ex0 is the epipolar line in the second view corresponding to the point x in the first view. Essential matrix has 5 F. Multibody Structure from Motion parameters and can be found using 5 different matches of Multibody structure from motion is the extension of SfM points [2]. to dynamic scenes where objects can move and the scene is 2) Fundamental Matrix: We can further expand this rela- not static. This task is challenging because number of moving tion to find the formulation for two corresponding pixel points objects can change during time. Also moving objects are often in two images. The resulting matrix is called fundamental small and therefore few number of features can be tracked for matrix: them [6]. The main requirements to build a system that enables T u Fu0 =0 dynamic SfM are: 1) finding number of moving objects at the beginning of image sequence, 2) segmenting feature tracks to find different motions in scene, 3) finding 3D structure for each moving object, The accuracy of the method is very dependent on how accurate we segment feature tracks. III. 3D MATCH MOVING Match moving is a visual-effects technique that allows us Fig. 4. Feature Tracking to insert computer-generated objects into real camera captured scenes. Thinking technically, it is hard to insert these objects in a way that they maintain correct position, orientation, scale, The more precision the algorithm has in tracking features, and motion relative to other objects in the real scene. the more precisely we can calculate 3D information. Like The key process in match moving is to precisely determine motion estimation, there are motion vectors for active features camera movements in a 3D space. Having camera movements, frame by frame. The result is that we can find the position newly inserted 3D objects will appear perfectly matched with of each feature throughout the entire footage by having its regards to perspective, position and scale.

Load more