<<

An Overview of Matchmoving using Structure from Methods

Kamyar Haji Allahverdi Pour Alimohammad Rabbani Department of Computer Engineering Department of Computer Engineering Sharif University of Technology Sharif University of Technology Tehran, Iran Tehran, Iran Email: [email protected] Email: [email protected]

Abstract—Nowadays seeing supernatural happenings in movies is not a rare happening. Here we want to discuss a method on how to add extraterrestrial objects to an ordinary video. This method of adding computer-generated objects to a natural video is called Matchmoving [1]. The main part of this process is Structure from Motion (SfM). SfM is the process that enables Matchmovers to use the 3D information of the scene. We will mainly discuss this part, how it works and different methods to extract its information.

I. INTRODUCTION As a Matchmover tries to add an artificial object to a real scene, he needs the 3D information of the scene. This information can be extracted from the sequence of images. Using the difference between images, a perception of this 3D structure can be computed. As converting 3D information to 2D information would inevitably introduce data loss, this task Fig. 1. Basic Steps of SfM [5] is challenging and needs extra information to do the recon- struction. This information can be found from our previous 2) The second component is mapping 3D points to 2D knowledge about the scene, e.g. looking for parallel lines in points. This mapping is a matrix like the one below: the images. Another way is finding corresponding points in X different views and construct 3D points using triangulation x f 000 c Y [8]. y 0 f 00 c 2Z 3 213 ⇠ 200f 03 c II. STRUCTURE FROM MOTION 6 1 7 4 5 4 5 6 7 First of all, we’ll discuss the process in which we can Where f is the focal length of camera.4 5 Here we are find information about the real scene. This task is done using homogenous coordinates, so we should keep the using a method called Structure from Motion. Basic steps of last element equal to 1. As we don’t have access to this process is shown in Figure 1. This process starts with this scale factor, we just have an arbitrary perception of understanding the mapping between 3D world and 2D image depth. This means that using just one image information, plane. After understanding that, we’ll discuss the method to we can’t understand how far an object is from a camera. find 3D points from 2D points. We can set this scale factor to 1, as we don’t know the exact number. A. From 3D to 2D 3) The final component is converting 2D coordinates of We want to model the process in which a 3D point is image plane to pixel coordinates. This transformation is mapped to a 2D point by a camera. The most common model done using an upper triangular matrix called K which is pinhole projection model. If there is no non-linear effect, hold camera intrinsic parameters. This matrix can be like radial distortion, this model is a good approximation for written as: most real cameras. This model has 3 components in which a ↵u su0 3D point is converted to a 2D point: K = 0 ↵ v 2 v 0 3 1) The first component is transforming real world coordi- 001 nates to camera coordinates. This transform consists of a By combining aforementioned4 matrices, we5 can compute a rotation and a translation, which together they are called projection matrix, P , which maps 3D points to 2D pixel camera extrinsic parameters. coordinates: This relation can be derived by replacing points in the essential 1 matrix relation with the multiplication of K , inverse of camera intrinsic parameters matrix, by the pixel point:

1 T 1 (K u) E(K0 u0)=0 T 1T 1 u (K EK0 )u0 =0 T u Fu0 =0 So we can see that fundamental matrix is given by F Fig. 2. Epipolar Geometry 1T 1 ⇠ K EK0 . This matrix has 9 elements and it can be computed up to an arbitrary scale, so we can calculate this x PX matrix using 8 correspondent points. ⇠ 3) Projection Matrices: After finding fundamental matrix By using this equation, we can find two linear equations to we can calculate projection matrices. This can be accom- find P elements. This matrix has 12 elements, which there are plished by decomposing fundamental matrix [4]. 11 unknown elements as the scale factor is arbitrary. So if we have at least 6 points in 3D space and their correspondents in D. Triangulation: From 2D to 3D 2D image plane, we can write 12 different equations to find By finding camera projection matrices, we can find the P elements. correspondent 3D point of every 2D pair of points. This process is called Triangulation. This process is done by B. Finding Correspondent Points intersecting projection rays, rays passing from camera center As mentioned before, we need to find corresponding points and 2D points on the image plane. As there is always some to extract 3D scene structure. One way to find these correspon- noise in our measurements, the projected rays of 2D points dent points, is to ask a user to tag these points in our images. will not intersect. So the main task of triangulation is to find This approach is time consuming and the results will depend 3D points that minimize this error: on how accurate the user has selected the correspondent points. X = arg min ui ui0 The automatic way to do so, in general, is a difficult task. X i k k Methods for finding these correspondent points automatically, u P u use information around important pixels to describe them. Where i is the 2D point of the image and i0 is the predicted These important points are called interest points. We can 2D point by the 3D point and projection matrix. After doing measure the similarity between these descriptors to find an triangulation we have found the information we were looking approximate of how much two points can be considered for about the structure of the scene. matches. For example we can use Harris corner detector [3] E. Multiple-View Structure from Motion to find such interest points. Harris corner detector finds points As we discussed in the previous section, given two views that are the maxima for image auto-correlation function. we can find the correspondent 3D points to reconstruct the C. Epipolar Geometry structure of the scene. But with a sequence of images, e.g. Epipolar geometry or two-view geometry provides the rela- frames of a video, we have more than two views. The methods tion between two views of a scene. If we have point x in the used to reconstruct a 3D scene from multiple views are first view, epipolar geometry restricts the corresponding point, called sequential methods. As new views are given to these methods they extend the partial reconstructed scene with the x0, in the second view to lie on a line called epipolar line (Figure 2). This constraint can be formulated using a matrix new information. One of these ways uses sequential Monte called essential matrix. Carlo (SMC) methods for estimating 3D scene [7]. SMC 1) Essential Matrix: This matrix gives the relation that can originally is a non-linear filter for smoothing state space find epipolar lines: model. Another well-known method for refining 3D points when we have multiple views is Bundle Adjustment. Bundle T x Ex0 =0 Adjustment is the act of refining 3D points and projection matrices simultaneously. Where Ex0 is the epipolar line in the second view correspond- ing to the point x in the first view. Essential matrix has 5 F. Multibody Structure from Motion parameters and can be found using 5 different matches of Multibody structure from motion is the extension of SfM points [2]. to dynamic scenes where objects can move and the scene is 2) Fundamental Matrix: We can further expand this rela- not static. This task is challenging because number of moving tion to find the formulation for two corresponding pixel points objects can change during time. Also moving objects are often in two images. The resulting matrix is called fundamental small and therefore few number of features can be tracked for matrix: them [6]. The main requirements to build a system that enables T u Fu0 =0 dynamic SfM are: 1) finding number of moving objects at the beginning of image sequence, 2) segmenting feature tracks to find different motions in scene, 3) finding 3D structure for each moving object, The accuracy of the method is very dependent on how accurate we segment feature tracks.

III. 3D MATCH MOVING Match moving is a visual-effects technique that allows us Fig. 4. Feature Tracking to computer-generated objects into real camera captured scenes. Thinking technically, it is hard to insert these objects in a way that they maintain correct position, orientation, scale, The more precision the algorithm has in tracking features, and motion relative to other objects in the real scene. the more precisely we can calculate 3D information. Like The key process in match moving is to precisely determine motion estimation, there are motion vectors for active features camera movements in a 3D space. Having camera movements, frame by frame. The result is that we can find the position newly inserted 3D objects will appear perfectly matched with of each feature throughout the entire footage by having its regards to , position and scale. This means that if motion vectors and its position in a single frame. Statistical we have a camera-tracked blank scene, we can create a 3D information may be used to eliminate or correct mistakes in object in it. After regenerating a 2D video scene again, with feature tracking. A sample visualization of feature tracking is the new object inserted, we will see the object moving in the shown in Figure 4. 2D scene. These movements, gives us the impression that the camera is really moving around our 3D object. B. Calculating 3D information Now, if we had a 3D structure of objects present in the Extracting 3D information from original footage is an footage, we could recreate the scene adding new objects with essential part of match moving. First, it helps solve cameras regards to real objects. Consider that the final result, must 3D motion. And then, that information will be used in next comply with basic human vision rules and concepts such as steps to maintain visual relations of added objects and real motion parallax and occlusion. This is where SfM comes in objects. handy. We create 3D structures of real objects in original Here is an example one can keep in mind describing footage in order to comply with these concepts as we add new why this information is needed to maintain visual relations. objects. Figure 3 depicts our sample match moved footage. Suppose we are placing a cylinder on top of a table. First, Considering previous notes in mind, we can break the we need to know where in space the surface of the table is process of match moving into the following steps: located. In addition, some parts of the cylinders shadow may 1) Feature tracking cast outside the tables surface on the ground. So the ground 2) Calculating 3D information plane is also needed. 3) Camera tracking In the previous simple example, at least two surfaces were 4) 3D modeling needed. In more complex situations and scenes, more points 5) Video composition and surfaces are required. At this point, SfM methods are used to build the 3D structure of a scene. The details needed in the A. Feature Tracking structure, is determined by the complexity of the scene and Feature tracking is a process very similar to motion esti- the artists requirements. mation. This is the first step that needs to be done before C. Camera Tracking calculating 3D information or tracking the camera. Features are points of interest in the image that can be tracked in several Another step where SfM methods show up, is tracking the frames with a tracking algorithm. Depending on the tracking camera and solving its motion in 3D space. This is the key algorithm, we may have different approaches in selecting process in match moving. Again, using SfM techniques, the features. exact characteristics of the camera are extracted throughout the footage. These characteristics include orientation, position, and focal length. In a nutshell, camera tracking is done by inverse projection of 2D paths for rigid objects in the footage. During camera tracking, a vector is created in each frame for the camera specifying its characteristics. In a perspective 3D view, the cameras path and orientation can be seen clearly using these vectors. Figure 5 shows a perspective view of an original footage. In this figure, the 3D structure of the scene Fig. 3. Sample Match Moved Footage is partially reconstructed and the camera is tracked as well. Fig. 7. Different Rendered Layers of Output

Fig. 5. Camera Tracking Fig. 8. Composited Footage

The red line in the figure depicts the path that the camera is result. Figure 7 indicates different layers of rendered output. moving on during the footage. E. Video Composition D. 3D Modeling This is the next and final step in match moving. Different A 3D artist may now get involved to design new objects layers of newly added objects need to be composited in to for the scene. The camera and reconstructed points from the the original footage. As a result, computer generated objects footage are imported in a 3D modeling software. Therefore, seem to be perfectly matched into the sequence, as if they were actually there when the footage was being recorded. A lot of the artist can see important points in a 3D space from the R cameras view. Since the camera is moving in the scene, objects different computer software such as are that are inserted will be moving accordingly in the final 2D available for video . output. Figure 6 demonstrates the interface of a 3D modeling Using blending modes, alpha channel masking, and similar software and shows a cylinder added on the surface of a table. image composition techniques, the artist is able to reconstruct In this step, the artists talents play a crucial role in the quality. objects from different layers. The layering that is done in He is the one who adds new objects in place and tries to add the previous step helps the artist deal with more complexities details to them. The outputs of this step are image sequences of in composition. Suppose a computer generated object needs different layers. Each layer consists of a set of characteristics to cast its shadow on an object from the original footage. of different objects. For example, a single layer maybe used Depending on the situation, it may be necessary to reduce the for shadows, meaning that only shadows of different objects intensity of that shadow. If the layers for color and shadow are visible in the layer. are not separated, the artist may not be able to reduce the The image sequence of each layer is technically new intensity of the objects shadow without changing its color frames that show the current position of objects in that characteristics. This may result in a visual error and is not frame. The frames are consistent with the frames from the acceptable in match moving. Figure 8 shows our sample original footage. In the next step, these image sequences are footage being composited. composited into the original footage and will create the final IV. CONCLUSION In this overview we covered the process of Match Moving. We also discussed about Structure from Motion which is the main element of Match Moving. We saw that new methods of SfM enables us to create great .

REFERENCES [1] T. Dobbert. Matchmoving: The Invisible Art of Camera Tracking. Wiley Desktop Editions. John Wiley & Sons, 2006. [2] Friedrich Fraundorfer, Petri Tanskanen, and Marc Pollefeys. A minimal case solution to the calibrated relative pose problem for the case of two known orientation angles. In Proceedings of the 11th European conference on : Part IV, ECCV’10, pages 269–282, Berlin, Heidelberg, 2010. Springer-Verlag. [3] Chris Harris and Mike Stephens. A combined corner and edge detector, Fig. 6. 3D Model of a Cylinder volume 15, pages 147–151. Manchester, UK, 1988. [4] R. I. Hartley and A. Zisserman. Retrieving the camera matrices, pages 253–256. Cambridge University Press, ISBN: 0521540518, second edition, 2004. [5] Gang Liu, Reinhard Klette, and Bodo Rosenhahn. Structure from motion in the presence of noise. In Image and Vision Computing New Zealand, pages 138–143, 2005. [6] Kemal E. Ozden, Konrad Schindler, and Luc Van Gool. Multibody structure-from-motion in practice. IEEE Trans. Pattern Anal. Mach. Intell., 32(6):1134–1141, June 2010. [7] Gang Qian and Rama Chellappa. Structure from motion using sequential monte carlo methods. International Journal of Computer Vision, 59(1):5– 31, 2004. [8] D.P. Robertson and R. Cipolla. Structure from Motion. In Varga, M., editors, Practical Image Processing and Computer Vision. John Wiley, 2009.