Video Synchronization Using Temporal Signals from Epipolar Lines

The Interdisciplinary Center, Herzliya Efi Arazi School of Computer Science Video Synchronization Using Temporal Signals from Epipolar Lines M.Sc. dissertation for research project Submitted by Dmitry Pundik Under the supervision of Dr. Yael Moses May 2010 Acknowledgments I owe my deepest gratitude to my advisor, Dr. Yael Moses, who, with her graceful attitude and endless patience, stood by me during my research and kept her faith when I have had doubts. With her extensive knowledge and vast experience, she steered my study towards matureness and was fully involved until the very last paragraph of this thesis. I would also like to thank the wonderful teachers I had here at IDC, who reminded me why I had chosen this vocation. Specifically, I would like to express my gratefulness to Dr. Yacov Hel-Or, who was always ready to lend an ear and share a valuable advice. The research would not be completed without the support of my direct management in Advasense Image Sensors Ltd., who understood the importance of academic education, and allowed me to pursue and accomplish this thesis. Finally, I would like to thank my fellow students: Ran Eshel { for his significant contribution in the video experiments and for his willingness to assist in any way, from mathematical discussions to document proofreading; and all the neighbors in the IDC laboratory, who made my stay extremely pleasant. i Abstract Time synchronization of video sequences in a multi-camera system is necessary for successfully analyzing the acquired visual information. Even if synchronization is established, its quality may deteriorate over time due to a variety of reasons, most notably frame dropping. Conse- quently, synchronization must be actively maintained. This thesis presents a method for online synchronization that relies only on the video sequences. We introduce a novel definition of low level temporal signals computed from epipolar lines. The spatial matching of two such temporal signals is given by the fundamental matrix. Thus, no pixel correspondence is required, bypass- ing the problem of correspondence changes in the presence of motion. The synchronization is determined from registration of the temporal signals. We consider general video data with substantial movement in the scene, for which high level information may be hard to extract from each individual camera (e.g., computing trajectories in crowded scenes). Furthermore, a trivial correspondence between the sequences is not assumed to exist. The method is online and can be used to resynchronize video sequences every few seconds, with only a small delay. Experiments on indoor and outdoor sequences demonstrate the effectiveness of the method. ii Table of Contents Acknowledgments . .i Abstract . ii Table of Contents . iii List of Figures . .v List of Algorithms . vi 1 Introduction 1 2 Previous work 5 3 Method 9 3.1 A temporal signal . .9 3.2 Signal Registration . 11 3.3 Epipolar line filtering . 15 3.4 Algorithms flow . 15 4 Experiments 17 4.1 Basic results . 19 4.2 Frame dropping . 21 4.3 Using a prior on P (∆t)............................... 22 4.4 Setting the parameters . 23 4.5 Verification of Calibration . 24 4.6 Temporal signal of an entire frame . 25 iii TABLE OF CONTENTS 5 Conclusion 27 Bibliography 30 iv List of Figures 3.1 Motion indicators calculation . 10 3.2 Temporal signals . 12 4.1 Set 1 .......................................... 18 4.2 Set 2 .......................................... 19 4.3 Set 3 .......................................... 20 4.4 Set 4 .......................................... 21 4.5 Basic results, Sets 1,2 ................................ 22 4.6 Basic results, Sets 3,4 ................................ 23 4.7 Using prior, Set 3 ................................... 24 4.8 Confidence and frame drops . 25 4.9 Incorrect synchronization . 26 v List of Algorithms 3.1 Temporal signal update . 15 3.2 Synchronization iteration . 16 vi Chapter 1 Introduction Applications of multiple camera systems range from video surveillance of large areas such as airports or shopping centers, to videography and filmmaking. As more and more of these applications utilize the information obtained in the overlapping fields of view of the cameras, precise camera synchronization and its constant maintenance are indispensable. Examples of such applications include multi-view object tracking (e.g. Eshel and Moses [6], Gavrila et al. [8]), action recognition (e.g. Weinland et al. [18]), and scene flow computation (e.g. Basha et al. [1], Pons et al. [12], Furukawa and Ponce [7]). In practice, given enough video time, synchronization will be violated because of technical imperfections that cause frame dropping or incorrect timing between sequences. The tendency to use mostly inexpensive components makes such violations a certainty in many video systems. Manual synchronization is out of the question, as it is labor-intensive and cannot be performed constantly; thus, it cannot handle arbitrary frame-dropping. Precise time synchronization via satellite, as in GPS systems, may be too expensive or limited in indoor environments. Using distributed protocols for clock synchronization methods depends on the properties of the communication network and is sensitive to communication failures. Obvious alternative sources of time information are the video streams themselves, which often provide sufficient and reliable information for automatic synchronization. In this work we address the problem of computing and maintaining the temporal synchro- 1 Chapter 1 / Introduction nization between a pair of video streams with the same frame rate, relying only on the video data. The temporal synchronization task can be described as a problem of finding the best correlation of temporal events in a pair of video streams. There were a number of previous approaches for defining such events. Here we briefly review the main approaches, while in Chapter 2 we give more details on existing studies. Some studies attempt to define a unique high-level motion signatures appearing in both video streams (e.g. Dexter et al. [5]). These motion features consists of statistics over the entire image, hence no spatial correspondence between the events is required. As such approaches tend to have high computational com- plexity, less high-level features were used in literature. Other family of methods relied on a tracked trajectory of features or objects, visible in both streams. Such methods require some level of correspondence between the considered trajectories, which can be difficult due to the ambiguities of 3D-shapes in a two-dimensional image. In order to overcome these ambiguities, some methods assumed planar motion in the scene (e.g. Caspi and Irani [3]), while others relied on special properties of the trajectories (e.g. Tresadern and Reid [16], Whitehead et al. [19]). Algorithms without explicit computing point correspondence were also proposed, assuming linear combination between the objects' views or verifying spatial geometry during the synchronization search (e.g. Wolf and Zomet [20], Lei and Yang [10]). The computation of the trajectories and its quality strongly depends on the scene. Therefore, a number of works tried to avoid the usage of trajectories. For instance, a direct approach using low-level video features, such as time-gradients, was introduced (e.g. Caspi and Irani [3]). But using such simple features requires an exact point-to-point correspondence, which is possible only in case of a full Homography transformation between the views. That is, again an assumption of a planar scene or a negligible translation between the cameras' location. Proposed approach We present a method for obtaining online time synchronization of a pair of video sequences acquired by two static cameras, possibly in a wide-baseline setup. The fundamental matrix 2 Chapter 1 / Introduction between each pair of sequences, which provides epipolar line-to-line correspondence, is assumed to be known. (For example, it can be computed directly from static corresponding features of the videos when there is no motion in the scene.) This is the only spatial correspondence required by our method. We consider sequences of general 3D scenes which contain a large number of moving objects, focusing on sequences for which features or object trajectories may be hard to compute due to occlusions and substantial movement (see Figure 4.1). Furthermore, trivial correspondence (e.g., homography) between the sequences is not assumed. The temporal misalignment is considered to be only a translation, i.e., the sequences have the same frame rate. Therefore, we do not detect sub-frame time shifts. Our method allows online synchronization correction when a number of frames drop. Our method is based on matching temporal signals defined on epipolar lines of each of the sequences. Hence, the spatial matching is given by the fundamental matrix. The temporal matching is performed using a probabilistic optimization framework; independent simultaneous motion occurring on different epipolar lines improve our synchronization. Failure to find such a matching (despite the observed motion in the scene) indicates that the epipolar geometry is incorrect. The temporal signal is defined as an integration of the information along an epipolar line, during a sufficient interval of time (at least 2 seconds). A simple background subtraction algorithm is used as an input to the integration. Integrating the information along epipolar lines rather than considering signals at the pixel level not only avoids the search for correspondence

Load more