The Interdisciplinary Center, Herzliya Efi Arazi School of Computer Science
Video Synchronization Using Temporal Signals from Epipolar Lines
M.Sc. dissertation for research project
Submitted by Dmitry Pundik Under the supervision of Dr. Yael Moses
May 2010 Acknowledgments
I owe my deepest gratitude to my advisor, Dr. Yael Moses, who, with her graceful attitude and endless patience, stood by me during my research and kept her faith when I have had doubts.
With her extensive knowledge and vast experience, she steered my study towards matureness and was fully involved until the very last paragraph of this thesis.
I would also like to thank the wonderful teachers I had here at IDC, who reminded me why
I had chosen this vocation. Specifically, I would like to express my gratefulness to Dr. Yacov
Hel-Or, who was always ready to lend an ear and share a valuable advice.
The research would not be completed without the support of my direct management in
Advasense Image Sensors Ltd., who understood the importance of academic education, and allowed me to pursue and accomplish this thesis.
Finally, I would like to thank my fellow students: Ran Eshel – for his significant contribution in the video experiments and for his willingness to assist in any way, from mathematical discussions to document proofreading; and all the neighbors in the IDC laboratory, who made my stay extremely pleasant.
i Abstract
Time synchronization of video sequences in a multi-camera system is necessary for successfully analyzing the acquired visual information. Even if synchronization is established, its quality may deteriorate over time due to a variety of reasons, most notably frame dropping. Conse- quently, synchronization must be actively maintained. This thesis presents a method for online synchronization that relies only on the video sequences. We introduce a novel definition of low level temporal signals computed from epipolar lines. The spatial matching of two such temporal signals is given by the fundamental matrix. Thus, no pixel correspondence is required, bypass- ing the problem of correspondence changes in the presence of motion. The synchronization is determined from registration of the temporal signals. We consider general video data with substantial movement in the scene, for which high level information may be hard to extract from each individual camera (e.g., computing trajectories in crowded scenes). Furthermore, a trivial correspondence between the sequences is not assumed to exist. The method is online and can be used to resynchronize video sequences every few seconds, with only a small delay.
Experiments on indoor and outdoor sequences demonstrate the effectiveness of the method.
ii Table of Contents
Acknowledgments ...... i
Abstract ...... ii
Table of Contents ...... iii
List of Figures ...... v
List of Algorithms ...... vi
1 Introduction 1
2 Previous work 5
3 Method 9
3.1 A temporal signal ...... 9
3.2 Signal Registration ...... 11
3.3 Epipolar line filtering ...... 15
3.4 Algorithms flow ...... 15
4 Experiments 17
4.1 Basic results ...... 19
4.2 Frame dropping ...... 21
4.3 Using a prior on P (∆t)...... 22
4.4 Setting the parameters ...... 23
4.5 Verification of Calibration ...... 24
4.6 Temporal signal of an entire frame ...... 25
iii TABLE OF CONTENTS
5 Conclusion 27
Bibliography 30
iv List of Figures
3.1 Motion indicators calculation ...... 10
3.2 Temporal signals ...... 12
4.1 Set 1 ...... 18
4.2 Set 2 ...... 19
4.3 Set 3 ...... 20
4.4 Set 4 ...... 21
4.5 Basic results, Sets 1,2 ...... 22
4.6 Basic results, Sets 3,4 ...... 23
4.7 Using prior, Set 3 ...... 24
4.8 Confidence and frame drops ...... 25
4.9 Incorrect synchronization ...... 26
v List of Algorithms
3.1 Temporal signal update ...... 15
3.2 Synchronization iteration ...... 16
vi Chapter 1
Introduction
Applications of multiple camera systems range from video surveillance of large areas such as airports or shopping centers, to videography and filmmaking. As more and more of these applications utilize the information obtained in the overlapping fields of view of the cameras, precise camera synchronization and its constant maintenance are indispensable. Examples of such applications include multi-view object tracking (e.g. Eshel and Moses [6], Gavrila et al. [8]), action recognition (e.g. Weinland et al. [18]), and scene flow computation (e.g. Basha et al. [1], Pons et al. [12], Furukawa and Ponce [7]).
In practice, given enough video time, synchronization will be violated because of technical imperfections that cause frame dropping or incorrect timing between sequences. The tendency to use mostly inexpensive components makes such violations a certainty in many video systems.
Manual synchronization is out of the question, as it is labor-intensive and cannot be performed constantly; thus, it cannot handle arbitrary frame-dropping. Precise time synchronization via satellite, as in GPS systems, may be too expensive or limited in indoor environments.
Using distributed protocols for clock synchronization methods depends on the properties of the communication network and is sensitive to communication failures. Obvious alternative sources of time information are the video streams themselves, which often provide sufficient and reliable information for automatic synchronization.
In this work we address the problem of computing and maintaining the temporal synchro-
1 Chapter 1 / Introduction nization between a pair of video streams with the same frame rate, relying only on the video data. The temporal synchronization task can be described as a problem of finding the best correlation of temporal events in a pair of video streams. There were a number of previous approaches for defining such events. Here we briefly review the main approaches, while in
Chapter 2 we give more details on existing studies. Some studies attempt to define a unique high-level motion signatures appearing in both video streams (e.g. Dexter et al. [5]). These motion features consists of statistics over the entire image, hence no spatial correspondence between the events is required. As such approaches tend to have high computational com- plexity, less high-level features were used in literature. Other family of methods relied on a tracked trajectory of features or objects, visible in both streams. Such methods require some level of correspondence between the considered trajectories, which can be difficult due to the ambiguities of 3D-shapes in a two-dimensional image. In order to overcome these ambiguities, some methods assumed planar motion in the scene (e.g. Caspi and Irani [3]), while others relied on special properties of the trajectories (e.g. Tresadern and Reid [16], Whitehead et al. [19]). Algorithms without explicit computing point correspondence were also proposed, assuming linear combination between the objects’ views or verifying spatial geometry during the synchronization search (e.g. Wolf and Zomet [20], Lei and Yang [10]). The computation of the trajectories and its quality strongly depends on the scene. Therefore, a number of works tried to avoid the usage of trajectories. For instance, a direct approach using low-level video features, such as time-gradients, was introduced (e.g. Caspi and Irani [3]). But using such simple features requires an exact point-to-point correspondence, which is possible only in case of a full Homography transformation between the views. That is, again an assumption of a planar scene or a negligible translation between the cameras’ location.
Proposed approach
We present a method for obtaining online time synchronization of a pair of video sequences acquired by two static cameras, possibly in a wide-baseline setup. The fundamental matrix
2 Chapter 1 / Introduction between each pair of sequences, which provides epipolar line-to-line correspondence, is assumed to be known. (For example, it can be computed directly from static corresponding features of the videos when there is no motion in the scene.) This is the only spatial correspondence required by our method. We consider sequences of general 3D scenes which contain a large number of moving objects, focusing on sequences for which features or object trajectories may be hard to compute due to occlusions and substantial movement (see Figure 4.1). Furthermore, trivial correspondence (e.g., homography) between the sequences is not assumed. The temporal misalignment is considered to be only a translation, i.e., the sequences have the same frame rate.
Therefore, we do not detect sub-frame time shifts. Our method allows online synchronization correction when a number of frames drop.
Our method is based on matching temporal signals defined on epipolar lines of each of the sequences. Hence, the spatial matching is given by the fundamental matrix. The temporal matching is performed using a probabilistic optimization framework; independent simultaneous motion occurring on different epipolar lines improve our synchronization. Failure to find such a matching (despite the observed motion in the scene) indicates that the epipolar geometry is incorrect. The temporal signal is defined as an integration of the information along an epipolar line, during a sufficient interval of time (at least 2 seconds). A simple background subtraction algorithm is used as an input to the integration. Integrating the information along epipolar lines rather than considering signals at the pixel level not only avoids the search for correspondence but allows the handling of general moving scenes. Note that in a general scene, the correspondence between pixels at different time steps changes due to 3D motion of objects in space. Therefore, the synchronization cannot rely on corresponding pixels.
The main contribution of this thesis is the use of low level temporal events along corre- sponding epipolar lines for video synchronization. Our method does not require high level computation such as tracking, which may be hard to compute in crowded scenes as the ones considered in our experiments. Furthermore, we bypass the need to compute point-to-point correspondences between pixels. Finally, our method can be used in an online framework, because it detects the synchronization errors (e.g., frame drops) in a matter of seconds, as they
3 Chapter 1 / Introduction occur in the video.
4 Chapter 2
Previous work
Synchronization can be achieved using visual information by correlating spatio-temporal fea- tures or events viewed by two or more cameras. Our method considers static cameras acquiring a moving scene. Previous attempts to synchronize such sequences can be classified by the choice of features used for matching. The most straightforward approach is finding both spatial and temporal correspondence between point features at frames taken in all possible time shifts between the two video streams. Such approaches are vulnerable to correspondence ambiguities and require a large search space. A method for reducing the complexity of this search was suggested by Carceroni et al. [2]. In this method, an intersection between the 2D trajectory of a feature in one sequence and an epipolar line of some other feature from other sequence in considered. Such an intersection in time is a possible synchronization point between the sequences. By collecting all such intersections for all the features in the considered sequences and using RANSAC for selecting the best hypothesis, the correct synchronization is computed.
Higher level features that contain temporal information also assist to reduce the matching ambiguity and the search complexity. An example of such features are motion trajectories of objects. In order to perform the synchronization, some kind of spatial correspondence between the trajectories must be calculated. Since the motion of the objects is three-dimensional, matching the observed 2D trajectories in each sequence is ill posed. Thus, different researches proposed methods to ease or avoid such spatial correspondence. Caspi and Irani [3] proposed
5 Chapter 2 / Previous work a method, assuming planar motion in a scene (with a homography transformation between the trajectories). Such an assumption allows to calculate spatial correspondence between the trajectories, as it eliminates the 2D-3D ambiguities. The method was based on feature extraction with later correspondence detection between the two trajectories of the same feature.
A transformation in time-space was found for aligning those trajectories. This work assumes affine transformation in the time domain.
Wolf and Zomet [20] performed a correspondence-free synchronization in a non-rigid scene, assuming orthographic projection. The main assumption is that each observed 3D feature in one sequence can be expressed as a linear combination of the 3D features observed by the other sequence. This assumption allows to avoid the need to calculate spatial correspondence between the trajectories. By applying a rank constraint on a measurement matrix containing the location of the tracked features from both sequences, the correct time shift is found. A similar rank constraint on a constructed measurement matrix was used by Tresadern and Reid [16].
An affine projection is assumed, and a correspondence between the tracked 2D trajectories of features in both sequences is computed. A special measurement matrix containing the correspondence points is created. Two frames from the two sequences can be synchronous, if the measurement matrix has a specific rank. After all possible frame pairs are detected,
RANSAC is used in order to determine the correct sequences synchronization.
Lei and Yang [10] synchronized three or more free moving cameras. Corresponding point and line features are detected, matched and then tracked in all sequences. Each combination of time shifts between the first and the second, and the first and the third cameras is a synchronization hypothesis between the three cameras. The spatial correspondence is avoided with the aid of spatial geometry consistency for any three possible time shifts. This consistency is assessed for each hypothesis, and the correct synchronization is found. The spatial geometry can be estimated from the point and line correspondences. In another study, Singh et al. [14] introduced event dynamics for the synchronization task. Working on a spatially calibrated pair of video sequences, they used regression on the trajectory points in space and time. The temporal misalignment model was known and the motion trajectories were extracted. This
6 Chapter 2 / Previous work working scheme effectively dealt with noisy data.
Whitehead et al. [19] also used trajectories of tracked objects for synchronizing three cam- eras. The inflection points of the trajectories are used for spatial detecting correspondence between trajectories, as those points are space-time invariant and can be detected in a 2D video.
Another way to use tracked features was proposed by Tuytelaars and Van Gool [17]. The proposed approach deals with free-moving cameras in a non-rigid 3D moving scene. First, a set of corresponding points must be found and tracked along the two video sequences, followed by backprojecting the corresponding points as lines in 3D world. By searching for intersection of these lines, a rigidity of the points set can be verified. The set of free-moving points in 3D world can be rigid only when the two video frames are synchronized.
The computation of the trajectories and its quality strongly depend on the scene and can often be hard to compute as in the video considered in this thesis. In order to avoid feature tracking, highly discriminative action recognition features were proposed for synchronization by Dexter et al. [5]. A feature-based alignment scheme is used, while the preferred features are the SSM, self-similarity matrices. SSM are used as a temporal signature of image sequences that are not affected by view angle. Those signatures are tracked during the video. This work focused on human actions detection with free camera motion. Naturally, such high-level features are limited to scenes for which these actions appear and can be detected. Another limitation of this work is the high computational complexity of such calculations.
In an effort to avoid complex computations such as tracking and action recognition, an approach based on brightness variation over the entire image was suggested by Caspi and
Irani [3]. However, this method requires full spatial alignment of the sequences, and a ho- mography transformation between the views was assumed. It is a direct alignment algorithm, based on the direct brightness variations of the sequences. After defining an error function in- cluding all the required parameters of the space-time transformation, an optimization problem was solved, utilizing a pyramid framework. Another approach, by Yan and Pollefeys [21], sug- gested using statistics over low level space-time interest points in each of the sequences. In that
7 Chapter 2 / Previous work research, the space-time interest points, based on the Harris corner detector, were extended to deal with spatio-temporal corners. After detecting and integrating such features, a simple correlation was used in order to synchronize the video sequences. This concept steers clear of computing point-to-point, trajectory, or action correspondence. However, since the statis- tics are computed over the entire image, the approach is strongly sensitive to the overlapping regions of the two views, relative viewing angle, and the complexity of the motion appearing in the scene. The limitations of these two approaches motivate the solution suggested in this thesis.
It is worth mentioning that several synchronization methods considered moving cameras viewing a static scene. For instance, Spencer and Shah [15] performed temporal synchro- nization between two moving cameras with unknown but fixed relative location. Their work concentrated on detecting the frame-to-frame movement of the cameras, assuming limited or non-existent motion in the scene. Scenes with relatively little motion were also assumed by several studies. For example, Serrat et al. [13] proposed a method for synchronizing two se- quences from free moving cameras, not requiring any overlapping in space, but assuming the paths of the moving cameras are very similar. A limited motion of the objects in a scene is also assumed. The algorithm relies on frame-to-frame registration of the sequences in order to synchronize them.
8 Chapter 3
Method
Given a pair of color (or gray-level) sequences and a fundamental matrix, we achieve synchro- nization by time registration of the temporal signals from the two sequences. We first present our novel definition of temporal signals of a sequence, followed by a probabilistic approach for registering two of them. The summary of the algorithm flow is presented in Algorithms 3.1 and 3.2.
There are several preconditions required for the method. The video frame rate is assumed to be equal for all sequences, thus the synchronization is only a translation in time. In addition, the sequences are roughly synchronized prior to the algorithm, such that the residual de- synchronization is only up to a few dozen of frames. In general, the synchronization in the system is assumed to change not too often. It is also assumed that there is some motion in the overlapping areas of the cameras’ fields of view. The set of epipolar lines in the two images together with their correspondence are computed from the given fundamental matrix.
3.1 A temporal signal
To define the temporal signals, we use the epipolar geometry of a pair of images. Given the
0 0 fundamental matrix F for a pair of images, a set of epipolar lines L = {`r} and L = {`r} and 0 their correspondence, `r ↔ `r are computed, using the given fundamental matrix (Hartley and
9 Chapter 3 / Method
(a) (b) (c)
Figure 3.1: The calculation of motion indicators for two video sequences: (a) are the background frame selected for each of the video sequences. (b) are the currently analyzed video frames. (c) are the result of the background subtraction with exemplar sets of corresponding epipolar lines drawn. The motion indicators are computed by accumulating the data along each epipolar line, as described in Equation 3.1.
Zisserman [9]). The correspondence of a given pointp ˆ ∈ `r is constrained to lie on the epipolar ˆ0 line `r = F pˆ in its synchronized frame (the points and the lines are given in homogeneous coordinates). Traditionally, this property is used for constraining the correspondence search in stereo or motion algorithms. Pixel correspondence is not guaranteed to remain the same over time due to 3D motion. However, two corresponding pixels are guarantied to remain on two corresponding epipolar lines in both sequences. (The only possible exception is a major occlusion on one of the views.) Using this observation, we define the signals on the entire epipolar line, avoiding not only the problem caused by the change of pixels correspondence over time but also the general challenge of computing spatial correspondence between frames.
A background subtraction algorithm is used for defining the temporal signal of each se- quence. The base of the motion signal is the Euclidean distance between the data frame and the selected background frame for each pixel. For each epipolar line, a motion indicator is
10 Chapter 3 / Method taken to be the sum of these distances of the line’s pixels, as demonstrated in Figure 3.1. The temporal signal of an epipolar line, the line signal, is defined to be the set of motion indicators on an epipolar line as a function of time. Formally, let I(p, t) and B(p, t) be the intensity values
1 of a pixel p ∈ `r, in some video frame and corresponding background frame , respectively. The line signal of that epipolar line, Sr(t), is defined to be the distance between the two vectors:
Sr(t) = Σp∈`r kI(t, p) − B(t, p)k. (3.1)
The collection of line signalsfor all the epipolar lines in a video, is the temporal signal of the video sequence. The temporal signals of two considered sequences are represented by
0 matrices S and S (Figure 3.2), where each row r of this matrix consists of a line signal, Sr.
That is, Sr,t is the motion indicator of an epipolar line `r at a time step t. Only a few dozen epipolar lines from each frame, a few pixels apart, are considered.
3.2 Signal Registration
In this section we present the time registration of a given pair of temporal signals of the video sequences. For robust results, and in order to combine information from different line signals, the matching is determined using a probabilistic framework, utilizing a maximum a posteriori estimation. The time shift is detected by finding a maximum likelihood value for the two signals, with different time shifts applied to the second signal. A sliding window in a predefined range is used to determine ∆t.
Let S and S0 be a pair of line signals, extracted from corresponding epipolar lines in two video sequences. At this stage, assume a single consistent time shift between the two sequences and no frame drops in any of them. We begin with considering the probability distribution of a time shift ∆t of S0 to match S. Applying Bayes’ law we obtain:
P (S, S0 |∆t)P (∆t) P (∆t |S, S0) = . (3.2) P (S, S0)
1 Br(p, t) is a function of t, because in the general case an adaptive background subtraction can be used.
11 Chapter 3 / Method
(a) (b) (c)
Figure 3.2: (a),(b) are examples of temporal signals S and S0 of two sequences, containing 130 epipolar lines for a time period of 8 seconds (200 frames). Each pixel in the signal is the motion indicator of an epipolar line at a time point. (c) is the matching result for those signals with a high-confidence peak at the correct time shift of ∆t = −1 frames.
The denominator term, P (S, S0), is an a priori joint probability distribution of S and S0.
A uniform distribution of this probability is assumed, hence it is taken to be a normalization constant factor. In general, prior knowledge about the time and space distribution of objects in the scene, or the overlapping regions of the two sequences can be used for computing this prior. Extracting such knowledge is out of the scope of this thesis. The term P (∆t) is another prior, in this case on the probability distribution of ∆t. Use of this prior is discussed in the experimental part (see Section 4.3).
For estimating the likelihood term, P (S, S0 |∆t), we apply a simple stochastic model on the temporal signals. A commonly used assumption of additive white Gaussian noise is applied to the two line signals:
S(t) = S0(t + ∆t) + N (µ, σ2), (3.3) where ∆t is the correct time shift between the two signals, and µ is the difference between the averages of both. For simplicity, in the rest of the thesis we omit the value µ: from here on, each line signal S participates in the computation after its average was subtracted. Using this model assumption, the likelihood of two line signals, given ∆t, is obtained by:
12 Chapter 3 / Method