<<

The Interdisciplinary Center, Herzliya Efi Arazi School of Computer Science

Video Synchronization Using Temporal Signals from Epipolar Lines

M.Sc. dissertation for research project

Submitted by Dmitry Pundik Under the supervision of Dr. Yael Moses

May 2010 Acknowledgments

I owe my deepest gratitude to my advisor, Dr. Yael Moses, who, with her graceful attitude and endless patience, stood by me during my research and kept her faith when I have had doubts.

With her extensive knowledge and vast experience, she steered my study towards matureness and was fully involved until the very last paragraph of this thesis.

I would also like to thank the wonderful teachers I had here at IDC, who reminded me why

I had chosen this vocation. Specifically, I would like to express my gratefulness to Dr. Yacov

Hel-Or, who was always ready to lend an ear and share a valuable advice.

The research would not be completed without the support of my direct management in

Advasense Image Sensors Ltd., who understood the importance of academic education, and allowed me to pursue and accomplish this thesis.

Finally, I would like to thank my fellow students: Ran Eshel – for his significant contribution in the video experiments and for his willingness to assist in any way, from mathematical discussions to document proofreading; and all the neighbors in the IDC laboratory, who made my stay extremely pleasant.

i Abstract

Time synchronization of video sequences in a multi-camera system is necessary for successfully analyzing the acquired visual information. Even if synchronization is established, its quality may deteriorate over due to a variety of reasons, most notably frame dropping. Conse- quently, synchronization must be actively maintained. This thesis presents a method for online synchronization that relies only on the video sequences. We introduce a novel definition of low level temporal signals computed from epipolar lines. The spatial matching of two such temporal signals is given by the fundamental matrix. Thus, no pixel correspondence is required, bypass- ing the problem of correspondence changes in the presence of motion. The synchronization is determined from registration of the temporal signals. We consider general video data with substantial movement in the scene, for which high level information may be hard to extract from each individual camera (e.g., computing trajectories in crowded scenes). Furthermore, a trivial correspondence between the sequences is not assumed to exist. The method is online and can be used to resynchronize video sequences every few seconds, with only a small delay.

Experiments on indoor and outdoor sequences demonstrate the effectiveness of the method.

ii Table of Contents

Acknowledgments ...... i

Abstract ...... ii

Table of Contents ...... iii

List of Figures ...... v

List of Algorithms ...... vi

1 Introduction 1

2 Previous work 5

3 Method 9

3.1 A temporal signal ...... 9

3.2 Signal Registration ...... 11

3.3 Epipolar line filtering ...... 15

3.4 Algorithms flow ...... 15

4 Experiments 17

4.1 Basic results ...... 19

4.2 Frame dropping ...... 21

4.3 Using a prior on P (∆t)...... 22

4.4 Setting the parameters ...... 23

4.5 Verification of Calibration ...... 24

4.6 Temporal signal of an entire frame ...... 25

iii TABLE OF CONTENTS

5 Conclusion 27

Bibliography 30

iv List of Figures

3.1 Motion indicators calculation ...... 10

3.2 Temporal signals ...... 12

4.1 Set 1 ...... 18

4.2 Set 2 ...... 19

4.3 Set 3 ...... 20

4.4 Set 4 ...... 21

4.5 Basic results, Sets 1,2 ...... 22

4.6 Basic results, Sets 3,4 ...... 23

4.7 Using prior, Set 3 ...... 24

4.8 Confidence and frame drops ...... 25

4.9 Incorrect synchronization ...... 26

v List of Algorithms

3.1 Temporal signal update ...... 15

3.2 Synchronization iteration ...... 16

vi Chapter 1

Introduction

Applications of multiple camera systems range from video surveillance of large areas such as airports or shopping centers, to videography and filmmaking. As more and more of these applications utilize the information obtained in the overlapping fields of view of the cameras, precise camera synchronization and its constant maintenance are indispensable. Examples of such applications include multi-view object tracking (e.g. Eshel and Moses [6], Gavrila et al. [8]), action recognition (e.g. Weinland et al. [18]), and scene flow computation (e.g. Basha et al. [1], Pons et al. [12], Furukawa and Ponce [7]).

In practice, given enough video time, synchronization will be violated because of technical imperfections that cause frame dropping or incorrect timing between sequences. The tendency to use mostly inexpensive components makes such violations a certainty in many video systems.

Manual synchronization is out of the question, as it is labor-intensive and cannot be performed constantly; thus, it cannot handle arbitrary frame-dropping. Precise time synchronization via satellite, as in GPS systems, may be too expensive or limited in indoor environments.

Using distributed protocols for methods depends on the properties of the communication network and is sensitive to communication failures. Obvious alternative sources of time information are the video streams themselves, which often provide sufficient and reliable information for automatic synchronization.

In this work we address the problem of computing and maintaining the temporal synchro-

1 Chapter 1 / Introduction nization between a pair of video streams with the same frame rate, relying only on the video data. The temporal synchronization task can be described as a problem of finding the best correlation of temporal events in a pair of video streams. There were a number of previous approaches for defining such events. Here we briefly review the main approaches, while in

Chapter 2 we give more details on existing studies. Some studies attempt to define a unique high-level motion signatures appearing in both video streams (e.g. Dexter et al. [5]). These motion features consists of statistics over the entire image, hence no spatial correspondence between the events is required. As such approaches tend to have high computational com- plexity, less high-level features were used in literature. Other family of methods relied on a tracked trajectory of features or objects, visible in both streams. Such methods require some level of correspondence between the considered trajectories, which can be difficult due to the ambiguities of 3D-shapes in a two-dimensional image. In order to overcome these ambiguities, some methods assumed planar motion in the scene (e.g. Caspi and Irani [3]), while others relied on special properties of the trajectories (e.g. Tresadern and Reid [16], Whitehead et al. [19]). Algorithms without explicit computing point correspondence were also proposed, assuming linear combination between the objects’ views or verifying spatial geometry during the synchronization search (e.g. Wolf and Zomet [20], Lei and Yang [10]). The computation of the trajectories and its quality strongly depends on the scene. Therefore, a number of works tried to avoid the usage of trajectories. For instance, a direct approach using low-level video features, such as time-gradients, was introduced (e.g. Caspi and Irani [3]). But using such simple features requires an exact point-to-point correspondence, which is possible only in case of a full Homography transformation between the views. That is, again an assumption of a planar scene or a negligible translation between the cameras’ location.

Proposed approach

We present a method for obtaining online time synchronization of a pair of video sequences acquired by two static cameras, possibly in a wide-baseline setup. The fundamental matrix

2 Chapter 1 / Introduction between each pair of sequences, which provides epipolar line-to-line correspondence, is assumed to be known. (For example, it can be computed directly from static corresponding features of the videos when there is no motion in the scene.) This is the only spatial correspondence required by our method. We consider sequences of general 3D scenes which contain a large number of moving objects, focusing on sequences for which features or object trajectories may be hard to compute due to occlusions and substantial movement (see Figure 4.1). Furthermore, trivial correspondence (e.g., homography) between the sequences is not assumed. The temporal misalignment is considered to be only a translation, i.e., the sequences have the same frame rate.

Therefore, we do not detect sub-frame time shifts. Our method allows online synchronization correction when a number of frames drop.

Our method is based on matching temporal signals defined on epipolar lines of each of the sequences. Hence, the spatial matching is given by the fundamental matrix. The temporal matching is performed using a probabilistic optimization framework; independent simultaneous motion occurring on different epipolar lines improve our synchronization. Failure to find such a matching (despite the observed motion in the scene) indicates that the epipolar geometry is incorrect. The temporal signal is defined as an integration of the information along an epipolar line, during a sufficient interval of time (at least 2 seconds). A simple background subtraction algorithm is used as an input to the integration. Integrating the information along epipolar lines rather than considering signals at the pixel level not only avoids the search for correspondence but allows the handling of general moving scenes. Note that in a general scene, the correspondence between pixels at different time steps changes due to 3D motion of objects in space. Therefore, the synchronization cannot rely on corresponding pixels.

The main contribution of this thesis is the use of low level temporal events along corre- sponding epipolar lines for video synchronization. Our method does not require high level computation such as tracking, which may be hard to compute in crowded scenes as the ones considered in our experiments. Furthermore, we bypass the need to compute point-to-point correspondences between pixels. Finally, our method can be used in an online framework, because it detects the synchronization errors (e.g., frame drops) in a matter of seconds, as they

3 Chapter 1 / Introduction occur in the video.

4 Chapter 2

Previous work

Synchronization can be achieved using visual information by correlating spatio-temporal fea- tures or events viewed by two or more cameras. Our method considers static cameras acquiring a moving scene. Previous attempts to synchronize such sequences can be classified by the choice of features used for matching. The most straightforward approach is finding both spatial and temporal correspondence between point features at frames taken in all possible time shifts between the two video streams. Such approaches are vulnerable to correspondence ambiguities and require a large search space. A method for reducing the complexity of this search was suggested by Carceroni et al. [2]. In this method, an intersection between the 2D trajectory of a feature in one sequence and an epipolar line of some other feature from other sequence in considered. Such an intersection in time is a possible synchronization point between the sequences. By collecting all such intersections for all the features in the considered sequences and using RANSAC for selecting the best hypothesis, the correct synchronization is computed.

Higher level features that contain temporal information also assist to reduce the matching ambiguity and the search complexity. An example of such features are motion trajectories of objects. In order to perform the synchronization, some kind of spatial correspondence between the trajectories must be calculated. Since the motion of the objects is three-dimensional, matching the observed 2D trajectories in each sequence is ill posed. Thus, different researches proposed methods to ease or avoid such spatial correspondence. Caspi and Irani [3] proposed

5 Chapter 2 / Previous work a method, assuming planar motion in a scene (with a homography transformation between the trajectories). Such an assumption allows to calculate spatial correspondence between the trajectories, as it eliminates the 2D-3D ambiguities. The method was based on feature extraction with later correspondence detection between the two trajectories of the same feature.

A transformation in time-space was found for aligning those trajectories. This work assumes affine transformation in the time domain.

Wolf and Zomet [20] performed a correspondence-free synchronization in a non-rigid scene, assuming orthographic projection. The main assumption is that each observed 3D feature in one sequence can be expressed as a linear combination of the 3D features observed by the other sequence. This assumption allows to avoid the need to calculate spatial correspondence between the trajectories. By applying a rank constraint on a measurement matrix containing the location of the tracked features from both sequences, the correct time shift is found. A similar rank constraint on a constructed measurement matrix was used by Tresadern and Reid [16].

An affine projection is assumed, and a correspondence between the tracked 2D trajectories of features in both sequences is computed. A special measurement matrix containing the correspondence points is created. Two frames from the two sequences can be synchronous, if the measurement matrix has a specific rank. After all possible frame pairs are detected,

RANSAC is used in order to determine the correct sequences synchronization.

Lei and Yang [10] synchronized three or more free moving cameras. Corresponding point and line features are detected, matched and then tracked in all sequences. Each combination of time shifts between the first and the second, and the first and the third cameras is a synchronization hypothesis between the three cameras. The spatial correspondence is avoided with the aid of spatial geometry consistency for any three possible time shifts. This consistency is assessed for each hypothesis, and the correct synchronization is found. The spatial geometry can be estimated from the point and line correspondences. In another study, Singh et al. [14] introduced event dynamics for the synchronization task. Working on a spatially calibrated pair of video sequences, they used regression on the trajectory points in space and time. The temporal misalignment model was known and the motion trajectories were extracted. This

6 Chapter 2 / Previous work working scheme effectively dealt with noisy data.

Whitehead et al. [19] also used trajectories of tracked objects for synchronizing three cam- eras. The inflection points of the trajectories are used for spatial detecting correspondence between trajectories, as those points are space-time invariant and can be detected in a 2D video.

Another way to use tracked features was proposed by Tuytelaars and Van Gool [17]. The proposed approach deals with free-moving cameras in a non-rigid 3D moving scene. First, a set of corresponding points must be found and tracked along the two video sequences, followed by backprojecting the corresponding points as lines in 3D world. By searching for intersection of these lines, a rigidity of the points set can be verified. The set of free-moving points in 3D world can be rigid only when the two video frames are synchronized.

The computation of the trajectories and its quality strongly depend on the scene and can often be hard to compute as in the video considered in this thesis. In order to avoid feature tracking, highly discriminative action recognition features were proposed for synchronization by Dexter et al. [5]. A feature-based alignment scheme is used, while the preferred features are the SSM, self-similarity matrices. SSM are used as a temporal signature of image sequences that are not affected by view angle. Those signatures are tracked during the video. This work focused on human actions detection with free camera motion. Naturally, such high-level features are limited to scenes for which these actions appear and can be detected. Another limitation of this work is the high computational complexity of such calculations.

In an effort to avoid complex computations such as tracking and action recognition, an approach based on brightness variation over the entire image was suggested by Caspi and

Irani [3]. However, this method requires full spatial alignment of the sequences, and a ho- mography transformation between the views was assumed. It is a direct alignment algorithm, based on the direct brightness variations of the sequences. After defining an error function in- cluding all the required parameters of the space-time transformation, an optimization problem was solved, utilizing a pyramid framework. Another approach, by Yan and Pollefeys [21], sug- gested using statistics over low level space-time interest points in each of the sequences. In that

7 Chapter 2 / Previous work research, the space-time interest points, based on the Harris corner detector, were extended to deal with spatio-temporal corners. After detecting and integrating such features, a simple correlation was used in order to synchronize the video sequences. This concept steers clear of computing point-to-point, trajectory, or action correspondence. However, since the statis- tics are computed over the entire image, the approach is strongly sensitive to the overlapping regions of the two views, relative viewing angle, and the complexity of the motion appearing in the scene. The limitations of these two approaches motivate the solution suggested in this thesis.

It is worth mentioning that several synchronization methods considered moving cameras viewing a static scene. For instance, Spencer and Shah [15] performed temporal synchro- nization between two moving cameras with unknown but fixed relative location. Their work concentrated on detecting the frame-to-frame movement of the cameras, assuming limited or non-existent motion in the scene. Scenes with relatively little motion were also assumed by several studies. For example, Serrat et al. [13] proposed a method for synchronizing two se- quences from free moving cameras, not requiring any overlapping in space, but assuming the paths of the moving cameras are very similar. A limited motion of the objects in a scene is also assumed. The algorithm relies on frame-to-frame registration of the sequences in order to synchronize them.

8 Chapter 3

Method

Given a pair of color (or gray-level) sequences and a fundamental matrix, we achieve synchro- nization by time registration of the temporal signals from the two sequences. We first present our novel definition of temporal signals of a sequence, followed by a probabilistic approach for registering two of them. The summary of the algorithm flow is presented in Algorithms 3.1 and 3.2.

There are several preconditions required for the method. The video frame rate is assumed to be equal for all sequences, thus the synchronization is only a translation in time. In addition, the sequences are roughly synchronized prior to the algorithm, such that the residual de- synchronization is only up to a few dozen of frames. In general, the synchronization in the system is assumed to change not too often. It is also assumed that there is some motion in the overlapping areas of the cameras’ fields of view. The set of epipolar lines in the two images together with their correspondence are computed from the given fundamental matrix.

3.1 A temporal signal

To define the temporal signals, we use the epipolar geometry of a pair of images. Given the

0 0 fundamental matrix F for a pair of images, a set of epipolar lines L = {`r} and L = {`r} and 0 their correspondence, `r ↔ `r are computed, using the given fundamental matrix (Hartley and

9 Chapter 3 / Method

(a) (b) (c)

Figure 3.1: The calculation of motion indicators for two video sequences: (a) are the background frame selected for each of the video sequences. (b) are the currently analyzed video frames. (c) are the result of the background subtraction with exemplar sets of corresponding epipolar lines drawn. The motion indicators are computed by accumulating the data along each epipolar line, as described in Equation 3.1.

Zisserman [9]). The correspondence of a given pointp ˆ ∈ `r is constrained to lie on the epipolar ˆ0 line `r = F pˆ in its synchronized frame (the points and the lines are given in homogeneous coordinates). Traditionally, this property is used for constraining the correspondence search in stereo or motion algorithms. Pixel correspondence is not guaranteed to remain the same over time due to 3D motion. However, two corresponding pixels are guarantied to remain on two corresponding epipolar lines in both sequences. (The only possible exception is a major occlusion on one of the views.) Using this observation, we define the signals on the entire epipolar line, avoiding not only the problem caused by the change of pixels correspondence over time but also the general challenge of computing spatial correspondence between frames.

A background subtraction algorithm is used for defining the temporal signal of each se- quence. The base of the motion signal is the Euclidean distance between the data frame and the selected background frame for each pixel. For each epipolar line, a motion indicator is

10 Chapter 3 / Method taken to be the sum of these distances of the line’s pixels, as demonstrated in Figure 3.1. The temporal signal of an epipolar line, the line signal, is defined to be the set of motion indicators on an epipolar line as a function of time. Formally, let I(p, t) and B(p, t) be the intensity values

1 of a pixel p ∈ `r, in some video frame and corresponding background frame , respectively. The line signal of that epipolar line, Sr(t), is defined to be the distance between the two vectors:

Sr(t) = Σp∈`r kI(t, p) − B(t, p)k. (3.1)

The collection of line signalsfor all the epipolar lines in a video, is the temporal signal of the video sequence. The temporal signals of two considered sequences are represented by

0 matrices S and S (Figure 3.2), where each row r of this matrix consists of a line signal, Sr.

That is, Sr,t is the motion indicator of an epipolar line `r at a time step t. Only a few dozen epipolar lines from each frame, a few pixels apart, are considered.

3.2 Signal Registration

In this section we present the time registration of a given pair of temporal signals of the video sequences. For robust results, and in order to combine information from different line signals, the matching is determined using a probabilistic framework, utilizing a maximum a posteriori estimation. The time shift is detected by finding a maximum likelihood value for the two signals, with different time shifts applied to the second signal. A sliding window in a predefined range is used to determine ∆t.

Let S and S0 be a pair of line signals, extracted from corresponding epipolar lines in two video sequences. At this stage, assume a single consistent time shift between the two sequences and no frame drops in any of them. We begin with considering the probability distribution of a time shift ∆t of S0 to match S. Applying Bayes’ law we obtain:

P (S, S0 |∆t)P (∆t) P (∆t |S, S0) = . (3.2) P (S, S0)

1 Br(p, t) is a function of t, because in the general case an adaptive background subtraction can be used.

11 Chapter 3 / Method

(a) (b) (c)

Figure 3.2: (a),(b) are examples of temporal signals S and S0 of two sequences, containing 130 epipolar lines for a time period of 8 seconds (200 frames). Each pixel in the signal is the motion indicator of an epipolar line at a time point. (c) is the matching result for those signals with a high-confidence peak at the correct time shift of ∆t = −1 frames.

The denominator term, P (S, S0), is an a priori joint probability distribution of S and S0.

A uniform distribution of this probability is assumed, hence it is taken to be a normalization constant factor. In general, prior knowledge about the time and space distribution of objects in the scene, or the overlapping regions of the two sequences can be used for computing this prior. Extracting such knowledge is out of the scope of this thesis. The term P (∆t) is another prior, in this case on the probability distribution of ∆t. Use of this prior is discussed in the experimental part (see Section 4.3).

For estimating the likelihood term, P (S, S0 |∆t), we apply a simple stochastic model on the temporal signals. A commonly used assumption of additive white Gaussian noise is applied to the two line signals:

S(t) = S0(t + ∆t) + N (µ, σ2), (3.3) where ∆t is the correct time shift between the two signals, and µ is the difference between the averages of both. For simplicity, in the rest of the thesis we omit the value µ: from here on, each line signal S participates in the computation after its average was subtracted. Using this model assumption, the likelihood of two line signals, given ∆t, is obtained by:

12 Chapter 3 / Method

0 2 P S(t) − S (t + ∆t) 1 − t L(S, S0, ∆t) = √ e 2σ2 . (3.4) σ 2π

In reality, the relation between the signals is more complicated than this simple presenta- tion. Differences in the cameras’ photometric parameters and the foreshortening effects caused by perspective projection may result in some gain effect between these signals. Another sim- plification we have is a hidden assumption of independence between the motion indicators in a single line signal. Adjacent indicators are expected to be correlated to some degree, because the objects captured in the video have finite speed, relatively to the sampling frame rate. Despite these simplifications, the obtained results are satisfying, as demonstrated in our experiments.

Note that the maximal value of P (∆t |S, S0) and the maximal value of P (S, S0 |∆t)P (∆t) will be obtained for the same value of ∆t. Hence the desired value of the time shift is computed from:

0 2 P S(t) − S (t + ∆t) 1 − t arg max P (∆t |S, S0) = arg max P (∆t) √ e 2σ2 . (3.5) ∆t ∆t σ 2π

0 As defined above, each row in S and S represents a line signal for an epipolar line `r ∈ L. We consider those signals to be independent, due to the spatial distance between the selected epipolar lines. Therefore, computing the likelihood can be extended to sequence signals S and 0 S by taking the product of the likelihoods of all the line signals, as shown in Equation 3.6. 0 Up to this point, we assume a single consistent time shift between S and S . In order to incorporate it into an online framework, the algorithm must work on a finite time interval at each iteration. Thus, the synchronization at a given time step, t0, is determined only from a k interval of the sequence signal, taken from t0 − k up to t0 . Furthermore, the sought for ∆t is bounded by some finite range −c ≤ ∆t ≤ c. (In our experiments, k corresponds to roughly 4 to 8 seconds and c corresponds to 1 to 3 seconds). Inserting all of the above into

13 Chapter 3 / Method

Equations 3.2 and 3.5, we obtain:

0 Y 0 arg max P (∆t |S, S ) = arg max P (∆t) P (∆t |Sr, Sr) (3.6) ∆t ∆t `r∈Lˆ(t)

t0 0 2 X X Sr,t − Sr,t+∆t − 2σ2 = arg max P (∆t) e r∈L(t) t=t0−k −c≤∆t≤c where Lˆ ⊆ L is the subset of epipolar lines participating in the computation (defined in 3.3),

0 and Sr and Sr are signals of corresponding epipolar lines `r. The time shift ∆t that yields maximal likelihood according to Equation 3.6 is the desired time shift for the two given video sequences (see Figure 3.2(c)). The actual value of the likeli- hood is used as a confidence level of the resulting ∆t. This value is taken after a normalization step, which ensures that the probability distribution of ∆t in the range −c ≤ ∆t ≤ c sums up to one. The higher the probability is, the more robust the answer is. In the online synchronization framework, only the high-confidence results will be taken into account.

In calculations presented in this section, we have performed two important normalization steps. The first normalization follows Equation 3.3, the average subtraction from each signal S.

This step effectively distinguishes our method from the whole-frame statistics (see Section 4.6).

Without this normalization, the following calculations lead to summarizing data from all the epipolar lines, with no individual operations on each of them. The second normalization step described in the previous paragraph, ensuring the probability distribution of ∆t in the range

−c ≤ ∆t ≤ c sums up to one. Due to the fact that we omit the a priori probability distribution terms, the final value of the resulting probability needs to be normalized. Since we only search for ∆t in a finite range, we assume the probability density outside this range is zero.

This normalization enables to combine the probability density function of different temporal signals, thus allowing comparing different confidence levels, and future developments including combining information from more than two video streams.

14 Chapter 3 / Method

3.3 Epipolar line filtering

Registration of only a subset of the line signals is sufficient for synchronization. Moreover, line signals that contain negligible motion information may noise into the registration process, and are therefore removed from the computation. We next define the subset of epipolar lines

Lˆ ⊆ L, that participate in the computation for a given time step t. The signals are removed on the basis of both sequences. We test for motion information only at a single time step. We do so by computing the temporal gradient along an epipolar line, taking into consideration some noise estimation of such a gradient. The noise at each image pixel is assumed to be additive

2 white Gaussian noise with some variance σm. Hence, we determine significant motion on the epipolar line r only if the residual information on the time gradient along the epipolar line goes beyond the estimated noise threshold. In case of no real motion, this time gradient yields only noise. Formally, the motion probability at a given time t, for an epipolar line `r is given by:

(I(t, p) − I(t − 1, p))2 − P 1 p∈`r 2σ2 Pmotion(`r, t) = √ e m . (3.7) σm 2π

The subset Lˆ consists only of epipolar lines with motion probability over some threshold.

This simple filtering process compensates for errors of the background subtraction algorithms and eliminates any wrongly detected residual motion caused by them.

3.4 Algorithms flow

Algorithm 3.1 Temporal signal update The algorithm is triggered for every new frame acquired. Input: two new frames from the video sequences

1. Perform background subtraction

2. For each epipolar line `r: calculate the motion indicators (Equation 3.1).

0 3. Update the matrices S and S .

15 Chapter 3 / Method

Algorithm 3.2 Synchronization iteration The algorithm is triggered every 0.8 seconds. 0 Input: two temporal signals S and S . 0 1. Extract the data corresponding to the time interval k from S and S .

2. Compute Lˆ by filtering `r ∈ L for the current time step (Section 3.3).

3. For each `r ∈ Lˆ subtract its average µr. 4. Compute the likelihood for each −c ≤ ∆t ≤ c using Equation 3.6.

5. Apply the prior for P (∆t).

6. Normalize the distribution of resulting probability such that it sums up to 1.

7. Find the maximal value of the probability and the best time shift.

16 Chapter 4

Experiments

We conducted a number of experiments to test the effectiveness of our method. The input for each is a pair of video sequences taken with the same frame rate. In addition, a fundamental matrix (computed manually) and a rough synchronization (up to an error of 50 frames) are assumed to be given. The method was implemented in Matlab. The corresponding epipolar lines of each pair of sequences were computed using a standard rectification method. A naive background subtraction was used where the background consists of an empty frame, subtracted from all the other frames in the video stream.

Four sets of video sequences were taken, as shown in Figures 4.1, 4.2, 4.3 and 4.4. In the indoor video sequences a flicker effect is evident, caused by fluorescent lighting in the scene. In order to avoid distractions to the synchronization algorithm, the flicker was removed by tem- poral low-pass filtering of the video. The framework triggers the synchronization computation every 0.8 seconds of the video.

Set 1: In Set 1 an indoor scenario was acquired, in which a dense crowd – around 30 people – walk about. The cameras were placed at an elevation of approximately 6 meters. The cameras’

fields of view have a relatively large overlap. The videos were recorded at 25 fps, with a frame size of 640 × 480. The total runtime of Set 1 is 5000 frames (3.33 minutes).

Set 2: Set 2 is similar to Set 1 but the cameras’ fields of view only partially overlap. This

17 Chapter 4 / Experiments

Figure 4.1: Set 1: The two rows show four successive frames from the first and second view, respectively; the first column contains an exemplary subset of the used epipolar lines for this set. set represents a difficult case in the sense of viewing angles, since there is a big difference in the view points of the two cameras. The total runtime of Set 2 is 5000 frames (3.33 minutes).

Set 3: Set 3 is of a relatively dark outdoor scene with only few people walking around. This case represents another difficult scenario, with a small amount of motion in dark conditions.

A pair of cameras were located at an elevation of about 6 meters: the videos were recorded at 15 fps, with frame size of 640 × 512 pixels. The total runtime of Set 3 is 1125 frames (1.25 minutes).

Set 4: To verify that our method works properly on other video sequences used in literature, we have performed the synchronization of a pair of short videos used in works by Caspi and

Irani [3]. The sequences contain a single car moving in a parking lot1, demonstrated in Fig- ure 4.4. The videos were recorded at 25 fps, with frame size of 512 × 288 pixels. The total runtime of Set 4 is 520 frames (20.8 seconds).

1The video sequences are taken from http://www.wisdom.weizmann.ac.il/˜vision/VideoAnalysis/Demos/Seq2Seq/Seq2Seq.html

18 Chapter 4 / Experiments

Figure 4.2: Set 2: The two rows show four successive frames from the first and second view, respectively; the first column contains an exemplary subset of the used epipolar lines for this set.

4.1 Basic results

The presented tests were performed on the four sets. In Sets 1-3, the interval size was taken to be k = 140, no prior on P (∆t) was used (i.e., uniform distribution is assumed on P (∆t)).

The value of σ for Set 1 and Set 2 was set to 1300, and for Set 3 to 600. For Set 4, the following parameters were used: σ = 400 and k = 80. We discuss the parameters setting in Section 4.4. The results consist of a set of time shifts between two video streams with a probability (confidence) value for each shift. Each of the time-shifts for Set 1, Set 2, Set 3 and

Set 4 are represented by a single dot in Figure 4.5(a), Figure 4.5(b), Figure 4.6(a) and 4.6(b), respectively. The x-axis is the computed time shift and the y-axis is the confidence in the computed result. Ideally, we would like the dots to align along the correct time shift, and to have high confidence. The correct time shift, computed by hand, is ∆t = −1 frames for all sets.

To evaluate the percentage of correct results, it is necessary to set a threshold on the confidence value. The threshold 0.7 is considered in the analysis of the four data sets. A result is considered to be correct if it is in the range of ±1 frames from the correct synchronization.

Using this threshold on Set 1, approximately 50% of the obtained results have high levels

19 Chapter 4 / Experiments

Figure 4.3: Set 3: The two rows show four successive frames from the first and second view, respectively; the first column contains an exemplary subset of the used epipolar lines for this set. of confidence, and 95% pecent of them are correct. That is, the system yields, on average, a high-confidence result each 1.6 seconds.

The percentage of the correct high-confidence results obtained for Set 2 is 100%. However, only 12% of the obtained results had high confidence(> 0.7). It is mostly due to the relatively small overlapping field of view of the two cameras, resulting in a small number of epipolar lines that can participate in the registration. As the working area is small, the algorithm analyses long time periods without motion, which yield low-confidence results.

For Set 3, the percentage of correct results is 100% with only 13% of the results having high confidence. In addition, the low confidence results consist of a relatively large amount of errors.

This is due to the small number of moving objects in the scene and objects moving along the direction of epipolar lines. Note that a movement along an epipolar line is not expected to produce good synchronization, since it induces ambiguities, as discussed in Section 5. The effect of a non-uniform prior on P (∆t) when incorporated into this set is discussed in Section 4.3.

To summarize, our method constantly and reliably maintains the time synchronization between two sequences. It is important to note that tracking objects or features in the crowded scene of Set 1 and Set 2 from a single camera is considered to be an extremely difficult task due to substantial movement and a large number of occlusions. Hence, synchronization studies

20 Chapter 4 / Experiments

Figure 4.4: Set 4: The two rows show four successive frames from the first and second view, respectively; the first column contains an exemplary subset of the used epipolar lines for this set. that rely on trajectories detected by each of the cameras (e.g., Whitehead et al. [19], Caspi et al. [4]) are not adequate in this case. Furthermore, the scene consists of a genuine 3D structure and the distance between the cameras is non-negligible. Hence, a homography transformation of the pair of sequences cannot be used to match pixels or trajectories (as used by Caspi and Irani [3]). Finally, a general statistics of spatio-temporal features over the whole frame

(proposed by Yan and Pollefeys [21]) cannot be applied, since the camera’s fields of view are not fully overlapping. In addition, there are a large number of objects moving simultaneously in the frame, making it impossible to rely on whole-frame statistics.

4.2 Frame dropping

Frame dropping is expected in a simple commercial system when it operates over a long period of time. The need to detect frame dropping and resynchronize is one of the main motivations for an online synchronization algorithm. To test the robustness of our method in the presence of frame dropping, we applied our algorithm to Set 1 where 3 frame drops occurred during the video. That is, the correct time shift changed from −1 to 16, then to −8 and finally, back to −1. The rest of the experiment setup was identical to the basic one. The results are

21 Chapter 4 / Experiments

(a) (b)

Figure 4.5: Basic results, Sets 1,2: Each of the 250 computed time-shifts for (a) Set 1 and (b) Set 2, one for each 0.8 seconds, are represented by a single dot. Each dot in the graph represents the computed time shift for a single time step. Low confidence results are marked in blue. Correct and incorrect high confidence results are marked in green and red, respectively. The x-axis is the computed time shift and the y-axis is the confidence in the computed result. presented in Figure 4.8(b), where the detected time shift is plotted as a function of time. The result demonstrates that the correct time shift is detected, and the reaction time to the drop is approximately 7-8 seconds. This reaction time is due to the interval of 140 frames, which, in addition to the search range c = 30, corresponds to 8 seconds. During this time period the two registered temporal signals contain inconsistent information with a frame drop in it. Hence, the results are incorrect and have low confidence.

4.3 Using a prior on P (∆t)

In an online framework, a non-uniform probability distribution on ∆t can be applied, using the result of the previous synchronization iteration. It is assumed that the time synchronization rarely changes during the video, and the changes are of a few frames only (due to frame dropping). We tested our method using a Gaussian distribution of P (∆t) with σ = 2 and a mean set to the previously detected high-confidence time shift (starting with 0) . Comparing the results with (Figure 4.7(a)) and without (Figure 4.7(b)) use of the prior, shows that the

22 Chapter 4 / Experiments

(a) (b)

Figure 4.6: Basic results, Sets 3,4: (a) Synchronization results for Set 3: each of the 60 computed time-shifts, one for each 0.8 seconds, are represented by a single dot. (b) Synchronization results for Set 4: each of the 16 computed time-shifts, one for each 0.8 seconds, are represented by a single dot. The axes description and color codes interpretation are as in Figure 4.5 prior reduces the instability of the low-confidence results. We tested the effect of using a prior on Set 1 (with and without frame dropping) and on Set 2. In all these tests the results remain the same. Hence we can conclude that on the one hand the prior can reduce errors for unstable results, and on the other hand it does not impair other results.

4.4 Setting the parameters

In addition to the confidence threshold, there are two more parameters that have to be set.

The time interval k controls the number of frames that participate in the signal registration procedure. Longer intervals will lead to more robust results, especially for areas and with limited motion. According to our tests, in a video pair with a lot of motion, an interval of k = 20 frames (0.8 seconds) is sufficient for obtaining robust synchronization results. However, for limited and sporadic motion, such an interval yields a somewhat noisy output, therefore k = 140 frames was used in all our experiments. The downside of large intervals is the increase in computation time and the slower reaction time in the presence of frame drops. The reaction time to such changes can, in the worst case, be as long as the interval time, as discussed in

23 Chapter 4 / Experiments

(a) (b)

Figure 4.7: Using prior, Set 3: Each of the 60 computed time-shifts, one for each 0.8 seconds, are represented by a single dot computed for Set 3, (a) without prior and (b) with a prior. The axes description and color codes interpretation are as in Figure 4.5

Section 4.2.

The other parameter is the σ, a normalization factor in the probability calculations (Equa- tions 3.3-3.6). In general, σ depends on photometric parameters of the cameras, as well as on the epipolar geometry. In the experiments, the value of σ was set empirically. This factor affects the numerical outcome of the confidence for each time shift, as demonstrated in Fig- ure 4.8(a). High values of σ suppress the confidence, hence flatten the probability distribution of P (∆t |S, S0), causing indecisiveness and noisy output. However, lower values of σ increase the confidence of all the measurements, and as a result, the confidence of incorrect time shifts increases as well. Thus, in order the preserve the correct output of the framework, the final confidence threshold must be selected in accordance to the value of σ.

4.5 Verification of Calibration

The main goal of our method was to compute synchronization between a pair of sequences, while the camera calibration (i.e., the epipolar geometry) is assumed to be given to the system.

Incorrect epipolar geometry causes motion indicators on corresponding epipolar lines to be uncorrelated. In particular, the confidence of all the possible synchronization results is expected

24 Chapter 4 / Experiments

(a) (b)

Figure 4.8: Confidence and frame drops: (a) A graph showing the success rate as a function of the confidence threshold for three different values of σ (see Equation 3.6). The blue, red and green lines represent σ = 1600, 1300, and 1000, respectively. (b) Frame dropping example, drop reaction time = 7 seconds. The green and the red dots represent high and low confidence, respectively. The verti- cal blue lines indicate the time at which the frame drop occurred. The black line is the correct time shift. to be low. An experiment for demonstrating this observation was conducted, simulating a scenario of a small tilt in one of the cameras. The tilt effectively moves all the set of epipolar lines in the video on the Y-axis, breaking the correspondence between the epipolar lines of the two video sequences. This leads to a total synchronization failure, as no high-confidence time shift is detected. The result of this experiment is demonstrated in Figure 4.9(a). Consequently, it is impossible to use our method when the system is out of calibration. Yet, this property of our method can be used to verify calibration, i.e., to distinguish between correct and incorrect calibration of the cameras. Although it cannot be used in a straightforward manner for camera calibration, because the search space for a fundamental matrix is too large, it does serve as an essential first step towards recalibration, following calibration failure.

4.6 Temporal signal of an entire frame

We discussed in the introduction and the method sections why we choose to use epipolar lines signals rather than point signals. Here we challenge our choice to use epipolar line signals

25 Chapter 4 / Experiments

(a) (b)

Figure 4.9: Incorrect synchronization: (a) Incorrect calibration of the cameras can be detected by the presented failure to detect synchronization (Section 4.5). (b) The failure of registration sequence signals of Set 1 based on the entire frame motion indicator (Section 4.6). Comparing this result with Figure 4.5(a) demonstrates the benefit of using the epipolar line signals proposed by our method. The axes description and color codes interpretation are as in Figure 4.5. rather than a similar signal defined by a motion indicator based on the entire frame (similar to the approach taken by Yan and Pollefeys [21]). When the temporal signal is defined on the entire frame, any spatial correspondence between motion indicators is neglected. We modified our method to sum the motion indicators on the entire frame in order to obtain the motion signal. As expected, the obtained result cannot be used for sequence synchronization. Such an approach fails in the presence of complex motion in the scene. The result of this experiment is presented in Figure 4.9(b).

26 Chapter 5

Conclusion

A synchronization of two video sequences is a vital step in each and every automatic video analysis in a multi-camera system. Furthermore, maintaining such synchronization during the video runtime is also essential.

We presented a novel method for synchronizing a pair of sequences using only motion signals of corresponding epipolar lines. Our method is suitable for detecting and correcting frame dropping. Its simplicity is in bypassing the computation of spatial correspondence between features, tracked trajectories or image points, which may be hard to compute in the scenes considered in our experiments. The only spatial correspondence required is between epipolar lines, which are computed directly from a fundamental matrix of the video pairs.

Such fundamental matrix can be computed using different methods known in literature, for instance, using the 8-points method (as described by Hartley and Zisserman [9]), extracted from static features in the videos. The relatively low computational effort used in our our algorithm will allow to incorporate it into real-time systems, after a future optimization cycle.

Furthermore, it can detect the synchronization errors (e.g., frame drops) in a matter of seconds, as they occur in the video. Thus, it can be used in an online framework. Finally, the method can be used for detecting calibration failures, as a basis for recalibration process.

It is worth noting that our method may fail in rare cases such as an object moving strictly along an epipolar line and no other information is available. In this case the temporal matching

27 Chapter 5 / Conclusion is expected to yield the same probability for all time shifts. In addition, if the object also moves across a non-overlapping regions of the cameras, an incorrect synchronization is expected. In order to overcome such problems, a method for detecting overlapping regions of cameras can be utilized, e.g. a method suggested by Mandel et al. [11]. Another way to resolve this problem is by considering a system with more than two cameras. Other pairs of video sequences with different sets of epipolar lines can resolve this ambiguity.

Future work

We intend to study the extension of the proposed approach to handle more than two sequences.

Such extension will make the synchronization more robust, as it will allow to solve cases with limited motion relying on information from different cameras’ pairs. This extension expected to be natural due to the probabilistic properties of the algorithm. It can be achieved by combining the probability distribution of time shifts for different sequence pairs, calculated from their temporal signals registration. At the next stage, the best possible configuration of the time shifts for the whole camera network has to be found.

In the future, we intend to study the possibility of developing an automatic procedure for setting the algorithm’s parameters. The σ (see Equation 3.6) can be evaluated from noise estimations on the original data, and the time interval k (see Equation 3.6) can be set by analyzing the events occurring in the video.

In addition, some research effort should be invested in detecting overlapping areas along the epipolar lines. Such knowledge can be used to eliminate the rare failure cases, when motion along epipolar lines occurs (as discussed in Section 5).

The usage of the a priori joint probability distribution of the line signals can improve the synchronization calculation (see Section 3.2). It is worth exploring how such knowledge can be automatically obtained.

Our algorithm is intended to work in real-time video systems. In order to incorporate it into a real-time framework, an optimization cycle is required. We intend to implement

28 Chapter 5 / Conclusion this algorithm in an efficient and optimized manner, using C++ programming language with optimized image processing libraries, combined with different acceleration methods, such as intrinsic functions and GPU support.

Another area for future research is the recalibration of a camera pair. As our method is capable of detecting incorrect or compromised calibration, it can be used for searching the correct calibration. Applying a naive exhaustive search is problematic due to the time complexity it demands. Therefore, the main challenge of such approach is finding a way to perform an effective calibration detection.

29 Bibliography

[1] T. Basha, Y. Moses, and N. Kiryati. Free viewpoint action recognition using motion

history volumes. In Proc. IEEE Conf. Comp. Vision Patt. Recog., 2010.

[2] R. Carceroni, F. Padua, G. Santos, and K. Kutulakos. Linear sequence-to-sequence align-

ment. In Proc. IEEE Conf. Comp. Vision Patt. Recog., volume 1, 2004.

[3] Y. Caspi and M. Irani. Spatio-temporal alignment of sequences. IEEE Trans. Patt. Anal.

Mach. Intell., pages 1409–1424, 2002.

[4] Y. Caspi, D. Simakov, and M. Irani. Feature-based sequence-to-sequence matching. Int.

J. of Comp. Vision, 68(1):53–64, 2006.

[5] E. Dexter, P. Pe’rez, and I. Laptev. Multi-view synchronization of human actions and

dynamic scenes. In Proc. British Machine Vision Conference, 2009.

[6] R. Eshel and Y. Moses. Homography based multiple camera detection and tracking of

people in a dense crowd. In CVPR, pages 1–8, 2008.

[7] Y. Furukawa and J. Ponce. Dense 3D motion capture from synchronized video streams.

In Proc. IEEE Conf. Comp. Vision Patt. Recog., pages 1–8, 2008.

[8] DM Gavrila and LS Davis. 3-D model-based tracking of humans in action: a multi-view

approach. In CVPR, pages 73–80, 1996.

[9] R. Hartley and A. Zisserman. Multiple view geometry. Cambridge university press, 2000.

30 BIBLIOGRAPHY

[10] C. Lei and Y. Yang. Tri-focal tensor-based multiple video synchronization with subframe

optimization. IEEE Transactions on Image Processing, 15(9), 2006.

[11] Z. Mandel, I. Shimshoni, and D. Keren. Multi-camera topology recovery from coherent

motion. In ACM/IEEE International Conference on Distributed Smart Cameras, pages

243–250, 2007.

[12] J.P. Pons, R. Keriven, and O. Faugeras. Multi-view stereo reconstruction and scene flow

estimation with a global image-based matching score. Int. J. of Comp. Vision, 72(2):179–

193, 2007.

[13] J. Serrat, F. Diego, F. Lumbreras, and J.M. Alvarez.´ Synchronization of Video Sequences

from Free-Moving Cameras. In Proc. Iberian Conf. on Pattern Recognition and Image

Analysis, page 627, 2007.

[14] M. Singh, A. Basu, and M. Mandal. Event dynamics based temporal registration. IEEE

Transactions on Multimedia, 9(5), 2007.

[15] L. Spencer and M. Shah. Temporal synchronization from camera motion. In Proc. Asian

Conf. Comp. Vision, 2004.

[16] P. Tresadern and I. Reid. Synchronizing image sequences of non-rigid objects. In Proc.

British Machine Vision Conference, volume 2, pages 629–638, 2003.

[17] T. Tuytelaars and L. Van Gool. Synchronizing video sequences. In Proc. IEEE Conf.

Comp. Vision Patt. Recog., volume 1, 2004.

[18] D. Weinland, R. Ronfard, and E. Boyer. Free viewpoint action recognition using motion

history volumes. Computer Vision and Image Understanding, 104(2-3):249–257, 2006.

[19] A. Whitehead, R. Laganiere, and P. Bose. Temporal synchronization of video sequences

in theory and in practice. In IEEE Workshop on Motion and Video Computing, 2005.

[20] L. Wolf and A. Zomet. Correspondence-free synchronization and reconstruction in a non-

rigid scene. In Proc. Workshop Vision and Modeling of Dynamic Scenes, 2002.

31 BIBLIOGRAPHY

[21] J. Yan and M. Pollefeys. Video synchronization via space-time interest point distribution.

In Proc. Advanced Concepts for Intelligent Vision Systems, 2004.

32