Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and Motion-Based Detection and Tracking

Ming Ye

A dissertation submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

University of Washington

2002

Program Authorized to Offer Degree: Electrical Engineering

University of Washington

Abstract

Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow and Motion-Based Detection and Tracking

by Ming Ye

Co-Chairs of Supervisory Committee:

Professor Robert M. Haralick Electrical Engineering

Professor Linda G. Shapiro Computer Science and Engineering

This thesis describes new approaches to optical flow estimation and motion-based detection and tracking. Statistical methods, particularly outlier rejection, error analysis and Bayesian inference, are extensively exploited in our study and are shown to be crucial to the robust analysis of visual motion. To recover optical flow, or 2D velocity fields, from image sequences, certain models of brightness conservation and flow smoothness must be assumed. Thus how to cope with model violations especially motion discontinuities becomes a very challenging issue. We first tackle this problem from a local approach, that is, finding the most representative flow vector for each small image region. We recast the popular gradient-based method as a two-stage regression problem and apply adaptive robust estimators to both stages. The estimators are adaptive in the sense that their complexity increases with the amount of outlier contamination. Due to the limited contextual information, the local approach has spatially varying uncertainty. We evaluate the uncertainty systematically through covari- ance propagation. Pointing out the limitations of local and gradient-based methods, we further propose a matching-based global optimization technique. The optimal estimate is formulated as maximizing the a posteriori probability of the optical flow given three image frames. Using a Markov random field flow model and robust statistics, the formulation reduces to mini- mizing a regularization type of global energy function, which we carefully design so as to accommodate outliers, occlusions and local adaptivity. Minimizing the resulting large-scale nonconvex function is nontrivial and is often the performance bottleneck of previous global techniques. To overcome this problem, we develop a three-step graduated solution method which inherits the advantages of various popular approaches and avoids their drawbacks. This technique is highly efficient and accurate. Its performance is demonstrated through experiments on both synthetic and real data and comparison with competing techniques. By making only weak assumptions of spatiotemporal continuity, the two proposed tech- niques are applicable to general scenarios, for example, to both rigid and nonrigid motion. They serve as a foundation for object-based motion analysis. Many of their conclusions are also extendable to other visual surface reconstruction problems such as image restoration and stereo matching. The last part of the thesis describes a motion-based detection and tracking system designed for an airborne visual surveillance application, in which challenges arise from the small target size (1 × 2 − 3 × 3 pixels), low image quality, substantial camera wobbling and plenty of background clutters. The system is composed of a detector and a tracker. The former identifies suspicious objects by the statistical difference between their motion and the background motion; the latter employs a Kalman filter to track the dynamic behavior of objects in order to detect real targets and update their states. Both components operate in a Bayesian mode, and each benefits from the other’s accuracy. The system exhibits excellent performance in experiments. In an 1800-frame real video, it produces no false detections and tracks the true target since the second frame, with average position error below 1 pixel. This probabilistic approach reduces parameter tuning to a minimum. It also facilitates data fusion from different information channels.

TABLE OF CONTENTS

List of Figures iv

List of Tables vi

Chapter 1: Introduction 1 1.1 Optical Flow Estimation ...... 4 1.2 A Local Method with Error Analysis ...... 9 1.3 A Global Optimization Method ...... 10 1.4 Motion-Based Target Detection and Tracking ...... 12 1.5 Thesis Outline ...... 13

Chapter 2: Estimating Optical Flow: Approaches and Issues 15 2.1 Brightness Conservation ...... 16 2.2 Flow Field Coherence ...... 19 2.3 Typical Approaches ...... 20 2.4 Robust Methods ...... 23 2.5 Error Analysis ...... 28 2.6 Hierarchical Processing ...... 30

Chapter 3: Local Flow Estimation and Error Analysis 34 3.1 A Two-Stage-Robust Adaptive Technique ...... 34 3.1.1 Linear Regression and Robustness ...... 35 3.1.2 Two-Stage Regression Model ...... 40 3.1.3 Choosing Estimators ...... 42 3.1.4 Experiments and Analysis ...... 43

i 3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction . . . . 50 3.2.1 The Approach ...... 51 3.2.2 Experiments and Analysis ...... 53 3.2.3 Discussion ...... 58 3.3 Error Analysis on Robust Local Flow ...... 61 3.3.1 Covariance Propagation ...... 61 3.3.2 Experiments ...... 65 3.3.3 Discussion ...... 67

Chapter 4: Global Matching with Graduated Optimization 70 4.1 Formulation ...... 71 4.1.1 MAP Estimation ...... 72 4.1.2 MRF Prior Model ...... 72 4.1.3 Likelihood Model: Robust Three-Frame Matching ...... 74 4.1.4 Global Energy with Local Adaptivity ...... 75 4.2 Optimization ...... 76 4.2.1 Step I: Gradient-Based Local Regression ...... 77 4.2.2 Step II: Gradient-Based Global Optimization ...... 77 4.2.3 Step III: Global Matching ...... 78 4.2.4 Overall Algorithm ...... 79 4.3 Experiments ...... 81 4.3.1 Quantitative Measures ...... 82 4.3.2 TS: An Illustrative Example ...... 82 4.3.3 Barron’s Synthetic Data ...... 85 4.3.4 Real Data ...... 90 4.4 Conclusions and Discussion ...... 93

Chapter 5: Motion-Based Detection and Tracking 96 5.1 Bayesian State Estimation ...... 98

ii 5.2 Kalman Filter ...... 100 5.3 Tracking ...... 101 5.4 Motion-Based Detection ...... 103 5.5 Bayesian Detection ...... 111 5.6 The Algorithm ...... 113 5.7 Experiments ...... 114 5.8 Discussion ...... 116

Chapter 6: Conclusions 117 6.1 Summary and Contributions ...... 117 6.2 Open Questions and Future Work ...... 121

Bibliography 125

iii LIST OF FIGURES

1.1 Example Optical flow on flower garden sequence ...... 2 1.2 Motion estimation by template matching ...... 5 1.3 Motion analysis for airborne video surveillance ...... 12

2.1 Aperture problem ...... 29 2.2 Hierarchical processing ...... 32

3.1 Comparison of Geman-McClure norm and L2 norm ...... 38 3.2 Block diagram of the two-stage-robust adaptive algorithm ...... 44 3.3 Central frame of the synthetic sequence (5 frames, 32 × 32) ...... 45 3.4 Correct flow field ...... 45 3.5 OFC cluster plots at three typical pixels ...... 46 3.6 TS sequence results ...... 47 3.7 Pepsi sequence results ...... 48 3.8 Pepsi: estimated flow fields ...... 49 3.9 Random sampling based algorithm for high-breakdown robust estimators . . 51 3.10 Adaptive algorithm for high-breakdown robust estimators ...... 52 3.11 TS: trial set size map ...... 54 3.12 TS: correct and estimated flow fields ...... 55 3.13 TT, DT middle frame ...... 57 3.14 YOS middle frame ...... 57 3.15 OTTE sequence ...... 58 3.16 TAXI sequence results ...... 59 3.17 TAXI: intensity images of x-component ...... 60

iv 3.18 TS motion boundary ...... 66 3.19 TAXI motion boundary ...... 68 3.20 TAXI: motion boundary on images subsampled by 2 ...... 69

4.1 Comparison of Geman-McClure norm and L2 norm ...... 73 4.2 System diagram (operations at each pyramid level) ...... 80 4.3 TS sequence results ...... 83 4.4 Error cdf curves...... 86 4.5 DTTT sequence results (motion boundaries highlighted in (a))...... 88 4.6 Taxi results...... 89 4.7 Flower garden results...... 91 4.8 Traffic results...... 92 4.9 Pepsi can results...... 93

5.1 A typical detection-tracking system ...... 97 5.2 Proposed Bayesian system ...... 98 5.3 Example data sets ...... 105 5.4 f16502 target pixel candidates ...... 106 5.5 f18300 and f19000 target pixel candidates ...... 108 5.6 Target pixels for f16502, f18300 and f19000 ...... 110 5.7 Detection results w and w/o priors on f16503 ...... 112 5.8 Two sample frames ...... 114

v LIST OF TABLES

3.1 Comparison of four popular regression criteria (estimators) ...... 40 3.2 TS sequence: comparison of average error percentage ...... 48 3.3 Quantitative comparison ...... 56

4.1 Quantitative measures ...... 85 4.2 Comparison of various techniques on Yosemite (cloud part excluded) with Barron’s angular error measure ...... 87

5.1 Quantitative measures in 1800 frames ...... 115

vi ACKNOWLEDGMENTS

It is a great pleasure to express my gratitude to all those who have made this disser- tation possible. First, I thank my co-advisor Prof. Robert Haralick, a man of wisdom and rigor, for his guidance and support during both my master’s and doctoral study. I am deeply indebted to Prof. , who became my co-advisor in my final year and helped me through the critical period of time with constant support and encouragement. I would also like to thank other members of my supervisory committee: Prof. Jenq-Neng Hwang, Prof. Qiang Ji, Prof. Werner Stuetzle, Prof. Ming-Ting Sun and Prof. David Thouless, who monitored my work and put in the effort to reading earlier versions of this dissertation. My former colleagues in the Intelligence Systems Laboratory: Dr. Qiang Ji, Dr. Gang Liu, Dr. Desikachari Nadadur, Dr. Selim Aksoy, Dr. Mingzhou Song, Dr. Jisheng Liang, Dr. Lei Sui and Dr. Yalin Wang, deserve many thanks for their friendship and help. I especially want to thank Dr. Qiang Ji and Dr. Gang Liu for pleasant and fruitful discussions and brotherly advice that helped me stay encouraged and on the right track. I am grateful to the Electrical Engineering Department for providing me a good work environment during my final year. Particularly, I must thank Helene Obradovich for her efforts to support me with a teaching assistantship, to Frankye Jones for keep- ing an eye on my progress, and to Sekar Thiagarajan and his team for the computing support. I wish to express sincere appreciation to Dr. Marshall Bern and Dr. David Gold- berg for giving me the opportunity to work at the Xerox Palo Alto Research Center

vii (PARC). Their advice, encouragement and friendship made my summer internship at PARC a very productive and enjoyable one. Last but certainly not the least, I am forever indebted to the love and caring of my family and friends. Special thanks go to my dear husband Chengyang Li, who will receive his Ph.D. about the same time, and to my dear parents and sister for supporting and encouraging me to pursue my academic aspirations.

viii 1

Chapter 1

INTRODUCTION

Visual motion is the 2D velocity field corresponding to the movement of brightness patterns in the image plane of a visual sensor. It usually arises from the relative motion between 3D objects and the observer, and it provides rich information about the surface structures of the objects and their dynamic behavior [58, 89]. Human beings rely on the skills of perceiving and understanding visual motion in order to move around, meet with people, watch movies and perform many other essential daily tasks. If we want computers to assist us and interact with us, we must endow them with a similar capability for analyzing visual motion, that is, accurately measuring and appropriately interpreting the 2D velocity present in digital images. This has turned out to be a highly complicated and error-prone process. The co-existence of profound significance and great challenge makes visual motion analysis a very important and active research area in .

Optical flow is a flexible representation of visual motion that is particularly suitable for computers analyzing digital images. It associates each image pixel (x, y) with a two- component vector u = (u(x, y), v(x, y))T , indicating its apparent instantaneous 2D velocity. The optical flow representation is adopted throughout this thesis and henceforth we use the terms “visual motion” and “optical flow” interchangeably. In order to illustrate the concept of optical flow, Figure 1.1 shows three frames that are part of a video sequence taken by a camera passing in front of a flower garden. The optical flow estimated for the second frame, which is subsampled by 8 each way to avoid being too crowded, is shown in Figure 1.1(d). It overall agrees with our perception of motion in the scene.

Once available, optical flow estimates can be used to infer many 3D structural and dynamic properties of the scene [54, 36]. In a general scenario, 2D image motion can 2

(a) Frame 1 (b) Frame 2 (c) Frame 3

(d) Optical flow estimated on Frame 2

Figure 1.1: Three frames in a video sequence taken by a camera passing in front of a flower garden and the estimated optical flow field 3

be caused by camera motion (ego-motion), motion of independent moving objects in the scene, or a composite of these two. If a video sequence is taken by a moving camera of a rigid 3D scene, as in the case of the flower garden sequence, analysis of this sequence can lead to recovery of the camera motion (pose) [48, 64] and the 3D surface structure of the scene [31, 78, 108, 115, 67]. When there are independent moving objects in the scene, motion analysis can help determine the number of objects, their individual 3D motions, their distances to the observer and surface structures. The above study is vital to applications in environment modelling [8, 132], target detection and tracking [80, 92, 110, 32, 81, 101], auto-navigation [2, 45, 123], video event analysis [60, 107] and medical image registration [1]. Analyzing optical flow in the 2D domain is important in its own right. Many dynamic features such as the focus of expansion [122], motion boundaries [93, 16] and occlusion relationships [73] can be extracted from optical flow fields (although the extraction is much less straightforward than it intuitively would be—we will come to this topic in the next section). These dynamic features can assist in image segmentation [88, 113, 14, 86, 131] and independent motion detection [80, 92], and usually serve as intermediate measures to object- based representations [110]. Moreover, temporal continuity encoded in visual motion has been exploited for redundancy reduction in video compression [114, 150, 109], image/video super-resolution [112], and removal of image noise [120] and image distortion [142]. Visual motion, as a compelling cue to the perception of 3D structure and realism, can also be used for graphics and animation [29, 100]. For example, a cartoon character can be made to mimic a human character’s expression by first measuring the human character’s facial motion and then warping the cartoon character accordingly. Such concepts have already been utilized in film production [100], and they are expected to play an increasingly important role in the future with advances in computational technology. All the visual motion applications discussed above assume that accurate optical flow estimates are already available or can be conveniently computed. Unfortunately, recovering optical flow from images is very difficult for three reasons. First of all, the movements of brightness patterns in the image plane might not impose sufficient constraints on the actual 2D motion—this is the intrinsic ambiguity of optical flow. Secondly, in formulating the 4

problem of optical flow estimation, certain assumptions about the motion and the image observation must be made; these assumptions, as simplifications of real-world phenomena, can easily be violated and result in erroneous estimates. Finally, the computation involved can be intensive and even prohibitive so that a more appropriate formulation might not lead to higher practical accuracy. Even worse, these difficulties are usually entangled together and make it very hard to tell which factors attribute to the failure. For the above reasons, despite decades of active research and steady progress, the performance of existing optical flow estimation techniques remains unsatisfactory. It is thus the main theme of this thesis to explore new approaches to optical flow estimation which handle these problems more effectively. The rest of this chapter serves as a high-level overview of my dissertation. The following section briefly reviews optical flow research and motivates our study. Two novel techniques, exploiting local and global motion coherence respectively, are described in Section 1.2 and Section 1.3. Section 1.4 discusses a detection and tracking system which can be considered as an application of visual motion, and is built on top of various results established in our study of optical flow estimation. Finally Section 1.5 gives an outline of the thesis. Conclusions and contributions of various pieces of our work will be pointed out in each individual section.

1.1 Optical Flow Estimation

Basics Given the two images in Figure 1.2, the task of estimating the optical flow in the first frame is to determine where each pixel in this frame moves to in the next frame. The most intuitive method of doing this is probably template matching. Consider the pixel at the center of the box, which is near the center of the front tree trunk, in Frame 1. In order to find its corresponding point in Frame 2, we may take the image block within the box as a template, search Frame 2 for the block most similar to it, and computer the optical flow vector from the displacement between the centers of the blocks. Two assumptions are implied in this matching process: (i) the template maintains its brightness pattern and (ii) 5

4 1

2

3

Frame 1 Frame 2

Figure 1.2: Motion estimation by template matching. Consider the pixel at the center of the white box in Frame 1. In order to find its corresponding point in Frame 2, we may take the image block within the box as a template, search Frame 2 for the block most similar to it, and then the displacement between the centers of the blocks is the optical flow vector. Templates 1, 2 and 4 show the aperture problem. Template 3 shows a case of assumption violation caused by motion boundaries.

all pixels within the template move at the same speed. These are simple embodiments of the brightness conservation and flow field smoothness assumptions, which are the foundation of all motion estimation methods.

The above template matching method, however, does not work well in all places. Some problematic positions are marked in Figure 1.2. Template 1 (upper-left in Frame 1, on the roof) contains an intensity edge and many blocks along the edge in the second frame seem to match it almost equally well; as a result, only the motion perpendicular to the edge can be reliably recovered. Template 2 belongs to the sky and is poorly textured. Where the block matching process finds the best match in the next frame is pretty much due to image noise. Template 4 shows a similar problem. Its sky part is attached to the twigs and is assigned the foreground motion (see Figure 1.1). These three cases illustrates the aperture problem [58]: if we are examining motion only through a small aperture (region), the local image information can be insufficient for uniquely determining the motion. The aperture problem is the intrinsic difficulty of visual motion perception. Some of the ambiguity it induces may be resolvable with appropriate contextual knowledge. For instance, human 6

viewers can recognize Templates 1 and 2 as part of the house and the sky, respectively, and can associate their motions with the rest of the scene. There have been efforts to mimic this ability, including adaptive template window selection [72] and flow propagation [58]. Nonetheless, the aperture problem is unavoidable in general; it always exists in the form of spatially varying uncertainty. For such reasons, error analysis [52, 144] is an integral part of optical flow estimation and will be addressed in this thesis.

Challenges in motion estimation also arise from assumption violations. One example is given by Template 3. The correct motion of its center pixel is the motion of the front tree trunk. But since the template also includes a part from the flower bed, which moves differently, flow constancy no longer holds in this block and the outcome of the matching process can be arbitrarily wrong. Motion discontinuities have received the most attention in combating assumption violations, not only because they are abundant in real imagery, but also in that they often correspond to significant scene features, which could be of even greater interest than the motion itself in some applications. The brightness conservation assumption can also become invalid due to large image noise and illumination changes. To deal with these problems, we may either adopt new models accommodating the abnormalities or develop techniques that work gracefully even with violations present. The latter measure is indispensable because, any assumption, as a simplification of a real-world phenomena, will potentially be violated. For this reason, devising methods robust to unmodelled incidences has become a central issue in motion estimation as well as in the entire computer vision community [50, 85, 51].

In more than two decades’ intensive research, optical flow estimation has been tackled from different angles with variable success. Early studies [58, 82, 3] establish basic models for brightness conservation and flow smoothness. Recent efforts [77, 15, 5, 97] emphasize enhancing robustness against model violations and solving associated optimization prob- lems. The following section is a glimpse of the broad area especially methods related to our work. More literature review will be given in Chapter 2.

Overview of Related Work

Two main types of constraints are derived from the brightness conservation assumption: 7

matching-based constraints [3] and gradient-based constraints [58, 82]. Matching-based constraints, as used in the template matching process, determine the motion vector by trying a number of candidate positions and finding the one with the minimum matching error. This method can handle large motion, but the search process can be computationally expensive and yield poor sub-pixel accuracy [7, 14]. Gradient-based constraints are linear approximations of matching-based constraints. By exploiting gradient information, they can achieve much better efficiency and accuracy and hence have become the most popular in practice. But relying on derivative computation makes their applicability more limited [145]. Based on how flow smoothness is imposed, the approaches are further divided into two types: local parametric and global optimization. Local parametric methods assume that within a certain region the flow field is described by a parametric model [12]. The simplest, yet one of the most popular models is the local constant model, as implied in template matching. Local models usually involve simple computation and can achieve good local accuracy [7, 39], but they degrade or fail when the model is inappropriate or the local information becomes insufficient or unreliable. Global optimization methods cast optical flow estimation in a regularization framework — every vector satisfies its brightness constraint while maintaining coherence with its neighbors [58]. Because they propagate flow between different parts of the flow field, such approaches are less sensitive to the aperture problem, but for the same reason, they tend to oversmooth the flow field. Most popular approaches are gradient-based. The best known classical techniques are perhaps the global gradient-based method by Horn and Schunck [58] and the local gradient-based method by Lucas and Kanade [82]. Traditional techniques [7] usually require brightness conservation and flow smoothness to be satisfied everywhere in the flow field. The restrictive assumptions result in smeared motion discontinuities and high sensitivity to abrupt noise. As their limitations are widely recognized, a large number of recent efforts have been devoted to increasing robustness especially allowing motion discontinuities. For gradient-based local parametric methods, various robust regression techniques such as robust clustering [113, 94], M-estimators [15, 109], high-breakdown robust estimators [5, 97, 145] are substituted for the traditional least- 8

squares estimator. They reduce the impact of model violations by fitting the structure of the majority of the data. Global optimization approaches are reformulated in terms of anisotropic diffusion [91, 19], Markov random fields with line process [88, 77, 57, 14], weak continuity [20, 15], or robust statistics [116, 15, 86], among many others. These techniques generally outperform their non-robust counterparts in terms of accuracy. But computational complexity quickly becomes the new performance bottleneck. This is especially true for global methods involving large-scale nonconvex optimization, which are considered the most promising [111] methods.

Why Low-level Approaches

This thesis is concerned with low-level approaches to optical flow estimation (in fact, when people talk about optical flow methods, they normally refer to low-level methods). “Low-level” means that only primitive image descriptors (intensity values) and weak as- sumptions (piecewise spatiotemporal continuity) are exploited. Due to the small amount of prior knowledge, the limitations of such approaches, for instance, in handling motion discontinuities, are obvious and understandable.

The reader may then wonder why we do not use other channels of information or stronger assumptions—it seems to make perfect sense to extract motion for each object separately. Such ideas are compelling and have been exploited in a number of applications. Examples include using color segmentation to assist motion boundary localization [129]; assuming the motion field to be a mixture [68, 109], single/multiple rigid bodies [12] or layers [131]; and explicitly modelling and tracking motion boundaries [86, 16]. Replacing the optical flow representation of visual motion by an object-based representation has also been suggested [118, 48, 101].

Nonetheless, low-level approaches continue to be extensively studied for good reasons [83]. First of all, by making weaker assumptions, low-level methods are more general and are applicable to different types of visual motion, for example, both rigid and nonrigid motion. Secondly, low-level methods are indispensable building blocks, leading in a bottom- up fashion to more complex motion analysis [118]; in fact, higher-level methods usually need low-level methods in model selection [130], initialization and optimization procedures [68], 9

and advances in low-level research are applicable to them as well. Finally, there is still plenty of room for improvement in low-level motion estimation, particularly in robustness and error analysis. Due to compromises in formulations and solution methods, existing techniques can fail even in ideal settings. As an example, many methods intended to preserve motion discontinuities use gradient-based brightness constraints, which can break down at discontinuities due to derivative evaluation failure. Error analysis of motion estimates is a crucial task due to the inherent ambiguity in visual motion. The insufficiency of robustness and error analysis in optical flow estimation are the major motivations of our research. We have considered both local and global approaches to piecewise-smooth optical flow estimation. The following two sections overview the main results and contributions of our work.

1.2 A Local Method with Error Analysis

A Two-Stage-Robust Adaptive Scheme. Gradient-based optical flow estimation tech- niques essentially consist of two stages: estimating derivatives, and organizing and solving optical flow constraints (OFC). Both stages pool information in a certain neighborhood and are regression procedures in nature. Least-squares (LS) solutions to the regression prob- lems break down in the presence of outliers such as motion boundaries. To cope with this problem, a few robust regression tools [15, 86, 97, 5] have been introduced to the OFC stage. However, as a very similar information pooling step, derivative calculation has sel- dom received proper attention in optical flow estimation. Crude derivative estimators are widely used; as a consequence, robust OFC (one-stage robust) methods still break down near motion boundaries. Pointing out this limitation, we propose to calculate derivatives from a robust facet model [146, 145]. To reduce the computation overhead, we carry out the robust derivative stage adaptively according to a confidence measure of the flow estimate. Preliminary experimental results show that the two-stage robust scheme permits correct flow recovery even at immediate motion boundaries. A Deterministic Algorithm for High-Breakdown Robust Regression. High- breakdown criteria are employed in both of the above regression problems. They have no 10

closed-form solutions and past research has resorted to certain approximation schemes. So far all applications of high-breakdown robust methods in visual reconstruction [121, 75, 5, 97, 117] have adopted a random-sampling algorithm given in [106]—the estimate with the best criterion value is picked from a random pool of trial estimates. These methods uniformly apply the algorithms to all pixels in an image disregarding the actual amount of outliers, and suffer from heavy computation as well as unstable accuracy. By taking advantage of the piecewise smoothness property of the visual field and the selection capability of robust estimators, we propose a deterministic adaptive algorithm for high-breakdown local parametric estimation. Starting from LS estimates, we iteratively choose neighbors’ values as trial solutions and use robust criteria to adapt them to the local constraint. This method provides an estimator whose complexity depends on the actual outlier contamination. It inherits the merits of both LS and robust estimators and results in crispy boundaries as well as smooth inner surfaces; it is also faster than algorithms based on random sampling.

Error Analysis Through Covariance Propagation. Due to the aperture problem and outlying structures, an optical flow estimate generally has spatially varying reliability. In order for subsequent applications to make judicious use of the results [34], error statistics of the flow estimate have to be analyzed. In our earlier work [141], we have conducted error analysis for the least-squares-based local estimation method using the covariance propaga- tion theory for approximate linear systems and small errors. Here we generalize the results to the newer robust method. Our analysis estimates image noise and derivative errors in an adaptive fashion, takes into account correlation of derivative errors at adjacent positions. It is more complete, systematic and reliable than previous efforts.

1.3 A Global Optimization Method

By drawing information from the entire visual field, the global optimization approach [58, 15] to optical flow estimation is conceptually more effective in handling the aperture problem and outliers than the local approach. But its actual performance has been somehow dis- appointing due to formulation defects and solution complexity. On one hand, approximate formulations are frequently adopted for ease of computation, with the consequence that the 11

correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated formulations typically involve large-scale nonconvex optimization problems, which are so hard to solve that the practical accuracy might not be competitive with simpler methods. The global optimization method we have developed is aimed at better solutions to both problems.

From a Bayesian perspective, we assume the flow field prior distribution to be a Markov random field (MRF) and formulate the optimal optical flow as the maximum a posteriori (MAP) estimate, which is equivalent to the minimum of a robust global energy function. The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2) The strengths of brightness and smoothness errors in the global energy are automatically balanced according to local data variation, and consequently parameter tuning is reduced. These features enable our method to achieve a higher accuracy upper-bound than previous algorithms.

In order to solve the resultant energy minimization problem, we develop a hierarchical three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient method with a deterministic iterative implementation, which provides a high-quality initial flow estimate. Step II is a global gradient-based formulation solved by Successive Over- Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes the original energy by greedy propagation. It corrects gross errors introduced by derivative evaluation and pyramid operations. In this process, merits are inherited and drawbacks are largely avoided in all three steps. As a result, high accuracy is obtained both on and off motion boundaries.

Performance of this technique is demonstrated on a number of standard test data sets. On Barron’s synthetic data, which have become the benchmark since the publication of [7], this method achieves the best accuracy among all low-level techniques. Close compar- ison with the well known Black and Anandan’s dense regularization technique (BA) [14] shows that our method yields uniformly higher accuracy in all experiments at a similar computational cost. 12

(a) A typical frame (b) Target marked

Figure 1.3: Motion analysis for airborne video surveillance. A tiny airplane is only observ- able by its distinct motion.

1.4 Motion-Based Target Detection and Tracking

In a visual surveillance project funded by the Boeing Company, we have investigated an application of optical flow to airborne target detection and tracking. The greatest difficulty in this problem lies in the extremely small target size, typically 2 × 1 − 3 × 3 pixels, which makes results from most previous aerial visual surveillance studies unapplicable. Challenges also arise from low image quality, substantial camera wobbling and plenty of background clutters. A sample frame of the client data is given in Figure 1.3 together with a copy in which the target is marked.

The proposed system consists of two components; a moving object detector identifies objects by the statistical difference between their motions and the background motion, and a Kalman filter tracks their dynamic behaviors in order to detect targets and update their states. Both the detector and the tracker operate in a Bayesian mode and they each benefit from the other’s accuracy. The system exhibits excellent performance in experiments. On an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels large), it produces no false targets and tracks the true target from the second frame with 13

average position error below 1 pixel. This probabilistic approach reduces parameter tuning to a minimum. It also facilitates data fusion from different information channels.

1.5 Thesis Outline

The first and major half of the dissertation is devoted to optical flow estimation, and the rest describes the motion-based target detection and tracking system. To enhance visual motion analysis robustness, which is the central issue in our study, statistical tools are extensively explored at every stage. Given the diversity of the topics, previous work is summarized and mathematical and statistical tools are introduced when the need arises. Chapter 2 serves as a literature review on piecewise-smooth optical flow estimation. Standard constraints derived from the brightness conservation and flow smoothness assump- tions and common techniques such as hierarchical processing are described. Representative methods, both classical and more robust ones, are discussed. The relative merits of different approaches are important considerations in designing our methods. Chapter 3 addresses the two-stage robust adaptive approach to local flow estimation and its error analysis. Using the facet model, the popular local gradient-based approach is reformulated as a two-stage regression problem. Appropriate robust estimators are identified for both stages and the adaptive scheme is introduced. A deterministic algorithm for high- breakdown robust regression in visual reconstruction is proposed, and its effectiveness is demonstrated at the OFC solving stage. Error analysis carried out for the least-squares version of the method is reviewed and then the results are generalized to the robust version. Experimental results on both synthetic and real data are given in each of the above three parts. Robust estimation is formally introduced in this chapter and it will be extensively used in the rest of the thesis. Chapter 4 discusses the global optimization approach to optical flow estimation. From a Bayesian perspective, the maximum a posteriori (MAP) criterion is used with a Markov random field (MRF) prior distribution to formulate optical flow estimation as minimiz- ing a global energy function. The global energy is carefully designed to allow occlusions, flow discontinuities and local adaptivity. Furthermore, a graduated deterministic solution 14

technique is developed for the minimization problem. It exploits the advantages of various formulations and solution techniques for accuracy and efficiency. The theoretical and practi- cal advantages of this method are illustrated by experimental results and comparisons with other techniques on various synthetic and real image sequences. This chapter concludes by pointing out contributions and future research directions along this line. Chapter 5 presents the motion-based target detection and tracking system. It begins by describing the Kalman-filter-based tracker. In doing so the Bayesian state estimation theory, which is also used in the detection phase, is explained. A hybrid motion estimator is devised to locate independently moving objects. Its measurements are integrated with priors from the previous tracking results, and then the detector can operate in a Bayesian mode. Performance of this system is demonstrated on real airborne video. Chapter 6 concludes this dissertation by summarizing the results, contributions and future research avenues of each individual piece of our work. 15

Chapter 2

ESTIMATING OPTICAL FLOW: APPROACHES AND ISSUES

Optical flow estimation has long been an active research area in computer vision. Pio- neering work on calculating image velocity for compressing TV signals [79, 28] dates back to the mid 70’s. During the 80’s, the fundamental assumptions enabling optical flow es- timation, namely, brightness conservation and flow field coherence, were examined from different angles resulting in a large number of techniques, which are compared in the in- fluential review articles by Barron, Fleet and Beauchemin [7, 10]. A drawback common to many of these early techniques is that they usually require the assumptions to be satisfied in a strict (least-squares) sense so that their performance degrades severely in the presence of unmodelled events especially motion discontinuities. As such limitations have been widely recognized, the theme of optical flow research in recent years has shifted to enhancing the robustness of classical approaches. Encouraging progress has been made along this line and the estimation accuracy has been greatly improved. However, due to problems in formu- lations and solution techniques, there still exists a considerable gap between the achieved performance and what is desired in real-world applications. In addition, visual motion has its intrinsic ambiguity, which cannot be resolved by any estimation methods. It makes reli- able error analysis of optical flow estimates a crucial issue that needs to be addressed more adequately. This unsatisfactory state of affairs continues to motivate investigation in the area.

This chapter reviews piecewise-smooth optical flow estimation. We will describe typical formulations, representative techniques and their relative merits. The purpose is not to give a comprehensive literature review, which is beyond our scope, but to provide background knowledge for understanding difficulties in this problem, major achievements of previous work and motivations for our study. We organize this chapter as follows. The first two sections discuss the modelling of brightness conservation and flow coherence respectively, 16

and Section 2.3 describes typical formulations resulting from combinations of these models. Section 2.4 addresses challenges arising from modelling violations and efforts at ameliorating these problems. Section 2.5 points out the inherent ambiguity of optical flow and introduces previous work on error analysis. Finally, Section 2.6 explains the hierarchical process that is widely employed to handle large motions.

2.1 Brightness Conservation

Let I(x, y, t) be the image intensity at a point (x, y) at time t. The brightness conservation assumption can be expressed as

I(x, y, t) = I(x + δx, y + δy, t + δt)

= I(x + uδt, y + vδt, t + δt), (2.1) where (δx, δy) is the spatial displacement during the time interval δt, and (u, v) is the optical flow vector. This equation simply states that a point maintains its intensity value during motion, or corresponding points in different frames have the same brightness.

Matching-based methods [3, 120] find the flow vector or displacement that yields the best match between image regions in different frames. Best match can be defined in terms of maximizing a similarity measure such as the normalized cross-correlation, or minimizing a distance measure such as the sum-of-squares difference (SSD):

X 2 EB(u, v) = [I(x, y, t) − I(x + uδt, y + vδt, t + δt)] , (2.2) (x,y)∈R where EB designates the brightness conservation error, and R is the image region spanned by the template. Such matching criteria normally do not lead to closed-form solutions. In order to find the best match, usually a set of displacements are hypothesized, and the one with the best matching score is retained. This discrete exhaustive search process has poor efficiency and often results in low subpixel accuracy. For this reason, gradient-based methods have gained popularity in the optical flow estimation community. 17

Gradient-based methods [58, 82, 53] make use of differential approximations of the brightness constancy constraint Eq. 2.1. When the spatiotemporal image intensity I is differentiable at the point (x, y, t), the right side of Eq. 2.1 can be expanded as Taylor series, yielding

I(x, y, t) = I(x, y, t) + Ixuδt + Iyvδt + Itδt + ², where (Ix,Iy,It) is the image intensity gradient vector at the point (x, y, t), and ² represents the higher-order terms. If the displacement (uδt, vδt) is infinitesimally small, ² becomes negligible and the equation simplifies to the well known optical flow constraint equation (OFCE) [58]

Ixu + Iyv + It = 0, (2.3) which is a linear equation in the two unknowns u and v. Given n ≥ 2 pixels of the same 2D motion, their OFCEs can be grouped together and then u, v can be calculated through linear regression. Another way of obtaining additional constraints is to exploit second-order image deriva- tives. Differentiating Eq. 2.3 with respect to x, y and t respectively gives three more equa- tions:

Ixxu + Iyxv + Itx = 0

Ixyu + Iyyv + Ity = 0

Ixtu + Iytv + Itt = 0.

They can be used alone [7] or combined with OFCE [53] to solved for u. The most distinct attraction of gradient-based constraints, compared with matching- based constraints, is their ease of computation. The use of derivatives allows more efficient exploration of the solution space and hence achieves lower complexity and higher floating- point precision [7, 9]. However, it is important to point out that such advantages do come with a price: the additional assumptions made in deriving the gradient-based constraints dictates their more limited applicability. First of all, gradient-based constraints are valid only for small displacements, which in practice means magnitude < 1 ∼ 2 pixels/frame. Sec- ondly, in order for the higher-order terms to be negligible, the local image intensity function 18

should be close to a planar structure, which is also often violated. Finally, derivative es- timation is a problematic process itself. Commonly used methods include neighborhood differences [58], facet model fitting [145] and spatiotemporal filtering [119]. They all imply constant optical flow in the neighborhood and therefore break down near motion bound- aries. In fact, derivatives are low-level visual metrics just like optical flow, and thus their computation also meets with difficulties produced by the aperture problem and assumption violations [20, 13].

Maintaining high derivative quality, identifying unusable estimates and diagnosing fail- ures are crucial to the robustness of (gradient-based) optical flow estimation. We address these issues in developing both of our new techniques (Chapter 3 and Chapter 4).

Frequency-based methods. Performing the Fourier transform on the brightness con- stancy constraint Eq. 2.1 yields

−j(uδtωx+vδtωy+δtωt) Iˆ(ωx, ωy, ωt) = Iˆ(ωx, ωy, ωt)e ,

where Iˆ(ω1, ω2, ω3) is the Fourier transform of I(x, y, t) and ω1, ω2, ω3 denote spatiotemporal frequency. Clearly, for this equation to hold, it must satisfy

uωx + vωy + ωt = 0. (2.4)

This is the basic constraint for frequency-based approaches. It states that all nonzero energy associated with a translating 2D pattern lies on a plane through the origin in the frequency space, and the norm of the plane defines the optical flow vector. Frequency- based approaches are often presented as biological models of human motion sensing. They can handle cases that are difficult for matching approaches, e.g., the motion of random dot patterns. But in most cases, they are close to the frequency-domain equivalents of matching-based and gradient-based methods [10], and extracting the nonzero energy plane usually involves heavy computation. As a consequence, they are not as popular as the other two types of approaches. 19

2.2 Flow Field Coherence

For each pixel the brightness conservation constraint (Eq. (2.1), (2.3) or (2.4)) forms one constraint in the two unknowns u and v. Additional constraints come from the flow field coherence assumption, which means neighboring pixels experience consistent motion. Based on how coherence is imposed, the approaches can be further divided into two major types, local parametric and global optimization.

Local parametric methods assume that within a certain region the flow field is described by a parametric model: u(x) = u(x; p).

Here boldface letters denote column vectors: u = (u, v)T , x = (x, y)T , p is the vector of model parameters. Common models include the constant model      u(x, y)   p0  u(x; p) =   =   , v(x, y) p1 which holds at any location as the region size approaches zero, the affine model    p0 + p1x + p2y  u(x; p) =   p3 + p4x + p5y which approximates the 2D motion of a remote 3D surface, and the quadratic model   2  p0 + p1x + p2y + p6x + p7xy  u(x; p) =   (2.5) 2 p3 + p4x + p5y + p6xy + p7y which describes the instantaneous 2D motion of a planar surface undergoing 3D rigid motion (we will use this model in the airborne visual surveillance application in Chapter 5). Low-order polynomial flow models gain popularity from their clear physical meanings and simple computation. But how to select a region appropriate for a given model and how to choose models suitable for a given region are very complicated problems [130, 25]. The common practice of applying the same model uniformly to all image locations risks the danger of under-fitting, over-fitting and compromising different models, and usually results in a flow field of highly uneven accuracy. 20

Global optimization methods can avoid the region selection problem to a certain extent. Instead of assuming a rigid model for an entire region, they allow arbitrary local variations as long as the flow field is smooth (almost) everywhere. Such a global smoothness assumption usually leads to a regularization type of formulation. A classical technique of this kind is due to Horn and Schunck [58]. They define the best optical flow field as the one minimizing the overall OFCE error and local flow variation:

X 2 2 2 [(Ixsus + Iysvs + Its) + λ(¯us − us) + (¯vs − vs) ]. (2.6) s

Here s is a one-dimensional index of pixel locations (x, y), which traverses all pixel locations in a progressive scan manner. The first quadratic term in the summation is the OFCE error at location s; the second term requires minimal deviations between the flow vector and

(¯us, v¯s), the average of its neighbors i ∈ Ns. The constant λ is a tuning parameter which controls the relative importance of data and flow variation. Global optimization models deal with the aperture problem more effectively than local parametric models by propagating flow estimates between different locations, but due to the propagation, they tend to over-smooth the field. In addition, global models are sensitive to the choice of the control parameter λ and their computation is more involved.

2.3 Typical Approaches

In principle, any of the above brightness conservation models and flow coherence models can be paired up to derive a formulation for optical flow estimation. Among all possible combinations, gradient-based local parametric, gradient-based global optimization and spa- tiotemporal filtering approaches, especially the first two, have attracted the most attention because of the good balance between their accuracy and complexity.

Gradient-based local parametrization Combining gradient-based constraints and low-order polynomial flow models, one usually arrives at a linear equation in the flow model parameter p:

Ap = b. 21

Particularly, using first-order constraints and the constant flow model, we have

Au = b (2.7)      Ix1 Iy1   It1       . .   .  A =  . .  b = −  .  .     Ixn Iyn Itn When A0A is nonsingular, the least-squares (LS) solution to the equation is

u = (A0A)−1A0b. (2.8)

Both sides of Eq. 2.7 can be multiplied by a window function W = diag[W1,...,Wn] to assign heavier weights to certain constraints. The corresponding equation is WAu = W b. If the weights are absorbed by A and b: A ← W A, b ← W b, the same LS solution Eq. 2.8 is obtained. Lucas and Kanade [82] employ an iterative version of the above algorithm for stereo registration. Since they are probably the first to formalize this approach, the (weighted) LS fit of local first-order constraints to a constant flow model is usually referred to as the Lucas and Kanade technique, which we abbreviated as LK. This technique is reported to be the most efficient and accurate, especially after confidence-based selection [7]. An early technique using second-order constraints is due to Haralick and Lee [53]. They interpret the OFCE as the interception line of the isocontour plane with a successive image frame, calculate image derivatives from the facet model, and solve the first- and second-order constraints at each pixel by singular value decomposition (SVD) [102].

Gradient-based global optimization The seminal technique of this category by Horn and Schunck (HS) [58] was introduced in the last section. They solve the constraint Eq. 2.6 for the flow field by iterative relaxation:

n n−1 Ixs(Ixsu¯s + Iysv¯s + Its) us = us − 2 2 λ + Ixs + Iys n n−1 Iys(Ixsu¯s + Iysv¯s + Its) vs = vs − 2 2 λ + Ixs + Iys where n denotes the iteration number, (u0, v0) denote initial flow estimates (set to zero), and λ is chosen empirically. Typically, flow fields obtained from this technique are visually 22

pleasing because of the smoothness, but their quantitative accuracy is not as good as local gradient-based methods [7] due to over-smoothing and slow convergence of the relaxation process.

Spatiotemporal filtering Movements in the spatiotemporal image volume, formed by stacking images in a se- quence, induce structures with certain orientations. For example, the trace of a translating point is a line whose direction in the volume directly corresponds to its velocity. Different methods were proposed to extract the orientations including inertia tensor [66], hyperge- ometric filters [140] and orientation tensors [35]. Since determining 2D velocity in the frequency domain amounts to finding a nonzero energy plane (Section 2.1), the filtering approach is also adopted by frequency-based methods [56]. A recent filtering method with good accuracy reported is due to Farneb¨ack [35]. He fits data in an image neighborhood to a quadratic polynomial model I(x) = xT Ax + bT x + c, derives an orientation tensor from the model parameters T = AAT + ηbbT , and finds the flow vector by minimizing vT T v. Here we temporarily adopt his notations x = (x, y, t)0, v = (u, v, 1)T /|(u, v, 1)T | for convenience of presentation. It is not hard to see that this method closely resembles local gradient-based approaches: tensor construction is equivalent to derivative (first- and second-order) calculation; solving the homogeneous linear equation in the augmented flow vector vT T v = 0 is equivalent to solving the linear equation in the original vector (u, v)T Eq. 2.7. The efficiency of this technique is mainly enabled by the intermediate step of tensor construction. Without the intermediate step, or if a filter bank is used instead [56, 38], the computation can becomes cumbersome and only discrete estimates can be obtained. This contrast is also similar to the that between gradient-based and matching-based approaches. The equivalence between spa- tiotemporal filtering/frequency-domain approaches and certain matching-based/gradient- based methods was pointed out previously [118, 7].

Others Block matching (SSD) methods can be used to find a pixel-accuracy displacement, and a quadratic surface fitting of the neighboring matching errors can produce an estimate of 23

subpixel accuracy [126]. The techniques of Anandan [3] and Singh [120] initialize the flow field using this method, and then employ some global smoothness constraints to propagate flow from places of higher confidence to places of lower confidence. Matching-based ap- proaches have better large motion handling capability than gradient-based approaches. But the computational difficulties and poor subpixel accuracy make them less competitive in optical flow estimation. For similar reasons, matching-based global optimization schemes were attempted with very limited success [88, 14], and frequency-based global optimization approaches are almost never explored.

2.4 Robust Methods

Most early techniques, as described in the above three sections, require brightness conser- vation and flow smoothness to be satisfied everywhere in the flow field. The restrictive assumptions make them easily break down in reality where model violations are abundant. An obvious source of violation is motion discontinuity. Imposing flow smoothness in a region containing multiple motion modes results in compromise between these modes and smeared flow estimates. Such failure not only is detrimental to optical flow accuracy, but also ob- scures important geometric or physical properties of the scene. Violations of the brightness constancy assumption also occur commonly in natural scenes. Conditions such as specular reflections, shadows and illumination variations induce non-motion brightness changes. In cases of transparency, due to the interaction of translucent reflective surfaces, the image intensity of a single pixel can be a composite of multiple 3D points’ brightness values [14]. Examples include looking into a running creek and watching through a pane of glass. Ap- plying simple brightness matching criteria in these situations does not produce meaningful motion estimates. As the above limitations of traditional techniques [7] are widely recog- nized, a large number of recent efforts have been devoted to increasing robustness against assumption violations, especially to allowing motion discontinuities.

Explicit segmentation Assuming motion boundaries coincide with intensity discontinuities and the former are subsets of the latter, a number of researchers [17, 129] first segment the visual field using 24

image intensity, then compute parametric (e.g. affine) motion in each segment, and finally group neighboring segments into regions of coherent motion. Such approaches experience two problems. First, accurate image segmentation is itself very difficult. Second, the as- sumed relationship between motion and intensity discontinuities is not necessarily correct. Motion estimation and motion-based segmentation form a chicken-and-egg dilemma: the motion estimator needs to know where motion boundaries are in order to avoid smooth- ing across them, whereas the motion-based segmenter requires an accurate motion field in order to divide the scene into regions of consistent motion. In an attempt to circumvent this problem, motion estimation and segmentation have been carried out simultaneously. The generic approach can be described as finding a segmentation of the flow field and the motion (parameters) in each segment that minimizes the difference between the observed and predicted image data [151]. Actual techniques differ by the employed flow models, optimization criteria and solution methods. Kanade and Okutomi [72] develop an adaptive window technique that adjusts the rect- angular window size to minimize the uncertainty in the estimate. Schweitzer [114] devises a recursive algorithm to split the motion field into square patches according to the minimal encoding length criterion. These methods use rectangular division of the flow field and can- not adapt to irregular motion boundaries. Wang and Adelson [134] assume that an image region is modeled by a set of overlapping layers which can be irregularly shaped or even transparent. They compute initial motion estimates using a least-squares approach within image patches, then use K-means clustering to group motion estimates into regions of con- sistent affine motion. Jepson and Black [68] use a probabilistic mixture model to explicitly represent multiple motions within a patch, and use the EM algorithm to estimate parame- ters for the fixed number of layers. Darrell and Pentland [33] and Sawhney and Ayer [109] automatically determine the number of layers under a minimum description length (MDL) encoding principle, which regards the most compact interpretation as the best among all possibilities [114]. The explicit segmentation approach usually involves modeling the visual field as a col- lection of (rigid) objects of certain parametric motion. Appropriately choosing the motion models and the number of objects, especially in a dynamic situation, is very difficult [25, 130] 25

and can be impossible when nonrigid motion such as human movement and facial expres- sion are present. Furthermore, due to the extremely high dimension of the problem, how to efficiently solve the associated numerical optimization problems remains a challenging issue. In general, iterative methods in which, each updating step consists of sequential es- timation and segmentation of the motion field, are used. The initial guess is also given by a sequential method and its quality is crucial for convergence. For the above reasons, the explicit segmentation approach is not suitable for general optical flow estimation and is not pursued in this thesis.

Outlier-suppressed regression

A major cause of the failure of traditional gradient-based local parametric techniques is the use of least-squares regression, which finds a compromise among all constraints and can break down even in the presence of a single model outlier. To repair this problem, various mechanisms have been attempted to reject outliers and fit the structure of the majority of constraints.

It is sometimes possible to detect outliers by examining the residual of the least-squares solution. After obtaining an initial least-squares estimate, Irani et al. [64] iteratively remove outliers and recompute the least-squares solution. This process is still least-squares in essence; it is sensitive to the initial quadratic estimate which may be arbitrarily bad. A number of researchers (e.g. Fennema and Thompson [37], Schunck [113], Nesi et al. [94]) investigate robust clustering [70] based on the Hough transform. Such approaches have better outlier resistance but are computationally very expensive. More success is achieved by employing robust estimators, particularly, M-estimators [15, 109] and high-breakdown robust estimators [5, 97, 145] in local optical flow constraint fitting. These estimators will be formally introduced and compared in Section 3.1.1. Among these methods, the one reporting the best accuracy is to first identify and reject outliers using high-breakdown criteria and then estimate parameters from the remaining constraints [5, 97, 145].

The computational burden of high-breakdown robust estimators increases with the amount of outlier contamination. Applying the same algorithm uniformly to the entire flow field incurs excessive computation, since most places contain few outliers. We tackle 26

the efficiency problem with an adaptive algorithm (Section 3.2). Also, the limitations of gradient-based and local regression approaches (Section 2.1, 2.2) remain regardless of the regression technique. We will propose a matching-based global optimization formulation to overcome such limitations (Section 4).

Discontinuity-preserving regularization

A significant amount of attention has been paid to reformulating the regularization prob- lem to alleviate over-smoothing. Nagel and Enkelmann [91] suggest an oriented-smoothness constraint in which smoothness is not imposed across steep intensity gradients (edges).

Their formulation differs from HS’s (Eq. 2.6) in that the terms (¯us, v¯s) are augmented by functionals of local flow derivatives and first- and second-order image derivatives. Despite the added complexity, this method yields similar experimental results to HS [7], which is not surprising. On one hand, image discontinuities and flow discontinuities do not necessarily overlap; reducing smoothing in all places of large image gradient hurts flow propagation from areas to areas. On the other hand, in the vicinity of flow discontinuities where smoothing needs to be stopped, image derivatives are of poor precision and do not serve as a reliable indicator of occlusion.

Following Geman and Geman’s work on stochastic image restoration [42], Markov ran- dom fields (MRF) formulations [88, 77, 57, 14] have become an important class of techniques for coping with spatial discontinuities in optical flow estimation. An MRF is a distribution of a random field in which the probability of a site having a particular value depends on its neighbors’ values. The distribution of a piecewise-smooth field can be modeled by a dual pair of MRFs, one representing the observed field values and the other representing the un- observed discontinuities (line process), and then the best interpretation of the field can be found as the one maximizing the a posterior (MAP) probability. Utilizing the equivalence of the MRF and the Gibbs distribution, the MAP formulation reduces to minimizing a regu- larization energy, which is often solved by stochastic relaxation. Blake and Zisserman show that similar formulations can be obtained by modeling piecewise smoothness using weak continuity [20], and they tackle the optimization problem using a graduated non convexity (GNC) strategy. Their formulation is more compact with the elimination of the line process, 27

and their optimization strategy is more effective in practice than stochastic relaxation.

Shulman and Herv´e[116] first point out that spatial discontinuities can be treated as outliers and they propose an approach based on Huber’s minimax estimator. This choice of estimator leads to a convex optimization problem which is relatively easy to solve. Black and Anandan propose a robust framework in which both brightness and flow smoothness terms are modeled with robust estimators. They use redescending estimators which sup- press outliers more effectively than convex estimators, and solve the optimization problem by hierarchical continuation [15, 20]. Sim and Park adopt high-breakdown robust estimators to achieve even more effective outliers rejection [117] than commonly adopted M-estimators [116, 15, 86]. Black and Rangarajan [18] unify the line process and robust statistics per- spectives and suggest the approach can benefit problem formulation and solution.

It is important to point out that, in refining optical flow formulations, computational complexity increases rapidly with model sophistication. This is especially true for global methods which usually involve large-scale nonconvex optimization problems. There are two approaches to global optimization: stochastic and deterministic. Stochastic methods such as simulated annealing [42] make updates probabilistically to avoid low minima and use a temperature parameter to gradually dampen the randomness. They converge too slowly to be practically useful [88, 77, 14, 15, 23]. Deterministic methods such as continuation [15] and multigrid [86] assume a good initial flow estimate is available and make greedy updates towards a local minimum. The procedure can be multi-stage, resembling the annealing schedule. These methods have achieved more success in practice, but they have a limited capability for avoiding local minima and their performance depend on the initialization quality. Since global optimization is widely recognized as a powerful formulation technique for inverse problems, and computing technology looks promising for solving the associated numerical problems, developing global optimization algorithms has become a very hot topic in computer vision [23, 137, 111].

Brightness conservation violations

Phenomena violating the brightness constancy assumption have only been studied to a limited extent [10]. Transparency can be modeled by layered/mixture representations 28

[12, 134, 17, 33], which assign to each pixel a set of ownership weights indicating how different surface layers contribute to the observed pixel brightness. Bergen et al. [12] first consider the problem of extracting two motions, induced by either transparency or motion discontinuity, from three image frames. They use an iterative algorithm to estimate one motion, perform a nulling operation to remove the intensity pattern giving rise to the motion, and then solve for the second motion. Variable illumination can be accommodated by deriving more complex brightness conservation models such as the linear model (see [116] for one example), or matching less illumination-sensitive image features such as phase [38]. When violations comprise only a small fragment of observations in an area, they can be treated as outliers in a robust estimation framework [15]. Considering that most real objects are opaque and global illumination variation is usually negligible during a small interval, we adopt a robust estimation framework in our study.

2.5 Error Analysis

Despite steady progress on robust visual motion analysis, accurate optical flow estimates are generally inaccessible. One reason is that, in making necessary assumptions to turn the estimation problem into a well-posed problem, errors are inevitably introduced by assump- tion violations. Even under (unrealistic) conditions that no violations are encountered, the estimate can have a large uncertainty due to the aperture problem [58]—brightness vari- ation can be insufficient for uniquely determining the 2D velocity (Figure 2.1, also 1.2). The aperture problem shows the intrinsic ambiguity in visual motion perception: optical flow only approximates the projected image motion. In its most severe form, i.e., when the image is completely textureless, recovering the projected motion is impossible; more gener- ally, optical flow estimates in regions of more appropriate texture have higher confidence. The sensitivity to assumption violations and the aperture problem varies from technique to technique and from place to place in a visual field, and so does the uncertainty of the esti- mated optical flow. If subsequent applications are to make judicious use of such a flow field estimate, they must be equipped with certain error measurements indicating the uneven reliability [52]. 29

True Motion

1

3 ? 2

4 ?

Figure 2.1: Aperture problem: local information in an aperture might be insufficient to determine the 2D motion vector. Each circle is an aperture. (1) corner: reliable estimate; (2) boundary: normal flow only; (3) homogeneous region: ambiguous; (4) highly textured region: multiple solutions (aliasing).

Extracting 2D velocity at a pixel requires exploiting a spatiotemporal image neighbor- hood of that pixel. This fact introduces correlation between errors in nearby flow estimates. Accounting for such error correlation, especially in a global formulation, is a daunting task for both optical flow error analysis and subsequent applications; therefore it has seldom been tackled. Most previous efforts seek to provide an error measure with each individual estimate by analyzing error behaviors of local methods. Barron et al. compare and modify a number of one-dimensional confidence measures and use them to select reliable optical flow estimates [7, 10]. Since errors in optical flow estimates are in general directional and anisotropic, a two-dimensional confidence measure, particularly the covariance matrix, is more appropriate and informative.

Performance analysis in computer vision is often carried out with covariance propagation [36, 52]. Haralick illustrates the derivation and application of covariance propagation theory for a wide variety of vision algorithms [52]. A trivial case is to propagate additive random perturbations through a linear system y = T x with input x and output y, in which the 0 output covariance Σy can be expressed in terms of the input covariance Σx as Σy = T ΣxT . The solution to the constraint equation Eq. 2.7 under the least-squares criterion is optimized 30

when the only error source is the additive iid noise in b (temporal derivatives) that has zero 2 mean and variance σb . Under this assumption, the above conclusion can be applied and the covariance of the optical flow estimate is simply

2 0 −1 Σu = σb (A A) (2.9)

2 where σb can be estimated from the residual errors ri = Ixiu + Iyiv + Iti as 1 Xn σˆ2 = r2. b n − 2 i i=1 The error analysis on a local matching-based method by Szeliski [126] and that on a local spatiotemporal filtering method by Heeger [56] make similar assumptions and obtain similar results. The assumptions enabling the above derivation are apparently unrealistic because (i) spatial derivatives in A also contain noise, and (ii) errors in derivatives are correlated due to the overlapping data supports for their computation. Ignoring these factors makes the velocity and covariance estimates biased. Efforts have been made to calculate unbiased estimates using generalized least-squares [90, 96, 95]. However, these methods bring little accuracy improvement at the cost of much heavier computation, because bias is a much weaker error source than variance [27] and outliers [15] in optical flow estimation. More details of the related work will be given in Section 3.3 to facilitate comparison with our methods. In our earlier work [141], we have conducted an error analysis for the least-squares based local estimation method using the covariance propagation theory for approximate linear systems and small errors. In this thesis, we generalize the results to the newer robust method. Our analysis estimates image noise and derivative errors in an adaptive fashion, taking into account correlation of derivative errors at adjacent positions. It is more complete, systematic and reliable than previous efforts.

2.6 Hierarchical Processing

Recall that gradient-based constraints are valid only for small image motion; in practice this typically means below 2 pixels/frame. While matching-based and frequency-based 31

formulations may cope with larger motion, the computational burden and chances of false matches (aliasing) increase rapidly with the search range. A general way of circumventing the large motion and aliasing problems is to adopt a hierarchical, coarse-to-fine strategy [12, 14]. The basic idea of hierarchical processing is to construct a pyramid representation [26] of an image sequence in which higher levels of the pyramid contain filtered and sub-sampled versions of the original images. Going up the pyramid, the image resolution decreases and the motion magnitude reduces proportionally. When a certain level of reduction is reached, the motion becomes small enough for estimation. Then computation proceeds in a top- down fashion, on each level the incremental flow is estimated, added to the initial value, and projected down to the lower level as its initial value. This process continues until the flow in the original images is recovered. In what follows, we describe an implementation of the hierarchical process that is used in our algorithms. Much of the recipe is adapted from [14].

• Gaussian pyramid construction. We create a P -level image pyramid Ip, p = 0,...,P − 1. Each upper-level image sequence Ip−1 is a smoothed and sub-sampled version of its lower level images Ip, expressed as

x y Ip−1( , ) = f ∗ Ip(x, y), ∀ x, y at level p 2 2

where f is a 3 × 3 Gaussian filter, “∗” represents convolution, and the resolution reduction rate is 2 which means each upper level image is one-fourth of its ancestor.

• Flow projection with interpolation. Once the optical flow field V is available at level p − 1, it is projected down to level p. The simplest projection scheme is “projection p p−1 x y with duplication”: V (x, y) = 2V (b 2 c, b 2 c), ∀ x, y at level p. To reduce the blocky effect, we use “projection with interpolation”:

up(2x, 2y) = 2up−1(x, y), ∀ x, y at level p − 1, 1 up(x, y) = [up(x − 1, y − 1) + up(x − 1, y + 1) 4 +V p(x + 1, y − 1) + up(x + 1, y + 1)], ∀ other x, y at level p. 32

construct image pyramid Ip, p = 0,...,P − 1; P −1 P −1 Iw ← I ; for (p : P − 1 → 0) { estimate residual flow ∆up; current total flow: up ← up + ∆up; stop if (p=0); project flow up to Level p − 1 yielding up−1; p−1 p−1 warp I yielding Iw ; }

Figure 2.2: Hierarchical processing

• Image warping. Given the flow field up that explains the motion from image Ip(t) to image Ip(t + dt), the image Ip(t + dt) can be warped to remove (compensate for) the motion such that these two images are almost aligned. Using “backward warping”, the stabilized version of Ipt + dt is defined as

p p Iw(x, y, t + dt) = I (x + u(x, y), y + v(x, y), t + dt).

(x + u(x, y), y + v(x, y)) usually does not fall on a regular grid. We use bilinear p p interpolation to estimate its intensity value. The warped images Iw(t),Iw(t + dt) show the residual motion ∆up.

• Motion estimation. The residual motion is estimated form the warped sequence: p p p ∆u ← Iw(t),Iw(t + dt), and the overall motion at level p is the sum of the projected and residual motion: up ← up + ∆up.

At the top level, the initial (projected) motion is assumed to be zero and the warped sequence p p is the same as the pyramid sequence Iw ← I . Finally, the procedure of the hierarchical, coarse-to-fine framework is given in Figure 2.2. Hierarchical schemes like the above have been used in a wide variety of motion estimation algorithms but their limitations [9] are often overlooked: (i) the blind projection and warping 33

operations may extrapolate and interpolate across motion boundaries; (ii) in the top-down fashion, errors produced in coarser levels are magnified and propagated to finer levels and are generally irreversible [14]. Solving the first problem again brings up the estimation- segmentation dilemma. To correct errors in coarser levels, certain multi-resolution schemes are needed which can propagate results in a bottom-up fashion, too [102]. Since each additional level of pyramid introduces new sources of errors, the number of levels should be large enough to allow incremental flow estimation but no larger. Appropriately choosing the number of levels is a difficult problem that has been addressed only to a very limited extent [9]. Most current techniques including ours determine the number empirically. 34

Chapter 3

LOCAL FLOW ESTIMATION AND ERROR ANALYSIS

This chapter considers the problem of finding the most representative translation within a small spatiotemporal image neighborhood and presents new algorithms to address the involved accuracy, efficiency and uncertainty measuring issues. In particular, (i) the popular local gradient-based approach is reformulated as a two-stage regression problem, appropriate robust estimators are identified for both stages, and an adaptive scheme is introduced to derivative evaluation to obtain sharp motion boundaries; (ii) a deterministic algorithm for high-breakdown robust regression in visual reconstruction is proposed, and its effectiveness is demonstrated at the optical flow constraint solving stage, and (iii) error analysis is carried out by covariance propagation, it accounts for spatially varying image noise and derivative errors and correlation of derivative errors at adjacent positions, and provides a reliable measure of the estimation uncertainty. This chapter is composed of three sections dedicated to the above three topics respectively. Experimental results on both synthetic and real data are given in each individual section.

3.1 A Two-Stage-Robust Adaptive Technique

The gradient-based local regression approach to optical flow estimation has become very popular because of its good overall accuracy and efficiency. Despite various formulations, methods of this type are generally composed of derivative estimation and optical flow con- straints (OFC) solving two stages. Both stages involve optimization by pooling information in a certain neighborhood and are regression procedures in nature. Classical techniques solve both regression problems in a Least-Squares (LS) sense [7]. In places where the mo- tion is multi-modal, their results can be arbitrarily bad. To cope with this problem, a few robust regression tools such as M-Estimators [15, 86] and Least Median of Squares (LMedS) estimators [5, 97] have been introduced to the OFC stage. By carefully analyzing the charac- 35

teristics of the optical flow constraints and comparing strengths and weaknesses of different robust regression tools [138, 106, 105, 104], we identify the Least Trimmed Squares (LTS) technique as more appropriate for the OFC stage. Meanwhile, as a very similar information pooling step, derivative calculation has seldom received proper attention in optical flow estimation. Crude (least-squares-based) estimators are widely used with the hope that the derivative estimation error can be averaged out or treated as outliers in the OFC regression stage. However as illustrated in Figure 3.5, near motion boundaries, derivative evaluation can completely fail and most of the constraints become outliers; in such a situation, no matter what robust tool is employed, OFC regression breaks down and motion boundaries cannot be preserved. Pointing out this limitation, we use a 3D facet model to formulate derivative estimation as an explicit regression problem, which can be robustified when the LS technique fails. We choose an LTS estimator for robust facet model fitting. LTS is costly and it may yield less accurate estimates where there are no outliers and LS suffices. Therefore, it should be applied to only when it is necessary. We calculate a confidence measure for each estimate from the LTS OFC step, and update the derivatives and the flow vector if the measure takes a small value. In this way the one-stage and two-stage robust methods are carried out adaptively. Preliminary experimental results show that this adaptive LTS scheme permits correct flow recovery even at immediate motion boundaries. Below we provide details of the two-stage-robust adaptive scheme. We will start with introducing robust regression, which is the backbone of the proposed method and will be extensively exploited in the rest of the thesis.

3.1.1 Linear Regression and Robustness

A linear regression model relates the output of a system yi, i = 1, . . . , n to its m-dimensional T input xi = (xi1, . . . , xim) by a linear transform with an additive noise term ξi, i.e.,

yi = xi1θ1 + xi2θ2 + ... + ximθM + ξi, or in a more compact form,

yn×1 = Xn×mθm×1 + ξn×1. (3.1) 36

With sufficient data points (X; y) collected (n À m), the model parameters θ can be estimated by minimizing a scalar criterion function F (r):

ˆ θ = argminθF (r), where r is the residual fitting error

r = y − yˆ = y − Xθ.ˆ

The criterion function F (r) differs among estimators depending on what error models are assumed.

Least-squares estimator The least-squares estimator uses a quadratic error function

Xn 2 2 F (r) = krk = ri i=1 and has a closed-form solution θˆ = (XT X)−1XT y.

It is optimal only if X is error-free and ξi is iid Gaussian with zero mean and variance σ2. When either condition has a significant violation, the least-squares estimate can be completely disrupted. There are two major types of significant model violations, or gross errors: those caused by bad y values are called y-outliers and those caused by error in X are leverage points. The performance of a regression estimator is usually characterized by its statistical efficiency and breakdown point. Simply put, statistical efficiency indicates the accuracy (in terms of estimate variance) when no gross error is present, and breakdown point is the smallest fraction of contamination that can cause the estimator to take on values arbitrarily far from the truth. These two factors are usually against each other. A good regression tool should have both factors high. The reason for the poor accuracy of least-squares in many situations is its 0% breakdown point, which means that a single outlier can lead to arbitrarily wrong estimates. The goal of robust regression is thus to develop regression tools that are relatively insensitive to gross errors while maintaining sufficiently high statistical efficiency [106, 138]. 37

M-estimators An M-estimator uses the criterion function

Xn F (r) = ρ(ri, σi) i=1 where the σi are scale parameters for the ρ-function. It includes the least-squares estimator as a special case with ρ being an L2 (quadratic) error norm. The impact of each datum on the overall solution is measured by the influence function: ψ(x, σ) = ∂ρ(x, σ)/∂x. The 2 least-squares estimator has ψLS(x, σ) = 2x/σ , which allows an outlier to introduce infinite bias to the estimate. One way to reduce outlier influence is to adopt a less drastic error norm, e.g., the Geman-McClure error norm

2 2 2 ρGM (x, σ) = x /(x + σ )

[43]. Its ρ, ψ curves are compared to those of the L2 norm in Figure 3.1. The Geman- McClure error norm saturates at 1 as the error increases. Its ψ function is bounded and redesceding—the influence of small errors is almost linear while that of abnormally large ones tends to zero. Finding an M-estimate is a nonlinear minimization problem. It is usually solved by iterated reweighted least-squares

Xn ˆ(k) (k−1) 2 θ = argminθ w(ri )ri i=1 where the superscript (k) designates the iteration number, the weight function is defined by w(x) = ψ(x)/x, and ri is the residual evaluated with the current estimate. M-estimators are resistant to y-outliers and have relatively high statistical efficiency, but they meet with computational difficulties such as initial guess dependency and non- convexity (for redescending estimators), have a low breakpoint (about 1/(M + 1)), and are vulnerable to leverage points [106, 138].

High-breakdown robust estimators Two popular high-breakdown robust estimators are the least-median-of-squares (LMedS) estimator and the least-trimmed-squares (LTS) estimator [106]. The LMedS estimator

ˆ n 2 θ = argminθmedi=1r (3.2) 38

1.2 1

1 0.5 0.8

0.6 0

0.4 −0.5 0.2

0 −1 −5 0 5 −5 0 5

(a) ρ(x, σ) (b) ψ(x, σ) = ρ0(x, σ)

Figure 3.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line): (a) error norms (σ = 1), (b) influence function.

overcomes most limitations of M-estimators: it is resistant to both types of gross errors, has a breakdown point as high as 50%, does not need an initial guess, and is guaranteed to converge. However it has extremely low statistical efficiency, which means that it tends to have very large estimation variances when no gross error is present. The LTS estimator was introduced to repair the low efficiency of LMedS. It is defined as

Xh ˆ 2 θ = argminθ (r )i:n (3.3) i=1

2 2 where h < n and (r )1:n ≤ ... ≤ (r )n:n are the ordered squared residuals. LTS allows the fit to stay away from the gross errors by excluding the largest squared residuals from the summation. Owning almost all merits of LMedS and better statistical efficiency, LTS is considered preferable to LMedS [104, 103]. High-breakdown estimators usually do not have closed-form solutions and are approx- imated by Monte Carlo like algorithms ([106]). A trial solution pool is constructed by p m random draws from totally Cn m-subsets, each yielding an exact solution and a correspond- ing criterion value; the one with the minimum value is picked as the solution. The value p is chosen so that the probability of having at least one good subset,

1 − (1 − (1 − ²)m)p, (3.4) 39

where ² is the fraction of outliers (up to 50%), is close to 1. The randomness in the solution is obvious especially when p is chosen small. A subsequent weighted least-squares (WLS) is recommended to enhance the statistical efficiency. In particular, a preliminary error scale is p defined asσ ˆ = C F (r), where C makesσ ˆ roughly unbiased at Gaussian error distribution

[105]; then regression outliers which have |ri/σˆ| > 2.5 are removed. Finally a WLS estimate is calculated from inliers as Xn ˆ 2 θ = argminθ wiri (3.5) i=1 and a more efficient scale estimate is given by the sample variance of inliers. According to the above recipe, LTS takes slightly longer time to compute than LMedS since finding the smallest n/2 numbers is more costly than finding the median of n numbers. A new algorithm for approximating LTS, so called FAST-LTS, has been introduced recently, which runs faster than all programs for LMedS and makes LTS the preferred choice of high- breakdown robust estimator. What enables FAST-LTS is the concentration property of

LTS: starting from any approximate LTS estimate θold and its associated criterion value

Qold, it is possible to compute another approximation θnew yielding an even lower criterion value Qnew [104]. In algorithmic terms, the C-step can be described as follows.

Given the h-subset Hold then:

ˆ • compute θold ← LS estimate from Hold

• compute the residuals rold(i) for i = 1, . . . , n

• sort the absolute values of these residuals, which yields a permutation π for which

|rold(π(1))| ≤ |rold(π(2))| ≤ ... ≤ |rold(π(n)|

• put Hnew ← {π(1), π(2), . . . , π(h)}

• compute θˆnew ← LS estimate from Hnew

The C-step can iterate until convergence. It speeds up LTS computation by providing a more efficient way of selecting trial solutions than random sampling. 40

Estimator Criterion Statistical Breakdown Y- Leverage Solution F (r) Efficiency Point Outliers Points Technique

Pn 2 LS i=1 ri High 0% No No Closed-form Pn M i=1 ρ(r) High 100/(1 + m)% Yes No Approximate n 2 LMedS medi=1ri Low 50% Yes Yes Approximate Pn/2 2 a LTS i=1(r )i:n Low 50% Yes Yes Approximate

a 2 2 (r )1:n ≤ · · · ≤ (r )n:n : ordered squared residuals

Table 3.1: Comparison of four popular regression criteria (estimators)

To summarize the above discussion, properties of four popular estimators are given in Table 3.1 for a regression problem of n equations and m unknowns.

3.1.2 Two-Stage Regression Model

In this section we show that both derivative estimation and optical flow constraint solving stages in the gradient-based local approach can be formulated as linear regression problems.

Optical flow constraint Following Haralick and Lee [53], we constrain the optical flow vector u = (u, v)T at location (x, y, t)T by Au + ξ = b (3.6)

where     I I I  x y   t           Ixx Ixy   Ixt  A =   b = −   .      I I   I   yx yy   yt  Itx Ity Itt We further assume that flow vectors in each small neighborhood of N pixels is constant, and hence each vector u conforms to N sets of constraints simultaneously. This constitutes our optical flow constraint [144]: a linear regression model

Asu + ξ = bs (3.7) 41

0 0 0 0 0 0 0 where As = (A1,A2,...,AN ) , bs = (b1, b2,..., bN ), and each pair of Ai, bi are the A, b defined by Eq.(2.3) at pixel i, i = 0,...,N − 1. In our experiment, we choose the constant flow neighborhood size to be 5 × 5, so N = 25. Comparing to the first-order constraint Eq. 2.7, this mixed-order constraint has the advantage that a large number of equations are provided on a small data support (100 equations in a 9 × 9 × 5 neighborhood). Such compactness is desirable because a smaller neighborhood size means less chance of encountering multiple motion, and a larger sample size brings higher statistical efficiency. Although second-order constraints alone are often avoided due to derivative quality concerns [7], we argue that they are beneficial when used together with first-order constraints under a robust criterion, because (i) they are automati- cally assigned less weights, due to the fact that second-order derivatives normally take much smaller values than first-order derivatives in real imagery, and (ii) outliers among them can be ignored under the robust criterion. In addition, experiments show that second-order derivatives of reasonable accuracy can be obtained from the facet model.

Derivatives From the Facet Model The facet model characterizes each small image data neighborhood by a signal model and a noise model [54]. Low-order polynomials are the most commonly used signal form. We use a 3D cubic polynomial for derivative estimation [143]. Here “3D” means that the polynomial is about the spatiotemporal variable (x, y, t); “cubic” means that the highest order of a term is 3. The facet model finds the polynomial coefficient vector a from the linear regression model

Da + ξ = J (3.8) where J is the observed image data vector (formed by traversing the neighborhood data lexicographically), and D is the design matrix composed of 20 canonical polynomial bases (1, x, y, t, x2, . . . , xyt). We use the facet model neighborhood size 5 × 5 × 5, so D has dimension 125 × 20. Once a is found, the spatiotemporal derivatives are merely scaled versions of its elements. More details about derivatives from the facet model can be found in our earlier work [143, 141, 146]. Most popular derivative estimators in optical flow estimation are neighborhood masks. 42

They essentially come from facet models [54] of different dimension (1D, 2D or 3D), order (1st, 2nd or 3rd) or neighborhood size (2, 3 or 5). For example, the four-point central difference mask (−1, 8, 0, −8, 1)/12 that Barron et al. use [7] is actually a 1D cubic facet model on a neighborhood of 5 pixels [146]. Our facet model outperforms it on most image sequences.

3.1.3 Choosing Estimators

In this section, we analyze the characteristics of the two regression problems and identify appropriate regression estimators for them.

Solving OFC by LTS.LS We observe that (i) both y-outliers and leverage points can happen in Eq.(3.7) because both As and bs are composed of derivative estimates; (ii) leverage points are roughly twice as likely as y-outliers due to the size contrast of As and bs; (iii) a significant portion of the constraints can be gross errors, when, for example, multiple motion models happen in a neighborhood; and (iv) the number of constraints are relatively small. Therefore the desired estimator for the OFC stage should be resistant to both types of gross errors and have a high breakdown point and good statistical efficiency on a small sample size. M-estimators [15, 86] and LMedS estimators [97, 5] were previously used at the OFC- stage. M-estimators are resistant to y-outliers and have relatively high statistical efficiency, but they have a low breakpoint of about 1/(1+m) and are vulnerable to leverage points. The LMedS estimator [105] is resistant to both types of gross errors and has a high breakdown point of 50%, but it has extremely low statistical efficiency, which means it tends to perform poorly when there is no gross error (Table 3.1). Owning almost all merits of LMedS and better statistical efficiency, LTS is preferred to LMedS [138, 103, 104]. We use least-trimmed-squares followed by (weighted) least-squares to solve the optical flow constraint, and call the procedure “LTS.LS”.

LS or LTS: Adaptive Derivative Estimation By default we solve the 3D cubic facet model in an LS sense to find the derivatives. When the estimation quality is poor, we update the derivatives from robust facet model fitting. 43

To reduce computation and prevent over-fitting, we use a 3D quadratic facet model for this purpose. As the dimension of the parameter vector a is as large as 10, and the breakdown point has to be high, the LTS estimator is again a better choice than M- and LMedS estimators. Unlike the OFC stage, where we estimate two parameters out of 100 constraint equations, in this stage, there are 10 parameters but only 125 constraint equations. With a rather small sample size, WLS can hardly improve results of LTS. So we decide to use LTS for robust facet fitting. Note that it might not be the best to apply LTS facet model fitting uniformly, because LTS tends to have lower statistical efficiency than LS when there is no gross error, and it involves much more computation. Therefore the LTS facet model should be used when and only when the estimation fails due to the LS facet quality. We take the coefficient of determination (R2) [105] from the LTS.LS OFC step as a confidence measure of the flow estimate. R2 measures the proportion of observation variability explained by the regression model. Here it is defined as P 2 2 i∈inliers ri R = 1 − P 2 . i∈inliers yi We detect poor flow estimates as those having R2 < T , and try robust facet model fitting to improve them. It is worth mentioning that our OFC stage and that of Bab-Hadiashar and Suter use a similar local optimization formulation, the difference being that they use LMedS while we use LTS as the regression tool. Both of us detect bad estimates with low R2 values but we treat them very differently. They remove them as unreliable, whereas we apply a two-stage LTS to improve their accuracy. Finally, the diagram of the proposed algorithm is given in Figure 3.2.

3.1.4 Experiments and Analysis

We demonstrate on both synthetic and real data how optical flow accuracy improves as the method upgrades from purely LS-based (LS-LS), one-stage robust (LS-LTS.LS) and two- stage robust (LTS-LTS.LS). We also compare our results with those from Bab-Hadiashar and Suter’s technique (BS) [5] which applies LMedS to the OFC stage. The results were 44

Image Data LS Derivatives Robust Facet OFC Optica flow & Confidence

Y High conf? Robust Facet N

Figure 3.2: Block diagram of the two-stage-robust adaptive algorithm

computed using their own C program, all parameters set as default. For fair comparison, the facet and OFC neighborhood sizes are fixed to 5 pixels for both techniques.

An illustrative example We first use the synthetic data set in Figure 3.3 to demonstrate the necessity of robust regression in both stages. The image size is 32 × 32. Motion of the left and right halves are vertical and horizontal respectively, both at 1 pixel/frame. Since an optical flow constraint equation forms a line au + bv + c = 0 in the u, v coordinate, with its distance to the true velocity an indicator of the degree of modeling imperfection, we use OFC cluster plots to visualize derivative quality and results of different estimators. Three typical points: (5, 5), (5, 20) and (14, 17) in Figure 3.3,3.4 are closely examined. Their true velocities are marked by black dots in Figure 3.5. (5, 5) is a point where most derivatives are of good quality, as we can tell from the nice OFC cluster at the true velocity (Figure 3.5(a)). However, even in this favorable case, LS-LS yields only (0.9734, 0.0015) while LS-LTS.LS yields (numerically) exactly (1, 0). The 9 × 9 × 5 data support of point (5, 20) has 1/9 conveying the left motion mode. Accordingly we observe a clear cluster at the true velocity and a small vague cluster at the left velocity (Figure 3.5(b)). LS is totally lost in this, yielding a compromise of (0.5933, 0.518), as oppose to LS-LTS.LS, which gives (-0.0051, 0.9913). These two cases suggest that LS-LTS.LS significantly outperforms LS-LS at the OFC stage. In the above cases, the facet model fitting errors can be accommodated by robust OFC. 45

5

10

15

20

25

30

5 10 15 20 25 30 Figure 3.3: Central frame of the synthetic se- quence (5 frames, 32 × 32) Figure 3.4: Correct flow field

But this is not the case with (14, 17), a boundary point on the right side. Figure 3.5(c) shows constraint lines scattering around, with two very vague clusters at (0, 1) and (1, 0). Estimates from LS-LS (-0.3937, 0.2482) and LS-LTS.LS (0.0708, 0.1267) are both totally wrong. Here applying robust regression at the OFC stage alone does not help any more. The reason is that derivative estimation at most points fails and a large portion of the constraints become gross errors, so that the major optical flow constraint model does not exist. Figure 3.5(d) shows the OFC plot from the robust facet model fitting. The ma- jor motion model becomes clear so that LTS.LS yields a reasonably accurate estimate of (0.0109, 1.0000).

Translating Squares Sequence (TS)

Figure 3.6(a),3.6(b) show the central frame and the correct flow field of another synthetic sequence Translating Squares (TS). It contains two squares translating at 1 pixel/frame. The image size is 64 × 64. LTS-LTS.LS is applied at places with R2 < 0.99.

We calculate the error percentage as the quantitative accuracy measure. It is the error vector magnitude normalized by the true velocity magnitude and multiplied by 100. We report the average error percentages on the entire flow field (AEP) as well as those measured 46

(a) (5, 5): LS facet (b) (5, 20): LS facet

(c) (14, 17): LS facet (d) (14, 17): LTS facet

Figure 3.5: OFC cluster plots at three typical pixels. Each line represent a constraint equation. (5, 5): good derivative quality; (5, 20): a small number of bad derivatives; (14, 17): on motion boundary, most derivative estimates are bad and robust facet fitting becomes necessary. 47

(a) Central frame (b) Correct flow (c) BS

(d) LS-LS (e) LS-LTS.LS (f) LTS-LTS.LS (g) R2 map

Figure 3.6: TS sequence results. Flow field estimates are subsampled by 2. Estimates with error percentages larger than 0.1% are shaded. 48

Technique AEP(%) AEPB(%) LS-LS 18.83 50.61 BS 8.03 26.24 LS-LTS 7.53 24.59 LTS-LTS 4.75 15.51

Table 3.2: TS sequence: comparison of average error percentage

(a) Central frame (b) BS (c) LS-LTS.LS (d) LTS-LTS.LS

Figure 3.7: Pepsi sequence central frame and horizontal flow (darker pixels indicate larger speeds to the left).

in the motion boundary area (AEPB). The motion boundary area is defined as a 9-pixel-wide band. Since the spatiotemporal data support for each flow estimate is 9 × 9 × 5, out of this band there are no outliers from motion boundaries at either the derivative or the OFC stage. The AEP and AEPB values are summarized in Table 3.2. The flow fields estimated from four algorithms are given in Figure3.6. To facilitate visual comparison, we shade estimates with error percentages larger than 0.1%. To keep the flow field plots from being too crowded, we subsample them by 2 in both x and y directions. We observe from the results that (i) robust methods out-perform LS methods, (ii) LTS seems to be slightly better than LMedS in the OFC stage, and (iii) LTS derivative estimation significantly reduces boundary errors.

The Pepsi Sequence This is a real image sequence in which a Pepsi can and background move approximately 49

(a) BS (b) LS-LTS.LS (c) LTS-LTS.LS

Figure 3.8: Pepsi: estimated flow fields

0.8 and 0.35 pixels to the left respectively (Figure3.7(a)). We show subsampled flow fields of four techniques in Figure3.8 and the (linearly scaled) horizontal flow values in Figure3.7. BS’s result (Figure 3.8(a),3.7(b) has significant vertical speed components in the upper-left and the lower parts, and the flow is still over-smoothed. Figure 3.8(b),3.7(c) is the result of LS-LTS (1st and 2nd order constraint). Motion contrast and discontinuities are much clearer. LTS-LTS.LS (Figure 3.8(c),3.7(d)) updated LS-LTS estimates with R2 < 0.75 and further improved the boundary accuracy.

Discussion The primary contribution of the above work is that it formulates optical flow estimation as two regression problems and adaptively solves them using one-stage or two-stage LTS methods. Preliminary experimental results on both synthetic and real image sequences ver- ified the effectiveness. Since derivative estimation is a fundamental step of many computer vision problems, and most optimization problems can be fit into the regression framework, conclusions of this paper may extend to other fields. A limitation of the proposed method lies in the high computational cost, induced by uni- formly applying an expensive high-breakdown robust estimator to both regression stages. In the next section, we exploit the piecewise-smooth property of visual fields to develop a deterministic algorithm whose complexity adapts to the degree of local outlier contamina- 50

tion. It converges faster and achieves more stable accuracy than the random-sampling-based algorithm.

3.2 Adaptive High-Breakdown Robust Methods For Visual Reconstruction

Visual reconstruction is the process of recovering the underlying true visual field from a noisy observation [20]. It includes many fundamental tasks in early vision such as image restoration, 3D surface reconstruction, stereo matching and optical flow estimation. What permits the reconstruction is the piecewise continuity property of a visual field, which is often imposed by local parametric models [55, 53]. In recent years, many robust methods have been employed to solve the associated regression problems [13, 121, 75]; among them, those based on high-breakdown robust criteria [106, 85], e.g. least-median-of-squares and least-trimmed-squares, have reported the best accuracy. High-breakdown criteria usually have no closed-form solutions, so certain approximation schemes must be used. Different approximation methods may lead to very different accuracy and convergence rate; and esearch is still going on in the statistics community to find more appropriate methods [104]. So far almost all high-breakdown robust methods in visual reconstruction applications [121, 75, 5, 97, 117, 145] adopt the random-sampling-based algorithm outlined by Rousseeuw and Leory [106]—the estimate with the best criterion value is picked from a random pool of trial estimates, and the algorithm is uniformly applied to all pixels in an image. The generic scheme can be summarized as follows. Using the same number of trial subsets p to all locations causes both efficiency and accuracy concerns. According to Eq. 3.4, p must be chosen large enough to ensure a high breakdown point. Since evaluating the criterion value F is an expensive operation, a large p value incurs a heavy computational burden. Meanwhile, much of the burden is unnecessary because most places in a normal visual surface have few outliers and the above complicated process often ends up generating least-squares estimates. If a priority is placed on saving computation and a smaller p value is chosen, the probability of locking on the correct solution can be hurt, especially at locations with significant contamination. What is needed, in order to circumvent the efficiency-accuracy predicament, is an adap- 51

choose number of subsets p according to Eq. 3.4 for all pixels { for p subsets {

compute LS solution uLS; store criterion value F ; } select solution with best criterion value; compute WLS solution from Eq.3.5; }

Figure 3.9: Random sampling based algorithm for high-breakdown robust estimators

tive scheme which performs least-squares estimation when no outlier is present and increases the p value as the noise contamination becomes more severe. This does not seem possible for an isolated regression problem in which no prior information about outlier contamination is available. But in visual reconstruction problems, we can exploit the piecewise-smooth property of visual surfaces to achieve the adaptiveness.

3.2.1 The Approach

Now we present an adaptive algorithm for high-breakdown robust visual reconstruction by considering the example problem of estimating piecewise-constant optical flow from noisy measurements of first-order derivatives, i.e., solving the Lucas-Kanade constraint equation Eq. 2.7 for all pixel locations under a high-breakdown robust criterion. Observing that in normal image sequences, (i) the majority of pixel locations do not have outliers and least-squares estimates are reasonably good, and (ii) flow fields are smooth and nearby estimates have similar values (true even at motion boundaries), we initialize the flow field using least-squares estimates, and then iteratively generate trial solutions for each pixel using its neighbors’ values. Given a trial solution, which may come from either the least-squares initial or a neighbor’s value, we identify the part of local constraints consistent with it and calculate a new solution following the weighted least-squares (WLS) procedure (Eq. 3.5). In this way, we obtain an updated trial solution which represents the local 52

for all pixels {

compute LS estimate VLS;

V,F ← WLS on VLS; } while #{pixels updated}>0 { for all pixels {

for all its neighbors Vn {

if ( Vn updated and |Vn − V | > T ) {

Vtry,Ftry ← WLS on Vn;

if ( Ftry < F ) { update V,F }}}}}

Figure 3.10: Adaptive algorithm for high-breakdown robust estimators

constraints more closely. We then compute its criterion value, and retain this trial solution if it achieves the best criterion value so far. The algorithm is described in Figure 3.10.

Because of the “locking” capability of the WLS update step, similar neighbor values usually result in very close or identical trial solutions, and hence not all neighboring values need to be borrowed. Also, to expedite convergence, it is better to use neighbors a few pixels away rather than immediate ones. Therefore, we use four neighbors to the N,S,E,W directions which are w/2 pixels away, where w is the window size of constant local flow.

In this approach, the piecewise smoothness property of the visual field and the selection capability of robust estimators are exploited to produce trial solutions in a much more educated way than random sampling methods. The complexity of the estimator varies with the local structure: least-squares solutions are used where no outlier is present, and the number of trials increases with the outlier percentage. The adaptive nature is revealed in Figure 3.2.2, which shows the number of trials at each pixel as an intensity image for the TS sequence (see description in Section 3.1.4). More trials, indicated by brighter colors, are carried out closer to the boundary where the structure is more complex. The trial set size ranges between 1 and 13 in this case, as opposed to the uniform p = 30 in the random 53

sampling based algorithm [5].

3.2.2 Experiments and Analysis

We calculate derivatives from a first-order spatiotemporal facet model on a support of size 3 × 3 × 3 [145], and solve the optical flow constraint under the LMedS criterion. Optical flow is estimated on the middle frame of every three frames. A hierarchical scheme [12] is adopted to handle large motions, with the number of pyramid levels empirically determined. We carefully handled boundary cases such that the resulted flow field is of the same size as the original image.

Comparison is made with modified versions of Lucas and Kanade’s LS based method (LK) [82] and Bab-Hadiashar and Suter’s random LMS based method (BS) [5]. Their original implementations do not have hierarchical process and derivatives are estimated differently. To emphasize the contrast of different regression methods, we implemented LK and BS by modifying the code of our algorithm. No pre-smoothing is done and the constant flow window size is fixed at 9 × 9. p = 30 random subsets are drawn in BS as [5] suggested. Experiments are carried out on a PIII 500MHz PC running Solaris. Vector plots below are appropriately subsampled and scaled to faciliate visual inspection.

Five image sequences with flow groundtruth are used for quantitative comparison. Two error measures are reported. One is the angular error e6 used in [7]. It is defined on the 0 0 augmented flow vector u˜ = (u , 1) as arccosu˜ · u˜0, where u0 is the correct flow vector. The other one is the error vector magnitude measure e|·| = |u − u0|/|u0|. We also report the consumed CPU time in seconds to give a rough idea on the speed contrast.

Translating Squares Sequence (TS). This data set was introduced in Section 3.1.4 (Figure 3.6). Calculation is done on the original resolution. The correct and estimated vector plots are given in Figure 3.12. LK’s result is smeared around motion boundaries and good elsewhere. This shows the necessity of robust estimation, and also justifies using LS for initialization in our method. BS’s and our results look close for this data set. The reason is that for the case m = 2 which we examine in this paper, p = 30 samples make the probability of having at least one good initial as high as 99.98%, even with 50% 54

Figure 3.11: TS trial set size map. Value range: 1 (darkest) → 13 (brightest). More trial solutions are generated in places of higher motion complexity.

outliers present. For these reasons and because of the simplicity of the TS sequence, the random scheme succeeds most of the time. This similarity in turn justifies the viability of implementing high-breakdown criteria without random sampling.

TT, DT, YOS Sequences. Three popular synthetic data sets, Translating Tree (TT), Diverging Tree (DT) and Yosemite (YOS), are obtained from Barron [7]. Their middle frames, the 20th in the TT and DT data sets and the 9th in the YOS data set, are given in Figs. 3.2.2, 3.2.2. TT and DT (150 × 150) simulate translational camera motion with respect to a textured planar surface. TT’s motion is horizontal, DT’s is divergent and their maximum speeds are about 2 pixels/frame. The motion in YOS (316 × 252) is mostly divergent with the maximum speed about 4-5 pixels/frame. The cloud part is excluded from evaluation. We use two levels of pyramid for TT and DT, and three levels for YOS.

OTTE Sequence. This real image sequence is provided by Nagel [98]. The scene is stationary except for a marble block in the center moving leftwards; and the camera is translating. Groundtruth is available where the vector is nonzero (Figure 3.15). Three levels of pyramid are used. Measures on all sequences are summarized in Table 3.3.

Robust methods are more accurate but much slower. Quite noticeably however, the accuracy advantage fades with the use of image pyramids. This is caused by the limitations 55

(a) Groundtruth (b) LK

(c) BS (d) Ours

Figure 3.12: TS: correct and estimated flow fields 56

◦ Data Technique e6 ( ) e|·| (%) time (sec) LK 6.14 15.12 0 TS BS 1.10 2.64 2 Ours 1.09 2.65 0 LK 2.36 5.48 1 TT BS 1.67 3.75 16 Ours 1.39 3.22 6 LK 6.12 18.33 1 DT BS 5.73 18.53 16 Ours 5.00 16.14 10 LK 3.69 12.68 4 YOS BS 3.81 11.87 61 Ours 3.42 11.10 40 LK 17.22 48.56 13 OTTE BS 17.02 48.23 205 Ours 16.84 47.90 121

Table 3.3: Quantitative comparison of the proposed adaptive LMedS algorithm to Lucas and Kanade (LK) [82] and Bab-Hadiashar and Suter (BS) [5]. The new algorithm is more accurate, and more efficient than BS. 57

Figure 3.13: TT, DT middle frame Figure 3.14: YOS middle frame

[14] of the simple hierarchical strategy [12]. The error it introduces can be greater than that from LS estimation and becomes the quality bottleneck. This issue will be addressed in the next chapter. Note in Table 3.3 that while ours method significantly outperforms LK in all cases, BS produces larger errors than LK for the DT, YOS sequences. This suggests the unstable nature of LMS based on random sampling.

TAXI Sequence. TAXI is a real sequence with no groundtruth also from Barron [7]. In the street scene there are four moving objects: a taxi turning the corner, a car in the lower left driving leftwards, a van in the lower right driving towards the right, and a pedestrian in the upper-left. Their image speeds are approximately 1.0, 3.0, 3.0 and 0.3 pixels/frame respectively. Two levels of pyramid are used. To enhance details, we display the horizontal flow component as intensity images in Figure 3.17. Brighter pixels represent larger speeds to the right. In LK’s estimate the flow fields of the vehicles have severely invaded the background. BS and our method preserve motion boundaries better. However it is quite obvious that BS has bumpier boundaries and produces more gross errors, for instance on the taxi. In addition, BS took 36 seconds CPU time while ours only took 13 seconds. We show the trial solution set sizes and give the minimum and maximum numbers of trials on two levels of pyramid and their sum in Figure 3.16. Apparently larger numbers 58

(a) Middle frame (b) True flow

Figure 3.15: OTTE sequence

of trials were used for places with more complex motion. This observation suggests that the trial set size map might help motion structure analysis.

3.2.3 Discussion

In this section we have presented an adaptive high-breakdown robust method for visual re- construction and applied it to optical flow estimation. By taking advantage of the piecewise smoothness property of visual fields and the selection capability of robust estimators, this algorithm can be faster and more accurate than algorithms based on random sampling. Although we have chosen locally constant flow estimation to illustrate its effectiveness, the strengths of this approach should be more apparent in problems of higher dimensions, such as affine flow estimation and piecewise-cubic image restoration, for which random sampling methods quickly become computationally formidable. One of our future work directions is to extend this approach to these applications. Also worth further investigation is a less expensive alternative to the WLS estimator (Eq. 3.5) for updating estimates during 59

(a) Middle frame (b) Level 1: (1,14)

(c) Level 0: (1,13) (d) Total: (2,25)

Figure 3.16: TAXI: snapshot and trial set sizes map (in parentheses: min. and max. numbers of trials) 60

(a) BS (36sec)

(b) Ours (13sec)

Figure 3.17: TAXI: intensity images of x-component. Note that BS has bumpier boundaries and produces more gross errors, e.g., near the center of the taxi. 61

each visit. One possibility is to use the concentration property of LTS [104]. Finally, it would be interesting to see how the trial set size could be used as an early cue for analyzing scene complexity.

3.3 Error Analysis on Robust Local Flow

In this section we provide an error analysis of the first-order differential local regression tech- nique through covariance propagation [52]. By using a high-breakdown robust criterion, we minimize the impact of outliers on both optical flow estimation and its error evaluation. We calculate spatiotemporal derivatives from a facet model, which enables us to take corre- lation of adjacent pixels into account, and estimate image noise and derivative errors in an adaptive fashion. In the regression problem, we consider errors from both the observations and the measurements. In addition, we adopt a hierarchical process to handle large motion. Our error analysis is more complete, systematic and reliable than previous attempts. The advantages are demonstrated in a motion boundary detection application.

3.3.1 Covariance Propagation

Covariance Propagation Theory Consider a system relating the output y to the input x by the function

y = f(x).

Generally f(·) is nonlinear; but when the perturbation ∆x is small enough to fit its linear range, the output error is well approximated by

df(x) ∆y = ∆x. dx

Then the covariance of the output is

df(x) df(x) Σ = ( )0Σ . (3.9) y dx x dx f(·) of most real systems cannot be expressed explicitly. Instead a relationship

g(x, y) = 0 62

usually exists. In such cases we have

df(x) ∂g(x,Y ) ∂g(x, y) ) = ( )−1 , dx ∂y ∂x and finally ∂g(x, y) ∂g(x, y) ∂g(x, y) ∂g(x, y) Σ = [( )−1 ]0Σ [( )−1 ] (3.10) y ∂y ∂x x ∂y ∂x Evaluating the covariance requires the knowledge of the true x, y values, which are seldom available and commonly approximated by their estimates in practice. The assumptions of unimodal noise and accurate observations and estimates limit the application of the theory to only near-perfect systems with no outlier present. Below we introduce the application of the theory to our robust optical flow estimator. The explanation goes as OFC and facet model two steps. Results presented here are more general than that reported in an earlier paper about a least-squares technique [144].

The OFC Step After outliers are removed under the robust criterion, the actual OFC we use is composed of the rows in Eq. 2.7 corresponding to the inliers. For simplicity, from now on we overload Eq. 2.7 by the actual OFC and let n be the number of inliers. Solving Eq. 2.7 using least-squares is optimized for the model that only b is contaminated 2 by iid additive zero-mean noise with variance σb . Under this assumption, the optical flow covariance is simply 2 0 −1 Σu = σb (A A) (3.11)

2 where σb can be estimated from the residual errors as 1 Xn σˆ2 = r2. b n − 2 i i=1 The above error model is apparently unrealistic because spatial derivatives in A are noisy as well. Under the Error-In-Variable (EIV) condition the LS estimate is biased. Accordingly, efforts have been made to calculate the unbiased estimate using generalized least-squares [90, 96, 95]. However, at the cost of much heavier computation, these methods bring little accuracy improvement. It is because bias is a much weaker error source than outliers [15] in optical flow estimation; methods which can suppress outliers [15, 147] achieve fairly good 63

accuracy, much better than what generalized least-squares can do. In addition, bias has also turned out to be less significant than estimation variance in motion estimation [27]. Therefore, we solve the outlier-suppressed OFC using an LS estimator, and analyzing the estimation error by propagating covariance from the derivative estimates. The input to the system Eq. 2.7 is the spatiotemporal derivative vector d and the output is the optical flow vector u. We assume their errors are both zero-mean and have covariance matrices Σd and Σu respectively. u and d do not have a linear relationship, but they are related by ∂F (d, u) g(u, d) = = A0(Au − b) = 0 ∂u where F (d, u) = |Au − b|2 is the criterion function. Proceeding as the covariance propaga- tion theory (Section 3.3.1) suggests, we obtain

∂g(d, u)/∂u = A0A and

∂g(d, u)/∂d = (∂g(d, u)/∂d1, . . . , ∂g(d, u)/∂dn) where   ∂g(d, u)  ri + Ixiu Ixiv Ixi  =   . ∂di Iyiu ri + Iyiv Iyi Applying Eq. 3.10 yields the optical flow estimate covariance ∂g(d, u) ∂g(d, u) ∂g(d, u) ∂g(d, u) Σ = [( )−1 ]0Σ [( )−1 ]. (3.12) u ∂u ∂d d ∂u ∂d This expression reveals that the error in the optical flow estimate not only depends on the residual errors and the derivative values (through system conditioning) as indicated by Eq. 3.11, but also relates to the optical flow value and errors in the derivatives. Such observations have been made in many previous studies [7, 15, 5]. Assuming the derivatives and optical flow estimates are sufficiently accurate, we use them in place of the unknown true values for evaluation. Now the only missing piece in the above expression is the derivative error covariance Σd. Its modeling has posed a great difficulty to many previous studies [119, 90, 96]. Below we tackle the problem using the facet model. 64

The Facet Model Step We assume the image noise is an iid zero-mean variable with variance σ2. From Sec- tion 3.3.1 we know that the gradient vector di at pixel i, i = 1, . . . , s is linearly related to its neighborhood data Ji by di = MJi. It permits directly applying of Eq. 3.9 and leads to

2 0 Σdi = σi MM .

Similarly, for any pair of gradient vectors di, dj we have

2 0 Σdidj = σijMiMj,

2 where Mi,Mj are the weights on their overlapped support, and σij is approximated by

σiσj. Finally the full derivative covariance matrix is assemblied from Σdidj , i, j = 1, . . . , n. 0 Notice that MiMj depends on the positional relationship of pixel i and j, and only takes a few forms once the supports for the OFC and the facet model are determined. Hence in 0 implementation we create a lookup table of all possible MiMj beforehand and refer to it during pixel-by-pixel error estimation.

The above procedure defines the structure of Σd. [90, 95] arrive at similar conclusions using their derivative masks. However they meet difficulty with image noise variance es- timation. [90] attempts to evaluate the variance empirically, but the method turns out unsuccessful. The reason is that here the image noise is not caused simply by acquisition errors; it also depends on the derivative masks and the local image texture [119, 96, 92]. We derive an estimate of σ2 from the facet model fitting residual error

2 2 σˆ = |Daˆ − J| /(nd − nb)

where nd, nb are respectively the number of pixels in J and the number of polynomial bases. This measure reflects the deviation of the local image texture from the assumed polynomial model, which arises from either image noise or complex textures. It is a by-product of derivative estimation, and is adaptive across the image. The use of the facet model fully automates the error propagation from the image data to the optical flow estimation.

Hierarchical Processing 65

We build our optical flow estimation and covariance propagation method in a hierarchical scheme to cope with large motions [12]. [119] propagates covariance down the pyramid by a Kalman filtering like scheme. Currently we assume results on different pyramid levels are independent from each other, and hence we combine covariance matrices of different levels simply by multiplying the values at the higher level by 4 and adding them to the values at the lower level. Due to the limitations of hierarchical schemes [147] and the crude combination method, we observe performance degradation as the number of pyramid levels increases. Handling large motions remains a very difficult problem and needs further investigation.

3.3.2 Experiments

Motion Boundary Detection. Inspired by [93, 90], we demonstrate the performance of our error analysis method through a local statistical motion boundary detector. Given two adjacent optical flow vectors and their covariance matrices (ui, Σui ) and (uj, Σuj ), we examine the hypothesis H0 that they originate from normal distributions of the same mean. Under H0, their difference vector u = ui − uj obeys a bivariate normal distribution

V ∼ N(0, Σui + Σuj ). Thus the statistic

0 −1 T = u (Σui + Σuj ) u

2 should obey a χ distribution of 2 degrees of freedom. We reject H0, or declare a boundary pixel pair, when T > Tα. Each Tα corresponds to a significance level α, which is the theoretical false alarm rate. We estimate optical flow on the middle frame of an odd number of frames. The constant flow window size is fixed at 9 × 9. We handled estimates at image borders, so that they also have good accuracy and the resultant flow field is of the same size as the original image [142]. We compare two optical flow estimators, LS and LMS, and covariance propagation under two noise models: iid b error (Eq. 3.11) and correlated EIV (Eq. 3.12). This forms four combinations. (a) LS OFC and covariance from Eq.3.11, (b) LS OFC and covariance from Eq.3.12, (c) LMS OFC and covariance from Eq.3.11, and (d) LMS OFC and covariance from Eq.3.12. (d) is the proposed method. (a) is similar to [126, 56], and (b) is similar to [90]. In 66

(a) 0.01 (b) 0.15 (c) 2e−11 (d) 0.05

Figure 3.18: TS motion boundary

all experiments, we adjust the α value to produce the best visual results. The performance of different methods are compared by inspecting the false-alarm and misdetection rates. The closeness of the α value and the observed false alarm rate is an indicator of the statistical validity of the results.

TS Sequence. We calculate derivatives from a first-order spatiotemporal facet model on a support of size 3 × 3 × 3 [145]. 3 frames are used. The results are shown in Figure 3.18 with α values given in captions. Simple as it is, the χ2 test is effective in detecting motion boundaries. LS methods break down around motion boundaries and the covariance from neither error model is reliable. This verifies that performance analysis based on covariance propagation only works for near-perfect systems and is unable to detect its own failure. (c,d) are similar by visual inspection. However, with an inappropriate noise model, (c) severely underestimates the error. Its associated α = 2e−11 makes little statistical sense, and thus in practice there is no good method for choosing the threshold. With outliers rejected and the correlated EIV model assumed, good results with solid statistical meaning are produced by (d).

Hamburg Taxi Sequence (TAXI). Derivatives are calculated from a cubic spa- tiotemporal facet model on a support of size 5 × 5 × 5 [145]. 5 frames are used. Figure 3.19 gives the results using a 2-level pyramid. Due to limitations of the hierarchial processing scheme (Section 3.3.1), the results are more noisy than those on the higher level of pyramid 67

only (original sequence spatially subsampled by 2) in Figure 3.20. But the overall observa- tion is that robust estimates have more faithful boundaries, and the correlated EIV model yields much less false alarms and misdetections. Quite noticeably though, the α values become more problematic on real data. This suggests that our error modeling still needs refining to meet the demand of real complexity.

3.3.3 Discussion

In this paper we have presented an error analysis on a robust optical flow estimation tech- nique. Our work extends previous research in several directions. First of all, we make explicit the dependence of the popular covariance propagation theory on accurate estimates, and perform our analysis with a highly robust technique. We employ a high-breakdown robust criterion to reject outliers, which are most detrimental to both optical flow estimation and error analysis. By using a 3D facet model we obtain good derivative estimation, and in addition we systematically estimate the image noise strength and the correlated errors in spatiotemporal derivatives. We also adopt a hierarchical scheme to handle the large motion case. We illustrate the effectiveness of our error analysis on an application of statistical mo- tion boundary detection. Compared to least-squares based methods, our method has sig- nificantly higher motion estimation accuracy and boundary fidelity, and produces less false alarms and misdetections. These exhibit the potential of our results in a wide range of applications such as Structure From Motion (SFM) and camera calibration [34]. Automatic performance analysis is a very important yet very difficult problem. Although it is one step farther than previous attempts, our approach is still based on the covariance propagation theory and break downs when the estimate quality becomes too low. The open issue is how to make the system be aware of when the quality of the estimates becomes too low to make further inference. 68

(a) 0.05 (b) 0.4

(c) 0.001 (d) 0.5

Figure 3.19: TAXI motion boundary 69

(a) 0.0005 (b) 0.2

(c) 3e−7 (d) 0.25

Figure 3.20: TAXI: motion boundary on images subsampled by 2 70

Chapter 4

GLOBAL MATCHING WITH GRADUATED OPTIMIZATION

The local approaches we presented in the previous chapter analyze each optical flow vector by exploring image data in a small spatiotemporal neighborhood surrounding that pixel location. Due to the limited contextual information, drawbacks of such approaches are obvious: if data in a neighborhood do not have enough brightness variation or they happen to be very noisy, the analysis can completely fail; in other words, local approaches are highly sensitive to the aperture problem and their reliability can vary greatly within a single image. In order to overcome such limitations, appropriate global approaches must be developed to incorporate contextual information more effectively. Global optimization techniques for optical flow estimation have been extensively studied throughout the years, but the state-of-the-art performance remains unsatisfactory due to formulation defects and solution complexity. On one hand, approximate formulations are frequently adopted for ease of computation, with the consequence that the correct flow is unrecoverable even in ideal settings. As an example, many methods intended to preserve motion discontinuities use gradient-based brightness constraints, which can break down at discontinuities due to derivative evaluation failure and thus cannot reach the goal of precise boundary localization [145]. On the other hand, more sophisticated formulations typically involve large-scale nonconvex optimization problems, which are so hard to solve that the practical accuracy might not be competitive to simpler methods. Motion estimation research has arrived at a stage in which a good collection of ingredients are available; but in order to significantly improve performance, both problem formulation and solution methods need to be carefully considered and optimized. In this chapter, we discuss the problem of optimal optical flow estimation assuming brightness conservation and piecewise smoothness and propose a matching-based global optimization method with a practical solution technique. 71

From a Bayesian perspective, we assume the flow field prior distribution to be a Markov random field (MRF) and formulate the optimal optical flow as the maximum a posteriori (MAP) estimate, which is equivalent to the minimum of a robust global energy function. The novelty in this formulation lies mainly in two aspects. 1) Three-frame matching is proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2) The strengths of brightness and smoothness errors in the global energy are automatically balanced according to local data variation, and consequently parameter tuning is reduced. These features enable our method to achieve a higher accuracy upper-bound than previous algorithms. In order to solve the resultant energy minimization problem, we develop a hierarchical three-step graduated optimization strategy. Step I is the robust local gradient method that we have proposed in Section 3.2. It provides a high-quality initial flow estimate. Step II is a global gradient-based formulation solved by Successive OverRelaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes the original energy by greedy propagation. It corrects gross errors introduced by derivative evaluation and pyramid operations. In this process, merits are inherited and drawbacks are largely avoided in all three steps. As a result, high accuracy is obtained both on and off motion boundaries. Performance of this technique is demonstrated on a number of standard test data sets. On Barron’s benchmark synthetic data [7], this method achieves the best accuracy among all low-level techniques. Close comparison with the well known Black and Anandan’s dense regularization technique (BA) [14] shows that our method yields uniformly higher accuracy in all experiments at a similar computational cost.

4.1 Formulation

Let I(x, y, t) be the image intensity at a point s = (x, y) at time t. The optical flow at T time t is a 2D vector field V with the vector at each site s denoted by Vs = (us, vs) , where us, vs represent the horizontal and vertical velocity components, respectively. At places with no confusion, we may drop the s index and denote an image frame by I(t) and a flow vector by V = (u, v)T . The task of estimating optical flow can be described 72

as finding V to best interpret the spatiotemporal intensity variation in the image frames

I = {I(t1),...,I(t),...,I(t2)}, t1 ≤ t ≤ t2. We consider it as a Bayesian inference problem and define the optimal solution under the maximum a posteriori (MAP) criterion.

4.1.1 MAP Estimation

Let P (V |I) be the posterior probability density of the flow field V conditioned on the intensity observation I. According to the maximum a posteriori (MAP) criterion, the best optical flow V˜ is at the mode of this density, i.e.,

˜ V = argmaxV P (V |I).

Applying Bayes rule, the posterior pdf can be factored as

P (I(t)|V,I − I(t)) P (V |I − I(t)) P (V |I) = (4.1) P (I(t)|I − I(t)) where I − I(t) designates the image frames excluding the one on which we estimate optical flow. Ignoring the denominator, which does not involve V , we have

˜ V = argmaxV P (I(t)|V,I − I(t)) P (V |I − I(t)) (4.2) where P (I(t)|V,I − I(t)) is the likelihood of observing the image I(t) given the optical flow V and its neighboring frames I − I(t); P (V |I − I(t)) is the prior probability density of the optical flow.

4.1.2 MRF Prior Model

We model the prior distribution of the optical flow using a Markov random field. The MRF is a highly effective model for piecewise smoothness. It was first introduced for image restoration by Geman and Geman [42] and has been widely employed in motion estimation to preserve boundaries [88, 77, 57, 14]. The elegance of the MRF lies in that once a neighborhood system N is defined, due to MRF/Gibbs distribution equivalence, the prior distribution with respect to N can be expressed in terms of a potential function ES(V ) as

P (V |I − I(t)) = exp(−ES(V ))/Z. 73

1.2 1 0.06

1 0.5 0.8 0.04

0.6 0

0.4 0.02 −0.5 0.2

0 −1 0 −5 0 5 −5 0 5 −5 0 5

(a) ρ(x, σ) (b) ψ(x, σ) = ρ0(x, σ) (c) pdf ∝ exp(−ρ(x, σ))

Figure 4.1: Comparison of Geman-McClure norm (solid line) and L2 norm (dashed line): (a) error norms (σ = 1), (b) influence function, (c) corresponding probability density functions (truncated).

The partition function Z is a normalizing constant. ES(V ) is the flow smoothness energy P modeled as a sum of site potentials: ES(V ) = s ES(Vs). We use a second-order neighborhood system of only pairwise interactions in the flow

field prior. Correspondingly, the local flow smoothness potential ES(Vs) is specified by the 8 average deviation of Vs from its 8-connected neighbors Vi, i ∈ Ns : 1 X ES(Vs) = ρ(Vs − Vi, σSs). (4.3) 8 8 i∈Ns

Here σSs is the flow variation scale at the site s, and the error norm ρ(x, σ) reflects the flow deviation distribution. The choice of ρ is the decisive factor of the boundary preservation capability of an MRF formulation. If ρ is an L2 norm and σSs is a fixed global parameter, the flow prior potential reduces to the smoothness error in the Horn and Schunck formulation (Eq. (2.6)), which does not preserve motion discontinuities at all. Geman and Geman [42] modeled continuous surfaces as an MRF, and introduced the “line process”, a set of binary variables indicating edges, as a dual MRF. This formulation has been widely adopted in motion estimation [88, 77, 57, 14]. It was shown by Blake and Zisserman [20] to be equivalent to assuming ρ as a truncated quadratic function. In a robust statisitics context, Black [14, 19] generalized 74

the line process to an analog “outlier process”. We adopt this point of view in designing the error norm. More distribution and robust statistics insight to the error norm and our design are given in the following two sections.

4.1.3 Likelihood Model: Robust Three-Frame Matching

If the likelihood P (I(t)|V,I − I(t)) is a site-independent exponential distribution propor- P tional to exp(− s EB(Vs)), the posterior distribution is also Gibbs, with the potential resembling the regularization global energy (Eq. (2.6)). We take this approach so that specifying the likelihood term reduces to modeling the brightness conservation error and its potential function. We use the matching constraint Eq. (2.1) to model brightness conservation. The tradi- tional assumption that pixels are visible in all frames is a major source of gross errors in occlusion areas. Taking such violations as outliers [14] may prevent error from propagating to nearby regions, but does not provide constraints for occlusion pixels and thus does not help their motion estimation. We observe that without temporal aliasing, all points in a frame are visible in the previous or the next frame. Therefore we define the matching error as the minimum of backward and forward warping errors in three frames, i.e.,

eW (Vs) = min(|Ib(Vs) − Is|, |If (Vs) − Is|)

where Is is the intensity of pixel s in the middle frame; Ib(Vs),If (Vs) are warped intensities in the previous and the next frames respectively. We are the first to explicitly model correspondence at occlusions in optical flow estimation [147]. A similar idea, known as a temporally shiftable window, has shown high effectiveness in handling occlusions in multi- view stereo [73]. It is conventional to assume that matching error comes from iid Gaussian noise and correspondingly use its L2 norm as the potential function [58]. However, image noise is not always Gaussian due to abrupt perturbations, and the matching error can come from other sources such as warping failures. It may often take large values and thus has a distribution with fatter tails than Gaussian. To represent the distribution more realistically, we use a 75

robust error norm ρ(x, σ) to define the potential, yielding

EB(Vs) = ρ(eW (Vs), σBs), (4.4)

where σBs is the local brightness variation scale. Figure 4.1 gives the ρ, ψ curves for the L2 and Geman-McClure error norms [15] and their corresponding pdfs 1. (Figure 4.1(c)) has much fatter tails than the Gaussian. The above prior and likelihood models are also justified from the robust statistics per- spective. In optical flow estimation, small matching errors and smooth flows are dominant; large errors and motion discontinuities can be considered as outliers to the modelled struc- ture. Henceforth, applying robust constraints to both the EB and ES terms serves to reduce the impact of local perturbations and prevent flow smoothing across boundaries. In addi- tion, it gracefully handles motion estimation and segmentation, a difficult “chicken-and-egg” problem, since motion boundaries can be easily located as flow smoothness outliers [15].

4.1.4 Global Energy with Local Adaptivity

A robust error norm is usually chosen to possess certain desirable properties to suit the problem at hand. We use the Geman-McClure robust error function [15]

ρ(x, σ) = x2/(x2 + σ2)

in both EB and ES terms for its redescending [15, 106] and normalizing properties. The first property ensures that the outlier influence tends to zero. We take errors exceeding

√ τ = σ/ 3, (4.5) where the influence function begins to decrease, as outliers [15]. This is equivalent to identifying pixels with error norm ≥ 0.25 as outliers. The normalization property is desirable because it makes the degrees of flow smoothness and brightness conservation comparable. Together with the spatially varying scales, it allows the relative strength of these two terms

1ρ(x, σ) might not necessarily represent a proper distribution, like Gaussian, which is defined on x ∈ (−∞, ∞). But we consider it appropriate in an application to define a reasonable range of expected errors and obtain a pdf by normalization in this range. Figure 4.1(c) shows the pdfs for x ∈ (−5, 5). 76

to be adaptive: where the observation is not trustworthy (σBs is large), stronger smoothness is enforced, and vice versa. The scales are gradually learned from image data, as we will discuss in Section 4.2.2 and 4.2.3. Finally, the complete global energy is expressed as

X 1 X E(V ) = [ρ(min(|Ib(Vs) − Ii|, |If (Vs) − Is|), σBs) + ρ(Vs − Vi, σSs)]. (4.6) s 8 8 i∈Ns

This design extends current robust global formulations [15, 86] in two aspects. First of all, the three-frame matching error models correspondences even at occlusions and enables higher accuracy upper bounds, which gradient-based or two-frame methods cannot achieve.

Secondly, the locally adaptive scheme is more reasonable than those taking σB, σS, λ as fixed global parameters and eases parameter tuning in experiments.

4.2 Optimization

As we have discussed in Section 2.4, the global energy Eq. (4.6) resides in a high-dimensional space and is nonconvex. Even finding its local minima is not easy lacking an explicit gradient expression. Because no general global optimization technique is known to provide a practical solution, we take a graduated deterministic approach to the minimization problem. We start from an initial estimate and progressively minimize a series of finer approximations to the original energy. In this process, we exploit the advantages of various formulations and solution techniques for accuracy and efficiency. Our first attempt to approximation is to replace the matching error by the OFCE, which enables simple gradient evaluation and more efficient exploration of the solution space. This step needs a good-quality initial estimate to start with. We provide this initial estimate from a yet cruder approximation, a gradient-based local regression method. This method is cruder, because the global smoothness is not enforced and estimation is solely based on local data. After these two steps, we usually have very high accuracy except at motion boundaries, and then we can directly minimize the original energy to correct residual errors. We build the process in a coarse-to-fine framework to handle large motions and expedite convergence. Details of this algorithm are explained below. 77

4.2.1 Step I: Gradient-Based Local Regression

Suppose a crude flow estimate V0 is available and has been compensated for. Step I uses the robust gradient-based local regression method that we have developed in Section 3.2 to compute the incremental flow ∆V . Both least-median-of-squares (LMedS) and least- trimmed-squares (LTS) were tried and yielded similar results, so henceforth our discussion is based on LMedS. This step generates high-quality initial flow estimates. Its effectiveness as an independent optical flow estimation approach has been verified in various studies [5, 97, 145].

4.2.2 Step II: Gradient-Based Global Optimization

∆V0, the incremental flow resulting from Step I, has good accuracy at most places, but its quality degrades where local constraints become unreliable. We improve its coherence using a gradient-based global optimization method, which is a better approximation to Eq. (4.6). The energy to minimize is

X 1 X E(∆V ) = [ρ(eG(∆Vs), σBs) + ρ(Vs + ∆Vs − Vi − ∆Vi, σSs)] (4.7) s 8 8 i∈Ns where eG is the OFCE error (Eq. (2.3)), and Vs is the sth vector of the initial flow V0. The local scales σBs, σSs are important parameters which control the shape of E and hence the solution. Below we describe how to estimate them from Step I’s results. Suppose normal errors are Gaussian variables with zero mean and standard deviation σ˜; then those exceeding 2.5˜σ can be considered as outliers. Contrasting this threshold to Eq. (4.5), we can establish an equivalence between the Geman-McClure scale σ and the Gaussian standard deviationσ ˜ as:

√ σ = 2.5 3σ. ˜

If we have sample standard deviationσ ˜, we may compute σ using the above formula.

At a site s, we calculateσ ˜Ss as the sample standard deviation of “inliers” in Vi − 8 Vs, i ∈ Ns . Inliers are selected using the RLS procedure described in the previous section.

Someσ ˜Ss values might be very large due to bad flow estimates. We put a cap on them: 78

1.4826 medians σ˜Ss . For stability concerns as well as for the estimate to be reasonable, we also put a lower limit 0.001 pixels/frame on these values. We calculateσ ˜Bs as the OFCE residual. And similarly we bound the value in the range [0.01, 1.4826 medians σ˜Bs ]. Now that the scales are all specified, we minimize the energy using Successive OverRe- laxation (SOR) [20, 14, 102]. Starting with the initial estimate ∆V0, on the nth iteration, each u component (and v similarly) is updated as

n n−1 1 ∂E us = us − ω n−1 , T (us) ∂us where

2 2 2 T (us) = Ix/σBs + 8/σSs.

SOR is well known as good at removing high-frequency errors while very slow in removing low-frequency errors [137, 23]. In our algorithm, the initial estimate has dominant high- frequency errors: it has good accuracy at most places but may lack coherence due to the local constraints. In such a case, the SOR procedure is very effective and converges fast. In addition, the update step size is adaptively adjusted by the local scales, which further improves the efficiency in exploring the solution space.

4.2.3 Step III: Global Matching

The incremental flow from Step II, ∆V1, and the initial estimate V0 add up to V1, which still exhibits gross errors at motion boundaries and other places where gradient evaluation fails. But it is overall a sufficiently precise representation, based on which we are ready to consider the original formulation Eq. (4.6). The computation of local scales is similar to that in Step II with a few differences. We adopt a globally constant matching error standard deviationσ ˜B. It is a bounded robust estimate from all matching errorsσ ˜Bs : max{0.08, 1.4826 medians σ˜Bs }. The flow vector standard deviation is kept spatially varying within [0.004,0.02] pixels/frame. We minimize the global energy function by greedy propagation. We first calculate the energy EB(Vs)+ES(Vs) from V1 for all pixels. Then we iteratively visit each pixel, examining whether a trial estimate from a candidate set results in a lower global energy E. The 79

candidate set consists of the 8-connected neighbors and their average, which were updated in the last visit. Once a pixel energy decrease occurs, we accept the candidate and update the related energy terms. The simple scheme works reasonably well because bad estimates are confined to narrow areas in the initial flow V1. It converged quickly in our experiments. Since each flow estimate Vi affects E only through its own energy and the smoothness energies of its 8-connected neighbors, the updating is entirely local can be carried out in parallel [42]. It is worth mentioning that a similar greedy propagation scheme was successfully applied to solving a global matching stereo formulation in an independent study [129].

4.2.4 Overall Algorithm

We employ a hierarchical process [12] to cope with large motion and to expedite convergence. We create a P -level image pyramid Ip, p = 0,...,P − 1 and start the estimation from the top (coarsest) level P − 1 with a zero initial flow field. At each level p, we warp the image p p p p sequence I using the initial flow V0 , obtaining image frames Iw. On Iw we initialize the residual flow using the local gradient method, enhance it using the global gradient method, p p p and add it to V0 yielding V1 . Then we refine V1 by applying the global matching method p p to I , resulting in the final flow estimate on Level p, V2 , which is projected down to Level p−1 0 p − 1 as its initial flow field V0 . At last the flow estimate on the original resolution is V2 . Operations on each pyramid level are illustrated in Figure 4.2. There is an exception: when more than one pyramid level is used, we skip Step III on the coarsest level. The rationale is that gradient-based methods suffice on the coarsest level, since the data are substantially smoothed and the flow is incremental; applying the matching constraint is usually harmful due to the smoothing and possible aliasing.

Hierarchical schemes have become standard in motion estimation, but their limitations are often overlooked. The projection and warping operations oversmooth the flow field; errors in coarser levels are magnified and propagated to finer levels and are generally irre- versible [14]. These problems are much alleviated by the global matching step—it works on the original pyramid images and corrects gross errors caused by derivative computation, projection and warping. 80

p Level p V0 I p Image Warping p Iw

Gradient Computation ∆ p Iw I: Local Gradient ∆ p σ σ V0 Bi Si II: Global Gradient ∆ p V1 + p V1 III: Global Matching p V2 Projection p−1 V0 Level p−1

Figure 4.2: System diagram (operations at each pyramid level) 81

From a practical point of view, the graduated scheme benefits from the merits of all three popular optical flow approaches and overcomes their limitations. Step I use a gradient-based local regression method for high-quality initialization, while leaving local ambiguities to be resolved later in more global formulations. Step II improves the flow coherence using the gradient-based global optimization method, which converges fast because of the good ini- tialization. Step III adopts a matching-based global formulation to correct gross errors introduced by derivative computation and the hierarchical process. Matching-based for- mulations have been studied before, but their advantages over gradient-based counterparts were not apparent due to computational difficulties [88, 77, 14]. We provide, for the first time, a practical solution and achieve highly competitive accuracy and efficiency.

4.3 Experiments

This section demonstrates the performance of the proposed technique on various synthetic and real data and makes comparison with previous techniques.

The settings in our algorithm are given below. Optical flow is estimated on the middle frame of every three frames. No image pre-smoothing is done. Derivatives are calculated from a first-order spatiotemporal facet model [54] on a support of size 3 × 3 × 3 [145]. The constant flow window size in Step I is set to W = 9. Sites at image borders use the valid part of the window so that the resulting flow field is of the same size as the original image. In Step II, 20 iterations are used for SOR. The values of the local scale bounds have been given in Section 4.2.2 and 4.2.3. The image pyramid is constructed by sub-sampling 3 × 3 Gaussian-smoothed images. Projection expands the flow field by 2 × 2 times with bilinear interpolation. Bilinear interpolation is also used for image warping. The above factors are kept constant in all experiments. The only tuning parameter is the number of pyramid levels. For each data set, we choose the number of levels to be just large enough for the gradient-based constraints to hold on the finest level. Larger numbers introduce more errors due to suppression of fine structures in reduction and smoothing in projection and warping (Section 4.2.4). Adaptive hierarchial control [9, 14] is an important open problem, which is not tackled in this work. 82

Close comparison is given with Black and Anandan’s dense regularization technique (BA) [14] whose code is publicly available. We modified their code to output floating-point data. BA calculates flow on the second of two frames. It uses the same number of pyramid levels as ours and other parameters are set as suggested in [14]. All experiments are carried out on a PIII 900M PC runing Linux. The computing time of our algorithm depends on the motion complexity in the input data. It is typically close to that of BA. Some sample CPU time values (in seconds) for our algorithm and BA are: 11.7 and 14.7 (Taxi), 29.5 and 27.4 (Flower Garden), 36.8 and 24.2 (Yosemite). Note that neither algorithm has been optimized for speed.

4.3.1 Quantitative Measures

Quantitative evaluation can be conducted on data sets with flow groundtruth by reporting statistics of certain error measures. The most frequently adopted error measures are the angle and magnitude of the error vector; the first and second order statistics are commonly reported. We propose to use e, the absolute u or v error, as the error measure. It is a consistent and fair measure since u, v components and positive and negative errors are treated sym- metrically in optical flow estimation. Also, this 1-D measure is much easier to work with than are 2-D or higher dimensional measures. In considering what statistics to use, we find the popular first and second order statistics not representative enough for such a highly screwed e distribution. Therefore we give the empirical cumulative distribution function (cdf) of e in addition to its meane ¯. Better estimates should have cdfs closer to the ideal unit step function. In order to facilitate comparison with other techniques, we also report the popular average angular error e6 [7].e ¯ and e6 values for five synthetic image sequences are summarize in Table 4.1.

4.3.2 TS: An Illustrative Example

The Translating Squares (TS) sequence (64 × 64, Figure 4.3(a)) was created to examine the theoretical merits of different approaches. It contains two squares translating at exactly 1 83

(a) (b) (c)

1

0.8

0.6 cdf(e) 0.4 S3 S2 S1 0.2 LS BA400 BA 0 0 0.2 0.4 0.6 0.8 1 e (pixels) (d) (e) (f)

Figure 4.3: TS sequence results. (a) middle frame. The motion boundary is highlighted with a solid white line. In order to provide details near motion boundaries, (b,c,d,f) show flows in the window outlined by the dotted line. (b) groundtruth and our estimate look the same. (c) BA estimate. (d) LS estimate in Step I. (f) Step I result. Step II result looks identical and is hence not shown separately. (g) error cdf curves. 84

pixel/frame, with the foreground square outlined in solid white. The vector plot of the flow groundtruth is given for the part near the boundary (marked by white dots). The images are well textured and noise-free. The motion is small and thus no hierarchical process is needed. In such an ideal setting, an optimal formulation assuming brightness conservation and piecewise smoothness should fully recover the flow.

Our method does achieve the performance upper bound. The result is almost perfect; its vector plot looks the same as the groundtruth (Figure 4.3(b)), the average errors are negligible (Table 4.1), and the error cdf curve (Figure 4.3(g), curve “S3”) closely resembles the unit step function.

Figure 4.3(d) shows the flow estimate from the LS initialization in Step I, which can be considered as an embodiment of the Lucas and Kanade technique [82]. Because of LS’s zero tolerance to outliers, the flow is completely smoothed out near the motion boundary (shadowed). Figure 4.3(e) shows the final result of Step I. Replacing LS by LMS dramatically improves the boundary accuracy, as it is also clear by comparing curves “LS” and “S1” in Figure 4.3(g). This proves the necessity of robustification.

Due to gradient evaluation failure, in Figure 4.3(e) gross errors still remain at the motion boundary. Moreover, the corners are rounded because the background motion becomes dominant there. These problems are characteristic of robust local gradient techniques [5, 97, 145], and they become more severe as the number of pyramid levels and the constant flow window size W increase. Since the TS sequence is well textured and there is no serious aperture problem, the improvement from the global OFC formulation (Step II) is minimal (see Figure 4.3(g) curve “S2”). The remaining gross errors at the motion boundary are inevitable to gradient-based techniques. They are finally corrected in Step III.

BA yields poor accuracy (Figure 4.3(c), Table 4.1) on this data set. The oversmooth- ing bias introduced by the LS initialization is not effectively corrected in the continuation process. The SOR procedure converges very slowly. The suggested 20 iterations [15] does not seem to be sufficient (see Figure 4.3(g) curve “BA”). Even after 400 iterations (curve “BA400”) the bias persists and the accuracy remains low. 85

◦ Data Technique e6 ( )e ¯ (pix) TS BA 8.04 0.12 Ours 1.1e-2 2.2e-4 TT BA 2.60 0.07 Ours 0.05 9.8e-3 DT BA 6.36 0.11 Ours 2.60 0.05 YOS BA 2.71 0.12 Ours 1.92 0.08 DTTT BA 10.9 0.20 Ours 4.03 0.08

Table 4.1: Quantitative measures

4.3.3 Barron’s Synthetic Data

The Translating Tree (TT), Diverging Tree (DT) and Yosemite (YOS), data sets were introduced in Section 3.2.2. We use two levels of pyramid for TT and DT, and three levels for YOS. The cloud part in YOS is excluded from evaluation. As it is consistently observed from the average error measures (Table 4.1) and the error cdf curves (Figure 4.4), our method achieves very high accuracy and consistently out-performs BA 2.

Most optical flow papers published after [7] report the average angular error e6 on YOS. Some of the results are quoted in Table 4.2. The first group take a dense regularization ap- proach assuming piecewise constant flow. To our knowledge, our method gives the smallest error among such techniques. The second group make stronger flow model assumptions such as local affine flow and constant flow in a considerable number of frames. These assumptions are appropriate on the YOS data set and may lead to higher accuracy. The smallest error on YOS was reported by Farneb¨ack [35]. The algorithm couples orientation-tensor-based

2The BA angular error obtained here is different from the one reported by Black and Anandan [15]. Most probably it is because that their data are different from Barron’s and they calculated flow on the 14th frame. We adopt Barron’s experiment setup for wider comparability. 86

1 1 1

0.8 0.8 Ours 0.8 Ours Ours BA BA BA 0.6 0.6 0.6 cdf(e) cdf(e) cdf(e) 0.4 0.4 0.4

0.2 0.2 0.2

0 0 0 0 0.5 1 1.5 0 0.2 0.4 0.6 0.8 1 0 0.5 1 1.5 e (pixels) e (pixels) e (pixels) (a) TT (b) DT (c) YOS

Figure 4.4: Error cdf curves.

spatiotemporal filtering with region-competition-based segmentation, and estimates locally affine motion in 9 frames. Although it uses only low-level models and 3 frames, our method also compares favorably with these techniques.

DTTT: Motion Discontinuities

The above three data sets, including YOS, contain smooth motions and cannot display the discontinuity-preserving capability of our method. We synthesize the DTTT sequence (150 × 150) for this purpose. DTTT was generated from TT, DT and “cookie cutters”: image data inside the cookie cutters come from TT and those outside come from DT. Its middle frame with motion boundaries highlighted is given in Figure 4.5(a). We use two pyramid levels for this set. For images of realistic sizes, vector plots with enough details do not fit the page. Following [15], we show the horizontal and vertical flow components u, v as intensity images, with brighter pixels representing larger speeds to the right. We linearly stretch the image contrast so as to use the full intensity range. Our flow estimate (Figure 4.5) has a clear layered look—it exhibits crisp motion dis- continuity and smoothness at other places. Figure 4.5(d) marks motion boundary pixels in black. They are located as smoothness outliers, i.e., those with final pixel smoothness 87

◦ Technique e6 ( ) Ye, Haralick and Shapiro (proposed) 1.92 Sim and Park [117] 4.13 Black and Anandan [15] 2.71 Szeliski and Coughlan [127] 2.45 M´eminand P´erez[86] 2.34 Black and Jepson [35] 2.29 Ju, Black and Jepson [71] 2.16 Bab-Hadiashar and Suter [5] 1.97 Farneb¨ack [35] 1.14

Table 4.2: Comparison of various techniques on Yosemite (cloud part excluded) with Bar- ron’s angular error measure

energy exceeding 0.25 (see Eq. 4.5). BA’s result is oversmoothed with local perturbations: at many places foreground and background motions invade one another; meanwhile, a num- ber of false boundaries arise corresponding to noise or intensity edges erroneously taken as motion discontinuities. This is also reflected by its boundary map (output from their code), which has lots of spurious detections. Note that their discontinuity is 1-pixel thick since they only mark one of each pair of mutual outliers. Our result has much higher quantitative accuracy than BA’s as shown in Figure 4.4 and Table 4.1.

However in our estimate, we do notice some gross errors near motion boundaries. For example, the right corner of the triangle is smoothed into the background. A closer look reveals that most of these errors happen in textureless regions, where even human viewers are unable to tell what the actual motion is (aperture problem). In such situations, the correctness of “groundtruth” becomes questionable and so does the authority of quantitative evaluation based on it. For this reason, together with the simplicity of synthetic data and error measures, “quantitative” results should be considered as qualitative at best. The above suggests that the inherent ambiguity of the optical flow should be considered in quantitative evaluation—a more convincing evaluation method should allow larger errors in regions of 88

(a) Middle frame (c) Our vertical flow (b) Our horizontal (d) Our motion flow boundaries

1

0.8 Ours BA 0.6 cdf(e) 0.4

0.2

0 0 0.5 1 1.5 2 e (pixels) (e) e cdf curves (g) BA vertical flow (f) BA horizontal (h) BA motion flow boundaries

Figure 4.5: DTTT sequence results (motion boundaries highlighted in (a)). 89

(a) Middle frame (b) BA horizontal flow

(c) Our horizontal flow (d) Our smoothness error (ES ) map

Figure 4.6: Taxi results. 90

less local information.

Also noticeable in our estimate is that our motion boundaries are not as smooth as one would like. This is partly due to the weakness of the simple optimization method in Step III. Developing more suitable optimization methods is an important direction in our future work.

4.3.4 Real Data

In this section we show results on four well-know real image sequences: Taxi, Flower Garden, Traffic and Pepsi. Taxi and Traffic contain independent motions; motions in the other two data sets are caused by camera motion and scene depth. For each data set we give the middle frame, the horizontal flow u from BA and our method, and the smoothness error ES map from our method.

The Taxi sequence (256 × 190) is obtained from Barron [7]. It mainly contains three moving cars (from left to right) at image speeds about 3.0, 1.0, 3.0 pixels/frame respectively. The van on the left has low contrast and surface reflectance. The truck on the right is fragmented by a tree in front. Difficulties also arise from the low image quality. Optical flow is estimated on the 9th frame. Two levels of pyramids are used. BA’s result is almost smoothed out. Better boundary performance might be obtained by tuning parameters. But as we have discussed earlier, smoothing seems to be inevitable for BA especially on data of such diverse motions. Our method yields a reasonable flow estimate and a motion boundary map. Note that the car regions include shadows which move with the cars at the same speeds. Motion boundaries inside the truck reflect the motion fragmentation.

Motion in the Flower Garden sequence (360 × 240, from Black) is caused by camera translation and scene depth. The image speed of the front tree is as large as about 7 pixels/frame. Optical flow is estimated on the 2nd frame. Three pyramid levels are used. In both BA’s and our results, the motion of the tree twigs smears into the background. This is another example of inherent flow ambiguity (aperture problem). BA’s estimate has considerable oversmoothing between layers. Our result shows clear-cut motion boundaries and smooth flows within each layer. Its accuracy is highly competitive with those from 91

(a) Middle frame (b) BA horizontal flow

(c) Our horizontal flow (d) Our smoothness error (ES ) map

Figure 4.7: Flower garden results.

model- or layer- based techniques [109, 131]. Consistent observations are made on the remaining two data sets. The Traffic sequence (512 × 512, from Nagel) contains eleven moving vehicles with the maximum image speed at about 6 pixels/frame. Optical flow is estimated on the 8th frame with three pyramid levels. The motorcycle in the building shadow (upper middle) is missed by BA but picked out by our method. The Pepsi sequence (201 × 201) was used by Black [15] to illustrate motion boundary preservation capability. Like Flower Garden, its motion discontinuities are caused by camera translation and scene depths. The maximum image speed is about 2 pixels/frame. Optical flow is estimated on the 3rd frame with three pyramid levels. We exclude a 5-pixel wide 92

(a) Middle frame (b) BA horizontal flow

(c) Our horizontal flow (d) Our smoothness error (ES ) map

Figure 4.8: Traffic results. 93

(a) Middle frame (d) Our ES map (b) BA horizontal (c) Our horizontal flow flow

Figure 4.9: Pepsi can results.

border from BA’s result to have a better contrast. The erroneous flow and discontinuity estimated at the lower-left corner are also caused by the poor texture.

4.4 Conclusions and Discussion

This chapter has presented a novel approach to optical flow estimation assuming brightness conservation and piecewise smoothness. From a Bayesian perspective, we propose a formu- lation based on three-frame matching and global optimization allowing local variation, and we solve it under a graduated minimization strategy. Extensive experiments verify that the new method out-performs its competitors and yields good accuracy on a wide variety of data. The contributions of our work to visual motion estimation are summarized as follows.

• We introduced backward-forward matching for optical flow estimation. It avoids prob- lematic derivative evaluation and models correspondences more faithfully than popular gradient-based constraints and those ignoring the visibility problem at occlusions.

• We designed the global energy to automatically balance the strength of brightness and smoothness errors according to local data variation. It is more complete and adaptive than previous designs containing rigid tuning parameters. 94

• As a by-product of the robust formulation, motion discontinuities can be reliably located as flow smoothness outliers.

• We developed a three-step graduated optimization strategy to minimize the resultant energy. It is the first efficient algorithm yielding good accuracy for a global matching formulation.

• The solution technique takes advantage of gradient-based local regression, gradient- based global optimization and matching-based global optimization methods and over- comes their limitations. The local gradient step provides a high-quality initial flow, while leaving local ambiguities to be resolved later in more global formulations. The global gradient step improves the flow coherence and it converges fast because of the good initialization. The global matching step corrects gross errors introduced by derivative computation and the hierarchical process.

• We proposed a deterministic algorithm to approximate the high-breakdown robust estimator in the local gradient step. It can be faster and more accurate than algorithms based on random sampling.

Many of the above conclusions are also applicable to other low-level visual problems such as stereo matching, 3D surface reconstruction and image restoration. As an accurate and efficient low-level approach to visual motion analysis, the new method has great potential in a wide variety of applications. First of all, it provides a good start- ing point for higher-level motion analysis. Our flow estimates already take a layered look and motion boundaries of layers are closed curves. They can reliably initialize motion seg- mentation [109], contour-based [86] and layered [131] representation. Model selection [130] is a crucial problem in automatic scene analysis [16] which is difficult because comparing a collection of models on the raw image data involves formidable computation. Our re- sults can ease this task by supplying a higher ground for scene knowledge learning. The backward-forward matching error, together with detected motion boundaries, can facilitate occlusion reasoning [16]. It may also guide image warping to avoid smoothing across motion 95

discontinuities. Some success has been obtained in our preliminary experiments. This is important to motion estimation as well as for novel view synthesis. A noticeable problem in our results is that motion boundaries are not very smooth. This is in part due to the simplicity of our minimization method. It has a limited ability for generating new values. And propagating only among immediate neighbors might be slow and can get stuck at trivial local minima. For the purpose of global optimization, methods such as graph cuts [23], which yields very nice results in stereo matching, full multigrid methods [102], Bayesian belief propagation (BBP) [137], and local minimization methods alternative to SOR [102] are worth studying. Furthermore, the benefits of the Bayesian framework should be fully exploited. Among all criteria that the global energy may arise from, we find the Bayesian approach most ap- pealing in both theoretic and practical interest. Estimating optical flow from a few images is inherently ambiguous: areas with more appropriate textures have higher estimation cer- tainty. This indicates that the nature of the problem is probabilistic instead of deterministic. Furthermore, the Bayesian formulation may provide a graceful solution to two important problems: global optimization and confidence estimation [126, 7, 144]. Interesting results from a global optimization method Bayesian belief propagation (BBP) have been shown on a limited domain of vision problems [137]. BBP propagates estimates together with their covariances. If it converges, it converges in a small number of iterations with covariance as a by-product. Confidence measures such as covariances are critical for subsequent applica- tions to make judicious use of the results. It will be interesting to see if ideas like BBP are applicable and beneficial to optical flow estimation. 96

Chapter 5

MOTION-BASED DETECTION AND TRACKING

This chapter considers an application of visual motion to detecting and tracking point targets in an airborne video of intensity images. The research has been carried out with Engineering 2000 Inc. and the Boeing Company, as a part of the efforts for developing a UAV See And Avoid System. The greatest difficulty in this project lies in the extremely small target size. For many purposes of airborne visual surveillance such as collision avoidance, targets need to be identified as far/early as possible. This requires handling targets no more than a few pixels large. Meanwhile, it is common for airborne video imagery to have low image quality, substantial camera wobbling and plenty of background clutter. How to reliably detect and track point targets irrespective of these distractions is a very challenging issue that has seldom been dealt with.

The primary cue for detection in aerial surveillance is the motion difference between the target and the background. It is especially true in our problem, because the tiny target has almost no other features to separate it from background clutter. Detecting and associating objects based on brightness patterns [40] easily leads to false matches and tracking failure. A popular motion-based detection method fits the background motion into a gradient-based parametric model and takes pixels with large fitting residuals as belonging to potential targets [12, 63, 101]. These kind of approaches, unfortunately, only work for objects with extended spatial supports, but do not apply to fast-moving point targets. We develop a hybrid motion estimation method and a hypothesis test to identify small independently moving objects. Specifically, for each pair of adjacent frames, we compute the global motion from a hierarchical model-based method; estimate the individual pixel motion by template matching; and detect object pixels as those with these two values statistically significantly different. The detection threshold is chosen with clear statistical meaning and remains fixed for all frames. 97

Data Target Target Detector Tracker Measurement State

Figure 5.1: A typical detection-tracking system

Tracking has been intensively studied in a wide variety of areas and also as a general information processing problem [6, 30, 139, 65]. It can become highly involved and error- prone when dealing with multiple maneuvering targets and low-quality measurements. In aerial surveillance applications, tracking can be considered relatively easier, since aerial targets are normally well separated and have predictable dynamics. Single-target tracking is usually formulated, either explicitly or implicitly, as a Bayesian state estimation problem. The Kalman filter is a Bayesian tracker under linear/Gaussian assumptions and is the most widely used tracking technique in practice. We assume the target position, after the global motion is compensated, conforms to a second order kinematic model, and track it using a Kalman filter. Gandhi et al. [40] take a similar approach, but they rely on a navigation system for the global motion parameters. The temporal integration method proposed by Pless et al. [101] is also similar to a Kalman filter but has more heuristics.

In a typical detection-tracking system, as Fig. 5.1 shows, the data flow between two components takes only one direction: from the detector to the tracker. The detector assumes a uniform prior distribution of the measurement space and makes decisions in a Neyman- Pearson (NP) mode. This is what the detector starts with when it has no idea of the target presence. It is also how most existing detection-tracking systems are operated [139, 101, 40].

Once an object is detected and a track is formed for it, priors become available to the detector in the form of predicted state and its associated covariance. Taking the tracker feedback into account, the detector can operate in a Bayesian mode. This amounts to boosting the prior surface near the expected value, or equivalently lowering the NP test threshold for states consistent with priors. At a minimal computational cost, the Bayesian detector achieves remarkably lower false-alarm and misdetection rates and higher position accuracy than the Neyman-Pearson detector. The data flow in the Bayesian system is bi- 98

Image Motion−Based Measurement Kalman State Bayesian Filtering Sequence Detector & Covariance Tracker & Covariance

Prior Prediction

Figure 5.2: Proposed Bayesian system

directional: the tracker tells the detector where to look for measurements, and the detector returns what it finds. The hybrid motion-based detector and the Bayesian detection method are crucial to high tracking accuracy and efficiency and form the major contributions of our work. In an experiment on a 1800-frame real video clip, no false targets are detected; the true target is tracked from the second frame, with position error mean and standard deviation as low as 0.88 and 0.44 pixels respectively.

5.1 Bayesian State Estimation

Bayes’ theorem gives the rule for updating belief in a Hypothesis H (i.e. the probability of H) given background information (context) I and additional evidence E:

p(E|H,I)p(H|I) p(H|E,I) = . p(E|I)

The posterior probability p(H|E,I) gives the probability of the hypothesis H after consid- ering the effect of evidence E in context I. The p(H|I) term is the prior probability of H given I alone; that is, the belief in H before the evidence E is considered. The likelihood term p(E|H,I) gives the probability of the evidence assuming the hypothesis H and back- ground information I is true. The denominator, p(E|I), is independent of H, and can be regarded as a normalizing or scaling constant. The information I is a conjunction of (at least) all of the other statements relevant to determining p(H|I) and p(E|I). A Bayesian estimate is optimal in the sense that it is derived from all available information, and no other inference can do better. 99

Suppose we want to estimate the state variable xt of a dynamic system at time t given all the observations yt up to time t. In the Bayesian framework, we need to propagate the conditional probability p(xt|Yt), where Yt = {yi|i = 1, . . . , t} is the entire set of observations.

We define Xt = {xi|i = 0, 1, . . . , t} as the history of the system state; x0 reflects our prior knowledge about the state before any evidence is collected. Applying the Bayes’ rule we have

p(yt|xt,Yt−1)p(xt|Yt−1) p(xt|Yt) = . p(yt|Yt−1)

Note that to make an inference at time t, we need to carry the entire history of the state and observation along. This incurs great modeling and computational difficulties. To keep the problem manageable, the following three assumptions are commonly made.

• the yt’s are mutually independent: p(yi, yj) = p(yi)p(yj).

• each yt is independent of the dynamic process: p(yt|Xt) = p(yt|xt).

• the system has a Markov property such that any new state is only conditioned on the

immediately preceding state: p(xt|Xt−1) = p(xt|xt−1).

These assumptions are reasonable in our application, while they dramatically simplify p(xt|Yt) to p(yt|xt)p(xt|Yt−1) p(xt|Yt) = p(yt) where Z p(xt|Yt−1) = p(xt|xt−1)p(xt−1|Yt−1)dxt−1.

Since the denominator p(yt) is not related to xt, we may take it as a scaling factor and rewrite the first equation as

p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).

These equations suggest a recursive procedure to update p(xt|Yt). Once we have

• the measurement model p(yt|xt), 100

• the system model p(xt|xt−1) and

• the prior model p(x0) = p(x0|x−1), we may propagate the probability in two phases:

R • prediction: p(xt|Yt−1) = p(xt|xt−1)p(xt−1|Yt−1)dxt−1,

• correction: p(xt|Yt) = Ctp(yt|xt)p(xt|Yt−1).

While the equations appeare deceivingly simple, propagating the conditional probability density is not easy. In most real systems the three models take complex forms and might not be expressed analytically. Usually Monte-Carlo type of methods have to be exploited to provide a sample of the density function. As it is well known, such methods can be very computationally demanding; and when efforts are made to reduce the computational burden, the resulting density function might not represent the underlying truth faithfully. Some work along this direction has been done [46, 65]. The general Bayesian approach may have to be pursued in cases of multiple maneuvering targets with clutter. But for our problem a special case of the approach, the Kalman filter, is more appropriate.

5.2 Kalman Filter

The Kalman filter is a recursive Bayesian state estimator optimized for linear systems with Gaussian noise. It is derived from three probabilistic models.

• The prior model: x0 ∼ N(ˆx0,P0). Here x0 and P0 are the mean and covariance of the state before any observation is made.

− • The system model: xt+1 = Ftxt + Gtut + wt. Here Ft rules the linear evolution of

the state variable with time. Gtut reflects some control input to the system, which is

taken as a known constant. wt ∼ N(0,Qt) is the process noise. 101

• The measurement model: yt = Htxt + vt. Here Ht relates the observed measurement

yt to the underlying true state. vt ∼ N(0,Rt) is the measurement noise.

The updating process can be summarized as

• Prediction:

− xˆt = Ft−1xˆt + Gt−1ut−1

− 0 Pt = Ft−1Pt−1Ft−1 + Qt−1.

• Correction:

−−1 0 −1 −1 − − 0 − 0 −1 − Pt = (Pt + HtRt Ht) = Pt − Pt Ht(HtPt Ht + Rt) HtPt

0 −1 − 0 − 0 −1 Kt = PtHtRt = Pt Ht(HtRt Ht + Rt)

− − xt =x ˆt + Kt(yt − Htxˆt )

where Pt is the posterior covariance of xt, and Kt is the gain matrix. The computation of Kalman filtering is very easy. The crucial factors for a successful Kalman filter are: (i) accurate modeling (defining the state variable, system model and measurement model) in- cluding appropriate parameter setting, and (ii) supplying high-quality measurements. Below we describe the Kalman filter we built for target tracking.

5.3 Tracking

The first step in designing the tracker is to model the dynamic behavior of the target. The simplest model is the second-order kinematic model, or constant translation model,

pt = pt−1 + vt−1,

T where pt = (xt, yt) is the target position, and vt = (vxt, vyt) is the velocity which should be a constant. In our problem, vt is not constant. It is a sum of two velocities, the velocity R G G of the target itself vt and that of the background vt due to camera airplane motion. vt 102

R G is quite random and so is vt. But, the component vt = vt − vt remains quite steady over time. It gives us a way of predicting vt:

G G vt = vt−1 + vt − vt−1.

Here the background motion can be accurately estimated between each pair of frames (Sec- tion 5.4), and is considered as a known control input to the system. In Kalman filter notation, the tracking problem can be formalized as follows: State Variable. T T T θt = (pt , vt )

T T where pt = (xt, yt) is the centroid position of the target, and vt = (vxt, vyt) is its image velocity. Prior Model. There are many ways of specifying the prior model. When no prior knowledge about the target motion is available, a diffuse prior (infinite covariance) is used, and the Kalman filtering process reduces to a recursive least-squares estimation. System Model.

θt = F θt−1 + ut + wt (5.1) where the control input is G G 0 0 ut = (0, 0, (vt − vt−1) ) , the stationary system matrix is   1 0 1 0        0 1 0 1  F =   ,    0 0 1 0    0 0 0 1 and wt ∼ N(0,Q). We assume the process noise covariance to be

0 Q = ²F Pt−1F

0 in Eq. 5.4. The result of multiplying FPt−1F by ² makes it become the covariance of − the predicted state Pt . This is an ad hoc approach accounting for errors from unmodeled sources, known in stochastic estimation as exponential aging [126]. 103

Measurement Model. We assume the observed state yt is a contaminated version of the true value

yt = θt + νt (5.2) where the noise νt ∼ N(0,Rt), and Rt is block-diagonal (position errors are correlated and velocity errors are correlated). Accurate (yt,Rt) estimation is crucial to the success of the filter, and it is the topic of the next two sections. The optimal solution procedure is given below. Prediction.

ˆ− ˆ θt = F θt−1 + ut (5.3)

− 0 Pt = FPt−1F + Q.

Correction.

− − −1 Kt = Pt (Pt + Rt) (5.4) ˆ− ˆ− θt = θt + Kt(yt − θt ) (5.5)

− Pt = (I − Kt)Pt

The numerical instability of the Kalman filter is well-known. It arises from the matrix − inversion in evaluating Kt (Eq. 5.6): if (Pt + Rt) is poorly conditioned, the value Kt gets is pretty much due to round-off errors. We deal with the problem using a simple method: adding a tiny positive number ε to the diagonal entries of the posterior covariance Pt. Cur- rently we set ε to be 1% of the smallest diagonal entry. The importance of parameter tuning cannot be understated in optimizing the performance of real Kalman filtering systems. But here we emphasize on the theoretical part of the problem, and leave the practical aspects for future consideration.

5.4 Motion-Based Detection

This section addresses the problem of detecting independently moving pixels between two frames, It and It+1, and measuring the object state. Examples are given on three 100 × 100 sample frames (Fig. 5.3). f16502 and f18300 have a target near the center. f19000 has no 104

target but many ground objects resembling targets by appearance. These data sets are cropped out of the full data set (Section 5.7).

Candidates as Background Motion Outliers.

The background motion vG is introduced by the relative ground-camera movement. The ground is well approximated by a planar object. Its image motion conforms to the quadratic model Eq. 2.5. Putting this model into the optical flow constraint equation Eq. 2.3, we have a linear constraint Asa = bs at each pixel location s where · ¸ 2 2 As = Ix Iy xIx xIy yIx yIy (x Ix + xyIy)(xyIx + y Iy) , bs = −It.

We solve this regression problem using least-squares. To reduce the impact of outliers, we refine the LS estimate under the least-trimmed-squares criterion Eq. 3.3 using the C-step in the FAST-LTS implementation (Section 3.1.1). The number of equations n is equal to the number of pixels in each frame. Processing time increases proportionally with n while estimation accuracy quickly saturates, since the number of unknowns is fixed at 8. Our experiments show that using 2500 out of the n equations achieves the same accuracy at a small constant cost. To handle large motion, up to 4 pixels in our client data, we adopt a hierarchical scheme with two pyramid levels (Section 2.6 and Fig. 2.2). The projection operation for the planar flow parameter a can be expressed as

i−1 i ap (0, 1) = 2ap(0, 1),

i−1 i ap (2, 3, 4, 5) = ap(2, 3, 4, 5),

i−1 i ap (6, 7) = ap(6, 7)/2.

Once the estimatea ˆ is available, the velocity at any position (x, y), vG(x, y) can be calculated from Eq. 2.5. Background motion can be very accurately recovered from the above process. Fig. 5.4(a) shows the frame difference between f16502 and f16503 after the background motion is re- moved by image warping. The difference is very small except near the target, which moves independently, and the image border, where warping errors are significant. It is this high accuracy that allows us to consider vG as a known control input to the system in the Kalman filter model (Eq.5.1). 105

Figure 5.3: Example data sets. Column 1: first frame, Column 2: second frame, Column 3: frame difference (first frame minus second frame). Row 1: f16502 (target is the white dot near the center), Row 2: f18300 (target near the center), Row 3: f19000 (no target but many target-like ground objects). 106

(a) (b)

(c) (d)

Figure 5.4: f16502 target pixel candidates. (a) frame difference after background motion is removed. (b) pixels of large warping errors. Postprocessing: (c) isolated pixels removed. (d) dilated by 3 × 3. 107

For slow-moving objects of sufficiently large size, the planar model fitting error serves as a good indicator of independent motion [63, 101]. Another method, which applies to small G objects with considerable image motion, is to warp It+1 towards It according to v yielding W W a new image It , and consider pixels with large warping errors |It − It | as potentially independently moving [40]. We use this method to find candidate pixels. We estimate the image noise variance σ2 by fitting a facet model to the image data (Section 3.1.2). Denoting byσ ˆi the standard deviation estimate at the ith pixel,σ ˆ = 1.4826 medianσ ˆi. Then we take pixels with warping errors exceeding 2.5σ as candidates. Two processes further refine the candidate set: isolated pixels are pruned and residual ones are dilated by one pixel in each direction. Results for f16502 are given in Fig. 5.4. We observe that target pixels are successfully selected and the number false alarms is very small. Results for f18300 and f19000 are given in Fig. 5.5.

As we see in the above results, this method does locate the target, but also produces a considerable amount of false detections, especially for f19000 (Fig. 5.5). This is because large intensity change can result from independent motion as well as great intensity vari- ation. Feeding such measures to the tracker imposes a penalty on both tracking accuracy and responding time [40]. Therefore we further exploit motion information to resolve the ambiguity.

Candidate Pixel Motion and Covariance.

It is important to point out that hierarchical gradient-based methods cannot be extended to calculating candidate pixel motion because they require information aggregation in a neighborhood much larger than the spatial support of a point target. Therefore, we calculate target candidate pixel motions using the matching-based (template matching) method [3, 126], which requires the minimal spatial integration.

For each candidate pixel, we take its 3 × 3 neighborhood as the template and find its best match in a window of size w × w in the next frame It+1. The displacement gives us a pixel-accuracy solutionv ˆ0. w should be chosen large enough to encompass the range of the velocity, but as small as possible to avoid false matches. Now we let w = 4 according to the observed maximum image motion. To achieve sub-pixel accuracy, we refinev ˆ0 by a 108

Figure 5.5: f18300 and f19000 target pixel candidates. Left column: f18300. Right column: f19000. Row 1: frame difference after background motion is removed. Row 2: pixels of large warping errors. Row 3: isolated pixels removed. 109

quadratic fit of the error surface surrounding it. Particularly, we take matching errors in the 3 × 3 neighborhood centered atv ˆ0, fit them into a 2D quadratic surface

0 0 2 2 e = v Av + b v + c = a1 + a2vx + a3vy + a4vx + a5vxvy + a6vy,

find the minimum of the surface and the displacement achieving it as, respectively,

0 −1 0 emin = c − b A b/4 = c + b vmin/2,

−1 vmin = −A b/2, and obtain the final motion estimate as

vˆ =v ˆ0 + vmin.

The covariance matrix ofv ˆ is available as a by-produce of the the quadratic fitting. It can be shown [126] that e Σ = min A−1. vˆ 9

When vmin is greater than 1 pixel either way, emin < 0, or the diagonal entries of Σvˆ < 0, the motion estimator clearly has made a mistake, and we give that pixel up and take it as belonging to the background. We also give the estimate up when the larger eigenvalue of

Σvˆ exceeds 0.25, which corresponds to position errors above 0.5 pixels.

Independent Motion from χ2 Test. If a candidate pixel actually belongs to the background, its image motion v should be no G G different from the background motion v . That is, under the null hypothesis H0 : v = v , we should have v − vG ∼ N(0, Σ) and the test statistic

T = (v − vG)0Σ−1(v − vG) conforms to a χ2 distribution with 2 degrees of freedom. An independent motion is detected when H0 is rejected. We reject H0 at the significance level α = 0.05 which amounts to finding

T > Tα = 5.9915. This test is simple yet very effective. As Fig. 5.6 shows, almost all the clutter is now gone and the target stands out as the only significant connected pixel set.

Data Association, Object Detection and State Measurement. 110

Figure 5.6: Target pixels for f16502, f18300 and f19000. Left column: target pixel candi- dates. Right column: target pixels detected by the statistical test (small connected pixel sets are considered noise and removed). Row 1: f16502. Row 2: f18300. Row 3: f19000. 111

Moving pixels are first assigned to existing tracks according to the nearest neighbor rule. New tracks are formed for residual large connected sets (≥ 5 pixels). Given a cluster of pixels, we calculate the object velocity and covariance by averaging all the (v, Σ) values; and calculate the position and covariance from the sample mean and variance. This provides accurate (yt,Rt) estimates to the tracker.

5.5 Bayesian Detection

The object detector described above assumes every pixel is equally likely to be an object pixel, and tries to make the distinction solely based on evidence it collects from two adjacent frames. It is the best we can do when initiating a track with no prior knowledge at that time. Due to the extremely small object size, detection is still difficult, especially for small independent motions. Meanwhile, once a track is formed, priors on the object state become immediately available from the tracker in the form of predicted state distribution. From a Bayesian point of view, to pursue the optimal detection results we are obliged to exploit the priors. This section introduces a Bayesian object detector. It is an important feature of our system; most detectors in previous visual surveillance applications [101, 40] always operate in the Neyman-Pearson mode. The prior distribution of the state at time t is exactly the predicted distribution from time t − 1 by the tracker. Our system is in all respects linear/Gaussian, and hence the ˆ− − distribution is defined by the mean θt and covariance Pt as in Eq. 5.4. The Bayesian detector utilizes the priors in two phases: (i) augmenting the candidate set by adding pixels falling into predicted 3σ regions, and (ii) validating/updating the velocity estimates. Each predicted candidate pixel has two sets of velocity and covariance available: one from the matching-based motion estimation step and the other from prediction, which we − − denote as (v0, Σ0) and (v , Σ ), respectively. To combine these two pieces of evidence, we first conduct a consistency test: candidates with pixel motion significantly different from the prediction are rejected. As summarized below, the χ2 test works in the same way as the detection of independent motions. Pixels failing the test are considered to have poor motion estimates and taken as background pixels. 112

− − • Null hypothesis: H0 : v0 ∼ N(v , Σ ).

− 0 − −1 − 2 • Test statistic: T = (v0 − v ) (Σ ) (v0 − v ) ∼ χ2 under H0.

• Reject H0 when T > Tα.(α, Tα) is fixed at (0.05,5.9915).

We next calculate its posterior motion estimate (ˆv, Σ)ˆ for each remaining candidate using formulae similar to the correction phase of the Kalman filter [148]:

− − −1 K = Σ (Σ + Σ0)

− − vˆ = v + K(v0 − v )

Σˆ = (I − K)Σ−,

Then the independent motion test is done for both the prior and posterior estimates. Object pixels are much easier to identify in the Bayesian mode because of the boosted density distribution, or equivalently the dampened threshold in the χ2 test, around the predicted values [139]. Fig. 5.7 illustrates the prior impact on f16503. By comparing the original candidate set (a) and the augmented one (b, predicted pixels in gray), we see that the position priors help to locate candidates missed by the motion-based detector. Object pixels detected with no priors, with position priors only and with full priors are given in (c), (d) and (e) respectively. Lower misdetection rates are achieved as more priors are incorporated.

(a) (b) (c) (d) (e)

Figure 5.7: Detection results w and w/o priors on f16503 113

5.6 The Algorithm

Note that so far we have been carefully using the word “object” instead of “target” when talking about detection and tracking. This is because not all objects we track are targets, only those with consistent dynamic behavior. In our application, we watch any object for

Ni = 25 frames (about 0.8 seconds), and declare it as a target only if during the period, it never misses measurements, and its position covariance is always within an allowed range.

We also permit a target to miss measurements up to Nt = 5 frames. When the measurement is missing, we use the prediction as the new state; and we confine the position covariance to be within the allowed range. The range is defined by the larger eigenvalue of the covariance matrix, and the maximum is set to 2 squared pixels. Tracks corresponding to false objects and dead targets are terminated. We record four properties for the track associated with each object: a unique ID id, the track length (since the first detection) hist, the number of successive frames in which the object is not measured miss, and the object state Xt. Here we briefly synopsize the execution of the algorithm on each frame.

1. Prediction. Extrapolate new state 1 and update covariance (Eq. 5.4).

2. Detection.

(a) Calculate global motion parameters. Find candidates from both large warping errors and position priors.

(b) Estimate candidate motion. Further find posterior estimates for predicted can- didates.

(c) Detect independently moving pixels using χ2 test.

(d) Assign detected pixels to existing tracks; initiate new tracks from unassigned large connected sets.

1We do prediction before detection in order to supply priors to the Bayesian detector. However, since predicting target positions needs the new global motion parameters (Eq. 5.1), the prediction process is actually completed in the detection phase. 114

(a) Frame 16540 (b) Frame 18000

Figure 5.8: Two sample frames in the video clip (targets at the center of the 41 × 41 windows).

(e) Measure track states.

3. Correction.

(a) Update object states (use prediction and increase miss by 1 if no new measure- ments, otherwise do Kalman filter correction (Eq. 5.6) and assign miss ← 0).

(b) Remove a target with miss = Nt. Terminate an object track if miss > 0 or position covariance too large.

5.7 Experiments

We demonstrate the system performance on a 1800-frame video clip (sample frames in Figure 5.8) obtained from real flight data. The frame rate is 30 frames/second and the image size is 256 × 192 pixels. There is one target in the clip. It is about 2 × 1 to 3 × 3 pixels in size, and its maximum image motion is about 5 pixels/frame. Many ground objects resemble the target in brightness and/or shape. The camera is constantly wobbling and the image quality is low. 115

Exp MD FA ITR min max med mean sd I 0 0 0.999 0.05 3.88 0.82 0.88 0.44 II 0 2 0.988 0.12 3.82 0.99 1.06 0.52 III 0 0 0.618 0.14 3.78 0.83 0.83 0.37

Table 5.1: Quantitative measures in 1800 frames. MD: number of missed targets. FA: number of false targets. ITR: in-track rate (target track length divided by 1800). The remaining measures refer to position error vector magnitude statistics (unit: pixel).

All experiments are carried out on a PIII 500MHz PC running Solaris. The current implementation is not optimized for speed, but it is reasonably fast, spending about 2.3 seconds on each frame including 2 seconds on background motion estimation. The algorithm has the potential to execute in real-time.

In the 1800-frame clip, a total of 252 objects are detected, and only one of them is identified as a target. The target is in-track since the second frame. We mark the target by placing a 7 × 7 white box centered at its estimated position. As can be observed in the output video [148], the marker encloses the target throughout the sequence.

We managed to locate target centroid positions in 564 frames and used them as the groundtruth in quantitative evaluation. Table 5.1 gives results from three experiments. Experiment I shows the proposed method. To illustrate the effectiveness of the Bayesian detector, we also performed Experiment II in which only position priors are used and Ex- periment III in which no priors are used.

In both Experiments II and III, the true target is still detected and thus there is no mis-detection. II has two false alarms due to the absence of the consistency test between the estimated and predicted motion vectors (Section 5.5). Its in-track rate is slightly lower, while the localization accuracy severely degrades. III’s error measures are comparable to I’s. However its in-track rate suffers a drastic decrease. The target is not confirmed until 23 seconds later and its track is broken twice. This could be unacceptable in time-critical applications such as collision avoidance. 116

5.8 Discussion

We have introduced a novel approach to point target detection and tracking in a low-quality airborne video. We identify objects by the statistical difference between their motions and the background motion, and track their dynamic behavior in order to detect targets and update their states. Compared to most previous visual surveillance studies, our method has four main advantages.

• The hybrid motion-based detector is highly efficient in suppressing background clutter, locating moving objects and modeling their dynamics. It enables us to employ a simple Kalman filter for object tracking.

• With priors exploited in detection, false alarm, misdetection and in-track rates receive significant improvement.

• The extensive use of statistical tests rather than heuristics reduces parameter tuning to a minimum.

• The Bayesian detection-tracking approach is readily applicable to other data sources such as UV and RGB images. It allows results from different channels to be easily integrated to yield more reliable output.

Performance of the proposed technique has been very encouraging in preliminary ex- periments. More real and synthetic data are needed for further evaluation. Currently the approach is being integrated to a UAV See And Avoid System being jointly developed by Engineering 2000 Inc. and the Boeing Company. 117

Chapter 6

CONCLUSIONS

Visual motion is a compelling cue to the structures and dynamics of the world around us. Its analysis is crucial to many key problems in today’s vision research such as ob- ject/environment/human modeling, video compression, event analysis and image-based ren- dering. This dissertation has addressed two fundamental problems in visual motion analysis: optical flow estimation and motion-based detection and tracking. Two new approaches, ex- ploiting local and global motion coherence, respectively, have been proposed for estimating piecewise-smooth optical flow. A video surveillance system has been designed based on mo- tion cues and Bayesian estimation theory to achieve reliable target detection and tracking. In the process of developing these techniques, statistical methods have been extensively used to measure estimation uncertainty, facilitate information fusion and achieve high ro- bustness. This chapter will summarize the main contributions of the dissertation and point out some open questions and future work directions.

6.1 Summary and Contributions

A two-stage-robust adaptive scheme for gradient-based local flow estimation.

Gradient-based optical flow estimation techniques consist of two stages: estimating derivatives and organizing and solving optical flow constraints (OFC). Both stages pool information in a certain neighborhood and are regression procedures in nature. Least- squares solutions to the regression problems break down in the presence of outliers such as motion boundaries. To cope with this problem, a few robust regression tools have been in- troduced to the OFC solving stage. By carefully analyzing the characteristics of the optical flow constraints and comparing the strengths and weaknesses of different robust regression tools, we identified the least-trimmed-squares (LTS) technique as more appropriate for the 118

OFC stage. As a very similar information pooling step, derivative calculation has seldom received proper attention in optical flow estimation. Crude derivative estimators are widely used; as a consequence, robust OFC (one-stage robust) methods still break down near mo- tion boundaries. Pointing out this limitation, we proposed to calculate derivatives from a robust facet model. To reduce the computation overhead, we carried out the robust deriva- tive stage adaptively according to a confidence measure of the flow estimate. Preliminary experimental results show that the two-stage robust scheme permits correct flow recovery even at immediate motion boundaries.

A deterministic high-breakdown robust method for visual reconstruction.

High-breakdown criteria are employed in both of the above regression problems. They have no closed-form solutions and past research has resorted to certain approximation schemes. So far all applications of high-breakdown robust methods in visual reconstruction have adopted a random-sampling algorithm—the estimate with the best criterion value is picked from a random pool of trial estimates. These methods uniformly apply the algorithms to all pixels in an image disregarding the actual number of outliers, and suffer from heavy computation as well as unstable accuracy. Taking advantage of the piecewise smoothness property of the visual field and the selection capability of robust estimators, we proposed a deterministic adaptive algorithm for high-breakdown local parametric estimation. Starting from least-squares estimates, we iteratively choose neighbors’ values as trial solutions and use robust criteria to adapt them to the local constraint. This method provides an estima- tor whose complexity depends on the actual outlier contamination. It inherits the merits of both least-squares and robust estimators and results in crisp boundaries as well as smooth inner surfaces; it is also faster than algorithms based on random sampling.

Error analysis on gradient-based local flow.

Due to the intrinsic ambiguity of visual motion and modeling imperfections, an optical flow estimate generally has spatially varying reliability. In order for subsequent applications to make judicious use of the results, error statistics of the flow estimate have to be analyzed. In our earlier work, we conducted error analysis for the least-squares-based local estimation 119

method using the covariance propagation theory for approximate linear systems and small errors. In this thesis, we have generalized the results to the newer robust method. Our analysis estimates image noise and derivative errors in an adaptive fashion, taking into account correlation of derivative errors at adjacent positions. It is more complete, systematic and reliable than previous efforts.

Piecewise-smooth optical flow from global matching and graduated optimiza- tion.

By drawing information from the entire visual field, the global optimization approach to optical flow estimation is conceptually more effective in handling the aperture problem and outliers than the local approach. But its actual performance has been somehow dis- appointing due to formulation defects and solution complexity. On one hand, approximate formulations are frequently adopted for ease of computation, with the consequence that the correct flow is unrecoverable even in ideal settings. On the other hand, more sophisticated formulations typically involve large-scale nonconvex optimization problems, which are so hard to solve that the practical accuracy might not be competitive with simpler methods. The global optimization method we developed provide better solutions to both problems.

From a Bayesian perspective, we assume the flow field prior distribution to be a Markov random field (MRF) and formulate the optimal optical flow as the maximum a posteriori (MAP) estimate, which is equivalent to the minimum of a robust global energy function. The novelty in our formulation lies mainly in two aspects. 1) Three-frame matching is proposed to detect correspondences; this overcomes the visibility problem at occlusions. 2) The strengths of brightness and smoothness errors in the global energy are automatically balanced according to local data variation, and consequently parameter tuning is reduced. These features enable our method to achieve a higher accuracy upper-bound than previous algorithms.

In order to solve the resultant energy minimization problem, we developed a hierarchical three-step graduated optimization strategy. Step I is a high-breakdown robust local gradient method with a deterministic iterative implementation, which provides a high-quality initial flow estimate. Step II is a global gradient-based formulation solved by Successive Over- 120

Relaxation (SOR), which efficiently improves the flow field coherence. Step III minimizes the original energy by greedy propagation. It corrects gross errors introduced by derivative evaluation and pyramid operations. In this process, merits are inherited and drawbacks are largely avoided in all three steps. As a result, high accuracy is obtained both on and off motion boundaries.

Performance of this technique was demonstrated on a number of homebrew and standard test data sets. On Barron’s synthetic data, which have become the benchmark since the publication of [7], this method achieved the best accuracy among all low-level techniques. Close comparison with the well known Black and Anandan’s dense regularization technique (BA) [14] showed that in all of our experiments the new method yields uniformly higher accuracy at a similar computational cost.

A motion-based Bayesian approach to aerial point target detection and tracking.

In a visual surveillance project funded by the Boeing Company, we have investigated an application of optical flow to airborne target detection and tracking. The greatest difficulty in this problem lies in the extremely small target size, typically 2 × 1 − 3 × 3 pixels, which makes results from most previous aerial visual surveillance studies unapplicable. Challenges also arise from low image quality, substantial camera wobbling and plenty of background clutter.

The proposed system consists of two components; a moving object detector identifies objects by the statistical difference between their motions and the background motion, and a Kalman filter tracks their dynamic behaviors in order to detect targets and update their states. Both the detector and the tracker operate in a Bayesian mode, and they each benefit from the other’s accuracy. The system exhibited excellent performance in experiments. On an 1800-frame real video clip with heavy clutter and a true target (1 × 2 − 3 × 3 pixels large), it produced no false targets and tracked the true target from the second frame with average position error below 1 pixel. This probabilistic approach reduces parameter tuning to a minimum. It also facilitates data fusion from different information channels. 121

6.2 Open Questions and Future Work

Recovering optical flow from image sequences is a very challenging problem due to the intrinsic ambiguity of visual motion, inevitable modeling imperfections, computational dif- ficulties, and the interweaving of these issues. Although more effective methods have been presented in this thesis to tackle these difficulties, our exploration is only a beginning; there are a host of issues worth further investigation and attention.

Optical flow formulation.

The formulation of an optical flow estimation technique determines the accuracy upper bound of that technique. For example, robust formulations model the noise effect more realistically than least-squares formulations, and hence their accuracy uniformly supersedes the latter in practice. Years of research efforts have been devoted to developing more precise formulations. A question that naturally arises is “is there an optimal formulation”?

Answering the question requires defining the best optical flow, which is largely application- specific: if the application is 3D structure/dynamics analysis, the best flow should be iden- tical to the projected velocity field, whereas if the application is motion-compensated video coding, the best flow does not necessarily coincide with the projected motion as long as it minimizes the coding cost. Low-level approaches do not aim at any specific application—it is exactly their goal and strength to measure visual motion in a general setting—and there- fore the best flow cannot be defined for them. It is usually implied in developing low-level techniques that the best flow estimate is the projected 2D velocity field. But since the project motion is normally unknown, it cannot be used to derive the objective function. For the above reasons, there is no optimal formulation for low-level approaches.

Progress in refining low-level formulations has been made by identifying more appropri- ate models for each individual component. In more than two decades’ intensive research, a large number of formulations have been studied, among them those considered the most promising are formulations employing global optimization and robust criteria. Our work ad- vances the state of art in this direction by introducing three-frame matching to overcome the visibility problem at occlusions and allowing local variation in the global scheme to reduce 122

parameter tuning and improve local adaptivity. For further improvement, problems such as the modeling of the three-frame matching error, the choice of robust estimator and the learning of parameters will be investigated in our future work. Developing more appropriate formulations continues to be a significant topic in the field of optical flow estimation.

Energy minimization.

In refining optical flow formulations, computational complexity increases rapidly with model sophistication. This is especially true for global approaches involving large-scale nonconvex optimization problems. No practical numerical methods exist for finding the global optimum; only a local optimum can be found. This fact causes great difficulties in problem diagnosis: when a technique yields a poor estimate, it can be very hard to tell whether it is due to formulation weaknesses or to the local-minimum nature of the solution.

Investigating more globally optimal solutions to the large-scale nonconvex problems is one of our immediate future work directions. Towards this end, methods such as graph cuts [23] which has yield nice results in stereo matching, full multigrid methods [102], Bayesian belief propagation (BBP) [137], and local minimization methods alternative to SOR [102] are worth studying. As many areas in computer vision are converging to energy minimization formulations, progress in this research is expected to have impacts in a wide context.

Uncertainty analysis.

Systematic uncertainty analysis is a very crucial, yet very difficult problem. We made an attempt to examine the uncertainty in the local gradient-based flow estimate and demon- strated its effectiveness in a motion boundary detection application. Although our approach is one step farther than previous efforts, it is based on propagating small perturbations through approximately linear systems and break downs when the estimate quality becomes too low. How to make a system be aware of its own failure remains an open issue. In addi- tion, almost all previous error analysis were performed for local techniques; how to measure the uncertainty of a global approach is another open issue.

Performance evaluation and comparison. 123

The significance of comparative evaluation cannot be understated: it is necessary to assess the performance of both established and new algorithms and to gauge the progress in the field [7, 111]. So far the most popular evaluation method is to measure the difference be- tween the flow estimate and the projected 2D velocity field. To maintain comparability with previously published results, we followed the methodology of Barron et al. in this thesis and reported certain statistics of the difference between our flow estimates and the synthesized groundtruth. However, as we pointed out in Section 4.3.3, such evaluation methods are flawed due to the aperture problem; they become problematic in textureless regions, where the correctness of “groundtruth” becomes questionable and so does the authority of quan- titative evaluation based on it. For this reason, together with the simplicity of synthetic data and error measures, “quantitative” results should be considered as qualitative at best. The above suggests that the inherent ambiguity of the optical flow should be taken into account in quantitative evaluation—larger errors should be allowed in regions of less local information. Developing more convincing evaluation methods deserves serious attention.

Bayesian framework.

The benefits of the Bayesian framework should be fully exploited. Among all criteria from which the global energy may arise, we find the Bayesian approach most appealing in both theoretic and practical interest. Estimating optical flow from a few images is inherently ambiguous: areas with more appropriate textures have higher estimation certainty. This indicates that the nature of the problem is probabilistic instead of deterministic. Further- more, the Bayesian formulation may provide a graceful solution to two important problems: global optimization and uncertainty analysis [126, 7, 144]. Interesting results from a global optimization method, Bayesian belief propagation (BBP), have been shown on a limited do- main of vision problems [137]. BBP propagates estimates together with their covariances. If it converges, it converges in a small number of iterations with covariances as a by-product. It will be interesting to see if ideas like BBP are applicable and beneficial to optical flow estimation.

Applications. 124

As an accurate and efficient low-level approach to visual motion analysis, the new method has great potential in a wide variety of applications. First of all, it provides a good starting point for higher-level motion analysis. Our flow estimates already take a layered look and motion boundaries of layers are closed curves. They can reliably initialize motion segmen- tation [109], contour-based [86] and layered [131] representation. Model selection [130] is a crucial problem in automatic scene analysis [16]; it is difficult because comparing a col- lection of models on the raw image data involves formidable computation. Our results can ease this task by supplying a higher ground for scene knowledge learning. The backward- forward matching error, together with detected motion boundaries, can facilitate occlusion reasoning [16]. It may also guide image warping to avoid smoothing across motion disconti- nuities. Some success has been obtained in our preliminary experiments. This is important to motion estimation as well as for novel view synthesis.

Visual reconstruction.

Motion estimation is one of many low-level visual reconstruction problems. Many con- clusions from our work are also extendable to other low-level visual problems such as stereo matching, 3D surface reconstruction and image restoration. 125

BIBLIOGRAPHY

[1] M.D. Abramoff, W. J. Niessen, and M. A. Viergever. Objective quantification of the motion of soft tissues in the orbit. IEEE Trans. on Medical Imaging, 19(10):986–995, 2000.

[2] G. Adiv. Determining three-dimensional motion and structure from optical flow gen- erated by several moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence, 7(4):384–401, 1985.

[3] P. Anandan. A computational framework and an algorithm for the measurement of visual motion. International Journal of Computer Vision, 2:283–310, 1989.

[4] S. Ayer, Pschroeter, and J. Bigun. Segmentation of moving objects by robust motion parameter estimation over multiple frames. In Proc. European Conf. on Computer Vision, volume 2, pages 316–327, 1994.

[5] A. Bab-Hadiashar and D. Suter. Robust optical flow estimation. International Journal of Computer Vision, 29(1):59–77, 1998.

[6] Y. Bar-Shalom and T.E. Fortmann. Tracking and Data Association. Academic Press, 1988.

[7] J. L. Barron, S. S. Beauchemin, and D. J. Fleet. Performance of optical flow tech- niques. International Journal of Computer Vision, 12(1):43–77, 1994.

[8] J.L. Barron. A survey of approaches for determining optic flow, environmental layout and egomotion. In RBCV-TR, 1984. 126

[9] R. Battiti, E. Amaldi, and C. Koch. Computing optical flow across multiple scales: An adaptive coarse-to-fine strategy. International Journal of Computer Vision, 6(2):133– 145, 1991.

[10] S. S. Beauchemin and J. L. Barron. The computation of optical flow. ACM Computing Surveys, 27(3):433–467, 1995.

[11] J. R. Bergen, P. Anandan, K. J. Hanna, and R. Hingorani. Hierarchical model-based motion estimation. In Proc. European Conf. on Computer Vision, pages 237–252, 1992.

[12] J.R. Bergen, P.J. Burt, R. Hingorani, and S. Peleg. A three-frame algorithm for esti- mating two-component image motion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(9):886–896, 1992.

[13] P. J. Besl, J. B. Birch, and L. T. Watson. Robust window operators. In Proc. European Conf. on Computer Vision, pages 591–600, 1988.

[14] M. J. Black. Robust Incremental Optical Flow. Doctoral dissertation (research report), Yale Univ., 1992.

[15] M. J. Black and P. Anandan. The robust estimation of multiple motions: paramet- ric and piecewise-smooth flow fields. Computer Vision and Image Understanding, 63(1):75–104, 1996.

[16] M. J. Black and D. J. Fleet. Probabilistic detection and tracking of motion disconti- nuities. International Journal of Computer Vision, 38:229–243, 2000.

[17] M. J. Black and A. Jepson. Estimating optical flow in segmented images using variable- order parametric models with local deformations. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(10):972–986, 1996. 127

[18] M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejec- tion, and robust statistics with applications in early vision. International Journal of Computer Vision, 19:57–91, 1996.

[19] M. J. Black, G. Sapiro, D. Marimont, and D. Heeger. Robust anisotropic diffusion. IEEE Trans. Image Processing: Special issue on Partial Differential Equations and Geometry Driven Diffusion in Image Processing and Analysis, 7(3):421–432, 1998.

[20] A. Blake and A. Zisserman. Visual Reconstruction. MIT Press, Cambridge, MA, 1987.

[21] P. Bouthemy and E. Fran¸cois. Motion segmentation and qualitative dynamic scene analysis from an image sequence. International Journal of Computer Vision, 10(2):157–182, 1993.

[22] P. Bouthemy and J. S. Rivero. A hierarchical likelihood approach for region segmen- tation according to motion-based criteria. In Proc. International Conf. on Computer Vision, pages 463–467, 1987.

[23] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Trans. on Pattern Analysis and Machine Intelligence, 23(11):1–18, 2001.

[24] M. Brooks, W. Chojnacki, D. Gawley, and A. van den Hengel. What value covari- ance information in estimating vision parameters? In Proc. International Conf. on Computer Vision, pages 302–308, 2001.

[25] K. Bubna and C. V. Stewart. Model selection and surface merging in reconstruction algorithms. In Proc. International Conf. on Computer Vision, pages 895–902, 1998.

[26] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. IEEE Trans. on Communication, 31:532–540, 1983. 128

[27] R. Pless C. Ferm¨ullerand Y. Aloimonos. Statistical biases in optical flow. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 561–566, 1999.

[28] C. Cafforio and F. Rocca. Methods for measuring small displacements of television images. IEEE Trans. on Information Theory, (5):573–579, 1976.

[29] S. E. Chen and L. Williams. View interpolation for image synthesis. Computer Graphics, 27(Annual Conference Series):279–288, 1993.

[30] C-Y. Chong, D. Garren, and T.P. Grayson. Ground target tracking— a historical perspective. In IEEE Proc. Aerospace Conference, volume 3, pages 433–448, 2000.

[31] R. Cipolla, K. Okamoto, and Y. Kuno. Robust structure from motion using motion parallax. In Proc. International Conf. on Computer Vision, pages 374–382, 1993.

[32] C. Colombo and A. del Bimbo. Generalized bounds for time to collision from first- order image motion. In Proc. International Conf. on Computer Vision, pages 220–226, 1999.

[33] T. Darrell and A. Pentland. Cooperative robust estimation using layers of support. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(5):474–487, 1995.

[34] A.M. Earnshaw and S.d. Blostein. The performance of camera translation direction estimators from optical flow: Analysis, comparison, and theoretical limits. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(9):927–932, 1996.

[35] G. Farneb¨ack. Very high accuracy velocity estimation using orientation tensors, para- metric motion, and simultaneous segmentation of the motion field. In Proc. Interna- tional Conf. on Computer Vision, volume 1, pages 171–177, 2001.

[36] O. Faugeras. Three-dimensional computer vision: a geometric viewpoint. MIT Press, 1993. 129

[37] C.L. Fennema and W.B. Thompson. Velocity determination in scenes containing several moving objects. Computer Graphics and Image Processing, 9:301–315, 1979.

[38] D. J. Fleet and A. D. Jepson. Computational of component image velocity from local phase information. International Journal of Computer Vision, 5(1):77–104, 1990.

[39] B. Galvin, B. McCane, K. Novins, D. Mason, and S. Mills. Recovering motion fields: An evaluation of eight optical flow algorithms. In Proc. British Machine Vision Conf., volume 1, pages 195–204, 1998.

[40] T. Gandhi, M. Yang, R. Kasturi, O. Camps, and L. Coraor. Detection of obstacles in the flight path of an aircraft. In Proc. Computer Vision and Pattern Recognition, volume 2, pages 304–311, 2000.

[41] D. Geman and G. Reynolds. Constrained restoration and the recovery of disconti- nuities. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(3):367–384, 1992.

[42] S. Geman and D. Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6(6):721–741, 1984.

[43] S. Geman and D. E. McClure. Statisical mothods for tomographic image reconstruc- tion. Bull. Int. Statist. Inst., 2(4):5–21, 1987.

[44] S. Ghosal and P. Vanek. A fast scalable algorithm for discontinuous optical flow estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(2):181– 194, 1996.

[45] A. Giachetti, M. Campani, and V. Torre. The use of optical flow for the autonomous navigation. In Proc. European Conf. on Computer Vision, pages A:146–151, 1994. 130

[46] N.J. Gordon, D.J. Salmond, and A.F.M. Smith. Novel approach to nonlinear/non- gaussian bayesian state estimation. IEE Proceedings-F, 140(2):107–113, 1993.

[47] N. Gupta and L. Kanal. Gradient based motion estimation without computing gra- dients. International Journal of Computer Vision, 22:81–101, 1997.

[48] K.J. Hanna. Direct multi-resolution estimation of ego-motion and structure from motion. In Proc. Workshop on Visual Motion, pages 156–162, 1991.

[49] R. M. Haralick. Computer vision theory: the lack thereof. CVGIP, 36:372–386, 1986.

[50] R. M. Haralick, editor. Proc. 1st International Workshop on Robust Computer Vision, Seattle, WA, Oct. 1990.

[51] R. M. Haralick, editor. Workshop Proc. Performance vs. Methodology in Computer Vision, Seattle, WA, June 1994.

[52] R. M. Haralick. Propagating covariance in computer vision. 10(5):561–72, 1996.

[53] R. M. Haralick and J. S. Lee. The facet approach to optic flow. In Proc. Image Understanding Workshop, pages 84–93, 1983.

[54] R. M. Haralick and L. G. Shapiro. Computer and Robot Vision. Addison-Wesley publishing company, 1992.

[55] R. M. Haralick and L. Watson. A facet model for image data. CVGIP, 15:113–129, 1981.

[56] D. Heeger. Optical flow using spatio-temporal filters. International Journal of Com- puter Vision, 1(4):279–302, 1988.

[57] F. Heitz and P. Bouthemy. Multimodal estimation of discontinuous optical flow using markov random fields. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15(12):1217–1232, 1993. 131

[58] B. K. P. Horn and B. G. Schunck. Determining optic flow. Artificial Intelligence, 17:185–203, 1981.

[59] Y. Huang, K. Palaniappan, X. Zhuang, and J. e. Cavanaugh. Optical flow field seg- mentation and motion estimation using a robust genetic partitioning algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(12):1177–1190, 1995.

[60] M. Ioka and M. Kurokawa. Estimation of motion vectors and their application to scene retrieval. MVA, 7(3):199–208, 1994.

[61] M. Irani. Multi-frame optical flow estimation using subspace constraints. In Proc. International Conf. on Computer Vision, pages 626–633, 1999.

[62] M. Irani and P. Anandan. A unified approach to moving object detection in 2d and 3d scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(6):577–589, 1998.

[63] M. Irani, B. Rousso, and S. Peleg. Computing occluding and transparent motions. International Journal of Computer Vision, 12:5–16, 1994.

[64] M. Irani, B. Rousso, and S. Peleg. Recovery of ego-motion using region alignment. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(3):268–272, 1997.

[65] M. Isard and A. Blake. Condensation — conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28, 1998.

[66] B. J¨ahne. Motion determination in space-time images. In Proc. European Conf. on Computer Vision, pages 161–173, 1990.

[67] T. Jebara, A. Azarbayejani, and A. Pentland. 3d structure from 2d motion. IEEE Signal Processing Magazine, pages 66–84, 5 1999. 132

[68] A. Jepson and M. J. Black. Mixture models for optical flow computation. Tech. Report, Res. in Biol. and Comp. Vision RBCV-TR-93-44, Univ. of Toronto, 1993.

[69] A. D. Jepson, D. J. Fleet, and T. El-Maraghi. Robust, on-line appearance models for vision tracking. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 415–422, 2001.

[70] J.-M. Jolion, P. Meer, and S. Bataouche. Robust clustering with applications in com- puter vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 163(8):791– 802, 1991.

[71] S. X. Ju, M. J. Black, and A. D. Jespon. Skin and bones: multi-layer, locally affine, optical flow and regularization with transparency. In Proc. Computer Vision and Pattern Recognition, pages 307–314, 1996.

[72] T. Kanade and M. Okutomi. A stereo matching algorithm with an adaptive window: theory and experiment. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(9):920–932, 1994.

[73] S. B. Kang, R. Szeliski, and J. Chai. Handling occlusions in dense multi-view stereo. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 103–110, 2001.

[74] J. K. Kearney, W. B. Thompson, and D. L. Boley. Optical flow estimation: an error analysis of gradient-based methods with local optimization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):229–244, 1987.

[75] V. Koivunen. A robust nonlinear filter for image restoration. IEEE Trans. Image Processing, 4(5):569–578, 1995.

[76] D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with occlusion reasoning. In Proc. European Conf. on Computer Vision, volume 1, pages 189–196, 1994. 133

[77] J. Konrad and E. Dubois. Bayesian estimation of motion vector fields. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(9):910–927, 1992.

[78] R. Kumar, P. Anandan, and K. Hanna. Direct recovery of shape from multiple views: a parallax based approach. In Proc. International Conf. on Pattern Recognition, pages 685–688, 1994.

[79] J. O. Limb and J. A. Murphy. Computer Graphics and Image Processing, pages =.

[80] M. I. A. Lourakis, A. A. Argyros, and S. C. Orphanoudakis. Independent 3d motion detection using residual parallax normal flow fields. In Proc. International Conf. on Computer Vision, pages 1012–17, 1998.

[81] M. I. A. Lourakis and S. C. Orphanoudakis. Using planar parallax to estimate the time-to-contact. In Proc. Computer Vision and Pattern Recognition, pages 640–645, 1999.

[82] B. D. Lucas and T. Kanade. An iterative image-registration technique with an ap- plication to stereo vision. In Proc. Image Understanding Workshop, pages 121–130, 1981.

[83] D. Marr. On the purpose of low-level vision. In MIT AI Memo, 1974.

[84] L. H. Matthies, R. Szeliski, and T. Kanade. Kalman filter-based algorithms for es- timating depth from image sequences. International Journal of Computer Vision, 3:209–236, 1989.

[85] P. Meer, D. Mintz D. Y. Kim, and A. Rosenfeld. Robust regression methods for computer vision: a review. International Journal of Computer Vision, 6(1):59–70, 1991.

[86] E. M´eminand P. P´erez.Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Processing, 7(5):703–719, 1998. 134

[87] M. Middendorf and H. H. Nagel. Estimation and interpretation of discontinuities in optical flow fields. In Proc. International Conf. on Computer Vision, pages 178–183, 2001.

[88] D. W. Murray and B. F. Buxton. Scene segmentation from visual motion using global optimization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):220– 228, 1987.

[89] H. H. Nagel. On the estimation of optical flow: Relations between different approaches and some new results. Artificial Intelligence, 33(3):299–324, 1987.

[90] H. H. Nagel. Optical flow estimation and the interaction between measurement errors at adjacent pixel positions. International Journal of Computer Vision, 15:271–288, 1995.

[91] H. H. Nagel and W. Enkelmann. An investigation of smoothness constraints for the estimation of displacement vector fields from image sequences. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8(5):565–593, 1986.

[92] H.H. Nagel and M. Haag. Bias-corrected optical flow estimation for road vehicle tracking. In Proc. International Conf. on Computer Vision, pages 1006–1011, 1998.

[93] H.H. Nagel, G. Socher, H. Hollnig, and M. Otte. Motion boundary detection in image sequences by local stochastic tests. In Proc. European Conf. on Computer Vision, volume 2, pages 305–315, 1994.

[94] P. Nesi, A. D. Bimbo, and D. Ben-Tzvi. A robust algorithm for optical flow estimation. Computer Vision and Image Understanding, 62(1):59–68, 1995.

[95] L. Ng and V. Solo. Errors-in-variable modelling in optical flow problems. In Proc. Int. Conf. on Acoustics Speech and Signal Processing, volume 5, pages 2773–2776, 1998. 135

[96] N. Ohta. Uncertainty models of the gradient constraint for optical flow computation. IEICE Trans. Info. & Sys., E79-D(7):958–962, 1996.

[97] E. P. Ong and M. Spann. Robust optical flow computation based on least-median-of- squares. International Journal of Computer Vision, 31(1):51–82, 1999.

[98] M. Otte and H.H. Nagel. Optical flow estimation: Advances and comparisons. In Proc. European Conf. on Computer Vision, pages 51–60, 1994.

[99] N. Peterfreund. Robust tracking of position and velocity with kalman snakes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 21(6):564–569, 1999.

[100] D. Piponi. Virtual cinematography in “the matrix”. http://www2.parc.com/ops/projects/forum/2000/forum-07-13.html, 2000.

[101] R. Pless, T. Brodsky, and Y. Aloimonos. Detecting independent motion: The statistics of temporal continuity. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8):768–773, 2000.

[102] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C. Cambridge Univ. Press, 2 edition, 1997.

[103] P. J. Rousseeuw and S. Van Aelst. Positive-breakdown robust methods in computer vision. Computing Science and Statistics, 31:451–460, 1999.

[104] P. J. Rousseeuw and K. Van Driessen. Computing lts regression for large data sets. Tech. report (submitted), Univ. of Antwerp.

[105] P. J. Rousseeuw and M. Hubert. Recent developments in progress. In Y. Dodge, editor, L1-Statistical Procedures and Related Topics, volume 31, pages 201–214. Institute of Mathematical Statistics Lecture Notes-Monograph Series, 1997. 136

[106] P. J. Rousseeuw and A. M. Leroy. Robust Regression and Outlier Detection. John Wiley and Sons, 1987.

[107] Y. Rui and P. Anandan. Segmenting visual actions based on spatio-temporal motion patterns. In Proc. Computer Vision and Pattern Recognition, volume 1, pages 111– 118, 2000.

[108] H. S. Sawhney. 3d geometry from planar parallax. In Proc. Computer Vision and Pattern Recognition, pages 929–934, 1994.

[109] H. S. Sawhney and S. Ayer. Compact representations of videos through dominant and multiple motion estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(8):814–830, 1996.

[110] H. S. Sawhney and R. Kumar Y. Guo. Independent motion detection in 3d scenes. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(10):1191–1199, 2000.

[111] D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1):7–42, 2002.

[112] R.R. Schultz, L. Meng, and R.L. Stevenson. Subpixel motion estimation for super- resolution image sequence enhancement. Journal of Visual Communication and Image Representation, 9(1):38–50, 1998.

[113] B. G. Schunck. Image flow segmentation and estimation by constraint line clustering. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11(10):1010–1027, 1989.

[114] H. Schweitzer. Occam algorithms for computing visual motion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(11):1033–1042, 1995. 137

[115] A. Shashua and N. Navab. Relative affine structure: theory and application to 3d reconstruction from perspective views. In Proc. Computer Vision and Pattern Recog- nition, pages 483–489, 1994.

[116] D. Shulman and J. Herv´e.Regularization of discontinuous flow fields. In Proc. Work- shop on Visual Motion, pages 81–85, 1989.

[117] D-G. Sim and R-H Park. Robust reweighted map motion estimation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(4):353–365, 1998.

[118] E. P. Simoncelli. Distributed Analysis and Representation of Visual Motion. Doctoral dissertation, MIT, 1993.

[119] E.P. Simoncelli, E.H. Adelson, and D. Heeger. Probability distributions of optical flow. In Proc. Computer Vision and Pattern Recognition, pages 310–315, 1991.

[120] A. Singh. Optic Flow Computation: A Unified Perspective. IEEE Press, 1990.

[121] S. S. Sinha and B. G. Schunk. A two-stage algorithm for discontinuity-preserving surface reconstruction. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14(1):36–55, 1992.

[122] S. Srinivasan. In Proc. International Conf. on Computer Vision, volume 1.

[123] G. P. Stein and A. Shashua. Model-based brightness constraints: on direct estimation of structure and motion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(9):992–1015, 2000.

[124] C. V. Stewart. Expected performance of robust estimators near discontinuities. In Proc. International Conf. on Computer Vision, pages 969–974, 1995.

[125] S. Sun, D. Haynor, and Y. Kim. Motion estimation based on optical flow with adaptive gradients. In Proc. International Conf. on Image Processing, pages 852–855, 2000. 138

[126] R. Szeliski. Bayesian modeling of uncertainty in low-level vision. Kluwer Academic Pub., 1989.

[127] R. Szeliski and J. Coughlan. Hierarchical spline-based image registration. In Proc. International Conf. on Image Processing, pages 194–201, 1994.

[128] R. Szeliski and H.-Y. Shum. Motion estimation with quadtree splines. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(12):1199–1210, 1996.

[129] H. Tao, H.S. Sawhney, and R. Kumar. A global matching framework for stereo com- putation. In Proc. International Conf. on Computer Vision, pages 532–539, 2001.

[130] P. H. S. Torr. Geometric motion segmentation and model selection. In J. Lasenby et al., editor, Proc. The Royal Society of London, pages 1321–1340, 1998.

[131] P. H. S. Torr, R. Szeliski, and P. Anandan. An integrated bayesian approach to layer extraction from image sequences. In Proc. International Conf. on Computer Vision, pages 983–990, 1998.

[132] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon. Bundle adjustment – A modern synthesis. In W. Triggs, A. Zisserman, and R. Szeliski, editors, Vision Algorithms: Theory and Practice, LNCS, pages 298–375. Springer Verlag, 2000.

[133] W. N. Venables and B. D. Ripley. Modern Applied Statistic with S-Plus, 2nd Edition. Springer, 1997.

[134] J. Y. A. Wang and H. Adelson. Representing moving images with layers. IEEE Trans. Image Processing, 3(5):625–638, 1994.

[135] J. Weber and J. Malik. Robust computation of optical flow in a multi-scale differential framework. International Journal of Computer Vision, 14(1):67–81, 1995. 139

[136] Y. Weiss. Bayesian belief propagation for image understanding. Workshop on Sta- tistical and Computational Theories of Vision 1999–Modeling, Learning, Computing, and Sampling (submitted for publication), 1999.

[137] Y. Weiss and W. T. Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary topology. Neural Comp., 13(10):2173–200, 2001.

[138] R. R. Wilcox. Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 1997.

[139] P. Willet, R. Niu, and Y. Bar-Shalom. Integration of bayes detection with target tracking. IEEE Trans. on Signal Processing, 49(1):17–30, 2000.

[140] Y. Xiong and S. A. Shafer. Moment and hypergeometric filters for high precision computation of focus, stereo and optical flow. International Journal of Computer Vision, 22(1):25–59, 1997.

[141] M. Ye. Image flow estimation using facet model and covariance propagation. M.S. Thesis, Univ. of Washington, Seattle, WA, USA, 1999.

[142] M. Ye, M. Bern, and D. Goldberg. Document image matching and annotation lifting. In Proc. International Conference on Document Analysis and Recognition, pages 753– 760, 2001.

[143] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance propagation. In Vision Interface, pages 51–58, 1998.

[144] M. Ye and R. M. Haralick. Image flow estimation using facet model and covariance propagation. In M. Cheriet and Y.H. Yang, editors, Vision Interface: Real World Applications of Computer Vision, pages 209–241. World Sci., 2000. 140

[145] M. Ye and R. M. Haralick. Optical flow from a least-trimmed squares based adaptive approach. In Proc. International Conf. on Pattern Recognition, pages 1052–1055, 2000.

[146] M. Ye and R. M. Haralick. Two-stage robust optical flow estimation. In Proc. Com- puter Vision and Pattern Recognition, pages 623–628, 2000.

[147] M. Ye and R. M. Haralick. Local gradient global matching piecewise smooth optical flow. In Proc. Computer Vision and Pattern Recognition, pages 712–717, 2001.

[148] M. Ye and R. M. Haralick. Point aerial target detection and tracking — a motion- based bayesian approach. ISL Tech Report, Univ. of Washington, 2001.

[149] M. Ye, R. M. Haralick, and L. G. Shapiro. Estimating optical flow using a global matching formulation and graduated optimization. In Proc. International Conf. on Image Processing, Rochester, NY, September 2002. to appear.

[150] K. Zhang, M. Bober, and J. Kittler. Motion based image segmentation for video coding. In Proc. International Conf. on Image Processing, pages 476–479, 1995.

[151] S. C. Zhu and A. Yuille. Region competition: unifying snakes, region growing, and bayes/mdl for multiband image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 18(9):884–900, 1996. 141

VITA

Ming Ye was born in Chengdu, P.R. China, in January 1975. She received her B.S. degree in Electrical Engineering from the University of Electronic Science and Technology of China in June 1997. She then joined the Intelligence Systems Laboratory at the University of Washington as a research assistant, where she obtained her M.S. degree in March 1999 and will receive her Ph.D. degree by December 2002, both in Electrical Engineering. She was a research intern at the Xerox Palo Alto Research Center during the summer of 2000. Her research is in the area of computer vision and image processing, with a focus on statistical and robust approaches to visual motion analysis and applications.