Fast Rigid Motion Segmentation Via Incrementally-Complex Local Models
Total Page:16
File Type:pdf, Size:1020Kb
Fast Rigid Motion Segmentation via Incrementally-Complex Local Models Fernando Flores-Mangas Allan D. Jepson Department of Computer Science, University of Toronto fmangas,[email protected] Abstract The problem of rigid motion segmentation of trajectory data under orthography has been long solved for non- degenerate motions in the absence of noise. But because real trajectory data often incorporates noise, outliers, mo- tion degeneracies and motion dependencies, recently pro- posed motion segmentation methods resort to non-trivial (a) Original video frames (b) Class-labeled trajectories representations to achieve state of the art segmentation ac- curacies, at the expense of a large computational cost. This paper proposes a method that dramatically reduces this cost (by two or three orders of magnitude) with minimal accu- racy loss (from 98.8% achieved by the state of the art, to 96.2% achieved by our method on the standard Hopkins 155 dataset). Computational efficiency comes from the use of a (c) Model support region (d) Model inliers and control points simple but powerful representation of motion that explicitly incorporates mechanisms to deal with noise, outliers and motion degeneracies. Subsets of motion models with the best balance between prediction accuracy and model com- plexity are chosen from a pool of candidates, which are then used for segmentation. (e) Model residuals (f) Segmentation result Figure 1: Model instantiation and segmentation. a) f th orig- 1. Rigid Motion Segmentation inal frame, Italian Grand Prix c 2012 Formula 1. b) Class- Rigid motion segmentation (MS) consists on separating labeled, trajectory data W (red, green, blue and black cor- regions, features, or trajectories from a video sequence into respond to chassis, helmet, background and outlier classes f spatio-temporally coherent subsets that correspond to in- respectively). c) Spatially-local support subset W^ for a dependent, rigidly-moving objects in the scene (Figure1.b candidate motion in blue. d) Candidate motion model in- f or1.f). The problem currently receives renewed attention, liers in red, control points (ci from Eq.3) in white. e) f partly because of the extensive amount of video sources and Residuals (ri from Eq. 11) color-coded with label data, the applications that benefit from MS to perform higher level radial coordinate is logarithmic. f) Segmentation result. computer vision tasks, but also because the state of the art is reaching functional maturity. methods [11,6,8] balance model complexity with modeling Motion Segmentation methods are widely diverse, but accuracy and have been successful at incorporating more most capture only a small subset of constraints or alge- of these aspects into a single formulation. For instance, braic properties from those that govern the image formation in [8] most model parameters are estimated automatically process of moving objects and their corresponding trajec- from the data, including the number of independent motions tories, such as the rank limit theorem [9, 10], the linear in- and their complexity, as well as the segmentation labels (in- dependence constraint (between trajectories from indepen- cluding outliers). However, because of the large number of dent motions) [2, 13], the epipolar constraint [7], and the necessary motion hypotheses that need to be instantiated, reduced rank property [11, 15, 13]. Model-selection based as well as the varying and potentially very large number of 1 model parameters that must be estimated, the flexibility of- 2D affine transformation models the motion of an arbitrary fered by this method comes at a large computational cost. plane, and the epipolar residuals account for relative depths. Current state of the art methods follow the trend of us- Crucially, these two components can be computed sepa- ing sparse low-dimensional subspaces to represent trajec- rately and incrementally, which enables an explicit mech- tory data. This representation is then fed into a clustering anism to deal with motion degeneracy. algorithm to obtain a segmentation result. A prime example In the context of 3D motion, a motion is degenerate when of this type of method is Sparse Subspace Clustering (SSC) the trajectories originate from a planar (or linear) object, or [3] in which each trajectory is represented as a sparse linear when neither the camera nor the imaged object exercise all combination of a few other basis trajectories. The assump- of their degrees of freedom, such as when the object only tion is that the basis trajectories must belong to the same translates, or when the camera only rotates. These are com- rigid motion as the reconstructed trajectory (or else, the re- mon situations in real world video sequences. The incre- construction would be impossible). When the assumption mental nature of the decompositions described above, fa- is true, the sparse mixing coefficients can be interpreted as cilitate the transition between degenerate motions and non- the connectivity weights of a graph (or a similarity matrix), degenerate ones. which is then (spectral) clustered to obtain a segmentation result. At the time of publication, SSC produced segmenta- Planar Model Under orthography, the projection of tra- tion results three times more accurate than the best prede- jectories from a planar surface can be modeled with the cessor. The practical downside, however, is the inherently affine transformation: large computational cost of finding the optimal sparse rep- 2 xc 3 2 xw 3 2 xw 3 resentation, which is at least cubic on the number of trajec- D t yc = yw = Aw!c yw ; (1) tories. 4 5 0> 1 4 5 2D 4 5 1 1 1 The work of [14] also falls within the class of subspace separation algorithms. Their approach is based on cluster- where D 2 R2×2 is an invertible matrix, and t 2 R2 is ing the principal angles (CPA) of the local subspaces associ- a translation vector. Trajectory coordinates (xw; yw) are in ated to each trajectory and its nearest neighbors. The clus- the plane’s reference frame (modulo a 2D affine transfor- tering re-weights a traditional metric of subspace affinity mation) and (xc; yc) are image coordinates. between principal angles. Re-weighted affinities are then Now, let W 2 R2F ×P be matrix of trajectory data that used for segmentation. The approach produces segmenta- contains the x and y image coordinates of P feature points tion results with accuracies similar to those of SSC, but the tracked through F frames, as in computational cost is close to 10 times bigger than SSC’s. 2 3 x1;1 ··· x1;P In this work we argue that competitive segmentation re- 6 y1;1 ··· y1;P 7 6 7 sults are possible using a simple but powerful representation 6 . .. 7 W = 6 . 7 : (2) of motion that explicitly incorporates mechanisms to deal 6 7 with noise, outliers and motion degeneracies. The proposed 4 xF;1 ··· xF;P 5 method is approximately 2 or 3 orders of magnitude faster yF;1 ··· yF;P than [3] and [14] respectively, currently considered the state To compute the parameters of A2D from trajectory data, let 2f×3 of the art. C = [c1; c2; c3] 2 R be three columns (three full tra- f 2f−1 2f > 1.1. Affine Motion jectories) from W, and let ci = [ci ; ci ] be the x and y coordinates of the i-th control trajectory at frame f. Then Projective Geometry is often used to model the image the transformation between points from an arbitrary source motion of trajectories from rigid objects between pairs of frame s to a target frame f can be written as: frames. However, alternative geometric relationships that cf cf cf cs cs cs facilitate parameter computation have also been proven use- 1 2 3 = As!f 1 2 3 ; (3) ful for this purpose. For instance, in perspective projection, 1 1 1 2D 1 1 1 general image motion from rigid objects can be modeled and As!f can be simply computed as: via the composition of two elements: a 2D homography, 2D −1 and parallax residual displacements [5]. The homography cf cf cf cs cs cs As!f = 1 2 3 1 2 3 : (4) describes the motion of an arbitrary plane, and the parallax 2D 1 1 1 1 1 1 residuals account for relative depths, that are unaccounted for by the planar surface model. The inverse in the right-hand side matrix of Eq.4 exists so s Under orthography, in contrast, image motion of rigid long as the points ci are not collinear. For simplicity we s!f f s objects can be modeled via the composition of a 2D affine refer to A2D as A2D and consequently A2D is the identity transformation plus epipolar residual displacements. The matrix. 3D Model In order to upgrade a planar (degenerate) that after this stage, the segmentation that results from the model into a full 3D one, relative depth must be accounted optimal model combination could be reported as a segmen- using the epipolar residual displacements. This means ex- tation result (x5). tending Eq.1 with a direction vector, scaled by the corre- The third stage incorporates the results from a set of sponding relative depth of each point, as in: model combinations that are closest to the optimal. Seg- mentation results are aggregated into an affinity matrix, 2 xc 3 2 xw 3 2 a 3 D ~t 13 which is then passed to a spectral clustering algorithm to yc = yw + δzw a : (5) 4 5 ~0> 1 4 5 4 23 5 produce the final segmentation result. This refinement stage 1 1 0 generally results in improved accuracy and reduced seg- mentation variability (x6). The depth δzw is relative to the arbitrary plane whose mo- tion is modeled by A2D; a point that lies on such plane would have δzw = 0.