Conditional Random People: Tracking Humans with Crfs and Grid Filters

Conditional Random People: Tracking Humans with CRFs and Grid Filters Leonid Taycher†, Gregory Shakhnarovich‡, David Demirdjian†, and Trevor Darrell† † Computer Science and Artificial Intelligence Laboratory, ‡ Department of Computer Science, Massachusetts Institute of Technology, Brown University, Cambridge, MA 02139 Providence, RI 02912 {lodrion, demirdji, trevor}@csail.mit.edu [email protected] Abstract plicated to be learned from data and is usually specified in an ad-hoc fashion. Recently, the use of directed discrimina- We describe a state-space tracking approach based on a tive models with parameters learned directly from data have Conditional Random Field (CRF) model, where the obser- been proposed [1, 24, 20]. vation potentials are learned from data. We find functions In this work we pose state estimation as inference in an that embed both state and observation into a space where undirected Conditional Random Field model (CRF) [16]. similarity corresponds to L1 distance, and define an obser- This allows us to replace the likelihood function with a vation potential based on distance in this space. This po- more general observation potential (compatibility) function tential is extremely fast to compute and in conjunction with that can be automatically learned from training data. These a grid-filtering framework can be used to reduce a contin- functions might be expensive to evaluate in general, but can uous state estimation problem to a discrete one. We show be made efficient at run-time if all state (pose) values at how a state temporal prior in the grid-filter can be com- which they can be evaluated are known in advance. In this puted in a manner similar to a sparse HMM, resulting in case much of the computation can be performed off-line, real-time system performance. The resulting system is used thus greatly reducing run-time complexity. for human pose tracking in video sequences. This algorithm naturally operates on a discrete set of samples, and we will show how we can estimate the pos- 1 Introduction terior probability in a continuous state space using grid filtering methods. The idea underlying these methods is that if Tracking articulated objects (such as humans) is an example the state-space can be partitioned into regions that are small of state estimation in a high-dimensional space with a non- then the posterior can be well approximated by a constant linear observation model that has been a focus of consider- function within each region. able attention. The combination of frequent self-occlusion The direct application of grid filtering would result in and unobservable degrees of freedom with the large vol- the need to evaluate the potential function in each region in ume of the pose space make probabilistic methods appeal- the partition, which is impossible to do in real time even ing. The vast majority of probabilistic articulated tracking with fast implementation. Fortunately this is not necessary, methods are based on a generative model formulation. since at a particular time-step, the prior state probability is Current state-of-the-art generative tracking algorithms negligible in the vast majority of the regions, allowing us to use non-parametric density estimators, such as particle fil- concentrate only on locations with a significant prior. ters, due to their ability to model arbitrary multimodal dis- Our algorithm operates in a standard predict-update tributions [17, 10]. Unfortunately, several properties con- framework: at every step we first estimate the temporal spire to make particle filtering extremely computationally prior probability of the state being in each of the regions intensive. On one hand, a large number of particles is in the partition. We then evaluate the observation poten- needed in order to faithfully model the distributions in ques- tial only for regions with non-negligible prior. While one tion. On the other hand, a complex likelihood function can view of the resulting algorithm as a version of parti- needs to be evaluated for each particle at every iteration cle filtering where particles can assume only a fixed, finite of the algorithm. A further drawback of generative-model set of values, this is not quite the case. When the cells are based algorithms is that the likelihood function is too com- fixed, the transition probabilities between cells can be pre- 1 computed, and the temporal prior computation reduced to ble model than the previously proposed regression methods. single sparse matrix/vector multiplication, in a manner sim- They allow for modeling the relationship between state and ilar to HMMs [18]. This computation avoids sampling step an arbitrary subset of observations. They are also better able altogether, producing better distribution estimates and sig- to adjust to sequences not appearing in the training data. For nificantly reducing computation time compared to Monte example the MEMM model (similar to one used in [24]) has Carlo methods. been shown to be subject to label bias problem [16]. After reviewing prior work, we describe the CRF-based While in the present work we use a simple chain- tracking formulation and a way to learn an observation po- structured CRF (Figure 1(b)), which directly models the de- tential function based on image embedding (Section 3). We pendency between concurrent state and observation, it can then discuss a grid-filter-based inference method which can be extended by introducing more general relationships be- be realized with a sparse HMM computation (Section 4). tween state and observations. The results of our method are demonstrated and compared We learn the observation potential function for our against competing algorithms in Section 5. model using the parameter sensitive embedding introduced in [21]. This algorithm allows us to learn a transformation of images of humans into a space where the distance be- 2 Prior Work tween embeddings of two images is likely to be small if the poses are similar and large otherwise. The observation Probabilistic articulated pose estimation is often ap- potential of a particular pose is then determined by the dis- proached using state-space methods. The majority of the tance between embeddings of the rendering of the figure in approaches have been based on a generative model formu- this pose and the observed image. lation, with varying assumptions about the forms of the If for every pose at which we would like to evaluate the pose distribution and transition probabilities. Early meth- potential we had to render the corresponding image, our ods [13, 19] assumed that both were Gaussian and used method would be extremely slow. By discretizing the con- Kalman filtering. Extended and Unscented [25] Kalman tinuous pose space, we are able to precompute the embed- filters enabled modeling of non-linear transitions, but still dings of all discrete poses off-line, thus drastically reducing constrained pose distribution to be Gaussian. These meth- run-time complexity. Fixing the set of poses at which ob- ods required a relatively small number of evaluations of the servation potential can be computed would seem to be an likelihood function, but lost track due to restrictive distribu- unreasonable restriction, since we are operating in a contin- tion models. uous pose space, but we overcome this problem by using a The need to relax the unimodality assumption led first variant of the grid-filtering technique [6, 2]. to use of mixture models [11, 5], and then to Monte-Carlo The main idea underlying grid filtering is that sufficiently methods that represent distributions with sets of discrete discretized random variable is indistinguishable from a con- samples (particles) [17, 10, 22, 23]. While theoretically tinuous one. That is, if the distribution can approximated by sound, particle filtering methods are not very successful in a piece-wise constant function, then it is sufficient to evalu- high dimensions [14] – they require large numbers of par- ate it only at one point in every “constancy region” (cell) [6]. ticles to faithfully represent the distribution, which entails This reduces a continuous estimation problem to a discrete large computational costs of likelihood evaluation. Further- one (albeit with very large number of discrete points). We more, the emission probability model used in likelihood show that in the case where both observation potential and evaluation is very expensive to train, and is often hand- the temporal prior are constant in each cell, tracking can be designed in an ad-hoc fashion. formulated as state estimation in an HMM framework, al- Several discriminative methods have been proposed for lowing us to use existing inference algorithms; further, we visual pose tracking. These algorithms apply various re- have found in practice that a manageable number of cells gression techniques while leveraging large number of anno- suffices for realistic tracking tasks. tated image sequences. For example, one [1], or a mixture [24] of simple experts were trained to predict current pose based on the past pose estimates and the current observa- 3 Tracking with Conditional Ran- tion. Robust regression combined with fast nearest neighbor search was used for single frame pose estimation in [21]. dom Fields In this paper we dispense with directed models alto- Figure 1(a) shows the dynamic generative model that is gether and opt for a Conditional Random Field (CRF) [16] commonly used in tracking applications. The state (pose1) model. The main advantage of CRFs over generative mod- at time t is denoted as θt, and the observed images as It. els is that CRFs do not require specification (and evalu- The full model is specified by the initial distribution p(θ0), ation) of the emission probability, but only similarity between state and observation(s). CRFs are also a more flexi- 1In this work we consider only first order Markov models of motion. 2 θt−1 θt θt+1 θt−1 θt θt+1 θt−1 θt θt+1 It−1 It It+1 It−1 It It+1 It−1 It It+1 (a) (b) (c) Figure 1: Chain-structured generative (a), CRF (b), and MEMM (c) tracking models.

Load more