Multi-View Latent Variable Discriminative Models for Action Recognition

Appeared in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2012. Multi-View Latent Variable Discriminative Models For Action Recognition Yale Song1, Louis-Philippe Morency2, Randall Davis1 1MIT Computer Science and Artificial Intelligence Laboratory 2USC Institute for Creative Technology fyalesong,[email protected], [email protected] Abstract Single-view latent variable discriminative models (e.g., HCRF [18] for segmented sequence data, and LDCRF [15] Many human action recognition tasks involve data that for unsegmented sequence data) have shown promising re- can be factorized into multiple views such as body postures sults in many human activity recognition tasks such as ges- and hand shapes. These views often interact with each other ture and emotion recognition. However, when applied to over time, providing important cues to understanding the multi-view latent dynamic learning, existing latent models action. We present multi-view latent variable discrimina- (e.g., early fusion [27]) often prove to be inefficient or in- tive models that jointly learn both view-shared and view- appropriate. The main difficulty with this approach is that specific sub-structures to capture the interaction between it needs a set of latent variables that are the product set of views. Knowledge about the underlying structure of the the latent variables from each original view [16]. This in- data is formulated as a multi-chain structured latent con- crease in complexity is exponential: with C views and D la- ditional model, explicitly learning the interaction between tent variables per view, the product set of all latent variables multiple views using disjoint sets of hidden variables in a is O(DC ). This in turn causes the model to require much discriminative manner. The chains are tied using a prede- more data to estimate the underlying distributions correctly termined topology that repeats over time. We present three (as confirmed in our experiment shown in Section4), which topologies – linked, coupled, and linked-coupled – that dif- makes this solution impractical for many real world applica- fer in the type of interaction between views that they model. tions. The task can get even more difficult when, as shown We evaluate our approach on both segmented and unseg- in [5], one process with high dynamics (e.g., high variance, mented human action recognition tasks, using the ArmGes- noise, frame rate) masks another with low dynamics, with ture, the NATOPS, and the ArmGesture-Continuous data. the result that both the view-shared and view-specific sub- Experimental results show that our approach outperforms structures are dominated by the view with high dynamics. previous state-of-the-art action recognition models. We present here multi-view latent variable discriminative models that jointly learn both view-shared and view- 1. Introduction specific sub-structures. Our approach makes the assumption Many real-world human action recognition tasks involve that observed features from different views are condition- data that can be factorized into multiple views. For exam- ally independent given their respective sets of latent vari- ple, the gestures made by baseball coaches involve complex ables, and uses disjoint sets of latent variables to capture the combinations of body and hand signals. The use of multiple interaction between views. We introduce multi-view HCRF views in human action recognition has been shown to im- (MV-HCRF) and multi-view LDCRF (MV-LDCRF) mod- prove recognition accuracy [1,3]. Evidence from psycho- els, which extend previous work on HCRF [18] and LD- logical experiments provides theoretical justification [25], CRF [15] to the multi-view domain (see Figure1). showing that people reason about interaction between views Knowledge about the underlying structure of the data is (i.e., causal inference) when given combined input signals. represented as a multi-chain structured conditional latent We introduce the term multi-view dynamic learning as a model. The chains are tied using a predetermined topol- mechanism for such tasks. The task involves sequential ogy that repeats over time. Specifically, we present three data, where each view is generated by a temporal pro- topologies –linked, coupled, and linked-coupled– that dif- cess and encodes a different source of information. These fer in the type of interaction between views that they model. views often exhibit both view-shared and view-specific sub- We demonstrate the superiority of our approach over exist- structures [11], and usually interact with each other over ing single-view models using three real world human action time, providing important cues to understanding the data. datasets – the ArmGesture [18], the NATOPS [22], and the 1 ArmGesture-Continuous datasets – for both segmented and multi-view object recognition [4,8, 11, 14, 23]. unsegmented human action recognition tasks. Section2 reviews related work, Section3 presents our 3. Our Multi-view Models models, Section4 demonstrates our approach using syn- thetic example, and Section5 describes experiments and re- In this section we describe our multi-view latent variable sults on the real world data. Section6 concludes with our discriminative models. In particular, we introduce two new contributions and suggests directions for future work. family of models, called multi-view HCRF (MV-HCRF) and multi-view LDCRF (MV-LDCRF) that extend previous 2. Related Work work on HCRF [18] and LDCRF [15] to the multi-view domain. The main difference between the two models lies in Conventional approaches to multi-view learning include that the MV-HCRF is for segmented sequence labeling (i.e., early fusion [27], i.e., combining the views at the input fea- one label per sequence) while the MV-LDCRF is for unseg- ture level, and late fusion [27], i.e., combining the views at mented sequence labeling (i.e., one label per frame). We the output level. But these approaches often fail to learn im- first introduce the notation, describe MV-HCRF and MV- portant sub-structures in the data, because they do not take LDCRF, present three topologies that define how the views multi-view characteristics into consideration. interact, and explain inference and parameter estimation. Several approaches have been proposed to exploit the Input to our model is a set of multi-view sequences x^ = multi-view nature of the data. Co-training [2] and multi- (c) (c) fx(1); ··· ; x(C)g, where each x(c) = fx ; ··· ; x g is an ple kernel learning [14, 23] have shown promising results 1 T observation sequence of length T from the c-th view. Each when the views are independent, i.e., they provide differ- x^ is associated with a label y that is a member of a finite ent and complementary information of the data. However, t t discrete set Y; for segmented sequences, there is only one y when the views are not independent, as is common in hu- for all t. We represent each observation x(c) with a feature man activity recognition, these methods often fail to learn t (c) RN from the data correctly [12]. Canonical correlation analysis vector φ(xt ) 2 . To model the sub-structure of the ^ (CCA) [8] and sparse coding methods [11] have shown a multi-view sequences, we use a set of latent variables h = (1) (C) (c) (c) (c) powerful generalization ability to model dependencies be- fh ; ··· ; h g, where each h = fh1 ; ··· ; hT g is tween views. However, these approaches are applicable a hidden state sequence of length T . Each random variable (c) (c) only to classification and regression problems, and cannot ht is a member of a finite discrete set H of the c-th view, be applied directly to dynamic learning problems. which is disjoint from view to view. Each hidden variable (c) Probabilistic graphical models have shown to be ex- hs is indexed by a pair (s; c). An edge between two hidden tremely successful in dynamic learning. In particular, multi- (c) (d) variables hs and ht is indexed by a quadruple (s; t; c; d), view latent dynamic learning using a generative model (e.g., where fs; tg describes the time indices and fc; dg describes HMM) has long been an active research area [3, 16]. Brand the view indices. et al.[3] introduced a coupled HMM for action recognition, and Murphy introduced Dynamic Bayesian Networks [16] 3.1. Multi-view HCRF that provide a general framework for modeling complex dependencies in hidden (and observed) state variables. We represent our model as a conditional probability dis- In a discriminative setting, Sutton et al.[24] introduced a tribution that factorizes according to an undirected graph dynamic CRF (DCRF), and presented a factorial CRF as an G = (V; EP ; ES) defined over a multi-chain structured instance of the DCRF, which performs multi-labeling tasks. stochastic process, where each chain is a discrete represen- However, their approach only works with single-view input, tation of each view. A set of vertices V represents ran- and may not capture the sub-structures in the data because dom variables (observed or unobserved) and the two sets it does not use latent variables [18]. More recently, Chen et of edges EP and ES represent dependencies among random al. presented a multi-view latent space Markov Network for variables. The unobserved (hidden) variables are marginal- multi-view object classification and annotation tasks [4]. ized out to compute the conditional probability distribution. Our work is different from the previous work in that, We call EP a set of view-specific edges; they encode tem- instead of making the view independence assumption as poral dependencies specific to each view. ES is a set of in [2, 14, 23], we make a conditional independence assump- view-shared edges that encode interactions between views. tion between views, maintaining computational efficiency Similar to HCRF [18], we construct a conditional prob- while capturing the interaction between views.

Load more