A Latent Manifold Markovian Dynamics Gaussian Process Sotirios P

Home , Gaussian process, Hidden Markov model

70 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 A Latent Manifold Markovian Dynamics Gaussian Process Sotirios P. Chatzis and Dimitrios Kosmopoulos

Abstract— In this paper, we propose a Gaussian process (GP) process (GP) regression [3]. This GP-based formulation of the model for analysis of nonlinear time series. Formulation of our model gives rise to a nonparametric Bayesian nature, which model is based on the consideration that the observed data removes the need to select a large number of parameters asso- are functions of latent variables, with the associated mapping between observations and latent representations modeled through ciated with function approximators, while retaining the power GP priors. In addition, to capture the temporal dynamics in the of nonlinear dynamics and observation. GPDM is essentially modeled data, we assume that subsequent latent representations an extension of the GP latent variable model (GPLVM) of [4], depend on each other on the basis of a hidden Markov prior which models the joint distribution of the observed data and imposed over them. Derivation of our model is performed by their representation in a low-dimensional latent space through marginalizing out the model parameters in closed form using GP priors for observation mappings, and appropriate stick- a GP prior. GPDM extends GPLVM by augmenting it with a breaking priors for the latent variable (Markovian) dynamics. model of temporal dynamics captured through imposition of This way, we eventually obtain a nonparametric Bayesian model a dedicated GP prior. This way, GPDM allows for not only for dynamical systems that accounts for uncertainty in the obtaining predictions about future data, but also regularizing modeled data. We provide efficient inference algorithms for the latent space to allow for more effective modeling of our model on the basis of a truncated variational Bayesian approximation. We demonstrate the efficacy of our approach temporal dynamics. considering a number of applications dealing with real-world Despite the merits of GPDM, a significant drawback of this data, and compare it with the related state-of-the-art approaches. model consists in the need of its inference algorithm to obtain Index Terms— Gaussian process (GP), latent manifold, maximum a posteriori (MAP) estimates of its parameters Markovian dynamics, stick-breaking process, variational Bayes. through type-II maximum-likelihood [2] (performed by means of scaled conjugate gradient descent [5]). This formulation I. INTRODUCTION poses a significant bottleneck to GPDM, due to both the HERE is a wide variety of generative models used to entailed high computational costs, as well as the possibility Tperform analysis of nonlinear time series [1]. Approaches of obtaining bad estimates due to the algorithm getting stuck based on hidden Markov models (HMMs) and linear dynami- to poor local maxima in cases of limited training datasets. cal systems (LDS) are quite ubiquitous in the current literature In addition, to increase computational efficiency, GPDM due to their simplicity, efficiency, and generally satisfactory imposes an oversimplistic spherical Gaussian prior over its performance in many applications. More expressive models, model of temporal dynamics, which probably undermines such as switching LDS and nonlinear dynamical systems its data modeling capacity. Finally, the use of GP priors to (NLDS), have also been proposed; however, these approaches describe the temporal dynamics between the latent variables are faced with difficulties in terms of their learning and of the model leads to significant computational overheads, as it inference algorithms, due to the entailed large number of gives rise to calculations that entail inverting very large Gram parameters that must be estimated, and the hence needed large matrices [3]. amounts of training data [1]. To resolve these issues, in this paper, we propose a flexible Recently, a nonparametric Bayesian approach designed to generative model for modeling sequential data by means of resolve these issues of NLDS, namely the Gaussian process nonparametric component densities. Formulation of our pro- dynamical model (GPDM), was introduced in [2]. This posed model is based on the assumption that, when modeling approach is fully defined by a set of low-dimensional rep- sequential data, each observation in a given sequence is related resentations of the observed data, with both the observa- to a vector in a latent space, and is generated through a latent tion and dynamical processes learned by means of Gaussian nonlinear function that maps the latent space to the space of observations. We use a GP prior to infer this unknown mapping Manuscript received July 21, 2013; revised February 24, 2014; accepted March 8, 2014. Date of publication March 21, 2014; date of current version function from the data in a flexible manner. December 16, 2014. In addition, the latent vectors that generate the observed S. P. Chatzis is with the Department of Electrical and Computer Engi- sequential data are assumed to possess strong temporal inter- neering, Cyprus University of Technology, Limassol 3036, Cyprus (e-mail: [email protected]). dependencies; to capture these dependencies, we assume that D. Kosmopoulos is with the Technological Educational Institute of Crete, these latent variables are generated from an HMM in the Heraklion 71004, Greece (e-mail: [email protected]). manifold of latent variables. Specifically, we assume a latent Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. space HMM with infinite hidden states, and use flexible Digital Object Identifier 10.1109/TNNLS.2014.2311073 stick-breaking priors [6], [7] to infer its hidden state dynamics;

2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 71

base distribution G0 with probability (α/α + M − 1) or 2) is selected from the existing draws, according to a multinomial allocation, with probabilities proportional to the number of { }C the previous draws with the same allocation [10]. Let c c=1 {∗ }M−1 be the set of distinct values taken by the variables m m=1 . ν M−1 {∗ }M−1 Denoting as c the number of values in m m=1 that ∗ {∗ }M−1 equals to c, the distribution of M given m m=1 can be shown to be of the form [10] (∗ |{∗ }M−1,α, ) p M m m=1 G0 α C ν M−1 c = G + δ (3) α + M − 1 0 α + M − 1 c Fig. 1. Intuitive illustration of the generative construction of our model. c=1 δ where c denotes the distribution concentrated at a single point . These results illustrate two key properties of the DP this formulation allows us to automatically determine the c scheme. First, the innovation parameter α plays a key role in required number of states in this latent manifold HMM in a determining the number of distinct parameter values. A larger data-driven fashion. We dub our approach the latent manifold α induces a higher tendency of drawing new parameters from hidden Markov GP (LM2GP) model. Inference for our model the base distribution G ; indeed, as α →∞,wegetG → G . is performed by means of an efficient truncated variational 0 0 Conversely, as α → 0, all {∗ }M tend to cluster to a single Bayesian algorithm [8], [9]. We evaluate the efficacy of m m=1 random variable. Second, the more often a parameter is shared, our approach in several applications, dealing with sequential the more likely it will be shared in the future. data clustering, classification, and generation. A high-level A characterization of the (unconditional) distribution of the illustration of the conceptual configuration of our model is random variable G drawn from a DP, DP(α, G ), is provided showninFig.1. 0 by the stick-breaking construction in [6]. Consider two infinite The remainder of this paper is organized as follows. collections of independent random variables v =[v ]∞ and In Section II, we provide a brief presentation of the the- c c=1 { }∞ ,wherethev are drawn from a Beta distribution, and oretical background of the proposed method. Initially, we c c=1 c the are independently drawn from the base distribution G . present the Dirichlet process (DP) and its function as a c 0 The stick-breaking representation of G is then given by prior in nonparametric Bayesian models; furthermore, we provide a brief summary of the GPDM model. In Section ∞ 2 = (v)δ III, we introduce the proposed LM GP model, and derive G c c (4) efficient model inference algorithms based on the variational c=1 Bayesian framework. In Section IV, we conduct the experi- where mental evaluation of our proposed model, considering a num- p(v ) = Beta(1,α) (5) ber of applications dealing with several real-world datasets, c c−1 and we compare its performance with state-of-the-art-related (v) = v (1 − v ) ∈[0, 1] (6) approaches. Finally, in Section V, we conclude and summarize c c j = this paper. j 1 and ∞ II. PRELIMINARIES c(v) = 1. (7) A. Dirichlet Process c=1 DP models were first introduced in [7]. A DP is charac- Under the stick-breaking representation of the DP, the terized by a base distribution G0 and a positive scalar α, atoms c, drawn independently from the base distribution G0, usually referred to as the innovation parameter, and is denoted can be seen as the parameters of the component distributions (α, ) as DP G0 . Essentially, a DP is a distribution placed over of a mixture model comprising an unbounded number of a distribution. Let us suppose we randomly draw a sample component densities, with mixing proportions c(v). distribution G from a DP, and, subsequently, we independently {∗ }M draw M random variables m = from G m 1 B. GP Dynamical Model |α, ∼ (α, ) G G0 DP G0 (1) 1) GP Models: Let us begin with a brief description ∗ | ∼ , = ,..., . m G G m 1 M (2) of GP regression. Consider an observation space X ;a GP f (x), x ∈ X , is defined as a collection of random Integrating out G, the joint distribution of the variables {∗ }M variables, any finite number of which have a joint Gaussian m m=1 can be shown to exhibit a clustering effect. Specif- − {∗ }M−1 distribution [3]. We typically use the notation ically, given the first M 1 samples of G, m m=1 , it can ∗ ( ) ∼ ( ( ), ( , )) be shown that a new sample M is either 1) drawn from the f x GP m x k x x (8) 72 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 where m(x) is the mean function of the process and k(x, x) is 2) GP Latent Variable Models: Building upon the GP the covariance function of the process. Usually, for simplicity, model, the GPLVM is essentially a GP, and the input variables and without any loss of generality, the mean of the process x of which are considered latent variables rather than observed is taken to be zero, m(x) = 0, although this is not necessary. ones. Specifically, GPLVM considers that the y ∈ RD are Concerning selection of the covariance function, a large variety observed multidimensional variables, while the latent vectors of kernel functions k(x, x) might be employed, depending x are variables lying in some lower dimensional manifold that on the application considered [3]. Eventually, the real process generates the observed variables y through an unknown latent f (x) drawn from a GP with mean zero and kernel function mapping f , modeled through a GP. This way, we have k(x, x) follows a Gaussian distribution with D p( f |x) = N ( f |0, k(x, x)). (9) p( y|x) = N (yd |0, kd (x, x)) (18) = Let us suppose a set of independent and identically dis- d 1 D ={( , )| = ,..., } tributed samples xi yi i 1 N , with the N which, considering a set of observations Y =[y ] = , obtains d-dimensional variables xi being the observations related to a n n 1 modeled phenomenon, and the scalars yi being the associated target values. The goal of a regression model is, given a new D ( | ) = N ( | , ( , )) observation x∗, to predict the corresponding target value y∗, p Y X Yd 0 Kd X X (19) based on the information contained in the training set D.The d=1 basic notion behind GP regression consists in the assumption (·, ·) that the observable (training) target values y in a considered where kd is the kernel of the model for the dth observed =[ ]N =[ ]N ( , ) regression problem can be expressed as the superposition of a dimension, X xn n=1, Yd ynd n=1,andKd X X GP over the input space X , f (x), and an independent white is the Gram matrix (with inputs X) corresponding to the Gaussian noise dth dimension of the modeled data. GPLVM learns the values of the latent variables x corresponding to the observed data y y = f (x) + (10) through the maximization of the model marginal likelihood, i.e., in a MAP fashion. As shown in [4], GPLVM can be where f (x) is given by (8), and viewed as a GP-based nonparametric Bayesian extension of probabilistic principal component analysis [11], [12], a popular p() = N |0,σ2 . (11) method for latent manifold modeling of high-dimensional data. Under this regard, the joint normality of the training target 3) Dynamic GPLVMs: Inspired from GPLVM, GPDM per- =[]N forms modeling of sequential data by considering a GP prior as values Y yi i=1 and some unknown target value y∗, approximated by the value f∗ of the postulated GP eval- in GPLVM, and introducing an additional model of temporal uated at the observation point x∗, is a Gaussian of the interdependencies between successive latent vectors x. Specif- =[ ]N form [3] ically, considering a sequence of observed data Y yn n=1, =[ ]D where yn ynd d=1, with corresponding latent manifold Y K(X, X) + σ 2 I k(x∗) N Q N , N projections X =[xn] = , xn ∈ R , GPDM models the 0 T (12) n 1 f∗ k(x∗) k(x∗, x∗) dependencies between the latent and observed data in (19), where while also considering a GP-based model of the interdependencies between the latent vectors xn. In detail, this latter T k(x∗) [k(x1, x∗),...,k(x N , x∗)] (13) model initially postulates =[ ]N × ˆ X xi i=1, I N is the N N identity matrix, and K is the N matrix of the covariances between the N training data points p(X) = p(x1) p(xn|xn−1; A)p(A)d A (20) (Gram matrix) n=2 ⎡ ⎤ k(x1, x1) k(x1, x2)... k(x1, x N ) ⎢ ⎥ which, assuming ⎢ k(x2, x1) k(x2, x2)... k(x2, x N ) ⎥ ( , ) . K X X ⎢ . . . ⎥ (14) ⎣ . . . ⎦ 2 p(xn|xn−1; A) = N (xn|Axn−1,σ ) (21) k(x N , x1) k(xN , x2)... k(x N , x N ) From (12), and conditioning on the available training samples, and a simplistic isotropic Gaussian prior on the columns of we can derive the expression of the model predictive distrib- the parameters matrix A, yields a GP prior of the form ution, yielding p(x1) 2 ( ) = p( f∗|x∗, D) = N ( f∗|μ∗,σ ) (15) p X ∗ (2π)Q(N−1)|(X, X)|Q μ = ( )T ( ( , ) + σ 2 )−1 ∗ k x∗ K X X I N Y (16) 1 − −1 T 2 2 T 2 1 × exp − tr((X, X) X2:N X : ) (22) σ∗ = σ − k(x∗) K(X, X) + σ I N 2 2 N × k(x∗) + k(x∗, x∗). (17) CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 73

( , ) ( ) ∀ where X X is the Gram matrix where f d is the vector of the fd xn n, i.e., f d [ f (x )]N ,andK (X, X) is the Gram matrix pertaining ( , ) d n n=1 d ⎡X X ⎤ to the dth dimension of the observations space, with kernel k(x1, x1) k(x1, x2)... k(x1, x N−1) hyperparameters set ψ . ⎢ ⎥ d ⎢ k(x2, x1) k(x2, x2)... k(x2, x N−1) ⎥ Subsequently, we consider that the latent manifold projec- ⎢ . . . ⎥. ⎣ . . . ⎦ tions x are generated by an HMM comprising infinite states. N,∞ Let us introduce the set of variables Z ={znc} , = , with k(xN−1, x1) k(x N−1, x2)... k(x N−1, x N−1) n c 1 z = 1, if x is considered to be generated from the cth (23) nc n HMM model state, znc = 0 otherwise. Then, our introduced Estimation of the set of (dynamically interdependent) latent generative model for the latent vectors xn comprises the variables X of the GPDM is performed by means of type-II assumptions [13] maximum-likelihood, similar to GPLVM. −1 p(xn|z = 1) = N (xn|μ , R ) (26) As we observe, GPDM utilizes a rather simplistic modeling nc c c ( = | = ; v ) = (v ), > approach for the dependencies between the latent variables x, p znj 1 zn−1,i 1 i j i n 1 (27) which is based on a linear model of dependencies, as in (21), j−1 (v ) = v ( − v ) ∈[ , ] combined with a rather implausible spherical prior over A. j i ij 1 ik 0 1 (28) This model formulation is clearly restrictive, and limits the k=1 ( = |; vπ ) = π (vπ ) level of complexity of the temporal structure that GPDM can p z1i 1 i (29) capture. In addition, the requirement of MAP estimation of i−1 π (vπ ) = vπ ( − vπ ) ∈[ , ] the latent variables x ∈ RQ results in high computational i i 1 j 0 1 (30) requirements. This is aggravated even more by the form j=1 of (22), which gives rise to computations entailing inverting with the Gram matrix (X, X), in addition to computing the inverse ∞ of the matrices K (X, X). Our proposed model is designed to (v ) = ∀ d j i 1 i (31) ameliorate these issues, as we shall discuss next. j=1 ∞ π (vπ ) = . III. PROPOSED APPROACH j 1 (32) j=1 A. Model Definition {v }∞ v =[v ]∞ Let us consider a sequence of observations with strong For the stick variables i i=1,where i ij j=1,and =[ ]N vπ =[vπ ]∞ , we impose Beta priors [13] of the form temporal interdependencies Y yn n=1. Similar to GPLVM, i i=1 we assume a latent lower dimensional manifold that generates p(v ) = Beta(1,α ) ∀ j (33) the observation space, and a smooth nonlinear mapping from ij i (vπ ) = ( ,απ ) ∀ . the latent space to the observations space. To infer this non- p i Beta 1 i (34) linear mapping, we postulate a GP prior, similar to GPLVM. In essence, this parameterization of the imposed Beta priors In addition though, we also postulate a model capable of gives rise to a DP-based prior construction for the transition capturing the temporal dynamics in the modeled data. For dynamics of our model [6]. this purpose, we prefer to work in the lower dimensional Finally, due to the effect of the innovation parameters on manifold of the latent vectors generating our observations. the number of effective latent Markov chain states, we also Specifically, we consider a generative nonparametric Bayesian impose Gamma priors over them model for these latent coordinates: we assume that they are (α ) = G(α |η ,η ) generated from an HMM comprising infinite (latent) states. p i i 1 2 (35) To make formulation of this latent manifold infinite-state (απ ) = G(απ |ηπ ,ηπ ). p 1 2 (36) HMM possible, we impose suitable stick-breaking priors over its state-transition dynamics, similar to [13]. We also impose suitable conjugate priors over the mean and In particular, formulation of our model commences by precision parameters of the (Markov chain-)state-conditional considering a smooth nonlinear (generative) mapping of the likelihoods of the latent vectors. Specifically, we choose normal-Wishart priors, yielding latent vectors xn to the observed ones yn, described by means of a GP, with the addition of some additive white noise. Given p(μ , Rc) = NW(μ , Rc|λc, mc,ωc, c). (37) these assumptions, we yield a likelihood function of the form c c This completes the definition of our proposed LM2GP model. D | = N ( | ( ), β−1) A plate diagram of our model is shown in Fig. 2. p yn xn ynd fd xn (24) d=1 B. Inference Algorithm where β is the white noise precision, and fd (xn) is modeled by means of a GP prior, such that Inference for nonparametric models can be conducted under a Bayesian setting, typically by means of variational Bayes [9], ( | ) = N ( | , ( , )) p f d X f d 0 K d X X (25) or Markov chain Monte Carlo techniques [14]. Here, we prefer 74 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015

is nonnegative, L(q) forms a strict lower bound of the log evidence, and would become exact if q(W) = p(W|, Y ). Hence, by maximizing this lower bound L(q) (variational free energy) so that it becomes as tight as possible, not only do we minimize the KL-divergence between the true and the variational posterior, but we also implicitly integrate out the unknowns W [8]. To facilitate variational Bayesian inference for our model, we assume that the posterior factorizes similar to the prior of our model (mean-field approximation) [15], [16]. This way, the variational free energy L(q) yields (ignoring constant terms) ˆˆ D p( f |X) L(q) = dXd f q(X)q( f ) log d d d q( f ) Fig. 2. Plate diagram representation of the LM2GP model. The arrows d=1 d represent conditional dependencies. The plates indicate independent copies N for states i and j. −1 + logp(ynd| fd (xn), β ) n=1 a variational Bayesian approach, due to its considerably better C ˆˆ scalability in terms of computational costs, which becomes of + dμ dRcq(μ , Rc) major importance when having to deal with large data corpora. c c c=1 Our variational Bayesian inference algorithm for the p(μ , R |λ , m ,ω , ) 2 × c c c c c c LM GP model comprises derivation of a family of variational log (μ , ) (.) ˆ q c Rc posterior distributions q , which approximate the true pos- (απ |ηπ ,ηπ ) {μ , }∞ vπ π π p 1 2 terior distribution over the infinite sets Z, c Rc = , , + dα q(α ) log ∞ ∞ π c 1 (απ ) {v } ,and{α } , as well the parameters α , the latent q i i=1 i i=1 ˆ mapping function values f =[f (x )]N , and the latent C−1 (vπ |απ ) d d n n=1 + vπ (vπ ) p c =[ ]N d q log π mappings X xn n=1. Apparently, Bayesian inference is c c q(v ) not tractable under this setting, since we are dealing with an c=1 c − ˆ infinite number of parameters. C1 p(α |η ,η ) + α q(α ) c 1 2 For this reason, we employ a common strategy in the d c c log (α ) q c literature of Bayesian nonparametrics, formulated on the basis c=1 C−1 ˆ of a truncation of the stick-breaking process [9]. Specifically, p(v |α ) + v (v ) cc c d cc q cc log we fix a value C and we let the variational posterior over q(v ) v vπ (v = ) = , c=1 cc the ij ,andthe i have the property q iC 1 1 ∀i = 1,...,C,andq(vπ = 1) = 1. In other words, we C C N C + ( = | = ) π (vπ ) (v ) > q znj 1 zn−1,i 1 set c and c i equal to zero for c C.Note 2 i=1 j=1 n=2 that, under this setting, the treated LM GP model involves ˆ p(z = 1|z − , = 1; v ) full stick-breaking priors; truncation is not imposed on the × v (v ) nj n 1 i i d i q i log model itself, but only on the variational distribution to allow q(znj = 1|zn−1,i = 1) for tractable inference. Hence, the truncation level C is a C ˆ π π π p(z1c = 1|v ) variational parameter that can be freely set, and not part of + q(z1c = 1) dv q(v )log q(z c = 1) the prior model specification. c=1 1 { , , vπ ,απ , { }D , {v ,α , μ , }C } C N ˆˆˆ Let W X Z f d d=1 c c c Rc c=1 be 2 + ( = ) μ the set of all the parameters of the LM GP model over which q znc 1 dxnd cdRc a prior distribution has been imposed, and be the set of c=1 n=1 ( |μ , ) the hyperparameters of the model priors and kernel functions. p xn c Rc × q(xn)q(μc, Rc)log . (40) Variational Bayesian inference introduces an arbitrary distri- q(xn) bution q(W) to approximate the actual posterior p(W|, Y ) ( ) that is computationally intractable, yielding [8] Derivation of the variational posterior distribution q W involves maximization of the variational free energy L(q) logp(Y ) = L(q) + KL(q||p) (38) over each one of the factors of q(W) in turn, holding the where others fixed, in an iterative manner [15]. On each iteration, ˆ apart from variational posterior updating, we also update the p(Y, W|) L(q) = Wq(W) estimates of the model hyperparameters in , by maximization d log ( ) (39) q W of the variational free energy L(q) over each one of them. By and KL(q||p) stands for the KullbackÐLeibler (KL) diver- construction, this iterative consecutive updating of the model gence between the (approximate) variational posterior q(W) is guaranteed to monotonically and maximally increase the and the actual posterior p(W|, Y ). Since KL divergence free energy L(q) [17]. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 75

1) Variational Posteriors: Let us denote as . q(·) the pos- (55) terior expectation of a quantity, that is the quantity’s mean and, we have value considering the entailed random variables follow the variational posteriors q(·). In the following, all these pos- N ω˜ = ω + ( = ) terior means can be computed analytically, except for the c c q znc 1 (56) n=1 expectations with respect to the posterior q(xn) over the N latent manifold representations of our modeled data. We shall λc = q(z =1) ˜ = n 1 nc (m −¯x )(m −¯x )T (57) elaborate on computation of these latter quantities later in this c λ + N ( = ) c c c c c n=1 q znc 1 section. + c + c From (40), we obtain the following variational (approxi- N mate) posteriors over the parameters of our model. ˜ λc = λc + q(znc = 1) (58) (v ) 1) For the q i ,wehave n=1 λ +¯ N ( = ) (v ) = (β˜ , βˆ ) cmc xc n=1 q znc 1 q ij Beta ij ij (41) m˜ c = . (59) λ + N q(z = 1) N c n=1 nc β˜ = + ( = | = ) ij 1 q znj 1 zn−1,i 1 (42) In the above expressions, xn q(xn) is the (poste- n=2 rior) expectation of the latent variable xn given its C N posterior q(xn). βˆ = α + ( = | = ). ij i (α ) q znρ 1 zn−1,i 1 6) Regarding the posteriors over the latent functions f d , q i = j+1 n=2 we have (43) ( ) = N ( |ˆμ , ) q f d f d d d (60) 2) Similar, for the q(vπ ),wehave where π π π q(v ) = Beta(β˜ , βˆ ) (44) −1 i i i = ( , )−1 + β β˜π = + ( = ) d K d X X I (61) i 1 q z1i 1 (45) q(X) μˆ = β C d d IYd (62) π π βˆ = α + ( ρ = ). i q(απ ) q z1 1 (46) [ ]N ( , )−1 Yd ynd = ,and K d X X ( ) is the posterior =i+1 n 1 q X −1 mean of the inverse Gram matrix K d (X, X) given q(α ) 3) For the i ,wehave the variational posteriors q(xn) ∀n. q(α ) = G(α |˜ , ˆ ) (47) 7) Similarly, the posteriors over the indicator variables Z i i i i yield where N−1 N ∗ ∗ ∗ ˜ = η + − q(Z) ∝ πδ δ δ p (xn|μδ , Rδ ) (63) i 1 C 1 (48) 1 n n+1 n n = = C−1 n 1 n 1 ˆ = η − ψ(βˆ ) − ψ(β˜ + βˆ ) . (49) where i 2 ij ij ij j=1 π∗ π (vπ ) c exp log c (vπ ) (64) (απ ) q 4) For the q ,wehave ∗ exp log (v ) (v ) (65) π π π π ij j i q i q(α ) = G(α |˜ε , εˆ ) (50) ∗( |μ , ) ( |μ , ) p xt i Ri exp logp xt i Ri (μ , ), ( ) where q i Ri q xt (66) ε˜π = ηπ + − 1 C 1 (51) C−1 and εˆπ = ηπ − ψ(βˆπ ) − ψ(β˜π + βˆπ ) . 2 i i i (52) δn arg(znc = 1). i=1 c (μ , ) From (63), and comparing this expression with the 5) The posteriors q c Rc are approximated in the following form: corresponding expressions of standard HMMs [18], it is easy to observe that computation of the probabilities (μ , ) = NW μ , |λ˜ , ˜ , ω˜ , ˜ q c Rc c Rc c mc c c (53) q(znj = 1|zn−1,i = 1),andq(znc = 1), which constitute ( ) where we introduce the notation the variational posterior q Z , can be easily performed by means of the forwardÐbackward algorithm for sim- N ( = ) n=1 q znc 1 xn q(xn) ple HMMs trained by means of maximum-likelihood, x¯ c (54) N ( = ) exactly as described in [19]; speciﬁcally, it sufﬁces to n=1 q znc 1 N run forwardÐbackward for a simple HMM model with ( = ) −¯ −¯ T its optimized values (point estimates) of the Markov c q znc 1 xn q(xn) xc xn q(xn) xc n=1 chain probabilities set equal to the posterior expected 76 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015

π∗ ∗ values c , ij in (64) and (65), and its state-conditional C. Predictive Density ∗( |μ , ) likelihoods set equal to the expectations p xt i Ri . Let us now consider the problem of computing the proba- ( ) ∗ ∗ N∗ 8) Finally, regarding the latent variable posteriors q xn , bility of a given sequence of observations Y =[y ] with L( ) n n=1 optimization over q yields respect to a trained LM2GP model. This quantity is useful, e.g., in applying our model to (sequence-level) classification C applications. For this purpose, we need to derive the predictive logq(x ) = q(z = 1) logp(x |z = 1) (μ , ) density of the model n nc n nc q c Rc ˆˆ = c 1 ( ∗| )= ( ∗| ∗; , ) ( ∗| , ) ( | ) ∗. D p Y Y p Y X Y X p X X Y p X Y dXdX + ( | , ( , )) . logp f 0 K d X X ( ) (67) d q f d (68) d=1 To begin with, the expression of p(Y ∗|X ∗; Y, X) is analogous to the predictive density expression of standard From (67), it becomes apparent that model configuration GP regression, yielding is not conjugate when it comes to the latent variables xn. ∗ As a consequence, the model does not yield a closed-form N D ( ∗| ∗; , ) = N ( ∗ | ∗ , σ ∗ 2) expression for the posteriors q(xn). A repercussion of this fact p Y X Y X ynd and nd (69) ( ) = = is that the entailed posterior expectations with respect to q xn n 1 d 1 −1 cannot be computed in an analytical fashion, wherever they ∗ ∗ T −1 a = kd (x ) K d (X, X) + β Yd (70) appear in the inference algorithm of our model (namely the nd n − ( , )−1 ∗ 2 ∗ − 1 ∗ quantities K d X X ( ) and xn q(x )). To resolve this σ =− ( )T ( , ) + β 1 ( ) q xn n kd xn K d X X kd xn issue, we can either resort to a deterministic approximation nd + kd (x∗, x∗) (71) of the posteriors q(xn), e.g., by the application of Laplace approximation, or perform sampling by means of MCMC. Our where investigations have shown that Laplace approximation does not ( ∗) [ ( , ∗),..., ( , ∗)]T. perform very well for our model. For this reason, in this paper, kd xn kd x1 xn kd x N xn (72) we opt for the latter alternative. Now, regarding the computation of the entailed expectation Specifically, for this purpose, we draw samples from of p(Y ∗|X ∗; Y, X) with respect to the distributions p(X|Y ) the variational posterior (67) using hybrid Monte Carlo and p(X ∗|X, Y ), we proceed as follows: first, we substi- (HMC) [20]. HMC provides an efficient method to draw tute p(X|Y ) with its approximation (variational posterior) samples from the variational posterior distribution q(xn) by ( ) = N ( ) q X n=1 q xn . This way, the corresponding expectation performing a physical simulation of an energy-conserving can be approximated by making use of the samples of the system to generate proposal moves. In detail, we add kinetic latent variables xn previously drawn through HMC. Finally, to energy variables, p, for each latent dimension, while the obtain the expectations with respect to the density p(X ∗|X, Y ), − ( ) expression of logq xn is used to specify a potential energy we resort to Gibbs sampling from the inferred latent-manifold over the latent variables. Each HMC step involves sampling infinite-state HMM. This obtains a set of samples of X ∗ from p and carrying out a physics-based simulation using leap-frog the latent manifold, which can be used to approximate the discretization. The final state of the simulation is accepted or sought (remaining) expectation with respect to p(X ∗|X, Y ). rejected based on the MetropolisÐHastings algorithm. HMC sampling of xn requires computation of the gradient of q(xn), which can be analytically performed in a straightforward D. Relations to Existing Approaches manner. As we have already discussed, our model is related to the 2) Hyperparameter Selection: Regarding the values of the GPDM method of [2]. The main difference between this paper hyperparameters of the model priors, we set mc = 0, and GPDM is that the latter uses a simplistic linear model of λc = 1, ωc = 10, and c = 100 I. Regarding the temporal dynamics combined with a spherical GP prior over hyperparameters of the kernel functions kd (·, ·), we estimate the latent variables. In contrast, our approach employs a more them by performing maximization of the model variational flexible generative construction, which considers latent vari- free energy L(q) over each one of them. Computation of able emission from a latent stick-breaking HMM (SB-HMM). the variational free energy L(q) and its gradient requires In addition, one can also observe that our method essentially derivation of the entailed posterior expectations in (40), reduces to a GPLVM model by setting the truncation threshold which can be obtained analytically, except for the expec- C equal to one, i.e., C = 1. tations with respect to the posteriors q(xn). These latter quantities can be obtained by means of MCMC, utilizing IV. EXPERIMENT the samples of xn, previously drawn by HMC sampling. Here, we experimentally evaluate our method in three Given the fact that none of the sought quantities yields scenarios: 1) unsupervised segmentation (frame-level clus- closed-form estimators, to perform L(q) maximization, we tering) of sequential data; 2) supervised segmentation resort to the L-Broyden-Fletcher-Goldfarb-Shanno (BFGS) (frame-level classification) of sequential data; and 3) (whole) algorithm [21]. sequence classification. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 77

In the context of our unsupervised sequence segmentation TABLE I experiments, we estimate the assignment of the data points UNSUPERVISED SEQUENCE SEGMENTATION:WORKFLOW (frames) of each sequence to the discovered latent classes RECOGNITION.ERROR RATES (%) FOR OPTIMAL LATENT (model states) using q(Z), and compare the estimated values MANIFOLD DIMENSIONALITY with the available ground-truth sequence segmentation. To compute the optimal assignment of the discovered class labels to the available ground-truth class labels, we resort to the Munkres assignment algorithm. This set of experiments makes it thus possible for us to evaluate the quality of the latent temporal dynamics discovered by our model. On the other hand, the considered supervised sequence segmentation experiments allow for us to additionally evaluate details). The frame-level tasks to recognize in an unsupervised the quality of the drawn latent manifold representations of fashion in these workflows are the following: the modeled data. Specifically, in this set of experiments, the 1) worker 1 picks up part 1 from rack 1 (upper) and places drawn latent manifold representations xn are subsequently fed it on the welding cell, and the mean duration is 8Ð10 s; to a simple multiclass classifier (specifically, a multiclass linear 2) workers 1 and 2 pick part 2a from rack 2 and place it support vector machine [22]), which is trained to discover on the welding cell; the correct (frame-level) classes by being presented with the 3) workers 1 and 2 pick part 2b from rack 3 and place it generated manifold representations xn. We expect that a good on the welding cell; performance of our model under such an experimental setup 4) worker 2 picks up spare parts 3a and 3b from rack 4 would indicate high-quality representational capabilities for and places them on the welding cell; both the generated latent manifold representations xn and the 5) worker 2 picks up spare part 4 from rack 1 and places temporal dynamics discovered by our model. Furthermore, in it on the welding cell; the context of this set of experiments, we also explore whether 6) workers 1 and 2 pick up part 5 from rack 5 and place our model could be used to obtain a deep learning architecture, it on the welding cell. by stacking multiple layers of models, each one fed with Feature extraction is performed as follows: to extract the latent state representations generated from the previous the spatiotemporal variations, we use pixel change layer (and the first layer fed with the actual observations). We history images to capture the motion history [25], and believe that, by stacking multiple layers of models, we might compute the complex Zernike moments A00, A11, A20, be capable of extracting more complex, higher level temporal A22, A31, A33, A40, A42, A44, A51, A53, A55, A60, A62, dynamics encoded in the final-layer latent representations xn, A64, A66, for each of which we compute the norm and the thus allowing for eventually obtaining increased supervised angle. In addition, the center of gravity and the area of the (frame-level) classification performance. found blobs are also used. In total, this feature extraction Finally, in the case of the sequence classification experi- procedure results in 31-D observation vectors. Zernike ments, we train one model for each one of the considered moments are calculated in rectangular regions of interest classes. Evaluation is performed by finding for each test of approximately 15K pixels in each image to limit the sequence the model that yields the highest predictive prob- processing and allow real-time feature extraction (performed ability, and assigning it to the corresponding class. at a rate of approximately 50Ð60 frames per second). In our To obtain some comparative results, in our experiments, experiments, we use a total of 40 sequences representing full apart from our method, we also evaluate GPLVM [4] and assembly cycles and containing at least one of the considered GPDM models [2], as well as other state-of-the-art approaches behaviors, with each sequence being approximately 1K relevant to each scenario. In our experiments, we make use of frames long. Frame annotation has been performed manually. the large-scale linear support vector machine (SVM) library In Table I, we illustrate the obtained error rates for optimal of [23], and the HMM-SVM implementation of Thorsten selection of the latent manifold dimensionality of our method, Joachims. In all our experiments, we use RBF kernels for all as well as the GPDM and GPLVM methods. Regarding the the evaluated GP-based models. Finally, HMC is run for 100 GPDM and GPLVM methods, clustering was performed by iterations, with 25 leap-frog√ steps at each iteration, and with presenting the generated latent subspace representations to a step-length equal to 0.001 N,whereN is the number of an SB-HMM [13]. All these results are means and variances modeled data points. over 10 repetitions of the experiment. As a baseline, we also evaluate SB-HMMs presented with the original (observed) data. As we observe, our approach yields a clear advantage A. Unsupervised Sequence Segmentation over the competition; we also observe that all the other latent 1) Workflow Recognition: We first consider an unsupervised manifold models yield inferior performance compared with sequence segmentation (frame-level clustering) application. the baseline SB-HMM. This result is a clear indication of the For this purpose, we use a public benchmark dataset involving much improved capability of our approach to capture latent action recognition of humans, namely the workflow recogni- temporal dynamics in the modeled data. tion (WR) database [24]. Specifically, we use the first two In addition, in Fig. 3, we show how model performance workflows pertaining to car assembly (see [24] for more changes as a function of the postulated latent manifold dimen- 78 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015

Fig. 3. Unsupervised sequence segmentation: workflow recognition. Error Fig. 5. Unsupervised sequence segmentation: Honeybee. Error rate fluctua- rate fluctuation with latent manifold dimensionality. tion with latent manifold dimensionality.

TABLE II UNSUPERVISED SEQUENCE SEGMENTATION:HONEYBEE.ERROR RATES (%) FOR OPTIMAL LATENT MANIFOLD DIMENSIONALITY

dance, the bee moves roughly in a straight line while rapidly shaking its body from left to right; the duration and orientation of this phase correspond to the distance and the orientation to Fig. 4. Workflow Recognition. Average execution time (per sequence) of the food source. At the endpoint of a waggle dance, the bee inference algorithm. turns in a clockwise or counter-clockwise direction to form a turning dance. Our dataset consists of six video sequences with lengths sionality, for the LM2GP, GPDM, and GPLVM methods. 1058, 1125, 1054, 757, 609, and 814 frames, respectively. The We observe that the model performance increases with man- bees were visually tracked, and their locations and head angles ifold dimensionality in a consistent fashion. We also observe were recorded. This resulted in obtaining 4-D frame-level fea- that our model benefits the most by an increase in latent ture vectors comprising the 2-D location of the bee and the sine manifold dimensionality. Furthermore, we run the Student’s-t and cosine of its head angle. Once the sequence observations test on all pairs of performances across the evaluated methods, were obtained, the trajectories were preprocessed as in [27]. to assess the statistical significance of the obtained differences; Specifically, the trajectory sequences were rotated so that the we obtain that all performance differences are deemed statis- waggle dances had head angle measurements centered about tically significant. zero radian. The sequences were then translated to center at Finally, in Fig. 4, we show an indicative empirical com- (0, 0), and the 2-D coordinates were scaled to the [−1, 1] parison of computational times, for processing one of the range. Aligning the waggle dances was possible by looking at sequences in our dataset (means and error bars over all the the high-frequency portions of the head angle measurements. used sequences). As we observe, GPDM requires the longest Following the suggestion of [26], the data were smoothed time between the compared methods; our method is faster than using a Gaussian FIR pulse-shaping filter with 0.5-dB GPDM, while imposing a reasonable increase in computational bandwidth-symbol time. costs compared with GPLVM. This difference from GPDM In our evaluations, we adopt the experimental setup of was expected, since our method involves one less Gram matrix the previous experiment. In Fig. 5, we show how model compared with GPDM. performance changes as a function of the postulated latent 2) Honeybee Dataset: Furthermore, we perform a second manifold dimensionality, for the LM2GP, GPDM, and GPLVM unsupervised sequence segmentation experiment using the methods. In Table II, we provide the error rates of all the Honeybee dataset [26]; it contains video sequences of hon- evaluated models for optimal latent manifold dimensionality eybees, which communicate the location and distance to a (wherever applicable). These results are means and variances food source through a dance that takes place within the hive. over 10 repetitions of the experiment. We again observe that The dance can be decomposed into three different movement both GPDM and GPLVM are outperformed by the baseline patterns that must be recognized by the evaluated algorithms: SB-HMM. Our method works better than all the alternatives. 1) waggle; 2) right turn; and 3) left turn. During the waggle Finally, we run the Student’s-t test on all pairs of perfor- CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 79

TABLE III SUPERVISED SEQUENCE SEGMENTATION:WORKFLOW RECOGNITION.ERROR RATES (%) FOR OPTIMAL LATENT MANIFOLD DIMENSIONALITY

Fig. 7. Supervised sequence segmentation: workflow recognition. Error rate fluctuation with latent manifold dimensionality for two-layer deep learning architectures. inferior performance compared with the evaluated latent variable models. This result seems to indicate that extracting latent manifold representations of the modeled data is much more effective for classification purposes than directly using the observed data and assuming temporal dynamics in the observations space. In addition, in Fig. 6, we show how model performance changes with the number of latent features for 2 Fig. 6. Supervised sequence segmentation: workflow recognition. Error rate the LM GP, GPDM, and GPLVM methods. We observe that fluctuation with latent manifold dimensionality. all methods yield their optimal performance with 10 latent features. mances across the evaluated methods, to assess the statistical Furthermore, we turn to our deep learning scenario, examin- significance of the obtained differences; we obtain that all ing how model performance changes if we add a second layer performance differences are deemed statistically significant, of models (with the same latent dimensionality), fed with the except for the differences between the optimal performance latent manifold representations generated by the original mod- of GPLVM and the optimal performance of GPDM (obtained els. We present the output (latent manifold representations) for three and four latent features, respectively). of the 2-layer models as input to a multiclass linear SVM classifier, and we evaluate the classification performance of the so-obtained classifier in the considered tasks. B. Supervised Sequence Segmentation In Fig. 7, we illustrate the classification error rates of the 1) Workflow Recognition: Here, we use the same dataset as resulting models as a function of latent manifold dimension- in Section IV-A1, but with the goal to perform supervised ality. As we observe, two-layer deep learning architectures sequence segmentation using the obtained latent manifold yield a significant improvement over the corresponding single- representations of the modeled data. For this purpose, we first layer architectures when the latent manifold dimensionality is use our approach to obtain the latent manifold representations low; however, performance for high latent manifold dimen- of the considered datasets. Subsequently, we use half of sionality turns out to be similar in the cases of GPDM and the resulting representations (corresponding to all the mod- our model, and substantially worse in the case of GPLVM. eled classes) for initial training of a (frame-level) multiclass To our perception, this result indicates that our model encodes linear SVM classifier [22], and keep the rest for testing richer and more consistent temporal dynamics information of the obtained classifier. Apart from our method, we also in the obtained latent manifold representations, which can evaluate GPDM and GPLVM under the same experimental be combined in a hierarchical manner to extract even more setup. In addition, as a baseline, we evaluate the HMM-SVM complex temporal dynamics. approach of [28] (presented with the original data, not the Finally, the Student’s-t test on all pairs of performances latent encodings). across the evaluated methods obtains here some very interest- In Table III, we provide the obtained classification error ing results. rates for the considered models. These results correspond 1) Our method yields statistically significant differences to optimal latent manifold dimensionality selection for the from GPDM and GPLVM for latent manifold dimen- LM2GP, GPDM, and GPLVM methods, and constitute means sionality between 10 and 20. However, for more or and variances over 10 repetitions of the experiment. As we less latent dimensions, the obtained differences are not observe, our method outperforms the competition. We also statistically significant. observe that GPDM and GPLVM yield essentially identi- 2) GPDM and GPLVM are comparable, with no statistical results; this fact indicates that GPDM does not capture cally significant differences, whatsoever. richer temporal dynamics in our data compared with GPLVM. 3) Models with two layers have statistically significant per- Another significant finding from these results is that the formance differences with respect to similar one-layer state-of-the-art HMM-SVM approach yields a considerably models in cases of low latent manifold dimensionality. 80 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015

TABLE IV observations space. In addition, in Fig. 8, we show how SUPERVISED SEQUENCE SEGMENTATION:HONEYBEE.ERROR RATES (%) model performance changes with the number of latent features FOR OPTIMAL LATENT MANIFOLD DIMENSIONALITY for the LM2GP, GPDM, and GPLVM methods. We observe a consistent performance improvement with latent manifold dimensionality. This is not unexpected, since we are dealing with models trained on low-dimensional data using some thousands of training points. Furthermore, we turn to our deep learning scenario, exam- ining how model performance changes if we add a second layer of models (with the same latent dimensionality), fed with the latent manifold representations generated by the original models. In Fig. 9, we show the classification error rates of the resulting models as a function of latent manifold dimensionality. As we observe, two-layer deep learning architectures yield improved performance for low latent manifold dimensionality (2-D latent space); however, any performance gains quickly disappear as latent dimensionality increases, with GPLVM performance eventually becoming inferior to the shallow modeling scenario, similar to the previous experiment. Fig. 8. Supervised sequence segmentation: Honeybee. Error rate fluctuation Finally, the Student’s-t test in this scenario finds that all with latent manifold dimensionality. pairs of differences are statistically significant, except for the case of one-layer versus two-layer model comparison, where the assumption of statistical significance is rejected for models with four latent dimensions.

C. Sequence-Level Classification 1) Learning Music-to-Dance Mappings: Finally, we evaluate our method in sequence-level classification (classification of whole sequences). Initially, we consider the problem of learning music-to-dance mappings. In this experiment, the observed sequences presented to our model constitute the chroma features extracted from a collection of music clips. Chroma analysis [29] is an interesting and powerful represen- Fig. 9. Supervised sequence segmentation: Honeybee. Error rate fluctuation with latent manifold dimensionality for two-layer deep learning architectures. tation for music audio in which the entire spectrum is projected onto 12 bins representing the 12 distinct semitones (or chroma) However, these differences quickly become insignificant of the musical octave. Since, in music, notes exactly one as the number of latent dimensions increases. octave apart are perceived as particularly similar, knowing the distribution of chroma even without the absolute frequency 2) Honeybee Dataset: Furthermore, we perform additional (i.e., the original octave) can give useful musical information frame-level classification experiments using the Honeybee about the audio, and may even reveal perceived musical dataset described in Section IV-A2. In our evaluations, we similarity that is not apparent in the original spectra [30]. adopt the experimental setup of the previous experiment. First, In our experiments, we use a dataset of 600 music we consider a shallow modeling scenario. In Table IV, we clips; they are split into sets of 100 clips pertaining to provide the obtained classification error rates for the con- each one of the dance classes: Waltz, Tango, Foxtrot, Cha sidered models. These results correspond to optimal latent Cha, Quickstep, and Samba.1 We preprocess these clips manifold dimensionality selection for the LM2GP, GPDM, as described above to obtain chroma features; this way, we and GPLVM methods, and are means and variances over eventually obtain sequences of 12-D observations, each one 10 repetitions of the experiment. 35KÐ184K frames long. The clips are further segmented into As we observe, our method outperforms the competition. approximately 8K frames long subsequences to form our We also observe that GPDM works better than GPLVM. dataset. We use half of our data for model training, and the rest Another significant finding from these results is that the for testing. We train one model for each one of the six classes state-of-the-art HMM-SVM approach yields again inferior we want to recognize. Classification of our test sequences performance compared with the evaluated latent variable is conducted by computing the sequence predictive densities models. This result corroborates our intuition that extracting with respect to the model of each class, and assigning each latent manifold representations of the modeled data is much more effective for classification purposes than directly using 1Music clips were downloaded from:http://www.ballroomdancers.com/Music/ the observed data and assuming temporal dynamics in the Default.asp?Tab=2; we converted them to WAV format for our experiments. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 81

TABLE V TABLE VI SEQUENCE-LEVEL CLASSIFICATION:LEARNING MUSIC-TO-DANCE SEQUENCE-LEVEL CLASSIFICATION:BIMANUAL GESTURE MAPPINGS.ERROR RATES (%) FOR OPTIMAL LATENT RECOGNITION.ERROR RATES (%) FOR OPTIMAL LATENT MANIFOLD DIMENSIONALITY MANIFOLD DIMENSIONALITY

Fig. 10. Sequence-level classification: learning music-to-dance mappings. Fig. 11. Sequence-level classification: bimanual gesture recognition. Error Error rate fluctuation with latent manifold dimensionality. rate fluctuation with latent manifold dimensionality.

sequence to the class the model of which yields the highest over 10 experiment repetitions) for optimal latent manifold predictive density. Apart from our method, we also evaluate dimensionality (wherever applicable) are depicted in Table VI. SB-HMMs [13], GPLVMs [4], and GPDMs [2] under the same As we observe, our approach works much better than the com- experimental setup. petition. Furthermore, in Fig. 11, we show how average model The obtained error rates for optimal latent manifold dimen- performance changes with latent manifold dimensionality. We sionality (wherever applicable) are depicted in Table V. These observe that all models yield a performance increase for results are means and variances over 10 repetitions of the moderate latent space dimensionality. Finally, the Student’s-t experiment. As we observe, our approach works much better test finds that all performance differences are statistically than the competition. We also observe that, in this experiment, significant. GPDM does actually work better than GPLVM. In Fig. 10, we show how average model performance changes with latent V. C ONCLUSION manifold dimensionality. It becomes apparent from this graph that the modeled data contain a great deal of redundancy, since In this paper, we proposed a method for sequential data all models yield a performance deterioration for high latent modeling that leverages the strengths of GPs, allowing for space dimensionality. Finally, the Student’s-t test finds that all more flexibly capturing temporal dynamics compared with performance differences are statistically significant. the existing nonparametric Bayesian approaches. Our method 2) Bimanual Gesture Recognition: Furthermore, we per- considers a latent manifold representation of the modeled data, form evaluations considering the problem of bimanual ges- and chooses to postulate a model of temporal dependencies ture recognition. For this purpose, we experiment with the on this latent manifold. Temporal dependencies in our model American Sign Language gestures for the words: against, aim, are captured through consideration of infinite-state Markovian balloon, bandit, cake, chair, computer, concentrate, cross, deaf, dynamics, and imposition of stick-breaking priors over the explore, hunt, knife, relay, reverse, and role. The used dataset entailed Markov chain probabilities. Inference for our model was obtained from four different persons executing each one of was performed by means of an efficient variational Bayesian these gestures and comprises 40 videos per gesture; 30 of these algorithm. videos are used for training and the rest for model evaluation. As we showed through experimental evaluations, our From this dataset, we extracted several features representing approach is suitable for unsupervised sequence segmentation the relative position of the hands and the face in the images, (frame-level clustering), supervised sequence segmentation as well as the shape of the respective skin regions, by means (frame-level classification), and whole sequence classification of the complex Zernike moments [31], as described in [18]. (sequence-level operation). We evaluated our approach in all This way, each used video comprises 1KÐ4K frames of 12-D these scenarios using real-world datasets, and observed that feature vectors used in our experiments. our method yields very competitive results, outperforming For each one of these 16 gestures, we fitted one model to popular, recently proposed related approaches, e.g., GPDM recognize it. The obtained error rates (means and variances and GPLVM. 82 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015

Finally, we examined whether our method can be also Similarly, the derivatives of q(xn) with respect to the latent employed to obtain a deep learning architecture, by stack- vectors xn yield ing multiple (layers of) LM2GP models, each one fed with ∂logq(xn) the latent manifold representations generated from the previ- = q(z = 1) ω˜ ˜ m˜ − xT ω˜ ˜ ∂ x nc c c c n c c ous layer (and the first one fed with the observable data); n c ∂ ( , )−1 specifically, we experimented with two-layer architectures. − 1 μˆ μˆ T + K d X X d d d As we observed, our method seems to yield much more 2 ∂ xn significant gains in such a scenario than GPDM- or GPLVM- d ∂ ( , )−1 based models, especially for low-dimensional manifold 1 K d X X + K d (X, X) . (74) 2 ∂ x assumptions. d n Our future research endeavors in this line of research mainly focus on addressing two open questions: the first one concerns REFERENCES the possibility of sharing the precision matrices between close [1] C. M. Bishop, Neural Networks for Pattern Recognition. Phoenix, AZ, hidden states. A question that must be answered is what USA: Clarendon, 1995. proximity criterion we could use for this purpose in the context [2] J. M. Wang, D. J. Fleet, and A. Hertzmann, “Gaussian process dynamical of our model. Would, e.g., comparing the values of the latent models for human motion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 2, pp. 283Ð298, Feb. 2008. vectors xn generated from different states provide a valid [3] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine proximity criterion? In such a case, what type of proximity Learning. Cambridge, MA, USA: MIT Press, 2006. measure would be more effective? [4] N. Lawrence, “Probabilistic non-linear principal component analysis with Gaussian process latent variable models,” J. Mach. Learn. Res., The second open question we want to investigate is the vol. 6, pp. 1783Ð1816, Nov. 2005. possibility of obtaining stacked denoising autoencoders [32] [5] M. F. Moller, “A scaled conjugate gradient algorithm for fast supervised for sequential data modeling using our LM2GP model as learning,” Neural Netw., vol. 6, no. 4, pp. 525Ð533, 1993. [6] J. Sethuraman, “A constructive definition of the Dirichlet prior,” Statist. the main building block. Stacked denoising autoencoders are Sinica, vol. 2, no. 2, pp. 639Ð650, 1994. deep learning architectures that apart from extracting the most [7] T. Ferguson, “A Bayesian analysis of some nonparametric problems,” informative latent subspace representations of observed data Ann. Statist., vol. 1, no. 2, pp. 209Ð230, 1973. [8] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduc- are also capable of denoising the observed data, which are tion to variational methods for graphical models,” in Learning in considered contaminated with noise at each phase of the model Graphical Models, M. Jordan, Ed. Boston, MA, USA: Kluwer, 1998, training algorithm. GPLVM models have already been used pp. 105Ð162. as building blocks for obtaining stacked denoising autoen- [9] D. M. Blei and M. I. Jordan, “Variational inference for Dirichlet process mixtures,” Bayesian Anal., vol. 1, no. 1, pp. 121Ð144, 2006. coder architectures with great success [33]. However, existing [10] D. Blackwell and J. MacQueen, “Ferguson distributions via Pólya urn formulations are not designed for capturing and exploiting schemes,” Ann. Statist., vol. 1, no. 2, pp. 353Ð355, 1973. temporal dependencies in the modeled data. In this paper, we [11] M. Tipping and C. Bishop, “Mixtures of probabilistic principal com- 2 ponent analyzers,” Neural Comput., vol. 11, no. 2, pp. 443Ð482, investigated the utility of LM GP as the main building block 1999. for obtaining simple deep learning architectures, and obtained [12] J. Zhao and Q. Jiang, “Probabilistic PCA for t distributions,” some promising results. How would our model perform in Neurocomputing, vol. 69, nos. 16Ð18, pp. 2217Ð2226, Oct. 2006. [13] J. Paisley and L. Carin, “Hidden Markov models with stick-breaking the context of a stacked denoising autoencoder framework? priors,” IEEE Trans. Signal Process., vol. 57, no. 10, pp. 3905Ð3917, This is a question that remains to be addressed in our future Oct. 2009. work. [14] Y. Qi, J. W. Paisley, and L. Carin, “Music analysis using hidden Markov mixture models,” IEEE Trans. Signal Process., vol. 55, no. 11, We shall publish source codes pertaining to our method at: pp. 5209Ð5224, Nov. 2007. http://www.cut.ac.cy/eecei/staff/sotirios.chatzis/?languageId=2. [15] D. Chandler, Introduction to Modern Statistical Mechanics.NewYork, NY, USA: Oxford University Press, 1987. [16] D. Blei and M. Jordan, “Variational methods for the Dirichlet process,” in Proc. 21st Int. Conf. Mach. Learn., New York, NY, USA, Jul. 2004, APPENDIX pp. 12Ð19. [17] P. Muller and F. Quintana, “Nonparametric Bayesian data analysis,” In this Appendix, we provide (for completeness sake) Statist. Sci., vol. 19, no. 1, pp. 95Ð110, 2004. the expressions of the derivatives of the variational free [18] S. P. Chatzis, D. I. Kosmopoulos, and T. A. Varvarigou, “Robust sequen- energy L(q) over the kernel hyperparameters, required for tial data modeling using an outlier tolerant hidden Markov model,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 9, pp. 1657Ð1669, Sep. hyperparameter optimization by means of L-BFGS, as well 2009. as the expressions of the derivatives of q(xn) with respect [19] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 245Ð255, to the latent vectors xn, required for HMC sampling ( ) Feb. 1989. from q xn . [20] R. M. Neal, “Probabilistic inference using Markov chain Monte Carlo Regarding the derivatives of L(q) with respect to the kernel methods,” Dept. Comput. Sci., Univ. Toronto, Tech. Rep. CRG-TR-93-1, hyperparameters, say ϕ , d = 1,...,D,wehave 1993. d [21] D. Liu and J. Nocedal, “On the limited memory method for large scale optimization,” Math. Program. B, vol. 45, no. 3, pp. 503Ð528, ∂L( ) ∂ ( , )−1 q =−1 μˆ μˆ T + K d X X 1989. d d d [22] V. N. Vapnik, Statistical Learning Theory.NewYork,NY,USA:Wiley, ∂ϕ 2 ∂ϕ ( ) d d q X 1998. 1 ∂ K (X, X)−1 [23] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, + K (X, X) d . d ∂ϕ (73) “LIBLINEAR: A library for large linear classification,” J. Mach. Learn. 2 d q(X) Res., vol. 9, pp. 1871Ð1874, Jan. 2008. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 83

[24] A. Voulodimos et al., “A threefold dataset for activity and workflow Dimitrios Kosmopoulos received B.Eng. degree in recognition in complex industrial environments,” IEEE MultiMedia, electrical and computer engineering and the Ph.D. vol. 19, no. 3, pp. 42Ð52, Jul./Sep. 2012. degree from the National Technical University of [25] D. Kosmopoulos and S. Chatzis, “Robust visual behavior recog- Athens, Athens, Greece, in 1997 and 2002, respec- nition,” IEEE Signal Process. Mag., vol. 27, no. 5, pp. 34Ð45, tively. Sep. 2010. He was a Research Assistant Professor with the [26] S. M. Oh, J. M. Rehg, T. Balch, and F. Dellaert, “Learning and Department of Computer Science, Rutgers Univer- inferring motion patterns using parametric segmental switching linear sity, New Brunswick, NJ, USA, and an Adjunct dynamic systems,” Int. J. Comput. Vis., vol. 77, nos. 1Ð3, pp. 103Ð124, Assistant Professor with the Department of Com- 2008. puter Science and Engineering, University of Texas [27] E. B. Fox, E. B. Sudderth, M. I. Jordan, and A. S. Willsky, “Nonparamet- at Arlington, Arlington, TX, USA. His former col- ric Bayesian learning of switching linear dynamical systems,” in Proc. laborations include the Institute of Informatics and Telecommunications, Neural Inf. Process. Syst., 2009, pp. 1Ð3. National Center for Scientific Research Demokritos, Athens, and the National [28] Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support Technical University of Athens. In the same period, he was with the University vector machines,” in Proc. ICML, 2004, pp. 1Ð9. of Central Greece, Lamia, Greece, University of Peloponnese, Sparti, Greece, [29] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chroma- and the Technical Educational Institute of Athens, Egaleo, Greece. He is based representations for audio thumbnailing,” in Proc. IEEE Workshop currently an Assistant Professor with the Department of Applied Informatics Appl. Signal Process. Audio Acoust., Jun. 2001, pp. 15Ð18. and Multimedia, Technical Educational Institute of Crete, Heraklion, Greece. [30] H. Jensen, M. G. Christensen, D. Ellis, and S. H. Jensen, “A tempo- He has published more than 70 papers in the field of computer/robotic vision, insensitive distance measure for cover song identification based on machine learning, and signal processing. chroma features,” in Proc. ICASSP, 2008, pp. 2209Ð2212. [31] R. Mukundan and K. R. Ramakrishnan, Moment Functions in Image Analysis: Theory and Applications. Singapore: World Sci., 1998. [32] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” J. Mach. Learn. Res., vol. 11, pp. 3371Ð3408, Jan. 2010. [33] J. Snoek, R. P. Adams, and H. Larochelle, “Nonparametric guidance of autoencoder representations using label information,” J. Mach. Learn. Res., vol. 13, pp. 2567Ð2588, Sep. 2012.

Sotirios P. Chatzis received the M.Eng. (Hons.) degree in electrical and computer engineering and the Ph.D. degree in machine learning from the National Technical University of Athens, Athens, Greece, in 2005 and 2008, respectively. He was a Post-Doctoral Fellow with the University of Miami, Coral Gables, FL, USA, from 2009 to 2010. He was a Post-Doctoral Researcher with the Department of Electrical and Electronic Engineer- ing, Imperial College London, London, U.K., from 2010 to 2012. He is currently an Assistant Professor with the Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology, Limassol, Cyprus. He has authored more than 40 papers in the most prestigious journals and conferences of the research ﬁeld in his ﬁrst seven years as a Researcher. His current research interests include machine learning theory and methodologies with a special focus on hierarchical Bayesian models, Bayesian nonparametrics, quantum statistics, and neuroscience. His Ph.D. research was supported by the Bodossaki Foundation, Greece, and the Greek Ministry for Economic Development. Dr. Chatzis was a recipient of the Dean’s scholarship for Ph.D. studies, being the best performing Ph.D. student of the class.