70 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 A Latent Manifold Markovian Dynamics Gaussian Process Sotirios P. Chatzis and Dimitrios Kosmopoulos
Abstract— In this paper, we propose a Gaussian process (GP) process (GP) regression [3]. This GP-based formulation of the model for analysis of nonlinear time series. Formulation of our model gives rise to a nonparametric Bayesian nature, which model is based on the consideration that the observed data removes the need to select a large number of parameters asso- are functions of latent variables, with the associated mapping between observations and latent representations modeled through ciated with function approximators, while retaining the power GP priors. In addition, to capture the temporal dynamics in the of nonlinear dynamics and observation. GPDM is essentially modeled data, we assume that subsequent latent representations an extension of the GP latent variable model (GPLVM) of [4], depend on each other on the basis of a hidden Markov prior which models the joint distribution of the observed data and imposed over them. Derivation of our model is performed by their representation in a low-dimensional latent space through marginalizing out the model parameters in closed form using GP priors for observation mappings, and appropriate stick- a GP prior. GPDM extends GPLVM by augmenting it with a breaking priors for the latent variable (Markovian) dynamics. model of temporal dynamics captured through imposition of This way, we eventually obtain a nonparametric Bayesian model a dedicated GP prior. This way, GPDM allows for not only for dynamical systems that accounts for uncertainty in the obtaining predictions about future data, but also regularizing modeled data. We provide efficient inference algorithms for the latent space to allow for more effective modeling of our model on the basis of a truncated variational Bayesian approximation. We demonstrate the efficacy of our approach temporal dynamics. considering a number of applications dealing with real-world Despite the merits of GPDM, a significant drawback of this data, and compare it with the related state-of-the-art approaches. model consists in the need of its inference algorithm to obtain Index Terms— Gaussian process (GP), latent manifold, maximum a posteriori (MAP) estimates of its parameters Markovian dynamics, stick-breaking process, variational Bayes. through type-II maximum-likelihood [2] (performed by means of scaled conjugate gradient descent [5]). This formulation I. INTRODUCTION poses a significant bottleneck to GPDM, due to both the HERE is a wide variety of generative models used to entailed high computational costs, as well as the possibility Tperform analysis of nonlinear time series [1]. Approaches of obtaining bad estimates due to the algorithm getting stuck based on hidden Markov models (HMMs) and linear dynami- to poor local maxima in cases of limited training datasets. cal systems (LDS) are quite ubiquitous in the current literature In addition, to increase computational efficiency, GPDM due to their simplicity, efficiency, and generally satisfactory imposes an oversimplistic spherical Gaussian prior over its performance in many applications. More expressive models, model of temporal dynamics, which probably undermines such as switching LDS and nonlinear dynamical systems its data modeling capacity. Finally, the use of GP priors to (NLDS), have also been proposed; however, these approaches describe the temporal dynamics between the latent variables are faced with difficulties in terms of their learning and of the model leads to significant computational overheads, as it inference algorithms, due to the entailed large number of gives rise to calculations that entail inverting very large Gram parameters that must be estimated, and the hence needed large matrices [3]. amounts of training data [1]. To resolve these issues, in this paper, we propose a flexible Recently, a nonparametric Bayesian approach designed to generative model for modeling sequential data by means of resolve these issues of NLDS, namely the Gaussian process nonparametric component densities. Formulation of our pro- dynamical model (GPDM), was introduced in [2]. This posed model is based on the assumption that, when modeling approach is fully defined by a set of low-dimensional rep- sequential data, each observation in a given sequence is related resentations of the observed data, with both the observa- to a vector in a latent space, and is generated through a latent tion and dynamical processes learned by means of Gaussian nonlinear function that maps the latent space to the space of observations. We use a GP prior to infer this unknown mapping Manuscript received July 21, 2013; revised February 24, 2014; accepted March 8, 2014. Date of publication March 21, 2014; date of current version function from the data in a flexible manner. December 16, 2014. In addition, the latent vectors that generate the observed S. P. Chatzis is with the Department of Electrical and Computer Engi- sequential data are assumed to possess strong temporal inter- neering, Cyprus University of Technology, Limassol 3036, Cyprus (e-mail: [email protected]). dependencies; to capture these dependencies, we assume that D. Kosmopoulos is with the Technological Educational Institute of Crete, these latent variables are generated from an HMM in the Heraklion 71004, Greece (e-mail: [email protected]). manifold of latent variables. Specifically, we assume a latent Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. space HMM with infinite hidden states, and use flexible Digital Object Identifier 10.1109/TNNLS.2014.2311073 stick-breaking priors [6], [7] to infer its hidden state dynamics;
2162-237X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 71
base distribution G0 with probability (α/α + M − 1) or 2) is selected from the existing draws, according to a multinomial allocation, with probabilities proportional to the number of { }C the previous draws with the same allocation [10]. Let c c=1 {∗ }M−1 be the set of distinct values taken by the variables m m=1 . ν M−1 {∗ }M−1 Denoting as c the number of values in m m=1 that ∗ {∗ }M−1 equals to c, the distribution of M given m m=1 can be shown to be of the form [10] (∗ |{∗ }M−1,α, ) p M m m=1 G0 α C ν M−1 c = G + δ (3) α + M − 1 0 α + M − 1 c Fig. 1. Intuitive illustration of the generative construction of our model. c=1 δ where c denotes the distribution concentrated at a single point . These results illustrate two key properties of the DP this formulation allows us to automatically determine the c scheme. First, the innovation parameter α plays a key role in required number of states in this latent manifold HMM in a determining the number of distinct parameter values. A larger data-driven fashion. We dub our approach the latent manifold α induces a higher tendency of drawing new parameters from hidden Markov GP (LM2GP) model. Inference for our model the base distribution G ; indeed, as α →∞,wegetG → G . is performed by means of an efficient truncated variational 0 0 Conversely, as α → 0, all {∗ }M tend to cluster to a single Bayesian algorithm [8], [9]. We evaluate the efficacy of m m=1 random variable. Second, the more often a parameter is shared, our approach in several applications, dealing with sequential the more likely it will be shared in the future. data clustering, classification, and generation. A high-level A characterization of the (unconditional) distribution of the illustration of the conceptual configuration of our model is random variable G drawn from a DP, DP(α, G ), is provided showninFig.1. 0 by the stick-breaking construction in [6]. Consider two infinite The remainder of this paper is organized as follows. collections of independent random variables v =[v ]∞ and In Section II, we provide a brief presentation of the the- c c=1 { }∞ ,wherethev are drawn from a Beta distribution, and oretical background of the proposed method. Initially, we c c=1 c the are independently drawn from the base distribution G . present the Dirichlet process (DP) and its function as a c 0 The stick-breaking representation of G is then given by prior in nonparametric Bayesian models; furthermore, we provide a brief summary of the GPDM model. In Section ∞ 2 = (v)δ III, we introduce the proposed LM GP model, and derive G c c (4) efficient model inference algorithms based on the variational c=1 Bayesian framework. In Section IV, we conduct the experi- where mental evaluation of our proposed model, considering a num- p(v ) = Beta(1,α) (5) ber of applications dealing with several real-world datasets, c c−1 and we compare its performance with state-of-the-art-related (v) = v (1 − v ) ∈[0, 1] (6) approaches. Finally, in Section V, we conclude and summarize c c j = this paper. j 1 and ∞ II. PRELIMINARIES c(v) = 1. (7) A. Dirichlet Process c=1 DP models were first introduced in [7]. A DP is charac- Under the stick-breaking representation of the DP, the terized by a base distribution G0 and a positive scalar α, atoms c, drawn independently from the base distribution G0, usually referred to as the innovation parameter, and is denoted can be seen as the parameters of the component distributions (α, ) as DP G0 . Essentially, a DP is a distribution placed over of a mixture model comprising an unbounded number of a distribution. Let us suppose we randomly draw a sample component densities, with mixing proportions c(v). distribution G from a DP, and, subsequently, we independently {∗ }M draw M random variables m = from G m 1 B. GP Dynamical Model |α, ∼ (α, ) G G0 DP G0 (1) 1) GP Models: Let us begin with a brief description ∗ | ∼ , = ,..., . m G G m 1 M (2) of GP regression. Consider an observation space X ;a GP f (x), x ∈ X , is defined as a collection of random Integrating out G, the joint distribution of the variables {∗ }M variables, any finite number of which have a joint Gaussian m m=1 can be shown to exhibit a clustering effect. Specif- − {∗ }M−1 distribution [3]. We typically use the notation ically, given the first M 1 samples of G, m m=1 , it can ∗ ( ) ∼ ( ( ), ( , )) be shown that a new sample M is either 1) drawn from the f x GP m x k x x (8) 72 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 26, NO. 1, JANUARY 2015 where m(x) is the mean function of the process and k(x, x) is 2) GP Latent Variable Models: Building upon the GP the covariance function of the process. Usually, for simplicity, model, the GPLVM is essentially a GP, and the input variables and without any loss of generality, the mean of the process x of which are considered latent variables rather than observed is taken to be zero, m(x) = 0, although this is not necessary. ones. Specifically, GPLVM considers that the y ∈ RD are Concerning selection of the covariance function, a large variety observed multidimensional variables, while the latent vectors of kernel functions k(x, x) might be employed, depending x are variables lying in some lower dimensional manifold that on the application considered [3]. Eventually, the real process generates the observed variables y through an unknown latent f (x) drawn from a GP with mean zero and kernel function mapping f , modeled through a GP. This way, we have k(x, x) follows a Gaussian distribution with D p( f |x) = N ( f |0, k(x, x)). (9) p( y|x) = N (yd |0, kd (x, x)) (18) = Let us suppose a set of independent and identically dis- d 1 D ={( , )| = ,..., } tributed samples xi yi i 1 N , with the N which, considering a set of observations Y =[y ] = , obtains d-dimensional variables xi being the observations related to a n n 1 modeled phenomenon, and the scalars yi being the associated target values. The goal of a regression model is, given a new D ( | ) = N ( | , ( , )) observation x∗, to predict the corresponding target value y∗, p Y X Yd 0 Kd X X (19) based on the information contained in the training set D.The d=1 basic notion behind GP regression consists in the assumption (·, ·) that the observable (training) target values y in a considered where kd is the kernel of the model for the dth observed =[ ]N =[ ]N ( , ) regression problem can be expressed as the superposition of a dimension, X xn n=1, Yd ynd n=1,andKd X X GP over the input space X , f (x), and an independent white is the Gram matrix (with inputs X) corresponding to the Gaussian noise dth dimension of the modeled data. GPLVM learns the values of the latent variables x corresponding to the observed data y y = f (x) + (10) through the maximization of the model marginal likelihood, i.e., in a MAP fashion. As shown in [4], GPLVM can be where f (x) is given by (8), and viewed as a GP-based nonparametric Bayesian extension of probabilistic principal component analysis [11], [12], a popular p() = N |0,σ2 . (11) method for latent manifold modeling of high-dimensional data. Under this regard, the joint normality of the training target 3) Dynamic GPLVMs: Inspired from GPLVM, GPDM per- =[]N forms modeling of sequential data by considering a GP prior as values Y yi i=1 and some unknown target value y∗, approximated by the value f∗ of the postulated GP eval- in GPLVM, and introducing an additional model of temporal uated at the observation point x∗, is a Gaussian of the interdependencies between successive latent vectors x. Specif- =[ ]N form [3] ically, considering a sequence of observed data Y yn n=1, =[ ]D where yn ynd d=1, with corresponding latent manifold Y K(X, X) + σ 2 I k(x∗) N Q N , N projections X =[xn] = , xn ∈ R , GPDM models the 0 T (12) n 1 f∗ k(x∗) k(x∗, x∗) dependencies between the latent and observed data in (19), where while also considering a GP-based model of the interdepen- dencies between the latent vectors xn. In detail, this latter T k(x∗) [k(x1, x∗),...,k(x N , x∗)] (13) model initially postulates =[ ]N × ˆ X xi i=1, I N is the N N identity matrix, and K is the N matrix of the covariances between the N training data points p(X) = p(x1) p(xn|xn−1; A)p(A)d A (20) (Gram matrix) n=2 ⎡ ⎤ k(x1, x1) k(x1, x2)... k(x1, x N ) ⎢ ⎥ which, assuming ⎢ k(x2, x1) k(x2, x2)... k(x2, x N ) ⎥ ( , ) . K X X ⎢ . . . ⎥ (14) ⎣ . . . ⎦ 2 p(xn|xn−1; A) = N (xn|Axn−1,σ ) (21) k(x N , x1) k(xN , x2)... k(x N , x N ) From (12), and conditioning on the available training samples, and a simplistic isotropic Gaussian prior on the columns of we can derive the expression of the model predictive distrib- the parameters matrix A, yields a GP prior of the form ution, yielding p(x1) 2 ( ) = p( f∗|x∗, D) = N ( f∗|μ∗,σ ) (15) p X ∗ (2π)Q(N−1)| (X, X)|Q μ = ( )T ( ( , ) + σ 2 )−1 ∗ k x∗ K X X I N Y (16) 1 − −1 T 2 2 T 2 1 × exp − tr( (X, X) X2:N X : ) (22) σ∗ = σ − k(x∗) K(X, X) + σ I N 2 2 N × k(x∗) + k(x∗, x∗). (17) CHATZIS AND KOSMOPOULOS: LATENT MANIFOLD MARKOVIAN DYNAMICS GP 73