<<

Bayesian Filtering with Online Gaussian Process Latent Variable Models

Yali Wang Marcus A. Brubaker Brahim Chaib-draa Raquel Urtasun Laval University TTI Chicago Laval University University of Toronto [email protected] [email protected] [email protected] [email protected]

Abstract cessful model in the context of modelling human motion is the Gaussian process latent variable model (GPLVM) [12], In this paper we present a novel non-parametric where the non-linear mapping between the latent space and approach to Bayesian filtering, where the predic- the high dimensional space is modeled with a Gaussian tion and observation models are learned in an process. This provides powerful prior models, which have online fashion. Our approach is able to han- been employed for character animation [28, 26, 15] and hu- dle multimodal distributions over both models by man body tracking [24, 16, 25]. employing a mixture model representation with In the context of tracking, one is interested in estimating Gaussian Processes (GP) based components. To the state of a dynamic system. The most commonly used cope with the increasing complexity of the esti- technique for state estimation is Bayesian filtering, which mation process, we explore two computationally recursively estimates the posterior probability of the state efficient GP variants, sparse online GP and local of the system. The two key components in the filter are GP, which help to manage computation require- the prediction model, which describes the temporal evolu- ments for each mixture component. Our exper- tion of the process, as well as the observation model which iments demonstrate that our approach can track links the state and the observations. A parametric form is human motion much more accurately than exist- typically employed for both models. ing approaches that learn the prediction and ob- servation models offline and do not update these Ko and Fox [10] introduced the GP-BayesFilter, which models with the incoming data stream. defines the prediction and observation models in a non- parametric way via Gaussian processes. This approach is well suited when accurate parametric models are difficult 1 INTRODUCTION to obtain. Its main limitation, however, resides in the fact that it requires ground truth states (as GPs are supervised), which are typically not available. GPLVMs were employed Many real world problems involve high dimensional data. in [11] to learn the latent space in an unsupervised manner, In this paper we are interested in modeling and tracking bypassing the need for labeled data. This, however, can not human motion. In this setting, dimensionality reduction exploit the incoming stream of data available in the online techniques are widely employed to avoid the curse of di- setting as the latent space is learned offline. Furthermore, mensionality. only unimodal prediction and observation models can be Linear approaches such as principle component analysis captured due to the fact that the models learned by GP are (PCA) are very popular as they are simple to use. However, nonlinear but Gaussian. they often fail to capture complex dependencies due to their In this paper we extend the previous non-parametric filters assumption of linearity. Non-linear dimensionality reduc- to learn the latent space in an online fashion as well as to tion techniques that attempt to preserve the local structure handle multimodal distributions for both the prediction and of the manifold (e.g., Isomap [21, 8], LLE [19, 14]) can observation models. Towards this goal, we employ a mix- capture more complex dependencies, but often suffer when ture model representation in the particle filtering frame- the manifold assumptions are violated, e.g., in the presence work. For the mixture components, we investigate two of noise. computationally efficient GP variants which can update the Probabilistic latent variable models have the advantage of prediction and observation models in an online fashion, and being able to take the uncertainties into account when cope with the growth in complexity as the number of data learning the latent representations. Perhaps the most suc- points increases over time. More specifically, the sparse online GP [3] selects the active set in a online fashion to Recently, a number of GP-based Bayesian filters were pro- efficiently maintain sparse approximations to the models. posed by learning the prediction and observation models Alternatively, the local GP [26] reduces the computation using GP regression [10, 4]. This is a promising alternative by imposing local sparsity. as GPs are non-parametric and can capture complex map- pings. However, training these methods requires access to We demonstrate the effectiveness of our approach on a ground truth data before filtering. Unfortunately, the in- wide variety of motions, and show that both approaches puts of the training set are the hidden states which are not perform better than existing algorithms. In the remainder always known in real-world applications. Two extensions of the paper we first present a review on Bayesian filter- were introduced to learn the hidden states of the training ing and the GPLVM. We then introduce our algorithm and set via a non-linear latent variable model [11] or a sparse show our experimental evaluation followed by the conclu- pseudo-input GP regression [22]. However, these methods sions. require offline learning procedures, which are not able to exploit the incoming data streams. In contrast, we propose 2 BACKGROUND two non-parametric particle filters that are able to exploit the incoming data to learn better models in an online fash- In this section we review Bayesian filtering and Gaussian ion. process latent variable models.

2.1 BAYESIAN FILTERING 2.2 GAUSSIAN PROCESS DYNAMICAL MODEL

Bayesian filtering is a sequential inference technique typi- The Gaussian Process Latent Variable Model (GPLVM) is cally employed to perform state estimation in dynamic sys- a probabilistic dimensionality reduction technique, which tems. Specifically, the goal is to recursively compute the places a GP prior on the observation model [12]. Wang posterior distribution of the current hidden state xt given et al. [28] proposed the Gaussian Process Dynamical the history of observations y1:t = (y1,..., yt) up to the Model (GPDM), which enriches the GPLVM to capture current time step temporal structure by incorporating a GP prior over the dy- Z namics in the latent space. Formally, the model is: p(xt|y1:t) ∝ p(yt|xt) p(xt|xt−1)p(xt−1|y1:t−1)dxt−1

xt = fx(xt−1) + ηx where p(xt|xt−1) is the prediction model that represents yt = fy(xt) + ηy the system dynamics, and p(yt|xt) is the observation model that represents the likelihood of an observation yt given the state x . Dy t where y ∈ R represents the observation and x ∈ Dx One of the most fundamental Bayesian filters is the Kalman R the latent state, with Dy  Dx. The noise pro- 2 filter, which is a maximum-a-posteriori estimator for linear cesses are assumed to be Gaussian ηx ∼ N (0, σxI) and 2 i i and Gaussian models. Unfortunately, it is often not applica- ηy ∼ N (0, σyI). The nonlinear functions fx and fy i 0 i ble in practice since most real dynamical systems are non- have GP priors, i.e., fx ∼ GP(0, kx(x, x )) and fy ∼ 0 linear and/or non-Gaussian. Two popular extensions for GP(0, ky(x, x )) where kx(·, ·) and ky(·, ·) are the kernel non-linear systems are the extended Kalman filter (EKF) functions. For simplicity, we denote the hyperparameters and the unscented Kalman filter (UKF) [9]. However, sim- of the kernel functions by θ. ilar to the Kalman filter, the performance of EKF and UKF Let x1:T = (x1, ··· , xT ) be the latent space coordi- is poor when the models are multimodal [5]. 0 0 nates from time t = 1 to time t = T0. GPDM is In contrast, particle filters that are not restricted to lin- typically learned by minimizing the negative log poste- ear and Gaussian models have been developed by using rior − log(p(x1:T0 , θ|y1:T0 )) with respect to x1:T0 , and θ sequential Monte Carlo to represent the under- [28]. After x1:T0 and θ are obtained, a standard GP pre- lying posterior p(xt|y1:t) [5]. More specifically, at each diction is used to construct the model p(xt|xt−1, θ, XT0 ) T0 time step, Np particles of xt are drawn from the prediction and p(yt|xt, θ, YT0 ) with data XT0 = {(xk−1, xk)}k=2 T0 model p(xt|xt−1), and then all the particles are weighted and YT0 = {(xk, yk)}k=1. Tracking (t > T0) is then per- according to the observation model p(yt|xt). The posterior formed assuming the model is fixed and can be done using, p(xt|y1:t) is approximated using these Np weighted parti- e.g., a particle filter as described above. The major draw- cles. Finally, the Np particles are resampled for the next back of this approach is that it is not able to adapt to new step. Unfortunately, the parametric description of the dy- observations during tracking. As shown in our experimen- namic models limits the estimation accuracy of Bayesian tal evaluation, this results in poor performance when the filters. training set is small. 3 ONLINE GP Algorithm 1 Online GP-Particle Filter

1: Initialize model parameters Θ based on y1:T0 (1:N ) In order to solve the above-mentioned difficulties in learn- 2: Initialize particle set x P based on y T0 1:T0 ing and filtering with dynamic systems, we propose an 3: for t = T0 + 1 to T do Online GP Particle Filter framework to learn and refine 4: for i = 1 to Np do the model during tracking, i.e., the prediction p(xt|xt−1) (i) (i) 5: x ∼ p(xt|x , Θt−1,M ) and observation p(y |x ) models are updated online in t t−1 t t wˆ(i) = p(y |x(i), Θ ) the particle filtering framework. To account for multi- 6: t t t t−1,O modality and the significant amount of uncertainty that can 7: end for w(i) =w ˆ(i)/(PNp wˆ(i)) be present, we propose to represent the prediction and ob- 8: Normalize weights t t i=1 t (1:Np) servation models by a mixture model. For each mixture 9: Resample particle set with probabilities wt component, we will investigate two different GP variants. 10: for i = 1 to Np do (i) (i) 11: ηi = arg max p(x |x , Θj ) Let the prediction and observation models at t − 1 be M j t t−1 t−1,M i (i) j 12: ηO = arg maxj p(yt|xt , Θt−1,O) 1 PRM i p(xt|xt−1, Θt−1,M ) = p(xt|xt−1, Θ )(1) 13: end for RM i=1 t−1,M 14: for j = 1 to R do 1 PRO i M p(yt|xt, Θt−1,O) = p(yt|xt, Θ ) (2) j Np RO i=1 t−1,O P i 15: nt−1 = i=1 δ(ηM = j) j N (i) i i 1 P p i where Θ and Θ represents the parameters of the 16: x¯t−1 = j i=1 δ(ηM = j)xt−1 t−1,M t−1,O nt−1 R i M j 1 PNp i (i) i-th component, Θt−1,M = {Θt−1,M }i=1 and Θt−1,O = 17: x¯ = δ(η = j)x t nj i=1 M t {Θi }RO are the parameters of all components. At t−1 t−1,O i=1 18: Update Θj with (x¯j , x¯j) the t-th time step, we run a standard particle filter to ob- t,M t−1 t 19: end for tain a number of weighted particles. The latent space rep- 20: for j = 1 to R do resentations at time t can be obtained by resampling the O nj = PNp δ(ηi = j) weighted particles. Then, we assign each particle to the 21: t−1 i=1 O j 1 PNp i (i) most likely mixture component of p(xt|xt−1, Θt−1,M ) and 22: x¯t = j i=1 δ(ηO = j)xt nt−1 p(yt|xt, Θt−1,O) to capture the multi-modality of the pre- j j 23: Update Θt,O with (x¯t , yt) diction and observation models. Finally, we compute the 24: end for latent states of the assigned particles and use this 25: end for mean state to update the corresponding components param- i i eters, Θt,M for the prediction (or motion) model and Θt,O for the observation model. The whole framework is sum- focus on these two strategies, we note that in principle any marized in Algorithm 1. similar update strategy could be used instead, such as infor- What remains is to specify how the parameters of individ- mative vector machines [13] or local regression approaches ual components are represented and updated (lines 18 and [7, 6, 20]. In what follows, to avoid confusion with the no- 23 in Algorithm 1). As noted above, we aim to use a GP tations of the latent state and observation, we will use a model for each mixture component. However, a standard and b to indicate the input and output when we describe implementation would require O(t3) operations and O(t2) SOGP and LGP regression in which we consider modeling memory. As t grows linearly over time, the particle fil- a generic b = f(a) + ξ, with ξ ∼ N (0, σ2I). ter will quickly become too computationally and memory intensive. Thus a primary challenge is how to efficiently 3.1 SPARSE ONLINE GAUSSIAN PROCESS update the GP mixture components in the prediction and observation models. The Sparse Online Gaussian Process (SOGP) of [3, 27] is i i In order to efficiently update Θt,M and Θt,O in an online a well-known algorithm for online learning of GP models. manner, we consider two fast GP-based strategies: Sparse To cope with the fact that data arrives in an online manner, Online GP (SOGP) and Local GPs (LGP) in which the re- SOGP trains a GP model sequentially by updating the pos- duction in memory and/or computation is achieved by an terior mean and of the latent function values of online sparsification and a local experts mechanism respec- the training set. This online procedure is coupled with a tively. A detailed review of fast GP approaches can be sparsification strategy which iteratively selects a fixed-size found in [1, 18]. subset of points to form the active set, preventing the oth- erwise unbounded growth of the computation and memory The specific contents of Θi or Θi will vary depending t,M t,O load. on the method used. In the case of SOGP it will contain some computed quantities and the active set while for LGP The key of SOGP is to maintain the joint posterior over the it will simply be the set of all training points. While we will latent function values of the fixed-size active set Dt−1, i.e., N (µt−1, Σt−1), by recursively updating µt−1 and Σt−1. Algorithm 2 SOGP Update 1 When a new observation (at, bt) is available, we perform input Previous posterior quantities µt−1, Σt−1, Qt−1 the following update to take the new data point into account input Previous active set Dt−1 [27] input New input-output observation (at, bt) pair 1: Compute ρt, µt and Σt as in Equations (4), (7) and (8). qt = Qt−1kt−1(at) (3) 2 2: if ρt <  then 2 T ρt = k(at, at) − kt−1(at) Qt−1kt−1(at) (4) 3: Perform update µt ← [µt]−i, Σt ← [Σt]−i,−i where 2 2 2 T i is the index of the newly added row to µ . σˆt = σ + ρt + qt Σt−1qt (5) t   4: Set Qt = Qt−1, Dt = Dt−1. Σt−1qt δt = 2 T (6) 5: else ρt + qt Σt−1qt 6: Compute Qt as in Equation (9).  µ  t−1 −2 T 7: Add to active set Dt = Dt−1 ∪ {(at, bt)}. µt = T +σ ˆt (bt − qt µt−1)δt (7) qt µt−1 8: end if   Σt−1 Σt−1qt −2 T 9: if |Dt| > NA then Σt = T 2 T − σˆt δtδt (8) qt Σt−1 ρt + qt Σt−1qt 10: Select point j to remove using Equation (10). 11: Perform update µt ← [µt]−j, Σt ← [Σt]−j,−j and T where kt−1(at) is the kernel vector which is constructed [Qt]−j,j [Qt]−j,j Qt ← [Qt]−j,−j − [Q ] . from at and the active set Dt−1, and Qt−1 is the inverse t j,j 12: Remove j from active set Dt ← Dt \{(aj, bj)}. kernel of the active set Dt−1. 13: end if One of the key steps in this algorithm is how to decide when output µt, Σt, Qt and Dt to add the new point to the active set. We employed the strategy suggested by [3, 27], and ignore the new point with 2 −6 ρt <  for some small value of  (we use  = 10 ). In where [·]−j,j selects the j-th column of the matrix with the this case, the µt, Σt are updated as µt ← [µt]−i, Σt ← j-th row removed and the point is removed from the active [Σt]−i,−i where i = t is the index of the new point, [·]−i set Dt ← Dt \{(aj, bj)}. removes the i-th entry of a vector, and [·]−i,−i removes the i-th row and column of a matrix. Additionally, the inverse The joint posterior at time t can be used to construct the predictive distribution for a new input a∗ kernel matrix is simply Qt = Qt−1 because the new point 2 is not included in the active set. When ρt ≥ , we add the SOGP ∗ ˜ 2 p (b|a , Dt, Θ) = N (b|b, σ˜ ) (14) new point to the active set Dt = Dt−1 ∪ {(at, bt)}. The ˜ ˜ ∗ T 2 2 ∗ ∗ µt, Σt are then the same as Eq.(7) and (8), and the inverse where b = kt(a ) Qtµt and σ˜ = σ + k(a , a ) + ∗ T ∗ kernel matrix is updated to be [27] kt(a ) (QtΣtQt − Qt)kt(a ). We summarize the SOGP updates in Algorithm 2. Q 0 q qT −q  Q = t−1 + ρ−2 t t t (9) t 0 0 t −qT 1 t 3.2 LOCAL GAUSSIAN PROCESSES

When the size of the active set is larger than the fixed size An alternative to the SOGP approach is to use Local Gaus- NA because a new point was added, we must remove a sian Processes (LGP), which was developed specifically to point. This is done by selecting the one which minimally deal with large, multi-modal regression problems [17, 23]. affects the predictions according to the squared predicted In LGP, given a test input a∗ and a set of input-output pairs error. Following [3, 27], we remove the j-th data point N D = {(ai, bi)}i=1}, the Ma-nearest neighbors Da∗ = with Ma 2 {(a`, b`)}`=1 are selected based on the distance in the in- [Q µ ]  ∗ j = arg min t t j put space d` = ka` − a k. Then, for each of the Ma (10) M j [Qt]j,j b neighbors, Mb-nearest neighbors Db` = {(aj, bj)}j=1 are selected based on the distance in the output space to where [·]j selects the j-th entry of a vector and [·]j,j select the jth diagonal entry of a matrix. Once a point has been b`. These neighbors are then combined to form a local GP expert which makes a Gaussian prediction with mean and selected for removal, µt, Σt and Qt are updated as covariance µ ← [µ ] (11) t t −j −1 ∗ µ` = BDb KD ,D kDb (a ) Σt ← [Σt]−j,−j (12) ` b` b` ` 2 ∗ ∗ ∗ T −1 ∗ 2 T σ = k(a , a ) − kD (a ) K kD (a ) + σ [Qt]−j,j[Qt] ` b` Db ,Db b` Q ← [Q ] − −j,j (13) ` ` t t −j,−j [Q ] t j,j where B is the matrix whose columns are the M near- Db` b 1 ∗ For simplicity of presentation, we assume that b is a scalar. est neighbors of b`, kDb (a ) is the vector of kernel func- ` ∗ The extension to vector valued b is straightforward. tion values for the input a and the points in Db` , and K is the kernel matrix for the points in D . The fi- expected, the prediction error is reduced in all the meth- Db` ,Db` b` nal predictive distribution is then formed by combining all ods when the number of particles increases. As shown in local experts in a mixture model the first row of Fig. 1, our approaches are superior to the baselines. Importantly, we outperformed the oracle base- Ma LGP ∗ X 2 line as we are able to represent multi-modal distributions p (b|a , D, Θ) = w`N (b|µ`, σ` I) (15) `=1 effectively. This is particularly important in the exercise sequence as the dynamics are clearly multimodal due to the w ∝ 1/d with weights ` `. different motions that are performed in the sequence. Fur- thermore, our LGP variant outperforms SOGP. We believe 4 EXPERIMENTAL EVALUATION this is due to the fact that SOGP has a fixed capacity while LGP is able to leverage more training data when making To illustrate our approach we choose 4 very different mo- predictions. tions, i.e., walking, golf swing, swimming as well as exer- Influence of noise: In this experiment we evaluate the ro- cises (composed of side twist and squat). The data consists bustness of all approaches to additive noise in the observa- of motion capture from the CMU dataset [2], where each tions. The second row of Fig. 1 shows that our LGP particle observation is a 62 dimensional vector containing the 3D filter significantly outperforms the baselines, particularly in rotations of all joints. We normalize the data to be mean the exercise sequence. Our SOGP outperforms all baselines zero and subsample the observations to reduce the correla- that have access to the same training set, and is only beaten tion between consecutive frames. We use a frequency of 12 by the oracle for walking. frames/s for walking and swimming, 24 frames/s for golf swing and 30 frames/s for the exercise motion. We com- Size of Training Set: We next evaluate how the accuracy pute all results averaged over 3 trials and report the average depends on the size of the inital training set, T0. The first root mean squared error as our measure of performance. row of Fig. 2 clearly indicates that our methods perform well even when the training set is very small. In contrast, In all the experiments, the latent space dimensionality the two baselines require bigger training sets to achieve is set to be 3 as is common for human motion mod- comparable performance. This is expected as the baselines els [28]. We use PCA to initialize the latent space and do not update the latent space to take the incoming obser- K- to obtain the data points used for the mixture vations into account. components. We choose the compound kernel function 0 2 0 2 2 2 T 0 k(x, x ) = σf exp(−0.5 k x − x k /γ ) + l x x for both prediction and observation mappings. Unless 4.2 QUALITATIVE EXPERIMENTS otherwise stated, we use 50 particles, a training set of size of 20/30/50/450 and 2/2/5/5 mixture components for Fig. 3 shows the latent space of both SOGP and LGP filters walking/golf/swimming/exercise motions respectively. For when employing 50 particles for each time step (depicted LGP, the number of local GP experts is 2/2/2/5, and the size in blue). From the 3D latent space and predicted skeletons, of each local expert is 5/8/5/20. For SOGP, the size of the we find that the manifolds of both LGP and SOGP particle active set is 20/5/50/20. The parameter values were chosen filters have a good representation of the high-dimensional to balance computational cost with the prediction accuracy human motion data. and in our experiments we demonstrate the robustness of our approach to these parameters. 4.3 PROPERTIES OF OUR METHODS

4.1 COMPARISON TO STATE-OF-THE-ART We next discuss various aspects of our method and evaluate the influence of the parameters of SOGP and LGP filters. We compare our approaches to two baselines: The first For LGP, due to the fact that the data sizes of walking, golf one is the approach of Ko and Fox [11] where a GPDM is and swimming motions are small, we reduced the number learned offline with gradient descent [28] before perform- (size) of local experts to be able to increase the size (num- ing particle filtering for state estimation. The second base- ber) of the local experts. line is similar, but learns the GPDM offline using stochastic Computational Complexity: Overall the computational gradient descent [29]. We tested the baselines in two differ- complexity of our method (Alg 1) is mainly determined by ent settings. First, only the initial training set is available the complexity of constructing a prediction distribution for to learn the prediction and observation models. Second, all each components (lines 5-6 and 11-12) and model updates the data (including future streamed examples) are used to (line 18 and line 23). Specifically, for an individual com- learn the prediction and observation models. Note that the ponent which is either SOGP or LGP, computing the pre- latter represents the oracle for Ko and Fox [11]. 2 3 diction distribution is O(NA) or O(MaMb + TMaMb) Number of Particles: We evaluate how the accuracy respectively where NA is the size of active set, Ma is the changes as a function of the number of particles, Np. As number of local experts, Mb are the number of neigh- GPDM+PF GPDM(Oracle)+PF Stochastic GPDM+PF Stochastic GPDM(Oracle)+PF Our SOGP+PF Our LGP+PF

4.5 3.5 7 11

10 6.5 3 9 4 6 8 2.5 7 5.5 3.5 6 2 5 5

3 1.5 4.5 4 20 40 60 80 100 20 40 60 80 100 20 30 40 50 20 40 60 80 100 Number of Particles Number of Particles Number of Particles Number of Particles

6 7.5 10 3.6 7 9 5.5 3.4

3.2 6.5 8 5 3 6 7 2.8 4.5 2.6 5.5 6

4 2.4 5 5 2.2

3.5 2 4.5 4 1.5 2 2.5 3 3.5 1 1.5 2 2 2.5 3 0 1 2 3 of Noise Standard Deviation of Noise Standard Deviation of Noise Standard Deviation of Noise

Figure 1: Root mean squared error as a function of (1st row) the number of particles in the particle filter, (2nd row) the standard deviation of noise added to the observations. The columns (from left to right) represent walking, golf, swimming and exercise motions. Note that our approach outperforms the baselines in all settings. Handling multimodal distributions is particularly important in the exercise example as it is composed of a variety of different motions.

GPDM+PF Stochastic GPDM+PF Our SOGP+PF Our LGP+PF

4.5 3.2 7 11

3 6.5 10 4 2.8 9 6 2.6 8 3.5 2.4 5.5 7 2.2 5 6 3 2 4.5 1.8 5

2.5 1.6 4 4 16 18 20 22 24 26 30 32 34 36 38 40 40 45 50 55 60 200 300 400 500 600 700 Size of Initial data Size of Initial Data Size of Initial Data Size of Initial Data

4.5 3.5 7.5 11

10 7 3 4 9 6.5 2.5 8 6 3.5 7 2 5.5 6

3 1.5 5 5 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 Missing Dimension Missing Dimension Missing Dimension Missing Dimension

Figure 2: Root Mean Squared Error as a function of (1st row) the number of initial training points, (2nd row) the number of the missing dimensions. The columns (from left to right) are respectively for walking, golf, swimming and exercise motions.

bors in the output space and TMaMb comes from the nents (lines 18 and 23) have a computational complexity of 2 KNN search. The model updates for the mixture compo- O(NA) and O(1) for SOGP and LGP respectively. 100 150 40 40 50 100 20 0 20 −50 50 0 0 −100 0 −200 −20 −20 10 −50 0 100 0 −100 50 −10 100 100 −40 0 0 0 0 −20 0 −100 −50 100 −50 −100 0 50 100 −30 200 −100 −100 (a) 3D Walk (b) 3D Golf (c) 3D Swim (d) 3D Exercise

100 150 40 40 50 100 20 0 20 −50 50 0 0 −100 0 −200 −20 −20 10 −50 0 0 100 50 −100 100 −10 100 −40 0 0 0 0 −20 0 −100 −50 100 −50 −100 0 50 100 −30 200 −100 −100 (e) 3D Walk (f) 3D Golf (g) 3D Swim (h) 3D Exercise

(i) Walk (t=23, 26, 29) (j) Golf (t=33, 41, 54) (k) Swim (t=71,79) (l) Exercise (t=508,555,721)

Figure 3: 3D latent spaces learned while tracking: The first row depicts the results of our SOGP variant, while the second row shows our LGP variant. In the walk/golf/swimming plots the red curve represents the predicted mean of the latent state sequence and the blue crosses are the particles at each step. In the plots for exercise (last column), the red, blue and green curves are the predicted mean of the latent state for three motions in the exercise sequence, and the black crosses are the particles at each step. The third row depicts the predicted skeletons, where ground truth is shown in green, our SOGP variant in blue and our LGP variant in red.

Our SOGP+PF Our SOGP(no update)+PF Our LGP+PF Our LGP(no update)+PF

4.5 4 7 9

3.5 6.5 8 4

3 6 7 3.5 2.5 5.5 6

3 2 5 5

2.5 1.5 4.5 4 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 6 Number of Mixture Components Number of Mixture Components Number of Mixture Components Number of Mixture Components

Figure 4: Root Mean Squared Error as a function of the number of mixture components. The columns (from left to right) are respectively for walking, golf, swimming and exercise motions.

Number of Mixture Components: Fig. 4 shows perfor- cally increases with the number of mixture components, for mance as a function of the number of mixture compo- SOGP, but less so for LGP. Furthermore, our approaches nents, RM and RO, for both SOGP and LGP. For LGP+PF outperform the baselines in which the model is not updated in walk/golf/swim/exercise, the number of local GPs are during filtering indicating that the online model updating is 1/1/2/5 and the size of each local GP are 3/3/5/20. In all very important in practice. Also note that while LGP gen- cases, we set RM = RO. Note that performance typi- erally outperforms SOGP, the difference quickly declines 5 10 15 20 6 7 9 6

5.5 8 6 5.5 Walk 7 5 5 5 6 4.5 4.5 Golf 4 5 4 4 4 3 3.5 3.5 Swim 3 3 2 3 2 Exercise 2.5 1 1 2.5 5 10 15 20 2 3 4 5 2 3 4 5 5 10 15 20

(a) Size of Active Data (b) Number of Local GPs (c) Size of Local GPs Figure 5: Root mean squared error as a function of the size of the active set in SOGP, the number of the local GP experts and the size of each local GP expert in LGP. In the subplot 5(c), the top x-axis is for exercise and the bottom one for the other motions.

Figure 6: Predicted skeleton for missing parts (Walk: two legs; Golf, Swim and Exercise: left arm). The ground truth is shown in green, our SOGP particle filter in blue and LGP particle filter in red. We show the predicted performance at t=24, 27, 32 for walk, t=34, 38, 50 for golf, t=71,79 for swim, t=470, 592, 711 for exercise.

as the number of mixture components increases. This sug- missing for all incoming frames. Our approach is able to gests that, when the fixed memory requirements of SOGP cope with missing data with only two small modifications. is desirable, a larger number of mixture components will First, particles are weighted only based on the observed di- achieve performance comparable to LGP. mensions. Furthermore, when updating the prediction and observation models, we employ mean imputation for the Active Set size in SOGP: To explore the effect of the size missing observation dimensions. Fig. 6 shows reconstruc- of the active set, N , on performance we set the number A tions of the missing dimensions for all our motions, which of mixture components, R and R , to be 2/1/5/5 for M O consists of the two legs for walking, the left arm for golf walk/golf/swim/exercise, and use the same settings as be- swing, swimming and exercise motions. We can see that fore for the other parameters. Results are shown in Fig. our approach is able to reconstruct the missing parts well. 5(a). As expected performance improves when the size of the active set increases. Finally, to evaluate the tracking performance as a function of the number of missing dimensions, we randomly gener- Number and Size of Local Experts in LGP: Figs. 5(b) ate the indices for the missing dimensions and use the same and 5(c) show the performance of our approach as a func- missing dimensions for all incoming frames. The second tion of the number of local GP experts, M , as well as a row of Fig. 2 shows that, compared to the baselines, our their size, M . For this experiment we set the number of b methods perform well even when the number of missing mixture components, R and R , to be 1/2/1/5 and used M O dimensions is 20 (1/3 of the skeleton) for all the motions. the same settings as before for the other parameters except In addition, our LGP particle filter outperforms our SOGP when evaluating the size of each local GP expert where we variant. set the number of local GP experts to 5/2/5/5. As shown in the figure, even with the small number (size) of local GP experts, we still achieve good performance. 5 CONCLUSION

4.4 HANDLING MISSING DATA In this paper we have presented a novel non-parametric approach to Bayesian filtering, where the observation and In this setting, we evaluate the capabilities of our ap- prediction models are constructed using a mixture model proaches to handle missing data. We assume that the initial with GP components learned in an online fashion. We set has no missing values, but a fixed set of joint angles are have demonstrated that our approach can capture the mul- timodality accurately and efficiently by online updates. We [14] C-S. Lee and A. Elgammal. Coupled Visual and Kine- have explored two fast GP variants for updating which keep matics Manifold Models for Human Motion Analysis. memory and computation bounded for individual mixture IJCV, 2010. components. We have demonstrated the effectiveness of [15] S. Levine, J. Wang, A. Haraux, Z. Popovic, and our approach when tracking different human motions and V. Koltun. Continuous character control with low- explored the impact of various parameters on performance. dimensional embeddings. In SIGGRAPH, 2012. The Local GP particle filter proved superior to our SOGP variant, however these differences can be mitigated by us- [16] K. Moon and V. Pavlovic. Impact of dynamics on ing more mixture components when using SOGP. In the fu- subspace embedding and tracking of sequences. In ture, we plan to investigate the usefulness of our approach CVPR, pages 198–205, 2006. in other settings such as shape deformation estimation and [17] D. Nguyen-Tuong, J. Peters, and M. Seeger. Local financial . gaussian process regression for real time online model learning and control. In NIPS, 2008. References [18] J. Quinonero-Candela˜ and C. E. Rasmussen. A Uni- fying View of Sparse Approximate Gaussian Process [1] K. Chalupka, C. K. I. Williams, and I. Murray. A Regression. JMLR, 2005. Framework for Evaluating Approximation Methods [19] S. Roweis and L. Saul. Nonlinear Dimensionality Re- for Gaussian Process Regression. JMLR, 2013. duction by Locally Linear Embedding. Science, 290 [2] CMU Mocap Database. CMU Mocap Database. (5500):2323–2326, 2000. http://mocap.cs.cmu.edu/. [20] S. Schaal and C. G. Atkeson. Constructive Incremen- [3] L. Csato and M. Opper. Sparse online gaussian pro- tal Learning From Only Local Information. Neural cesses. In Neural Computation, 2002. Computation, 1998. [4] M. P. Deisenroth, M. F. Huber, and U. D. Hanebeck. [21] J. Tenenbaum, V. de Silva, and J. Langford. A Global Analytic moment-based gaussian process filtering. In Geometric Framework for Nonlinear Dimensionality ICML, 2009. Reduction. Science, 2000. [22] R. Turner, M. P. Deisenroth, and C. E. Rasmussen. [5] A. Doucet, S. Godsill, and C. Andrieu. On sequential State-space inference and learning with gaussian pro- monte carlo sampling methods for bayesian filtering. cesses. In AISTATS, 2010. and Computing, 2000. [23] R. Urtasun and T. Darrell. Sparse probabilistic regres- [6] T. Hastie and C. Loader. Local Regression: Auto- sion for activity-independent human pose inference. matic Kernel Carpentry. Statistical Science, 1993. In CVPR, 2008. [7] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. [24] R. Urtasun, D. Fleet, A. Hertzman, and P. Fua. Priors Hinton. Adaptive Mixtures of Local Experts. Neural for people tracking from small training sets. In ICCV, Computation, 1991. 2005. [8] O. C. Jenkins and M. Mataric.´ A spatio-temporal ex- [25] R. Urtasun, D. Fleet, and P. Fua. 3d people tracking tension to Isomap nonlinear dimension reduction. In with gaussian process dynamical models. In CVPR, ICML, 2004. 2006. [9] S. J. Julier, J. K. Uhlmann, and H. F. Durrant-Whyte. [26] R. Urtasun, D Fleet, A. Geiger, J. Popovic, T. Darrell, A new approach for filtering nonlinear systems. In and N. Lawrence. Topologically-constrained latent American Control Conference, 1995. variable models. In ICML, 2008. [10] J. Ko and D. Fox. Gp-bayesfilters: Bayesian filter- [27] S. Van Vaerenbergh, M. Lazaro-Gredilla,´ and I. San- ing using gaussian process prediction and observation tamar´ıa. Kernel recursive least-squares tracker for models. In IROS, 2008. time-varying regression. IEEE Transactions on Neu- ral Networks and Learning Systems, 2012. [11] J. Ko and D. Fox. Learning gp-bayesfilters via gaus- sian process latent variable models. In RSS, 2009. [28] J. Wang, D. Fleet, and A. Hertzmann. Gaussian pro- cess dynamical models for human motion. PAMI, 30 [12] N. Lawrence. Probabilistic non-linear principal com- (2):283–298, 2008. ISSN 0162-8828. ponent analysis with gaussian process latent variable models. JMLR, 6:1783–1816, 2005. [29] A. Yao, J. Gall, L. V. Gool, and R. Urtasun. Learn- ing probabilistic non-linear latent variable models for [13] N. Lawrence, M. Seeger, and R. Herbrich. Fast Sparse tracking complex activities. In NIPS, 2011. Gaussian Process Methods: The Informative Vector Machine. In NIPS, 2003.