Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models
Total Page:16
File Type:pdf, Size:1020Kb
Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models Alessandro Davide Ialongo 1 2 Mark van der Wilk 3 James Hensman 3 Carl Edward Rasmussen 1 Abstract the distribution of the outcome (von Neumann et al., 1944). A striking example of this is model based policy search We identify a new variational inference scheme (Deisenroth & Rasmussen, 2011). for dynamical systems whose transition function is modelled by a Gaussian process. Inference in There are many approaches to time series prediction. Within this setting has either employed computationally the machine learning community, autoregressive (AR) intensive MCMC methods, or relied on factorisa- (Billings, 2013) and state-space (SSMs) models are pop- tions of the variational posterior. As we demon- ular, in part due to the relative ease with which predictions strate in our experiments, the factorisation be- can be made. AR predictions are obtained by a learned map- tween latent system states and transition function ping from the H last observations to the next one, whereas can lead to a miscalibrated posterior and to learn- SSMs model the underlying dynamical system by learning ing unnecessarily large noise terms. We eliminate a transition function which maps a system state forward this factorisation by explicitly modelling the de- in time. At each point in time, the state contains sufficient pendence between state trajectories and the Gaus- information for predicting both future states and observa- sian process posterior. Samples of the latent states tions. While AR models can be easier to train, SSMs have can then be tractably generated by conditioning the potential to be more data efficient, due to their minimal on this representation. The method we obtain state representation. (VCDT: variationally coupled dynamics and tra- We aim to learn complex, possibly stochastic, non-linear jectories) gives better predictive performance and dynamical systems from noisy data, following the Bayesian more calibrated estimates of the transition func- paradigm in order to capture model uncertainty. We use a tion, yet maintains the same time and space com- Gaussian process (GP) prior (Rasmussen & Williams, 2005) plexities as mean-field methods. Code is available for the transition function, giving the Gaussian Process State at: github.com/ialong/GPt. Space Model (GPSSM) (Frigola et al., 2013). Aside from capturing uncertainty, GPs are non-parametric, guaranteeing 1 Introduction that our model complexity will not saturate as we observe more data. Many time series are well explained by assuming their pro- gression is determined by an underlying dynamical system. Despite the challenge of performing accurate approximate A model of this dynamical system can be used to make inference in the GPSSM, an impressive amount of progress strong predictions of the future behaviour of the time se- has been made (Frigola et al., 2014; McHutchon, 2014; ries. We often require predictions of systems with unknown Eleftheriadis et al., 2017; Ialongo et al., 2017; Doerr et al., dynamics, be it for economic models of financial markets, 2018), helped along by the development of elegant varia- arXiv:1906.05828v1 [stat.ML] 13 Jun 2019 or models of physical systems to be controlled with model tional inference techniques (Titsias, 2009; Hensman et al., predictive control or model-based reinforcement learning. 2013) which retain the key non-parametric property of GPs. Particularly in low-data regimes, estimates of the uncertainty In this work, we improve variational approximations by crit- in the predictions are crucial for making robust decisions, ically examining existing methods and their failure modes. as the expected utility of an action can depend strongly on We propose a family of non-factorised variational posteriors which alleviate crucial problems with earlier approaches 1 Computational and Biological Learning Group, University while maintaining the same efficiency. We refer to the pro- of Cambridge 2Max Planck Institute for Intelligent Systems, Tubingen¨ 3PROWLER.io. Correspondence to: Alessandro Da- posed approach as VCDT: variationally coupled dynamics vide Ialongo <[email protected]>. and trajectories. Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models 2 Background has severe limitations which we will discuss and show exper- State-space models are common in machine learning, and imentally. Bui et al. (2016) investigated Power Expectation appear in many forms. At the most basic level, linear- Propagation as an alternative approach for fitting factorised Gaussian state-space models can be learned by maximum approximate posteriors. likelihood, combining Kalman smoothing and EM for in- 2.3 Variational inference approaches stance (Roweis & Ghahramani, 1999). Extensions have been developed for deterministic non-linear transitions (Ghahra- Variational methods often make simplifying assumptions mani & Roweis, 1999). Recurrent neural networks, like the about the form of the approximate posterior to improve popular LSTM (Hochreiter & Schmidhuber, 1996) learn de- computational tractability. This may result in significantly terministic mappings with state space structure for sequence biased solutions. The bias is particularly severe if indepen- prediction, which have seen successful recent use in a wide dence assumptions are made where strong correlations are range of tasks including translation (Sutskever et al., 2014). actually present (Turner & Sahani, 2011). Clearly, in dy- namical systems, the trajectory of the latent states depends 2.1 Bayesian non-parametric approaches strongly on the dynamics. Hence, we focus on improving We are particularly interested in prediction tasks which re- existing variational inference schemes by removing the in- quire good estimates of uncertainty, for example, for use dependence assumption between latent states and transition in model-based reinforcement learning systems (Deisen- function. We identify a general method that performs well roth & Rasmussen, 2011). These applications distinguish across varying noise scales. themselves by requiring 1) continual learning, as datasets are incrementally gathered, and 2) uncertainty estimates 3 The Model to ensure that policies learned are robust to the many dy- 3.1 State-space model structure namics that are consistent with a small dataset. Bayesian non-parametric models provide a unified and elegant way We model discrete-time sequences of observations Y = T E of solving both these problems: model complexity scales fytgt=1, where yt 2 R , by positing that each data point D with the size of the dataset and is controlled by the Bayesian is generated by a corresponding latent variable xt 2 R Occam’s razor (Rasmussen & Ghahramani, 2000). In this (the system’s “state”). These latent variables form a Markov work, we focus on approximations (Titsias, 2009; Hensman chain, implying that, for any time-step t, we can generate et al., 2013; Matthews et al., 2016) which offer improved xt+1 by conditioning only on xt and the transition function computational scaling with the ability to fully recover the f. We take the system to evolve according to a single, time- original non-parametric model. invariant transition function, which we would like to infer. While our inference scheme provides the freedom to choose 2.2 Gaussian process State Space Models arbitrary transition p(xt+1jf; xt) and observation p(ytjxt) Recurrent models with GP transitions (GPSSM) have been density functions, in keeping with previous literature, we proposed in a variety of ways, each with its own inference use Gaussians. We also assume a linear mapping between method (Wang et al., 2005; Ko & Fox, 2009; Turner et al., xt and the mean of yt j xt. This does not limit the range of 2010; Frigola et al., 2013). Early methods relied on maxi- systems that can be modelled, though it may require us to mum a posteriori inference for either the latent states (Wang choose a higher dimensional latent state (Frigola, 2015). We et al., 2005; Ko & Fox, 2009) or the transition function make this choice to reduce the non-identifiabilities between (Turner et al., 2010). Frigola et al. (2013) presented the first transitions and emissions. To reduce them further we may fully Bayesian treatment with a particle MCMC method that also impose constraints on the scale of f or of the linear sampled over the latent states and the GP transition function. mapping C. This allows the model to be expressed by the Due to issues with computational complexity and sampling following equations: in higher dimensional latent spaces, later attention turned to variational methods. All variational inference schemes f ∼ GP(m(·); k(·; ·)) (1) have so far relied on independence assumptions between x1 ∼ N µp ; Σp (2) latent states and transition function (Frigola et al., 2014; 1 1 McHutchon, 2014; Ialongo et al., 2017), sometimes even xt+1 j f; xt ∼ N (f(xt); Q) (3) factorising the state distribution over time (Mattos et al., y j x ∼ N (Cx + d; R) (4) 2016). Eleftheriadis et al. (2017) introduced a recognition t t t model to help with optimisation of the variational distribu- where we take the “process noise” covariance matrix Q to tion, while keeping the mean-field assumption. Recently, be diagonal, encouraging the transition function to account Doerr et al. (2018) introduced the first method to account for all correlations between latent dimensions.