6 Gaussian Processes 1 6.1 Introduction ...1 6.2 the Fernique Inequality

Total Page:16

File Type:pdf, Size:1020Kb

6 Gaussian Processes 1 6.1 Introduction ...1 6.2 the Fernique Inequality 6 Gaussian processes1 6.1 Introduction............................1 6.2 The Fernique inequality.....................4 6.3 Concentration of Lipschitz functionals.............6 6.3.1 The Pisier-Maurey approach..............6 6.3.2 The smart path method.................7 6.3.3 The stochastic calculus (Brownian motion) method..9 6.3.4 The Gaussian isoperimetric inequality......... 12 6.4 Problems............................. 14 6.5 Notes............................... 15 Printed: 8 December 2015 version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard Chapter 6 Gaussian processes Section 6.1 states three beautiful facts about multivariate normal distribu- tions: the Sudakov inequality; the Fernique comparison inequality; and the concentration inequality for Lipschitz functionals, with the Borell in- equality as a special case. Section 6.2 sketches a proof of the Fernique inequality, then shows how it implies the Sudakov inequality. Section 6.3 presents four different proofs for slightly differnt version of the Lipschitz concentration inequality. The proofs use techniques that have proven themselves most useful for the study of Gaussian processes. 6.1 Introduction Gaussian::S:intro This chapter has two aims: (i) to describe the technical tools that are needed (in Chapter 7) to estab- lish the various equivalences, for centered Gaussian processes, between the finiteness of P supt2T Xt and the existence of majorizing measures, as described in Section 4.6; (ii) to describe some surprising properties of Gaussian processes that have been the starting point for a flourishing literature on the concentration of measure phenomenon, as discussed in Chapters 11 and 12. Happily the two aims overlap. An essential ingredient for Talgrand's majoring measure argument is an inequality usually attributed to Sudakov (but consult the references in Section 6.5 for a more complete account of the history). version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard x6.1 Introduction 2 Gaussian::Sudakov <1> Theorem. (\Sudakov's minoration") Let Y := (Y1;Y2;:::;Yn) have a cen- 2 2 tered (zero means) multivariate normal distribution, with PjYj − Ykj ≥ δ 1=2 p for all j 6= k. Then (4π) P maxi≤n Yi ≥ δ log2 n. Remark. The lower bound is sharp within a constant, in the following 2 2 sense. If PjYj − Ykj ≤ δ for all j 6= k then P maxi Yi = PY1 + P maxi(Yi − Y1) = P maxi(Yi − Y1) and 2 exp (P maxi(Yi − Y1)=2δ) 2 2 ≤ P maxi exp (Yi − Y1) =4δ by Jensen 2 1 ≤ nP exp W with W ∼ N 0; 4 : q p Thus P maxi Yi is bounded above by 2δ log( 2n). The minoration can be proved (Section 6.2) by using a comparison the- orem due to Fernique(1975, page 18). fernique.thm <2> Fernique's comparison inequality. Suppose X and Y both have centered (zero means) multivariate normal distributions, with 2 2 PjXi − Xjj ≤ PjYi − Yjj for all i, j: Then Pf(maxi Xi − mini Xi) ≤ Pf(maxi Yi − mini Yi) + for each increasing, convex function f on R . Section 6.2 sketches the proof of this inequality. The method of proof illus- trates an important technique: construct a path between X and Y along which the expected value of interest increases. The other ingredient in the majorizing measure argument is a concen- tration inequality for the supremum of a Gaussian process. To avoid mea- surability issues, assume the index set is at worst countably infinite. Gaussian::Borell.subg <3> Borell's inequality. Suppose fYt : t 2 T g is Gaussian process with T fi- 2 nite or countably infinite. Assume both m := P supt2T Yt < 1 and σ := supt2T var(Yt) < 1. Then 2 Pfj supt2T Yt − mj ≥ σug ≤ 2 exp(−u =2) for all u ≥ 0: Consequently, ksup Y − mk ≤ CBorσ, with CBor a universal constant. t2T t Ψ2 Draft: 7dec2015 c David Pollard x6.1 Introduction 3 In special cases (such as independent N(0; 1)-distributed variables, as shown by the Problems to Chapter 4) one can get tighter bounds, but Borell's inequality has the great virtue of being impervious to the effects of possible dependence between the Yt. Theorem <3> can be deduced from a more basic fact about the N(0;In) n distribution on R . n 2 2 P 2 For vectors in R write j · j for the usual ` distance: jxj = i xi . n Gaussian::Lipschitz.fnal <4> Theorem. Suppose f : R ! R is a Lipschitz function, with kfkLip ≤ κ. n That is, jf(x) − f(y)j ≤ κjx − yj for all x; y 2 R . Then, for a universal constant C, −u2=(2C) γnff(x) ≥ γnf + κug ≤ e for all u ≥ 0: where γn denotes the N(0;In) distribution. Remark. Notice that the dimension n does not appear explicitly in the upper bound, although it might enter implicitly through κ for some functionals. This Theorem provides a good illustration of several different arguments that have been developed for Gaussian processes. Section 6.3 contains four different proofs of the Theorem. The easiest method (Pisier-Maurey, sub- section 6.3.1) gives the concentration bound with C = π2=4. The smart path method (subsection 6.3.2) improves the constant to 2. The stochastic calculus method (subsection 6.3.3) improves the constant to 1. The deepest method (subsection 6.3.4), based on the Gaussian isoperimetric inequality, again gives the constant 1 but with centering at the median of f(x). To- gether the four methods offer a mini-course in Gaussian tricks. Remark. The constant C = 1 is the best possible in general. If u is a unit vector the linear function f(x) = u0x is Lipschitz with κ = 1. Under γn the function f(x) has a N(0; 1) distribution, whose tails decrease like exp(−u2=2). Let me show you how Theorem <4> implies the analog of the Borell inequality with the u2=2 in the exponent replaced by u2=(2C) for whichever constant C you feel comfortable to use. (Different C's just lead to different values for CBor, but have no important effect on the arguments in Chapter 7.) Suppose T = N. Define Mn = maxi≤n Yi. For each fixed n we can think 0 n of each Yi as a linear functional, Yi(x) = µi + aix, on R equipped with γn, 0 with A = [a1; : : : ; an] an n×n matrix with A A equal to the variance matrix 2 2 of (Y1;:::;Yn). That gives jaij2 = var(Yi) ≤ σ . Draft: 7dec2015 c David Pollard x6.2 The Fernique inequality 4 The functional f(x) := maxi≤n Yi(x) is Lipschitz: 0 0 jf(x) − f(z)j = j maxi≤n(µi + aix) − maxi≤n(µi + aiz)j 0 0 ≤ maxi≤n j(µi + aix) − (µi + aiz)j ≤ maxi≤n jaij jx − zj by Cauchy-Schwarz ≤ σjx − zj: Theorem <4> gives −u2=(2C) PfMn ≥ PMn + σug ≤ e ; which implies −u2=(2C) PfMn > rg ≤ e for r > m + σu and each n. In the limit, as n ! 1, we get a one-sided analog of Theorem <3>. Repeat the argument with f replaced by −f to deduce the two-sided bound. 6.2 The Fernique inequality Gaussian::S:Fernique The following sketch of Fernique's argument summarizes the more detailed exposition by Pollard(2001, Section 12.3). First a smoothing argument shows that the function f could be assumed to be infinitely differentiable with second derivative having compact support, which sidesteps integrability questions and allows uninhibited appeals to integration-by-parts. Suppose X ∼ N(0;V0) and Y ∼ N(0;Vp1). The mainp idea is to interpolate between X and Y along a path X(θ) = 1 − θ + θY , for 0 ≤ θ ≤ 1. The random vector X(θ) has a N(0;Vθ) distribution, where Vθ = (1 − θ)V0 + θV1 = V0 + θD By Fourier inversion, the N(0;Vθ) distribution has density Z −n 0 1 0 gθ(x) = (2π) exp −ix t − 2 t Vθt : Rn Differentiation under the integral sign leads to the identity n n 2 @gθ(x) 1 X X @ gθ(x) = 2 Dj;k : @θ @xj@xk j=1 k=1 Draft: 7dec2015 c David Pollard x6.2 The Fernique inequality 5 It remains to show that the function Z H(θ) := Pf max Xi(θ) − min Xi(θ) = f max xi − min xi gθ(x) dx i i i i Rn is increasing in θ, or that n n Z 2 0 1 X X @ gθ(x) H (θ) = 2 Dj;k f max xi − min xi dx n i i @xj@xk j=1 k=1 R is nonnegative. Split the range of integration according to which xi is the maximum and which xi is the minimum. On each region integration-by-parts leads to a representation 0 1 X H (θ) = fj < kg (Dj;j − 2Dj;k + Dk;k)(Aj;k + Bj;k) ; 2 j;k 0 where Aj;k is an n − 1-dimensional integral of the nonnegative function f gθ over a boundary set and Bj;k is an n-dimension integral of the nonnegative 00 function f gθ. And the coefficient (Dj;j − 2Dj;k + Dk;k) is also nonengative because it equals 2 2 PjYj − Ykj − PjXj − Xkj ≥ 0 by assumption. Done. The Sudakov's minoration follows directly from the Fernique inequality with f chosen as the identity function. Without loss of generality suppose n equals 2k, a power of 2, so that the index set can be identified with S := k {−1; +1g . Construct the process fXs : s 2 Sg from a set Z1;:::;Zk of independendent N(0; 1)'s, k 1 −1=2 X Xs := δk sjZj 2 j=1 2 1 2 −1 P 0 2 2 for which PjXs − Xs0 j = 4 δ k j(sj − sj) ≤ δ .
Recommended publications
  • Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes
    Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes Yves-Laurent Kom Samo [email protected] Stephen Roberts [email protected] Deparment of Engineering Science and Oxford-Man Institute, University of Oxford Abstract 2. Related Work In this paper we propose an efficient, scalable Non-parametric inference on point processes has been ex- non-parametric Gaussian process model for in- tensively studied in the literature. Rathbum & Cressie ference on Poisson point processes. Our model (1994) and Moeller et al. (1998) used a finite-dimensional does not resort to gridding the domain or to intro- piecewise constant log-Gaussian for the intensity function. ducing latent thinning points. Unlike competing Such approximations are limited in that the choice of the 3 models that scale as O(n ) over n data points, grid on which to represent the intensity function is arbitrary 2 our model has a complexity O(nk ) where k and one has to trade-off precision with computational com- n. We propose a MCMC sampler and show that plexity and numerical accuracy, with the complexity being the model obtained is faster, more accurate and cubic in the precision and exponential in the dimension of generates less correlated samples than competing the input space. Kottas (2006) and Kottas & Sanso (2007) approaches on both synthetic and real-life data. used a Dirichlet process mixture of Beta distributions as Finally, we show that our model easily handles prior for the normalised intensity function of a Poisson data sizes not considered thus far by alternate ap- process. Cunningham et al.
    [Show full text]
  • Deep Neural Networks As Gaussian Processes
    Published as a conference paper at ICLR 2018 DEEP NEURAL NETWORKS AS GAUSSIAN PROCESSES Jaehoon Lee∗y, Yasaman Bahri∗y, Roman Novak , Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein Google Brain fjaehlee, yasamanb, romann, schsam, jpennin, [email protected] ABSTRACT It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as co- variance functions for GPs and allow fully Bayesian prediction with a deep neural network. In this work, we derive the exact equivalence between infinitely wide deep net- works and GPs. We further develop a computationally efficient pipeline to com- pute the covariance function for these GPs. We then use the resulting GPs to per- form Bayesian inference for wide deep neural networks on MNIST and CIFAR- 10. We observe that trained neural network accuracy approaches that of the corre- sponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test perfor- mance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite-width networks.
    [Show full text]
  • Gaussian Process Dynamical Models for Human Motion
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 30, NO. 2, FEBRUARY 2008 283 Gaussian Process Dynamical Models for Human Motion Jack M. Wang, David J. Fleet, Senior Member, IEEE, and Aaron Hertzmann, Member, IEEE Abstract—We introduce Gaussian process dynamical models (GPDMs) for nonlinear time series analysis, with applications to learning models of human pose and motion from high-dimensional motion capture data. A GPDM is a latent variable model. It comprises a low- dimensional latent space with associated dynamics, as well as a map from the latent space to an observation space. We marginalize out the model parameters in closed form by using Gaussian process priors for both the dynamical and the observation mappings. This results in a nonparametric model for dynamical systems that accounts for uncertainty in the model. We demonstrate the approach and compare four learning algorithms on human motion capture data, in which each pose is 50-dimensional. Despite the use of small data sets, the GPDM learns an effective representation of the nonlinear dynamics in these spaces. Index Terms—Machine learning, motion, tracking, animation, stochastic processes, time series analysis. Ç 1INTRODUCTION OOD statistical models for human motion are important models such as hidden Markov model (HMM) and linear Gfor many applications in vision and graphics, notably dynamical systems (LDS) are efficient and easily learned visual tracking, activity recognition, and computer anima- but limited in their expressiveness for complex motions. tion. It is well known in computer vision that the estimation More expressive models such as switching linear dynamical of 3D human motion from a monocular video sequence is systems (SLDS) and nonlinear dynamical systems (NLDS), highly ambiguous.
    [Show full text]
  • Modelling Multi-Object Activity by Gaussian Processes 1
    LOY et al.: MODELLING MULTI-OBJECT ACTIVITY BY GAUSSIAN PROCESSES 1 Modelling Multi-object Activity by Gaussian Processes Chen Change Loy School of EECS [email protected] Queen Mary University of London Tao Xiang E1 4NS London, UK [email protected] Shaogang Gong [email protected] Abstract We present a new approach for activity modelling and anomaly detection based on non-parametric Gaussian Process (GP) models. Specifically, GP regression models are formulated to learn non-linear relationships between multi-object activity patterns ob- served from semantically decomposed regions in complex scenes. Predictive distribu- tions are inferred from the regression models to compare with the actual observations for real-time anomaly detection. The use of a flexible, non-parametric model alleviates the difficult problem of selecting appropriate model complexity encountered in parametric models such as Dynamic Bayesian Networks (DBNs). Crucially, our GP models need fewer parameters; they are thus less likely to overfit given sparse data. In addition, our approach is robust to the inevitable noise in activity representation as noise is modelled explicitly in the GP models. Experimental results on a public traffic scene show that our models outperform DBNs in terms of anomaly sensitivity, noise robustness, and flexibil- ity in modelling complex activity. 1 Introduction Activity modelling and automatic anomaly detection in video have received increasing at- tention due to the recent large-scale deployments of surveillance cameras. These tasks are non-trivial because complex activity patterns in a busy public space involve multiple objects interacting with each other over space and time, whilst anomalies are often rare, ambigu- ous and can be easily confused with noise caused by low image quality, unstable lighting condition and occlusion.
    [Show full text]
  • Financial Time Series Volatility Analysis Using Gaussian Process State-Space Models
    Financial Time Series Volatility Analysis Using Gaussian Process State-Space Models by Jianan Han Bachelor of Engineering, Hebei Normal University, China, 2010 A thesis presented to Ryerson University in partial fulfillment of the requirements for the degree of Master of Applied Science in the Program of Electrical and Computer Engineering Toronto, Ontario, Canada, 2015 c Jianan Han 2015 AUTHOR'S DECLARATION FOR ELECTRONIC SUBMISSION OF A THESIS I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I authorize Ryerson University to lend this thesis to other institutions or individuals for the purpose of scholarly research. I further authorize Ryerson University to reproduce this thesis by photocopying or by other means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly research. I understand that my dissertation may be made electronically available to the public. ii Financial Time Series Volatility Analysis Using Gaussian Process State-Space Models Master of Applied Science 2015 Jianan Han Electrical and Computer Engineering Ryerson University Abstract In this thesis, we propose a novel nonparametric modeling framework for financial time series data analysis, and we apply the framework to the problem of time varying volatility modeling. Existing parametric models have a rigid transition function form and they often have over-fitting problems when model parameters are estimated using maximum likelihood methods. These drawbacks effect the models' forecast performance. To solve this problem, we take Bayesian nonparametric modeling approach.
    [Show full text]
  • Gaussian-Random-Process.Pdf
    The Gaussian Random Process Perhaps the most important continuous state-space random process in communications systems in the Gaussian random process, which, we shall see is very similar to, and shares many properties with the jointly Gaussian random variable that we studied previously (see lecture notes and chapter-4). X(t); t 2 T is a Gaussian r.p., if, for any positive integer n, any choice of coefficients ak; 1 k n; and any choice ≤ ≤ of sample time tk ; 1 k n; the random variable given by the following weighted sum of random variables is Gaussian:2 T ≤ ≤ X(t) = a1X(t1) + a2X(t2) + ::: + anX(tn) using vector notation we can express this as follows: X(t) = [X(t1);X(t2);:::;X(tn)] which, according to our study of jointly Gaussian r.v.s is an n-dimensional Gaussian r.v.. Hence, its pdf is known since we know the pdf of a jointly Gaussian random variable. For the random process, however, there is also the nasty little parameter t to worry about!! The best way to see the connection to the Gaussian random variable and understand the pdf of a random process is by example: Example: Determining the Distribution of a Gaussian Process Consider a continuous time random variable X(t); t with continuous state-space (in this case amplitude) defined by the following expression: 2 R X(t) = Y1 + tY2 where, Y1 and Y2 are independent Gaussian distributed random variables with zero mean and variance 2 2 σ : Y1;Y2 N(0; σ ). The problem is to find the one and two dimensional probability density functions of the random! process X(t): The one-dimensional distribution of a random process, also known as the univariate distribution is denoted by the notation FX;1(u; t) and defined as: P r X(t) u .
    [Show full text]
  • Gaussian Markov Processes
    C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006, ISBN 026218253X. c 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml Appendix B Gaussian Markov Processes Particularly when the index set for a stochastic process is one-dimensional such as the real line or its discretization onto the integer lattice, it is very interesting to investigate the properties of Gaussian Markov processes (GMPs). In this Appendix we use X(t) to define a stochastic process with continuous time pa- rameter t. In the discrete time case the process is denoted ...,X−1,X0,X1,... etc. We assume that the process has zero mean and is, unless otherwise stated, stationary. A discrete-time autoregressive (AR) process of order p can be written as AR process p X Xt = akXt−k + b0Zt, (B.1) k=1 where Zt ∼ N (0, 1) and all Zt’s are i.i.d. Notice the order-p Markov property that given the history Xt−1,Xt−2,..., Xt depends only on the previous p X’s. This relationship can be conveniently expressed as a graphical model; part of an AR(2) process is illustrated in Figure B.1. The name autoregressive stems from the fact that Xt is predicted from the p previous X’s through a regression equation. If one stores the current X and the p − 1 previous values as a state vector, then the AR(p) scalar process can be written equivalently as a vector AR(1) process. Figure B.1: Graphical model illustrating an AR(2) process.
    [Show full text]
  • FULLY BAYESIAN FIELD SLAM USING GAUSSIAN MARKOV RANDOM FIELDS Huan N
    Asian Journal of Control, Vol. 18, No. 5, pp. 1–14, September 2016 Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/asjc.1237 FULLY BAYESIAN FIELD SLAM USING GAUSSIAN MARKOV RANDOM FIELDS Huan N. Do, Mahdi Jadaliha, Mehmet Temel, and Jongeun Choi ABSTRACT This paper presents a fully Bayesian way to solve the simultaneous localization and spatial prediction problem using a Gaussian Markov random field (GMRF) model. The objective is to simultaneously localize robotic sensors and predict a spatial field of interest using sequentially collected noisy observations by robotic sensors. The set of observations consists of the observed noisy positions of robotic sensing vehicles and noisy measurements of a spatial field. To be flexible, the spatial field of interest is modeled by a GMRF with uncertain hyperparameters. We derive an approximate Bayesian solution to the problem of computing the predictive inferences of the GMRF and the localization, taking into account observations, uncertain hyperparameters, measurement noise, kinematics of robotic sensors, and uncertain localization. The effectiveness of the proposed algorithm is illustrated by simulation results as well as by experiment results. The experiment results successfully show the flexibility and adaptability of our fully Bayesian approach ina data-driven fashion. Key Words: Vision-based localization, spatial modeling, simultaneous localization and mapping (SLAM), Gaussian process regression, Gaussian Markov random field. I. INTRODUCTION However, there are a number of inapplicable situa- tions. For example, underwater autonomous gliders for Simultaneous localization and mapping (SLAM) ocean sampling cannot find usual geometrical models addresses the problem of a robot exploring an unknown from measurements of environmental variables such as environment under localization uncertainty [1].
    [Show full text]
  • Mean Field Methods for Classification with Gaussian Processes
    Mean field methods for classification with Gaussian processes Manfred Opper Neural Computing Research Group Division of Electronic Engineering and Computer Science Aston University Birmingham B4 7ET, UK. opperm~aston.ac.uk Ole Winther Theoretical Physics II, Lund University, S6lvegatan 14 A S-223 62 Lund, Sweden CONNECT, The Niels Bohr Institute, University of Copenhagen Blegdamsvej 17, 2100 Copenhagen 0, Denmark winther~thep.lu.se Abstract We discuss the application of TAP mean field methods known from the Statistical Mechanics of disordered systems to Bayesian classifi­ cation models with Gaussian processes. In contrast to previous ap­ proaches, no knowledge about the distribution of inputs is needed. Simulation results for the Sonar data set are given. 1 Modeling with Gaussian Processes Bayesian models which are based on Gaussian prior distributions on function spaces are promising non-parametric statistical tools. They have been recently introduced into the Neural Computation community (Neal 1996, Williams & Rasmussen 1996, Mackay 1997). To give their basic definition, we assume that the likelihood of the output or target variable T for a given input s E RN can be written in the form p(Tlh(s)) where h : RN --+ R is a priori assumed to be a Gaussian random field. If we assume fields with zero prior mean, the statistics of h is entirely defined by the second order correlations C(s, S') == E[h(s)h(S')], where E denotes expectations 310 M Opper and 0. Winther with respect to the prior. Interesting examples are C(s, s') (1) C(s, s') (2) The choice (1) can be motivated as a limit of a two-layered neural network with infinitely many hidden units with factorizable input-hidden weight priors (Williams 1997).
    [Show full text]
  • When Gaussian Process Meets Big Data
    IEEE 1 When Gaussian Process Meets Big Data: A Review of Scalable GPs Haitao Liu, Yew-Soon Ong, Fellow, IEEE, Xiaobo Shen, and Jianfei Cai, Senior Member, IEEE Abstract—The vast quantity of information brought by big GP paradigms. It is worth noting that this review mainly data as well as the evolving computer hardware encourages suc- focuses on scalable GPs for large-scale regression but not on cess stories in the machine learning community. In the meanwhile, all forms of GPs or other machine learning models. it poses challenges for the Gaussian process (GP) regression, a d n well-known non-parametric and interpretable Bayesian model, Given n training points X = xi R i=1 and their { ∈ n } which suffers from cubic complexity to data size. To improve observations y = yi = y(xi) R , GP seeks to { ∈ }i=1 the scalability while retaining desirable prediction quality, a infer the latent function f : Rd R in the function space variety of scalable GPs have been presented. But they have not (m(x), k(x, x′)) defined by the7→ mean m(.) and the kernel yet been comprehensively reviewed and analyzed in order to be kGP(., .). The most prominent weakness of standard GP is that well understood by both academia and industry. The review of it suffers from a cubic time complexity (n3) because of scalable GPs in the GP community is timely and important due O to the explosion of data size. To this end, this paper is devoted the inversion and determinant of the n n kernel matrix × to the review on state-of-the-art scalable GPs involving two main Knn = k(X, X).
    [Show full text]
  • Stochastic Optimal Control Using Gaussian Process Regression Over Probability Distributions
    Stochastic Optimal Control Using Gaussian Process Regression over Probability Distributions Jana Mayer1, Maxim Dolgov2, Tobias Stickling1, Selim Özgen1, Florian Rosenthal1, and Uwe D. Hanebeck1 Abstract— In this paper, we address optimal control of non- In this paper, we will refer to the term value function without linear stochastic systems under motion and measurement uncer- loss of generality. tainty with finite control input and measurement spaces. Such A solution to a stochastic control problem can be found problems can be formalized as partially-observable Markov decision processes where the goal is to find policies via dy- using the dynamic programming (DP) algorithm that is based namic programming that map the information available to the on Bellman’s principle of optimality [1]. This algorithm relies controller to control inputs while optimizing a performance on two key principles. First, it assumes the existence of criterion. However, they suffer from intractability in scenarios sufficient statistics in form of a probability distribution that with continuous state spaces and partial observability which encompasses all the information up to the moment of policy makes approximations necessary. Point-based value iteration methods are a class of global approximate methods that regress evaluation. And second, it computes the policy by starting at the value function given the values at a set of reference points. the end of the planning horizon and going backwards in time In this paper, we present a novel point-based value iteration to the current time step and at every time step choosing a approach for continuous state spaces that uses Gaussian pro- control input that minimize the cumulative costs maintained cesses defined over probability distribution for the regression.
    [Show full text]
  • The Variational Gaussian Process
    Published as a conference paper at ICLR 2016 THE VARIATIONAL GAUSSIAN PROCESS Dustin Tran Harvard University [email protected] Rajesh Ranganath Princeton University [email protected] David M. Blei Columbia University [email protected] ABSTRACT Variational inference is a powerful tool for approximate inference, and it has been recently applied for representation learning with deep generative models. We de- velop the variational Gaussian process (VGP), a Bayesian nonparametric varia- tional family, which adapts its shape to match complex posterior distributions. The VGP generates approximate posterior samples by generating latent inputs and warping them through random non-linear mappings; the distribution over random mappings is learned during inference, enabling the transformed outputs to adapt to varying complexity. We prove a universal approximation theorem for the VGP, demonstrating its representative power for learning any model. For inference we present a variational objective inspired by auto-encoders and perform black box inference over a wide class of models. The VGP achieves new state-of-the-art re- sults for unsupervised learning, inferring models such as the deep latent Gaussian model and the recently proposed DRAW. 1 INTRODUCTION Variational inference is a powerful tool for approximate posterior inference. The idea is to posit a family of distributions over the latent variables and then find the member of that family closest to the posterior. Originally developed in the 1990s (Hinton & Van Camp, 1993; Waterhouse et al., 1996; Jordan et al., 1999), variational inference has enjoyed renewed interest around developing scalable optimization for large datasets (Hoffman et al., 2013), deriving generic strategies for easily fitting many models (Ranganath et al., 2014), and applying neural networks as a flexible parametric family of approximations (Kingma & Welling, 2014; Rezende et al., 2014).
    [Show full text]