6 Gaussian Processes 1 6.1 Introduction ...1 6.2 the Fernique Inequality

6 Gaussian processes1 6.1 Introduction............................1 6.2 The Fernique inequality.....................4 6.3 Concentration of Lipschitz functionals.............6 6.3.1 The Pisier-Maurey approach..............6 6.3.2 The smart path method.................7 6.3.3 The stochastic calculus (Brownian motion) method..9 6.3.4 The Gaussian isoperimetric inequality......... 12 6.4 Problems............................. 14 6.5 Notes............................... 15 Printed: 8 December 2015 version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard Chapter 6 Gaussian processes Section 6.1 states three beautiful facts about multivariate normal distributions: the Sudakov inequality; the Fernique comparison inequality; and the concentration inequality for Lipschitz functionals, with the Borell inequality as a special case. Section 6.2 sketches a proof of the Fernique inequality, then shows how it implies the Sudakov inequality. Section 6.3 presents four different proofs for slightly differnt version of the Lipschitz concentration inequality. The proofs use techniques that have proven themselves most useful for the study of Gaussian processes. 6.1 Introduction Gaussian::S:intro This chapter has two aims: (i) to describe the technical tools that are needed (in Chapter 7) to estab- lish the various equivalences, for centered Gaussian processes, between the finiteness of P supt2T Xt and the existence of majorizing measures, as described in Section 4.6; (ii) to describe some surprising properties of Gaussian processes that have been the starting point for a flourishing literature on the concentration of measure phenomenon, as discussed in Chapters 11 and 12. Happily the two aims overlap. An essential ingredient for Talgrand's majoring measure argument is an inequality usually attributed to Sudakov (but consult the references in Section 6.5 for a more complete account of the history). version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard x6.1 Introduction 2 Gaussian::Sudakov <1> Theorem. (\Sudakov's minoration") Let Y := (Y1;Y2;:::;Yn) have a cen- 2 2 tered (zero means) multivariate normal distribution, with PjYj − Ykj ≥ δ 1=2 p for all j 6= k. Then (4π) P maxi≤n Yi ≥ δ log2 n. Remark. The lower bound is sharp within a constant, in the following 2 2 sense. If PjYj − Ykj ≤ δ for all j 6= k then P maxi Yi = PY1 + P maxi(Yi − Y1) = P maxi(Yi − Y1) and 2 exp (P maxi(Yi − Y1)=2δ) 2 2 ≤ P maxi exp (Yi − Y1) =4δ by Jensen 2 1 ≤ nP exp W with W ∼ N 0; 4 : q p Thus P maxi Yi is bounded above by 2δ log( 2n). The minoration can be proved (Section 6.2) by using a comparison theorem due to Fernique(1975, page 18). fernique.thm <2> Fernique's comparison inequality. Suppose X and Y both have centered (zero means) multivariate normal distributions, with 2 2 PjXi − Xjj ≤ PjYi − Yjj for all i, j: Then Pf(maxi Xi − mini Xi) ≤ Pf(maxi Yi − mini Yi) + for each increasing, convex function f on R . Section 6.2 sketches the proof of this inequality. The method of proof illus- trates an important technique: construct a path between X and Y along which the expected value of interest increases. The other ingredient in the majorizing measure argument is a concentration inequality for the supremum of a Gaussian process. To avoid mea- surability issues, assume the index set is at worst countably infinite. Gaussian::Borell.subg <3> Borell's inequality. Suppose fYt : t 2 T g is Gaussian process with T fi- 2 nite or countably infinite. Assume both m := P supt2T Yt < 1 and σ := supt2T var(Yt) < 1. Then 2 Pfj supt2T Yt − mj ≥ σug ≤ 2 exp(−u =2) for all u ≥ 0: Consequently, ksup Y − mk ≤ CBorσ, with CBor a universal constant. t2T t Ψ2 Draft: 7dec2015 c David Pollard x6.1 Introduction 3 In special cases (such as independent N(0; 1)-distributed variables, as shown by the Problems to Chapter 4) one can get tighter bounds, but Borell's inequality has the great virtue of being impervious to the effects of possible dependence between the Yt. Theorem <3> can be deduced from a more basic fact about the N(0;In) n distribution on R . n 2 2 P 2 For vectors in R write j · j for the usual ` distance: jxj = i xi . n Gaussian::Lipschitz.fnal <4> Theorem. Suppose f : R ! R is a Lipschitz function, with kfkLip ≤ κ. n That is, jf(x) − f(y)j ≤ κjx − yj for all x; y 2 R . Then, for a universal constant C, −u2=(2C) γnff(x) ≥ γnf + κug ≤ e for all u ≥ 0: where γn denotes the N(0;In) distribution. Remark. Notice that the dimension n does not appear explicitly in the upper bound, although it might enter implicitly through κ for some functionals. This Theorem provides a good illustration of several different arguments that have been developed for Gaussian processes. Section 6.3 contains four different proofs of the Theorem. The easiest method (Pisier-Maurey, subsection 6.3.1) gives the concentration bound with C = π2=4. The smart path method (subsection 6.3.2) improves the constant to 2. The stochastic calculus method (subsection 6.3.3) improves the constant to 1. The deepest method (subsection 6.3.4), based on the Gaussian isoperimetric inequality, again gives the constant 1 but with centering at the median of f(x). To- gether the four methods offer a mini-course in Gaussian tricks. Remark. The constant C = 1 is the best possible in general. If u is a unit vector the linear function f(x) = u0x is Lipschitz with κ = 1. Under γn the function f(x) has a N(0; 1) distribution, whose tails decrease like exp(−u2=2). Let me show you how Theorem <4> implies the analog of the Borell inequality with the u2=2 in the exponent replaced by u2=(2C) for whichever constant C you feel comfortable to use. (Different C's just lead to different values for CBor, but have no important effect on the arguments in Chapter 7.) Suppose T = N. Define Mn = maxi≤n Yi. For each fixed n we can think 0 n of each Yi as a linear functional, Yi(x) = µi + aix, on R equipped with γn, 0 with A = [a1; : : : ; an] an n×n matrix with A A equal to the variance matrix 2 2 of (Y1;:::;Yn). That gives jaij2 = var(Yi) ≤ σ . Draft: 7dec2015 c David Pollard x6.2 The Fernique inequality 4 The functional f(x) := maxi≤n Yi(x) is Lipschitz: 0 0 jf(x) − f(z)j = j maxi≤n(µi + aix) − maxi≤n(µi + aiz)j 0 0 ≤ maxi≤n j(µi + aix) − (µi + aiz)j ≤ maxi≤n jaij jx − zj by Cauchy-Schwarz ≤ σjx − zj: Theorem <4> gives −u2=(2C) PfMn ≥ PMn + σug ≤ e ; which implies −u2=(2C) PfMn > rg ≤ e for r > m + σu and each n. In the limit, as n ! 1, we get a one-sided analog of Theorem <3>. Repeat the argument with f replaced by −f to deduce the two-sided bound. 6.2 The Fernique inequality Gaussian::S:Fernique The following sketch of Fernique's argument summarizes the more detailed exposition by Pollard(2001, Section 12.3). First a smoothing argument shows that the function f could be assumed to be infinitely differentiable with second derivative having compact support, which sidesteps integrability questions and allows uninhibited appeals to integration-by-parts. Suppose X ∼ N(0;V0) and Y ∼ N(0;Vp1). The mainp idea is to interpolate between X and Y along a path X(θ) = 1 − θ + θY , for 0 ≤ θ ≤ 1. The random vector X(θ) has a N(0;Vθ) distribution, where Vθ = (1 − θ)V0 + θV1 = V0 + θD By Fourier inversion, the N(0;Vθ) distribution has density Z −n 0 1 0 gθ(x) = (2π) exp −ix t − 2 t Vθt : Rn Differentiation under the integral sign leads to the identity n n 2 @gθ(x) 1 X X @ gθ(x) = 2 Dj;k : @θ @xj@xk j=1 k=1 Draft: 7dec2015 c David Pollard x6.2 The Fernique inequality 5 It remains to show that the function Z H(θ) := Pf max Xi(θ) − min Xi(θ) = f max xi − min xi gθ(x) dx i i i i Rn is increasing in θ, or that n n Z 2 0 1 X X @ gθ(x) H (θ) = 2 Dj;k f max xi − min xi dx n i i @xj@xk j=1 k=1 R is nonnegative. Split the range of integration according to which xi is the maximum and which xi is the minimum. On each region integration-by-parts leads to a representation 0 1 X H (θ) = fj < kg (Dj;j − 2Dj;k + Dk;k)(Aj;k + Bj;k) ; 2 j;k 0 where Aj;k is an n − 1-dimensional integral of the nonnegative function f gθ over a boundary set and Bj;k is an n-dimension integral of the nonnegative 00 function f gθ. And the coefficient (Dj;j − 2Dj;k + Dk;k) is also nonengative because it equals 2 2 PjYj − Ykj − PjXj − Xkj ≥ 0 by assumption. Done. The Sudakov's minoration follows directly from the Fernique inequality with f chosen as the identity function. Without loss of generality suppose n equals 2k, a power of 2, so that the index set can be identified with S := k {−1; +1g . Construct the process fXs : s 2 Sg from a set Z1;:::;Zk of independendent N(0; 1)'s, k 1 −1=2 X Xs := δk sjZj 2 j=1 2 1 2 −1 P 0 2 2 for which PjXs − Xs0 j = 4 δ k j(sj − sj) ≤ δ .

6 Gaussian Processes 1 6.1 Introduction ...1 6.2 the Fernique Inequality

Scalable Nonparametric Bayesian Inference on Point Processes with Gaussian Processes

Deep Neural Networks As Gaussian Processes

Gaussian Process Dynamical Models for Human Motion

Modelling Multi-Object Activity by Gaussian Processes 1

Financial Time Series Volatility Analysis Using Gaussian Process State-Space Models

Gaussian-Random-Process.Pdf

Gaussian Markov Processes

FULLY BAYESIAN FIELD SLAM USING GAUSSIAN MARKOV RANDOM FIELDS Huan N

Mean Field Methods for Classification with Gaussian Processes

When Gaussian Process Meets Big Data

Stochastic Optimal Control Using Gaussian Process Regression Over Probability Distributions

The Variational Gaussian Process