6 Gaussian processes1 6.1 Introduction...... 1 6.2 The Fernique inequality...... 4 6.3 Concentration of Lipschitz functionals...... 6 6.3.1 The Pisier-Maurey approach...... 6 6.3.2 The smart path method...... 7 6.3.3 The stochastic calculus (Brownian motion) method..9 6.3.4 The Gaussian isoperimetric inequality...... 12 6.4 Problems...... 14 6.5 Notes...... 15
Printed: 8 December 2015
version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard Chapter 6 Gaussian processes
Section 6.1 states three beautiful facts about multivariate normal distribu- tions: the Sudakov inequality; the Fernique comparison inequality; and the concentration inequality for Lipschitz functionals, with the Borell in- equality as a special case. Section 6.2 sketches a proof of the Fernique inequality, then shows how it implies the Sudakov inequality. Section 6.3 presents four different proofs for slightly differnt version of the Lipschitz concentration inequality. The proofs use techniques that have proven themselves most useful for the study of Gaussian processes.
6.1 Introduction Gaussian::S:intro This chapter has two aims:
(i) to describe the technical tools that are needed (in Chapter 7) to estab- lish the various equivalences, for centered Gaussian processes, between the finiteness of P supt∈T Xt and the existence of majorizing measures, as described in Section 4.6;
(ii) to describe some surprising properties of Gaussian processes that have been the starting point for a flourishing literature on the concentration of measure phenomenon, as discussed in Chapters 11 and 12.
Happily the two aims overlap. An essential ingredient for Talgrand’s majoring measure argument is an inequality usually attributed to Sudakov (but consult the references in Section 6.5 for a more complete account of the history).
version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard §6.1 Introduction 2
Gaussian::Sudakov <1> Theorem. (“Sudakov’s minoration”) Let Y := (Y1,Y2,...,Yn) have a cen- 2 2 tered (zero means) multivariate normal distribution, with P|Yj − Yk| ≥ δ 1/2 p for all j 6= k. Then (4π) P maxi≤n Yi ≥ δ log2 n.
Remark. The lower bound is sharp within a constant, in the following 2 2 sense. If P|Yj − Yk| ≤ δ for all j 6= k then P maxi Yi = PY1 + P maxi(Yi − Y1) = P maxi(Yi − Y1) and
2 exp (P maxi(Yi − Y1)/2δ) 2 2 ≤ P maxi exp (Yi − Y1) /4δ by Jensen 2 1 ≤ nP exp W with W ∼ N 0, 4 . q √ Thus P maxi Yi is bounded above by 2δ log( 2n).
The minoration can be proved (Section 6.2) by using a comparison the- orem due to Fernique(1975, page 18).
fernique.thm <2> Fernique’s comparison inequality. Suppose X and Y both have centered (zero means) multivariate normal distributions, with
2 2 P|Xi − Xj| ≤ P|Yi − Yj| for all i, j.
Then
Pf(maxi Xi − mini Xi) ≤ Pf(maxi Yi − mini Yi)
+ for each increasing, convex function f on R .
Section 6.2 sketches the proof of this inequality. The method of proof illus- trates an important technique: construct a path between X and Y along which the expected value of interest increases. The other ingredient in the majorizing measure argument is a concen- tration inequality for the supremum of a Gaussian process. To avoid mea- surability issues, assume the index set is at worst countably infinite.
Gaussian::Borell.subg <3> Borell’s inequality. Suppose {Yt : t ∈ T } is Gaussian process with T fi- 2 nite or countably infinite. Assume both m := P supt∈T Yt < ∞ and σ := supt∈T var(Yt) < ∞. Then
2 P{| supt∈T Yt − m| ≥ σu} ≤ 2 exp(−u /2) for all u ≥ 0.
Consequently, ksup Y − mk ≤ CBorσ, with CBor a universal constant. t∈T t Ψ2
Draft: 7dec2015 c David Pollard §6.1 Introduction 3
In special cases (such as independent N(0, 1)-distributed variables, as shown by the Problems to Chapter 4) one can get tighter bounds, but Borell’s inequality has the great virtue of being impervious to the effects of possible dependence between the Yt. Theorem <3> can be deduced from a more basic fact about the N(0,In) n distribution on R . n 2 2 P 2 For vectors in R write | · | for the usual ` distance: |x| = i xi . n Gaussian::Lipschitz.fnal <4> Theorem. Suppose f : R → R is a Lipschitz function, with kfkLip ≤ κ. n That is, |f(x) − f(y)| ≤ κ|x − y| for all x, y ∈ R . Then, for a universal constant C,
−u2/(2C) γn{f(x) ≥ γnf + κu} ≤ e for all u ≥ 0.
where γn denotes the N(0,In) distribution.
Remark. Notice that the dimension n does not appear explicitly in the upper bound, although it might enter implicitly through κ for some functionals.
This Theorem provides a good illustration of several different arguments that have been developed for Gaussian processes. Section 6.3 contains four different proofs of the Theorem. The easiest method (Pisier-Maurey, sub- section 6.3.1) gives the concentration bound with C = π2/4. The smart path method (subsection 6.3.2) improves the constant to 2. The stochastic calculus method (subsection 6.3.3) improves the constant to 1. The deepest method (subsection 6.3.4), based on the Gaussian isoperimetric inequality, again gives the constant 1 but with centering at the median of f(x). To- gether the four methods offer a mini-course in Gaussian tricks.
Remark. The constant C = 1 is the best possible in general. If u is a unit vector the linear function f(x) = u0x is Lipschitz with κ = 1. Under γn the function f(x) has a N(0, 1) distribution, whose tails decrease like exp(−u2/2).
Let me show you how Theorem <4> implies the analog of the Borell inequality with the u2/2 in the exponent replaced by u2/(2C) for whichever constant C you feel comfortable to use. (Different C’s just lead to different
values for CBor, but have no important effect on the arguments in Chapter 7.) Suppose T = N. Define Mn = maxi≤n Yi. For each fixed n we can think 0 n of each Yi as a linear functional, Yi(x) = µi + aix, on R equipped with γn, 0 with A = [a1, . . . , an] an n×n matrix with A A equal to the variance matrix 2 2 of (Y1,...,Yn). That gives |ai|2 = var(Yi) ≤ σ .
Draft: 7dec2015 c David Pollard §6.2 The Fernique inequality 4
The functional f(x) := maxi≤n Yi(x) is Lipschitz:
0 0 |f(x) − f(z)| = | maxi≤n(µi + aix) − maxi≤n(µi + aiz)| 0 0 ≤ maxi≤n |(µi + aix) − (µi + aiz)| ≤ maxi≤n |ai| |x − z| by Cauchy-Schwarz ≤ σ|x − z|.
Theorem <4> gives
−u2/(2C) P{Mn ≥ PMn + σu} ≤ e ,
which implies
−u2/(2C) P{Mn > r} ≤ e for r > m + σu and each n.
In the limit, as n → ∞, we get a one-sided analog of Theorem <3>. Repeat the argument with f replaced by −f to deduce the two-sided bound.
6.2 The Fernique inequality Gaussian::S:Fernique The following sketch of Fernique’s argument summarizes the more detailed exposition by Pollard(2001, Section 12.3). First a smoothing argument shows that the function f could be assumed to be infinitely differentiable with second derivative having compact support, which sidesteps integrability questions and allows uninhibited appeals to integration-by-parts. Suppose X ∼ N(0,V0) and Y ∼ N(0,V√1). The main√ idea is to interpolate between X and Y along a path X(θ) = 1 − θ + θY , for 0 ≤ θ ≤ 1. The random vector X(θ) has a N(0,Vθ) distribution, where
Vθ = (1 − θ)V0 + θV1 = V0 + θD
By Fourier inversion, the N(0,Vθ) distribution has density Z −n 0 1 0 gθ(x) = (2π) exp −ix t − 2 t Vθt . Rn Differentiation under the integral sign leads to the identity
n n 2 ∂gθ(x) 1 X X ∂ gθ(x) = 2 Dj,k . ∂θ ∂xj∂xk j=1 k=1
Draft: 7dec2015 c David Pollard §6.2 The Fernique inequality 5
It remains to show that the function Z H(θ) := Pf max Xi(θ) − min Xi(θ) = f max xi − min xi gθ(x) dx i i i i Rn is increasing in θ, or that
n n Z 2 0 1 X X ∂ gθ(x) H (θ) = 2 Dj,k f max xi − min xi dx n i i ∂xj∂xk j=1 k=1 R is nonnegative. Split the range of integration according to which xi is the maximum and which xi is the minimum. On each region integration-by-parts leads to a representation
0 1 X H (θ) = {j < k} (Dj,j − 2Dj,k + Dk,k)(Aj,k + Bj,k) , 2 j,k
0 where Aj,k is an n − 1-dimensional integral of the nonnegative function f gθ over a boundary set and Bj,k is an n-dimension integral of the nonnegative 00 function f gθ. And the coefficient (Dj,j − 2Dj,k + Dk,k) is also nonengative because it equals
2 2 P|Yj − Yk| − P|Xj − Xk| ≥ 0 by assumption.
Done. The Sudakov’s minoration follows directly from the Fernique inequality with f chosen as the identity function. Without loss of generality suppose n equals 2k, a power of 2, so that the index set can be identified with S := k {−1, +1} . Construct the process {Xs : s ∈ S} from a set Z1,...,Zk of independendent N(0, 1)’s,
k 1 −1/2 X Xs := δk sjZj 2 j=1
2 1 2 −1 P 0 2 2 for which P|Xs − Xs0 | = 4 δ k j(sj − sj) ≤ δ . By Fernique’s inequality,
P (maxs Ys − mins Ys) ≥ P (maxs Xs − mins Xs) .
Symmetry of the multivariate normal implies that maxs Ys has the same distribution as maxs(−Yα) = − mins Yα, and similarly for the X’s. The last
Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 6
inequality implies
P maxs Ys ≥ P maxs Xs k 1 −1/2 X = δk maxs sjZj 2 P j=1 k 1 −1/2 X 1 1/2 = δk |Zj| = δk |Z1|. 2 P j=1 2 P
6.3 Concentration of Lipschitz functionals Gaussian::S:Lipschitz As promised, here are four different methods for proving versions of Theo- rem <4>. The aim is to show, for various choices of the constant C, that