<<

6 Gaussian processes1 6.1 Introduction...... 1 6.2 The Fernique inequality...... 4 6.3 Concentration of Lipschitz functionals...... 6 6.3.1 The Pisier-Maurey approach...... 6 6.3.2 The smart path method...... 7 6.3.3 The () method..9 6.3.4 The Gaussian isoperimetric inequality...... 12 6.4 Problems...... 14 6.5 Notes...... 15

Printed: 8 December 2015

version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard Chapter 6 Gaussian processes

Section 6.1 states three beautiful facts about multivariate normal distribu- tions: the Sudakov inequality; the Fernique comparison inequality; and the concentration inequality for Lipschitz functionals, with the Borell in- equality as a special case. Section 6.2 sketches a proof of the Fernique inequality, then shows how it implies the Sudakov inequality. Section 6.3 presents four different proofs for slightly differnt version of the Lipschitz concentration inequality. The proofs use techniques that have proven themselves most useful for the study of Gaussian processes.

6.1 Introduction Gaussian::S:intro This chapter has two aims:

(i) to describe the technical tools that are needed (in Chapter 7) to estab- lish the various equivalences, for centered Gaussian processes, between the finiteness of P supt∈T Xt and the existence of majorizing measures, as described in Section 4.6;

(ii) to describe some surprising properties of Gaussian processes that have been the starting point for a flourishing literature on the concentration of measure phenomenon, as discussed in Chapters 11 and 12.

Happily the two aims overlap. An essential ingredient for Talgrand’s majoring measure argument is an inequality usually attributed to Sudakov (but consult the references in Section 6.5 for a more complete account of the history).

version: 7dec2015 Mini-empirical printed: 8 December 2015 c David Pollard §6.1 Introduction 2

Gaussian::Sudakov <1> Theorem. (“Sudakov’s minoration”) Let Y := (Y1,Y2,...,Yn) have a cen- 2 2 tered (zero ) multivariate , with P|Yj − Yk| ≥ δ 1/2 p for all j 6= k. Then (4π) P maxi≤n Yi ≥ δ log2 n.

Remark. The lower bound is sharp within a constant, in the following 2 2 sense. If P|Yj − Yk| ≤ δ for all j 6= k then P maxi Yi = PY1 + P maxi(Yi − Y1) = P maxi(Yi − Y1) and

2 exp (P maxi(Yi − Y1)/2δ) 2 2 ≤ P maxi exp (Yi − Y1) /4δ by Jensen 2 1  ≤ nP exp W with W ∼ N 0, 4 . q √ Thus P maxi Yi is bounded above by 2δ log( 2n).

The minoration can be proved (Section 6.2) by using a comparison the- orem due to Fernique(1975, page 18).

fernique.thm <2> Fernique’s comparison inequality. Suppose X and Y both have centered (zero means) multivariate normal distributions, with

2 2 P|Xi − Xj| ≤ P|Yi − Yj| for all i, j.

Then

Pf(maxi Xi − mini Xi) ≤ Pf(maxi Yi − mini Yi)

+ for each increasing, convex f on R .

Section 6.2 sketches the proof of this inequality. The method of proof illus- trates an important technique: construct a path between X and Y along which the expected value of interest increases. The other ingredient in the majorizing measure argument is a concen- tration inequality for the supremum of a Gaussian process. To avoid mea- surability issues, assume the index set is at worst countably infinite.

Gaussian::Borell.subg <3> Borell’s inequality. Suppose {Yt : t ∈ T } is Gaussian process with T fi- 2 nite or countably infinite. Assume both m := P supt∈T Yt < ∞ and σ := supt∈T var(Yt) < ∞. Then

2 P{| supt∈T Yt − m| ≥ σu} ≤ 2 exp(−u /2) for all u ≥ 0.

Consequently, ksup Y − mk ≤ CBorσ, with CBor a universal constant. t∈T t Ψ2

Draft: 7dec2015 c David Pollard §6.1 Introduction 3

In special cases (such as independent N(0, 1)-distributed variables, as shown by the Problems to Chapter 4) one can get tighter bounds, but Borell’s inequality has the great virtue of being impervious to the effects of possible dependence between the Yt. Theorem <3> can be deduced from a more basic fact about the N(0,In) n distribution on R . n 2 2 P 2 For vectors in R write | · | for the usual ` distance: |x| = i xi . n Gaussian::Lipschitz.fnal <4> Theorem. Suppose f : R → R is a Lipschitz function, with kfkLip ≤ κ. n That is, |f(x) − f(y)| ≤ κ|x − y| for all x, y ∈ R . Then, for a universal constant C,

−u2/(2C) γn{f(x) ≥ γnf + κu} ≤ e for all u ≥ 0.

where γn denotes the N(0,In) distribution.

Remark. Notice that the dimension n does not appear explicitly in the upper bound, although it might enter implicitly through κ for some functionals.

This Theorem provides a good illustration of several different arguments that have been developed for Gaussian processes. Section 6.3 contains four different proofs of the Theorem. The easiest method (Pisier-Maurey, sub- section 6.3.1) gives the concentration bound with C = π2/4. The smart path method (subsection 6.3.2) improves the constant to 2. The stochastic calculus method (subsection 6.3.3) improves the constant to 1. The deepest method (subsection 6.3.4), based on the Gaussian isoperimetric inequality, again gives the constant 1 but with centering at the median of f(x). To- gether the four methods offer a mini-course in Gaussian tricks.

Remark. The constant C = 1 is the best possible in general. If u is a unit vector the linear function f(x) = u0x is Lipschitz with κ = 1. Under γn the function f(x) has a N(0, 1) distribution, whose tails decrease like exp(−u2/2).

Let me show you how Theorem <4> implies the analog of the Borell inequality with the u2/2 in the exponent replaced by u2/(2C) for whichever constant C you feel comfortable to use. (Different C’s just lead to different

values for CBor, but have no important effect on the arguments in Chapter 7.) Suppose T = N. Define Mn = maxi≤n Yi. For each fixed n we can think 0 n of each Yi as a linear functional, Yi(x) = µi + aix, on R equipped with γn, 0 with A = [a1, . . . , an] an n×n with A A equal to the variance matrix 2 2 of (Y1,...,Yn). That gives |ai|2 = var(Yi) ≤ σ .

Draft: 7dec2015 c David Pollard §6.2 The Fernique inequality 4

The functional f(x) := maxi≤n Yi(x) is Lipschitz:

0 0 |f(x) − f(z)| = | maxi≤n(µi + aix) − maxi≤n(µi + aiz)| 0 0 ≤ maxi≤n |(µi + aix) − (µi + aiz)| ≤ maxi≤n |ai| |x − z| by Cauchy-Schwarz ≤ σ|x − z|.

Theorem <4> gives

−u2/(2C) P{Mn ≥ PMn + σu} ≤ e ,

which implies

−u2/(2C) P{Mn > r} ≤ e for r > m + σu and each n.

In the limit, as n → ∞, we get a one-sided analog of Theorem <3>. Repeat the argument with f replaced by −f to deduce the two-sided bound.

6.2 The Fernique inequality Gaussian::S:Fernique The following sketch of Fernique’s argument summarizes the more detailed exposition by Pollard(2001, Section 12.3). First a smoothing argument shows that the function f could be assumed to be infinitely differentiable with second derivative having compact support, which sidesteps integrability questions and allows uninhibited appeals to integration-by-parts. Suppose X ∼ N(0,V0) and Y ∼ N(0,V√1). The main√ idea is to interpolate between X and Y along a path X(θ) = 1 − θ + θY , for 0 ≤ θ ≤ 1. The random vector X(θ) has a N(0,Vθ) distribution, where

Vθ = (1 − θ)V0 + θV1 = V0 + θD

By Fourier inversion, the N(0,Vθ) distribution has density Z −n 0 1 0  gθ(x) = (2π) exp −ix t − 2 t Vθt . Rn Differentiation under the integral sign leads to the identity

n n 2 ∂gθ(x) 1 X X ∂ gθ(x) = 2 Dj,k . ∂θ ∂xj∂xk j=1 k=1

Draft: 7dec2015 c David Pollard §6.2 The Fernique inequality 5

It remains to show that the function   Z   H(θ) := Pf max Xi(θ) − min Xi(θ) = f max xi − min xi gθ(x) dx i i i i Rn is increasing in θ, or that

n n Z   2 0 1 X X ∂ gθ(x) H (θ) = 2 Dj,k f max xi − min xi dx n i i ∂xj∂xk j=1 k=1 R is nonnegative. Split the range of integration according to which xi is the maximum and which xi is the minimum. On each region integration-by-parts leads to a representation

0 1 X H (θ) = {j < k} (Dj,j − 2Dj,k + Dk,k)(Aj,k + Bj,k) , 2 j,k

0 where Aj,k is an n − 1-dimensional integral of the nonnegative function f gθ over a boundary set and Bj,k is an n-dimension integral of the nonnegative 00 function f gθ. And the coefficient (Dj,j − 2Dj,k + Dk,k) is also nonengative because it equals

2 2 P|Yj − Yk| − P|Xj − Xk| ≥ 0 by assumption.

Done. The Sudakov’s minoration follows directly from the Fernique inequality with f chosen as the identity function. Without loss of generality suppose n equals 2k, a power of 2, so that the index set can be identified with S := k {−1, +1} . Construct the process {Xs : s ∈ S} from a set Z1,...,Zk of independendent N(0, 1)’s,

k 1 −1/2 X Xs := δk sjZj 2 j=1

2 1 2 −1 P 0 2 2 for which P|Xs − Xs0 | = 4 δ k j(sj − sj) ≤ δ . By Fernique’s inequality,

P (maxs Ys − mins Ys) ≥ P (maxs Xs − mins Xs) .

Symmetry of the multivariate normal implies that maxs Ys has the same distribution as maxs(−Yα) = − mins Yα, and similarly for the X’s. The last

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 6

inequality implies

P maxs Ys ≥ P maxs Xs  k  1 −1/2 X = δk maxs sjZj 2 P j=1 k 1 −1/2 X 1 1/2 = δk |Zj| = δk |Z1|. 2 P j=1 2 P

6.3 Concentration of Lipschitz functionals Gaussian::S:Lipschitz As promised, here are four different methods for proving versions of Theo- rem <4>. The aim is to show, for various choices of the constant C, that

2  P{f(X) ≥ Pf(X) + κu} ≤ exp −u /(2C) for all u ≥ 0,

if X has distribution γn and f has Lipschitz constant κ. For the first three methods the bound follows from the usual subgaussian moment generating function control,

λ(f(X)−γ f) Cλ2κ2/2 \E@ Lip.mgf <5> Pe n ≤ e for all λ ≥ 0. That is,

2 2  P{f(X) ≥ γnf + κu} ≤ infλ≥0 P exp −λκu + Cλ κ /2 with the inimum achieved at λ = u/(Cκ). As shown in Problem [1], a smoothing argument reduces to the case where f is infinitely differentiable, with |∇f(x)| ≤ κ everywhere. That is, if fi(x) denotes ∂f(x)/∂xi then

2 X 2 2 n \E@ gradf <6> |∇f(x)| = fi(x) ≤ κ for all x ∈ . i≤n R

6.3.1 The Pisier-Maurey approach Gaussian::PisierMaurey The simplest bound for the left-hand side of <5> comes from Jensen’s in- equality,

y λ(f(X(ω))−γnf(y)) y λ(f(X(ω))−f(y)) e ≤ γne for each ω.

Equivalently, if Y is another N(0,In)-distributed random vector that is in- dependent of X then

λ(f(X)−γ f) λ(f(X)−f(Y )) \E@ Lip.sym <7> Pe n ≤ Pe .

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 7

Remark. Notice that var(f(X) − f(Y )) = 2var(f(X)). This “sym- metrization” approach inevitably leads to at least a doubling of the constant C.

We could bound f(X) − f(Y ) in <7> by κ|X − Y | but that would introduce an explicit dependence on n in the upper bound, because the distribution of |X−Y | depends on n. Instead we need to exploit cancellations due to independent along a one-dimensional path from Y = X0 to X = X1,

Xθ = X sin(πθ/2) + Y cos(πθ/2) for 0 ≤ θ ≤ 1,

with derivative ∂X π π θ = (X cos(πθ/2) − Y sin(πθ/2)) =: Z . ∂θ 2 2 θ

Note that both Xθ and Zθ have distribution γn, and Xθ is independent of Zθ because cov(Xθ,Zθ) = 0 for each θ. Moreover

Z 1 Z 1 ∂f(Xθ) π f(X) − f(Y ) = dθ = Zθ · ∇f(Xθ) dθ. 0 ∂θ 0 2 By Jensen’s inequality for Lebesgue measure on [0, 1],

Z 1    λπ exp λ f(X) − f(Y ) ≤ exp Zθ · ∇f(Xθ) dθ. 0 2

Take expectations with respect to P, first conditioning on Xθ and using the 2 fact that ∇f(Xθ) is independent of Zθ and P exp(Zθ · t) = exp(|t| /2) for n each fixed t in R , to deduce that Z 1  2 2   λ π 2 P exp λ f(X) − f(Y ) ≤ P exp |∇f(Xθ)| dθ 0 8 ≤ exp λ2κ2π2/8 by <6>.

We have inequality <5> with C = π2/4.

6.3.2 The smart path method Gaussian::smart This refinement of the path method comes from Talagrand(2003, Sec- tion 1.3)]. It improves the constant C by creating a path through Gaussian

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 8

random 2n-vectors built from independent random N(0,In)-distributed vec- tors X, Y , and Z,

W0 = (Z,Z) and W1 = (X,Y )

Wθ = αθW0 + βθW1 = (αθZ + βθX, αθZ + βθY ) ∂W X = θ =α ˙ W + β˙ W θ ∂θ θ 0 θ 1 √ √ where αθ = 1 − θ and βθ = θ for 0 ≤ θ ≤ 1 and the dots denote derivatives (to avoid confusion with primes for transpose). The random vector Wθ has a N(0,Vθ) distribution with   2 2 0 In Vθ = αθvar(W0) + βθ var(W1) = I2n + (1 − θ) In 0

The random vector (Xθ,Wθ) also has a multivariate normal distribution with   ˙ 0 In 2cov(Xθ,Wθ) = 2αθα˙ θV0 + 2βθβθV1 = D := − . In 0

n n λ(f(x)−f(y)) The functional G : R × R → R, defined by G(x, y) = e , is evaluated along the Wθ path to create a function

H(θ) := PG(Wθ) for 0 ≤ θ ≤ 1 λ(f(X)−f(Y )) λ(f(X)−γ f) with H(0) = 1 and H(1) = Pe ≥ Pe n . Write Gi for the partial derivative of G with respect to its ith argument, that is  λfi(x)G(x, y) if 1 ≤ i ≤ n Gi(x, y) = −λfi−n(y)G(x, y) if n + 1 ≤ i ≤ 2n. Then

0 ∂G(Wθ) X2n H (θ) = P = PXθ,iGi(Wθ). ∂θ i=1 An integration-by-parts (Problem [2]) gives X2n Xθ,iGi(Wθ) = τi,j Gi,j(Wθ) P j=1 P where  1 1 − if |i − j| = n τi,j = cov(Xθ,i,Wθ,j) = Di,j = 2 . 2 0 otherwise

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 9

Many terms disappear in the double sum: X2n X2n τi,jGi,j(x, y) 1=1 j=1 n 1 X 2 = − 2λ (−1)fi(x)fi(y)G(x, y) 2 1=1 v u n ! n ! 2 u X 2 X 2 ≤ λ G(x, y)t fi(x) fi(y) by Cauchy-Schwarz 1=1 1=1 ≤ λ2G(x, y)κ2 by <6>.

0 2 2 2 2 Thus H (θ) ≤ λ κ PG(Wθ) = λ κ H(θ), that is, d log H(θ) ≤ λ2κ2 dθ which integrates to give

λ(f(X)−γ f) λ2κ2 Pe n ≤ H(1) ≤ e .

Theorem <4> holds with C = 2.

6.3.3 The stochastic calculus (Brownian motion) method Gaussian::stoch.calc This proof creates a diffent sort of path, from γnf to f(X), using a stochastic integral with respect to an n-dimensional Brownian motion,

Bt = (X1,t,...,Xn,t) for 0 ≤ t ≤ 1.

That is, the Xi processes are independent Brownian motions on [0, 1]. The key idea is that the process

Mt = PFt f(B1) for 0 ≤ t ≤ 1

is a martingale with M1 = f(B1) and M0 = γnf. (Here PFt denotes the conditional expectation with respect to the sigma-field Ft generated by {Bs : 0 ≤ s ≤ t}.) By the of Brownian motion and the independence of its increments, the martingale has an explicit representa- tion, Mt = F (Bt, t), where Z √ \E@ F.def <8> F (x, t) = Pf(x + (B1 − Bt)) = f(x + z 1 − t )φn(z) dz Rn −n/2 2 and φn(z1, . . . , zn) = (2π) exp(−|z| /2), the N(0,In) density.

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 10

Remark. The Mt is only defined up to an almost sure equivalence. The representation F (Bt, t) gives a version of the process with continuous sample paths.

At the risk of some notational confusion, write Fi for ∂F/∂xi and Fi,j 2 for ∂ F/∂xi∂xj and Ft for ∂F/∂t. Before I present the very slick stochastic calculus proof via the Itˆoformula, let me give you a more heuristic argument. The trick with stochastic integration is to carry Taylor expansions of functions of Bt out to second order, using the fact that for small δ > 0

∆Bt =: (∆X1,t,..., ∆Xn,t) := Bt+δ − Bt

has a N(0, δIn) distribution independent of Ft. For example,

∆Mt := Mt+δ − Mt Xn ≈ Ft(Bt, t)δ + Fi(Bt, t)∆Xi,t i=1 n n 1 X X \E@ M.incr1 <9> + Fi,j(Bt, t)∆Xi,t∆Xj,t 2 i=1 j=1

For i 6= j the term Ai,j := Fi,j(Bt, t)∆Xi,t∆Xj,t has PFt Ai,j = 0 and 2 2 PFt Ai,j = Fi,j(Bt, t)δ . All those cross-product terms can be absorbed

into the error of approximation. However, for i = j we have PFt Ai,i = 2 2 2 Fi,i(Bt, t)δ and PFt (Ai,i − Fi,i(Bt, t)δ) = Fi,i(Bt, t) O(δ ). The deviations

of Ai,i from PFt Ai,i can be ignored but the conditional expectation itself makes an important contribution. That is,

n n X  1 X  ∆Mt ≈ Fi(Bt, t)∆Xi,t + δ Ft(Bt, t) + Fi,i(Bt, t) . i=1 2 i=1 The martingale properties of B and M now kill another contribution:

n  1 X  0 = F ∆Mt ≈ δ Ft(Bt, t) + Fi,i(Bt, t) , P t 2 i=1 which corresponds to the fact (Problem [3]) that F is a solution to the heat equation,

∂F (x, t) X ∂2F (x, t) + 1 = 0 for 0 < t < 1, 2 i 2 ∂t ∂xi with boundary conditions F (x, 1) = f(x) and F (x, 0) = R f(x + z)φ(z) dz. The approximation simplifies even more, Xn \E@ M.incr2 <10> ∆Mt ≈ Fi(Bt, t)∆Xi,t. i=1

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 11

Similar reasoning gives 2 X 2 2 2 \E@ M.qv <11> F (∆Mt) ≈ Fi(Bt, t) F ∆X = δ|∇xF (Bt, t)| , P t i P t i,t which suggests that the process  Z t  1 2 2 Zt := exp λMt − 2 λ |∇xF (Bs, s)| ds 0 should also be a martingale: 1 2 2 PFt Zt+δ ≈ ZtPFt exp λ∆Mt − 2 λ δ|∇xF (Bt, t)| 1 2 2 1 2 ≈ ZtPFt 1 + λ∆Mt − 2 λ δ|∇xF (Bt, t)| + 2 (λ∆Mt) ≈ Zt. 2 2 Here the higher-order terms in δ|∇xF (Bt, t)| , which are of order δ , have been absorbed into the error of approximation. Assuming (correctly—see below) the validity of this martingale assertion we now have  Z 1  λM0 1 2 2 \E@ Zmg.equality <12> Pe = PZ0 = PZ1 = P exp λf(B1) − 2 λ |∇xF (Bs, s)| ds . 0 Finally we come to place where the Lipschitz property of f plays a role. The gradient of F inherits that that property: Z √ 2 2 |∇xF (x, t)| = | ∇xf(x + z 1 − t )φn(z) dz| Z √ 2 2 ≤ |∇xf(x + z 1 − t )| φn(z) dz ≤ κ .

Thus inequality <12> tells us that 2 2 2 2 λf(B1) λ κ /2 λγnf+λ κ /2 Pe ≤ e PZ1 ≤ e , which is inequality <5> with C = 1. I would be sympathetic if you had reservations about all these approx- imations. A rigorous derivation uses the versatile theorems of stochastic calculus, as expounded by Chung and Williams(2014, Section 5.4) = C&W. The argument is very clean. The process M is an L2 martingale with con- tinuous sample paths. By the Itˆoformula,

Mt − M0 = F (Bt, t) − F (B0, 0) X Z t = Fi(Bs, s) dXi,s i 0 Z t n n Z t 1 X X + Ft(Bs, s) ds + 2 Fi,j(Bs, s) d[Xi,Xj]s, 0 i=1 j=1 0

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 12

an L2 martingale plus a process with sample paths of bounded variation. P R t 2 The process Mt − M0 − i 0 Fi(Bs, s) dXi,s is an L martingale whose sample paths are both continuous and of bounded variation, which forces it to be the zero process [C&W Corollary 4.5]. That is, even without the benefit of Problem [3] we know that the bounded variation contribution is zero, leaving

X Z t Mt = M0 + Fi(Bs, s) dXi,s. i 0 The process [cf. C&W Theorem 5.7] is given by

X Z t [M]t = Fi(Bs, s)Fj(Bs, s) d[Xi,Xj]s i,j 0 Z t X 2 2 = Fi(Bs, s) ds ≤ κ t. i 0 1 2 The process Zt = exp(λMt − 2 λ [M]t) is a , that is, for some sequence of stopping times τj ↑ ∞, the process Z(t ∧ τj) is a martingale [C&W Theorem 6.2]. For each j,

1 2  PZ0 = PZ(1 ∧ τj) = P exp λM1∧τj − 2 λ [M]1∧τj 1 2 2 ≥ P exp λM1∧τj − 2 λ κ . Complete the argument by an appeal to Fatou’s Lemma as j → ∞,

λf(B ) λM λγ f+κ2λ2/2 Pe 1 = Pe 1 ≤ e n , which again is inequality <5> with C = 1.

6.3.4 The Gaussian isoperimetric inequality Gaussian::iso n δ For each subset A of R define d(z, A) = inf{|z − y| : y ∈ A} and A = n {z ∈ R : d(a, A) ≤ δ}. Write Φ for the N(0, 1) distribution function and Φ¯ for 1 − Φ. That is, if Z is N(0, 1) distributed then P{Z ≤ x} = Φ(x) and P{Z > x} = Φ(¯ x). The most stunning fact about γn—the so-called isoperimetric inequality— was established independently by Borell(1975) and Sudakov and Tsirel’son (1978).

n Gaussian::gisop <13> Gaussian isoperimetric inequality. If A is a Borel subset of R with δ γnA = Φ(α) then γnA ≤ Φ(¯ α + δ) for each δ ≥ 0. The upper bound is achieved when A is any closed halfspace with Gaussian measure Φ(α).

Draft: 7dec2015 c David Pollard §6.3 Concentration of Lipschitz functionals 13

For an exposition of a proof due to Ehrhard(1983a,b) see Pollard(2001, Section 12.5). The inequality can be rewritten more compactly as

δ −1  γk(A ) ≥ Φ Φ (γkA) + δ ,

slightly disguising the fact that equality is achieved by halfspaces but leading towards a functional form of the inequality that was developed by Bobkov (1996, 1997). For elegant reformulations of the functional approach see Ledoux(1998) and Barthe and Maurey(2000). It is the reduction from an n-dimensional problem, with n arbitrarily large, to a one-dimensional calculation for the lower bound that makes the isoperimetric inequality so powerful, as shown by the inequalities in the next Example. Recall that a median of a (real valued) random variable X is any con- stant m for which P{X ≥ m} ≥ 1/2 and P{X ≤ m} ≥ 1/2. Such an m always exists, but it need not be unique.

n Gaussian::gauss.conc <14> Example. Suppose f is a Lipschitz function on R with kfkLip ≤ κ. Un- der γn, the random variable f(z) has at least one median, a number M for which 1 1 γn{f(z) ≤ M} ≥ 2 and γn{f(z) ≥ M} ≥ 2 . n Define A = {z ∈ R : f(z) ≤ M} so that γnA ≥ 1/2 = Φ(0). If d(x, A) < u then there exist a point z ∈ A with d(z, x) < u. From the Lipschitz property and the fact that f(x) ≤ M we then get

f(x) < f(z) + κu < M + κu

Conversely, if f(x) ≥ M + κu then d(x, A) ≥ u. It follows that

¯ 1 2 γn{f(x) ≥ M + κu} ≤ γn{d(x, A) ≥ u} ≤ Φ(0 + u) ≤ 2 exp(−u /2). The companion lower bound follows by analogous argument for deviations n from the set {z ∈ R : f(z) ≥ M}. Together the two bounds give a concen- tration property for f,

2 \E@ med.conc <15> γn{z : |f(z) − M| ≥ κy} ≤ exp(−y /2),

where M is a median for f under γn. For many purposes it is more convenient to center the functional at its expected value µ = γnf. Inequality <15> implies Z ∞ |µ − M| ≤ γn|f − M| ≤ κ γn{|f(z) − M| ≥ κy} dy = Cκ, 0

Draft: 7dec2015 c David Pollard §6.4 Problems 14

√ where C = 1/(2 2π). Thus

2 γn{|f − µ| ≥ κ(C + y)} ≤ γn{|f − M| ≥ κy} ≤ exp(−y /2),

which implies a concentration inequality around µ. 

6.4 Problems Gaussian::S:Ga.problems [1] Suppose f is a real valued function on n with kfk = κ. Let ψ be an Gaussian::P:smooth.Lip R Lip n infinitely differentiable, nonegative function on R with compact support R R and ψ(z) dz = 1. For each σ > 0, define fσ(x) := f(x + σz)ψ(z) dz.

(i) Show that fσ has continuous partial derivatives of all orders and Z |fσ(x) − fσ(y)| ≤ |f(x + σz) − f(y + σz)|ψ(z) dz ≤ κ|x − y|

That is, kfσkLip ≤ κ. (ii) Use the inequality

Z 1   |f(x + z) − f(x) − z · ∇f(x)| ≤ z · ∇f(x + θz) − ∇f(x) dθ 0 and the continuity of ∇f as z → 0 to deduce that |∇f(x)| ≤ κ everywhere. (iii) Show that Z Z sup |fσ(x) − f(x)| ≤ sup |f(x + σz) − f(x)|ψ(z) dz ≤ κσ |z|ψ(z) dz. x x

Deduce that fσ converges uniformly to f as σ tends to zero. (iv) Suppose P{fσ(X) − Pfσ(X) ≥ t} ≤ H(t) for all t > 0, uniformly in σ, where H is a continuous function. Deduce that P{f(X) − Pf(X) ≥ t} ≤ H(t) for all t > 0. Hint: Choose σ small enough that supx |fσ(x) − f(x)| ≤ δ.

Gaussian::P:gauss.ibp [2] Suppose Z ∼ γ1 = N(0, 1) and (X,W1,...,Wm) has multivariate normal m distribution with PX = 0. Suppose G : R → R has partial derivatives Gi for which P|Gi(W1,...,Wm)| < ∞. Show that X XG(W1,...,Wm) = τi Gi(W1,...,Wm) where τi := cov(X,Wi) P i≤m P by following these steps.

Draft: 7dec2015 c David Pollard §6.5 Notes 15

0 (i) For each absolutely continuous g : R → R for which P|g (Z)| < ∞ show that 0 PZg(Z) = Pg (Z). Hint: The function φ(u)g(u) has almost sure derivative −uφ(u)g(u) + φ(u)g0(u).

(ii) Without loss of generality suppose X ∼ N(0, 1). Define Wi := Yi − τiX for i = 1, . . . , m. By calculating show that W is independent of X. Invoke part (i) with g(x) = G(W + xτ) for a fixed realization of W to deduce that

x Xm x γ xG(W + xτ) = τiγ Gi(W + xτ), 1 i=1 1 then take expected values over W .

Gaussian::P:heat [3] For the function F defined by equation <8>, show that √ ∂F (x, t) Z ∂f(x + z 1 − t )φ (z) = n dz ∂t Rn ∂t Z √ 1 −1/2 X = − (1 − t) fi(x + z 1 − t )ziφn(z) dz 2 i Z √ 1 X = − fi,i(x + z 1 − t )φn(z) dz 2 i 2 1 X ∂ F (x, t) = − 2 2 . i ∂ xi

6.5 Notes Gaussian::S:Notes For a proof of Sudakov’s inequality <1> (also known as the Sudakov mino- ration) see Ledoux and Talagrand(1991, page 79). They used the Slepian- Gordon inequalities, whose proof they borrowed from Kahane(1986). See also their Notes on pages 87–88 for more about the history and where credit is due. The Notes to Dudley(1999, Chapter 2) indicate that credit is also due to Chevet(1970). The method in subsection 6.3.1 is a special case of Theorem 2.2 of Pisier (1985, page 176), who commented that “The proof below is a simplification, due to Maurey, of my original proof which used an expansion in Hermite polynomials”. He aslo (page 180) sketched a stochastic calculus proof of the sharper result for C = 1, with the comment “B. Maurey found a proof of theorem 2.1 with the best constant K = 1/2 [that is, C = 1]. His proof uses stochastic integrals and apparently does not extend to the setting of theorem 2.2.” Ledoux(2001, page 45) attributed the stochastic calculus proof of the to Cirel’son et al.(1976), who in turn attributed the result

Draft: 7dec2015 c David Pollard §6.5 Notes 16

to (the 1974 Russian version of) Sudakov and Tsirel’son(1978). See Adler (1990, page 43) for a most readable exposition. See the concise and informative book by Ledoux(2001) for more about concentration inequalities.

References

Adler90gauss Adler, R. J. (1990). An introduction to Continuity, Extrema, and Related Topics for General Gaussian Processes, Volume 12 of Lecture Notes– Monograph series. Hayward, CA: Institute of Mathematical .

BartheMaurey2000AIHP Barthe, F. and B. Maurey (2000). Some remarks on isoperimetry of Gaus- sian type. Annales de l’Institut Henri Poincar´e,Probability and Statis- tics 36 (4), 419–434.

Bobkov1996JFA Bobkov, S. (1996). A functional form of the isoperimetric inequality for the Gaussian measure. Journal of Functional Analysis 135, 39–49.

Bobkov1997AnnProb Bobkov, S. G. (1997). An isoperimetric inequality on the discrete cube, and an elementary proof of the isoperimetric inequality in Gauss space. Annals of Probability 25 (1), 206–214.

Borell:75 Borell, C. (1975). The Brunn-Minkowski inequality in Gauss space. Inven- tiones Mathematicae 30, 207–216.

Chevet1970ASUCF Chevet, S. (1970). Mesures de Radon sur Rn et mesures cylin- driques. Annals scientifiques de l’Universit´ede Clermont-Ferrand 2, s´erie Math´ematiques43 (6), 91–158.

ChungWilliams2014book Chung, K. L. and R. J. Williams (2014). Introduction to Stochastic Integra- tion (Second ed.). Boston: Birkh¨auser.

CirelsonIbragimovSudakov1976norms Cirel’son, B., I. Ibragimov, and V. Sudakov (1976). Norms of Gaussian sam- ple functions. In Proceedings of the Third Japan—USSR Symposium on , Volume 550 of Springer Lecture Notes in Mathematics, pp. 20–41. Springer.

Dudley1999UCLT Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Uni- versity Press.

Ehrhard83MathScand Ehrhard, A. (1983a). Sym´etrisationdans l’espace de Gauss. Mathematica Scandinavica 53, 281–301.

Draft: 7dec2015 c David Pollard Notes 17

Ehrhard83slnm Ehrhard, A. (1983b). Un principe de sym´etrisationdans les espaces de Gauss. Springer Lecture Notes in Mathematics 990, 92–101.

Fernique75StFlour Fernique, X. (1975). Regularit´edes trajectoires des fonctions al´eatoires gaussiennes. Springer Lecture Notes in Mathematics 480, 1–97. Ecole d’Et´ede Probabilit´esde Saint-Flour IV—1974.

Kahane1986IsraelJM Kahane, J.-P. (1986). Une in´egalit´edu type de slepian et gordon sur les processus gaussiens. Israel Journal of Mathematics 55 (1), 109–110.

Ledoux1998PP Ledoux, M. (1998). A short proof of the Gaussian isoperimetric inequality. Progress in Probability 43, 229–232.

Ledoux01conc Ledoux, M. (2001). The Concentration of Measure Phenomenon, Volume 89 of Mathematical Surveys and Monographs. American Mathematical Soci- ety.

LedouxTalagrand91book Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces: Isoperimetry and Processes. New York: Springer.

Pisier85Varenna Pisier, G. (1985). Probabilistic methods in the geometry of Banach spaces. Springer Lecture Notes in Mathematics 1206, 167–241.

PollardUGMTP Pollard, D. (2001). A User’s Guide to Measure Theoretic Probability. Cam- bridge University Press.

SudakovTsirelson1978JSM Sudakov, V. N. and B. S. Tsirel’son (1978). Extremal properties of half-spaces for spherically invariant measures. Journal of Soviet Mathematics 9, 419–434. (Translated from Zapiski Nauchnykh Semi- narov Leningradskogo Otdeleniya Matematicheskogo instituta im. V. A. Steklova AN SSSR, Vo!. 41, pp. 14-24, 1974.).

Talagrand03spin Talagrand, M. (2003). Spin Glasses: a Challenge to Mathematicians, Vol- ume 46 of Ergbnisse der Mathematik und ihrer Grenzgebiete. New York: Springer-Verlag.

Draft: 7dec2015 c David Pollard