<<

MA3K0 - High-Dimensional

Lecture Notes

Stefan Adams i 2020, update 06.05.2020

Notes are in final version - proofreading not completed yet! Typos and errors will be updated on a regular basis

Contents

1 Prelimaries on 1 1.1 Random variables ...... 1 1.2 Classical Inequalities ...... 4 1.3 Limit Theorems ...... 6 2 Concentration inequalities for independent random variables 8 2.1 Why concentration inequalities ...... 8 2.2 Hoeffding’s Inequality ...... 10 2.3 Chernoff’s Inequality ...... 13 2.4 Sub-Gaussian random variables ...... 15 2.5 Sub-Exponential random variables ...... 22 3 Random vectors in High Dimensions 27 3.1 Concentration of the Euclidean norm ...... 27 3.2 Covariance matrices and Principal Component Analysis (PCA) ...... 30 3.3 Examples of High-Dimensional distributions ...... 32 3.4 Sub-Gaussian random variables in higher dimensions ...... 34 3.5 Application: Grothendieck’s inequality ...... 36 4 Random Matrices 36 4.1 Geometrics concepts ...... 36 4.2 Concentration of the operator norm of random matrices ...... 38 4.3 Application: Community Detection in Networks ...... 41 4.4 Application: Covariance Estimation and Clustering ...... 41 5 Concentration of - general case 43 5.1 Concentration by entropic techniques ...... 43 5.2 Concentration via Isoperimetric Inequalities ...... 50 5.3 Some calculus and covariance estimation ...... 54 6 Basic tools in high-dimensional probability 57 6.1 Decoupling ...... 57 6.2 Concentration for Anisotropic random vectors ...... 62 6.3 Symmetrisation ...... 63 7 Random Processes 64 ii

Preface

Introduction

We discuss an elegant argument that showcases the usefulness of probabilistic reasoning n in geometry. First recall that a convex combination of points z1, . . . , zm ∈ R is a linear combination with coefficients that are non-negative and sum to 1, i.e., it is a sum of the form

m m X X λizi where λi ≥ 0 and λi = 1. (0.1) i=1 i=1 Given a M ⊂ Rn, the convex hull of M is the set of all convex combinations of all finite collections of points in M, defined as

conv(M) := {convex combinations of z1, . . . , zm ∈ M for m ∈ N}. The number m of elements defining a convex combination in Rn is not restricted a priori. The classical theorem of Caratheodory states that one always take m ≤ n + 1. For the convenience of the reader we briefly state that classical theorem. Theorem 0.1 (Caratheodory’s theorem) Every point in the convex hull of a set M ⊂ Rn can be expressed as a convex combination of at most n + 1 points from M.

Unfortunately the bound n + 1 cannot be improved (it is clearly attained for a simplex M) 1 This is some bad news. However, in most applications we only want to approximate a point x ∈ conv(M) rather than to represent it exactly as a convex combination. Can we do this with fewer than n + 1 points? We now show that it is possible, and actually the number of required points does not need to depend on the dimension n at all! This is certainly brilliant news for any applica- tions in mind - in particular for those where the dimension of the data set is extremely high (data science and machine learning and high-dimensional geometry and statistical mechanics models). Theorem 0.2 (Approximate form Caratheodory’s theorem) Consider a set M ⊂ Rn whose diameter diam (M) := sup{kx − yk: x, y ∈ M} is bounded by 1. Then, for ev- ery point x ∈ conv(M) and every positive integer k, one can find points x1, . . . , xk ∈ M such that k 1 X 1 kx − x k ≤ √ . k j j=1 k

p 2 2 n n Here kxk = x1 + ··· + xn, x = (x1, . . . , xn) ∈ R , denotes the Euclidean norm on R .

1A simplex is a generalisation of the notion of a triangle to arbitrary dimensions. Specifically, a k-simplex k S is the convex hull of its k + 1 vertices: Suppose u0, . . . , uk ∈ R are affinely independent, which means that u1 − u0, . . . , uk − u0 are linearly independent. Then, the simplex determined by these vertices is the set of points k n X o S = λ0u0 + ··· + λkuk : λi = 1, λi ≥ 0 for i = 0, . . . , k . i=0 iii

Remark 0.3 We have assumed diam (√M) ≤ 1 for simplicity. For a general set M, the bound in the theorem changes to diam (M)/ k.

Why is this result surprising? First, the number of points k in convex combinations does not depend on the dimension n. Second, the coefficients of convex combinations can be made all equal. Proof. The argument upon which our proof is based is known as the empirical method of B. Maurey. W.l.o.g., we may assume that not only the diameter but also the radius of M is bounded by 1, i.e., kwk ≤ 1 for all w ∈ M. We pick a point x ∈ conv(M) and express it as a convex combination of some vectors z1, . . . , zm ∈ M as in (0.1). Now we consider the numbers λi in that convex combination as that a random vector Z takes the values zi, i = 1, . . . , m, respectively. That is, we define P(Z = zi) = λi, i = 1, . . . , m.

This is possible by the fact that the weights λi ∈ [0, 1] and sum to 1. Consider now a

(Zj)j∈N of copies of Z. This sequence is an independent identically distributed sequence of Rn-valued random variables. By the strong law of large numbers, k 1 X Z → x almost surely as k → ∞. k j j=1 We shall now get a quantitative version of this limiting statement, that is, we wish to obtain 1 Pk an error bound. For this we shall compute the variance of k j=1 Zj. We obtain

k k h 1 X 2i 1 h X 2i kx − Z k = k (Z − x)k (since [Z − x] = 0) E k j k2 E j E i j=1 j=1 k 1 X 2 = [kZ − xk ]. k2 E j j=1 The last identity is just a higher-dimensional version of the basic fact that the variance of a sum of independent random variables equals the sum of the variances. To bound the vari- ances of the single terms we compute using that Zj is copy of Z and that kZk ≤ 1 as Z ∈ M, 2 2 2 2 2 E[kZj − xk ] = E[kZ − E[Z]k ] = E[kZk ] − kE[Z]k ≤ E[kZk ] ≤ 1, where the second equality follows from the well-known property of the variance, namely, for n = 1,

2 2 2 2 2 2 E[kZ − E[Z]k ] = E[(Z − E[Z]) ] = E[Z − 2ZE[Z] + E[Z] ] = E[Z ] − E[Z] , and the cases for n > 1 follow similarly. We have thus shown that

k h 1 X 2i 1 kx − Z k ≤ . E k j k j=1 iv

Therefore, there exists a realisation of the random variables Z1,...,Zk such that

k 1 X 2 1 kx − Z k ≤ . k j k j=1

Since by construction each Zj takes values in M, the proof is complete. 2

We shall give one application of Theorem 0.2 in computational geometry. Suppose that we are given a subset P ⊂ Rn (say a polygon 2 ) and asked to cover it by balls of a given radius ε > 0. What is the smallest number of balls needed, and how should we place them? Corollary 0.4 (Covering polytopes by balls) Let P be a polygon in Rn with N vertices 2 and whose diameter is bounded by 1. Then P can be covered by at most Nd1/ε e Euclidean balls of radii ε > 0.

Proof. We shall define the centres of the balls as follows. Let k := d1/ε2e and consider the set k n1 X o N := x : x are vertices of P . k j j j=1 The polytope P is the convex hull of the set of its vertices, which we denote by M. We then apply√ Theorem 0.2 to any point x ∈ P = conv(M) and deduce that x is within a distance 1/ k ≤ ε from some point in N. This shows that the ε-balls centred at N do indeed cover P . 2 In this lecture we will learn several other approaches to the covering problem in relation to packing, entropy and coding, and random processes.

2In geometry, a polytope is a geometric object with’ flat’ sides. It is a generalisation of the three-dimensional polyhedron which is a solid with flat polygonal faces, straight edges and sharp corners/vertices. Flat sides mean that the sides of a (k + 1)-polytope consist of k-polytopes. PRELIMARIESON PROBABILITY THEORY 1

1 Prelimaries on Probability Theory

In this chapter we recall some basic concepts and results of probability theory. The reader should be familiar with most of this material some of which is taught in elementary proba- bility courses in the first year. To make these lectures self-contained we review the material mostly without proof and refer the reader to basic chapters of common undergraduate text- books in probability theory, e.g. [Dur19] and [Geo12]. In Section 1.1 we present basic definitions for and as well as random variables along with expectation, variance and moments. Vital for the lecture will be the review of all classi- cal inequalities in Section 1.2. Finally, in Section 1.3 we review well-know limit theorems.

1.1 Random variables A probability space (Ω, F,P ) is a triple consisting of a set Ω, a −σ-algebra F and a proba- bility measure P . We write P(Ω) for the power set of Ω which is the set of all subsets of Ω.

Definition 1.1 (σ-algebra) Suppose Ω 6= ∅. A system F ⊂ P(Ω) satisfying (a) Ω ∈ F

(b) A ∈ F ⇒ Ac := Ω \ A ∈ F S (c) A1,A2,... ∈ F ⇒ i≥1 Ai ∈ F. is called σ-algebra (or σ-field) on Ω. The pair (Ω, F) is then called an event space or mea- surable space.

Example 1.2 (Borel σ-algebra) Let Ω = Rn, n ∈ N and n n Y o G = [ai, bi]: ai < bi, ai, bi ∈ Q i=1 be the system consisting of all compact rectangular boxes in Rn with rational vertices and edges parallel to the axes. In honour of Emile´ Borel (1871–1956), the system Bn = σ(G) is called the Borel σ-algebra on Rn, and every A ∈ Bn a . Here, σ(G) denotes the smallest σ-algebra generated by the system G. Note that the Bn can also be generated by the system of open or half-open rectangular boxes, see [Dur19, Geo12]. ♣ The decisive point in the process of building a stochastic model is the next step: For each A ∈ F we need to define a value P (A) ∈ [0, 1] that indicates the probability of A. Sensibly, this should be done so that the following holds.

(N) Normalisation: P (Ω) = 1.

(A) σ-Additivity: For pairwise disjoint events A1,A2,... ∈ F one has  [  X P Ai = P (Ai). i≥1 i≥1 2 PRELIMARIESON PROBABILITY THEORY

Definition 1.3 (Probability measure) Let (Ω, F) be a measurable space. A P : F → [0, 1] satisfying the properties (N) and (A) is called a probability measure or a probability distribution, in short a distribution (or, a little old-fashioned, a probability law) on (Ω, F). Then the triple (Ω, F,P ) is called a probability space.

Theorem 1.4 (Construction of probability measures via densities) (a) Discrete case: For countable Ω, the relations X P (A) = %(ω) for A ∈ P(Ω),%(ω) = P ({ω}) for ω ∈ Ω ω∈A

establish a one-to-one correspondence between the set of all probability measures P on P (Ω, P(Ω)) and the set of all % = (%(ω))ω∈Ω in [0, 1] such that ω∈Ω %(ω) = 1.

(b) Continuous case: If Ω ⊂ Rn is Borel, then every function %:Ω → [0, ∞) satisfying the properties

n (i) {x ∈ Ω: %(x) ≤ c} ∈ BΩ for all c > 0, R (ii) Ω %(x) dx = 1

n determines a unique probability measure on (Ω, BΩ) via Z n P (A) = %(x) dx for A ∈ BΩ A

n (but not every probability measure on (Ω, BΩ) is of this form).

Proof. See [Dur19, Geo12]. 2

Definition 1.5 A sequence or function % as in Theorem 1.4 above is called a density (of P ) or, more explicitly (to emphasise normalisation), a probability density (function), often abbreviated as pdf. If a distinction between the discrete and continuous case is required, a sequence % = (%(ω))ω∈Ω as in case (a) is called a discrete density, and a function % in case (b) a Lebesgue density.

In probability theory one often considers the transition from a measurable space (event space) (Ω, F) to a coarser measurable (event) space (Ω0, F 0). In general such a mapping should satisfy the requirement

A0 ∈ F 0 ⇒ X−1A0 := {ω ∈ Ω: X(ω) ∈ A0} ∈ F. (1.1)

Definition 1.6 Let (Ω, F) and (Ω0, F 0) be two measurable (event) spaces. The every map- ping X :Ω → Ω0 satisfying property (1.1) is called a from (Ω, F) to (Ω0, F 0), or a random element of Ω0. Alternatively (in the terminology of measure theory), X is said to be measurable relative to F and F 0.

In probability theory it is common to write {X ∈ A0} := X−1A0. PRELIMARIESON PROBABILITY THEORY 3

Theorem 1.7 (Distribution of a random variable) If X is a random variable from a prob- ability space (Ω, F,P ) to a measurable space (Ω0, F 0), then the prescription

P 0(A0) := P (X−1A0) = P ({X ∈ A}) for A0 ∈ F 0

defines a probability measure P 0 on (Ω0, F 0).

Definition 1.8 (a The probability measure P 0 in Theorem 1.7 is called the distribution of X under P ,or the image of P under X ,and is denoted by P ◦ X−1. (In the literature, one also finds the notations PX or L(X; P ). The letter L stands for the more traditional term law, or loi in French.) (b) Two random variables are said to be identically distributed if they have the same distribution.

We are considering real-valued or Rn-valued random variables in the following and we just call them random variables for these cases. In basic courses in probability theory, one learns about the two most important quantities associated with a random variable X, namely the expectation 3 (also called the mean) and variance. They will be noted in this lecture by

2 E[X] and Var(X) := E[(X − E(X)) ].

The distribution of a real-valued random variable X is determined by the cumulative distribution function (CDF) of X , defined as

FX (t) = P(X ≤ t) = P((−∞, t])), t ∈ R. (1.2)

It is often more convenient to work with the tails of random variables, namely with

P(X > t) = 1 − FX (t). (1.3)

Here we write P for the generic distribution of the random variable X which is given by the context. For any real-valued random variable the moment generating function (MGF) is defined λX MX (λ) := E[e ] , λ ∈ R. (1.4)

When MX is finite for all λ in a neighbourhood of the origin, we can easily compute all moments by taking derivatives (interchanging differentiation and expectation (integration) in the usual way).

Lemma 1.9 (Integral Identity) Let X be a real-valued non-negative random variable. Then

Z ∞ E[X] = P(X > t) dt . 0 3In measure theory the expectation E[X] of a random variable on a probability space (Ω, F,P ) is the Lebesgue integral of the function X :Ω → R. This makes theorems on Lebesgue integration applicable in probability theory for expectations of random variables 4 PRELIMARIESON PROBABILITY THEORY

Proof. We can write any non-negative real number x via the following identity using indi- cator function 4: Z x Z ∞ x = 1 dt = 1l{t

Exercise 1.10 (Integral identity) Prove the extension of Lemma 1.9 to any real-valued ran- dom variable (not necessarily positive):

Z ∞ Z 0 E[X] = P(X > t) dt − P(X < t) dt . 0 −∞ K

1.2 Classical Inequalities In this section fundamental classical inequalities are presented. Here, classical refers to typical estimates for analysing stochastic limits.

Proposition 1.11 (Jensen’s inequality) Suppose that Φ: I → R, where I ⊂ R is an inter- val, is a convex function. Let X be a real-valued random variable. Then

Φ(E[X]) ≤ E[Φ(X)].

Proof. See [Dur19] or [Geo12] using either the existence of sub-derivatives for convex functions or the definition of convexity with the epi-graph of a function. The epi-graph of a function f : I → R,I ⊂ some interval, is the set

2 epi(f) := {(x, f(x) ∈ R : x ∈ I}.

A function f : I → R is convex if, and only if epi(f) is a convex set in R2. 2 A consequence of Jensen’s inequality is that kXkLp is an increasing function in the pa- rameter p, i.e.,

kXkLp ≤ kXkLq 0 ≤ p ≤ q ≤ ∞. q This follows form the convexity of Φ(x) = x p when q ≥ p.

Proposition 1.12 (Minkowski’s inequality) For p ∈ [1, ∞], let X,Y ∈ Lp, then

kX + Y kLp ≤ kXkLp + kY kLp . 4 1lA denotes the indicator function of the set A, that is, 1lA(t) = 1 if t ∈ A and 1lA(t) = 0 if t∈ / A. PRELIMARIESON PROBABILITY THEORY 5

Proposition 1.13 (Cauchy-Schwarz inequality) For X,Y ∈ L2,

|E[XY ]| ≤ kXkL2 kY kL2 .

Proposition 1.14 (Holder’s¨ inequality) For p, q ∈ (1, ∞) with 1/p + 1/q = 1 let X ∈ Lp and Y ∈ Lq. Then

E[XY ] ≤ E[|XY |] ≤ kXkLp kY kLq .

Lemma 1.15 (Linear Markov’s inequality) For non-negative random variables X and t > 0 the tail is bounded as [X] (X > t) ≤ E . P t Proof. Pick t > 0. Any positive number x can be written as

x = x1l{X≥t} + x1l{X

As X is non-negative, we insert X into the above expression and take the expectation (inte- gral) to obtain

E[X] = E[X1l{X≥t}] + E[X1l{X

2 This is one version of the Markov inequality which provides linear decay in t. In the following proposition we obtain the general version which will be used frequently throughout the lecture.

Proposition 1.16 (Markov’s inequality) Let Y be a real-valued random variable and f : [0, ∞) → [0, ∞) an increasing function. Then, for all ε > 0 with f(ε) > 0,

[f ◦ |Y |] (|Y | ≥ ε) ≤ E . P f(ε)

Proof. Clearly, the composition f ◦ |Y | is a positive random variable such that

f(ε)1l{|Y |≥ε} ≤ f ◦ |Y |.

Taking the expectation on both sides of that inequality gives

f(ε)P(|Y | ≥ ε) = E[f(ε)1l{|Y |≥ε}] ≤ E[f ◦ |Y |].

2 The following version of the Markov inequality is often called Chebyshev’s inequality.

Corollary 1.17 (Chebyshev’s inequality, 1867) For all Y ∈ L2 with E[Y ] ∈ (−∞, ∞) and ε > 0,   Var(Y ) |Y − [Y ]| ≥ ε ≤ P E ε2 6 PRELIMARIESON PROBABILITY THEORY

1.3 Limit Theorems Definition 1.18 (Variance and covariance) Let X,Y ∈ L2 be real-valued random vari- ables.

(a) 2 2 2 Var(X) := E[(X − E[X]) ] = E[X ] − E[X] √ is called the variance, and Var(X) the standard deviation of X with respect to P.

(b) cov(X,Y ) := E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ] is called the covariance of X and Y . It exists since |XY | ≤ X2 + Y 2.

(c) If cov(X,Y ) = 0, then X and Y are called uncorrelated.

Recall that two independent random variables are uncorrelated, but two uncorrelated are not necessarily independent as the following example shows.

Example 1.19 Let Ω = {1, 2, 3} and let P the uniform distribution on Ω. Define two ran- dom variables by their images, that is,

(X(1),X(2),X(3)) = (1, 0, −1) and (Y (1),Y (2),Y (3)) = (0, 1, 0).

Then XY = 0 and E[XY ] = 0, and therefore cov(X,Y ) = 0, but 1 (X = 1,Y = 1) = 0 6= = (X = 1) (Y = 1). P 9 P P Hence X and Y are not independent. ♣

2 Theorem 1.20 (Weak law of large numbers, L -version) Let (Xi)i∈N be a sequence of un- correlated (e.g. independent) real-valued random variables in L2 with bounded variance, in that v := X < ∞. Then for all ε > 0 supi∈N Var( i)

N  1 X  v P (Xi − E[Xi]) ≥ ε ≤ −→ 0, N Nε2 N→∞ i=1

and thus N 1 X P (Xi − E[Xi]) −→ 0, N N→∞ i=1

P ( −→ means convergence in probability). In particular, if [Xi] = [X1] holds for all N→∞ E E i ∈ N, then N 1 X P X −→ [X ]. N i E 1 i=1 PRELIMARIESON PROBABILITY THEORY 7

We now present a second version of the weak law of large numbers, which does not require the existence of the variance. To compensate we must assume that the random vari- ables, instead of being pairwise uncorrelated, are even pairwise independent and identically distributed.

1 Theorem 1.21 (Weak law of large numbers, L -version) Let (Xi)i∈N be a sequence of pair- wise independent, identically distributed real-valued random variables in L1. Then

N 1 X P Xi −→ E[X1]. N N→∞ i=1

Theorem 1.22 (Strong law of large numbers) If (Xi)i∈N is a sequence of pairwise uncor- related real-valued random variables in L2 with v := X < ∞, then supi∈N Var( i)

N 1 X (X − [X ]) → 0 almost surely as N → ∞. N i E i i=1 Theorem 1.23 (Central limit theorem; A.M. Lyapunov 1901, J.W. Lindeberg 1922,P. Leevy´ 1922)

Let (Xi)i∈N be a sequence of independent, identically distributed real-valued random vari- 2 ables in L with E[Xi] = m and Var(Xi) = v > 0. Then, as N → ∞,

N 1 X Xi − m d S∗ := √ √ −→ N(0, 1). N v N i=1 The normal distribution is defined as follows. A real-valued random variable X is normally distributed with mean µ and variance σ2 > 0 if Z ∞ 2 1 − (u−µ) (X > x) = √ e 2σ2 du, for all x ∈ . P 2 R 2πσ x We write X ∼ N(µ, σ2). We say that X is standard normal distributed if X ∼ N(0, 1).

A random vector X = (X1,...,Xn) is called a Gaussian random vector if there exits an n × m matrix A, and an n-dimensional vector b ∈ Rn such that XT = AY + b, where Y is an m-dimensional vector with independent standard normal entries, i.e. Yi ∼ N(0, 1) for m i = 1, . . . , m. Likewise, a random variable Y = (Y1,...,Ym) with values in R has the m-dimensional standard Gaussian distribution if the m coordinates are standard normally distributed and independent. The covariance matrix of X = AY + b is then given by

T T cov(Y ) = E[(Y − E[Y ])(Y − E[Y ]) ] = AA . Lemma 1.24 If A is an orthogonal n × n matrix, i.e. AAT = 1l, and X is a n-dimensional standard Gaussian vector, then AX is also a n-dimensional standard Gaussian vector.

Lemma 1.25 Let X1 and X2 be independent and normally distributed with zero mean and 2 variance σ > 0. Then X1 + X2 and X1 − X2 are independent and normally distributed with mean 0 and variance 2σ2. 8 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Proposition 1.26 If X and Y are n-dimensional Gaussian vectors with E[X] = E[Y ] and cov(X) = cov(Y ), then X and Y have the same distribution.

Corollary 1.27 A Gaussian random vector X has independent entries if and only if its co- variance matrix is diagonal. In other words, the entries in a Gaussian vector are uncorre- lated if and only if they are independent.

Bernoulli: p ∈ [0, 1], then X ∼ Ber(p) if P(X = 1) = p and P(X = 0) = 1 − p =: q. If X ∼ Ber(p), then E[X] = p and Var(X) = pq. PN Binomial: SN = i=1 Xi ∼ B(N, p) if Xi ∼ Ber(p) and (Xi), i = 1,...,N, independent family. E[SN ] = Np and Var(SN ) = Npq. Poisson: λ > 0, then X ∼ Poi(λ) if

e−λλk (X = k) = k ∈ , P k! N0

Poi(λ) ∈ M1(N0, P(N0). Here, M1(Ω) denotes the set of probability measures on Ω. Exponential: A random variable X taking positive real value is exponentially distributed with parameter α > 0 when the probability density function is

−αt fX (t) = αe for t ≥ 0.

1 1 We write X ∼ Exp(α). If X ∼ Exp(α), then E[X] = α and Var(X) = α2 .

(N) Theorem 1.28 (Poisson limit theorem) Let Xi , i = 1,...,N, be independent Bernoulli (N) (N) PN (N) random variables Xi ∼ Ber(pi ), and denote SN = i=1 Xi their sum. Assume that, as N → ∞,

N (N) X (N) max {pi } −→ 0 and E[SN ] = pi −→ λ ∈ (0, ∞). 1≤i≤N N→∞ N→∞ i=1 Then, as N → ∞, SN −→ Poi(λ) indistribution.

2 Concentration inequalities for independent random variables

2.1 Why concentration inequalities Suppose a random variable X has mean µ, then, for any t ≥ 0,

P(|X − µ| > t) ≤ something small

is a concentration inequality. One is interested in cases where the bound on the right hand side decays with increasing parameter t ≥ 0. We now discuss a very simple example to demonstrate that we need better concentration inequalities than the ones obtained for exam- ple from the central limit theorem (CLT). Toss a fair coin N times. What is the probability CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 9

N N that we get at least 3N/4 heads? Recall that E[SN ] = 2 and Var(SN ) = 4 in conjunction with Corollary 1.17 gives the bound  3   N N  4 S ≥ N ≤ |S − | ≥ ≤ . P N 4 P N 2 4 N This concentration bound vanishes linearly in N. One may wonder if we can to do better using the CLT. The De Moivre Laplace limit theorem (variant of the CLT for Binomial distributions) states that the distribution of the normalised number of heads

SN − N/2 ZN = pN/4 converges to the standard normal distribution N(0, 1). We therefore expect to get the follow- ing concentration bound  3   p  p S ≥ N = Z ≥ N/4 ≈ (Y ≥ N/4), (2.1) P N 4 P N P where Y ∼ N(0, 1). To obtain explicit bounds we need estimates on the tail of the normal distribution.

Proposition 2.1 (Tails of the normal distribution) Let Y ∼ N(0, 1). Then, for all t > 0, we have 1 1  1 −t2/2 1 1 −t2/2 − √ e ≤ P(Y ≥ t) ≤ √ e . t t3 2π t 2π Proof. Denote f(x) := exp(−x2/2). For the upper bound we use x ≥ t to get the estimate ∞ ∞ Z Z x 2 1 2 f(x) dx ≤ e−x /2 dx = e−t /2. t t t t For the lower bound we use integration by parts (IBP):

Z ∞ Z ∞ ∞ Z ∞ −x2/2 1 −x2/2 h 1 −x2/2i 1 −x2/2 e dx = xe dx = − e − 2 e dx t t x x t t x Z ∞ ∞ 1 −t2/2 1 −x2/2 1 −t2/2 h 1 −x2/2i = e − 3 xe d = e − − 3 e t t x t x t Z ∞ 3 −x2/2 + 4 e dx. t x Hence Z ∞ Z ∞ −x2/2 1 1  −t2/2 3 −x2/2 e dx = − 3 e + 4 e dx, t t t t x and, as the integral on the right hand side is positive,

1 3 −t2/2 P(Y ≥ t) ≥ √ (1/t − 1/t )e . 2π 2 The lower bound in Proposition 2.1 is lower than the tail lower bound in Lemma C.5 in the appendix. 10 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Lemma 2.2 (Lower tail bound for normal distribution) Let Y ∼ N(0, 1). Then, for all t ≥ 0,  t  1 −t2/2 P(Y ≥ t) ≥ √ e . t2 + 1 2π

Proof. Define ∞ 1  Z 2 t 2  √ −x /2 −t /2 g(t) := e dx − 2 e . 2π t t + 1 2 0 −2e−t /2 Then g(0) > 0 and g (t) = (t2+1)2 < 0. Thus g is strictly decreasing with limt→∞ g(t) = 0 implying that g(t) > 0 for all t ∈ R. 2 Using the tail estimates in Proposition 2.1 we expect to obtain an exponential bound for the right hand side of (2.1), namely that the probability of having at least 3N/4 heads seems to be smaller than √1 e−N/8. However, we have not taken into account the approximation 2π errors of ≈ in (2.1). Unfortunately, it turns out that the error decays too slowly, actually even more slowly than linearly in N. This can be seen form the following version of the CLT which we state without proof, see for example [Dur19].

Theorem 2.3 (Berry-Esseen CLT) In the setting of Theorem 1.23, for every N and every t ∈ , we have R % |P (ZN ≥ t) − P(Y ≥ t)| ≤ √ , N 3 3 where Y ∼ N(0, 1) and % = E[|X1 − m| ]/σ . Can we improve the approximation error ’≈’ in (2.1) by simply computing the probabil- ities with the help of Stirling’s formula? Suppose that n is even. Then   −N N 1 1 P(SN = N/2) = 2 ∼ √ and P(ZN = 0) ∼ √ , N/2 N N but P(Y = 0) = 0. We thus see that we shall get a better concentration bound. This is the content of the next section.

2.2 Hoeffding’s Inequality In this section we study sums of symmetric Bernoulli random variables defined as follows.

Definition 2.4 A random variable Z taking values in {−1, +1} with 1 (Z = −1) = (Z = +1) = P P 2 is called symmetric Bernoulli random variable. Note that the ’standard’ Bernoulli random variable X takes values in {0, 1}, and one easily switch between both via Z = 2X − 1.

Theorem 2.5 (Hoeffding’s inequality) Suppose that X1,...,XN , are independent symmet- N ric Bernoulli random variables and let a = (a1, . . . , aN ) ∈ R . Then, for any t ≥ 0, we have N  X   t2  P aiXi ≥ t ≤ exp − 2 . i=1 2kak2 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 11

Proof. Without loss of generality we can put kak2 = 1, namely, if kak2 6= 1, then define a˜i = ai/kak2, i = 1,...,N, to obtain the bound for

N  X  P a˜iXi ≥ kak2t . i=1 Let λ > 0, then, using Markov’s inequality, obtain

N N N  X    X  λt −λt h  X  P aiXi ≥ t = P exp λ aiXi ≥ e ≤ e E exp λ aiXi ]. i=1 i=1 i=1 To bound the right hand side use first that the random variables are independent,

h  N  N X Y λaiXi E exp λ aiXi ] = E[e ] i=1 i=1 and then compute for i ∈ {1,...,N},

eλai + e−λai [eλaiXi ] = = cosh(λa ). E 2 i We are left to find a bound for the hyperbolic cosine. There are two ways to get the upper bound 2 cosh(x) ≤ ex /2. 1.) Simply write down both Taylor and compare term by term, that is, x2 x4 x6 cosh(x) = 1 + + + + ··· 2! 4! 6! and ∞ 2 k 2 4 2 X (x /2) x x ex /2 = = 1 + + + ··· . k! 2 222! k=0 2.) Use the product expansion (complex analysis – not needed for this course, just as back- ground information)

2 ∞ 2 Y  4x   X 4x  2 cosh(x) = 1 + ≤ exp = ex /2. π2(2k − 1)2 π2(2k − 1)2 k=1 In any case, we obtain λa X λ2a2/2 E[e i i ] ≤ e i , and thus N 2  X   λ 2 a X ≥ t ≤ exp − λt + kak . P i i 2 2 i=1

Using kak2 = 1, and optimising over the value of λ we get λ = t and thus the right hand is simply exp(−t2/2). 2 12 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

With Hoeffding’s inequality in Theorem 2.5 we are now in the position to answer our previous question on the probability for a fair coin to have at least 3/4N times with heads. 1 The fair coin is a standard Bernoulli random variable Y with P(Y = 1) = P(Y = 0) = 2 , and the symmetric one is just Z = 2Y − 1. Thus we obtain the following bound,

N N N  X   X ZI + 1   X Zi  X ≥ 3/4N = ≥ 3/4N = ≥ 1/4N P i P 2 P 2 i=1 i=1 i=1 N  X Zi 1√   1  = √ ≥ N ≤ exp − N . P 2 2 · 4 i=1 N This shows that Hoeffding’s inequality is the a good concentration estimate avoiding the approximating error using the CLT. We also get two-sided tail / concentration estimates for Pn S = i=1 aiXi using Theorem 2.5 writing

P(|S| ≥ t) = P(S ≥ t) + P(−S ≥ t).

Theorem 2.6 (Two-sided Hoeffding’s inequality) Suppose that X1,...,XN , are indepen- N dent symmetric Bernoulli random variables and let a = (a1, . . . , aN ) ∈ R . Then, for any t ≥ 0, we have N  X   t2  P | aiXi| ≥ t ≤ 2 exp − 2 . i=1 2kak2 As the reader may have realised, the proof of Hoeffding’s inequality, Theorem 2.5, is quite flexible a sit is based on bounding the moment generating function. For example, the following version of Hoeffding’s inequality is valid for general bounded random variables. The proof will be given as an exercise.

Theorem 2.7 (Hoeffding’s inequality for general bounded random variables) Let X1,...,XN be independent random variable with Xi ∈ [mi,Mi], mi < Mi, for i = 1,...,N. Then, for any t ≥ 0, we have

N  X   2t2  (Xi − [Xi]) ≥ t ≤ exp − . P E PN 2 i=1 i=1(Mi − mi)

Example 2.8 Let X1,...,XN non-negative independent random variables with continuous distribution (i.e., having a density with respect to the Lebesgue measure). Assume that the probability density functions fi of Xi are uniformly bounded by 1. Then the following holds. (a) For all i = 1,...,N, Z ∞ Z ∞ −tx −tx 1 E[exp(−tXi)] = e fi(x) dx ≤ e dx = . 0 0 t

(b) For any ε > 0, we have N  X  N P Xi ≤ εN ≤ (eε) . (2.2) i=1 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 13

To show (2.2), we use Markov’s inequality for some λ > 0, the independence, and part (a) to arrive at

N N N  X −Xi   Y  ε  ≥ N ≤ eλN [e−λ/εXi ] ≤ eλN = eλN eN log(ε/λ). P ε E λ i=1 i=1 Minimising over λ > 0 gives λ = 1 and thus the desired statement.

2.3 Chernoff’s Inequality Hoeffding’s inequality is already good but it does not produce good results in case the success probabilities/parameter pi are very small. In that case one knows that the sum SN of N Bernoulli random variables has an approximately Poisson distribution. The Gaussian tails are not good enough to match Poisson tails as we will see later.

Theorem 2.9 (Chernoff’s inequality) Let Xi be independent Bernoulli random variables PN with parameter pi ∈ [0, 1], i.e., Xi ∼ Ber(pi), i = 1,...,N. Denote SN = i=1 their sum and µ = E[SN ] its mean. Then, for any t > µ, we have   eµt S ≥ t ≤ e−µ . P N t

Proof. Let λ > 0, then Markov’s inequality gives

N λSN λt −λt Y P(SN ≥ t) = P(e ≥ e ) ≤ e E[exp(λXi)]. i=1

We need to bound the MGF of each Bernoulli random variable Xi separately. Using 1 + x ≤ ex, we get

λ λ λ E[exp(λXi)] = e pi + (1 − pi) = 1 + (e + 1)pi ≤ exp((e − 1)p1).

Thus N M Y  λ  X   λ   E[exp(λXi)] ≤ exp e − 1 pi = exp e − 1 µ , i=1 i=1 and therefore −λt λ P(SN ≥ t) ≤ e exp((e − 1)µ) for all λ > 0. Optimising over λ > 0 we obtain λ = log(t/µ) as t > µ for the minimal upper bound. 2

Exercise 2.10 In the setting of Theorem 2.9, prove that, for any t < µ,

  eµt S ≤ t ≤ e−µ . P N t K 14 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Solution. We get

−λSN λt P(SN ≤ t) = P(−SN ≥ −t) ≤ E[e ]e

and N Y −λX −λ E[e i ] ≤ exp ((e − 1)µ). i=1 Thus λt −λ P(SN ≤ t) ≤ e exp ((e − 1)µ). Minimising over λ > 0, gives −λ = log(t/µ) which is valid as t < µ. We shall now discuss some example on random graphs where concentration inequalities, provide sufficient bounds.

Example 2.11 (Degrees of Random Graphs) We consider the Erdos-R¨ ennyi´ random graph model. This is the simplest stochastic model for large, real-world networks. The random graph G(n, p) consists of n vertices, and each pair of distinct vertices is connected inde- pendently (from all other pairs) with probability p ∈ [0, p]. The degree of a vertex is the (random) number of edges incident to that vertex. The expected degree of every vertex i (we label the n vertices by numbers 1, . . . , n), is

d(i) = E[D(i)] = (n − 1)p i + 1, . . . , n.

Note that d(i) = d(j) for all i 6= j, i, j = 1, . . . , n, and thus we simply write d in the following. We call the random graph dense when d ≥ log n holds, and the graph is denoted regular if the degree of all vertices approximately equal d.

Proposition 2.12 (Dense graphs are almost regular) There is an absolute constant C > 0 such that the following holds: Suppose that G(n, p) has expected degree d ≥ C log n. Then, with high probability, all vertices of G(n, p) have degrees between 0.9d and 1.1d.

Proof. Pick a vertex i of the graph, the the degree is simply a sum of Bernoulli random variables, i.e., n−1 X D(i) = Xk Xk ∼ Ber(p). k=1 We are using the two-sided version of the Chernoff bound in Theorem 2.9, see Exercise 2(a) Example Sheet 1:

−cd P(|D(i) − d| ≥ 0.1d) ≤ 2e for some absolute constant c > 0.

We can now ’unfix’ i by taking the union bound over all vertices of the graph:

n X cd P(∃i ∈ {1, . . . , n}: |D(i) − d| ≥ 0.1d) ≤ P(|D(i) − d| ≥ 0.1d) ≤ 2ne i=1 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 15

If d ≥ C log n for a sufficiently large absolute constant C, the probability on the left hand side is bounded as

−cC log n 1−cC P(∃i ∈ {1, . . . , n}: |D(i) − d| ≥ 0.1d) ≤ 2ne = 2n ≤ 0.1. This means that, with probability 0.9, the complementary event occurs and we have

P(∀ i ∈ {1, . . . , n}: |D(i) − d| < 0.1d) ≥ 0.9. 2 ♣

2.4 Sub-Gaussian random variables In this section we introduce a family of random variables, the so-called sub-Gaussian random variables, who show exponential concentration bounds similar to the Normal distribution (Gaussian). Before we define this class via various properties, we discuss some facts on Chernoff bounds. Suppose the real-valued random variable X has mean µ ∈ R and there is some constant b > 0 such that the function λ(X− [X]) Φ(λ) := E[e E ] (2.3) the so-called centred moment generating function exits for all |λ| ≤ b. For λ ∈ [0, b], we may apply Markov’s Inequality in Theorem 1.16 to the random variable Y = eλ(X−E[X]), thereby obtaining the upper bound Φ(λ) ((X − µ) ≥ t) = (eλ(X−E[X]) ≥ eλt) ≤ . P P eλt Optimising our choice of λ so as to obtain the tightest results yields the so-called Chernoff bound, namely. the inequality

log P((X − µ) ≥ t) ≤ inf { log Φ(λ) − λt}. (2.4) λ∈[0,b]

Example 2.13 (Gaussian tail bounds) Let X ∼ N(µ, σ2), µ ∈ R, σ2 > 0, and recall the probability density function (p.d.f) of X,

1 (x−µ)2 − 2 fX (x) = √ e 2σ . 2πσ2 The moment generating function (MGF) is easily computed as

2 1 Z (x−µ) 1 Z 2 2 λX − 2 +λx −y /2σ +λy MX (λ) = [e ] = √ e 2σ dx = √ e dy E 2 2 2πσ R 2πσ R 2 2 = eλµ+λ σ /2 < ∞ for all λ ∈ R. We obtain the Chernoff bound by optimising over λ ≥ 0 using Φ(λ) = −λµ λ2σ2/2 e MX (λ) = e , nλ2σ2 o t2 inf { log Φ(λ) − λt} = inf − λt = − . λ≥0 λ≥0 2 2σ2 16 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Thus any N(µ, σ2) random variable X satisfies the upper deviation inequality  t2  (X ≥ µ + t) ≤ exp − , t ≥ 0 , P 2σ2 and the two-sided version  t2  (|X − µ| ≥ t) ≤ 2 exp − , t ≥ 0 . P 2σ2 ♣

Exercise 2.14 (Moments of the normal distribution) Let p ≥ 1 and X ∼ N(0, 1). Then

Z 1/p Z ∞ 1/p 1/p  1 p −x2/2   2 p −x2/2  kXkLp = (E[|X|]) = √ |x| e dx = √ x e , dx 2π R 2π 0 √ √ p√  Z ∞ √ √ 2 1 1/p  2 2 2 1/p = ( 2 w)pe−w √ dw = √ Γ(p/2 + 1/2) 2 (x /2=w) 0 w 2 2π 2 √ Γ(p/2 + 1/2)1/p = 2 . Γ(1/2) Hence √ kXkLp = O( p) as p → ∞. K

λ2/2 We also note that the MGF of X ∼ N(0, 1) is given as MX (λ) = e . In the follow- ing proposition we identify equivalent properties for a real-valued random variable which exhibits similar tail bounds. moment bounds and MGF estimates than a Gaussian random variable does.

Proposition 2.15 (Sub-Gaussian properties) Let X be a real-valued random variable. Then there are absolute constants Ci > 0, i = 1,..., 5, such that the following statements are equivalent:

(i) Tails of X: 2 2 P(|X| ≥ t) ≤ 2 exp ( − t /C1 ) for all t ≥ 0. (ii) Moments: √ kXkLp ≤ C2 p for all p ≥ 1. (iii) MGF of X2:

2 2 2 2 1 E[ exp (λ X )] ≤ exp (C3 λ ) for all λ with |λ| ≤ . C3

(iv) MGF bound: 2 2 E[ exp (X /C4 )] ≤ 2.

Moreover, if E[X] = 0, then the statements (i)-(iv) are equivalent to the following statement CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 17

(v) MGF bound of X:

2 2 E[ exp (λX)] ≤ exp (C5 λ ) for all λ ∈ R.

Proof. (i) ⇒ (ii): Without loss of generality we can assume that C1 = 1 for property (i). This can be seen by just multiplying the parameter t with C1 which corresponds to multiply X by 1/C1. The integral identity in Lemma 1.9 applied to the non-negative random variable |X|p gives ∞ ∞ ∞ Z Z Z 2 [|X|p] = (|X|p > t) dt = (|X| ≥ u)pup−1 du ≤ 2e−u pup−1 du E P p P 0 t=u 0 (i) 0 p p = pΓ( ) ≤ p( )p/2 , u2=s 2 2 where we used the property Γ(x) ≤ xx of the Gamma function. 5 Taking the pth root gives √ √ √ √ p p p p p p √1/2 and p ≤ 2 gives (ii) with some C2 ≤ 2.√ To see that p ≤ 2 recall that n n limn→∞ n = 1. More precisely, in that proof one takes n = 1 + δn and shows with the q 2 binomial theorem that 0 < δn < n−1 .

(ii) ⇒(iii): Again with loss of generality we can assume that C2 = 1. Taylor expansion of the exponential function and linearity of the expectation gives ∞ X λ2k [X2k] [exp(λ2X2)] = 1 + E . E k! k=1

2k k 6 k k Property (ii) implies that E[X ] ≤ (2k) and Stirling’s formula we get that k! ≥ ( e ) . Thus ∞ ∞ X (2λ2k)k X 1 [exp(λ2X2)] ≤ 1 + = (2eλ2)k = , E (k/e)k 1 − 2eλ2 k=1 k=0 provided that 2eλ2 < 1 (geometric series). To bound the right hand side, i.e., to bound 1/(1 − 2eλ2), we use that 1 ≤ e2x for all x ∈ [0, 1/2] . 1 − x

5Γ(n) = (n − 1)!, n ∈ N. Z ∞ z−1 −x Γ(z) := x e dx for all z = x + iy ∈ C with x > 0 , 0 Γ(z + 1) = zΓ(z) , Γ(1) = 1 , 1 √ Γ( ) = π . 2

6Stirling’s formula: √ N! ∼ 2πNe−N N N where ∼ means quotient goes to zero when N → ∞ . Alternatively, √  1  1  N! ∼ 2πNe−N N N 1 + + O . 12N N 2 18 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

√1 2 1 For all |λ| ≤ 2 e we have 2eλ < 2 , and thus

1 [exp(λ2X2)] ≤ exp(4eλ2) for all |λ| ≤ √ , E 2 e √ and (iii) follows with C3 = 2 e. (iii) ⇒ (iv): Trivial and left as an exercise.

(iv) ⇒ (i): Again, we assume without loss of generality that C4 = 1. Using Markov’s inequality 1.16 with f = exp ◦x2, we obtain

X2 t2 −t2 X2 −t2 P(|X| ≥ t) = P(e ≥ e ) ≤ e E[e ] ≤ 2e . (iv)

Suppose now that E[X] = 0. We show that (iii) ⇒ (v) and (v) → (i).

(iii) ⇒ (v): Again, we assume without loss of generality that C3 = 1. We use the well-known inequality x x2 e ≤ x + e , x ∈ R , to obtain λX λ2X2 λ2X2 λ2 E[e ] ≤ E[λX + e ] = E[e ] ≤ e if |λ| ≤ 1 , (iii) and thus we have (v) for the range |λ| ≤ 1. Now, assume |λ| ≥ 1. This time, use the inequality 2λx ≤ λ2 + x2, λ.x ∈ R, to arive at

λX λ2/2 X2/2 λ2/2 λ2 E[e ] ≤ e E[e ] ≤ e exp(1/2) ≤ e as |λ| ≥ 1 .

(v) → (i): Again, assume that C5 = 1. Let λ ≥ 0.

λX λt −λt λX −λt λ2 2 P(X ≥ t) = P(e ≥ e ) ≤ e E[e ] ≤ e e = exp(−λt + λ ) .

2 Optimising over λ ≥ 0, we obtain λ = t/2, and thus P(x ≥ t) ≤ e−t /4. Now we repeat the 2 argument for −X and obtain P(X ≤ −t) ≤ e−t /4, and thus

−t2/4 P(|X| ≥ t) ≤ 2e , and C1 = 2 implies (i). 2

Definition 2.16 (Sub-Gaussian random variable, first definition) A random real-valued variable X that satisfies one of the equivalent statements (i)-(iv) in Proposition 2.15 is called sub-Gaussian random variable. Define, for any sub-Gaussian random variable n o kXk := inf t > 0: ( exp (X2/t2)] ≤ 2 . ψ2 E CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 19

Exercise 2.17 (Sub-Gaussian norm) Let X be a sub-Gaussian random variable and define n o kXk := inf t > 0: [ exp(X2/t2)] ≤ 2 . ψ2 E Show that k·k is indeed a norm on the space of sub-Gaussian random variables. ψ2 K

Example 2.18 (Examples of Sub-Gaussian random variables) (a) X ∼ N(0, 1) is sub- 2 2 Gaussian random variable: pick t > 0 with 1 > t2 and set a(t) := 1 − t2 . Then

Z 2 r 2 2 1 − a(t)x 1 E[ exp (X /t )] = √ e 2 dx = 2π R a(t) shows that there is an absolute constant C > 0 with kXk ≤ C. ψ2 (b) Let X be a symmetric Bernoulii random variable, kXk = 1.

2 2 2 1 [eX /t ] = e1/t ≤ 2 ⇔ ≤ t2. E log 2 √ Thus kXk = 1/ log 2. ψ2

(c) Let X be a real-valued, bounded random variable, that is, |X| ≤ b = kXk∞. Thus

2 2 2 2 X /t kXk∞/t p E[e ] ≤ e ≤ 2 ⇔ t ≥ kXk∞/ log 2, √ and kXk = kXk / log 2. ψ2 ∞ ♣

2 Recall that the sum of independent normal random variables Xi ∼ N(µi, σi ), i = 1,...N, is normally distributed, that is,

N N N X  X X 2 SN = Xi ∼ N µi , σi . i=1 i=1 i=1 For a proof see [Dur19] or [Geo12]. We then may wonder if the sum of independent sub- Gaussian random variables is sub-Gaussian as well. There are different statements on this depending whether µi = 0 for all i = 1,...,N, as done in the book [Ver18], or not. We shall follow the general case here. It proves useful to have the following equivalent definition of sub-Gaussian random variables. Definition 2.19 (Sub-Gaussian random variables, second definition) A real-valued ran- dom variable X with mean µ = E[X] is a sub-Gaussian random variable if there is a positive number σ such that

λ(X−µ) σ2λ2/2 E[e ]] ≤ e for all λ ∈ R. (2.5)

The constant σ satisfying (2.5) is referred to as the sub-Gaussian parameter; for instance, we say that X is sub-Gaussian with parameter σ when (2.5) holds. 20 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Remark 2.20 (Sub-Gaussian definitions:) It is easy to see that our two definitions are equiv- alent when the random variable X has zero mean µ = 0: use statement (v) of Proposi- tion 2.15. If µ 6= 0 and X satisfies (2.5), we obtain the following tail bound

1 (|X| ≥ t) ≤ 2 exp ( − (t − µ)2) for all t ≥ 0. (2.6) P 2σ2 This tail bound is not exactly the statement (i) of Proposition 2.15 as the parameter t ≥ 0 is chosen with respect to the mean. In most cases, one is solely interested in tail estimates away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the range for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.19 is widely used and sometimes called sub-Gaussian for centred random variables. We use both definitions synonymously. 

Proposition 2.21 (Sum of independent sub-Gaussian random variables) Let X1,...,XN be independent sub-Gaussian random variables with sub-Gaussian parameters σi, i = 1,...,N, PN respectively. Then SN = is a sub-Gaussian random variable with sub-Gaussian pa- q i=1 PN 2 rameter i=1 σi .

Proof.

N N N X Y 2 X 2 E[ exp (λ (Xi − µi))] = E[ exp (λ(Xi − µi))] ≤ exp (λ /2 σi ) for all λ ∈ R. i=1 i=1 i=1 2 This allows us to obtain a Hoeffding bound for independent sum of sub-Gaussian random variables.

Proposition 2.22 (Hoeffding bounds for sums of independent sub-Gaussian random variables) Let X1,...,XN be real-valued independent sub-Gaussian random variables with sub-Gaussian parameter σi and mean µi, i = 1,...,N, respectively. Then, for all t ≥ 0,

N  X   t2  (Xi − µi) ≥ t ≤ exp − . P PN 2 i=1 2 i=1 σi

PN Proof. Set SeN := i=1(Xi − µi). Using again the exponential version of the Markov inequality in Proposition 1.16, we obtain

N  N  λSeN −λt −λt Y λ(Xi−µi) 2 X 2 P(SeN ≥ t) ≤ E[e ]e = e E[e ] ≤ exp − λt + λ /2 σi . i=1 i=1

PN 2 Optimising over λ, we obtain λ = t/ i=1 σi , and conclude with the statement. 2 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 21

Example 2.23 (Bounded random variables) Let X be a real-valued random variable with mean-zero and (bounded) supported on [a, b], a < b. Denote X0 an identical independent copy of X, i.e., X ∼ X0. For any λ ∈ R, using Jensen’s inequality 1.11, 0 0 λX λ(X−E 0 [X ]) λ(X−X ) EX [e ] = EX [e X ] ≤ EX,X0 [e ] , where EX,X0 is expectation with respect to the two independent and identically distributed random variables X and X0. Suppose ε is an independent (from X and X0 Rademacher function, that is ε is a symmetric Bernoulli random variable. Then (X − X0) ∼ ε(X − X0). For Rademacher functions/symmetric Bernoulli random variables ε we estimate the moment generating function as ∞ ∞ ∞ 1 1 X (λ)k X (−λ)k  X λ2k [eλε] = (eλ + e−λ) = + = E 2 2 k! k! (2k)! k=0 k=0 k=0 ∞ 2k X λ 2 ≤ 1 + = eλ /2. 2kk! k=1 Thus λ(X−X0) λε(X−X0) 2 0 2 EX,X0 [e ] = EX,X0 [Eε[e ]] ≤ EX,X0 [ exp(λ (X − X ) /2)] 2 2 ≤ EX,X0 [ exp(λ (b − a) /2)], as |X − X0| ≤ b − a. Thus X sub-Gaussian with sub-Gaussian parameter σ = b − a > 0, see Definition 2.19. ♣ As discussed above in Remark 2.20, in many situations and results we encounter later, we typically assume that the random variables have zero means. Then the two definitions are equivalent. In the next lemma we simply show that centering does not harm the sub-Gaussian property. This way we can see that we can actually use both our definitions for sub-Gaussian random variables.

Lemma 2.24 (Centering) If X is sub-Gaussian random variable then X − E[X] is sub- Gaussian too with kX − [X]k ≤ CkXk E ψ2 ψ2 for some absolute constant C > 0.

Proof. We use the fact that k·k is norm. Triangle inequality gives ψ2 kX − [X]k ≤ kXk + k [X]k . E ψ2 ψ2 E ψ2 For any a ∈ one can find t ≥ |a| such that [exp(a2/t2)] ≤ 2, thus R log 2 E |a| kak = ≤ c|a| ψ2 log 2 for some constant c > 0. Hence

k [X]k ≤ c| [X]| ≤ c [|X|] = ckXk 1 ≤ CkXk E ψ2 E E L ψ2 √ using kXk p ≤ CkXk p for all p ≥ 1. L ψ2 2 22 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

2.5 Sub-Exponential random variables N Suppose Y = (Y1,...,YN ) ∈ R is a random vector with independent coordinates Yi ∼ N(0, 1), i = 1,...,N. We expect that the Euclidean norm kY k2 exhibits some form of 2 concentration as the square of the norm kY k2 is the sum of independent random variables 2 2 Yi . However, although the Y1 are sub-Gaussian random variables, the Yi are not. Recall our tail estimate for the Gaussian random variables and note that √ √ 2 2 P(Yi > t) = P(|Yi| > t) ≤ C exp(− t /2) = C exp(−t/2). The tails are like those for the exponential distribution. To see that, suppose that X ∼ −λt Exp(λ), λ > 0, i.e., the probability density function (pdf) is fX (t)λe 1l{t ≥ 0}. Then, for all t ≥ 0, Z ∞ −λt −λt P(X ≥ t) = λe dt = λe . t We can therefore use our general Hoeffding bound. The following proposition summarise properties of a new class of random variables.

Proposition 2.25 (Sub-exponential properties) Let X be a real-valued random variable. Then the following properties are equivalent for absolute constant Ci > 0, i = 1,..., 5:

(i) Tails: P(|X| ≥ t) ≤ 2 exp(−t/C1) for all t ≥ 0 .

(ii) Moments: p 1/p kXkLp = (E[|X| ]) ≤ C2p for all p ≥ 1 .

(iii) Moment generating function (MGF) of |X|:

E[exp(λ|X|)] ≤ exp(C3λ) for all λ such that 0 ≤ λ ≤ 1/C3 .

(iv) MGF of kXk bounded at some point:

E[exp(|X|C4) ≤ 2 .

Moreover, if E[X] = 0, the properties (i)-(iv) are also equivalent to (v): (v) MGF of X:

2 2 E[exp(λX)] ≤ exp(C5 λ ) for all λ such that |λ| ≤ 1/C5 .

Definition 2.26 (Sub-exponential random variable, first definition) A real-valued ran- dom variable X is called Sub-exponential if it satisfies one of the equivalent properties (i)-(iv) of Proposition 2.25 (respectively properties (i)-(v) when [X] = 0). Define kXk E ψ1 by kXk := inf{t > 0: [exp(|X|/t)] ≤ 2} . (2.7) ψ1 E CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 23

Lemma 2.27 Equation (2.7) defines the sub-exponential norm kXkψ on the set of all sub- exponential random variables X.

Proof. Supportclass. 2

Lemma 2.28 A real-valued random variable X is sub-Gaussian if and only if X2 is sub- exponential. Moreover, kX2k = kXk2 . (2.8) ψ1 ψ1 Proof. Suppose X is sub-Gaussian and t ≥ 0. Then √ √ 2 2 2 P(|X| ≥ t) = P(|X| ≥ t) ≤ 2 exp(− t /C1 ) , and thus X2 is sub-exponential according to Proposition 2.25 (i). If X2 is sub-exponential we have 2 2 2 P(|X| ≥ t) = P(|X| ≥ t ) ≤ 2 exp(−t /C1) , and thus X is sub-Gaussian. As for the norms, recall that kX2k is the infimum of all ψ1 numbers C > 0 satisfying [exp(X2/C)] ≤ 2, whereas kXk is the infimum of all numbers E ψ2 K > 0 satisfying [exp(X2/K2)] ≤ 2. Putting C = K2, one obtains kX2k = kXk2 . E ψ1 ψ2 2

Lemma 2.29 (Product of sub-Gaussians) Let X and Y real-valued sub-Gaussian random variables. Then XY is sub-exponential and

kXY k ≤ kXk kY k . ψ1 ψ2 ψ2

Proof. Suppose that 0 6= kXk (if kXk = 0 and/or kY k = 0 the statement trivially ψ2 ψ2 ψ2 holds). Then X = X/kXk is sub -Gaussian with kXk = 1 kXk = 1. Thus e ψ2 e ψ2 kXk ψ2 ψ2 we assume without loss of generality that kXk = kY k = 1. To prove the statement ψ2 ψ2 in the lemma we shall show that E[exp(X2)] ≤ 2 and E[exp(Y 2)] ≤ 2 both imply that [exp(|XY |)] ≤ 2, where [exp(|XY |)] ≤ 2 implies that kXY k ≤ 1. We are going to E E ψ1 use Young’s inequality : a2 b2 ab ≤ + for a, b ∈ . 2 2 R Thus h X2 Y 2 i h X2  Y 2 i [exp(|XY |)] ≤ exp + = exp exp Young’s inequality E E 2 2 E 2 2 1 h i 1 ≤ exp (X2) + exp (Y 2) = (2 + 2) = 2 Young’s inequality. 2E 2 2

Example 2.30 (Exponential random variables) Suppose X ∼ Exp(α), α > 0. Then E[X] = 1 1 1 α and Var(X) = α2 . We compute for every t 6= α to get

Z ∞ ∞ x/t −αx h −α −x(α−1/t)i α E[exp(|X|/t)] = αe e dx = e = ≤ 2 0 (α − 1/t) 0 (α − 1/t) 24 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES if and only if t ≥ 2/α. Hence kXk = 2 . ψ1 α ♣

Example 2.31 (Sub-exponential but not sub-Gaussian) Suppose that Z ∼ N(0, 1), and 2 1 define X := Z . Then E[X] = 1. For λ < 2 we have

Z ∞ −λ λ(X−1) 1 λ(z2−1) −z2/2 e E[e ] = √ e e dz = √ , 2π −∞ 1 − 2λ

1 λ(X−1) whereas for λ > 2 we have E[e ] = +∞. Thus X is not sub-Gaussian. In fact one can show, after some computation, that

−λ e 2 2 1 √ ≤ e2λ = e4λ /2 for all |λ| < . 1 − 2λ 4

This motivates the following alternative definition of sub-exponential random variables which corresponds to the second definition for sub-Gaussian random variables in Definition 2.19. ♣

Definition 2.32 (Sub-exponential random variables, second definition) A real-valued random variable X with mean µ ∈ R is sub-exponential if there exist non-negative parameters (ν, α) such that

2 2 1 [ exp (λ(X − µ))] ≤ eν λ /2 for all |λ| < . (2.9) E α

Remark 2.33 (a) The random variable X in Example 2.31 is sub-exponential with param- eters (ν, α) = (2, 4).

(b) It is easy to see that our two definitions are equivalent when the random variable X has zero mean µ = 0: use statement (v) of Proposition 2.25. If µ 6= 0 and X satisfies (2.9), we obtain the following tail bound

 1  (|X| ≥ t) ≤ 2 exp − (t − µ)2 for all t ≥ 0. (2.10) P 2ν2 This tail bound is not exactly the statement (i) of Proposition 2.25 as the parameter t ≥ 0 is chosen with respect to the mean. In most cases, one is solely interested in tail estimates away from the mean. Thus the definitions are equivalent in case µ 6= 0 if we limit the range for parameter t to t ≥ |µ|. In the literature and in applications Definition 2.32 is widely used and sometimes called sub-exponential for centred random variables. We use both definitions synonymously.  CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES 25

Example 2.34 (Bounded random variable) Let X be a real-valued, mean-zero and bounded random variable support on the compact interval [a, b], a < b. Furthermore, let X0 be an independent copy of X and let ε be an independent (from both X and X0) Rademacher func- 1 tion, that is, ε is a symmetric Bernoulli random variable with P(ε = −1) = P(ε = 1) = 2 . Using Jensen’s inequality for X0 we obtain

0 0 0 λX λ(X−E 0 [X ]) λ(X−X ) λε(X−X ) EX [e ] = EX [e X ] ≤ EX,X0 [e ] = EX,X0 [Eε[e ]] ,

0 where we write EX,X0 for the expectation with respect to X and X , Eε for the expectation with respect to ε, and where we used that (X − X0) ∼ ε(X − X0). Here, ∼ means that both random variables have equal distribution. Hold α := (X − X0) fixed and compute

∞ ∞ ∞ 1 1 X (−λα)k X (λα)k  X (λα)2k [eλεα] = [e−λα + eλα] = ( + = Eε 2 2 k! k! (2k)! k=0 k=0 k=0 ∞ 2k X (λα) 2 2 ≤ 1 + = eλ α /2 . 2kk! k=1 Inserting this result in the previous one leads to

λX h λ2(X−X0)2/2i λ2(b−a)2/2 EX [e ] ≤ EX,X0 e ≤ e

as |X − X0| ≤ b − a. ♣

Proposition 2.35 (Sub-exponential tail-bound) Let X be a real-valued sub-exponential ran- dom variable with parameters (ν, α) and mean µ = E[X]. Then, for every t ≥ 0,

2 ( − t 2 e 2ν2 for 0 ≤ t < ν /α, (X − µ ≥ t) ≤ (2.11) P − t 2 e 2α for t ≥ ν /α .

Proof. Recall Definition 2.32 and obtain  λ2ν2  (X − µ ≥ t) ≤ e−λt [eλ(X−µ)] ≤ exp − λt + for λ ∈ [0, α−1) . P E 2 2 2 ∗ Define g(λ, t) := −λt+λ ν /2. We need to determine g (t) = infλ∈[0,α−1){g(λ, t)}. Suppose 2 ∗ t 2 that t is fixed, then ∂λg(λ, t) = −t + λν = 0 if and only if λ = λ = ν2 . If 0 ≤ t < ν /α, then the infimum equals the unconstrained one and g∗(t) = −t2/2ν2 for t ∈ [0, ν2/α). Suppose now that t ≥ ν2/α. As g(·, t) is monotonically decreasing on [0, λ∗), the constrained infimum is achieved on the boundary λ∗ = α−1, and hence g∗(t) = −t/2α. 2

Definition 2.36 (Bernstein condition) A real-valued random variable X with mean µ ∈ R and variance σ2 ∈ (0, ∞) satisfies the Bernstein condition with parameter b > 0 if 1 | [(X − µ)k]| ≤ k!σ2bk−2 k = 2, 3, 4,.... (2.12) E 2 26 CONCENTRATIONINEQUALITIESFORINDEPENDENTRANDOMVARIABLES

Exercise 2.37 (a) Show that a bounded random variable X with |X − µ| ≤ b satisfies the Bernstein condition (2.12) . (b) Show that the bounded random variable X in (a) is sub-exponential and derive a bound on the centred moment generating function

E[exp(λ(X − E[X]))] . KK

Solution. (a) From our assumption we have |E[(x−µ)2]| ≤ σ2. Using Holder’s¨ inequality we obtain 1 [|X − µ|k−2|X − µ|2] ≤ σ2 [|X − µ|k−2] ≤ σ2bk−2 k! E E 2 for all k ∈ N, k ≥ 2. (b) By power series expansion we have (using the Bernstein bound from (a),

∞ λ2σ2 X [(X − µ)k] [eλ(X−µ)] = 1 + + λk E E 2 k! k=3 ∞ λ2σ2 λ2σ2 X ≤ 1 + + (|λ|b)k−2 , 2 2 k=3 and for |λ| < 1/b we can sum the geometric series to obtain

λ2σ2/2 λ2σ2/2 [eλ(X−µ)] ≤ 1 + ≤ exp ( ) E 1 − b|λ| 1 − b|λ| by using 1 + t ≤ et. Thus X is sub-exponential as we obtain √ λ(X−µ) 2 2 E[e ] ≤ exp (λ ( 2σ) /2) for all |λ| < 1/2b. 2

Exercise 2.38 (general Hoeffding inequality) Let X1,...,XN independent mean-zero sub- N Gaussian real-valued random variables, and let a = (a1, . . . , aN ) ∈ R . Then, for every t ≥ 0, we have N  X   ct2  | aiXi| ≥ t ≤ 2 exp − , P 2 2 i=1 K kak2 where K = max {kX k }. KK 1≤i≤N i ψ2 PN Hint: Use the fact that Xi sub-Gaussian, i = 1,...,N implies that i=1 Xi is sub-Gaussian with N N X 2 X 2 k X k ≤ C kX k . i ψ2 i ψ2 i=1 i=1 RANDOMVECTORSIN HIGH DIMENSIONS 27

3 Random vectors in High Dimensions

n We study random vectors X = (X1,...,Xn) ∈ R and aim to obtain concentration proper- ties of the Euclidean norm of random vectors X.

3.1 Concentration of the Euclidean norm

Suppose Xi, i = 1, . . . , n, are independent R-valued random variables with E[Xi] = 0 and Var(Xi) = 1. Then n n 2 X 2 X 2 E[kXk2] = E[ Xi ] = E[Xi ] = n. i=1 i=1 √ We thus expect that the expectation of the Euclidean norm√ is approximately n. We will now see in a special case that X is indeed very close to n with high probability.

n Theorem 3.1 (Concentration of the norm) Let X = (X1,...,Xn) ∈ R be a random vec- 2 tor with independent sub-Gaussian coordinates Xi that satisfy E[Xi ] = 1. Then √ kkXk − nk ≤ CK2 , 2 ψ2 where K := max {kXk } and C > 0 an absolute constant. 1≤i≤n ψ2 The following two exercises are used in the proof of the theorem.

Exercise 3.2 (Centering for sub-exponential random variables) Show the centering lemma for sub-exponential random variables. This is an extension of Lemma 2.24 to sub-exponential random variables: Let X be a real-valued sub-exponential random variable. Then

kX − [X]k ≤ CkXk . E ψ1 ψ1 K

Exercise 3.3 (Bernstein inequality) Let X1,...,XN be independent mean-zero sub-exponential real-valued random variables. Then, for every t ≥ 0, we have

N  1 X   n t2   t o  | X | ≥ t ≤ 2 exp − c min , N , P N i K2 K i=1 for some absolute constant c > 0 and where K := max {kX k }. KK 1≤i≤N i ψ1

1 PN Solution. SN := N i=1 Xi. As usual we start with

N −λt λSN −λt Y λ/NXi P(SN ≥ t) ≤ e E[e ] = e E[e ] , i=1 and (v) in Propostion 2.25 implies that, writing Xei = Xi/N, c λXei 2 2 E[e ] ≤ exp (cλ kXeik ) for |λ| ≤ , ψ1 K 28 RANDOMVECTORSIN HIGH DIMENSIONS

and kX k2 = 1/N 2kX k2 . Thus ei ψ1 i ψ1

N X 1 2 2 c (S ≥ t) ≤ exp ( − λt + cλ ( ) kX k ) for |λ| ≤ . P N i ψ1 N max1≤i≤N {kXik } i=1 ψ1 Optimising over λ one obtains

n t2   c o λ = min , , 2cσ˜2 max {kX k } 1≤i≤N ei ψ1 where σ˜2 = PN 1/N 2kX k2 . Thus i=1 i ψ1

 n t2   c o P(SN ≥ t) ≤ exp − c min , , 2cσ˜2 max {kX k } 1≤i≤N ei ψ1 and finally  n t2   t o  (S ≥ t) ≤ exp − c min , N . P N K2 K ©

Remark 3.4 With high probability, e.g., with probability 0.99 (adjust the absolute constants c > 0 and K > 0 accordingly) X stays within a constant distance from the sphere of radius √ 2 √ n. Sn := kXk2 has mean n and standard deviation O( n):

n n  2 h X 2i 2 X 4 X 2 2 Var kXk2 = E (XiXj) − n = E[Xi ] + E[(XiXj) ] − n . i,j=1 i=1 1≤i

2 4 For any i 6= j we have E[(XiXj) ] = 1, and E[Xi ] = O(1) (follows from (ii) in Proposi- q √ tion 2.15). Thus Var(kXk2) = (n+(n−1)2−n2)O(1), and therefore Var(kXk2) = O( n). √ 2 √ 2 Hence kXk2 = Sn deviates by O(1) around n. Note that this follows from q √ √ n ± O( n) = n ± O(1) .



Proof of Theorem 3.1. We assume again without loss of generality that K ≥ 1.

n 1 2 1 X kXk − 1 = (X2 − 1) . n 2 n i i=1

2 Xi sub-Gaussian implies that that Xi − 1 is sub-exponential (Lemma 2.29). The centering property of Exercise 3.2 shows that there is an absolute constant C > 0 such that

2 2 2 2 kXi − 1k ≤ CkXi k = CkXik ≤ CK . ψ1 ψ1 Lemma 2.28 ψ2 RANDOMVECTORSIN HIGH DIMENSIONS 29

We now apply the Bernstein inequality in Exercise 3.3 to obtain, for every u ≥ 0, 1  cn  (| kXk2 − 1| ≥ u) ≤ 2 exp − min{u2, u} , (3.1) P n 2 K4 where we used that K ≥ 1 implies K4 ≥ K2. Note that for z ≥ 0 inequality |z − 1| ≥ δ implies the inequality |z2 − 1| ≥ max{δ, δ2} . To see that, consider first z ≥ 1 which implies that z + 1 ≥ z − 1 ≥ δ and thus |z2 − 1| = |z − 1||z + 1| ≥ δ2. For 0 ≤ z < 1 we have z + 1 ≥ 1 and thus |z2 − 1| ≥ δ. We apply this finding, (3.1) with u = max{δ, δ2} to obtain  1   1  cn |√ kXk − 1| ≥ δ ≤ | kXk2 − 1| ≥ max{δ, δ2} ≤ 2 exp ( − δ2) . P n 2 P n 2 K4

We used that v = min{u, u2} = δ when u = max{δ, δ2}. To see that, note that δ ≥ δ2 2 2 implies δ ≤ 1 and thus u √= δ and v = δ . If δ > 1 we have δ >> δ and thus u = δ and thus v = δ2. Setting t = δ n we finally conclude with

√  ct2  (|kXk − n| ≥ t) ≤ 2 exp − , P 2 K4 √ which shows that |kXk2 − n| is sub-Gaussian. 2

n Exercise 3.5 (Small ball probabilities) Let X = (X1,...,Xn) ∈ R be a random vec- tor with independent coordinates Xi having continuous distribution with probability density functions fi : R → R (Radon-Nikodym density with respect to the Lebesgue measure) satis- fying |fi(x)| ≤ 1 , i = 1, . . . , n, for all x ∈ R . Show that, for any ε > 0, we have

√ n P(kXk2 ≤ ε n) ≤ (Cε) for some absolute constant C > 0. K

Solution. n 2 2 2 2 λε2n 2 λε2n Y 2 P(kXk2 ≤ ε n) = P(−kXk2 ≥ −ε n) ≤ e E[exp(−λkXk2)] = e E[exp(−λXi )] , i=1 and inserting r Z 2 Z 2 π 2 −λxi −λxi E[exp(−λXi )] ≤ e |fi(x)| dx ≤ e dx = , R R λ we obtain  n  (kXk2 ≤ ε2n) ≤ exp λε2n − log (λ/π) ≤ (Cε)n , P 2 2 1 where the last inequality follows by optimising over λ and getting λ = 2ε2 . © 30 RANDOMVECTORSIN HIGH DIMENSIONS

3.2 Covariance matrices and Principal Component Analysis (PCA)

n Definition 3.6 Let X = (X1,...,Xn) ∈ R be a random vector. Define the XXT as the (n × n) matrix

 2    X1 X1X2 ··· X1Xn X1 2 X2X1 X ····   ·   2  t     ······  XX :=  ·  X1 ··· Xn =   .    ······   ·     ······  Xn 2 XnX1 ···· Xn

n Let X = (X1,...,Xn) ∈ R be a random vector with mean µ = E[X] and matrix T µµ = (µiµj)i,j=1,...,n. Then the covariance matrix , which is defined as

h T i T T cov(X) := E (X − µ)(X − µ) = E[XX ] − µµ , (3.2) is a (n × n) symmetric positive-semidefinite matrix with entries

cov(X)i,j = E[(Xi − µi)(Xj − µj)], i, j = 1, . . . , n . (3.3)

The second moment matrix of a random vector X ∈ Rn is simply

T Σ = Σ(X) = E[XX ] = (E[XiXj])i,j=1,...,n , (3.4) and Σ(X) is symmetric and positive-semidefinite matrix which can be written as the spectral decomposition n X T Σ(X) = siuiui , (3.5) i=1 n where ui ∈ R are the eigenvectors of Σ(X) for the eigenvalues si. The second moment matrix allows the principal component analysis (PCA). We order the eigenvalues of Σ(X) according to their size: s1 ≥ s2 ≥ · · · ≥ sn. For large values of the dimension n one aims to identify a few principal directions. These directions correspond to the eigenvectors with the largest eigenvalues. For example, suppose that the first m eigenvalues are significantly larger than the remaining n−m ones. This allows to reduced the dimension of the given data to Rm by neglecting all contributions from directions with eigenvalues significantly smaller than the chosen principal ones.

n Definition 3.7 A random vector X = (X1,...,Xn) ∈ R is called isotropic if

T Σ(X) = E[XX ] = 1ln ,

n where 1ln := idn is the identity operator/matrix in R . RANDOMVECTORSIN HIGH DIMENSIONS 31

Exercise 3.8 (a) Let Z be an isotropic mean-zero random vector in Rn, µ ∈ Rn, and Σ be a (n × n) positive-semidefinite matrix. Show that then X := µ + Σ1/2Z has mean µ and covariance matrix cov(X) = Σ.

(b) Let X ∈ Rn be a random vector with mean µ and invertible covariance matrix Σ = cov(X). Show that then Z := Σ−1/2(X − µ) is an isotropic mean-zero random vector. K

Lemma 3.9 (Isotropy) A random vector X ∈ Rn is isotropic if and only if

2 2 n E[hX, xi ] = kxk2 for all x ∈ R .

Proof. X is isotropic if and only if (E[XiXj])i,j=1,...,n = (δi,j)i,j=1,...,n = 1ln . For every x ∈ Rn we have n n 2 X X 2 E[hX, xi ] = E[xiXixjXj] = xiE[XiXj]xj = hx, Σ(X)xi = kxk2 i,j=1 i,j=1 if and only if Σ(X) = 1ln. 2 2 It suffices to show E[hX, eii ] = 1 for all basis vectors ei, i = 1, . . . , n. Note that hX, eii is a one-dimensional marginal of the random vector X. Thus X is isotropic if and only if all one-dimensional marginals of X have unit variance. An isotropic distribution is evenly extended in all spatial directions.

Lemma 3.10 Let X be an isotropic random vector in Rn. Then

2 E[kXk2] = n .

Moreover, if X and Y are two independent isotropic random vectors in Rn, then

2 E[hX,Y i ] = n .

Proof. For the first statement we view XT X as a 1 × 1 matrix and take advantage of the cyclic property of the trace operation on matrices:

2 T T T T E[kXk2] = E[X X] = E[Trace (X X)] = E[Trace (XX )] = Trace (E[XX ]) = Trace (1ln) = n .

We fix a realisation of Y , that is, we consider the conditional expectation of X with respect to Y which we denote EX . The law of total expectation says that

2 2 E[hX,Y i ] = EY [EX [hX,Y i |Y ]] , where EY denotes the expectation with respect to Y . To compute the innermost expectation we use Lemma 3.9 with x = Y and obtain using the previous part that

2 2 E[hX,Y i ] = EY [kY k2] = n . 32 RANDOMVECTORSIN HIGH DIMENSIONS

2 Suppose X ⊥ Y are isotropic vectors in Rn, and consider the normalised versions X¯ : X/kXk and Y¯ := Y/kY k . From the concentration results in this chapters we know that 2 2 √ √ √ with high probability, kXk2 ∼ n, kY k2 ∼ n, and hX,Y i ∼ n. Thus, with high probabity, 1 |hX,¯ Y¯ i| ∼ √ . n Thus, in high dimensions, independent and isotropic random vector are almost orthogonal.

3.3 Examples of High-Dimensional distributions

Definition 3.11 (Spherical distribution) A random vector X is called spherically√ dis- tributed if it is uniformly distributed on the Euclidean sphere with radius n and centre at the origin, i.e., √ X ∼ Unif( nSn−1) , where

n n−1 n n X 2 o n S = x = (x1, . . . , xn) ∈ R : xi = 1 = {x ∈ R : kxk2 = 1} i=1 is the unit sphere of radius 1 and centre at the origin.

√ n−1 Exercise 3.12 X ∼ Unif( nS ) is isotropic but the coordinates Xi, i = 1, . . . , n, of X 2 2 are not independent due to the condition X1 + ··· + Xn = n. KK Solution. We give a solution for n = 2. The solution for higher dimensions is similar and use high-dimensional versions of the Polar coordinates. We use Polar coordinates to represent X as √ cos(θ) X = n , θ ∈ [0, 2π] . sin(θ) 2 2 Let e1 and e2 the two unit basis vectors in R . It suffices to show E[hX, eii ] = 1, i = 1, 2, as any vector can be written as a linear combination of the two basis vectors. Without loss of generality we pick e1 (the proof for e2 is just analogous):

Z 2π 2π 2 √ 2 n 2 2 hθ sin(2θ)i 2π E[hX, e1i ] = E[( n cos(θ)) ] = cos (θ) dθ = + = π = 1 . 2π 0 2π 2 4 0 2 © An example of a discrete isotropic distribution in Rn is the symmetric Bernoulli distribu- n n tion in R . We say that a random vector X = (X1,...,Xn) ∈ R is symmetric Bernoulli distributed if the coordinates Xi are independent, symmetric, Bernoulli random variables. A 1 random variable ε is symmetric if P(ε = −1) = P(ε = +1) = 2 . Equivalently, we may say that X is uniformly distributed on the unit cube in Rn:   X ∼ Unif { − 1, +1}n . RANDOMVECTORSIN HIGH DIMENSIONS 33

The symmetric Bernoulli distribution is isotropic. This can be easily seen again by checking 2 n E[hX, eii ] = 1 for all i = 1, . . . , n, or for any x ∈ R by checking that n 2 h X 2 2i h X i 2 E[hX, xi ] = E Xi xi + E 2 XiXjxixj = kxk2 , i=1 1≤i

For the following recall the definition of the normal distribution. See also the appendix sheets distributed at the begin of the lecture for useful Gaussian calculus formulae.

Definition 3.13 (Multivariate Normal / Gaussian distribution) We say a random vector n n Y = (Y1,...,Yn) ∈ R has standard normal distribution in R , denoted

Y ∼ N(0, 1ln) ,

if the coordinates Yi, i = 1, . . . , n, are independent, R-valued standard normally distributed random variables, i.e., Yi ∼ N(0, 1). The probability density function (pdf) for Y is just the product n Y 1 −x2/2 1 −kxk2/2 √ i 2 fY (x) = e = n/2 e . (3.6) i=1 2π (2π)

It is easy to check that Y is isotropic. Furthermore, as the density (3.6) only depends on the Euclidean norm, that is, the standard normal distribution in Rn only depends on the length and not on the direction. In other words, the standard normal distribution in Rn is rotation invariant. This reasoning is rigorously stated in the next proposition.

T Proposition 3.14 Let Y ∼ N(0, 1ln) and U be a n × n orthogonal matrix (i.e., U U = T −1 T UU = 1ln, or equivalently, U = U ). Then

UY ∼ N(0, 1ln) . Proof. For Z := UY we have 2 T T U T 2 kZk2 = Z Z = X U X = X X = kXk2 . Furthermore, |det(U)| = |det(U T )| = 1, and thus for any vector J ∈ Cn (characteristic functions/Laplace transform), Z n hJ,Zi 1  1  Y EZ [e ] = n/2 exp − hz, zi + hJ, xi dzi (2π) n 2 R i=1 Z n 1  1 T  Y = n/2 exp − hx, xi + hU J, xi dxi (2π) n 2 R i=1 1 1  = exp hU T J, U T Ji = [ehJ,Y i] . (2π)n/2 2 EY 34 RANDOMVECTORSIN HIGH DIMENSIONS

2

Let Σ be a symmetric positive-definite n × n matrix and X ∈ Rn random vector with mean µ = E[X]. Then

−1/2 X ∼ N(µ, Σ) ⇔ Z = Σ (X − µ) ∼ N(0, 1ln) 1  1  ⇔ f (x) exp − hX − µ, Σ−1(X − µ)i , x ∈ n. X (2π)n/2 det(Σ)1/2 2 R

For large values of n the standard normal distribution N(0, 1ln) is not concentrated around √the origin, instead it is concentrated in a thin spherical shell around the sphere of radius n around the origin (shell with width of order O(1)). From Theorem ?? we obtain for Y ∼ N(0, 1ln),

√ 2 P(|kY k2 − n| ≥ t) ≤ 2 exp ( − Ct ) , for all t ≥ 0 √ and an absolute constant C > 0. Therefore with high probability kY k2 ≈ n and √ √ Y ≈ nΘ ∼ Unif( nSn−1) ,

with Θ = Y/kY k2. Henceforth, with high probability,

√ n−1 N(0, 1ln) ≈ Unif( nS ) .

3.4 Sub-Gaussian random variables in higher dimensions

Definition 3.15 (Sub-Gaussian random vectors) A random vector X ∈ Rn is sub- Gaussian if the one-dimensional marginals hX, xi are sub-Gaussian real-valued random variables for all x ∈ Rn. Moreover,

kXk := sup {khX, xik } . ψ2 ψ2 x∈Sn−1

n Lemma 3.16 Let X = (X1,...,Xn) ∈ R be a random vector with independent mean-zero sub-Gaussian coordinates Xi. Then X is sub-Gaussian and

kXk ≤ C max {kXik } , ψ2 1≤i≤n ψ2 for some absolute constant C > 0.

Proof. Let x ∈ Sn−1. We are using Proposition 2.21 for the sum of independent sub- Gaussian random variables: n n 2 X 2 X 2 2 2 khX, xikψ = k xiXikψ ≤ C xi kXikψ ≤ C max {kXikψ } , 2 2 P rop. 2.21 2 1≤i≤n 2 i=1 i=1

Pn 2 where we used that i=1 xi = 1, 2 RANDOMVECTORSIN HIGH DIMENSIONS 35

n Theorem 3.17 (Uniform distribution on the sphere) Let X ∈ R be a random vector√ uni- formly distributed on the Euclidean sphere in Rn with centre at the origin and radius n, i.e., √ X ∼ Unif( nSn−1) . Then X is sub-Gaussian, and kXk ≤ C for some absolute constant C > 0. ψ2 We will actually present two different proofs of this statement. The first uses concentra- tion properties whereas the second employs a geometric approach. Proof of Theorem3.17 - Version I. See [Ver18] page 53-54. 2 Proof of Theorem3.17 - Version II.

Et

y Ht 7 O O O 2y

Figure 1:

For convenience we will work on the unit sphere, so let us rescale X Z := √ ∼ Unif(Sn−1). n √ It suffices to show that kZk ≤ C/ n, which by definition means that khZ, xik ≤ C for ψ2 ψ2 all unit vectors x. By rotation invariance, all marginals hZ, xi have the same distribution, and n hence without loss of generality, we may prove our claim for x = e1 = (1, 0,..., 0) ∈ R . In other words, we shall show that

2 P(|Z1| ≥ t) ≤ 2 exp(−ct n) for all t ≥ 0 . We use the fact that

n−1 P (Z1 ≥ t) = P(Z ∈ Ct) with the spherical cap Ct = {z ∈ S : z1 ≥ t}.

Denote by Kt the ”ice-cream” cone when we connect all points in Ct to the origin, see Figure 1. The fraction of Ct in the unit sphere (in terms of area) is the same as the fraction n 2 2 of Kt in the unit ball B(0, 1) = {x ∈ R : x1 + ··· + xn ≤ 1}. Thus Vol (K ) (Z ∈ C ) = t . P 1 t Vol (B(0, 1)) 36 RANDOM MATRICES

√ √ 0 2 2 0 The set Kt is contained in a ball B(0 , 1 − t ) with radius 1 − t centred at 0 = (t, 0,..., 0). We get √ √ 2 n 2 P(Z1 ≥ t) = ( 1 − t ) ≤ exp ( − t n/2) for 0 ≤ t ≤ 1/ 2 , and we can easily extend this bound to all t by√ loosening the absolute constant (note that for t ≥ 1 is trivially zero). Indeed, in the range 1/ 2 ≤ t ≤ 1, √ 2 P(Z1 ≥ t) ≤ P (Z1 ≥ 1/ 2) ≤ exp(−n/4) ≤ exp(−t n/4)

2 We proved that P(Z1 ≥ t) ≤ exp(−t n/4). By symmetry, the same inequality holds for −Z1. Taking the union bound, we obtain the desired sub-Gaussian tail. 2

Remark 3.18 The so-called Projective Central Limit Theorem tells us that marginals of the uniform distribution on the sphere in Rn become asymptotically normally distributed as the dimension n increases. For any fixed unit vector u ∈ Rn: √ X ∼ Unif( nSn−1) −→ N(0, 1) in distribution as n → ∞ .



3.5 Application: Grothendieck’s inequality See Chapter 3.5 in [Ver18].

4 Random Matrices

4.1 Geometrics concepts

Definition 4.1 Let (T, d) be a , K ⊂ T , and ε > 0.

(a) ε-net: A subset N ⊂ K is an ε-net of K if for every point of K is within a distance ε of some point of N , ∀x ∈ K ∃ x0 ∈ N : d(x, x0) ≤ ε .

(b) Covering number: The smallest possible cardinality of an ε-net of K is the covering number of K and is denoted N(K, d, ε). Equivalently, N(K, d, ε) is the smallest number of closed balls with centres in K and radii ε whose union covers K.

(c) ε-separated sets: A subset N ⊂ T is ε-separated if d(x, y) > ε for all distinct points x, y ∈ N .

(d) Packing numbers: The largest possible cardinality of an ε-separated subset of K is called the packing number of K and denoted P(K, d, ε).

Lemma 4.2 Let (T, d) be a metric space. Suppose that N is a maximal ε separated subset of K ⊂ T . Here maximal means that adding any new point x ∈ K to the set N destroys the ε-separation property. Then N is an ε-net. RANDOM MATRICES 37

Proof. Let x ∈ K. If x ∈ N ⊂ K, then choosing x0 = x we have d(x0, x0) = 0 ≤ ε. Suppose x∈ / N .Then the maximality assumption implies that N ∪ {x} is not ε-separated, thus d(x, x0) ≤ ε for some x0 ∈ N . 2

Lemma 4.3 (Equivalence of packing and covering numbers) Let (T, d) be a metric space. For any K ⊂ T and any ε > 0, we have

P(K, d, 2ε) ≤ N(K, d, ε) ≤ P(K, d, ε) . (4.1)

n Proof. Without loss of generality we consider Euclidean space T = R with d = k·k2. For the upper bound one can show that P(K, k·k2, ε) is the largest number of closed disjoint balls with centres in K and radii ε/2. Furthermore, P(K, k·k2, ε) is the largest cardinality of an ε-separated subset, any ε-separated set N with #N = P(K, k·k2, ε) is maximal, and hence an ε-net according to Lemma 4.2. Thus N(K, k·k2, ε) ≤ #N. Pick an 2ε-separated subset P = {xi} in K and an ε-net N = {yj} of K. Each xi ∈ K belongs to some closed ball Bε(yj) with radius ε around some yj. Any such ball Bε(yj) may contain at most one point of the xi’s. Thus |P| = #P ≤ |N| = #N. 2 In the following we return to the Euclidean space Rn with its Euclidean norm, d(x, y) = kx − yk2.

Definition 4.4 (Minkowski sum) A, B ⊂ Rn. A + B := {a + b: a ∈ A, b ∈ B} .

Proposition 4.5 (Covering numbers of the Euclidean ball) (a) Let K ⊂ Rn and ε > 0. (n) n Denote |K| the volume of K and denote B = {x ∈ R : kxk2 ≤ 1} the closed unit Euclidean ball. Then

|K| |(K + ε/2B(n))| ≤ N(K, k·k , ε) ≤ P(K, k·k , ε) ≤ . |εB(n)| 2 2 |ε/2B(n)|

(b) 1n 2 n ≤ N(B(n), k·k , ε) ≤ + 1 . ε 2 ε Proof.

(a) The centre inequality follows from Lemma 4.3. Lower bound: Let N := N(K, k·k2, ε). Then we can cover K by N balls with radii ε. Then |K| ≤ N|εB(n)|. Upper bound: Let N := P(K, k·k , ε) and construct N closed disjoint balls B ε (x ) with centres x ∈ K and 2 2 i i radii ε/2. These balls might not fit entirely into the set K, but certainly into the extended set K + ε B(n). Thus 2 ε ε N| B(n)| ≤ |K + B(n)| . 2 2 (b) The statement follows easily with part (a) and is left as an exercise. 2 38 RANDOM MATRICES

Remark 4.6 To simplify the bound in Proposition 4.5, note that in the nontrivial range ε ∈ (0, 1] we have 1n 3n ≤ N(B(n), k·k , ε) ≤ . ε 2 ε 

Definition 4.7 (Hamming cube) The Hamming cube His the set of binary strings of length n, i.e., H = {0, 1}n .

Define the Hamming distance dH between two binary strings as the number of bits where they disagree, i.e.,

n dH(x, y) := #{i ∈ {1, . . . , n}: x(i) 6= y(i)} , x, y ∈ {0, 1} .

Exercise 4.8 Show that dH is a metric on H. K

4.2 Concentration of the operator norm of random matrices

Definition 4.9 Let A be an m×n matrix with real entries. The matrix A represents a linear map Rn → Rm. (a) The operator norm or simply the norm of A is defined as

nkAxk o kAk := max 2 = max {kAxk } . n n−1 2 x∈R \{0} kxk2 x∈S Equivalently, kAk = max {hAx, yi} . x∈Sn−1,y∈Sm−1

(b) The singular values si = si(A) of the matrix A are the square roots of the eigenvalues of both AAT and AT A,

p T p T si(A) = λi(AA ) = λi(A A) ,

and one orders them s1 ≥ s2 ≥ · · · ≥ sn ≥ 0. If A is symmetric, then si(A) = |λi(A)|. (c) Suppose r = rank(A). The singular value decomposition of A is

r X T A = siuivi , i=1

m where si = si(A) are the singular values of A, the vectors ui ∈ R are the left singular n vectors, and the vectors vi ∈ R are the right singular vectors of A. RANDOM MATRICES 39

Remark 4.10 (a) In terms of its spectrum, the operator norm of A equals the largest singu- lar value of A,

s1(A) = kAk .

(b) The extreme singular values s1(A) and sr(A) are respectively the smallest number M and the largest number m such that

n mkxk2 ≤ kAxk2 ≤ Mkxk2 , for all x ∈ R .

Thus

n sn(A)kx − yk2 ≤ kAx − Ayk2 ≤ s1(A)kx − yk2 for all x ∈ R .



Lemma 4.11 (Operator norm on a net) Let ε ∈ [0, 1) and A be an m × n matrix. Then, for any ε-net N of Sn−1, we have

1 sup {kAxk2} ≤ kAk ≤ sup {kAxk2} . x∈N 1 − ε x∈N

Proof. The lower bound is trivial as N ⊂ Sn−1. For the upper bound pick an x ∈ Sn−1 for which kAk = kAxk2 and choose x0 ∈ N for this x such that kx − x0k2 ≤ ε. Then

kAx − Ax0k2 ≤ kAkkx − x0k2 ≤ εkAk .

The triangle inequality implies that

kAx0k2 ≥ kAxk2 − kAx − Ax0k2 ≥ (1 − ε)kAk , and thus kAk ≤ kAx0k2/(1 − ε). 2

Exercise 4.12 Let N be an ε-net of Sn−1 and M be an ε-net of Sm−1. Show that for any m × n matrix A one has 1 sup {hAx, yi} ≤ kAk ≤ sup {hAx, yi} . x∈N,y∈M 1 − 2ε x∈N,y∈M

K

Exercise 4.13 (Isometries) Let A be an m × n matrix with m ≥ n. Prove the following equivalences:

T n A A = 1ln ⇔ A isometry, i.e., kAxk2 = kxk2 for all x ∈ R ⇔ sn(A) = s1(A) .

KK 40 RANDOM MATRICES

Theorem 4.14 (Norm of sub-Gaussian random matrices) Let A be an m×n random ma- trix with independent mean-zero sub Gaussian random entries Aij, i = 1, . . . , m, j = 1, . . . , n. Then, for every t > 0, we have √ √ kAk ≤ CK( m + n + t) ,

2 with probability at least 1 − 2 exp ( − t ), where K := max 1≤i≤m, {kAijk } and where 1≤j≤n ψ2 C > 0 is an absolute constant.

Proof. We use an ε-net argument for our proof. 1 n−1 Step 1: Using Proposition 4.5 (b) we can find for ε = 4 an ε-net N ⊂ S and an ε-net M ⊂ Sm−1 with |N| ≤ 9n and |M| ≤ 9m . By Exercise 4.12, the operator norm of A can be bounded using our nets as follows

kAk ≤ 2 max {hAx, yi} . x∈N,y∈M

Step 2: Concentration Pick x ∈ N and y ∈ M. We compute (using the fact that the entries and sub-Gaussians with their norm bounded by K)

m n m n 2 X X 2 X X khAx, yik ≤ C kA x y k ≤ CK2 y2x2 = CK2 , ψ2 ij i j ψ2 j i i=1 j=1 i=1 j=1

for some absolute constant C > 0. We therefore obtain a tail bound for hAx, yi, i.e., for every u ≥ 0, 2 2 P(hAx, yi ≥ u) ≤ 2 exp ( − cu /K ) , for some absolute constant c > 0.

Step 3: Union bound We unfix the choice of x and y in Step 2 by a union bound.

  X n+m 2 2 P max {hAx, yi} ≥ u ≤ P(hAx, yi ≥ u) ≤ 9 2 exp ( − cu /K ) . x∈N,y∈M x∈N,y∈M √ √ We continue the estimate by choosing u = CK( n + m + t) which leads to a lower bound u2 ≥ C2K2(n + m + t2), and furthermore, adjust the constant C > 0 such cu2/K2 ≥ 3(n + m) + t2. Inserting these choices we get

  n+m 2 2 P max {hAx, yi} ≥ u ≤ 9 2 exp ( − 3(n + m) − t ) ≤ 2 exp ( − t ) , x∈N,y∈M and thus 2 P(kAk ≥ 2u) ≤ 2 exp ( − t ) . 2 RANDOM MATRICES 41

Corollary 4.15 Let A be an n × n random matrix whose entries on and above the diagonal are independent mean-zero sub-Gaussian random variables. Then, for every t > 0, we have √ kAk ≤ CK( n + t)

with probability at least 1 − 4 exp(−t2) and K := max {kA k }. 1≤i,j≤n ij ψ2

Proof. Exercise. Decompose the matrix into an upper-triangular part and a lower-triangular part and use the proof of the previous theorem. 2 We can actually improve the statement in Theorem 4.14 in two ways. First we obtain a two-sided bound, and, secondly, we can relax the independence assumption. As this is a refinement of our previous statement we only state the result and omit its proof. The statement is used in covariance estimation below.

Theorem 4.16 Let A be an m × n matrix whose rows Ai, i = 1, . . . , m, are independent mean-zero sub-Gaussian isotropic random vectors in Rn. Then, for every t > 0, w ehave

√ 2 √ √ 2 √ m − CK ( n + t) ≤ sn(A) ≤ s1(A) ≤ m + CK ( n + t) ,

with probability at least 1 − 2 exp(−t2) and K := max {kA k }. Furthermore, 1l≤i≤m i ψ2

1 r n t  AT A − 1l ≤ K2 max{δ, δ2} where δ = C + √ . m n m m

4.3 Application: Community Detection in Networks See Chapter 4.5 in [Ver18].

4.4 Application: Covariance Estimation and Clustering Suppose that X (1),...,X (N) are empirical samples (random outcomes) of a random vector X ∈ Rn. We do not have access to the full distribution, only to the empirical measure

N 1 X L = δ (i) , N N X i=1 which is a random (depending on the N random outcomes X (i)) probability measure on Rn. Here, the symbol δX is the Kronecker-delta measure or point measure defined as

( 1 if X = y, n δX (y) = y ∈ R . 0 if y 6= X,

We assume for simplicity that E[X] = 0 and recall Σ = Σ(X) = E[XXT ]. 42 RANDOM MATRICES

Definition 4.17 Let X ∈ Rn with N random outcomes/random samples X (1),...,X (N).

(a) The empirical measure of X (1),...,X (N) is the probability measure on Rn,

N 1 X L := δ (i) . N N X i=1

(b) The empirical covariance of X (1),...,X (N) is the random matrix

N 1 X T Σ := (X (i))(X (i)) . N N i=1

Note that Xi ∼ X implies that E[ΣN ] = Σ. The law of large numbers yields

ΣN → Σ almost surely as N → ∞ .

Theorem 4.18 (Covariance estimation) Let X be a sub-Gaussian random vector in Rn, and assume that there exist K ≥ 1 such that

n khX, xik ≤ KkhX, xik 2 , for all x ∈ . ψ2 L R Then, for every N ∈ N, h i r n n  kΣ − Σk ≤ CK2 + kΣk . E N N N

Proof. First note that

2 2 2 khX, xikL2 = E[|hX, xi| ] = E[hX, xi ] = hΣx, xi . We bring X,X (1),...,X (N) all into isotropic position. That is, there exist independent isotropic random vectors Z,Z (1),...,Z (N) such that

X = Σ1/2Z and X (i) = Σ1/2Z (i) .

We have from our assumptions that

kZk ≤ K and kZ (i)k ≤ K. ψ2 ψ2 Then 1/2 1/2 kΣN − Σk = kΣ RN Σ k ≤ kRN kkΣk , where N 1 X T R = (Z (i))(Z (i)) − 1l . N N n i=1 Suppose now that A is the N × n matrix whose rows are (Z (i))T , that is, 1 AT A − 1l = R , N n N CONCENTRATION OF MEASURE - GENERALCASE 43 and Theorem 4.16 for A implies that r n n  [kR k] ≤ CK2 + , E N N N and we conclude with our statement. 2

Remark 4.19 For all ε ∈ (0, 1) we have

E[kΣN − Σk] ≤ εkΣk , if we take a sample of size N ∼ ε−2n. 

5 Concentration of measure - general case

We study now general concentration of measure phenomena and aim in particular to include cases where the random variables are not necessarily independent. This assumptions made our concentration results relatively easy to develop. In the first section we summarise con- centration results by entropy techniques before studying dependent random variables in the remaining sections.

5.1 Concentration by entropic techniques In the following assume that ϕ: R → R is a convex function, X an R-valued random variable such that E[X] and E[ϕ(X)] are finite unless otherwise stated. The random variable is a measurable map from some probability space (Ω, F, P) to R and distributed with law PX = −1 P ◦ X ∈ M1(R). Definition 5.1 (Entropy) The entropy of the random variable X for the convex function ϕ is Hϕ(X) = E[ϕ(X)] − ϕ(E[X]).

Corollary 5.2 By Jensen’s inequality, ϕ([E[X]) ≤ E[ϕ(X)], we see that Hϕ(X) ≥ 0.

Example 5.3 (i) ϕ(u) = u2, u ∈ R, then the entropy of X, 2 2 Hϕ(X) = E[X ] − E[X] = Var(X) is the variance of X. (ii) ϕ(u) = − log u, u > 0, and for X real-valued random variable we have that Z := eλX > 0 is a strictly positive real-valued random variable.

λX λ(X−E[X]) Hϕ(Z) = −λE[X] + log E[e ] = log E[e ].

1. ϕ(u) = u log u, u > 0, and ϕ(0) := 0. ϕ is convex function on R+. For any non- negative random variable Z ≥ 0, the ϕ-entropy is

Hϕ(Z) = E[Z log Z] − E[Z] log E[Z]. ♣ 44 CONCENTRATION OF MEASURE - GENERALCASE

In the following we will drop the index ϕ for the entropy whenever we take ϕ(u) = u log u as in Example 5.3 (iii). There are several reasons why this choice is particular useful. In the next remark we show some connection of that entropy to other entropy concepts in probability theory.

Remark 5.4 Suppose that Ω is a finite , and let p, q ∈ M1(Ω) be two probabil- ity measures (vectors) such that q(ω) = 0 implies p(ω) = 0. Define ( p(ω) if q(ω) > 0, X :Ω → R, ω 7→ X(ω) = q(ω) 0 if q(ω) = p(ω) = 0 , with distribution q ∈ M1(Ω). Then X ≥ 0 with X X E[X] = X(ω)q(ω) = p(ω) = 1. ω∈Ω ω∈Ω X X p(ω) H(X) = X(ω)q(ω) log X(ω) − [X] log [X] = p(ω) log E E q(ω) ω∈Ω ω∈Ω =: H(p|q) =: D(pkq), where H(p|q) is the relative entropy of p with respect to q , a widely used function proba- bility theory (e.g. large deviation theory) and information theory, and where D(pkq) is the Kullback-Leibler divergence of p and q used in information theory. 

For any real-valued random variable X, Z = exp(λX), λ ∈ R, is a positive random variable and the entropy can be written with the moment generating function of X, recall the definition of the moment generating function (MGF) in (1.4) (we assume again that the expectations are finite for all λ ∈ R),

λX 0 H(e ) = λMX (λ) − MX (λ) log MX (λ). (5.1) Example 5.5 (Entropy - Gaussian random variable) Suppose X ∼ N(0, σ2), σ > 0. Then 2 2 0 2 MX (λ) = exp(λ σ /2), and MX (λ) = λσ MX (λ), and therefore 1 H(eλX ) = λ2σ2M (λ) − λ2σ2/2M (λ) = λ2σ2M (λ). X X 2 X ♣ A bound on the entropy leads to a bound of the centred moment generating function Φ, see (2.3), this is the content of the so-called Herbst argument .

Proposition 5.6 (Herbst argument) Let X be a real-valued random variable and suppose that, for σ > 0, 1 H(eλX ) ≤ σ2λ2M (λ) 2 X for λ ∈ I with interval I being either [0, ∞) or R. Then 1 log [ exp (λ(X − [X]))] ≤ λ2σ2 for all λ ∈ I. (5.2) E E 2 CONCENTRATION OF MEASURE - GENERALCASE 45

Remark 5.7 (a) When I = R, then the bound (5.2) is equivalent to asserting that the centred random variable X − E[X] is sub-Gaussian with parameter σ > 0. (b) For I = [0, ∞), the bound (5.2) leads immediately to the one-sided tail estimate

2 2 P(X ≥ E[X] + t) ≤ exp(−t /2σ ) , t ≥ 0,

and I = R provides the corresponding two-sided estimate. 

Proof of Proposition 5.6. Suppose that I = [0, ∞). Using (5.1), our assumption turns into a differential inequality for the moment generating function,

1 λM 0 (λ) − M (λ) log M (λ) ≤ σ2λ2M (λ) for all λ ≥ 0 . (5.3) X X X 2 X

1 Define now a function G(λ) := λ log MX (λ) for λ 6= 0, and then extend the function to 0 by continuity d G(0) := lim G(λ) = log MX (λ) = E[X] . λ→0 dλ λ=0 Our assumptions on the existence of the MGF imply differentiability with respect to the parameter λ. Hence 0 0 1 MX (λ) 1 G (λ) = − 2 log MX (λ) , λ MX (λ) λ and thus we can rewrite our differential inequality (5.3) as

1 G0(λ) ≤ σ2 for all λ ∈ I = [0, ∞) . 2

For any λ0 > 0 we can integrate both sides of the previous inequality to arrive at 1 G(λ) − G(λ ) ≤ σ2(λ − λ ) . 0 2 0

Now, letting λ0 ↓ 0 and using the above definition of G(0), we get 1   1 G(λ) − [X] = log M (λ) − log eλE[X] ≤ σ2λ , E λ X 2 and we conclude with the statement in (5.2). 2

Proposition 5.8 (Bernstein entropy bound) Suppose there exist B > 0 and σ > 0 such that the real-valued random variable X satisfies the following entropy bound

λX 2 0 2 −1 H(e ) ≤ λ (BMX (λ) + MX (λ)(σ − BE[X])) for all λ ∈ [0,B ) .

Then λ(X− [X]) 2 2 −1 −1 log E[e E ] ≤ σ λ (1 − Bλ) for all λ ∈ [0,B ) . (5.4) 46 CONCENTRATION OF MEASURE - GENERALCASE

Remark 5.9 As a consequence of the Chernoff argument, Proposition 5.8 implies that X satisfies the upper tail bound

 δ  (X ≥ [X] + δ) ≤ exp , for all δ ≥ 0 . (5.5) P E 4σ2 + 2Bδ 

Exercise 5.10 Prove the tail bound (5.5). K

Proof of Proposition 5.8. We skip the proof of this statement as it employs similar tech- niques as in the proof of Proposition 5.6. 2 So far, the entropic method has not provided substantial new insight as all concentra- tion results are done via the usual Chernoff bound. We shall now study concentration for functions of many random variables.

Definition 5.11 (a) A function f : Rn → R is separately convex if, for every index k ∈ {1, . . . , n}, the univariate function

fk : R → R , yk 7→ fk(yk) := f(x1, . . . , xk−1, yk, xk+1, . . . , xn) ,

n−1 is convex for each vector (x1, . . . , xk−1, xk+1, . . . , xn) ∈ R .

(b) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is Lipschitz continuous (sometimes called locally Lipschitz continuous ) if for every x ∈ X there exists a

neighbourhood U ⊂ X such that f|U is globally Lipschitz continuous. Here f|U : U → Y is the restriction of f to U.

(c) A function f : X → Y for metric spaces (X, dX ) and (Y, dY ) is L-Lipschitz continuous (sometimes called globally Lipschitz continuous) if there exists L ∈ R such that

dY (f(x), f(y)) ≤ LdX (x, y) for all x, y ∈ X. (5.6)

The smallest constant L > 0 satisfying (5.6) is denoted kfkLip. In the following some statements hold for global Lipschitz continuity and some only for local Lipschitz con- tinuity.

n Theorem 5.12 (Tail-bound for Lipschitz functions) Let X = (X1,...,Xn) ∈ R be a random vector with independent random coordinates Xi supported on the interval [a, b], a < b, and let f : Rn → R be separately convex and L-Lipschitz continuous with respect to the Euclidean norm k·k2. Then, for every t ≥ 0,

 t2  (f(X) ≥ [f(X)] + t) ≤ exp − . (5.7) P E 4L2(b − a)2 CONCENTRATION OF MEASURE - GENERALCASE 47

Example 5.13 (Operator norm of a random matrix) Let M ∈ Rn×d be a n × d matrix with independent identically distributed mean-zero random entries Mij, i ∈ {1, . . . , n}, j ∈ {1, . . . , d} supported on the interval [−1, 1]. kMk = max {kMνk } = max max {hu, Mνi} . d 2 d u∈ n : ν∈ : kνk =1 ν∈R : R R 2 kuk =1 kνk2=1 2 M 7→ kMk is a function f : Rn×d → R, f(M) = kMk. To apply Theorem 5.12 above we shall show that f is Lipschitz and separately convex. The operator norm is the max- imin/supremum of functions that are linear in the entries of the matrix M, and thus any such function is convex and as such separately convex. Moreover, for M,M 0 ∈ Rn×d, 0 0 0 |kMk − kM k| ≤ kM − M k ≤ kM − M kF , q Pn Pd 2 where kMkF := i=1 j=1 Mij is the Euclidean norm of all entries of the matrix, called the Frobenius norm of the matrix M. The first inequality shows that f := k·k is 1-Lipschitz continuous. Thus Theorem 5.12 implies that, for every t ≥ 0, −t2/16 P(kMk ≥ E[kMk] + t) ≤ e . ♣ Key ingredients for the proof of Theorem 5.12 are the following two lemmas.

Lemma 5.14 (Entropy bound for univariate functions) Let X and Y two independent, iden- tically distributed R-valued random variables. Denote by EX,Y the expectation with respect to X and Y . For any function g : R → R the following statements hold: (a) 2 λg(X) 2 h  λg(X) n oi H(e ) ≤ λ EX,Y g(X) − g(Y ) e 1l g(X) ≥ g(Y ) , for all λ > 0 .

(b) If in addition the random variable X is supported on [a, b], a < b, and the function g is convex and Lipschitz continuous, then

λg(X) 2 2 h 0 2 λg(X)i H(e ) ≤ λ (b − a) E (g (X)) e , for all λ > 0 .

Lemma 5.15 (Tensorisation of the entropy) Let X1,...,Xn be independent real-valued random variables and f : Rn → R a given function. Then n λf(X ,...,X ) h X  λf (X ) ki H(e 1 n ) ≤ E H e k k |X¯ , for all λ > 0 , (5.8) k=1 where fk is the function introduced in Definition 5.11 and ¯ k X = (X1,...,Xk−1,Xk+1,...,Xn) .

The entropy on the right hand side is computed with respect to Xk for k = 1, . . . , n, by holding the remaining X¯ k fixed. That is,   H eλfk(Xk) X¯ k = [eλfk(Xk)λf (X )] − [ exp (λf (X ))] log [ exp (λf (X ))] EXk k k EXk k k EXk k k is still a function of X¯ k and is integrated with respect to E on the right hand side of (5.8). 48 CONCENTRATION OF MEASURE - GENERALCASE

We first finish the proof of theorem 5.12. ¯ k n−1 Proof of Theorem 5.12. For k ∈ {1, . . . , n} and every vector (fixed) X ∈ R the fk is convex, and hence Lemma 5.14 implies for all λ > 0 that

  h 2 i H eλfk(Xk) X¯ k ≤ λ2(b − a)2 (f 0 (X )) eλfk(Xk) X¯ k EXk k k 2 2 2 h∂f(X1,...,Xk,...,Xn) i λ (b − a) EXk exp (λf(X1,...,Xk,...,Xn)) . ∂xk

With Lemma 5.15 one obtains writing X = (X1,...,Xn),

n 2  λf(X) 2 2 h X h∂f(X) i H e ≤ λ (b − a) E EXk exp (λf(X)) ∂xk k=1 2 2 2 λf(X) ≤ λ (b − a) L E[e ] , where we used the Lipschitz continuity leading

n 2 2 X ∂f(X) 2 k∇f(X)k2 = ≤ L almost surely . ∂xk k=1

The tail bound then follows from the Herbst argument in Proposition 5.6. 2 Proof of Lemma 5.14. Using the fact that X and Y are independent and identical dis- tributed we have

λg(X) λg(X) λg(X) λg(Y ) H(e ) = EX [λg(X)e ] − EX [e ] log EY [e ] .

By Jensen’s inequality, λg(Y ) log EY [e ] ≥ EY [λg(Y )] , and thus, using the symmetry between X and Y , we obtain (we write EX,Y for the expecta- tion with respect to both, X and Y when we want to distinguish expectations with respect to the single random variables. Note that we can easily replace EX,Y by E),

λg(X) λg(X) λg(X) H(e ) ≤ EX [λg(X)e ] − EX,Y [e λg(Y )] 1 = [λ(g(X) − g(Y ))(eλg(X) − eλg(Y ))] 2EX,Y (5.9) λg(X) λg(Y ) = λE[(g(X) − g(Y ))(e − e ){g(X) ≥ g(Y )}] Symmetry

For all s, t ∈ R we have es −et ≤ es(s−t). To see that, assume without loss of generality that s ≥ t and recall that ex ≥ 1 + x, to see that

es(1 − et − s) ≤ es(1 − (1 + (t − s))) = es(s − t) .

For s ≥ t, we obtain therefore

(s − t)(es − et)1l{s ≥ t} ≤ (s − t)2es1l{s ≥ t} . CONCENTRATION OF MEASURE - GENERALCASE 49

Applying this bound with s = λg(X) and t = λg(Y ) to the inequality (5.9) yields

λg(X) 2 2 λg(X) H(e ) ≤ λ E[(g(X) − g(Y )) e 1l{g(X) ≥ g(Y )}] . If in addition the function g is convex, then we have the upper bound

g(x) − g(y) ≤ g0(x)(x − y) ,

and hence, for g(x) ≥ g(y),

(g(x) − g(y))2 ≤ (g0(x))2(x − y)2

which finishes the proof of the statement in Lemma 5.14. 2 Proof of Lemma 5.15. They key ingredient is the variational representation of the entropy (5.13), see Proposition 5.16 and Remark 5.17 below:

λf(X) λf(X) H(e ) = sup {E[g(X)e ]} , (5.10) g∈G where G = {g :Ω → R: eg ≤ 1}. The proof now amounts to some computation once the following notations are introduced. ¯ For each j ∈ {1, . . . , n} define Xj = (Xj,...,Xn), and for any g ∈ G define the functions i ¯ g , i = 1, . . . , n, as follows using X = (X1,...,Xn) and E[ · |Xj] denote expectation with ¯ respect to X conditions on fixing Xj: 1 g(X) ¯ g (X1,...,Xn) := g(X) − log E[e |X2] , [eg(X)|X¯ ] gk(X ,...,X ) := log E k , for k = 2, . . . , n . k n g(X) ¯ E[e |Xk+1] It is easy to see that by construction we have,

n X k g(X) g (Xk,...,Xn) = g(X) − log E[e ] ≥ g(X) , (5.11) k=1 and k E[ exp (g (Xk,...,Xn)|Xk+1] = 1 . We use this decomposition within the variational representation (5.10) leading to the follow- ing chain of upper bounds:

n λf(X) X k λf(X) E[g(X)e ] ≤ E[g (Xk,...,Xn)e ] (5.11) k=1 n X k λf(X) k k ¯ = EX¯ [EXk [g (Xk,...,Xn)e |X ]] k=1 n X λf (X ) ¯ k k ¯ ≤ EXk [H(e |Xk)]] . (5.10) k=1 We conclude with the statement by optimising over the function g ∈ G. 2 50 CONCENTRATION OF MEASURE - GENERALCASE

Proposition 5.16 (Duality formula of the Entropy) Let Y be a non-negative R-valued ran- dom variable defined on a probability space (Ω, F,P ) such that E[Φ(Y )] < ∞, where Φ(u) = u log u for u ≥ 0. Then

H(Y ) = sup {E[gY ]} , (5.12) g∈U

where U = {g :Ω → R measurable with E[eg] = 1}.

g dQ g Proof. Denote Q = e P the probability measure with Radon-Nikodym denisty dP = e with respect to P for some g ∈ U. Denote E the expectation with respect to P and EQ the expectation with respect to Q, and we write HQ we we compute the entropy with respect to the probability measure Q. Then

−g −g −g −g HQ(Y e ) = EQ[Y e ( log Y − g)] − EQ[Y e ] log EQ[Y e ] = E[Y log Y ] − E[Y g] − E[Y ] log E[Y ] = H(Y ) − E[Y g] ,

−g and as HQ(Y e ) ≥ 0 (entropy is positive due to Corollary 5.2) we get that

H(Y ) ≥ E[Y g] .

The equality in (5.12) follows by setting eg = Y/E[Y ], i.e., g = log Y = log E[Y ]. 2 Remark 5.17 (Variational representation of the entropy) One can easily extend the vari- ational formula (5.12) in Proposition to the set G = {g :Ω → R: eg ≤ 1}, namely

H(Y ) = sup {E[gY ]} . (5.13) g∈G 

5.2 Concentration via Isoperimetric Inequalities n−1 We will see that Lipschitz functions√ concentrate well on S . In the following, when we consider the sphere Sn−1 or bSn−1 we use the Euclidean metric in Rn instead of the geodesic metric of the spheres. √ Theorem 5.18 (Concentration of Lipschitz functions on the sphere) Let f : nSn−1 → R be a Lipschitz function and √ X ∼ Unif( nSn−1) . Then kf(X) − [f(X)]k ≤ Ckfk , E ψ2 Lip for some absolute constant C > 0. The statement of Theorem 5.18 amounts to the following concentration result, for every t ≥ 0,  Ct2  P(|f(X) − E[f(X)]| ≥ t) ≤ 2 exp − 2 , (5.14) kfkLip CONCENTRATION OF MEASURE - GENERALCASE 51

for some absolute constant C > 0. We already√ know this statement for linear functions, see Theorem 3.17 saying that when X ∼ n(Sn−1) we have that X (or any linear map) is sub-Gaussian. To prove the extension to any nonlinear Lipschitz function we need two fundamental results, the so-called isoperimetric inequalities, which we can only state in order not to overload the lecture.

Definition 5.19 Suppose f : Rn → R is some function. It’s level-sets (or sub-level sets) are n Lf (c) := {x ∈ R : f(x) ≤ c} , c ∈ R .

Theorem 5.20 (Isomperimetric inequality on Rn) Among all subsets A ⊂ Rn with given volume, Euclidean balls have minimal surface area. Moreover, for any ε > 0, Euclidean balls minimise the volume of the ε-neighbourhood of A,

n (n) Aε := {x ∈ R : ∃ y ∈ A: kx − yk2 ≤ ε} = A + εB ,

where B(n) is the unit ball in Rn. Theorem 5.21 (Isomperimetric inequality on the sphere) Let ε > 0. Then, among all n−1 A ⊂ S with given area σn−1(A), the spherical caps minimise the area of the ε-neighbourhood σn−1(Aε), n−1 Aε := {x ∈ S : ∃ y ∈ A: kx − yk2 ≤ ε} . A spherical cap C(a, ε) centred at a point a ∈ Sn−1 is the set

n−1 C(a, ε) := {x ∈ S : kx − ak2 ≤ ε} . For the proof of Theorem 5.18 the following lemma is a crucial step. √ Lemma 5.22 (Blow-up) For any A ⊂ nSn−1, denote σ the normalised area on the sphere. 1 if σ(A) ≥ 2 , then, for every t ≥ 0,

2 σ(At) ≥ 1 − 2 exp(−ct )

for some absolute constant c > 0.

√ n−1 1 Proof. For the hemisphere H = {x ∈ nS : x1 ≤ 0}, we have σ(A) ≥ 2 = σ(H). The t-neighbourhood Ht of the hemisphere H is a spherical cap, and the isoperimetric inequality in Theorem 5.21 gives σ(At) ≥ σ(Ht) . We continue with Theorem 3.17 but noting that the normalised measure σ is the uniform probability measure on the sphere such that

σ(Ht) = P(X ∈ Ht) . Because of n √ n−1 t o Ht ⊃ x ∈ nS : x1 ≤ √ 2 52 CONCENTRATION OF MEASURE - GENERALCASE

we have √ √ 2 σ(Ht) ≥ P(X1 ≤ t/ 2) = 1 − P(X1 > t/ 2) ≤ 1 − 2 exp ( − ct ) , for some absolute constant c > 0. 2

Proof of Theorem 5.18. Without loss of generality we assume that kfkLip = 1. Let M denote the median of f(X), that is, 1 1 (f(X) ≤ M) ≥ and (f(X) ≥ M) ≥ . P 2 P 2 √ n−1 1 The set A = {x ∈ nS : f(x) ≤ M} is a level (sub-level) set of f with P(X ∈ A) ≥ 2 . 2 Lemma 5.22 implies that P(X ∈ At) ≥ 1 − 2 exp(−Ct ) for some absolute constant C > 0. We claim that, for every t ≥ 0,

P(X ∈ At) ≤ P(f(X) ≤ M + t) . (5.15)

To see (5.15), note that X ∈ At implies kX − yk2 ≤ t form some point y ∈ A. By our definition of the set A, f(y) ≤ M. As kfkLip = 1, we have

f(X) − f(y) ≤ |f(X) − f(y)| ≤ kX − yk2 and thus

f(X) ≤ f(y) + kX − yk2 ≤ M + t . Hence 2 P(f(X) ≤ M + t) ≥ 1 − 2 exp(−Ct ) . √ We now repeat our argument for −f: A˜ = {x ∈ nSn−1 : − f(x) ≤ M}. Then P(X ∈ ˜ ˜ 2 ˜ A) ≥ 1/2 and thus P(Xt ∈ At) ≥ 1 − 2 exp(−Ct ). Now X ∈ A implies kX − yk2 ≤ t for some y ∈ A˜, and f(y) ≥ M by definition of A˜.

−f(X) − (−f(y)) ≤ kX − yk2 ≤ t ⇒ f(X) ≥ f(y) − t ≥ M − t , and thus P(f(X) ≥ M −t) ≥ 1−2 exp(−Ct˜ 2) for some absolute constant C˜ > 0. Combining our two estimates we obtain

2 P(|f(X) − M| ≤ t) ≥ 1 − 2 exp ( − Ctˆ ) for some absolute constant Cˆ > 0, and thus the immediate tail estimate shows that kf(X) − Mk ≤ C ψ2 for some absolute constant C > 0. To replace the median M by the expectation E[f(X)] note that although the median is not unique it is a fixed real number determined by the distribution of the random variable X and the function f. We use the centering Lemma 2.24 to get |kf(X)k − kMk | ≤ kf(X) − Mk ≤ C ψ2 ψ2 ψ2 kMk ≤ C˜ ⇒ −C˜ + kf(X)k ≤ C ψ2 ψ2 ⇒ kf(x)k ≤ C˜ + C ψ2 ⇒ kf(X) − [f(X)]k ≤ C. E ψ2 CONCENTRATION OF MEASURE - GENERALCASE 53

2

Definition 5.23 (Isoperimetric problem) Let (E.d) be a metric space and B(E) its Borel- σalgebra, and P ∈ M1(E, B(E)) some given probability measure and X an E-valued random variable. Isoperimteric problem: Given p ∈ (0, 1) and t > 0, find the sets A with P (X ∈ A) ≥ p for which P (d(X,A) ≥ t) is maximal, where

d(X,A) := inf {d(x, y)} . y∈A

Concentration function:

c α(t) := sup {P (d(X,A) ≥ t)} = sup {P (At) , A⊂E : A⊂E : P (A)≥1/2 P (A)≥1/2

where At is the t-blowup of the set A, At = {x ∈ E : d(x, A) < t}. For a given function f : E → R denote the median of f(X) by Mf(X).

There are many isoperimetric inequalities, we only mention the Gaussian isoperimetric inequality as it is widely used. Recall that the Gaussian (standard normal distribution) is n n given by the probability measure p ∈ M1(R , B(R )), Z −n/2 −kxk2/2 n γn(A) = (2π) e dx1 ··· dxn ,A ⊂ B(R ) . A

The concentration function for the one-dimensional Gaussian (n = 1) is just α(t) = 1 − Φ(t) ∞ 2 with Φ(t) = √1 R eu /2 du. In the following statement half-spaces are sets of the form 2π t

n n n A = {x ∈ R : hx, ui < λ} , u ∈ R , λ ∈ R , or A = {x ∈ R : x1 ≤ z}, z ∈ R .

Theorem 5.24 (Gaussian isoperimetric inequality) Let ε > 0 be given. Among all A ⊂ n R with fixed Gaussian measure γn(A), the half-spaces minimise the Gaussian measure of the ε-neighbourhood γn(Aε).

From this we can obtain the following concentration results. The proof is using similar steps as done above, and leave the details as exercise for the reader.

n Theorem 5.25 (Gaussian concentration) Suppose X ∼ N(0, 1ln), and let f : R → R be a Lipschitz . Then

kf(X) − [f(X)]k ≤ Ckfk E ψ2 Lip for some absolute constant C > 0. 54 CONCENTRATION OF MEASURE - GENERALCASE

5.3 Some matrix calculus and covariance estimation

Theorem 5.26 (Matrix Bernstein inequality) Let X1,...,XN be independent mean-zero random n × n symmetric matrices such that kXik ≤ K almost surely for all i = 1,...,N. Then, for every t ≥ 0, we have

N  X   t2/2  k X k ≥ t ≤ 2n exp − , P i σ2 + Kt/3 i=1 where σ2 = PN [X2] is the norm of the matrix variance of the sum. i=1 E i For the proof we shall introduce a few well-known facts about matrix calculus.

Definition 5.27 (a) For any symmetric n × n matrix X with eigenvalues λi = λi(X) and corresponding eigenvectors ui the function of a matrix for any given f : R → R is defined as the n × n matrix

n X T f(X) := f(λi)uiui . i=1

(b) Suppose X is a n × n matrix. We write X < 0 if X is positive-semidefinite. Equiva- n×n lently, X < 0 if all eigenvalues of X are positive, i.e., λi ≥ 0. For Y ∈ R , we set X < Y and Y 4 X if X − Y < 0. We borrow the following trace inequalities from linear algebra. Recall the notion of the trace of a matrix.

Golden=Thompson inequality: For any n × n symmetric matrices A and B we have

Trace (eA+B) ≤ Trace (eAeB) .

Lieb’s inequality: Suppose H is a n × n symmetric matrix and define the function on matrices f(X) := Trace ( exp (H + log X)) . Then f is concave on the space of positive-definite n × n matrices.

In principle one can prove the Matrix Bernstein inequality in Theorem 5.26 with these two re- sults from matrix analysis. If X is a random positive-definite matrix then Lieb’s and Jensen’s inequality imply that E[f(X)] ≤ f(E[X]) . We now apply this with X = eZ for some n × n symmetric matrix Z:

Lemma 5.28 (Lieb’s inequality for random matrices) Let H be a fixed n × n symmetric matrix and Z be a random n × n symmetric matrix. Then

Z E[Trace ( exp (H + Z))] ≤ Trace ( exp (H + log E[e ])) . (5.16) CONCENTRATION OF MEASURE - GENERALCASE 55

The proof of Theorem 5.26 follows below and is based on the following bound of the moment generating function (MGF).

Lemma 5.29 (Bound on MGF) Let X be an n × n symmetric mean-zero random matrix such that kXk ≤ K almost surely. Then

λ2/2 [ exp (λX)] ≤ exp (g(λ) [X2]) , where g(λ) = , for |λ| < 3/K . E E 1 − |λ|K/3

Proof of Lemma 5.29. For z ∈ C we can easily obtain the following estimate 1 z2 ez ≤ 1 + z + for |z| < 3 . 1 − |z|/3 2 This can can easily derived from the Taylor series of the exponential function in conjunction with the lower bound p! ≥ 2 × 3p−2 and the geometric series. Details are left to the reader. Now with z = λx, |x| ≤ K, |λ| < 3/K, we then obtain

eλx ≤ 1 + λx + g(λ)x2 .

Now we transfer this inequality from scalar to matrices:

λX 2 e 4 1ln + λX + g(λ)X , and then, after taking the expectation and using E[X] = 0, to arrive at

λX 2 E[e ] ≤ 1ln + g(λ)E[X ] . To finish, use the inequality 1 + z ≤ ez to conclude with

λX 2 E[e ] ≤ exp (g(λ)E[X ]) . 2

Proposition 5.30 (Expectation bound via the Bernstein Theorem) Under all the assump- tions in Theorem 5.26 we have the tail bound

N  X   t2/2  k X k ≥ t ≤ 2n exp − . P i σ2 + Kt/3 i=1 Then

N √ h X i  2π p  4   X ≤ 2σ + log(2n) + K 1 + log(2n) . (5.17) E i 2 3 i=1

K Proof. Define b := 3 . Then the right hand side in Theorem 5.26 reads as  t2  2n exp − . 2(σ2 + bt) 56 CONCENTRATION OF MEASURE - GENERALCASE

We will use a crude union-type bound on the tail probability itself by observing that either 2 2 PN σ ≤ bt or σ ≥ bt. Define Z := i=1 Xi. For every t ≥ 0,  t2  n  t   t2 o (kZk ≥ t) ≤ 2n exp − ≤ 2n max exp − , exp − P 2(σ2 + bt) 4b 4σ2  t   t2  ≤ 2n exp − + 2n exp − . 4b 4σ2 We shall combine this with the trivial inequality P(kXk ≥ t) ≤ 1. Thus  t   t2  (kZk ≥ t) ≤ 1 ∧ 2n exp − + 1 ∧ 2n exp − , P 4b 4σ2 and Z ∞ E[kZk] = P(kZk ≥ t) dt =: I1 + I2 , 0 R ∞ R ∞ 2 2 with I1 = 0 1 ∧ 2n exp(−t/(4b)) dt and I2 = 0 1 ∧ 2n exp(−t /(4σ )) dt. Solve 2n exp(−t1/(4b)) = 1 to obtain t1 = 4b log(2n), and then

Z t1 Z ∞ I1 = 1 dt + (2n) exp(−t/(4b)) dt = t1 + 4b = 4b(1 + log(2n)) . 0 t1

2 2 √ Solve 2n exp(−t /(4σ )) = 1 to obtain t2 = 2σ log(2n), and then

Z t2 Z ∞ Z ∞ 2 2 −t˜2/2 I2 = 1 dt + (2n) exp(−t /(4σ )) dt = t2 + 2σ √ (2n)e dt˜ ˜ 0 t2 t=t/2σ log(2n) p √ ≤ 2σ( log(2n) + π) , √ ˜ where we used another transformation for the inequality,√ namely, x = t − log(2n). Equiv- alently, we may use 2ne−t˜2/2 ≤ e−x2 for all t ≥ log(2n). We arrive at √  π p    [kZk] ≤ 2σ + log(2n) + 4b 1 + log(2n) . E 2 2 Proof of Theorem 5.26. PN Step 1: Define S := i=1 Xi and λm(S) := max1≤i≤n λi(S) the largest eigenvalue of S. Then kSk = max{λm(S), λm(−S)}.

λλm(S) λt −λt λλm(S) P(λm(S) ≥ t) = P(e ≥ e ) ≤ e E[e ] .

The eigenvalues of eλS are eλλi(S), and thus

λλm(S) λS E := E[e ] = E[λm(e )] . All eignevalues of eλS are positive, hence the maximal eigenvalue of eλS is bounded by the sum of all eigenvalues, that is, by the trace of eλS. Henceforth

λS E ≤ E[Trace (e )] . (5.18) BASICTOOLSINHIGH-DIMENSIONALPROBABILITY 57

Step 2: We now turn to bound the expectation of the trace using Lieb’s inequality, in partic- ular Lemma 5.28. We write the exponent of eλS separately, namely splitting off the last term of the sum, writing PN−1 λX +λX Trace (e i=1 i N ) . (5.19)

When we take the expectation of (5.19) we condition on X1,...,XN−1 and apply Lieb’s PN inequality (Lemma 5.28 for the fixed matrix H := i=1 λXi), and obtain

N−1 h   X i E ≤ E Trace exp λXi + log Eb[ exp(λXN )] , i=1 where Eb is the conditional expectation. We repeat this application of Lemma 5.28 succes- sively for λXN−1, . . . , λX1, to arrive at

N   X h λX i E ≤ Trace exp log E e i . i=1

λXi Step 3: We use Lemma 5.29 to bound E[e ] for each Xi, i = 1,...,N. Step 4: With Step 3 we get E ≤ Trace ( exp (g(λ)Z)) , PN 2 where Z := i=1 E[Xi ]. The trace of exp(g(λ)Z) is a sum of n positive eigenvalues,

2 E ≤ nλm(exp(g(λ)Z)) = n exp(g(λ)λm(Z)) = n exp(g(λ)kZk) = n exp(g(λ)σ ) .

Thus, with Step 1 and Lemma 5.29,

2 P(λm(S) ≥ t) ≤ n exp(−λt + g(λ)σ ) for |λ| < 3/K . The minimum in the exponent is attained for λ = t/(σ2 + 2Kt/3). We finally conclude with

 t2/2  (λ (S) ≥) ≤ n exp − , P m σ2 + Kt/3

and repeating our steps for λm(−S) we conclude with the statement of the theorem. 2

6 Basic tools in high-dimensional probability

6.1 Decoupling

Definition 6.1 Let X1,...,Xn be independent real-valued random variables and aij ∈ R, i, j = 1, . . . , n. The random quadratic form

n X T n aijXiXj = X AX = hX, AXi ,X = (X1,...,Xn) ∈ R ,A = (aij) , i,j=1

is called chaos in probability theory. 58 BASICTOOLSINHIGH-DIMENSIONALPROBABILITY

For simplicity, we assume in the following that the random variables Xi have mean-zero and unit variances, E[Xi] = 0 , Var(Xi) = 1 , i = 1, . . . , n . Then n m X X E[hX, AXi] = aijE[XiXj] = aii = Trace (A) . i,j=1 i=1 We shall study concentration properties for chaos. This time we need to develop tools to overcome the fact that we have sum of not necessarily independent random variables. The idea is to use the decoupling technique . The idea is to study the following random quadratic form, n X 0 T 0 0 aijXiXj = X AX = hX, AX i , (6.1) i,j=1

0 0 0 where X = (X1,...,Xn) is a random vector independent of X but with the same distribu- tion as X. Obviously, the bilinear form in (6.1) is easier to study, e.g., when we condition on X0 we simply obtain a linear form for the random vector X. The vector X0 is called an independent copy of X, and conditioning on X0,

n n 0 X X 0 hX, AX i = ciXi with ci = aijXj , i=1 j=1 is a random linear form for X depending on the condition of the independent copy.

n Lemma 6.2 Let Y and Z be independent random vectors in R such that EY [Y ] = EZ [Z] = 0 (we write indices when we want to stress the expectation for a certain random variable). Then, for every convex function F : Rn → R, one has

E[F (Y )] ≤ E[F (Y + Z)] .

n Proof. Fix y ∈ R and use EZ [Z] = 0 and the convexity to get

F (y) = F (y + EZ [Z]) ≤ EZ [F (y + Z)] .

We choose y = Y and take expectation with respect to Y ,

EY [F (Y )] = EY [F (Y + EZ [Z])] = EY [F (EZ [Y + Z])] ≤ EY ⊗ EZ [F (Y + Z)] .

2

Theorem 6.3 (Decoupling) Let A be an n × n diagonal-free (i.e., diagonal entries vanish) n and X = (X1,...,Xn) ∈ R be a random vector with independent mean-zero coordinates 0 Xi, and X an independent copy of X. Then, for every convex function F : R → R, one has

0 E[F (hX, AXi)] ≤ E[F (4hX, AX i)] . (6.2) BASICTOOLSINHIGH-DIMENSIONALPROBABILITY 59

Proof. The idea is to study partial chaos X aijXiXj (i,j)∈I×Ic with a randomly chosen subset I ⊂ {1, . . . , n}. Let δ1, . . . , δn be independent Bernoulli 1 random variables with P(δi = 0) = P(δi = 1) = 2 . Then set I = {i: δi = 1}. We condition on the random variable X and obtain for i 6= j using aii = 0 and E[δi(1 − δj)] = 1/4, X h X i h X i hX, AXi = aijXiXj = 4Eδ δi(1 − δj)aijXiXj = 4EI aijXiXj , i6=j i6=j (i,j)∈I×Ic where Eδ is expectation with respect to the Bernoulli random variables and EI is expectation with respect to the Bernoulli random variables for I = {i: δi = 1}. Apply F on both sides and use Jensen (w.r.t. to EI ) and Fubini to get h  X i EX [F (hX, AXi)] ≤ EI EX F 4 aijXiXj . (i,j)∈I×Ic There is a realisation of a random set I such that h  X i EX [F (hX, AXi)] ≤ E F 4 aijXiXj . (6.3) (i,j)∈I×Ic

c 0 We fix such an realisation of I until the end. We now replace Xi, j ∈ I , by Xj on the right hand side of (6.3). It the remains to show that

h  X 0 i R.H.S. of (6.3) ≤ E F 4 aijXiXj , (6.4) (i,j)∈[n]×[n] where [n] = {1, . . . , n}. We now split the argument of the function F on the right hand side of (6.4) into three terms,

X 0 aijXiXj =: Y + Z1 + Z2 , (i,j)∈[n]×[n] with

X 0 X 0 X 0 Y := aijXiXj ,Z1 := aijXiXj ,Z2 := aijXiXj . (i,j)∈I×Ic (i,j)∈I×I (i,j)∈Ic×[n]

0 We now condition on all random variables except (Xj)j∈I and (Xi)i∈Ic . We denote the corre- sponding conditional expectation by Ee. The conditioning already fixes Y . Furthermore, Z1 and Z2 have zero expectation. Applications of Lemma 6.2 leads to

F (4Y ) ≤ Ee[F (4Y + 4Z1 + 4Z2)] , and thus E[F (4Y )] ≤ E[F (4Y + 4Z1 + 4Z2)] . This proves (6.4) and finishes the argument by taking the final expectation with respect to I. 2 60 BASICTOOLSINHIGH-DIMENSIONALPROBABILITY

n Theorem 6.4 (Hanson-Wright inequality) Let X = (X1,...,Xn) ∈ R be a random vec- tor with independent mean-zero sub-Gaussian coordinates and A be an n × n matrix. Then, for every t ≥ 0,    n t2   t o P |hX, AXi − E[hX, AXi]| ≥ t ≤ 2 exp − c min , , 4 2 K2kAk K kAkF where K := max {kX k }. 1≤i≤n i ψ2 We prepare the proof with the following two lemmas.

0 0 Lemma 6.5 (MGF of Gaussian chaos) Let X,X ∼ N(0, 1ln), X and X be independent, and let A be an n × n matrix. Then

0 2 E[ exp (λhX, AX i)] ≤ exp (Cλ kAkF) , for all |λ| ≤ C/kAk , for some absolute constant C > 0.

Proof. We use the singular value decomposition of the matrix A , see Definition 4.9:

n X T A = siuivi i=1 n 0 X 0 hX, AX i = sihui,Xihvi,X i i=1 n 0 X 0 hX, AX i = siYiYi , i=1

0 0 0 where Y = (hu1,Xi,..., hun,Xi) ∼ N(0, 1ln) and Y = (hv1,X i,..., hvn,X i) ∼ N(0, 1ln). Independence yields

n 0 Y 0 E[ exp (λhX, AX i)] = E[ exp (λsiYiYi )] . i=1 For each i ∈ {1, . . . , n} we compute (taking the expectation with respect to Y 0)

0 2 2 2 2 2 2 2 E[ exp (λsiYiYi )] = E[ exp (λ si Yi /2)] ≤ exp (Cλ si ) , provided λ si ≤ C. Thus n  X  C [ exp (λhX, AX0i)] ≤ exp Cλ2 s2 , provided λ ≤ . E i max {s2} i=1 1≤i≤n i

Pn 2 2 We conclude with i=1 si = kAkF and max1≤i≤n{si} = kAk. 2 Lemma 6.6 (Comparison) Let X and X0 be independent mean-zero sub-Gaussian random vectors in n with kXk ≤ K and kX0k ≤ K. Furthermore, let Y,Y 0 be independent R ψ2 ψ2 0 normally distributed random vectors Y,Y ∼ N(0, 1ln), and A be an n × n matrix. Then

0 2 0 E[ exp (λhX, AX i)] ≤ E[ exp (CK λhY, AY i)] . BASICTOOLSINHIGH-DIMENSIONALPROBABILITY 61

0 Proof. We condition on X and take the expectation with respect to X (denoted by EX ). Then hX, AX0i is conditionally sub-Gaussian and

0 2 2 0 2 EX [ exp (λhX, AX i)] ≤ exp (Cλ K kAX k2)] , λ ∈ R . We now replace X by Y but still conditioning on X0 (we replace one at a time),

0 2 2 0 2 EY [ exp (µhY, AX i)] = exp (µ K kAX k2/2) , µ ∈ R . √ Choosing µ = 2Cλ, we can match our estimates to get √ 0 0 EX [ exp (λhX, Ax i)] ≤ EY [ exp ( 2CλhY, AX i)] . We can now take the expectation with respect to X0 on both sides and see that we have successfully replaced X by Y . We can now repeat the same procedure for X0 and Y 0 to obtain our statement. 2 Proof of Theorem 6.4. Without loss of generality, K = 1. It suffices to show the one- sided tail estimate. Denote

p = P(hX, AXi − E[hX, AXi] ≥ t) . Write n X 2 2 X hX, AXi − E[hX, AXi] = aii(Xi − E[Xi ]) + aijXiXj , i=1 i,j : i6=j and thus the problem reduces to estimating diagonal and off-diagonal sums: n  X 2 2   X  p ≤ P aii(Xi − E[Xi ]) ≥ t/2 + P aijXiXj ≥ t/2 =: p1 + p2 . i=1 i,j : i6=j

2 2 Diagonal sum: Xi − E[Xi ] are independent mean-zero sub-exponential random vari- ables. Thus, using centering (see Exercise 3.2) and Lemma 2.28 we have kX2 − [X2]k ≤ CkX2k ≤ CkX k2 ≤ C. i E i ψ1 i ψ1 i ψ2

Then Bernstein’s inequality (see Exercise 3.3 using it with 1/N replaced by Aii)) implies that  n t2   t o p1 ≤ exp − C min Pn 2 , . i=1 aii max1≤i≤n{|aii|} P Off-diagonal sum: S := i6=j aijXiXj and λ > 0. Then

0 2 2 E[exp(λS)] ≤ E[exp(4λhX, AX i)] ≤ exp(Cλ kAkF) provided |λ| ≤ C/kAk . Thus 2 2 p2 ≤ exp(−λt/2)E[exp(λS)] ≤ exp(−λt/2 + Cλt kAkF) . Optimising over 0 ≤ λ ≤ C/kAk, we get  n t2   t   p ≤ exp − C min , } . 2 2 kAk kAkF 2 62 BASICTOOLSINHIGH-DIMENSIONALPROBABILITY

6.2 Concentration for Anisotropic random vectors We now study anisotropic random vectors, which have the form BX where B is a fixed matrix and where X is an isotropic random vector.

Exercise 6.7 Let B be an m × n matrix and X be an isotropic random vector in Rb. Show that 2 2 E[kBXk2] = kBkF . K

Theorem 6.8 (Concentration for random vectors) Let B be an m × n matrix and X = n (X1,...,Xn) ∈ R a random vector with independent mean-zero unit variance sub-Gaussian coordinates. Then

2 kkBXk − kBk k ≤ CK kBk ,K := max {kXik } , 2 F ψ2 1≤i≤n ψ2 for some absolute constant C > 0.

Remark 6.9 (a) We have, according to Exercise 6.7,

2 2 E[kBXk2] = kBkF .

(b) Recall that we have already proved a version with B = 1ln in Theorem 3.1, namely, √ kkXk − nk ≤ CK2 . 2 ψ2 

Proof of Theorem 6.8. Without loss of generality we assume that K ≥ 1 and we write A = BT B and apply Hanson-Wright inequality (Theorem 6.4) with the following.

2 hX, AXi = kBXk2, 2 2 E[hX, AXi] = kBkF; kAk = kBk , T T kB BkF ≤ kB kkBkF = kBkkBkF . Thus, for every u ≥ 0,    C n u2   u o |kBXk2 − kBk2| ≥ u ≤ 2 exp − min , . P 2 F K4 2 2 2 kBk kBkF kBk 2 Setting u = εkBkF with ε ≥ 0, we obtain 2  2 2 2  2 kBk  P |kBXk − kBk | ≥ εkBk ≤ 2 exp − C min{ε , ε} F . 2 F F K4kBk2 We now set δ2 = min{ε2, ε}, or, equivalently ε = max{δ, δ2}, and we recall our reasoning in the proof of Theorem 3.1, namely, that for z ≥ 0, the bound |z − 1| ≥ δ implies |z2 − 1| ≥ max{δ, δ2}.

2 2 2 |kBXk2 − kBkF| ≥ δkBkF ⇒ |kBXk2 − kBkF| ≥ εkBkF . BASICTOOLSINHIGH-DIMENSIONALPROBABILITY 63

Thus 2    2 kBk  P |kBXk − kBk | ≥ δkBk ≤ 2 exp − Cδ F , 2 F F K4kBk2 and setting t = δkBkF, we obtain the tail-estimate    Ct2  P |kBXk − kBk | ≥ t ≤ 2 exp − . 2 F K4kBk2 2

6.3 Symmetrisation

Definition 6.10 (symmetric random variables) A real-valued random variable X is called symmetric if X and −X have the same distribution.

Exercise 6.11 Let X be a real-valued random variable independent of some symmetric 1 Bernoulli random variable ε, i.e., P(ε = −1) = P(ε = 1) = 2 . Show the following state- ments.

(a) εX and ε|X| are symmetric random variables and εX and ε|X| have the same distribu- tion.

(b) X symmetric ⇒ distribution of εX and ε|X| equal the distribution of X.

(c) Suppose X0 is an independent copy of X, then X − X0 is a symmetric random variable.

Lemma 6.12 (Symmetrisation) Let X1,...,XN be independent mean-zero random vec- tors in a normed space (E, k·k) and let ε1, . . . , εN be independent symmetric Bernoulli random variables (that is, they are not only independent of each other but also of any Xi, i = 1,...,N). Show that

N N N 1 h X i h X i h X i k ε X k ≤ k X k ≤ 2 k ε X k . 2E i i E i E i i i=1 i=1 i=1

Proof. 0 PN 0 Upper bound: Let (Xi) be an independent copy of the random vector (Xi). Since i=1 Xi has zero mean,

N N N N h X i h X X 0 i h X 0 i p := E k Xik ≤ E k Xi − Xik = E k (Xi − Xi)k , i=1 i=1 i=1 i=1 where the inequality stems from an application of Lemma 6.2), namely, if E[Z] = 0 then E[kY k] ≤ E[kY + Zk]. 64 RANDOM PROCESSES

0 Since all (Xi − Xi) are symmetric random vectors, they have the same distribution as 0 εi(Xi − Xi), see Exercise 6.11. Application of the triangle inequality and our assumptions conclude the upper bound

N N N N h X 0 i h X i h X 0 i h X i p ≤ E k εi(Xi − Xi)k ≤ E k εiXik + E k εiXik = 2E k εiXik i=1 i=1 i=1 i=1

Lower bound: By conditioning on εi and the triangle inequality,

N N N h X i h X 0 i h X 0 i E k εiXik ≤ E k εi(Xi − Xi)k = E k (Xi − Xi)k i=1 i=1 i=1 N N N h X i h X 0 i h X i ≤ E k Xik + E k Xik = 2E k Xik . i=1 i=1 i=1 2

7 Random Processes

Definition 7.1 A random process is a collection of random variables X := (Xt)t∈T of ran- dom variables Xt defined on the same probability space, which are indexed by the elements t of some set T .

n Example 7.2 (a) T = {1, . . . , n}, then X = (X1,...,Xn) ∈ R random vector.

(b) T = N, then X = (Xn)n∈N with

n X Xn = Zi ,Zi independent identically distributed , i=1 is the discrete time random walk.

(c) When the dimension of the index set is greater than one, e.g., T ⊂ Rn or T ⊂ Zn, we often call the process a random field (Xt)t∈T . ♣

Definition 7.3 Let (Xt)t∈T be a random process with E[Xt] = 0 for all t ∈ T . The covari- ance function of the process is defined as

Σ(t, s) := cov(Xt,Xs) , t, s ∈ T.

The increments of the process are defined as

2 1/2 d(t, s) := kXt − XskL2 = (E[(Xt − Xs) ]) , t, s ∈ T. RANDOM PROCESSES 65

Definition 7.4 (Gaussian process) (a) A random process (Xt)t∈T is called a Gaussian

process if, for any finite subset T0 ⊂ T , the random vector (Xt)t∈T0 has normal dis- tribution. Equivalently, (Xt)t∈T is called a Gaussian process if every finite linear com- bination X atXt , at ∈ R , t∈T0 is a normally distributed random variable.

n (b) Suppose T ⊂ R and let Y ∼ N(0, 1ln), and define

n Xt := hY, ti , t ∈ T ⊂ R .

n We call the random process (Xt)t∈T the canonical Gaussian process in R .

2 To compute the increments of a canonical Gaussian process, recall that hY, ti ∼ N(0, ktk2) for any t ∈ Rn. Then one show that (easy exercise) the increments are n d(t, s) = kXt − XskL2 = kt − sk2 t, s ∈ R , n where k·k2 denotes the Euclidean norm in R .

Lemma 7.5 Let X ∈ Rn be a mean-zero Gaussian random vector. Then there exist points n t1, . . . , tn ∈ R such that n X := (hY, tii)i=1,...,n ∈ R , with Y ∼ N(0, 1ln) . Proof. Left as an exercise to the reader. 2

Theorem 7.6 (Slepian’s inequality) Let (Xt)t∈T and (Yt)t∈T be two mean-zero Gaussian processes. Assume that, for all t, s ∈ T , we have

2 2 2 2 (i) E[Xt ] = E[Yt ] (ii) E[(Xt − Xs) ] ≤ E[(Yt − Ys) ] . (7.1) Then, for every τ ∈ R, we have     P sup{Xt} ≥ τ ≤ P sup{Yt} ≥ τ , (7.2) t∈T t∈T and, consequently, E[ sup{Xt}] ≤ E[ sup{Yt}] . t∈T t∈T

Whenever (7.2) holds, we say that the process (Xt)t∈T is stochastically dominated by the process (Yt)t∈T . The proof of Theorem 7.6 follows combining the two versions of Slepian’s inequality which we will discuss below. To prepare these statements we introduce the method of Gaussian interpolation first. Suppose T is finite, e.g., |T | = n. Let X = (Xt)t∈T and Y = (Yt)t∈T be two Gaussian random vectors (without loss of generality we may assume that they X and Y are indepen- dent). We define the Gaussian interpolation as √ √ Z(u) := uX + 1 − uY u ∈ [0, 1] . (7.3) 66 RANDOM PROCESSES

It is easy to see (following exercise) that the covariance matrix of the interpolation interpo- lates linearly between the covariance matrices of X and Y .

Exercise 7.7 Show that

Σ(Z(u)) = uΣ(X) + (1 − u)Σ(Y ) t ∈ [0, 1] .

K

For a function f : Rn → R, we shall study how the expectation E[f(Z(u))] varies with u ∈ [0, 1]. Suppose, for example, that f(x) := 1l{max1≤i≤n{xi} ≤ u}. Then one can easily show that E[f(Z(u))] increases with u leading to

E[f(Z(1))] ≥ E[f(Z(0))] , and henceforth ( max {Xi} < τ) ≥ ( max {Yi} < τ) . P 1≤i≤n P 1≤i≤n The following three lemmas collect basic facts about Gaussian random variables and their proofs will be left as an exercise (homework).

Lemma 7.8 (Gaussian integration by parts) Suppose X ∼ N(0, 1). Then, for any differ- entiable function f : R → R,

0 E[f (X)] = E[Xf(X)] .

Proof. Exercise for the reader. Solution see Lemma 7.2.3 in [Ver18]. 2 Similarly, X ∼ N(0, σ2), σ > 0 implies E[Xf(X)] = σ2E[f 0(X)].

Lemma 7.9 (Gaussian integration by parts) Suppose X ∼ N(0, Σ), X ∈ Rn, Σ an n × n symmetric positive semi-definite matrix. Then, for any differentiable function f : Rn → R,

E[Xf(X)] = ΣE[∇f(X)] .

Proof. Exercise for the reader (homework Sheet 4). 2

Lemma 7.10 (Gaussian interpolation) Consider two independent Gaussian random vec- X Y tors X ∼ N(0, Σ ) and Y ∼ N(0, Σ ) in Rn. Define the interpolation Gaussian random vector as √ √ Z(u) := u X + 1 − u Y , u ∈ [0, 1] . (7.4) Then for any twice-differentiable function f : Rn → R, we have n d 1 X h ∂2f i [f(Z(u))] = (ΣX − ΣY ) (Z(u)) . (7.5) duE 2 ij ij E ∂x ∂x i,j=1 i j

Proof. Exercise for the reader (homework Sheet 4). Solution see Lemma 7.2.7 in [Ver18]. 2 We now prove a first version of Slepian’s inequality (Theorem 7.6). RANDOM PROCESSES 67

Lemma 7.11 (Slepian’s inequality, functional form) Let X,Y ∈ Rn be two mean-zero Gaussian random vectors. Assume that for all i, j = 1, . . . , n, 2 2 2 2 (i) E[Xi ] = E[Yi ] , (ii) E[(Xi − Xj) ] ≤ E[(Yi − Yj) ] , (7.6) and let f : Rn → R be twice-differentiable such that ∂2f (x) ≥ 0 for all i 6= j . ∂xi∂xj Then E[f(X)] ≥ E[f(Y )] . X Y X Y Proof. We have, using Σii = Σii and Σij ≥ Σij, 2 2 2 E[(Xi − Xj) ] ≤ E[Yi − 2YiYj + Yj ] . We assume without loss of generality that X and Y are independent. The Lemma 7.10 shows that d [f(Z(u))] ≥ 0 , duE and hence that E[f(Z(u))] increases in u. Thus E[f(Z(1))] = E[f(X)] ≥ E[f(Y )] = E[f(Z(0))]. 2 We can now prove Theorem 7.6 with the following results for Gaussian vectors.

Theorem 7.12 Let X and Y be two mean-zero Gaussian random vectors in Rn as in Lemma 7.11. Then, for every τ ∈ R, we have     max {Xi} ≥ τ ≤ max {Yi} ≥ τ . P 1≤i≤n P 1≤i≤n Consequently, [ max {Xi}] ≤ [ max {Yi}] . E 1≤i≤n E 1≤i≤n Proof. The key idea is to use Lemma 7.11 for some appropriate approximation of the maximum. For that to work, let h: R → [0, 1] be a twice-differentiable approximation of the indicator function of the interval (−∞, τ), that is, h(t) ≈ 1l(−∞,τ)(t), t ∈ R. Define n f(x) := h(x1) ··· h(xn) for x = (x1, . . . , xn) ∈ R . The second partial derivatives 2 ∂ f 0 0 Y (x) = h (xi)h (xj) h(xk) ∂xi∂xj k6=i,j are positive. It follows that E[f(X)] ≥ E[f(Y )]. Thus, by the above approximations, ac- cording to Lemma 7.11,     max {Xi} < τ ≥ max {Yi} < τ , P 1≤i≤n P 1≤i≤n and thus [ max {Xi}] ≤ [ max {Yi}] , E 1≤i≤n E 1≤i≤n and the statement. 2 The following theorem improves Slepian’s inequality by using a different approximation in the proof. 68 RANDOM PROCESSES

Theorem 7.13 (Sudakov-Fernique inequality) Let (Xt)t∈T and (Yt)t∈T be two random pro- cesses. Assume that, for all t, s ∈ T , we have

2 2 (i) E[(Xt − Xs) ] ≤ E[(Yt − Ys) ] .

Then

E[ sup {Xt}] ≤ E[ sup {Yt}] . t∈T t∈T

Proof. We shall prove the statement for Gaussian random vectors X,Y ∈ Rn with the help of Theorem 7.12. The idea is that this time we use an approximation of the maximum itself an d not for the indicator function. For β > 0, define

n 1 X f(x) := log eβxi x ∈ n . β R i=1

One can easily show that

f(x) −→ max {xi} . β→∞ 1≤i≤n Inserting the function into the Gaussian interpolation formula (7.5) we can obtain that

d [f(Z(u))] ≤ 0 , duE and conclude with our statement. 2 We now combine feature of the index space with the random process.

Definition 7.14 (Canonical metric) Suppose X = (Xt)t∈T is a random process with index set T . We define the canonical metric of the process by

 h 2i1/2 d(t, s) := kXt − XskL2 = E Xt − Xs t, s ∈ T.

We can now study the question if we can evaluate the expectation E[supt∈T {Xt}] by using properties of the geometry, in particular using covering numbers of the index metric space (T, d) of a given random process. Via this approach we obtain a lower bound of the expectation of the supremum.

Theorem 7.15 (Sudakov’s minorisation inequality) Let X = (Xt)t∈T be a mean-zero Gaus- sian process. Then, for any ε ≥ 0, we have p E[ sup {Xt}] ≥ Cε log N(T, d, ε) , t∈T

where d is the canonical metric of the process, C > 0 an absolute constant and N(T, d, ε) the covering number for T (recall Definition 4.1). RANDOM PROCESSES 69

Proof. Assume that N(T, d, ε) = N < ∞ is finite. When T is not compact, one can show that the expectation of the supremum is infinite (we skip this step). Let M be a maximal ε-separated subset of T . Then M is an ε-net according to Lemma 4.2, and thus |M| ≥ N. It suffices to show p E[ sup {Xt}] ≥ Cε log N. (7.7) t∈M

To show (7.7), we compare the process (Xt)t∈M with a simpler process (Yt)t∈M. ε Yt := √ Gt ,Gt ∼ N(0, 1),Gt independent of Gs for all t 6= s . 2 For t, s ∈ M fixed we have

2 2 2 E[(Xt − Xs) ] = d(t, s) ≥ ε , while ε2 [(Y − Y )2] = [(G − G )2] = ε2 . E t s 2 E t s Thus 2 2 E[(Xt − Xs) ] ≥ E[(Yt − Ys) ] for all t, s ∈ M . We now obtain with Theorem 7.13, ε p E[ sup {Xt}] ≥ E[ sup {Yt}] = √ E[ max{Gi}] ≥ Cε log N, t∈M t∈M 2 i∈M where we used Proposition 7.16below. 2

Proposition 7.16 Let Yi ∼ N(0, 1), i = 1,...,N, be independent normally distributed real- valued random variables. Then the following holds. (a) h i p E max {Yi} ≤ 2 log N. 1≤i≤N (b) h i p E max {|Yi|} ≤ 2 log(2N) . 1≤i≤N (c) h i p E max {Yi} ≥ C 2 log N for some absolute constant C > 0 . 1≤i≤N

Proof. (a) Let β > 0. Using Jensen’s inequality, we obtain

1 1 N 1 N β max1≤i≤n{Yi} X βYi X βYi E[ max Yi] = E[ log e ] ≤ E[ log e ] ≤ log E[e ] 1≤i≤n β β β i=1 i=1

1  2  β log N = log Neβ /2 = + . β 2 β √ The claim is proved by taking β = 2 log N. 70 RANDOM PROCESSES

(b) We repeat the steps in (a),

N 1 X h βY −βY i β log(2N) E[ max |Y | ] = ... ≤ E e i + e i = + . 1≤i≤n i β 2 β i=1 √ Taking β = 2 log(2N), we conclude with the statement. (c) The lower bound is slightly more involved and needs some preparation:

(i) The Yi’s are symmetric (Gaussian), and hence, by symmetry,

E[ max {|Yi − Yj|}] = E[ max {(Yi − Yj)}] = 2E[ max {Yi}] . 1≤i,j≤N 1≤i,j≤N 1≤i≤N

(ii) For every k ∈ {1,...,N}, using (i),

E[ max {Yi}] ≤ E[ max {|Yi|}] ≤ E[|Yk|] + E[ max {|Yi − Yj|] 1≤i≤N 1≤i≤N 1≤i,j≤N

= E[|Yk|] + 2E[ max {Yi}] . 1≤i≤N

To obtain an lower bound we exploit the fact that our Gaussian random variables Yi are independent and identically distributed. Then, for every δ > 0,

Z δ N N E[ max {|Yi|}] ≥ [1 − (1 − P(|Y1| > t)) ] dt ≥ δ[1 − (1 − P(|Y1| > δ)) ] . 1≤i≤N 0 Now, we obtain with the lower tail bound of the normal distribution (see Proposition 2.1),

r Z ∞ r 2 2 2 2 P(|Y1| > δ) = exp ( − t /2) dt ≥ exp ( − (δ + 1) /2) . π δ π √ Now we choose δ = log N with N large enough so that 1 (|Y | > δ) ≥ , P 1 N and hence 1 N 1 E[ max {|Yi|}] ≥ δ[1 − (1 − ) ] ≥ δ(1 − ) . 1≤i≤N N e We conclude with statement (c) using

E[ max {|Yi|}] ≤ E[|Y1|] + 2E[ max {Yi}] . 1≤i≤N 1≤i≤N 2

Definition 7.17 The Gaussian width of a subset T ⊂ Rn is defined h i W(T ) := E suphY, xi with Y ∼ N(0, 1ln) . x∈T RANDOM PROCESSES 71

Exercise 7.18 Suppose Xi, i = 1,...,N, are sub-Gaussian random variables, and K = max kX k . Show that, for any N ≥ 2, 1≤i≤N i ψ2 p E[ max {|Xi|}] ≤ CK log N. 1≤i≤N

KKK We summarise a few simple properties of the Gaussian width. We skip the proof as it involves mostly elementary properties of the norm and the Minkowski sum.

Proposition 7.19 (Properties of Gaussian width) (a) W(T ) is finite ⇔ T is bounded.

(b) W(T + S) = W(T ) + W(S) S,T ⊂ Rn.

(c) √ 1 n √ diam (T ) ≤ W(T ) ≤ diam (T ) . 2π 2

References

[Dur19]R.D URRETT. Probability - Theory and Examples. Cambridge University Press, fifth ed., 2019.

[Geo12]H.-O.G EORGII. Stochastics. De Gruyter Textbook. De Gruyter, 2nd rev. and ext. ed., 2012.

[Ver18]R.V ERSHYNIN. High-Dimensional Probability. Cambridge University Press, 2018. 72 INDEX

Index ε-net, 36 operator norm, 38 ε-separated, 36 packing number, 36 Sub-exponential norm, 23 Rademacher function, 25 Bernstein condition, 25 random field, 64

n random process, 64 canonical Gaussian process in R , 65 relative entropy, 44 canonical metric of the process, 68 centred moment generating function, 15 second moment matrix, 30 Chaos, 57 separately convex, 46 covariance function, 64 singular values, 38 covariance matrix, 7, 30 spherically distributed, 32 covering number, 36 Stirling’s formula, 17 cumulative distribution function (CDF), 3 stochastically dominated, 65 Sub-exponential random variable, 22 Decoupling, 58 Sub-exponential random variable, second empirical covariance, 42 definition, 24 empirical measure, 42 Sub-exponential random variables, first def- inition, 22 Gamma function, 17 Sub-Gaussian - first definition, 18 Gaussian interpolation, 65 Sub-Gaussian - second definition, 19 Gaussian process, 65 Sub-Gaussian properties, 15 Gaussian width, 70 sub-level sets, 51 global Lipschitz continuity, 46 symmetric, 63 symmetric Bernoulli distribution in Rn, 32 Hamming cube, 38 Hamming distance, 38 tails of the normal distribution, 9 Herbst argument, 44 unit sphere, 32 increments, 64 isotropic, 30 Young’s inequality, 23

Jensen’s inequality, 4

Kullback-Leibler divergence, 44 level sets, 51 Lipschitz continuous, 46

Minkowski sum, 37 moment generation function (MGF), 3 Multivariate Normal / Gaussian distribu- tion, 33 normal distribution, 7