2 A few good inequalities1 2.1 Tail bounds and concentration...... 1 2.2 Moment generating functions...... 3 2.3 Orlicz norms...... 4 2.4 Subgaussian tails...... 6 2.5 Bennett’s inequality...... 11 2.6 Bernstein’s inequality...... 14 2.7 Subexponential distributions?...... 17 2.8 The Hanson-Wright inequality for subgaussians...... 20 2.9 Problems...... 23 2.10 Notes...... 27

Printed: 27 November 2015

version: 20nov15 Mini-empirical printed: 27 November 2015 c David Pollard Chapter 2 A few good inequalities

Section 2.1 introduces exponential tail bounds. Section 2.2 introduces the method for bounding tail probabilities using mo- ment generating functions. Section 2.3 describes analogs of the usual Lp norms, which have proved useful in empirical process theory. Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property. Section 2.5 discusses the Bennett inequality for sums of independent ran- dom variables that are bounded above by a constant. Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands. Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution. Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables.

2.1 Tail bounds and concentration Basic::S:intro Sums of independent random variables are often approximately normal dis- tributed. In particular, their tail probabilities often decrease in ways similar to the tails of a normal distribution. Several classical inequalities make this vague idea more precise. Such inequalities provided the foundation on which empirical process theory was built.

version: 20nov15 Mini-empirical printed: 27 November 2015 c David Pollard §2.2 Moment generating functions 2

We know a lot about the normal distribution. For example, if W has a N(µ, σ2) distribution then (Feller, 1968, Section 7.1)

 1 1  e−x2/2 1 e−x2/2 − √ ≤ P{W ≥ µ + σx} ≤ √ for all x > 0. x x3 2π x 2π Clearly the inequalities are useful only for larger x: as x decreases to zero the lower bound goes to −∞ and the upper bound goes to +∞. For many purposes the exp(−x2/2) factor matters the most. Indeed, the simpler tail 2 bound P{W ≥ µ+σx} ≤ exp(−x /2) for x ≥ 0 often suffices for asymptotic arguments. Sometimes we need a better inequality showing that the dis- tribution concentrates most of its probability mass in a small region, such as

−x2/2 P{|W − µ| ≥ σx} ≤ 2e for all x ≥ 0, a concentration inequality. Similar-looking tail bounds exist for random variables that are not ex- actly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form

−g(x) \E@ exp.bnd <1> P{X ≥ x + PX} ≤ e for x ≥ 0, with g a function that typically increases like a power of x. Sometimes g behaves like a quadratic for some range of x and like a linear function further out in the tails; sometimes some logarithmic terms creep in; and sometimes it depends on var(X) or on some more exotic measure of size, such as an Orlicz norm (see Section 2.3). For example, suppose S = X1 + ··· + Xn, a sum of independent random variables with PXi = 0 and each Xi bounded above by some constant b. The Bernstein inequality (Section 2.6) tells us that ! √ −x2/2 {S ≥ x V } ≤ exp √ for x ≥ 0, P 1 1 + 3 bx/ V

where V is any upper bound for var(S). If each Xi is also bounded below by −b then a similar tail bound exists for −√S, which leads to a concentration inequality, an upper bound for√ P{|S| ≥ x V }. For x much smaller than V /b the exponent in the Bernstein inequality behaves like −x2/2; for much larger x the exponent behaves like a negative multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the inequality gives a subgaussian bound for x not too big and a subexponential bound for very large x.

Draft: 20nov15 c David Pollard §2.2 Moment generating functions 3

2.2 Moment generating functions Basic::S:mgf The most common way to get a bound like <1> uses the moment gener- θX ating function, M(θ) := Pe . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have

θ(X−w) −θw \E@ mgf.tail <2> P{X ≥ w} ≤ inf Pe = inf e M(θ). θ≥0 θ≥0

Basic::normal <3> Example. If X has a N(µ, σ2) distribution then M(θ) = exp(θµ + σ2θ2/2) is finite for all real θ. For x ≥ 0 inequality <2> gives

2 2 P{X ≥ µ + σx} ≤ inf exp(−θ(µ + σx) + θµ + θ σ /2) θ≥0 = exp(−x2/2) for all x ≥ 0,

the minimum being achieved by θ = x/σ. Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail,

2 P{X ≤ µ − σx} ≤ exp(−x /2) for all x > 0,

−x2/2 leading to the concentration inequality P{|X − µ| ≥ σx} ≤ 2e .  Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4.

Exactly the same arguments work for any whose mo- ment generating is bounded above by exp(θµ + σ2θ2/2), for all real θ and some finite constants µ and τ. This property essentially defines the sub- gaussian distributions, which are discussed in Section 2.4. The function L(θ) := log M(θ) is called the cumulant-generating function. The upper bound in <2> is often rewritten as

−L∗(θ) ∗ e where L (θ) := supθ≥0(θw − L(θ)).

Remark. As Boucheron et al.(2013, Sections 2.2) and others have noted, the function L∗ is the Fenchel-Lagrange transform of L, which is also known as the conjugate function (Boyd and Vandenberghe, 2004, Section 3.3).

Draft: 20nov15 c David Pollard §2.3 Orlicz norms 4

Clearly L(0) = 0. The H¨olderinequality shows that L is convex: if θ is a convex combination α1θ1 + α2θ2 then L(θ) e = M(α1θ1 + α2θ2)  α θ X α θ X  = P e 1 1 e 2 2 α α  θ X  1  θ X  2 α L(θ )+α L(θ ) ≤ Pe 1 Pe 2 = e 1 1 2 2 . More precisely (Problem [1]), the set J = {θ : M(θ) < ∞} is an interval, possibly degenerate or unbounded, containing the origin. For example, for the Cauchy, M(θ) = ∞ for all real θ 6= 0; for the standard exponential distribution (density e−x{x > 0} with respect to Lebesgue measure),

 −1 M(θ) = (1 − θ) for θ < 1 . ∞ otherwise Problem [1] also justifies the interchange of differentiation and expecta- tion on the interior of J, which leads to M 0(θ) eθX L0(θ) = = X M(θ) P M(θ) M 00(θ) M 0(θ)2 eθX  eθX 2 L00(θ) = − = X2 − X . M(θ) M(θ) P M(θ) P M(θ)

θX If we define a probability measure Pθ by its density, dPθ/dP = e /M(θ), then the derivatives of L can be re-expressed using moments of X under Pθ: 0 00 2 L (θ) = PθX and L (θ) = Pθ (X − PθX) = varθ(X) ≥ 0. 0 In particular, L (0) = PX, so that Taylor expansion gives 1 2 ∗ \E@ L.Taylor <4> L(θ) = θPX + 2 θ varθ∗(X) with θ lying between 0 and θ.

Bounds on the Pθ of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example <14>) provides a good illustration of this line of attack.

2.3 Orlicz norms Basic::S:orlicz There is a fruitful way to generalize the the concept of an Lp norm by + + of a Young function, that is, a convex, increasing function Ψ : R → R with Ψ(0) = 0 and Ψ(x) → ∞ as x → ∞. The concept applies with any measure µ but I need it mostly for probability measures.

Draft: 20nov15 c David Pollard §2.3 Orlicz norms 5

Remark. The assumption that Ψ(x) → ∞ as x → ∞ is needed only to rule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x0) > 0 for some x0 then, for x = x0/α with α < 1,

Ψ(x0) = Ψ(αx + (1 − α)0) ≤ αΨ(x)

That is, Ψ(x)/x ≥ Ψ(x0)/x0 for all x ≥ x0. If Ψ is not identically zero then it must increase at least linearly with increasing x.

At one time Orlicz norms became very popular in the empirical process literature, in part (I suspect) because chaining arguments (see Chapter 4) are cleaner with norms than with tail probabilities. The most important Young functions for my purposes are the power functions, Ψ(x) = xp for a fixed p ≥ 1, and the exponential power func- α tions Ψα(x) := exp(x ) − 1 for α ≥ 1. The function Ψ2 is particularly useful (see Section 2.4) for working with distributions whose tails decrease like a Gaussian. It is also possible (Problem [6]) to define a Young function Ψα, for 0 < α < 1, for which

α Ψα(x) ≤ exp(x ) ≤ Kα + Ψα(x) for all x ≥ 0,

where Kα is a positive constant. Some linear interpolation near 0 is needed to overcome the annoyance that x 7→ exp(xα) is concave on the inter- 1/α val [0, zα], for zα := ((1 − α)/α) .

Basic::Orlicz.norm <5> Definition. Let (X, A, µ) be a measure space and f be an A-measuable real function on X. For each Young function Ψ the corresponding Orlicz

norm kfkΨ (or Ψ-norm) is defined by

kfkΨ = inf{c > 0 : µΨ(|f(x)|/c) ≤ 1},

with kfkΨ = ∞ if the infimum runs over an empty set.

Remark. It is not hard to show that kfkΨ < ∞ if and only if µΨ(|f|/C0) < ∞ for at least one finite constant C0. Moreover, if 0 < c = kfkΨ < ∞ then µΨ(|f|/c) ≤ 1; the infimum is achieved. When restricted to the set LΨ(X, A, µ) of all measurable real

functions on X with finite Ψ-norm, k·kΨ is a seminorm. It fails to be a norm only because kfkΨ = 0 if and only if µ{x : f(x) 6= 0} = 0. The corresponding set of µ-equivalence classes is a Banach space. It

is traditional and benign to call k·kΨ a norm even though it is only a seminorm on LΨ(X, A, µ). The assumption Ψ(0) = 0 can be discarded if µ is a finite measure. For each finite constant A > Ψ(0), the quantity inf{c > 0 : µΨ(|f|/c) ≤ A} also defines a seminorm.

Draft: 20nov15 c David Pollard §2.4 Subgaussian tails 6

Finiteness of the Ψ-norm for a random variable Z immediately implies

a tail bound: if 0 < σ = kZkΨ < ∞ then  Ψ(|Z|/σ)  1  \E@ orlicz.tail <6> {|Z| ≥ xσ} ≤ max 1, = max 1, for x ≥ 0. P P Ψ(x) Ψ(x)

For example, for each α > 0 there is a constant Cα for which

α {|Z| ≥ x kZk } ≤ C e−x for x ≥ 0. P Ψα α

−xα The constant can be chosen so that Cαe ≥ 1 for all x so small that −xα Ψα(x)e is not close to 1. When seeking to bound an Orlicz norm I often find it easier to first obtain a constant C0 for which PΨ(|X|/C0) ≤ K0, where K0 is a constant strictly larger than 1. As the next Lemma shows, a simple convexity argument turns

such an inequality into a bound on kXkΨ.

Basic::Psi.bound <7> Lemma. Suppose Ψ is a Young function. If PΨ(|X|/C0) ≤ K0 for some finite constants C0 and K0 > 1 then kXkΨ ≤ C0K0.

Proof For each θ in [0, 1] convexity of Ψ gives

θ|X| |X| PΨ ≤ θPΨ + (1 − θ)Ψ(0) ≤ θK0. C0 C0

The choice θ = 1/K0 makes the last bound equal to 1.  Beware. The Lemma does not give a foolproof way of determining rea- sonable bounds for the Ψ-norm. For example, if X has a N(0, 1) distribution then simple integration shows that √ p 2 PΨ2(|X|/c) = g(c) := c/ c − 2 − 1 for c > 2. √ Thus kXk = p8/3 but cg(c) → ∞ as c decreases to 2. Ψ2

2.4 Subgaussian tails Basic::S:subgaussian As noted in Section 2.2, if X is a random variable for which there exist 2 θX 1 2 2 constants µ and σ for which Pe ≤ exp(µθ + 2 σ θ ) for all real θ then

−x2/2 \E@ subg.oneside <8> P{X ≥ µ + σx} ≤ e for all x ≥ 0.

Draft: 20nov15 c David Pollard §2.4 Subgaussian tails 7

The approximations for θ near zero,

θX 1 2 2 2 Pe = 1 + θPX + 2 θ PX + o(θ ), 1 2 2 1 2 2 2 2 exp(µθ + 2 σ θ ) = 1 + µθ + 2 (σ + µ )θ + o(θ ), 2 2 2 2 would force PX = µ and PX ≤ σ + µ , that is, var(X) ≤ σ . As Prob- lem [11] shows, σ2 might need to be much larger than the variance. To avoid any hint of the convention that σ2 denotes a variance, I feel it is safer to replace σ by another symbol.

Basic::subgaussian <9> Definition. Say that a random variable X with PX = µ has a subgaussian θX 2 2 distribution with scale factor τ < ∞ if Pe ≤ exp(µθ + τ θ /2) for all real θ. Write τ(X) for the smallest τ for which such a bound holds.

Remark. Some authors call such an X a τ-subgaussian random variable. Some authors require µ = 0 for subgaussianity; others use the qualifier centered to refer to the case where µ is zero.

For a subgaussian X, the two-sided analog of <8> gives  x2  \E@ beta.def <10> {|X − µ| ≥ x} ≤ 2 exp − for all x ≥ 0, P 2β2 for 0 < β ≤ τ(X). In fact (Theorem <16>) existence of such a tail bound is equivalent to subgaussianity.

Basic::gaussian <11> Example. Suppose Y = (Y1,...,Yn) has a multivariate normal distribu- 2 tion. Define M := maxi≤n |Yi| and σ := maxi≤n var(Yi). Chapter 6 explains why

2 P{|M − PM| ≥ σx} ≤ 2 exp(−x /2) for all x ≥ 0. That is, M is subgaussian—a most surprising and wonderful fact.  Many useful results that hold for gaussian variables can be extended to subgaussians. In empirical process theory, conditional subgaussianity plays a key role in symmetrization arguments (see Chapter 8). The first example captures the main idea. P Basic::Rademachers <12> Example. Suppose X = i≤n bisi, where the bi’s are constants and the si’s 1 are independent Rademacher random variables, that is, P{si = +1} = 2 = P{si = −1}. Then ∞ 2k   X (θbi) eθbisi = 1 eθbi + e−θbi = ≤ exp(θ2b2/2) P 2 (2k)! i k=0

Draft: 20nov15 c David Pollard §2.4 Subgaussian tails 8

2 2 θX Q θbisi θ τ /2 2 P 2 so that Pe ≤ i≤n Pe ≤ e with τ = i≤n bi . That is, X has a 2 P 2 centered subgaussian distribution with τ(X) ≤ i≤n bi . 

More generally, if X is a sum of independent subgaussians X1,...,Xn then the equality Y eθ(X−PX) = eθ(Xi−PXi) P i P

2 P 2 shows that X is subgaussian with τ (X) ≤ i τ (Xi). For sums of indepen- dent variables we need consider only the subgaussianity of each summand.

Basic::interval.range <13> Example. Suppose X is a random variable with PX = µ and a ≤ X ≤ b, for finite constants a and b. By representation <4>, its moment generating function M(θ) satisfies

1 2 ∗ log M(θ) = θPX + 2 θ varθ∗(X) with θ lying between 0 and θ.

∗ Write µ for Pθ∗ X and m for (b − a)/2. Note that |X − m| ≤ (b − a)/2. Thus

∗ 2 2 2 varθ∗(X) = Pθ∗ |X − µ | ≤ Pθ∗ |X − m| ≤ (b − a) /4.

θ(X−µ) 2 2 and log Pe ≤ (b − a) θ /8. The random variable X is subgaussian with τ(X) ≤ (b − a)/2.  Remark. Hoeffding(1963, page 22) derived the same bound on the moment generating function by a direct appeal to convexity of the exponential function: X − a eθX ≤ qeθa + peθb where 1 − q = p = P P b − a In effect, the bound represents the extremal case where X concentrates on the the two-point set {a, b}. He then calculated a second derivative for L, without interpreting the results as a variance but using an inequality that can be given a variance interpretation.

Basic::Hoeffding.ineq <14> Example. (Hoeffding’s inequality for independent summands) Suppose ran- dom variables X1,...,Xn are independent with PXi = µi and ai ≤ Xi ≤ bi for each i, for constants ai and bi. By Example <13>, each Xi is subgaussian with τ(Xi) ≤ (bi −ai)/2. The P P sum S = i≤n Xi is also subgaussian, with µ = i≤n µi

Draft: 20nov15 c David Pollard §2.4 Subgaussian tails 9

2 P 2 P 2 and τ (S) = i≤n τ (Xi) = i≤n(bi − ai) /4. The one-sided version of inequality <10> gives

\E@ Hoeff.indep <15> nX o  2 X 2 (Xi − µi) ≥ x ≤ exp −2x / (bi − ai) for each x ≥ 0. P i≤n i≤n See Chapter 3 for an extension of Hoeffding’s inequality <15> to sums of bounded martingale differences.  The next Theorem provides other characterizations of subgaussianity, 2 using the Orlicz norm for the Young function Ψ2(x) = exp(x ) − 1 and mo- ment inequalities. The proof of the Theorem provides a preview of a “sym- metrization” argument that plays a major role in Chapter 8. I prefer to write the argument using product measures rather than conditional expectations. Here is the idea. Suppose X1 is a random variable with PX1 = µ, defined on a probability space (Ω1, F1, P1). Make an “independent copy” (Ω2, F2, P2) of the probability space then carry X1 over to a new random variable X2 on Ω2. Under P1 ⊗ P2 the random variables X1 and X2 are independent and identically distributed.

Remark. You might prefer to start with an X defined on (Ω, F, P) then define random variables on Ω × Ω by X1(ω1, ω2) = X(ω1) and X2(ω1, ω2) = X(ω2). Under P ⊗ P the random variables X1 and X2 are independent and each has the same distribution as X (under P). I once referred to independent copies during a seminar in a Math department. One member of the audience protested vehemently that a copy of X is the same as X and hence can only be independent of X in trivial cases.

Note that P2X2 = P1X1 = µ. For each convex function f,

ω1 ω2 P1f(X1 − µ) = P1 f [X1(ω1) − P2 X2(ω2)] ω1 ω2 = P1 f [P2 (X1(ω1) − X2(ω2))] ω1 ω2 ≤ P1 P2 f [X1(ω1) − X2(ω2)] ,

the last line coming from Jensen’s inequality for P2.

Basic::subgauss.equiv <16> Theorem. For each random variable X with expected value µ, the following assertions are equivalent. (i) X has a subgaussian disribution (with τ(X) < ∞) kXk Basic::L.finite (ii) L(X) := sup √ r < ∞ r≥1 r

Draft: 20nov15 c David Pollard §2.4 Subgaussian tails 10

Basic::Psitwo.finite (iii) kXk < ∞ Ψ2  x2  Basic::beta.def2 (iv) there exists a positive constant β for which {|X−µ| ≥ x} ≤ 2 exp − P 2β2 for all x ≥ 0

More precisely, √ Basic::Psi2L.equiv (a) c kXk ≤ L(X) ≤ kXk with c = 1/ 4e ; 0 Ψ2 Ψ2 0

Basic::centered (b) if β(X) denotes the smallest β for which inequality (iv) holds, √ c kX − µk ≤ β(X) ≤ τ(X) ≤ 4L(X − µ) with c = 1/ 6. 1 Ψ2 1

Remark. Finiteness of L(X) is also equivalent to finiteness of L(X −µ)

because |µ| ≤ kXk1 ≤ L(X) and |L(X)−L(X−µ)| ≤ |µ|. Consequently L(X − µ) ≤ 2L(X), but there can be no universal inequality in the other direction: L(X − µ) is unchanged by addition of a constant to X but L(X) can be made arbitrarily large. The precise values of the constants ci are not important. Most likely they could be improved, but I can see no good reason for trying.

Proof Implicit in the sequences of inequalities is the assertion that finite- ness of one quantity implies finiteness of another quantity.

Proof of (a) k k Problem [8] shows that k ≥ k! ≥ (k/e) for all k ∈ N. Suppose Ψ (|X|/D) ≤ 1. Then P (X/D)2k/k! ≤ 1 so that P 2 k∈N P √ 1/2k kXk2k ≤ D(k!) ≤ D k for each k ∈ N √ √ and L(X) ≤ 2 sup kXk / 2k ≤ D by Problem [9]. k∈N 2k √ Conversely, if ∞ > L ≥ kXkr/ r for all r ≥ 1 then

2k |X| X kX/Dk X √ Ψ = 2k ≤ (L 2k/D)2k(e/k)k, P 2 D k! k∈N k∈N √ which equals 1 if D = 4e L.

Proof of (b) None of the quantities appearing in the inequalities depend on µ, so we may as well assume µ = 0. Inequality <10> gives β(X) ≤ τ(X).

Draft: 20nov15 c David Pollard §2.5 Bennett’s inequality 11

Simplify notation by writing β for β(X) and τ for τ(X) and L for L(X), and define γ := kXk . Also, assume that X is not a constant, to avoid Ψ2 annoying trivial cases. √ To show that γ ≤ β 6:   Z ∞ |X − µ| x 2 2 PΨ2 √ = P e {|X − µ| ≥ 6β x} dx β 6 0 Z ∞ √ x = e P{|X − µ| ≥ β 6x} dx 0 Z ∞ ≤ 2 exp (x − 3x) dx = 1. 0 To show that τ ≤ 4L: Symmetrize. Construct a random variable X0 with the same distribution as X but independent of X. By the triangle inequality, 0 √ kX − X kr ≤ 2 kXkr ≤ 2L r for all r ≥ 1. By Jensen’s inequality

θX θ(X−X0) Pe ≤ Pe X θ2k (X − X0)2k = 1 + P symmetry kills odd moments k∈N (2k)! √ X (2L 2k )2k ≤ 1 + k∈N kkk! 2 2 k X (8θ L ) 2 2 = 1 + = e8L θ , k∈N k! and so on. 

2.5 Bennett’s inequality Basic::S:Bennett The Hoeffding inequality depends on the somewhat crude bound

var(W ) ≤ (b − a)2/4 if a ≤ W ≤ b.

Such an inequality can be too pessimistic if W has only a small probability of taking values near the endpoints of the interval [a, b]. The inequalities can sometimes be improved by introducing second moments explicitly into the upper bound. Bennett’s one-sided tail bound provides a particularly elegant illustration of the principle.

Draft: 20nov15 c David Pollard §2.5 Bennett’s inequality 12

Basic::Bennett.ineq <17> Theorem. Suppose X1,...,Xn are independent random variables for which 2 Xi ≤ b and PXi = µi and PXi ≤ vi for each i, for nonnegative constants b P and vi. Let W equal i≤n vi. Then

nX o  x2  bx  P (Xi − µi) ≥ x ≤ exp − ψBenn for x ≥ 0 i≤n 2W W

where ψBenn denotes the function defined on [−1, ∞) by (1 + t) log(1 + t) − t \E@ Benpsi.def <18> ψ (t) := for t 6= 0, and ψ (0) = 1. Benn t2/2 Benn

Remark. By Problem [14], ψBenn is positive, decreasing, and convex, with ψBenn(−1) = 2 and ψBenn(t) decreasing like (2/t) log t as t → ∞. When bx/W is close to zero the ψBenn contribution to the upper bound is close to 1 and the exponent is close to a subgaussian −x2/(2W ). (If b = 0 then it is exactly subgaussian.) Further out in the tail, for larger x, the exponent decreases like −x log(bx/W )/b, which is slightly faster than exponential.

x Proof Define Φ1(x) = e − 1 − x, the convex function Ψ1 adjusted to have zero derivative at the origin, as in Problem [7]. From Problem [14], the positive function

Φ1(x) \E@ Del.def <19> ∆(x) = for x 6= 0, and ∆(0) = 1, x2/2 is continuous and increasing on the whole real line. For θ ≥ 0,

θXi Pe = 1 + θPXi + PΦ1(θXi) 1 2  = 1 + θµi + P 2 (θXi) ∆(θXi) 1 2 2 ≤ 1 + θµi + 2 θ P(Xi )∆(θb) because ∆ is increasing  1 2  \E@ Bennett.mgf <20> ≤ exp θµi + 2 θ vi∆(θb) . That is,

 vi  \E@ bdd1 <21> eθ(Xi−µi) ≤ exp 1 θ2v ∆(θb) = exp Φ (θb) for θ ≥ 0. P 2 i b2 1 Remark. In the last expression I am implicitly assuming that b is nonzero. We can recover the bound when b = 0 by a passage to the limit as b decreases to zero. Actually, the b = 0 case is quite noteworthy because the ∆ contribution disappears from the upper bound for P exp(θXi) and the ψBenn disappears from the final inequality.

Draft: 20nov15 c David Pollard §2.5 Bernstein’s inequality 13

P Abbreviate i≤n(Xi − µi) to T . By independence and <21>, W  θT Y θ(Xi−µi) Pe = Pe ≤ exp Φ1(θb) i≤n b2 and, by <2>,  W  {T ≥ x} ≤ inf exp −xθ + Φ (θb) P θ≥0 b2 1 ! W  bx  = exp − 2 sup bθ − Φ1(θb) b bθ≥0 W  W  = exp − Φ∗(bx/W ) , b2 1 ∗ where Φ1 denotes the conjugate of Φ1,  ∗ (1 + y) log(1 + y) − y if y ≥ −1 Φ1(y) := sup (ty − Φ1(t)) = . t∈R +∞ otherwise 1 2 When y ≥ −1 the supremum is achieved at t = log(1+y). Thus 2 y ψBenn(y) = ∗ 2 ∗ 2 Φ1(y) for y ≥ −1 and (W/b )Φ1(bx/W ) = x /(2W )ψBenn(bx/W ), as as- serted.  Remark. The inequality still holds if 0 > b ≥ −x/W , although the proof changes slightly. What happens for smaller b?

Unlike the case of the two-sided Hoeffding inequality, there is no au- tomatic lower tail analog for the Bennett inequality without an explicit lower bound for the summands. One particularly interesting case occurs when Xi ≥ 0: X X 2 { (Xi − µi) ≤ −x} = { (µi − Xi) ≥ x} ≤ exp(− /(2W )) P i≤n P i≤n for all x ≥ 0, a clean (one-sided) subgaussian tail bound. The Bennett inequality also works for the Poisson distribution. If X has a Poisson(λ) distribution then (Problem [15])  x2 x {X ≥ λ + x} exp − ψ for x ≥ 0 P 2λ Benn λ and, for 0 ≤ x ≤ λ,  x2  x  x2  {X ≤ λ − x} = exp − ψ − ≤ exp − , P 2λ Benn λ 2λ

Draft: 20nov15 c David Pollard §2.6 Bernstein’s inequality 14

a perfect subgaussian lower tail. See Problem [16] if you have any doubts about the boundedness of Poisson random variables. See Chapter 3 for an extension of the Bennett inequality to sums of martingale differences.

2.6 Bernstein’s inequality Basic::S:Bernstein −1 As shown by Problem [14], ψBenn(t) ≥ (1 + t/3) for t ≥ −1. Thus the Bennett inequality implies a weaker result: if X1,...,Xn are independent 2 with PXi = µi and PXi ≤ vi and Xi ≤ b then X  x2/2  \E@ Bernstein.indep0 <22> P{ (Xi − µi) ≥ x} ≤ exp − for x ≥ 0, i≤n W + bx/3 a form of Bernstein’s inequality. When bx/W is close to zero the expo- nent is again close to x2/(2W ); when bx/W is large, the Bernstein inequality loses the extra log term in the exponent. If the summands Xi are bounded in absolute value (not just bound above) we get an analogous two-sided inequality. 2 The assumptions PXi ≤ vi and Xi ≤ b imply k 2 k−2 k−2 P|Xi| ≤ PXi b ≤ vib for k ≥ 2.

k A much weaker condition on the growth of P|Xi| with k, without any assumption of boundedness, leads to a useful upper bound for the moment generating function and a more genral version of the Bernstein ineqi=uality..

Basic::Bernstein.unbdd <23> Theorem. (Bernstein inequality) Suppose X1,...,Xn are independent ran- dom variables with PXi = µi and k 1 k−2 \E@ Bernstein.moment <24> P|Xi| ≤ 2 viB k! for k ≥ 2. P Define W = i≤n vi. Then, for x ≥ 0,

nX o −H (x,B,W ) −H (x,B,W ) (Xi − µi) ≥ x ≤ e Benn ≤ e Bern P i≤n where x2/W H (x, B, W ) = Bern 2(1 + xB/W ) x2/W HBenn(x, B, W ) = 1 + xB/W + p1 + 2xB/W

Draft: 20nov15 c David Pollard §2.6 Bernstein’s inequality 15

Remark. The B here does not represent an upper bound for Xi. The traditional form of Bernstein’s inequality, with HBern, corresponds to the bound <22> derived from the Bennett inequality under the assumption Xi ≤ b = B/3.

Proof As before,

n o P X X −(x+ µi)θ Y θXi := P Xi ≥ x + µi ≤ inf e i Pe . F i i θ≥0 i≤n

Use the moment inequality <24> to bound the moment generating func- tion for Xi.

θXi t Pe = 1 + θµi + PΦ1(θXi) where Φ1(t) := e − 1 − t and, for θ ≥ 0,

k k X θ P|Xi| PΦ1(θXi) ≤ PΦ1(θ|Xi|) = k≥2 k! 1 2 X k−2 ≤ viθ (θB) 2 k≥2 2 viθ \E@ Bernstein.mgf <25> = 1 provided 0 ≤ θ < 1/B. 2 1 − Bθ Thus 2  2  viθ viθ eθXi ≤ 1 + θµ + 1 ≤ exp θµ + 1 for 0 ≤ θB < 1 P i 2 1 − bθ i 2 1 − bθ

and W θ2 ≤ exp − sup g(θ) where g(θ) := xθ − 1 . F 0≤Bθ<1 2 1 − Bθ

The HBern bound As an approximate maximizer for g choose θ0 = x/(W + xB) so that

 x2/2  ≤ e−g(θ0) = exp − . F W + xB

Bennett(1962, equation (7)) made this strange choice by defining G = 1−θB 1 2 then minimizing −xθ+ 2 θ W/G while ignoring the fact that G depends on θ. That gives θ = xG/W = (1 − θB)x/W , a linear equation with solution θ0. Very confusing. As Bennett noted, a boundedness assumption |Xi| ≤ b

Draft: 20nov15 c David Pollard §2.6 Bernstein’s inequality 16

actually implies <24> with B = b/3. The exponential upper bound then agrees with <22>.

The HBenn bound Following Bennett(1962, equation (7a)), this time we really maximize g(θ). The calculation is cleaner if we rewrite g with u := xB/W and s := 1 − θB. With those substitutions, W W g(θ) = g (s) = 2us − (1 − s)2/s = −s−1 − (1 + 2u)s − 2(1 + u) . 0 2b2 2B2

−1/2 −g (s ) The maximum is achieved at s0 = (1 + 2u) giving F ≤ e 0 0 where  2  W √  W u \E@ u.bound <26> g0(s0) = 1 + u − 1 + 2u = √ . B2 B2 1 + u + 1 + 2u



Both inequalities in Theorem <23> have companion lower bounds, ob- tained by replacing Xi by −Xi in all the calculations. This works because the Bernstein condition <24> also holds with Xi replaced by −Xi. The in- equalities could all have been written as two-sided bounds, that is, as upper P bounds for P{| i(Xi − µi)| ≥ x}. If we are really only interested in a one-sided bound then it seems super- fluous to control the size of |Xi|. For example, for the upper tail it should + be enough to control Xi . Boucheron et al.(2013, page 37)=BLM derived such a bound (which they attributed to Emmanuel Rio) by means of the elementary inequality X ex − 1 − x ≤ 1 x2 + (x+)k/k! for all real x. 2 k≥3

(For x ≥ 0 both sides are equal. For x < 0 it reflects the fact that g(x) = x 1 2 e − (1 + x + 2 x ) has a positive derivative with g(0) = 0.) They replaced the assumption <24> by

2 k 1 k−2 PXi = vi and P|Xi| ≤ 2 viB k! for k ≥ 3. For B > θ ≥ 0, we then get

 2  X viθ eθXi ≤ 1 + θµ + 1 θ2v + 1 v θ2 (θB)k−2 ≤ exp θµ + 1 , P i 2 i 2 i i 2 1 − θB k≥3

Draft: 20nov15 c David Pollard §2.7 Subexponential distributions? 17

which is exactly the same inequality as in the proof of Theorem <23>. As with the HBenn calculation it follows that   nX X o W √  P Xi ≥ x + µi ≤ exp − 1 + u − 1 + 2u i i B2 P where W = i vi and u = xB/W . BLM observed that the upper bound can be inverted by solving a quadratic in u, 2 1 + 2u = 1 + u − tB2/W  ,

2 p 2 which has a positive√ solution u = tB /W + 2tB /W , corresponding to x = W u/B = tB+ 2tW . The one-sided Bernstein bound then takes an elegant form, √ nX o −t \E@ one.sided.Bernstein <27> (Xi − µi) ≥ tB + 2tW ≤ e for t ≥ 0. P i √ By splitting according to whether tB ≥ 2tW or not, and by introducing an extraneous factor of 2, we can split the elegant form into two more suggestive inequalities,

X −t 2 P{ (Xi − µi) ≥ 2tB} ≤ e for t ≥ 2W/B i √ X −t2/2 2 2 { (Xi − µi) ≥ 2t W } ≤ e for t /2 ≥ 2W/B . P i These inequalities make clear that the subexponential part (large t) of the tail is controlled by B and the subgaussian part (moderate t) is controlled by W .

2.7 Subexponential distributions? Basic::S:subexp To me, subexponential behavior refers to functions that are bounded above by a constant multiple of exp(−c|x|), for some positive constant c. For example, the Bernstein inequality shows that, far enough from the , the tail probabilities for a large class of sums of independent random variables are subexponential. Closer to the mean the tail are subgaussian, which one might think is better than subexponential. As another example, if a random 10 variable X takes values in [−1, +1] we have P{|X| ≥ x} = 0 ≤ exp(−10 x) for all x ≥ 1, but I would hesitate to call that behavior subexponential. What then does it mean for a whole distribution to be subexponential? And how is subexponentiality related to the Bernstein moment condition, |X|k \E@ Bernstein.moment0 <28> ≤ 1 v/B2 for k ≥ 2? PBkk! 2

Draft: 20nov15 c David Pollard §2.7 Subexponential distributions? 18

As you already know, condition <28> implies

vθ2 \E@ Bernstein.mgf0 <29> eθX −1−θ X = Φ (θX) ≤ Φ (θ|X|) ≤ 1 for 0 ≤ θB < 1, P P P 1 P 1 2 1 − θB which implies

 vθ2  eθX ≤ exp θµ + 1 for 0 ≤ θB < 1 and µ = X. P 2 1 − θB P

Van der Vaart and Wellner(1996, page 103) pointed out that condition <28> is implied by

X |X|k 1 v/B2 ≥ Φ (|X|/B) = P where Φ (x) := ex − 1 − x. 2 P 1 Bkk! 1 k≥2

And, if <28> holds, then

X |X|k Φ (|X|/2B) = P ≤ 1 (2v)/(2B)2, P 1 2kBkk! 2 k≥2

which is just <29> with θ = 1/(2B). It would seem that the moment condi- tion is somehow related to kXk . Indeed, Vershynin(2010, Section 5.2.4) Φ1 defined a random variable X to be subexponential if supk∈N kXkk /k < ∞. By Problems [7] and [10], this condition is equivalent to finiteness of the Orlicz norm kXk . Φ1 Boucheron et al.(2013, Section 2.4) defined a centered random vari- able X to be sub-gamma on the right tail with variance factor v and scale factor B if

 vθ2/2  \E@ sub.gamma <30> eθX ≤ exp for 0 ≤ θB < 1. P 1 − θB

They pointed out that if X has a gamma(α, β) distribution (with mean µ = αβ and variance v = αβ2) then

θ(X−µ) −θµ −α Pe = e (1 − θβ) = exp (−θαβ − α log(1 − θβ)) αβ2θ2/2 ≤ exp for 0 ≤ θβ < 1. 1 − θβ

Hence the name. The special case where α = 1 corresponds to the exponen- tial.

Draft: 20nov15 c David Pollard §2.7 Subexponential distributions? 19

Let me try to tie these definitions and facts together.  √  First note that the upper bound in <29> equals 1 if θ = 2/ B + B2 + 2v . Thus the Bernstein assumption <28> implies  p  γ := kXk ≤ B + B2 + 2v /2. Φ1

2 If, as is sometimes assumed, v = PX then <28> also implies

3/2 3 1 2 v ≤ P|X| ≤ 2 v3!B . That is, v ≤ (3B)2 and B < γ ≤ 2.7B. It appears that the B, or kXk , Φ1 controls the subexponential part of the tail probabilities, at least as far as the Bernstein inequality is concerned. Inequality <27> conveyed the same message. Compare with the subexponential bound

{|X| ≥ x} ≤ min (1, 1/Φ (x/γ)) for γ = kXk . P 1 Φ1

The inequality PΦ1(|X|/γ) ≤ 1 also implies |X|k ≤ 1 ≤ 1 (2γ2)/γ2 for k ≥ 2. P γkk! 2

That is, <28> holds with v = 2γ2 and B = γ. Theorem <23> then gives

 x2  {X − µ ≥ x} ≤ exp − P 4γ2 + 2xγ ≤ exp(−x2/(8γ2)) for 0 ≤ x ≤ 2γ.

Is that not a subgaussian bound? Unfortunately, the bound—subgaussian or not—has little value for such a small range near zero. The upper bound √ always exceeds 1/ e . The useful subgaussian part of the bound is contributed by the v in <28>. The effect is clearer for a sum of independent random variables X1,...,Xn, each distributed like X:

X nvθ2/2 P exp(θ Xi) ≤ exp for 0 ≤ θB < 1, i≤n 1 − θB which leads to tail bounds like

X  x2/2  P{ Xi ≥ x} ≤ exp − , i nv + xB

Draft: 20nov15 c David Pollard §2.8 The Hanson-Wright inequality for subgaussians 20

useful subgaussian behavior when 0 ≤ x ≤ nv/B. In my opinion, the sub-gamma property <30> is better thought of as a restricted subgaussian assumption that automatically implies subexponen- tial behavior far out in the tails. For each α in (0, 1) it implies vθ2/2 eθX ≤ exp for 0 ≤ θB ≤ α < 1. P 1 − α Remark. Uspensky(1937, page 204) used this approach for the Bernstein inequality, before making the choice α = xB/(W + xB) to recover the traditional HBern form of the upper bound. More generally, if we have a bound θ(X−µ) vθ2/2 \E@ restricted.subg <31> Pe ≤ e for 0 ≤ θ ≤ ρ, with v and ρ positive constants and µ = PX, then for x ≥ 0, 1 2 P{X ≥ µ + x} ≤ min exp −xθ + vθ 0≤θ≤ρ 2  1 2  exp − 2 x /v for 0 ≤ x ≤ ρv \E@ gauss.exp <32> = 1 2 −xρ/2 . exp −ρx + 2 vρ < e for x > ρv The minimum is achieved by θ = min(ρ, x/v). The proof in the next Section of a generalized version of the Hanson- P Wright inequality—showing that a quadratic i,j XiAi,jXj in independent subgaussians has a subgaussian/subexponential tail— illustrates this ap- proach.

2.8 The Hanson-Wright inequality for subgaussians Basic::S:HW The following is based on an elegant argument by Rudelson and Vershynin (2013) = R&V. To shorten the proof I assume that the A matrix has zeros P down its diagonal, so that P i,j XiAi,jXj = 0 and I can skip the part of P 2 2 the argument (see Problem [17]) that deals with i Ai,i(Xi − PXi ).

Basic::HW <33> Theorem. Let X1,...,Xn be independent, centered subgaussian random variables with τ ≥ maxi τ(Xi). Let A be an n × n matrix of real numbers with Aii = 0 for each i. Then !! t2 t {X0AX ≥ t} ≤ exp − min , √ for t ≥ 0, P 64τ 4 kAk 2 HS 8 2 τ kAk2 q where kAk := ptrace(A0A) = P A2 and kAk := sup kAuk . HS i,j i,j 2 kuk2≤1 2

Draft: 20nov15 c David Pollard §2.8 The Hanson-Wright inequality for subgaussians 21

Remark. There are no assumptions of symmetry or positive definiteness on A. The proof also works with A replaced by −A, leading to a similar upper bound for {|X0AX| ≥ t}. The Hilbert-Schmidt norm kAk P √ HS is also called the Frobenius norm. The constants 64 and 8 2 are not important.

θX θ2/2 Proof Without loss of generality, suppose τ = 1, so that Pe i ≤ e for 0 2 n all i and all real θ and P exp(α X) ≤ exp(kαk2 /2) for each α ∈ R . As explained at the end of Section 2.7, it suffices to show

θX0AX 1 2 Pe ≤ exp( 2 vθ ) for 0 ≤ θ ≤ ρ √ 2 −1 where v = 16 kAkHS and ρ = 32 kAk2 . [n] To avoid conditioning arguments, assume P = ⊗i∈[n]Pi on B(R ), where [n] := {1, 2, . . . , n}, and each Xi is a coordinate map, that is, Xi(x) = xi and X(x) = x. Write G for the N(0,In) distribution, another probability [n] measure on B(R ). Of course G = ⊗i∈[n]Gi with each Gi equal to N(0, 1). For each subset J of [n] write PJ for ⊗i∈J Pi and GJ for ⊗i∈J Gi and, for [n] J generic vectors w = (wi : i ∈ [n]) in R , write wJ for (wi : i ∈ J) ∈ R . θX0AX The proof works by bounding Pe by the G expected value of an analogous quadratic, using a decoupling argument that R&V attributed to Bourgain(1998). [n] For each δ = (δi : i ∈ I) ∈ {0, 1} define Aδ to be the n × n matrix with (i, j)th element δi(1 − δj)Ai,j. Write Dδ for {i ∈ [n]: δi = 1} and Nδ for [n]\D . Then X0A X = X0 EX where E := A[D ,N ]. If the rows δ δ Dδ Nδ δ δ δ 0 0 and columns of A were permuted so that i < i for all i ∈ Dδ and i ∈ Nδ then Aδ would look like 0 E  A = δ where E := A[D ,N ] . δ 0 0 δ δ δ

[n] Now let Q denote the uniform distribution on {0, 1} . Under Q the δi’s 1 are independent Ber(1/2) random variables, so that QAδ = 4 A. For each θ, Jensen’s inequality and Fubini give 0  x δ 0 δ x 0 P exp θX AX = P exp(4θQ X AδX) ≤ Q P exp(4θX AδX). 0 Write Qδ for X AδX. It therefore suffices to show that 1 2 P exp(4θQδ) ≤ exp( 2 vθ ) for each δ and all 0 ≤ θ ≤ ρ. From this point onwards the δ is held fixed. To simplify notation I drop most of the subscript δ’s, writing XD instead of XDδ , and so on. Of course Aδ cannot drop its subscript.

Draft: 20nov15 c David Pollard §2.9 Problems 22

0 0 Under PD, with XN held fixed, the quadratic Qδ := X AδX = XDEXN is a linear combination of the independent subgaussians XD. Thus   4θQδ 1 2 2 PDe ≤ exp 2 (4θ) kEXN k2 = GD exp (4θQδ) .

Similarly, PN exp (4θQδ) ≤ GN exp (θQδ). Thus

P exp(4θQδ) = PDPN exp(4θQδ) ≤ PDGN exp(4θQδ) = GN PD exp(4θQδ) 2 0 0 8θ X E EXN ≤ GN GD exp(4θQδ) = GN e N . The problem is now reduced to its gaussian analog, for which rotational symmetry of GN allows further simplification by means of a diagonalization of a positive definite matrix, E0E = L0ΛL, where L is orthogonal and Λ = 0 diag(λ1, . . . , λn), with λ1 ≥ λ2 ≥ . . . λn ≥ 0 are the eigenvalues of E E.

2 0 0 8θ X E EXN 2 0  Y 2 2 N e N = N exp 8θ X ΛXN = Gi exp(8θ λix ) G G N i∈N i X 2  2 \E@ G.bnd <34> ≤ exp 16θ λi provided 8θ max λi ≤ 1/4. i∈N i∈N The last inequality uses the fact that, if Z ∼ N(0, 1),

sZ2 −1/2 1  1 Pe = (1 − 2s) = exp − 2 log(1 − 2s) ≤ exp(2s) if 0 ≤ s ≤ 4 . Now we need some matrix facts to bring everything back to an expression involving A. For notational simplicity again assume that the rows D all precede the rows N, so that    0 0 0 0  B1 E 0 B1B1 + B2B2 B1E + B2B3 A = and A A = 0 0 0 0 . B2 B3 E B1 + B3B2 E E + B3B3 It follows that

2 0 0 X kAk = trace(A A) ≥ trace(E E) = λi HS i∈N and

kAk2 = sup kAvk2 ≥ sup kEuk2 = λ1. kvk2≤1 kuk2≤1 Combine the various inequalities to conclude that √ 0   2 2  −1 P exp θX AX ≤ exp 16θ kAkHS provide 0 ≤ θ ≤ 32 kAk2 .

An appeal to inequality <32> completes the proof. 

Draft: 20nov15 c David Pollard §2.9 Problems 23

2.9 Problems Basic::S:Problems

Basic::P:convex.interval [1] Define a = sup{θ < 0 : M(θ) = +∞} and b = inf{θ > 0 : M(θ) = +∞} θX where M(θ) = Pe . For simplicity, suppose both a and b are finite and a < b. (i) Show that the restriction of M to the interval [a, b] is continuous. (For example, if M(b) = +∞ you should show that M(θ) → +∞ as θ increases to b.)

(ii) Suppose θ ∈ (a, b). Choose δ > 0 so that [θ − 2δ, θ + 2δ0] ⊂ (a, b). For |h| < δ, establish the domination bound

Z 1 (θ+h)X θX (θ+sh)X |e − e |/h ≤ Xe ds 0   ≤ max e(θ+2δ)X , eθX , e(θ−2sδ)X /δ.

0 θX Deduce via a Dominated Convergence argument that M (θ) = P(Xe ). (iii) Argue similarly to justify another differentiation inside the expectation, lead- ing to the expression for M 00(θ).

g(x) Basic::P:e.to.g [2] Suppose h(x) = e , where g is a twice differentiable real-valued function + on R . (i) Show that h is convex on any interval J for which (g0(x))2 + g00(x) ≥ 0 for x ∈ int(J).

xα Basic::P:Psia.ge1 [3] For a fixed α ≥ 1 define Ψα(x) = e − 1 for x ≥ 0. (i) Show that Ψα(x) is a Young function.. (ii) Show that

α α Ψα(x)Ψα(y) ≤ exp(x + y ) − 1 ≤ Ψα(x + y) for all x, y ≥ 0 1/a ≤ Ψα(2 xy) for x ∧ y ≥ 1.

(iii) Deduce from the inequality Ψα(x)Ψα(y) ≤ Ψα(x + y) that

−1 −1 −1 Ψα (uv) ≤ Ψα (u) + Ψα (v) for all u, v ≥ 0.

Basic::P:extend.to.Young [4] Suppose f is a convex, nonnegative, increasing function on an interval [a, ∞), with a > 0. Suppose also that there exists an x0 ∈ [a, ∞) for which

Draft: 20nov15 c David Pollard §2.9 Problems 24

f(x0)/x0 = τ := infx≥a f(x)/x. Show that h(x) := τx{0 ≤ x < x0} + f(x){x ≥ x0} is a Young function.

+ + Basic::P:extend.range [5] Suppose f : R → R is strictly increasing with f(0) = 0 and

f(u1u2) ≤ c0 (f(u1) + f(u2)) for min(u1, u2) ≥ v0,

for some constants c0 and v0 > 0. For each v1 in (0, v0) prove existence of a larger constant c1 for which

f(u1u2) ≤ c1 (f(u1) + f(u2)) for min(u1, u2) ≥ v1.

∗ ∗ Hint: For ui = max(ui, v0) show that f(ui ) ≤ f(ui)f(v0)/f(v1).

xα [6] For a fixed α ∈ (0, 1) define fα(x) = e for x ≥ 0. Basic::P:Psia.lt1 −1 1/α (i) Show that x 7→ fα(x) is convex for x ≥ zα := (α − 1) and concave for 0 ≤ x ≤ zα. 15 1/α 1/α (ii) Define xα = (1/α) and τα = fα(xα)/xα = (αe) . Show that 10

α=1/2 Ψα(x) := τx{x < xα} + fα(x){x ≥ xα} 5 is a Young function. 0

0 zα 2 xα 6 8 (iii) Show that there exists a constant Kα for which

α Ψα(x) ≤ exp(x ) ≤ Kα + Ψα(x) for all x ≥ 0.

1/α (iv) Show that Ψα(x)Ψα(y) ≤ Ψα(2 xy) for x ∧ y ≥ xα.

(v) Deduce that there exists a constant Cα for which

−1 −1  −1 Cα Ψα (u) + Ψα (v) ≥ Ψα (uv) when u ∧ v ≥ 1.

Basic::P:slope.zero [7] Suppose Ψ is a Young function with right-hand derivative τ > 0 at 0. Define Φ(x) = Ψ(x) − τx for x ≥ 0. If Φ is not identically zero then it is a Young function. For that case show that

−1 kfkΨ ≥ kfkΦ ≥ kfkΨ /K where K = 1 + τΦ (1).

for each measurable real function f.

Draft: 20nov15 c David Pollard §2.9 Problems 25

k k Basic::P:Stirling [8] For each k in N prove that e ≥ k /k! ≥ 1. Compare with the sharper estimate via Stirling’s formula (Feller, 1968, Section 2.9),

kke−k √ 1 1 = e−rk / 2πk where > r > . k! 12k k 12k + 1

Basic::P:interp [9] Suppose n0 = 1 ≤ n1 < n2 < . . . is an increasing sequence of positive + integers with supk∈N nk/nk−1 = B < ∞ and f : [1, ∞) → R is an increasing function with supt≥1 f(Bt)/f(t) = A < ∞. For each random variable X show that sup kXk /f(r) ≤ A sup kXk /f(n ). r≥1 r k∈N nk k

Basic::P:moment.Psia [10] With Ψα as in Problem [2], prove the existence of positive constants cα an Cα for which kXk c kXk ≤ sup r ≤ C kXk α Ψα 1/α α Ψα r≥1 r for every random variable X.

Basic::P:subg.var [11] Theorem <16> suggests that kXk2 might be comparable to β(X) if X has a subgaussian distribution. It is easy to deduce from inequality <10> that

kXk2 ≤ 2β(X). Show that there is no companion inequality in the other direction by considering the bounded, symmetric random variable X for which P{X = ±M} = δ and P{X = 0} = 1 − 2δ. If 2δ = M show that 2 PX = 1 but 2 4 2  Xθ 2 2 log 1 + θ /2! + (θ M )/4! ≤ log Pe ≤ τ θ /2 for all real θ would force τ 2 ≥ 2 log 3/2 + M 2/24.

Basic::P:subg.suff [12] Show that a random variable X that has either of the following two prop- erties is subgaussian.

(i) For some positive constants c1, c2, and c3,

2 P{|X| ≥ x} ≤ c1 exp(−c2x ) for all |x| ≥ c3.

(ii) For some positive constants c1, c2, and c3,

θX 2 Pe ≤ c1 exp(c2θ ) for all |θ| ≥ c3

Basic::P:subg.limit1 [13] Suppose {Xn} is a sequence of random variables for which 2 2 P exp(θXn) ≤ exp(τ θ /2) for all n and all real θ,

Draft: 20nov15 c David Pollard §2.10 Notes 26

where τ is a fixed positive constant. If Xn → X almost surely, show that θX 2 2 Pe ≤ exp(τ θ /2) for all real θ.

Basic::P:lagrange [14] Suppose f is a sufficiently smooth real-valued function defined at least in a neighborhood of the origin of the real line. Define f(x) − f(0) − xf 0(0) G(x) = if x 6= 0, and G(0) = f 00(0). x2/2 Prove the following facts. (i) f(x) − f(0) − xf 0(0) = x2 RR f 00(xs){0 < s < t < 1} ds dt (ii) If f is convex then G is nonnegative. (iii) If f 00 is an increasing function then so is G(x). (iv) If f 00 is a convex function then so is G. Moreover G(x) ≥ f 00(x/3). Hint: Jensen and 2 RR s{0 < s < t < 1} ds dt = 1/3.

(v) For the special case where f(x) = (1 + x) log(1 + x) − x show that G = ψBenn from <18> and G(x) ≥ (1 + x/3)−1.

x Basic::P:Poisson [15] Recall the notation F (x) = e − 1 − x for x ∈ R and ∗ 2 ∗ ψBenn(y) = F (y)/(y /2) where F (y) = (1 + y) log(1 + y) − y for y ≥ −1. Suppose X has a Poisson(λ) distribution θX (i) Show that L(θ) = log P = λθ + λF (θ). (ii) For x ≥ 0, deduce that  x2  {X ≥ λ + x} ≤ exp (−λF ∗(x/λ)) = exp − ψ (x/λ) . P 2λ Benn (iii) For 0 ≤ x ≤ λ show that P{X ≤ λ − x} ≤ inf exp (λF (−θ) − λx) θ≥0  x2   x2  = exp − ψ (−x/λ) ≤ exp − . 2λ Benn 2λ

Basic::P:BinPoisson [16] To what extent can the bounds from Problem [15] be recovered as limit- ing cases of the bounds derived via the Bennett inequality applied to the Bin(n, λ/n) distribution?

[17] Suppose ∞ > γ = kXk . Use Theorem <16> to deduce that X2k ≤ Basic::P:diag.HW Ψ2 P √ 2k 2kγ ≤ (2eγ2)kk!. Use Bernstein’s inequality from Theorem <23> to P 2 2 deduce a tail bound analogous to Theorem <33> for i αi(Xi − PXi ) with independent subgaussians X1,...,Xn and constants αi.

Draft: 20nov15 c David Pollard §2.10 Notes 27

2.10 Notes Basic::S:Notes Anyone who has read Massart’s Saint-Flour lectures (Massart, 2003) or Lu- gosi’s lecture notes (Lugosi, 2003) will realize how much I have been influ- enced by those authors. Many of the ideas in those lectures reappear in the book by Boucheron, Lugosi, and Massart(2013). Many authors seem to credit Chernoff(1952) with the moment gener- ating trick in <2>, even though the idea is obviously much older: com- pare with the treatment of the Bernstein inequality by Uspensky(1937, pages 204–206)). Chernoff himself gave credit to Cram´erfor “extremely more powerful results” obtained under stronger assumptions. The characterizations for subgaussians in Theorem <16> are essentially due to Kahane(1968, Exercise 6.10). Hoeffding(1963) established many tail bounds for sums of independent random variables. He also commented that the results extend to martin- gales. The Bennett inequality for independent summands was proved by Ben- nett(1962) by a more complicated method. Section 2.6 on the Bernstein inequality is based on the exposition by Uspensky(1937, pages 204–206), with refinements from Bennett(1962) and Massart(2003, Section 1.2.3). See also Boucheron et al.(2013, Sections 2.4, 2.8) for a slightly more detailed exposition, including the idea of using the positive part to get a one-sided bound, which they credited to Emmanuel Rio. Van der Vaart and Wellner(1996, page 103) noted the connection between the moment assumption <24> and the behavior of PΦ1(θ|Xi|). Dudley(1999, Appendix H) used the name Young-Orlicz modulus for what I am calling a Young function. to such a function as an Orlicz modulus. R t Every Young function can be written as Ψ(t) = 0 ψ(x) dx, where ψ is the right-hand derivative of Ψ. The function ψ is increasing and right- continuous. For their N-functions, Krasnosel’ski˘ıand Ruticki˘ı(1961, Sec- tion 1.3) added the requirements that ψ(0) = 0 < ψ(x) for x > 0 and ψ(x) → ∞ as x → ∞, which ensures that Ψ is strictly increasing with Ψ(t)/t → ∞ as t → ∞. Dudley called such a function an Orlicz modulus. + When ψ(0) = 0 there is a measure µ on B(R ) for which ψ(b) = µ(0, b] s + for all intervals (0, b]. In that case PΨ(X) = µ P(X − s) , a representation closely related to Lemma 1 of Panchenko(2003).

Draft: 20nov15 c David Pollard §2.10 Notes 28

References

Bennett62jasa Bennett, G. (1962). Probability inequalities for the sum of independent random variables. Journal of the American Statistical Association 57, 33–45.

BLM2013Concentration Boucheron, S., G. Lugosi, and P. Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.

bourgain1998random Bourgain, J. (1998). Random points in isotropic convex sets. MSRI Publi- cations: Convex geometric analysis 34, 53–58.

BoydVandenberghe2004 Boyd, S. and L. Vandenberghe (2004). Convex Optimization. Cambridge University Press.

Chernoff52AMS Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hy- pothesis based on a sum of observations. Annals of Mathematical Statis- tics 23 (4), 493–507.

Dudley99book Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Uni- versity Press.

Feller1 Feller, W. (1968). An Introduction to and Its Applications (third ed.), Volume 1. New York: Wiley.

Hoeffding:63 Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58, 13–30.

Kahane68RSF Kahane, J.-P. (1968). Some Random Series of Functions. Heath. Second edition: Cambridge University Press 1985.

KrasnoselskiiRutickii1961 Krasnosel’ski˘ı, M. A. and Y. B. Ruticki˘ı (1961). Convex Functions and Orlicz Spaces. Noordhoff. Translated from the first Russian edition by Leo F. Boron.

Lugosi2003ANU Lugosi, G. (2003). Concentration-of-measure inequalities. Notes from the Summer School on Machine Learning, Australian National University. Available at http://www.econ.upf.es/˜lugosi/.

Massart03Flour Massart, P. (2003). Concentration Inequalities and Model Selection, Volume 1896 of Lecture Notes in Mathematics. Springer Verlag. Lectures given at the 33rd Probability Summer School in Saint-Flour.

Draft: 20nov15 c David Pollard Notes 29

Panchenko2003AP Panchenko, D. (2003). Symmetrization approach to concentration inequali- ties for empirical processes. Annals of Probability 31, 2068–2081.

RudelsonVershynin2013ECP Rudelson, M. and R. Vershynin (2013). Hanson-Wright inequality and sub- gaussian concentration. Electron. Commun. Probab 18 (82), 1–9.

Uspensky1937book Uspensky, J. V. (1937). Introduction to Mathematical Probability. McGraw- Hill.

vaartwellner96book van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Process: With Applications to Statistics. Springer-Verlag.

vershynin2010compressed Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027.

Draft: 20nov15 c David Pollard