2 a Few Good Inequalities 1 2.1 Tail Bounds and Concentration

2 A few good inequalities1 2.1 Tail bounds and concentration.................1 2.2 Moment generating functions..................3 2.3 Orlicz norms...........................4 2.4 Subgaussian tails.........................6 2.5 Bennett's inequality....................... 11 2.6 Bernstein's inequality...................... 14 2.7 Subexponential distributions?.................. 17 2.8 The Hanson-Wright inequality for subgaussians........ 20 2.9 Problems............................. 23 2.10 Notes............................... 27 Printed: 27 November 2015 version: 20nov15 Mini-empirical printed: 27 November 2015 c David Pollard Chapter 2 A few good inequalities Section 2.1 introduces exponential tail bounds. Section 2.2 introduces the method for bounding tail probabilities using moment generating functions. Section 2.3 describes analogs of the usual Lp norms, which have proved useful in empirical process theory. Section 2.4 discusses subgaussianity, a most useful extension of the gaus- sianity property. Section 2.5 discusses the Bennett inequality for sums of independent random variables that are bounded above by a constant. Section 2.6 shows how the Bennett inequality implies one form of the Bern- stein inequality, then discusses extensions to unbounded summands. Section 2.7 describes two ways to capture the idea of tails that decrease like those of the exponential distribution. Section 2.8 illustrates the ideas from the previous Section by deriving a subgaussian/subexponential tail bound for quadratic forms in independent subgaussian random variables. 2.1 Tail bounds and concentration Basic::S:intro Sums of independent random variables are often approximately normal distributed. In particular, their tail probabilities often decrease in ways similar to the tails of a normal distribution. Several classical inequalities make this vague idea more precise. Such inequalities provided the foundation on which empirical process theory was built. version: 20nov15 Mini-empirical printed: 27 November 2015 c David Pollard x2.2 Moment generating functions 2 We know a lot about the normal distribution. For example, if W has a N(µ, σ2) distribution then (Feller, 1968, Section 7.1) 1 1 e−x2=2 1 e−x2=2 − p ≤ PfW ≥ µ + σxg ≤ p for all x > 0. x x3 2π x 2π Clearly the inequalities are useful only for larger x: as x decreases to zero the lower bound goes to −∞ and the upper bound goes to +1. For many purposes the exp(−x2=2) factor matters the most. Indeed, the simpler tail 2 bound PfW ≥ µ+σxg ≤ exp(−x =2) for x ≥ 0 often suffices for asymptotic arguments. Sometimes we need a better inequality showing that the distribution concentrates most of its probability mass in a small region, such as −x2=2 PfjW − µj ≥ σxg ≤ 2e for all x ≥ 0; a concentration inequality. Similar-looking tail bounds exist for random variables that are not exactly normally distributed. Particularly useful are the so-called exponential tail bounds, which take the form −g(x) \E@ exp.bnd <1> PfX ≥ x + PXg ≤ e for x ≥ 0; with g a function that typically increases like a power of x. Sometimes g behaves like a quadratic for some range of x and like a linear function further out in the tails; sometimes some logarithmic terms creep in; and sometimes it depends on var(X) or on some more exotic measure of size, such as an Orlicz norm (see Section 2.3). For example, suppose S = X1 + ··· + Xn, a sum of independent random variables with PXi = 0 and each Xi bounded above by some constant b. The Bernstein inequality (Section 2.6) tells us that ! p −x2=2 fS ≥ x V g ≤ exp p for x ≥ 0; P 1 1 + 3 bx= V where V is any upper bound for var(S). If each Xi is also bounded below by −b then a similar tail bound exists for −pS, which leads to a concentration inequality, an upper bound forp PfjSj ≥ x V g. For x much smaller than V =b the exponent in the Bernstein inequality behaves like −x2=2; for much larger x the exponent behaves like a negative multiple of x. In a sense that is made more precise in Sections 2.6 and 2.7, the inequality gives a subgaussian bound for x not too big and a subexponential bound for very large x. Draft: 20nov15 c David Pollard x2.2 Moment generating functions 3 2.2 Moment generating functions Basic::S:mgf The most common way to get a bound like <1> uses the moment gener- θX ating function, M(θ) := Pe . From the fact that exp is nonnegative and exp(θ(X − w)) ≥ 1 when X ≥ w and θ ≥ 0 we have θ(X−w) −θw \E@ mgf.tail <2> PfX ≥ wg ≤ inf Pe = inf e M(θ): θ≥0 θ≥0 Basic::normal <3> Example. If X has a N(µ, σ2) distribution then M(θ) = exp(θµ + σ2θ2=2) is finite for all real θ. For x ≥ 0 inequality <2> gives 2 2 PfX ≥ µ + σxg ≤ inf exp(−θ(µ + σx) + θµ + θ σ =2) θ≥0 = exp(−x2=2) for all x ≥ 0; the minimum being achieved by θ = x/σ. Analogous arguments, with X − µ replaced by µ − X, give an analogous bound for the lower tail, 2 PfX ≤ µ − σxg ≤ exp(−x =2) for all x > 0; −x2=2 leading to the concentration inequality PfjX − µj ≥ σxg ≤ 2e . Remark. Of course the algebra would have been a tad simpler if I had worked with the standardized variable (X − µ)/σ. I did things the messier way in order to make a point about the centering in Section 2.4. Exactly the same arguments work for any random variable whose moment generating is bounded above by exp(θµ + σ2θ2=2), for all real θ and some finite constants µ and τ. This property essentially defines the subgaussian distributions, which are discussed in Section 2.4. The function L(θ) := log M(θ) is called the cumulant-generating function. The upper bound in <2> is often rewritten as −L∗(θ) ∗ e where L (θ) := supθ≥0(θw − L(θ)): Remark. As Boucheron et al.(2013, Sections 2.2) and others have noted, the function L∗ is the Fenchel-Lagrange transform of L, which is also known as the conjugate function (Boyd and Vandenberghe, 2004, Section 3.3). Draft: 20nov15 c David Pollard x2.3 Orlicz norms 4 Clearly L(0) = 0. The Hölderinequality shows that L is convex: if θ is a convex combination α1θ1 + α2θ2 then L(θ) e = M(α1θ1 + α2θ2) α θ X α θ X = P e 1 1 e 2 2 α α θ X 1 θ X 2 α L(θ )+α L(θ ) ≤ Pe 1 Pe 2 = e 1 1 2 2 : More precisely (Problem [1]), the set J = fθ : M(θ) < 1g is an interval, possibly degenerate or unbounded, containing the origin. For example, for the Cauchy, M(θ) = 1 for all real θ 6= 0; for the standard exponential distribution (density e−xfx > 0g with respect to Lebesgue measure), −1 M(θ) = (1 − θ) for θ < 1 : 1 otherwise Problem [1] also justifies the interchange of differentiation and expecta- tion on the interior of J, which leads to M 0(θ) eθX L0(θ) = = X M(θ) P M(θ) M 00(θ) M 0(θ)2 eθX eθX 2 L00(θ) = − = X2 − X : M(θ) M(θ) P M(θ) P M(θ) θX If we define a probability measure Pθ by its density, dPθ=dP = e =M(θ), then the derivatives of L can be re-expressed using moments of X under Pθ: 0 00 2 L (θ) = PθX and L (θ) = Pθ (X − PθX) = varθ(X) ≥ 0: 0 In particular, L (0) = PX, so that Taylor expansion gives 1 2 ∗ \E@ L.Taylor <4> L(θ) = θPX + 2 θ varθ∗(X) with θ lying between 0 and θ: Bounds on the Pθ variance of X lead to bounds on L and tail bounds for X − PX. The Hoeffding inequality (Example <14>) provides a good illustration of this line of attack. 2.3 Orlicz norms Basic::S:orlicz There is a fruitful way to generalize the the concept of an Lp norm by means + + of a Young function, that is, a convex, increasing function Ψ : R ! R with Ψ(0) = 0 and Ψ(x) ! 1 as x ! 1. The concept applies with any measure µ but I need it mostly for probability measures. Draft: 20nov15 c David Pollard x2.3 Orlicz norms 5 Remark. The assumption that Ψ(x) ! 1 as x ! 1 is needed only to rule out the trivial case where Ψ ≡ 0. Indeed, if Ψ(x0) > 0 for some x0 then, for x = x0/α with α < 1, Ψ(x0) = Ψ(αx + (1 − α)0) ≤ αΨ(x) That is, Ψ(x)=x ≥ Ψ(x0)=x0 for all x ≥ x0. If Ψ is not identically zero then it must increase at least linearly with increasing x. At one time Orlicz norms became very popular in the empirical process literature, in part (I suspect) because chaining arguments (see Chapter 4) are cleaner with norms than with tail probabilities. The most important Young functions for my purposes are the power functions, Ψ(x) = xp for a fixed p ≥ 1, and the exponential power func- α tions Ψα(x) := exp(x ) − 1 for α ≥ 1. The function Ψ2 is particularly useful (see Section 2.4) for working with distributions whose tails decrease like a Gaussian. It is also possible (Problem [6]) to define a Young function Ψα, for 0 < α < 1, for which α Ψα(x) ≤ exp(x ) ≤ Kα + Ψα(x) for all x ≥ 0; where Kα is a positive constant.

2 a Few Good Inequalities 1 2.1 Tail Bounds and Concentration

Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems

Concentration Inequalities and Martingale Inequalities — a Survey

Lecture 6: Estimating the Empirical Risk with Samples

Arxiv:2002.10761V1 [Math.PR] 25 Feb 2020

4 Concentration Inequalities, Scalar and Matrix Versions

Lecture 12: Martingales Concentration Inequality

Lecture 1 - Introduction and the Empirical CDF

Concentration Inequalities for Empirical Processes of Linear Time Series

Concentration Inequalities

Section 1: Basic Probability Theory

Concentration Inequalities for Dependent Random

Basic Concentration Properties of Real-Valued Distributions Odalric-Ambrym Maillard