4 Concentration Inequalities, Scalar and Matrix Versions
Total Page:16
File Type:pdf, Size:1020Kb
4 Concentration Inequalities, Scalar and Matrix Versions 4.1 Large Deviation Inequalities Concentration and large deviations inequalities are among the most useful tools when understanding the performance of some algorithms. In a nutshell they control the probability of a random variable being very far from its expectation. The simplest such inequality is Markov's inequality: Theorem 4.1 (Markov's Inequality) Let X ≥ 0 be a non-negative random variable with E[X] < 1. Then, [X] ProbfX > tg ≤ E : (31) t Proof. Let t > 0. Define a random variable Yt as 0 if X ≤ t Y = t t if X > t Clearly, Yt ≤ X, hence E[Yt] ≤ E[X], and t ProbfX > tg = E[Yt] ≤ E[X]; concluding the proof. 2 Markov's inequality can be used to obtain many more concentration inequalities. Chebyshev's inequality is a simple inequality that control fluctuations from the mean. 2 Theorem 4.2 (Chebyshev's inequality) Let X be a random variable with E[X ] < 1. Then, Var(X) ProbfjX − Xj > tg ≤ : E t2 2 Proof. Apply Markov's inequality to the random variable (X − E[X]) to get: 2 E (X − EX) Var(X) ProbfjX −X j > t g= Prob f(X −X)2 > t2 g ≤ = : E E t2 t2 2 4.1.1 Sums of independent random variables In what follows we'll show two useful inequalities involving sums of independent random variables. The intuitive idea is that if we have a sum of independent random variables X = X1 + ··· + Xn; where X are iid centered random variables, then while the value of X can be of order O(n) it will very i p likely be of order O( n) (note that this is the order of its standard deviation). The inequalities that p follow are ways of very precisely controlling the probability of X being larger than O( n). While we could use, for example, Chebyshev's inequality for this, in the inequalities that follow the probabilities will be exponentially small, rather than quadratic, which will be crucial in many applications to come. 55 Theorem 4.3 (Hoeffding’s Inequality) Let X1;X2;:::;Xn be independent bounded random vari- ables, i.e., jXij ≤ a and E[Xi] = 0. Then, ( n ) X t2 Prob X > t ≤ 2 exp − : i 2na2 i=1 p The inequality implies that fluctuations larger than O ( n) have small probability. For example, p 2 for t = a 2n log n we get that the probability is at most n . Pn Proof. We first get a probability bound for the event i=1 Xi > t. The proof, again, will follow from Markov. Since we want an exponentially small probability, we use a classical trick that involves exponentiating with any λ > 0 and then choosing the optimal λ. ( n ) ( n ) X X Prob Xi > t = Prob Xi > t (32) i=1 i=1 n λ Pn X λto = Prob e i=1 i > e λ Pn X [e i=1 i ] ≤ E etλ n −tλ Y λX = e E[e i ]; (33) i=1 where the penultimate step follows from Markov's inequality and the last equality follows from inde- pendence of the Xi's. λX λx We now use the fact that jXij ≤ a to bound E[e i ]. Because the function f(x) = e is convex, a + x a − x eλx ≤ eλa + e−λa; 2a 2a for all x 2 [−a; a]. Since, for all i, E[Xi] = 0 we get a + Xi a − Xi 1 [eλXi ] ≤ eλa + e−λa ≤ eλa + e−λa = cosh(λa) E E 2a 2a 2 Note that15 x2=2 cosh(x) ≤ e ; for all x 2 R Hence, λX (λX )2=2 (λa)2=2 E[e i ] ≤ E[e i ] ≤ e : Together with (32), this gives ( n ) n X −tλ Y (λa)2=2 Prob Xi > t ≤ e e i=1 i=1 2 = e−tλen(λa) =2 15 P1 x2n x2=2 P1 x2n n This follows immediately from the Taylor expansions: cosh(x) = n=0 (2n)! , e = n=0 2nn! , and (2n)! ≥ 2 n!. 56 This inequality holds for any choice of λ ≥ 0, so we choose the value of λ that minimizes (λa)2 min n − tλ λ 2 Differentiating readily shows that the minimizer is given by t λ = ; na2 which satisfies λ > 0. For this choice of λ, 1 t2 t2 t2 n(λa)2=2 − tλ = − = − n 2a2 a2 2na2 Thus, ( n ) 2 X − t Prob Xi > t ≤ e 2na2 i=1 Pn By using the same argument on i=1 (−Xi), and union bounding over the two events we get, ( n ) 2 X − t 2na2 Prob Xi > t ≤ 2e i=1 2 Remark 4.4 Let's say that we have random variables r1; : : : ; rn i.i.d. distributed as 8 < −1 with probability p=2 ri = 0 with probability 1 − p : 1 with probability p=2: Then, E(ri) = 0 and jrij ≤ 1 so Hoeffding’s inequality gives: ( n ) X t2 Prob r > t ≤2 exp − : i 2n i=1 Pn Intuitively, the smallest p is the more concentrated j i=1 rij should be, however Hoeffding’s in- equality does not capture this behavior. Pn A natural way to quantify this intuition is by noting that the variance of i=1 ri depends on p as Var(ri) = p. The inequality that follows, Bernstein's inequality, uses the variance of the summands to improve over Hoeffding’s inequality. The way this is going to be achieved is by strengthening the proof above, more specifically in λX step (33) we will use the bound on the variance to get a better estimate on E[e i ] essentially by 2 2 k 2 k−2 σ2 k realizing that if Xi is centered, EXi = σ , and jXij ≤ a then, for k ≥ 2, EXi ≤ σ a = a2 a . 57 Theorem 4.5 (Bernstein's Inequality) Let X1;X2;:::;Xn be independent centered bounded ran- 2 2 dom variables, i.e., jXij ≤ a and E[Xi] = 0, with variance E[Xi ] = σ . Then, ( n ) ! X t2 Prob Xi > t ≤ 2 exp − : 2nσ2 + 2 at i=1 3 Remark 4.6 Before proving Bernstein's Inequality, note that on the example of Remark 4.4 we get ( n ) ! X t2 Prob ri > t ≤ 2 exp − ; 2np + 2 t i=1 3 which exhibits a dependence on p and, for small values of p is considerably smaller than what Hoeffd- ing's inequality gives. Proof. As before, we will prove ( n ) ! X t2 Prob Xi > t ≤ exp − ; 2nσ2 + 2 at i=1 3 Pn and then union bound with the same result for − i=1 Xi, to prove the Theorem. For any λ > 0 we have ( n ) P X λ Xi λt Prob Xi > t = Probfe > e g i=1 P [eλ Xi ] ≤ E eλt n −λt Y λX = e E[e i ] i=1 Now comes the source of the improvement over Hoeffding’s, " 1 # X λmXm [eλXi ] = 1 + λX + i E E i m! m=2 1 X λmam−2σ2 ≤ 1 + m! m=2 1 σ2 X (λa)m = 1 + a2 m! m=2 σ2 = 1 + eλa − 1 − λa a2 Therefore, ( n ) n X σ2 Prob X > t ≤ e−λt 1 + eλa − 1 − λa i a2 i=1 58 We will use a few simple inequalities (that can be easily proved with calculus) such as16 1 + x ≤ x e ; for all x 2 R. This means that, 2 2 σ λa σ (eλa−1 −λa) 1 + e −1 − λa ≤ e a2 ; a2 which readily implies ( n ) 2 X −λt nσ (eλa−1−λa) Prob Xi > t ≤ e e a2 : i=1 As before, we try to find the value of λ > 0 that minimizes nσ2 min −λt + (eλa −1 − λa) λ a2 Differentiation gives nσ2 −t + (aeλa − a) = 0 a2 which implies that the optimal choice of λ is given by 1 at λ∗ = log 1 + a nσ2 If we set at u = ; (34) nσ2 ∗ 1 then λ = a log(1 + u). Now, the value of the minimum is given by 2 2 nσ ∗ nσ −λ∗ t + (eλ a − 1 − λ∗a) = − [(1 + u) log(1 + u) − u] : a2 a2 Which means that, ( n ) X nσ2 Prob X > t ≤ exp − f(1 + u) log(1 + u) − ug i a2 i=1 The rest of the proof follows by noting that, for every u > 0, u (1 + u) log(1 + u) − u ≥ 2 2 ; (35) u + 3 which implies: ( n ) ! X nσ2 u Prob Xi > t ≤ exp − a2 2 + 2 i=1 u 3 ! t2 = exp − : 2 2 2nσ + 3 at 2 16In fact y = 1 + x is a tangent line to the graph of f(x) = ex. 59 4.2 Gaussian Concentration One of the most important results in concentration of measure is Gaussian concentration, although being a concentration result specific for normally distributed random variables, it will be very useful n throughout these lectures. Intuitively it says that if F : R ! R is a function that is stable in terms of its input then F (g) is very well concentrated around its mean, where g 2 N (0;I). More precisely: T Theorem 4.7 (Gaussian Concentration) Let X = [X1;:::;Xn] be a vector with i.i.d. standard n Gaussian entries and F : R ! R a σ-Lipschitz function (i.e.: jF (x) − F (y)j ≤ σkx − yk, for all n x; y 2 R ). Then, for every t ≥ 0 t2 Prob fjF (X) − F (X)j ≥ tg ≤ 2 exp − : E 2σ2 For the sake of simplicity we will show the proof for a slightly weaker bound (in terms of the constant 2 t2 inside the exponent): Prob fjF (X) − EF (X)j ≥ tg ≤ 2 exp − π2 σ2 .