1 Introduction 2 Concentration Inequalities

EE-731: ADVANCED TOPICS IN DATA SCIENCES LABORATORY FOR INFORMATION AND INFERENCE SYSTEMS SPRING 2016 INSTRUCTOR:VOLKAN CEVHER SCRIBER:HARSHIT GUPTA PROBABILITY IN HIGH DIMENSIONS Index terms: Concentration inequalities, Bounded difference function, Cramer-Chernoff bound, Johnson-Lindenstrauss Theorem, Hoeffding’s Lemma, Sub-Gaussian random variable, Entropy, Herbst’s trick 1 Introduction Measuring the concentration of a random variable around some value has many applications in statistics, information theory, optimization, etc. In probability theory, this concentration of measures can be quantified as : Definition 1. Given a random variable Y and a constant m, Y is said to be concentrated around m if, P(jY − mj > t) ≤ D(t) where, D(t) decreases drastically to 0 in t. These inequalities are called concentration of measure inequalities [1] [2]. 1 Pn Example 1.1. A simple example can be considered when Yn = n i=1 Xi, where Xi are independent random variables with 2 mean µ and variance σ , then Yn tends to concentrate around µ as n ! 1. The concentration in this case can be quantified by following results: • Law of Large Numbers : P(jY − mj > ) ! 0 as n ! 1. • P jY − mj pα ! − α n ! 1 Central Limit Theorem : ( > n ) 2Φ σ as , where Φ is the standard normal CDF. • Large Deviations : Under some technical assumptions, P(jY − mj > ) ≤ exp(−n:c()) It is quite interesting to note that these concentration properties on the average of the independent random variables hold for the more generalized scenario where a not too sensitive function of independent random variables concentrates on its expectation, formally: Proposition 1.1. If x1; ··· ; xn are independent random variables then any function f (x1; ··· ; xn) that is, not too sensitive to any of the co-ordinate will concentrate around its mean: −t2=c(t) P(j f (x1; ··· ; xn) − E[ f (x1; ··· ; xn)]j > t) ≤ e : Here, c(t) quantifies the sensitivity of the function to its variables. 2 Concentration Inequalities Now we define several types of functions and the concentration inequalities: LIONS @ EPFL Advanced Topics in Data Sciences Prof. Volkan Cevher 2.1 Bounded Difference Function n Definition 2. A function f : X !R is called a bounded difference function if for some non-negative c1; ··· ; cn, 0 sup j f (x1; ··· ; xi; ··· ; xn) − f (x1; ··· ; xi ; ··· ; xn)j ≤ ci: 0 fx1;:::;xi;:::;xn;xi 2Xg Example 2.1 (Chromatic number of a Random Graph). Let V = f1; ··· ; ng and G be a random graph such that each pair i; j 2 V is independently connected with probability p. Let Xi j = 1 if (i; j) are connected and 0 otherwise. The Chromatic number of G is the number of colors needed to color the vertices such that no two connected vertices have the same color. Defining: Chromatic number = f (X11; ··· ; Xi j; ··· ; Xnn) it can be shown that f is a bounded difference function with Ci j = 1. n Theorem 2.1 (Bounded difference inequality). Let f : X ! R satisfy the bounded difference property with ci’s and let X1; ··· ; Xn be independent random variables. Then: 0 1 B −2t2 C P(j f (x ; ··· ; x ) − E[ f (x ; ··· ; x )]j > t) ≤ 2 exp B C : 1 n 1 n @BPN 2 AC i=1 ci Proof. To prove this result we need to exploit some common probability bounds. As these bounds are essential to prove the inequality, therefore, we discuss them in details with examples, meanwhile, keeping the flow of the proof: • Cramer Chernoff Bound – Markov’s Inequality We start by stating Markov’s inequality and its proof: E[Z] Theorem 2.2 (Markov’s Inequality). Let Z be a non-negative random variable then, P(Z ≥ t) ≤ t . Proof. This can be proven by showing that: Z 1 Z 1 z Z 1 z E[Z] PZ(z) dz ≤ PZ(z) dz ≤ PZ(z) dz = : t t t 0 t t – Chebyshev’s Inequality We extend this result to any non-decreasing and non-negative function ' of the non- negative random varibale Z then, E['(Z)] P(Z ≥ t) ≤ P('(Z) ≥ '(t)) ≤ : '(t) On choosing '(Z) = Z2, and substituting jZ − E[Z]j into Z, we get the Chebyshev’s inequality: Var[Z] P(jZ − E[Z]j ≥ t) ≤ : t2 λZ λZ – Chernoff Bound Similarly, by choosing '(Z) = e where, λ ≥ 0 and taking Z(λ) = log , we get the Chernoff bound: " !# −λt λZ P(Z ≥ t) ≤ inf e E[e ] = E exp −sup(λt − Z(λ)) : λ≥0 λ≥0 2 LIONS @ EPFL Advanced Topics in Data Sciences Prof. Volkan Cevher – Cramer-Transform To get the Cramer-Chernoff inequality from this Chernoff bound we first define Cramer Transform of the random variable Z,: ∗ Z(t) = sup λt − Z(λ) λ≥0 Thus, by putting the Cramer transform of Z in the Chernoff bound we derive: ∗ P(Z ≥ t) ≤ exp − Z(t) Example 2.2. To illustrate the power of these bounds we apply them to the case when Z = X1 + ··· + Xn where Xi are i.i.d. ’s (independent and identical distributions). On applying Chebyshev’s inequality to the sum and acknowledging that Var[Z] = nVar[X], and putting t = n, we obtain, ! 1 Var[X] P jZ − E[Z]j ≥ ≤ : n n2 Whereas, by first deriving that: Pn λZ λ i=1 Xi n λXi n λXi λX n Z(λ) = log E[e ] = log E[e ] = log E[Πi=1e ] = log Πi=1E[e ] = log(E[e ]) = n X(λ) using i.i.d. property of the Xi s and putting this in the Cramer-Chernoff inequality on the sum we get: ∗ P(Z ≥ n) ≤ exp −n X() : We also use the Cramer-Chernoff inequality on any centered random variable Y = X − E[X] and its negative Y− = −Y: ∗ ∗ P(Y ≥ t) ≤ exp − Y (t) ; P(−Y ≥ t) ≤ exp − −Y(t) ; adding the two: ∗ ∗ P(jYj ≥ t) ≤ exp − Y (t) + exp − Y−(t) : ∗ ∗ In case the probability distribution of Y is symmetric around its mean. Then −Y (t) = Y (t), and thus, ∗ P(jYj ≥ t) ≤ 2 exp − Y(t) : N 2 λ2σ2 ∗ t2 For example, for a random variable Y = (0; σ ), Y (λ) = 2 and Y (t) = 2σ2 and hence, 2 − t P(jYj ≥ t) ≤ 2e 2σ2 ; which means that the Gaussian random variables concentrate around their mean as σ ! 0. • Sub-Gaussian Random Variables : To use the discussed inequalities to prove the concentration of measure for the bounded difference functions we describe the larger family of sub-Gaussian random variables to which they belong: λ2σ2 Definition 3 (Sub-Gaussian Random Variable). If a centered random variable Z is such that Z(λ) ≤ 2 , for all non- negative λ then it is called a sub-Gaussian random variable with parameter σ2. The family of all such random variables with the parameter σ2 is denoted by G(σ2). ≤ λ2σ2 ∗ ≥ t2 ∗ ≥ t2 Interestingly, when Z(λ) 2 , then Z(t) 2σ2 and −Z(t) 2σ2 , which brings us to the properties of the sub- Gaussian random variables: 2 2 G 2 j j ≥ ≤ − t – If Z (σ ) then, P( Z t) 2 exp 2σ2 i.e. Sub-Gaussian random variables concentrate around their mean. 3 LIONS @ EPFL Advanced Topics in Data Sciences Prof. Volkan Cevher 2 Pn Pn 2 2 – If Zi 2 G(σ ) are independent, then i=1 aiXi 2 G( i=1 ai σ ) i.e. linear combination of sub-Gaussian random variables is sub-Gaussian as well. • Hoeffding’s Lemma The result then can be used to state Hoeffding’s Lemma: λ2 (b−a)2 Lemma 2.3 (Hoeffding’s Lemma). Let Z be a random variable with E[Z] = 0, and bounded in [a; b] then, Z(λ) ≤ 2 : 4 and thus, Z 2 G((b − a)2=4). The details of the lemma are discussed later. • Applying property 1 of the sub-Gaussian random variables on Z = Y − E[Y], we get, ! 2t2 P(jY − E[Y]j ≥ t) ≤ 2 exp − ; (b − a)2 We then use the property 2 of the sub-Gaussian random variables to prove that Z = Z1 +···+Zn (where Zi = Yi −E[Yi] Pn 2 and is bounded in [ai; bi]) is sub-Gaussian with parameter i=1(bi − ai) =4. Then, substituting t = n and using the result obtained we get : ! 0 1 1 B 2n2 C P jZ − E[Z]j ≥ ≤ 2 exp B− C : n @B 1 Pn 2 AC n i=1(bi − ai) We state 2 more examples of concentration inequalities : Theorem 2.4 (Lipschitz Function of Gaussian RVs). Let X1; ··· ; Xn be independent random variables with distribution N(0; 1) 0 0 0 and let f be an L − Lipschitz i.e. (j f (x) − f (x )j ≤ Lkx − x k2 for any x; x ). Then, ! t2 P (j f (X ; ··· ; X ) − E[ f (X ; ··· ; X )]j) ≤ 2 exp − : 1 n 1 n 2L2 n Theorem 2.5. Let X1; ··· ; Xn be independent random variables bounded in [0; 1], and let f : [0; 1] ! R be 1 − Lipschitz and separately convex (i.e. convex in any given co-ordinate when the other ones are fixed). Then, ! t2 P ( f (X ; ··· ; X ) − E[ f (X ; ··· ; X )] ≥ t) ≤ exp − 1 n 1 n 2 3 Examples We now discuss some applications of these concentration inequalities: Example 3.1. PAC Learnability: As discussed in the previous lecture, a hypothesis class H has the uniform convergence property, if there exists a function nH (; δ), such that for every ; δ 2 [0; 1] and any probability distribution P, if n ≥ nH (; δ), we have sup jFˆn(h) − F(h)j ≤ ; h2H with probability at least 1 − δ, where, Fˆn(h) is the empirical risk and F(h) is the risk.

Load more