Basic Concentration Properties of Real-Valued Distributions Odalric-Ambrym Maillard

Basic Concentration Properties of Real-Valued Distributions Odalric-Ambrym Maillard To cite this version: Odalric-Ambrym Maillard. Basic Concentration Properties of Real-Valued Distributions. Doctoral. France. 2017. cel-01632228 HAL Id: cel-01632228 https://hal.archives-ouvertes.fr/cel-01632228 Submitted on 9 Nov 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Lecture notes Concentration of R-valued distributions Fall 2017 BASIC CONCENTRATION PROPERTIES OF REAL-VALUED DISTRIBUTIONS ODALRIC-AMBRYM MAILLARD Inria Lille - Nord Europe SequeL team [email protected] In this note we introduce and discuss a few concentration tools for the study of concentration inequalities on the real line. After recalling versions of the Chernoff method, we move to concentration inequalities for predictable processes. We especially focus on bounds that enable to handle the sum of real-valued random variables, where the number of summands is itself a random stopping time, and target fully explicit and empirical bounds. We then discuss some important other tools, such as the Laplace method and the transportation lemma. Keywords: Concentration of measure, Statistics. Contents 1 Markov Inequality and the Chernoff method 1 1.1 A first consequence . 2 1.2 Two complementary results . 2 1.3 The illustrative case of sub-Gaussian random variables . 3 2 Concentration inequalities for predictable processes 4 2.1 Doob’s maximal inequalities . 5 2.2 The peeling technique for random stopping times . 5 2.3 Birge-Massart concentration . 9 3 Uniform bounds and the Laplace method 11 4 Some other applications 12 4.1 Change of measure and code-length theory . 12 4.2 Chernoff Importance Sampling . 13 4.3 Transportation lemma . 16 1 Markov Inequality and the Chernoff method In this section, we start by introducing the celebrated Markov’s inequality, and show how this seemingly weak result leads to some of the most powerful tool in statistics: the Chernoff method, and the Laplace transform. #Lemma 1 (Markov’s inequality) For any measurable real-valued random variable that is almost surely non-negative, then it holds for all " > 0 that [X] (X ") E : P > 6 " Proof" of Lemma 1: ! The proof uses the following straightforward decomposition: X = XIfX > "g + XIfX < "g Now since X is almost surely non-negative, it holds almost surely that XIfX < "g > 0, and thus X > "IfX > "g. We conclude by taking expectations on both sides (which is valid since E[X] < 1), and deduce that E[X] > "P[X > "]. O-A. Maillard page 1 2017 Lecture notes Concentration of R-valued distributions Fall 2017 1.1 A first consequence We can apply this result immediately to real-valued random variables by remarking that for any random variable distributed according to ν which we note X ∼ ν) and λ 2 R, the random variable exp(λX) is non-negative. Thus if we now define the domain of ν by Dν = fλ : E[exp(λX)] < 1g, we deduce by application of Markov’s inequality that for all t > 0, + 8λ 2 R? \Dν P(X > t) = P(exp(λX) > exp(λt)) 6 exp(−λt)E[exp(λX)] : (1) − 8λ 2 R? \Dν P(X 6 t) = P(exp(λX) > exp(λt)) 6 exp(−λt)E[exp(λX)] : (2) In this construction, the exp transform may seem arbitrary, and one could indeed use more general transforms. The benefit of using other transforms will be discussed later. Currently, we explore what happens with the exp case. One first immediate result is the following: Lemma 2 (Chernoff’s rule) Let X ∼ ν be a real-valued random variable. Then log E exp(X) 6 0 ; implies 8δ 2 (0; 1]; P X > ln(1/δ) 6 δ : The proof is immediate by considering t = ln(1/δ) and λ = 1 in (1). 1.2 Two complementary results Now one can consider two complementary points of view: The first one is to fix the value of t in (1) and (2) and minimize the probability level (the term on the right-hand side of the inequality). The second one is to fix the value of the probability level, and optimize the value of t. This leads to the following lemmas. 'Lemma 3 (Cramer-Chernoff) Let X ∼ ν be a real-valued random variable. Let us introduce the $ log-Laplace transform and its Legendre transform: 8λ 2 R;'ν (λ) = log E[exp(λX)]; ? 8t 2 R;'ν (t) = sup λt − 'ν (λ) ; λ2R and let Dν = fλ 2 R : 'ν (λ) < 1g. + If Dν \ R? 6= ;, then E[X] < 1 and for all t > E[X] ? log P(X > t) 6 −'ν (t) : − Likewise, if Dν \ R? 6= ;, E[X] > −∞ and for all t 6 E[X], ? log P(X 6 t) 6 −'ν (t) : Remark& 1 The log-Laplace transform 'ν is also called known as the cumulant generative function. % Proof of Lemma 3: First, note that fλ 2 R : E[exp(λX)] < 1g coincides with fλ 2 R : 'ν (λ) < 1g. Using equations (1) and (2), it holds: (X t) inf exp(−λt + log [exp(λX)]) P > 6 + E λ2R? \Dν (X t) inf exp(−λt + log [exp(λX)]) P 6 6 − E λ2R? \Dν O-A. Maillard page 2 2017 Lecture notes Concentration of R-valued distributions Fall 2017 ? The Legendre transform 'ν of the log-Laplace function 'ν unifies these two cases. Indeed, a striking ? property of 'ν is that if λ 2 Dν for some λ > 0, then E[X] < 1. This can be seen by Jensen’s inequality applied to the function ln: Indeed it holds λE[X] = E[ln exp(λX)] 6 'ν (λ). Further, for all t > E[X], it holds ? 'ν (t) = sup λt − 'ν (λ) : + λ2R \Dν Note that this also applies if E[X] = −∞. Likewise, if λ 2 Dν for some λ < 0 then E[X] > −∞ and for all t 6 E[X], it holds ? 'ν (t) = sup λt − 'ν (λ) : − λ2R \Dν Alternatively, the second point of view is to fix the confidence level δ 2 (0; 1], and then to solve the equation exp(−λt)E[exp(λX)] = δ in t = t(δ). We then optimize over t. This leads to: 'Lemma 4 (Alternative Cramer-Chernoff) Let X ∼ ν be a real-valued random variable and let Dν = $ fλ 2 R : log E exp(λX) < 1g. It holds, n 1 log(1/δ)o X inf log [exp(λX)] + δ (3) P > + E 6 λ2Dν \R? λ λ n 1 log(1/δ)o P X 6 sup − log E[exp(−λX)] − 6 δ : (4) + λ λ λ2(−Dν )\R? Proof& of Lemma 4: % Solving exp(−λt)E[exp(λX)] = δ for δ 2 (0; 1] and λ 6= 0, we obtain he following equivalence −λt + log E[exp(λX)] = log(δ) λt = − log(δ) + log E[exp(λX)] 1 1 t = log(1/δ) + log [exp(λX)] : λ λ E Thus, we deduce from (1) and (2) that 1 1 8λ > 0 X log(1/δ) + log [exp(λX)] δ P > λ λ E 6 1 1 8λ > 0 X − log(1/δ) − log [exp(−λX)] δ : P 6 λ λ E 6 1 The rescaled Laplace transform λ ! λ log E[exp(λX)] is sometimes called the entropic risk measure. Note that Lemma 3 and 4 involve slightly different quantities, depending on whether we focus on the probability level δ or the threshold on X. 1.3 The illustrative case of sub-Gaussian random variables An immediate corollary is the following: O-A. Maillard page 3 2017 Lecture notes Concentration of R-valued distributions Fall 2017 'Corollary 1 (Sub-Gaussian Concentration Inequality) Let fXigi6n be independent R-sub- $ Gaussian random variables with mean µ, that is such that 1 8λ 2 ; log exp λ(X − µ) λ2R2 : R E i 6 2 Then, n X p 2 8δ 2 (0; 1) P (Xi − µ) > 2R n log(1/δ) 6 δ i=1 2 Remark& 2 This corollary naturally applies to Gaussian random variables with variance σ , in which case R =%σ. It also applies to bounded random variable. Indeed random variables fXigi6n bounded in [0; 1] are 1=2-sub- Gaussian. This can be understood intuitively by remarking that distributions with the highest variance on [0; 1] are Bernoulli, and that the variance of a Bernoulli with parameter θ 2 [0; 1] is θ(1 − θ) 6 1=4, thus resulting in R2 = 1=4. This is proved more formally via Hoeffding’s lemma. Proof of Corollary 1: Indeed, it holds that n n 1 X 1 Y log exp(λ (X − µ)) = log exp(λ(X − µ)) λ E i λ E i i=1 i=1 n (a) 1 Y h i = log exp(λ(X − µ)) λ E i i=1 n 1 X h i = log exp(λ(X − µ)) λ E i i=1 (b) n λR2 ; 6 2 where (a) is by independence, and (b) holds by using the sub-Gaussian assumption. We deduce by Lemma 4 that n X n 2 log(1/δ)o P (Xi − µ) > inf λR n=2 + λ2D \ + λ i=1 ν R? n n (a) X n 1 h X i log(1/δ)o 6 P (Xi − µ) > inf log E exp λ (Xi − µ) + λ2D \ + λ λ i=1 ν R? i=1 6 δ ; where in (a), we used that x < y implies P(X > y) 6 P(X > x).

Basic Concentration Properties of Real-Valued Distributions Odalric-Ambrym Maillard

Lecture 2: Estimating the Empirical Risk with Samples CS4787 — Principles of Large-Scale Machine Learning Systems

Concentration Inequalities and Martingale Inequalities — a Survey

Lecture 6: Estimating the Empirical Risk with Samples

Arxiv:2002.10761V1 [Math.PR] 25 Feb 2020

4 Concentration Inequalities, Scalar and Matrix Versions

Lecture 12: Martingales Concentration Inequality

Lecture 1 - Introduction and the Empirical CDF

Concentration Inequalities for Empirical Processes of Linear Time Series

Concentration Inequalities

Section 1: Basic Probability Theory

Concentration Inequalities for Dependent Random

Basic Tail and Concentration Bounds 2