October 29 2.1 Recap1 2.2 Distance Between Probability Distributions5

36-789: Topics in High Dimensional Statistics II Fall 2015 Lecture 2: October 29 Lecturer: Alessandro Rinaldo Scribes: Jisu KIM Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only with the permission of the Instructor. 2.1 Recap1 As discussed in Lecture 1(Oct 27), general strategy to obtain minimax rates yields h ^ i inf sup EP w(d(θ; θ(P )) ≥ w(δ)inf max Pθj ( (X) 6= j) ; θ^ P 2P j=0;··· ;M 2 3 where : X ! f0; ··· ;Mg is a test function, and d(θi; θj) ≥ 2δ for all i 6= j (2δ-packing ). Denote pe(θ0; ··· ; θM ) = infmaxPθj ( (X) 6= j): j Now next job is to lower bound pe by constant. If pe ≥ c ≥ 0; 4 then w(δ)c is a lower bound and w(δ) = w(δn) ! 0 will give you a rate, when δ = δn ! 0 as n ! 1. This rate is optimal if you can find a θ^(X) such that xxxx ^ sup EP w d(θ(X); θ) = O (w(δn)) : P 2P 2.2 Distance between probability distributions5 Let P; Q be two probability measures on (Ω; A), having densities p and q with respect to some dominating measure (i.e. Lebesgue measure on Rd). e.g., µ = P + Q. 2.2.1 Total variation distance Definition 2.1 6 dTV (P; Q) = jjP − QjjTV := sup jP (A) − Q(A)j: A2A 1See Section 2.2 in [T2008], p.79-80 2Proposition 2.3 in [D2014], p.13 3Section 2.2.1 in [D2014], p.13 4Equation (2.3) in [T2008], p.80 5See Section 2.4 in [T2008], p.83-91 6Definition 2.4 in [T2008], p.83 and Equation (1.2.4) in [D2014], p.7 2-1 2-2 Lecture 2: October 29 Then following holds:7 • dTV is a metric. • 0 ≤ dTV ≤ 1. • dTV = 0 if and only if P = Q. • dTV = 1 if and only if P and Q are singular, i.e. there exists A such that P (A) = 1 and Q(A) = 0. • dTV is a very strong distance. Lemma 2.2 Scheffe lemma.8 1 Z dTV (P; Q) = jp(x) − q(x)jdµ(x) 2 X Proof: Take A = fx 2 X : q(x) ≥ p(x)g : 2.2.1.1 Interpretation of dTV Suppose we observe X coming from either P or Q. And we have hypothesis test as H0 : X ∼ P vs Ha : X ∼ Q: ( 1 X comes from Q Now for any test φ(X) ! f0; 1g with interpretation as φ(X) = , Type I error is 0 X comes from P EP [φ(X)], and Type II error is EQ [1 − φ(X)]. Then 1 − dTV (P; Q) = inf fEP [φ(X)] + EQ [1 − φ(X)]g φ for all tests(measurable functions) φ : X ! f0; 1g : exercise. Proof: Use Neyman-Pearson Lemma, the ( 1 q(x) ≥ p(x) optimal test is φ(x) = . 0 q(x) < p(x) More generally, 1 Z Z jp − qj = dTV (P; Q) = 1 − min fp(x); q(x)dxg 2 X follows from Scheffe lemma. Then following holds: Z inf EP [f] + EQ [1 − f] = min fp(x); q(x)g dx 0≤f≤1 X Z inf fEP [f] + EQ[g]g ≥ min fp(x); q(x)g = 1 − dTV (p; q): f; g≥0; f+g≥1 7See Properties of the total variance distance in Section 2.4 in [T2008], p. 84 8Lemma 2.1 in [T2008], p. 84 Lecture 2: October 29 2-3 2.2.2 Hellinger Definition 2.3 9 s Z 2 H(P; Q) = pp(x) − pq(x) dµ(x) X Then following holds:10 p p • H(P; Q) is a L2 distance between p and q. • H(P; Q) gives canonical notion of regularity for statistical model: when pp(x) is Hadamard differen- tiable. • H(P; Q) is a metric. • 0 ≤ H2(P; Q) ≤ 2. 2 3 6 Z 7 2 6 p p 7 • H (P; Q) = 2 61 − p(x) q(x)dµ(x)7. 4 X 5 | {z } Hellinger Affinity n n 11 N N • Tensorization : if P = Pi and Q = Qi, then i=1 i=1 " n 2 # Y H (Pi;Qi) H2(P; Q) = 2 1 − 1 − : 2 i=1 2.2.3 KL Divergence Definition 2.4 12 ( R log p(x) p(x)dµ(x) P Q KL(P; Q) = X q(x) 1 otherwise Then following holds:13 • KL(P; Q) ≥ 0. • KL(P; Q) = 0 if and only if P = Q. • It is not symmetric and does not satisfy triangle inequality. n n 14 N N • Tensorization : if P = Pi and Q = Qi, then i=1 i=1 n X KL(P; Q) = KL(Pi;Qi): i=1 9Definition 2.3 in [T2008], p.83, and Equation (2.2.2) in [D2014], p.15 10See Properties of the Hellinger distance in Section 2.4 in [T2008], p.83 11Equation (2.2.5) in [D2014], p.15 12Definition 2.5 in [T2008], p.84, and Section 1.2.2 in [D2014], p.5-7 13See Properties of the Kullback divergence in Section 2.4 in [T2008], p.83 14Equation (2.2.4) in [D2014], p.15 2-4 Lecture 2: October 29 2.2.4 χ2-divergence Definition 2.5 8 2 <R p(x) − 1 q(x)dµ(x) if P Q χ2(P; Q) = q(x) :1 otherwise: Then following holds::15 2 2 R p(x) • χ (P; Q) = q(x) q(x)dµ(x) − 1: • χ2(P; Q) equals f-divergence16, with f(x) = (x − 1)2. n n N N • Tensorization: if P = Pi and Q = Qi, then i=1 i=1 n 2 Y 2 χ (P; Q) = 1 − χ (Pi;Qi) : i=1 2 2.2.5 Relationships among dTV , H, KL, and χ 2 2 R 1 R p 2 1 h H (P; Q) i 17 • 1 − dTV (P; Q) = minfp; qgdx ≥ 2 pqdx = 2 1 − 2 . q 1 2 H2(P; Q) 18 • 2 H (P; Q) ≤ dTV (P; Q) ≤ H(P; Q) 1 − 4 ≤ H(P; Q). Lemma 2.6 (Donoho, Liu, 91) (\tensorization of dTV ") 1=n 1−δ2 n n If dTV (P; Q) ≤ 1 − 2 for some δ 2 (0; 1), then dTV (P ;Q ) ≤ δ: Proof: n n n n dTV (P ;Q ) ≤ H(P ;Q ) v u " n # u Y H2(P; Q) = t2 1 − 1 − 2 i=1 r h 2i ≤ 2 1 − (1 − dTV (P; Q)) ≤ δ: Theorem 2.7 (Pinsker inequality)19 r KL(P; Q) d (P; Q) ≤ : TV 2 15See Properties of the χ2 divergence in Section 2.4 in [T2008], p.83 16For any function f, f-divergence is defined as D (P; Q) = R f p(x) q(x)dµ(x). Refer to Section 1.2.3 in [D2014], p.7 f X q(x) 17Lemma 2.3 in [T2008], p.86 18Lemma 2.3 in [T2008], p.86, and Proposition 2.4.(a) and Section 2.6.1 in [D2014], p.15, 30 19Lemma 2.5 in [T2008], p.88, and Proposition 2.4.(b) and Section 2.6.1 in [D2014], p.15, 30-31 Lecture 2: October 29 2-5 • KL(P; Q) ≤ χ2(P; Q).20 p p 2 21 • dTV (P; Q) ≤ H(P; Q) ≤ KL(P; Q) ≤ χ (P; Q). 2.2.6 Minimax lower bounds based on 2 hypotheses Recall that pe(P0;P1) = infmaxPi ( (X) 6= i). Now we want to lower bound it [d(θ(P0); θ(P1)) ≥ 2δ] i=0;1 22 1−α Theorem 2.8 1) If dTV (P0;P1) ≤ α(≤ 1), then pe(P0;P1) ≥ 2 (total variation version). 2 1 h q α i 2) H (P0;P1) ≤ α(≤ 2), then pe ≥ 2 1 − α 1 − 2 (Hellinger version). p 2 1 −x 1− α/2 2 3) If KL(P0;P1) ≤ α < 1, χ (P0;P1) ≤ α < 1, then pe ≥ max 4 e ; 2 (Kullback/χ version). Proof: 2) and 3) are based on 1). 1): pe = infmaxPi ( (X) 6= i) i=0;1 1 1 ≥ inf P0 ( (X) 6= 0) + P1 ( (X) = 1) 2 2 1 = inf [type I error + type II error] 2 1 = [1 − d (P; Q)] : 2 TV 2.3 Le Cam's Lemma Lemma 2.9 Le Cam Lemma (Bin Yu's paper23) Θ = fθ(P );P 2 Pg. Suppose 9Θ1; Θ2 ⊂ Θ such that d(θ1; θ2) ≥ 2δ, 8θ1 2 Θ1, 8θ2 2 Θ2. Let Pi ⊂ P consisting of all P 2 P such thath θ(P ) 2 Θi. Let co(Pi) be convex hall of Pi, i = 1; 2. Then h ^ i inf sup EP w d(θ; θ(P )) ≥ w(δ) sup [1 − dTV (P1;P2)]: ^ θ P 2P P12co(P1);P22co(P2) | R {z } minfp1;p2g Proof: Take w(x) = x 20Lemma 2.7 in [T2008], p.90 21Lemma 2.4 and Equation (2.27) in [T2008], p.90 22Theorem 2.2 in [T2008], p.90 23Lemma 1 in [Y1997], p.424-425 2-6 Lecture 2: October 29 h ^ i h ^ i h ^ i M = 2 sup EP d(θ; θ(P )) ≥ EP1 d(θ; Θ1) + EP2 d(θ; Θ2) P 2P for any Pi 2 co(Pi): Since ^ ^ d(θ; Θ1) + d(θ; Θ2) ≥ d(Θ1; Θ2) ≥ 2δ; by hypothesis 0 2 3 2 31 ^ ^ B 6d(θ; Θ1)7 6d(θ; Θ2)7C M ≥ 2δ BEP1 6 7 + EP2 6 7C @ 4 2δ 5 4 2δ 5A | {z } | {z } 0≤f 0≤g ≥ 2δ inf EP1 [f(X)] + EP2 [g(X)] f;g≥0; f+g≥1 ≥ 2δ [1 − dTV (P1;P2)] Example.

Load more