Distances and Divergences for Probability Distributions

Distances and Divergences for Probability Distributions Andrew Nobel October, 2020 Background Basic question: How far apart (different) are two distributions P and Q? I Measured through distances and divergences I Used to define convergence of distributions I Used to assess smoothness of parametrizations fPθ : θ 2 Θg I Means of assessing the complexity of a family of distributions I Key role in understanding the consistency of inference procedures I Key ingredient in formulating lower and upper bounds on the performance of inference procedures Kolmogorov-Smirnov Distance Definition: Let P and Q be probability distributions on R with CDFs F and G. The Kolmogorov-Smirnov (KS) distance between P and Q is KS(P; Q) = sup jF (t) − G(t)j t Properties of Total Variation 1. 0 ≤ KS(P; Q) ≤ 1 2.KS (P; Q) = 0 iff P = Q 3. KS is a metric 4.KS (P; Q) = 1 iff there exists s 2 R with P ((−∞; s]) = 1 and Q((s; 1)) = 1 Total Variation Distance Definition: Let X be a set with a sigma-field A. The total variation distance between two probability measures P and Q on (X ; A) is TV(P; Q) = sup jP (A) − Q(A)j A2A Properties of Total Variation 1. 0 ≤ TV(P; Q) ≤ 1 2.TV (P; Q) = 0 iff P = Q 3. TV is a metric 4.TV (P; Q) = 1 iff there exists A 2 A with P (A) = 1 and Q(A) = 0 KS, TV, and the CLT Note: KS(P; Q) and TV(P; Q) can both be expressed in the form sup jP (A) − Q(A)j A2A0 For KS family A0 = all intervals (−∞; t], while for TV family A0 = all (Borel) sets Example: Let X1;X2;::: 2 {−1; 1g iid with P(Xi = 1) = P(Xi = −1) = 1=2. By the standard central limit theorem n 1 X Zn = Xi )N (0; 1) n1=2 i=1 Let Pn = distribution of Zn and Q = N (0; 1). Can show that −1=2 KS(Pn;Q) ≤ cn while TV(Pn;Q) ≡ 1 Total Variation and Densities d Scheffe’s´ Theorem: Let P ∼ f and Q ∼ g be distributions on X = R . Then 1 R 1.TV (P; Q) = 2 jf(x) − g(x)j dx 2.TV (P; Q) = 1 − R minff(x); g(x)g dx 3.TV (P; Q) = P (A) − Q(A) where A = fx : f(x) ≥ g(x)g Analogous results hold when P ∼ p(x) and Q ∼ q(x) are described by pmfs Upshot: Total variation distance between P and Q is half the L1-distance between densities or mass functions Total Variation and Hypothesis Testing Problem: Observe X 2 X having density f0 or f1. Wish to test H0 : X ∼ f0 vs. H1 : X ∼ f1 Any decision rule d : X ! f0; 1g has overall (Type I + Type II) error Err(d) = P0(d(X) = 1) + P1(d(X) = 0) Fact: The optimum overall error among all decision rules is Z inf Err(d) = minff0(x); f1(x)g dx = 1 − TV(P0;P1) d:X !f0;1g Total Variation and Coupling Definition: A coupling of distributions P and Q on X is a jointly distributed pair of random variables (X; Y ) such that X ∼ P and Y ∼ Q Fact: TV(P; Q) is the minimum of P(X 6= Y ) over all couplings of P and Q I If X ∼ P and Y ∼ Q then P(X 6= Y ) ≥ TV(P; Q) I There is an optimal coupling achieving the lower bound I Optimal coupling makes X; Y equal as much as possible Note: If ρ is a metric on X the Wasserstein distance between distributions P and Q is defined by min E[ρ(X; Y )] where the minimum is over all couplings (X; Y ) of P and Q. Hellinger Distance d Definition: Let P ∼ f and Q ∼ g be probability measures on R . The Hellinger distance between P and Q is given by 1=2 Z p p 2 H(P; Q) = f(x) − g(x) dx Properties of Total Variation p p 1.H (P; Q) is just the L2 distance between f and g 2.H 2(P; Q) = 2 1 − Rpf(x)g(x) dx , therefore 0 ≤ H2(P; Q) ≤ 2 3.H (P; Q) = 0 iff P = Q 4. H is a metric 5.H 2(P; Q) = 2 iff there exists A 2 A with P (A) = 1 and Q(A) = 0 Hellinger Distance vs. Total Variation Fact: For any pair of densities f; g we have the following inequalities 2 2 Z 1 Z p 1 1 min(f; g) dx ≥ fg dx = 1 − H2(f; g) 2 2 2 Fact: For any distributions P and Q s 1 H2(P; Q) H2(P; Q) ≤ TV(P; Q) ≤ H(P; Q) 1 − 2 4 I H2(P; Q) = 0 iff TV(P; Q) = 0 and H2(P; Q) = 2 iff TV(P; Q) = 1 I H(Pn;Qn) ! 0 iff TV(Pn;Qn) ! 0 Kullback-Liebler (KL) Divergence Definition: The KL-divergence between distributions P ∼ f and Q ∼ g is given by Z f(x) KL(P : Q) = KL(f : g) = f(x) log dx g(x) Analogous definition holds for discrete distributions P ∼ p and Q ∼ q I The integrand can be positive or negative. By convention 8 f(x) < +1 if f(x) > 0 and g(x) = 0 f(x) log = g(x) : 0 if f(x) = 0 I KL divergence is not symmetric, and is not a metric. Note that f(X) KL(P : Q) = Ef log g(X) First Properties of KL Divergence Fact: The integral defining KL(P : Q) is well defined. Letting u− = max(−u; 0), Z f(x) f(x) log dx < 1 g(x) − Key Fact: I Divergence KL(P : Q) ≥ 0 with equality if and only if P = Q I KL(P : Q) = +1 if there is a set A with P (A) > 0 and Q(A) = 0 Notation: When pmfs or pdfs clear from context, write KL(p : q) or KL(f : g) KL Divergence Examples Example: Let p and q be pmfs on f0; 1g with p(0) = p(1) = 1=2 and q(0) = (1 − )=2; q(1) = (1 + )=2 Then we have the following exact expressions, and bounds KL(p : q) = − 1 log(1 − 2) ≤ 2 when ≤ p1 I 2 2 1 2 1− 2 I KL(q : p) = 2 log(1 − ) + 2 log( 1+ ) ≤ 2 Example: If P ∼ Nd(µ0; Σ0) and Q ∼ Nd(µ1; Σ1) with Σ0; Σ1 > 0 then −1 t −1 2 KL(P : Q) = tr(Σ1 Σ0) + (µ1 − µ0) Σ1 (µ1 − µ0) + ln(jΣ1j=jΣ0j) − d KL Divergence and Inference Ex 1. (Testing) Consider testing H0 : X ∼ f0 vs. H1 : X ∼ f1. The divergence f0(X) KL(f0 : f1) = E0 log ≥ 0 f1(X) is just the expected log likelihood ratio under H0 Ex 2. (Estimation) Suppose X1;X2;::: iid with Xi ∼ f(xjθ0) in P = ff(xjθ): θ 2 Θg. Under suitable assumptions, when n is large, ^ θMLE(x) ≈ argmin KL(f(·|θ0): f(·|θ)) θ2Θ In other words, MLE is trying to find θ minimizing KL divergence with true distribution. KL Divergence vs Total Variation and Hellinger Fact: For any distributions P and Q we have (1)TV (P; Q)2 ≤ KL(P : Q)=2 (Pinsker’s Inequality) (2)H (P; Q)2 ≤ KL(P : Q) Log Sum Inequality Log-Sum Inequality: If a1; : : : ; an and b1; : : : ; bn are non-negative then n n ! Pn X ai X i=1 ai ai log ≥ ai log b Pn b i=1 i i=1 i=1 i with equality iff all the ratios ai=bi are equal Corollary: If P ∼ p and Q ∼ q are distributions, then for every event B X p(x) P (B) p(x) log ≥ P (B) log q(x) Q(B) x2B with equality iff p(x)=q(x) is constant for x 2 B Product Densities (Tensorization) Recall: Given distributions P1;:::;Pn on X with densities f1; : : : ; fn the product n n distribution P = ⊗i=1Pi on X has density f(x1; : : : ; xn) = f1(x1) ··· fn(xn) Tensorization: Let P1;:::;Pn and Q1;:::;Qn be distributions on X n n Pn I TV(⊗i=1Pi; ⊗i=1Qi) ≤ i=1 TV(Pi;Qi) 2 n n Pn 2 I H (⊗i=1Pi; ⊗i=1Qi) ≤ i=1 H (Pi;Qi) n n Pn I KL(⊗i=1Pi; ⊗i=1Qi) = i=1 KL(Pi;Qi) Distinguishing Coins Given: Observations X = X1;:::;Xn 2 f0; 1g iid ∼ Bern(θ) with θ 2 fθ0; θ1g Goal: Find a decision rule d : f0; 1gn ! f0; 1g such that ? P0(d(X) = 1) ≤ α ? P1(d(X) = 0) ≤ α Question: How large does the number of observations n need to be? Fact: Let ∆ = jθ0 − θ1j. Then there exists a decision procedure achieving performance (?) and requiring number of observations 2 log(1/α) n = ∆2 Identifying Fair and Biased Coins Suppose now that θ0 = 1=2 and θ1 = 1=2 + for some fixed 2 (0; 1=4) Fact: For every event A ⊆ f0; 1gn p jP0(X 2 A) − P1(X 2 A)j = jP0(A) − P1(A)j ≤ 2n Fact: If d : f0; 1gn ! f0; 1g is any decision rule achieving (?) then 1 − 2α n ≥ 2.

Load more