Distances and Divergences for Probability Distributions

Distances and Divergences for Probability Distributions Andrew Nobel October, 2020 Background Basic question: How far apart (different) are two distributions P and Q? I Measured through distances and divergences I Used to define convergence of distributions I Used to assess smoothness of parametrizations fPθ : θ 2 Θg I Means of assessing the complexity of a family of distributions I Key role in understanding the consistency of inference procedures I Key ingredient in formulating lower and upper bounds on the performance of inference procedures Kolmogorov-Smirnov Distance Definition: Let P and Q be probability distributions on R with CDFs F and G. The Kolmogorov-Smirnov (KS) distance between P and Q is KS(P; Q) = sup jF (t) − G(t)j t Properties of Total Variation 1. 0 ≤ KS(P; Q) ≤ 1 2.KS (P; Q) = 0 iff P = Q 3. KS is a metric 4.KS (P; Q) = 1 iff there exists s 2 R with P ((−∞; s]) = 1 and Q((s; 1)) = 1 Total Variation Distance Definition: Let X be a set with a sigma-field A. The total variation distance between two probability measures P and Q on (X ; A) is TV(P; Q) = sup jP (A) − Q(A)j A2A Properties of Total Variation 1. 0 ≤ TV(P; Q) ≤ 1 2.TV (P; Q) = 0 iff P = Q 3. TV is a metric 4.TV (P; Q) = 1 iff there exists A 2 A with P (A) = 1 and Q(A) = 0 KS, TV, and the CLT Note: KS(P; Q) and TV(P; Q) can both be expressed in the form sup jP (A) − Q(A)j A2A0 For KS family A0 = all intervals (−∞; t], while for TV family A0 = all (Borel) sets Example: Let X1;X2;::: 2 {−1; 1g iid with P(Xi = 1) = P(Xi = −1) = 1=2. By the standard central limit theorem n 1 X Zn = Xi )N (0; 1) n1=2 i=1 Let Pn = distribution of Zn and Q = N (0; 1). Can show that −1=2 KS(Pn;Q) ≤ cn while TV(Pn;Q) ≡ 1 Total Variation and Densities d Scheffe’s´ Theorem: Let P ∼ f and Q ∼ g be distributions on X = R . Then 1 R 1.TV (P; Q) = 2 jf(x) − g(x)j dx 2.TV (P; Q) = 1 − R minff(x); g(x)g dx 3.TV (P; Q) = P (A) − Q(A) where A = fx : f(x) ≥ g(x)g Analogous results hold when P ∼ p(x) and Q ∼ q(x) are described by pmfs Upshot: Total variation distance between P and Q is half the L1-distance between densities or mass functions Total Variation and Hypothesis Testing Problem: Observe X 2 X having density f0 or f1. Wish to test H0 : X ∼ f0 vs. H1 : X ∼ f1 Any decision rule d : X ! f0; 1g has overall (Type I + Type II) error Err(d) = P0(d(X) = 1) + P1(d(X) = 0) Fact: The optimum overall error among all decision rules is Z inf Err(d) = minff0(x); f1(x)g dx = 1 − TV(P0;P1) d:X !f0;1g Total Variation and Coupling Definition: A coupling of distributions P and Q on X is a jointly distributed pair of random variables (X; Y ) such that X ∼ P and Y ∼ Q Fact: TV(P; Q) is the minimum of P(X 6= Y ) over all couplings of P and Q I If X ∼ P and Y ∼ Q then P(X 6= Y ) ≥ TV(P; Q) I There is an optimal coupling achieving the lower bound I Optimal coupling makes X; Y equal as much as possible Note: If ρ is a metric on X the Wasserstein distance between distributions P and Q is defined by min E[ρ(X; Y )] where the minimum is over all couplings (X; Y ) of P and Q. Hellinger Distance d Definition: Let P ∼ f and Q ∼ g be probability measures on R . The Hellinger distance between P and Q is given by 1=2 Z p p 2 H(P; Q) = f(x) − g(x) dx Properties of Total Variation p p 1.H (P; Q) is just the L2 distance between f and g 2.H 2(P; Q) = 2 1 − Rpf(x)g(x) dx , therefore 0 ≤ H2(P; Q) ≤ 2 3.H (P; Q) = 0 iff P = Q 4. H is a metric 5.H 2(P; Q) = 2 iff there exists A 2 A with P (A) = 1 and Q(A) = 0 Hellinger Distance vs. Total Variation Fact: For any pair of densities f; g we have the following inequalities 2 2 Z 1 Z p 1 1 min(f; g) dx ≥ fg dx = 1 − H2(f; g) 2 2 2 Fact: For any distributions P and Q s 1 H2(P; Q) H2(P; Q) ≤ TV(P; Q) ≤ H(P; Q) 1 − 2 4 I H2(P; Q) = 0 iff TV(P; Q) = 0 and H2(P; Q) = 2 iff TV(P; Q) = 1 I H(Pn;Qn) ! 0 iff TV(Pn;Qn) ! 0 Kullback-Liebler (KL) Divergence Definition: The KL-divergence between distributions P ∼ f and Q ∼ g is given by Z f(x) KL(P : Q) = KL(f : g) = f(x) log dx g(x) Analogous definition holds for discrete distributions P ∼ p and Q ∼ q I The integrand can be positive or negative. By convention 8 f(x) < +1 if f(x) > 0 and g(x) = 0 f(x) log = g(x) : 0 if f(x) = 0 I KL divergence is not symmetric, and is not a metric. Note that f(X) KL(P : Q) = Ef log g(X) First Properties of KL Divergence Fact: The integral defining KL(P : Q) is well defined. Letting u− = max(−u; 0), Z f(x) f(x) log dx < 1 g(x) − Key Fact: I Divergence KL(P : Q) ≥ 0 with equality if and only if P = Q I KL(P : Q) = +1 if there is a set A with P (A) > 0 and Q(A) = 0 Notation: When pmfs or pdfs clear from context, write KL(p : q) or KL(f : g) KL Divergence Examples Example: Let p and q be pmfs on f0; 1g with p(0) = p(1) = 1=2 and q(0) = (1 − )=2; q(1) = (1 + )=2 Then we have the following exact expressions, and bounds KL(p : q) = − 1 log(1 − 2) ≤ 2 when ≤ p1 I 2 2 1 2 1− 2 I KL(q : p) = 2 log(1 − ) + 2 log( 1+ ) ≤ 2 Example: If P ∼ Nd(µ0; Σ0) and Q ∼ Nd(µ1; Σ1) with Σ0; Σ1 > 0 then −1 t −1 2 KL(P : Q) = tr(Σ1 Σ0) + (µ1 − µ0) Σ1 (µ1 − µ0) + ln(jΣ1j=jΣ0j) − d KL Divergence and Inference Ex 1. (Testing) Consider testing H0 : X ∼ f0 vs. H1 : X ∼ f1. The divergence f0(X) KL(f0 : f1) = E0 log ≥ 0 f1(X) is just the expected log likelihood ratio under H0 Ex 2. (Estimation) Suppose X1;X2;::: iid with Xi ∼ f(xjθ0) in P = ff(xjθ): θ 2 Θg. Under suitable assumptions, when n is large, ^ θMLE(x) ≈ argmin KL(f(·|θ0): f(·|θ)) θ2Θ In other words, MLE is trying to find θ minimizing KL divergence with true distribution. KL Divergence vs Total Variation and Hellinger Fact: For any distributions P and Q we have (1)TV (P; Q)2 ≤ KL(P : Q)=2 (Pinsker’s Inequality) (2)H (P; Q)2 ≤ KL(P : Q) Log Sum Inequality Log-Sum Inequality: If a1; : : : ; an and b1; : : : ; bn are non-negative then n n ! Pn X ai X i=1 ai ai log ≥ ai log b Pn b i=1 i i=1 i=1 i with equality iff all the ratios ai=bi are equal Corollary: If P ∼ p and Q ∼ q are distributions, then for every event B X p(x) P (B) p(x) log ≥ P (B) log q(x) Q(B) x2B with equality iff p(x)=q(x) is constant for x 2 B Product Densities (Tensorization) Recall: Given distributions P1;:::;Pn on X with densities f1; : : : ; fn the product n n distribution P = ⊗i=1Pi on X has density f(x1; : : : ; xn) = f1(x1) ··· fn(xn) Tensorization: Let P1;:::;Pn and Q1;:::;Qn be distributions on X n n Pn I TV(⊗i=1Pi; ⊗i=1Qi) ≤ i=1 TV(Pi;Qi) 2 n n Pn 2 I H (⊗i=1Pi; ⊗i=1Qi) ≤ i=1 H (Pi;Qi) n n Pn I KL(⊗i=1Pi; ⊗i=1Qi) = i=1 KL(Pi;Qi) Distinguishing Coins Given: Observations X = X1;:::;Xn 2 f0; 1g iid ∼ Bern(θ) with θ 2 fθ0; θ1g Goal: Find a decision rule d : f0; 1gn ! f0; 1g such that ? P0(d(X) = 1) ≤ α ? P1(d(X) = 0) ≤ α Question: How large does the number of observations n need to be? Fact: Let ∆ = jθ0 − θ1j. Then there exists a decision procedure achieving performance (?) and requiring number of observations 2 log(1/α) n = ∆2 Identifying Fair and Biased Coins Suppose now that θ0 = 1=2 and θ1 = 1=2 + for some fixed 2 (0; 1=4) Fact: For every event A ⊆ f0; 1gn p jP0(X 2 A) − P1(X 2 A)j = jP0(A) − P1(A)j ≤ 2n Fact: If d : f0; 1gn ! f0; 1g is any decision rule achieving (?) then 1 − 2α n ≥ 2.

Distances and Divergences for Probability Distributions

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support