<<

Distances and for Probability Distributions

Andrew Nobel

October, 2020 Background

Basic question: How far apart (different) are two distributions P and Q?

I Measured through distances and divergences

I Used to define convergence of distributions

I Used to assess of parametrizations {Pθ : θ ∈ Θ}

I Means of assessing the complexity of a family of distributions

I Key role in understanding the consistency of inference procedures

I Key ingredient in formulating lower and upper bounds on the performance of inference procedures Kolmogorov-Smirnov Distance

Definition: Let P and Q be probability distributions on R with CDFs F and G. The Kolmogorov-Smirnov (KS) distance between P and Q is

KS(P,Q) = sup |F (t) − G(t)| t

Properties of Total Variation

1. 0 ≤ KS(P,Q) ≤ 1

2.KS (P,Q) = 0 iff P = Q

3. KS is a

4.KS (P,Q) = 1 iff there exists s ∈ R with P ((−∞, s]) = 1 and Q((s, ∞)) = 1 Total Variation Distance

Definition: Let X be a with a sigma-field A. The total variation distance between two probability measures P and Q on (X , A) is

TV(P,Q) = sup |P (A) − Q(A)| A∈A

Properties of Total Variation

1. 0 ≤ TV(P,Q) ≤ 1

2.TV (P,Q) = 0 iff P = Q

3. TV is a metric

4.TV (P,Q) = 1 iff there exists A ∈ A with P (A) = 1 and Q(A) = 0 KS, TV, and the CLT

Note: KS(P,Q) and TV(P,Q) can both be expressed in the form

sup |P (A) − Q(A)| A∈A0

For KS family A0 = all intervals (−∞, t], while for TV family A0 = all (Borel) sets

Example: Let X1,X2,... ∈ {−1, 1} iid with P(Xi = 1) = P(Xi = −1) = 1/2. By the standard central limit theorem

n 1 X Zn = Xi ⇒ N (0, 1) n1/2 i=1

Let Pn = distribution of Zn and Q = N (0, 1). Can show that

−1/2 KS(Pn,Q) ≤ cn while TV(Pn,Q) ≡ 1 Total Variation and Densities

d Scheffe’s´ Theorem: Let P ∼ f and Q ∼ g be distributions on X = R . Then

1 R 1.TV (P,Q) = 2 |f(x) − g(x)| dx

2.TV (P,Q) = 1 − R min{f(x), g(x)} dx

3.TV (P,Q) = P (A) − Q(A) where A = {x : f(x) ≥ g(x)}

Analogous results hold when P ∼ p(x) and Q ∼ q(x) are described by pmfs

Upshot: Total variation distance between P and Q is half the L1-distance between densities or mass functions Total Variation and Hypothesis Testing

Problem: Observe X ∈ X having density f0 or f1. Wish to test

H0 : X ∼ f0 vs. H1 : X ∼ f1

Any decision rule d : X → {0, 1} has overall (Type I + Type II) error

Err(d) = P0(d(X) = 1) + P1(d(X) = 0)

Fact: The optimum overall error among all decision rules is

Z inf Err(d) = min{f0(x), f1(x)} dx = 1 − TV(P0,P1) d:X →{0,1} Total Variation and Coupling

Definition: A coupling of distributions P and Q on X is a jointly distributed pair of random variables (X,Y ) such that X ∼ P and Y ∼ Q

Fact: TV(P,Q) is the minimum of P(X 6= Y ) over all couplings of P and Q

I If X ∼ P and Y ∼ Q then P(X 6= Y ) ≥ TV(P,Q)

I There is an optimal coupling achieving the lower bound

I Optimal coupling makes X,Y equal as much as possible

Note: If ρ is a metric on X the Wasserstein distance between distributions P and Q is defined by min E[ρ(X,Y )] where the minimum is over all couplings (X,Y ) of P and Q. Hellinger Distance

d Definition: Let P ∼ f and Q ∼ g be probability measures on R . The Hellinger distance between P and Q is given by

 1/2 Z p p 2 H(P,Q) = f(x) − g(x) dx

Properties of Total Variation √ √ 1.H (P,Q) is just the L2 distance between f and g

  2.H 2(P,Q) = 2 1 − Rpf(x)g(x) dx , therefore 0 ≤ H2(P,Q) ≤ 2

3.H (P,Q) = 0 iff P = Q

4. H is a metric

5.H 2(P,Q) = 2 iff there exists A ∈ A with P (A) = 1 and Q(A) = 0 Hellinger Distance vs. Total Variation

Fact: For any pair of densities f, g we have the following inequalities

 2  2 Z 1 Z p 1 1 min(f, g) dx ≥ fg dx = 1 − H2(f, g) 2 2 2

Fact: For any distributions P and Q s 1 H2(P,Q) H2(P,Q) ≤ TV(P,Q) ≤ H(P,Q) 1 − 2 4

I H2(P,Q) = 0 iff TV(P,Q) = 0 and H2(P,Q) = 2 iff TV(P,Q) = 1

I H(Pn,Qn) → 0 iff TV(Pn,Qn) → 0 Kullback-Liebler (KL)

Definition: The KL-divergence between distributions P ∼ f and Q ∼ g is given by

Z f(x) KL(P : Q) = KL(f : g) = f(x) log dx g(x)

Analogous definition holds for discrete distributions P ∼ p and Q ∼ q

I The integrand can be positive or negative. By convention  f(x)  +∞ if f(x) > 0 and g(x) = 0 f(x) log = g(x)  0 if f(x) = 0

I KL divergence is not symmetric, and is not a metric. Note that

 f(X)  KL(P : Q) = Ef log g(X) First Properties of KL Divergence

Fact: The defining KL(P : Q) is well defined. Letting u− = max(−u, 0),

Z  f(x)  f(x) log dx < ∞ g(x) −

Key Fact:

I Divergence KL(P : Q) ≥ 0 with equality if and only if P = Q

I KL(P : Q) = +∞ if there is a set A with P (A) > 0 and Q(A) = 0

Notation: When pmfs or pdfs clear from context, write KL(p : q) or KL(f : g) KL Divergence Examples

Example: Let p and q be pmfs on {0, 1} with

p(0) = p(1) = 1/2 and q(0) = (1 − )/2, q(1) = (1 + )/2

Then we have the following exact expressions, and bounds

KL(p : q) = − 1 log(1 − 2) ≤ 2 when  ≤ √1 I 2 2

1 2  1− 2 I KL(q : p) = 2 log(1 −  ) + 2 log( 1+ ) ≤ 2

Example: If P ∼ Nd(µ0, Σ0) and Q ∼ Nd(µ1, Σ1) with Σ0, Σ1 > 0 then

−1 t −1 2 KL(P : Q) = tr(Σ1 Σ0) + (µ1 − µ0) Σ1 (µ1 − µ0) + ln(|Σ1|/|Σ0|) − d KL Divergence and Inference

Ex 1. (Testing) Consider testing H0 : X ∼ f0 vs. H1 : X ∼ f1. The divergence

  f0(X) KL(f0 : f1) = E0 log ≥ 0 f1(X)

is just the expected log likelihood ratio under H0

Ex 2. (Estimation) Suppose X1,X2,... iid with Xi ∼ f(x|θ0) in P = {f(x|θ): θ ∈ Θ}. Under suitable assumptions, when n is large,

ˆ θMLE(x) ≈ argmin KL(f(·|θ0): f(·|θ)) θ∈Θ

In other words, MLE is trying to find θ minimizing KL divergence with true distribution. KL Divergence vs Total Variation and Hellinger

Fact: For any distributions P and Q we have

(1)TV (P,Q)2 ≤ KL(P : Q)/2 (Pinsker’s Inequality)

(2)H (P,Q)2 ≤ KL(P : Q) Log Sum Inequality

Log-Sum Inequality: If a1, . . . , an and b1, . . . , bn are non-negative then

n n ! Pn X ai X i=1 ai ai log ≥ ai log b Pn b i=1 i i=1 i=1 i

with equality iff all the ratios ai/bi are equal

Corollary: If P ∼ p and Q ∼ q are distributions, then for every event B

X p(x) P (B) p(x) log ≥ P (B) log q(x) Q(B) x∈B

with equality iff p(x)/q(x) is constant for x ∈ B Product Densities (Tensorization)

Recall: Given distributions P1,...,Pn on X with densities f1, . . . , fn the product n n distribution P = ⊗i=1Pi on X has density f(x1, . . . , xn) = f1(x1) ··· fn(xn)

Tensorization: Let P1,...,Pn and Q1,...,Qn be distributions on X

n n Pn I TV(⊗i=1Pi, ⊗i=1Qi) ≤ i=1 TV(Pi,Qi)

2 n n Pn 2 I H (⊗i=1Pi, ⊗i=1Qi) ≤ i=1 H (Pi,Qi)

n n Pn I KL(⊗i=1Pi, ⊗i=1Qi) = i=1 KL(Pi,Qi) Distinguishing Coins

Given: Observations X = X1,...,Xn ∈ {0, 1} iid ∼ Bern(θ) with θ ∈ {θ0, θ1}

Goal: Find a decision rule d : {0, 1}n → {0, 1} such that

? P0(d(X) = 1) ≤ α

? P1(d(X) = 0) ≤ α

Question: How large does the number of observations n need to be?

Fact: Let ∆ = |θ0 − θ1|. Then there exists a decision procedure achieving performance (?) and requiring number of observations

2 log(1/α) n = ∆2 Identifying Fair and Biased Coins

Suppose now that θ0 = 1/2 and θ1 = 1/2 +  for some fixed  ∈ (0, 1/4)

Fact: For every event A ⊆ {0, 1}n √ |P0(X ∈ A) − P1(X ∈ A)| = |P0(A) − P1(A)| ≤  2n

Fact: If d : {0, 1}n → {0, 1} is any decision rule achieving (?) then

1 − 2α n ≥ 2