Distances and Divergences for Probability Distributions
Andrew Nobel
October, 2020 Background
Basic question: How far apart (different) are two distributions P and Q?
I Measured through distances and divergences
I Used to define convergence of distributions
I Used to assess smoothness of parametrizations {Pθ : θ ∈ Θ}
I Means of assessing the complexity of a family of distributions
I Key role in understanding the consistency of inference procedures
I Key ingredient in formulating lower and upper bounds on the performance of inference procedures Kolmogorov-Smirnov Distance
Definition: Let P and Q be probability distributions on R with CDFs F and G. The Kolmogorov-Smirnov (KS) distance between P and Q is
KS(P,Q) = sup |F (t) − G(t)| t
Properties of Total Variation
1. 0 ≤ KS(P,Q) ≤ 1
2.KS (P,Q) = 0 iff P = Q
3. KS is a metric
4.KS (P,Q) = 1 iff there exists s ∈ R with P ((−∞, s]) = 1 and Q((s, ∞)) = 1 Total Variation Distance
Definition: Let X be a set with a sigma-field A. The total variation distance between two probability measures P and Q on (X , A) is
TV(P,Q) = sup |P (A) − Q(A)| A∈A
Properties of Total Variation
1. 0 ≤ TV(P,Q) ≤ 1
2.TV (P,Q) = 0 iff P = Q
3. TV is a metric
4.TV (P,Q) = 1 iff there exists A ∈ A with P (A) = 1 and Q(A) = 0 KS, TV, and the CLT
Note: KS(P,Q) and TV(P,Q) can both be expressed in the form
sup |P (A) − Q(A)| A∈A0
For KS family A0 = all intervals (−∞, t], while for TV family A0 = all (Borel) sets
Example: Let X1,X2,... ∈ {−1, 1} iid with P(Xi = 1) = P(Xi = −1) = 1/2. By the standard central limit theorem
n 1 X Zn = Xi ⇒ N (0, 1) n1/2 i=1
Let Pn = distribution of Zn and Q = N (0, 1). Can show that
−1/2 KS(Pn,Q) ≤ cn while TV(Pn,Q) ≡ 1 Total Variation and Densities
d Scheffe’s´ Theorem: Let P ∼ f and Q ∼ g be distributions on X = R . Then
1 R 1.TV (P,Q) = 2 |f(x) − g(x)| dx
2.TV (P,Q) = 1 − R min{f(x), g(x)} dx
3.TV (P,Q) = P (A) − Q(A) where A = {x : f(x) ≥ g(x)}
Analogous results hold when P ∼ p(x) and Q ∼ q(x) are described by pmfs
Upshot: Total variation distance between P and Q is half the L1-distance between densities or mass functions Total Variation and Hypothesis Testing
Problem: Observe X ∈ X having density f0 or f1. Wish to test
H0 : X ∼ f0 vs. H1 : X ∼ f1
Any decision rule d : X → {0, 1} has overall (Type I + Type II) error
Err(d) = P0(d(X) = 1) + P1(d(X) = 0)
Fact: The optimum overall error among all decision rules is
Z inf Err(d) = min{f0(x), f1(x)} dx = 1 − TV(P0,P1) d:X →{0,1} Total Variation and Coupling
Definition: A coupling of distributions P and Q on X is a jointly distributed pair of random variables (X,Y ) such that X ∼ P and Y ∼ Q
Fact: TV(P,Q) is the minimum of P(X 6= Y ) over all couplings of P and Q
I If X ∼ P and Y ∼ Q then P(X 6= Y ) ≥ TV(P,Q)
I There is an optimal coupling achieving the lower bound
I Optimal coupling makes X,Y equal as much as possible
Note: If ρ is a metric on X the Wasserstein distance between distributions P and Q is defined by min E[ρ(X,Y )] where the minimum is over all couplings (X,Y ) of P and Q. Hellinger Distance
d Definition: Let P ∼ f and Q ∼ g be probability measures on R . The Hellinger distance between P and Q is given by
1/2 Z p p 2 H(P,Q) = f(x) − g(x) dx
Properties of Total Variation √ √ 1.H (P,Q) is just the L2 distance between f and g
2.H 2(P,Q) = 2 1 − Rpf(x)g(x) dx , therefore 0 ≤ H2(P,Q) ≤ 2
3.H (P,Q) = 0 iff P = Q
4. H is a metric
5.H 2(P,Q) = 2 iff there exists A ∈ A with P (A) = 1 and Q(A) = 0 Hellinger Distance vs. Total Variation
Fact: For any pair of densities f, g we have the following inequalities
2 2 Z 1 Z p 1 1 min(f, g) dx ≥ fg dx = 1 − H2(f, g) 2 2 2
Fact: For any distributions P and Q s 1 H2(P,Q) H2(P,Q) ≤ TV(P,Q) ≤ H(P,Q) 1 − 2 4
I H2(P,Q) = 0 iff TV(P,Q) = 0 and H2(P,Q) = 2 iff TV(P,Q) = 1
I H(Pn,Qn) → 0 iff TV(Pn,Qn) → 0 Kullback-Liebler (KL) Divergence
Definition: The KL-divergence between distributions P ∼ f and Q ∼ g is given by
Z f(x) KL(P : Q) = KL(f : g) = f(x) log dx g(x)
Analogous definition holds for discrete distributions P ∼ p and Q ∼ q
I The integrand can be positive or negative. By convention f(x) +∞ if f(x) > 0 and g(x) = 0 f(x) log = g(x) 0 if f(x) = 0
I KL divergence is not symmetric, and is not a metric. Note that
f(X) KL(P : Q) = Ef log g(X) First Properties of KL Divergence
Fact: The integral defining KL(P : Q) is well defined. Letting u− = max(−u, 0),
Z f(x) f(x) log dx < ∞ g(x) −
Key Fact:
I Divergence KL(P : Q) ≥ 0 with equality if and only if P = Q
I KL(P : Q) = +∞ if there is a set A with P (A) > 0 and Q(A) = 0
Notation: When pmfs or pdfs clear from context, write KL(p : q) or KL(f : g) KL Divergence Examples
Example: Let p and q be pmfs on {0, 1} with
p(0) = p(1) = 1/2 and q(0) = (1 − )/2, q(1) = (1 + )/2
Then we have the following exact expressions, and bounds
KL(p : q) = − 1 log(1 − 2) ≤ 2 when ≤ √1 I 2 2
1 2 1− 2 I KL(q : p) = 2 log(1 − ) + 2 log( 1+ ) ≤ 2
Example: If P ∼ Nd(µ0, Σ0) and Q ∼ Nd(µ1, Σ1) with Σ0, Σ1 > 0 then
−1 t −1 2 KL(P : Q) = tr(Σ1 Σ0) + (µ1 − µ0) Σ1 (µ1 − µ0) + ln(|Σ1|/|Σ0|) − d KL Divergence and Inference
Ex 1. (Testing) Consider testing H0 : X ∼ f0 vs. H1 : X ∼ f1. The divergence
f0(X) KL(f0 : f1) = E0 log ≥ 0 f1(X)
is just the expected log likelihood ratio under H0
Ex 2. (Estimation) Suppose X1,X2,... iid with Xi ∼ f(x|θ0) in P = {f(x|θ): θ ∈ Θ}. Under suitable assumptions, when n is large,
ˆ θMLE(x) ≈ argmin KL(f(·|θ0): f(·|θ)) θ∈Θ
In other words, MLE is trying to find θ minimizing KL divergence with true distribution. KL Divergence vs Total Variation and Hellinger
Fact: For any distributions P and Q we have
(1)TV (P,Q)2 ≤ KL(P : Q)/2 (Pinsker’s Inequality)
(2)H (P,Q)2 ≤ KL(P : Q) Log Sum Inequality
Log-Sum Inequality: If a1, . . . , an and b1, . . . , bn are non-negative then
n n ! Pn X ai X i=1 ai ai log ≥ ai log b Pn b i=1 i i=1 i=1 i
with equality iff all the ratios ai/bi are equal
Corollary: If P ∼ p and Q ∼ q are distributions, then for every event B
X p(x) P (B) p(x) log ≥ P (B) log q(x) Q(B) x∈B
with equality iff p(x)/q(x) is constant for x ∈ B Product Densities (Tensorization)
Recall: Given distributions P1,...,Pn on X with densities f1, . . . , fn the product n n distribution P = ⊗i=1Pi on X has density f(x1, . . . , xn) = f1(x1) ··· fn(xn)
Tensorization: Let P1,...,Pn and Q1,...,Qn be distributions on X
n n Pn I TV(⊗i=1Pi, ⊗i=1Qi) ≤ i=1 TV(Pi,Qi)
2 n n Pn 2 I H (⊗i=1Pi, ⊗i=1Qi) ≤ i=1 H (Pi,Qi)
n n Pn I KL(⊗i=1Pi, ⊗i=1Qi) = i=1 KL(Pi,Qi) Distinguishing Coins
Given: Observations X = X1,...,Xn ∈ {0, 1} iid ∼ Bern(θ) with θ ∈ {θ0, θ1}
Goal: Find a decision rule d : {0, 1}n → {0, 1} such that
? P0(d(X) = 1) ≤ α
? P1(d(X) = 0) ≤ α
Question: How large does the number of observations n need to be?
Fact: Let ∆ = |θ0 − θ1|. Then there exists a decision procedure achieving performance (?) and requiring number of observations
2 log(1/α) n = ∆2 Identifying Fair and Biased Coins
Suppose now that θ0 = 1/2 and θ1 = 1/2 + for some fixed ∈ (0, 1/4)
Fact: For every event A ⊆ {0, 1}n √ |P0(X ∈ A) − P1(X ∈ A)| = |P0(A) − P1(A)| ≤ 2n
Fact: If d : {0, 1}n → {0, 1} is any decision rule achieving (?) then
1 − 2α n ≥ 2