Technical Notes on Kullback-Leibler Divergence
Alexander Etz Sept. 5, 2018 (updated Jan. 2, 2019)
Kullback-Leibler Divergence
The Kullback-Leibler divergence between the true data-generating distribution for random variable X, say p1(x), and another possible candidate distribution, say p2(x), is the expected value of the log likelihood ratio in favor of the true model. That is, if p1(x) is the true model, then
Z p1(x) KL(p1||p2) = p1(x) log dx (1) X p2(x) p1(x) = Ep1 log (2) p2(x) where the X in the first line is the sample space of the random variable X (i.e., the possible values it can take); if X is discrete then the integral is replaced with a sum. The log likelihood ratio can be interpreted as the amount of evidence the data provide for one model versus another, so the KL divergence tells us how much evidence we can expect our data to provide in favor of the true model. See Etz (2018) for a refresher on likelihoods and likelihood ratios. The integral in (1) might be rather complicated, and if we try to derive the KL divergence using brute force integration/summation it can get a little hairy. The representation in (2) makes our life a lot easier, because for many common distributions it reduces the bulk of the derivation to some simple algebra. There aren’t too many derivations for commonly used Kullback-Leibler di- vergences online, so I have written up some of my notes below.
0.1 Bernoulli The bernoulli distribution is useful when we want to model a trial with a binary outcome and a certain probability of success. Note: The binomial distribution is a trial consisting of n independent replicates of a single bernoulli trial. Let X ∼ Bern(θ), x ∈ {0, 1}, 0 < θ < 1. Then
x 1−x pθ(x) = θ (1 − θ) ,
1 and E[X] = θ. Then the log likelihood ratio (LLR) in favor of a bernoulli 1 distribution with θ = θ1 versus one with θ = θ2 is p (X) LLR(X) = log θ1 pθ2 (X) θX (1 − θ )1−X = log 1 1 X 1−X θ2 (1 − θ2) θ 1 − θ = X log 1 + (1 − X) log 1 . θ2 1 − θ2 To get to the Kullback-Leibler divergence we need to take the expectation of this function when truly θ = θ1. The log likelihood ratio above is a linear function of X, so to take its expectation we can simply replace X with E[X] = θ1, giving
KL (pθ1 (X)||pθ2 (X)) = Eθ1 (LLR(X)) θ1 1 − θ1 = E[X] log + (1 − E[X]) log θ2 1 − θ2 θ1 1 − θ1 = θ1 log + (1 − θ1) log . θ2 1 − θ2
Note: If X1,X2,...,Xn are independent bernoulli trials with probaility of P success θ, then the random variable Y = Xi follows a binomial distribution with probability of success θ and sample size n. In this case the KL divergence for the binomial trial is simply n times the KL divergence of a single bernoulli trial.
0.2 Geometric The geometric distribution is useful when we want to know the failure rate of a process using a design that continues collecting data until the first failure occurs. Let X ∼ Geo(θ), x ∈ {0, 1,... }, 0 < θ < 1. Then
x pθ(x) = θ(1 − θ) ,
1−θ and E[X] = θ . Then the log likelihood ratio between θ1 and θ2 is p (X) LLR(X) = log θ1 pθ2 (X) X θ1(1 − θ1) = log X θ2(1 − θ2) θ 1 − θ = log 1 + X log 1 . θ2 1 − θ2
1 From now on I’ll just use the shorthand “the log likelihood ratio between θ1 and θ2” but remember we are really talking about the distributions indexed by those parameters.
2 To get to the KL divergence we need the expectation of this function when truly θ = θ1. This is again a linear function of X, so we again simply replace X with its expectation, giving θ1 1 − θ1 1 − θ1 KL(pθ1 ||pθ2 ) = log + log θ2 θ1 1 − θ2
Note: If X1,X2,...,Xn are independent geometric trials with probability P of success θ, then the random variable Y = Xi follows a negative binomial distribution with probability of success θ. In this case the KL divergence for a negative binomial trial is simply n times the KL divergence of a single geometric trial.
0.3 Poisson Let X ∼ P ois(λ), x ∈ {0, 1,... }, λ > 0. Then
x −λ pλ(x) = λ e /x! and E[X] = λ. Then the log likelihood ratio between λ1 and λ2 is
p (X) LLR(X) = log λ1 pλ2 (X) λX e−λ1 /X! = log 1 X −λ2 λ2 e /X! λ1 = X log − (λ1 − λ2). λ2 To get to the KL divergence we need the expectation of this function when truly λ = λ1. Again we have a linear function of X, so we just replace X with its expectation, giving
λ1 KL(pλ1 ||pλ2 ) = λ1 log − (λ1 − λ2) λ2
0.4 Exponential Let X ∼ Exp(θ), x > 0, θ > 0. Then
−θx pθ(x) = θe ,
3 and E[X] = 1/θ. Then the log likelihood ratio between θ1 and θ2 is p LLR(X) = log θ1 pθ2 −θ1X θ1e = log −θ X θ2e 2 θ1 = log − X (θ1 − θ2) . θ2 To get to the KL divergence we need the expectation of this function when truly θ = θ1. Again we have a linear function of X, so we just replace X with its expectation, giving θ1 θ1 − θ2 KL(pθ1 ||pθ2 ) = log − . θ2 θ1
0.5 Normal (part 1) Let X ∼ N(µ, σ2), −∞ < x < ∞, −∞ < µ < ∞, σ2 > 0. Then if we let θ = (µ, σ2), we have
1 (x − µ)2 pθ(x) = √ exp − , 2πσ2 2σ2 and E[X] = µ. Then the log likelihood ratio between two normal distributions 2 with different means, µ1 versus µ2, but the same variance σ , is
p LLR(X) = log µ1(X) pµ2 (X) √ h 2 i 2 (x−µ1) 1/ 2πσ exp − 2σ2 = log √ h 2 i 2 (x−µ2) 1/ 2πσ exp − 2σ2 1 = − (X − µ )2 − (X − µ )2 . 2σ2 1 2
Let’s define Y = X − µ1 and δ = µ2 − µ1. Then we have 1 LLR(X) = − Y 2 − (Y − δ)2 2σ2 1 = − Y 2 − Y 2 + 2Y δ − δ2 2σ2 Y δ δ2 = − + , σ2 2σ2 and if we change back to our original variables we obtain (X − µ )(µ − µ ) (µ − µ )2 LLR(X) = − 1 2 1 + 2 1 . σ2 2σ2
4 Again we have a linear function of X. Taking the expectation when truly µ = µ1, the first term becomes zero, and thus we obtain (µ − µ )2 KL(p ||p ) = 2 1 . µ1 µ2 2σ2
If we write ∆ = (µ2−µ1)/σ for the standardized mean difference, we can see that the KL divergence is ∆2/2, i.e., half the squared standardized mean difference between the two distributions.
0.6 Normal (part 2) Let X ∼ N(µ, σ2) as before. The log likelihood ratio between two normal 2 distributions with different means, µ1 versus µ2, and different variances, σ1 2 versus σ2, is h 2 i p 2 (X−µ1) 1/ 2πσ1 exp − 2σ2 LLR(X) = log 1 h 2 i p 2 (X−µ2) 1/ 2πσ2 exp − 2 2σ2 2 2 2 1 σ2 (X − µ1) (X − µ2) = log 2 − 2 − 2 . 2 σ1 2σ1 2σ2 2 Let’s again define Y = X − µ1 and δ = µ2 − µ1, and let’s also define τi = 1/σi , i = 1, 2. Then
1 τ1 1 2 1 2 LLR(X) = log − τ1Y − τ2(Y − δ) 2 τ2 2 2 1 τ1 1 2 2 2 = log − τ1Y − τ2Y + 2τ2Y δ − τ2δ 2 τ2 2 1 τ1 1 2 1 2 = log − (τ1 − τ2)Y − τ2Y δ + τ2δ . 2 τ2 2 2 We need to find the expectation of the above function to obtain the KL diver- gence. Unfortunately, it is not just a linear function of Y , but of Y 2 as well. 2 2 2 2 Recall that in general E[Y ] = V ar[Y ] + E[Y ] . When µ = µ1 and σ = σ1, 2 E[Y ] = E[X − µ1] = 0 and V ar[Y ] = V ar[X] = σ1 = 1/τ1. Thus,
1 τ1 (τ1 − τ2) 1 2 KL(pθ1 ||pθ2 ) = log − + τ2δ 2 τ2 2τ1 2 1 τ1 1 τ2 1 2 = log − + + τ2δ 2 τ2 2 2τ1 2 2 2 2 1 σ2 σ1 (µ2 − µ1) = log 2 + 2 + 2 − 1 . 2 σ1 σ2 σ2 Note that the KL divergence for normal distributions with the same variance but different means (section 0.5) is a special case of the above result, where
5 2 2 2 σ1 = σ2 = σ . Likewise, the KL divergence for two normal distributions with the same mean but different variances is also a special case of the above where µ1 = µ2 = µ.
References
Etz, A. (2018). Introduction to the concept of likelihood and its applications. Advances in Methods and Practices in Psychological Science, 1(1):60–69.
6