Differential Entropy

Home , Differential entropy, Mutual information

Lecture 17: Diﬀerential Entropy

• Differential entropy • AEP for differential entropy • Quantization • Maximum differential entropy • Estimation counterpart of Fano’s inequality

Dr. Yao Xie, ECE587, Information Theory, Duke University From discrete to continuous world

Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Diﬀerential entropy

• deﬁned for continuous random variable

• diﬀerential entropy: ∫ h(X) = − f(x) log f(x)dx S

S is the support of probability density function (PDF)

• sometimes denote as h(f)

Dr. Yao Xie, ECE587, Information Theory, Duke University 2 Uniform distribution

• f(x) = 1/a, x ∈ [0, a]

• diﬀerential entropy: ∫ a h(X) = 1/a log(a)dx = log a bits 0

• for a < 1, log a < 0, diﬀerential entropy can be negative! (unlike the discrete world)

• interpretation: volume of support set is 2h(X) = 2log a = a > 0

Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Normal distribution

x2 1 − • f(x) = √ e 2σ2 , x ∈ R 2πσ2

• Diﬀerential entropy:

1 h(x) = log 2πeσ2 bits 2

Calculation:

Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Some properties

• h(X + c) = h(X)

• h(aX) = h(X) + log |a|

• h(AX) = h(X) + log | det(A)|

Dr. Yao Xie, ECE587, Information Theory, Duke University 5 AEP for continuous random variable

• Discrete world: for a sequence of i.i.d. random variables

−nH(X) p(X1,...,Xn) → 2

• Continuous world: for a sequence of i.i.d. random variables

1 − log f(X ,...,X ) → h(X), in probability. n 1 n

• Proof from weak law of large number

Dr. Yao Xie, ECE587, Information Theory, Duke University 6 Size of typical set

• Discrete world: number of typical sequences

| (n)| ≈ nh(X) Aϵ 2

• Continuous world: volume of typical set

• Volume of set A: ∫

Vol(A) = dx1 ··· dxn A

Dr. Yao Xie, ECE587, Information Theory, Duke University 7 (n) • p(Aϵ ) > 1 − ϵ for n suﬃciently large

• Vol(A) ≤ 2n(h(X)+ϵ) for n suﬃciently large

• Vol(A) ≥ (1 − ϵ)2n(h(X)−ϵ) for n suﬃciently large

• Proofs: very similar to the discrete world ∫

1 = f(x1, ··· , xn)dx1 ··· dxn ∫

≥ f(x1, ··· , xn)dx1 ··· dxn (n) Aϵ ∫ −n(h(X)+ϵ) ≥ 2 dx1 ··· dxn (n) Aϵ

Dr. Yao Xie, ECE587, Information Theory, Duke University 8 • Volume of smallest set that contains most of the probability is 2nh(X)

• for n-dimensional space, this means that each dim has measure 2h(X)

#"! "&

Diﬀerential entropy is a the volume of the typical set Fisher information is a the surface area of the typical set

Dr. Yao Xie, ECE587, Information Theory, Duke University 9 Mean value theorem (MVT)

If a function f is continuous on the closed interval [a, b], and diﬀerentiable on (a, b), then there exists a point c ∈ (a, b) such that

f(b) − f(a) f ′(c) = b − a

Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Relation of continuous entropy to discrete entropy • Discretize a continuous pdf f(x), divide the range of X into bins of length ∆

• MVT: exist a value xi within each bin such that ∫ (i+1)∆ f(xi)∆ = f(x)dx i∆

f(x)

∆

Dr. Yao Xie, ECE587, Information Theory, Duke University 11 ∆ • Deﬁne a random variable X ∈ {x1,...} with probability mass function ∫ (i+1)∆ pi = f(x)dx = f(xi)∆ i∆

• Entropy of X∆

∑∞ ∑ ∆ H(X ) = − pi log pi = − ∆f(xi) log f(xi) − log ∆, ∞

• If f(x) is Riemann integrable, H(X∆) + log ∆ → h(X), as ∆ → 0

Dr. Yao Xie, ECE587, Information Theory, Duke University 12 Implication on quantization

• Let ∆ = 2−n, − log ∆ = n

• In general, h(X) + n is the number of bits on average to describe a continuous variable X to n-bit accuracy

• e.g. if X is uniform on [0, 1/8], the ﬁrst 3 bits must be zero. Hence to describe X to n bit accuracy we need n − 3 bits. Agrees with h(X) = −3

• if X ∼ N (0, 100), n + h(X) = n + .5 log(2πe · 100) = n + 5.37

Dr. Yao Xie, ECE587, Information Theory, Duke University 13 Joint and conditional diﬀerential entropy

• Joint diﬀerential entropy ∫

h(X1,...,Xn) = − f(x1, . . . , xn) log f(x1, . . . , xn)dx1 ··· dxn

• Conditional diﬀerential entropy ∫ h(X|Y ) = − f(x, y) log f(x, y)dxdy = h(X,Y ) − h(Y )

Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Relative entropy and mutual information

• Relative entropy ∫ f D(f||q) = f log g Not necessarily ﬁnite. Let 0 log(0/0) = 0

• Mutual information ∫ f(x, y) I(X; Y ) = f(x, y) log dxdy f(x)f(y)

• I(X; Y ) ≥ 0, D(f||q) ≥ 0

• Same Venn diagram relationship as in the discrete world: chain rule, conditioning reduces entropy, union bound...

Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Entropy of multivariate Gaussian

Theorem. Let X1,...,Xn ∼ N (µ, K), µ: mean vector, K: covariance matrix, then

1 h(X ,X ,...,X ) = log(2πe)n|K| bits 1 2 n 2

Proof:

Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Mutual information of multivariate Gaussian

• (X,Y ) ∼ N (0,K) [ ] σ2 ρσ2 K = ρσ2 σ2

• 1 2 h(X) = h(Y ) = 2 log(2πeσ )

• 1 2| | h(X,Y ) = 2 log(2πe) K

• − −1 − 2 h(X; Y ) = h(X) + h(Y ) h(X; Y ) = 2 log(1 ρ )

• ρ = 1, perfectly correlated, mutual information is ∞!

Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Multivariate Gaussian is maximum entropy distribution

Theorem. Let X ∈ Rn be random vector with zero mean and covariance matrix K. Then 1 h(X) ≤ log(2πe)n|K| 2 Proof:

Dr. Yao Xie, ECE587, Information Theory, Duke University 18 Estimation counterpart of Fano’s inequality

• Random variable X, estimator Xˆ

• − ˆ 2 ≥ 1 2h(X) E(X X) 2πee

• equality iﬀ X is Gaussian and Xˆ is the mean of X

• corollary: given side information Y and estimator Xˆ(Y )

1 E(X − Xˆ(Y ))2 ≥ e2h(X|Y ) 2πe

Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Summary

discrete random variable ⇒ continuous random variable entropy ⇒ diﬀerential entropy • Many things similar: mutual information, AEP

• Some things are diﬀerent in continuous world: h(X) can be negative, maximum entropy distribution is Gaussian.

Dr. Yao Xie, ECE587, Information Theory, Duke University 20