Lecture 17: Differential Entropy
• Differential entropy • AEP for differential entropy • Quantization • Maximum differential entropy • Estimation counterpart of Fano’s inequality
Dr. Yao Xie, ECE587, Information Theory, Duke University From discrete to continuous world
Dr. Yao Xie, ECE587, Information Theory, Duke University 1 Differential entropy
• defined for continuous random variable
• differential entropy: ∫ h(X) = − f(x) log f(x)dx S
S is the support of probability density function (PDF)
• sometimes denote as h(f)
Dr. Yao Xie, ECE587, Information Theory, Duke University 2 Uniform distribution
• f(x) = 1/a, x ∈ [0, a]
• differential entropy: ∫ a h(X) = 1/a log(a)dx = log a bits 0
• for a < 1, log a < 0, differential entropy can be negative! (unlike the discrete world)
• interpretation: volume of support set is 2h(X) = 2log a = a > 0
Dr. Yao Xie, ECE587, Information Theory, Duke University 3 Normal distribution
x2 1 − • f(x) = √ e 2σ2 , x ∈ R 2πσ2
• Differential entropy:
1 h(x) = log 2πeσ2 bits 2
Calculation:
Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Some properties
• h(X + c) = h(X)
• h(aX) = h(X) + log |a|
• h(AX) = h(X) + log | det(A)|
Dr. Yao Xie, ECE587, Information Theory, Duke University 5 AEP for continuous random variable
• Discrete world: for a sequence of i.i.d. random variables
−nH(X) p(X1,...,Xn) → 2
• Continuous world: for a sequence of i.i.d. random variables
1 − log f(X ,...,X ) → h(X), in probability. n 1 n
• Proof from weak law of large number
Dr. Yao Xie, ECE587, Information Theory, Duke University 6 Size of typical set
• Discrete world: number of typical sequences
| (n)| ≈ nh(X) Aϵ 2
• Continuous world: volume of typical set
• Volume of set A: ∫
Vol(A) = dx1 ··· dxn A
Dr. Yao Xie, ECE587, Information Theory, Duke University 7 (n) • p(Aϵ ) > 1 − ϵ for n sufficiently large
• Vol(A) ≤ 2n(h(X)+ϵ) for n sufficiently large
• Vol(A) ≥ (1 − ϵ)2n(h(X)−ϵ) for n sufficiently large
• Proofs: very similar to the discrete world ∫
1 = f(x1, ··· , xn)dx1 ··· dxn ∫
≥ f(x1, ··· , xn)dx1 ··· dxn (n) Aϵ ∫ −n(h(X)+ϵ) ≥ 2 dx1 ··· dxn (n) Aϵ
Dr. Yao Xie, ECE587, Information Theory, Duke University 8 • Volume of smallest set that contains most of the probability is 2nh(X)
• for n-dimensional space, this means that each dim has measure 2h(X)
#"! "&
#"! "&
Differential entropy is a the volume of the typical set Fisher information is a the surface area of the typical set
Dr. Yao Xie, ECE587, Information Theory, Duke University 9 Mean value theorem (MVT)
If a function f is continuous on the closed interval [a, b], and differentiable on (a, b), then there exists a point c ∈ (a, b) such that
f(b) − f(a) f ′(c) = b − a
Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Relation of continuous entropy to discrete entropy • Discretize a continuous pdf f(x), divide the range of X into bins of length ∆
• MVT: exist a value xi within each bin such that ∫ (i+1)∆ f(xi)∆ = f(x)dx i∆
f(x)
∆
x
Dr. Yao Xie, ECE587, Information Theory, Duke University 11 ∆ • Define a random variable X ∈ {x1,...} with probability mass function ∫ (i+1)∆ pi = f(x)dx = f(xi)∆ i∆
• Entropy of X∆
∑∞ ∑ ∆ H(X ) = − pi log pi = − ∆f(xi) log f(xi) − log ∆, ∞
• If f(x) is Riemann integrable, H(X∆) + log ∆ → h(X), as ∆ → 0
Dr. Yao Xie, ECE587, Information Theory, Duke University 12 Implication on quantization
• Let ∆ = 2−n, − log ∆ = n
• In general, h(X) + n is the number of bits on average to describe a continuous variable X to n-bit accuracy
• e.g. if X is uniform on [0, 1/8], the first 3 bits must be zero. Hence to describe X to n bit accuracy we need n − 3 bits. Agrees with h(X) = −3
• if X ∼ N (0, 100), n + h(X) = n + .5 log(2πe · 100) = n + 5.37
Dr. Yao Xie, ECE587, Information Theory, Duke University 13 Joint and conditional differential entropy
• Joint differential entropy ∫
h(X1,...,Xn) = − f(x1, . . . , xn) log f(x1, . . . , xn)dx1 ··· dxn
• Conditional differential entropy ∫ h(X|Y ) = − f(x, y) log f(x, y)dxdy = h(X,Y ) − h(Y )
Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Relative entropy and mutual information
• Relative entropy ∫ f D(f||q) = f log g Not necessarily finite. Let 0 log(0/0) = 0
• Mutual information ∫ f(x, y) I(X; Y ) = f(x, y) log dxdy f(x)f(y)
• I(X; Y ) ≥ 0, D(f||q) ≥ 0
• Same Venn diagram relationship as in the discrete world: chain rule, conditioning reduces entropy, union bound...
Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Entropy of multivariate Gaussian
Theorem. Let X1,...,Xn ∼ N (µ, K), µ: mean vector, K: covariance matrix, then
1 h(X ,X ,...,X ) = log(2πe)n|K| bits 1 2 n 2
Proof:
Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Mutual information of multivariate Gaussian
• (X,Y ) ∼ N (0,K) [ ] σ2 ρσ2 K = ρσ2 σ2
• 1 2 h(X) = h(Y ) = 2 log(2πeσ )
• 1 2| | h(X,Y ) = 2 log(2πe) K
• − −1 − 2 h(X; Y ) = h(X) + h(Y ) h(X; Y ) = 2 log(1 ρ )
• ρ = 1, perfectly correlated, mutual information is ∞!
Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Multivariate Gaussian is maximum entropy distribution
Theorem. Let X ∈ Rn be random vector with zero mean and covariance matrix K. Then 1 h(X) ≤ log(2πe)n|K| 2 Proof:
Dr. Yao Xie, ECE587, Information Theory, Duke University 18 Estimation counterpart of Fano’s inequality
• Random variable X, estimator Xˆ
• − ˆ 2 ≥ 1 2h(X) E(X X) 2πee
• equality iff X is Gaussian and Xˆ is the mean of X
• corollary: given side information Y and estimator Xˆ(Y )
1 E(X − Xˆ(Y ))2 ≥ e2h(X|Y ) 2πe
Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Summary
discrete random variable ⇒ continuous random variable entropy ⇒ differential entropy • Many things similar: mutual information, AEP
• Some things are different in continuous world: h(X) can be negative, maximum entropy distribution is Gaussian.
Dr. Yao Xie, ECE587, Information Theory, Duke University 20