<<

Lecture 17: Differential

• Differential entropy • AEP for differential entropy • Quantization • Maximum differential entropy • Estimation counterpart of Fano’s inequality

Dr. Yao Xie, ECE587, Theory, Duke University From discrete to continuous world

Dr. Yao Xie, ECE587, , Duke University 1 Differential entropy

• defined for continuous

• differential entropy: ∫ h(X) = − f(x) log f(x)dx S

S is the support of probability density function (PDF)

• sometimes denote as h(f)

Dr. Yao Xie, ECE587, Information Theory, Duke University 2 Uniform distribution

• f(x) = 1/a, x ∈ [0, a]

• differential entropy: ∫ a h(X) = 1/a log(a)dx = log a 0

• for a < 1, log a < 0, differential entropy can be negative! (unlike the discrete world)

• interpretation: volume of support set is 2h(X) = 2log a = a > 0

Dr. Yao Xie, ECE587, Information Theory, Duke University 3

x2 1 − • f(x) = √ e 2σ2 , x ∈ R 2πσ2

• Differential entropy:

1 h(x) = log 2πeσ2 bits 2

Calculation:

Dr. Yao Xie, ECE587, Information Theory, Duke University 4 Some properties

• h(X + c) = h(X)

• h(aX) = h(X) + log |a|

• h(AX) = h(X) + log | det(A)|

Dr. Yao Xie, ECE587, Information Theory, Duke University 5 AEP for continuous random variable

• Discrete world: for a sequence of i.i.d. random variables

−nH(X) p(X1,...,Xn) → 2

• Continuous world: for a sequence of i.i.d. random variables

1 − log f(X ,...,X ) → h(X), in probability. n 1 n

• Proof from weak law of large number

Dr. Yao Xie, ECE587, Information Theory, Duke University 6 Size of

• Discrete world: number of typical sequences

| (n)| ≈ nh(X) Aϵ 2

• Continuous world: volume of typical set

• Volume of set A: ∫

Vol(A) = dx1 ··· dxn A

Dr. Yao Xie, ECE587, Information Theory, Duke University 7 (n) • p(Aϵ ) > 1 − ϵ for n sufficiently large

• Vol(A) ≤ 2n(h(X)+ϵ) for n sufficiently large

• Vol(A) ≥ (1 − ϵ)2n(h(X)−ϵ) for n sufficiently large

• Proofs: very similar to the discrete world ∫

1 = f(x1, ··· , xn)dx1 ··· dxn ∫

≥ f(x1, ··· , xn)dx1 ··· dxn (n) Aϵ ∫ −n(h(X)+ϵ) ≥ 2 dx1 ··· dxn (n) Aϵ

Dr. Yao Xie, ECE587, Information Theory, Duke University 8 • Volume of smallest set that contains most of the probability is 2nh(X)

• for n-dimensional space, this means that each dim has measure 2h(X)

#"! "&

#"! "&

Differential entropy is a the volume of the typical set is a the surface area of the typical set

Dr. Yao Xie, ECE587, Information Theory, Duke University 9 Mean value theorem (MVT)

If a function f is continuous on the closed interval [a, b], and differentiable on (a, b), then there exists a point c ∈ (a, b) such that

f(b) − f(a) f ′(c) = b − a

Dr. Yao Xie, ECE587, Information Theory, Duke University 10 Relation of continuous entropy to discrete entropy • Discretize a continuous pdf f(x), divide the range of X into bins of length ∆

• MVT: exist a value xi within each bin such that ∫ (i+1)∆ f(xi)∆ = f(x)dx i∆

f(x)

x

Dr. Yao Xie, ECE587, Information Theory, Duke University 11 ∆ • Define a random variable X ∈ {x1,...} with probability mass function ∫ (i+1)∆ pi = f(x)dx = f(xi)∆ i∆

• Entropy of X∆

∑∞ ∑ ∆ H(X ) = − pi log pi = − ∆f(xi) log f(xi) − log ∆, ∞

• If f(x) is Riemann integrable, H(X∆) + log ∆ → h(X), as ∆ → 0

Dr. Yao Xie, ECE587, Information Theory, Duke University 12 Implication on quantization

• Let ∆ = 2−n, − log ∆ = n

• In general, h(X) + n is the number of bits on average to describe a continuous variable X to n- accuracy

• e.g. if X is uniform on [0, 1/8], the first 3 bits must be zero. Hence to describe X to n bit accuracy we need n − 3 bits. Agrees with h(X) = −3

• if X ∼ N (0, 100), n + h(X) = n + .5 log(2πe · 100) = n + 5.37

Dr. Yao Xie, ECE587, Information Theory, Duke University 13 Joint and conditional differential entropy

• Joint differential entropy ∫

h(X1,...,Xn) = − f(x1, . . . , xn) log f(x1, . . . , xn)dx1 ··· dxn

• Conditional differential entropy ∫ h(X|Y ) = − f(x, y) log f(x, y)dxdy = h(X,Y ) − h(Y )

Dr. Yao Xie, ECE587, Information Theory, Duke University 14 Relative entropy and

• Relative entropy ∫ f D(f||q) = f log g Not necessarily finite. Let 0 log(0/0) = 0

• Mutual information ∫ f(x, y) I(X; Y ) = f(x, y) log dxdy f(x)f(y)

• I(X; Y ) ≥ 0, D(f||q) ≥ 0

• Same relationship as in the discrete world: chain rule, conditioning reduces entropy, union bound...

Dr. Yao Xie, ECE587, Information Theory, Duke University 15 Entropy of multivariate Gaussian

Theorem. Let X1,...,Xn ∼ N (µ, K), µ: mean vector, K: , then

1 h(X ,X ,...,X ) = log(2πe)n|K| bits 1 2 n 2

Proof:

Dr. Yao Xie, ECE587, Information Theory, Duke University 16 Mutual information of multivariate Gaussian

• (X,Y ) ∼ N (0,K) [ ] σ2 ρσ2 K = ρσ2 σ2

• 1 2 h(X) = h(Y ) = 2 log(2πeσ )

• 1 2| | h(X,Y ) = 2 log(2πe) K

• − −1 − 2 h(X; Y ) = h(X) + h(Y ) h(X; Y ) = 2 log(1 ρ )

• ρ = 1, perfectly correlated, mutual information is ∞!

Dr. Yao Xie, ECE587, Information Theory, Duke University 17 Multivariate Gaussian is maximum entropy distribution

Theorem. Let X ∈ Rn be random vector with zero mean and covariance matrix K. Then 1 h(X) ≤ log(2πe)n|K| 2 Proof:

Dr. Yao Xie, ECE587, Information Theory, Duke University 18 Estimation counterpart of Fano’s inequality

• Random variable X,

• − ˆ 2 ≥ 1 2h(X) E(X X) 2πee

• equality iff X is Gaussian and Xˆ is the mean of X

• corollary: given side information Y and estimator Xˆ(Y )

1 E(X − Xˆ(Y ))2 ≥ e2h(X|Y ) 2πe

Dr. Yao Xie, ECE587, Information Theory, Duke University 19 Summary

discrete random variable ⇒ continuous random variable entropy ⇒ differential entropy • Many things similar: mutual information, AEP

• Some things are different in continuous world: h(X) can be negative, maximum entropy distribution is Gaussian.

Dr. Yao Xie, ECE587, Information Theory, Duke University 20