Chapter 8:

University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 8 outline

• Motivation • Definitions • Relation to discrete entropy • Joint and conditional differential entropy • Relative entropy and • Properties • AEP for Continuous Random Variables

University of Illinois at Chicago ECE 534, Natasha Devroye Motivation

• Our goal is to determine the capacity of an AWGN channel

N Gaussian noise ~ N(0,PN)

h X Y = h X + N Wireless channel with fading

time time

University of Illinois at Chicago ECE 534, Natasha Devroye Motivation

• Our goal is to determine the capacity of an AWGN channel

N Gaussian noise ~ N(0,PN)

h X Y = h X + N Wireless channel with fading

2 1 h P +PN C = log | | 2 PN 1 = 2 log (1 + SNR)⇥ (bits/channel use)

University of Illinois at Chicago ECE 534, Natasha Devroye Motivation

• need to define entropy, mutual information between CONTINUOUS random variables

• Can you guess?

• Discrete X, p(x):

• Continuous X, f(x):

University of Illinois at Chicago ECE 534, Natasha Devroye Definitions - densities

University of Illinois at Chicago ECE 534, Natasha Devroye Properties - densities

University of Illinois at Chicago ECE 534, Natasha Devroye Properties - densities

University of Illinois at Chicago ECE 534, Natasha Devroye Properties - densities

University of Illinois at Chicago ECE 534, Natasha Devroye 8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247

an interpretation of the differential entropy: It is the of the equivalent side length of the smallest set that contains most of the prob- ability. Hence low entropy implies that the is confined to a small effective volume and high entropy indicates that the random variable is widely dispersed. Note. Just as the entropy is related to the volume of the , there is a quantity called Fisher information which is related to the surface area of the typical set. We discuss Fisher information in more detail in Sections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY

QuantizedConsider a random variable randomX with density f(x) variablesillustrated in Figure 8.1. Suppose that we divide the range of X into bins of length !. Let us assume that the density is continuous within the bins. Then, by the mean value theorem, there exists a value xi within each bin such that

(i 1)! + f(xi)! f(x)dx. (8.23) = !i! Consider the quantized random variable X!, which is defined by

! X xi if i! X<(i 1)!. (8.24) = ≤ +

f(x)

x

FIGURE 8.1. Quantization of a continuous random variable.

University of Illinois at Chicago ECE 534, Natasha Devroye 8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247 8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247 an interpretation of the differential entropy: It is the logarithman interpretation of the of the differential entropy: It is the logarithm of the equivalent side length of the smallest set that contains most ofequivalent the prob- side length of the smallest set that contains most of the prob- ability. Hence low entropy implies that the random variable isability. confined Hence low entropy implies that the random variable is confined to a small effective volume and high entropy indicates that theto random a small effective volume and high entropy indicates that the random variable is widely dispersed. variable is widely dispersed. Note. Just as the entropy is related to the volume of the typicalNote set,. there Just as the entropy is related to the volume of the typical set, there is a quantity called Fisher information which is related to theis a surface quantity called Fisher information which is related to the surface area of the typical set. We discuss Fisher information in morearea detail of thein typical set. We discuss Fisher information in more detail in Sections 11.10 and 17.8. Sections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPYQuantized random variablesENTROPY Consider a random variable X with density f(x)illustrated in FigureConsider 8.1. a random variable X with density f(x)illustrated in Figure 8.1. Suppose that we divide the range of X into bins of length !Suppose. Let us that we divide the range of X into bins of length !. Let us assume that the density is continuous within the bins. Then, byassume the mean that the density is continuous within the bins. Then, by the mean value theorem, there exists a value xi within each bin such thatvalue theorem, there exists a value xi within each bin such that

(i 1)! (i 1)! + + f(xi)! f(x)dx. (8.23) f(xi)! f(x)dx. (8.23) = = !i! !i! Consider the quantized random variable X!, which is defined byConsider the quantized random variable X!, which is defined by

! ! X xi if i! X<(i 1)!. (8.24) X xi if i! X<(i 1)!. (8.24) = ≤ + = ≤ +

f(x) f(x)

∆ ∆

x x

FIGURE 8.1. Quantization of a continuous random variable. FIGURE 8.1. Quantization of a continuous random variable.

University of Illinois at Chicago ECE 534, Natasha Devroye 8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247 8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY 247 an interpretation of the differential entropy: It is the logarithman interpretation of the of the differential entropy: It is the logarithm of the equivalent side length of the smallest set that contains most ofequivalent the prob- side length of the smallest set that contains most of the prob- ability. Hence low entropy implies that the random variable isability. confined Hence low entropy implies that the random variable is confined to a small effective volume and high entropy indicates that theto random a small effective volume and high entropy indicates that the random variable is widely dispersed. variable is widely dispersed. Note. Just as the entropy is related to the volume of the typicalNote set,. there Just as the entropy is related to the volume of the typical set, there is a quantity called Fisher information which is related to theis a surface quantity called Fisher information which is related to the surface area of the typical set. We discuss Fisher information in morearea detail of thein typical set. We discuss Fisher information in more detail in Sections 11.10 and 17.8. Sections 11.10 and 17.8.

8.3Quantized RELATION OF DIFFERENTIAL random ENTROPY variables TO DISCRETE8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY ENTROPY

Consider a random variable X with density f(x)illustrated in FigureConsider 8.1. a random variable X with density f(x)illustrated in Figure 8.1. Suppose that we divide the range of X into bins of length !Suppose. Let us that we divide the range of X into bins of length !. Let us assume that the density is continuous within the bins. Then, byassume the mean that the density is continuous within the bins. Then, by the mean value theorem, there exists a value xi within each bin such thatvalue theorem, there exists a value xi within each bin such that

(i 1)! (i 1)! + + f(xi)! f(x)dx. (8.23) f(xi)! f(x)dx. (8.23) = = !i! !i! Consider the quantized random variable X!, which is defined byConsider the quantized random variable X!, which is defined by

! ! X xi if i! X<(i 1)!. (8.24) X xi if i! X<(i 1)!. (8.24) = ≤ + = ≤ +

f(x) f(x)

∆ ∆

x x

FIGURE 8.1. Quantization of a continuous random variable. FIGURE 8.1. Quantization of a continuous random variable. University of Illinois at Chicago ECE 534, Natasha Devroye Differential entropy - definition

University of Illinois at Chicago ECE 534, Natasha Devroye Examples

f(x)

a b x

University of Illinois at Chicago ECE 534, Natasha Devroye Examples

University of Illinois at Chicago ECE 534, Natasha Devroye Differential entropy - the good the bad and the ugly

University of Illinois at Chicago ECE 534, Natasha Devroye Differential entropy - the good the bad and the ugly

University of Illinois at Chicago ECE 534, Natasha Devroye Differential entropy - multiple RVs

University of Illinois at Chicago ECE 534, Natasha Devroye Differential entropy of a multi-variate Gaussian

University of Illinois at Chicago ECE 534, Natasha Devroye SUMMARY 41

Proof: We have

r(x) H(p) D(p r) p(x) log p(x) p(x) log 2− − || 2 + p(x) (2.151) = ! ! 2 p(x) log r(x) (2.152) = ! p(x)2log r(x) (2.153) ≤ " p(x)r(x) (2.154) = Pr"(X X′), (2.155) = =

where the inequality follows from Jensen’s inequality and the convexity of the function f(y) 2y. = !

ParallelsThe following telegraphicwith discrete summary omits qualifying entropy.... conditions.

SUMMARY Definition The entropy H(X) of a discrete random variable X is defined by

H(X) p(x) log p(x). (2.156) = − ✓ x "∈X Properties of H ✕ 1. H(X) 0. ≥ 2. Hb(X) (log a)Ha(X). = b 3. (Conditioning reduces entropy) For any two random variables, X and Y , we have .... H(X Y) H(X) (2.157) | ≤ with equality if and only if X and Y are independent. n 4. H(X1,X2,...,Xn) i 1 H(Xi), with equality if and only if the ≤ = .... Xi are independent. 5. H(X) log , with! equality if and only if X is distributed uni- formly≤ over | .X | .... X 6. H(p) is concave in p. ....

University of Illinois at Chicago ECE 534, Natasha Devroye Parallels with discrete entropy.... 42 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Definition The relative entropy D(p q) of the probability mass function p with respect to the probability∥ mass function q is defined by p(x) D(p q) p(x) log . (2.158) ∥ = x q(x) Definition The mutual information! between two random variables X and Y is defined as p(x, y) .... I(X Y) p(x, y) log . (2.159) ; = p(x)p(y) x y !∈X !∈Y Alternative expressions 1 .... H(X) Ep log , (2.160) = p(X) 1 .... H(X,Y) Ep log , (2.161) = p(X, Y) .... 1 H(X Y) Ep log , (2.162) | = p(X Y) | p(X, Y) I(X Y) Ep log , (2.163) ; = p(X)p(Y)

p(X) PROBLEMS 43 D(p q) Ep log . (2.164) || = q(X) PropertiesRelative entropy: of D and I D(p(x, y) q(x,y)) D(p(x) q(x)) D(p(y x) q(y x)). 1. I(X Y) ||H(X) H(X= Y) ||H(Y) +H(Y X)| ||H(X)| H(Y); H(X,Y).= − | = − | = + UniversityJensen’s of Illinois − inequality.at Chicago ECEIf 534,f is Natasha a convex Devroye function, then Ef (X) f(EX). 2. D(p q) 0 with equality if and only if p(x) q(x), for≥ all x Log. sum∥ inequality.≥ For n positive numbers, =a ,a ,...,a and∈ X 1 2 n b3.1,bI(X2,...,bY) n,D(p(x, y) p(x)p(y)) 0, with equality if and only if p(x,; y) =p(x)p(y) (i.e.,|| X and Y≥are independent). = n n n 4. If m, and u is theai uniform distributioni over1 ai , then D(p ai log ai log = (2.165) u)| Xlog|=m H(p). b ≥ n b X ∥ = − i 1 i " i 1 # $i 1 i 5. D(p q) is convex!= in the pair (p,!= q). = || $ with equality if and only if ai constant. Chain rules bi = n Entropy: H(X1,X2,...,Xn) i 1 H(Xi Xi 1,...,X1). Data-processing inequality. If=X Y Z| forms− a Markov chain, Mutual information: →= → I(X Y) I(X Z). n " I(X; 1≥,X2,...,X; n Y) i 1 I(Xi Y X1,X2,...,Xi 1). ; = = ; | − Sufficient statistic. T(X) is sufficient relative to f (x) if and only " θ if I(θ X) I(θ T(X)) for all distributions on θ.{ } ; = ;

Fano’s inequality. Let Pe Pr X(Y)ˆ X . Then = { ̸= }

H(Pe) Pe log H(X Y). (2.166) + |X| ≥ |

Inequality. If X and X′ are independent and identically distributed, then

H(X) Pr(X X′) 2− , (2.167) = ≥

PROBLEMS

2.1 Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entropy H(X) in bits. The following expressions may be useful:

∞ 1 ∞ r rn , nrn . = 1 r = (1 r)2 n 0 n 0 != − != − (b) A random variable X is drawn according to this distribution. Find an “efficient” sequence of yes–no questions of the form, 42 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Definition The relative entropy D(p q) of the probability mass function p with respect to the probability∥ mass function q is defined by p(x) D(p q) p(x) log . (2.158) ∥ = x q(x) Definition The mutual information! between two random variables X and Y is defined as p(x, y) I(X Y) p(x, y) log . (2.159) ; = p(x)p(y) x y !∈X !∈Y Alternative expressions 1 H(X) Ep log , (2.160) = p(X) 1 H(X,Y) Ep log , (2.161) = p(X, Y) 1 H(X Y) Ep log , (2.162) | = p(X Y) | p(X, Y) Parallels withI(X Y)discreteEp log entropy...., (2.163) ; = p(X)p(Y) p(X) D(p q) Ep log . (2.164) || = q(X) Properties of D and I 1. I(X Y) H(X) H(X Y) H(Y) H(Y X) H(X) H(Y); H(X,Y).= − | = − | = + − .... 2. D(p q) 0 with equality if and only if p(x) q(x), for all x . ∥ ≥ = ∈ X 3. I(X Y) D(p(x, y) p(x)p(y)) 0, with equality if and only if p(x,; y) =p(x)p(y) (i.e.,|| X and Y≥are independent). = .... 4. If m, and u is the uniform distribution over , then D(p u)| Xlog|=m H(p). X ∥ = − .... 5. D(p q) is convex in the pair (p, q). || .... Chain rules n Entropy: H(X1,X2,...,Xn) i 1 H(Xi Xi 1,...,X1). = | − Mutual information: = n " I(X1,X2,...,Xn Y) i 1 I(Xi Y X1,X2,...,Xi 1). ; = = ; | − "

University of Illinois at Chicago ECE 534, Natasha Devroye Parallels with discrete entropy...PROBLEMS 43

Relative entropy: D(p(x, y) q(x,y)) D(p(x) q(x)) D(p(y x) q(y x)). || = || + | || | Jensen’s inequality. If f is a convex function, then Ef (X) f(EX). ≥

Log sum inequality. For n positive numbers, a1,a2,...,an and b1,b2,...,bn,

n n n ai i 1 ai .... ai log ai log = (2.165) b ≥ n b i 1 i " i 1 # $i 1 i != != = $ with equality if and only if ai constant. bi = .... Data-processing inequality. If X Y Z forms a Markov chain, I(X Y) I(X Z). → → .... ; ≥ ;

Sufficient statistic. T(X) is sufficient relative to fθ (x) if and only .... if I(θ X) I(θ T(X)) for all distributions on θ.{ } ; = ;

Fano’s inequality. Let Pe Pr X(Y)ˆ X . Then = { ̸= }

H(Pe) Pe log H(X Y). (2.166) + |X| ≥ |

Inequality. If X and X′ are independent and identically distributed, then University of Illinois at Chicago ECE 534, Natasha Devroye H(X) Pr(X X′) 2− , (2.167) = ≥

PROBLEMS

2.1 Coin flips. A fair coin is flipped until the first head occurs. Let X denote the number of flips required. (a) Find the entropy H(X) in bits. The following expressions may be useful:

∞ 1 ∞ r rn , nrn . = 1 r = (1 r)2 n 0 n 0 != − != − (b) A random variable X is drawn according to this distribution. Find an “efficient” sequence of yes–no questions of the form, Differential entropy - the good the bad and the ugly

University of Illinois at Chicago ECE 534, Natasha Devroye Relative entropy and mutual information

University of Illinois at Chicago ECE 534, Natasha Devroye Properties

University of Illinois at Chicago ECE 534, Natasha Devroye A quick example

• Find the mutual information between the correlated Gaussian random variables with correlation coefficient ρ

• What is I(X;Y)?

University of Illinois at Chicago ECE 534, Natasha Devroye More properties of differential entropy

University of Illinois at Chicago ECE 534, Natasha Devroye More properties of differential entropy

University of Illinois at Chicago ECE 534, Natasha Devroye Examples of changes in variables

University of Illinois at Chicago ECE 534, Natasha Devroye Concavity and convexity

• Same as in the discrete entropy and mutual information....

University of Illinois at Chicago ECE 534, Natasha Devroye 58 ASYMPTOTIC EQUIPARTITION PROPERTY

. Here it turns out that p(X1,X2,...,Xn) is close nH to 2− with high probability. We summarize this by saying, “Almost all events are almost equally surprising.” This is a way of saying that

n(H ϵ) Pr (X1,X2,...,Xn) : p(X1,X2,...,Xn) 2− ± 1 (3.1) = ≈ if X1,X! 2,...,Xn are i.i.d. p(x). " ∼ Xi n Xi In the example just given, where p(X1,X2,...,Xn) p q − , we are simply saying that the number of 1’s in the sequence= # is close# to np (with high probability), and all such sequences have (roughly) the nH(p) same probability 2− . We use the idea of convergence in probability, defined as follows: Definition (Convergence of random variables). Given a sequence of random variables, X1,X2,..., we say that the sequence X1,X2,... con- verges to a random variable X:

1. In probability if for every ϵ > 0, Pr Xn X > ϵ 0 2 {| − | } → 2. In mean square if E(Xn X) 0 − → 3. With probability 1 (also called almost surely) if Pr limn Xn X 1 { →∞ = The AEP for}= continuous RVs 3.1 ASYMPTOTIC EQUIPARTITION PROPERTY THEOREM

• The AEPThe for asymptoticdiscrete RVs equipartition said..... property is formalized in the following theorem.

Theorem 3.1.1 (AEP) If X1,X2,... are i.i.d. p(x),then ∼ 1 log p(X1,X2,...,Xn) H(X) in probability. (3.2) −n → Proof: Functions of independent random variables are also independent random variables. Thus, since the Xi are i.i.d., so are log p(Xi). Hence, • The AEP byfor the continuous weak law of RVs large numbers,says..... 1 1 log p(X1,X2,...,Xn) log p(Xi) (3.3) −n = −n i $ E log p(X) in probability (3.4) →− H(X), (3.5) = which proves the theorem. !

University of Illinois at Chicago ECE 534, Natasha Devroye Typical sets

• One of the points of the AEP is to define typical sets.

• Typical set for discrete RVs...

• Typical set of continuous RVs....

University of Illinois at Chicago ECE 534, Natasha Devroye Typical sets and volumes

University of Illinois at Chicago ECE 534, Natasha Devroye Maximum entropy distributions

• For a discrete random variable taking on K values, what distribution maximized the entropy?

• Can you think of a continuous counter-part?

[Look ahead to Ch.12, pg. 409-412]

University of Illinois at Chicago ECE 534, Natasha Devroye Maximum entropy distributions

[Look ahead to Ch.12, pg. 409-412]

University of Illinois at Chicago ECE 534, Natasha Devroye Maximum entropy examples

Prove 2 ways!

University of Illinois at Chicago ECE 534, Natasha Devroye Maximum entropy examples

Prove 2 ways!

University of Illinois at Chicago ECE 534, Natasha Devroye 38 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

calculate a function g(Y) Xˆ , where Xˆ is an estimate of X and takes on Estimationvalues inerrorˆ . We will and not restrict= differential the alphabet ˆ to be equal entropy to , and we will also allowX the function g(Y) to be random.X We wish toX bound the probability that Xˆ X. We observe that X Y Xˆ forms a Markov chain. Define the probability̸= of error → → • A counter part to Fano’s inequality for discrete RVs... Pe Pr Xˆ X . (2.129) = ̸= Theorem 2.10.1 (Fano’s Inequality! ) For" any Xˆ such that X Y Xˆ , with Pe Pr(X X)ˆ , we have → → = ̸=

H(Pe) Pe log H(X X)ˆ H(X Y). (2.130) + |X| ≥ | ≥ | This inequality can be weakened to

1 Pe log H(X Y) (2.131) + |X| ≥ | or H(X Y) 1 Pe | − . (2.132) Why can’t ≥welog use Fano’s? |X| Remark Note from (2.130) that Pe 0 implies that H(X Y) 0, as intuition suggests. = | =

Proof: We first ignore the role of Y and prove the first inequality in (2.130). We will then use the data-processing inequality to prove the more traditional form of Fano’s inequality, given by the second inequality in (2.130). Define an error random variable,

1ifXˆ X, E ̸= (2.133) = 0ifXˆ X. # =

University of Illinois at ChicagoThen, ECE using 534, Natasha the chain Devroye rule for entropies to expand H(E,X X)ˆ in two different ways, we have |

H(E,X X)ˆ H(X X)ˆ H(E X, X)ˆ (2.134) | = | + | 0 = H(E X)ˆ H(X$ %&E,X)ˆ '. (2.135) = | + | H(Pe) Pe log ≤ ≤ |X | $ %& ' $ %& ' Since conditioning reduces entropy, H(E X)ˆ H(E) H(Pe). Now since E is a function of X and Xˆ , the conditional| ≤ entropy=H(E X, X)ˆ is | Estimation error and differential entropy

University of Illinois at Chicago ECE 534, Natasha Devroye Summary 256 DIFFERENTIAL ENTROPY

SUMMARY

h(X) h(f ) f(x)log f(x)dx (8.81) = = − !S n . nh(X) f(X ) 2− (8.82) = . Vol(A(n)) 2nh(X). (8.83) ϵ = H ([X] n ) h(X) n. (8.84) 2− ≈ + 1 h( (0, σ 2)) log 2πeσ 2. (8.85) N = 2

1 n h( n(µ, K)) log(2πe) K . (8.86) N = 2 | | f D(f g) f log 0. (8.87) || = g ≥ ! n h(X1,X2,...,Xn) h(Xi X1,X2,...,Xi 1). (8.88) = | − i 1 "= h(X Y) h(X). (8.89) | ≤ h(aX) h(X) log a . (8.90) University of Illinois at Chicago ECE 534, Natasha Devroye = + | | f(x,y) I(X Y) f(x,y)log 0. (8.91) ; = f(x)f(y) ≥ ! 1 max h(X) log(2πe)n K . (8.92) EXXt K = 2 | | = 2 1 2h(X Y) E(X X(Y))ˆ e | . − ≥ 2πe

2nH(X) is the effective alphabet size for a discrete random variable. 2nh(X) is the effective support set size for a continuous random variable. 2C is the effective alphabet size of a channel of capacity C.

PROBLEMS 8.1 Differential entropy. Evaluate the differential entropy h(X) f ln f for the following: = − (a) The exponential density, f(x) λe λx , x 0. # = − ≥ 256 DIFFERENTIAL ENTROPY

SUMMARY

h(X) h(f ) f(x)log f(x)dx (8.81) = = − !S n . nh(X) f(X ) 2− (8.82) = . Vol(A(n)) 2nh(X). (8.83) ϵ = H ([X] n ) h(X) n. (8.84) 2− ≈ + 1 h( (0, σ 2)) log 2πeσ 2. (8.85) N = 2

1 n h( n(µ, K)) log(2πe) K . (8.86) N = 2 | | f D(f g) f log 0. (8.87) Summary || = g ≥ ! n h(X1,X2,...,Xn) h(Xi X1,X2,...,Xi 1). (8.88) = | − i 1 "= h(X Y) h(X). (8.89) | ≤ h(aX) h(X) log a . (8.90) = + | | f(x,y) I(X Y) f(x,y)log 0. (8.91) ; = f(x)f(y) ≥ ! 1 max h(X) log(2πe)n K . (8.92) EXXt K = 2 | | = 2 1 2h(X Y) E(X X(Y))ˆ e | . − ≥ 2πe

2nH(X) is the effective alphabet size for a discrete random variable. 2nh(X) is the effective support set size for a continuous random variable. 2C is the effective alphabet size of a channel of capacity C.

PROBLEMS University of8.1 Illinois atDifferential Chicago ECE 534, Natasha entropy Devroye. Evaluate the differential entropy h(X) f ln f for the following: = − (a) The exponential density, f(x) λe λx , x 0. # = − ≥