Chapter 2: Entropy and

University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2 outline

• Definitions • Entropy • Joint entropy, • Relative entropy, mutual information • Chain rules • Jensen’s inequality • Log-sum inequality • Data processing inequality • Fano’s inequality

University of Illinois at Chicago ECE 534, Natasha Devroye Definitions

A discrete X takes on values x from the discrete alphabet . X The probability mass function (pmf) is described by

p (x)=p(x) = Pr X = x , for x . X { } X

University of Illinois at Chicago ECE 534, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2

Probability, Entropy, and Inference Definitions Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 This chapter, and its sibling, Chapter 8, devote some time to notation. Just You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/as the White Knight fordistinguished links. between the song, the name of the song, and what the name of the song was called (Carroll, 1998), we will sometimes 2.1: Probabilities and ensembles need to be careful to distinguish between a random variable, the v23alue of the i ai pi random variable, and the proposition that asserts that the random variable x has a particular value. In any particular chapter, however, I will use the most 1 a 0.0575 a Figure 2.2. The probability 2 b 0.0128 simple and friendly notation possible, at the risk of upsetting pure-minded b a distribution over the 27 27 3 c 0.0263 c b × c readers. For example, if somethingpossibleis ‘truebigramswith probabilitxy in anyEnglish1’, I will usually 4 d 0.0285 d d simply say that it is ‘true’. language document, The e 5 e 0.0913 e f Frequently Asked Questions 6 f 0.0173 f g 7 g 0.0133 g h 2.1 Probabilities and ensemblesManual for Linux. i 8 h 0.0313 h j 9 i 0.0599 i An ensemble X is a triple (x, X , X ), where the outcome x is the value k A P 10 j 0.0006 j l of a random variable, which takes on one of a set of possible values, m 11 k 0.0084 k n X = a1, a2, . . . , ai, . . . , aI , having probabilities X = p1, p2, . . . , pI , o A { } P { } 12 l 0.0335 l p with P (x = ai) = pi, pi 0 and a P (x = ai) = 1. 13 m 0.0235 m q ≥ i∈AX 14 n 0.0596 n r ! s The name is mnemonic for ‘alphabet’. One example of an ensemble is a 15 o 0.0689 o t A p u letter that is randomly selected from an English document. This ensemble is 16 p 0.0192 v shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space 17 q 0.0008 q w r x character ‘-’. 18 r 0.0508 y 19 s 0.0567 s z t – Abbreviations. Briefer notation will sometimes be used. For example, 20 t 0.0706 21 u 0.0334 u a b c d e f g h i j k l m n o p q r s t u v w x y z – P (x = ai) may be written as P (ai) or P (x). y 22 v 0.0069 v 23 w 0.0119 w Probability of a subset. If T is a subset of X then: A 24 x 0.0073 x y P (T ) = P (x T ) = P (x = a ). (2.1) 25 y 0.0164 ∈ i 26 z 0.0007 z Marginal probability. We can obtain the marginal probability P (x) from a T "i∈ 27 – 0.1928 – the joint probability P (x, y) by summation: For example, if we define V to be vowels from figure 2.1, V = a, e, i, o, u , then Figure 2.1. Probability P (x = a ) P (x = a , y). { } (2.3) i ≡ i distribution over the 27 outcomes y Y P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. (2.2) for a randomly selected letter in !∈A an English language document Similarly, using briefer notation, the marginal probabilitA joint ensemy of bley is:XY is an ensemble in which each outcome is an ordered (estimated from The Frequently Asked Questions Manual for pair x, y with x X = a1, . . . , aI and y Y = b1, . . . , bJ . Linux). The picture shows the P (y) P (x, y). ∈(2.4)A { } ∈ A { } ≡ We call P (x, y) the joint probability of x and y. probabilities by the areas of white x X squares. !∈A Commas are optional when writing ordered pairs, so xy x, y. ⇔ ConditionalUniversityprobabilit of Illinoisy at Chicago ECE 534, NatashaN.B. DevroyeIn a joint ensemble XY the two variables are not necessarily inde- pendent. P (x = ai, y = bj) P (x = ai y = bj) if P (y = bj) = 0. (2.5) 22 | ≡ P (y = bj) ̸

[If P (y = b ) = 0 then P (x = a y = b ) is undefined.] j i | j We pronounce P (x = a y = b ) ‘the probability that x equals a , given i | j i y equals bj’.

Example 2.1. An example of a joint ensemble is the ordered pair XY consisting of two successive letters in an English document. The possible outcomes are ordered pairs such as aa, ab, ac, and zz; of these, we might expect ab and ac to be more probable than aa and zz. An estimate of the joint probability distribution for two neighbouring characters is shown graphically in figure 2.2. This joint ensemble has the special property that its two marginal dis- tributions, P (x) and P (y), are identical. They are both equal to the monogram distribution shown in figure 2.1. From this joint ensemble P (x, y) we can obtain conditional distributions, P (y x) and P (x y), by normalizing the rows and columns, respectively | | (figure 2.3). The probability P (y x = q) is the probability distribution | of the second letter given that the first letter is a q. As you can see in figure 2.3a, the two most probable values for the second letter y given Definitions

The events X = x and Y = y are statistically independent if p(x, y)=p(x)p(y).

The variables X1,X2, XN are called independent if for all (x1,x2, ,xN ) we have··· ··· ⇤ X1 ⇥ X2 ⇥ ···XN N p(x ,x , x )= p (x ). 1 2 ··· N Xi i i=1 They are furthermore called identically distributed if all variables Xi have the same distribution pX (x).

University of Illinois at Chicago ECE 534, Natasha Devroye Entropy

• Intuitive notions?

• 2 ways of defining entropy of a random variable: • axiomatic definition (want a measure with certain properties...) • just define and then justify definition by showing it arises as answer to a number of natural questions

Definition: The entropy H(X) of a discrete random variable X with pmf pX (x) is given by

H(X)= p (x) log p (x)= E [log p (X)] X X pX (x) X x

University of Illinois at Chicago ECE 534, Natasha Devroye Order these in terms of entropy

University of Illinois at Chicago ECE 534, Natasha Devroye Order these in terms of entropy

University of Illinois at Chicago ECE 534, Natasha Devroye Entropy examples 1

• What’s the entropy of a uniform discrete random variable taking on K values? i

• What’s the entropy of a random variable with =[ , , , ],p = [1/2; 1/4; 1/8; 1/8] X ⇥ ⇤ ⌅ ⇧ X

• What’s the entropy of a deterministic random variable?

University of Illinois at Chicago ECE 534, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

32 2 — Probability, Entropy, and Inference

What do you notice about your solutions? Does each answer depend on the i ai pi h(pi) detailed contents of each urn? 1 a .0575 4.1 The details of the other possible outcomes and their probabilities are ir- 2 b .0128 6.3 relevant. All that matters is the probability of the outcome that actually 3 c .0263 5.2 happened (here, that the ball drawn was black) given the different hypothe- 4 d .0285 5.1 ses. We need only to know the likelihood, i.e., how the probability of the data 5 e .0913 3.5 that happened varies with the hypothesis. This simple rule about inference is 6 f .0173 5.9 known as the likelihood principle. 7 g .0133 6.2 8 h .0313 5.0 9 i .0599 4.1 The likelihood principle: given a generative model for data d given 10 j .0006 10.7 parameters θ, P (d θ), and having observed a particular outcome 11 k .0084 6.9 | d1, all inferences and predictions should depend only on the function 12 l .0335 4.9 13 m .0235 5.4 P (d1 θ). | 14 n .0596 4.1 15 o .0689 3.9 In spite of the simplicity of this principle, many classical statistical methods 16 p .0192 5.7 violate it. 17 q .0008 10.3 18 r .0508 4.3 19 s .0567 4.1 2.4 Definition of entropy and related functions 20 t .0706 3.8 21 u .0334 4.9 The Shannon information content of an outcome x is defined to be 22 v .0069 7.2 1 23 w .0119 6.4 h(x) = log . (2.34) 2 P (x) 24 x .0073 7.1 25 y .0164 5.9 It is measured in bits. [The word ‘bit’ is also used to denote a variable 26 z .0007 10.4 whose value is 0 or 1; I hope context will always make clear which of the 27 - .1928 2.4 two meanings is intended.] 1 p log 4.1 i 2 p In the next few chapters, we will establish that the Shannon information i i ! content h(ai) is indeed a natural measure of the information content of the event x = ai. At that point, we will shorten the name of this Table 2.9. Shannon information quantity to ‘the information content’. contents of the outcomes a–z. The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits.

The entropy of an ensemble X is defined to be the average Shannon in- formation content of an outcome: 1 H(X) P (x) log , (2.35) ≡ P (x) x !∈AX with the convention for P (x) = 0 that 0 log 1/0 0, since × ≡ limθ 0+ θ log 1/θ = 0. → Like the information content, entropy is measured in bits.

When Copyrightit is con Cambridgevenien Universityt, we maPressy 2003.also On-screenwrite viewingH(X permitted.) as H Printing(p), wherenot permitted.p is http://www.cambridge.org/0521642981 Entropy:the vectorYou canexample( pbuy1, thisp2 ,book. . . for, p 30I ). poundsAnother 2 or $50. Seename http://www.inference.phy.cam.ac.uk/mackay/itila/for the entropy of X is the for links. uncertainty of X. 32 2 — Probability, Entropy, and Inference

What do you notice about your solutions? Does each answer depend on the Example 2.12. The entropy of a randomly selected letter in an English docu- i ai pi h(pi) detailed contents of each urn? 1 a .0575 4.1 ment is about 4.11Thebits,detailsassumingof the otherits probabilitpossible outcomesy is asandgivtheiren inprobabilitiestable 2.9.are ir- 2 b .0128 6.3 We obtain thisrelevnumanbt.erAllbythataveragingmatters islogthe1/pprobabiliti (showny of inthetheoutcomefourththatcol-actually 3 c .0263 5.2 happened (here, that the ball drawn was black) given the different hypothe- 4 d .0285 5.1 umn) under the probability distribution pi (shown in the third column). ses. We need only to know the likelihood, i.e., how the probability of the data 5 e .0913 3.5 that happened varies with the hypothesis. This simple rule about inference is 6 f .0173 5.9 known as the likelihood principle. 7 g .0133 6.2 8 h .0313 5.0 9 i .0599 4.1 The likelihood principle: given a generative model for data d given 10 j .0006 10.7 parameters θ, P (d θ), and having observed a particular outcome 11 k .0084 6.9 | d1, all inferences and predictions should depend only on the function 12 l .0335 4.9 13 m .0235 5.4 P (d1 θ). | 14 n .0596 4.1 15 o .0689 3.9 In spite of the simplicity of this principle, many classical statistical methods 16 p .0192 5.7 violate it. 17 q .0008 10.3 18 r .0508 4.3 19 s .0567 4.1 2.4 Definition of entropy and related functions 20 t .0706 3.8 21 u .0334 4.9 The Shannon information content of an outcome x is defined to be 22 v .0069 7.2 1 23 w .0119 6.4 h(x) = log . (2.34) 2 P (x) 24 x .0073 7.1 25 y .0164 5.9 It is measured in bits. [The word ‘bit’ is also used to denote a variable 26 z .0007 10.4 whose value is 0 or 1; I hope context will always make clear which of the 27 - .1928 2.4 two meanings is intended.] 1 p log 4.1 i 2 p In the next few chapters, we will establish that the Shannon information i i ! content h(ai) is indeed a natural measure of the information content of the event x = ai. At that point, we will shorten the name of this Table 2.9. Shannon information University of Illinois at Chicago ECE 534, Natashaquantity Devroyeto ‘the information content’. contents of the outcomes a–z. The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits.

The entropy of an ensemble X is defined to be the average Shannon in- formation content of an outcome: 1 H(X) P (x) log , (2.35) ≡ P (x) x !∈AX with the convention for P (x) = 0 that 0 log 1/0 0, since × ≡ limθ 0+ θ log 1/θ = 0. → Like the information content, entropy is measured in bits. When it is convenient, we may also write H(X) as H(p), where p is the vector (p1, p2, . . . , pI ). Another name for the entropy of X is the uncertainty of X.

Example 2.12. The entropy of a randomly selected letter in an English docu- ment is about 4.11 bits, assuming its probability is as given in table 2.9. We obtain this number by averaging log 1/pi (shown in the fourth col- umn) under the probability distribution pi (shown in the third column). Entropy: example 3

• Bernoulli random variable takes on heads (0) with probability p and tails with probability 1-p. Its entropy is defined as

H(p) := p log (p) (1 p) log (1 p) 2 2 16 ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

1

0.9

0.8

0.7

0.6 )

p ( 0.5

H 0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p

FIGURE 2.1. H(p) vs. p. University of Illinois at Chicago ECE 534, Natasha Devroye

Suppose that we wish to determine the value of X with the minimum number of binary questions. An efficient first question is “Is X a?” This splits the probability in half. If the answer to the first question= is no, the second question can be “Is X b?” The third question can be “Is X c?” The resulting expected number= of binary questions required is 1.75.= This turns out to be the minimum expected number of binary questions required to determine the value of X. In Chapter 5 we show that the minimum expected number of binary questions required to determine X lies between H(X) and H(X) 1. +

2.2 JOINT ENTROPY AND CONDITIONAL ENTROPY

We defined the entropy of a single random variable in Section 2.1. We now extend the definition to a pair of random variables. There is nothing really new in this definition because (X, Y ) can be considered to be a single vector-valued random variable.

Definition The joint entropy H(X,Y) of a pair of discrete random variables (X, Y ) with a joint distribution p(x, y) is defined as

H(X,Y) p(x, y) log p(x, y), (2.8) = − x y !∈X !∈Y Entropy

The entropy H(X)= p(x) log p(x) has the following properties: x H(X) 0, entropy is always non-negative. H(X)=0i X is deterministic • (0 log(0)⌅ = 0). H(X) log( ). H(X)=log( )i X has uniform distribution over X. • ⇤ |X | |X |

Since Hb(X) = logb(a)Ha(X), we don’t need to specify the base of the loga- • rithm (bits vs. nat).

Moving on to multiple RVs

University of Illinois at Chicago ECE 534, Natasha Devroye Joint entropy and conditional entropy

Definition: Joint entropy of a pair of two discrete random variables X and Y is:

H(X, Y ) := E [log p(X, Y )] p(x,y) = p(x, y) log p(x, y) x y X Y

Note: H(X Y ) = H(Y X). | | !!!

University of Illinois at Chicago ECE 534, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Joint entropy and conditional8.3: Further exercisesentropy 141

Exercise 8.7.[2, p.143] Consider the ensemble XY Z in which = = AX AY = 0, 1 , x and y are independent with = p, 1 p and AZ { } PX { − } • Natural definitions, since.... = q, 1 q and PY { − } z = (x + y) mod 2. (8.13)

(a) If q = 1/2, what is ? What is I(Z; X)? Copyright Cambridge UniversityTheorem: Press 2003. On-screen Chain viewing rule permitted. Printing not permitted. http://www.cambridge.org/0521642981PZ You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/(b) for links.For general p and q, what is ? What is I(Z; X)? Notice that H(X, Y )=H(X)+H(Y X) PZ this ensemble| is related to the binary symmetric channel, with x = 140 input,8 —y =Depnoiseenden, andt zRandom= output.iVariables

Figure 8.1. The relationship H(X, Y ) H(Y) Figure 8.2. A misleading between joint information, representation of entropies H(X) marginal entropy, conditional (contrast with figure 8.1). H(X|Y)entropI(X;Y)y andH(Y|X)mutual entropy. H(X,Y) H(Y ) H(X Y ) I(X; Y ) H(Y X) H(X) | |

8.2 Exercises Three term entropies [3, p.143] [1Corollary:] Exercise 8.8. Many texts draw figure 8.1 in the form of a Venn diagram ◃ Exercise 8.1. Consider three independent random variables u, v, w with(figureen-8.2). Discuss why this diagram is a misleading representation tropies H , H , H . Let X (U, V ) and HY (X,(V, YW ).ZWhat)=His H(X(X,ZY )?)+H(Y X, Z) u v w ≡ ≡ | of |entropies. Hint: consider| the three-variable ensemble XY Z in which What is H(X Y )? What is I(X; Y )? x 0, 1 and y 0, 1 are independent binary variables and z 0, 1 ∈ { } ∈ { } ∈ { } | is defined to be z = x + y mod 2. ◃ Exercise 8.2.[3, p.142] Referring to the definitions of conditional entropy (8.3– 8.4), confirm (with an example) that it is possible for H(X8.3y =Furtherb ) to exercises | k exceed H(X), but that the average, H(X Y ), is less than H(X). So University of Illinois at Chicago ECE 534, Natasha Devroye| The data-processing theorem data are helpful – they do not increase uncertainty, on average. The data processing theorem states that data processing can only destroy ◃ Exercise 8.3.[2, p.143] Prove the chain rule for entropy, equationinformation.(8.7). [H(X, Y ) = H(X) + H(Y X)]. Exercise 8.9.[3, p.144] Prove this theorem by considering an ensemble W DR | in which w is the state of the world, d is data gathered, and r is the Exercise 8.4.[2, p.143] Prove that the mutual information I(X; Y ) Hpro(Xcessed) data, so that these three variables form a Markov chain ≡ − H(X Y ) satisfies I(X; Y ) = I(Y ; X) and I(X; Y ) 0. w d r, (8.14) | ≥ → → [Hint: see exercise 2.26 (p.37) and note that that is, the probability P (w, d, r) can be written as

I(X; Y ) = DKL(P (x, y) P (x)P (y)).] (8.11) P (w, d, r) = P (w)P (d w)P (r d). (8.15) || | | Exercise 8.5.[4 ] The ‘entropy distance’ between two random variablesShocanw bthate the average information that R conveys about W, I(W ; R), is less than or equal to the average information that D conveys about W , defined to be the difference between their joint entropy and their Im(Wutual; D). information: This theorem is as much a caution about our definition of ‘information’ as it DH (X, Y ) H(X, Y ) I(X; Y ). (8.12) ≡ − is a caution about data processing! Prove that the entropy distance satisfies the axioms for a distance – D (X, Y ) 0, D (X, X) = 0, D (X, Y ) = D (Y, X), and D (X, Z) H ≥ H H H H ≤ DH (X, Y ) + DH (Y, Z). [Incidentally, we are unlikely to see DH (X, Y ) again but it is a good function on which to practise inequality-proving.]

Exercise 8.6.[2 ] A joint ensemble XY has the following joint distribution.

P (x, y) x 1 2 3 4 1 2 3 4 1 1/ 1/ 1/ 1/ 1 8 16 32 32 2 1 1 1 1 y 2 /16 /8 /32 /32 3 3 1/16 1/16 1/16 1/16 4 4 1/4 0 0 0

What is the joint entropy H(X, Y )? What are the marginal entropies H(X) and H(Y )? For each value of y, what is the conditional entropy H(X y)? What is the conditional entropy H(X Y )? What is the | | conditional entropy of Y given X? What is the mutual information between X and Y ? Joint/conditional entropy examples

p(x, y) y =0 y =1 x =0 1/2 1/4 x =1 0 1/4 i H(X,Y)= H(X|Y)= H(Y|X)= H(X)= H(Y)=

University of Illinois at Chicago ECE 534, Natasha Devroye Entropy is central because...

(A) entropy is the measure of average uncertainty in the random variable

(B) entropy is the average number of bits needed to describe the random variable

(C) entropy is a lower bound on the average length of the shortest description of the random variable

(D) entropy is measured in bits?

(E) H(X)= p(x) log (p(x)) x 2 (F) entropy of a deterministic value is 0

University of Illinois at Chicago ECE 534, Natasha Devroye Mutual information

• Entropy H(X) is the uncertainty (``self-information'') of a single random variable

• Conditional entropy H(X|Y) is the entropy of one random variable conditional upon knowledge of another. • The average amount of decrease of the randomness of X by observing Y is the average information that Y gives us about X.

Definition: The mutual information I(X; Y ) between the random variables X and Y is given by

I(X; Y )=H(X) H(X Y ) | p(x, y) = p(x, y) log 2 p(x)p(y) x y ⇤X ⇤Y p(X, Y ) = E log p(x,y) 2 p(X)p(Y ) ⇥

University of Illinois at Chicago ECE 534, Natasha Devroye n At the heart of informationn itheory because...n i P = f (1 f) e i i=m+1 Channel: p(y|x) • Information :⌅ ⇧ X Y C = max I(X; Y ) p(x)

• Operational channel 1capacity: C = log (1 + h 2P/P ) 2 2 | | N Highest rate (bits/channel use) that can 1communicatelog (1 + hat2 P/Preliably) 2 2 | | N C = • Channel coding theorem says: information capacity = operational capacity ⌃ 1 2 Eh 2 log2(1 + h P/PN ) University of Illinois at Chicago ECE 534, Natasha Devroye | | 1 ⇥ maxQ⌥:Tr(Q)=P 2 log2 IMR + HQH† C = ⌃ 1 ⇤ ⇤ † maxQ:Tr(Q)=P EH 2 ⇤log2 IMR + HQH⇤ ⇤ ⇤⇥ ⌥ ⇤ ⇤ Y = HX + N 1 X = H U + N

⌅ 1 Y = H(H U)+N = U + N

1 C = log (1 + P/N) 2 2

R I(Y ; X X ) 2 ⇤ 2 2| 1 Let Z =(Y1,Y2, X1, X2, V1, V2,W) be distributed as:

P (w) P (m w)P (m w)P (x m ,m ,w) ⇥ 1| 1⇥| 1| 1 1⇥ P (m⇥ m ,w)P (m⇥ m ,w)P (m v ,w)P (m v ,w) ⇥ 1| 1 1⇥| 1⇥ 2| 1 2⇥| 1 P (x m ,m , m⇥,w)P (y x , x )P (y x , x ) ⇥ 2| 2 2⇥ 1| 1 2 2| 1 2

1 Mutual information example

p(x, y) y =0 y =1 X or Y p(x) p(y) x =0 1/2 1/4 0 3/4 1/2 x =1 0 1/4 1 1/4 1/2 i

University of Illinois at Chicago ECE 534, Natasha Devroye Divergence (relative entropy, K-L distance)

Definition: Relative entropy, divergence or Kullback-Leibler distance between two distributions, P and Q, on the same alphabet, is

p(x) p(x) D(p q) := E log = p(x) log p q(x) q(x) ⇥ x ⇤X (Note: we use the convention 0 log 0 =0and 0 log 0 = p log p = .) 0 q 0

D(p q) is in a sense a measure of the “distance” between the two distribu- • tions.⇥ If P = Q then D(p q) = 0. • ⇥ Note D(p q) is not a true distance. • ⇥ D( , ) = 0.2075 D( , ) = 0.1887

University of Illinois at Chicago ECE 534, Natasha Devroye K-L divergence example

= 1, 2, 3, 4, 5, 6 • X { } P = [1/61/61/61/61/61/6] • i Q = [1/10 1/10 1/10 1/10 1/10 1/2] • D(p q) =? and D(q p) =? • ⇧ ⇧ p(x) q(x)

x x

University of Illinois at Chicago ECE 534, Natasha Devroye Mutual information as divergence

Definition: The mutual information I(X; Y ) between the random variables X and Y is given by

I(X; Y )=H(X) H(X Y ) | p(x, y) = p(x, y) log 2 p(x)p(y) x y ⇤X ⇤Y p(X, Y ) = E log p(x,y) 2 p(X)p(Y ) ⇥

• Can we express mutual information in terms of the K-L divergence?

I(X; Y )=D(p(x, y) p(x)p(y))

University of Illinois at Chicago ECE 534, Natasha Devroye Mutual information and entropy

Theorem: Relationship between mutual information and entropy.

I(X; Y )=H(X) H(X Y ) | I(X; Y )=H(Y ) H(Y X) | I(X; Y )=H(X)+H(Y ) H(X, Y ) I(X; Y )=I(Y ; X) (symmetry) I(X; X)=H(X) (“self-information”)

H(X) H(Y) H(Y) H(X) H(Y) H(X|Y)

I(X;Y) I(X;Y) I(X;Y)

University of Illinois at Chicago ECE 534, Natasha Devroye Chain rule for entropy

Theorem: (Chain rule for entropy):(X ,X , ..., X ) p(x ,x , ..., x ) 1 2 n 1 2 n n H(X1) H(X2) H(X1,X2, ..., Xn)= H(Xi Xi 1, ..., X1) | i=1 i

H(X3)

H(X1) H(X2|X1)

H(X1,X2,X3) = + +

H(X3|X1,X2)

University of Illinois at Chicago ECE 534, Natasha Devroye Conditional mutual information

H(X) H(Y)

H(Z)

H(X) H(Y) H(X|Z)H(X) H(Y) H(X|YH(X),Z) H(Y) I(XlY|Z) = -

H(Z) H(Z) H(Z)

University of Illinois at Chicago ECE 534, Natasha Devroye Chain rule for mutual information

Theorem: (Chain rule for mutual information)

n

I(X1,X2, ..., Xn; Y )= I(Xi; Y Xi 1,Xi 2, ..., X1) | i=1

H(X) H(Y) H(X) H(X) H(Y) H(Y)

I(X,Y;Z) = I(X,;Z) + I(Y,;Z|X) i H(Z) H(Z) H(Z)

Chain rule for relative entropy in book pg. 24

University of Illinois at Chicago ECE 534, Natasha Devroye What is the grey region?

H(X) H(Y) H(X) H(Y)

H(Z) H(Z)

University of Illinois at Chicago ECE 534, Natasha Devroye AnotherCopyright Cambridge University Press disclaimer.... 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

144 8 — Dependent Random Variables

I(X;Y) A I(X;Y|Z) Figure 8.3. A misleading representation of entropies, H(Y) continued. H(X,Y|Z) H(X|Y,Z) H(Y|X,Z)

H(X) H(Z|X) H(Z|Y) H(Z|X,Y) H(Z)

that the random outcome (x, y) might correspond to a point in the diagram, and thus confuse entropies with probabilities. Secondly, the depiction in terms of Venn diagrams encourages one to be- lieve that all the areas correspond to positive quantities. In the special case of two random variables it is indeed true that H(X Y ), I(X; Y ) and H(Y X) | | are positive quantities. But as soon as we progress to three-variable ensembles, we obtain a diagram with positive-looking areas that may actually correspond to negative quantities. Figure 8.3 correctly shows relationships such as H(X) + H(Z X) + H(Y X, Z) = H(X, Y, Z). (8.31) | | !!! But it gives the misleading impression that the conditional mutual information I(X; Y Z) is less than the mutual information I(X; Y ). In fact the area | labelled A can correspond to a negative quantity. Consider the joint ensemble (X, Y, Z) in which x 0, 1 and y 0, 1 are independent binary variables ∈ { } ∈ { } and z 0, 1 is defined to be z = x + y mod 2. Then clearly H(X) = ∈ { } H(Y ) = 1 bit. Also H(Z) = 1 bit. And H(Y X) = H(Y ) = 1 since the two | variables are independent. So the mutual information between X and Y is zero. I(X; Y ) = 0. However, if z is observed, X and Y become dependent — knowing x, given z, tells you what y is: y = z x mod 2. So I(X; Y Z) = 1 − | bit. Thus the area labelled A must correspond to 1 bits for the figure to give − the correct answers. The above example is not at all a capricious or exceptional illustration. The binary symmetric channel with input X, noise Y , and output Z is a situation in which I(X; Y ) = 0 (input and noise are independent) but I(X; Y Z) > 0 | (once you see the output, the unknown input and the unknown noise are intimately related!). [Mackay’s textbook] The Venn diagram representation is therefore valid only if one is aware that positive areas may represent negative quantities. With this proviso kept in mind, the interpretation of entropies in terms of sets can be helpful (Yeung, 1991). Solution to exercise 8.9 (p.141). For any joint ensemble XY Z, the following chain rule for mutual information holds. University of Illinois at Chicago ECE 534, Natasha Devroye I(X; Y, Z) = I(X; Y ) + I(X; Z Y ). (8.32) | Now, in the case w d r, w and r are independent given d, so → → I(W ; R D) = 0. Using the chain rule twice, we have: | I(W ; D, R) = I(W ; D) (8.33) and I(W ; D, R) = I(W ; R) + I(W ; D R), (8.34) | so I(W ; R) I(W ; D) 0. (8.35) − ≤ Convex and concave functions

10

7.5

5

2.5

0

-2.5 0 5 -5 10 2.5 7.5 -10 -7.5 -2.5 12.5

University of Illinois at Chicago ECE 534, Natasha Devroye 10

7.5 Convex and concave functions 5 2.5

0

-2.5 0 5 -5 10 2.5 7.5 -10 -7.5 -2.5 12.5

University of Illinois at Chicago ECE 534, Natasha Devroye Jensen’s inequality

Theorem: (Jensen’s inequality) If f is convex, then

E[f(X)] f(E[X]). i If f is strictly convex, the equality implies X = E[X] with probability 1.

University of Illinois at Chicago ECE 534, Natasha Devroye Jensen’s inequality consequences

Theorem: (Information inequality) D(p q) 0, with equality i p = q. • ⌃ ⇤ Corollary: (Nonnegativity of mutual information) I(X; Y ) 0 with equality • i X and Y are independent. ⇤ Theorem: (Conditioning reduces entropy) H(X Y ) H(X) with equality i • X and Y are independent. | ⇥ Theorem: H(X) log with equality i X has a uniform distribution over • . ⇥ |X | X Theorem: (Independence bound on entropy) H(X ,X , ..., X ) n H(X )with • 1 2 n ⇥ i=1 i equality i Xi are independent. N i University of Illinois at Chicago ECE 534, Natasha Devroye Log-sum inequality

Theorem: (Log sum inequality) For nonnegative a1,a2, ..., an and b1,b2, ..., bn,

n a n n a a log i a log i=1 i i b i n b i=1 i i=1 ⇥ i=1 i ⌅ ⌅ ⇤ ⇤ with equality i ai/bi = const.

Convention: 0 log 0 = 0, a log a = if a>0 and 0 log 0 =0. 0 ⇥ 0

i

University of Illinois at Chicago ECE 534, Natasha Devroye Log-sum inequality consequences

Theorem: (Convexity of relative entropy) D(p q) is convex in the pair (p, q), • so that for pmf’s (p ,q ) and (p ,q ), we have⌃ for all 0 1: 1 1 2 2 ⇤ ⇤ D(p + (1 )p q + (1 )q ) 1 2 ⌃ 1 2 D(p q ) + (1 )D(p q ) ⇤ 1 ⌃ 1 2 ⌃ 2 Theorem: Concavity of entropy For X p(x), we have that • ⌅

H(p) := Hp(X)is a concave function of p(x).

Theorem: (Concavity of the mutual information in p(x)) Let (X, Y ) p(x, y)= • p(x)p(y x). Then, I(X; Y ) is a concave function of p(x) for fixed p(⌅y x). | | Theorem: (Convexity of the mutual information in p(y x)) Let (X, Y ) • p(x, y)=p(x)p(y x). Then, I(X; Y ) is a convex function| of p(y x) for fixed⌅ p(x). | | N i University of Illinois at Chicago ECE 534, Natasha Devroye Markov chains

Definition: X, Y, Z form a Markov chain in that order (X Y Z)i ⇥ ⇥ p(x, y, z)=p(x)p(y x)p(z y) p(z y, x)=p(z y) | | | | N1 N2

X ⊕ ⊕ Z Y

X Y Z i X and Z are conditionally independent given Y • ⇥ ⇥ X Y Z Z Y X. Thus, we can write X Y Z. • ⇥ ⇥ ⌅ ⇥ ⇥ ⇤ ⇤ i

University of Illinois at Chicago ECE 534, Natasha Devroye Data-processing inequality

N1 N2 N1

X ⊕ ⊕ Z X ⊕ f() Z Y Y i

University of Illinois at Chicago ECE 534, Natasha Devroye Markov chain questions

If X Y Z, then I(X; Y ) I(X; Y Z). ⇥ ⇥ | What if X, Y, Z do not form a Markov chain, can I(X; Y Z) I(X; Y )? |

If X1 X2 X3 X4 X5 X6, then Mutual Information increases as you get closer⇥ together:⇥ ⇥ ⇥ ⇥

I(X ; X I)(X1I;(XX2); XI)(X1I; X(X4) ; XI()X1;IX(X5) ; XI()X1; XI(6X). ; X ) 1 2 1 3 1 4 1 5 1 6

University of Illinois at Chicago ECE 534, Natasha Devroye Fano’s inequality

i

University of Illinois at Chicago ECE 534, Natasha Devroye Fano’s inequality consequences

University of Illinois at Chicago ECE 534, Natasha Devroye