Chapter 2: Entropy and Mutual Information
Total Page:16
File Type:pdf, Size:1020Kb
Chapter 2: Entropy and Mutual Information University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2 outline • Definitions • Entropy • Joint entropy, conditional entropy • Relative entropy, mutual information • Chain rules • Jensen’s inequality • Log-sum inequality • Data processing inequality • Fano’s inequality University of Illinois at Chicago ECE 534, Natasha Devroye Definitions A discrete random variable X takes on values x from the discrete alphabet . X The probability mass function (pmf) is described by p (x)=p(x) = Pr X = x , for x . X { } ∈ X University of Illinois at Chicago ECE 534, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 2 Probability, Entropy, and Inference Definitions Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 This chapter, and its sibling, Chapter 8, devote some time to notation. Just You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/as the White Knight fordistinguished links. between the song, the name of the song, and what the name of the song was called (Carroll, 1998), we will sometimes 2.1: Probabilities and ensembles need to be careful to distinguish between a random variable, the v23alue of the i ai pi random variable, and the proposition that asserts that the random variable x has a particular value. In any particular chapter, however, I will use the most 1 a 0.0575 a Figure 2.2. The probability 2 b 0.0128 simple and friendly notation possible, at the risk of upsetting pure-minded b a distribution over the 27 27 3 c 0.0263 c b × c readers. For example, if somethingpossibleis ‘truebigramswith probabilitxy in anyEnglish1’, I will usually 4 d 0.0285 d d simply say that it is ‘true’. language document, The e 5 e 0.0913 e f Frequently Asked Questions 6 f 0.0173 f g 7 g 0.0133 g h 2.1 Probabilities and ensemblesManual for Linux. i 8 h 0.0313 h j 9 i 0.0599 i An ensemble X is a triple (x, X , X ), where the outcome x is the value k A P 10 j 0.0006 j l of a random variable, which takes on one of a set of possible values, m 11 k 0.0084 k n X = a1, a2, . , ai, . , aI , having probabilities X = p1, p2, . , pI , o A { } P { } 12 l 0.0335 l p with P (x = ai) = pi, pi 0 and a P (x = ai) = 1. 13 m 0.0235 m q ≥ i∈AX 14 n 0.0596 n r ! s The name is mnemonic for ‘alphabet’. One example of an ensemble is a 15 o 0.0689 o t A p u letter that is randomly selected from an English document. This ensemble is 16 p 0.0192 v shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space 17 q 0.0008 q w r x character ‘-’. 18 r 0.0508 y 19 s 0.0567 s z t – Abbreviations. Briefer notation will sometimes be used. For example, 20 t 0.0706 21 u 0.0334 u a b c d e f g h i j k l m n o p q r s t u v w x y z – P (x = ai) may be written as P (ai) or P (x). y 22 v 0.0069 v 23 w 0.0119 w Probability of a subset. If T is a subset of X then: A 24 x 0.0073 x y P (T ) = P (x T ) = P (x = a ). (2.1) 25 y 0.0164 ∈ i 26 z 0.0007 z Marginal probability. We can obtain the marginal probability P (x) from a T "i∈ 27 – 0.1928 – the joint probability P (x, y) by summation: For example, if we define V to be vowels from figure 2.1, V = a, e, i, o, u , then Figure 2.1. Probability P (x = a ) P (x = a , y). { } (2.3) i ≡ i distribution over the 27 outcomes y Y P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31. (2.2) for a randomly selected letter in !∈A an English language document Similarly, using briefer notation, the marginal probabilitA joint ensemy of bley is:XY is an ensemble in which each outcome is an ordered (estimated from The Frequently Asked Questions Manual for pair x, y with x X = a1, . , aI and y Y = b1, . , bJ . Linux). The picture shows the P (y) P (x, y). ∈(2.4)A { } ∈ A { } ≡ We call P (x, y) the joint probability of x and y. probabilities by the areas of white x X squares. !∈A Commas are optional when writing ordered pairs, so xy x, y. ⇔ ConditionalUniversityprobabilit of Illinoisy at Chicago ECE 534, NatashaN.B. DevroyeIn a joint ensemble XY the two variables are not necessarily inde- pendent. P (x = ai, y = bj) P (x = ai y = bj) if P (y = bj) = 0. (2.5) 22 | ≡ P (y = bj) ̸ [If P (y = b ) = 0 then P (x = a y = b ) is undefined.] j i | j We pronounce P (x = a y = b ) ‘the probability that x equals a , given i | j i y equals bj’. Example 2.1. An example of a joint ensemble is the ordered pair XY consisting of two successive letters in an English document. The possible outcomes are ordered pairs such as aa, ab, ac, and zz; of these, we might expect ab and ac to be more probable than aa and zz. An estimate of the joint probability distribution for two neighbouring characters is shown graphically in figure 2.2. This joint ensemble has the special property that its two marginal dis- tributions, P (x) and P (y), are identical. They are both equal to the monogram distribution shown in figure 2.1. From this joint ensemble P (x, y) we can obtain conditional distributions, P (y x) and P (x y), by normalizing the rows and columns, respectively | | (figure 2.3). The probability P (y x = q) is the probability distribution | of the second letter given that the first letter is a q. As you can see in figure 2.3a, the two most probable values for the second letter y given Definitions The events X = x and Y = y are statistically independent if p(x, y)=p(x)p(y). The variables X1,X2, XN are called independent if for all (x1,x2, ,xN ) we have··· ··· ⇤ X1 ⇥ X2 ⇥ ···XN N p(x ,x , x )= p (x ). 1 2 ··· N Xi i i=1 They are furthermore called identically distributed if all variables Xi have the same distribution pX (x). University of Illinois at Chicago ECE 534, Natasha Devroye Entropy • Intuitive notions? • 2 ways of defining entropy of a random variable: • axiomatic definition (want a measure with certain properties...) • just define and then justify definition by showing it arises as answer to a number of natural questions Definition: The entropy H(X) of a discrete random variable X with pmf pX (x) is given by H(X)= p (x) log p (x)= E [log p (X)] − X X − pX (x) X x University of Illinois at Chicago ECE 534, Natasha Devroye Order these in terms of entropy University of Illinois at Chicago ECE 534, Natasha Devroye Order these in terms of entropy University of Illinois at Chicago ECE 534, Natasha Devroye Entropy examples 1 • What’s the entropy of a uniform discrete random variable taking on K values? i • What’s the entropy of a random variable with =[ , , , ],p = [1/2; 1/4; 1/8; 1/8] X ⇥ ⇤ ⌅ ⇧ X • What’s the entropy of a deterministic random variable? University of Illinois at Chicago ECE 534, Natasha Devroye Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links. 32 2 — Probability, Entropy, and Inference What do you notice about your solutions? Does each answer depend on the i ai pi h(pi) detailed contents of each urn? 1 a .0575 4.1 The details of the other possible outcomes and their probabilities are ir- 2 b .0128 6.3 relevant. All that matters is the probability of the outcome that actually 3 c .0263 5.2 happened (here, that the ball drawn was black) given the different hypothe- 4 d .0285 5.1 ses. We need only to know the likelihood, i.e., how the probability of the data 5 e .0913 3.5 that happened varies with the hypothesis. This simple rule about inference is 6 f .0173 5.9 known as the likelihood principle. 7 g .0133 6.2 8 h .0313 5.0 9 i .0599 4.1 The likelihood principle: given a generative model for data d given 10 j .0006 10.7 parameters θ, P (d θ), and having observed a particular outcome 11 k .0084 6.9 | d1, all inferences and predictions should depend only on the function 12 l .0335 4.9 13 m .0235 5.4 P (d1 θ). | 14 n .0596 4.1 15 o .0689 3.9 In spite of the simplicity of this principle, many classical statistical methods 16 p .0192 5.7 violate it. 17 q .0008 10.3 18 r .0508 4.3 19 s .0567 4.1 2.4 Definition of entropy and related functions 20 t .0706 3.8 21 u .0334 4.9 The Shannon information content of an outcome x is defined to be 22 v .0069 7.2 1 23 w .0119 6.4 h(x) = log .