Rate Entropy Rate Mutual Information

Formal Modeling in Cognitive Science 1 Entropy Rate Lecture 26: Entropy Rate; Mutual Information Entropy Rate The Entropy of English

Frank Keller 2 School of Informatics Mutual Information University of Edinburgh Mutual Information over Distributions [email protected] Pointwise Mutual Information March 6, 2006

Frank Keller Formal Modeling in Cognitive Science 1 Frank Keller Formal Modeling in Cognitive Science 2

Entropy Rate Entropy Rate Entropy Rate Entropy Rate Mutual Information The Entropy of English Mutual Information The Entropy of English Entropy Rate Entropy Rate

Entropy rate takes the length of the message into account: Entropy depends on the length of the message; longer messages have higher entropy (all else being equal); Definition: Entropy Rate entropy rate takes this into account, it normalizes by n, the The entropy rate of a sequence of random variables X1, X2, . . . , Xn is length of the message; defined as: intuitively, entropy rate is the entropy per character or per 1 word in a message. H = H(X , X , . . . , X ) rate n 1 2 n 1 Example: simplified Polynesian = − X X · · · X f (x1, x2, . . . , xn) log f (x1, x2, . . . , xn) n In the previous example, we computed the of a x1∈X1 x2∈X2 xn∈Xn consonant and a vowel. The per character entropy is: Note that we have to extend our notion of joint distribution f (x, y) 1 1 and joint entropy H(X , Y ) to arbitrarily many random variables. Hrate = H(X1, X2, . . . , Xn) = H(C, V ) = 1.218 bits n 2

Frank Keller Formal Modeling in Cognitive Science 3 Frank Keller Formal Modeling in Cognitive Science 4 Entropy Rate Entropy Rate Entropy Rate Entropy Rate Mutual Information The Entropy of English Mutual Information The Entropy of English Shannon’s Experiments Shannon’s Experiments

Guessing game: an experimental subject is given a sample of Then we can then use this distribution to compute the entropy English text and is asked to guess the next letter (Shannon 1951). rate of English. The results show: Assumption: subject will guess the most probably letter first, then Hrate(English) is between 0.6 and 1.3 bits per character if the second most probable letter, etc. estimated by humans;

This way we get a probability distribution over the number of if humans gamble on the outcome, Hrate(English) is between guesses required to get the correct letter: 1.25 and 1.35 bpc; if we estimated it from a 500M word corpus, then No. of guesses 1 2 3 4 5 > 5 Hrate(English) 1.75 bpc. Probability 0.79 0.08 0.03 0.02 0.02 0.06 Modern estimates use word-guessing, not letter-guessing.

Frank Keller Formal Modeling in Cognitive Science 5 Frank Keller Formal Modeling in Cognitive Science 6

Entropy Rate Mutual Information over Distributions Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information Mutual Information Pointwise Mutual Information Mutual Information Mutual Information

Definition: Mutual Information We can also express mutual information in terms of entropy: If X and Y are discrete random variables and f (x, y) is the value Theorem: Mutual Information of their joint probability distribution at (x, y), and f (x) and f (y) If X and Y are discrete random variables with joint entropy are the marginal distributions of X and Y , respectively, then: H(X , Y ) and the marginal entropy of X is H(X ), then: f (x, y) I (X ; Y ) = f (x, y) log I (X ; Y ) = H(X ) − H(X |Y ) X X f (x)f (y) x∈X y∈Y = H(Y ) − H(Y |X ) is the mutual information (MI) of X and Y . = H(X ) + H(Y ) − H(X , Y )

Intuitively, mutual information is the reduction in uncertainty of X This follows from the definition of in terms of due to the knowledge of Y . joint entropy.

Frank Keller Formal Modeling in Cognitive Science 7 Frank Keller Formal Modeling in Cognitive Science 8 Entropy Rate Mutual Information over Distributions Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information Mutual Information Pointwise Mutual Information Mutual Information Mutual Information

The relationship between mutual information and entropy can be visualized using a Venn diagram: Properties of mutual information: H(X,Y) Intuitively, I (X ; Y ) is the amount of information X and Y contain about each other; I (X ; Y ) ≥ 0 and I (X ; Y ) = I (Y ; X ); I (X ; Y ) is a measure of the dependence between X and Y : H(X|Y) H(Y|X) I (X ; Y ) = 0 if and only if X and Y are independent; I(X;Y) I (X ; Y ) grows not only with the dependence of X and Y , but also with H(X ) and H(Y ); I (X ; X ) = H(X ); entropy as “self-information” of X .

H(X) H(Y)

Frank Keller Formal Modeling in Cognitive Science 9 Frank Keller Formal Modeling in Cognitive Science 10

Entropy Rate Mutual Information over Distributions Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information Mutual Information Pointwise Mutual Information Mutual Information Mutual Information

Example: simplified Polynesian Example: simplified Polynesian Back to simplified Polynesian, with the following joint probability First compute the entropy of a vowel: distribution: H(V ) = − X f (y) log f (y) f (x, y) p t k f (y) y∈V 1 3 1 1 a 16 8 16 2 1 3 1 1 1 1 1 1 1 i 16 16 0 4 = −( log + log + log ) 3 1 1 2 2 4 4 4 4 u 0 16 16 4 . 1 3 1 = 1 5 bits f (x) 8 4 8 We have already computed H(V |C) = 1.375 bits (last lecture), so Let’s compute the mutual information of a consonant and a vowel: we can now compute: I (V ; C) = H(V ) − H(V |C) I (V ; C) = H(V ) − H(V |C) = 0.125 bits

Frank Keller Formal Modeling in Cognitive Science 11 Frank Keller Formal Modeling in Cognitive Science 12 Entropy Rate Mutual Information over Distributions Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information Mutual Information Pointwise Mutual Information Pointwise Mutual Information Pointwise Mutual Information

Mutual information is defined over random variables. Definition: Pointwise Mutual Information Pointwise mutual information is defined over values of random If X and Y are discrete random variables with the joint distribution variables; f (x, y) and the marginal distributions f (x) and f (y), then: Example: MI over vowels and consonants; pointwise MI over f (x, y) I (x; y) = log the letters a and p; f (x)f (y) Intuitively, pointwise MI is the amount of information provided by the occurrence of event y about the occurrence of event x. is the pointwise mutual information at (x, y).

Frank Keller Formal Modeling in Cognitive Science 13 Frank Keller Formal Modeling in Cognitive Science 14

Entropy Rate Mutual Information over Distributions Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information Mutual Information Pointwise Mutual Information Pointwise Mutual Information Summary

Entropy rate is the per-word or per-character entropy; Example: simplified Polynesian the entropy rate of English can be estimated using Compute the pointwise mutual information of a and p and of experiments with humans or approximated using a large i and p: corpus; f (a, p) 1 I (a; p) = log = log 16 = 0 mutual information I (X ; Y ) is the reduction in uncertainty of f a f p 1 1 ( ) ( ) 8 · 2 X due to the knowledge of Y ; f (i, p) 1 graphically, it’s the intersection of two ; I (i; p) = log = log 16 = 1 f i f p 1 1 X Y I X Y ( ) ( ) 8 · 4 if and are independent, then ( ; ) = 0; pointwise mutual information: same for points of a distributions, instead of for the whole distribution.

Frank Keller Formal Modeling in Cognitive Science 15 Frank Keller Formal Modeling in Cognitive Science 16 Entropy Rate Mutual Information over Distributions Mutual Information Pointwise Mutual Information References

Shannon, Claude E. 1951. Prediction and entropy of printed English. Bell Systems Technical Journal 30:50–64.

Frank Keller Formal Modeling in Cognitive Science 17