2.8 Information Theory

56 Chapter 2. Probability σˆ2 The term S is called the (numerical or empirical) standard error, and is an estimate of our uncertainty about our estimate of μ. (See Section 6.2 for more discussion on standard errors.) If we want to report an answer which is accurate to within± with probability at least 95%, we need to use a number of samples S which satisfies 1.96 σˆ2/S ≤ . We can approximate 4ˆσ2 the 1.96 factor by 2, yielding S ≥ 2 . 2.8 Information theory information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding). At first, this seems far removed from the concerns of probability theory and machine learning, but in fact there is an intimate connection. To see this, note that compactly representing data requires allocating short codewords to highly probable bit strings, and reserving longer codewords to less probable bit strings. This is similar to the situation in natural language, where common words (such as “a”, “the”, “and”) are generally much shorter than rare words. Also, decoding messages sent over noisy channels requires having a good probability model of the kinds of messages that people tend to send. In both cases, we need a model that can predict which kinds of data are likely and which unlikely, which is also a central problem in machine learning (see (MacKay 2003) for more details on the connection between information theory and machine learning). Obviously we cannot go into the details of information theory here (see e.g., (Cover and Thomas 2006) if you are interested to learn more). However, we will introduce a few basic concepts that we will need later in the book. 2.8.1 Entropy The entropy of a random variable X with distribution p, denoted by H (X) or sometimes H (p), is a measure of its uncertainty. In particular, for a discrete variable with K states, it is defined by K H (X) − p(X = k)log2 p(X = k) (2.107) k=1 Usually we use log base 2, in which case the units are called bits (short for binary digits). If we use log base e, the units are called nats. For example, if X ∈{1,...,5} with histogram distribution p =[0.25, 0.25, 0.2, 0.15, 0.15], we find H =2.2855. The discrete distribution with maximum entropy is the uniform distribution (see Section 9.2.6 for a proof). Hence for a K-ary random variable, the entropy is maximized if p(x = k)=1/K; in this case, H (X)=log2 K. Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty. In Figure 2.5(b), where we plotted a DNA sequence logo, the height of each bar is defined to be 2 − H,whereH is the entropy of that distribution, and 2 is the maximum possible entropy. Thus a bar of height 0 corresponds to a uniform distribution, whereas a bar of height 2 corresponds to a deterministic distribution. 2.8. Information theory 57 1 0.5 H(X) 0 0 0.5 1 p(X = 1) Figure 2.21 Entropy of a Bernoulli random variable as a function of θ. The maximum entropy is log2 2=1. Figure generated by bernoulliEntropyFig. For the special case of binary random variables, X ∈{0, 1}, we can write p(X =1)=θ and p(X =0)=1− θ. Hence the entropy becomes H (X)=−[p(X =1)log2 p(X =1)+p(X =0)log2 p(X =0)] (2.108) = −[θ log2 θ +(1− θ)log2(1 − θ)] (2.109) This is called the binary entropy function, and is also written H (θ). We plot this in Figure 2.21. We see that the maximum value of 1 occurs when the distribution is uniform, θ =0.5. 2.8.2 KL divergence One way to measure the dissimilarity of two probability distributions, p and q, is known as the Kullback-Leibler divergence (KL divergence)orrelative entropy. This is defined as follows: K pk KL (p||q) pk log (2.110) k k=1 q where the sum gets replaced by an integral for pdfs.10 We can rewrite this as KL (p||q)= pk log pk − pk log qk = −H (p)+H (p, q) (2.111) k k where H (p, q) is called the cross entropy, H (p, q) − pk log qk (2.112) k One can show (Cover and Thomas 2006) that the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q to 10. The KL divergence is not a distance, since it is asymmetric. One symmetric version of the KL divergence is the Jensen-Shannon divergence, defined as JS(p1,p2)=0.5KL (p1||q)+0.5KL (p2||q),whereq =0.5p1 +0.5p2. 58 Chapter 2. Probability define our codebook. Hence the “regular” entropy H (p)=H (p, p), defined in Section 2.8.1, is the expected number of bits if we use the true model, so the KL divergence is the difference between these. In other words, the KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p. The “extra number of bits” interpretation should make it clear that KL (p||q) ≥ 0, and that the KL is only equal to zero iff q = p. We now give a proof of this important result. Theorem 2.8.1. (Information inequality) KL (p||q) ≥ 0 with equality iff p = q. Proof. To prove the theorem, we need to use Jensen’s inequality. This states that, for any convex function f,wehavethat n n f λixi ≤ λif(xi) (2.113) i=1 i=1 n where λi ≥ 0 and i=1 λi =1. This is clearly true for n =2(by definition of convexity), and can be proved by induction for n>2. Let us now prove the main theorem, following (Cover and Thomas 2006, p28). Let A = {x : p(x) > 0} be the support of p(x). Then p(x) q(x) −KL (p||q)=− p(x)log = p(x)log (2.114) x∈A q(x) x∈A p(x) q(x) ≤ log p(x) =log q(x) (2.115) x∈A p(x) x∈A ≤ log q(x)=log1=0 (2.116) x∈X where the first inequality follows from Jensen’s. Since log(x) is a strictly concave function, we have equality in Equation 2.115 iff p(x)=cq(x) for some c. We have equality in Equation 2.116 iff x∈A q(x)= x∈X q(x)=1, which implies c =1. Hence KL (p||q)=0iff p(x)=q(x) for all x. One important consequence of this result is that the discrete distribution with the maximum entropy is the uniform distribution. More precisely, H (X) ≤ log |X |,where|X | is the number of states for X, with equality iff p(x) is uniform. To see this, let u(x)=1/|X |. Then p(x) 0 ≤ KL (p||u)= p(x)log (2.117) u(x) x = p(x)logp(x) − p(x)logu(x)=−H (X)+log|X | (2.118) x x This is a formulation of Laplace’s principle of insufficient reason, which argues in favor of using uniform distributions when there are no other reasons to favor one distribution over another. See Section 9.2.6 for a discussion of how to create distributions that satisfy certain constraints, but otherwise are as least-commital as possible. (For example, the Gaussian satisfies first and second moment constraints, but otherwise has maximum entropy.) 2.8. Information theory 59 2.8.3 Mutual information Consider two random variables, X and Y . Suppose we want to know how much knowing one variable tells us about the other. We could compute the correlation coefficient, but this is only defined for real-valued random variables, and furthermore, this is a very limited measure of dependence, as we saw in Figure 2.12. A more general approach is to determine how similar the joint distribution p(X, Y ) is to the factored distribution p(X)p(Y ). This is called the mutual information or MI, and is defined as follows: p(x, y) I (X; Y ) KL (p(X, Y )||p(X)p(Y )) = p(x, y)log (2.119) x y p(x)p(y) We have I (X; Y ) ≥ 0 with equality iff p(X, Y )=p(X)p(Y ). That is, the MI is zero iff the variables are independent. To gain insight into the meaning of MI, it helps to re-express it in terms of joint and conditional entropies. One can show (Exercise 2.12) that the above expression is equivalent to the following: I (X; Y )=H (X) − H (X|Y )=H (Y ) − H (Y |X) (2.120) where H (Y |X) is the conditional entropy, defined as H (Y |X)= x p(x)H (Y |X = x). Thus we can interpret the MI between X and Y as the reduction in uncertainty about X after observing Y , or, by symmetry, the reduction in uncertainty about Y after observing X.

Load more