Introduction to Information Theory

Introduction to Information Theory Entropy as a Measure of Information Content Entropy of a random variable. Let X be a random variable that takes on values from its domain P fx1; x2; : : : ; xng with respective probabilities p1; p2; : : : ; pn, where i pi = 1. Then the entropy of X, H(X), represents the average amount of information contained in X and is defined by n X H(X) = − pi log2 pi: i=1 Note that entropy is measured in bits. Also, Hn(p1; : : : ; pn) denotes another way of writing the entropy function. Arguments in favor of the above definition of information: • The definition is consistent with the following extreme cases: 1 1. If n = 2 and p1 = p2 = 2 , then H(X) = 1 bit; i.e. when an event (e.g. X = x1) has an equally likely chance of occurring or not occurring, then its outcome possesses one bit of information. This is the maximum amount of information a binary outcome may possess. 2. In the case when pi = 1 for some 1 ≤ i ≤ n, then H(X) = 0; i.e. any random variable whose outcome is certain possesses no information. • Moreover, the above definition is the only definition which satisfies the following three properties of information which seem reasonable under any definition: 1 1 { Normalization: H2( 2 ; 2 ) = 1 { Continuity: H2(p; 1 − p) is a continuous function of p on the interval (0; 1) { Grouping: p1 p2 Hm(p1; : : : ; pm) = Hm−1(p1 + p2; p3; : : : ; pm) + (p1 + p2)H2( ; ) p1 + p2 p1 + p2 1 Claude Shannon (1912-2001). Pioneer in • applying Boolean logic to electronic circuit design • studying the complexity of Boolean circuits • signal processing: determined lower bounds on the amount of samples needed to achieve a desired estimation accuracy • game theory: inventor of minimax algorithm • information theory: first to give a precise definition for the concept of information • coding theory: Channel Coding Theorem, Optimal Coding Theorem 2 Example 1. Calculate the amount of information contained in a weather forecast if the possibilities include fnormal; rain; fog; hot; windyg if there respective probabilities are :8;:10;:04;:03, and :03. Example 2. Statement of the 12-balls puzzle. You are given 12 balls, all of which are identical in size and weight, except one, which may be heavier or lighter than the other 11. You must determine which ball is non-standard using only a balance. Devise a strategy which will always determine the non-standard ball in 3 or less balancings. As a starting point, which of the six possible initial balancing configurations provides the most information to the user in the worst case? 3 Conditional Probability Given a random variable X, an event pertaining to X is defined as a measurable subset of its domain. Capital letters A; B; C; : : : are used to represent events. An event A is said to be discrete if it consists of a countable number of domain elements. In this case we can define the probability of A as X p(A) = p(x): x2A In addition, given that event B has been observed, the conditional probability p(AjB) is defined as the probability of observing A on condition B has been observed. It is mathematically defined as p(A; B) p(AjB) = : p(B) If p(AjB) = p(A), then A and B are said to be independent events. Otherwise they are called dependent events. As a corollary, A and B are independent iff p(A; B) = p(A)p(B). Example 3. Three coins are tossed independently. Coin i, i = 1; 2; 3, has probability i=4 of landing heads. What is the probability that Coin 1 has landed heads, on condition that two out of the three coins landed heads? one out of three? zero out of three? three out of three? 4 The Law of Total Probability shows how conditional probabilities can seem useful for computing uncon- ditional ones. Let A be an event whose probability is to be computed, and let E1;E2;:::;En be pairwise n disjoint events for which p( [ Ei) = 1 (i.e., exactly one of the Ei will be observed). Then i=1 n X p(A) = p(AjEi)p(Ei): i=1 Example 4. Suppose a bag contains three coins. One coin is a two-headed coin, the other is a fair coin, while the third has only a probability of 0.25 of landing heads. If you randomly select one of these from the bag to toss, what is the probability that you will observe a heads? 5 Given random variables X and Y , such that domain(X) = fx1; : : : ; xmg, and domain(Y ) = fy1; : : : ; yng, we can define new random variable Z = X × Y where domain(Z) = domain(X) × domain(Y ). For this reason Z is referred to as a random vector; in this case a two-dimensional random vector. We often write it as Z = (X; Y ). Moreover, we let p(x; y) denote the probability distribution of Z. It is also called the joint distribution between X and Y . From this joint distribution we can compute the conditional distributions p(xjy) and p(yjx), as well as the marginal distributions p(x) and p(y). Here, \marginal" is meant as being secondary or more narrow in focus. Example 5. Use the joint-distribution table below to compute the conditional and marginal distributions. Y=X 1 2 3 4 1 1 1 1 1 8 16 32 32 1 1 1 1 2 16 8 32 32 1 1 1 1 3 16 16 16 16 1 4 4 0 0 0 6 Not surprisingly, one can define both the joint entropy H(X; Y ) of Z = (X; Y ), as well as the conditional entropy H(XjY ). The joint entropy is defined as X 1 H(X; Y ) = p(x; y) log : p(x; y) (x;y)2X×Y In other words, it is simply the entropy of the random vector Z = (X; Y ). As for conditional entropy, suppose that, before observing X, that random variable Y is observed. Now if Y is independent of X, then observing Y does not say anything about what will be observed for X. On the other hand, if Y and X are highly dependent, then one would expect that, in observing Y , the amount of information obtained in observing X might be reduced. An extreme case of this occurs when Y = X. In this case, there is no additional information provided by X once Y is observed. This motivates the concept of the conditional entropy of X given Y , which is defined as X X 1 H(XjY ) = p(y)( p(xjy) log( )): p(xjy) y x In words H(XjY ) is the expected amount of information left in X on condition that Y has been observed. 7 Example 6. Compute H(X; Y ) and H(XjY ) given the following probability table. Y=X 1 2 3 4 1 1 1 1 1 8 16 32 32 1 1 1 1 2 16 8 32 32 1 1 1 1 3 16 16 16 16 1 4 4 0 0 0 8 The mutual information, I(X; Y ) between two variables is now defined as I(X; Y ) = H(X) − H(XjY ): In other words, it is the amount of information shared by X and Y . As an exercise, it can be shown that an alternative definiton is X p(x; y) I(X; Y ) = p(x; y) log : p(x)p(y) (x;y) In other words, it is the Kullback-Leibler distance between the joint distribution of X and Y , and the distribution p(x)p(y) that represents the case when X and Y are independent. Example 7. Compute I(X; Y ) for the variables X and Y that were defined in Example 6. 9 References. 1. T. Cover, J. Thomas, "Elements of Information Theory", 2nd Edition, Wiley-Interscience, 2006 2. L. Mlodinow, \The Drunkard's Walk: How Randomness Rules our Lives", Vintage Press, 2009 3. S. Ross, \Introduction to Probability Models", 10th Edition, Academic Press, 2009 Exercises. 1. Prove that the entropy function Hm(p1; : : : ; pm) satisfies the grouping property. 2. If a soccer team wins 60% of the time, loses 20% of the time, and ties 20% of the time, how much information on average is communicated by a newscaster who is reporting the result for that team (assuming the information is limited to win, lose, or tie). 3. Suppose that a multiple choice exam has fifty questions with four responses each; where the correct response is randomly assigned a letter a-d. a) Suppose a student who knows nothing about the exam subject takes the exam and guesses each answer. Let X be the random vector that encodes the student answers. Compute H(X). Assume all questions are answered independently. b) Same question as a), but now assume the student has complete mastery of the subject. 4. Now suppose that the instructor who gives the exam mentioned in the previous problem notices that, for a particular question, the correct answer was marked 67% of the time, while the second-best response was 20%, third best 10%, and worst 3%. Letting X denote a response to that question from a randomly chosen student, calculate H(X). 5. Solve the 12-balls puzzle. Make a tree that indicates what action is taken based on outcome of each balancing experiment. 6. The inventor of Morse code, Samuel Morse (1791-1872), needed to know the frequency of letters in English text that he could give the simplest codewords to the most frequently used letters.

Introduction to Information Theory

Information Content and Error Analysis

Chapter 4 Information Theory

Shannon Entropy and Kolmogorov Complexity

Understanding Shannon's Entropy Metric for Information

Information, Entropy, and the Motivation for Source Codes

Entropy in Classical and Quantum Information Theory

An Introduction to Information Theory and Entropy

Exploring Deep Learning Using Information Theory Tools and Patch Ordering

Interpretable Convolution Methods for Learning Genomic Sequence Motifs

Lecture 1: August 29 1.1 About the Class

Information Diffusion Prediction Via Recurrent Cascades Convolution

What Were the Odds: Estimating the Market's Probability of Uncertain