Lecture 1: Entropy, Divergence and Mutual Information

TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin Telefon +49 (0)30 314-29668 Subject: Telefax +49 (0)30 314-28320 [email protected] Text…& Sekretariat HFT6 Patrycja Chudzik Telefon +49 (0)30 314-28459 Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] Lecture 1: www.mk.tu-berlin.de Entropy, Divergence and Mutual Information Copyright G. Caire (Sample Lectures) 30 Probability TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin Telefon +49 (0)30 314-29668 Subject: Telefax +49 (0)30 314-28320 [email protected] Text…& X X = x SekretariatA HFT6 random variable , takes on values in the set . The event is the Patrycja Chudzik • Telefon +49 (0)30 314-28459 X { } Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] event that X takes on the particular value x . 2 X We write X P to denote that P is the pmf of X, when X is discrete. • ⇠ X X When = M is finite, the pmf P is also represented as the probability • |X | X vector p =(p1,p2,...,pM ),pi = PX(xi) www.mk.tu-berlin.de where we assume a given (fixed) indexing of the elements of with the X integers 1,...,M. A probability vector p has non-negative components that satisfy p =1, • i i therefore, it is a point in the probability simplex. P A random sequence, or discrete-time random process is X : i =1, 2,... • { i } Copyright G. Caire (Sample Lectures) 31 i.i.d. Sequences and Joint pmf TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin Telefon +49 (0)30 314-29668 Subject: Telefax +49 (0)30 314-28320 [email protected] P i ,i ,...,i The sequence is i.i.d., with marginal pmf X, if for any 1 2 n we have Text…& • Sekretariat HFT6 Patrycja Chudzik Telefon +49 (0)30 314-28459 Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] n P(Xi1 = xi1,...,Xin = xin)= PX(xij ) j=1 Y We indicate a random n-sequence (random vector) as Xn =(X ,...,X ). • 1 n www.mk.tu-berlin.de A random n-sequence Xn takes on values in n, the set of (row) vectors of • X length n over , denoted by x =(x ,...,x ). X 1 n The joint pmf of Xn is denoted by • PXn(x)=P(X1 = x1,X2 = x2,...,Xn = xn) Copyright G. Caire (Sample Lectures) 32 Conditional pmf TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin By definition Telefon +49 (0)30 314-29668 Subject: Telefax +49 (0)30 314-28320 • [email protected] (A, B) P Text…& Sekretariat HFT6 Patrycja Chudzik (A B)= P Telefon +49 (0)30 314-28459 Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] | (B) P (defined only if P(B) > 0). Conditional probability mass function of Y given X: • PY X(y x)=P(Y = y X = x) | | | Telescopic property of probability www.mk.tu-berlin.de • P(X = x, Y = y, Z = z)=P(X = x)P(Y = y X = x)P(Z = z X = x, Y = y) | | (obviously, this generalizes to random vectors Xn). Written in terms of probability mass functions: • PX,Y,Z(x, y, z)=PX(x)PY X(y x)PZ X,Y (z x, y) | | | | Copyright G. Caire (Sample Lectures) 33 Entropy TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin Telefon +49 (0)30 314-29668 Subject: DefinitionTelefax +49 (0)30 314-28320 1. The entropy H(X) of a discrete random variable X PX over [email protected] Text…& Sekretariat HFT6 ⇠ X Patrycja Chudzik is defined by: Telefon +49 (0)30 314-28459 Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] H(X)= PX(x) log(PX(x)) = E [log(PX(X))] − − x X2X ⌃ Example 1. Binary entropy function: for X Bernoulli-p, we have www.mk.tu-berlin.de ⇠ 1 1 H(X)=p log +(1 p) log = (p) p − 1 p H2 − More in general, we indicate by (p) the entropy function denoted as a function H of the probability vector p. ⌃ Copyright G. Caire (Sample Lectures) 34 In Appendix 2, the following result is established: Theorem 2: The only H satisfying the three above assumptions is of the form: n H K ! pi log pi i 1 where K is a positive constant. This theorem, and the assumptions required for its proof, are in no way necessary for the present theory. It is given chiefly to lend a certain plausibility to some of our later definitions. The real justification of these definitions, however, will reside in their implications. Quantities of the form H ! pi log pi (the constant K merely amounts to a choice of a unit of measure) play a central role in information theory as measures of information, choice and uncertainty. The form of H 8 will be recognized as that of entropy as defined in certain formulations of statistical mechanics where pi is the probability of a system being in cell i of its phase space. H is then, for example, the H in Boltzmann’s famous H theorem.Binary We shall call EntropyH ! pi log pi the Function entropy of the set of probabilities2(pp)1 pn.Ifx is a chance variable we will write H x for its entropy; thus x is not an argumentH of a function but a label for a TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and number, to differentiate it from H y say, the entropy of the chance variable y. Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann The entropy in the case of two possibilities with probabilities p and q 1 p, namely Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire H plog p qlogq Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin Telefon +49 (0)30 314-29668 Subject: Telefax +49 (0)30 314-28320 is plotted in Fig. 7 as a function of p. [email protected] Text…& Sekretariat HFT6 Patrycja Chudzik Telefon +49 (0)30 314-28459 Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] H BITS www.mk.tu-berlin.de p Fig. 7—Entropy in the case of two possibilities with probabilities p and 1 p . The quantity H has a number of interesting properties which further substantiate it as a reasonable measure of choice or information. 1. H 0 if and only if all the pi but one are zero, this one having the value unity. Thus only when we are certain of the outcome does H vanish. Otherwise H is positive. 1 2. For a given n, H is a maximum and equal to logn when all the pi are equal (i.e., n ). This is also intuitively the most uncertain situation. 8See, for example, R. C. Tolman, Principles of Statistical Mechanics, Oxford, Clarendon, 1938. Copyright G. Caire (Sample Lectures) 35 11 Joint and Conditional Entropy TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1. Month 2014 Einsteinufer 25 10587 Berlin n Telefon +49 (0)30 314-29668 n Subject: DefinitionTelefax +49 (0)30 314-28320 2. The joint entropy of a discrete random n-sequence X PX [email protected] Text…& Sekretariat HFT6 ⇠ Patrycja Chudzik overTelefon +49 (0)30 314-28459 is: Telefax +49 (0)30 314-28320 Prof. Dr. Giuseppe Caire [email protected] X n n H(X )= PXn(x) log(PXn(x)) = E [log(PXn(X ))] − − x n X2X ⌃ n m For two jointly distributed random vectors X ,Y over and , Definition 3. www.mk.tu-berlin.de n X m Y respectively, with joint pmf PXn,Y m, the conditional entropy of X given Y is: n m H(X Y )= PXn,Y m(x, y) log(PXn Y m(x y)) | − | | x n,y m 2XX2Y n m = E log PXn Y m(X Y ) − | | ⇥ ⇤ ⌃ Copyright G. Caire (Sample Lectures) 36 Chain Rule for Entropy TU Berlin | Sekr. HFT 6 | Einsteinufer 25 | 10587 Berlin Faculty of Electrical Engineering and Computer Systems Department of Telecommunication Firma xy Systems Herrn Mustermann Beispielstraße 11 12345 Musterstadt Information and Communication Theory Prof. Dr. Giuseppe Caire Berlin, 1.

Lecture 1: Entropy, Divergence and Mutual Information

On Measures of Entropy and Information

Information Theory 1 Entropy 2 Mutual Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´

Language Models

Neural Networks and Backpropagation

Entropy and Decisions

A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Entropy, Relative Entropy, Cross Entropy Entropy

Reduced Perplexity: a Simplified Perspective on Assessing Probabilistic Forecasts Kenric P

K-Nearest Neighbor Based Consistent Entropy Estimation for Hyperspherical Distributions

Entropy Methods for Joint Distributions in Decision Analysis Ali E