Information-Theoretic Modeling Lecture 3: Source Coding: Theory

Outline Entropy and Information Data Compression Information-Theoretic Modeling Lecture 3: Source Coding: Theory Teemu Roos Department of Computer Science, University of Helsinki Fall 2014 Teemu Roos Information-Theoretic Modeling Outline Entropy and Information Data Compression 1 Entropy and Information Entropy Information Inequality Data Processing Inequality 2 Data Compression Asymptotic Equipartition Property (AEP) Typical Sets Noiseless Source Coding Theorem Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Entropy Given a discrete random variable X with pmf pX ,wecanmeasure the amount of “surprise” associated with each outcome x by 2X the quantity 1 IX (x) = log2 . pX (x) The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.) The entropy of X measures the expected amount of “surprise”: 1 H(X )=E[IX (X )] = pX (x) log2 . pX (x) x X2X Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Binary Entropy Function For binary-valued X ,withp = p (1) = 1 p (0), we have X − X 1 1 H(X )=p log +(1 p) log . 2 p − 2 1 p − Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies 1 the joint entropy of two (or more) random variables: 1 H(X , Y )= pX ,Y (x, y) log2 , pX ,Y (x, y) x yX2X 2Y 2 the entropy of a conditional distribution: 1 H(X Y = y)= pX Y (x y) log2 , | | | p (x y) x X Y X2X | | 3 and the conditional entropy: H(X Y )= p(y) H(X Y = y) | | y X2Y 1 = pX ,Y (x, y) log2 . pX Y (x y) x | | yX2X 2Y Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies The joint entropy H(X , Y ) measures the uncertainty about the pair (X , Y ). The entropy of the conditional distribution H(X Y = y) | measures the uncertainty about X when we know that Y = y. The conditional entropy H(X Y )measurestheexpected | uncertainty about X when the value Y is known. Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. pX ,Y (x, y)=pY (y) pX Y (x y) · | | Next apply log(ab) = log a + log b. Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. log2 pX ,Y (x, y) = log2 pY (y) + log2 pX Y (x y) | | Next apply log a = log(1/a). − Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. 1 1 1 log2 = log2 + log2 pX ,Y (x, y) pY (y) pX Y (x y) | | 1 1 1 E log2 = E log2 + E log2 , pX ,Y (x, y) pY (y) pX Y (x y) | | H(X , Y )=H(Y )+H(X Y ) . , | Teemu Roos Information-Theoretic Modeling Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Mutual Information The mutual information I (X ; Y )=H(X ) H(X Y ) − | measures the average decrease in uncertainty about X when the value of Y becomes known. Mutual information is symmetric (chain rule): I (X ; Y )=H(X ) H(X Y )=H(X ) (H(X , Y ) H(YH))(X ) H(X , Y )+H(Y )( H(X ) H(X , Y )) + H(Y ) − | − − − − = H(Y ) H(Y X )=I (Y ; X ) . − | On the average, X gives as much information about Y as Y gives about X . Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Relationships between Entropies H(X,Y) H(X) H(Y) H(X | Y) I(X ; Y) H(Y | X) Teemu Roos Information-Theoretic Modeling Information Inquality For any two (discrete) distributions pX and qX ,wehave D(p q ) 0 X k X ≥ with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs! Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X pX (x) (We consider pX (x)log =0wheneverpX (x)=0.) 2 qX (x) Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X Information Inquality For any two (discrete) distributions pX and qX ,wehave D(p q ) 0 X k X ≥ with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs! Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Kullback-Leibler Divergence The information inequality implies I (X ; Y ) 0 . ≥ Proof. I (X ; Y )=H(X ) H(X Y ) − | = H(X )+H(Y ) H(X , Y ) − pX ,Y (x, y) = pX ,Y (x, y) log2 pX (x) pY (y) x yX2X 2Y = D(p p p ) 0 . X ,Y k X Y ≥ In addition, D(p p p )=0i↵p (x, y)=p (x) p (y)for X ,Y k X Y X ,Y X Y all x , y .ThismeansthatvariablesX and Y are 2X 2Y independent i↵ I (X ; Y ) = 0. Teemu Roos Information-Theoretic Modeling (Boltzmann,A combinatorial 1896;approach Hartley, 1928; to the Kolmogorov, definition of 1965): information S = k ln W . Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | 1 Proof. Let uX (x)= be the uniform distribution over . |X | X pX (x) 0 D(pX uX )= pX (x) log2 = log2 H(X ) . k uX (x) |X |− x X2X Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W . Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Ludvig Boltzmann (1844–1906) Teemu Roos Information-Theoretic Modeling On the average, knowing another r.v. can only reduce uncertainty about X . However, note that H(X Y = y) | may be greater than H(X ) for some y — “contradicting evidence”. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W .

Information-Theoretic Modeling Lecture 3: Source Coding: Theory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support