Information-Theoretic Modeling Lecture 3: Source Coding: Theory

Outline Entropy and Information Data Compression Information-Theoretic Modeling Lecture 3: Source Coding: Theory Teemu Roos Department of Computer Science, University of Helsinki Fall 2014 Teemu Roos Information-Theoretic Modeling Outline Entropy and Information Data Compression 1 Entropy and Information Entropy Information Inequality Data Processing Inequality 2 Data Compression Asymptotic Equipartition Property (AEP) Typical Sets Noiseless Source Coding Theorem Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Entropy Given a discrete random variable X with pmf pX ,wecanmeasure the amount of “surprise” associated with each outcome x by 2X the quantity 1 IX (x) = log2 . pX (x) The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.) The entropy of X measures the expected amount of “surprise”: 1 H(X )=E[IX (X )] = pX (x) log2 . pX (x) x X2X Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Binary Entropy Function For binary-valued X ,withp = p (1) = 1 p (0), we have X − X 1 1 H(X )=p log +(1 p) log . 2 p − 2 1 p − Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies 1 the joint entropy of two (or more) random variables: 1 H(X , Y )= pX ,Y (x, y) log2 , pX ,Y (x, y) x yX2X 2Y 2 the entropy of a conditional distribution: 1 H(X Y = y)= pX Y (x y) log2 , | | | p (x y) x X Y X2X | | 3 and the conditional entropy: H(X Y )= p(y) H(X Y = y) | | y X2Y 1 = pX ,Y (x, y) log2 . pX Y (x y) x | | yX2X 2Y Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies The joint entropy H(X , Y ) measures the uncertainty about the pair (X , Y ). The entropy of the conditional distribution H(X Y = y) | measures the uncertainty about X when we know that Y = y. The conditional entropy H(X Y )measurestheexpected | uncertainty about X when the value Y is known. Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. pX ,Y (x, y)=pY (y) pX Y (x y) · | | Next apply log(ab) = log a + log b. Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. log2 pX ,Y (x, y) = log2 pY (y) + log2 pX Y (x y) | | Next apply log a = log(1/a). − Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | Proof. 1 1 1 log2 = log2 + log2 pX ,Y (x, y) pY (y) pX Y (x y) | | 1 1 1 E log2 = E log2 + E log2 , pX ,Y (x, y) pY (y) pX Y (x y) | | H(X , Y )=H(Y )+H(X Y ) . , | Teemu Roos Information-Theoretic Modeling Logarithmic scale makes entropy additive. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy Remember the chain rule of probability: pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | − Xi=1 X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive. Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Mutual Information The mutual information I (X ; Y )=H(X ) H(X Y ) − | measures the average decrease in uncertainty about X when the value of Y becomes known. Mutual information is symmetric (chain rule): I (X ; Y )=H(X ) H(X Y )=H(X ) (H(X , Y ) H(YH))(X ) H(X , Y )+H(Y )( H(X ) H(X , Y )) + H(Y ) − | − − − − = H(Y ) H(Y X )=I (Y ; X ) . − | On the average, X gives as much information about Y as Y gives about X . Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Relationships between Entropies H(X,Y) H(X) H(Y) H(X | Y) I(X ; Y) H(Y | X) Teemu Roos Information-Theoretic Modeling Information Inquality For any two (discrete) distributions pX and qX ,wehave D(p q ) 0 X k X ≥ with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs! Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X pX (x) (We consider pX (x)log =0wheneverpX (x)=0.) 2 qX (x) Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X Information Inquality For any two (discrete) distributions pX and qX ,wehave D(p q ) 0 X k X ≥ with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs! Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Kullback-Leibler Divergence The information inequality implies I (X ; Y ) 0 . ≥ Proof. I (X ; Y )=H(X ) H(X Y ) − | = H(X )+H(Y ) H(X , Y ) − pX ,Y (x, y) = pX ,Y (x, y) log2 pX (x) pY (y) x yX2X 2Y = D(p p p ) 0 . X ,Y k X Y ≥ In addition, D(p p p )=0i↵p (x, y)=p (x) p (y)for X ,Y k X Y X ,Y X Y all x , y .ThismeansthatvariablesX and Y are 2X 2Y independent i↵ I (X ; Y ) = 0. Teemu Roos Information-Theoretic Modeling (Boltzmann,A combinatorial 1896;approach Hartley, 1928; to the Kolmogorov, definition of 1965): information S = k ln W . Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | 1 Proof. Let uX (x)= be the uniform distribution over . |X | X pX (x) 0 D(pX uX )= pX (x) log2 = log2 H(X ) . k uX (x) |X |− x X2X Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W . Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Ludvig Boltzmann (1844–1906) Teemu Roos Information-Theoretic Modeling On the average, knowing another r.v. can only reduce uncertainty about X . However, note that H(X Y = y) | may be greater than H(X ) for some y — “contradicting evidence”. Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy Properties of entropy: 1 H(X ) 0 ≥ 1 Proof. pX (x) 1 log2 0. ) pX (x) ≥ 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W .

Load more