Outline Entropy and Information Data Compression
Information-Theoretic Modeling Lecture 3: Source Coding: Theory
Teemu Roos
Department of Computer Science, University of Helsinki
Fall 2014
Teemu Roos Information-Theoretic Modeling Outline Entropy and Information Data Compression
1 Entropy and Information Entropy Information Inequality Data Processing Inequality
2 Data Compression Asymptotic Equipartition Property (AEP) Typical Sets Noiseless Source Coding Theorem
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Entropy
Given a discrete random variable X with pmf pX ,wecanmeasure the amount of “surprise” associated with each outcome x by 2X the quantity 1 IX (x) = log2 . pX (x) The less likely an outcome is, the more surprised we are to observe it. (The point in the log-scale will become clear shortly.)
The entropy of X measures the expected amount of “surprise”: 1 H(X )=E[IX (X )] = pX (x) log2 . pX (x) x X2X
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Binary Entropy Function
For binary-valued X ,withp = p (1) = 1 p (0), we have X X 1 1 H(X )=p log +(1 p) log . 2 p 2 1 p
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies
1 the joint entropy of two (or more) random variables: 1 H(X , Y )= pX ,Y (x, y) log2 , pX ,Y (x, y) x yX2X 2Y 2 the entropy of a conditional distribution: 1 H(X Y = y)= pX Y (x y) log2 , | | | p (x y) x X Y X2X | | 3 and the conditional entropy: H(X Y )= p(y) H(X Y = y) | | y X2Y 1 = pX ,Y (x, y) log2 . pX Y (x y) x | | yX2X 2Y Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality More Entropies
The joint entropy H(X , Y ) measures the uncertainty about the pair (X , Y ).
The entropy of the conditional distribution H(X Y = y) | measures the uncertainty about X when we know that Y = y.
The conditional entropy H(X Y )measurestheexpected | uncertainty about X when the value Y is known.
Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | Xi=1
X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive.
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy
Remember the chain rule of probability:
pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy
H(X , Y )=H(Y )+H(X Y ) . | Proof. pX ,Y (x, y)=pY (y) pX Y (x y) · | | Next apply log(ab) = log a + log b.
Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | Xi=1
X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive.
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy
Remember the chain rule of probability:
pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy
H(X , Y )=H(Y )+H(X Y ) . | Proof.
log2 pX ,Y (x, y) = log2 pY (y) + log2 pX Y (x y) | | Next apply log a = log(1/a).
Teemu Roos Information-Theoretic Modeling The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | Xi=1
X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive.
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy
Remember the chain rule of probability:
pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy
H(X , Y )=H(Y )+H(X Y ) . | Proof. 1 1 1 log2 = log2 + log2 pX ,Y (x, y) pY (y) pX Y (x y) | | 1 1 1 E log2 = E log2 + E log2 , pX ,Y (x, y) pY (y) pX Y (x y) | | H(X , Y )=H(Y )+H(X Y ) . , |
Teemu Roos Information-Theoretic Modeling Logarithmic scale makes entropy additive.
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy
Remember the chain rule of probability:
pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy
H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | Xi=1
X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | ,
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Entropy
Remember the chain rule of probability:
pX ,Y (x, y)=pY (y) pX Y (x y) . · | | For the entropy we have: Chain Rule of Entropy
H(X , Y )=H(Y )+H(X Y ) . | The rule can be extended to more than two random variables: n H(X1,...,Xn)= H(Xi H1,...,Hi 1) . | Xi=1
X ✏ Y H(X Y )=H(X ) H(X , Y )=H(X )+H(Y ). , | , Logarithmic scale makes entropy additive.
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Mutual Information
The mutual information
I (X ; Y )=H(X ) H(X Y ) | measures the average decrease in uncertainty about X when the value of Y becomes known.
Mutual information is symmetric (chain rule):
I (X ; Y )=H(X ) H(X Y )=H(X ) (H(X , Y ) H(YH))(X ) H(X , Y )+H(Y )( H(X ) H(X , Y )) + H(Y ) | = H(Y ) H(Y X )=I (Y ; X ) . | On the average, X gives as much information about Y as Y gives about X .
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Relationships between Entropies
H(X,Y)
H(X)
H(Y)
H(X | Y) I(X ; Y) H(Y | X)
Teemu Roos Information-Theoretic Modeling Information Inquality
For any two (discrete) distributions pX and qX ,wehave
D(p q ) 0 X k X with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs!
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality
Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as
pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X
pX (x) (We consider pX (x)log =0wheneverpX (x)=0.) 2 qX (x)
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Information Inequality
Kullback-Leibler Divergence The relative entropy or Kullback-Leibler divergence between (discrete) distributions pX and qX is defined as
pX (x) D(pX qX )= pX (x) log2 . k qX (x) x X2X Information Inquality
For any two (discrete) distributions pX and qX ,wehave
D(p q ) 0 X k X with equality i↵ p (x)=q (x)forallx . X X 2X Proof. Gibbs! Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Kullback-Leibler Divergence
The information inequality implies
I (X ; Y ) 0 . Proof. I (X ; Y )=H(X ) H(X Y ) | = H(X )+H(Y ) H(X , Y ) pX ,Y (x, y) = pX ,Y (x, y) log2 pX (x) pY (y) x yX2X 2Y = D(p p p ) 0 . X ,Y k X Y In addition, D(p p p )=0i↵p (x, y)=p (x) p (y)for X ,Y k X Y X ,Y X Y all x , y .ThismeansthatvariablesX and Y are 2X 2Y independent i↵ I (X ; Y ) = 0.
Teemu Roos Information-Theoretic Modeling (Boltzmann,A combinatorial 1896;approach Hartley, 1928; to the Kolmogorov, definition of 1965): information S = k ln W .
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy
Properties of entropy: 1 H(X ) 0 1 Proof. pX (x) 1 log2 0. ) pX (x) 2 H(X ) log 2 |X | 1 Proof. Let uX (x)= be the uniform distribution over . |X | X
pX (x) 0 D(pX uX )= pX (x) log2 = log2 H(X ) . k uX (x) |X | x X2X
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy
Properties of entropy: 1 H(X ) 0 1 Proof. pX (x) 1 log2 0. ) pX (x) 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W .
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Ludvig Boltzmann (1844–1906)
Teemu Roos Information-Theoretic Modeling On the average, knowing another r.v. can only reduce uncertainty about X . However, note that H(X Y = y) | may be greater than H(X ) for some y — “contradicting evidence”.
Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy
Properties of entropy: 1 H(X ) 0 1 Proof. pX (x) 1 log2 0. ) pX (x) 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W .
3 H(X Y ) H(X ) | Proof. 0 I (X ; Y )=H(X ) H(X Y ) . |
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Properties of Entropy
Properties of entropy: 1 H(X ) 0 1 Proof. pX (x) 1 log2 0. ) pX (x) 2 H(X ) log 2 |X | A combinatorial approach to the definition of information (Boltzmann, 1896; Hartley, 1928; Kolmogorov, 1965): S = k ln W .
3 H(X Y ) H(X ) | On the average, knowing another r.v. can only reduce uncertainty about X . However, note that H(X Y = y) | may be greater than H(X ) for some y — “contradicting evidence”. Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Chain Rule of Mutual Information
The conditional mutual information of variables X and Y given Z is defined as I (X ; Y Z)=H(X Z) H(X Y , Z) . | | | Chain Rule of Mutual Information
For random variables X and Y1,...,Yn we have
n I (X ; Y1,...,Yn)= I (X ; Yi Y1,...,Yi 1) . | Xi=1
Independence among Y1,...,Yn implies n I (X ; Y1,...,Yn)= I (X ; Yi ) . Xi=1
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Data Processing Inequality
Let X , Y , Z be (discrete) random variables. If Z is conditionally independent of X given Y ,i.e.,ifwehave
pZ X ,Y (z x, y)=pZ Y (z y)forallx, y, z, | | | | then X , Y , Z form a Markov chain X Y Z. ! ! For instance, Y is a “noisy” measurement of X ,andZ = f (Y )is the outcome of deterministic data processing performed on Y , then we have X Y Z. ! ! This implies that
I (X ; Z Y )=H(Z Y ) H(Z Y , X )=0 . | | | When Y is known, Z doesn’t give any extra information about X (and vice versa).
Teemu Roos Information-Theoretic Modeling Outline Entropy Entropy and Information Information Inequality Data Compression Data Processing Inequality Data Processing Inequality
Assuming that X Y Z is a Markov chain, we get ! ! I (X ; Y , Z)=I (X ; Z)+I (X ; Y Z) | = I (X ; Y )+I (X ; Z Y ) . | Now, because I (X ; Z Y ) = 0, and I (X ; Y Z) 0, we obtain: | | Data Processing Inequality If X Y Z is a Markov chain, then we have ! ! I (X ; Z) I (X ; Y ) . No data-processing can increase the amount of information that we have about X .
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem
1 Entropy and Information Entropy Information Inequality Data Processing Inequality
2 Data Compression Asymptotic Equipartition Property (AEP) Typical Sets Noiseless Source Coding Theorem
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
If X1, X2,... is a sequence of independent and identically distributed (i.i.d.) r.v.’s with domain and pmf p ,then X X 1 1 log2 , log2 ,... pX (X1) pX (X2) is also an i.i.d. sequence of r.v.’s.
The expected values of the elements of the above sequence are all equal to the entropy:
1 1 E log2 = pX (x) log2 = H(X )foralli N. pX (Xi ) pX (x) 2 x X2X
Teemu Roos Information-Theoretic Modeling n 1 1 1 1 log2 = log2 . Asymptotic Equipartitionn p(x1,..., Propertyxn) (AEP)n pX (xi ) Xi=1 For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n p(x1,...,xn)= pX (xi ) . Yi=1
Teemu Roos Information-Theoretic Modeling n 1 1 1 1 log2 = log2 . Asymptotic Equipartitionn p(x1,..., Propertyxn) (AEP)n pX (xi ) Xi=1 For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 = . p(x1,...,xn) pX (xi ) Yi=1
Teemu Roos Information-Theoretic Modeling n 1 1 1 1 log2 = log2 . Asymptotic Equipartitionn p(x1,..., Propertyxn) (AEP)n pX (xi ) Xi=1 For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 log2 = log2 . p(x1,...,xn) pX (xi ) Yi=1
Teemu Roos Information-Theoretic Modeling n 1 1 1 1 log2 = log2 . Asymptotic Equipartitionn p(x1,..., Propertyxn) (AEP)n pX (xi ) Xi=1 For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 log2 = log2 . p(x1,...,xn) pX (xi ) Xi=1
Teemu Roos Information-Theoretic Modeling Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 1 1 log2 = log2 . n p(x1,...,xn) n pX (xi ) Xi=1
Teemu Roos Information-Theoretic Modeling Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 1 1 log2 = log2 . n p(x1,...,xn) n pX (xi ) Xi=1 By the (weak) law of large numbers, the average on the right-hand side converges in probability to its mean, i.e., the entropy:
n 1 1 lim Pr log2 H(X ) <✏ =1 forall✏>0. n " n pX (Xi ) # !1 i=1 X
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The i.i.d. assumption is equivalent to
n 1 1 1 1 log2 = log2 . n p(x1,...,xn) n pX (xi ) Xi=1
Asymptotic Equipartition Property (AEP) For i.i.d. sequences, we have
1 1 lim Pr log H(X ) <✏ =1 n n 2 p(x ,...,x ) !1 1 n for all ✏>0.
Teemu Roos Information-Theoretic Modeling n(H(X ) ✏) Pr p(x ,...,x )=2 ± 1 . , 1 n ⇡ h i Asymptotic Equipartition Property (informally) “Almost all sequences are almost equally likely.”
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The AEP states that for any ✏>0, and large enough n,wehave
1 1 Pr log H(X ) <✏ 1 n 2 p(x ,...,x ) ⇡ 1 n 1 1 H(X )| ✏< log2 {z < H(}X )+✏ n p(x1,...,xn)
Teemu Roos Information-Theoretic Modeling n(H(X ) ✏) Pr p(x ,...,x )=2 ± 1 . , 1 n ⇡ h i Asymptotic Equipartition Property (informally) “Almost all sequences are almost equally likely.”
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The AEP states that for any ✏>0, and large enough n,wehave
1 1 Pr log H(X ) <✏ 1 n 2 p(x ,...,x ) ⇡ 1 n 1 n(H(X|) ✏) < log2 {z < n(H}(X )+✏) p(x1,...,xn)
Teemu Roos Information-Theoretic Modeling n(H(X ) ✏) Pr p(x ,...,x )=2 ± 1 . , 1 n ⇡ h i Asymptotic Equipartition Property (informally) “Almost all sequences are almost equally likely.”
Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The AEP states that for any ✏>0, and large enough n,wehave
1 1 Pr log H(X ) <✏ 1 n 2 p(x ,...,x ) ⇡ 1 n n( H(X ) ✏) 1 n (H(X )+✏) 2 | < {z < 2 } p(x1,...,xn)
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
The AEP states that for any ✏>0, and large enough n,wehave
1 1 Pr log H(X ) <✏ 1 n 2 p(x ,...,x ) ⇡ 1 n n (H(X )+✏) n(H(X ) ✏) 2 | < p(x1,...,{z xn) < 2 }
n(H(X ) ✏) Pr p(x ,...,x )=2 ± 1 . , 1 n ⇡ h i Asymptotic Equipartition Property (informally) “Almost all sequences are almost equally likely.”
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem AEP
Technically, the key step in the proof was using the weak law of large numbers to deduce
n 1 1 lim Pr log2 H(X ) <✏ =1 forall✏>0. n " n pX (Xi ) # !1 i=1 X In other words, with high probability the average “surprisingness” log2 pX (Xi ) over the sequence is close to its expectation.
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem Typical Sets
Typical Set (n) n The typical set A✏ is the set of sequences (x ,...,x ) with 1 n 2X the property:
n(H(X )+✏) n(H(X ) ✏) 2 p(x ,...,x ) 2 . 1 n The AEP states that
lim Pr X n A(n) =1 . n 2 ✏ !1 h i In particular, for any ✏>0, and large enough n,wehave
Pr X n A(n) > 1 ✏. 2 ✏ h i
Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem Typical Sets
(n) How many sequences are there in the typical set A✏ ? We can use the fact that by definition each sequence has n(H(X )+✏) probability at least 2 . (n) Since the total probability of all the sequences in A✏ is trivially at most 1, there can’t be too many of them.
1 p(x ,...,x ) 1 n (n) (x ,...,xn) A 1 X2 ✏ n(H(X )+✏) n(H(X )+✏) (n) 2 =2 A ✏ (n) (x1,...,xn) A✏ X2 A(n) 2n(H(X )+✏) . , ✏ Teemu Roos Information-Theoretic Modeling Outline Asymptotic Equipartition Property (AEP) Entropy and Information Typical Sets Data Compression Noiseless Source Coding Theorem Typical Sets
(n) Is it possible that the the typical set A✏ is very small? This time we can use the fact that by definition each sequence has n(H(X ) ✏) probability at most 2 . Since for large enough n, the total probability of all the sequences (n) in A✏ is (by the AEP) at least 1 ✏, there can’t be too few of them.
1 ✏