Information theory
LING 572 Fei Xia
• Reading: M&S 2.2
• It is the use of probability theory to quantify and measure “information”.
• Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 2 Entropy
• Entropy is a measure of the uncertainty associated with a distribution. H(X ) p(x)log p(x) x Here, X is a random variable, x is a possible outcome of X.
• The lower bound on the number of bits that it takes to transmit messages.
• An example: – Display the results of a 8-horse race. – Goal: minimize the number of bits to encode the results. 3 An example
• Uniform distribution: pi=1/8.
1 1 H(X ) p(x)log p(x) 8*( log2 ) 3 bits x 8 8 • Non-uniform distribution: (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64)
1 1 1 1 1 1 1 1 1 1 H(X ) ( log log log log 4* log ) 2 bits 2 2 4 4 8 8 16 16 64 64 (0, 10, 110, 1110, 111100, 111101, 111110, 111111)
Uniform distribution has a higher entropy. MaxEnt: make the distribution as “uniform” as possible. 4 Cross Entropy
• Entropy: H (X ) p(x)log p(x) x • Cross Entropy: H c (X ) p(x)log q(x) x
Here, p(x) is the true probability; q(x) is our estimate of p(x).
H c (X ) H (X )
5 Relative Entropy
• Also called Kullback-Leibler divergence: p(x) KL( p || q) p(x)log H (X ) H(X ) 2 q(x) c
• A “distance” measure between probability functions p and q; the closer p(x) and q(x) are, the smaller the relative entropy is.
• KL divergence is asymmetric, so it is not a proper distance metric: KL( p,q) KL(q, p)
6 Joint and conditional entropy
• Joint entropy: H(X ,Y) p(x, y)log p(x, y) x y
• Conditional entropy: H (Y | X ) H (X ,Y ) H (X )
7 Entropy of a language (per-word entropy)
• The entropy of a language L:
p(x1n )log p(x1n ) H (L, p) lim x1n n n
• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: (Shannon-Breiman-Mcmillan Theorem) log p(x ) log p(x ) H (L, p) lim 1n 1n n n n
8 Per-word entropy (cont)
• p(x1n) can be calculated by n-gram models
• Ex: unigram model
9 Perplexity
H(L,p) • Perplexity PP(x1n) is 2 .
• Perplexity is the weighted average number of choices a random variable has to make.
• Perplexity is often used to evaluate a language model; lower perplexity is preferred.
10 Mutual information
• It measures how much is in common between X and Y: p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) H (X ) H (Y ) H (X ,Y ) I(Y; X ) H (X ) H (X |Y ) H (Y ) H (Y | X )
• I(X;Y) = KL( p(x,y) || p(x)p(y) )
• If X and Y are independent, I(X;Y) is 0. 11 Summary on Information theory
• Reading: M&S 2.2
• It is the use of probability theory to quantify and measure “information”.
• Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 12 Additional slides
13 Conditional entropy
H (Y | X ) p(x)H (Y | X x) x p(x) p(y | x)log p(y | x) x y p(x, y)log p(y | x) x y p(x, y)log p(x, y) / p(x) x y p(x, y)(log p(x, y) log p(x)) x y p(x, y)log p(x, y) p(x, y)log p(x) x y x y p(x, y)log p(x, y) p(x)log p(x) x y x
H (X ,Y ) H (X ) 14 Mutual information
p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) p(x, y)log p(x, y) p(x, y)log p(x) p(x, y)log p(y) x y x y y x H (X ,Y ) log p(x) p(x, y) log p(y) p(x, y) x y y x H (X ,Y ) (log p(x)) p(x) (log p(y)) p(y) x y H (X ) H (Y ) H (X ,Y ) I(Y; X )
15