<<

Information theory

LING 572 Fei Xia

1

• Reading: M&S 2.2

• It is the use of probability theory to quantify and measure “information”.

• Basic concepts: – Entropy – and relative entropy – and – Entropy of the language and perplexity – 2 Entropy

• Entropy is a measure of the uncertainty associated with a distribution. H(X )   p(x)log p(x) x Here, X is a , x is a possible outcome of X.

• The lower bound on the number of bits that it takes to transmit messages.

• An example: – Display the results of a 8-horse race. – Goal: minimize the number of bits to encode the results. 3 An example

• Uniform distribution: pi=1/8.

1 1 H(X )   p(x)log p(x)  8*( log2 )  3 bits x 8 8 • Non-uniform distribution: (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64)

1 1 1 1 1 1 1 1 1 1 H(X )  ( log  log  log  log  4* log )  2 bits 2 2 4 4 8 8 16 16 64 64 (0, 10, 110, 1110, 111100, 111101, 111110, 111111)

Uniform distribution has a higher entropy. MaxEnt: make the distribution as “uniform” as possible. 4 Cross Entropy

• Entropy: H (X )   p(x)log p(x) x • Cross Entropy: H c (X )   p(x)log q(x) x

Here, p(x) is the true probability; q(x) is our estimate of p(x).

H c (X )  H (X )

5 Relative Entropy

• Also called Kullback-Leibler divergence: p(x) KL( p || q)  p(x)log  H (X )  H(X )  2 q(x) c

• A “distance” measure between probability functions p and q; the closer p(x) and q(x) are, the smaller the relative entropy is.

• KL divergence is asymmetric, so it is not a proper distance metric: KL( p,q)  KL(q, p)

6 Joint and conditional entropy

• Joint entropy: H(X ,Y)   p(x, y)log p(x, y) x y

• Conditional entropy: H (Y | X )  H (X ,Y )  H (X )

7 Entropy of a language (per-word entropy)

• The entropy of a language L:

 p(x1n )log p(x1n ) H (L, p)   lim x1n n n

• If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: (Shannon-Breiman-Mcmillan Theorem) log p(x ) log p(x ) H (L, p)   lim 1n   1n n n n

8 Per-word entropy (cont)

• p(x1n) can be calculated by n-gram models

• Ex: unigram model

9 Perplexity

H(L,p) • Perplexity PP(x1n) is 2 .

• Perplexity is the weighted average number of choices a random variable has to make.

• Perplexity is often used to evaluate a language model; lower perplexity is preferred.

10 Mutual information

• It measures how much is in common between X and Y: p(x, y) I(X ;Y )   p(x, y)log x y p(x) p(y)  H (X )  H (Y )  H (X ,Y )  I(Y; X )  H (X )  H (X |Y )  H (Y )  H (Y | X )

• I(X;Y) = KL( p(x,y) || p(x)p(y) )

• If X and Y are independent, I(X;Y) is 0. 11 Summary on Information theory

• Reading: M&S 2.2

• It is the use of probability theory to quantify and measure “information”.

• Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 12 Additional slides

13 Conditional entropy

H (Y | X )   p(x)H (Y | X  x) x   p(x) p(y | x)log p(y | x) x y   p(x, y)log p(y | x) x y   p(x, y)log p(x, y) / p(x) x y   p(x, y)(log p(x, y)  log p(x)) x y   p(x, y)log p(x, y)   p(x, y)log p(x) x y x y   p(x, y)log p(x, y)   p(x)log p(x) x y x

 H (X ,Y )  H (X ) 14 Mutual information

p(x, y) I(X ;Y )   p(x, y)log x y p(x) p(y)   p(x, y)log p(x, y)  p(x, y)log p(x)  p(x, y)log p(y) x y x y y x  H (X ,Y )  log p(x) p(x, y)  log p(y) p(x, y) x y y x  H (X ,Y )  (log p(x)) p(x)  (log p(y)) p(y) x y  H (X )  H (Y )  H (X ,Y )  I(Y; X )

15