Information Theory

Information theory LING 572 Fei Xia 1 Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 2 Entropy • Entropy is a measure of the uncertainty associated with a distribution. H(X ) p(x)log p(x) x Here, X is a random variable, x is a possible outcome of X. • The lower bound on the number of bits that it takes to transmit messages. • An example: – Display the results of a 8-horse race. – Goal: minimize the number of bits to encode the results. 3 An example • Uniform distribution: pi=1/8. 1 1 H(X ) p(x)log p(x) 8*( log2 ) 3 bits x 8 8 • Non-uniform distribution: (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64) 1 1 1 1 1 1 1 1 1 1 H(X ) ( log log log log 4* log ) 2 bits 2 2 4 4 8 8 16 16 64 64 (0, 10, 110, 1110, 111100, 111101, 111110, 111111) Uniform distribution has a higher entropy. MaxEnt: make the distribution as “uniform” as possible. 4 Cross Entropy • Entropy: H (X ) p(x)log p(x) x • Cross Entropy: H c (X ) p(x)log q(x) x Here, p(x) is the true probability; q(x) is our estimate of p(x). H c (X ) H (X ) 5 Relative Entropy • Also called Kullback-Leibler divergence: p(x) KL( p || q) p(x)log H (X ) H(X ) 2 q(x) c • A “distance” measure between probability functions p and q; the closer p(x) and q(x) are, the smaller the relative entropy is. • KL divergence is asymmetric, so it is not a proper distance metric: KL( p,q) KL(q, p) 6 Joint and conditional entropy • Joint entropy: H(X ,Y) p(x, y)log p(x, y) x y • Conditional entropy: H (Y | X ) H (X ,Y ) H (X ) 7 Entropy of a language (per-word entropy) • The entropy of a language L: p(x1n )log p(x1n ) H (L, p) lim x1n n n • If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: (Shannon-Breiman-Mcmillan Theorem) log p(x ) log p(x ) H (L, p) lim 1n 1n n n n 8 Per-word entropy (cont) • p(x1n) can be calculated by n-gram models • Ex: unigram model 9 Perplexity H(L,p) • Perplexity PP(x1n) is 2 . • Perplexity is the weighted average number of choices a random variable has to make. • Perplexity is often used to evaluate a language model; lower perplexity is preferred. 10 Mutual information • It measures how much is in common between X and Y: p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) H (X ) H (Y ) H (X ,Y ) I(Y; X ) H (X ) H (X |Y ) H (Y ) H (Y | X ) • I(X;Y) = KL( p(x,y) || p(x)p(y) ) • If X and Y are independent, I(X;Y) is 0. 11 Summary on Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 12 Additional slides 13 Conditional entropy H (Y | X ) p(x)H (Y | X x) x p(x) p(y | x)log p(y | x) x y p(x, y)log p(y | x) x y p(x, y)log p(x, y) / p(x) x y p(x, y)(log p(x, y) log p(x)) x y p(x, y)log p(x, y) p(x, y)log p(x) x y x y p(x, y)log p(x, y) p(x)log p(x) x y x H (X ,Y ) H (X ) 14 Mutual information p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) p(x, y)log p(x, y) p(x, y)log p(x) p(x, y)log p(y) x y x y y x H (X ,Y ) log p(x) p(x, y) log p(y) p(x, y) x y y x H (X ,Y ) (log p(x)) p(x) (log p(y)) p(y) x y H (X ) H (Y ) H (X ,Y ) I(Y; X ) 15.

Load more