Information Theory

Information theory LING 572 Fei Xia 1 Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 2 Entropy • Entropy is a measure of the uncertainty associated with a distribution. H(X ) p(x)log p(x) x Here, X is a random variable, x is a possible outcome of X. • The lower bound on the number of bits that it takes to transmit messages. • An example: – Display the results of a 8-horse race. – Goal: minimize the number of bits to encode the results. 3 An example • Uniform distribution: pi=1/8. 1 1 H(X ) p(x)log p(x) 8*( log2 ) 3 bits x 8 8 • Non-uniform distribution: (1/2, 1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64) 1 1 1 1 1 1 1 1 1 1 H(X ) ( log log log log 4* log ) 2 bits 2 2 4 4 8 8 16 16 64 64 (0, 10, 110, 1110, 111100, 111101, 111110, 111111) Uniform distribution has a higher entropy. MaxEnt: make the distribution as “uniform” as possible. 4 Cross Entropy • Entropy: H (X ) p(x)log p(x) x • Cross Entropy: H c (X ) p(x)log q(x) x Here, p(x) is the true probability; q(x) is our estimate of p(x). H c (X ) H (X ) 5 Relative Entropy • Also called Kullback-Leibler divergence: p(x) KL( p || q) p(x)log H (X ) H(X ) 2 q(x) c • A “distance” measure between probability functions p and q; the closer p(x) and q(x) are, the smaller the relative entropy is. • KL divergence is asymmetric, so it is not a proper distance metric: KL( p,q) KL(q, p) 6 Joint and conditional entropy • Joint entropy: H(X ,Y) p(x, y)log p(x, y) x y • Conditional entropy: H (Y | X ) H (X ,Y ) H (X ) 7 Entropy of a language (per-word entropy) • The entropy of a language L: p(x1n )log p(x1n ) H (L, p) lim x1n n n • If we make certain assumptions that the language is “nice”, then the cross entropy can be calculated as: (Shannon-Breiman-Mcmillan Theorem) log p(x ) log p(x ) H (L, p) lim 1n 1n n n n 8 Per-word entropy (cont) • p(x1n) can be calculated by n-gram models • Ex: unigram model 9 Perplexity H(L,p) • Perplexity PP(x1n) is 2 . • Perplexity is the weighted average number of choices a random variable has to make. • Perplexity is often used to evaluate a language model; lower perplexity is preferred. 10 Mutual information • It measures how much is in common between X and Y: p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) H (X ) H (Y ) H (X ,Y ) I(Y; X ) H (X ) H (X |Y ) H (Y ) H (Y | X ) • I(X;Y) = KL( p(x,y) || p(x)p(y) ) • If X and Y are independent, I(X;Y) is 0. 11 Summary on Information theory • Reading: M&S 2.2 • It is the use of probability theory to quantify and measure “information”. • Basic concepts: – Entropy – Cross entropy and relative entropy – Joint entropy and conditional entropy – Entropy of the language and perplexity – Mutual information 12 Additional slides 13 Conditional entropy H (Y | X ) p(x)H (Y | X x) x p(x) p(y | x)log p(y | x) x y p(x, y)log p(y | x) x y p(x, y)log p(x, y) / p(x) x y p(x, y)(log p(x, y) log p(x)) x y p(x, y)log p(x, y) p(x, y)log p(x) x y x y p(x, y)log p(x, y) p(x)log p(x) x y x H (X ,Y ) H (X ) 14 Mutual information p(x, y) I(X ;Y ) p(x, y)log x y p(x) p(y) p(x, y)log p(x, y) p(x, y)log p(x) p(x, y)log p(y) x y x y y x H (X ,Y ) log p(x) p(x, y) log p(y) p(x, y) x y y x H (X ,Y ) (log p(x)) p(x) (log p(y)) p(y) x y H (X ) H (Y ) H (X ,Y ) I(Y; X ) 15.

Information Theory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support