CS340 Machine Learning Information Theory

CS340 Machine learning Information theory 1 Announcements • If you did not get email, contact [email protected] • Newsgroup ubc.courses.cpsc.340 • Hw1 due wed – bring hardcopy to start of class • Added knnClassify.m, normalize.m • Add/drop deadline tomorrow 2 Information theory • Data compression (source coding) – More frequent events should have shorter encodings • Error correction (channel coding) – Should be able to infer encoded event even is message is corrupted by noise • Both tasks require building probabilistic models of data sources, p(x), and hence are related to machine learning • Lower bounds on coding length and channel capacity depend on our uncertainty about p(x), defined in terms of entropy 3 Info theory & ML CUP, 2003, freely available online on David Mackay’s website 4 Entropy • Consider a discrete random variable X ∈ {1,…,K} • Suppose we observe event X=k. The info content of this event is related to its surprise factor h(k) = log2 1/p(X = k) = − log2 p(X = k) • The entropy of distrib p is the average info content K H(X) = − p(X = k) log2 p(X = k) k=1 • Max entropy = uniform, min entropy = delta fn 0 ≤ H(X) ≤ log2 K 5 Binary entropy function • Suppose X ∈ {0,1}, p(X=1)= θ, p(X=0)=1- θ • We say X ~ Bernoulli( θ) H(X) = H(θ) = −[p(X = 1) log2 p(X = 1) + p(X = 0) log2 p(X = 0)] = −[θ log2 θ + (1 − θ) log2(1 − θ)] 6 Entropy of p(y|x,D) for kNN P(y=1|x,D) p(y=1|x,K=10,weighted,beta=1.0000) train 8.22 5 0.9 6.88 1.0 0.8 4 1 5.54 4.20 0.7 3 2.86 0.6 2 3 1.53 0.5 0.19 0.4 1 -1.15 0.3 0 -2.49 0.2 2 0.0 -1 -3.83 0.1 -2 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 -3 -2 -1 0 1 2 3 Entropyentropy P(y=2|x,D) 8.22 p(y=2|x,K=10,weighted,beta=1.0000) 8.22 6.88 0.9 1.4 6.88 5.54 0.8 5.54 1.2 4.20 4.20 0.7 2.86 1 2.86 0.6 1.53 1.53 0.5 0.8 0.4 0.19 0.19 0.6 -1.15 0.3 -1.15 -2.49 0.2 -2.49 0.4 -3.83 0.1 -3.83 0.2 7 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 Active learning train 5 4 1 • Suppose we can request the label y for 3 any location (feature vector) x. 2 3 • A natural (myopic) criterion is to pick the 1 one that minimizes our predictive uncertainty 0 2 ∗ -1 x = arg min H(p(y|x, D)) ∈X -2 x -3 -2 -1 0 1 2 3 Entropyentropy 8.22 •Implementing this in practice may be quite 6.88 1.4 5.54 difficult, depending on the size of the X 1.2 4.20 space, and the form of the probabilistic 2.86 1 model p(y|x) 1.53 0.8 0.19 0.6 -1.15 -2.49 0.4 -3.83 0.2 8 Will cover later -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 Active learning with Gaussian Processes If we assume the yi labels are correlated with their nearest neighbors, we can propagate information and rapidly classify all the points 9 Nando de Freitas Will cover later Entropy & source coding theorem • Shannon proved that the minimum number of bits needed to encode an RV with distribution p is H(p) • Example: X in {a,b,c,d,e} with distribution p(a) = 0.25, p(b) = 0.25, p(c) = 0.2, p(d) = 0.15, p(e) = 0.15 • Assign short codewords (00,10,11) to common events (a,b,c) and long codewords (010,011) to rare events in a prefix-free way a → 00, b → 10, c → 11, d → 010, e → 011 001011010 → 00, 10, 11, 010 → abcd Build tree bottom up – Huffman code 10 Not on exam Example cont’d • Example: X in {a,b,c,d,e} with distribution p(a) = 0.25, p(b) = 0.25, p(c) = 0.2, p(d) = 0.15, p(e) = 0.15 a → 00, b → 10, c → 11, d → 010, e → 011 • Average number of bits needed by this code 0.25 ∗ 2 + 0.25 ∗ 2 + 0.2 ∗ 2 + 0.15 ∗ 3 + 0.15 ∗ 3 = 2.30 • Entropy of distribution: H = 2.2855 • To get closer to lower bound, encode blocks of symbols at once (arithmetic coding) Not on exam 11 Joint entropy • The joint entropy of 2 RV’s is defined as H(X, Y ) = − p(x, y) log2 p(x, y) x,y • If X and Y are independent X ⊥ Y ⇐⇒ p(X, Y ) = p(X)p(Y ) then our uncertainty is maximal (since X does not inform us about Y or vice versa) X ⊥ Y ⇐⇒ H(X, Y ) = H(X) + H(Y ) • In general, considering events jointly reduces our uncertainty H(X, Y ) ≤ H(X) + H(Y ) (non trivial proof – see later) • and our joint uncertainty is >= marginal uncertainty H(X, Y ) ≥ H(X) ≥ H(Y ) ≥ 0 When is H(X,Y) = H(X)? 12 Example • Let X(n) be the event that n is even, and Y(n) be the event that n is prime, for n ∈ {1,…,8} • The joint distribution = normalized counts 1 1 3 3 3 3 1 1 H(X, Y ) = −[ log2 + log2 + log2 + log2 ] = 1.8113 8 8 8 8 8 8 8 8 What is H(X) + H(Y)? 13 Example cont’d • The joint and marginal distributions are • Hence H(X)=H(Y)=1, so H(X, Y ) = 1.8113 < H(X) + H(Y ) = 2 14 Conditional entropy • H(Y|X) is expected uncertainty in Y after seeing X def H(Y |X) = p(x)H(Y |X = x) x = − p(x) p(y|x) log p(y|x) x y = − p(x, y) log p(y|x) x,y p(x, y) = − p(x, y) log p(x) x,y 1 = − p(x, y) log p(x, y) − p(x) log p(x) x,y x = H(X, Y ) − H(X) When is H(Y|X)=0? When is H(Y|X) = H(Y)? 15 Information never hurts • Conditioning on data always decreases (or at least, never increases) our uncertainty, on average H(X, Y ) ≤ H(Y ) + H(X) from before H(Y |X) = H(X, Y ) − H(X) from above ≤ H(Y ) + H(X) − H(X) ≤ H(Y ) 16 Mutual information • I(X,Y) is how much our uncertainty about Y decreases when we observe X def p(x, y) I(X, Y ) = p(x, y) log p(x)p(y) y x = −H(X, Y ) + H(X) + H(Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) • Hence H(X, Y ) = H(X|Y ) + H(Y |X) + I(X, Y ) Mackay 9.1 17 Mutual information • MI captures dependence between RVs in the following sense: I(X, Y ) ≥ 0 and I(X, Y ) = 0 ⇐⇒ X ⊥ Y • If X ⊥ Y ⇒ I(X,Y)=0 is easy to show; I(X,Y)=0 ⇒ X ⊥ Y is harder. • This is more general than a correlation coefficient, ρ ∈ [-1,1] which is only captures linear dependence • For MI, we have 0 ≤ I(X, Y ) ≤ H(X) ≤ log2 K When is I(X,Y) = H(X)? 18 Example • Recall the even/ prime example with joint, marginal and conditional distributions • Hence 1 1 3 3 3 3 1 1 H(Y |X) = −[ log2 + log2 + log2 + log2 ] = 0.8113 8 4 8 4 8 4 8 4 I(X, Y ) = H(Y ) − H(Y |X) = 1 − 0.8113 = 0.1887 cond = normalize(joint) = joint ./ repmat(sum(joint,2), 1, Y) 19 Relative entropy (KL divergence) • The Kullback-Leibler (KL) divergence is defined as p(x) D(p||q) = p(x) log = −H(p) − p(x) log q(x) q(x) x x • where ∑x p(x) log q(x) is the cross entropy • KL is the average number of extra bits needed to encode the data if we think the distribution is q, but it is actually p. • KL is not symmetric and hence not a distance. • However, KL(p,q) >= 0 with equality iff p=q. 20 Jensen’s inequality • A concave function is one which lies above any chord f(λx1 + (1 − λ)x2) ≥ λf(x1) + (1 − λ)f(x2) • Jensen: for any concave f, E[f(X)] ≤ f(E[X]) p(x)f(x) ≤ f( p(x)) x x • Proof by induction: set K λ = p(x = 1), 1 − λ = p(x) x=2 21 Not on exam Proof that KL >= 0 • Let f(u)= log 1/u be a concave fn, and u(x)=p(x)/q(x) D(p||q) = E[f(q(x)/p(x))] q(x) ≥ f p(x) p(x) x 1 = log( ) = 0 q(x) • Hence x p(x, y) I(X, Y ) = p(x, y) log = D(p(x, y)||p(x)p(y)) ≥ 0 p(x)p(y) x,y and H(X) + H(Y ) = I(X, Y ) + H(X, Y ) ≥ H(X, Y ) 22 Not on exam.

CS340 Machine Learning Information Theory

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support