CS340 Machine learning Information theory
1 Announcements • If you did not get email, contact [email protected] • Newsgroup ubc.courses.cpsc.340 • Hw1 due wed – bring hardcopy to start of class • Added knnClassify.m, normalize.m • Add/drop deadline tomorrow
2 Information theory • Data compression (source coding) – More frequent events should have shorter encodings • Error correction (channel coding) – Should be able to infer encoded event even is message is corrupted by noise • Both tasks require building probabilistic models of data sources, p(x), and hence are related to machine learning • Lower bounds on coding length and channel capacity depend on our uncertainty about p(x), defined in terms of entropy
3 Info theory & ML
CUP, 2003, freely available online on David Mackay’s website 4 Entropy • Consider a discrete random variable X ∈ {1,…,K} • Suppose we observe event X=k. The info content of this event is related to its surprise factor
h(k) = log2 1/p(X = k) = − log2 p(X = k) • The entropy of distrib p is the average info content K H(X) = − p(X = k) log2 p(X = k) k=1 • Max entropy = uniform, min entropy = delta fn
0 ≤ H(X) ≤ log2 K
5 Binary entropy function • Suppose X ∈ {0,1}, p(X=1)= θ, p(X=0)=1- θ • We say X ~ Bernoulli( θ)
H(X) = H(θ) = −[p(X = 1) log2 p(X = 1) + p(X = 0) log2 p(X = 0)]
= −[θ log2 θ + (1 − θ) log2(1 − θ)]
6 Entropy of p(y|x,D) for kNN P(y=1|x,D) p(y=1|x,K=10,weighted,beta=1.0000) train 8.22 5 0.9 6.88 1.0 0.8 4 1 5.54 4.20 0.7 3 2.86 0.6
2 3 1.53 0.5
0.19 0.4 1 -1.15 0.3 0 -2.49 0.2 2 0.0 -1 -3.83 0.1
-2 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 -3 -2 -1 0 1 2 3
Entropyentropy P(y=2|x,D) 8.22 p(y=2|x,K=10,weighted,beta=1.0000) 8.22 6.88 0.9 1.4 6.88 5.54 0.8 5.54 1.2 4.20 4.20 0.7
2.86 1 2.86 0.6
1.53 1.53 0.5 0.8 0.4 0.19 0.19
0.6 -1.15 0.3 -1.15 -2.49 0.2 -2.49 0.4 -3.83 0.1 -3.83 0.2 7 -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24
-4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 Active learning
train
5
4 1 • Suppose we can request the label y for 3 any location (feature vector) x.
2 3 • A natural (myopic) criterion is to pick the
1 one that minimizes our predictive uncertainty 0 2 ∗ -1 x = arg min H(p(y|x, D)) ∈X -2 x -3 -2 -1 0 1 2 3
Entropyentropy
8.22 •Implementing this in practice may be quite 6.88 1.4 5.54 difficult, depending on the size of the X 1.2 4.20 space, and the form of the probabilistic 2.86 1 model p(y|x) 1.53 0.8 0.19 0.6 -1.15
-2.49 0.4
-3.83 0.2 8 Will cover later -4.47 -3.13 -1.79 -0.45 0.89 2.23 3.56 4.90 6.24 Active learning with Gaussian Processes
If we assume the yi labels are correlated with their nearest neighbors, we can propagate information and rapidly classify all the points
9 Nando de Freitas Will cover later Entropy & source coding theorem • Shannon proved that the minimum number of bits needed to encode an RV with distribution p is H(p) • Example: X in {a,b,c,d,e} with distribution p(a) = 0.25, p(b) = 0.25, p(c) = 0.2, p(d) = 0.15, p(e) = 0.15 • Assign short codewords (00,10,11) to common events (a,b,c) and long codewords (010,011) to rare events in a prefix-free way a → 00, b → 10, c → 11, d → 010, e → 011
001011010 → 00, 10, 11, 010 → abcd
Build tree bottom up – Huffman code
10 Not on exam Example cont’d
• Example: X in {a,b,c,d,e} with distribution p(a) = 0.25, p(b) = 0.25, p(c) = 0.2, p(d) = 0.15, p(e) = 0.15
a → 00, b → 10, c → 11, d → 010, e → 011 • Average number of bits needed by this code 0.25 ∗ 2 + 0.25 ∗ 2 + 0.2 ∗ 2 + 0.15 ∗ 3 + 0.15 ∗ 3 = 2.30 • Entropy of distribution: H = 2.2855 • To get closer to lower bound, encode blocks of symbols at once (arithmetic coding)
Not on exam 11 Joint entropy • The joint entropy of 2 RV’s is defined as
H(X,Y ) = − p(x, y) log2 p(x, y) x,y • If X and Y are independent
X ⊥ Y ⇐⇒ p(X,Y ) = p(X)p(Y ) then our uncertainty is maximal (since X does not inform us about Y or vice versa) X ⊥ Y ⇐⇒ H(X,Y ) = H(X) + H(Y ) • In general, considering events jointly reduces our uncertainty H(X,Y ) ≤ H(X) + H(Y ) (non trivial proof – see later) • and our joint uncertainty is >= marginal uncertainty H(X,Y ) ≥ H(X) ≥ H(Y ) ≥ 0
When is H(X,Y) = H(X)? 12 Example • Let X(n) be the event that n is even, and Y(n) be the event that n is prime, for n ∈ {1,…,8}
• The joint distribution = normalized counts
1 1 3 3 3 3 1 1 H(X,Y ) = −[ log2 + log2 + log2 + log2 ] = 1.8113 8 8 8 8 8 8 8 8
What is H(X) + H(Y)? 13 Example cont’d • The joint and marginal distributions are
• Hence H(X)=H(Y)=1, so
H(X,Y ) = 1.8113 < H(X) + H(Y ) = 2
14 Conditional entropy • H(Y|X) is expected uncertainty in Y after seeing X def H(Y |X) = p(x)H(Y |X = x) x = − p(x) p(y|x) log p(y|x) x y = − p(x, y) log p(y|x) x,y p(x, y) = − p(x, y) log p(x) x,y 1 = − p(x, y) log p(x, y) − p(x) log p(x) x,y x = H(X,Y ) − H(X)
When is H(Y|X)=0? When is H(Y|X) = H(Y)?
15 Information never hurts • Conditioning on data always decreases (or at least, never increases) our uncertainty, on average
H(X,Y ) ≤ H(Y ) + H(X) from before H(Y |X) = H(X,Y ) − H(X) from above ≤ H(Y ) + H(X) − H(X) ≤ H(Y )
16 Mutual information • I(X,Y) is how much our uncertainty about Y decreases when we observe X def p(x, y) I(X,Y ) = p(x, y) log p(x)p(y) y x = −H(X,Y ) + H(X) + H(Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) • Hence H(X,Y ) = H(X|Y ) + H(Y |X) + I(X,Y )
Mackay 9.1 17 Mutual information • MI captures dependence between RVs in the following sense: I(X,Y ) ≥ 0 and I(X,Y ) = 0 ⇐⇒ X ⊥ Y • If X ⊥ Y ⇒ I(X,Y)=0 is easy to show; I(X,Y)=0 ⇒ X ⊥ Y is harder. • This is more general than a correlation coefficient, ρ ∈ [-1,1] which is only captures linear dependence • For MI, we have
0 ≤ I(X,Y ) ≤ H(X) ≤ log2 K
When is I(X,Y) = H(X)?
18 Example • Recall the even/ prime example with joint, marginal and conditional distributions
• Hence 1 1 3 3 3 3 1 1 H(Y |X) = −[ log2 + log2 + log2 + log2 ] = 0.8113 8 4 8 4 8 4 8 4
I(X,Y ) = H(Y ) − H(Y |X) = 1 − 0.8113 = 0.1887
cond = normalize(joint) = joint ./ repmat(sum(joint,2), 1, Y) 19 Relative entropy (KL divergence) • The Kullback-Leibler (KL) divergence is defined as p(x) D(p||q) = p(x) log = −H(p) − p(x) log q(x) q(x) x x • where ∑x p(x) log q(x) is the cross entropy • KL is the average number of extra bits needed to encode the data if we think the distribution is q, but it is actually p. • KL is not symmetric and hence not a distance. • However, KL(p,q) >= 0 with equality iff p=q.
20 Jensen’s inequality • A concave function is one which lies above any chord f(λx1 + (1 − λ)x2) ≥ λf(x1) + (1 − λ)f(x2)
• Jensen: for any concave f, E[f(X)] ≤ f(E[X]) p(x)f(x) ≤ f( p(x)) x x • Proof by induction: set K λ = p(x = 1), 1 − λ = p(x) x=2 21 Not on exam Proof that KL >= 0 • Let f(u)= log 1/u be a concave fn, and u(x)=p(x)/q(x)
D(p||q) = E[f(q(x)/p(x))] q(x) ≥ f p(x) p(x) x 1 = log( ) = 0 q(x) • Hence x p(x, y) I(X,Y ) = p(x, y) log = D(p(x, y)||p(x)p(y)) ≥ 0 p(x)p(y) x,y and
H(X) + H(Y ) = I(X,Y ) + H(X,Y ) ≥ H(X,Y )
22 Not on exam