Joint & Conditional Entropy, Mutual Information

Application of Information Theory, Lecture 2 Joint & Conditional Entropy, Mutual Information Handout Mode Iftach Haitner Tel Aviv University. Nov 4, 2014 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 1 / 26 Part I Joint and Conditional Entropy Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 2 / 26 Joint entropy ▸ Recall that the entropy of rv X over X , is defined by H(X) = − Q PX (x) log PX (x) x∈X ▸ Shorter notation: for X ∼ p, let H(X) = − ∑x p(x) log p(x) (where the summation is over the domain of X). ▸ The joint entropy of (jointly distributed) rvs X and Y with (X; Y ) ∼ p, is H(X; Y ) = − Q p(x; y) log p(x; y) x;y This is simply the entropy of the rv Z = (X; Y ). ▸ Example: 1 1 1 1 1 1 Y H(X; Y ) = − log − log − log X 0 1 2 1 4 1 4 1 0 1 1 2 4 4 4 4 1 1 1 1 1 0 = ⋅ 1 + ⋅ 2 = 1 2 2 2 2 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 3 / 26 Joint entropy, cont. ▸ The joint entropy of (X1;:::; Xn) ∼ p, is H(X1;:::; Xn) = − Q p(x1;:::; xn) log p(x1;:::; xn) x1;:::;xn Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 4 / 26 Conditional entropy ▸ Let (X; Y ) ∼ p. ▸ For x ∈ Supp(X), the random variable Y SX = x is well defined. ▸ The entropy of Y conditioned on X, is defined by H(Y SX) ∶= E H(Y SX = x) = E H(Y SX) x←X X ▸ Measures the uncertainty in Y given X. ▸ Let pX & pY SX be the marginal & conational distributions induced by p. H(Y SX) = Q pX (x) ⋅ H(Y SX = x) x∈X = − Q pX (x) Q pY SX (ySx) log pY SX (ySx) x∈X y∈Y = − Q p(x; y) log pY SX (ySx) x∈X ;y∈Y = − E log pY SX (Y SX) (X;Y ) = − E log Z Z =pY SX (Y SX) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 5 / 26 Conditional entropy, cont. ▸ Example Y X 0 1 1 1 0 4 4 1 1 2 0 What is H(Y SX) and H(XSY )? H(Y SX) = E H(Y SX = x) x←X 1 1 = H(Y SX = 0) + H(Y SX = 1) 2 2 1 1 1 1 1 = H( ; ) + H(1; 0) = : 2 2 2 2 2 H(XSY ) = E H(XSY = y) y←Y 3 1 = H(XSY = 0) + H(XSY = 1) 4 4 3 1 2 1 = H( ; ) + H(1; 0) = 0:6887≠H(Y SX): 4 3 3 4 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 6 / 26 Conditional entropy, cont.. ▸ H(XSY ; Z ) = E H(XSY = y; Z = z) (y;z)←(Y ;Z ) = E E H(XSY = y; Z = z) y←Y z←Z SY =y = E E H((XSY = y)SZ = z) y←Y z←Z SY =y = E H(Xy SZy ) y←Y for (Xy ; Zy ) = (X; Z)SY = y Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 7 / 26 Relating mutual entropy to conditional entropy ▸ What is the relation between H(X), H(Y ), H(X; Y ) and H(Y SY )? ▸ Intuitively,0 ≤ H(Y SX) ≤ H(Y ) Non-negativity is immediate. We prove upperbound later. ▸ H(Y SX) = H(Y ) iff X and Y are independent. ▸ ( ) = ( 3 1 ) > 1 = ( S ) In our example, H Y H 4 ; 4 2 H Y X ▸ Note that H(Y SX = x) might be larger than H(Y ) for some x ∈ Supp(X). ▸ Chain rule (proved next). H(X; Y ) = H(X) + H(Y SX) ▸ Intuitively, uncertainty in (X; Y ) is the uncertainty in X plus the uncertainty in Y given X. ▸ H(Y SX) = H(X; Y ) − H(X) is as an alternative definition for H(Y SX). Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 8 / 26 Chain rule (for the entropy function) Claim 1 For rvs X; Y , it holds that H(X; Y ) = H(X) + H(Y SX). ▸ Proof immediately follow by the grouping axiom: Y n X Let qi = ∑j=1 pi;j P1;1 ::: P1;n ⋮ ⋮ ⋮ H(P1;1;:::; Pn;n) Pn;1 ::: Pn;n Pi;1 Pi;n = H(q1;:::; qn) + Q qi H( ;:::; ) qi qi = H(X) + H(Y SX): ▸ Another proof. Let (X; Y ) ∼ p. ▸ p(x; y) = pX (x) ⋅ pY SX (xSy). Ô⇒ log p(x; y) = log pX (x) + log pY SX (xSy) Ô⇒ E log p(X; Y ) = E log pX (X) + E log pY SX (Y SX) Ô⇒ H(X; Y ) = H(X) + H(Y SX). Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 9 / 26 H(Y SX) ≤ H(Y ) Jensen inequality: for any concave function f , values t1;:::; tk and λ1; : : : ; λk ∈ [0; 1] with ∑i λi = 1, it holds that ∑i λi f (ti ) ≤ f (∑i λi ti ). Let (X; Y ) ∼ p. H(Y SX) = − Q p(x; y) log pY SX (ySx) x;y p (x) = Q p(x; y) log X x;y p(x; y) p(x; y) pX (x) = Q pY (y) ⋅ log x;y pY (y) p(x; y) p(x; y) pX (x) = Q pY (y) Q log y x pY (y) p(x; y) p(x; y) pX (x) ≤ Q pY (y) log Q y x pY (y) p(x; y) 1 = Q pY (y) log = H(Y ): y pY (y) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 10 / 26 H(Y SX) ≤ H(Y ) cont. ▸ Assume X and Y are independent (i.e., p(x; y) = pX (x) ⋅ pY (y) for any x; y) Ô⇒ pY SX = pY Ô⇒ H(Y SX) = H(Y ) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 11 / 26 Other inequalities ▸ H(X); H(Y ) ≤ H(X; Y ) ≤ H(X) + H(Y ). Follows from H(X; Y ) = H(X) + H(Y SX). ▸ Left inequality since H(Y SX) is non negative. ▸ Right inequality since H(Y SX) ≤ H(Y ). ▸ H(X; Y SZ) = H(XSZ) + H(Y SX; Z ) (by chain rule) ▸ H(XSY ; Z) ≤ H(XSY ) Proof: H(XSY ; Z ) = E H(X S Y ; Z) Z;Y = E E H(X S Y ; Z ) Y Z SY = E E H((X S Y )S Z) Y Z SY ≤ E E H(X S Y ) Y Z SY = E H(X S Y ) Y = H(XSY ): Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 12 / 26 Chain rule (for the entropy function), general case Claim 2 For rvs X1;:::; Xk , it holds that H(X1;:::; Xk ) = H(Xi ) + H(X2SX1) + ::: + H(Xk SX1;:::; Xk−1). Proof:? ▸ Extremely useful property! ▸ Analogously to the two variables case, it also holds that: ▸ H(Xi ) ≤ H(X1;:::; Xk ) ≤ ∑i H(Xi ) ▸ H(X1;:::; XK SY ) ≤ ∑i H(Xi SY ) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 13 / 26 Examples ▸ ∼ ( 1 2 ) (from last class) Let X1;:::; Xn be Boolean iid with Xi 3 ; 3 . Compute H(X1;:::; Xn) ▸ As above, but under the condition that >i Xi = 0? ▸ Via chain rule? ▸ Via mapping? Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 14 / 26 Applications ▸ Let X1;:::; Xn be Boolean iids with Xi ∼ (p; 1 − p) and let X = X1;:::; Xn. ′ Let f be such that Pr [f (X) = z] = Pr [f (X) = z ], for every k ∈ N and z; z′ ∈ {0; 1}k . Let K = Sf (X)S. Prove thatE K ≤ n ⋅ h(p). ▸ n ⋅ h(p) = H(X1;:::; Xn) ≥ H(f (X); K ) = H(K ) + H(f (X)S K ) = H(K ) + E K ≥ E K ▸ Interpretation ▸ Positive results Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 15 / 26 Applications cont. ▸ How many comparisons it takes to sort n elements? LetA be a sorter for n elements algorithm making t comparisons. What can we say about t? ▸ Let X be a uniform random permutation of [n] and let Y1;:::; Yt be the answersA gets when sorting X. ▸ X is determined by Y1;:::; Yt . Namely, X = f (Y1;:::; Yt ) for some function f . ▸ H(X) = log n! ▸ H(X) = H(f (Y1;:::; Yn)) ≤ H(Y1;:::; Yn) ≤ Q H(Yi ) i = t Ô⇒ t ≥ log n! = Θ(n log n) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 16 / 26 Concavity of entropy function Let p = (p1;:::; pn) and q = (q1;:::; qn) be two distributions, and for λ ∈ [0; 1] consider the distribution τλ = λp + (1 − λ)q. (i.e., τλ = (λp1 + (1 − λ)q1; : : : ; λpn + (1 − λ)qn). Claim 3 H(τλ) ≥ λH(p) + (1 − λ)H(q) Proof: ▸ Let Y over {0; 1} be1 wp λ ▸ Let X be distributed according to p if Y = 0 and according to q otherwise. ▸ H(τλ) = H(X) ≥ H(X S Y ) = λH(p) + (1 − λ)H(q) We are now certain that we drew the graph of the (two-dimensional) entropy function right... Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 17 / 26 Part II Mutual Information Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 18 / 26 Mutual information ▸ I(X; Y ) — the “information" that X gives on Y ▸ I(X; Y ) ∶= H(Y ) − H(Y SX) = H(Y ) − (H(X; Y ) − H(X)) = H(X) + H(Y ) − H(X; y) = I(Y ; X): ▸ The mutual information that X gives about Y equals the mutual information that Y gives about X.

Joint & Conditional Entropy, Mutual Information

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support