Joint & Conditional Entropy, Mutual Information

Application of Information Theory, Lecture 2 Joint & Conditional Entropy, Mutual Information Handout Mode Iftach Haitner Tel Aviv University. Nov 4, 2014 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 1 / 26 Part I Joint and Conditional Entropy Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 2 / 26 Joint entropy ▸ Recall that the entropy of rv X over X , is defined by H(X) = − Q PX (x) log PX (x) x∈X ▸ Shorter notation: for X ∼ p, let H(X) = − ∑x p(x) log p(x) (where the summation is over the domain of X). ▸ The joint entropy of (jointly distributed) rvs X and Y with (X; Y ) ∼ p, is H(X; Y ) = − Q p(x; y) log p(x; y) x;y This is simply the entropy of the rv Z = (X; Y ). ▸ Example: 1 1 1 1 1 1 Y H(X; Y ) = − log − log − log X 0 1 2 1 4 1 4 1 0 1 1 2 4 4 4 4 1 1 1 1 1 0 = ⋅ 1 + ⋅ 2 = 1 2 2 2 2 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 3 / 26 Joint entropy, cont. ▸ The joint entropy of (X1;:::; Xn) ∼ p, is H(X1;:::; Xn) = − Q p(x1;:::; xn) log p(x1;:::; xn) x1;:::;xn Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 4 / 26 Conditional entropy ▸ Let (X; Y ) ∼ p. ▸ For x ∈ Supp(X), the random variable Y SX = x is well defined. ▸ The entropy of Y conditioned on X, is defined by H(Y SX) ∶= E H(Y SX = x) = E H(Y SX) x←X X ▸ Measures the uncertainty in Y given X. ▸ Let pX & pY SX be the marginal & conational distributions induced by p. H(Y SX) = Q pX (x) ⋅ H(Y SX = x) x∈X = − Q pX (x) Q pY SX (ySx) log pY SX (ySx) x∈X y∈Y = − Q p(x; y) log pY SX (ySx) x∈X ;y∈Y = − E log pY SX (Y SX) (X;Y ) = − E log Z Z =pY SX (Y SX) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 5 / 26 Conditional entropy, cont. ▸ Example Y X 0 1 1 1 0 4 4 1 1 2 0 What is H(Y SX) and H(XSY )? H(Y SX) = E H(Y SX = x) x←X 1 1 = H(Y SX = 0) + H(Y SX = 1) 2 2 1 1 1 1 1 = H( ; ) + H(1; 0) = : 2 2 2 2 2 H(XSY ) = E H(XSY = y) y←Y 3 1 = H(XSY = 0) + H(XSY = 1) 4 4 3 1 2 1 = H( ; ) + H(1; 0) = 0:6887≠H(Y SX): 4 3 3 4 Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 6 / 26 Conditional entropy, cont.. ▸ H(XSY ; Z ) = E H(XSY = y; Z = z) (y;z)←(Y ;Z ) = E E H(XSY = y; Z = z) y←Y z←Z SY =y = E E H((XSY = y)SZ = z) y←Y z←Z SY =y = E H(Xy SZy ) y←Y for (Xy ; Zy ) = (X; Z)SY = y Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 7 / 26 Relating mutual entropy to conditional entropy ▸ What is the relation between H(X), H(Y ), H(X; Y ) and H(Y SY )? ▸ Intuitively,0 ≤ H(Y SX) ≤ H(Y ) Non-negativity is immediate. We prove upperbound later. ▸ H(Y SX) = H(Y ) iff X and Y are independent. ▸ ( ) = ( 3 1 ) > 1 = ( S ) In our example, H Y H 4 ; 4 2 H Y X ▸ Note that H(Y SX = x) might be larger than H(Y ) for some x ∈ Supp(X). ▸ Chain rule (proved next). H(X; Y ) = H(X) + H(Y SX) ▸ Intuitively, uncertainty in (X; Y ) is the uncertainty in X plus the uncertainty in Y given X. ▸ H(Y SX) = H(X; Y ) − H(X) is as an alternative definition for H(Y SX). Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 8 / 26 Chain rule (for the entropy function) Claim 1 For rvs X; Y , it holds that H(X; Y ) = H(X) + H(Y SX). ▸ Proof immediately follow by the grouping axiom: Y n X Let qi = ∑j=1 pi;j P1;1 ::: P1;n ⋮ ⋮ ⋮ H(P1;1;:::; Pn;n) Pn;1 ::: Pn;n Pi;1 Pi;n = H(q1;:::; qn) + Q qi H( ;:::; ) qi qi = H(X) + H(Y SX): ▸ Another proof. Let (X; Y ) ∼ p. ▸ p(x; y) = pX (x) ⋅ pY SX (xSy). Ô⇒ log p(x; y) = log pX (x) + log pY SX (xSy) Ô⇒ E log p(X; Y ) = E log pX (X) + E log pY SX (Y SX) Ô⇒ H(X; Y ) = H(X) + H(Y SX). Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 9 / 26 H(Y SX) ≤ H(Y ) Jensen inequality: for any concave function f , values t1;:::; tk and λ1; : : : ; λk ∈ [0; 1] with ∑i λi = 1, it holds that ∑i λi f (ti ) ≤ f (∑i λi ti ). Let (X; Y ) ∼ p. H(Y SX) = − Q p(x; y) log pY SX (ySx) x;y p (x) = Q p(x; y) log X x;y p(x; y) p(x; y) pX (x) = Q pY (y) ⋅ log x;y pY (y) p(x; y) p(x; y) pX (x) = Q pY (y) Q log y x pY (y) p(x; y) p(x; y) pX (x) ≤ Q pY (y) log Q y x pY (y) p(x; y) 1 = Q pY (y) log = H(Y ): y pY (y) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 10 / 26 H(Y SX) ≤ H(Y ) cont. ▸ Assume X and Y are independent (i.e., p(x; y) = pX (x) ⋅ pY (y) for any x; y) Ô⇒ pY SX = pY Ô⇒ H(Y SX) = H(Y ) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 11 / 26 Other inequalities ▸ H(X); H(Y ) ≤ H(X; Y ) ≤ H(X) + H(Y ). Follows from H(X; Y ) = H(X) + H(Y SX). ▸ Left inequality since H(Y SX) is non negative. ▸ Right inequality since H(Y SX) ≤ H(Y ). ▸ H(X; Y SZ) = H(XSZ) + H(Y SX; Z ) (by chain rule) ▸ H(XSY ; Z) ≤ H(XSY ) Proof: H(XSY ; Z ) = E H(X S Y ; Z) Z;Y = E E H(X S Y ; Z ) Y Z SY = E E H((X S Y )S Z) Y Z SY ≤ E E H(X S Y ) Y Z SY = E H(X S Y ) Y = H(XSY ): Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 12 / 26 Chain rule (for the entropy function), general case Claim 2 For rvs X1;:::; Xk , it holds that H(X1;:::; Xk ) = H(Xi ) + H(X2SX1) + ::: + H(Xk SX1;:::; Xk−1). Proof:? ▸ Extremely useful property! ▸ Analogously to the two variables case, it also holds that: ▸ H(Xi ) ≤ H(X1;:::; Xk ) ≤ ∑i H(Xi ) ▸ H(X1;:::; XK SY ) ≤ ∑i H(Xi SY ) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 13 / 26 Examples ▸ ∼ ( 1 2 ) (from last class) Let X1;:::; Xn be Boolean iid with Xi 3 ; 3 . Compute H(X1;:::; Xn) ▸ As above, but under the condition that >i Xi = 0? ▸ Via chain rule? ▸ Via mapping? Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 14 / 26 Applications ▸ Let X1;:::; Xn be Boolean iids with Xi ∼ (p; 1 − p) and let X = X1;:::; Xn. ′ Let f be such that Pr [f (X) = z] = Pr [f (X) = z ], for every k ∈ N and z; z′ ∈ {0; 1}k . Let K = Sf (X)S. Prove thatE K ≤ n ⋅ h(p). ▸ n ⋅ h(p) = H(X1;:::; Xn) ≥ H(f (X); K ) = H(K ) + H(f (X)S K ) = H(K ) + E K ≥ E K ▸ Interpretation ▸ Positive results Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 15 / 26 Applications cont. ▸ How many comparisons it takes to sort n elements? LetA be a sorter for n elements algorithm making t comparisons. What can we say about t? ▸ Let X be a uniform random permutation of [n] and let Y1;:::; Yt be the answersA gets when sorting X. ▸ X is determined by Y1;:::; Yt . Namely, X = f (Y1;:::; Yt ) for some function f . ▸ H(X) = log n! ▸ H(X) = H(f (Y1;:::; Yn)) ≤ H(Y1;:::; Yn) ≤ Q H(Yi ) i = t Ô⇒ t ≥ log n! = Θ(n log n) Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 16 / 26 Concavity of entropy function Let p = (p1;:::; pn) and q = (q1;:::; qn) be two distributions, and for λ ∈ [0; 1] consider the distribution τλ = λp + (1 − λ)q. (i.e., τλ = (λp1 + (1 − λ)q1; : : : ; λpn + (1 − λ)qn). Claim 3 H(τλ) ≥ λH(p) + (1 − λ)H(q) Proof: ▸ Let Y over {0; 1} be1 wp λ ▸ Let X be distributed according to p if Y = 0 and according to q otherwise. ▸ H(τλ) = H(X) ≥ H(X S Y ) = λH(p) + (1 − λ)H(q) We are now certain that we drew the graph of the (two-dimensional) entropy function right... Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 17 / 26 Part II Mutual Information Iftach Haitner (TAU) Application of Information Theory, Lecture 2 Nov 4, 2014 18 / 26 Mutual information ▸ I(X; Y ) — the “information" that X gives on Y ▸ I(X; Y ) ∶= H(Y ) − H(Y SX) = H(Y ) − (H(X; Y ) − H(X)) = H(X) + H(Y ) − H(X; y) = I(Y ; X): ▸ The mutual information that X gives about Y equals the mutual information that Y gives about X.

Joint & Conditional Entropy, Mutual Information

Distribution of Mutual Information

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

An Introduction to Information Theory

Package 'Infotheo'

Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020

Chapter 2: Entropy and Mutual Information

GAIT: a Geometric Approach to Information Theory

On a General Definition of Conditional Rényi Entropies

Noisy Channel Coding

A Statistical Framework for Neuroimaging Data Analysis Based on Mutual Information Estimated Via a Gaussian Copula

Networked Distributed Source Coding

On Measures of Entropy and Information