Mutual Information

§ 2. Information measures: mutual information

2.1 Divergence: main inequality

Theorem 2.1 (Information Inequality).

D(P YQ) ≥ 0 ; D(P YQ) = 0 iﬀ P = Q

Proof. Let ϕ(x) ≜ x log x, which is strictly convex, and use Jensen’s Inequality:

P (x) P (x) P (x) DP(YQ) = Q P (x) log = Q Q(x)ϕ ≥ ϕ Q Q(x) = ϕ(1) = 0 Q(x) Q(x) Q(x) X X X

2.2 Conditional divergence

The main objects in our course are random variables. The main operation for creating new random variables, and also for deﬁning relations between random variables, is that of a random transformation:

Deﬁnition 2.1. Conditional probability distribution (aka random transformation, transition probability kernel, Markov kernel, channel) K(⋅S⋅) has two arguments: ﬁrst argument is a measurable subset ofY , second argument is an element ofX . It must satisfy:

1. For any x ∈ X : K( ⋅ S) x is a probability measure on Y

2. For any measurable A function x ↦ K(AS) x is measurable onX . XY In this case we will say that K acts from to . In fact, we will abuse notation and write PYXS instead of K to suggest what spaces X and Y are1. Furthermore, if X and Y are connected by the P S ÐÐYÐ→X random transformation PYXS we will write X Y . Remark 2.1. (Very technical!) Unfortunately, condition 2 (standard for probability textbooks) will frequently not be suﬃciently strong for this course. The main reason is that we want Radon-Nikodym dP derivatives such as Y SX=x (y) to be jointly measurable in (x, y). See Section ?? for more. dQY Example: = ( ) ⇔ = 1. deterministic system: Y f X PY SX=x δf(x) ⊥⊥ ⇔ = 2. decoupled system: Y X PY SX=x PY 1 Another reason for writing PY SX is that from any joint distribution PX,Y (on standard Borel spaces) one can extract a random transformation by conditioning on X.

20 = + ⊥⊥ ⇔ = 3. additive noise (convolution): Y X Z with Z X PY SX=x Px+Z . Multiplication:

P S ÐÐÐY X→ = X Y to get PXY PX PY SX : ( ) = ( S ) ( ) PXY x, y PY SX y x PX x . = ○ Composition (Marginalization): PY PYXS PX , that is PY SX acts on PX to produce PY : () = (S)() PY y Q PYXS y x PX x . x∈X

PYXS Will also write PX ÐÐÐ→ PY . Deﬁnition 2.2 (Conditional divergence). ( Y S ) = [ ( Y )] D PY SX QY SX PX Ex∼PX D PY SX=x QY SX=x (2.1) = () ( Y ) Q PX x D PY SX=x QYXS =x . (2.2) x∈X ( S ) = SAS − ( YS) X Note: H X Y log DPXYS UX PY , where UX is is uniform distribution on . Theorem 2.2 (Properties of Divergence). ( Y S ) = ( Y ) 1. D PY SX QYXS PX DPX PYXS PX QYXS ( Y ) = ( Y S) + (Y) 2. (Simple chain rule) D PXY QXY DPYXS QYXS PX DPX QX

3. (Monotonicity) DP(Y)XY QXY ≥ D(Y)PY QY 4. (Full chain rule)

n ( Y ) = Q (Yi−1 i−1 S i−1 ) DPX1⋯Xn QX1⋯Xn DPXiSX QXiSX PX i=1

n In the special case of QX = ∏i QXi we have ( Y ⋯ ) = ( Y ⋯ ) + (Y ) D PX1⋯Xn QX1 QXn D PX1⋯Xn PX1 PXn Q DPXi QXi

= ○ 5. (Conditioning increases divergence) Let PY SX and QY SX be two kernels, let PY PY SX PX = ○ and QY QYXS PX . Then ( Y ) ≤ ( Y S) D PY QY D PY SX QY SX PX ( Y S ) = equality iﬀ D PXSY QXSY PY 0

Pictorially:

PY X PY |

QY X QY |

21 = ○ 6. (Data-processing for divergences) Let PY PY SX PX

PY = ∫ PY X ( ⋅ S)x dPX S ¡ Ô⇒ D(P YQ ) ≤ D(P YQ ) (2.3) = ( ⋅ S ) Y Y X X QY ∫ PY SX x dQX

Pictorially:

PX PY P Y | X XY Ô⇒ D(PX YQX ) ≥ D(PY YQY )

Q QX Y

Proof. We only illustrate these results for the case of ﬁnite alphabets. General case follows by doing a careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditional probability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretize alphabets and take granularity of discretization to 0. This method will become clearer in Lecture 4, once we understand continuity of D.

P YXS PX 1. x P [D(PY X xYQY X x)] = X,Y P P log E ∼ X S = S = E( )∼ X Y SX QY SX PX

P PXY Y SX PX 2. Disintegration: X,Y log = X,Y log + log E( ) QXY E( ) QY SX QX

3. Apply 2. with X and Y interchanged and use D(⋅Y⋅) ≥ 0.

n n 4. Telescoping P n = ∏ P i−1 and Q n = ∏ Q i−1 . X i=1 XiSX X i=1 XiSX 5. Inequality follows from monotonicity. To get conditions for equality, notice that by the chain rule for D:

D(PXY YQXY ) = DP( Y X YSQY X PX ) + DP( X YPX ) S S ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ =0 = (Y S) + ( Y) D PXYYS QXS PY D PY QY and hence we get the claimed result from positivity of D.

6. This again follows from monotonicity.

Corollary 2.1.

( Y ⋯ ) ≥ ( Y ) D PX1⋯Xn QX1 QXn Q D PXi QXi or n =iﬀ n = ∏ PX j=1 PXj

Note: In general we can have D(PXY YQXY ) ≶ DP(Y)X QX + D(Y)PY QY . For example, if XY= under P and Q, then DP(Y()XY DQXY = DP(YX QX ) < 2DP( X YQX ). Conversely, if PX = QX and PY = QY but PXY ≠ QXY we have DP( XY YQXY ) > 0 = DP(Y)X QX + DP(Y)Y QY .

Corollary 2.2. Y = f(XDP) ⇒ (Y)Y QY ≤ DP(YX QX ), with equality if f is 1-1.

22 Note: D(PY YQY ) = D(PX Y)QX ⇒~ f is 1-1. Example: PX = Gaussian,QX = Laplace,YX= SS.

Corollary 2.3 (Large deviations estimate). For any subset E ⊂ X we have

d( PX [EQ ]YX [EDP]) ≤ (Y)X QX

= Proof. Consider Y 1{X∈E}.

2.3 Mutual information

Deﬁnition 2.3 (Mutual information).

I(X; Y ) = D(PXY YPX PY )

Note:

• Intuition: I(X; Y ) measures the dependence between X and Y , or, the information about X (resp. Y ) provided by Y (resp. X)

• Deﬁned by Shannon (in a diﬀerent form), in this form by Fano.

• Note: not restricted to discrete. ( ) ( ) • I X; Y is a functional of the joint distribution PXY , or equivalently, the pair PX ,PYXS . Theorem 2.3 (Properties of I). ( ) = ( Y ) = ( Y S ) = ( Y S ) 1. I X; Y D PXY PX PY D PY SX PY PX D PXSY PX PY 2. Symmetry: I(X; YIY) = (); X

3. Positivity: I(X; Y ) ≥ 0; I(X; Y ) = 0 iﬀ X ⊥⊥ Y

4. I(f(X); Y ) ≤ I(X; Y ); f one-to-one ⇒ I(f(X); YIX) = ( ; Y )

5. “More data ⇒ More info”: I()X1,X2; ZI≥ (X1; Z)

P P Proof. 1. I(X; Y ) = log PXY = log Y SX = log XSY . E PX PY E PY E PX

2. Apply data-processing inequality twice to the map ()x, y→ ()( y, x to get DPX,Y YPX PY ) = DP( Y,XYPY P X ).

3. By deﬁnition.

4. We will use the data-processing property of mutual information (to be proved shortly, see Theorem 2.5). Consider the chain of data processing: (x, y) ↦ (f(x), y) ↦ (f −1(f(x)), y). Then I(X; Y ) ≥ I(f(X); Y ) ≥ I(f −1(f(X)); Y ) = I(X; Y )

5. Consider f(X1,X2) = X1.

Theorem 2.4 (I v.s. H).

23 ⎧ ⎪H()XX discrete 1. IX(); X = ⎨ ⎪ ⎩⎪+∞ otherwise 2. If X, Y discrete then I(X; Y ) = H(X) + H(Y ) − H(X,Y ) If only X discrete then IX( ; Y ) = HXHXY( ) − ( S )

3. If X, Y are real-valued vectors, have joint pdf and all three diﬀerential entropies are ﬁnite then I(X; Y ) = h(X) + h() Y− h( X, Y ) ( S ) If X has marginal pdf pX and conditional pdf pXSY x y then I(X; Y ) = h(X) − h(XSY ) .

4. If X or Y are discrete then IX(); Y ≤ min ((HX,HY) ( )), with equality iﬀ HXY(S) = 0 or HYX( S ) = 0, i.e., one is a deterministic function of the other. ( ) = ( Y S ) = ( Y ) Proof. 1. By deﬁnition, I X; X D PXSX PX PX Ex∼X D δx PX . If PX is discrete, then 1 D( δxYPX ) = log and I(X; X) = H(X). If PX is not discrete, then let A = {x ∶ PX ()x > 0} PX (x) c denote the set ofatoms of PX . Let ∆ = {(x, x ) ∶ x ∈~ A} ⊂ X × X . Then PX,X ()∆ = PX (A ) > 0 but since (PX × PX )(E ) ≜ S PX (dx1) S PX (dx2)1{(x1, x2) ∈ E} X X we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X ≪~ PX × PX and thus

I(X; XD) = (PX,X YPX PX ) = +∞ .

2. log PXY = log 1 + log 1 − log 1 . E PX PY E PX PY PXY

3. Similarly, when PX,Y and PX PY have densities pXY and pX pY we have

pXY D(PXY YPX PY ) ≜ E log = h() X+ h() Y− h() X, Y pX pY

4. Follows from 2.

Corollary 2.4 (Conditioning reduces entropy). X discrete: HXYHX(S) ≤ (), with equality iﬀ XY⊥⊥ . Intuition: The amount of entropy reduction = mutual information

i.i.d. 1 3 1 Example: X = UYOR , where U, Y ∼ Bern( 2 ). Then X ∼ Bern( 4 ) and H(X) = h( 4 ) < 1 bits = H(XSY = 0), i.e., conditioning on Y = 0 increases entropy. But on average, H(XSY ) = P [Y = 0] H(XSY = 1 0) + P [Y = 1] H(XSY = 1) = 2 bits < H(X), by the strong concavity of h(⋅). Note: Information, entropy and Venn diagrams:

1. The following Venn diagram illustrates the relationship between entropy, conditional entropy, joint entropy, and mutual information.

24 H(X,Y )

H(Y X) I(X; Y ) H(X Y ) | |

H(Y ) H(X)

2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to

H(X1) + H(X2) + H(X3) − H(X1,X2) − H(X2,X3) − H(X1,X3) + HX( 1,X2,X3) (2.4)

which is sometimes denoted IX( ; Y ; Z). It can be both positive and negative (why?).

3. In general, one can treat random variables as sets (so that r.v. Xi corresponds to set Ei and (X1,X2) corresponds to E1 ∪ E2). Then we can deﬁne a unique signed measure µ on the ﬁnite algebra generated by these sets so that every information quantity is found by replacing

IH~ → µ ; → ∩ , → ∪ S → ∖ .

As an example, we have

H(X1SX2,X3) = µ(E1 ∖ (E2 ∪ E3)) , (2.5)

I(S)X1,X2; X3 X4 = µ((( E1 ∪ E2) ∩ E3) ∖ E4) . (2.6)

By inclusion-exclusion, quantity (2.4) corresponds to µ(E1 ∩ E2 ∩ E3), which explains why µ is not necessarily a positive measure.

I(X; Y ) Example: Bivariate Gaussian. X,Y — jointly Gaussian 1 1 ( ) = I X; Y log 2 2 1 − ρXY

E[(X−EX)(Y −EY )] where ρXY ≜ ∈ [−1, 1] is the correlation ρ σX σY coeﬃcient. -1 0 1

2 2 Proof. WLOG, by shifting and scaling if necessary, we can assume EX =EY = 0 and EX =EY = 1. 2 Then ρ = EXY . By joint Gaussianity, Y= ρX+ Z for some Z ∼ N (0, 1 − ρ ) ⊥⊥ X. Then using the divergence formula for Gaussians (1.16), we get

( ) = ( Y S ) I X; Y D PY SX PY PX 2 = ED(N ( ρX, 1 − ρ )YN (0, 1)) 1 1 log e 2 2 = E log + (ρX) + 1 − ρ − 1 2 1 − ρ2 2 1 1 = log 2 1 − ρ2

25 Note: Similar to the role of mutual information, the correlation coeﬃcient also measures the dependency between random variables which are real-valued (more generally, on an inner-product space) in certain sense. However, mutual information is invariant to bijections and more general: it can be deﬁned not just for numerical random variables, but also for apples and oranges.

Example: Additive white Gaussian noise (AWGN) channel. X ⊥⊥ N — independent Gaussian N

σ2 ( + ) = 1 + X I X; X N 2 log 1 σ2 °N XY+ signal-to-noise ratio (SNR)

m n Example: Gaussian vectors. X ∈ R , Y ∈ R — jointly Gaussian 1 det Σ det Σ I(X; Y) = log X Y 2 det Σ[]X,Y

≜ [( − )( − )′] ∈ m where ΣX E X EX X EX denotes the covariance matrix of X R , and Σ[X,Y] denotes m n the the covariance matrix of the random vector [X, Y] ∈ R + . In the special case of additive noise: YXN= + for N ⊥⊥ X, we have

1 det(Σ + Σ ) I(X; XN+ ) = log X N 2 det ΣN

ΣΣXX why? since det Σ = det = det ΣX det ΣN. [X,X+N] ΣX ΣX+ΣN Example: Binary symmetric channel (BSC).

1 0 − δ 0 XYδ XY+ 1 1 1− δ

1 X ∼ Bern ,N ∼ Bern(δ) 2 Y = XN+ I(X; Y ) = log 2 − h(δ)

Example: Addition over ﬁnite groups. X is uniform on G and independent of Z. Then

I(X; X + Z) = log SGS − H(Z)

Proof. Show that X + Z is uniform on G regardless of Z.

26 2.4 Conditional mutual information and conditional independence

Deﬁnition 2.4 (Conditional mutual information).

( S ) = ( Y S ) I X; Y Z D PXY SZ PXSZ PY SZ PZ (2.7) = [ ( S = )] Ez∼PZ I X; Y Z z . (2.8) ( )( ) ≜ (S) (S) where the product of two random transformations is PXS Z= zPYS Z= z x, y PXZS x z PYZS y z , under which X and Y are independent conditioned on Z.

Note: I(X; Y SZ) is a functional of PXYZ . Remark 2.2 (Conditional independence). A family of distributions can be represented by a directed acyclic graph. A simple example is a Markov chain (line graph), which represents distributions that { ∶ = } factor as PXYZ PXYZ PX PY SX PZSY . ⎧ → → ⇔ = ⋅ ⎪ X Y Z PXZSY PXSY PZYS ⎪ ⎪ ⇔ = ⎪ PZXYS PZSY ⎪ ⎪ ⇔ P = P ⋅ P ⋅ P ⎪ XYZ X YXS ZSY Cond. indep. ⎪ ⎨ ⇔ X, Y,Z form a Markov chain notation ⎪ ⎪ ⇔ XZY⊥⊥ S ⎪ ⎪ ⇔ = ⋅ ⋅ ⎪ PXYZ PY PXSY PZSY ⎪ ⎩⎪ ⇔ ZYX→ →

Theorem 2.5 (Further properties of Mutual Information).

1. I(X; ZSY ) ≥ 0, with equality iﬀ X → YZ→ 2. (Kolmogorov identity or small chain rule)

I(X,Y ; Z) = I(X; Z) + I(Y ; ZSX) = IY(); ZIX+ (S); ZY

3. (Data Processing) If X → Y → Z, then

a) IX(); ZIX≤ ( ; Y ) b) I(X; Y SZ) ≤ IX( ; Y ) 4. (Full chain rule) n n k−1 I(X ; Y ) = Q I(Xk; Y SX ) k=1

Proof. 1. By deﬁnition and Theorem 2.3.3.

2. P P PY XZ XYZ = XZ ⋅ S PXY PZ PX PZ PY SX

27 3. Apply Kolmogorov identity to I(Y,Z; X):

I(Y,Z; X) = I(X; Y ) + I(X; ZSY ) ´¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¸¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¹¶ =0 = I()X; ZIX+ (S; Y Z)

4. Recursive application of Kolmogorov identity.

Example: 1-to-1 function ⇒ IX( ; Y ) = IX( ; f( Y )) Note: In general, I(X; YZS) ≷ I(X; Y ). Examples: a) “> ”: Conditioning does not always decrease M.I. To ﬁnd counterexamples when X,Y,Z do not form a Markov chain, notice that there is only one directed acyclic graph non-isomorphic to XYZ→ → , namely XYZ→ ← . Then a counterexample is

i.i.d. 1 X,Z ∼ Bern( ) Y = X ⊕ Z 2 I(X; Y ) = 0 since XY⊥⊥ IX( ; YIXS)Z = ( ; XZZHX⊕ S ) = () = log 2

b) “<”: Z = Y . Then I(X; Y SY ) = 0. Note: (Chain rule forI ⇒ Chain rule for H) Set YX= n. Then HX()n = IX()n; Xn = ∑n ( nS k−1) = ∑n (S k−1) ( S n k−1) = k=1 IXk; X X k=1 HXXk , since HXk X , X 0. Remark 2.3 (Data processing for mutual information via data processing of divergence). We proved data processing for mutual information in Theorem 2.5 using Kolmogorov’s identity. In fact, data processing for mutual information is implied by the data processing for divergence:

( ) = ( Y S ) ≤ ( Y S ) = ( ) I X; Z D PZSX PZ PX D PY SX PY PX IX; Y,

P S P S ÐÐÐ→ZY ÐÐZYÐ→ where note that for each x, we have PYS X= x PZS X= x and PY PZ . Therefore if we have a bi-variate functional of distributions D(Y)PQ which satisfies data processing, then we can define an ( ) ≜ D( Y S ) ≜ D( Y ) “M.I.-like” quantity via ID X; YPYXS PY PX Ex∼ PX PY SX=x PY which will satisfy data processing on Markov chains. A rich class of examples arises by taking D = Df (an f-divergence, defined in (1.15)). That f-divergence satisfies data-processing is going to be shown in Remark 4.2.

2.5 Strong data-processing inequalities

For many random transformations PY SX , it is possible to improve the data-processing inequality (2.3): For any PX ,QX we have D(PY YQY ) ≤ ηKLD(PX YQX ) , < where ηKL 1 and depends on the channel PY SX only. Similarly, this gives an improvement in the data-processing inequality for mutual information: For any PU,X we have

U → XYIU→ Ô⇒ (); Y≤ ηKLIU( ; X.) = ( ) = ( − )2 For example, for PY SX BSC δ we hav e ηKL 1 2δ . Strong data-processing inequalities quantify the intuitive observation that noise inside the channel PY SX must reduce the information that Y carries about the data U, regardless of how smart the hook up U → X is. This is an active area of research, see [PW15] for a short summary.

28 2.6* How to avoid measurability problems?

As we mentioned in Remark 2.1 conditions imposed by Deﬁnition 2.1 on PY SX are insuﬃcient. Namely, we get the following two issues:

dP 1. Radon-Nikodym derivatives such as Y SX=x (y) may not be jointly measurable in (x, y) dQY { ∶ ≪ } 2. Set x PY SX=x QY may not be measurable. The easiest way to avoid all such problems is the following:

∶ XY→ Agreement A1: All conditional kernels PYXS in these notes will be assumed to be deﬁned by choosing a σ-ﬁnite measure µ2 on Y and measurable function ρ(ySx) ≥ 0 onX × Y such that

PY X (ASx) = S ρ(ySx)µ2(dy) S A for all x and measurable sets A and ρ(S)(y x µ dy) = 1 for all x. ∫Y 2 Notes:

′( S ) ′ ′ 1. Given another kernel QY SX specified via ρ y x and µ2 we may first replace µ2 and µ2 via ′′ = + ′ µ2 µ2 µ2 and thus assume that both PYXS and QY SX are specified in terms of the same ′′ dµ2 dominating measure µ2 . (This modifies ρ(ySx) to ρ(ySx) ′′ (y).) dµ2

2. Given two kernels PY SX and QY SX speciﬁed in terms of the same dominating measure µ2 and functions ρP (ySx) and ρQ(ySx), respectively, we may set

dPY X ρ (ySx) S ≜ P ( S ) dQY SX ρQ y x = ≪ outside of ρQ 0. When PY SX=x QY SX=x the above gives a version of the Radon-Nikodym derivative, which is automatically measurable in (x, y).

3. Given QY speciﬁed as dQY = q() y dµ2 we may set A0 = {x ∶ S ρ(ySx)dµ2 = 0} {q=0} { ∶ ≪ } This set plays a role of x PYS X= x QY .Unlik e the latter A0 is guaranteed to be measurable by Fubini [C¸11, Prop. 6.9]. By “plays a role” we mean that it allows to prove statements like: For any PX PX,Y ≪ PX QY ⇐⇒ PX []A0 = 1 .

So, while our agreement resolves the two measurability problems above, it introduces a new one. Indeed, given a joint distribution PX,Y on standard Borel spaces, it is always true that one can extract a conditional distribution PY SX satisfying Deﬁnition 2.1 (this is called disintegration). However, it is not guaranteed that PY SX will satisfy Agreement A1. To work around this issue as well, we add another agreement:

29 Agreement A2: All joint distributions PX,Y are speciﬁed by means of data: µ1,µ2 – σ-ﬁnite measures onX andY , respectively, and measurable function λ( x, y) such that

PX,Y (E) ≜ S λ(x, y)µ1(dx)µ2()dy . E

Notes:

1. Again, given a ﬁnite or countable collection of joint distributions PX,Y ,QX,Y ,... satisfying A2 we may without loss of generality assume they are deﬁned in terms of a common µ1, µ2.

2. Given PX,Y satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal:

λ(x, y) PY X (ASx) = S ρ(ySx)µ2(dy) ρ(ySx) ≜ (2.9) S A p(x)

PX ()A= S p()()() x µ1 dx p x ≜ S λ( x, η)µ2(dη) (2.10) A Y with ρ(ySx) deﬁned arbitrarily for those x, for which p(x) = 0.

Remark 2.4. The first problem can alsob e resolved with the help of Doob’s version of Radon- Nikodym theorem [Ç11, Chapter V.4, Theorem 4.44]: If the σ-algebra onY is separable (satisfied Y ≪ whenever is a Polish space, for example) and PYXS =x QY SX=x then there exists a jointly measurablev ersion of Radon-Nikodymderiv ative

dPY X x (x, y) ↦ S = (y) dQY SX=x

30 MIT OpenCourseWare https://ocw.mit.edu

6.441 Information Theory Spring 2016

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.