<<

and Mutual (Discrete Random Variables)

Master Universitario en Ingenier´ıade Telecomunicaci´on

I. Santamar´ıa Universidad de Cantabria Introduction Entropy and Relative Entropy Jensen’s inequality

Contents

Introduction

Entropy

Joint Entropy and Conditional Entropy

Relative Entropy

Mutual Information

Jensen’s inequality

Entropy and Mutual Information 1/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Entropy and mutual information are key concepts in IT

I Entropy

I The entropy H(X ) of a X gives us the fundamental limit for I A source producing i.i.d. realizations of X can be compressed up to H(X ) /realization I The entropy is the average shortest description of X

I Mutual information

I The mutual information gives us the fundamental limit for transmission I The capacity of a channel is given by

C = max I (X ; Y ) p(x)

Entropy and Mutual Information 2/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Definitions Let X be a discrete random variable (r.v.) that takes values x ∈ X . The probability mass function (pmf) of X will be denoted by

p(x) = Pr{X = x}

Example 1: Bernoulli r.v. X = {0, 1}

Pr{X = 1} = p, Pr{X = 0} = 1 − p

Example 2: Binomial r.v. X ∼ B(n, p)(X = {0, 1,..., n}) n Pr{X = k} = pk (1 − p)n−k k

I Note that X ∼ B(1, p) is a Bernoulli r.v.

I If X1,..., Xn are independent B(1, p), then Pn Y = k=1 Xk ∼ B(n, p)

Entropy and Mutual Information 3/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Information

For a discrete random variable, X , the information of an outcome X = x is i(x) = − log(p(x)) and it is measured in bits1

I Why the log of a probability measures information? Not a simple question, but there are strong reasons

I Less probable outcomes are the most informative ones I For independent random variables: i(x, y) = − log(p(x, y)) = − log(p(x)p(y)) = i(x) + i(y)

I The outcome of a random experiment is most informative if the pmf is uniform → Experiment design, guessing games,...

1Unless otherwise indicated, in this course log denotes in base 2. Entropy and Mutual Information 4/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

If X is Bernoulli with Pr{X = 1} = p, the information of the outcome X = 1 is i = − log(p)

10

8

6 i

4

2

0 0 0.2 0.4 0.6 0.8 1 p The information is always a positive quantity

Entropy and Mutual Information 5/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Entropy

Definition: The entropy of a discrete random variable, X , is defined by2 X H(X ) = −E[log p(X )] = − p(x) log(p(x)) x∈X

I For discrete random variables H(X ) ≥ 0

I It is the average information of the random variable X :

H(X ) = E[i(X )],

note that this can interpreted also as the mean value of a new (transformed) r.v. Y = − log(p(X ))

2We assume for convention that 0 log(0) = 0. Entropy and Mutual Information 6/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Interpretation

I Entropy is the measure of average uncertainty in X

I Entropy is the average number of bits needed to describe X

I Entropy is a lower bound on the average length of the shortest description of X

Entropy and Mutual Information 7/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example 1: Entropy of a Bernoulli r.v. with parameter p

H(X ) = −p log(p) − (1 − p) log(1 − p) , H(p)

1

0.8

0.6 H(p) 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 p It is a concave function of p

Entropy and Mutual Information 8/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example 2: Entropy of a uniform r.v. taking on K values: e.g., X = (1,..., K)

K X X 1 H(X ) = − p(x) log(p(x)) = log(K) = log(K) K x∈X i=1

I It does not depend of the values that X takes, it depends only on their probabilities (X and X + a have the same entropy!)

I Property: For an arbitrary discrete r.v. H(X ) ≤ log(|X |),where |X | denotes the cardinality of the set, and H(X ) = log(|X |) iff X has a uniform distribution over X That is to say: H(X ) is a lower bound on the number of binary questions (bits) that are always guaranteed to identify an outcome from the ensemble X

Entropy and Mutual Information 9/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example 3: The entropy of English

i ai pi I Probabilities estimated from The Frequently Asked 1 a 0.0575 a a 2 b 0.0128 b Questions Manual for Linux 3 c 0.0263 c 4 d 0.0285 d I 26 letters (a-z) and a space character ’-’ 5 e 0.0913 e 6 f 0.0173 f 7 g 0.0133 g I The entropy is 8 h 0.0313 h 9 i 0.0599 i X 10 j 0.0006 j H = − pi log(pi ) = 4.11 bits/letter 11 k 0.0084 k l i 12 0.0335 l 13 m 0.0235 m 14 n 0.0596 n 15 o 0.0689 o I The most informative letter is z: 11.48 bits 16 p 0.0192 p (frequency of appearance 0.1%) 17 q 0.0008 q 18 r 0.0508 r 19 s 0.0567 s I The least informative letter is e: 3.5 bits (frequency 20 t 0.0706 t of appearance 13%) 21 u 0.0334 u 22 v 0.0069 v 23 w 0.0119 w I Is it possible to write a book in English without using 24 x 0.0073 x the letter e? 25 y 0.0164 y 26 z 0.0007 z 27 – 0.1928 – aReprinted from MacKay’s textbook

Entropy and Mutual Information 10/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

I [To July 1906]

If youth, throughout all history, had had a champion to stand up for it; to show a doubting world that a child can think; and, possibly, do it practically; you wouldn’t constantly run across folks today who claim that “a child don’t know anything.” A child’s brain starts functioning at birth; and has, amongst its many infant convolutions, thousands of dormant atoms, into which God has put a mystic possibility for noticing an adult’s act, and figuring out its purport. Up to about its primary school days a child thinks, naturally, only of play. But many a Entropy and Mutualform of Information play contains disciplinary factors. “You can’t do this,” or “that puts you out,” 11/31 shows a child that it must think, practically or fail. Now, if, throughout childhood, a brain has no opposition, it is plain that it will attain a position of “status quo,” as with our ordinary animals. Man knows not why a cow, dog or lion was not born with a brain on a par with ours; why such animals cannot add, subtract, or obtain from books and schooling, that paramount position which Man holds today. But a human brain is not in that class. Constantly throbbing and pulsating, it rapidly forms opinions; attaining an ability of its own; a fact which is startlingly shown by an occasional child “prodigy” in music or school work. And as, with our dumb animals, a child’s inability convincingly to impart its thoughts to us, should not class it as ignorant. Upon this basis I am going to show you how a bunch of bright young folks did find a champion; a man with boys and girls of his own; a man of so dominating and happy individuality that Youth is drawn to him as is a fly to a sugar bowl. It is a story about a small town. It is not a gossipy yarn; nor is it a dry, monotonous account, full of such customary “fill-ins” as “romantic moonlight casting murky shadows down a long, winding country road.” Nor will it say anything about tinklings lulling distant folds; robins carolling at twilight, nor any “warm glow of lamplight” from a cabin window. No. It is an account of up-and-doing activity; a vivid portrayal of Youth as it is today; and a practical discarding of that worn-out notion that “a child don’t know anything.” Now, any author, from history’s dawn, always had that most important aid to writing: an ability to call upon any word in his dictionary in building up his story. That is, our strict laws as to word construction did not block his path. But in my story that mighty obstruction will constantly stand in my path; for many an important, common word I cannot adopt, owing to its orthography. I shall act as a sort of historian for this small town; associating with its inhabitants, and striving to acquaint you with its youths, in such a way that you can look, knowingly, upon any child, rich or poor; forward or “backward;” your own, or John Smith’s, in your community. You will find many young minds aspiring to know how, and why such a thing is so. And, if a child shows curiosity in that way, how ridiculous it is for you to snap out:— “Oh! Don’t ask about things too old for you!” Such a jolt to a young child’s mind, craving instruction, is apt so to dull its avidity, as to hold it back in its school work. Try to look upon a child as a small, soft young body and a rapidly growing, constantly inquiring brain. It must grow to maturity slowly. Forcing a child through school by constant night study during hours in which it should run and Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Definitions The joint p.m.f of two random variables X and Y taking values on alphabets X and Y, respectively, is p(x, y) = Pr{X = x, Y = y}, (x, y) ∈ X × Y If p(x) = Pr{X = x} > 0, the conditional probability of Y = y given that X = x is defined as p(x, y) p(y|x) = Pr{Y = y |X = x} = p(x) Independence I The events X = x and Y = y are independent if p(x, y) = p(x)p(y)

I The random variables X and Y are independent if p(x, y) = p(x)p(y), ∀(x, y) ∈ X × Y

Entropy and Mutual Information 12/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example

Y

Pr{X = 0,Y =1}= 0,081 Pr{X =1,Y =1}= 0,018 Pr{X = 2,Y =1}= 0,001 Pr{Y =1}= 0,1

Pr{X = 0,Y = 0}= 0,7290 Pr{X =1,Y = 0}= 0,1620 Pr{X = 2,Y = 0}= 0,009 Pr{Y = 0}= 0,9 X

Pr{X = 0}= 0,81 Pr{X =1}= 0,18 Pr{X = 2}= 0,01

I You can easily check that X and Y are independent

I Actually, X ∼ B(2, 0.1) and Y ∼ B(1, 0.1)

How is Z = X + Y distributed?

Entropy and Mutual Information 13/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Joint entropy and conditional entropy Definition: The joint entropy, H(X , Y ) of two random variables (X , Y ) with pmf p(x, y) is defined as X X H(X , Y ) = −E[log p(X , Y )] = − p(x, y) log p(x, y) x∈X y∈Y Definition: The conditional entropy of Y given X is defined as X X H(Y |X ) = −E[log p(Y |X )] = − p(x, y) log p(y|x) x∈X y∈Y Note that X X X X H(Y |X ) = − p(x, y) log p(y|x) = − p(x) p(y|x) log p(y|x) x∈X y∈Y x∈X y∈Y X = p(x)H(Y |X = x) x∈X

Entropy and Mutual Information 14/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Chain rule We know that p(x, y) = p(x)p(y|x); therefore, taking and expectations on both sides we arrive to

E[log p(X , Y )] = E[log p(X )] + E[log p(Y |X )]

and the so-called chain rule for conditional entropy follows

H(X , Y ) = H(X ) + H(Y |X )

Similarly, we have

H(X , Y ) = H(Y ) + H(X |Y )

Note that H(Y |X ) 6= H(X |Y ), but

H(X ) − H(X |Y ) = H(Y ) − H(Y |X )

Entropy and Mutual Information 15/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

H (X ,Y )

H (X ) H (Y | X )

H (X |Y ) H (Y )

As a corollary of the chain rule, it is easy to prove the following

H(X , Y |Z) = H(X |Z) + H(Y |X , Z)

Entropy and Mutual Information 16/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Generalization of the chain rule for entropy: Let X1, X2,..., Xn be a collection of random variables with joint pmf p(x1, x2,..., xn). Then n X H(X1, X2,..., Xn) = H(Xi |Xi−1,..., X1) i=1 The entropy is a sum of conditional entropies, for instance

H(X1, X2, X3) = H(X1) + H(X2|X1) + H(X3|X2, X1)

Entropy and Mutual Information 17/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example (from Cover’s textbook): Let (X,Y) have the following joint pmf

p(x, y) x = 1 x = 2 x = 3 x = 4 1 1 1 1 y = 1 8 16 32 32 1 1 1 1 y = 2 16 8 32 32 1 1 1 1 y = 3 16 16 16 16 1 y = 4 4 0 0 0

I H(X ) = 7/4

I H(Y ) = 2

I H(X |Y ) = 11/8

I H(Y |X ) = 13/8

I H(X , Y ) = 27/8

Entropy and Mutual Information 18/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Relative entropy, K-L divergence Definition: The relative entropy or Kullback-Leibler divergence between two distributions, p(x) and q(x), taking values on the same alphabet is defined as

X p(x)  p(X ) D(p||q) = p(x) log = E log q(x) p q(X ) x∈X

0 p We use the convention that 0 log 0 = 0, p log 0 = ∞ and 0 0 log q = 0 I It is a measure of the “distance” between the two distributions

I D(p||q) ≥ 0 I D(p||q) = 0 iff p = q

I However, it is not a true distance

D(p||q) 6= D(q||p)

Entropy and Mutual Information 19/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example: Consider two different Bernoulli distributions: p(x) = (1 − r)δ(x) + rδ(x − 1) and q(x) = (1 − s)δ(x) + sδ(x − 1)

1 − r  r  D(p||q) = (1 − r) log + r log 1 − s s

1 − s  s  D(q||p) = (1 − s) log + s log 1 − r r

1

0.8

If we fix s = 0.5 and vary r from 0.6

D(p||q) 0 to 1, D(p||q) looks like this: 0.4

0.2

0 0 0.2 0.4 0.6 0.8 1 r

Entropy and Mutual Information 20/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Mutual information

Definition 1: The mutual information I (X ; Y ) between the random variables X and Y is given by

I (X ; Y ) = H(X ) − H(X |Y )

I It is the reduction in the uncertainty of X due to the knowledge of Y

I I (X ; Y ) ≥ 0, which implies

H(X ) ≥ H(X |Y )

i.e., “information cannot hurt”

I I (X ; Y ) = 0 iff X and Y are independent

Entropy and Mutual Information 21/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Definition 2: The mutual information I (X ; Y ) can be defined as the relative entropy between the joint distribution p(x, y) and the product of marginals p(x)p(y)

X X p(x, y) I (X ; Y ) = D(p(x, y)||p(x)p(y)) = p(x, y) log p(x)p(y) x∈X y∈Y X X p(x|y) = p(x, y) log p(x) x∈X y∈Y X X X X = − p(x, y) log p(x) + p(x, y) log p(x|y) x∈X y∈Y x∈X y∈Y   X X X = − p(x) log p(x) − − p(x, y) log p(x|y) x∈X x∈X y∈Y = H(X ) − H(X |Y )

Entropy and Mutual Information 22/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Properties

I It is symmetric (X says as much about Y as Y says about X ) I (X ; Y ) = H(X ) − H(X |Y ) = H(Y ) − H(Y |X ) = I (Y ; X )

I The mutual information of a random variable with itself is the entropy (self-information) I (X ; X ) = H(X ) − H(X |X ) = H(X )

I It also satisfies I (X ; Y ) = H(X ) + H(Y ) − H(X , Y ) = H(X , Y ) − H(X |Y ) − H(Y |X )

Entropy and Mutual Information 23/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

H (X ,Y )

H (X )

H (Y )

H (X |Y ) I(X ;Y ) H (Y | X )

Entropy and Mutual Information 24/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Example: Let (X,Y) have the following joint pmf

p(x, y) x = 1 x = 2 x = 3 x = 4 1 1 1 1 y = 1 8 16 32 32 1 1 1 1 y = 2 16 8 32 32 1 1 1 1 y = 3 16 16 16 16 1 y = 4 4 0 0 0

I I (X ; Y ) = 3/8

Entropy and Mutual Information 25/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Chain rule for mutual information

I (X1, X2, X3; Y ) = I (X1; Y ) + I (X2; Y |X1) + I (X3; Y |X2, X1)

The proof follows from the definition of mutual information and the application of the chain rule for entropy

More generally, we can write

n X I (X1, X2,..., Xn; Y ) = I (Xi ; Y |Xi−1,..., X1) i=1

Entropy and Mutual Information 26/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Jensen’s inequality Jensen’s inequality generalizes the idea that, for convex functions, the secant line lies above the graph

Many fundamental inequalities in are consequences of Jensen’s inequality, the most important one being

D(p||q) ≥ 0

Entropy and Mutual Information 27/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

2 x I Convex: x, x , |x|, e , x log(x) (for x ≥ 0), ... √ I Concave: x, log(x), x (for x ≥ 0), ...

Entropy and Mutual Information 28/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Definition: A function f (x) is said to be convex on an interval (a, b) if, ∀x1, x2 ∈ (a, b), 0 < λ < 1

f (λx1 + (1 − λ)x2) ≤ λf (x1) + (1 − λ)f (x2) The function is strictly convex if equality holds only if λ = 0 or λ = 1

I A function f (x) is concave if −f (x) is convex

I If the function f (x) has a second derivative that is non-negative (positive) over an interval, the function is convex (strictly convex) over that interval

Entropy and Mutual Information 29/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

Jensen’s inequality Theorem: If f (x) is a convex function and X is a random variable

E [f (X )] ≥ f (E [X ]) Corollary: If f (x) is strictly convex

E [f (X )] = f (E [X ]) =⇒ X = E [X ] =⇒ X is a constant Jensen’s inequality follows directly from the fact that for a set of numbers x1, x2,..., xn and a set of positive weights a1, a2,..., an, a convex function satisfies P ! P i ai xi i ai f (xi ) f P ≤ P j aj j aj which can be easily proved by induction ai If pi = P is viewed as the pmf of a random variable X , the j aj probabilistic form of Jensen’s inequality follows Entropy and Mutual Information 30/31 Introduction Entropy Joint Entropy and Conditional Entropy Relative Entropy Mutual Information Jensen’s inequality

An application

D(p||q) ≥ 0

X p(x) X q(x) −D(p||q) = − p(x) log = p(x) log q(x) p(x) x∈X x∈X ! X q(x) ≤ log p(x) p(x) x∈X ! X = log q(x) = log(1) = 0 x∈X where we have applied Jensen’s inequality in the second line since log(x) is a strictly concave function

Entropy and Mutual Information 31/31