Machine Learning CS 4641

Information Theory

Nakul Gopalan Georgia Tech

These slides are based on slides from Le Song, Roni Rosenfeld, Chao Zhang, Mahdi Roozbahani and Maneesh Sahani. Outline

• Motivation • Entropy • and • Cross-Entropy and KL-Divergence

2 Uncertainty and Information

Information is processed data whereas knowledge is information that is modeled to be useful.

You need information to be able to get knowledge

3 Uncertainty and Information

Which day is more uncertain?

How do we quantify uncertainty?

High entropy correlates to high information or the more uncertain Information

Let X be a with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥)

Information

Let X be a random variable with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥)

Have you heard a picture is worth 1000 words?

Information obtained by random word from a 100,000 word vocabulary:

1 1 퐼 푤표푟푑 = log = log = 16.61 푏푖푡푠 푝 푥 1/100000

A 1000 word document from same source: 퐼 푑표푐푢푚푒푛푡 = 1000 × 퐼 푤표푟푑 = 16610

A 640*480 pixel, 16-greyscale video picture (each pixel has 16 bits information): 1 퐼 푃푖푐푡푢푟푒 = log = 1228800 1/16640∗480

A picture is worth (a lot more than) 1000 words!

Which one has a higher probability: T or H? Which one should carry more information: T or H?

Example

Morse Code image from Wikipedia Commons

• Information theory is a mathematical framework which addresses questions like: ‣ How much information does a random variable carry about? ‣ How efficient is a hypothetical code, given the statistics of the random variable? ‣ How much better or worse would another code do? ‣ Is the information carried by different random variables complementary or redundant?

14 Outline

• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence

15 Entropy

16 Entropy

17 Entropy Computation: An Example

5

18 Properties of Entropy

19 Outline

• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence

20 Joint Entropy

Temperature huMidity

21 Conditional Entropy or Average Conditional Entropy

푝(푥) 퐻 푌 푋 = ෍ 푝 푥 퐻 푌 푋 = 푥 = ෍ 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌

22 Conditional Entropy proof Conditional Entropy 푝(푥) 퐻 푌 푋 = ෍ 푝 푥 퐻 푌 푋 = 푥 = ෍ 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌

24 Conditional Entropy

25 Conditional Entropy

Discrete random variables: 푝 푥푖 퐻 푌 푋 = ෍ 푝 푥푖 퐻 푌 푋 = 푥푖 = ෍ 푝 푥푖, 푦푖 푙표푔 푝 푥푖, 푦푖 푥∈푋 푥∈푋,푦∈푌

Continuous: 퐻 푌 푋

26 Poll

Relationship between H(Y) and H(Y|X): • H(Y) >= H(Y|X) • H(Y) <= H(Y|X) Mutual Information

퐼 푌 푋

28 Properties of Mutual Information

29 CE and MI: Visual Illustration

Image Credit: Christopher Olah. 30 Outline

• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence

31

Cross Entropy: The expected number of bits when a wrong distribution Q is assumed while the data actually follows a distribution P

= 퐻 푃 + 퐾퐿 푃 푄

Other definitions:

32 Kullback-Leibler Divergence

= 퐻(푃, 푄) − 퐻(푃)

KL Divergence is a distance measurement

log function is By Jensen Inequality concave or 퐸[푔(푥)] ≤ 푔(퐸[푥]) convex? 푔(푥) = log(x)

When 푃 = 푄, 퐾퐿[푃||푄] = 0 33 Take-Home Messages

• Entropy ‣ A measure for uncertainty ‣ Why it is defined in this way (optimal coding) ‣ Its properties • Joint Entropy, Conditional Entropy, Mutual Information ‣ The physical intuitions behind their definitions ‣ The relationships between them • Cross Entropy, KL Divergence ‣ The physical intuitions behind them ‣ The relationships between entropy, cross-entropy, and KL divergence

34