Machine Learning CS 4641
Information Theory
Nakul Gopalan Georgia Tech
These slides are based on slides from Le Song, Roni Rosenfeld, Chao Zhang, Mahdi Roozbahani and Maneesh Sahani. Outline
• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence
2 Uncertainty and Information
Information is processed data whereas knowledge is information that is modeled to be useful.
You need information to be able to get knowledge
3 Uncertainty and Information
Which day is more uncertain?
How do we quantify uncertainty?
High entropy correlates to high information or the more uncertain Information
Let X be a random variable with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥)
Information
Let X be a random variable with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥)
Have you heard a picture is worth 1000 words?
Information obtained by random word from a 100,000 word vocabulary:
1 1 퐼 푤표푟푑 = log = log = 16.61 푏푖푡푠 푝 푥 1/100000
A 1000 word document from same source: 퐼 푑표푐푢푚푒푛푡 = 1000 × 퐼 푤표푟푑 = 16610
A 640*480 pixel, 16-greyscale video picture (each pixel has 16 bits information): 1 퐼 푃푖푐푡푢푟푒 = log = 1228800 1/16640∗480
A picture is worth (a lot more than) 1000 words!
Which one has a higher probability: T or H? Which one should carry more information: T or H?
Example
Morse Code image from Wikipedia Commons Information Theory
• Information theory is a mathematical framework which addresses questions like: ‣ How much information does a random variable carry about? ‣ How efficient is a hypothetical code, given the statistics of the random variable? ‣ How much better or worse would another code do? ‣ Is the information carried by different random variables complementary or redundant?
14 Outline
• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence
15 Entropy
16 Entropy
17 Entropy Computation: An Example
5
18 Properties of Entropy
19 Outline
• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence
20 Joint Entropy
Temperature huMidity
21 Conditional Entropy or Average Conditional Entropy
푝(푥) 퐻 푌 푋 = 푝 푥 퐻 푌 푋 = 푥 = 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌
22 Conditional Entropy proof Conditional Entropy 푝(푥) 퐻 푌 푋 = 푝 푥 퐻 푌 푋 = 푥 = 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌
24 Conditional Entropy
25 Conditional Entropy
Discrete random variables: 푝 푥푖 퐻 푌 푋 = 푝 푥푖 퐻 푌 푋 = 푥푖 = 푝 푥푖, 푦푖 푙표푔 푝 푥푖, 푦푖 푥∈푋 푥∈푋,푦∈푌
Continuous: 퐻 푌 푋
26 Poll
Relationship between H(Y) and H(Y|X): • H(Y) >= H(Y|X) • H(Y) <= H(Y|X) Mutual Information
퐼 푌 푋
28 Properties of Mutual Information
29 CE and MI: Visual Illustration
Image Credit: Christopher Olah. 30 Outline
• Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence
Cross Entropy: The expected number of bits when a wrong distribution Q is assumed while the data actually follows a distribution P
= 퐻 푃 + 퐾퐿 푃 푄
Other definitions:
32 Kullback-Leibler Divergence
= 퐻(푃, 푄) − 퐻(푃)
KL Divergence is a distance measurement
log function is By Jensen Inequality concave or 퐸[푔(푥)] ≤ 푔(퐸[푥]) convex? 푔(푥) = log(x)
When 푃 = 푄, 퐾퐿[푃||푄] = 0 33 Take-Home Messages
• Entropy ‣ A measure for uncertainty ‣ Why it is defined in this way (optimal coding) ‣ Its properties • Joint Entropy, Conditional Entropy, Mutual Information ‣ The physical intuitions behind their definitions ‣ The relationships between them • Cross Entropy, KL Divergence ‣ The physical intuitions behind them ‣ The relationships between entropy, cross-entropy, and KL divergence
34