Information Theory Annotated

Machine Learning CS 4641 Information Theory Nakul Gopalan Georgia Tech These slides are based on slides from Le Song, Roni Rosenfeld, Chao Zhang, Mahdi Roozbahani and Maneesh Sahani. Outline • Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence 2 Uncertainty and Information Information is processed data whereas knowledge is information that is modeled to be useful. You need information to be able to get knowledge 3 Uncertainty and Information Which day is more uncertain? How do we quantify uncertainty? High entropy correlates to high information or the more uncertain Information Let X be a random variable with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥) Information Let X be a random variable with distribution p(x) 1 퐼 푋 = log( ) 2 푝(푥) Have you heard a picture is worth 1000 words? Information obtained by random word from a 100,000 word vocabulary: 1 1 퐼 푤표푟푑 = log = log = 16.61 푏푖푡푠 푝 푥 1/100000 A 1000 word document from same source: 퐼 푑표푐푢푚푒푛푡 = 1000 × 퐼 푤표푟푑 = 16610 A 640*480 pixel, 16-greyscale video picture (each pixel has 16 bits information): 1 퐼 푃푖푐푡푢푟푒 = log = 1228800 1/16640∗480 A picture is worth (a lot more than) 1000 words! Which one has a higher probability: T or H? Which one should carry more information: T or H? Example Morse Code image from Wikipedia Commons Information Theory • Information theory is a mathematical framework which addresses questions like: ‣ How much information does a random variable carry about? ‣ How efficient is a hypothetical code, given the statistics of the random variable? ‣ How much better or worse would another code do? ‣ Is the information carried by different random variables complementary or redundant? 14 Outline • Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence 15 Entropy 16 Entropy 17 Entropy Computation: An Example 5 18 Properties of Entropy 19 Outline • Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence 20 Joint Entropy Temperature huMidity 21 Conditional Entropy or Average Conditional Entropy 푝(푥) 퐻 푌 푋 = ෍ 푝 푥 퐻 푌 푋 = 푥 = ෍ 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌 22 Conditional Entropy proof Conditional Entropy 푝(푥) 퐻 푌 푋 = ෍ 푝 푥 퐻 푌 푋 = 푥 = ෍ 푝 푥, 푦 푙표푔 푝(푥, 푦) 푥∈푋 푥∈푋,푦∈푌 24 Conditional Entropy 25 Conditional Entropy Discrete random variables: 푝 푥푖 퐻 푌 푋 = ෍ 푝 푥푖 퐻 푌 푋 = 푥푖 = ෍ 푝 푥푖, 푦푖 푙표푔 푝 푥푖, 푦푖 푥∈푋 푥∈푋,푦∈푌 Continuous: 퐻 푌 푋 26 Poll Relationship between H(Y) and H(Y|X): • H(Y) >= H(Y|X) • H(Y) <= H(Y|X) Mutual Information 퐼 푌 푋 28 Properties of Mutual Information 29 CE and MI: Visual Illustration Image Credit: Christopher Olah. 30 Outline • Motivation • Entropy • Conditional Entropy and Mutual Information • Cross-Entropy and KL-Divergence 31 Cross Entropy Cross Entropy: The expected number of bits when a wrong distribution Q is assumed while the data actually follows a distribution P = 퐻 푃 + 퐾퐿 푃 푄 Other definitions: 32 Kullback-Leibler Divergence = 퐻(푃, 푄) − 퐻(푃) KL Divergence is a distance measurement log function is By Jensen Inequality concave or 퐸[푔(푥)] ≤ 푔(퐸[푥]) convex? 푔(푥) = log(x) When 푃 = 푄, 퐾퐿[푃||푄] = 0 33 Take-Home Messages • Entropy ‣ A measure for uncertainty ‣ Why it is defined in this way (optimal coding) ‣ Its properties • Joint Entropy, Conditional Entropy, Mutual Information ‣ The physical intuitions behind their definitions ‣ The relationships between them • Cross Entropy, KL Divergence ‣ The physical intuitions behind them ‣ The relationships between entropy, cross-entropy, and KL divergence 34.

Load more