A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Outline Definition of Information • First part based very loosely on [Abramson 63]. (After [Abramson 63]) • Information theory usually formulated in terms of information Let E be some event which occurs with probability channels and coding — will not discuss those here. P (E). If we are told that E has occurred, then we say that we have received 1. Information 1 I(E) = log2 2. Entropy P (E) bits of information. 3. Mutual Information 4. Cross Entropy and Learning • Base of log is unimportant — will only change the units We’ll stick with bits, and always assume base 2 • Can also think of information as amount of ”surprise” in E (e.g. P (E) = 1, P (E) = 0) • Example: result of a fair coin flip (log2 2= 1 bit) • Example: result of a fair die roll (log2 6 ≈ 2.585 bits) Carnegie Carnegie Mellon 2 ITtutorial,RoniRosenfeld,1999 Mellon 4 ITtutorial,RoniRosenfeld,1999 A Gentle Tutorial on Information Information Theory and Learning • information 6= knowledge Concerned with abstract possibilities, not their meaning Roni Rosenfeld • information: reduction in uncertainty Carnegie Carnegie Mellon University Mellon Imagine: #1 you’re about to observe the outcome of a coin flip #2 you’re about to observe the outcome of a die roll There is more uncertainty in #2 Next: 1. You observed outcome of #1 → uncertainty reduced to zero. 2. You observed outcome of #2 → uncertainty reduced to zero. =⇒ more information was provided by the outcome in #2 Carnegie Mellon 3 ITtutorial,RoniRosenfeld,1999 Entropy Entropy as a Function of a Probability Distribution A Zero-memory information source S is a source that emits sym- Since the source S is fully characterized byP = {p1,...pk} (we bols from an alphabet {s1, s2,...,sk} with probabilities {p1, p2,...,pk}, don’t care what the symbols si actually are, or what they stand respectively, where the symbols emitted are statistically indepen- for), entropy can also be thought of as a property of a probability dent. distribution function P : the avg uncertainty in the distribution. So we may also write: What is the average amount of information in observing the output of the source S? 1 H(S) = H(P ) = H(p1, p2,...,pk) = X pi log Call this Entropy: i pi (Can be generalized to continuous distributions.) 1 1 H(S) = X pi · I(si) = X pi · log = EP [ log ] i i pi p(s) * Carnegie Carnegie Mellon 6 ITtutorial,RoniRosenfeld,1999 Mellon 8 ITtutorial,RoniRosenfeld,1999 Information is Additive Alternative Explanations of Entropy • I(k fair coin tosses) = log 1 = k bits 1/2k So: 1 • H(S) = X pi · log – random word from a 100,000 word vocabulary: i pi I(word) = log 100, 000 = 16.61 bits – A 1000 word document from same source: 1. avg amt of info provided per symbol I(document) = 16,610 bits 2. avg amount of surprise when observing a symbol – A 480x640 pixel, 16-greyscale video picture: 3. uncertainty an observer has before seeing the symbol I(picture) = 307, 200 · log 16 = 1, 228, 800 bits 4. avg # of bits needed to communicate each symbol • =⇒ A (VGA) picture is worth (a lot more than) a 1000 words! (Shannon: there are codes that will communicate these sym- • (In reality, both are gross overestimates.) bols with efficiency arbitrarily close to H(S) bits/symbol; there are no codes that will do it with efficiency < H(S) bits/symbol) Carnegie Carnegie Mellon 5 ITtutorial,RoniRosenfeld,1999 Mellon 7 ITtutorial,RoniRosenfeld,1999 Special Case: k = 2 The Entropy of English Flipping a coin with P(“head”)=p, P(“tail”)=1-p 27 characters (A-Z, space). 100,000 words (avg 5.5 characters each) 1 1 H(p)= p · log + (1 − p) · log p 1 − p • Assuming independence between successive characters: – uniform character distribution: log 27 = 4.75 bits/character – true character distribution: 4.03 bits/character • Assuming independence between successive words: – unifrom word distribution: log 100, 000/6.5 ≈ 2.55 bits/character – true word distribution: 9.45/6.5 ≈ 1.45 bits/character • True Entropy of English is much lower! Notice: • zero uncertainty/information/surprise at edges • maximum info at 0.5 (1 bit) • drops off quickly Carnegie Carnegie Mellon 10 IT tutorial,Roni Rosenfeld,1999 Mellon 12 IT tutorial,Roni Rosenfeld,1999 Properties of Entropy Special Case: k = 2 (cont.) Relates to: ”20 questions” game strategy (halving the space). 1 H(P ) = X pi · log i pi So a sequence of (independent) 0’s-and-1’s can provide up to 1 bit of information per digit, provided the 0’s and 1’s are equally 1. Non-negative: ( ) 0 H P ≥ likely at any point. If they are not equally likely, the sequence 2. Invariant wrt permutation of its inputs: provides less information and can be compressed. H(p1, p2,...,pk)= H(pτ(1), pτ(2),...,pτ(k)) 3. For any other probability distribution {q1, q2,...,qk}: 1 1 H(P ) = X pi · log < X pi · log i pi i qi 4. H(P ) ≤ log k, with equality iff pi = 1/k ∀i 5. The further P is from uniform, the lower the entropy. Carnegie Carnegie Mellon 9 ITtutorial,RoniRosenfeld,1999 Mellon 11 IT tutorial,Roni Rosenfeld,1999 Joint Probability, Joint Entropy Conditional Probability, Conditional Entropy P (M = m|T = t) cold mild hot low 0.1 0.4 0.1 0.6 cold mild hot high 0.2 0.1 0.1 0.4 low 1/3 4/5 1/2 0.3 0.5 0.2 1.0 high 2/3 1/5 1/2 1.0 1.0 1.0 • H(T )= H(0.3, 0.5, 0.2) = 1.48548 Conditional Entropy: • H(M)= H(0.6, 0.4) = 0.970951 • H(M|T = cold)= H(1/3, 2/3) = 0.918296 • H(T )+ H(M) = 2.456431 • H(M|T = mild)= H(4/5, 1/5) = 0.721928 Joint Entropy: consider the space of ( ) events ( )= • t, m H T,M • H(M|T = hot)= H(1/2, 1/2) = 1.0 1 t,m P (T = t, M = m) · log P P (T =t,M=m) • Average Conditional Entropy (aka Equivocation): H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1) = 2.32193 H(M/T )= Pt P (T = t) · H(M|T = t)= 0.3 · H(M|T = cold) + 0.5 · H(M|T = mild) + 0.2 · H(M|T = hot) = 0.8364528 Notice that H(T,M) < H(T )+ H(M) !!! How much is T telling us on average about M? H(M) − H(M|T ) = 0.970951 − 0.8364528 ≈ 0.1345 bits Carnegie Carnegie Mellon 14 IT tutorial,Roni Rosenfeld,1999 Mellon 16 IT tutorial,Roni Rosenfeld,1999 Two Sources Conditional Probability, Conditional Entropy P (T = t|M = m) Temperature T : a random variable taking on values t P(T=hot)=0.3 cold mild hot low 1/6 4/6 1/6 1.0 P(T=mild)=0.5 high 2/4 1/4 1/4 1.0 P(T=cold)=0.2 Conditional Entropy: =⇒ H(T)=H(0.3, 0.5, 0.2) = 1.48548 • H(T |M = low)= H(1/6, 4/6, 1/6) = 1.25163 huMidity M: a random variable, taking on values m • H(T |M = high)= H(2/4, 1/4, 1/4) = 1.5 P(M=low)=0.6 • Average Conditional Entropy (aka equivocation): H(T/M)= P (M = m) · H(T |M = m)= P(M=high)=0.4 Pm 0.6 · H(T |M = low) + 0.4 · H(T |M = high) = 1.350978 =⇒ H(M)= H(0.6, 0.4) = 0.970951 How much is M telling us on average about T ? T,M not independent: P (T = t, M = m) 6= P (T = t) · P (M = m) H(T ) − H(T |M) = 1.48548 − 1.350978 ≈ 0.1345 bits Carnegie Carnegie Mellon 13 IT tutorial,Roni Rosenfeld,1999 Mellon 15 IT tutorial,Roni Rosenfeld,1999 Mutual Information Visualized A Markov Source Order-k Markov Source: A source that ”remembers” the last k symbols emitted. Ie, the probability of emitting any symbol depends on the last k emitted symbols: P (sT =t|sT =t−1, sT =t−2,...,sT =t−k) So the last k emitted symbols define a state, and there are qk states. First-order markov source: defined by qXq matrix: P (si|sj) Example: ST =t is position after t random steps H(X,Y )= H(X)+ H(Y ) − I(X; Y ) Carnegie Carnegie Mellon 18 IT tutorial,Roni Rosenfeld,1999 Mellon 20 IT tutorial,Roni Rosenfeld,1999 Average Mutual Information Three Sources From Blachman: I(X; Y ) = H(X) − H(X/Y ) 1 1 = X P (x) · log − X P (x,y) · log (”/” means ”given”. ”;” means ”between”. ”,” means ”and”.) x P (x) x,y P (x|y) P (x|y) = X P (x,y) · log • H(X, Y/Z)= H({X,Y } /Z) x,y P (x) P (x,y) • H(X/Y, Z)= H(X / {Y, Z}) = X P (x,y) · log x,y P (x)P (y) • I(X; Y/Z)= H(X/Z) − H(X/Y, Z) • Properties of Average Mutual Information: I(X; Y ; Z) = I(X; Y ) − I(X; Y/Z) • Symmetric (but H(X) 6= H(Y ) and H(X/Y ) 6= H(Y/X)) = H(X, Y, Z) − H(X,Y ) − H(X, Z) − H(Y, Z)+ H(X)+ H(Y )+ H • Non-negative (but H(X) − H(X/y) may be negative!) =⇒ Can be negative! • Zero iff X,Y independent • I(X; Y, Z)= I(X; Y )+ I(X; Z/Y ) (additivity) • Additive (see next slide) • But: I(X; Y ) = 0,I(X; Z) = 0 doesn’t mean I(X; Y, Z) = 0!!! Carnegie Carnegie Mellon 17 IT tutorial,Roni Rosenfeld,1999 Mellon 19 IT tutorial,Roni Rosenfeld,1999 Modeling an Arbitrary Source A Distance Measure Between Distributions Kullback-Liebler distance: Source D(Y ) with unknown distribution PD(Y ) (recall H(P )= E [log 1 ] ) D PD P (Y ) D KL(PD; PM ) = CH(PD; PM ) − H(PD) P (Y ) = E [log D ] Goal: Model (approximate) with learned distribution PM (Y ) P D PM (Y ) What’s a good model PM (Y )? Properties of KL distance: 1.

A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2

Package 'Infotheo'

Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020

GAIT: a Geometric Approach to Information Theory

On a General Definition of Conditional Rényi Entropies

Noisy Channel Coding

On Measures of Entropy and Information

Lecture 11: Channel Coding Theorem: Converse Part 1 Recap

Information Theory 1 Entropy 2 Mutual Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´

Language Models