<<

Outline Definition of Information

• First part based very loosely on [Abramson 63]. (After [Abramson 63]) • usually formulated in terms of information Let E be some event which occurs with probability channels and coding — will not discuss those here. P (E). If we are told that E has occurred, then we say that we have received

1. Information 1 I(E) = log2 2. Entropy P (E) bits of information. 3. 4. and Learning • Base of log is unimportant — will only change the units We’ll stick with bits, and always assume base 2 • Can also think of information as amount of ”surprise” in E (e.g. P (E) = 1, P (E) = 0)

• Example: result of a fair coin flip (log2 2= 1 bit)

• Example: result of a fair die roll (log2 6 ≈ 2.585 bits)

Carnegie Carnegie Mellon 2 ITtutorial,RoniRosenfeld,1999 Mellon 4 ITtutorial,RoniRosenfeld,1999

A Gentle Tutorial on Information Information Theory and Learning • information 6= knowledge Concerned with abstract possibilities, not their meaning Roni Rosenfeld • information: reduction in uncertainty

Carnegie Carnegie Mellon University Mellon Imagine: #1 you’re about to observe the outcome of a coin flip #2 you’re about to observe the outcome of a die roll There is more uncertainty in #2

Next: 1. You observed outcome of #1 → uncertainty reduced to zero. 2. You observed outcome of #2 → uncertainty reduced to zero.

=⇒ more information was provided by the outcome in #2

Carnegie Mellon 3 ITtutorial,RoniRosenfeld,1999 Entropy Entropy as a Function of a

A Zero-memory information source S is a source that emits sym- Since the source S is fully characterized byP = {p1,...pk} (we bols from an alphabet {s1, s2,...,sk} with probabilities {p1, p2,...,pk}, don’t care what the symbols si actually are, or what they stand respectively, where the symbols emitted are statistically indepen- for), entropy can also be thought of as a property of a probability dent. distribution function P : the avg uncertainty in the distribution. So we may also write: What is the average amount of information in observing the output of the source S? 1 H(S) = H(P ) = H(p1, p2,...,pk) = X pi log Call this Entropy: i pi

(Can be generalized to continuous distributions.) 1 1 H(S) = X pi · I(si) = X pi · log = EP [ log ] i i pi p(s)

* Carnegie Carnegie Mellon 6 ITtutorial,RoniRosenfeld,1999 Mellon 8 ITtutorial,RoniRosenfeld,1999

Information is Additive Alternative Explanations of Entropy

• I(k fair coin tosses) = log 1 = k bits 1/2k So: 1 • H(S) = X pi · log – random word from a 100,000 word vocabulary: i pi I(word) = log 100, 000 = 16.61 bits – A 1000 word document from same source: 1. avg amt of info provided per symbol I(document) = 16,610 bits 2. avg amount of surprise when observing a symbol – A 480x640 pixel, 16-greyscale video picture: 3. uncertainty an observer has before seeing the symbol I(picture) = 307, 200 · log 16 = 1, 228, 800 bits 4. avg # of bits needed to communicate each symbol • =⇒ A (VGA) picture is worth (a lot more than) a 1000 words! (Shannon: there are codes that will communicate these sym- • (In reality, both are gross overestimates.) bols with efficiency arbitrarily close to H(S) bits/symbol; there are no codes that will do it with efficiency < H(S) bits/symbol)

Carnegie Carnegie Mellon 5 ITtutorial,RoniRosenfeld,1999 Mellon 7 ITtutorial,RoniRosenfeld,1999 Special Case: k = 2 The Entropy of English

Flipping a coin with P(“head”)=p, P(“tail”)=1-p 27 characters (A-Z, space).

100,000 words (avg 5.5 characters each) 1 1 H(p)= p · log + (1 − p) · log p 1 − p • Assuming independence between successive characters: – uniform character distribution: log 27 = 4.75 bits/character – true character distribution: 4.03 bits/character • Assuming independence between successive words: – unifrom word distribution: log 100, 000/6.5 ≈ 2.55 bits/character – true word distribution: 9.45/6.5 ≈ 1.45 bits/character • True Entropy of English is much lower! Notice: • zero uncertainty/information/surprise at edges • maximum info at 0.5 (1 bit) • drops off quickly Carnegie Carnegie Mellon 10 IT tutorial,Roni Rosenfeld,1999 Mellon 12 IT tutorial,Roni Rosenfeld,1999

Properties of Entropy Special Case: k = 2 (cont.)

Relates to: ”20 questions” game strategy (halving the space). 1 H(P ) = X pi · log i pi So a sequence of (independent) 0’s-and-1’s can provide up to 1 bit of information per digit, provided the 0’s and 1’s are equally 1. Non-negative: ( ) 0 H P ≥ likely at any point. If they are not equally likely, the sequence 2. Invariant wrt permutation of its inputs: provides less information and can be compressed. H(p1, p2,...,pk)= H(pτ(1), pτ(2),...,pτ(k))

3. For any other probability distribution {q1, q2,...,qk}: 1 1 H(P ) = X pi · log < X pi · log i pi i qi

4. H(P ) ≤ log k, with equality iff pi = 1/k ∀i 5. The further P is from uniform, the lower the entropy.

Carnegie Carnegie Mellon 9 ITtutorial,RoniRosenfeld,1999 Mellon 11 IT tutorial,Roni Rosenfeld,1999 Joint Probability, Conditional Probability, P (M = m|T = t) cold mild hot low 0.1 0.4 0.1 0.6 cold mild hot high 0.2 0.1 0.1 0.4 low 1/3 4/5 1/2 0.3 0.5 0.2 1.0 high 2/3 1/5 1/2 1.0 1.0 1.0 • H(T )= H(0.3, 0.5, 0.2) = 1.48548 Conditional Entropy: • H(M)= H(0.6, 0.4) = 0.970951 • H(M|T = cold)= H(1/3, 2/3) = 0.918296 • H(T )+ H(M) = 2.456431 • H(M|T = mild)= H(4/5, 1/5) = 0.721928 Joint Entropy: consider the space of ( ) events ( )= • t, m H T,M • H(M|T = hot)= H(1/2, 1/2) = 1.0 1 t,m P (T = t, M = m) · log P P (T =t,M=m) • Average Conditional Entropy (aka Equivocation): H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1) = 2.32193 H(M/T )= Pt P (T = t) · H(M|T = t)= 0.3 · H(M|T = cold) + 0.5 · H(M|T = mild) + 0.2 · H(M|T = hot) = 0.8364528 Notice that H(T,M) < H(T )+ H(M) !!! How much is T telling us on average about M?

H(M) − H(M|T ) = 0.970951 − 0.8364528 ≈ 0.1345 bits Carnegie Carnegie Mellon 14 IT tutorial,Roni Rosenfeld,1999 Mellon 16 IT tutorial,Roni Rosenfeld,1999

Two Sources Conditional Probability, Conditional Entropy

P (T = t|M = m) Temperature T : a taking on values t

P(T=hot)=0.3 cold mild hot low 1/6 4/6 1/6 1.0 P(T=mild)=0.5 high 2/4 1/4 1/4 1.0 P(T=cold)=0.2 Conditional Entropy: =⇒ H(T)=H(0.3, 0.5, 0.2) = 1.48548 • H(T |M = low)= H(1/6, 4/6, 1/6) = 1.25163 huMidity M: a random variable, taking on values m • H(T |M = high)= H(2/4, 1/4, 1/4) = 1.5 P(M=low)=0.6 • Average Conditional Entropy (aka equivocation): H(T/M)= P (M = m) · H(T |M = m)= P(M=high)=0.4 Pm 0.6 · H(T |M = low) + 0.4 · H(T |M = high) = 1.350978 =⇒ H(M)= H(0.6, 0.4) = 0.970951

How much is M telling us on average about T ? T,M not independent: P (T = t, M = m) 6= P (T = t) · P (M = m) H(T ) − H(T |M) = 1.48548 − 1.350978 ≈ 0.1345 bits

Carnegie Carnegie Mellon 13 IT tutorial,Roni Rosenfeld,1999 Mellon 15 IT tutorial,Roni Rosenfeld,1999 Mutual Information Visualized A Markov Source

Order-k Markov Source: A source that ”remembers” the last k symbols emitted.

Ie, the probability of emitting any symbol depends on the last k emitted symbols: P (sT =t|sT =t−1, sT =t−2,...,sT =t−k)

So the last k emitted symbols define a state, and there are qk states.

First-order markov source: defined by qXq matrix: P (si|sj)

Example: ST =t is position after t random steps

H(X,Y )= H(X)+ H(Y ) − I(X; Y ) Carnegie Carnegie Mellon 18 IT tutorial,Roni Rosenfeld,1999 Mellon 20 IT tutorial,Roni Rosenfeld,1999

Average Mutual Information Three Sources

From Blachman: I(X; Y ) = H(X) − H(X/Y ) 1 1 = X P (x) · log − X P (x,y) · log (”/” means ”given”. ”;” means ”between”. ”,” means ”and”.) x P (x) x,y P (x|y) P (x|y) = X P (x,y) · log • H(X, Y/Z)= H({X,Y } /Z) x,y P (x) P (x,y) • H(X/Y, Z)= H(X / {Y,Z}) = X P (x,y) · log x,y P (x)P (y) • I(X; Y/Z)= H(X/Z) − H(X/Y, Z) •

Properties of Average Mutual Information: I(X; Y ; Z) = I(X; Y ) − I(X; Y/Z) • Symmetric (but H(X) 6= H(Y ) and H(X/Y ) 6= H(Y/X)) = H(X,Y,Z) − H(X,Y ) − H(X,Z) − H(Y,Z)+ H(X)+ H(Y )+ H • Non-negative (but H(X) − H(X/y) may be negative!) =⇒ Can be negative! • Zero iff X,Y independent • I(X; Y,Z)= I(X; Y )+ I(X; Z/Y ) (additivity) • Additive (see next slide) • But: I(X; Y ) = 0,I(X; Z) = 0 doesn’t mean I(X; Y,Z) = 0!!!

Carnegie Carnegie Mellon 17 IT tutorial,Roni Rosenfeld,1999 Mellon 19 IT tutorial,Roni Rosenfeld,1999 Modeling an Arbitrary Source A Distance Measure Between Distributions

Kullback-Liebler distance: Source D(Y ) with unknown distribution PD(Y )

(recall H(P )= E [log 1 ] ) D PD P (Y ) D KL(PD; PM ) = CH(PD; PM ) − H(PD) P (Y ) = E [log D ] Goal: Model (approximate) with learned distribution PM (Y ) P D PM (Y )

What’s a good model PM (Y )? Properties of KL distance: 1. RMS error over D’s parameters ⇒ but D is unknown! 1. Non-negative. KL(P ; P ) = 0 ⇐⇒ P = P 2. Predictive Probability: Maximize the expected log-likelihood D M D M the model assigns to future data from D 2. Generally non-symmetric

The following are equivalent: 1. Maximize Predictive Probability of PM for distribution D 2. Minimize Cross Entropy CH(PD; PM ) 3. Minimize the distance KL(PD; PM ) Carnegie Carnegie Mellon 22 IT tutorial,Roni Rosenfeld,1999 Mellon 24 IT tutorial,Roni Rosenfeld,1999

Approximating with a Markov Source Cross Entropy

A non-Markovian source can still be approximated by one.

∗ M = argmax ED[log PM (Y )] Examples: English characters: C = {c1, c2,...} M 1 = argmin ED[log ] 1. Uniform: H(C) = log 27 = 4.75 bits/char M PM (Y ) = CH(P ; P ) ⇐= Cross Entropy 2. Assuming 0 memory: H(C) = H(0.186, 0.064, 0.0127,...) = D M 4.03 bits/char

3. Assuming 1st order: H(C)= H(ci/ci−1) = 3.32 bits/char

4. Assuming 2nd order: H(C)= H(ci/ci−1, ci−2) = 3.1 bits/char The following are equivalent: 5. Assuming large order: Shannon got down to ≈ 1 bit/char 1. Maximize Predictive Probability of PM

2. Minimize Cross Entropy CH(PD; PM ) 3. Minimize the difference between PD and PM (in what sense?)

Carnegie Carnegie Mellon 21 IT tutorial,Roni Rosenfeld,1999 Mellon 23 IT tutorial,Roni Rosenfeld,1999