A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Total Page:16

File Type:pdf, Size:1020Kb

A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very Outline Definition of Information • First part based very loosely on [Abramson 63]. (After [Abramson 63]) • Information theory usually formulated in terms of information Let E be some event which occurs with probability channels and coding — will not discuss those here. P (E). If we are told that E has occurred, then we say that we have received 1. Information 1 I(E) = log2 2. Entropy P (E) bits of information. 3. Mutual Information 4. Cross Entropy and Learning • Base of log is unimportant — will only change the units We’ll stick with bits, and always assume base 2 • Can also think of information as amount of ”surprise” in E (e.g. P (E) = 1, P (E) = 0) • Example: result of a fair coin flip (log2 2= 1 bit) • Example: result of a fair die roll (log2 6 ≈ 2.585 bits) Carnegie Carnegie Mellon 2 ITtutorial,RoniRosenfeld,1999 Mellon 4 ITtutorial,RoniRosenfeld,1999 A Gentle Tutorial on Information Information Theory and Learning • information 6= knowledge Concerned with abstract possibilities, not their meaning Roni Rosenfeld • information: reduction in uncertainty Carnegie Carnegie Mellon University Mellon Imagine: #1 you’re about to observe the outcome of a coin flip #2 you’re about to observe the outcome of a die roll There is more uncertainty in #2 Next: 1. You observed outcome of #1 → uncertainty reduced to zero. 2. You observed outcome of #2 → uncertainty reduced to zero. =⇒ more information was provided by the outcome in #2 Carnegie Mellon 3 ITtutorial,RoniRosenfeld,1999 Entropy Entropy as a Function of a Probability Distribution A Zero-memory information source S is a source that emits sym- Since the source S is fully characterized byP = {p1,...pk} (we bols from an alphabet {s1, s2,...,sk} with probabilities {p1, p2,...,pk}, don’t care what the symbols si actually are, or what they stand respectively, where the symbols emitted are statistically indepen- for), entropy can also be thought of as a property of a probability dent. distribution function P : the avg uncertainty in the distribution. So we may also write: What is the average amount of information in observing the output of the source S? 1 H(S) = H(P ) = H(p1, p2,...,pk) = X pi log Call this Entropy: i pi (Can be generalized to continuous distributions.) 1 1 H(S) = X pi · I(si) = X pi · log = EP [ log ] i i pi p(s) * Carnegie Carnegie Mellon 6 ITtutorial,RoniRosenfeld,1999 Mellon 8 ITtutorial,RoniRosenfeld,1999 Information is Additive Alternative Explanations of Entropy • I(k fair coin tosses) = log 1 = k bits 1/2k So: 1 • H(S) = X pi · log – random word from a 100,000 word vocabulary: i pi I(word) = log 100, 000 = 16.61 bits – A 1000 word document from same source: 1. avg amt of info provided per symbol I(document) = 16,610 bits 2. avg amount of surprise when observing a symbol – A 480x640 pixel, 16-greyscale video picture: 3. uncertainty an observer has before seeing the symbol I(picture) = 307, 200 · log 16 = 1, 228, 800 bits 4. avg # of bits needed to communicate each symbol • =⇒ A (VGA) picture is worth (a lot more than) a 1000 words! (Shannon: there are codes that will communicate these sym- • (In reality, both are gross overestimates.) bols with efficiency arbitrarily close to H(S) bits/symbol; there are no codes that will do it with efficiency < H(S) bits/symbol) Carnegie Carnegie Mellon 5 ITtutorial,RoniRosenfeld,1999 Mellon 7 ITtutorial,RoniRosenfeld,1999 Special Case: k = 2 The Entropy of English Flipping a coin with P(“head”)=p, P(“tail”)=1-p 27 characters (A-Z, space). 100,000 words (avg 5.5 characters each) 1 1 H(p)= p · log + (1 − p) · log p 1 − p • Assuming independence between successive characters: – uniform character distribution: log 27 = 4.75 bits/character – true character distribution: 4.03 bits/character • Assuming independence between successive words: – unifrom word distribution: log 100, 000/6.5 ≈ 2.55 bits/character – true word distribution: 9.45/6.5 ≈ 1.45 bits/character • True Entropy of English is much lower! Notice: • zero uncertainty/information/surprise at edges • maximum info at 0.5 (1 bit) • drops off quickly Carnegie Carnegie Mellon 10 IT tutorial,Roni Rosenfeld,1999 Mellon 12 IT tutorial,Roni Rosenfeld,1999 Properties of Entropy Special Case: k = 2 (cont.) Relates to: ”20 questions” game strategy (halving the space). 1 H(P ) = X pi · log i pi So a sequence of (independent) 0’s-and-1’s can provide up to 1 bit of information per digit, provided the 0’s and 1’s are equally 1. Non-negative: ( ) 0 H P ≥ likely at any point. If they are not equally likely, the sequence 2. Invariant wrt permutation of its inputs: provides less information and can be compressed. H(p1, p2,...,pk)= H(pτ(1), pτ(2),...,pτ(k)) 3. For any other probability distribution {q1, q2,...,qk}: 1 1 H(P ) = X pi · log < X pi · log i pi i qi 4. H(P ) ≤ log k, with equality iff pi = 1/k ∀i 5. The further P is from uniform, the lower the entropy. Carnegie Carnegie Mellon 9 ITtutorial,RoniRosenfeld,1999 Mellon 11 IT tutorial,Roni Rosenfeld,1999 Joint Probability, Joint Entropy Conditional Probability, Conditional Entropy P (M = m|T = t) cold mild hot low 0.1 0.4 0.1 0.6 cold mild hot high 0.2 0.1 0.1 0.4 low 1/3 4/5 1/2 0.3 0.5 0.2 1.0 high 2/3 1/5 1/2 1.0 1.0 1.0 • H(T )= H(0.3, 0.5, 0.2) = 1.48548 Conditional Entropy: • H(M)= H(0.6, 0.4) = 0.970951 • H(M|T = cold)= H(1/3, 2/3) = 0.918296 • H(T )+ H(M) = 2.456431 • H(M|T = mild)= H(4/5, 1/5) = 0.721928 Joint Entropy: consider the space of ( ) events ( )= • t, m H T,M • H(M|T = hot)= H(1/2, 1/2) = 1.0 1 t,m P (T = t, M = m) · log P P (T =t,M=m) • Average Conditional Entropy (aka Equivocation): H(0.1, 0.4, 0.1, 0.2, 0.1, 0.1) = 2.32193 H(M/T )= Pt P (T = t) · H(M|T = t)= 0.3 · H(M|T = cold) + 0.5 · H(M|T = mild) + 0.2 · H(M|T = hot) = 0.8364528 Notice that H(T,M) < H(T )+ H(M) !!! How much is T telling us on average about M? H(M) − H(M|T ) = 0.970951 − 0.8364528 ≈ 0.1345 bits Carnegie Carnegie Mellon 14 IT tutorial,Roni Rosenfeld,1999 Mellon 16 IT tutorial,Roni Rosenfeld,1999 Two Sources Conditional Probability, Conditional Entropy P (T = t|M = m) Temperature T : a random variable taking on values t P(T=hot)=0.3 cold mild hot low 1/6 4/6 1/6 1.0 P(T=mild)=0.5 high 2/4 1/4 1/4 1.0 P(T=cold)=0.2 Conditional Entropy: =⇒ H(T)=H(0.3, 0.5, 0.2) = 1.48548 • H(T |M = low)= H(1/6, 4/6, 1/6) = 1.25163 huMidity M: a random variable, taking on values m • H(T |M = high)= H(2/4, 1/4, 1/4) = 1.5 P(M=low)=0.6 • Average Conditional Entropy (aka equivocation): H(T/M)= P (M = m) · H(T |M = m)= P(M=high)=0.4 Pm 0.6 · H(T |M = low) + 0.4 · H(T |M = high) = 1.350978 =⇒ H(M)= H(0.6, 0.4) = 0.970951 How much is M telling us on average about T ? T,M not independent: P (T = t, M = m) 6= P (T = t) · P (M = m) H(T ) − H(T |M) = 1.48548 − 1.350978 ≈ 0.1345 bits Carnegie Carnegie Mellon 13 IT tutorial,Roni Rosenfeld,1999 Mellon 15 IT tutorial,Roni Rosenfeld,1999 Mutual Information Visualized A Markov Source Order-k Markov Source: A source that ”remembers” the last k symbols emitted. Ie, the probability of emitting any symbol depends on the last k emitted symbols: P (sT =t|sT =t−1, sT =t−2,...,sT =t−k) So the last k emitted symbols define a state, and there are qk states. First-order markov source: defined by qXq matrix: P (si|sj) Example: ST =t is position after t random steps H(X,Y )= H(X)+ H(Y ) − I(X; Y ) Carnegie Carnegie Mellon 18 IT tutorial,Roni Rosenfeld,1999 Mellon 20 IT tutorial,Roni Rosenfeld,1999 Average Mutual Information Three Sources From Blachman: I(X; Y ) = H(X) − H(X/Y ) 1 1 = X P (x) · log − X P (x,y) · log (”/” means ”given”. ”;” means ”between”. ”,” means ”and”.) x P (x) x,y P (x|y) P (x|y) = X P (x,y) · log • H(X, Y/Z)= H({X,Y } /Z) x,y P (x) P (x,y) • H(X/Y, Z)= H(X / {Y, Z}) = X P (x,y) · log x,y P (x)P (y) • I(X; Y/Z)= H(X/Z) − H(X/Y, Z) • Properties of Average Mutual Information: I(X; Y ; Z) = I(X; Y ) − I(X; Y/Z) • Symmetric (but H(X) 6= H(Y ) and H(X/Y ) 6= H(Y/X)) = H(X, Y, Z) − H(X,Y ) − H(X, Z) − H(Y, Z)+ H(X)+ H(Y )+ H • Non-negative (but H(X) − H(X/y) may be negative!) =⇒ Can be negative! • Zero iff X,Y independent • I(X; Y, Z)= I(X; Y )+ I(X; Z/Y ) (additivity) • Additive (see next slide) • But: I(X; Y ) = 0,I(X; Z) = 0 doesn’t mean I(X; Y, Z) = 0!!! Carnegie Carnegie Mellon 17 IT tutorial,Roni Rosenfeld,1999 Mellon 19 IT tutorial,Roni Rosenfeld,1999 Modeling an Arbitrary Source A Distance Measure Between Distributions Kullback-Liebler distance: Source D(Y ) with unknown distribution PD(Y ) (recall H(P )= E [log 1 ] ) D PD P (Y ) D KL(PD; PM ) = CH(PD; PM ) − H(PD) P (Y ) = E [log D ] Goal: Model (approximate) with learned distribution PM (Y ) P D PM (Y ) What’s a good model PM (Y )? Properties of KL distance: 1.
Recommended publications
  • Lecture 3: Entropy, Relative Entropy, and Mutual Information 1 Notation 2
    EE376A/STATS376A Information Theory Lecture 3 - 01/16/2018 Lecture 3: Entropy, Relative Entropy, and Mutual Information Lecturer: Tsachy Weissman Scribe: Yicheng An, Melody Guan, Jacob Rebec, John Sholar In this lecture, we will introduce certain key measures of information, that play crucial roles in theoretical and operational characterizations throughout the course. These include the entropy, the mutual information, and the relative entropy. We will also exhibit some key properties exhibited by these information measures. 1 Notation A quick summary of the notation 1. Discrete Random Variable: U 2. Alphabet: U = fu1; u2; :::; uM g (An alphabet of size M) 3. Specific Value: u; u1; etc. For discrete random variables, we will write (interchangeably) P (U = u), PU (u) or most often just, p(u) Similarly, for a pair of random variables X; Y we write P (X = x j Y = y), PXjY (x j y) or p(x j y) 2 Entropy Definition 1. \Surprise" Function: 1 S(u) log (1) , p(u) A lower probability of u translates to a greater \surprise" that it occurs. Note here that we use log to mean log2 by default, rather than the natural log ln, as is typical in some other contexts. This is true throughout these notes: log is assumed to be log2 unless otherwise indicated. Definition 2. Entropy: Let U a discrete random variable taking values in alphabet U. The entropy of U is given by: 1 X H(U) [S(U)] = log = − log (p(U)) = − p(u) log p(u) (2) , E E p(U) E u Where U represents all u values possible to the variable.
    [Show full text]
  • Package 'Infotheo'
    Package ‘infotheo’ February 20, 2015 Title Information-Theoretic Measures Version 1.2.0 Date 2014-07 Publication 2009-08-14 Author Patrick E. Meyer Description This package implements various measures of information theory based on several en- tropy estimators. Maintainer Patrick E. Meyer <[email protected]> License GPL (>= 3) URL http://homepage.meyerp.com/software Repository CRAN NeedsCompilation yes Date/Publication 2014-07-26 08:08:09 R topics documented: condentropy . .2 condinformation . .3 discretize . .4 entropy . .5 infotheo . .6 interinformation . .7 multiinformation . .8 mutinformation . .9 natstobits . 10 Index 12 1 2 condentropy condentropy conditional entropy computation Description condentropy takes two random vectors, X and Y, as input and returns the conditional entropy, H(X|Y), in nats (base e), according to the entropy estimator method. If Y is not supplied the function returns the entropy of X - see entropy. Usage condentropy(X, Y=NULL, method="emp") Arguments X data.frame denoting a random variable or random vector where columns contain variables/features and rows contain outcomes/samples. Y data.frame denoting a conditioning random variable or random vector where columns contain variables/features and rows contain outcomes/samples. method The name of the entropy estimator. The package implements four estimators : "emp", "mm", "shrink", "sg" (default:"emp") - see details. These estimators require discrete data values - see discretize. Details • "emp" : This estimator computes the entropy of the empirical probability distribution. • "mm" : This is the Miller-Madow asymptotic bias corrected empirical estimator. • "shrink" : This is a shrinkage estimate of the entropy of a Dirichlet probability distribution. • "sg" : This is the Schurmann-Grassberger estimate of the entropy of a Dirichlet probability distribution.
    [Show full text]
  • Arxiv:1907.00325V5 [Cs.LG] 25 Aug 2020
    Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities Ronan Perry1, Ronak Mehta1, Richard Guo1, Jesús Arroyo1, Mike Powell1, Hayden Helm1, Cencheng Shen1, and Joshua T. Vogelstein1;2∗ Abstract. Information-theoretic quantities, such as conditional entropy and mutual information, are critical data summaries for quantifying uncertainty. Current widely used approaches for computing such quantities rely on nearest neighbor methods and exhibit both strong performance and theoretical guarantees in certain simple scenarios. However, existing approaches fail in high-dimensional settings and when different features are measured on different scales. We propose decision forest-based adaptive nearest neighbor estimators and show that they are able to effectively estimate posterior probabilities, conditional entropies, and mutual information even in the aforementioned settings. We provide an extensive study of efficacy for classification and posterior probability estimation, and prove cer- tain forest-based approaches to be consistent estimators of the true posteriors and derived information-theoretic quantities under certain assumptions. In a real-world connectome application, we quantify the uncertainty about neuron type given various cellular features in the Drosophila larva mushroom body, a key challenge for modern neuroscience. 1 Introduction Uncertainty quantification is a fundamental desiderata of statistical inference and data science. In supervised learning settings it is common to quantify uncertainty with either conditional en- tropy or mutual information (MI). Suppose we are given a pair of random variables (X; Y ), where X is d-dimensional vector-valued and Y is a categorical variable of interest. Conditional entropy H(Y jX) measures the uncertainty in Y on average given X. On the other hand, mutual information quantifies the shared information between X and Y .
    [Show full text]
  • GAIT: a Geometric Approach to Information Theory
    GAIT: A Geometric Approach to Information Theory Jose Gallego Ankit Vani Max Schwarzer Simon Lacoste-Julieny Mila and DIRO, Université de Montréal Abstract supports. Optimal transport distances, such as Wasser- stein, have emerged as practical alternatives with theo- retical grounding. These methods have been used to We advocate the use of a notion of entropy compute barycenters (Cuturi and Doucet, 2014) and that reflects the relative abundances of the train generative models (Genevay et al., 2018). How- symbols in an alphabet, as well as the sim- ever, optimal transport is computationally expensive ilarities between them. This concept was as it generally lacks closed-form solutions and requires originally introduced in theoretical ecology to the solution of linear programs or the execution of study the diversity of ecosystems. Based on matrix scaling algorithms, even when solved only in this notion of entropy, we introduce geometry- approximate form (Cuturi, 2013). Approaches based aware counterparts for several concepts and on kernel methods (Gretton et al., 2012; Li et al., 2017; theorems in information theory. Notably, our Salimans et al., 2018), which take a functional analytic proposed divergence exhibits performance on view on the problem, have also been widely applied. par with state-of-the-art methods based on However, further exploration on the interplay between the Wasserstein distance, but enjoys a closed- kernel methods and information theory is lacking. form expression that can be computed effi- ciently. We demonstrate the versatility of our Contributions. We i) introduce to the machine learn- method via experiments on a broad range of ing community a similarity-sensitive definition of en- domains: training generative models, comput- tropy developed by Leinster and Cobbold(2012).
    [Show full text]
  • On a General Definition of Conditional Rényi Entropies
    proceedings Proceedings On a General Definition of Conditional Rényi Entropies † Velimir M. Ili´c 1,*, Ivan B. Djordjevi´c 2 and Miomir Stankovi´c 3 1 Mathematical Institute of the Serbian Academy of Sciences and Arts, Kneza Mihaila 36, 11000 Beograd, Serbia 2 Department of Electrical and Computer Engineering, University of Arizona, 1230 E. Speedway Blvd, Tucson, AZ 85721, USA; [email protected] 3 Faculty of Occupational Safety, University of Niš, Carnojevi´ca10a,ˇ 18000 Niš, Serbia; [email protected] * Correspondence: [email protected] † Presented at the 4th International Electronic Conference on Entropy and Its Applications, 21 November–1 December 2017; Available online: http://sciforum.net/conference/ecea-4. Published: 21 November 2017 Abstract: In recent decades, different definitions of conditional Rényi entropy (CRE) have been introduced. Thus, Arimoto proposed a definition that found an application in information theory, Jizba and Arimitsu proposed a definition that found an application in time series analysis and Renner-Wolf, Hayashi and Cachin proposed definitions that are suitable for cryptographic applications. However, there is still no a commonly accepted definition, nor a general treatment of the CRE-s, which can essentially and intuitively be represented as an average uncertainty about a random variable X if a random variable Y is given. In this paper we fill the gap and propose a three-parameter CRE, which contains all of the previous definitions as special cases that can be obtained by a proper choice of the parameters. Moreover, it satisfies all of the properties that are simultaneously satisfied by the previous definitions, so that it can successfully be used in aforementioned applications.
    [Show full text]
  • Noisy Channel Coding
    Noisy Channel Coding: Correlated Random Variables & Communication over a Noisy Channel Toni Hirvonen Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing [email protected] T-61.182 Special Course in Information Science II / Spring 2004 1 Contents • More entropy definitions { joint & conditional entropy { mutual information • Communication over a noisy channel { overview { information conveyed by a channel { noisy channel coding theorem 2 Joint Entropy Joint entropy of X; Y is: 1 H(X; Y ) = P (x; y) log P (x; y) xy X Y 2AXA Entropy is additive for independent random variables: H(X; Y ) = H(X) + H(Y ) iff P (x; y) = P (x)P (y) 3 Conditional Entropy Conditional entropy of X given Y is: 1 1 H(XjY ) = P (y) P (xjy) log = P (x; y) log P (xjy) P (xjy) y2A "x2A # y2A A XY XX XX Y It measures the average uncertainty (i.e. information content) that remains about x when y is known. 4 Mutual Information Mutual information between X and Y is: I(Y ; X) = I(X; Y ) = H(X) − H(XjY ) ≥ 0 It measures the average reduction in uncertainty about x that results from learning the value of y, or vice versa. Conditional mutual information between X and Y given Z is: I(Y ; XjZ) = H(XjZ) − H(XjY; Z) 5 Breakdown of Entropy Entropy relations: Chain rule of entropy: H(X; Y ) = H(X) + H(Y jX) = H(Y ) + H(XjY ) 6 Noisy Channel: Overview • Real-life communication channels are hopelessly noisy i.e. introduce transmission errors • However, a solution can be achieved { the aim of source coding is to remove redundancy from the source data
    [Show full text]
  • On Measures of Entropy and Information
    On Measures of Entropy and Information Tech. Note 009 v0.7 http://threeplusone.com/info Gavin E. Crooks 2018-09-22 Contents 5 Csiszar´ f-divergences 12 Csiszar´ f-divergence ................ 12 0 Notes on notation and nomenclature 2 Dual f-divergence .................. 12 Symmetric f-divergences .............. 12 1 Entropy 3 K-divergence ..................... 12 Entropy ........................ 3 Fidelity ........................ 12 Joint entropy ..................... 3 Marginal entropy .................. 3 Hellinger discrimination .............. 12 Conditional entropy ................. 3 Pearson divergence ................. 14 Neyman divergence ................. 14 2 Mutual information 3 LeCam discrimination ............... 14 Mutual information ................. 3 Skewed K-divergence ................ 14 Multivariate mutual information ......... 4 Alpha-Jensen-Shannon-entropy .......... 14 Interaction information ............... 5 Conditional mutual information ......... 5 6 Chernoff divergence 14 Binding information ................ 6 Chernoff divergence ................. 14 Residual entropy .................. 6 Chernoff coefficient ................. 14 Total correlation ................... 6 Renyi´ divergence .................. 15 Lautum information ................ 6 Alpha-divergence .................. 15 Uncertainty coefficient ............... 7 Cressie-Read divergence .............. 15 Tsallis divergence .................. 15 3 Relative entropy 7 Sharma-Mittal divergence ............. 15 Relative entropy ................... 7 Cross entropy
    [Show full text]
  • Lecture 11: Channel Coding Theorem: Converse Part 1 Recap
    EE376A/STATS376A Information Theory Lecture 11 - 02/13/2018 Lecture 11: Channel Coding Theorem: Converse Part Lecturer: Tsachy Weissman Scribe: Erdem Bıyık In this lecture1, we will continue our discussion on channel coding theory. In the previous lecture, we proved the direct part of the theorem, which suggests if R < C(I), then R is achievable. Now, we are going to prove the converse statement: If R > C(I), then R is not achievable. We will also state some important notes about the direct and converse parts. 1 Recap: Communication Problem Recall the communication problem: Memoryless Channel n n J ∼ Uniff1; 2;:::;Mg −! encoder −!X P (Y jX) −!Y decoder −! J^ log M bits • Rate = R = n channel use • Probability of error = Pe = P (J^ 6= J) (I) (I) The main result is C = C = maxPX I(X; Y ). Last week, we showed R is achievable if R < C . In this lecture, we are going to prove that if R > C(I), then R is not achievable. 2 Fano's Inequality Theorem (Fano's Inequality). Let X be a discrete random variable and X^ = X^(Y ) be a guess of X based on Y . Let Pe = P (X 6= X^). Then, H(XjY ) ≤ h2(Pe) + Pe log(jX j − 1) where h2 is the binary entropy function. ^ Proof. Let V = 1fX6=X^ g, i.e. V is 1 if X 6= X and 0 otherwise. H(XjY ) ≤ H(X; V jY ) (1) = H(V jY ) + H(XjV; Y ) (2) ≤ H(V ) + H(XjV; Y ) (3) X = H(V ) + H(XjV =v; Y =y)P (V =v; Y =y) (4) v;y X X = H(V ) + H(XjV =0;Y =y)P (V =0;Y =y) + H(XjV =1;Y =y)P (V =1;Y =y) (5) y y X = H(V ) + H(XjV =1;Y =y)P (V =1;Y =y) (6) y X ≤ H(V ) + log(jX j − 1) P (V =1;Y =y) (7) y = H(V ) + log(jX j − 1)P (V =1) (8) = h2(Pe) + Pe log(jX j − 1) (9) 1Reading: Chapter 7.9 and 7.12 of Cover, Thomas M., and Joy A.
    [Show full text]
  • Information Theory 1 Entropy 2 Mutual Information
    CS769 Spring 2010 Advanced Natural Language Processing Information Theory Lecturer: Xiaojin Zhu [email protected] In this lecture we will learn about entropy, mutual information, KL-divergence, etc., which are useful concepts for information processing systems. 1 Entropy Entropy of a discrete distribution p(x) over the event space X is X H(p) = − p(x) log p(x). (1) x∈X When the log has base 2, entropy has unit bits. Properties: H(p) ≥ 0, with equality only if p is deterministic (use the fact 0 log 0 = 0). Entropy is the average number of 0/1 questions needed to describe an outcome from p(x) (the Twenty Questions game). Entropy is a concave function of p. 1 1 1 1 7 For example, let X = {x1, x2, x3, x4} and p(x1) = 2 , p(x2) = 4 , p(x3) = 8 , p(x4) = 8 . H(p) = 4 bits. This definition naturally extends to joint distributions. Assuming (x, y) ∼ p(x, y), X X H(p) = − p(x, y) log p(x, y). (2) x∈X y∈Y We sometimes write H(X) instead of H(p) with the understanding that p is the underlying distribution. The conditional entropy H(Y |X) is the amount of information needed to determine Y , if the other party knows X. X X X H(Y |X) = p(x)H(Y |X = x) = − p(x, y) log p(y|x). (3) x∈X x∈X y∈Y From above, we can derive the chain rule for entropy: H(X1:n) = H(X1) + H(X2|X1) + ..
    [Show full text]
  • Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory
    NEU 560: Statistical Modeling and Analysis of Neural Data Spring 2018 Lecture 8: Information Theory and Maximum Entropy Lecturer: Mike Morais Scribes: 8.1 Fundamentals of Information theory Information theory started with Claude Shannon's A mathematical theory of communication. The first building block was entropy, which he sought as a functional H(·) of probability densities with two desired properties: 1. Decreasing in P (X), such that if P (X1) < P (X2), then h(P (X1)) > h(P (X2)). 2. Independent variables add, such that if X and Y are independent, then H(P (X; Y )) = H(P (X)) + H(P (Y )). These are only satisfied for − log(·). Think of it as a \surprise" function. Definition 8.1 (Entropy) The entropy of a random variable is the amount of information needed to fully describe it; alternate interpretations: average number of yes/no questions needed to identify X, how uncertain you are about X? X H(X) = − P (X) log P (X) = −EX [log P (X)] (8.1) X Average information, surprise, or uncertainty are all somewhat parsimonious plain English analogies for entropy. There are a few ways to measure entropy for multiple variables; we'll use two, X and Y . Definition 8.2 (Conditional entropy) The conditional entropy of a random variable is the entropy of one random variable conditioned on knowledge of another random variable, on average. Alternative interpretations: the average number of yes/no questions needed to identify X given knowledge of Y , on average; or How uncertain you are about X if you know Y , on average? X X h X i H(X j Y ) = P (Y )[H(P (X j Y ))] = P (Y ) − P (X j Y ) log P (X j Y ) Y Y X X = = − P (X; Y ) log P (X j Y ) X;Y = −EX;Y [log P (X j Y )] (8.2) Definition 8.3 (Joint entropy) X H(X; Y ) = − P (X; Y ) log P (X; Y ) = −EX;Y [log P (X; Y )] (8.3) X;Y 8-1 8-2 Lecture 8: Information Theory and Maximum Entropy • Bayes' rule for entropy H(X1 j X2) = H(X2 j X1) + H(X1) − H(X2) (8.4) • Chain rule of entropies n X H(Xn;Xn−1; :::X1) = H(Xn j Xn−1; :::X1) (8.5) i=1 It can be useful to think about these interrelated concepts with a so-called information diagram.
    [Show full text]
  • A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by MURAL - Maynooth University Research Archive Library 1 A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´ Abstract Given a collection of strings, each with an associated probability of occurrence, the guesswork of each of them is their position in a list ordered from most likely to least likely, breaking ties arbitrarily. Guesswork is central to several applications in information theory: Average guesswork provides a lower bound on the expected computational cost of a sequential decoder to decode successfully the transmitted message; the complementary cumulative distribution function of guesswork gives the error probability in list decoding; the logarithm of guesswork is the number of bits needed in optimal lossless one-to-one source coding; and guesswork is the number of trials required of an adversary to breach a password protected system in a brute-force attack. In this paper, we consider memoryless string-sources that generate strings consisting of i.i.d. characters drawn from a finite alphabet, and characterize their corresponding guesswork. Our main tool is the tilt operation on a memoryless string-source. We show that the tilt operation on a memoryless string-source parametrizes an exponential family of memoryless string-sources, which we refer to as the tilted family of the string-source. We provide an operational meaning to the tilted families by proving that two memoryless string-sources result in the same guesswork on all strings of all lengths if and only if their respective categorical distributions belong to the same tilted family.
    [Show full text]
  • Language Models
    Recap: Language models Foundations of Natural Language Processing Language models tell us P (~w) = P (w . w ): How likely to occur is this • 1 n Lecture 4 sequence of words? Language Models: Evaluation and Smoothing Roughly: Is this sequence of words a “good” one in my language? LMs are used as a component in applications such as speech recognition, Alex Lascarides • machine translation, and predictive text completion. (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipp Koehn) 24 January 2020 To reduce sparse data, N-gram LMs assume words depend only on a fixed- • length history, even though we know this isn’t true. Alex Lascarides FNLP lecture 4 24 January 2020 Alex Lascarides FNLP lecture 4 1 Evaluating a language model Two types of evaluation in NLP Intuitively, a trigram model captures more context than a bigram model, so Extrinsic: measure performance on a downstream application. • should be a “better” model. • – For LM, plug it into a machine translation/ASR/etc system. – The most reliable evaluation, but can be time-consuming. That is, it should more accurately predict the probabilities of sentences. • – And of course, we still need an evaluation measure for the downstream system! But how can we measure this? • Intrinsic: design a measure that is inherent to the current task. • – Can be much quicker/easier during development cycle. – But not always easy to figure out what the right measure is: ideally, one that correlates well with extrinsic measures. Let’s consider how to define an intrinsic measure for LMs.
    [Show full text]