Chapter 4 Information Theory
Total Page:16
File Type:pdf, Size:1020Kb
Chapter Information Theory Intro duction This lecture covers entropy joint entropy mutual information and minimum descrip tion length See the texts by Cover and Mackay for a more comprehensive treatment Measures of Information Information on a computer is represented by binary bit strings Decimal numb ers can b e represented using the following enco ding The p osition of the binary digit 3 2 1 0 Bit Bit Bit Bit Decimal Table Binary encoding indicates its decimal equivalent such that if there are N bits the ith bit represents N i the decimal numb er Bit is referred to as the most signicant bit and bit N as the least signicant bit To enco de M dierent messages requires log M bits 2 Signal Pro cessing Course WD Penny April Entropy The table b elow shows the probability of o ccurrence px to two decimal places of i selected letters x in the English alphab et These statistics were taken from Mackays i b o ok on Information Theory The table also shows the information content of a x px hx i i i a e j q t z Table Probability and Information content of letters letter hx log i px i which is a measure of surprise if we had to guess what a randomly chosen letter of the English alphab et was going to b e wed say it was an A E T or other frequently o ccuring letter If it turned out to b e a Z wed b e surprised The letter E is so common that it is unusual to nd a sentence without one An exception is the page novel Gadsby by Ernest Vincent Wright in which the author delib erately makes no use of the letter E from Covers b o ok on Information Theory The entropy is the average information content M X px hx H x i i i=1 where M is the numb er of discrete values that x can take It is usually written as i M X H x px log px i i i=1 with the convention that log Entropy measures uncertainty Entropy is maximised for a uniform distribution px M The resulting entropy i is H x l og M which is the numb er of binary bits required to represent M dierent 2 messages rst section For M for example the maximum entropy distribution is given by px px eg an unbiased coin biased coins have lower 1 2 entropy The entropy of letters in the English language is bits which is less than l og bits This is however the information content due to considering just 2 the probability of o ccurence of letters But in language our exp ectation of what the next letter will b e is determined by what the previous letters have b een To measure this we need the concept of joint entropy Because H x is the entropy of a dimensional variable it is sometimes called the scalar entropy to dierentiate it from the joint entropy Signal Pro cessing Course WD Penny April Joint Entropy Table shows the probability of o ccurence to three decimal places of selected pairs of letters x and y where x is followed by y This is called the joint probability i i i i px y The table also shows the joint information content i i x y px y hx y i j i j i j t h t s t r Table Probability and Information content of pairs of letters hx y log i j px y i j The average joint information content is given by the joint entropy M M X X px y log px y H x y i j i j i=1 j =1 If we x x to say x then the probability of y taking on a particular value say y is i j given by the conditional probability px x y y i j py y jx x j i px x i For example if x t and y h then the joint probability px y is just the i j i j probability of o ccurrence of the pair which from table is The conditional probability py jx however says that given weve seen the letter t whats the j i probability that the next letter will b e h which from tables and is Rearranging the ab ove relationship and dropping the y y notation gives j px y py jxpx Now if y do es not dep end on x then py jx py Hence for indep endent variables we have px y py px This means that for indep endent variables the joint entropy is the sum of the indi vidual or scalar entropies H x y H x H y Consecutive letters in the English language are not indep endent except either after or during a b out of serious drinking If we take into account the statistical dep endence on the previous letter the entropy of English reduces to bits p er letter from If we lo ok at the statistics of not just pairs but triplets and quadruplets of letters or at the statistics of words then it is p ossible to calculate the entropy more accurately as more and more contextual structure is taken into account the estimates of entropy reduce See Covers b o ok page for more details Signal Pro cessing Course WD Penny April Relative Entropy The relative entropy or Kul lbackLiebler Divergence b etween a distribution q x and a distribution px is dened as X q x D q jjp q x log px x 1 Jensens inequality states that for any convex function f x and set of M p ositive co ecients f g which sum to one j M M X X f x x f j j j j j =1 j =1 A sketch of a pro of of this is given in Bishop page Using this inequality we can show that X px D q jjp q x log q x x X log px x log Hence D q jjp The KLdivergence will app ear again in the discussion of the EM algorithm and Variational Bayesian learning see later lectures Mutual Information The mutual information is dened as the relative entropy b etween the joint distribution and the pro duct of individual distributions I x y D pX Y jjpX pY Substuting these distributions into allows us to express the mutual information as the dierence b etween the sum of the individual entropies and the joint entropy I x y H x H y H x y Therefore if x and y are indep endent the mutual information is zero More generally I x y is a measure of the dependence b etween variables and this dep endence will b e captured if the underlying relationship is linear or nonlinear This is to b e contrasted with Pearsons correlation co ecient which measures only linear correlation see rst lecture 1 A convex function has a negative second derivative Signal Pro cessing Course WD Penny April Minimum Description Length Given that a variable has a determinisitic comp onent and a random comp onent the complexity of that variable can b e dened as the length of a concise description of that variables regularities This denition has the merit that b oth random data and highly regular data will have a low complexity and so we have a corresp ondence with our everyday notion of 2 complexity The length of a description can b e measured by the numb er of binary bits required to enco de it If the probability of a set of measurements D is given by pD j where are the parameters of a probabilistic mo del then the minimum length of a co de for representing D is from Shannons co ding theorem the same as the information content of that data under the mo del see eg equation L log pD j However for the recevier to deco de the message they will need to know the parameters which b eing real numb ers are enco ded by truncating each to a nite precision We need a total of k log bits to enco de the This gives L log pD j k log tx The optimal precision can b e found as follows First we expand the negative log likeliho o d ie the error using a Taylor series ab out the Maximum Likeliho o d ML solution This gives T L log pD j H k log tx where and H is the Hessian of the error which is identical to the inverse covariance matrix the rst order term in the Taylor series disapp ears as the error gradient is zero at the ML solution The derivative is L k tx H If the covariance matrix is diagonal and therefore the Hessian is diagonal then for the case of linear regression see equation the diagonal elements are 2 N x i h i 2 e 2 2 is the variance of the ith input More where is the variance of the errors and e x i generally eg nonlinear regression this last variance will b e replaced with the variance 2 This is not the case however with measures such as the Algorithm Information Content AIC or Entropy as these will b e high even for purely random data Signal Pro cessing Course WD Penny April of the derivative of the output wrt the ith parameter But the dep endence on N remains Setting the ab ove derivative to zero therefore gives us 2 constant N where the constant dep ends on the variance terms when we come to take logs of this constant b ecomes an additive term that do esnt scale with either the numb er of data p oints or the numb er of parameters in the mo del we can therefore ignore it The Minimum Description Length MDL is therefore given by k MDLk log pD j log N This may b e minimised over the numb er of parameters k to get the optimal mo del complexity For a linear regression mo del N 2 log log pD j e Therefore N k 2 MDL k log log N Linear e which is seen to consist of an accuracy term and a complexity term This criterion can b e used to select the optimal numb er of input variables and therefore oers a solution to the biasvariance dilemma see lecture In later lectures the MDL criterion will b e used in autoregressive and wavelet mo dels The MDL complexity measure can b e further rened by integrating out the dep en dence on altogether The resulting measure is known as the stochastic complexity I k log pD jk where Z pD jk pD j k p d In Bayesian statistics this quantity is known as the marginal likeliho o d or evidence The sto chastic complexity measure is thus equivalent after taking negative logs to the Bayesian mo del order selection criterion see later See Bishop page for a further discussion of this relationship.