Chapter

Information Theory

Intro duction

This lecture covers joint entropy and minimum descrip

tion length See the texts by Cover and Mackay for a more comprehensive

treatment

Measures of Information

Information on a computer is represented by binary strings numb ers

can b e represented using the following enco ding The p osition of the binary digit

3 2 1 0

Bit Bit Bit Bit Decimal

Table Binary encoding

indicates its decimal equivalent such that if there are N the ith bit represents

N i

the decimal numb er Bit is referred to as the most signicant bit and bit N

as the least signicant bit To enco de M dierent messages requires log M bits

2

Pro cessing Course WD Penny April

Entropy

The table b elow shows the of o ccurrence px to two decimal places of

i

selected letters x in the English alphab et These statistics were taken from Mackays

i

b o ok on The table also shows the of a

x px hx

i i i

a

e

j

q

t

z

Table Probability and Information content of letters

letter

hx log

i

px

i

which is a of surprise if we had to guess what a randomly chosen letter of

the English alphab et was going to b e wed say it was an A E T or other frequently

o ccuring letter If it turned out to b e a Z wed b e surprised The letter E is so

common that it is unusual to nd a sentence without one An exception is the

page novel Gadsby by Ernest Vincent Wright in which the author delib erately makes

no use of the letter E from Covers b o ok on Information Theory The entropy

is the average information content

M

X

px hx H x

i i

i=1

where M is the numb er of discrete values that x can take It is usually written as

i

M

X

H x px log px

i i

i=1

with the convention that log Entropy measures

Entropy is maximised for a uniform distribution px M The resulting entropy

i

is H x l og M which is the numb er of binary bits required to represent M dierent

2

messages rst section For M for example the maximum entropy distribution

is given by px px eg an unbiased coin biased coins have lower

1 2

entropy

The entropy of letters in the English language is bits which is less than

l og bits This is however the information content due to considering just

2

the probability of o ccurence of letters But in language our exp ectation of what

the next letter will b e is determined by what the previous letters have b een To

measure this we need the concept of joint entropy Because H x is the entropy of

a dimensional variable it is sometimes called the scalar entropy to dierentiate it

from the joint entropy

Signal Pro cessing Course WD Penny April

Joint Entropy

Table shows the probability of o ccurence to three decimal places of selected pairs

of letters x and y where x is followed by y This is called the joint probability

i i i i

px y The table also shows the joint information content

i i

x y px y hx y

i j i j i j

t h

t s

t r

Table Probability and Information content of pairs of letters

hx y log

i j

px y

i j

The average joint information content is given by the joint entropy

M M

X X

px y log px y H x y

i j i j

i=1 j =1

If we x x to say x then the probability of y taking on a particular value say y is

i j

given by the conditional probability

px x y y

i j

py y jx x

j i

px x

i

For example if x t and y h then the joint probability px y is just the

i j i j

probability of o ccurrence of the pair which from table is The conditional

probability py jx however says that given weve seen the letter t whats the

j i

probability that the next letter will b e h which from tables and is

Rearranging the ab ove relationship and dropping the y y notation gives

j

px y py jxpx

Now if y do es not dep end on x then py jx py Hence for indep endent variables

we have

px y py px

This means that for indep endent variables the joint entropy is the sum of the indi

vidual or scalar

H x y H x H y

Consecutive letters in the English language are not indep endent except either after or

during a b out of serious drinking If we take into account the statistical dep endence

on the previous letter the entropy of English reduces to bits p er letter from

If we lo ok at the statistics of not just pairs but triplets and quadruplets of

letters or at the statistics of words then it is p ossible to calculate the entropy more

accurately as more and more contextual structure is taken into account the estimates

of entropy reduce See Covers b o ok page for more details

Signal Pro cessing Course WD Penny April

Relative Entropy

The relative entropy or Kul lbackLiebler Divergence b etween a distribution q x and

a distribution px is dened as

X

q x

D q jjp q x log

px

x

1

Jensens inequality states that for any convex function f x and set of M p ositive

co ecients f g which sum to one

j

M M

X X

f x x f

j j j j

j =1 j =1

A sketch of a pro of of this is given in Bishop page Using this inequality we

can show that

X

px

D q jjp q x log

q x

x

X

log px

x

log

Hence

D q jjp

The KLdivergence will app ear again in the discussion of the EM algorithm and

Variational Bayesian learning see later lectures

Mutual Information

The mutual information is dened as the relative entropy b etween the joint

distribution and the pro duct of individual distributions

I x y D pX Y jjpX pY

Substuting these distributions into allows us to express the mutual information

as the dierence b etween the sum of the individual entropies and the joint entropy

I x y H x H y H x y

Therefore if x and y are indep endent the mutual information is zero More generally

I x y is a measure of the dependence b etween variables and this dep endence will b e

captured if the underlying relationship is linear or nonlinear This is to b e contrasted

with Pearsons correlation co ecient which measures only linear correlation see rst

lecture

1

A convex function has a negative second derivative

Signal Pro cessing Course WD Penny April

Minimum Description Length

Given that a variable has a determinisitic comp onent and a random comp onent the

complexity of that variable can b e dened as the length of a concise description of

that variables regularities

This denition has the merit that b oth random data and highly regular data will

have a low complexity and so we have a corresp ondence with our everyday notion of

2

complexity

The length of a description can b e measured by the numb er of binary bits required

to enco de it If the probability of a set of measurements D is given by pD j where

 are the parameters of a probabilistic mo del then the minimum length of a co de for

representing D is from s co ding theorem the same as the information

content of that data under the mo del see eg equation

L log pD j

However for the recevier to deco de the message they will need to know the parameters

 which b eing real numb ers are enco ded by truncating each to a nite precision

We need a total of k log bits to enco de the This gives

L log pD j k log

tx

The optimal precision can b e found as follows First we expand the negative log

likeliho o d ie the error using a Taylor series ab out the Maximum Likeliho o d ML

solution  This gives

T

L log pD j  H  k log

tx

where    and H is the Hessian of the error which is identical to the inverse

covariance matrix the rst order term in the Taylor series disapp ears as the error

gradient is zero at the ML solution The derivative is

L k

tx

H 

If the covariance matrix is diagonal and therefore the Hessian is diagonal then for

the case of linear regression see equation the diagonal elements are

2

N

x

i

h

i

2

e

2 2

is the of the ith input More where is the variance of the errors and

e x

i

generally eg nonlinear regression this last variance will b e replaced with the variance

2

This is not the case however with measures such as the Algorithm Information Content AIC

or Entropy as these will b e high even for purely random data

Signal Pro cessing Course WD Penny April

of the derivative of the output wrt the ith parameter But the dep endence on N

remains Setting the ab ove derivative to zero therefore gives us

2

constant

N

where the constant dep ends on the variance terms when we come to take logs of

this constant b ecomes an additive term that do esnt scale with either the numb er of

data p oints or the numb er of parameters in the mo del we can therefore ignore it

The Minimum Description Length MDL is therefore given by

k

MDLk log pD j log N

This may b e minimised over the numb er of parameters k to get the optimal mo del

complexity

For a linear regression mo del

N

2

log log pD j

e

Therefore

N k

2

MDL k log log N

Linear

e

which is seen to consist of an accuracy term and a complexity term This criterion can

b e used to select the optimal numb er of input variables and therefore oers a solution

to the biasvariance dilemma see lecture In later lectures the MDL criterion will

b e used in autoregressive and wavelet mo dels

The MDL complexity measure can b e further rened by integrating out the dep en

dence on  altogether The resulting measure is known as the stochastic complexity

I k log pD jk

where

Z

pD jk pD j k p d

In Bayesian statistics this quantity is known as the marginal likeliho o d or evidence

The sto chastic complexity measure is thus equivalent after taking negative logs to

the Bayesian mo del order selection criterion see later See Bishop page

for a further discussion of this relationship