Hyperparameters: Optimize, Or Integrate Out?

Home , Posterior probability

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

David J.C. MacKay

Cavendish Lab oratory,

Cambridge, CB3 0HE. United Kingdom.

[email protected]

ABSTRACT. I examine two approximate metho ds for computational implementation of Bayesian

hierarchical mo dels, that is, mo dels which include unknown hyp erparameters such as regularization

constants. In the `evidence framework' the mo del parameters are integrated over, and the resulting

evidence is maximized over the hyp erparameters. The optimized hyp erparameters are used to de ne

a Gaussian approximation to the p osterior distribution. In the alternative `MAP' metho d, the true

p osterior probability is found by integrating over the hyp erparameters. The true p osterior is then

maximized over the mo del parameters, and a Gaussian approximation is made. The similarities of

the two approaches, and their relative merits, are discussed, and comparisons are made with the

ideal hierarchical Bayesian solution.

In mo derately ill-p osed problems, integration over hyp erparameters yields a probability distri-

bution with a skew p eak which causes signi cant biases to arise in the MAP metho d. In contrast,

the evidence framework is shown to intro duce negligible predictive error, under straightforward

conditions.

General lessons are drawn concerning the distinctive prop erties of inference in many dimensions.

\Integrating overanuisance parameter is very much like estimating the parameter

from the data, and then using that estimate in our equations." G.L. Bretthorst

\This integration would b e counter-pro ductive as far as practical manipulation is

concerned." S.F. Gul l

1 Outline

In ill-p osed problems, a Bayesian mo del H commonly takes the form:

P D; w ; ; jH= P D jw ; ;HP w j ; HP ; jH; 1

where D is the data, w is the parameter vector, de nes a noise variance =1= , and

is a regularization constant. In a regression problem, for example, D might b e a set of data

p oints, fx; tg, and the vector w might parameterize a function f x; w . The mo del H states

that for some w , the dep endentvariables ftg are given by adding noise to ff x; w g; the

likeliho o d function P D jw ; ;H describ es the assumed noise pro cess, parameterized bya

noise level 1= ; the prior P w j ; Hemb o dies assumptions ab out the spatial correlations

and smo othness that the true function is exp ected to have, parameterized by a regularization

constant . The variables and are known as hyp erparameters. Problems for which

mo dels can b e written in the form 1 include linear interp olation with a xed basis set

D.J.C. MACKAY

Gull 1988; MacKay 1992a, non-linear regression with a neural network MacKay 1992b,

non-linear classi cation MacKay 1992c, and image deconvolution Gull 1989.

In the simplest case linear mo dels, Gaussian noise, the rst factor in 1, the likeliho o d,

can b e written in terms of a quadratic function of w , E w :

P D jw ; ;H= exp E w : 2

What makes the problem `ill-p osed' is that the hessian rrE is ill-conditi oned | some

of its eigenvalues are very small, so that the maximum likeliho o d parameters dep end unde-

sirably on the noise in the data. The mo del is `regularized' by the second factor in 1, the

prior, which in the simplest case is a spherical Gaussian:

exp P w j ; H= w w : 3

The regularization constant de nes the variance =1= of the prior for the comp onents

w of w .

Muchinterest has centred on the question of how the constants and | or the

ratio = | should b e set, and Gull 1989 has derived an app ealing Bayesian prescription

for these constants see also MacKay 1992a for a review. This `evidence framework'

integrates over the parameters w to give the `evidence' P D j ; ; H. The evidence is then

maximized over the regularization constant and noise level . A Gaussian approximation

is then made with the hyp erparameters xed to their optimized values. This relates closely

to the `generalized maximum likeliho o d' metho d in statistics Wahba 1975. This metho d

can b e applied to non-linear mo dels by making appropriate lo cal linearizations, and has b een

used successfully in image reconstruction Gull 1989; Weir 1991 and in neural networks

MacKay 1992b; Tho db erg 1993; MacKay 1994.

Recently an alternative pro cedure for computing inferences under the same Bayesian

mo del has b een suggested by Buntine and Weigend 1991, Strauss et al. 1993 and Wolp ert

1993. In this approach, one integrates over the regularization constant rst to obtain

the `true prior', and over the noise level to obtain the `true likeliho o d'; then maximizes

the `true p osterior' over the parameters w . A Gaussian approximation is then made around

this true probability density maximum. I will call this the `MAP' metho d for maximum a

posteriori; this use of the term `MAP' may not coincide precisely with its general usage.

The purp ose of this pap er is to examine the choice b etween these two Gaussian approx-

imations, b oth of which might b e used to approximate predictive inference. It is assumed

that it is predictive distributions that are of interest, rather than p oint estimates. Estima-

tion will only app ear as a computational stepping stone in the pro cess of approximating a

predictive distribution. I concentrate on the simplest case of the linear mo del with Gaussian

noise, but the insights obtained are exp ected to apply to more general non-linear mo dels and

to mo dels with multiple hyp erparameters. When a non-linear mo del has multiple lo cal op-

tima, one can approximate the p osterior by a sum of Gaussians, one tted at each optimum.

There is then an analogous choice b etween either a optimizing separately at each lo cal

optimum in w , and using a Gaussian approximation conditioned on MacKay 1992b;

or b tting Gaussians to lo cal maxima of the true p osterior with the hyp erparameter

integrated out.

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

2 The Alternative Metho ds

Given the Bayesian mo del de ned in 1, we mightbeinterested in the following inferences.

Problem A: Infer the parameters, i.e., obtain a compact representation of P w jD; H and

the marginal distributions P w jD; H.

Problem B: Infer the relative mo del plausibility, which requires the `evidence' P D jH.

Problem C: Make predictions, i.e. obtain some representation of P D jD; H, where D ,

2 2

in the simplest case, is a single new datum.

Let us assume for simplicity that the noise level is known precisely, so that only the

regularization constant is resp ectively optimized or integrated over. Comments ab out

can apply equally well to .

The Ideal Approach

Ideally,ifwewere able to do all the necessary integrals, wewould just generate the probabil-

ity distributions P w jD; H, P D jH, and P D jD; Hby direct integration over everything

that we are not concerned with. The pioneering work of Box and Tiao 1973 used this

approach to develop Bayesian robust statistics.

For real problems of interest, however, such exact integration metho ds are seldom avail-

able. A partial solution can still b e obtained by using Monte Carlo metho ds to simulate

the full probability distribution see Neal 1993b for an excellent review. Thus one can

obtain problem A a set of samples fw g which represent the p osterior P w jD; H, and

problem C a set of samples fD g which represent the predictive distribution P D jD; H.

2 2

Unfortunately, the evaluation of the evidence P D jH with Monte Carlo metho ds problem

B is a dicult undertaking. Recent developments Neal 1993a; Skilling 1993 now makeit

p ossible to use gradient and curvature information so as to sample high dimensional spaces

more e ectively,even for highly non-Gaussian distributions. Let us come down from these

clouds however, and turn attention to the two deterministic approximations under study.

The Evidence Framework

The evidence framework divides our inferences into distinct `levels of inference':

Level 1: Infer the parameters w for a given value of :

P D jw ; ;HP w j ; H

: 4 P w jD; ; H=

P D j ; H

Level 2: Infer :

P D j ; HP jH

P jD; H= : 5

P D jH

Level 3: Compare mo dels:

P HjD / P D jHP H: 6

There is a pattern in these three applications of Bayes' rule: at each of higher levels 2 and

3, the data-dep endent factor e.g. in level 2, P D j ; H is the normalizing constant the

`evidence' from the preceding level of inference.

The inference problems listed at the b eginning of this section are solved approximately

using the following pro cedure.

D.J.C. MACKAY

The level 1 inference is approximated by making a quadratic expansion, around a maxi-

mum of P w jD; ; H, of log P D jw ; ;HP w j ; H; this expansion de nes a Gaussian

approximation to the p osterior. The evidence P D j ; H is estimated byevaluating the

appropriate determinant. For linear mo dels the Gaussian approximation is exact.

By maximizing the evidence P D j ; Hatlevel 2, we nd the most probable value of the

regularization constant, , and error bars on it, . Because is a p ositive scale

log jD

variable, it is natural to represent its uncertainty on a log scale.

The value of is substituted at level 1. This de nes a probability distribu-

tion P w jD; ; H whichisintended as a `go o d approximation' to the p osterior

P w jD; H. The solution o ered for problem A is a Gaussian distribution around the

maximum of this distribution, w , with covariance matrix de ned by =

MPj

rr log P w jD; ; H. Marginals for the comp onents of w are easily obtained from

this distribution.

The evidence for mo del H problem B is estimated using:

P D jH ' P D j ;HP log jH 2 : 7

MP MP

log jD

Problem C: The predictive distribution P D jD; H is approximated by using the p oste-

rior distribution with = :

P D jD; ; H= d w P D jw ; HP w jD; ; H: 8

2 MP 2 MP

For a lo cally linear mo del with Gaussian noise, b oth the distributions inside the integral

are Gaussian, and this integral is straightforward to p erform.

As reviewed in MacKay 1992a, the most probable value of satis es a simple and intuitive

implicit equation,

w 1

1 i

= 9

where w are the comp onents of the vector w and is the number of wel l-determined

MP j

parameters, which can b e expressed in terms of the eigenvalues of the matrix rrE w :

a D

= k Trace = : 10

This quantityisanumber between 0 and k . Recalling that can b e interpreted as the

variance of the distribution from which the parameters w come, we see that equation

9 corresp onds to an intuitive prescription for a variance estimator. The idea is that we are

estimating the variance of the distribution of w from only well-determined parameters,

the other k having b een set roughly to zero by the regularizer and therefore not

contributing to the sum in the numerator.

In principle, there maybemultiple optima in , but this is not the typical case for

a mo del well matched to the data. Under general conditions, the error bars on log are

' 2= MacKay 1992a see section 5. Thus log is well-determined by the data

log jD

if 1.

The central computation can b e summarised thus:

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

Evidence approximation: nd the self-consistent solution fw ; g such that

MPj

w maximizes P w jD; ; H and satis es equation 9.

MP MP

MP j

Justi cation for the Evidence ApproximationThe central approximation in this scheme can

b e stated as follows: when weintegrate out a parameter, the e ect for most purp oses is

to estimate the parameter from the data, and then constrain the parameter to that value

Box and Tiao 1973; Bretthorst 1988. When we predict an observable D , the predictive

distribution is dominated by the value = . In symb ols,

P D jD; H= P D jD; ; HP log jD; H d log ' P D jD; ; H:

2 2 2 MP

This approximation is accurate as long as P D jD; ; H is insensitivetochanges in log

on a scale of , so that the distribution P log jD; H is e ectively a delta function.

log jD

This is a well-established idea.

A similar equivalence of two probability distributions arises in statistical thermo dynam-

ics. The `canonical ensemble' over all states r of a system,

P r = exp E =Z; 11

describ es equilibrium with a heat bath at temp erature 1= . Although the energy of the

system is not xed, the probability distribution of the energy is usually sharply p eaked

ab out the mean energy E . The corresp onding `micro canonical ensemble' describ es the

system when it is isolated and has xed energy:

1= E 2 [E E =2]

P r = : 12

0 otherwise

Under these two distributions, a particular microstate r mayhavenumerical probabili-

ties that are completely di erent. For example, the most probable microstate under the

canonical ensemble is always the ground state, for any temp erature 1= 0; whereas its

probability under the micro canonical ensemble is zero. But it is well known Reif 1965

that for most macroscopic purp oses, if the system has a large numb er of degrees of free-

dom, the two distributions are indistinguishabl e, b ecause most of the probability mass of

the canonical ensemble is concentrated in the states in a small interval around E .

The same reasoning justi es the evidence approximation for ill-p osed problems, with

particular values of w corresp onding to microstates. If the number of well-determined

parameters is large, then , like the energy ab ove, is well-determined. This do es not imply

that the two densities P w jD; H and P w jD; ; H are numerically close in value, but we

havenointerest in the probability of the high dimensional vector w .For practical purp oses,

we only care ab out distributions of low-dimensional quantities e.g., an individual parameter

w or a new datum; what matters, and what is asserted here, is that when we pro ject

the distributions down in order to predict low-dimensional quantities, the approximating

distribution P w jD; ; H puts most of its probability mass in the right place. A more

precise discussion of this approximation is given in section 5.

D.J.C. MACKAY

The MAP Method

The alternative pro cedure is rst to integrate out to obtain the true prior:

P w jH= d P w j ; HP jH: 13

We can then write down the true p osterior directly except for its normalizing constant:

P w jD; H / P D jw ; HP w jH: 14

This p osterior can b e maximized to nd the MAP parameters, w .How do es this relate

to the desired inferences listed at the head of this section? Not all authors describ e how

they intend the true p osterior to b e used in practical problems e.g.,Wolp ert 1993; here

I describ e a metho d based on the suggestions of Buntine and Weigend 1991.

Problem A: The p osterior distribution P w jD; H is approximated by a Gaussian dis-

tribution, tted around the most probable parameters, w ; to nd the Hessian of the

p osterior, one needs the Hessian of the prior, derived b elow. A simple evaluation of the

factors on the right hand side of 14 is not a satisfactory solution of problem A, since

a the normalizing constant is missing; b even if the r.h.s. of 14 were normalized, the

abilitytoevaluate the lo cal value of this densitywould b e of little use as a summary of

the distribution in the high-dimensional space; how, for example, is one to obtain marginal

distributions over w from 14?

Problem B: An estimate of the evidence is obtained from the determinant of the covari-

ance matrix of this Gaussian distribution.

Problem C: The parameters w with error bars are used to generate predictions as in

A simple example will illustrate that this approach actually gives results qualitatively

very similar to the evidence framework. If we apply the improp er prior P log = const

Imp

and evaluate the true prior, we obtain:

w =2

i=1

e 1

P w jH= d log / : 15

Imp

k=2

The derivative of the true log prior with resp ect to w is k= w w . This `weight decay'

i i

term can b e directly viewed in terms of an `e ective ',

w 1

= : 16

w k

Any maximum of the true p osterior P w jD; H is therefore also a maximum of the condi-

tional p osterior P w jD; ; H, with set to . The similarity of equation 16 to equation

9 of the evidence framework is clear. We can therefore describ e the MAP metho d thus:

MAP metho d improp er prior: nd the self-consistent solution fw ; g such

MP e

that w maximizes P w jD; ; H and satis es equation 16.

MP e e

This pro cedure is suggested in MacKay 1992b as a `quick and dirty' approximation to the

evidence framework.

If a uniform prior over from 0 to 1 is used instead of a prior over log then the resulting exp onent

changes from k=2tok=2 + 1.

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

The Effective and the Curvature of a General Prior

Wehave just established that, when the improp er prior 15 is used, the MAP solution

lies on the `alpha tra jectory' | the graph of w | for a particular value of = .

MP j

This result still holds when a prop er prior over is used to de ne the true prior 13. The

e ective w , found by di erentiation of log P w jH, is:

w = d P jw ; H: 17

In general there maybemultiple lo cal probability maxima, all of which lie on the alpha

tra jectory. In summary, optima w found by the MAP metho d can b e describ ed thus:

MAP metho d prop er prior: nd the self-consistent solution fw ; g such that

MP e

w maximizes P w jD; ; H and satis es equation 17.

MP e e

The curvature of the true prior is needed for evaluation of the error bars on w in the

MAP metho d. By direct di erentiation of the true prior 13, we nd:

2 T

rrlog P w jH= I w ww ; 18

where w is de ned in 16, and the e ectivevariance of is:

Z Z

2 2 2

: 19 w = w w = d P jw ; H d P jw ; H

This is an intuitive result: if were xed to , then the curvature would just b e the

rst term in 18, I. The fact that is uncertain depletes the curvature in the radial

direction w = w =jw j.

3 Pros and Cons

The algorithms for nding the evidence framework's w and the MAP metho d's w

MPj

have b een seen to b e very similar. Is there any real distinction to b e drawn b etween these

two aproaches?

The MAP metho d has the advantage that it involves no approximations until after we

have found the MAP parameters w ; in contrast, the evidence framework approximates

an integral over .

In the MAP metho d the integrals over and need only b e p erformed once and can

then b e used rep eatedly for di erent data sets; in the evidence framework, each new data

set has to receive individual attention, with a sequence of Gaussian integrations b eing

p erformed each time and are optimized.

So why not always integrate out hyp erparameters whenever p ossible? Let us answer

this question by magnifying the systematic di erences b etween the two approaches. With

sucient magni cation it will b ecome evident to the intuition that the approximation of

the evidence framework is sup erior to the MAP approximation. The distinction b etween

w and w is similar to that b etween the two estimators of standard deviation on

MPj

a calculator, and , the former b eing the biased maximum likeliho o d estimator,

N N1

whereas the latter is unbiased. The true p osterior distribution has a skew p eak, so that the

MAP parameters are not representative of the whole p osterior distribution. This is b est

illustrated by an example.

D.J.C. MACKAY

The Widget Example

A collection of widgets i =1::k have a prop erty called `wibble', w , whichwe measure,

widget by widget, in noisy exp eriments with a known noise level =1:0. Our mo del for

these quantities is that they come from a Gaussian prior P w j ; H, where =1= is not

known. Our prior for this variance is at over log from =0:1to = 10.

w w w

Scenario 1. Supp ose four widgets have b een measured and give the following data:

fd ;d ;d ;d g = f3.2, -3.2, 2.8, -2.8g. The task is problem A to infer the wibbles of

1 2 3 4

these four widgets, i.e. to pro duce a representative w with error bars. On the backofan

envelop e, or in a computer algebra system, we nd the following answers using equations

9 and 16/17:

Evidence framework: =0:124, w = f2:8; 2:8; 2:5; 2:5g; each with error

MP j

bars 0:9.

MAP metho d: =0:145, w = f2:8; 2:8; 2:4; 2:4g; each with error bars 0:9.

e MP

These answers are insensitive to the details of the prior over .

So far so go o d: w is slightly less regularized than w , but there is not much

MPj

disagreement when all the parameters are well-determined.

Scenario 2. Supp ose in addition to the four measurements ab ovewe are now informed

that there are an additional four unmeasured widgets in a b ox next do or. Thus wenowhave

b oth well-determined and ill-determined parameters, as in an ill-p osed problem. Intuitively,

wewould like our inferences ab out the well-measured widgets to b e negligibly a ected by

this vacuous information ab out the unmeasured widgets, just as the true Bayesian predictive

distributions are una ected. But clearly with k = 8, the di erence b etween k and in

equations 9 and 16 is going to b ecome signi cant. The value of will b e substantially

greater than that of .

In the evidence framework the value of is exactly the same, since each of the ill-

determined parameters has = 0 and adds nothing to the sum in 10. So the value of

and the predictive distributions are unchanged.

In contrast, the MAP parameter vector w is squashed close to zero. The precise

value of w is sensitive to the prior over . Solving equation 17 in a computer algebra

system, we nd: =79:2, w = f0:040; 0:040; 0:035; 0:035; 0; 0; 0; 0g, with marginal

e MP

error bars on all eight parameters =0:11.

w jD

Thus the MAP Gaussian approximation is terribly biased towards zero. The nal dis-

aster of this approach is that the error bars on the parameters are also corresp ondingly

small.

This is not a contrived example. It contains the basic feature of ill-p osed problems: that

there are b oth well-determined and p o orly-determined parameters. To aid comprehension,

the two sets of parameters are separated. This example can b e transformed into a typical

ill-p osed problem simply by rotating the basis to mix the parameters together. In neural

networks, a pair of scenarios identical to those discussed ab ove can arise if there are a large

numb er of p o orly determined parameters whichhave b een set to zero by the regularizer,

and we consider `pruning' these parameters. In scenario 1, the network is pruned, removing

the ill-determined parameters. In scenario 2, the parameters are retained, and assume their

most probable value, zero. In each case, what is the optimal setting of the weight decay

rate assuming the traditional regularizer w w =2? Wewould exp ect the answer to b e

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

unchanged. Yet the MAP metho d e ectively sets to a much larger value in the second

scenario.

The MAP metho d may lo cate the true p osterior maximum, but it fails to capture most

of the true probability mass.

4 Inference in Many Dimensions

In many dimensions, therefore, new intuitions are needed.

Nearly all of the volume of a k -dimensional hyp ersphere is in a thin shell near its surface.

For example, in 1000 dimensions, 90 of a hyp ersphere of radius 1.0 is within a depth of

0.0023 of its surface. A central core of the hyp ersphere, with radius 0.5, contains less than

300

1=10 of the volume.

This has an imp ortant e ect on high-dimensional probability distributions. Consider a

k 2 2

Gaussian distribution P w = 1= 2 exp w =2 . Nearly all of the probabil-

1 w

p p

k and of thickness / r= k .For ity mass of a Gaussian is in a thin shell of radius r =

example, in 1000 dimensions, 90 of the mass of a Gaussian with = 1 is in a shell of

k=2 217

radius 31.6 and thickness 2.8. However, the probability density at the origin is e ' 10

times bigger than the density at this shell where most of the probability mass is.

Consider two Gaussian densities in 1000 dimensions which di er in by 1, and which

contain equal total probability mass. In each case 90 of the mass is lo cated in a shell

which di ers in radius by only 1 b etween the two distributions. The maximum probability

density,however, is greater at the centre of the Gaussian with smaller ,by a factor of

exp0:01k ' 20,000.

In summary, probability density maxima often havevery little asso ciated probability

mass, even though the value of the probability density there may b e immense, b ecause they

have so little asso ciated volume. If a distribution is comp osed of a mixture of Gaussians

with di erent , the probability density maxima are strongly dominated by smaller values

of . This is why the MAP metho d nds a silly solution in the widget example.

Thus the lo cations of probability density maxima in many dimensions are generally

misleading and irrelevant. Probability densities should only b e maximized if there is go o d

reason to b elieve that the lo cation of the maximum conveys useful information ab out the

whole distribution, e.g., if the distribution is approximately Gaussian.

Condition Satisfied by Typical Samples

The conditions 9 and 16, satis ed by the optima ; w and ; w resp ec-

MP e MP

MPj

tively, are complemented by an additional result concerning typical samples from p osterior

distributions conditioned on . The maximum w of a Gaussian distribution is not

MPj

typical of the p osterior: the maximum has an atypically small value of w w , b ecause,

as discussed ab ove, nearly all of the mass of a Gaussian is in a shell at some distance

surrounding the maximum.

Consider samples w from the Gaussian p osterior distribution with xed to ,

T 2

P w jD; ; H. The average value of w w = w for these samples satis es:

: 20 = P

i w h

jD;

i MP

D.J.C. MACKAY

Pro of: The deviation w = w w is Gaussian distributed with w w = .So

MPj

2 T 2

h w i = w +w w +w = w + Trace =

MP MP MP MP

jD; MPj MPj

MPj MP MP MP

k , using equations 9 and 10.

Thusatypical sample from the evidence approximation prefers just the same value of

as do es the evidence.

5 Conditions for the Evidence Approximation

Wehave observed that the MAP metho d can lead to absurdly biased answers if there

are many ill-determined parameters. In contrast, I now discuss conditions under which the

evidence approximation works. I discuss the case of linear mo dels with Gaussian probability

distributions. This includes the case of image reconstruction problems that have separable

Gaussian distributions in the Fourier domain.

What do we care ab out when we approximate a complex probability distribution by

a simple one? My de nition of a go o d approximation is a practical one, concerned with

A estimating parameters; B estimating the evidence accurately; and C getting the

predictive mass in the right place. Estimation of individual parameters A is a sp ecial case

of prediction C, so in the following I will address only C and B.

For convenience let us work in the eigenvector basis where the prior given and the

likeliho o d are b oth diagonal Gaussian functions. The curvature of the log likeliho o d is

represented by eigenvalues f g.Foratypical ill-p osed problem these eigenvalues cover

several orders of magnitude in value. Without loss of generality let us assume k data

measurements fd g, such that d = w + , where the noise standard deviation is =1.

a a a a

We de ne the probability distribution of everything by the pro duct of the distributions:

k=2

1 1

; and w P log jH= ;Pw j ; H= exp

log = 2 2

max min

k=2

P D jw ; H= 2 exp w d :

a a a

In the case of a deconvolution problem the eigenvectors are the Fourier set and the p oint

. spread function in Fourier space is given by

The discussion pro ceeds in two steps. First, the p osterior distribution over must have

a single sharp p eak at . No general guarantee can b e given for this to b e the case, but

various p ointers are given. Second, given a sharp Gaussian p osterior over log ,itisproved

that the evidence approximation intro duces negligible error.

Concentration of P log jD; H in a Single Maximum

Condition 1 In the posterior distribution over log , al l the probability mass should be

contained in a single sharp maximum.

For this to hold, several sub-conditions are needed. If there is any doubt whether these

conditions are sucient, it is straightforward to iterate all the waydown the tra jectory,

explicitly evaluating P log jD; H.

The prior over must b e such that the p osterior has negligible mass at log !1.

In cases where the signal to noise ratio of the data is very low, there may b e a signi cant

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

tail in the evidence for large . There mayeven b e no maximum in the evidence, in which

case the evidence framework gives singular b ehaviour, with going to in nity. But often

the tails of the evidence are small, and contain negligible mass if our prior over log has

cuto s at some and surrounding . For each data analysis problem, one may

min max MP

evaluate the critical ab ove which the p osterior is measurably a ected by the large

max

tail of the evidence Gull 1989. Often, as Gull p oints out, this critical value of has

max

bizarre magnitude.

Even if a at prior b etween appropriate and is used, it is p ossible in principle

min max

for the p osterior P log jD; Htobemulti-mo dal. However this is not exp ected when the

mo del space is well matched to the data. Examples of multi-mo dality only arise if the

data are grossly at variance with the likeliho o d and the prior. For example, if some large

eigenvalue measurements give small d , and some measurements with small eigenvalue

give large d , then the p osterior over can havetwo p eaks, one at large which nicely

explains d , but must attribute d to unusually large amounts of noise, and one at small

al as

which nicely explains d , but must attribute d to w b eing unexp ectedly close to

as al al

zero. I now suggest a way of formalizing this concept into a quantitative test.

If we accept the mo del, then we b elieve that there is a true value of = , and that

given , the data measurements d are the sum of two indep endent Gaussian variables

T a

2 2

= , where w and , so that P d j ; H = Normal0; +1. The

a a a a T

aj aj

T T

2 2

exp ectation of d +1.We therefore exp ect that there is an such that the is hd i =

a a

2 2 2

quantities fd = g are indep endently distributed like with one degree of freedom.

De nition 1 A data set fd g is grossly at variance with the model for a given value of ,

+1 is not in the interval [e ; 1+ ]; where is the if any of the quantities j = d =

signi cance level of this test.

It is conjectured that if we nd a value of = which lo cally maximizes the evidence,

and with which the data are not grossly at variance, then there are no other maxima over

Conversely, if the data are grossly at variance with a lo cal maximum , then there

maybemultiple maxima in , and the evidence approximation may b e inaccurate. In these

circumstances one might also susp ect that the entire mo del is inadequate in some way.

Assuming that P log jD; H has a single maximum over log ,how sharp is it exp ected

to b e? I now establish conditions under which the P log jD; H is lo cally Gaussian and

sharp.

De nition 2 The symbol n is de ned by:

a MP

n : 21

a MP

This is a measure of the number of eigenvalues within approximately e-fold of .

a MP

In the following, I will assume that n , but this condition is not essential for the

evidence approximation to b e valid. If n , and the data are not grossly at variance e

D.J.C. MACKAY

with , then we nd on Taylor-expanding log P jD; H ab out = , that the second

MP MP

derivative is large, and that the third derivative is relatively small:

@ log P D j ; H 1

w = 0 =

MPj

@ log 2

@ log P D j ; H

' w =

MP j

@ log 2

@ log P D j ; H

: ' w =

MP j

@ log 2

The rst derivative is exact, assuming that the eigenvalues are indep endentof , which

is true in the case of a Gaussian prior on w Bryan 1990. The second and third derivatives

are approximate, with terms prop ortional to n b eing omitted. The third derivative is

relatively small even though it is equal to the second derivative, since in the expansion

d c

2 3 1=2

l + l + :::, the second term gives a negligible p erturbation for l c P l = exp

2 6

3=2

if d c . In this case, since d = c = 1, the p erturbation intro duced by the higher

1=2

order terms is O . Thus the p osterior distribution over log has a maximum that is

b oth lo cally Gaussian and sharp if 1 and n . The expression for the evidence 7

follows.

Error of Low-dimensional Predictive Distributions

I will now assume that the p osterior distribution P log jD; H is Gaussian with standard

deviation =1= , with 1, and = O 1.

log jD

Theorem 1 Consider a scalar which depends linearly on w , y = g w . The evidence

approximation's predictive distribution for y is close to the exact predictive distribution, for

nearly al l projections g . In the case g = w , the error measuredbyacross-entropy is of

p p

order n = .For al l g perpendicular to this direction, the error is of order 1= .

A similar result is exp ected still to hold when the dimensionalityofy is greater than

. one, provided that it is much less than

Pro of: At `level 1', we infer w for a xed value of :

9 8

= <

d 1

a a

+ w : 22 P w jD; ; H / exp

a a

; :

2 +

MPj

The most probable w given this value of is: w = d = + : The p osterior

a a a

distribution is Gaussian ab out this most probable w .Weintro duce a typical w , that is, a

sample from the p osterior for a particular value of :

d r

a a a

TYPj

= w ; 23 +

where r is a sample from Normal0,1. a

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

Now, assuming that log has a Gaussian p osterior distribution with standard deviation

,a typical , i.e., a sample from this p osterior, is given to leading order by 1=

TYP

= ; 24 1+

TYP

where s is a sample from Normal0,1. Wenow substitute this into 23 and obtain

atypical w from the true p osterior distribution, which dep ends on k + 1 random variables

TYP

fr g;s.We expand each comp onent of this vector w in p owers of 1= :

2 2

s s d

a a MP

TYP MP

w = 1 + + ::: +

+ + +

a MP a MP a MP

2 2

1 3 r s s

a MP

1 + ::: 25

2 + 8 +

a MP a MP

a MP

TYP TYP 2 2

Wenow examine the mean and variance of y = g w . Setting hr i = hs i =1

a a a

and dropping terms of higher order than 1= ,we nd that whereas the evidence approxi-

mation gives a Gaussian predictive distribution for y which has mean and variance:

X X

2 MPj

; = = g w ;

0 a

a 0

a MP

a a

the true predictive distribution is, to order 1= , Gaussian with mean and variance:

MP j

g w ; = +

a 1 0

a MP

8 9

< =

2 2

X X

1 g

2 2 a MPj MP

= + + : g w

1 0 a

: ;

+ + +

a MP a MP a MP

a a

How wrong can the evidence approximation b e? Since b oth distributions are Gaussian, it is

simple to evaluate various distances b etween them. The cross entropybetween p =Normal ;

0 0

and p =Normal ; is

1 1

8 9

! !

2 3

< =

2 2 2 2

1 p 1

1 0 1

1 0 1 0

+ + O : H p ;p p log =

0 1 1

2 2 2

: ;

p 2 4

0 0 0

We consider the two dominant terms separately.

! ,

X X

1 0

2 MPj

= h ; 26 h w

a a

2 2

3=2

a MP

+ . The worst case is given by the direction g such that h = where h = g =

a MP a a a

MPj

w . This worst case gives an upp er b ound to the contribution to the cross

3=2

entropy:

MPj

1 w

1 0

2 2 3

a MP

2 1

MPj

= 1 28 < w

2 2 2

D.J.C. MACKAY

So the change in never has a signi cant e ect.

The variance term can b e split into two terms:

8 9

! ! ,

MP j

MP < =

2 2 2

X X X

h w

a MP

MP 2 2 1 0

= h ; + h

a a

: ;

a MP

a a a

where, as ab ove, h = g = + .

a a a MP

MPj

, i.e., the radial For the rst term, the worst case is the direction h = w

direction g = w . Substituting in this direction, we nd:

MPj

MPj MP

First term w 29

a MP

2 1

MP j

= O 1 30 w < =

MP j

We can improve this b ound by substituting for w in terms of d and making use of

the de nition of n . Only n of the terms in the sum in equation 29 are signi cant. Thus

e e

: 31 First term

So this term can give a signi cant e ect, but only in one direction; for any direction or-

thogonal in h to this radial direction, this term is zero.

Finally,we examine the second term:

X X

1 1

2 2

h < 1: 32 h

a a

a MP

a a

So this term never has a signi cant e ect.

ConclusionThe evidence approximation a ects the mean and variance of prop erties y of w ,

1=2

but only to within O of the prop erty's standard deviation; this error is insigni cant,

for large . The sole exception is the direction g = w , along which the variance is

MPj

erroneously small, with a cross-entropy error of order O n = .

A Correction Term

This result motivates a straightforward term which could b e added to the inverse Hessian

of the evidence approximation, to correct the predictivevariance in this direction. The

predictivevariance for a general y = g w could b e estimated by

2 T 2 0 0

= g + w w g ; 33

log jD MPj MPj

0 2

where w @ w =@ log = w , and = . With this correction,

MPj MPj

MPj log jD

the predictive distribution for any direction would b e in error only by order O 1= . If

2 1 2

the noise variance = is also uncertain, then the factor is incremented by

log jD

. =

log jD

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

True posterior MAP approximation Evidence approximation

W_MP

Figure 1: Approximating complicated distributions with a Gaussian

This is a schematic illustration of the prop erties of a multi-dimensional distribution. Atypical

p osterior distribution for an ill-p osed problem has a skew p eak. A Gaussian tted at the MAP

parameters is a bad approximation to the distribution: it is in the wrong place, and its error bars

are far to o small. Additional features of the true p osterior distribution not illustrated here are that

it typically has spikes of high probability density at the origin w = 0 and at the maximum likeliho o d

parameters w = w . The evidence approximation gives a Gaussian distribution which captures

most of the probability mass of the true p osterior.

6 Discussion

The MAP metho d, though exact, is capable of giving MAP parameters which are un-

representative of the true p osterior. In high dimensional spaces, maxima are misleading.

MAP estimates play no fundamental role in Bayesian inference, and they can change arbi-

trarily with arbitrary re-parameterizations. The problem with MAP estimates is that they

maximize the probability density, without taking account of the complementary volume

information. Figure 1 attempts, in one dimension, to convey this di erence b etween the

two Gaussian approximations.

When there are many ill-determined parameters, the MAP metho d's integration over

yields a w whichisover-regularized.

There are two general take-home messages.

1 When one has a choice of whichvariables to integrate over and which to maximize

over, one should integrate over as manyvariables as p ossible, in order to capture the

relevantvolume information. There are typically far fewer regularization constants and

other hyp erparameters than there are `level 1' parameters.

2 If practical Bayesian metho ds involve approximations such as tting a Gaussian to a

p osterior distribution, then one should think twice b efore integrating out hyp erparameters

Gull 1988. The probability density which results from suchanintegration typically has

askew p eak; a Gaussian tted at the p eak may not approximate the distribution well. In

Integration over the noise level 1= to give the true likeliho o d leads to a bias in the other direction.

These two biases may cancel: the evidence framework's w coincides with w if the number of

MPj ;

MP MP

well-determined parameters happ ens to ob ey the condition =k = N=N + k .

D.J.C. MACKAY

contrast, optimization of the hyp erparameters can give a Gaussian approximation which,

for predictive purp oses, puts most of the probability mass in the right place.

The evidence approximation, which sets hyp erparameters so as to maximize the evi-

dence, is not intended to pro duce an accurate numerical approximation to the true p oste-

rior distribution over w ; and it do es not. But what matters is whether low-dimensional

prop erties of w i.e., predictions are seriously mis-calculated as a result of the evidence

approximation.

The main conditions for the evidence approximation are that the data should not b e

grossly at variance with the likeliho o d and the prior, and that the number of well-determined

parameters should b e large. How large dep ends on the problem, but often a value as small

as ' 3 is sucient, b ecause this means that is determined to within a factor of e recall

' 2= ; predictive distributions are often insensitivetochanges of of this

log jD

magnitude. Thus the approximation is usually go o d if wehave enough data to determine

a few parameters.

If satisfactory conditions do not hold for the evidence approximation e.g.,if is to o

small, then it should b e emphasized that this would not then motivate integrating out

rst. The MAP approximation is systematically inferior to the evidence approximation.

It would probably b e most convenientnumerically to retain as an explicit variable, and

integrate it out last Bryan 1990.

A nal p ointinfavour of the evidence framework is that it can b e naturally extended at

least approximately to more elab orate priors such as mixture mo dels; it would b e dicult

to integrate over the mixture hyp erparameters in order to evaluate the `true prior' in these

cases.

Acknowledgments

I thank Radford Neal, David R.T. Robinson, Steve Gull, and Martin Old eld for helpful

discussions, and John Skilling for invaluable contributions to the pro of in section 5. I am

grateful to Anton Garrett for comments on the manuscript.

References

Box, G. E. P., and Tiao, G. C. 1973 Bayesian inference in statistical analysis .

Addison{Wesley.

Bretthorst, G. 1988 Bayesian spectrum analysis and parameter estimation .

Springer.

Bryan, R. 1990 Solving oversampled data problems by Maximum Entropy.InMax-

imum Entropy and Bayesian Methods, Dartmouth, U.S.A., 1989 , ed. byP.Fougere,

pp. 221{232. Kluwer.

Buntine, W., and Weigend, A. 1991 Bayesian back{propagation. Complex Systems

5: 603{643.

Gull, S. F. 1988 Bayesian inductive inference and maximum entropy.InMaximum

Entropy and Bayesian Methods in Science and Engineering, vol. 1: Foundations , ed. by

G. Erickson and C. Smith, pp. 53{74, Dordrecht. Kluwer.

HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?

Gull, S. F. 1989 Developments in maximum entropy data analysis. In Maximum

Entropy and Bayesian Methods, Cambridge 1988 , ed. by J. Skilling, pp. 53{71, Dordrecht.

Kluwer.

MacKay, D. J. C. 1992a Bayesian interp olation. Neural Computation 4 3: 415{447.

MacKay, D. J. C. 1992b A practical Bayesian framework for backpropagation

networks. Neural Computation 4 3: 448{472.

MacKay, D. J. C. 1992c The evidence framework applied to classi cation networks.

Neural Computation 4 5: 698{714.

MacKay, D. J. C. 1994 Bayesian non-linear mo delling for the 1993 energy prediction

comp etition. In Maximum Entropy and Bayesian Methods, Santa Barbara 1993 , ed. by

G. Heidbreder, Dordrecht. Kluwer.

Neal, R. M. 1993a Bayesian learning via sto chastic dynamics. In Advances in Neural

Information Processing Systems 5 , ed. by C. L. Giles, S. J. Hanson, and J. D. Cowan,

pp. 475{482, San Mateo, California. Morgan Kaufmann.

Neal, R. M. 1993b Probabilistic inference using Markovchain Monte Carlo metho ds.

Technical Rep ort CRG{TR{93{1, Dept. of Computer Science, UniversityofToronto.

Reif, F. 1965 Fundamentals of statistical and thermal physics . McGraw{Hill.

Skilling, J. 1993 Bayesian numerical analysis. In Physics and Probability , ed. by

W. T. Grandy, Jr. and P. Milonni, Cambridge. C.U.P.

Strauss, C. E. M., Wolpert, D. H., and Wolf, D. R. 1993 Alpha, evidence,

and the entropic prior. In Maximum Entropy and Bayesian Methods, Paris 1992 , ed. by

A. Mohammed-Djafari, Dordrecht. Kluwer.

Thodberg, H. H. 1993 Ace of Bayes: application of neural networks with pruning.

Technical Rep ort 1132 E, Danish meat research institute.

Wahba, G. 1975 A comparison of GCV and GML for cho osing the smo othing param-

eter in the generalized spline smo othing problem. Numer. Math. 24: 383{393.

Weir, N. 1991 Applications of maxmimum entropy techniques to HST data. In

Proceedings of the ESO/ST{ECF Data Analysis Workshop, April 1991 .

Wolpert, D. H. 1993 On the use of evidence in neural networks. In Advances in

Neural Information Processing Systems 5 , ed. by C. L. Giles, S. J. Hanson, and J. D.

Cowan, pp. 539{546, San Mateo, California. Morgan Kaufmann.