HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?
David J.C. MacKay
Cavendish Lab oratory,
Cambridge, CB3 0HE. United Kingdom.
ABSTRACT. I examine two approximate metho ds for computational implementation of Bayesian
hierarchical mo dels, that is, mo dels which include unknown hyp erparameters such as regularization
constants. In the `evidence framework' the mo del parameters are integrated over, and the resulting
evidence is maximized over the hyp erparameters. The optimized hyp erparameters are used to de ne
a Gaussian approximation to the p osterior distribution. In the alternative `MAP' metho d, the true
p osterior probability is found by integrating over the hyp erparameters. The true p osterior is then
maximized over the mo del parameters, and a Gaussian approximation is made. The similarities of
the two approaches, and their relative merits, are discussed, and comparisons are made with the
ideal hierarchical Bayesian solution.
In mo derately ill-p osed problems, integration over hyp erparameters yields a probability distri-
bution with a skew p eak which causes signi cant biases to arise in the MAP metho d. In contrast,
the evidence framework is shown to intro duce negligible predictive error, under straightforward
conditions.
General lessons are drawn concerning the distinctive prop erties of inference in many dimensions.
\Integrating overanuisance parameter is very much like estimating the parameter
from the data, and then using that estimate in our equations." G.L. Bretthorst
\This integration would b e counter-pro ductive as far as practical manipulation is
concerned." S.F. Gul l
1 Outline
In ill-p osed problems, a Bayesian mo del H commonly takes the form:
P D; w ; ; jH= P D jw ; ;HP w j ; HP ; jH; 1
2
where D is the data, w is the parameter vector, de nes a noise variance =1= , and
is a regularization constant. In a regression problem, for example, D might b e a set of data
p oints, fx; tg, and the vector w might parameterize a function f x; w . The mo del H states
that for some w , the dep endentvariables ftg are given by adding noise to ff x; w g; the
likeliho o d function P D jw ; ;H describ es the assumed noise pro cess, parameterized bya
noise level 1= ; the prior P w j ; Hemb o dies assumptions ab out the spatial correlations
and smo othness that the true function is exp ected to have, parameterized by a regularization
constant . The variables and are known as hyp erparameters. Problems for which
mo dels can b e written in the form 1 include linear interp olation with a xed basis set
D.J.C. MACKAY
Gull 1988; MacKay 1992a, non-linear regression with a neural network MacKay 1992b,
non-linear classi cation MacKay 1992c, and image deconvolution Gull 1989.
In the simplest case linear mo dels, Gaussian noise, the rst factor in 1, the likeliho o d,
can b e written in terms of a quadratic function of w , E w :
D
1
P D jw ; ;H= exp E w : 2
D
Z
D
What makes the problem `ill-p osed' is that the hessian rrE is ill-conditi oned | some
D
of its eigenvalues are very small, so that the maximum likeliho o d parameters dep end unde-
sirably on the noise in the data. The mo del is `regularized' by the second factor in 1, the
prior, which in the simplest case is a spherical Gaussian:
1
T
1
exp P w j ; H= w w : 3
2
Z
W
2
The regularization constant de nes the variance =1= of the prior for the comp onents
w
w of w .
i
Muchinterest has centred on the question of how the constants and | or the
ratio = | should b e set, and Gull 1989 has derived an app ealing Bayesian prescription
for these constants see also MacKay 1992a for a review. This `evidence framework'
integrates over the parameters w to give the `evidence' P D j ; ; H. The evidence is then
maximized over the regularization constant and noise level . A Gaussian approximation
is then made with the hyp erparameters xed to their optimized values. This relates closely
to the `generalized maximum likeliho o d' metho d in statistics Wahba 1975. This metho d
can b e applied to non-linear mo dels by making appropriate lo cal linearizations, and has b een
used successfully in image reconstruction Gull 1989; Weir 1991 and in neural networks
MacKay 1992b; Tho db erg 1993; MacKay 1994.
Recently an alternative pro cedure for computing inferences under the same Bayesian
mo del has b een suggested by Buntine and Weigend 1991, Strauss et al. 1993 and Wolp ert
1993. In this approach, one integrates over the regularization constant rst to obtain
the `true prior', and over the noise level to obtain the `true likeliho o d'; then maximizes
the `true p osterior' over the parameters w . A Gaussian approximation is then made around
this true probability density maximum. I will call this the `MAP' metho d for maximum a
posteriori; this use of the term `MAP' may not coincide precisely with its general usage.
The purp ose of this pap er is to examine the choice b etween these two Gaussian approx-
imations, b oth of which might b e used to approximate predictive inference. It is assumed
that it is predictive distributions that are of interest, rather than p oint estimates. Estima-
tion will only app ear as a computational stepping stone in the pro cess of approximating a
predictive distribution. I concentrate on the simplest case of the linear mo del with Gaussian
noise, but the insights obtained are exp ected to apply to more general non-linear mo dels and
to mo dels with multiple hyp erparameters. When a non-linear mo del has multiple lo cal op-
tima, one can approximate the p osterior by a sum of Gaussians, one tted at each optimum.
There is then an analogous choice b etween either a optimizing separately at each lo cal
optimum in w , and using a Gaussian approximation conditioned on MacKay 1992b;
or b tting Gaussians to lo cal maxima of the true p osterior with the hyp erparameter
integrated out.
HYPERPARAMETERS: OPTIMIZE, OR INTEGRATE OUT?
2 The Alternative Metho ds
Given the Bayesian mo del de ned in 1, we mightbeinterested in the following inferences.
Problem A: Infer the parameters, i.e., obtain a compact representation of P w jD; H and
the marginal distributions P w jD; H.
i
Problem B: Infer the relative mo del plausibility, which requires the `evidence' P D jH.
Problem C: Make predictions, i.e. obtain some representation of P D jD; H, where D ,
2 2
in the simplest case, is a single new datum.
Let us assume for simplicity that the noise level is known precisely, so that only the
regularization constant is resp ectively optimized or integrated over. Comments ab out
can apply equally well to .
The Ideal Approach
Ideally,ifwewere able to do all the necessary integrals, wewould just generate the probabil-
ity distributions P w jD; H, P D jH, and P D jD; Hby direct integration over everything
2
that we are not concerned with. The pioneering work of Box and Tiao 1973 used this
approach to develop Bayesian robust statistics.
For real problems of interest, however, such exact integration metho ds are seldom avail-
able. A partial solution can still b e obtained by using Monte Carlo metho ds to simulate
the full probability distribution see Neal 1993b for an excellent review. Thus one can
obtain problem A a set of samples fw g which represent the p osterior P w jD; H, and
problem C a set of samples fD g which represent the predictive distribution P D jD; H.
2 2
Unfortunately, the evaluation of the evidence P D jH with Monte Carlo metho ds problem
B is a dicult undertaking. Recent developments Neal 1993a; Skilling 1993 now makeit
p ossible to use gradient and curvature information so as to sample high dimensional spaces
more e ectively,even for highly non-Gaussian distributions. Let us come down from these
clouds however, and turn attention to the two deterministic approximations under study.
The Evidence Framework
The evidence framework divides our inferences into distinct `levels of inference':
Level 1: Infer the parameters w for a given value of :
P D jw ; ;HP w j ; H
: 4 P w jD; ; H=
P D j ; H
Level 2: Infer :
P D j ; HP jH
P jD; H= : 5
P D jH
Level 3: Compare mo dels:
P HjD / P D jHP H: 6
There is a pattern in these three applications of Bayes' rule: at each of higher levels 2 and
3, the data-dep endent factor e.g. in level 2, P D j ; H is the normalizing constant the
`evidence' from the preceding level of inference.
The inference problems listed at the b eginning of this section are solved approximately
using the following pro cedure.
D.J.C. MACKAY
The level 1 inference is approximated by making a quadratic expansion, around a maxi-
mum of P w jD; ; H, of log P D jw ; ;HP w j ; H; this expansion de nes a Gaussian
approximation to the p osterior. The evidence P D j ; H is estimated byevaluating the
appropriate determinant. For linear mo dels the Gaussian approximation is exact.
By maximizing the evidence P D j ; Hatlevel 2, we nd the most probable value of the
regularization constant, , and error bars on it, . Because is a p ositive scale
MP
log jD
variable, it is natural to represent its uncertainty on a log scale.
The value of is substituted at level 1. This de nes a probability distribu-
MP
tion P w jD; ; H whichisintended as a `go o d approximation' to the p osterior
MP
P w jD; H. The solution o ered for problem A is a Gaussian distribution around the