Bayesian Deviance, the Effective Number of Parameters, and The - DocsLib

sign in sign up

<<

Home , David Spiegelhalter

Bayesian deviance the eective number of parameters

and the comparison of arbitrarily complex mo dels

y z

David J Spiegelhalter Nicola G Best Bradley P Carlin

March

Abstract

We consider the problem of comparing complex hierarchical mo dels in which the number of

parameters is not clearly dened We follow Dempster in examining the p osterior distribution

of the loglikeliho o d under each mo del from which we derive measures of t and complexity

the eective number of parameters These may be combined into a Deviance Information

Criterion DIC which is shown to haveanapproximate decisiontheoretic justication Ana

lytic and asymptotic identities reveal the measure of complexity to b e a generalisation of a wide

range of previous suggestions with particular reference to the neural net work literature The

contributions of individual observations to t and complexity can give rise to a diagnostic plot

of deviance residuals against leverages The pro cedure is illustrated in a numb er of examples

and throughout it is emphasised that the required quantities are trivial to compute in a Markov

chain Monte Carlo analysis and require no analytic work for new mo dels

Intro duction

The development of Markov chain Monte Carlo MCMC has made it p ossible to t increasingly

large classes of mo dels with the aim of exploring realworld complexities of data Gilks et al

Being able to t such mo dels naturally leads to the wish to compare alternativeformulations with

the aim of identifying a class of succinct plausible mo dels for example we might ask whether we

need to incorp orate a random eect to allow for overdisp ersion whether allowing measurement

error helps explain the data what distributional forms to assume and so on

Within the classical mo delling framework mo del comparison takes place by dening a measure of

free parameters in the mo del t typically the deviance statistic and complexity the number of

Since increasing complexity is accompanied by b etter t mo dels are compared by trading these two

quantities o using likeliho o d ratio tests Akaikes information criterion or one of a numb er of other

suggestions Aitkin Bayesian mo del comparison using Schwarzs information criterion as a

Bayes factor approximation also requires sp ecication of the numb er of parameters in each mo del

Kass and Raftery Unfortunately in complex hierarchical mo dels in which parameters

MRC Biostatistics Unit Institute of Public Health Robinson Way Cambridge CB SR UK email

davidspiegelhaltermrcbsuc ama cuk

y

Dept Epidemiology and Public Health Imp erial College Scho ol of Medicine at St Marys Norfolk Place London

W PG UK email nbesticacuk

z

Division of Biostatistics Box Mayo Building University of Minnesota Minneap olis MN USA

email bradmuskiebiostatumnedu

Bayesian deviance

generally outnumb er observations these metho ds clearly cannot b e directly applied Gelfand and

Dey The most ambitious attempts to tackle this problem app ear in the neural network

literature Mo o dy MacKay Ripley

In the next section we follow Dempster recently reprinted as Dempster b in basing

comparisons on the p osterior distributions of the deviance loglikeliho o d some standardising

factor under eachmodel We identify t as the p osterior mean of the deviance and complexity

ie the eective number of parameters p as the dierence between the p osterior mean of the

D

deviance and the deviance based on the p osterior means of the parameters These quantities can

b e trivially obtained from a Markovchain Monte Carlo MCMC analysis The t and complexity

are then added to form a Devianc e Information Criterion DIC which may be used for mo del

comparison In Section we illustrate the use of this simple technique on a reasonably complex

example in spatial mo delling with covariates

The remainder of the pap er attempts to justify the use of these measures of t complexity and their

combination into a single mo del comparison criterion In Section we use a heuristic asymptotic

argumentto see DIC as a generalisation of Akaikes Information Criterion AIC Akaike

and then show that DIC has an approximate decisiontheoretic justication in terms of minimising

exp ected loss in predicting a replicate dataset and that in some situations an absolute measure of

t is obtained Although not necessary for the computation of DIC it is useful to examine exact

and asymptotic forms within standard mo dels Section We show that in hierarchical normal

linear mo dels p has a sensible closed form as the trace of of the hat matrix that pro jects data onto

D

the tted values and hence can b e related to man y other suggestions Asymptotic approximations

in the exp onential family reveal further relationships In Section we argue that each observations

contribution to p can be interpreted as its leverage and show how to obtain diagnostic plots of

D

leverages against deviance residuals as abypro duct of an MCMC analysis regardless of the form

of the mo del and without any analytic eort

Section contains a set of examples to showhowDICworks in practice and in Section wediscuss

how it ts into the general mo del comparison literature Finally Section presents a critique of

the metho d identifying some areas for further research

Measures of t and complexity from the p osterior distribution

of the deviance

Supp ose we are tting a mo del with observed data y and unknown quantities whichmay include

parameters at dierentlevels of the mo del latentvariables missing data and so on The Bayesian

approac h sp ecies a joint distribution py which will generally consist of a pro duct of many

terms through conditional indep endence assumptions This joint distribution can b e written as

py py j p j p

where and y is conditionally indep endentof given Thus are the parameters that

directly inuence y eg true means while typically are hyp erparameters that govern the form

of the prior distribution for Our interest will fo cus on the lowestlevel parameters since they

directly inuence the t and predictive ability of the mo del

Dempster long ago suggested direct consideration of the p osterior distribution of the log

likeliho o d of the data equiv alent to examining the p osterior distribution of

D log py j logf y

Bayesian deviance

where f y is some fully sp ecied standardising term that is a function of the data alone and hence

do es not aect mo del comparison We shall term D the Bayesian deviance and intro duce

two sp ecic standardisations rst the null standardisation D loglikeliho o d obtained by

assuming f y is the p erfect predictor that gave probability to each observation and second for

memb ers of the one parameter exp onential family with E Y the saturated deviance D

S

obtained by setting f y py j y

The p osterior distribution of D is based on p jy where p j y py j p and p

R

p j p d Dempster provided some basic examples and some suggestions for compar

ison of distributions of the loglikeliho o d and a numb er of pap ers have featured plots or summaries

such as p osterior means see for example Raghunathan Zeger and Karim Gilks

et al and Richardson and Green However these authors app ear unclear ab out how

to compare mo dels of diering complexity and in a discussion to a republishing of his pa

per Dempster a gives little extra guidance stating that one should plot a representation

from Markov chain Monte Carlo metho ds of the p osterior distributions of loglikeliho o d perhaps

under each comp eting mo del and compare I hesitate to dictate a pro cedure for cho osing among

mo dels

We shall not b e so hesitant and oer a set of suggestions First summarise the t of a mo del by

the p osterior exp ectation of the deviance

D E D

jy

Second measure the complexity of a mo del by the eective numb er of parameters p dened as

D

the exp ected deviance minus the deviance evaluated at the p osterior exp ectations

p E D D E

D

jy jy

D D

Finally mo dels may b e compared using a Deviance Information Criterion dened as

D p DIC

D

D p

D

This will b e shown to b e a natural generalisation of Akaikes Information Criterion

DIC may b e calculated during an MCMC run by monitoring b oth and D and at the end of

the run simply taking the sample mean of the simulated values of D minus the plugin estimate of

the deviance using the sample means of the simulated values of Smaller values of DIC indicate a

b ettertting mo del As with many previouslyprop osed mo del comparison to ols DIC consists of

two terms one representing go o dnessoft and the other a p enalty for increasing mo del complexity

We need to emphasise that we do not recommend that DIC b e used as a strict criterion for mo del

choice or as a basis for mo del averaging Drap er Selecting a single mo del is a complex

pro cedure involving background knowledge and other factors such as the robustness of inferences

to alternative mo dels with similar supp ort Box and Tiao mo del choice may b e unnecessary

in the rst place and is certainly very dicult to formalise We rather view DIC as a metho d

for screening alternative formulations in order to pro duce a list of candidate mo dels for further

consideration See Section for further discussion of this issue

Bayesian deviance

A running example the spatial distribution of lip cancer in

Scotland

To illustrate the practical application of our suggestion we analyse data on the rates of lip cancer

in counties in Scotland Clayton and Kaldor Breslow and Clayton The data

include observed y and exp ected E numb ers of cases for each county i where the exp ected

i i

counts are based on the age and sexstandardised national rate applied to the p opulation at risk

in eachcounty a covariate x representing the p ercentage of the p opulation in each countywho

i

are engaged in agriculture shing or forestry used as a proxy for sunligh t exp osure plus the

lo cation of each county expressed as a list A ofitsn adjacent counties We assume the usual

i i

Poisson mo del for cancer incidence within each county

y Poisson E

i i i

where denotes the underlying true areasp ecic relative risk of lip cancer Using the standard

i

canonical parameterisation log may be expressed using either a pooled mo del with or

i i

without the covariate

Mo del

i

Mo del x

i i

or a saturated mo del

Mo del

i i

and variance are placed on where lo cally uniform priors actually Normal with mean

and i However mo dels and make no allowance for variation b etween the true

i

risk ratios in each county other than that asso ciated with the covariate in mo del whilst mo del

assumes indep endence b etween the countysp ecic risk ratios essentially yielding the maximum

y

i

likeliho o d estimates A more plausible assumption is that the true countysp ecic risk

i

E

i

ratios lie somewhere between the indep endent and p o oled estimates thus motivating the use of

random eects mo dels In the present example sp ecication of the random eects p opulation

distribution is complicated by the p ossibility that the may be spatially correlated That is

i

the risk ratios in two neighb ouring counties may be more similar than risk ratios in two counties

vary smo othly with further apart p ossibly due to dep endence on unmeasured risk factors which

geographic lo cation We thus consider comparison of the following six random and mixed eects

mo dels in addition to mo dels ab ove

Mo del

i i

Mo del x

i i i

Mo del

i i

Mo del x

i i i

Mo del

i i i

Mo del x

i i i i

where and are as for mo dels and are exchangeable random eects with a Normal prior

i

distribution having zero mean and precision and are spatial random eects with a conditional

i

autoregressiv e prior Besag given by

X

j Normal

i j

ni

n n

i i

j A i

Bayesian deviance

Mo del D D p DIC

D

pooled

cov

saturated

exch

exchcov

spat

spat cov

exchspat

exch spatcov

Table Deviance summaries for lip cancer data cov is a mo del with the covariate exch means

an exchangeable random eect spat is a spatially correlated rando j eect

Gamma and Gamma priors are assumed for the random eects precision param

eters and resp ectively The prior for the latter is weakly informative in order to improve

the stability and convergence prop erties of the mo del since we note that there is considerable

nonidentiability in the ab ove parameterisations However this do es not inuence the mo del

comparison which is based only on the tted s

i

For this Poisson mo del we adopt the classical deviance McCullagh and Nelder p

X

y

i

i

D y log y e E

S i i i

i

e E

i

i

P

y

i

as the standardising factor obtained by taking log f y log py j log

i i

i

E

i

For eachmodelwe ran an MCMC sampler in BUGS Spiegelhalter et al a for interations

following a burnin period of iterations As suggested by Dempster Figure shows

a kerneldensity smo othed plot of the resulting p osterior distributions of the deviance under each

comp eting mo del Apart from revealing the clear unacceptability of Mo dels and this clearly

illustrates the diculty of formally comparing p osterior deviances on the basis of such plots alone

In Table we present the results of our own suggestion for summarising the deviance for comp eting

mo dels For each mo del D is simply the mean of the p osterior samples of D andD is calculated

by plugging the p osterior means of the relevant parameters into the linear predictor

i i i

i

Beginning with the xed eects mo dels and we note that p and p resp ectively

D D

which are the true number of parameters For mo del p which is slightly lower than

D

the true parameters

The six random eects mo dels have values of p ranging from ab out to Somewhat sur

D

prisingly the eective number of parameters in mo dels and which each contain random

eects spatial and exchangeable are almost identical to the eectivenumb er of parameters

in mo dels and which eachcontain only the spatial random eects There are approximately

more eective parameters in mo dels and This suggests that the exchangeable random ef

fects do less well at explaining the b etweenarea variability in relative risk compared to the spatial

random eects and indeed are probably redundant once the spatial eects are also included in the

mo del

Turning to the comparison of DIC for eachmodelwe rst note that DIC is sub ject to Monte Carlo

Bayesian deviance

model 1 model 6 model 2 model 7 model 3 model 8 model 4 model 9 model 5

0.0 0.05 0.10 0.15 0.20 0.25 0.30 // // 225 250 275 300 325 445 465 585 605

Deviance

Figure Posterior distributions of the deviance for each mo del considered in the lip cancer example

Bayesian deviance

sampling error since it is a function of sto chastic quantities generated under an MCMC sampling

scheme Whilst computing the precise standard errors for our DIC values is a sub ject of ongoing

research the standard errors for the D values are readily obtained and provide a go o d indication of

the accuracy of DIC and p In any case in several runs using dierent initial values and random

D

numb er seeds for this example the DIC and p estimates obtained never varied by more than

D

As such we are condent that even allowing for Monte Carlo error any of Mo dels are sup erior

in terms of DIC p erformance to mo dels which are in turn sup erior to mo dels Comparison

of DIC for mo dels suggests that the four spatial mo dels are virtually indistiguishable in terms

of overall t However in Section we show how to pro duce diagnostic plots of leverage against

deviance residuals using the individual comp onents of DIC these suggest imp ortant dierences in

t for certain asp ects of mo dels

Theory and asymptotics

A heuristic derivation of p and DIC

D

W e emphasise that the asymptotics describ ed in this section reect the situation where increasing

information ab out individual members of is received ie where the number of

p

observations grows with resp ect to the numb er of parameters

to give to second order We rst expand D around E

jy

D D

T T

D D

T T

D L L

where L logpy j D and L and L represent rst and second derivatives with resp ect to

Consider now a nonhierarchical prior in which p is assumed to b e completely sp ecied with no

unknown parameters It is well known that asymptotically

jy N L

Writing D to represent are the maximum likeliho o d estimates such that L where

non

the deviance for a nonhierarchical mo del wethus obtain from

T

D D L

non

D

p

T

since by L has an approximate chisquared distribution with p degrees of

freedom

Rearranging and taking exp ectations with resp ect to the p osterior distribution of reveals

p E D D

non

jy

ie the numb er of parameters is approximately the exp ected deviance D E D minus the

non

jy

tted deviance Akaikes information criterion Akaike is AIC D p and hence from

may b e written

AIC D p

Bayesian deviance

the exp ected deviance plus the numb er of parameters

Our suggestion for hierarchical mo dels thus follows equations and but substituting the

p osterior mean for the maximum likeliho o d estimate It is a generalisation of Akaikes criterion

for nonhierarchical mo dels p p and DIC AIC

D

MCMC estimation of the maximum likeliho o d deviance

We note in passing that since V D p we can use MCMC output to estimate the

non

jy

classical deviance D ofany nonhierarchical mo del by

V D D E D

using the empirical mean and variance of the sampled values for D Although this maximum

likeliho o d deviance is theoretically the minimum of D over all feasible values of D will generally

be very badly estimated by the sample minimum over an MCMC run

The ab ove discussion suggests that the p osterior distributions for the nonhierarchical Mo dels

and in Figure should be approximately shifted distributions with and degrees of

freedom resp ectively and hence have variances of and In fact the resp ective variances of

these distributions are and and equation provides classical deviance estimates of

and compared with the true values of and The chisquared

approximation app ears rather go o d and a reasonably accurate estimate of the maximum lik eliho o d

deviance is obtained

Asymptotic prop erties of p

D

Taking exp ectations of with resp ect to the p osterior distribution of gives

h i

T

E D D E tr L

jy

i h

T

D E tr L

h i

T

tr L E D

tr L V D

i h

T

is the p osterior covariance matrix of and L is the observed where V E

Fishers information evaluated at the p osterior mean of Note that this will be the asymptotic

covariance of had a lo cally uniform prior b een adopted

Thus

p tr L V

D

which can be thought of as a measure of the ratio of the information in the likelihood about the

parameters as a fraction of the total information in the likeliho o d and the prior Under asymptotic

p osterior normalitywehave that

V L P

Bayesian deviance

where P log p and hence can b e written

p tr V P V

D

p tr P V

In particular consider the problem of regularisation in complex interp olation mo dels such as

neural networks in which the parameters are standardised and assumed to have indep endent

normal priors with precision Then expression may b e written

p p trV

D

as obtained by MacKay

will In many conditionally indep endent hierarchical mo dels will b e the same length as y and L

i

log py j

i

b e a diagonal matrix with ith entry L andso

i

i

p

X

L V jy p

i D

i

ii

showing that each parameter contributes the ratio of the information in the likeliho o d L to its

i

p osterior precision V jy

i

An approximate decisiontheoretic justication for DIC

Supp ose we wish to make predictions on a replicate dataset Y which has an identical design to

rep

the observed data y as in the framework describ ed by Gelfand and Ghosh Assume the

true mo del is pY j and the loss in using an estimate is given by

rep

L E log pY j

rep

Y j

rep

the predicted loss using a prop er logarithmic scoring rule Bernardo Denote log pY j

rep

by D Then following the approach of Ripley p this loss can b e broken down into

rep

L E D D E D D D D D

rep rep rep

Y j Y j

rep rep

We shall denote the rst two terms by L and L resp ectively

Expanding the rst term to second order gives

T T

L E L L

Y j

rep r ep

rep

where L logpY j Since E L we obtain after some rearrangement

rep

r ep

Y j

rep r ep

T

L tr I

where I E L is Fishers information in Y and hence also in y This might reason

rep

Y j

rep rep

ably b e approximated by the observed information at the estimated parameters so that

T

L tr L

Bayesian deviance

The second term in may b e written as L E log pY j log py j this dep ends

rep

Y j

rep

only on the observed data and the true parameter and for anyvalue of has sampling exp ectation

zero

Supp ose that under a particular mo del assumption we obtain a p osterior distribution p jy Then

from and our p osterior exp ected loss when adopting the estimator is

T

E L E D E L tr L E D D

jy jy jy jy

Ideally we would cho ose to minimise this function but this would be extremely complex In

practice wemayapproximate the true Bayes estimate by the p osterior mean making the exp ected

loss

E L tr L V E L p D

D

jy jy

where V has b een previously dened as the p osterior covariance of and p D D Since

D

we have already shown in that p tr L V we nally obtain the attractive result that

D

the exp ected p osterior loss when adopting a particular p osterior distribution is approximately

plus a term with exp ectation zero whichever mo del is true DI C p D

D

Kass and Raftery criticise Akaike for using a plugin predictive distribution as wehave

done ab ove rather than the full predictive distribution obtained by integrating out the unknown

parameters The ab ove justication must therefore b e taken as fairly heuristic

Asymptotic sampling theory prop erties of the p osterior exp ected deviance

Supp ose that all asp ects of the assumed mo del are true Then b efore observing y our exp ectation

of the p osterior exp ected deviance is

h i

E D E E D

y y

jy

h i

E log py j logf y E

y j

byreversing the conditioning b etween y and Supp ose is of dimension p and let f y py j y

where y are the standard maximum likeliho o d estimates Then

py j

log E

y j

py j y

is simply the exp ected likeliho o d ratio statistic for the tted values y with resp ect to the true

null mo del and hence under the standard conditions is asymptotically p Hence we exp ect if the

mo del is true the p osterior exp ected deviance standardised by the maximised loglikeliho o d to

be p the dimension of

In particular consider the oneparameter exp onential family where p n the total sample size The

likeliho o d is maximised by substituting y for the mean of y and the p osterior mean of the deviance

i i

has approximate sampling exp ectation of n if the mo del is true This might be appropriate for

checking the overall go o dnessoft of the mo del This will b e exact for normal mo dels with known

variance but in general will only b e reliable if each observation provides considerable information

ab out its mean McCullagh and Nelder p Note that comparing D with n is precisely

the same as comparing D withn p the eective degrees of freedom

D

D with the sample size n This suggests In Table we might therefore compare the column

that all mo dels to provide an adequate overall t to the data and that the comparison is based

on their complexity alone

Bayesian deviance

Results for some common mo del classes

Normal mo dels

Oneway ANOVA

Consider the oneway analysis of variance with known variance comp onents ie

y N i p

i i

i

giving

X

D y

i i i

i

P

which is loglikeliho o d standardised bytheterm log f y log obtained from setting

i

i

y

i i

Saturated mo del We assume the s have indep endent lo cally uniform priors so that jy

i i

N y and D Thus

i

p

sat

i

D p D

sat sat

and so p p the true numb er of parameters

D

Pooled mo del We assume for all i where has a lo cally uniform prior Then jy

i

P P P P

N y Thus y y y and so D y

i i i i i i

i

pool

X X

D y D y y y

i i i i

pool pool

i i

and so p the true numb er of parameters

D

Exchangeable mo del known normal prior We assume a prior N where are

i

assumed known Then jy N y where is the likeliho o d

i i i i i i i i

i

precision as a fraction of the p osterior precision in the language of traditional mixed mo delling

this is equal to the intraclass correlation co ecient It can b e easily shown that

X

y D

i i i

exch

where a b is a noncentral chisquare with mean a b and variance a b Thus since

wehave

i i i

X X X

D j y D j y

i i i i i i i

exch exch

and so

X X

i

p

D i

i

i i

The eective number of parameters is therefore the sum of the intraclass correlation co ecients

whichessentially measures the sum of the ratios of the precision in the likeliho o d to the precision

in the p osterior as describ ed in equation ab ove

Bayesian deviance

Exchangeable mo del unknown normal prior mean Giving a lo cally uniform prior we

P P P

obtain a p osterior distribution N y where y y It is straightforward

i i i i

to showthat

X X X X

D y y

i i i i i i i

exch

X

D y y

i i i

exch

P P P

and so p

D i i i i

If all group precisions are equal p p as obtained by Ho dges and Sargent

D

Exchangeable mo del general known prior Supp ose has a sp ecied prior p and the

i

p osterior distribution of has mean and precision Then this general deviance has the form

i i i

X X X

y D y

i i i i i i i i i i i

i i i

and so

X X X

D y D y

gen i i i i i gen i i i

i i i

P

and so p sothateach observation contributes the ratio of its p osterior precision based

D i i

on the likeliho o d alone to its p osterior precision based on the full mo del

Supp ose further that the p osterior distribution of given y ie all the data except y is

i i

ni

approximately normal with mean Then from standard normal priorp osterior analysis

i

i

w y w where w Thus the contribution to the total number of parameters is

i i i i i i

i

the relativeweight given to the observation in estimating its mean

We note in passing that Ye suggests that for general normal mo dels the contribution of

each to the eective numb er of parameters should b e

i

E

i

y j

h

i

i

In our Bayesian framework we equate to and hence E w w E Since

i i i i i i

y j y j

i

the second term do es not dep end on wethus obtain

i

h w

i i

and hence in this context Yes suggestion is a sp ecial case of our general deviancebased measure

of complexity

The general normal linear mo del

We consider the general hierarchical normal mo del using the notation of Lindley and Smith

Supp ose

y N A C

N A C

where all matrices and vectors are of appropriate dimension Then the standardised deviance is

T

D y A C y A Assume the p osterior distribution for is normal with mean

Bayesian deviance

Bb and covariance B B and b will be left unsp ecied for the moment Then expressing

y A as y A A A reveals that

T T T

D D y A C A C A

Taking exp ectations with resp ect to the p osterior distribution of eliminates the middle term and

gives

T

D D trA C A B

T

and thus p trA C A B

D

T

If is assumed known then Lindley and Smith show that B A C A C and hence

T T T

p trA C A A C A C p trC A C A

D

A more revealing identity is found by assuming is unknown with a lo cally uniform prior Then

T T T

Lindley and Smith show that B A C A C C A A C A A C and b

T T

A C y The tted values for the data are given b y y A A Bb A BA C y and so

T

the hat matrix that pro jects the data onto the tted values is H A BA C Now p

D

T T

trA C A B trA BA C tr H

This identication of the eective number of parameters with the trace of the hat matrix is a

standard result in linear mo delling and extends to the general class of smo othing and generalised

additive mo dels Hastie and Tibshirani Sec and is also the conclusion of Ho dges and

Sargent in the context of general linear mo dels The advantage of using the deviance

formulation for sp ecifying p is that all matrix manipulation and asymptotic approximation is

D

avoided Note that tr H is the sum of terms which in regression diagnostics are identied as

the individual leverages the inuence of each observationonitsttedvalue we shall exploit this

identity in Section

The general normal nonlinear mo del

A large class of mo dels can b e formulated using the following extension to the LindleySmith mo del

discussed ab ove

y N g D

N A D

where g is a nonlinear expression as found for example in pharmacokinetics or neural networks

and and are likeliho o d and prior precisions resp ectively presumed known for the present in

many situations A will b e and D D will b e identity matrices Dene

T

q y g D y g

T

D A r A

as the likeliho o d and prior residual variation

Assuming asymptotic p osterior normalitywehave jy N V where

V L P

q D

Bayesian deviance

which from equation leads to

p p tr q D I

D p

D have eigenvalues i p Then D I has eigenvalues and q Let q

i p i

hence

X

i

p

D

i

Setting gives the result found in the oneway analysis of variance example in Section

but this more general expression was describ ed byMacKay

MacKay shows a consequence of this formulation can be the use of eective degrees of

freedom in estimating variance paramteers Assume and are unknown and to b e estimated by

maximising the Typ e I I likeliho o d py j derived from integrating out the unknown from the

likeliho o d From a standard Laplace approximation Kass and Raftery this is is equivalent

to minimising

log py j log py j log p j log p jy

q n log logjD j

r p log logjD j

p log log jV j

Now

log jD jlog jV j log j D V j

log j q D I j

p

X

i

log

by the previous derivation Thus we seek to minimise

X

N log r p log log log py j q

i

Setting derivatives equal to zero reveals that

q

n p

D

r

p

D

which are the tted likeliho o d and prior residual variation divided by their eective degrees of

freedom derived from p

D

These results were derived by MacKay using the form of p given in equation although

D

we emphasise two dierences in our result First while Mackay needs to sp ecically evaluate trV

our p arises without any additional computation Second wewould recommend including and

D

in the general MCMC estimation pro cedure rather than relying on Typ e I I maximum likeliho o d

estimates Ripley p

Bayesian deviance

General oneparameter exp onential family

We assume that we have p groups of observations where each of the n observations in group i

i

has the same distribution Following McCullagh and Nelder we dene a oneparameter

exp onential family for the j th observation in the ith group as

log py j w y b cy

ij i i ij i i ij

where standard results are that

E Y j b V Y j b w

i ij i i ij i i i

Both canonical and mean parameterisations maybeofinterest Here we shall fo cus on the canonical

parameterisation in terms of as its p osterior distribution should b etter fulll the asymptotic

i

normal approximation underlying the optimality criterion describ ed in Section related identities

are of course available for the mean parameterisation in terms of Often the choice of

i i

parameterisation has little impact on the resulting conclusions However in Section wepresent

an example of a Bernoulli mo del where this choice do es prove to b e imp ortant and in Section we

further discuss parameterisation invariance issues

Writing b E b we easily obtain that the contribution of the ith group to the eective

i i

jy

i

numb er of parameters is p n w b b

Di i i i i

Poisson likeliho o d with conjugate prior

i i

e In this case w b e and hence p n E e

i Di i

jy

i

i

Let us assume a conjugate prior e Now if X a b then Elog X

i

a log b where t log t t is the digamma function It follows that p n

Di i

P

y

i

y e n where y y Using Stirlings approximation xlogx x

i i i ij

j

O x see eg Abramowitz and Stegun p and a oneterm Maclaurin expansion of

the exp onential function we obtain that

X

p n n

D i i

i

We note that this do es not dep end on the data observed and each group contributes the fraction

of its sample size to the eective sample size underlying the p osterior As the sample size in each

group increases its contribution tends towards

Bernoulli likeliho o d with conjugate prior

In this case w logit log b log e and hence p

i Di

i i

n fE log e log e g

i

jy

i

i

B eta Now if X Betaa b then Let us assume a conjugate prior e

i

E log X a a bElog X b a b Hence it can b e shown that

n y y

i i i

p n f n log e e g

Di i i

Bayesian deviance

A similar use of Stirlings approximation to that in the previous subsection shows that p

D

P

n n We note that this do es not dep end on the data observed and again is the

i i

i

ratio of sample size to eective p osterior sample size As the sample size in each group increases

its contribution tends towards

Asymptotic form

A second order Taylor expansion of D around D yields

i i

D D w y n b w n b

i i i i i i i i i i i i i

and hence as each n increases the contribution to p tends to

i D

w

i

V jy n b p

i i i Di

the precision in the likeliho o d divided by the p osterior precision

It can be shown that for Poisson and Bernoulli likeliho o ds and conjugate priors this approxima

tion yields precisely the results obtained ab ove by using Stirlings approximation on the exact

contribution

Generalised linear and mixed mo dels with canonical link functions

Following McCullagh and Nelder we assume the mean of y is related to a set of covariates

i ij

T

and that g is the canonical link x through a link function g x

i i

i

Canonical parameterisation

Using the asymptotic result ab ove w ehave

w w

i i

T T

jy V jy x n b V x n b x p

i i i i i Di

i i

Since b g and V y j b w

i i i i i i

n w

i i

W n b

i i i

g V y j

i i

where W are the GLM iterated weights McCullagh and Nelder p Hence

i

h i

T

p tr X WX V jy

D

Under a N V prior on the prior contribution to the negative Hessian matrix at the mo de is

just V so under the canonical link the approximate normal p osterior has variance

T

V jy V X WX

again pro ducing the p as a measure of the ratio of likeliho o d to p osterior information D

Bayesian deviance

LairdWare normal mo dels

Laird and Ware sp ecied the mixed normal mo del

y N X Z R b N D

where R and D are currently assumed known The random eects are xed eects are and

placing a uniform prior on we can write this mo del within the general LindleySmith formulation

by setting R C X Z A and C as a blo ckdiagonal matrix with innities

in the topleft blo ck D is the b ottom right and zeros elsewhere

Wehave already shown that in these circumstances

T T

p trA C A A C A C

D

and substituting in the appropriate entries for the LairdWare mo del gives p tr V V where

D

T T T T

X R X X R Z X R X X R Z

V V

T T T T

Z R X Z R Z Z R X Z R Z D

precision of the parameter estimates assuming D relative to the precision which is the

assuming D

Generalised linear mixed mo dels

Wenow consider the class of generalized linear mixed mo dels with canonical link in which g

i

T T

where N D Breslow and Clayton z x

i i

Using the same argumentasforGLMswendthat

h i

T

p tr X Z W X Z V jy p tr V V

D D

where

T T T T

X W X X W Z X W X X W Z

V V

T T T T

Z W X Z W Z Z W X Z W Z D

This matches the prop osal of Lee and Nelder except their D is a diagonal matrix of the

second derivatives of the prior likeliho o d for each random eect

Using DIC for mo del diagnostics

In Section we noted that in general linear mo dels the contribution of each observation to p

D

turned out to b e its leverage dened as the relative inuence each observation has on its own tted

y be taken in general mo del tting Thus as a value and we suggest that this interpretation ma

bypro duct of MCMC estimation we may obtain deviance residuals and estimates of leverage for

each observation Supp ose weareworking with the saturated deviance D where the contribution

S

of an individual observation i to DIC is

DIC D p

i S D

i

i

p dr

D i i

Bayesian deviance

p

where dr D with sign given by the sign of y E y j is the deviance residual dened

i S i i

i

analagously to McCullagh and Nelder p

A wide range of diagnostic plots of these quantities are p ossible following the structure established

for each of the in nonhierarchical mo dels For example Figure shows a plot of dr against p

i D

i

mo dels considered for the lip cancer example intro duced Section The dashed lines marked on

each plot are of the form x y c and p oints lying along such a parab ola will each contribute an

amountDIC c to DIC for that mo del For mo dels parab olas are marked at values of c

i

and and anydatapoint whose contribution DIC islabelledby its observation number For

i

mo dels and parab olas are marked at c and since the size of the deviance resiudals

and individual contributions to DIC are much larger For clarity only p oints for which DIC

i

are marked by their observation numb er

Figure identies observations and the only counties with y as outliers under each

i

of the random eects mo dels Observation app ears to be an outlier in mo dels which

have a spatial eect but not in the remaining mo dels Further investigation reveals that county

has only cases compared to exp ected whilst each of its neighb ouring counties have high

observed counts relative to exp ected The spatial prior in mo dels

causes the estimated rate in county to b e smo othed towards the mean of its neighb ours rates

thus leading to the discrepancy b etween observed and tted values However since the observation

still exercises considerable weight on its tted value the leverage is high as well

Further numerical examples

Seeds random eects logistic regression with alternative priors

Crowder presents an analysis of the prop ortion of seeds that germinate on each of plates

arranged according to a factorial layout dep ending on binary variables seed typ e x and ro ot

extract x The numb er of seeds r which germinate out of the n seeds on plate i are assumed to

i i

follow a binomial distribution r Binomialp n We compare the following logistic regression

i i i

mo dels for logitp

i i

Mo del x x x x

i i i i i

Mo del

i i

b Mo del

i i

Mo del x x x x b

i i i i i i

where lo cally uniform priors actually Normal with mean and precision are placed on

and i and b are exchangeable random eects with a Normal prior

i i

distribution having zero mean and precision Four alternative prior sp ecications were considered

for the random eects precision

Prior A Gamma

Prior B Gamma

Prior C Pareto

Prior D Pareto

Prior A is just prop er but diuse having mean and variance Prior B is prop er and has

mean see Smith et al for an argument in favour of this prior Priors C and D are

Bayesian deviance

Model 1 Model 2 Model 3

•••••••• ••••• • •• • •• leverage leverage leverage

49 49 • • 45 42 • • •• • • •• ••• • ••• 2 50••••••••• ••••••••••••••••••••••11431051 7 2 55 • ••••••••••••••••••••• ••••115 7 • 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 -5 0 5 -5 0 5 -2 -1 0 1 deviance residual deviance residual deviance residual Model 4 Model 5 Model 6

• • •• • • •• • • • •• • • • • • • • • • ••••• •• • • •• • •• • • •• • • • •••• • • • •• •• • • • • • •• ••• •••• 50 • •• • • • • • ••• •• •• • •• • ••• • • • • • • • • • • • ••• •• • • • • • • • • • 55 • • •• • • 55 • • • • leverage • leverage • leverage • •• • • • • 56 • • • • • •• • 56 • • • • 55 • •• • 56 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1 deviance residual deviance residual deviance residual Model 7 Model 8 Model 9

• • • • • • • • •• • • • • •• • ••• 50 • •• • • • • 50 • • • •• 50 •• • • •• • • • • • • • • •••• • • • •• • • • • • • • •• • • • •• • • • • • • • •• •• • • • ••• • • •• • • • •• •• leverage • • leverage • •• leverage • • • • • •• • • • • • • • • •• • • • 55 • • • 55 • • • 55 • • • • • • • • • • • • • •• • • • •• • •• • 56 56 56 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 -2 -1 0 1 -2 -1 0 1 -2 -1 0 1

deviance residual deviance residual deviance residual

Figure Posterior distributions of the deviance for each mo del considered in the lip cancer example

Bayesian deviance

Mo del D D p DIC interval

S S D

A

B

C

D

A

B

C

D

Table Deviance and p osterior summaries for the seed germination data

q

the standard deviation equivalent to a uniform prior on or resp ectively for

of the random eects We emphasise that such a list of priors must b e made strictly external to the

information in the data otherwise the mo del comparison criterion will b e unreasonably favourable

For this binomial mo del we adopt the saturated deviance McCullagh and Nelder p

X

r n r

i i i

r n log D r log

i i S i

n p p

i i i

i

P

i

r

e

i

obtained by taking p as the standar and log f r log pr j logit

i i i

i

n

i

e

i

dising factor

For each mo del and prior we ran an MCMC sampler in BUGS for interations following a

burnin p erio d of iterations The resulting deviance summaries are given in Table

The estimates of p for the two nonhierarchical mo dels for mo del and for mo del

D

closely approximate the actual number of parameters in each mo del and resp ectively

The estimates of p for mo del imply that the random eects contribute approximately

D

eective parameters dep ending on the prior plus parameter for However including the three

covariate eects in mo del results in a net decrease in p the random eects now contribute

D

only eective parameters since the covariates explain a substantial amount of the between

plate variation This conclusion is supp orted by examination of DIC whichfavours mo del under

any of the four priors over the other mo dels considered

Turning to the comparison of priors for we reach dierent conclusions dep ending on the linear

predictor For mo del which includes only an intercept term plus the random eects DIC makes

little distinction between priors A B and C but prefers each over prior D Examination of the

q

p osterior distributions for explains why priors A B and C yield similar estimates of

with a interval of approximately However prior D do es not supp ort values of

thus constraining the random eects to b e less variable than the data suggest Including

the covariate and interaction eects in the linear predictor of mo del explains a considerable

amount of the between plate variation with a corresp onding reduction in the estimated size of

under all four priors Prior D nowprovides supp ort across the plausible range of values for and

hence is no longer p enalised byDICBycontrast the greater uncertainty asso ciated with the just

prop er prior A results in a higher value of DIC compared to the remaining priors

Bayesian deviance

From the p ersp ective of absolute t comparing D with n shows the hierarchical mo dels are

S

reasonable with B b eing particularly favoured These data are also used by Lee and Nelder

to illustrate their metho d for estimating the scaled deviance and degrees of freedom of hierarchical

generalised linear mo dels where our mo del A corresp onds to the GLMM mo del shown in their

Table Their estimate of degrees of freedom is identical to our value of n p

D

reinforcing the analysis of Section Their estimated scaled deviance of contrasts with

our D however their scaled deviance is based on direct estimates of the mean p rather

S

than on estimates of the canonical parameters and if we use the equivalent mean parameterisation

weobtainD

S

From the p ersp ective of absolute t comparing D with n shows the hierarchical mo dels are

S

reasonable with A b eing particularly favoured These data are also used by Lee and Nelder

to illustrate their metho d for estimating the scaled deviance and degrees of freedom of hierarchical

generalised linear mo dels where our mo del A corresp onds to the GLMM mo del shown in their

Table Their estimate of degrees of freedom is identical to our value of n p

D

reinforcing the analysis of Section Their estimated scaled deviance of contrasts

however their scaled deviance is based on estimates of the mean of y with our D

S

and if we use the mean parameterisation we obtain D

S

Stacks robust regression

Spiegelhalter et al bpp consider a variety of dierent error structures for the oft

analyzed stack loss data of Brownlee Here the resp onse variable y the amount of stack

loss escaping ammonia in an industrial application is regresssed on three predictor variables air

ow x temp erature x and acid concentration x Assuming the usual linear regression

structure

z z z

i i i i

where z x x sdx the standardized covariates the presence of a few prominent outliers

ij ij j j

amongst the n cases motivates comparison of the following four error distributions

Mo del y Normal

i i

Mo del y DE

i i

Mo del y Logistic

i i

Mo del y t

i i

d

where DE denotes the double exp onential Laplace distribution and t denotes the Students t

d

distribution with d degrees of freedom

Unlike our other examples the form of the likeliho o d changes with each modelsowemust takecare

ull deviances loglikeliho o ds These emerge as with normalizing constants when computing n

follows

P

n

D y log

i i

i

P

n

D jy j log

i i

i

h i

P

n

y

i i

D log e log y

i i

i

o n

P

n

d d

y log log log D d log

i i

i

d d

Awellknown alternative to direct tting of many symmetric but nonnormal error distributions is

through scale mixtures of normals Andrews and Mallows From p of Carlin and Louis

Bayesian deviance

wehave the alternate t formulation

d

d d

Mo del y Normal w Gamma

i i i

d

w d

i

and corresp onding null deviance expression

n

X

w

i

D w y log

i i i

i

a simple mo dication of the Gaussian expression

Mo del D D p DIC

D

Normal

DE

Logistic

t

t as scale mixture

Table Deviance results for stack loss data

Following Spiegelhalter et al b we set d and for each mo del we placed essentially at

priors actually normal with mean and precision on the avague Gamma

j

prior on and ran the Gibbs sampler in BUGS for iterations following a burnin p erio d of

iterations

Replacing and w by their p osterior means where necessary for the D calculation the resulting

i

deviance summaries are shown in Table note that the mean parametrization and the canonical

parametrization are equivalent here since the mean is a linear function of the canonical

i

parameters Beginning with a comparison of the rst four mo dels the estimates of p are all just

D

ov er the correct number of parameters for this example The DIC values imply that Mo del

double exp onential is b est followed by the t the logistic and nally the normal Clearly this

order is consistent with the mo dels resp ective abilities to accommo date outliers

Turning to the normal scale mixture representation for the t likeliho o d Mo del the p value is

D

suggesting that the w random eects contribute only an extra to parameters However

i

the mo dels smaller DIC value implies that the extra mixing parameters are worth it in an overall

quality of t sense We emphasize that the results from Mo dels and need not be equal since

while they lead to the same marginal likeliho o d for the y they corresp ond to dierent prediction

i

problems

Finally plots of deviance residuals versus leverages not shown clearly identify the observations

determined to b e outlying byseveral previous authors analysing this dataset

Panel Longitudinal binary observations

To illustrate how the mean and canonical parameterisations intro duced in Section and further

discussed in Section can sometimes lead to dierent conclusions our next example considers a

subset of data from the Six Cities study a longitudinal study of the health eects of air p ollution

see Fitzmaurice and Laird for the data and a likeliho o dbased analysis The data consist

of rep eated binary measurements y of the wheezing status yes no of child i at time j ij

Bayesian deviance

Canonical parametrization Mean parametrization

D D p DIC D p DIC Mo del

D D

logit

probit

cloglog

Table Results for b oth parameterizations of Bernoulli panel data

i I j J for each of I children living in Stueb enville Ohio at J

timep oints We are given two predictor variables a the age of child i in years at measurement

ij

p oint j oryears and s the smoking status of child is mother yes no Following

i

the Bayesian analysis of Chib and Greenb erg we adopt the conditional resp onse mo del

Y Bernoullip

ij ij

p PrY g

ij ij ij

z z z b

ij ij ij ij i

where z x x k and x a x s and x a s a smokingage

ij ij ij i ij ij i

ij k ij k k

interaction term The b are individualsp ecic random eects initially given an exchangeable

i

N sp ecication which allo w for dep endence among the longitudinal resp onses for child i

The mo del choice issue here is to determine the most appropriate link function g among three

candidates namely the logit the probit and the complementary loglog More formally our three

mo dels are

Mo del g p logitp logp p

ij ij ij ij

Mo del g p probitp p

ij ij ij

Mo del g p cloglog p log log p

ij ij ij

Since the Bernoulli likeliho o d is unaected by this choice in all cases the null deviance takes the

simple form

X

y log p y log p D

ij ij ij ij

ij

Placing at priors on the a vague Gamma prior on and running the Gibbs

k

sampler for iterations following a burnin period of iterations pro duces the deviance

summaries in Table for the canonical and mean parametrizations resp ectively where the canonical

parametrization constructs as the mean of the linear predictors and b and then uses the

i

appropriate linking transformation logit probit or cloglog to obtain the imputed means for the

p The mean parametrization simply uses the means of the p themselves when computing D

ij ij

DIC prefers the cloglog link under the canonical parametrization but the probit link under the mean

parametrization such disagreement is unfortunate but is quite feasible with Bernoulli distributions

in which the p osterior distributions of the tted means will b e highly nonnormal We rep eat that

we prefer the canonical results due to the improved normality of the p osterior distributions

CAMCOG Informative missing data

Best et al present an analysis of longitudinal data from a study of dementia and cognitive

decline in the elderly women aged were interviewed at the start of the study and were

assessed for cognitive function using the CAMCOG test a neuropsychological assesment scale

Bayesian deviance

taking values between and A rep eat interview was conducted years later but only

of the original cohort were reassessed The remaining women had either died moved awayor

refused to particpate Best et al argue that the dropout mechanism may b e nonignorable

and must b e mo delled explicitly Following their published analysis we consider a linear regression

mo del for cognitive function at the second interview

y Normal

i i

y x

i i i

where y is the CAMCOG score for sub ject i at interview k and x represents age at second

i

ki

prop er priors for the regression co ecients and actually interview We assume vague but

indep endent Normal with mean and precision and the inverse measurement variance

actually Gamma A logistic selection mo del was sp ecied for the nonresp onse

mechanism

m Bernoullip

i i

logitp y

i i

where m is a binary variable indicating if the CAMCOG score for sub ject i was missing at interview

i

We compare four alternative prior assumptions for the parameters of this mo del

Prior Normal

Prior

Prior Normal

Normal

Prior Normal

Normal

Prior corresp onds to a noninformative drop out mechanism priors and represent informative

prior distributions elicited from an exp ert psychiatrist obtained by tting a logistic curve to the

exp erts best guess at the prop ortion of women she would exp ect to drop out given dierent true

CAMCOG scores prior is a diuse prior on the unknown parameters of the informative drop out

mo del

Since it was necessary to augment the observed dataset with the missing value indicator m in order

to explicitly mo del the drop out mechanism we must take account of this additional data when

calculating the mo del null deviance ie

X X

obs

D log log y m log p m log p

i i i i i

i

i i

The resulting deviance summaries for each prior are shown in Table

Considering rst the deviance contribution from the missing data submo del we see that p closely

D

approximates the true numb er of parameters for all priors and that DIC clearly rejects the non

informative drop out mo del prior in favour of the three informative missing data mo dels priors

Prior in which the co ecients and are xed at the exp erts prior values is slightly

prefered over priors and in which the co ecients are assumed to be unknown In fact the

Bayesian deviance

Mo del D D p DIC

D

obs

Deviancecontribution from y

i

Deviancecontribution from m

i

Total deviance

Table Deviance summaries CAMCOG data

p osterior distributions for and under priors and are close to the exp erts prior values

implying that her prior guess at the nonresp onse rates as a function of true CAMCOG score was

very accurate Consequently the additional complexity imp osed by treating and as unknown

in priors and is uneccessary and is p enalised byDIC

The deviance contributions from the observed resp onse data y i result in nearly iden

i

tical values of DIC and p irresp ective of the prior mechanism assumed for the missing data

D

This seems reasonable since the likeliho o d terms for this part of the mo del are identical under all

four priors Consequently the dierences in total DIC and p for each mo del simply reect the

D

dierences asso ciated with the missing data submo dels

Metho ds for mo del comparison

From an applied viewp oint our aim lies not in simply determining the eective dimension of a

mo del but in using the deviance to help compare comp eting and p erhaps nonnested mo dels of

varying typ es We are therefore adding to the long list of criteria that have b een suggested for

comparing or cho osing between mo dels following the nomenclature of Dempster a these

can be broadly distinguished into predictive criteria that compare the ability of prior and mo del

assumptions to predict the currently observed data and postdictive criteria that assess assumptions

conditional on the observed data

Predictive criteria The basis for such criteria can b e thought of as a sequential series of predictive

statements Dawid which if a full probability mo del is b eing assessed b ecomes the marginal

R

likeliho o d py py jpd The resulting Bayes factors Kass and Raftery may be

used to obtain p osterior probabilities of comp eting mo dels However in the discussion of Aitkin

Smith p ointed out that this formulation may only b e appropriate in circumstances where

it was really b elieved one and only one of the comp eting mo dels were in fact true and the

crucial issue was to cho ose this correct mo del this discussion is elab orated in Bernardo and Smith

Bayesian deviance

chapter Wewould argue that this is not the situation for the kind of examples discussed

in this pap er we neither b elieveany of the mo dels is actually true nor do we wish to formulate a

decision problem of strict mo del choice

For a nonhierarchical mo del with p parameters and n observations the Bayes or Schwarz infor

mation criterion Schwarz given byBIC log py j p log n has b een widely promoted

but its implementation for hierarchical mo dels has b een controversial This is due to the uncertainty

concerning the prop er values of b oth p and n in such situations For instance in our basic one

P

I

parameter exp onential family hierarchical mo del from Section should n be chosen as n

i

i

the total numb er of observations or I the numb er of groups If the observations within eachgroup

are indep enden t then the former choice app ears more sensible whereas if they are p erfectly corre

lated the latter is more appropriate In practice the true level of withingroup correlation will b e

somewhere in b etween so the choice of n should b e as well Unfortunately asymptotic theory seems

of little help here for instance in the context of normal linear hierarchical mo dels Pauler

shows that even the two extreme choices for n ab ove are defensible asymptotically In this same

context SAS Proc MIXED currently employs the smaller choice in the interest of conservatism ie

retaining predictors in the nal mo del the larger choice may often go to o far in p enalizing larger

mo dels Working instead in the Cox survival mo del setting Volinsky unpublished Univ

of Washington PhD dissertation obtains a similar ambiguous conclusion where nowthechoice for

n is between the number of patients in the study and the number of events deaths observed

though he suggests use of the latter based on simulation work and analytic results from a simpler

exp onential survival mo del It seems p ossible that p might have a role to play in adapting BIC

D

to hierarchical mo dels

Postdictive criteria There have been a number of recent suggestions for comparing mo dels in

the light of observed data usually based on estimates of their predictive ability on a replicate

dataset

Aitkin suggested using the p osterior mean of the likelihood and contrasts the resulting

happier with our use p osterior Bayes factors PDF with AIC and BIC Aitkin We feel

of the loglikeliho o d in that the log probability ordinate is a prop er scoring rule for evaluating

predictions Dawid in contrast to the ordinate itself In addition the resulting p enalty for

complexity in the PDF app ears insucient

Laud and Ibrahim and Gelfand and Ghosh suggest minimising a predictive discrep

ancy measure E dy y jy where y is a draw from the p osterior predictive distribution

new new

obs obs

T

py jy and we might for instance take dy y y y y y These au

new new new new

obs obs obs obs

thors show their measures also have attractiveinterpretations as weighted sums of go o dness of t

and predictivevariability p enalty terms However prop er choice of the criterion requires fairly

involved analytic work as well as several sub jectivechoices ab out the utility function appropriate

for the problem at hand Furthermore the onewayANOVA mo del in Section gives rise to a

and a predictivevariability term equal to p p Thus their suggestion t term equivalentto D

D

is equivalent in this context to comparison by D which although invariant to parameterisation

again do es not seem to suciently p enalize complexity

Ye and Ye and Wong extend Akaike by arguing that predictive error on a replicate

dataset is minimised by p enalizing loglikeliho o d by GOF where GOF is the generalised

degrees of freedom measuring the exp ected sensitivity of the tted values to their corresp onding

observed values essentially the exp ected leverage This is extremely similar to our prop osal al

eective though they directly identify measures of leverage as the necessary comp onents of the

number of parameters rather than as a consequence of a more general formulation We should

Bayesian deviance

therefore exp ect our and Yes prop osals to give similar conclusions although we note that his com

putational metho ds require direct estimation of leverage through a series of simulated or systematic

p erturbations of the data followed by rep eated mo del tting

Ripley pp discusses the general problem of mo del comparison with particular

reference to the structure of neural networks and shows that previous suggestions for the eective

number of parameters made by Murata et al Mo o dy and others are all essentially

equivalent to the following denition of the eectivenumb er of parameters p Let py j b e the

closest distribution in the assumed mo del to the unobservable true mo del py where closest is in

terms of minimising KullbackLeibler distance and let L y be some function of the data and

parameters that we seek to minimise Then

p trKJ

where

L Y L Y

J E K Var

Y Y

where we emphasise that the exp ectation and variance are now with resp ect to the unknown true

sampling distribution If we seek to maximise the jointlikeliho o d of the data and parameters ie

L logpy j log p and are willing to assume the true mo del is a member of py j which

may be reasonable in a suciently parameterised mo del then it is straightforward to show that

V nJ L nK and hence p p

D

Conclusion a brief appraisal of issues arising with DIC

We now attempt a brief summary of the disadvantages and advantages of using p and DIC in

D

routine statistical analysis

A ma jor problem in using p is that it is only asymptotically invariant to the chosen parameterisa

D

may arise from substituting p osterior means of alternative tion since dierent tted deviances D

choices of For example

In the oneparameter exp onential family dierent results will b e obtained by using the mean

or canonical parameterisation The decisiontheoretic justication and the asymptotic iden

tities for DIC are all improved by approximate p osterior normality which will be b etter

achieved by using parameters dened on the whole real line This is the basis for recommend

ing in Section that the canonical parameterisation is the most approriate choice although

the panel example in Section showed this choice could b e imp ortant with Bernoulli data

In mo dels with unknown scale parameters there will b e some dep endence on whether to base

D on the p osterior means of the standard deviations variances precisions logprecisions

or some other choice In theory one could achieveinvariance to the choice of nuisance parame

ters by rerunning the sampler conditional on the p osterior means of the primary parameters

and using the resulting mean deviance E D forD in the calculation of p this

D

jy

would now b e the eectivenumb er of primary parameters

Since approximate p osterior normality is desirable we would argue that logprecision is the

most appropriate scale but fortunately the choice seems to make little dierence in mo del

comparison For example supp ose we have a normal sample y N i n i

Bayesian deviance

with standard lo cally uniform priors on and log It can be shown up to O n that

compared to the correct answer of parameterising in terms of log and gives p s

D

of and resp ectively Hence the parameterisation log is preferable but

n n n

by a rather small margin

As we have seen in the stacks example of Section there may be sensitivity to apparent

inno cuous restructuring of the mo del this is to be exp ected since DIC is not a function

of the marginal likeliho o d of the data and hence changes that might not change the Bayes

factor do change DIC By making such changes one is altering the denition of a replicate

dataset and hence one would exp ect DIC to change

Wehaveshown that our suggestion is strongly related to a range of previous prop osals for p ostdic

e mo del choice but has the additional advantages that it tiv

is completely general to any class of mo del

involves little additional analytic work

involves no extra Monte Carlo sampling

oers an eective mo del screening technique that seems natural for a broad class of mo dels

and p erforms reasonably across a range of examples

is already implemented in the test version of WinBUGS the most general Bayesian software

package to date and could b e easily added to any other MCMCbased package

represents an attractive compromise b etween very informal mo del comparison metho ds eg

plotting p osterior distributions of loglikeliho o ds whichare hard to use and interpret and

overly formal decision theoretic metho ds eg Bayes factors or p osterior predictive discrep

ancy measures which are predicated on a choice of a single mo del and require substantial

y function analytical work and sub jectivechoices such as the selection of an appropriate utilit

In conclusion we feel that p and DIC deserve further investigation as to ols for mo del comparison

D

Acknowledgements

Wearevery grateful for the generous discussion and criticism of the participants in the programme

on Neural Networks and Machine Learning held at the Isaac Newton Institute for Mathematical

Sciences in and to Andrew Thomas for so quickly implementing our changing ideas into

WinBUGS BPC received partial supp ort from National Institute of Allergy and Infectious Diseases

NIAID Grant RAI

References

Abramowitz M and Stegun I Handbook of Mathematical Functions th ed Dover New

York

Aitkin M Posterior Bayes factors with discussion Journal of the Royal Statistical Society

Series B

Aitkin M The calibration of Pvalues p osterior Bayes factors and the AIC from the

p osterior distribution of the likeliho o d Statistics and Computing

Bayesian deviance

Akaike H Information theory and an extension of the maximum likeliho o d principle In nd

Intl Symp on Information TheoryedBPetrov and F Csaki Akademiai Kiado Budap est

Andrews D and Mallows C Scale mixtures of normality Journal of the Royal Statistical

Society Series B

Bernardo J M Exp ected information as exp ected utility Annals of Statistics

Bernardo J M and Smith A F M Bayesian Theory John Wiley and Sons Chichester

England

Besag J Spatial interaction and the statistical analysis of lattice systems with discussion

Journal of the Royal Statistical Society Series B

Best N G Spiegelhalter D J Thomas A and Brayne C E G Bayesian analysis of

realistically complex mo dels Journal of the Royal Statistical Society Series A

Bo x G E P and Tiao G C Bayesian InferenceinStatistical Analysis AddisonWesley

Breslow N E and Clayton D G Approximate inference in generalized linear mixed

mo dels Journal American Statistical Association

Brownlee K A Statistical Theory and Methodology in Science and Engineering John

Wiley and Sons New York

Carlin B P and Louis T A Bayes and Empirical Bayes Methods for Data Analysis

Chapman and Hall London UK

Chib S and Greenberg E Analysis of multivariate probit mo dels Biometrika to

app ear

Clayton D G and Kaldor J Empirical Bayes estimates of agestandardised relativerisks

for use in disease mapping Biometrics

Crowder M J Betabinomial Anova for prop ortons Applied Statistics

Dawid A P Statistical theory the prequential approach Journal of the Royal Statistical

Society Series A

Dawid A P Probability forecasting In Encyclopaedia of Statistical Sciences Vol ed

S Kotz and N L Johnson pp John Wiley and Sons New York

Dempster A P The direct use of likeliho o d for signicance testing In Proceedings of Con

ference on Foundational Questions in Statistical Inference ed O BarndorNielsen P Blae

ersity of Aarhus sild and G Schou pp Department of Theoretical Statistics Univ

Dempster A P a Commentary on the pap er by Murray Aitkin and on discussion by

Mervyn Stone Statistics and Computing

Dempster A P b The direct use of likeliho o d for signicance testing Statistics and Com

puting

Drap er D Assessment and prop ogation of mo del uncertainty Journal of the Royal Statis

tical Society series B

Fitzmaurice G and Laird N Alikeliho o dbased metho d for analysing longitudinal binary

resp onses Biometrika

Gelfand A and Ghosh S Mo del choice a minimum p osterior predictive loss approach

Biometrika

Gelfand A E and Dey D K Bayesian mo del choice asymptotics and exact calculations

Journal of the Royal Statistical Society Series B

Gilks W R Richardson S and Spiegelhalter D J Markov Chain Monte Carlo Methods

in Practice Chapman and Hall New York

Bayesian deviance

Gilks W R Wang C C Coursaget P and Yvonnet B Randomeects mo dels for

longitudinal data using gibbs sampling Biometrics

Hastie T and Tibshirani R Generalized Additive Models Chapman and Hall London

Ho dges J and Sargent D Counting degrees of freedom in hierarchical and other richly

parameterised mo dels Technical rep ort Division of Biostatistics University of Minnesota

USA

Kass R and Raftery A Bayes factors and mo del uncertainty Journal of the American

Statistical Association

Laird N M and Ware J H Random eects mo dels for longitudinal data Biometrics

Laud P and Ibrahim J Predictive mo del selection Journal of the Royal Statistical Society

Series B

Lee Y and Nelder J Hierarchical generalised linear mo dels with discussion Journal of

the Royal Statistical Society Series B

Lindley D V and Smith A F M Bayes estimates for the linear mo del with discussion

Journal of the Royal Statistical Society Series B

MacKay D J C Bayesian interp olation Neural Computation

MacKay D J C Probable networks and plausible predictions a review of practical

Bayesian metho ds for sup ervised neural networks Network Computation in Neural Systems

McCullagh P and Nelder J Generalized Linear Models nd edition Chapman and Hall

London

Mo o dyJE The eective numb er of parameters An analysis of generalization and regu

larization in nonlinear learning systems In Advances in Neural Information Processing Systems

ed J E Mo o dy S J Hanson and R P Lippmann pp Morgan Kaufmann San

Mateo California

Murata N Yoshizawa S and Amari S Network information criterion determining

the numb er of hidden units for articial neural network mo dels IEEE Transactions on Neural

Networks

Pauler D The Schwarz criterion and related metho ds for normal linear mo dels Biometrika

to app ear

Raghunathan T E ABayesian mo del selection criterion Technical rep ort Universityof

Washington

Richardson S and Green P J On Bayesian analysis of mixtures with unknown number

of comp onents with discussion Journal of the Royal Statistical Society Series B

Ripley B D Pattern Recognition and Neural Networks Cambridge University Press

Cambridge UK

Schwarz G Estimating the dimension of a mo del Annals of Statistics

Smith T C Spiegelhalter D J and Thomas A Bayesian graphical mo delling applied

to random eects metaanalysis Statistics in Medicine

Spiegelhalter D J Thomas A Best N G and Gilks W R a BUGS Bayesian inference

C Biostatistics Unit Cambridge Using Gibbs Sampling Version version ii MR

Spiegelhalter D J Thomas A Best N G and Gilks W R b BUGS Examples Volume

Version version ii MRC Biostatistics Unit Cambridge

Bayesian deviance

Ye J On measuring and correcting the eects of data mining and mo del selection Journal

of the American Statistical Association

Ye J and Wong W Evaluation of highly complex mo deling pro cedures with binomial and

p oisson data Technical rep ort Graduate Scho ol of Business University of Chicago

Zeger S L and Karim M R Generalised linear mo dels with random eects a Gibbs

sampling approach Journal of the American Statistical Association

Web Analytics