<<

Marginal likelihood estimation via power posteriors

N. Friel† University of Glasgow, UK A.N. Pettitt Queensland University of Technology, Australia

Summary. Model choice plays an increasingly important role in . From a Bayesian perspective a crucial goal is to compute the marginal likelihood of the data for a given model. This however is typically a difficult task since it amounts to integrating over all model parameters. The aim of this paper is to illustrate how this may be achieved using ideas from thermodynamic integration or path sampling. We show how the marginal likelihood can be computed via MCMC methods on modified posterior distributions for each model. This then allows Bayes factors or posterior model probabilitiesto be calculated. We show that this approach requires very little tuning, and is straightforward to implement. The new method is illustrated in a variety of challenging statistical settings. Keywords: ; REGRESSION; LONGITUDINAL DATA; MULTINOMIAL; HIDDEN MARKOV MODEL; MODEL CHOICE

1. Introduction

Suppose data y is assumed to have been generated by one of M models indexed by the set {1, 2,...,M}. An important goal of Bayesian model selection is to calculate p(k|y) – the posterior model probability for model k. Here the aim may be to obtain a single most probable model, or indeed a subset of likely models, a posteriori. Alternatively, posterior model probabilities may be synthesised from all competing models to calculate some quantity of interest common to all models, using model averaging (Hoeting et al 2001). But before we can make inference for p(k|y) we first need to fully specify the posterior distribution of all unknown parameters in the joint model and parameter space. We denote by θk, the parameters specific to model k, where θ denotes collection of all model parameters. Specifying prior distributions for within model parameters p(θk|k), and for model indi- cators p(k), together with a likelihood L(y|θk, k) within model k allows to proceed by examining the posterior distribution

p(θk, k|y) ∝ L(y|θk, k)p(θk|k)p(k). (1) Across model strategies proceed by sampling from this joint posterior distribution of model indicators and parameters. The reversible jump MCMC algorithm (Green 1995) is a popular approach for this situation. Other across model search strategies include (Godsill 2001) and (Carlin and Chib 1995). By contrast, within model searches examine the posterior distribution within model k separately for each k. Here the within model posterior appears as

p(θk|y, k) ∝ L(y|θk, k)p(θk|k), where the constant of proportionality, often termed the marginal likelihood or integrated likelihood for model k is written as

p(y|k)= L(y|θk, k)p(θk|k)dθk. (2) Zθk This is in general a difficult integral to compute, possibly involving high dimensional within model parameters θk. However if we were able to do so, then we would be readily able to make statements about posterior model probabilities, using Bayes theorem, p(y|k)p(k) p(k|y)= M . 1 p(y|k)p(k) †Address for correspondence: Department of Statistics,P University of Glasgow, Glasgow, G12 8QQ, UK. E-mail: [email protected] 2 Friel, Pettitt

The marginal likelihoods can be used to compare two models by computing Bayes factors,

p(y|k = 1) B = , 12 p(y|k = 2) without the need to specify prior model probabilities. Note that the Bayes factor B12 gives the evidence provided by the data in favour of model 1 compared to model 2. It can also be seen that

p(k =1|y) p(k = 2) B = . 12 p(k =2|y) p(k = 1)

In other words, the Bayes factor is the ratio of the posterior odds of model 1 to its prior odds. Note that an improper prior distribution p(θk|k) leads necessarily to an improper marginal like- lihood, which in turns implies that the Bayes factor is not well-defined. To circumvent the difficulty of using improper priors for model comparison, (O’Hagan 1995) introduced an approximate method termed the fractional Bayes factors. Here an approximate (proper) marginal likelihood is defined by the ratio L(y|θ , k)p(θ |k)dθ θk k k k , R {L(y|θ , k)}ap(θ |k)dθ θk k k k R since any impropriety in the prior for θk cancels above and below. Other approaches to this problem include the intrinsic Bayes factor (Berger and Perricchi 1996). In this paper we concentrate on the case where prior model distributions are proper. Various methods have been proposed in the literature to estimate the marginal likelihood (2). For example (Chib 1995) estimates the marginal likelihood p(y|k) using output from a Gibbs sampler for (θk|y, k). It relies however on a block updating approach for θk. Clearly this is not always possible to do. This work was then extended in (Chib and Jeliazkov 2001), where output from a Metropolis-Hasting algorithm for the posterior (θk|y, k) can be used to estimate the marginal likelihood. Annealed importance sampling (Neal 2001) estimates the marginal likelihood using ideas from importance sampling. Here an independent sample from the posterior is generated by defining a sequence of distributions indexed by a temperature parameter t from the prior through to the posterior. Importantly however, the collection of importance weights can be used to estimate the marginal likelihood. In this paper we propose a new method to compute the marginal likelihood based on samples from a distribution proportional to the likelihood raised to a power t times the prior, which we term the power posterior. This method was inspired by ideas from path sampling or thermodynamic integration (Gelman and Meng 1998). We find that the marginal likelihood, p(y|k) can be expressed as an integral with respect to t from 0 to 1 of the expected deviance for model k, where the expectation is taken with respect to the power distribution, at power t. We argue that this method requires very little tuning or book keeping. It is easy to implement, requiring minor modification to computer code which samples from the posterior distribution. In this paper, in Section 2 we carry out a review of different approaches to across and within model searches. Section 3 introduces the new method for estimating the marginal likelihood, based on sampling from the so-called power posterior distribution. Here we also outline how this method could be implemented in practice, while also giving some guidance as to the sensitivity of the estimate of the marginal likelihood (or Bayes factor) to the diffusivity of the prior model parameters. Three illustrations of how the new method performs in practice are given in Section 4. In particular complex modelling situations are illustrated which preclude other methods of marginal likelihood estimation which rely on block updating model parameters. We conclude this paper with some final remarks in Section 5.

2. Review of methods to compute Bayes factors

There are numerous methods and techniques available to estimate (2). Generally speaking two ap- proaches are possible – an across model search or a within model search. The former approach in an MCMC setting, involves generating a single Markov chain which traverses the joint model and parameter space (2). A popular choice is the reversible jump sampler (Green 1995). Godsill (2001) retains aspects of the reversible jump sampler, but considers the case where parameters are shared Marginal likelihood estimation via power posteriors 3 between different model, as occurs for example in nested models. Other choices include (Stephens 2000), (Carlin and Chib 1995) and (Dellaportas, Forster and Ntzoufras 2001). Within model search essentially aims to estimate the marginal likelihood (2) for each model k separately, and then if desired uses this information to form Bayes factors, see (Chib 1995), (Chib and Jeliazkov 2001). Neal (2001) combines aspects of simulated annealing and importance sampling to provide a method of gathering an independent sample from a posterior distribution of interest, but importantly also to estimate the marginal likelihood. Bridge sampling (Meng and Wong 1996) offers the possibility of estimating the Bayes factor linking the two posterior distributions by a bridge function. Bartolucci and Mira (2003) use this approach, although it is based on an across model reversible jump sampler. Within model approaches are disadvantageous when the cardinality of the model space is large. However, as noted by Green (2003), the ideal situation for a within model approach is one where the models are all reasonably heterogeneous. In effect, this is the case where it is difficult to choose proposal distributions when jumping between models - and indeed the situation where parameters across models of the same dimension have different interpretations. This short review is not intended to be exhaustive. A more complete picture can be found in the substantial reviews (Sisson 2005) and (Han and Carlin 2001).

2.1. Reversible jump MCMC Reversible jump MCMC (Green 1995) offers the potential to carry out inference for all unknown parameters (1) in the joint model and parameter space in a single logical framework. A crucial innovation in the seminal paper by Green (1995) was to illustrate that detailed balance could be achieved for general state spaces. In particular this extends the Metropolis-Hastings algorithm to variable dimension state spaces of the type (θk, k). To implement the algorithm, proposing to move from (θk, k) to (θl,l) proceeds by generating a random vector u from a distribution g and setting ∗ (θl,l)= fkl((θk, k), u). Similarly to move from (θl,l) to (θk, k) requires random numbers u follow- ∗ ∗ ing some distribution g , and setting (θk, k) = flk((θl,l), u ), for some deterministic function flk. However it is important that the transformation fkl from (θk, k) to (θl,l) is both a bijection and its differential invertible. A necessary condition for this to apply is if the so-called ‘dimension matching’ ∗ condition applies, that is, if dim(θk) + dim(u) = dim(θl) + dim(u ). In this case, the probability of accepting such a move appears as

p(θ ,l|y)p(l → k)g∗(u∗) min 1, l |J|  p(θk, k|y)p(k → l)g(u)  where p(l → k) is the probability of moving from model k to model l and in addition J is the jacobian ∗ resulting from the transformation from ((θk, k), u) to ((θl,l), u ). In practice, this may be simplified slightly by not insisting on stochastic moves in both direction, so that, for example, dim(u∗) = 0, whence the term g∗(·) disappears in the numerator above. Finally for the case of nested models, a possible move type is (θk+1, k + 1) = ((θk, k),u) in which case the jacobian term equals 1. In some respects RJMCMC is difficult to use in practice. The main drawback would appear to be the problem of model mixing across dimensions. Typically this is as a result of the difficulty in choosing a suitable ‘jump’ proposal distribution. Here it is unclear how to reasonably center and scale the distribution to increase the chance of the move being accepted. However recent work by (Brooks et al 2003) has tackled this problem to some extent.

2.2. Chib’s method An important method of marginal likelihood estimation within each model is that of Chib (1995). This method follows from noticing that for any parameter configuration θ∗, Bayes’ rule implies that the marginal likelihood of the data y for model k satisfies

L(y|θ∗)p(θ∗) p(y)= . p(θ∗|y)

Here and for the remainder of this article, for ease of notation, we remove reference to the model indicator k, except where this is ambiguous. Each factor on the right hand side above can be calculated 4 Friel, Pettitt immediately, with the exception of the posterior probability p(θ∗|y). Typically θ∗ would be chosen as a point of high posterior probability to increase the numerical accuracy of the estimate. Chib illustrated that this probability can be estimated via provided θ∗ can be partitioned ∗ into n blocks {θi }, say, where the full conditional of each block is amenable to Gibbs sampling. It is clear that n ∗ ∗ ∗ ∗ p(θ |y)= p(θi |θi−1,...,θ1 , y). iY=2 ∗ ∗ ∗ Now each factor p(θj |θj−1,...,θ1 , y) can be estimated from the Gibbs output by integrating out ∗ ∗ parameters θj+1,...,θn:

I 1 p(θ∗|θ∗ ,...,θ∗, y)= p(θ∗|θ(i),...,θ(i) ,θ∗ ,...,θ∗, y), (3) j j−1 1 I j n j+1 j−1 1 Xi=1 where the index i indicates iterations of the Markov chain at stationarity. Further the normalising constant of each block must be known exactly in order for full conditional probabilities to be esti- mated. Chib and Jeliazkov (2001) extended this methodology to the case where (3) can be updated using Metropolis-Hastings output, employing an identity based solely on the Metropolis-Hastings acceptance probabilities, but which does not require the normalising constant of p(θ∗|y). However implementing both methods relies on judicious partitioning of the parameter θ∗, in addition to a considerable amount of book keeping. Clearly both methods increase in computational complexity as the dimension of θk increases.

2.3. Annealed importance sampling Estimating the marginal likelihood using ideas from importance sampling is also possible as illustrated in (Neal 2001). The idea is to define a sequence of distributions, starting from one for which it is possible to gather samples, for example the prior distribution, and ending at a target distribution from which you would like to sample. Neal (2001) defines this sequence as

ti 1−ti pti (θk|y)= p(θk) p(θk|y) ,

where 0 = t0 < t1 < · · · < tn = 1. Thus pt0 and ptn corresponds to the prior and posterior distribution respectively. The algorithm begins by sampling a point xt0 from the prior, pt0 . At the ith step, a point xti+1 is generated from pti+1 via a Markov chain transition kernel at xti , for example via Gibbs, or Metropolis-Hastings updating. The final step n yields a point xtn from the posterior. Repeating this scheme N times, yields an independent sample x(1),...,x(N) from the posterior. In effect, distribution pti is an importance distribution for pti+1 . An important by-product of this scheme is that the collection of importance weights

ptn− (xtn− ) ptn− (xtn− ) p (x ) w(i) = 1 1 2 2 . . . t0 t0 ptn (xtn ) ptn−1 (xtn−1 ) pt1 (xt1 ) are such that N w(i) p(y)= i=1 . P N That is, the marginal likelihood obtains as the average of the importance weights.

3. Marginal likelihoods and power posteriors

Here we introduce a new approach to estimating the integrated likelihood based on ideas of thermo- dynamic integration or path sampling (Gelman and Meng 1998). Consider introducing an auxiliary variable (or temperature schedule) T (t) where T : [0, 1] → [0, 1] defined such that T (0) = 0 and T (1) = 1. For simplicity we assume that T (t)= t. Consider the power posterior defined as

t pt(θk|y) ∝{L(y|θk)} p(θk). (4) Marginal likelihood estimation via power posteriors 5

Now, define t z(y|t)= {L(y|θk)} p(θk) dθk. Zθk

By construction, z(y|t = 0) is the integral of the prior for θk, which equals 1. Further, z(y|t = 1) is the marginal likelihood of the data. Here we assume of course that z(y|t) < ∞ for all t ∈ [0, 1]. Now ideas from path sampling (Gelman and Meng 1998) can be used to calculate the integral of interest z(y|t = 1). The following identity is crucial to the problem in hand:

z(y|t = 1) 1 log(p(y)) = log = E log{L(y|θ )} dt. (5) θk|y,t k  z(y|t = 0) Z0 Thus the log of the marginal likelihood results as the mean deviance where the expectation is taken with respect to the power posterior (4) at temperature t, where t moves from 0 to 1. The identity (5) can be derived easily as follows: d 1 d log{z(y|t)} = z(y|t) dt z(y|t) dt

1 d t = {L(y|θk)} p(θk) dθk z(y|t) dt Zθk 1 t = {L(y|θk)} log{L(y|θk)}p(θk) dθk z(y|t) Zθk t {L(y|θk)} p(θk) = log{L(y|θk)} dθk Zθk z(y|t) = E log{L(y|θ )}. θk|y,t k Equation (5) now follows by integrating with respect to t. This approach shares some analogies to annealed importance sampling outlined in Section 2.3, however, here we estimate the marginal likelihood on the log scale, ensuring increased numerical stability. Further our method estimates log(p(y)) using expectations, again aiding the numerical stability. It is interesting to note that the fraction z(y|t = 1)/z(y|t = a), where 0

3.1. Estimating the marginal likelihood Note that the identity for the marginal likelihood (5) can be considered as a double integral, integrating over the power parameter t and the model parameters θk. The joint distribution of θk and t can be written as, L(y|θ )tp(θ ) p(θ ,t|y)= p(θ |t, y)p(t)= k k p(t) k k z(y|t)

Now the full conditional distribution of θk looks like,

t p(θk|y,t) ∝{L(y|θk)} p(θk) while if we assume that p(t) ∝ z(y|t), then also

t p(t|θk, y) ∝{L(y|θk)} p(θk). 6 Friel, Pettitt

(1) (n) Now a sample {(θk ,t1),..., (θk ,tn)} gathered from p(θk,t|y) can be used to calculate (5) by ordering the ti’s and calculating log{L(y|θk)}, estimating the integral via quadrature. All of this hinges on the assumption that p(t) ∝ z(y|t). It is our experience that z(y|t) varies by orders of magnitude with t. This is not surprising since by construction z(y|t = 0) = 1, while the marginal likelihood, z(y|t = 1), could be quite large depending on the problem in hand. Thus using a single chain, values of t close to zero would tend not to be sampled with high frequency leading to poor estimation of p(y). As a more direct approach we suggest discretising the integral (5) over t ∈ [0, 1], running separate chains for each t, sampling from the power posterior to estimate the mean deviance, E log(L(θ |y)). θk|y,t k Numerical integration using, for example, a trapezoidal rule, over t yields an estimate of the marginal likelihood. For example choosing a discretisation 0 = t0 < t1 < · · · < tn−1 < tn = 1, leads to an approximation

n−1 E log L(y|θk)+ E log L(y|θk) θk|y,ti+1 θk |y,ti log p(y) ≈ (t − t )h i (6) i+1 i 2 Xi=1 Note that Monte Carlo standard errors for each E log L(y|θ ), can be pieced together to give θk |y,ti k an overall Monte Carlo standard error for log p(y). It should be apparent that if it is possible to sample from the posterior distribution of model parameters, then it should often be possible to sample from the power posterior, and hence to compute the marginal likelihood. In particular if the likelihood contribution follows an exponential family model, then raising the likelihood to a power t amounts to multiplying the exponent by t. We therefore expect that in many cases if the posterior is amenable to Gibbs sampling, then so too would the power posterior. In terms of computation, modifying existing MCMC code which samples from posterior model parameters is trivial. Essentially all that is needed is an extra iteration loop for the temperature parameter t, calculating the expected deviance under the power posterior at each temperature iteration.

3.2. Sensitivity of p(y|k) to the prior It is well understood that the Bayes factor is sensitive to the choice of prior model parameters. Here we outline for a simple example how this impacts on the estimate of the marginal likelihood via samples from the power posterior, as outlined above. Consider the simple situation where data y = {yi} are independent and normally distributed with mean θ and unit variance. Assuming θ ∼ N(m, v), a priori, leads to a power posterior, θ|y,t ∼ N(mt, vt) where nty¯ + m/v 1 m = and v = . t nt +1/v t nt +1/v It is straightforward to show that

n 2 n 1 2 n (m − y¯) n 1 Eθ|y,t log L(y|θ)= − log2π − (yi − y¯) − − . (7) 2 2 2 (vmt + 1)2 2 (nt +1/v) Xi=1 Recall that the log of the marginal likelihood obtains by integrating (7) with respect to t over t ∈ [0, 1]. Consider the situation when t = 0. In this case the final term in on the left hand side of (7) appears as −nv/2. Clearly as v → ∞, so too does Eθ|y,t=0 log L(y|θ), and at the same speed. This illustrates the sensitivity of the prior specification to the numerical stability of the estimate of the marginal likelihood. Figure 1 plots Eθ|y,t log L(y|θ) against t for v = 10, 5, 1, (for illustrative purposes the first two terms on the left hand side take a constant value −1,y ¯ = 0, m = 0 and n = 10). It is our experience that the behaviour of the mean deviance with t, as outlined in Figure 1 is typical in more complex settings, even when the prior parameters are quite uninformative. Bearing this in mind we see that the choice of spacing for the ti’s in (6) is important. For example, a c temperature schedule of the type ti = ai , where ai = i/n is an equal spacing of the n points in the interval [0, 1], and c > 1 is a constant, ensures that the ti’s are chosen with high frequency close to t = 0. Prescribing the collection of ti’s in this way should improve the efficiency of the estimate of p(y). Marginal likelihood estimation via power posteriors 7

0

−10

−20

−30

−40

−50

−60 0 0.2 0.4 0.6 0.8 1

Fig. 1. Expected deviance (7), under the distribution θ|y,t plotted against t for prior variance equal to 10, 5, 1. As v increases, so too does the rate at which the mean deviance changes with t.

i yi xi zi i yi xi zi i yi xi zi 1 3040 29.2 25.4 15 2250 27.5 23.8 29 1670 22.1 21.3 2 2470 24.7 22.2 16 2650 25.6 25.3 30 3310 29.2 28.5 3 3610 32.3 32.2 17 4970 34.5 34.2 31 3450 30.1 29.2 4 3480 31.3 31 18 2620 26.2 25.7 32 3600 31.4 31.4 5 3810 31.5 30.9 19 2900 26.7 26.4 33 2850 26.7 25.9 6 2330 24.5 23.9 20 1670 21.1 20.0 34 1590 22.1 21.4 7 1800 19.9 19.2 21 2540 24.1 23.9 35 3770 30.3 29.8 8 3110 27.3 27.2 22 3840 30.7 30.7 36 3850 32.0 30.6 9 3670 32.3 29 23 3800 32.7 32.6 37 2480 23.2 22.6 10 2310 24.0 23.9 24 4600 32.6 32.5 38 3570 30.3 30.3 11 4360 33.8 33.2 25 1900 22.1 20.8 39 2620 29.9 23.8 12 1880 21.5 21.0 26 2530 25.3 23.1 40 1890 20.8 18.4 13 3670 32.2 29.0 27 2920 30.8 29.8 41 3030 33.2 29.4 14 1740 22.5 22.0 28 4990 38.9 38.1 42 3030 28.2 28.2

Table 1. Radiata pine dataset. yi: Maximum compression strength parallel to the grain, xi: Density, zi: Resin- adjusted density.

4. Examples 4.1. Linear regression - non-nested models The dataset in Table 1 was taken from Williams (1959). The data describe the maximum compression strength parallel to the grain yi, the density xi, and the resin-adjusted density zi for 42 specimens of radiata pine. This dataset has been examined in (Han and Carlin 2001), (Carlin and Chib 1995) and (Bartolucci and Scaccia 2004), where they compared several methods to estimate the Bayes factor between two non-nested competing models. The competing models are as follows:

2 M1 : yi = α + β(xi − x¯)+ ǫi, ǫi ∼ N(0, σ ). 2 M2 : yi = γ + δ(zi − z¯)+ ηi, ηi ∼ N(0, τ ).

The following prior specification was used (identical to those papers cited immediately above): N((3000, 185)T , diag(106, 104) for (α, β)T and (γ,δ)T . An IG(3, (2 · 3002)−1) prior was chosen for σ2 and τ 2, where IG(a,b) is an inverse gamma distribution with density 1 f(x)= . exp(1/bx)Γ(a)baxa+1 8 Friel, Pettitt

Power Posterior RJMCMC RJ corrected Mean 4853.5 5280.6 4869.4 Standard error 94.2 1622.4 97.8 Relative error 1.8% 34.5% 2.0%

Table 2. Linear regression models. Estimates of means, standard errors and relative errors of B12 using the method of power posteriors and RJMCMC. ‘RJMCMC’ entries corresponds to prior model probabilities p(k =1)= p(k = 2) = 0.5, while ‘RJ corrected’ corresponds to p(k = 1) = 0.9995 and p(k = 2) = 0.0005.

Green and O’Hagan (1998) found, for the given prior specification, by numerical integration that the Bayes factor B21 = 4862. The aim for this straightforward situation is to see what statistically efficiency can be achieved by using the power posterior method to compute the Bayes factor over using RJMCMC. 5 Here we calculated each marginal likelihood using a temperature schedule of the type ti = ai , where the ai’s correspond to 10 equally spaced points in [0, 1]. In total 30, 000 iterations were used to estimate each marginal likelihood. Finally the algorithm was run for 100 independent chains, each estimating Bˆ21,i for i =1,..., 100. To implement RJMCMC we specified p(k =1)= p(k =2)=0.5. To allow for a fair comparison, the reversible jump sampler was run for 60, 000 iterations. Within model parameters were updated via Gibbs sampling. Across model moves were proposed by simply setting (α, β, σ) = (γ, δ, τ), resulting in the jacobian term taking the value 1. The following quantities appear in Table 2:

1 100 Mean = Bˆ 100 21,i Xi=1 100 1 Standard error = v (Bˆ − Mean)2 u100 21,i u Xi=1 t 1 1 100 Rel. Error = v (Bˆ − B )2. B u100 21,i 21 21 u Xi=1 t As can be seen from the results in Table 2, RJMCMC faired poorly. This is simply because the reversible jump sampler does not mix well and so does not visit model 1 very often, leading to a poor posterior estimate of p(k = 1|y). Running the reversible jump sampler with the prior model probability strongly weighted towards model 1 with p(k = 1) = 0.9995 and p(k = 2) = 0.0005, leads to estimates of B21 with similar efficiency to that of the power posterior method and the power posterior has marginally smaller relative error than the ’RJ corrected’ method.

4.2. Categorical longitudinal models For this example we re-visit an analysis of a categorical longitudinal dataset presented in (Pettitt et al 2005). This example concerns a large social survey of immigrants to Australia. Data is recorded for each subject on three separate occasions. The response variable of interest is employment sta- tus, which comprises of 3 categories – “employed”, “unemployed” or “non-participant”. The “non- participant” category refers to subjects who are for example, students, retired or pensioners. The response variable is modelled as a multinomial . Here we are concerned with fitting Bayesian hierarchical models to the data including both fixed and random effects on employment status. Specifically we assume that the response yisj of individual i at time s belongs to employment category j with probability pisj . Note we have used s to index time rather than k as in (Pettitt et al 2005). Thus yisj is a binary random variable, and further we assume that yis· = {yis1,yis2,yis3} has a multinomial distribution

yis· ∼ Multinomial(pis·,nis), Marginal likelihood estimation via power posteriors 9

Fixed effects model k = 1 ti 1 0.6561 0.4096 0.2401 0.1296 0.0625 0.0256 0.0081 0.0016 0 Mean Dev. -2421 -2425 -2433 -2446 -2475 -2547 -2728 -3421 -6670 -66965.28 MC se 0.196 0.293 0.526 0.803 1.325 4.003 5.564 26.120 95.330 536.931

Random effects model k = 2 ti 1 0.6561 0.4096 0.2401 0.1296 0.0625 0.0256 0.0081 0.0016 0 Mean Dev. -1783 -2215 -2428 -2449 -2483 -2562 -2813 -3862 -7218 -55013.28 MC se 3.525 4.663 0.690 0.775 1.753 3.712 10.140 27.510 96.360 551.865

Table 3. Categorical longitudinal model. Expected deviances (and Monte Carlo standard errors) for the power posterior at temperature ti for models k = 1, 2.

where nis = j yisj = 1. The next level of the hierarchy models the binary probabilities pisj to fixed and randomP covariate effects, µisj pisj = , j µisj P where, T log(µisj )= Xisβj + αij , and where Xis is a vector of covariates, βj are fixed effects, and αij is a random effect reflecting time constant unobserved heterogeneity. Choosing “employed” as a reference state and setting β1 and αi1 to zero allows the model to be identified. In order to maintain invariance with respect to which state is chosen as the reference state, we must take the random effect (αi2, αi3) to have a multivariate normal distribution with mean 0 and variance-covariance Σ. We write the posterior distribution as

p(β, α, Σ|y) ∝ L(y|β, α, Σ)p(α|Σ)p(Σ)p(β), assuming a priori independence between β and Σ. The prior distribution for βj was set to standard normal distribution with variances of 16 (except for the parameters corresponding to age2 which were given more precise priors). The random effects terms (αi2, αi3) was assigned a bivariate zero mean distribution where the elements of the variance-covariance matrix Σ where Σ11 ∼ Uniform(0, 10), Σ22 ∼ Uniform(0, 10) and the correlation coefficient was assigned Uniform(−1, 1). Here it is possible to update beliefs about all unknown parameters from their full conditional distributions using Gibbs sampling. Here our interest is in calculating marginal likelihoods for

T Model k =1: log(µisj )= Xisβj

T Model k =2: log(µisj )= Xisβj + αij .

Model 1 is essentially a fixed effects model, where the regression effect βj , for employment state j, remains constant for each individual at each of the time points s. A random effects term, αij , is however included in model 2, accounting for variability between individuals at a given state j. Indeed other plausible models are also possible, modelling for example, variability in time between individuals, again the reader is referred to (Pettitt et al 2005) for more details. For this illustration we have used a randomly selected sample size of 1000 from the complete case data (n = 3234) analysed in (Pettitt et al 2005). For this example, collecting samples from the power posteriors is possible using the WinBUGS software (Spiegelhalter, Thomas and Best 1998). 4 To implement (6) we chose a temperature schedule ti = ai , where the ai’s are 10 equally spaced points in the interval [0.0016, 1] and the end point t1 = 0. Within each temperature ti,2, 000 samples were collected from the stationary distribution pti (β|y, k = 1) and pti (β, α, Σ|y, k = 2). Table 3 summarises the output. Applying the trapezoidal rule (6), yields log p(y|k = 1) = −2, 528.72 and log p(y|k = 2) = −2, 358.15, with associated Monte Carlo standard errors of 0.64 and 1.34 for the trapezoidal rule, respectively. This leads to the strong conclusion that a random effects model is much more probable which is qualitatively similar to the conclusion presented in (Pettitt et al 2005). 10 Friel, Pettitt

4.3. Hidden Markov random field models Markov random fields (MRFs) are often used to model binary spatially dependent data – the autolo- gistic model (Besag 1974) is a popular choice. Here the joint distribution of x = {xi : i =1, 2,...,n} taking values {(−1, +1)} on a regular lattice is defined as

p(x|β) ∝ exp{β0 xi + β1 xixj } (8) Xi Xi∼j conditional on parameters β = (β0,β1). Positive values of β0 encourage xi to take the values +1, while positive values of β1 encourage homogenous regions of +1’s or −1’s. The notation i ∼ j denotes that xi and xj are neighbours. For this example we examine two models, defined via their neighbourhood structure:

Model k =1: A first order neighbourhood where each point xi has as neighbours the four nearest adjacent points. Model k =2: A 2nd order neighbourhood structure where in addition to the first order neighbours, the four nearest diagonal points also belong to the neighbourhood. Both neighbourhood structures are modified along the edges of the lattice. MRF models are difficult to handle in practice, due to computational burden of calculating the proportional constant, in (8), c(β) say. A hidden MRF y arises when an MRF x is corrupted by some noise process. The underlying MRF is essentially hidden, and appears as parameters in the model. Typically it assumed that conditional on x the yi’s are independent which gives the likelihood:

n L(y|x,θ)= p(yi|xi,θ), iY=1 for some parameters θ. Once prior distributions p(β) and p(θ) are specified for β and θ respectively, a complete Bayesian analysis proceeds by making inference on the posterior distribution

p(x,β,θ|y, k) ∝ L(y|x,θ)p(x|β, k)p(β)p(θ).

It is relatively straightforward to sample from the full conditional distribution of each of x and θ. Sampling from the full conditional distribution of β is more problematic, due to the difficulty of calculating the normalising constant of the MRF, c(β). However provided the number of rows or columns is not greater than 20 for a reasonable number of the other dimension, then the forward recursion method presented in (Reeves and Pettitt 2004) can be use to calculate c(β). For a more complete description of the problem of Bayesian estimation of hidden MRFs, the reader is referred to (Friel et al 2005). For this example, gene expression levels were measured for 34 genes in a cluster of 38 neighbouring genes on the Streptomyces coelcicolor genome for 10 time points. The cluster of 38 neighbouring genes under study is responsible for the production of calcium-dependent antibodies We define the observations on a 38 × 10 regular lattice, where log expression level ysg corresponds to the gth gene at time point s. Figure 2 displays the data y, indicating gene locations for which there is no data. Here we assume that the data y masks a MRF process x, where states (−1, +1) correspond to ‘up-regulation’ and ‘down-regulation’ respectively. We assume that the MRF process follows a first order neighbourhood structure (k = 1), or a second order neighbourhood structure (k = 2). Finally we assume that the distribution of y given x is modelled as independent Gaussian noise with state specific mean µ(xsg ), and common variance σ and set θ = (µ, σ). Wit and McClure (2004) show that normality of log expression levels is a reasonable assumption for similar experimental setups. It is straightforward to handle the missing data – in the full conditional distribution of the latent process x, the needs to be modified slightly to allow for the fact that 4 of the columns of x are not supported by any data. A flat normal prior was chosen for each of the β parameters. The prior distribution for µ was distributed uniformly from the set {(µ(−1), µ(+1))|− 2 ≤ µ(−1) ≤ 2, µ(−1) ≤ µ(+1) ≤ 2}. The Marginal likelihood estimation via power posteriors 11

2

4

6

8

10 11 21 27 31

Fig. 2. Log expression levels of 34 genes on the Streptomyces genome for 10 consecutive time points. The x-axis labels missing columns.

First order model k = 1 ti 1 0.6243 0.3660 0.1975 0.0953 0.0390 0.0123 0.0024 0.0002 0 Mean Dev. -251.7 -253.0 -252.6 -248.5 -259.4 -282.2 -352.5 -365.3 -681.0 -1607.6 MC se 0.040 0.033 0.061 0.083 0.183 0.377 0.854 1.163 6.754 27.240

Second order model k = 2 ti 1 0.6243 0.3660 0.1975 0.0953 0.0390 0.0123 0.0024 0.0002 0 Mean Dev. -252.6 -252.6 -253.1 -247.5 -256.7 -279.7 -321.2 -345.5 -475.8 -1248.6 MC se 0.029 0.016 0.024 0.089 0.182 0.450 0.639 1.228 3.646 17.676

Table 4. Hidden Markov random field models. Expected deviances (and Monte Carlo standard errors) for the power posterior at temperature ti for models k = 1, 2. values −2 and +2 represent approximate minimum and maximum values which are found in similar datasets. Corresponding to these values a gamma prior with mean 2 and variance 4 was specified for σ. Note that Friel and Wit (2005) present a more complete analysis of a similar dataset. 4 Here we chose a temperature schedule ti = ai , where the ai’s are 11 equally spaced points in the interval [0, 1]. Within each temperature ti, 5, 000 samples were collected from the stationary distribution pti (x,β,µ|y, k), for k =1, 2. Table 4 summarises the output. Applying the trapezoidal rule (6), yields log p(y|k =1)= −256.93 and log p(y|k =2)= −255.61, with associated Monte Carlo standard errors of 0.0013 and 0.0010, respectively. Thus the second order model is deemed more probable a posteriori.

5. Concluding remarks

We have introduced a new method of estimating the marginal likelihood for complex hierarchical Bayesian models which involves a minimal amount of change to commonly used algorithms which compute the posterior distributions of unknown parameters. We have illustrated the technique for three examples. The first, a simple regression example, where the prior model probabilities needed tuning in order to estimate the Bayes factor well. The second example involved a random effects model for multinomial data and demonstrated the ease of computing the marginal likelihood with a standard software package such as WinBUGS. Here the results demonstrated the overwhelming difference in marginal likelihoods for the two models considered, a similar situation to the first example where RJMCMC with default equally weighted model prior probabilities would perform very poorly. The third example involved a complex hidden Markov structure and the results demonstrated a difference in terms of marginal likelihood between the two models. In terms of approximating the mean deviance, the second example is far more challenging than the third example with the former displaying characteristics resulting from use of a vague prior with large negative values of the mean deviance for t near 0. However the Monte Carlo standard errors of the marginal likelihoods are 12 Friel, Pettitt nevertheless reasonably small for this example. Computation of the marginal likelihood requires a proper prior. The sensitivity of the value of marginal likelihood to the choice of prior can be readily investigated using our method. Various approaches have been proposed for the case where the prior is improper. As we mentioned above, the fractional Bayes factor is straightforwardly computed as a by-product of the marginal likelihood. For those seeking such approximations our method provides a straightforward solution. Our choice of quadrature rule and use of simulation resources can be improved but we have not followed up that matter here. But nevertheless, our computational approach provides estimates with tolerably small standard errors. In conclusion, we have illustrated a method of computing the marginal likelihood which is straight- forward to implement and can be used for complex models.

Acknowledgements Both authors were supported by the Australian Research Council. The authors wish to kindly acknowledge Thu Tran for her assistance with computational aspects of this work. Nial Friel wishes to acknowledge the School of Mathematical Sciences, QUT for its hospitality during June 2005.

References

Bartolucci, F. and A. Mira (2003), Efficient estimate of Bayes factors from reversible jump output. Technical report, Universit´adell’Insubria, Dipartimento di Economia Bartolucci, F. and L. Scaccia (2004), A new approach for estimating the Bayes factor. Technical report, Universit´adi Perugia Berger, J. O. and L. R. Perricchi (1996), The intrinsic Bayes factor for linear models. In J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith (eds.), , vol. 5, pp. 25–44, Oxford, Oxford University press Besag, J. E. (1974), Spatial interaction and the statistical analysis of lattice systems (with discus- sion). Journal of the Royal Statistical Society, Series B 36, 192–236 Brooks, S. P., P. Giudici and G. O. Roberts (2003), Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions (with discussion). Journal of the Royal Statistical Society, Series B 65(1), 3–57 Carlin, B. P. and S. Chib (1995), Bayesian Model Choice via Markov Chain Monte Carlo. Journal of the Royal Statistical Society, Series B 57, 473–484 Chib, S. (1995), Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90, 1313–1321 Chib, S. and I. Jeliazkov (2001), Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association 96, 270–281 Dellaportas, P., J. J. Forster and I. Ntzoufras (2001), On Bayesian model and variable selection using MCMC. Statistics and Computing 12, 27–36 Dryden, I. L., M. R. Scarr and C. C. Taylor (2003), Bayesian texture segmentation of weed and crop images using reversible jump Markov chain Monte Carlo methods. Applied Statistics 52(1), 31–50 Friel, N., A. N. Pettitt, R. Reeves and E. Wit (2005), Bayesian inference in hidden Markov random fields for binary data defined on large lattices. Technical report, University of Glasgow Friel, N. and E. Wit (2005), Markov random field model of gene interactions on the M. Tuberculosis genome. Technical report, University of Glasgow, Department of Statistics Gelman, A. and X.-L. Meng (1998), Simulating normalizing contants: from importance sampling to bridge sampling to path sampling. Statistical Science 13, 163–185 Godsill, S. J. (2001), On the Relationship Between Markov Chain Monte Carlo Methods for Model Uncertainty. Journal of Computational and Graphical Statistics 10, 230–248 Marginal likelihood estimation via power posteriors 13

Green, P. J. (1995), Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82, 711–732 Green, P. J. (2003), Trans-dimensional Markov chain Monte Carlo. In P. J. Green, N. L. Hjort and S. Richardson (eds.), Highly Structured Stochastic Systems, Oxford University Press, Oxford Green, P. J. and A. O’Hagan (1998), Model choice with MCMC on product spaces without using pseudo-priors. Technical Report 98-13, University of Nottingham Green, P. J. and S. Richardson (2002), Hidden Markov models and disease mapping. Journal of the American Statistical Association 97, 1055–1070 Han, C. and B. P. Carlin (2001), Marlov chain Monte Carlo methods for computing Bayes factors: A comparative review. Journal of the American Statistical Association 96(455), 1122–1132 Hoeting, J. A., D. Madigan, A. E. Raftery and C. T. Volinsky (2001), Bayesian model averaging: A tutorial. Statistical Science 14(4), 382–417 Meng, X.-L. and W. Wong (1996), Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica 6, 831–860 Neal, R. M. (2001), Annealed importance sampling. Statistics and Computing 11, 125–139 O’Hagan, A. (1995), Fractional Bayes factors for model comparison (with discussion). Journal of the Royal Statistical Society, series B 57, 99–138 Pettitt, A. N., T. T. Tran, M. A. Haynes and J. L. Hay (2005), A Bayesian hierarchical model for categorical longitudinal data from a social survey of immigrants. Journal of the Royal Statistical Society, series A (to appear) Reeves, R. and A. N. Pettitt (2004), Efficient recursions for general factorisable models. Biometrika 91(3), 751–757 Sisson, S. A. (2005), Trans-dimensional Markov chains: A decade of progress and future perspec- tives. Journal of the American Statistical Association (to appear) Spiegelhalter, D., A. Thomas and N. Best (1998), WinBUGS: Bayesian inference using Gibbs Sam- pling, Manual version 1.2 . Imperial College, London and Medical Research Council Biostatistics Unit, Cambridge Stephens, M. (2000), Bayesian analysis of mixture models with an unknown number of components - an alternative to reversible jump methods. Annals of Statistics 28, 40–74 Willams, E. (1959), Regression Analysis. Wiley Wit, E. and J. McClure (2004), Statistics for Microarrays: Design, Analysis and Inference. Wiley, Chichester