Marginal Likelihood from the Metropolis-Hastings Output

Marginal Likelihood From the Metropolis–Hastings Output

Siddhartha Chib and Ivan Jeliazkov

This article provides a framework for estimating the marginal likelihood for the purpose of Bayesian model comparisons. The approach extends and completes the method presented in Chib (1995) by overcoming the problems associated with the presence of intractable full conditional densities. The proposed method is developed in the context of MCMC chains produced by the Metropolis–Hastings algorithm, whose building blocks are used both for sampling and marginal likelihood estimation, thus economizing on prerun tuning effort and programming. Experiments involving the logit model for binary data, hierarchical random effects model for clustered Gaussian data, Poisson regression model for clustered count data, and the multivariate probit model for correlated binary data, are used to illustrate the performance and implementation of the method. These examples demonstrate that the method is practical and widely applicable. KEY WORDS: Bayes factor; Bayesian model comparison; Clustered count data; Correlated binary data; Local invariance; Local reversibility; Metropolis–Hastings algorithm; Multivariate density estimation; Reduced conditional density.

1. INTRODUCTION dif cult (see Han and Carlin 2000) especially in models with latent variables as in the examples that follow. Even in the Consider the problem of comparing a collection of models best of circumstances, these methods require prerun tuning to 8 1 : : : 1 9 that re ect competing hypotheses about the 1 L get suitable mixing on model space. Another problem is that data y = 4y 1 : : : 1 y 5. Suppose that each model is charac- 1 n l some subset of the existing models must be included in the terized by a model-speci c parameter vector ˆ S kl of l 2 l < simulation if a new model is to be compared to the existing dimension kl and sampling density f 4y l1 ˆl5. In this context, Bayesian model selection proceeds—by pairwise compar- ones, increasing the dimensionality of the parameter space and ison of the models in 8 9 through their posterior odds ratio, introducing tuning concerns beyond those required for sam- l pling from the new model. which for any two models i and j is written as Work has also been done on the direct estimation of Pr4i y5 Pr4i5 m4y i5 the marginal likelihood in general non-nested model settings — = — (1) Pr4j y5 Pr4j 5 m4y j 5 (Chib 1995; Gelfand and Dey 1994) and the estimation of — — ratios of marginal likelihoods especially in the setting of where nested models (Chen and Shao 1997; DiCiccio, Kass, Raftery, and Wasserman 1997; Meng and Wong 1996; Verdinelli and m4y l5 = f 4y l 1 ˆl5 l4ˆl l5 dˆl (2) — — — Wasserman 1995). In this article, we focus on the approach Z is the marginal likelihood of l. The rst fraction on the of Chib (1995), which is based on a representation of the right-hand side of (1) is known as the prior odds and the marginal likelihood that is amenable to calculation by MCMC second as the Bayes factor. methods. Because the marginal likelihood is the normalizing The calculation of the marginal likelihood, which is of some constant of the posterior density, one can write importance in Bayesian statistics, has attracted considerable interest in the recent Markov chain Monte Carlo (MCMC) f4y l1 ˆl5 4ˆl l 5 m4y l 5 = — — 1 (3) literature. One generic problem is that because the marginal — 4ˆl y1 l 5 likelihood is obtained by integrating the sampling density — which is referred to as the basic marginal likelihood iden- f 4y l 1 ˆl 5 with respect to the prior distribution of the parameters—, and not the posterior distribution, the posterior MCMC tity. Evaluating the right-hand side of this identity at some output from the simulation cannot be used directly to estimate appropriate point ˆl and taking logarithms one obtains the the marginal likelihood. Of course, analytic evaluation of the expression integral is almost never possible. Because of these dif cul- ties, attempts have been made (for example, Carlin and Chib log m4y 5 = log f4y 1 ˆ 5 + log 4ˆ 5 1995; Green 1995) to estimate the posterior odds by Markov — l — l l l — l chain Monte Carlo methods that sample both model space log 4ˆl y1 l5 (4) and parameter space. These approaches deliver the posterior ƒ — probabilities of each model, and hence the posterior odds for from which the marginal likelihood can be estimated by nd- any two models, according to the frequency of visits to each ing an estimate of the posterior ordinate 4ˆ y1 5. Thus l — l model. Although such methods are important, they suffer from the calculation of the marginal likelihood is reduced to nd- certain drawbacks. One is that the tuning of the MCMC sam- ing an estimate of the posterior density at a single point ˆl . plers to promote mixing on a high-dimensional space can be For estimation ef ciency, the latter point is generally taken to be a high-density point in the support of the posterior.

Siddhartha Chib is Professor of Econometrics and Statistics, John M. Olin School of Business, CB1133, Washington University, St. Louis, MO 63130 (Email: [email protected]). Ivan Jeliazkov is a doctoral student in Eco- © 2001 American Statistical Association nomics at Washington University. The authors are grateful to the editor, asso- Journal of the American Statistical Association ciate editor, and referees for helpful comments. March 2001, Vol. 96, No. 453, Theory and Methods 270 Chib and Jeliazkov: Marginal Likelihood Estimation 271

Chib (1995) provides a method to estimate the posterior suggest a modi ed method that Chib, Nardari, and Shephard ordinate in the context of Gibbs MCMC sampling. Suppose (1999) have applied to a 120-dimensional posterior distribu- that the parameter space is split into B conveniently speci- tion. In this article, we propose a different, less demanding,

ed blocks, so that ˆ = 4ˆ11 : : : 1 ˆB 5 where we suppress the and more general solution that solves both problems, thus model index for notational convenience. Then, by the law of extending and completing the Chib method for estimating the total probability we have marginal likelihood. The rest of the article is organized as follows. In Section 2, 4ˆ y5= 4ˆ1 y5 4ˆ2 y1ˆ15 4ˆB y1ˆ11: : : 1ˆB 15 (5) we discuss the MCMC sampling framework and give the main — — — — ƒ results that form the basis for the proposed method. Section 3 where 4ˆ y5 is the marginal density ordinate of ˆ and 1— 1 contains results from extensive experiments that document the 4ˆ y1 ˆ 1 : : : 1 ˆ 5 is the full conditional density ordi- B 1 B 1 performance of the method in diverse model situations. Con- nate an— d the remaininƒ g ordinates are reduced conditional ordi- cluding remarks are given in Section 4. nates. Now assume that each full conditional density is fully known. Then, the marginal density ordinate is estimated by 2. PROPOSED APPROACH the Rao–Blackwell device (Gelfand and Smith 1990; Tanner and Wong 1987). Next, the rst reduced conditional ordi- 2.1 One Block Sampling nate is found by averaging the full conditional density of To motivate the general approach, we begin by considering 1 M 4j5 4j5 ˆ , 4ˆ y1 ˆ 5 = M ƒ 4ˆ y1 ˆ 1 ˆ 1 : : : 1 ˆ 5 where a simple case of some importance. Suppose that the posterior 2 O 2— 1 j=1 2— 1 3 B 8ˆ4j51 : : : 1 ˆ4j59 4ˆ 1 : : : 1 ˆ y1 ˆ 5 are M draws that are density 4ˆ y5 4ˆ5f4y ˆ5, de ned over S, a subset of d, 3 B 3P B — 1 — / — < obtained from a reduced Gibbs MCMC run in which ˆ1 is is sampled in one block by the Metropolis–Hastings algorithm xed at ˆ1 and sampling is over 8ˆ21 : : : 1 ˆB 9, a procedure (Chib and Greenberg 1995; Hastings 1970; Tierney 1994) and that requires no new programming. Subsequent reduced ordi- the goal is to estimate the posterior ordinate 4ˆ y5 given the 415 4M5 — nates are estimated in the same way by xing additional posterior sample 8ˆ 1 : : : 1 ˆ 9. Let q4ˆ1 ˆ0 y5 denote the blocks. The time cost of this procedure is generally small proposal density (candidate generating density)—for the transi- when blocking is done effectively and a few reduced runs are tion from ˆ to ˆ0, where the proposal density is allowed to required, as is possible in many practical problems, and the depend on the data y, and let higher-dimensional blocks are placed rst in the equation (5) decomposition. f 4y ˆ05 4ˆ05 q4ˆ01 ˆ y5 4ˆ1 ˆ0 y5 = min 11 — — It is worth noting that an alternative approach to estimat- — f4y ˆ5 4ˆ5 q4ˆ1 ˆ0 y5 — — ing the posterior ordinate is developed by Ritter and Tanner denote the probability of move (probability of accepting the (1992), also in the context of Gibbs MCMC chains with fully proposed value). known full conditional distributions. If one lets If we let p4ˆ1 ˆ0 y5 = 4ˆ1 ˆ0 y5q4ˆ1 ˆ0 y5 denote the sub- B kernel of the M–H— algorithm, the— n from —the reversibility of KG4ˆ1 ˆ y5 = 4ˆk y1 ˆ1 1 : : : 1 ˆk 11 ˆk+ 11 : : : 1 ˆB 5 (6) ƒ the subkernel we can write that for any point ˆ — k=1 — Y denote the Gibbs transition kernel, then by virtue of the p4ˆ1 ˆ y5 4ˆ y5 = 4ˆ y5p4ˆ 1 ˆ y50 (7) — — — — fact that the Gibbs chain satis es the invariance con- Upon integrating both sides of this expression with respect dition 4ˆ y5 = KG4ˆ1 ˆ y5 4ˆ y5 dˆ, one can obtain — — — to ˆ over d , we immediately obtain the result that the pos- the posterior ordinate by averaging the transition ker- < nel over draws froRm the posterior distribution 4ˆ y5 = terior ordinate is given by 1 M 4g5 O — M ƒ K 4ˆ 1 ˆ y5. This estimate only requires draws g=1 G — 4ˆ1 ˆ y5q4ˆ1 ˆ y5 4ˆ y5dˆ from the full Gibbs run, but when ˆ is high-dimensional 4ˆ y5 = — — — 0 (8) P — 4ˆ 1 ˆ y5q4ˆ 1 ˆ y5 dˆ and the model contains latent variables, this estimate tends to R — — be less accurate than Chib’s estimate, which achieves accu- To highlight the estimatioR n strategy, write the latter in more racy with the help of additional simulations by decomposing succinct form as the posterior ordinate into smaller pieces and estimating each reduced ordinate from its full conditional distribu- E 4ˆ1 ˆ y5q4ˆ1 ˆ y5 4ˆ y5 = 1 — — tion. Nonetheless, the idea embodied in Ritter and Tanner’s — E2 4ˆ 1 ˆ y5 method is valuable, and it is a useful method when ˆ is — low-dimensional. where the numerator expectation E1 is with respect to the dis- Neither the Chib nor Ritter and Tanner methods for esti- tribution 4ˆ y5 and the denominator expectation E is with — 2 mating the posterior ordinate offer a solution to problems in respect to q4ˆ 1 ˆ y5. This implies that a simulation-consistent — which sampling is via a single-block Metropolis sampler or for estimate of the posterior ordinate is available as problems in which one or more of the normalizing constants 1 M 4g5 4g5 M ƒ 4ˆ 1 ˆ y5q4ˆ 1 ˆ y5 of the full conditional densities are not known. For the latter 4ˆ y5 = g=1 — — 0 (9) situation, Chib and Greenberg (1998) have applied a nonpara- O — J 1 J 4ˆ 1 ˆ4j5 y5 P ƒ j=1 metric density estimation method to calculate the recalcitrant — reduced ordinate, but in a high-dimensional case the accu- where 8ˆ4g59 are the sampledPdraws from the posterior distri- racy of the resulting estimate is questionable, although they bution and 8ˆ4j59 are draws from q4ˆ 1 ˆ y5, given the xed — 272 Journal of the American Statistical Association, March 2001

value ˆ . On substituting the latter estimate in the log of the denote the subkernel of the M–H chain for ˆ1 conditioned basic marginal likelihood identity, we get on 4ˆ21 z5. Then, by direct calculation we see that this kernel satis es the condition log m4y5 = log f4y ˆ 5 + log 4ˆ 5 log 4ˆ y5 (10) O — ƒ O —

This is a rather simple expression that can be applied to p4ˆ11 ˆ y1 ˆ21 z5 4ˆ1 y1 ˆ21 z5 1— — many models including the class of univariate generalized lin- = 4ˆ y1 ˆ 1 z5p4ˆ 1 ˆ y1 ˆ 1 z51 (11) ear models, parametric survival models, and Gaussian linear 1 — 2 1 1— 2 mixed models for clustered data. Note that the choice of point which may be referred to as a local reversibility condition. ˆ is arbitrary, but for estimation ef ciency it is customary to Now if we multiply both sides of (11) by 4ˆ21 z y5 and inte- choose a point that has high density under the posterior. Also — grate over – = 4ˆ11 ˆ21 z5, we get note that we let the simulation-sample sizes in the numerator and the denominator be different, although in practice we set them to be equal. It should be appreciated that the sampling p4ˆ11 ˆ1 y1 ˆ21 z5 4ˆ1 y1 ˆ21 z5 4ˆ21 z y5 d– 4j5 — — — of ˆ from q4ˆ 1 ˆ y5 normally consumes a small amount of Z time in relation to th—e time required for the full MCMC run. = p4ˆ 1 ˆ y1 ˆ 1 z5 4ˆ y1 ˆ 1 z5 4ˆ 1 z y5 d– 1 1— 2 1— 2 2 — This means that the marginal likelihood of the model is avail- Z able almost as soon as the full MCMC run is nished. Finally, or if S is a proper subset of d, then values ˆ4j5 from q4ˆ 1 ˆ y5 that do not lie in S are include< d in the average in the denom— - p4ˆ11 ˆ1 y1 ˆ21 z5 4ˆ11 ˆ21 z y5 d– inator with the value 4ˆ 1 ˆ4j5 y5 = 0. — — — Z = p4ˆ 1 ˆ y1 ˆ 1 z5 4ˆ y5 4ˆ 1 z y1 ˆ 5 d– 2.2 Two Parameter Blocks and 1 1— 2 1— 2 — 1 Multiple Latent Variable Blocks Z and so, To enhance understanding, assume that the normalizing constant of only the rst full conditional density is not known p4ˆ 1 ˆ y1 ˆ 1 z5 4ˆ 1 ˆ 1 z y5d– and that this density is sampled by the M–H algorithm, as in 1 1— 2 1 2 — the multivariate probit model example presented in section 3.5. Z Let q4ˆ 1 ˆ0 y1 ˆ 1 z5 denote the proposal density for the tran- = 4ˆ y5 p4ˆ11 ˆ y1 ˆ21 z5 4ˆ21 z y1 ˆ 5d–0 1 1— 2 1— 1 — — 1 sition from ˆ1 to ˆ10 where we have for generality allowed the Z proposal density to depend on the data and the two remaining From this last equality, and the de nitions of p4ˆ 1 ˆ y1 ˆ 1 z5 1 1— 2 blocks. Also let and p4ˆ 1 ˆ y1 ˆ 1 z5, it follows that 1 1— 2

f 4y ˆ10 1 ˆ21 z5 4ˆ10 1 ˆ25 E1 4ˆ11 ˆ1 y1 ˆ21 z5q4ˆ11 ˆ1 y1 ˆ21 z5 4ˆ 1 ˆ0 y1 ˆ 1 z5 = min 11 — 4ˆ y5 = — — 1 (12) 1 1— 2 f 4y ˆ 1 ˆ 1 z5 4ˆ 1 ˆ 5 1 — E 4ˆ 1 ˆ y1 ˆ 1 z5 — 1 2 1 2 2 1 1— 2 q4ˆ10 1 ˆ1 y1 ˆ21 z5 — where the numerator expectation E1 is with respect to the q4ˆ11 ˆ10 y1 ˆ21 z5 distribution 4ˆ 1 ˆ 1 z y5, whereas the denominator expec- — 1 2 — tation E is with respect to the distribution 4ˆ 1 z y1 ˆ 5 denote the probability of move. An application of the basic 2 2 — 1 marginal likelihood identity yields q4ˆ1 1 ˆ1 y1 ˆ21 z5. Each of— the integrals in (12) can be estimated by the Monte f 4y ˆ 1 ˆ 5 4ˆ 1 ˆ 5 m4y5 = — 1 2 1 2 Carlo method. To estimate the numerator, we take the draws 4ˆ 1 ˆ y5 8ˆ4g51 ˆ4g51 z4g59M from the full run and average the quan- 1 2— 1 2 g=1 tity 4ˆ 1 ˆ y1 ˆ 1 z5q4ˆ 1 ˆ y1 ˆ 1 z5. For the denominator, and the goal is to estimate 4ˆ 1 ˆ y5. We assume that the 1 1— 2 1 1— 2 1 2 because the expectation is conditioned on ˆ , we continue the likelihood ordinate is available readil—y either by direct calcu- 1 MCMC simulation for an additional J iterations with the two lation (as in mixed effects models for clustered data where full conditional densities z denotes the cluster-speci c random effects) or by a Monte Carlo integration method. Note that although it is also true 4ˆ y1 ˆ 1 z53 4z y1 ˆ 1 ˆ 5 (13) that m4y5 = f4y1 z ˆ 1 ˆ 5 4ˆ 1 ˆ 5= 4z 1 ˆ 1 ˆ y5, the lat- 2— 1 — 1 2 — 1 2 1 2 1 2— ter identity is not that useful because it requires the compu- At each iteration of this reduced run, given the values tation of the ordinate 4ˆ 1 ˆ 1 z y5 whose dimension can 4j5 4j5 1 2 4ˆ2 1 z 5, we generate a variate easily run into the hundreds, if not —thousands. 4j5 4j5 4j5 Following Chib (1995), we decompose the posterior ordi- ˆ1 q4ˆ11 ˆ1 y1 ˆ2 1 z 5 nate as — 4ˆ 1 ˆ y5 = 4ˆ y5 4ˆ y1 ˆ 51 leading to the triple 4ˆ4j51 z4j51 ˆ4j55 that is a draw from the dis- 1 2— 1— 2— 1 2 1 tribution 4ˆ21 z y1 ˆ15q4ˆ11 ˆ1 y1 ˆ21 z5. The marginal ordi- where 4ˆ1 y5 cannot be estimated by the Rao–Blackwell — — method because— , by assumption, the normalizing constant of nate can now be estimated as

1 M 4g5 4g5 4g5 4g5 4g5 4g5 4ˆ1 y1 ˆ21 z5 is not known. Let M ƒ 4ˆ 1ˆ y1ˆ 1z 5q4ˆ 1ˆ y1ˆ 1z 5 — g=1 1 1 2 1 1 2 4ˆ1 y5= — — 0 (14) O — J 1 J 4ˆ 1ˆ4j5 y1ˆ4j51z4j55 p4ˆ 1 ˆ y1 ˆ 1 z5 = 4ˆ 1 ˆ y1 ˆ 1 z5q4ˆ 1 ˆ y1 ˆ 1 z5 P ƒ j=1 1 1 2 1 1— 2 1 1— 2 1 1— 2 — P Chib and Jeliazkov: Marginal Likelihood Estimation 273

Next, the values z4j5 from the preceeding reduced run, sented in the last subsection, we obtain the result that which are marginally from 4z y1 ˆ15, are used to form the 1 J — 4j5 average 4ˆ y1 ˆ 5 = J ƒ 4ˆ y1 ˆ 1 z 5, which is a O 2— 1 j=1 2— 1 simulation-consistent estimate of 4ˆ2 y1 ˆ1 5. Thus at the — 4î y1 ˆ11 : : : 1 î 15 conclusion of the reduced runP, both ordinates are available and — ƒ i+ 1 i+ 1 the marginal likelihood estimate is given by E1 4î1 î y1 –i 11 – 5q4î1 î y1 –i 11 – 5 — ƒ — ƒ = i+ 1 (16) E2 4î 1 î y1 –i 11 – 5 — ƒ log m4y5 = log f 4y ˆ 5 + log 4ˆ 5 8log 4ˆ y5 O — ƒ O 1— + log 4ˆ y1 ˆ 590 (15) where E1 is the expectation with respect to conditional poste- O 2— 1 i+ 1 rior 4î1 – y1 –i 15 and E2 that with respect to the condi- — ƒ i+ 1 i+ 1 Three remarks are in order. First, a single reduced run, aug- tional product measure 4– y1 –i 5q4î 1 î y1 –i 11 – 5. — — ƒ mented with a step to sample ˆ1 from the proposal density, These two integrals can be estimated as before from the out- delivers the variates that are used in the calculation of each put of the reduced MCMC runs, as follows. of the two ordinates, 4ˆ1 y5 and 4ˆ2 y1 ˆ15. Second, if one places the recalcitrant ordinat— e rst in —the decomposition of Step 1. Set –i 1 = –i 1 and sample the reduced set of full the posterior ordinate, then the reduced run does not involve ƒ ƒ conditional distributions 4ˆk y1 ˆ k5, k = i1: : : 1 B. Let the any M–H steps. Third, as mentioned previously and shown 4g5 — 4g5 ƒ generated draws be 8î 1 : : : 1 ˆB 9, g = 11 : : : 1 M. later in Section 3.4, this same approach can be applied when Step 2. Include î in the conditioning set, let –i = the full conditional density of z is sampled by a sequence of 4–i 11 î 5 and remove the î full conditional distribution M–H steps. fromƒ the collection in Step 1. Then sample the remain-

ing distributions 4ˆk y1 ˆ k 5, k = i + 11 : : : 1 B, to produce 2.3 Multiple Parameter Blocks 4j5 4j5 — ƒ 4j5 8î+ 11 : : : 1 ˆB 9. At each step of the sampling also draw î i+ 114j5 One of the features of the approach of Chib (1995), which from q4î 1 î y1 –i 11 – 5. — ƒ has been inherited by the current method, is the exibility Step 3. Estimate the reduced ordinate in (16) by the ratio with which it accommodates both low- and high-dimensional of Monte Carlo averages problems. For example, in most low-dimensional MCMC problems, grouping all parameters into one block is a sensi- ble strategy. In higher-dimensional problems, it may be the 4î y1ˆ11: : : 1î 15 case that convenience, computational necessity, or simulation O — ƒ 1 M 4g5 i+ 114g5 4g5 i+ 114g5 design may require sampling of the parameters in several M ƒ g=1 4î 1î y1–i 11– 5q4î 1î y1–i 11– 5 = — ƒ — ƒ smaller, more manageable blocks. Our approach readily gen- 1 J 4j5 i+ 114j5 P J ƒ j=1 4î 1î y1–i 11– 5 eralizes to this situation. — ƒ (17) To describe the setting, suppose that the MCMC sam- P pling is conducted without any latent data, because latent where the average in the denominator may include zeros if data, sampled in one or more blocks, can be handled as in 4j5 the previous section, and suppose that the parameters are there are î values that lie outside the support of the poste- dk rior S. grouped into B blocks ˆ = 4ˆ11 : : : 1 ˆB 5, with ˆk Sk . 2 < Step 4. Estimate the marginal likelihood on the log scale as Write the posterior ordinate at ˆ as 4ˆ11 : : : 1 ˆB y5 = B — i=1 4î y1 ˆ11 : : : 1 î 15 and consider the estimation of the — ƒ reduced ordinate 4î y1 ˆ11 : : : 1 î 15. Q — ƒ Now suppose that the full conditional density log m4y5 = log f 4y ˆ 5 + log 4ˆ 5 4ˆ y1 ˆ 5 4ˆ5f 4y ˆ5, i = 11 : : : 1 B , is sampled by the O — i i B — ƒ / — i+ 1 M–H algorithm with proposal density q4î1 î0 y1 –i 11 – 5 — ƒ log 4î y1 ˆ11 : : : 1 î 15 (18) and probability of move ƒ i=1 O — ƒ X i+ 1 4î1 î0 1 –i 11 – 5 — ƒ It is important to keep in mind that the reduced runs are i+ 1 f 4y î0 1 –i 11 – 5 4î01 ˆ i5 obtained by xing an appropriate set of parameters and con- ƒ ƒ = min 11 — i+ 1 f 4y î1 –i 11 – 5 4î1 ˆ i5 tinuing the MCMC simulation with a smaller set of distribu- — ƒ ƒ i+ 1 tions. Therefore, these runs require little to no coding beyond q4î0 1 î y1 –i 11 – 5 ƒ — i+ 1 1 what is done initially for the sampling of the posterior dis- q4î1 î0 y1 –i 11 – 5 — ƒ tribution. As long as the full MCMC sampling scheme has been properly designed (with blocking and proposal densities where we have written –i 1 = 4ˆ11 : : : 1 î 15 to denote the i+ 1 ƒ ƒ that avoid reducibility problems), each of the reduced runs is blocks up to i and – = 4î+ 11 : : : 1 ˆB 5 to denote the blocks beyond i. well de ned. Finally, observe that the variates –4i+ 15 in Step 2 Again by exploiting the local reversibility of the M–H step are automatically produced in the next reduced run, where the for ˆ and completely analogous arguments to the ones pre- ordinate 4ˆ y1 ˆ 1 : : : 1 ˆ 5 is estimated. i i+ 1— 1 i 274 Journal of the American Statistical Association, March 2001

2.4 Numerical Standard Error of the of the latter estimates must be incorporated by a separate cal- Marginal Likelihood Estimate culation. In addition, extending the calculation of the numerical standard error to cases involving more than two blocks is In this section, we discuss brie y how the numerical stan- straightforward. One proceeds by including (with appropriate dard error of the marginal likelihood estimate can be derived. arguments and conditioning sets) more elements like h and The numerical standard error gives the variation that can be 1 h for each block that is simulated by Metropolis–Hastings, expected in the marginal likelihood estimate if the simulation 2 and elements similar to h for blocks that are sampled directly. were to be repeated. For speci city, we consider the case in 3 In closing, we mention that in the subsequent examples Section 2.2 under the assumption that M = J. Following Chib (1995) we de ne the vector process we have performed a frequency analysis to verify the accu- racy of the numerical standard error estimates obtained by the approach described in this section. The posterior simula- 4g5 h1 tions are repeated 50 times for each combination of M and J. 4g5 4g5 The standard deviations of the marginal likelihood estimates h = 0h 1 2 obtained from these replications are found to closely mirror Bh4g5 C those from the above approach, thus providing a useful vali- B 3 C @ A dation of this method. 4ˆ4g51 ˆ y1 ˆ4g51 z4g55q 4ˆ4g51 ˆ y1 ˆ4g51 z4g55 1 1— 2 1 1 1— 2 4g5 4g5 0 4ˆ 1 ˆ y1 ˆ 1 z4g55 11 3. EXAMPLES 1 1 — 2 B 4ˆ y1 ˆ 1 z4g55 C We now discuss the performance of the marginal likelihood B 2— 1 C @ A estimation method by varying different aspects of the MCMC where the draws in the rst component of h are from the full design in several important models. We report evidence on MCMC run, while those in the second and third components different facets of the design including the choice of the pro- are from the reduced run, although this is not emphasized posal density, the type of blocking and size of blocks, and the in the notation. If we let h = M 1 M h4g5, then by equa- O ƒ g=1 sample sizes M and J used to estimate the posterior ordinate. tions (14) and (15), an estimate of the log-marginal likelihood To measure the ef ciency of the MCMC parameter sampling as a function of the elements of h isPgiven by O scheme, we use the measures 61 + 2 kˆ=1 k4l57, where k4l5 is the sample autocorrelation at lag l for the kth parameter in logm4y5=logf4y ˆ 5+ log 4ˆ 5 8logh logh + logh 90 O — ƒ O 1 ƒ O 2 O 3 the sampling with the summation truncateP d according to (say) the Parzen window. The latter quantity is called the inef ciency To estimate the variance of this quantity, suppose that the factor or the autocorrelation time and may be interpreted as values of f4y ˆ 5 and 4ˆ 5 are available directly. Then, the ratio of the numerical variance of the posterior mean from the variance of— the log-marginal likelihood estimate becomes the MCMC chain to the variance of the posterior mean from var4log m4y55 var4log h log h log h 5, which can be hypothetical independent draws. Our conclusion is that the = O 1 O 2 + O 3 found byO the Delta method ƒonce we obtain an estimate of the scheme that is ef cient for sampling the posterior distribution, variance of h. As 8h4g59 inherits the ergodicity of the Markov as measured by the inef ciency factors, is also ef cient for O chain, it follows by the ergodic theorem in Tierney (1994) that estimating the marginal likelihood. Thus algorithms that have h will converge to its mean almost surely as M . Under higher inef ciency factors tend to produce marginal likelihood O regularity conditions, an estimate of the sample varianc! ˆ e of h estimates with larger numerical standard errors although they O is given by the expression (Newey and West 1987) have minor to no signi cant effect on the point estimate of the marginal likelihood. Conversely, algorithms that are designed m 1 s to be ef cient for the simulation of the parameters are also var4h5 = M ƒ ì + 1 4ì + ì0 5 1 O 0 ƒ m 1 s s ef cient for the estimation of the marginal likelihood. " s=1 + # X where 3.1 Binary Data Logit Model M To begin, we consider the marginal likelihood computation 1 4g5 4g s5 ì M ƒ 4h h54h ƒ h501 s 01 11 : : : 1 m1 s = O O = in the binary data logit model with emphasis on the effect of g=s+ 1 ƒ ƒ X the proposal density on the marginal likelihood estimate. For and m is a constant that represents the lag at which the auto- this model it is possible to sample the posterior distribution correlation function of h4g5 tapers off. In the examples that in one block, which allows us to isolate the role of the pro- follow, we have set the value of m, somewhat cautiously, posal density without having to consider the in uence of the to equal 40. Given this covariance matrix, it follows from blocking design. the Delta method that var4log m4y55 a var4h5a, where a We compare two canonical proposal densities in this set- = 0 O = 4h 11 h 11 h 15 . The numericaO l standard error is the square ting. The rst is a tailored proposal density, and the second is O 1ƒ O 2ƒ O 3ƒ 0 root ofƒthis quantity. a random walk proposal density. The tailored proposal density We note that if one also needs to estimate the ordinate of is de ned by letting q4ˆ1 ˆ0 y5 = q4ˆ0 y5 = f 4ˆ0 m1 V1 5, — — T — the likelihood function, that of the prior, or both, as in the where m is the mode of the log target density and V is the examples presented in Sections 3.3 and 3.4, then the variance inverse of the negative Hessian of the log-target evaluated at m, Chib and Jeliazkov: Marginal Likelihood Estimation 275 and f 4 5 denotes a multivariate-t density with mean m, vari- from the random walk proposal density for each value of M T — ance V=4 25, and degrees of freedom. The random walk that is used in the experiment. Next, we consider the estimates ƒ proposal density is de ned as q4ˆ1 ˆ0 y5 = f 4ˆ0 ˆ1 ’V1 5 of the marginal likelihoods under the two proposal densities — T — where we use the tailored matrix V, along with a tuning param- for different values of M and J. These results are shown in eter ’, as the dispersion matrix of the proposal. Table 1, where we nd that the marginal likelihood estimate The model we t utilizes an n = 200 observation data set under the tailored proposal density stabilizes up to the second from Mroz (1987) that deals with the factors that in uence decimal when the sample sizes are M = 51000 and J = 51000. participation of the female spouse in the labor market. Seven Further increases in the sample sizes affect the estimate only covariates, non-wife income, number of years of education, in the third decimal place. On the other hand, the estimate years of experience, experience squared, age, number of chil- from the more volatile random walk chain agrees with the dren less than 6 years old in the household, and number of tailored estimate up to the rst decimal place only when the children greater than 6 years old in the household are used sample size is about 20,000. In terms of the variability of to explain the binary indicator of labor market participation. the estimates, we see that the numerical standard error of the Including the constant, the full model contains eight parame- marginal likelihood estimate, which we report in parentheses ters. If we assume that the parameters follow the multivariate below each estimate, is about 10 times smaller from the tai- normal distribution ”84‚ ‚01 B05, then the posterior distribu- lored chain in comparison with the random walk chain. tion of the parameters is give— n by

200 y 3.2 Longitudinal Hierarchical Model for i 41 yi 5 4‚ y5 ”84‚ ‚01B05 pi 41 pi5 ƒ Clustered Data — / — i=1 ƒ x ‚ 1 Y In this example, we illustrate the impact of the MCMC p = 41+ eƒ i 5ƒ i blocking scheme on the marginal likelihood estimate. We use two schemes to sample the posterior distribution of the param- where yi 801 19 is the binary outcome variable and xi is the covariate2vector on the ith subject in the sample. eters. In the rst scheme, multiple blocks are used to sample In Figure 1, we see that the MCMC simulation of the pos- the posterior distribution. In the second scheme, the posterior terior distribution with the tailored proposal density produces samples are drawn in one block, marginalized over the latent inef ciency factors that are about ve times smaller than those data and a subset of the parameters. Within each scheme we

30 30

25 25

y 20 y 20 c c n n e e i i c c

i 15 i 15 f f f f e e n n I 10 I 10

5 5

0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Tailored Binary Logit (M=J=5000) Random Walk Binary Logit (M=J=5000)

30 30

25 25

y 20 y 20 c c n n e e i i c c

i 15 i 15 f f f f e e n n I 10 I 10

5 5

0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Tailored Binary Logit (M=J=10000) Random Walk Binary Logit (M=J=10000)

30 30

25 25

y 20 y 20 c c n n e e i i c c

i 15 i 15 f f f f e e n n I 10 I 10

5 5

0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Tailored Binary Logit (M=J=20000) Random Walk Binary Logit (M=J=20000)

Figure 1. Binary Logit Model. Inef ciency factors under two different proposals and choices of M J for eight covariate parameters. 276 Journal of the American Statistical Association, March 2001

Table 1. Log-marginal Likelihood Estimates for the Binary Logit Model Two blocking schemes are considered for this model both of Using Tailored and Random Walk Proposal Densities which rely on the facts, exploited by Chib and Carlin (1999), that Type of proposal density (M1J) Tailored Random walk y ‚1 D1 ‘ 2 ® 4X ‚1 ì 5 i— ni i i (5,000, 5,000) 1440756 1440835 2 ‚ y1 D1 ‘ ®64‚1 Bn5 ƒ (0015) ƒ (0201) — O (10,000, 10,000) 1440757 1440884 2 1 467 1 ƒ ƒ where ì =4‘ I + W DW0 51 ‚ =B 4Bƒ ‚ + X0 ìƒ y 5 (0011) (0106) i ni i i O n 0 0 i=1 i i i 1 467 1 1 (20,000, 20,000) 1440751 1440779 and B = 4B X ìƒ X 5 . Therefore, one MCMC ƒ ƒ n 0ƒ + i=1 i0 i i ƒ (0010) (0103) scheme, which we refer to as the multiple-blocPk scheme, is given as follows. P are careful to ensure that the sampling is done as ef ciently as Longitudinal Data Model: Multiple Block MCMC Algo- possible in order to isolate the pure effect of blocking. Other- rithm wise, of course, the de ciencies of the sampling scheme would 1. Sample ‚ ® 4‚1 B 5 and be confounded with the effect of blocking. Once again, we nd 6 O n that the MCMC parameter simulation scheme that produces 2 the lower inef ciency factors also tends to produce marginal b ® D 4‘ ƒ W 4y X ‚51 i 2 i i i ƒ i likelihood estimates with lower numerical standard errors. 2 1 D = 4D + ‘ ƒ W0 W 5ƒ 1 i 467 The effect of blocking is considered in the context of a i i i Gaussian longitudinal data model with random effects. We use 2. Sample the same model and data as Han and Carlin (2000) where 1 several model space methods, including those of Carlin and 467 ƒ 1 1 Chib (1995), and Green (1995), are compared. Han and Carlin Dƒ ·2 0 + 4671 R0ƒ + bibi0 report that the model space methods were dif cult to imple- ( i=1 ) ³ X ´ ment in this setting. 3. Sample The data used in our illustration is from a clinical trial on the effectiveness of two antiretroviral drugs (didanosine (ddI) 467 2 0 + ni „0 + yi Xi‚ Wibi and zalcitabine (ddC)) in 467 persons with advanced HIV ‘ 2 © § 1 i=1 ˜ ƒ ƒ ˜ 0 2 2 infection. The response variable for patient at time is yij i j ³ P P ´ the square root of the patient’s CD4 count, a seriological mea- Because each of these distributions is tractable the posterior sure of immune system health and prognostic factor for AIDS- ordinate can be computed by the direct Chib method using the related illness and mortality. The dataset records patient CD4 decomposition counts at study entry and again at 2, 6, 12, and 18 months after entry, for the ddI and ddC groups, respectively. Follow- 1 2 4Dƒ 1 ‘ 1 ‚ y5 ing the aforementioned work, we t a mixed-effects model — 1 2 1 2 = 4Dƒ y5 4‘ y1 D 5 4‚ y1 Dƒ 1 ‘ 51 yi = Xi‚ + Wibi + ˜i — — — ˜ 401 ‘ 2I 5 where the rst term is obtained by averaging the Wishart den- i ®ni ni sity over draws on 8b 9 from the full run, the second ordinate b 401 D5 i i ®2 is estimated by averaging the inverse gamma full density of ‘ 2 over draws on 4‚1 8b 95 from a reduced run conditioned where the th row of patient ’s design matrix W takes the i j i i on D , and the third ordinate is multivariate normal as given form w = 411 t 5, where t 801 21 61 121 189 and the xed ij ij ij 2 previously and available directly. design matrix X is X = W d W a W . The d is a binary i i i— i i— i i i In the second scheme, the parameters of the model are variable indicating whether patient received ddI ( 1) or i di = sampled in one block by using the fact that the density ddC ( 0), and is a binary variable indicating if the di = ai of the observations marginalized over both 4‚1 8b 95 can be patient was diagnosed as having AIDS at baseline ( 1) or i ai = expressed as not (ai = 0). The prior distribution of ‚ 2 6 1 is assumed to be ®64‚ 1 B05 with 467 0 ” 4‚ ‚ 1 B 5ç ”n 4yi Xi‚1 ìi5 f4y D1 ‘ 25 = 6 O — 0 0 i=1 i — O 1 ‚ = 4101 01 01 01 31 051 — ” 4‚ ‚1 B 5 0 ƒ 6 O — O n and which is an application of the basic marginal likelihood identity, now used to nd the marginal density of y conditioned on 2 2 2 2 2 2 2 1 2 B0 = Diag42 1 1 1 4015 1 1 1 1 1 1 51 4D1 ‘ 5. Therefore, the posterior of 4Dƒ 1 ‘ 5 is proportional to this density times the prior densities on these parameters 1 that on Dƒ is Wishart ·2401 R0=05 with 0 = 24 and and can be sampled by a one-block tailored M–H method. In 2 1 2 R = diag40251 165 and nally that on ‘ is inverse gamma particular, let f 4Dƒ 1 R 5 and f 4‘ =21 „ =25 denote 0 W — 0 0 IG — 0 0 © §40=21 „0=25 with 0 = 6 and „0 = 400. the Wishart and inverse gamma prior densities, respectively, Chib and Jeliazkov: Marginal Likelihood Estimation 277 and let 4m1 V5 denote the modal value and inverse of the neg- sets of estimates are virtually identical, thus showing that the 1 2 ative Hessian at the mode of the function log f 4y Dƒ 1 ‘ 5 + direct Chib method and its extension to M–H chains produce 1 2 — 1 log fW 4Dƒ 01 R05 + log fIG 4‘ 0=21 „0=25 and let q4Dƒ 1 the same answer, when both can be applied, and in parallel 2 — 1 2 — ‘ y5 = f 4Dƒ 1 ‘ m1 V1 5 denote the proposal density. with the evidence presented previously, that the scheme that — T — Then, in the one block algorithm for the clustered Gaus- produces the lower inef ciency factors (here the single block 10 20 1 2 sian model we propose 4Dƒ 1 ‘ 5 fT 4Dƒ 1 ‘ m1 V1 5 and scheme) produces the marginal likelihood estimate with the 10 20 — move to 4Dƒ 1 ‘ 5 with probability smaller numerical standard error. We should note, however, that not all of the difference in the numerical standard errors

10 20 10 20 is due to the effect of blocking because some of it comes from f4y Dƒ 1 ‘ 5f 4Dƒ 1 R 5f 4‘ =21 „ =25 = min — W — 0 0 IG — 0 0 the difference in the size of the posterior ordinates that are esti- f4y D 11 ‘ 25f 4D 1 1 R 5f 4‘2 =21 „ =25 ( — ƒ W ƒ — 0 0 IG — 0 0 mated under the two schemes: 4-dimensional in the one-block 1 2 f 4Dƒ 1 ‘ m1 V1 5 scheme versus 10-dimensional in the multiple-block scheme. T — 1 1 D 10 20 m V On the basis of these experiments, which cover important fT 4 ƒ 1 ‘ 1 1 5 ) — model situations, one may conclude that the marginal likeli- Given the output from this scheme, we estimate the marginal hood estimation method in this article appears to be robust likelihood by the one-block estimate given in Section 2.1. It to changes in the simulation sample sizes, blocking schemes, should be noted that this scheme provides an entirely differ- and sampling methods. For this method, the numerical stan- ent route to nding the marginal likelihood in relation to the dard error of the marginal likelihood estimate is lower for the one presented in the preceding paragraph. We think it is inter- schemes that produce the lower inef ciency factors. esting to see what estimates emerge from these two different approaches to the same problem. 3.3 Poisson Longitudinal Regression In Figure 2, we report the inef ciency factors under the two 2 schemes for the parameters D11, D12, D22, and ‘ . Observe We now consider the calculation of the marginal likelihood that the single block M–H scheme lowers the inef ciency fac- in a nonlinear latent variable problem where the likelihood tors by a factor of about two. Next, in Table 2 we report calculation is quite complicated and latent variables cannot be the estimates of the marginal likelihood under the two sam- avoided in the MCMC simulation. With this example, which pling schemes. The rst important point to note is that the two has n = 58 blocks of latent variables that must each be sam-

10 10

8 8 y y c c

n 6 n 6 e e i i c c i i f f f f

e 4 e 4 n n I I

2 2

0 0 1 2 3 4 1 2 3 4 Multipleblock scheme (M=J=5000) Single block scheme (M=J=5000)

10 10

8 8 y y c c

n 6 n 6 e e i i c c i i f f f f

e 4 e 4 n n I I

2 2

0 0 1 2 3 4 1 2 3 4 Multipleblock scheme (M=J=10000) Single block scheme (M=J=10000)

10 10

8 8 y y c c

n 6 n 6 e e i i c c i i f f f f

e 4 e 4 n n I I

2 2

0 0 1 2 3 4 1 2 3 4 Multipleblock scheme (M=J=20000) Single block scheme (M=J=20000)

Figure 2. Gaussian Longitudinal Model. Inef ciency factors under two different sampling schemes and choices of M J for the parameters 2 D11 , D12, D22, and ‘ 278 Journal of the American Statistical Association, March 2001

Table 2. Effect of Blocking Scheme in the Longitudinal Model on probability Log-marginal Likelihood Estimates

58 Type of blocking i=1 f4yi ‚01 bi5”24‚0 01 100I25 4‚1 ‚0 y1 8b 95 = min — — — i 58 f 4y ‚1 b 5” 4‚ 01 100I 5 (M1J) Multiple blocks Single block (Q i=1 i— i 2 — 2

(5,000, 5,000) 315770551 315770578 Q fT 4‚ m01 V01 155 ƒ ƒ — 1 1 0 (0028) (0009) f 4‚0 m 1 V 1 155 (10,000, 10,000) 315770565 315770566 T — 0 0 ) ƒ (0020) ƒ (0006) (20,000, 20,000) 315770575 315770574 2. Calculate the parameters 4mi1 Vi5 as the mode ƒ (0014) ƒ (0006) and inverse of the negative Hessian of the mode of log ” 4b ‡1 D5 + log f4y ‚1 b 5, propose b0 2 i— i— i i f 4b m 1 V 1 5 and move to b0 with probability T i— i i i pled by a M–H step, we are able to illustrate the point made f 4yi ‚1 bi0 5”24bi0 ‡1 D5 fT 4bi mi1 Vi1 5 in Section 2 that the intractability of the latent variable distri- i = min — — — 1 1 0 f 4y ‚1 b 5” 4b ‡1 D5 f 4b0 m 1 V 1 5 butions does not alter the scheme for estimating the marginal ( i— i 2 i— T i— i i ) likelihood. 1 3. Sample ‡ ®24‡ ‡1 M5 where M = 4100ƒ I2 + The model of interest, taken from Diggle, Liang, and Zeger 1 1 58 — O 1 58Dƒ 5ƒ and ‡ = M Dƒ b . O i=1 i (1995), is concerned with the modeling of seizure counts 8yit 9 4. Sample for each of i = 11 : : : 1 59 epileptics measured rst over an P 8-week baseline period 4t = 05 and then over four subsequent 58 1 1 2-week periods t = 11 : : : 1 4. At the end of the baseline period, Dƒ ·2 621 6I + 4bi ‡54bi ‡507ƒ 0 i=1 ƒ ƒ each patient is randomly assigned to either receive the drug ³ X ´ progabide or a placebo. After removing observation 49, we t To estimate the marginal likelihood, we observe that this the model: MCMC scheme is a special case of the setup in Section 2.3 1 with B = 3, ˆ1 = Dƒ , ˆ2 = ‚, and ˆ3 = ‡, where the additional blocks of latent variables can be integrated out in the yit ‚1 bi Poisson4‹it 5 — manner described in Section 2.2. We begin by decomposing ln4‹it 5 = ln ’it + ‚1xit1 + ‚2xit1xit2 + bi1 + bi2xit2 the posterior ordinate as

bi ®24‡1 D51 1 1 4Dƒ 1‚ 1‡ y5= 4Dƒ y5 4‚ y1D 5 4‡ y1D 1‚ 51 — — — — where the covariate xit1 is an indicator for treatment status, where only the 2-dimensional reduced ordinate 4‚ y1 D 5 — xit2 is an indicator of period (0 if baseline and 1 otherwise), cannot be estimated by the Rao–Blackwell device. Speci - ’it is the offset that is equal to 8 in the baseline period and 2 cally, an estimate of the rst ordinate can be found by aver- otherwise, and bi = 4bi11 bi25 are latent random effects. This aging the Wishart density over draws on 8bi9 and ‡ from the speci cation of the model produces the likelihood function full run. Then, from Section 2.3, our estimate of the second ordinate is given by 58 4

f 4y ‚1 ‡1 D5 = ”24bi ‡1 D5 p4yit ‚1 bi5 dbi — i=1 — t=0 — 4‚ y1 D 5 Z O — Y58 Y 1 M 4g5 4g5 4g5 M ƒ g=1 4‚ 1 ‚ y1 8bi 95q4‚ y1 8bi 95 ”24bi ‡1 D5f4yi ‚1 bi5 dbi1 (19) = — — (20) 1 J 4j5 4j5 i=1 — — J 4‚ 1 ‚ y1 8b 95 P ƒ j=1 i YZ — where the draws in the numerator are from a reduced run where p4 5 is the Poisson mass function with mean ‹it . We P now follow Chib, Greenberg, and Winkelmann (1998) and with the full conditional distributions of ‚1 ‡, and 8bi9, con- conduct the posterior MCMC simulations with the full conditioned on D . The draws in the denominator are from a 1 second reduced run with the full conditional distributions of ditional distributions of ‚, 8bi9, ‡1 and Dƒ where the ‚ and b blocks are each updated by M–H steps. If we let 8bi9 and ‡, conditioned on 4D 1 ‚ 5 with an appended step in 8 i9 4j5 4j5 1 which ‚ is drawn from q4‚ y1 8bi 95. The draws on 8bi9 in ‚ ®2401 100I251 ‡ ®2401 100I251 and Dƒ ·2441 I25, — then the complete MCMC algorithm for simulating the poste- the latter run are also used to average the normal density of ‡ rior distribution is de ned as follows. to produce an estimate of the third ordinate. The log-marginal likelihood is then obtained from (18) after the likelihood ordi- Longitudinal Poisson Model: MCMC Algorithm nate f4y ‚ 1 ‡ 1 D 5 is found by estimating the integral in (19) by an— importance sampling method.

1. Calculate the parameters 4m01 V05 as the mode We compare the performance of the new approach by also and inverse of the negative Hessian of the mode computing the marginal likelihood as in Chib, Greenberg, and 58 of log ” 4‚ 01 100I 5 + log f 4y ‚1 b 5, propose ‚0 Winkelmann (1998) where a kernel density estimation method 2 — 2 i=1 i— i fT 4‚ m01 V01 5 q4‚ y1 8bi95 and move to ‚0 with is used to calculate 4‚ y1 D 5. Because the dimension of ‚ — —P — Chib and Jeliazkov: Marginal Likelihood Estimation 279 is small, we expect that the two methods should agree closely. Multivariate probit model: MCMC algorithm In fact, the new method produces an estimate of 915023 ƒ 1. Sample for i 520, j 7 based on M = J = 101000 and the older method gives the value of 915049, with numerical standard errors from both ƒ ´ ® 401 54Œij 1 vij 5 if yij = 1 methods of approximately 01. As the dimension of ‚ increases, zij ˆ the kernel density estimate of the ordinate 4‚ y1 D 5 will ´ ® 4 1054Œij 1 vij 5 if yij = 0 ( ƒˆ become less accurate and one will have to rely on th—e approach developed in this article. where Œij = E4zij zi4 j51 ‚1 è5, vij = var4zij zi4 j51 ‚1 è5, and 4Œ1s5 — ƒ — ƒ ´ ® 4a1b5 denotes the normal distribution with parameters 3.4 Multivariate Probit Model 4Œ1 s5 truncated to the interval 4a1 b5. 1 The nal example again involves a large number of latent 2. Sample ‚ ®k4‚1 Bn51 where ‚ = Bn4B0ƒ ‚0 + 520 1 O 1 520 O 1 1 ƒ ƒ variables in the nonlinear model setting of a model for cor- i=1 Xi0è zi5 and Bn = 410ƒ I3 + i=1 Xi0è Xi5ƒ are the related binary data. The model of interest is the multivari- usual updates based on the complete data. P P ate probit model, which we t using data from the Panel 3. Calculate m = arg max‘ log ”214‘ 03i211 02I215 + 520 — Study of Income Dynamics of the University of Michigan. log ” 4z X ‚1 è5 and V the negative inverse of the Hes- i=1 21 i— i The response variable is the labor force participation decision sian of the log-target at the mode, propose ‘ 0 f 4‘ m1 V1 5 T — of 520 married women, in the age range 25–62, each observed andP move to ‘ 0 with probability over the 7-year span 1976–1982. Under the tted model, the marginal probability of participation status of the ith woman 520 ” 4‘ 0 03i 1 02I 5 ” 4z X ‚1 è05I6‘ 0 S7 at the jth time point is given by = min 21 — 21 21 i=1 21 i— i 2 ” 4‘ 03i 1 02I 5 520 ” 4z X ‚1 è5 Pr4y =1 ‚5=ê4‚ + ‚ x + ‚ x 51 i 5201j 71 ( 21 — 21 Q21 i=1 21 i— i ij — 0 1 1ij 2 2ij Q fT 4‘ m1 V1 5 where x1 is the ith subject’s education measured as the number — 1 1 of grades completed, x is total family income excluding the f 4‘ 0 m1 V1 5 2 T — ) woman’s earnings, in thousands of dollars, ‚ = 4‚01 ‚11 ‚25 and ê is the distribution function of the standard normal dis- It should be clear that this situation is similar to the one discussed in Section 2.2 except that the latent variables are tribution. The joint probability of a particular vector yi 2 7 1 representing the sequence of participation status indicators for sampled not in one block but in several blocks. If we decom- the ith subject, conditioned on the parameters ‚ and a corre- pose the posterior ordinate at the posterior mean as lation matrix è is 4‘ 1 ‚ y5 = 4‘ y5 4‚ y1 ‘ 51 Pr4yi ‚1 è5 = ”J 4zi Xi‚1 è5 dzi1 (21) — — — — Bi7 Bi1 — Z Z then the rst ordinate is available from (14) and the second where B is the interval 401 5 if y = 1 and the interval ordinate as the average ij ˆ ij 4 1 07 if yij = 0 and zi = 4zi11 : : : 1 zi75 is a vector of latent ƒˆ M 4g5 variables with distribution 1 4‚ y1 è 5 M ƒ ” 4‚ ‚ 1 B 5 = 3 O n zi ‚1 è ®74Xi‚1 è51 O — g=1 — — X Xi = 4xi01 : : : 1 xi250 is a 7 3 matrix of the covariates 4g5 520 1 4g5 1 where ‚ = B 4 X0 è ƒ z 5, B = 410ƒ I + augmented with a vector of 1s and ‚ = 4‚01 : : : 1 ‚250 O n i=1 i i n 3 520 1 1 4g5 are the regression parameters. The 21 free elements i=1 Xi0è ƒ Xi5ƒ and zi are draws from a reduced MCMC of the correlation matrix è are denoted by ‘ = run with è xed at è P. We are particularly interested in see- P 4‘211 ‘311 ‘321 ‘411 ‘421 ‘431 : : : 1 ‘711 : : : 1 ‘765. The likelihood ing how the proposed method is able to estimate the marginal of this model is dif cult to compute because the integral in likelihood if the 21-dimensional ordinate 4‘ y5 is estimated (21), which de nes the likelihood contribution of the ith sub- in one block. For comparison purposes we als— o utilize the ject, is not available in closed form. This necessitates the use approach of Chib and Greenberg (1998), where the posterior of latent data in the MCMC sampling. ordinate is estimated from We now consider how well our method performs in this context. Under the prior distributions 4‚5 = ” 4‚ 01 10I 5 4‘ y5 4‘ y1‘ 5 4‘ y1‘ 1‘ 5 4‘ y1‘ 1‘ 1‘ 51 3 — 4 1— 2— 1 3— 1 2 4— 1 2 3 and 4‘5 ”214‘ 03i211 02I215IS , where i21 is a vector of ones / — 21 and S is the subset of R that leads to a positive de nite where ‘ 11 ‘ 2, and ‘ 3 are 6-dimensional subblocks of ‘ and correlation matrix, the posterior distribution is sampled using ‘ 4 is 3-dimensional. Each of the conditional ordinates in the the algorithm of Chib and Greenberg (1998), which is based latter expression is estimated by kernel smoothing applied to on the method of Albert and Chib (1993), by the 4520 75+ 2 20,000 values of ‘ i generated from 4‘ i y1 ‘ 11 : : : 1 ‘ i 15 — ƒ conditional distributions in a reduced Markov chain Monte Carlo run. In relation to the one ‘ block setup, the latter sampler requires three 48z 9 y1 ‚1 ‘5 4i = 11 : : : 1 5201 j = 11 : : : 1 753 additional reduced runs and consumes about twice as much ij — 4‚ y1 8z 91è53 4‘ y1 8z 91 ‚51 time. Interestingly, however, the two setups produce virtu- — i — i ally the same answer. From the one ‘ block method based where the last distribution, which is high-dimensional, is sam- on M = J = 201000, the marginal likelihood is found to be pled in one block by a conditional tailored M–H step. 115500533 with a numerical standard error of .842. Then, the ƒ 280 Journal of the American Statistical Association, March 2001 multiple ‘ block setup produces the value 11550045 with a machine. Multiple evaluations of the likelihood are therefore ƒ numerical standard error of .278. The larger numerical stan- extremely costly. It is not possible to circumvent this burden dard error of the one ‘ block method is a consequence of by keeping the approximately 3,600 latent variables in the last the fact that the sampling of ‘ from 4‘ y1 8z 91‚5 produces example because the consequent increase in the dimension of — i higher inef ciency factors for each component of ‘ in relation the ordinates leads to inef cient estimates. Model space meth- to the output on ‘ when it is generated in three 6-dimensional ods, on the other hand, have a different underpinning and blocks and one 3-dimensional block, which shows once again provide a complementary set of tools for nding Bayes fac- that the more ef cient MCMC simulation routine reduces the tors. In general, these methods cannot be competitive with a numerical standard error of the marginal likelihood estimate. direct marginal likelihood estimation method when the number of models being compared is small because it is easier to 3.5 Additional Examples and Alternative Methods t each model directly than to design a new mega-simulation The method developed in this article has been applied to procedure with its attendant tuning and other costs. Han and other important models. For example, we have conducted an Carlin (2000), in the context of our second example, make analysis of the method in the context of ordinal probit mod- the same point and report favorably on our method in relation els, autoregressive and moving average (ARMA) models, and to the methods of Green (1995) and Carlin and Chib (1995). parametric models for survival data such as the Weibull and Nonetheless, a comparative study of these methods in more the log-logistic. We nd support for the same general conclu- general settings, such as those in our last two examples, will sions presented previously. For instance, in our ordinal data be desirable once suitable model space methods for those example, a 16-dimensional posterior ordinate is estimated two models have been worked out. ways: from the output of a single block M–H algorithm and from the output of a two-block algorithm. In this case, the single block algorithm produces higher inef ciency factors 4. CONCLUDING REMARKS and, in keeping with the results presented previously, a higher numerical standard error in the marginal likelihood estimation. In this article, we have derived and illustrated an extended In addition, we have considered some of the alternative version of the Chib method of computing model marginal methods that are discussed in the introduction. One such alter- likelihoods for Bayesian model comparisons based on the native is the method of kernel smoothing to estimate the poste- output of Metropolis–Hastings MCMC chains. By complet- rior ordinate. For high-dimensional problems, we nd that the ing the Chib method we now have a framework for calcu- kernel-based estimate exhibits two de ciencies: larger numer- lating marginal likelihoods in practically all models that are ical standard errors and signi cant bias that tends to dissi- t by Markov chain Monte Carlo methods. One virtue of the pate slowly as a function of the simulation sample size. For approach is that it is based on the programming that is done example, in the logit model of Section 3.1, the kernel based to simulate the posterior distribution. Thus once the basic pro- estimate is 1410722 after 5,000 draws, and 1420229 after gramming has been done, one can nd the marginal likelihood 40,000 drawsƒ; and, standard errors are approximatelƒ y 10 times of a given collection of models by rearrangement of existing larger than those produced by the proposed method. The ker- code and, importantly, without further tuning of the MCMC nel estimates can be made more accurate by breaking up the algorithm. In experiments involving the logit model for binary parameter vector into smaller blocks, as is done in the mul- data, hierarchical random effects model for clustered Gaussian tivariate probit example, and then proceeding with additional data, Poisson regression model for clustered count data, and reduced runs to produce the correct sample draws. This proce- the multivariate probit model for correlated binary data, we dure reduces the bias and lowers the numerical standard error have illustrated the performance and implementation of the of the marginal likelihood estimate but takes increasingly more method. In our examples, the method is robust to changes in time as the size of each MCMC block is increased. Thus to the blocking schemes and proposal densities that are used to apply this method one would usually need more blocks in the sample the posterior distribution. Furthermore, the sampling MCMC sampling than would arise from the natural full con- scheme that is ef cient for sampling the posterior distribution, ditional structure of the model. as measured by the inef ciency factors, is also ef cient for Furthermore, we have examined the marginal likelihood estimating the marginal likelihood. estimation methods proposed by Gelfand and Dey (1994) and Meng and Wong (1996) in the context of the problems that are [Received July 1999. Revised June 2000.] discussed in this article. Although these methods both require the use of certain auxiliary functions, each produces results in close agreement with those in Tables 1 and 2. This is what one might expect given that both models are relatively simple REFERENCES and the sample sizes are relatively large. In the last two mod- Albert, J., and Chib, S. (1993), “Bayesian Analysis of Binary and Polychoto- els, however, these methods are computationally much more mous Response Data,” Journal of the American Statistical Association, 88, 669–679. expensive than our method because of the required evaluation Brown, B. W. (1980), “Prediction Analyses for Binary Data,” in Biostatistics of the likelihood for each sampled draw. In the multivariate Casebook, eds. R. J. Miller, B. Efron, B. W. Brown, and L. E. Moses, probit model discussed previously, for example, a single like- New York: Wiley. Carlin, B. P., and Chib, S. (1995), “Bayesian Model Choice via Markov Chain lihood evaluation, which is all that is required in our approach, Monte Carlo Methods,” Journal of the Royal Statistical Society, Ser. B, 57, takes about 2 minutes of computing time on a 550 MHz 473–484. Chib and Jeliazkov: Marginal Likelihood Estimation 281

Chen, M.-H., and Shao, Q.-M. (1997), “On Monte Carlo Methods for Esti- Green, P. J. (1995), “Reversible Jump Markov Chain Monte Carlo mating Ratios of Normalizing Constants,” The Annals of Statistics, 25, Computation and Bayesian Model Determination,” Biometrika, 82, 1563–1594. 711–732. Chib, S. (1995), “Marginal Likelihood from the Gibbs Output,” Journal of Han, C., and Carlin, B. (2000), “MCMC methods for Computing Bayes Fac- the American Statistical Association, 90, 1313–1321. tors: A Comparative Review,” Research report 2000–001, Division of Bio- Chib, S., and Carlin, B. P. (1999), “On MCMC Sampling in Hierarchical statistics, University of Minnesota. Longitudinal Models,” Statistics and Computing, 9, 17–26. Hastings, W. K. (1970), “Monte Carlo Sampling Methods Using Markov Chib, S., and Greenberg, E. (1995), “Understanding the Metropolis–Hastings Chains and their Applications,” Biometrika, 57, 97–109. Algorithm,” American Statistician, 49, 327–335. Meng, X.-L., and Wong, W. H. (1996), “Simulating Ratios of Normaliz- Chib, S., and Greenberg, E. (1998), “Analysis of Multivariate Probit Models,” ing Constants via a Simple Identity: A Theoretical Exploration,” Statistica Biometrika, 85, 2, 347–361. Sinica, 6, 831–860. Chib, S., Greenberg, E., and Winkelmann, R. (1998), “Posterior Simulation Mroz, T. A. (1987), “The Sensitivity of an Empirical Model of Married and Bayes Factors in Panel Count Data Models,” Journal of Econometrics, Women’s Hours of Work to Economic and Statistical Assumptions,” Econo- 86, 33–54. metrica, 55, 765–799. Chib, S., Nardari, F., and Shephard, N. (1999), “Analysis of High Dimensional Newey, W. K., and West, K. D. (1987), “A Simple Positive Semi-De nite Het- Multivariate Stochastic Volatility Models,” Technical report, Olin School of eroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econo- Business, Washington University, St. Louis, MO. metrica, 55, 703–708. DiCiccio, T. J., Kass, R. E., Raftery, A. E., and Wasserman, L. (1997), “Com- Ritter, C., and Tanner, M. A. (1992), “Facilitating the Gibbs Sampler: The puting Bayes Factors by Combining Simulation and Asymptotic Approxi- Gibbs Stopper and the Griddy–Gibbs sampler,” Journal of the American mations,” Journal of the American Statistical Association, 92, 903–915. Statistical Association, 87, 861–868. Diggle, P., Liang, K.-K., and Zeger, S. L. (1995), Analysis of Longitudinal Tanner, M. A., and Wong, W. H. (1987), “The Calculation of Posterior Distri- Data, Oxford: Oxford University Press. butions by Data Augmentation,” Journal of the American Statistical Asso- Gelfand, A. E., and Dey, D. (1994), “Bayesian Model Choice: Asymptotics ciation, 82, 528–549. and Exact Calculations,” Journal of the Royal Statistical Society Ser. B, 56, Tierney, L. (1994), “Markov Chains for Exploring Posterior Distributions,” 501–514. (with discussion), The Annals of Statistics, 22, 1701–1762. Gelfand, A. E., and Smith, A. F. M. (1990), “Sampling-based Approaches Verdinelli, I., and Wasserman, L. (1995), “Computing Bayes Factors Using to Calculating Marginal Densities,” Journal of the American Statistical a Generalization of the Savage–Dickey Density Ratio,” Journal of the Association, 85, 398–409. American Statistical Association, 90, 614–618.