Marginal Likelihood and Bayes Factors for Dirichlet Process Mixture

Marginal Likelihoodand BayesFactors forDirichlet ProcessMixture Models

Sanjib Basu and Siddhartha Chib

We presenta methodfor comparing semiparametric Bayesian models,constructed under the Dirichlet process mixture(DPM) framework, withalternative semiparameteric orparameteric Bayesian models.A distinctivefeature ofthe method is thatit can beapplied to semiparametric modelscontaining covariates andhierarchical priorstructures, and is apparentlythe rst methodof its kind. Formally, themethod is basedon the marginal likelihood estimation approach of Chib (1995) and requires estimation of the likelihood and posteriorordinates of the DPM model at asinglehigh-density point. An interesting computation is involvedin the estimation of the likelihoodordinate, which is devisedvia collapsed sequential importance sampling. Extensive experiments with synthetic and real data involvingsemiparametric binarydata regressionmodels and hierarchical longitudinalmixed-effects modelsare usedto illustrate the implementation,performance, andapplicability of themethod. KEY WORDS:Bayesian modelcomparison; Bayes factor;Dirichlet process mixture;Marginal likelihood; Semiparametric binarydata model;Semiparametric longitudinaldata model.

1. INTRODUCTION containcovariates. Therefore, none of theexisting approaches canbe used to compare the tof exiblesemiparameteric Advancesin Markov chain Monte Carlo (MCMC) simula- regressionmodels of thetype discussed by, for example,Bush tionmethods have facilitated the study of Bayesian models andMacEachern (1996), Kleinman and Ibrahim (1998), and underfar weakerand more realistic assumptions than was pre- Basuand Mukhopadhyay (2000). viouslypossible. As aresultof thesedevelopments, semipara- metricBayesian modeling has become a practicaloption, and Onepurpose of thisarticle is tointroducea Bayesianmodel underthe Dirichlet process mixture (DPM) framework,for comparisonmethod for semiparametricmodels that can be example,novel and appealing statistical models can be formu- appliedeven when the model contains covariates and (possibly) latedand estimated. It turns out, however, that although the aninvolved hierarchical prior structure. The method that we methodologyfor ttingDPM modelsis more or less estab- devisefor ndingthe “ weightof evidence”from themarginal lished,there is a paucityof workon methodsthat can be used likelihoodof the semiparametric model is apparently the rst tocompare these models with competing semiparametric or tobe proposed for Bayesiansemiparametric regression mod- parametricmodels. els.In this method, which relies on the framework ofChib Thegeneral problem of comparing semiparametric mod- (1995),the rather dif cult computation of themarginal likeli- elswith other alternative model speci cations has a relatively hood(which entails integration of the likelihood with respect recenthistory. Florens, Richard, and Rolin (1996) and Carota tothe prior density of theparameters) is reduced to the more andParmigiani (1996) developed procedures to comparepara- tractableproblems of ndingestimates of the likelihood and metricmodels with nonparametric alternatives modeled with oftheposterior at a singlepoint. Crucially, both of thesequan- Dirichletprocess (DP) andmixture of Dirichlet processes titiesare readily available. T oestimatethe posterior ordinate, (MDP). Itis importantto bearin mindthat, despite the similar- we needprimarily the MCMC proceduresthat simulate the ityin nomenclature, the MDP specication is rather different parametersfrom theposterior distribution, whereas, to esti- from theDPM modelthat we considerin the sequel. In more matethe likelihood ordinate, we developan interesting, low- recentwork, Berger and Guglielmi (2001) modeled the non- variabilitymethod based on collapsed sequential importance parametricalternative by a Pólya tree process and computed sampling(SIS). CollapsedSIS isa variantof the SIS method theBayes factor for adefaultreference prior, and Ishwaran, introducedby Kong, Liu, and W ong(1994), and Liu (1996, James,and Sun (2001) compared models with different num- 2001).This variant was discussedby Lo, Brunner, and Chan berof unique mixture components by subsuming the models (1996),Ishwaran and James (2001b), Ishwaran et al. (2001), withina nitemixture model. andIshwaran and T akahara(2002) in the general context Signicantly, earlier work on this general topic (except for ofweighted Chinese restaurant processes and by Quintana thatin Ishwaran et al. 2001) is concerned with nonparamet- (1998),MacEachern, Clyde, and Liu (1999), and Quintana and ricinstead of semiparametric models, because the unknown Newton(2000) in the setting of categorical DPM models. distributionof the observations is modeled directly by a non- Thearticle is organizedas follows.In Section 2 we present parametricprior process. Even more importantly, all avail- theDPM modeland the model comparison problem of inter- ablemethods explicitly assume an independent and identically est.In Section 3 we discusscomputation of themarginal like- distributedmodel for thedata, which rules out models that lihoodof theDPM modelwith a viewto computing the Bayes

SanjibBasu isAssociate Professor,Division of Statistics,Northern Illinois University,Dekalb, IL 60115 (E-mail: [email protected] ).SiddharthaChib isHarry C.HartkopfProfessor of Econometricsand Statistics, John M. Olin © 2003 American StatisticalAssociation Schoolof Business, W ashingtonUniversity, St. Louis MO 63130(E-mail: Journal of theAmerican StatisticalAssociation [email protected] ).Theauthors are gratefulto the editor, associate editor, March 2003, Vol.98, No.461, Theory and Methods andreferees forconstructive and valuable comments. DOI10.1198/ 01621450338861947

224 Basu and Chib: DirichletProcess Mixture Models 225 factorof alternativeparametric and semiparametric models. In hyperparameters and G 4 Ê5 and 0 ¢— Sections4 and5 we delineatethe speci cs of the method. InSection 6 we provideapplications of the method to three iid È11 : : : 1 Èn G G1 G °4 1 G04 Ê55 (3) examples,each of whichcontains covariates. W ealsoconsider — ¹ ¹ ¢— ¢— twosynthetic datasets to illustrate the usefulness of Bayesfac- isa samplefrom thisprior process, then the prequential pre- torsfor ndingthe true model. W egiveconcluding remarks dictionrule of Èi is given by inSection 7. P4È È 1 : : : 1 È 5 q 45G 4 Ê5 i 1 i 1 ki 1 11 i 0 2 ¢— ƒ D ƒ C ¢— ki 1 2. DIRICHLET PROCESSMIXTURE MODEL ƒ q 45„ 4 51 i n1 (4) j1 i Èj1ü i 1 Let yi4i n5 denotea collectionof scalar or vector-valued C j 1 ƒ ¢ µ µ D independentobservations whose distribution is modeled by a X generalDPM modeldescribed as where „ 4 5 denotesthe degenerate measure at È, È ¢ 8Èü 1 : : : 1 Èü 9 arethe set of k uniquevalues in 11 i 1 ki 1 1 i 1 i 1 ƒ ƒ ƒ ƒ ki 1 1 8È11 : : : 1 Èi 19, 8qj1 i459 ƒ C arethe probabilities of the dif- yi Èi1 Ô1 xi f4 Èi1 Ô1 xi51 i 11 : : : 1 n3 j 1 — ¹ ¢— D ferentcomponents ƒ (that may D functionally depend on ) adding iid È 1 : : : 1 È G G3 to1; for thecase where i 1,vacuous sets and sums 1 n— ¹ aretreated as empty. The DDP prioris of course the G 1 G DP41 G 4 Ê553 — 0 ¹ 0 ¢— mostpopular class of priors with a Pólya urn representa- Ö 4Ô1 Ê1 5 1 (1) tion. If °4 1 G 4 Ê55 DP4 1 G 4 Ê55 withconcentration D ¹ ¢— 0 ¢— D ¢— 0 ¢— parameter andbase measure G 4 Ê5,then(4) holdswith 0 ¢— where x are xedcovariates, Ô isa vectorparameter asso- i nj1 i 1 ciatedwith the distribution of y , 8È 9 arelatent or subject- qj1 i45 ƒ 1 j 11 : : : 1 ki 11 i i D i 1 D ƒ speci c randomvectors thatare conditionally independent Cƒ q 45 1 (5) giventhe distribution G, and 8f4 Èi1 Ô1 xi59 isa parametric ki 1 11 i ¢— ƒ C D i 1 familyof densities with respect to a dominatingmeasure Œ. C ƒ

Given G,therefore,the density of thedata y 8y11 y21 : : : 1 yn9 where nj1 i 1 denotesthe frequency of the unique label Èj1ü i 1 D ƒ ƒ withregard to Œ (andsuppressing dependence on the covari- among 8È11 : : : 1 Èi 19, j 11 : : : 1 ki 1. ƒ D ƒ ates)is given by the mixture Anotherpoint is thatthe model described in (1) canalterna- tivelybe expressed in terms of the “ stick-breaking”construc- n tionof theDP asgiven by Sethuraman (1994), f4y Ô1 G5 f4y È 1 Ô5dG4È 50 (2) — D i— i i i 1 ˆ iid YD Z G4 5 p „ 4 51 where Z G 4 Š51 l 11 : : : 1 and l Zl l 0 ¢ D l 1 ¢ ¹ ¢— D Thekey feature of themodel is theassumption that the distri- XD l 1 bution G isunknownand is modeledby aDPprior(Ferguson ƒ p V 1 p V 41 V 51 l 21 : : : 1 1973)with concentration parameter andspeci ed baseprob- 1 1 l l j D D j 1 ƒ D abilitymeasure G 4 Ê5 thatdepends on an unknown param- YD 0 ¢— iid eter vector Ê. Here G and G denoteprobability measures, with V beta411 51 l 11 : : : (6) 0 l ¹ D althoughwe oftenrefer tothem as distributions.The Bayesian modelis completed by assuming that the parameter vector If thesum in (6) istruncated at a largeinteger N , we obtain the nite-dimensionalprior considered by Ishwaranand James Ô,thehyperparameter vector Ê of G0,andthe concentration parameter followa parametricdistribution . (2001a),who developed a blockedGibbs sampler for the TheDPM modelwas introducedby Ferguson (1983) and modelunder this prior by updating blocks of parameters in Lo(1984). Kuo (1986) rst describedMonte Carlo techniques multivariatesteps instead of the one-at-a-time updates that for ttingthese models by samplingfrom theprior. The clever appearin the Pó lya urn– based samplers. trickof exploitingthe Blackwell and MacQueen (1973) Pó lya Finally,we notethat the approach that we developcan also urncharacterization of the DP [see (4) and(5)] withina beappliedto the two-parameter Poisson– Dirichlet process dis- Markovchain sampling setting was elucidatedby Escobar cussedby Ishwaran and James (2001a), which includes the (1988,1994), and Escobar and W est(1995). The collapsed DPasa specialcase. clustersampling method of MacEachern (1994) and the “ no- gaps”algorithm of MacEachern and Mü ller (1998) for non- 3. MODEL COMPARISON PROBLEM conjugateDPM modelsalso use the Pó lya urn structure. Supposethat we aregiven a collectionof models

Animportant point is that the foregoing setup and the 811 : : : 1 J 9,whereone (or more) ofthemodels is aDPM methoddeveloped later can be extended to any prior pro- model,and the objective is to compare the different mod- cessthat follows a generalizedPó lya urn scheme. In par- elsgiven the data y 4y 1 : : : 1 y 5.Theformal Bayesian D 1 n ticular, if °4 1 G 4 Ê55 denotessuch a priorprocess with approachfor doingthis comparison is via the pairwise Bayes ¢— 0 ¢— 226 Journal oftheAmerican Statistical Association, March 2003

factors,de ned for anytwo models r and s by the ratio Hanand Carlin (2001) recently reported that the marginal ofmarginal likelihoods likelihoodestimates from theChib approach are quite accu- ratecompared with those from othermethods. Now, giventhe y m4 r 5 marginallikelihood estimates for anytwo models and , Brs — 0 r s D m4y s5 theBayes factor is available as — Inthe semiparametric DPM context,calculation of the B exp8log m4y 5 log m4y 590 marginallikelihood is a largelyunexplored problem. In fact, rs D O — r ƒ O — s theproblem in this case is somewhatdeeper, because even the Byway of binterpretation, if thetwo models r and s are computationof thelikelihood function of theDPM model(an equallyprobable a priori,then the Bayes factor B is the pos- inputinto the marginal likelihood) has not been satisfactorily rs teriorodds in favor of themodel r .Alternatively,the Bayes tackledin the literature. Speci cally, if we let °4 1 G 1 Ê5 ¢— 0 factorcan also be viewed as the relative success of the two denotethe DP measure,then the likelihood L4y Ô1 Ê1 1 G 5 — 0 modelsat predicting the data y.Good(1985) has referred ofthe DPM model(on suppressing the model index) is tothe log of the Bayes factor as the “ weightof evidence.” given by Accordingto the famous scale of Jeffreys, alog(base e) Bayesfactor values in the range of 401 10155, 410151 30455, L4y ”1 Ê1 1 G05 f 4y Ô1 G5d°4G 1 G01 Ê51 (7) 430451 40605, and 440601 5 provide“ notworth a mention,” — D — — ˆ Z “substantial,”“strong, ”and“ verystrong” evidence against the whichrequires an integration over the space of the in nite- s model. dimensionalparameter G.Additionally,let 4Ô1 Ê1 5 denote Animportant practical consequence of devolving the theprior density of the parameters. Then the marginal like- marginallikelihood computation in the foregoing manner is lihoodis obtained by integrating the likelihood function over thatthe problem is reducedto one of ndingestimates of the theprior distribution of theparameters, likelihoodand posterior ordinates. These two problems can be tackledquite effectively by separate means. Indeed, computa- m4y5 L4y Ô1 Ê1 1 G 5 4Ô1 Ê1 5 dÔdÊd tionof the posterior ordinate is based on the output produced D — 0 Z bythe MCMC simulationalgorithms currently used to esti- f4y Ô1 G5 d°4G 1 G01 Ê5 4Ô1 Ê1 5 dÔdÊd mateDPM models.Thus this step requires almost no addi- D — — tionalprogramming beyond what is needed to ttheDPM ZZ n model.On the other hand, computation of thelikelihood ordi- f 4y È 1 Ô5 dG4È 5 D i— i i naterequires additional computation, but the burden is not Ài 1 Á ZZ YD Z large.The method that we havedeveloped is basedon sequen- d°4G 1 G 1 Ê5 4Ô1 Ê1 5 dÔdÊd (8) — 0 tialimportance sampling (Kong et al. 1994; Liu 1996, 2001; wherethe last step uses (2). Clearly, direct evaluation of these Loet al. 1996; MacEachern et al. 1999; Ishwaran and James integralsis impossible. Therefore, a feasibleapproach to this 2001b;Ishwaran et al. 2001; Ishwaran and T akahara2002). A problemmust tackle the problem by different means. variantof our method is available that can be applied to the Inthis article, we focuson the approach of Chib (1995), casein which the sampling density and G0 arenonconjugate. whichis based on a representationof the marginal likelihood Weconcludethis section with several remarks. If theDPM thatis amenable to calculation by MCMC methods.Because modelis to be compared against a suitablyembedded para- themarginal likelihood is thenormalizing constant of thepos- metricalternative, then the marginal likelihood of theparamet- teriordensity, one can write ricmodel can be computedby available methods (Chib 1995; Chiband Jeliazkov 2001). The ratio of thetwo marginal like- L4y Ôü 1 Êü 1 ü 1 G 5 4Ôü 1 Êü 1 ü 5 lihoodsthen provides the Bayes factor for theparametric ver- m4y5 — 0 1 D 4Ôü 1 Êü 1 ü y5 sussemiparametric model. Of course,if the alternative model — isa differentDPM model,then its marginal likelihood can where 4Ôü 1 Êü 1 ü 5 issome point in the parameter space, becomputed by the method developed here. Thus, with the 4Ôü 1 Êü 1 ü 5 isthe prior density at that point, and methodat hand, we can ndthe Bayes factor for comparing 4Ôü 1 Êü 1 ü y5 isthe posterior density of the parameters theDPM modelagainst both parametric and semiparametric — alsoevaluated at that same point. None of the quantities alternatives. inthis expression is conditioned on an estimate of the Finally,for appropriatemodel comparisons, it is desirable, unknowndistribution G,becauseotherwise the ef ciency of ifpossible,to match the prior speci cations in the two models, theestimate would be severely compromised. Now, if we let atleastfor similarparameters (see Bergerand Guglielmi 2001 L4y Ôü 1 Êü 1 ü 1 G05 and 4Ôü 1 Êü 1 ü y1 G05 denoteestimates for furtherdiscussion). If we arecomparing a DPM model — O — ofthe likelihood and posterior ordinates (methods for nding withanother nonparametric model, then this issue needs to btheseestimates are given later), it follows that we canconve- betaken up on a case-by-casebasis, as shown in our exam- nientlyestimate the marginal likelihood as ples.When the alternative is a parametricmodel, however, properembedding should allow the DPM modelto beviewed asa generalizationof the parametric model. Here we follow logm4y5 logL4y Ôü 1 Êü 1 ü 1 G05 log 4Ôü 1 Êü 1 ü 5 O D — C Florenset al. (1996), who recommended that the two models log 4Ôü 1 Êü 1 ü y1 G 50 (9) b ƒ O — 0 bespeci ed in such a waythat the predictive (or marginal) Basu and Chib: DirichletProcess Mixture Models 227 distributionof a singleobservation is identical under the two (1998).For theconjugate case, MacEachern (1994) devel- models(i.e., the two models cannot be distinguished by just opedan improved collapsed Gibbs sampler that provides bet- oneobservation). In that case, the relevant parametric alterna- termixing. In this approach, È i isreexpressed in terms of ƒ tiveto the DPM modelis given by theunique values Èü 1 : : : 1 Èü andthe cluster memberships 1 k i ƒ s i 4s11 : : : 1 sn5 8si9, where each sl recordswhich unique Èjü y È 1 Ô f4 È 1 Ô51 i 11 : : : 1 n3 ƒ D n i i i labelcorresponds to the value È , that is, s j iff È Èü . In — ¹ ¢— D l l D l D j iid thecollapsed sampler, only the cluster membership, s , of the È 1 : : : 1 È G G 4 Ê53 i 1 n— 0 ¹ 0 ¢— ithobservation is sampled from thecategorical distribution 4Ô1 Ê5 4Ô1 Ê5 4Ô1 Ê1 5 d1 (10) ¹ D Z P4si j s i1 y1 Ô1 Ê1 5 where 4Ô1 Ê1 5 isthe joint prior under the DPM model D — ƒ in(1). The predictive distribution of a singleobserva- cqj1 n45 f 4yi È1 Ô1 xi5dHj1 i4È51 1 j k i — ƒ µ µ ƒ tion yi undereither the DPM modelin (1) orthe para- D cq f4y È1 Ô1 x 5dG 4È Ê51 j k 11 ( k i 11 n i i 0 i ƒ C R — — D ƒ C metricmodel in (10) is then identical and is given by (12) f 4y ˆ1 Ô5dG 4ˆ Ê5d 4Ô1 Ê5. R Ô1 Ê i— 0 — where c isthenormalizing constant and Hj1 i4È5 istheposte- R ƒ 4. POSTERIORORDINATE ESTIMATION riordistribution of È basedon the prior G0 andobservations 8yl 2 l i1 and sl j9,whichare in cluster j. The unique 8Èjü 9 4.1 Markov ChainSampling areupdated 6D next, Dif needed,given all cluster memberships s.

Inour method, the posterior ordinate 4Ôü 1 Êü 1 ü y5 in (9) Thesecond class of methods, known as the blocked Gibbs isestimated from theoutput of the MCMC simulation— of the sampler,were developedby Ishwaran and James (2001a) posteriordistribution of the DPM model.There are currently andIshwaran et al. (2001). These methods use the trunca- twobroad approaches for estimatingthe DPM model.These tionof the stick-breaking construction given in (6) toexpress twomethods differ in theway in which the lower-level param- therandom mixing measure in nite-dimensionalform as N etersof the model are sampled; the parameters Ô1 Ê, and G4 5 l 1 pl„Z 4 5, where N issome large integer. Under ¢ D D l ¢ aresampled in thesame way in bothmethods. Practically, this thisrestriction, the mixture density in (2) canbe written in meansthat we canproduce an estimate of the posterior ordi- hierarchicalP fashion as nate 4Ôü 1 Êü 1 ü y1 G 5 from theMCMC outputof either 0 ind method.T oshowhow — this is done, we beginby presenting a y Z1 s 1 Ô1 x f 4y Z 1 Ô1 x 51 i 11 : : : 1 n1 i— i i ¹ i— si i D briefreview of the two MCMC samplingschemes. N The rst classof methodsare based on thePó lya urn repre- iid si p pl„l 4 51 (13) sentationin (4). First proposed by Escobar (1988, 1994) and — ¹ l 1 ¢ D MacEachern(1994), the sampling is conducted marginalized X where si isthe latent mixture component indicator for the ith overthe random measure G andexploits the fact that the joint iid distributionof in(3) isexchangeable. The full observation, s 4s11 : : : 1 sn5, Zl G04 Ê5, Z 4Z11 : : : , 4È11 : : : 1 Èn5 D ¹ ¢— D conditionaldistribution of canbe deduced, by virtue of ZN 5, Z isindependent of p,andthe distribution of p Èi D exchageability,as 4p11 : : : 1 pN 5 isspeci ed by the stick-breaking construction. Theblocked Gibbs sampler updates Z1 s, and p inmultivari-

ateblocks as opposed to the one-at-time updating of Èi or P4Èi È i1 Ô1 Ê1 1 G01 y5 2 ¢— ƒ clustermembership si inthe Pó lya urn samplers. The full k i ƒ conditionaldistributions of 4Z s1 Ô1 Ê1 y5, 4s Z1 p1Ô1 y5, and qü P ü 4È y 5 qü „ 4 51 (11) — — k i 11 i i i j1 i Èj1ü i 4p s1 5 wheregiven by Ishwaranand James (2001a, sec. 5.2). / ƒ C 2 ¢— C j 1 ƒ ¢ — XD Notethat the latent Èi arestill available in this scheme where 8Èü 1 : : : 1 Èü 9 denotethe set of k unique as Èi Zs . 11 i k i 1 i i i ƒ ƒ ƒ ƒ D values in È 8È 1 : : : 1 È 9 8È 9, qü q 45 i 1 n i k i 11 i k i 11 n ƒ D n ƒ C / ƒ C 4.2 Estimatingthe Posterior Ordinate f 4yi È1 Ô1 xi5dG04È Ê5, qj1ü i qj1 n45f 4yi Èj1ü i1 Ô1 xi5, and — — / — ƒ Withinthe Sampler Pü 4È y 5 istheconditional law of È when y has the den- R i 2 ¢— i i i sity f4y È 1 Ô1 x 5 and È G 4 Ê5.Thesampling approach Wenowdetail, following the framework ofChib (1995), i— i i i ¹ 0 ¢— iscompleted by sampling 4Ô1 Ê1 5 from theirrespective full howthe posterior ordinate of (9) canbe estimated from the conditionaldensities. W enotehere that because Ê is at the outputof either the Pó lya urn scheme sampler or the blocked highestlevel of the hierarchical speci cation, the functional Gibbssampler. Let us start with the decomposition form ofits conditional posterior depends only on the prior of

Ê andthe base measure G04 Ê5 andis analytically available ¢— log 4Ôü 1 Êü 1 ü y5 log 4Ôü y5 ifthe prior for Ê isconjugate to G04 Ê5,irrespectiveof the — D — ¢— complexityof the semiparametric mixture model. log 4 ü y1 Ôü 5 log 4Êü y1 Ôü 1 ü 51 (14) C — C — Onecan see that the expression for qü involvesan inte- k i 11 i ƒ C gralwhich is analytically available when f 4yi È1 Ô1 xi5 and andnow consider the estimation of each ordinate. Suppose G 4È Ê5 areconjugate. The nonconjugate case, —which is less for simplicitythat the full conditional distributions of Ô1 , 0 — convenient,has been considered by MacEachern and Mü ller and Ê haveknown normalizing constants. If thenormalizing 228 Journal oftheAmerican Statistical Association, March 2003 constant(s)is (are) notknown, and Metropolis– Hastings sam- raisesan interesting question of how the integrals should be pling(Chib and Greenberg 1995) is used for updatingsome calculated.(W eshowin Sec. 6.1 that an exact answer can oftheseparameters, then the ordinates can be estimatedalong bederived by a tediouscomputation when n issmall, but of thelines of Chib and Jeliazkov (2001). coursewe needto develop a generalapproach that is ef - Supposethat the MCMC samplingscheme, beyond the req- cientand valid for anysample size.) Earlier, Ferguson (1983) uisiteburn-in period, has been iterated for g 11 : : : 1 G alsodid some exact calculations for smallsample sizes. After D 1 cycles.The output from thissampling can, with an obvious athoroughstudy of this problem and extensive compara- justication, be capitalized to estimate 4Ôü y5 as tiveevaluation of different techniques, we havedeveloped — amethodthat is both accurate and computationally ef cient. G1 1 4g5 4g5 4g5 Todescribeour method, we rst showhow the likelihood 4Ôü y5 4Ôü È 1 Ê 1 1 y51 (15) O — D G1 g 1 — ordinatecan be found as abyproductof the SIS method,where D X we usethe subscript 4i5 (e.g. y4i5)togenerically denote the rst wherethe superscript 4g5 denotesthe values drawn at the i elementsof a vector[i.e., y 4y 1 : : : 1 y 5].Insequential 4i5 D 1 i gthiteration and the density on the right side is the one that imputation,the Èi (i n)aresequentially generated from the appearsin theMCMC update.T oestimatethe second ordinate, importancesampling µdistribution we x Ô at Ôü andcontinue the Markov chain simulations n for anadditional G2 iterations,where all other unobservables ü 4È11 : : : 1 Èn y1 Öü 5 4Èi y4i51 È4i 151 Öü 51 (19) (except Ô)areupdated. These draws yieldthe estimate ƒ — D i 1 — YD G G 1 1 C 2 4g5 4g5 startingwith È1 andcontinuing on to Èn.Konget al. (1994) 4ü y1 Ôü 5 ü È 1 Ôü 1 Ê 1 y 1 (16) O — D G — showedthat the importance weight equals 2 g G1 1 DXC ¡ ¢ where,with the introduction of anadditional latent variable u, 4È 1 : : : 1 Èn y1 Öü 5 w 1 — 1 (20) thefull conditional posterior of was givenby Escobar and ü 4È11 : : : 1 Èn y1 Öü 5 D L4y Öü 5 West(1995) as a mixtureof two gamma distributions, — — where

a0 kn 1 n C ƒ gamma4a0 d1 b0 logu5 n4b logu5 w w4È11 : : : 1 Èn5 f4y1 Öü 5 f 4yi y4i 151 È4i 151 Öü 5 0 C ƒ ƒ ƒ ƒ D D — i 2 — a k 1 YD (21) 1 0 C n ƒ gamma4a k 11 b logu51 C ƒ n4b logu5 0 C n ƒ 0 ƒ ³ 0 ƒ ´ (17) and f4yi y4i 151 È4i 151 Öü 5 isthe prequential predictive density — ƒ ƒ of yi. k denotesthe number of distinctvalues in 8È 1 : : : 1 È 9, and n 1 n Because L4y Öü 5 in(20) is independent of 4È11 : : : 1 Èn5, the latent u isgenerated from itsfull conditional distribution theexpression in — (20) can be usedto deliveran estimate of the givenby beta 4u 11 n5. likelihoodfunction L4y Öü 5.Supposethat the sequential sam- — C — Last,we xboth Ô and at 4Ôü 1 ü 5 andrun the chain for plingprocedure is repeated M timesand at that each cycle g, 4g5 4g5 another G3 iterationsto produce an estimate 4Êü y1 Ôü 1 ü 5 we obtainthe draws È 1 : : : 1 È from ü 4È 1 : : : 1 È y1 Öü 5 O — 1 n 1 n— analogousto the preceeding one. Finally, we substitutethese andcalculate w4g5 following(21). Then the average w threeestimates into (14). W ementionthat the numerical stan- 1 M 4g5 N D M ƒ g 1 w over the M draws isa simulation-consistent darderror oftheresulting estimate can be found according to MonteCarlo D estimate of the likelihood ordinate, as is readily themethod given by Chib (1995). con rmed.P Interestingly,this basic idea can be applied to the DPM 5. LIKELIHOOD ORDINATE ESTIMATION model when f 4y È 1 Ôü 5 and G areconjugate, because from i— i 0 5.1 BasicSequential Importance Sampling (4) and(5) we knowimmediately that Inthis section we describemethods for estimatingthe like- ü lihoodordinate L4y Ôü 1 Êü 1 ü 1 G05 of(9).T osetthe stage for — f 4yi y4i 151 È4i 151 Öü 5 f 4yi È1 Ôü 5dG04È Êü 5 theproblem, let us beginby recalling that the likelihood func- — ƒ ƒ D ü i 1 — — C ƒ Z tionof the parameters at a particularpoint Öü 4Ôü 1 Êü 1 ü 5 ki 1 D ƒ nj1 i 1 inthe parameter space, given the sample data, is ƒ f 4yi Èj 1 Ôü 5 (22) C j 1 ü i 1 — XD C ƒ L4y Ôü 1 Êü 1 ü 5 — and n

f4yi Èi1 Ôü 5dG4Èi5 d°4G ü 1 G01 Êü 50 (18) D — — ü i 1 4Èi y4i51 È4i 151 Öü 5 f 4yi Èi1 Ôü 5fG 4Èi Êü 5 Z À D Z Á — ƒ / i 1 — 0 — Y ü C ƒ Theproblem is that neither the integral in braces [even when ki 1 ƒ nj1 i 1 f 4y È 1 Ôü 5 and G areconjugate] nor the outside integral ƒ f 4y È 1 Ôü 5„ 4È 50 (23) i— i 0 C i 1 i— j Èj i overthe in nite-dimensional parameter G isanalytic. This j 1 ü XD C ƒ Basu and Chib: DirichletProcess Mixture Models 229

5.2 CollapsedSequential Importance Sampling where c isthe normalizing constant. In other words, sample s from theset of existing unique cluster labels with proba- Althoughthe basic SIS iseasy to implement, in apply- i bilitiesgiven in the rst lineof the foregoing expression, or ingthis method to the DPM modelwe havefound that the elseassign a newcluster label with probability given in the weights w4g5 tendto be extremely variable. This problem has secondline. beenobserved in other settings. T oovercomethis problem, we turnto the collapsed SIS methoddeveloped in the context At theend of each complete run through the observa- 4g5 4g5 n 4g5 ofa DPM beta-binomialmodel by MacEachern et al. (1999) tions,we calculate w u1 i 2 ui ,andthen estimate D D andlater extended to the multinomial and nonexchangeable thelikelihood ordinate of the DPM modelas L4y Öü 1 G05 1 M 4g5 Q — D beta-binomialmodels by Quintana (1998) and Quintana and M ƒ g 1 w . Newton(2000). The collapsed SIS methodwas alsodis- WementionD that the preceeding discussion b has been cussedin the context of weighted Chinese restaurant pro- restrictedP to the conjugate DPM model,because our applica- cessesby Lo et al. (1996). General weighted Chinese restau- tions,along with most formulations of DPM models,are cen- rantprocesses and algorithms for theirposterior inference are teredon this case. A methodfor thenonconjugate setting is reviewedby Ishwaran and James (2001b) and Ishwaran and availablefrom theauthors. Takahara(2002), and such methods were usedto estimate the Finally,the ef ciency of w asan estimate of thelikelihood N marginaldensity by Ishwaran et al. (2001). The idea behind ordinatecan be measuredby itscoef cient of variation, C4w5. N thecollapsed SIS istheelimination of È byintegration,which As pointedout by Irwin, Cox, and Kong (1994), C4w5 can be i N collapsesthe space in which the sequential sampling operates shownby the delta method to be approximately the standard tothe set of possiblecluster memberships. Because È is ana- error of logw inestimating the log-likelihood. The sample i N lyticallyintegrated out from thecomputation, this method has estimate of C4w5 is N lessvariability due to the Rao– Blackwellization effect (see 1 s MacEachernet al. 1999). As we illustratein Section 6, this C4w5 w 1 N D pM w modied SIS methodprovides the correct foundation for our N method. where s denotesthe sampleb standard deviation of w4g51 g Todescribethe method, we recallthat each unique È forms w i 11 : : : 1 M. D aclusterunder the Dirichlet process prior. In SIS theseclusters areformed sequentially. Now inthe collapsed method we do 6. EXAMPLES not sample Èi butinstead sample the cluster membership, si, 6.1 AnExample With anExact Answer marginalizedover Èi.Recallingthe notations used earlier in (4),(5), and (12), collapsed sequential sampling for theDPM Wenowturn to providing some empirical demonstrations modelproceeds as follow. First, compute u f 4y Öü 5 1 D 1— D andillustrations of our marginal likelihood estimation method f 4y È1 Ôü 5dG 4È Êü 5 and set s toequal 1 (becausethe rst — 0 — 1 for DPM models.Given that the method for estimatingthe observationmust begin with a newcluster). Then, for i R D posteriorordinate has been tested in several problems [Chib 21 : : : 1 n,performthe following steps sequentially: (1995),Chib and Jeliazkov (2001)] it is important to focus Step 1. Computethe predictive probability attentionon the estimation of thelikelihood ordinate proposed inSection 5. As across-check,we considera specialcase in 4g5 g whichthe likelihood ordinate can be obtained via alternative ui f4yi y4i 151 s4i 151 Öü 1 G05 D — ƒ ƒ methods,either analytically or approximately, and compare ü theresults with those from ourapproach. f 4yi È1 Ôü 5 dG04È Êü 5 D ü i 1 — — Considerthe longitudinal study reported by Gelfand, Hills, C ƒ Z ki 1 Racine-Poon,and Smith (1990) on n 30young rats whose ƒ nj1 i 1 D ƒ f 4yi È1 Ôü 5 dHj1 i 14È Êü 51 (24) weightsare measured weekly for vetime periods. Let yij C i 1 — ƒ — j 1 ü denotethe weight of the ithrat measured at age x and let XD C ƒ Z ij 2 2 where Hj1 i 14È Êü 5 istheposterior distribution of È based on y È 1‘ 1 X N 4X È 1‘ I53 ƒ — i— i i ¹ 5 i i the prior G0 andobservations 8yl 2 l i 1 and sl j9. When µ ƒ D Èi 4ˆi11 ˆi250 Ì1 D N24Ì1 D51 i 11 : : : 1 n3 f and G0 areconjugate, both integrals can be obtained in D — ¹ D 1 2 closedform. Dƒ Wishart 421 R53‘ inversegamma 4a1 b51 4g5 ¹ 2 ¹ Step 2. Draw si from thecategorical distribution

where yi 4yi11 : : : 1 yi550, Xi isthe design matrix with units 4g5 D Pr4si j y4i51 s4i 151 Öü 5 inthe rst columnand 4xi11 : : : 1 xi550 inthe second column, D — ƒ and N 4m1 è5 is the p-variatenormal distribution with mean nj1i 1 p c ƒ f4yi È1 Ôü 5 dHj1 i 14È Êü 51 vector m andcovariance matrix è.Thehyperparameters R, D ü i 1 — ƒ — C ƒ Z a, and b areset to equal diag 41001 015,4.25and 97.5, respec- 1 j ki 1 µ µ ƒ tively. ü Weconsidera DPM extensionof this model by letting the c f4yi È1 Ôü 5 dG04È Êü 51 iid D ü i 1 — — randomcoef cients È 1 : : : 1 È G, where G follows a DP C ƒ Z 1 n ¹ j ki 1 11 (25) priorwith base measure G0 N24Ì1 D5,thedistribution used D ƒ C D 230 Journal oftheAmerican Statistical Association, March 2003 intheforegoing parametric model. This is asimpleand effec- Prior sampling –100 tiveway to relax the normality assumption in clustered data k i

l –200 models. g o Now considerthe question of ndingthe likelihood ordi- l –300 2 nateat some high-density point Öü 4‘ ü 1 Ìü 1 Dü 5. The n D –400 ratscan cluster in a largenumber of possible ways. Given Basic sequential –190 asetof cluster memberships s 4s11 : : : 1 sn5,thedensity

2 D k –192 i f 4y ‘ ü 1 Ìü 1 Dü 1 s5 canbe obtained analytically by using the l

— g –194 o

conjugatestructure as l –196

2 –198 f 4y ‘ ü 1 Ìü 1 Dü 1 s5 Collapsed sequential — –193 kn k

2 i

l –194 N54yi XiÈ1‘ ü 5 dG04È Ìü 1 Dü 5 1 (26) g

D — — o

j 1 l À 8i2si j9 Á –195 YD Z YD where k isthe number of distinct clusters or the number of –196 n 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 distinctvalues in 8s11 : : : 1 sn9.Thenthe likelihood ordinate is 2 2 given as L4y ‘ ü 1 Ìü 1 Dü 5 f4y ‘ ü 1 Ìü 1 Dü 1 s5 4s5, where Figure1. Rat Data: Traceof Log-LikelihoodEvaluations. Note the thesum is over— all possible Ds and — 4s5 isthe prior probability different verticalscales. ofgettingthe partition s.BecauseP the sum is overall partitions, thisanalytic estimate is computationally feasible only when n 6.2 BayesFactors for BinaryData Models issmall. However, in these feasible cases, one can compare thisanalytic estimate with the likelihood value obtained by Inrecent years, there has been a signicant amount of ourproposed method. Baysianwork on generalizing the simple probit and logis- Supposethat n 10.For thissample size, we comparethe ticregression models for binaryregression. Albert and Chib analyticestimate of D thelikelihood ordinate with the estimates (1993) rst showedhow to ta t-linkmodel, and Basu and from theproposed basic and collapsed sequential methods. Mukhopadhyay(2000) extended this idea to DPM linkmod- Wealsoconsider an alternative Monte Carlo estimate of the els.Erkanli, Stangl, and Mü ller (1993) and Newton, Czado, likelihoodordinate based on sampling the cluster locations andChappell (1996) provided different semiparametric gener- alizationsof thebinary data model. The issue of modelselec- s 4s11 : : : 1 sn5 sequentiallyfrom theirprior distribution using theD Pó lya urn scheme in (5) andthen averaging the likelihood tion,is notaddressed in thesearticles however .Inthis section 2 f 4y ‘ ü 1 Ìü 1 Dü 1 s5 of(26)over these realizations of 8s9. This we illustratethe application of our techniques to a semipara- estimate— is referred toas the “ priorsampling” – basedestimate metricmodel for binarydata that we comparewith two para- inTable1. Ferguson(1983) reported a similarcomparison for metricmodels. The data for thisproblem are from Brown the case of n 5wherehe compared the analytic estimate of (1980),who used the response variable 8yi9, i 53,as anindi- µ thepredictive Ddensity with the Monte Carlo estimate obtained catorof thepresence of prostaticnodal involvement in patients from theKuo (1986) prior sampling method. withprostate cancer. The objective is to explain the response Theestimates listed in Table1 showthat both the collapsed yi withfour covariates: log of the level of serum acid phos- andbasic sequential methods accurately estimate the likeli- phate 4x25;theresult of anX-ray examination,coded 0 ifneg- hoodordinate. Whereas the inaccuracy of the likelihood esti- ativeand 1 ifpositive 4x35;sizeof the tumor, coded 0 ifsmall mateobtained by samplingthe cluster locations from theprior and1 iflarge 4x45;andpathologic grade of the tumor, coded maynot appear signi cant in T able1, note that these results 0ifless serious and 1 ifmore serious 4x55.Thesedata have arefor arathersmall sample size of n 10.In experiments beenanalyzed by Chib (1995) using a binaryprobit regres- involvinglarger sample sizes, however, we D havefound that the sionmodel, which yielded a logmarginal likelihood value of estimatebased on sampling cluster locations from theprior 360252. ƒ aresmaller than the SIS-based estimates by several orders of Models. Westartwith the popular probit regression, magnitude.T ogetan idea of the variability of the estimates, denoted by ,thatmodels the probability of presence or inFigure 1 we plotthe trace of thelog-likelihood evaluations 1 “successprobability” as Pr 4yi 1 11 Â5 ê4xi0Â5, where from thedifferent methods. W eseethat the prior sampling ê4 5 isthe standard normal distribution D — function. D This model methodshows extreme uctuations,whereas the collapsed iscompared ¢ with the t-linkmodel discussed by Albert and methodis the most stable (note the different vertical scales). Chib(1993). In this case Pr 4y 1 1 Â5 F 4x0 Â1 11 5 i D — 2 D t i D ê4xi0Âp‹5dG04‹5, where Ft 4 1 1 5 isthe cumulative dis- tributionfunction of the t distribution¢ with dispersion and R degreesof freedom and G gamma4=21 =25. We let Table1. Comparison of Log-LikelihoodOrdinates Found by 0 D LikelihoodEstimation Methods; n 10 10. D DOur goalis to compare the preceeding parametric models Analyticvalue Basic SIS CollapsedSIS Prior sampling withthe semiparametric DPM modelproposed by Basu and Mukhopadhyay(2000). Under the DPM model,the link func- 193096 1930971 1930954 193075 ƒ ƒ ƒ ƒ tionis modeled semiparametrically as a normalscale mixture Basu and Chib: DirichletProcess Mixture Models 231

wherethe mixing distribution G is random, isdened. In other words, the latent variable zi persistsin the samplingin conjunction with the cluster membership variable

si.Wenowdescribe the details. Pr4y 1 1 Â5 ê4x0Âp‹5 dG4‹51 i D — 3 D i Thepredictive ordinate of the rst observationcan be Z y1 calculatedeasily and is given by u1 Ft 4x10 Âü 1 11 5 81 41 y 5 D ƒ G DP41 G051 G0 gamma 1 1 F 4x0 Âü 1 11 59 ƒ 1 .Wenextdraw Ç 4z 1 s 5 from itspos- ¹ D 2 2 t 1 1 D 1 1 ³ ´ teriordistribution conditioned on y1.Thisis accomplished by with 10.Note that when G isa xeddistribution and drawing z1 from itsconditional distribution, D equal to G0, we get the t-link model 2. Tocomparethe three models on a fair basis,we assume t4x10 Âü 1 11 5I601 5 if y1 1 z1 y11 Öü 1 G0 ˆ D thatin each model Â follows a N54Â01 B05 priordistribution, — ¹ t4x0 Âü 1 11 5I4 1 05 if y 01 ( 1 ƒˆ 1 D where Â0 isa vectorwith each element equal to 075 and B0 isa diagonalmatrix with 25 on the diagonal. In addition, in where t4Œ1 11 5 isthe Student t distributionwith location Œ, theDPM model,the prior on the concentration parameter dispersion1, and degreesof freedom. W enextset s 1. 1 D istaken to be gamma 451 25. For theremaining observations 4i 21 : : : 1 n5, we calcu- D Fittingof Models. Weteachof the three contending latethe prequential predictive density of yi instep 1 ofthe collapsedSIS andthen draw avalueof Ç 4z 1 s 5 from modelsby theAlbert and Chib (1993) approach. For example, i D i i toestimate the DPM model,we expressthe model in termsof p4Çi y4i51 Ç4i 151 Öü 1 G05 instep 2. Suppose that after com- pleting— these ƒ steps for the rst i 1observations,there are latentvariables 8zi9 as ƒ ki 1 clusterswith the jthcluster with nj1 i 1 elements, j ƒ ƒ D 11 : : : 1 ki 1.Thenthe posterior distribution of ‹ based on the 1 ƒ zi Â1 ‹i N4xi0Â1 ‹iƒ 53 yi I4zi > 053 prior G andonly those latent observations 8z 2 l < i1 s j9 — ¹ D 0 l l D iid in the jthcluster is ‹ 1 : : : 1 ‹ G1 G DP41 G 51 G gamma 1 0 1 n ¹ ¹ 0 0 D 2 2 ³ ´ Hj1 i 14‹5 Amajorbene t ofthis representation is that conditioned on ƒ 2 the latent zi,themodel resembles a linearregression with gamma 4 nj1 i 15=21 4 4zl xl0 Âü 5 5=2 ƒ allof its associated tractability. The posterior distribution of D C C 8l

Computingthe Marginal Likelihood of the DPM Model. ü p F 4x0Âü 1 11 5 Wenowdiscuss marginal likelihood computation of the semi- i D i 1 t i ü C ƒ parametricDPM model 3.(Themarginal likelihood of the ki 1 1 ƒ 1 parametricmodels is obtained from theapproach outlined in nj1 i 1Ft 4xi0Âü 1 aj1ƒ i 1 bj1 i 11 aj1 i 150 C i 1 ƒ ƒ ƒ ƒ Chib1995.) As usual,we startwith the decomposition in ü j 1 C ƒ XD (9),where now Ô Â and Ê isnonstochastic. The poste- D Nextwe moveto step 2, where we applythe method of com- riorordinate 4Âü 1 ü y5 canbe estimated from thedecom- — positionto draw avariate Çi 4zi1 si5 from thejoint distribu- position 4Âü y5 4 ü y1 Âü 5,wherethe rst ordinateis cal- D — — tion 4zi1 si y4i51 Ç4i 151 Öü 1 G05 4zi y4i51 Ç4i 151 Öü 1 G05 culatedvia (15) by averaging the conditional posterior den- — ƒ D — ƒ 1 53 4si y4i51 Ç4i 151 Öü 1 G01 zi5.The rst ofthesedistributions can sity N54Âü Â1 B5, with Â B4B0ƒ Â0 i 1 ‹ixizi5 and B — ƒ 1 —O53 1O D C D D beobtained by an interesting argument, noting that 4B0ƒ Â0 i 1 ‹ixizi5ƒ ,overthe draws from theMCMC run. C D P Thesecond ordinate 4ü y1 Âü 5 isestimated according to P — (16)and the mixture representation in (17). ü ‹i y4i 151 Ç4i 151 Öü 1 G0 G04 Šü 5 Calculationof the likelihood ordinate L4y Âü 1 ü 1 G 5 by — ƒ ƒ ¹ i 1 ¢— — 0 ü C ƒ thecollapsed SIS methodis rathermore interesting. The pres- ki 1 1 ƒ enceof latent variables zi inthe model leads to some argu- nj1 i 1Hj1 i 14 5 (28) ƒ ƒ C ü i 1 j 1 ¢ mentsthat are likely to be of general interest. C ƒ D Oneinitial point is thateven though the unobservables con- X and nectedto the ithobservation yi includethe latent variable zi andthe random precision parameters ‹ ,toproducea tractable i 1 versionof the collapsed SIS method,we mustcollapse or zi y4i 151 Ç4i 151 Öü 1 G01 ‹i zi Öü 1 G01 ‹i N4xi0Âü 1 ‹iƒ 50 — ƒ ƒ ¹ — ¹ (29) marginalizeover only ‹i,thevariable on whichthe DPM prior 232 Journal oftheAmerican Statistical Association, March 2003

If we marginalize(29) with respect to the distribution of ‹i log marginal estimate givenin (28), we obtain –37.74 –37.745

ü –37.75 zi y4i 151 Ç4i 151 Öü 1 G0 t4xi0Âü 1 11 5 — ƒ ƒ ¹ ü i 1 C ƒ –37.755 ki 1 1 ƒ

1 l n t4x0Âü 1 aƒ b 1 a 50 (30) a –37.76

j1 i 1 i j1 i 1 j1 i 1 j1 i 1 n C i 1 ƒ ƒ ƒ ƒ i ü j 1 g r C ƒ XD a –37.765 m

Thiscan be viewed as the prior distribution of z , and so, by o i l –37.77 theAlbert and Chib (1993) approach, we immediatelydeter- –37.775 minethat the density 4zi y4i51 Ç4i 151 Öü 1 G05 isthe mixture — ƒ distributionin (30) truncated below at 0 if yi is1 andtrun- –37.78 catedabove at 0if y is0. Havingdrawn z from thistruncated i i –37.785 mixturedistribution, we completethe composition by drawing –37.79 thecategorical random variable si accordingto (25) from the 0 1 2 3 4 5 iteration 4 distribution x 10

Pr4si j y4i51 Ç4i 151 Öü 1 G01 zi5 Figure2. Binary Data: TheMarginal Likelihood Estimate of theDPM D — ƒ ModelVersus Number of Iterations. nj1 i 1 1 ƒ c ft 4zi xi0Âü 1 aj1ƒ i 1bj1 i 11 aj1 i 151 j ki 1 D i 1 — ƒ ƒ ƒ µ ƒ ü C ƒ ü skewness)of thepatient’ s CD4count, recorded at study entry, c ft 4zi xi0Âü 1 11 51 j ki 1 11 (31) andat 2, 6, 12, and 18 months after entry. Several patients D i 1 — D ƒ C ü C ƒ haveincomplete records due to dropouts, so the effective where c isthe normalizing constant and f 4 Œ1 1 5 denotes responsevector for the ithpatient is y 4y 1 : : : 1 y 5, where t i D i1 i1 ti thedensity of the t distributionwith location ¢—Œ,dispersion , 1 t 5.Carlin and Louis (2000), and Chib and Jeliazkov µ i µ anddegrees of freedom .Arunthrough the observations now (2001)used the following linear mixed-effects model for these n yieldsthe quantity w u1 i 2 ui.Thesesteps are repeated M data: times,and the average Dof the Dw’sproducesour estimate of the Q likelihoodordinate L4y Âü 1 ü 1 G 5 for theDPM model . 2 2 0 3 yi Â1 bi1‘ 1 Xi Nt 4XiÂ Wibi1‘ 51 Finally,the marginal —likelihood of the binary DPM model — ¹ i C iid isestimated by inserting the posterior and likelihood ordinate i n 4673 b 1 : : : 1 b N 401 D51 (32) µ D 1 n ¹ 2 estimatesinto (9). W enotehere that sampling the latent zi where the jth row of W takesthe form w 411 x 5, x withinour collapsed SIS maintainsthe tractability of thepos- i ij D ij ij 2 terior in(27), which in turn provides manageable 801 21 61 121 189,the xeddesign matrix Xi is Xi 4Wi Hj1i 14‹5 D — ƒ d W a W 5, d isa binaryvariable indicating whether patient expressionsfor allof therequisite distributions. i i — i i i i receiveddidanosine ( di 1)or zalecitabine( di 0), and ai Results. Figure2 illustratesthe marginal likelihood esti- isa binaryvariable indicating D whether the patient D was diag- mateof theDPM modelfrom ourproposed approach. W esee nosedas having AIDS atbaseline ( ai 1) or not (ai 0). We from thegraph that the estimate stabilizes up to the rst dec- D D denotethis parametric model by 1.Thesecond paramet- imalplace quite quickly. In T able2 we reportthe marginal ric model, 2,providesa heavier-taileddistribution for the likelihoodfor thethree models under contention. It is note- two-dimensionalrandom effects, bi,bymodeling them with a worthythat in this case, the Bayes factor criterion does not Student t distribution. supportthe DPM model(in fact, the Bayes factors provides Our thirdmodel, 3,isa exiblesemiparametric model “substantial”evidence in favor of theStudent t modelagainst thatdoes not impose a parametricassumption on the random- theDPM model).This shows that a modelelaboration in the effectsdistribution, but instead models it by a Dirichletpro- directionof asemiparametricmodel need not necessarily dom- cess as inatea parametricspeci cation. (In thenext section, we reach theopposite conclusion.) The ability to evaluate such elab- b11 : : : 1 bn G G3 G G0 DP41 G053 G04 D5 N2401 D50 — ¹ — ¹ ¢— D orations,which hitherto has not been possible, should prove usefulin practical model building. Table2. Binary Data: EstimatedLog Marginal Likelihoods (on the 6.3 BayesFactors for Longitudinal diagonal)for ThreeBinary Data Models Data MixedModels 1 (probit) 2 (Student t link) 3 (DPM) Models. Carlinand Louis (2000) reported data from a 360252 1 ƒ clinicaltrial on the effectiveness of two antiretroviral drugs 2 (0451) 350801 ƒ ( 10488) ( 10939) 370740 (didanosineand zalcitabine) in 467 persons with advanced 3 ƒ ƒ ƒ humanimmunode ciency virus infection. The response vari- NOTE:The entry in brackets isthe logof the Bayesfactor in favor ofthe rowmodel versus able yij for patient i at time j isthe square root (to reduce the column model. Basu and Chib: DirichletProcess Mixture Models 233

If G isa xeddistribution and equal to G0,we getthe para- draw thecluster label si from thediscrete mass distribution, metricmodel .Bushand MacEachern (1996), Kleinman 1 4g5 andIbrahim (1998), T ao,Palta, Y andell,and Newton (1999), Pr4si j y4i 151 s4i 151 yi1 Öü 5 D — ƒ ƒ andIshwaran and T akahara(2002) also took advantage of the nj1 i 1 c ƒ N 4y X Âü W b 1 è 51 1 j k Dirichletprocess prior in linear mixed models. ti i i i j j i 1 D ü i 1 — C O µ µ ƒ Wecompletethe models by assuming that a priori, Â 2 C ƒ ü c N 4y X Âü 1 Vü 51 j k 10 6 1 is N64Â01 B05 with Â0 4101 01 01 01 31 05 and B0 ni i i i i 1 2 2 2 2 2 2 D 1 ƒ D D ü i 1 — D ƒ C diag42 1 1 1 4015 1 1 1 1 1 1 5; Dƒ is Wishart2401 R0=05 with C ƒ 24 and R diag40251 165; and ‘ 2 isinverse gamma Thesesteps are repeated M times;the average of w u 0 D 0 D D i 431 605.Finally,in model 3, isassumedto followa gamma from eachcomplete sweep through the observations is our distributionwith parameters 20 and 1. estimateof the likelihood ordinate. W eobtainthe marginal Q likelihoodof the semiparametric DPM modelby inserting the PosteriorOrdinate Estimation. For brevity,we onlycon- posteriorand likelihood ordinate estimates into (9). siderthe marginal likelihood computation of thesemiparamet- TwoStudiesWith Simulated Data. Weinvestigatethe ef - ric model 3.Inaccordance with the general approach of (9), thenotations of which are transferred here with Ô 4Â1‘ 25 cacyof the Bayes factor for semiparametricmodel comparison 1 D intwo simulated datasets. Both datasets include the covari- and Ê Dƒ ,writethe posterior ordinate as D atemeasurements of the rst 200patients in the original CD4

count.The responses yi1 i 11 : : : 1 200in the rst datasetare 1 2 1 D 4Dƒ ü 1 Âü 1‘ ü 1 ü y5 4Dƒ ü y5 4Âü y1 Dü 5 simulatedin two stages. In particular, the random effects are — D — — 2 2 simulatedfrom thefour-component bivariate normal mixture, 4‘ ü y1 Dü 1 Âü 5 4 ü y1 Dü 1 Âü 1‘ ü 51 — — 4 iid 1 2 where Öü 4Âü 1‘ ü 1 Dü 1 ü 5 isthe posterior mean from the b11 : : : 1 b200 G N24Ìl1 Dü 51 D ¹ D l 1 4 MCMCrun.The rst ofthese four ordinates is estimated XD from theoutput of the complete MCMC runby averaging the withoverall mean (1/ 4) Ìl 0. Next, yi issimulated from 1 1ü D Wishartconditional density of Dƒ at Dƒ .Thesecond ordi- themixed-effects model in (32). The covariance matrix Dü is P nateis estimated from areducedrun where D is xed at Dü xedat theposterior mean of D basedon thecomplete original andthe multivariate normal conditional density of Â is evalu- data. ated at Âü andaveraged. The third ordinate is estimated from a Dueto the mixture structure in the data-generation model, furtherreduced run with both D and Â xed,and then averag- we expectthe semiparametric DPM model 3 to provide 2 ingthe inverse gamma conditional density at ‘ ü . The fourth abetter tthanthe Gaussian model 1,andproceed to ordinateis obtained from analreduced run with D, Â, and makea modelcomparison via the Bayes factor. W eobtainthe 2 ‘ xedat the starred values, where the mixture gamma den- marginallikelihood of the DPM model 3 usingthe method- sity of in(17) is averaged over the sampled values at the ologydescribed above and estimate the marginal likelihood

xed point ü . ofthe parametric Gaussian model 1 from theMCMC algo- rithmof Chib and Carlin (1999), as described by Chib and LikelihoodOrdinate Estimation. Weestimatethe like- Jeliazkov(2001). These estimates and the resulting Bayes fac- lihoodordinate L4y Öü 1 G 5 at Öü usingthe collapsed — 0 torare listed in T able3. TheBayes factor clearly supports the SIS asdescribed in Section 5.2. The predictive ordi- DPM model 3 andprovides “ verystrong” evidence (accord- nateof the rst observationis given by u1 f 4y1 Öü 5 ingto the scale described in Sec. 3) against the parametric 2 D — D f 4y b 1 Âü 1‘ ü 5dG 4b Dü 5 N 4y X Âü 1 Vü 5, where 1 1 0 1 t1 1 1 1 Gaussianmodel 1. — 2 — D — Vü ‘ ü It WiDü W01 i 11 : : : 1 n. We next set s1 1. Acommoncriticism of the Bayes factor is that it does not R i D i C i D D For theremaining observations ( i 21 : : : 1 n),notethat the havean explicit penalty term for theadditional dimensions of D posteriordistribution of b basedon the prior G0 and only anextended model. This criticism is not valid. T oprovidean thoseobservations 8y 2 l < i1s j9 in the jthcluster is empiricalillustration of how the Bayes factor supports parsi- l l D H 4b5 N 4b 1 D 51 j 11 : : : 1 k , where D 4D 1 j1 i 1 2 O j j i 1 j ü ƒ mony,when parsimony is justi ed, we simulatethe responses 2ƒ D 1 D 2ƒ D C ‘ ƒ ü l2s j Wl0 Wl 5ƒ and bj Dj‘ ƒ ü l2s j Wl0 4yl XlÂü 5. yi1 i 11 : : : 1 200in our second dataset from theparamet- lD O D lD ƒ D Theprequential predictive ordinate of yi nowfollows from ricGaussian model 1 by rst simulatingthe random effects (24) asP P

Table3. Estimated Log Marginal Likelihood (Diagonal) and ü LogBayes Factor of Versus u N 4y X Âü 1 Vü 5 1 3 i ni i i i D ü i 1 — C ƒ Randomeffects simulated Randomeffects simulated ki 1 ƒ n j1 i 1 from four-componentmixture from thenormal model 1 ƒ N 4y X Âü W b 1 è 51 C i 1 ti i— i C i O j j j 1 ü (normal) (DPM) (normal) (DPM) XD C ƒ 1 3 1 3 2 where è ‘ I W D W ,whichresembles V except that 1 11778006 1 11551030 j ü ti i j i0 iü ƒ ƒ D C 3 20042 11757064 3 10074 11562004 it involves Dj , not Dü .For step2 ofthe collapsed SIS, we ƒ ƒ ƒ 234 Journal oftheAmerican Statistical Association, March 2003

b11 : : : 1 b200 from thebivariate normal distribution described Table4. Estimated Log Marginal Likelihoods (on the diagonal) for in(32). T able3 liststhe estimated log-marginal likelihood of Eachof ThreeMixed Modelsfor theAIDS Data theGaussian model 1 andthe DPM model 3 for these 1 (DPM) 2 (Student t link) 3 (normal) data.W endthat the Bayes factor selects the correct model and,more importantly, provides “ verystrong” evidence for the 314590789 (930424) (79025) 1 ƒ 2 315530213 ( 140174) simplecorrect model 1 againstits extended complex coun- ƒ ƒ 3 315390039 terpart 3.Notethat the marginal likelihood estimates for ƒ thetwo models are computed based on identical priors on the NOTE:The entries in the upperhalf are the logof the Bayesfactor infavor ofthe rowmodel hyperparameters. vs the column model. CD4Count Data. Wenowturn to the original CD4 count data with n 467patients and compare the linear mixed- model(on the scale of Sec. 3). These Bayes factor values D effectssemiparametric DPM model 3 withthe parametric canbe interpretedas stating that model 3 is“ verystrongly” Gaussianmodel 1 andthe parametric Student t model 2 successfulin predicting the observed data relative to model with10 degrees of freedom. The Gaussian model has been 1 or 2,orthat the “ weightof evidence” provided by the consideredby Chib and Jeliazkov (2001). Note that all three datain favor of 3 is“ verystrong” compared to 1 or 2. modelshave the same prior on the common parameters Â, ‘ 2, and D. 7. CONCLUDING REMARKS Figure3 illustratesthe estimates of the log-marginal likelihoodof the DPM model 3 obtainedby combining the Inthis article we havedeveloped and exempli ed one of posteriorordinate estimate and the collapsed SIS-based like- the rst approachesfor computingthe marginal likelihood of a lihoodordinate estimate. For comparison,we alsocomputed semiparametricDPM model.One virtue of the proposed tech- themarginal likelihood estimate using the basic SIS method nique,which relies on the approach of Chib (1995), is that it describedin Section5.1. Signi cantly, our preferred collapsed usesthe programming done to simulate the posterior distribu- SIS estimatestabilizes quickly and does not change much as tionof the DPM modeland requires no further tuning of the thenumber of iterationsis increased.In contrast, the estimate MCMCalgorithm.The only incidental coding needed is for from thebasic SIS approachtends toward the collapsed esti- theestimation of the likelihood ordinate at one xedpoint, mate,but evidently does not converge even after 100,000 iter- whichis done by the sequential imputation method. Using a ations.W ehaveseen similar behavior in other models that we longitudinalnormal regression DPM modelwhere the value haveconsidered. ofmarginal likelihood is known analytically, we haveshown Finally,we evaluatethe three models in terms of the thatour proposed estimate is accurate, stable, and ef cient. marginallikelihoods and Bayes factors. W estressthat com- Theimplementation and performance of themethod have been paringDPM modelsfor clustereddata in this fashion has not furtherclari ed in experiments involving semiparametric link beenpossible until now .Table4 showsthat the estimated modelsfor binaryresponse data and hierarchical mixed mod- logBayes factor in favor of the DPM modelversus the Stu- elsfor longitudinaldata. Although the DPM modeldecisively dent t modelis 93.424 and that versus the standard Gaussian dominatesthe parametric models in the longitudinal example, modelis 79.25, providing “ verystrong” support for theDPM thebinary response example leads to a differentverdict. One mayexpect that with access to theproposed method, the prac- ticeof comparing semiparametric DPM modelswith paramet- log marginal estimate –3458 ricor other semiparametric models on the basis of marginal likelihoodsand Bayes factors may become common. –3460 [Received February2002. Revised September2002.] –3462 REFERENCES –3464 l

a Albert,J., and Chib, S. (1993),“ Bayesian Analysisof Binaryand Polychoto- n i –3466 g mousResponse Data,” Journalof American Statistical Association , 88, r a 669–679. m

g –3468 Basu,S., andMukhopadhyay, S. (2000),“ BinaryResponse Regression With o l Normal Scale MixtureLinks,” in GeneralizedLinear Models: A Bayesian –3470 Perspective ,eds.D. K.Dey,S. K.Ghosh,and B. K.Mallick,New York: Marcel Dekker,pp. 231– 242. –3472 Berger,J. O.,and Guglielmi, A. (2001),“ Bayesian andConditional Frequen- tistTesting of a Parametric ModelV ersus Nonparametric Alternatives,” –3474 Journalof AmericanStatistical Association ,96,174– 184. Blackwell,D., and MacQueen, J.B.(1973),“ FergusonDistributions via Polya UrnSchemes,” TheAnnals of Statistics ,1,353–355. –3476 0 2 4 6 8 10 Brown,B. W.(1980),“ PredictionAnalysis for Binary Data,” in Biostatistics iteration 4 x 10 Casebook,eds.R. J.Miller,B. Efron,B. W.Brown,and L. E.Moses,New York:Wiley. Bush,C. A.,andMacEachern, S.N.(1996),“ ASemiparametric Bayesian Figure3. AIDS Data: TheMarginal Likelihood Estimate of theDPM Modelfor Randomised Block Designs,” Biometrika,88,275– 285. ModelVersus Number of Iterations. Thesolid anddashed line represent Carlin,B. P.,andLouis, T. A. (2000), Bayes andEmpirical Bayes Methods estimatesfrom thecollapsed and basic sequential methods. forData Analysis ,(2nded.), New York:Chapman Hall/ CRC. Basu and Chib: DirichletProcess Mixture Models 235

Carota,C., and Parmigiani, G. (1996),“ OnBayes Factorfor Nonparametric (2001b),“ Generalized WeightedChinese Restaurant Processes for Alternatives,”in BayesianStatistics 5 ,eds.J. M.Bernardo,J. O.Berger, Species SamplingMixture Models,” unpublished manuscript. A.P.Dawid,and A. F.M. Smith,London: Oxford University Press, Ishwaran,H., James, L.F.,and Sun, J. (2001),“ Bayesian ModelSelection pp.507– 511. inFinite Mixtures by Marginal Density Decomposition,” Journalof the Chib,S. (1995),“ MarginalLikelihood From the Gibbs Output,” Journal of AmericanStatistical Association ,96,1316– 1332. theAmerican Statistical Association ,90,1313– 1321. Ishwaran,H., and T akahara, G.(2002),“ Independentand Identically Dis- Chib,S., andCarlin, B. (1999),“ OnMCMC Sampling in Hierarchical Lon- tributedMonte Carlo Algorithms for Semiparametric LinearMixed Mod- gitudinalModels,” Statisticsand Computing , 9, 17–26. els,” Journalof theAmerican Statistical Association ,97,1154– 1166. Chib,S., andGreenberg, E. (1995),“ Understandingthe Metropolis-Hastings Kleinman,K. P.,andIbrahim, J. G.(1998),“ ASemiparametric Bayesian Algorithm,” TheAmerican Statistician ,49,327– 335. Approachto the Random Effects Model,” Biometrics,54,921– 938. ChibS., and Jeliazkov, I. (2001), “ MarginalLikelihood From the Metropolis– Kong,A., Liu, J. S.,andW ong,W .H.(1994),“ SequentialImputations and HastingsOutput,” Journalof the American Statistical Association , 96, Bayesian MissingData Problems,” Journalof the American Statistical 270–281. Association ,89,278– 288. Erkanli,A., Stangl, D., and Mü ller, P .(1993),“ ABayesian Analysisof Ordinal Kuo,L. (1986),“ Computationsof Mixtures of Dirichlet Processes,” SIAM Data UsingMixtures,” in Proceedingsof the Section on BayesianStatistical Journalon Scientic andStatistical Computing , 7, 60–71. Science,American StatisticalAssociation, pp. 51– 56. Liu,J. S.(1996),“ Nonparametric Hierarchical Bayes viaSequential Imputa- Escobar,M. D.(1988),“ Estimatingthe Means ofSeveral Normal Populations tions,” TheAnnals of Statistics ,24,911– 930. byNonparametric Estimationof the Distribution of the Means,” unpub- (2001), MonteCarlo Strategies in Scienti c Computing , New York: lisheddoctoral dissertation, Y ale University. Springer-Verlag. (1994),“ EstimatingNormal Means Witha DirichletProcess Prior,” Lo,A. Y.(1984),“ Ona Class ofBayesian Nonparametric Estimates: I.Den- Journalof theAmerican Statistical Association ,89,268– 277. sityEstimates,” TheAnnals of Statistics ,12,351– 357. Escobar,M. D.,andWest, M. (1995),“ Bayesian DensityEstimation and Lo,A. Y.,Brunner,L. J.,and Chan, A. T.(1996), “ WeightedChinese Restau- Inference UsingMixtures, ” Journalof the American Statistical Association , rantProcesses andBayesian MixtureModels,” Research Report1, Hong 90,577– 588. KongUniversity of Science andTechnology. Ferguson,T. S. (1973),“ ABayesian Analysisof SomeNonparametric Prob- MacEachern, S.N.(1994), “ EstimatingNormal Means Witha Conjugate lems,” TheAnnals of Statistics ,1,209– 230. StyleDirichlet Process Prior,” Communicationsin Statistics: Simulation (1983),“ Bayesian DensityEstimation By Mixtures of Normal Dis- andComputation ,23,727– 741. tributions,”in Recent Advancesin Statistics: P apersin Honor of Her- MacEachern, S.N.,Clyde,M., and Liu, J. S. (1999),“ SequentialImportance manChernoff on His SixtiethBirthday ,New York:Academic Press, Samplingfor Nonparametric Bayes Models:The Next Generation,” Cana- pp.287– 302. dianJournal of Statistics ,27,251– 267. Florens,J. P.,Richard,J. F.and Rolin, J. M.(1996),“ Bayesian Encompass- MacEachern, S.N.,and Mü ller, P .(1998),“ EstimatingMixtures of Dirichlet ingSpeci cation Tests ofa Parametric ModelAgainst a Nonparametric Process Models,” Journalof Computational and Graphical Statistics , 7, Alternative,”technical report, University of Pittsburgh,Dept. of Economics. 223–238. Gelfand,A., Hills, S. E.,Racine-Poon, A., and Smith, A. F.M. (1990), Newton,M. A.,Czado, C., and Chappell, R. (1996),“ Bayesian Inference for “Illustrationof Bayesian Inference inNormal Data ModelsUsing Semiparametric BinaryRegression,” Journalof the American Statistical GibbsSampling,” Journalof the American Statistical Association , 85, Association ,91,142– 153. 972–985. Quintana,F. A. (1998),“ Nonparametric Bayesian Analysisfor Assessing Good,I. J. (1985),“ Weightof Evidence;A Brief Survey,”in BayesianStatis- Homogeneityin k L ContingencyTables WithFixed Right Margin tics 2,eds.J. M. Bernardoet al.,New York:Elsevier, pp. 249– 269. Totals,” Journalof the American Statistical Association ,93,1140– 1149. Han,C., and Carlin, B. P.(2001),“ MCMCMethods for Computing Bayes Quintana,F. A., and Newton, M. (2000),“ ComputationalAspects ofNonpara- Factors:A ComparativeReview, ” Journalof theAmerican Statistical Asso- metric Bayesian AnalysisWith Applications to the Modeling of Multiple ciation,96,1122– 1132. BinarySequences,” Journalof Computational and Graphical Statistics , 9, Irwin,M., Cox,N., and Kong, A. (1994),“ SequentialImputation for Multi- 711–737. locusLinkage Analysis,” Proceedingsof theNational Academy ofScience Sethuraman,J. (1994), “ AConstructiveDe nition of DirichletPriors,” Statis- USA,91, 11684– 11688. tica Sinica,4,639–650. Ishwaran,H., and James, L.F.(2001a),“ GibbsSampling Methods for Stick- Tao,H., Palta, M., Y andell,B. S.,andNewton, M. A.(1999), “ AnEstima- BreakingPriors,” Journalof the American Statistical Association , 96, tionMethod for the Semiparametric MixedEffects Model,” Biometrics, 55, 161–173. 102–110.