<<

Parameterization and BayesianModeling

Andrew GELMAN

Progressin statistical computationoften leads toadvances instatistical modeling.For example, it is surprisinglycommon that an existing modelis reparameterized, solelyfor computational purposes, but then this new conŽguration motivates a new familyof models that is useful inapplied . One reasonwhy this phenomenon may nothave been noticed in statistics is thatreparameterizations donot change the likelihood.In a Bayesian framework,however, a transformationof typicallysuggests a new familyof priordistributions. We discussexamples incensoredand truncated data, mixture modeling, multivariate imputation, stochastic processes, andmultilevel models. KEY WORDS:Censored data; Data augmentation;Gibbs sampler; Hierarchical model;Missing-data imputation; expansion; Priordistribution; Truncated data.

1.INTRODUCTION tospeed Gibbs samplers (e.g., Hills and Smith 1992; Boscardin 1996;Roberts and Sahu 1997). Progressin statistical computation often leads to advances From aBayesianperspective, however, new parameteriza- instatistical modeling. W eexplorethis idea in the context of tionscan lead to newprior distributions and thus new models. dataand parameter augmentation— techniques in which latent Oneway in which this often occurs is iftheprior distribution dataor parameters are added to a model.Data and parameter for aparameteris conditionallyconjugate (i.e.,conjugate in the augmentationare methods for reparameterizinga model,not conditionalposterior distribution), given the data and all other changingits description of data, but allowing computation to parametersin themodel. In Gibbs sampler computation, condi- proceedmore easily, quickly, or reliably.In a Bayesiancontext, tionalconjugacy can allow more efŽ cient computation of poste- however,these latent data and parameters can often be given riormoments using “ Rao–Blackwellization” (see Gelfand and substantiveinterpretations in a waythat expands the model’ s Smith1990). This technique is also useful for performingin- practicalutility. ferenceson latent parameters in mixture models, conditional on 1.1 Data Augmentation and Parameter Expansionin convergenceof thesimulations for thehyperparameters. As we Likelihood and BayesianInference discussin Section5, parameterexpansion leads to newfamilies ofconditionally conjugate models. Dataaugmentation (Tannerand W ong1987) refers toafam- Onceagain, there is ananalogyto in sim- ilyof computationalmethods that typically add new latent data plersettings. For example,in classical regression, applying a thatare partially identiŽ ed by thedata.By “ partiallyidentiŽ ed,” lineartransformation to regression predictors has no effect on we meanthat there is some information about these new vari- thepredictions. But in a Bayesianregression with a hierarchi- ables,but as sample size increases, the amount of information calprior distribution, rescaling and other linear transformations abouteach variable does not increase. Examples included in canpull parameters closer together so that shrinkage is more thisarticle include censored data, latent mixture indicators, and effective,as we discussin Section5.1. latentcontinuous variables for discreteregressions. Data aug- mentationis designedto allow simulation-based computations 1.2 Model Expansionfor Substantiveor tobe performedmore simply on thelarger space of “complete Computational Reasons data,”by analogyto theworkings of theEMalgorithmfor max- Theusual reason for expandinga modelis for substantive imumlikelihood (Dempster, Laird, and Rubin 1977). reasons—to bettercapture an underlying model of interest,to Parameterexpansion (Liu,Rubin, and Wu 1998) typically betterŽ texistingdata, or both. Interesting statistical issues addsnew parameters that are nonidentiŽ ed— inBayesian terms, arisewhen balancing these goals, and Bayesian inference with ifthey have improper prior distributions (as theytypically do), properprior distributions can resolve the potential nonidentiŽ - thenthey have improper posterior distributions. An important abilityproblems that can arise. exampleis replacing a parameter µ bya product, ÁÃ, so that ABayesianmodel can be expandedby adding parameters, or inferencecan be obtainedabout the productbut not about either asetof candidatemodels can be bridgedusing discrete model individualparameter .Parameterexpansion can be viewed as averagingor continuous model expansion. Recent treatments partof alargerperspective on iterativesimulation (see vanDyk ofthese approaches from aBayesianperspective have been andMeng 2001; Liu 2003), but our focus here is on its con- presentedby Madigan and Raftery (1994), Hoeting, Madigan, structionof nonidentiŽable parameters as acomputationalaid. Raftery,and V olinsky(1999), Draper (1995),and Gelman, Bothdata augmentation and parameter expansion are excit- Huang,van Dyk, and Boscardin (2003, secs. 6.6 and 6.7). In ingtools for increasingthe simplicity and speed of compu- thecontext of dataŽ tting,Green and Richardson (1997) showed tations.In a likelihood-inferenceframework,that is all they howBayesian model mixing can be usedto perform theequiv- canbe— computational tools. By design, these methods do not alentof nonparametricdensity estimation and regression. changethe likelihood; they only change its parameterization. Anotherimportant form ofmodelexpansion is for sensitiv- Thesame goes for simplercomputational methods, such as ityto potential nonignorability in data collection (see Rubin standardizationof predictorsin regression models and rotations 1976;Little and Rubin 1987). The additional parameters in

AndrewGelman is Professor,Department ofStatistics, Columbia Univer- ©2004American Statistical Association sity,New York,NY 10027(E-mail: [email protected] ).Theauthor Journalof the American Statistical Association thanksXiao-Li Meng and several reviewers forhelpful discussions and the U.S. June2004, V ol.99, No. 466, Review Article NationalScience Foundationfor Ž nancialsupport. DOI 10.1198/016214504000000458

537 538 Journalof theAmerican Statistical Association, June 2004 thesemodels cannot be identiŽed but are varied to explore sen- 2.2 Modeling Truncated Data asCensored but With an sitivityof inferencesto assumptionsabout selection (see Diggle Unknown Number ofCensored Data Points andKenward 1994; Kenward 1998; Rotnitzky, Robins, and Now supposethat N isunknown.W ecanconsider two op- Scharfstein1998; Troxel, Ma, and Heitjan 2003). Nandaram tionsfor modelingthe data: andChoi (2002) argued that continuous model expansion with aninformative prior distribution is appropriate for modeling 1.Using the truncated-data model (1) potentialnonignorability in nonresponse,and showed how the 2.Using the censored-data model (2), treating the original sample size N asmissingdata. variationin nonignorabilitycan be estimatedfrom ahierarchi- caldata structure (see alsoLittle and Gelman 1998). Thesecond option requires a probabilitydistribution for N. The Inthis article, we donot further consider these substantive completeposterior distribution of theobserved data y and miss- examplesof modelexpansion, but rather discuss several classes ing data N is then ofmodels for which theoretically or computationally motivated N p.¹; ¾; N y/ p.N/p.¹; ¾/ modelexpansion has unexpectedly led to newinsights or new j D 91 classesof models. All of these examples feature new para- ³ ´ N 91 91 metersor model structures that could be considered a purely ¹ 200 ¡ 8 ¡ N.y ¹; ¾ 2/: mathematicalconstructions but gain new life when given direct £ ¾ i j ³ ³ ´´ i 1 interpretations.Where possible, we illustratewith applications YD from ourown research so that we havemore certainty in our Wecanobtain the marginal posterior density of .¹; ¾/ by sum- claimsabout the original motivation and ultimate uses of the ming over N, N 91 reparameterizationsand model expansions. 1 N ¹ 200 ¡ p.¹; ¾ y/ p.N/p.¹; ¾ / 8 ¡ j / 91 ¾ N 91 ³ ´³ ³ ´´ 2.TRUNCA TEDAND CENSORED DATA XD 91 Itis well understood that the censored data model is like N.y ¹; ¾ 2/ £ i j thetruncated data model, but with additional information. With i 1 YD censoring,certain speciŽ c measurementsare missing. Here we 91 furtherexplore the connection between the two models. p.¹; ¾ / N.y ¹; ¾ 2/ D i j i 1 2.1 Truncated and Censored Data Models YD N 91 1 N ¹ 200 ¡ p.N/ 8 ¡ : (3) Weworkin the context of a simpleexample (see Gelman £ 91 ¾ N 91 ³ ´³ ³ ´´ etal. 2003, sec. 7.8). A randomsample of N animals are XD weighedon adigitalscale. The scale is accurate,but it doesnot Itturns out that if p.N/ 1=N,thenthe expression inside the / givea readingfor objectsthat weigh more than 200 pounds. Of summationin (3) hasthe form ofa negativebinomial den- sity with µ N 1, ® 91, and 1 8..¹ 200/=¾//. the N animals, n 91aresuccessfully weighed; their weights D ¡ D ¯ 1 D ¡ D Theexpression inside the summation Cis proportional to .1 are y1; : : : ; y91.Weassumethat the weights of the animals in ¹ 200 91 ¡ thepopulation are normally distributed with mean ¹ and stan- 8. ¡¾ //¡ ,sothat for thisparticular choice of noninfor- darddeviation ¾ . mativeprior distribution, the entire expression (3) becomes Inthe “ truncateddata” scenario, N isunknown and the pos- proportionalto thesimple truncated-data posterior density (1). teriordistribution of the unknown parameters ¹ and ¾ of the Mengand Zaslavsky (2002) discussed other properties of the 1=N priordistribution and the negative binomial model. data is Itseems completely sensible that if we adda parameter N tothe model and average it out, then we shouldreturn to the p.¹; ¾ y/ j originalmodel (1). What is odd, however,is that this works only 91 91 withone particular prior distribution, p.N/ 1=N. This seems ¹ 200 ¡ 2 / p.¹; ¾/ 1 8 ¡ N.yi ¹; ¾ /: (1) almostto cast doubt on the original truncated-data model, in / ¡ ¾ j µ ³ ´¶ i 1 thatit hasthis hidden assumption about N. YD Thetruncated-data expression of the censored data model Inthe “ censoreddata” scenario, N isknownand theposterior addsgenerality but introduces a sensitivityto the prior distri- distributionis butionof thenew parameter N. p.¹; ¾ y; N/ 2.3 Connection tothe Themesof This Article j N 91 91 Model(1) isthe basic posterior distribution for truncated ¹ 200 ¡ 2 p.¹; ¾ / 8 ¡ N.yi ¹; ¾ /: (2) data.Going to the censored-data formulation is a modelex- / ¾ j pansion; N isan additional piece of information that is not µ ³ ´¶ Yi 1 D neededin themodel but allows it to be analyzedusing a differ- By using p.¹; ¾/ rather than p.¹; ¾ N/,we areassuming that entmethod (in this case the censored-data model). In examples j N providesno directinformation about ¹ and ¾ . morecomplicated than the normal, this model expansion may Gelman:Parameterization and Bayesian Modeling 539 becomputationally useful, because it removes the integral in usinga mixtureof three normal distributions, with two large thedenominator of thetruncated-data likelihood. componentscorresponding to thetwo modes and onelowerand However,once we considerthe model expansion, it reveals broadercomponent to pickup outlyingdistricts that would not theoriginal truncated-data likelihood as justone possibilityin a bewell captured by thetwo main components. classof models.Depending on theinformation available in any For electoraldata ui (deŽned as thelogit of theDemocratic particularproblem, it could make sense to use different prior shareof thetwo-party vote in district i,excludinguncontested distributionsfor N andthus different truncated-data models. It districts),we Žtanerror model, u N.® ; ¾ 2/, where ® rep- i » i i ishard to return to the original “ stateof innocence”in which resentedthe “usualvote” in the district i (onthe logit scale) that N didnot need to bemodeled. wouldpersist in future elections or undercounterfactual scenar- Asimilarmodeling issue arises in contingency table mod- ios.(The variance ¾ 2 was estimatedfrom variationin district els(see, e.g., Fienberg 1977), where the multinomial model, votesbetween successive elections.) W econtinuedby model- y1; : : : ; yJ multinomial .n ¼1; : : :; ¼j / for countscan be ingthe usual votes with a mixturedistribution, » I derivedfrom thePoisson model, y1; : : : ; y Poisson.¸1; J » 3 : : : ; ¸j /,conditioningon n j yj from thePoisson . j ¸j / 2 D p.®i/ ¸j N.®i ¹j ; ¿j /: distributionand with ¼j ¸j = ¸ for each j .However,other D j D P P j 1 modelswith the same conditional distributions are possible, XD correspondingto differentmarginal P models for thetotal n. Itis unstableto estimate mixture parameters by maximumlike- lihood(see, e.g., Titterington, Smith, and Makov 1985; Gelman 3.LA TENTV ARIABLES 2003),a problemcompounded by thefact that the data ®i are BydeŽnition, latent variables are constructions that add un- observedonly indirectly, and so we regularizedthe estimates knownsto themodel without changing the . ofthe ¸; ¹, and ¾ byassigning informative Here we considertwo common ways in which latent variables priordistributions. The prior corresponded to a humpat areconstructed: (1) discretelabeling of componentsin aŽnite about40% Democratic vote with standard deviation 10%, an- mixturemodel and (2) hypothesizingcontinuous unobserved otherhump symmetrically located at about 60% Democratic dataunderlying discrete-data models, such as logistic regres- vote,and a thirdhump, centered at 50% and much wider, to sion.W eillustratewith examples from ourresearch in voting catchthe outliers that are not close to eithermajor mode. The andpublic opinion. priordistributions were givenenough uncertainty that the vari- ancesfor themajor humps could differ, and the combined dis- 3.1 Latent DiscreteVariables in MixtureModels tributioncould be either unimodal or bimodal, depending on Figure1 showsa histogramof theDemocratic party’ s share thedata. The model was completedwith a Dirichlet .19; 19; 4/ ofthe vote in about400 elections to theU.S. Houseof Repre- distributionon ¸1; ¸2, and ¸3,whichallowed the major modes sentativesin 1988. In an earlier work (Gelman and King 1990), todiffer a bitin mass butconstrained the third mode to itsdes- we modeledsimilar data from manyother elections to studythe ignatedminor role of pickingup outliers. propertiesof the electoral system under various assumptions WeŽtthemodel using data augmentation, following T anner aboutelectoral swings. Traditionally this problem had been andW ong(1987). The data augmentation scheme alternately studiedby Ž ttinga normaldistribution to district vote propor- updatedthe model parameters and the latent mixture indicators tionsand then shifting this distribution to theleft or rightto sim- for eachdistrict. The mixture model Ž twelland did a goodjob ulatealternative hypothetical election outcomes (see Kendall estimatingthe distribution of the “ usualvote” parameters, ®i , andStuart 1950; Gudgin and T aylor1979). There had been dis- whichin turn was usefulin answering questions of political cussionin thepolitical science literature of moregeneral mod- interest,such as whatproportion of districtswe wouldexpect els(e.g., King and Browning 1987), and datasuch as thatshown toswitch parties if thenational vote were toshiftby x%. inFigure 1 persuadedus thatmixture distributions might be ap- Thenext step was totake the mixture indicators seriously. propriate.W esetup a modelfor thesesorts of election data Whatdoes it mean that some elections are in the “ Republican hump,”someare in the“ Democratichump,” and others are in thegroup of “ extras”? Werealizedthat there was akeypiece of informationnot included in our model that took on three pos- siblevalues: incumbency.Anelection could have a Republican incumbent(i.e., the existing Congress member could be aRe- publican,running for reelection),a Democraticincumbent, or anopen seat (no incumbent running). Conditionalon incumbency, the election data were Žtrea- sonablywell by unimodaldistributions, as shown by Figure2. Theindividual distributions are not quite normal, which is per- hapsan interesting feature, but it is certainly an improvement togo from athree-componentmixture model to a regression- typemodel conditional on apredictorthat takes on threevalues. Figure1. Histogram of Democratic Share of the T wo-PartyVote in Themodel with incumbency gives more accurate predictions CongressionalElections in 1988. Only districts that were contested by andmakes more political sense (see Gelmanand King 1994). It bothmajor parties are shown here. cameabout because we were tryingto make sense of amixture 540 Journalof theAmerican Statistical Association, June 2004

Figure2. Histogram of DemocraticShare of theT wo-PartyVote in Congressional Elections in 1988, in DistrictsWith (a) Republican Incumbents, (b)Democratic Incumbents, and (c) Open Seats. Combined, the three distributions yield the bimodal distribution in Figure 1. model—not to estimate the latent categories, but merely to bet- Figure3 showssome opinion trends among different subgroups terŽ tthedistribution of data(as was doneusing more advanced ofthe population of potential voters at three key periods: the computationaltechniques by Greenand Richardson 1997). Democraticconvention, the Republican convention, and the Ž - nal40 daysof the campaign. The most notable patterns occur 3.2 Latent Continuous Variables in LogisticRegression duringthe twoconventions; at each time there is a strongswing Wenowconsider the opposite sort of model: discrete data towardthe nominee of theparty throwing the convention. The usefullyexpressed in terms of a latentcontinuous variable. It swingsoccur among almost all subgroups but most strongly iswell known in statistical computation that logistic or pro- amonggroups of Independents.Similar patterns were foundby bitregression can be deŽ ned as linear regressions with latent Hillygusand Jackman (2003) in astudyof the 2000 election. continuousvariables (see Finney1947; Ashford and Sowden Thepattern of Independentsshowing the greatest swing was 1970).More recently, this parameterization has beenapplied to Žrst asurpriseto us from apoliticalperspective. W ehadex- theGibbs sampler for binaryregression with probit and Stu- pectedthat the strongest swings during each convention would dent t links(see Albertand Chib 1993; Liu 2004). W egivean bewithinthe party holding the convention—thus greatermove- exampleto illustratehow the latentvariables, which might have mentsamong Democrats in theDemocraticconvention and Re- beenviewed as acomputationalconvenience, can take on alife publicansduring their convention. oftheir own. However,from alatent-variableperspective, the patterns Inearlier work (Gelman and King 1993) we describeda studyof opinion trends in a seriesof political polls during makesense. Think of eachvoter as having a continuousvariable the1988 Presidential election campaign. In our analysis we thatis positive for aBushsupported and negative for aDukakis focusedon the discrete binary of whether a survey supporter,and suppose that campaign events have additive ef- respondentfavored George Bush or MichaelDukakis for Pres- fectson thelatent variable. Then Independents will tend to be ident,at one point Ž ttinga seriesof logisticregressions. (The near0 onthecontinuous preference scale (we infactlearn this articlealso brie y consideredthe “ noopinion”responses.) from thelogistic regressions— party identiŽ cation is a strong Althoughthe latent variable formulation was notneeded in predictorof vote preference) and thus will be more likely to ourmodeling, it provides some useful insights. For example, beshiftedin their preference by an eventsuch as aconvention

Figure3. Changes in Public Opinion During Three Periods in the 1988 Presidential Election Campaign: (a) The Democratic Convention, (b) the RepublicanConvention, and (c) the Final 40 Days of theCampaign. In each graph each symbol represents a differentsubset of the adult population. Thecircles represent different subsets of Democrats (e.g., white Democrats, college-educated Democrats, Democrats between 30 and 44), the trianglesrepresent different subsets of Independents, and the squares represent Republicans. Party labels are self-deŽ ned from survey responses. Ineach graph the change in support for Bush during the period is plottedversus the Ž nalsupport for Bush at thetime of the election. The different groupstend to move together, but the largest changes tend to be amongthe Independents. Gelman:Parameterization and Bayesian Modeling 541 thatis small but in a uniformdirection. In contrast, the shifts of boundedvariables) and also because it allows different sets thelast 40 dayswere notconsistently in favorof asingleparty, ofpredictors for differentregression models. These two fea- andso groups of Independentswere notso likely to showlarge turesallow imputation models to bemorerealistic for awider swings. classof dataproblems, and they have been implemented in S Anaturalmodel for thepreference yi ofanindividualvoter i (VanBuuren and Oudshoom 2000) and SAS (Raghunathan, at time t is then Solenberger,and V anHoewyk2002) and used in applications. Incomparison,it wouldbe much more difŽ cult to set up, and Bushsupporter if zit > 0 yit computeinferences from, afullymultivariate model D Dukakissupporter if zit 0, » · for Žttingdiscrete, continuous, and bounded variables. zit .X¯/i ±t ²it ; D C C 4.2 InconsistentConditional Distributions wherethe individual predictors X containsuch information as partyidentiŽ cation. The national swings controlled by thecon- Butthere is a potentialproblem with this computationally tinuoustime series ±t willthen have the largest effects on indi- convenientimputation : It is so exiblein its speciŽ - viduals i for whom .X¯/i isnear0. Themodel can be further cationsthat its interlocking conditional distributions will not in elaboratedwith individual random effects and time-varying pre- generalbe compatible.That is, there is noimplicitjoint distrib- dictors. utionunderlying the imputation models. Wearenot claiming here that latent variables are neces- This exibilityleading to incompatibility (Arnold, Castillo, sary for quantitativeor politicalunderstanding of theseopinion andSarabia 1999) is a theoretical aw,but in earlier work trends.Rather, we arepointing out that once the latent vari- (Gelmanand Raghunathan 2001) we arguedthat in prac- ableshave been set up, they take on directpolitical interpreta- tice,it can be useful in modeling data structures that could tionsfar beyondwhat might have been expected had they been notbe easily Ž tbymore standard models. Separate regres- treatedonly as computational devices. In this case the connec- sionsoften make more sense than joint models that assume tioncould be made more completely by including survey re- normalityand hope for thebest (Gelman, King, and Liu sponseson “feelingthermometers” — thatis, questions such as, 1998),mix normality with completely unstructured discrete “RateGeorge Bush on a1–10 scale,with 1 beingmost negative distributions(Schafer 1997),mix normality (with random and10 beingmost positive. ” effects) andlog-linear structures for discretedistributions (Raghunathanand Grizzle 1995), or generalizewith the t dis- 4.MUL TIVARIATEMISSING–DA TAIMPUTATION tribution(Liu 1995). One may argue that having a jointdistri- butionin the imputation is less important than incorporating 4.1 IterativeUnivariate Imputations informationfrom othervariables and unique features of the VanBuuren,Boshuizen, and Knook(1999) and Raghunathan, dataset(e.g., zero/ nonzerofeatures in income components, Lepkowski,Solenberger, and V anHoewyk(2001) have formal- bounds,skip patterns, nonlinearity, interactions). Conditional izediterative for imputingmultivariate missing data modelingallows enormous  exibilityin dealing with practical usingthe following method. The algorithm starts by imputing problems.In applied work, we havenever been able to Ž tthe simpleguessed values for allof the missing data. Then each jointmodels to a realdataset without making drastic simpliŽ - variable,one at atime,is modeled by aregressionconditional cations. onallof theother variables in the model. The regression is Žt Thuswe aresuggesting the use of anewclass of models— andused to impute random values from thepredictive distribu- inconsistentconditional distributions— that were initiallymoti- tionfor allof themissing data for thatparticular variable. The vatedby computational and analytical convenience. However, algorithmthen cycles through the variables and updates them asin the other examples of this article, once we acceptthese iterativelyin Gibbssampler fashion until approximate conver- newmodels, they take on theirown substantive interpretations. gence. This“ orderedpseudo-Gibbs sampler” (Heckerman, 4.3 ExampleFrom SurveyImputation Chickering,Meek, Rounthwaite, and Kadie 2001) is a Markov Weconsideran example in which we foundthat structural chainalgorithm, and ifthevalues of theparametersare recorded featuresof the conditional models can affect the distributions attheend of eachloop of iterations,then they will converge to oftheimputations in ways that are not always obvious. In the adistribution.This distribution may be inconsistentwith vari- New YorkCity Social Indicators Survey (GarŽ nkel and Meyers ousof theconditional distributions used in its implementation, 1999),it was necessaryto imputemissing responses for family butin general there will be convergence to something (assum- incomeconditional on demographicsand information, such as ingthat the usual conditions hold for convergenceof aMarkov whetheror not anyone in the family received government wel- chainto a uniquenondegenerate stationary distribution). fare beneŽts. Conversely, if the“ welfare beneŽts” indicator is For thespecial case in which each regression model includes missing,then family income is clearlya usefulpredictor. (W e allof theother variables with no interactionsand is linearwith ignorehere the complicating factor that the survey asked about normalerrors, thenthis is aGibbssampler, in which the missing severaldifferent sources of income, and these questions had dif- dataare imputed based on an underlying multivariate normal ferentpatterns of nonresponse.) modelthat is encodedas a setof interlockinglinear regressions. Tosimplify,suppose that we areimputing a continuousin- Infact, the iterative imputation approach is more general, comevariable y1 anda binaryindicator y2 for welfare beneŽts, Žrst becauseit allows nonnormal models (e.g., logistic re- conditionalon aset X offully-observedcovariates. W ecancon- gressionfor binaryoutcomes and truncated distributions for sidertwo natural approaches. Perhaps the simplest is a direct 542 Journalof theAmerican Statistical Association, June 2004 modelwhere, for example, p.y1 y2; X/ isanormaldistribution lineartransformations (see Atkinson1985; Carroll and Ruppert j (perhapsa regressionmodel on y2, X,andthe interactions of 1988).In classical inference, reparameterizations such as cen- y2 and X) and p.y2 y1; X/ isa logisticregression on y1, X, teringaffect the likelihood for themodel parameters but do not j andthe interactions of y1 and X.(For simplicity,we ignorethe affectpredictive inference. issuesof nonnegativityand possible zero values of y1.) Incontrast, even simple linear transformations can substan- Amoreelaborate and perhaps more appealing model uses tivelychange inferences in hierarchical Bayesian regressions. hiddenvariables. Let z2 bea latentcontinuous variable, deŽ ned For asimpleexample, consider a regressionpredicting stu- so that dents’grades from asetof pretests,in which the coefŽ cients ¯j 2 1 if z2 0 for thepretests j aregiven a commonN .¹; ¿ / priordistrib- y2 ¸ (4) D 0 if z2 < 0. ution,where ¹ and ¿ arehyperparameters estimated from the » data.Further suppose that the pretests were originallyon several Wecanthen model p.y1; z2 X/ asa jointnormal distribu- j tion(i.e., a multivariateregression). Compared with the direct differentscales and that each pretest has been rescaled to the model,this latent-variable approach has the advantageof acon- range0– 100. If thetests measure similar abilities, then it would makesense for thecoefŽ cients for therescaled test scores to be sistentjoint distribution. And once inference for .y1; z2/ has similar.The Bayesian model will then perform moreshrinkage beenobtained, we candirectly infer about y2 using(4). In addi- tion,this model has the conceptual appeal that z2 canbe inter- onthe new scale and lead to more accurate predictions com- pretedas some sort of continuous“ proclivity”for welfare that paredwith the untransformed model. This is an example of a isactivated only if it exceeds a certainthreshold. In fact, the reparameterizationthat has no effect on the likelihood but is relationshipbetween z2 and y2 canbe madestochastic if such potentiallyimportant in Bayesianinference. amodelwould appear more realistic. Sothe latent-variable model is better (except for possible 5.2 Parameter Expansion computationaldifŽ culties), right? Not necessarily. A perhaps- Amoreelaborate example of linear reparameterization for disagreeablebyproduct of the latent model is that, because hierarchicalregression is the parameter-expanded EM algo- ofthe joint normality, the distributions of income among rithmproposed by Liuet al. (1998) that has since been adapted thewelfare andnonwelfare groups— that is, the distributions toGibbs sampler and Metropolisalgorithms (Liu and Wu 1999; p.y y 1; X/ and p.y y 0; X/—mustsubstantially 1 2 1 2 vanDyk and Meng 2001; Liu 2003; Gelman et al. 2003). W e overlap.j In D contrast, the direct j model D allows the overlap to be briey reviewthe PX-Gibbs modelhere and then describe how largeor small,depending on thedata. Thus the class of incon- ithas motivated model expansions. sistentmodels, introduced for computationalreasons, is more generalthan previously considered models in waysthat are rel- 5.2.1Hierarchical Model With P otentiallySlow Conver- evantfor modelingmultivariate data. gence. Thestarting point is the hierarchical model expressed asa regressionwith coefŽ cients in M exchangeablebatches, 4.4 PotentialConsequences ofIncompatibility Inthe multivariate missing-data problem, we aregoing be- M y X.m/¯.m/ error; yondBayesian analysis to computation from aninconsistent D C m 1 model—but to theextent that inferences are being summarized XD bysimulations, these can be considered approximate Bayes. where X.m/ is the mthsubmatrixof predictorsand ¯.m/ is the Thenon-Bayesianity should show up as incoherence of infer- mthsubvectorof regressioncoefŽ cients. The J coefŽcients in ences,which suggests that estimates of theextent of the prob- m eachsubvector ¯.m/ haveexchangeable prior distributions, lemand approaches to correctingit couldbe obtainedby data partitioning —thatis, combining analyses from differentsub- .m/ 2 ¯ N.0; ¾ /; for j 1; : : : ; Jm: setsof thedata— another statistical method that has been mo- j » m D tivatedby computationalreasons (see Chopin2002; Ridgeway Moregenerally, the distributions could have nonzero means; we andMadigan 2003; Gelman and Huang 2003). usethe simpler form herefor easein exploring the key issues. Evenso, this model is quitegeneral and includes models with 5.MUL TILEVELREGRESSION MODELS severalbatches of random effects— for example,a modelfor 5.1 Rescaling Regression Predictors replicatedtwo-way data can have row effects,column effects, Whenlinear transformations are applied to predictors in a andtwo-way interactions. In a mixed-effectsmodel, the coef- classicalregression or , it is for rea- Žcientsfor theŽ xedeffects can be collectedinto a batchwith sonsof computational efŽ ciency or interpretive convenience. standarddeviation ¾m setto inŽ nity. For therandom effects, the Twofamiliar examples are the expression of indicators in an ¾m parameterswill be estimated from data. ANOVAsettingin termsof orthogonalcontrasts and standard- For thisclass of hierarchicalmodels, Gibbs sampler calcula- izationof continuouspredictors to havemean 0 andvariance 1. tionscan get stuck, as follows. If aparticularstandard deviation Theseand other linear transformations can make regression co- parameter ¾m happensto be startednear 0, thenin theupdating efŽcients easier to understand, but they have no effect on the stage,the batch of coefŽcients in ¯.m/ willbe shrunktoward 0. predictiveinferences from themodel.From aninferential stand- Thenat thenext update, ¾m willbe estimatednear 0, and so on. point,linear transformations do nothingin classicalregression, Eventually,the simulation will spiral out of this trap, but under whichis one reason why statistical theory has focused on non- someconditions this can take a longtime (Gelman et al.2003). Gelman:Parameterization and Bayesian Modeling 543

5.2.2P arameterExpansion to Speed Computation. The Campbell1992). However, at the national level there are re- slowconvergence can be Ž xedusing the following parameter allyonly 12 observations, and so one must be parsimonious expansionscheme, adapted from asimilaralgorithm for theEM withnational-level predictors. In practice, this means perform- algorithmgiven by Liu et al.(1998). In the parameter-expanded ingsome previous data analysis to picka singleeconomic pre- model,each component m oftheregression model is multiplied dictor,a singlepopularity predictor, and maybeone ortwoother byanewparameter, ®m, predictorsbased on incumbencyand political ideology (Gelman andKing 1993; Campbell, Ali, and Jalalzai 2003). M Amoregeneral approach to including national predictors y ® X.m/¯.m/ error; (5) D m C woulduse the parameter-expanded model. For example,sup- m 1 XD posethat we wishto include Ž vemeasures of the national andthen the parameters ®m aregiven uniform prior distribu- economy(e.g., change in per-capita GDP ,changein unem- tions on . ; /.Thisnew model is equivalentto the origi- ¡1 1 ployment).W ecouldŽ rst standardizethem (see Sec.5.1) and nalmodel but with a newparameterization: ® ¾ in the new j mj m theninclude them in the model exchangeably in a batch m, as modelmaps to ¾m inthe original formulation, and each new .m/ .m/ 1 2 .m/ .m/ ¯1 ; : : : ; ¯5 ,eachwith a N . 5 ; ¾m/ priordistribution. This ®m¯j mapsto the old ¯j .Theparameters ®m and ¾m are priordistribution bridges the gap between the two extremes of notjointly identiŽ ed, but we cansimply run the Gibbs sampler simplyusing the average of the Ž vemeasures as a predictor onthe expandedset of parameters .®; ¯; ¾/ andtransform back (thatwould be ¾ 0),and including them as Žveindependent m D tothe original scale to getinferences under the original model. predictors( ¾ ).Themultiplicative form ofmodel(5) al- m D 1 Inthe Gibbs sampler for thenew model, the components lowsus toset up thefull model using conditionally-c onjugate 2 of ¯ and ¾ areupdated as before— normal and inverse chi- distributions.For example,we couldpredict the election out- squareddistributions— and the new parameter vector ® is come in year t in state s withinregion r.s/ as updatedusing a normaldistribution. This extra step stops the 5 algorithmfrom gettingstuck near 0 (Gelman,Huang, van Dyk, .0/ .0/ .1/ .1/ y X ¯ ®1 X ¯ ®2° ®3± ² ; andBoscardin 2004; Liu and Wu 1999). st D st C j t j C t C r.s/;t C st Xj 1 5.2.3New Model F amiliesImplied by P arameterExpansion. D where X.0/ isthe matrix of state year-levelpredictors, X.1/ is Bygiving the new parameters ®m uniformprior distributions, £ we canperform moreefŽ cient computations for thehierarchical thematrix of year-levelpredictors, and ° , ±, and ² arenational, model.The parameter ® hasno meaningand is used just as a regional,and statewide error terms.In this model the auxiliary multiplierto create the parameters ¯ and ¾ from theoriginal parameters ®2 and ®3 existfor purelycomputational reasons, model. andthey can be given noninformative prior distributions with theunderstanding that we areinterested only in the products Butnow suppose that we takethe parameters ®m seriously, givingthem informative prior distributions. What does that ®2°t and ®3±r;t .Moreinterestingly, ®1 servesboth a compu- .1/ get us? tationaland modeling rule— with the ¯j parametershaving a 2 1 2 First,the variance parameter ¾m inthe old model has been common N. 5 ; ¾m/ priordistribution, ®1 hasthe interpretation 2 2 split into ®m¾m inthenew model, which expands the family of asthe overall coefŽ cient for theeconomic predictors. The pa- conditionallyconjugate distributions from inversechi-squared rameters ® and ¯ areconditionally conjugate and form amore toproducts of inversechi-squared with squared normal distrib- generalmodel than would be possibleusing marginally conju- utions. gatefamilies with the usual linear model parameterization. Second,and potentially more important, we caninterpret the Boththe parameter-expanded computations and the factor- parameters ®m themselvesin the context of the multiplicative analyticmodel work with generalized linear models as well. model(5). For each m, X.m/¯.m/ representsa “factor”formed Theyconnect somewhat to therecent approach of West(2003) bya linearcombination of the Jm individualpredictors, and tousing linear combinations to include more predictors in a ®m representsthe importance of thatfactor. modelthan data points. The approach of West(2003) is more For thefactor-analytic interpretation of themodel to be use- completelydata based, whereas the class of models that we ful,we needinformative prior distributions on the parameters havedeveloped here is structured using prior information (e.g., ® or ¯.Otherwise,the model is nonidentiŽ ed, and we cannot knowingwhich predictors correspond to economic conditions, distinguishbetween the importance ®m ofa factorand the coef- whichcorrespond to presidential popularity, and so forth). Ž cients ¯.m/ ofits individual components. This indeterminacy j 6.ITERA TIVEALGORITHMS AND TIMEPROCESSES canbe resolvedby, for example,constraining ¯.m/ 1 for j j D each m,thusgiving each factor the form ofaweightedaverage, Spatialmodels have motivated developments in iterative sim- P orby setting a “softconstraint” in the form ofa properprior ulationalgorithms from Metropolis,Rosenbluth, Rosenbluth, distributionon themultiplicative parameters ®m. Teller,and T eller(1953) to Geman and Geman (1984) and Weillustratewith the problem of forecasting presidential beyond.It has been pointed out that iterative simulation has electionsby state. A forecastingmodel based on 12 recent therole of transforming an intractable spatial process into nationalelections would have 600 “ datapoints” — state-level atractablespace– time process. Such developments as hybrid elections—and could then potentially include many state-level sampling(Duane, Kennedy, Pendleton, and Roweth 1987; Neal predictorsmeasuring such factors as economic performance, 1994)take this idea even further by connectingthe space– time incumbency,and popularity (see, e.g., Rosenstone 1983; processto a physicalmodel of particleswith “ momentum”as 544 Journalof theAmerican Statistical Association, June 2004 wellas “ position”in simulation space. In a simulationwith 1994,2003), which requires speciŽ cations for allparameters in multiplesequences, algorithms have been developed in which amodel,it is natural to give substantive models and interpre- the“ particles”from thedifferent sequences can interact. tationsto the additional random variables that arise with data Perhapsinsight can be gained by taking this time-series augmentationand parameter expansion. processseriously. How wouldthis be done?W eŽrst mustdis- [Received September2002. Revised November 2003.] tinguishbetween two phases of the iterative simulation. First, thereis the initial mixing of thesequences, moving from their REFERENCES startingpoints to thetarget distribution. Second, the simulations Albert,J. H.,and Chib, S. (1993),“ Bayesian Analysisof Binary and Poly- movearound within the target distribution. chotomousResponse Data,” Journalof theAmerican Statistical Association , 88,669– 679. TheŽ rst stagecould be identiŽed with“ learning,”or infor- Arnold,B. C.,Castillo,E., andSarabia, J. M.(1999), ConditionalSpeciŽ cation mationpropagating from themodelstatement to the inferences. ofStatistical Models ,New York:Springer-V erlag. Thiscould be relevantin a settingin which new information is Ashford,J. R.,andSowden, R. R.(1970),“ Multi-Variate ProbitAnalysis,” Bio- metrics,26,535– 546. comingin or in which a processis being tracked that is itself Atkinson,A. C.(1985), Plots,Transformations, and Regression ,Oxford,U.K.: changing(Gilks and Berzuini 2001). In that case the current OxfordUniversity Press. stageof the simulation could represent some current state of Boscardin,W .J.(1996),“ Bayesian Analysisfor Some Hierarchical Lin- ear Models,”unpublished doctoral thesis, University of California-Berkeley, knowledge.Another scenario could be a Žxedtruth, but with Dept.of Statistics. actorswho are learning about it bygatheringinformation. This Campbell,J. E.(1992),“ Forecastingthe Presidential V otein the States,” Amer- mightbe of interest to economists. For example,if solvinga icanJournal of PoliticalScience ,36,386– 407. Campbell,J. E.,Ali,S., andJalalzai, F.(2003), “ Forecastingthe Presidential particularregression problem requires 10 ;000Gibbs sampler Votein the States, 1948– 2000,” technical report, University of Buffalo, Dept. iterations,then perhaps it is aproblemthat in realtime cannot ofPolitical Science. besolved immediately by economicactors. In this sort of appli- Carroll,R. J.,andRuppert, D. (1988), Transformationand Weighting in Re- gression,London:Chapman & Hall. cation,the (or otheriterative algorithm) should Chopin,N. (2002),“ ASequentialParticle FilterMethod for Static Models,” map,at least approximately, to areal-timelearning process. Biometrika,89,539– 552. Thesecond stage of aniterative algorithm— awalk through Dempster, A.P.,Laird,N. M.,andRubin, D. B.(1977),“ MaximumLikelihood FromIncomplete Data viathe EM Algorithm”(with discussion), Journal of thetarget distribution— could have a directinterpretation if un- theRoyal Statistical Society ,Ser.B, 39,1– 38. certainty aboutparameters in a modelcould be mapped onto Diggle,P .,andKenward, M. G.(1994), “ InformativeDrop-out in Longitudinal variability overtime. For example,when in studying exposures Data Analysis,” Journalof theRoyal Statistical Society ,Ser.B, 43,49– 73. Draper, D.(1995),“ Assessment andPropagation of ModelUncertainty” (with toradon gas, we founda highlevel of uncertaintyin home radon discussion), Journalof the Royal Statistical Society ,Ser.B, 57,45– 97. levels,even given some geographic information (Lin, Gelman, Duane,S., Kennedy, A. D.,Pendleton,B. J.,and Roweth, D. (1987),“ Hybrid Price,and Krantz 1999). It is alsoknown that radon levels vary MonteCarlo,” Physics Letters ,Ser.B, 195,216– 222. Fienberg,S. E.(1977), TheAnalysis of Cross-ClassiŽ ed Categorical Data , overtime (both on weeklyand annual scales). Perhaps it could Cambridge,MA: MIT Press. bepossible to setup aniterativesampling algorithm so that the Finney,D. J.(1947),“ TheEstimation From Individual Records of theRelation- jumpsin updating home radon levels correspond to temporal shipBetween Dose andQuantal Response,” Biometrika,34,320– 334. GarŽnkel, I., andMeyers, M. K.(1999),“ ATale ofManyCities: The New York variation. CitySocial Indicators Survey,” Columbia University, School of SocialWork. Gelfand,A. E.,andSmith, A. F.M. (1990),“ Sampling-BasedApproaches to 7.CONCLUSION CalculatingMarginal Densities,” Journalof theAmerican Statistical Associ- ation,85,398– 409. Itis increasingly common in statistical modeling to add la- Gelman, A.(2003),“ ABayesian Formulationof Exploratory Data Analysisand tentvariables and parameters that do notchange the likelihood Goodness-of-FitTesting,” InternationalStatistical Review ,71,369– 382. function,but rather improve computational tractability. Some- Gelman, A.,Carlin,J. B.,Stern,H. S.,andRubin, D. B.(1995), BayesianData Analysis,London:Chapman & Hall. timesthese additional variables are partially estimable (as with Gelman, A.,andHuang, Z. (2003),“ Samplingfor Bayesian ComputationWith censoreddata, latent mixture components, and latent continu- Large Datasets,”technical report, Columbia University, Dept. of Statistics. ousvariables for discreteregression); in othersettings they are Gelman, A.,Huang, Z., vanDyk, D. A.,andBoscardin, W. J. (2004),“ Trans- formedand Parameter-Expanded Gibbs Samplers forHierarchical Linear completelynonidentiŽ ed (as inparameter expansion for hier- andGeneralized LinearModels,” technical report, Department ofStatistics, archicallinear models). In either case, we havefound insights ColumbiaUniversity. from treatingthese computational parameters as “ real.”Taking Gelman, A.,andKing, G. (1990),“ Estimatingthe Electoral Consequences of LegislativeRedistricting,” Journalof the American Statistical Association , theseparameters seriously can lead to deeperunderstanding of 85,274– 282. theoriginal model (as withtruncated data) or data(as withthe (1993),“ WhyAre American PresidentialElection Campaign Polls So publicopinion study) or suggestways in which the model can Variable WhenV otesAre SoPredictable? “ BritishJournal of Political Sci- ence,23,409– 451. beimproved by addinginformation (as withthe Congressional (1994),“ AUniŽed Model for Evaluating Electoral Systems and Re- electionsexample). In other settings, improved algorithms can districtingPlans,” AmericanJournal of PoliticalScience ,38,514– 554. actuallylead to newstatistical models that can Ž tdatamore ac- Gelman, A.,King,G., and Liu, C. (1998),“ NotAsked and Not Answered: MultipleImputation for Multiple Surveys” (with discussion and rejoinder), curately,as withmultivariate missing-data imputation and mul- Journalof the American Statistical Association ,93,846– 874. tilevelregression. Gelman, A.,and Raghunathan, T. E. (2001),“ UsingConditional Distributions At atheoreticallevel, reparameterization has a specialnew forMissing-Data Imputation,” discussion of “ ConditionallySpeciŽ ed Distri- butions,”by Arnoldet al., StatisticalScience ,3,268– 269. rolein Bayesiananalysis because it tendsto lead to new classes Geman, S.,and Geman, D.(1984),“ StochasticRelaxation, Gibbs Distribu- ofprior distributions (especially true if the distributions are tions,and the Bayesian Restorationof Images,” IEEETransactions on Pattern originallymotivated by computationalconcerns) and thus new Analysisand Machine Intelligence ,6,721–741. Gilks,W .R.,andBerzuini, C. (2001),“ Followinga MovingTarget: Monte models,even if thelikelihoodis unchanged. At apracticallevel, CarloInference forDynamic Bayesian Models,” Journalof RoyalStatistical whenusing Bugs (Spiegelhalter, Thomas, Best, Gilks, and Lunn Society,Ser.B, 63,127– 146. Gelman:Parameterization and Bayesian Modeling 545

Green, P.J.,and Richardson, S. (1997),“ OnBayesian Analysisof Mixtures Nandaram, B.,andChoi, J. W.(2002), “ Hierarchical Bayesian Nonresponse Withan Unknown Number of Components” (with discussion), Journalof the Modelsfor Binary Data FromSmall Areas WithUncertainty About Ignora- RoyalStatistical Society ,Ser.B, 59,731– 792. bility,” Journalof the American Statistical Association ,97,381– 388. Gudgin,G., and Taylor, P .J.(1979), Seats,V otes,and the Spatial Organisation Neal, R.M.(1994),“ AnImproved Acceptance Procedurefor the Hybrid Monte ofElections ,London:Pion. CarloAlgorithm,” Journalof ComputationalPhysics ,111,194– 203. Heckerman, D.,Chickering, D. M.,Meek,C., Rounthwaite,R., and Kadie, C. Raghunathan,T. E., andGrizzle, J.E.(1995),“ ASplitQuestionnaire Survey (2001),“ DependencyNetworks for Inference, CollaborativeFiltering, and Design,” Journalof American Statistical Association , 90, 55–63. Data Visualization,” Journalof MachineLearning Research , 1, 49–75. Raghunathan,T. E., Lepkowski,J. E.,Solenberger,P .W.,and V an Hills,S. E.,andSmith, A. F.M. (1992),“ Parameterization Issues inBayesian Hoewyk,J. H.(2001),“ AMultivariateTechnique for Multiply Imputing Inference”(with discussion), in BayesianStatistics 4 ,eds.J. M.Bernardo, MissingV alues Usinga Sequenceof Regression Models,” Survey Method- J.O.Berger,A. P.Dawid,and A. F.M. Smith,Oxford, U.K.: Oxford Univer- ology, 27, 85–95. sityPress, pp.227– 246. Raghunathan,T. E., Solenberger,P .W.,and V anHoewyk, J. (2002),“ IVEware Hillygus,D. S.,andJackman, S.(2003),“ VoterDecision Making in Election (SASsoftware formissing-data imputation),” available at www.isr.umich. 2000:Campaign Effects, PartisanActivation, and the Clinton Legacy,” Amer- edu/src/smp/ive/ . icanJournal of PoliticalScience ,47,583– 596. Hoeting,J., Madigan, D., Raftery, A. E., andV olinsky,C. (1999),“ Bayesian Ridgeway,G., and Madigan, D. (2003),“ ASequentialMonte Carlo Method for ModelAveraging,” StatisticalScience ,14,382– 401. Bayesian Analysisof Massive Datasets,” Journalof DataMining and Knowl- Kendall,M. G.,andStuart, A. (1950), “ TheLaw ofthe Cubic Proportion in edgeDiscovery ,7,301– 319. ElectionResults,” BritishJournal of Sociology ,1,193– 196. Roberts,G. O.,and Sahu, S. K.(1997),“ UpdatingSchemes, CorrelationStruc- Kenward,M. G.(1998),“ SelectionModels for Repeated Measurements With ture,Blocking and Parameterization forthe Gibbs Sampler,” Journalof the Non-RandomDropout: An Illustration of Sensitivity,” Statisticsin Medicine , RoyalStatistical Society ,Ser.B, 59,291– 317. 17,2723– 2732. Rosenstone,S. J.(1983), ForecastingPresidential Elections ,New Haven,CT: King,G., andBrowning, R. X.(1987),“ Democratic Representationand Parti- Yale UniversityPress. san Bias inCongressional Elections,” AmericanPolitical Science Review , 81, Rotnitzky,A., Robins, J. M.,andScharfstein, D. O.(1998),“ Semiparametric 1251–1276. Regressionfor Repeated Outcomes WithNonignorable Nonresponse,” Jour- Lin,C. Y.,Gelman, A.,Price,P .N.,andKrantz, D. H.(1999),“ Analysis nalof American Statistical Association ,94,1096– 1146. ofLocalDecisions Using Hierarchical Modeling,Applied to Home Radon Rubin,D. B.(1976),“ Inference andMissing Data,” Biometrika,63,581– 592. Measurement andRemediation” (with discussion), StatisticalScience , 14, Schafer,J. L.(1997), Analysisof Incomplete Multivariate Data ,London:Chap- 305–337. man & Hall. Little,R. J.A.,andRubin, D. B.(1987), StatisticalAnalysis With Missing Data , Spiegelhalter,D., Thomas, A., Best, N.,Gilks,W., and Lunn, D. (1994, 2003), New York:Wiley. “BUGS:Bayesian Inference UsingGibbs Sampling,” MRC Biostatistics Unit, Little,T. C., and Gelman, A.(1998),“ ModelingDifferential Nonresponsein Cambridge,U.K., available at www.mrc-bsu.cam.ac.uk/bugs/ . Sample Surveys,” Sankhy¯a,60,101– 126. Tanner,M. A.,andWong, W .H.(1987),“ TheCalculation of Posterior Distri- Liu,C. (1995),“ MissingData ImputationUsing the Multivariate t Distribu- butionsby Data Augmentation”(with discussion), Journalof the American tion,” Journalof Multivariate Analysis ,48,198– 206. StatisticalAssociation ,82,528– 550. (2003),“ AlternatingSubspace-Spanning Resampling to Accelerate Titterington,D. M.,Smith,A. F.M., andMakov, U. E.(1985), StatisticalAnaly- MarkovChain Monte Carlo Simulation,” Journalof theAmerican Statistical sis ofFinite Mixture Distributions ,New York:Wiley. Association ,98,110– 117. Troxel,A., Ma, G., and Heitjan, D. F.(2003), “ AnIndex of Local Sensitivity to (2004),“ RobitRegression: A SimpleRobust Alternative to Logit and Nonignorability,” StatisticaSinica , to appear. Probit,” in MissingData and Bayesian Methods in Practice . VanBuuren, S., Boshuizen,H. C.,and Knook, D. L.(1999),“ MultipleImputa- Liu,C., Rubin, D. B.,andWu, Y .N.(1998),“ Parameter Expansionto Acceler- tionof MissingBlood Pressure Covariatesin Survival Analysis,” Statisticsin ate EM:ThePX– EMAlgorithm,” Biometrika,85,755– 770. Liu,J., and Wu, Y .N.(1999),“ Parameter Expansionfor Data Augmentation,” Medicine,18,681– 694. Journalof the American Statistical Association ,94,1264– 1274. VanBuuren, S., andOudshoom, C. G.M.(2000),“ MICE:Multivariate Impu- Madigan,D., and Raftery, A. E. (1994),“ ModelSelection and Accounting for tationby ChainedEquations (S software formissing-data imputation),” avail- ModelUncertainty in Graphical Models Using Occam’ s Window,” Journal of able at web.inter.nl.net/users/S.van.Buuren/mi/ . theAmerican Statistical Association ,89,1535– 1546. vanDyk, D. A.,and Meng, X. L.(2001),“ TheArt ofData Augmentation”(with Meng,X. L.,andZaslavsky, A. M.(2002),“ SingleObservation Unbiased Pri- discussion), Journalof Computational and Graphical Statistics , 10, 1–111. ors,” TheAnnals of Statistics ,30,1345– 1375. West,M. (2003),“ Bayesian FactorRegression Models in the ‘ Large p, Metropolis,N., Rosenbluth,A. W.,Rosenbluth,M. N.,Teller, A. H.,and Small n’Paradigm,”in BayesianStatistics 7 ,eds.S. Bayarri,J. O.Berger, Teller,E. (1953),“ Equationof State Calculationsby Fast ComputingMa- J.M.Bernardo,A. P.Dawid,D. Heckerman, A.F.M. Smith,and M. West, chines,” Journalof Chemical Physics ,21,1087– 1092. Oxford,U.K.: Oxford University Press.