BMA: an R Package for Bayesian Model Averaging
Total Page:16
File Type:pdf, Size:1020Kb
News The Newsletter of the R Project Volume 5/2, November 2005 Editorial by Douglas Bates Early on they mention the task view for spatial sta- tistical analysis that is being maintained on CRAN. One of the strengths of the R Project is its pack- Task views are a new feature on CRAN designed to age system and its network of archives of pack- make it easier for those who are interested in a par- ages. All useRs will be (or at least should be) famil- ticular topic to navigate their way through the mul- iar with CRAN, the Comprehensive R Archive Net- titude of packages that are available and to find the work, which provides access to more than 600 con- ones targeted to their particular interests. tributed packages. This wealth of contributed soft- To provide the exception that proves the rule, Xie ware is a tribute to the useRs and developeRs who describes the use of condor, a batch process schedul- have written and contributed those packages and ing system, and its DAG (directed acylic graph) capa- to the CRAN maintainers who have devoted untold bility to coordinate remote execution of long-running hours of effort to establishing, maintaining and con- R processes. This article is an exception in that tinuously improving CRAN. it doesn’t deal with an R package but it certainly Kurt Hornik and Fritz Leisch created CRAN and should be of interest to anyone using R for running continue to do the lion’s share of the work in main- simulations or for generating MCMC samples, etc. taining and improving it. They were also the found- and for anyone who works with others who have ing editors of R News. Part of their vision for R News such applications. I can tell you from experience that was to provide a forum in which to highlight some the combination of condor and R, as described by of the packages available on CRAN, to explain their Xie, can help enormously in maintaining civility in purpose and to provide an illustration of their use. a large and active group of statistics researchers who This issue of R News brings that vision to fruition share access to computers. in that almost all our articles are written about CRAN Continuing in the general area of distributed packages by the authors of those packages. statistical computing, L’Ecuyer and Leydold de- We begin with an article by Raftery, Painter and scribe their rstream package for maintaining mul- Volinsky about the BMA package that uses Bayesian tiple reproducible streams of pseudo-random num- Model Averaging to deal with uncertainty in model bers, possibly across simulations that are being run selection. in parallel. Next Pebesma and Bivand discuss the sp pack- This is followed by Benner’s description of his age in the context of providing the underlying classes mfp package for multivariable fractional polynomi- and methods for working with spatial data in R. als and Sailer’s description of his crossdes package Contents of this issue: rstream: Streams of Random Numbers for Stochastic Simulation . 16 mfp . 20 Editorial . 1 crossdes . 24 BMA: An R package for Bayesian Model Aver- R Help Desk . 27 aging . 2 Changes in R . 29 Classes and methods for spatial data in R ... 9 Changes on CRAN . 35 Running Long R Jobs with Condor DAG . 13 Forthcoming Events: useR! 2006 . 43 Vol. 5/2, November 2005 2 for design and randomization in crossover studies. board and I would like to take this opportunity to ex- We round out the issue with regular features that tend my heartfelt thanks to Paul Murrell and Torsten include the R Help Desk, in which Duncan Mur- Hothorn, my associate editors. They have again doch joins Uwe Ligges to explain the arcane art of stepped up and shouldered the majority of the work building R packages under Windows, descriptions of of preparing this issue while I was distracted by changes in the 2.2.0 release of R and recent changes other concerns. Their attitude exemplifies the spirit on CRAN and an announcement of an upcoming of volunteerism and helpfulness that makes it so re- event, useR!2006. Be sure to read all the way through warding (and so much fun) to work on the R Project. to the useR! announcement. If this meeting is any- thing like the previous useR! meetings – and it will Douglas Bates be – it will be great! Please do consider attending. University of Wisconsin – Madison, U.S.A. This is my last issue on the R News editorial [email protected] BMA: An R package for Bayesian Model Averaging by Adrian E. Raftery, Ian S. Painter and Christopher T. does BMA for linear regression using Markov chain Volinsky Monte Carlo model composition (MC3), and allows one to specify a prior distribution and to make in- Bayesian model averaging (BMA) is a way of taking ference about the variables to be included and about account of uncertainty about model form or assump- possible outliers at the same time. tions and propagating it through to inferences about an unknown quantity of interest such as a population parameter, a future observation, or the future payoff Basic ideas or cost of a course of action. The BMA posterior dis- tribution of the quantity of interest is a weighted av- Suppose we want to make inference about an un- erage of its posterior distributions under each of the known quantity of interest ∆, and we have data D. models considered, where a model’s weight is equal We are considering several possible statistical mod- to the posterior probability that it is correct, given els for doing this, M1,..., MK. The number of mod- that one of the models considered is correct. els could be quite large. For example, if we consider Model uncertainty can be large when observa- only regression models but are unsure about which tional data are modeled using regression, or its ex- of p possible predictors to include, there could be as tensions such as generalized linear models or sur- many as 2p models considered. In sociology and epi- vival (or event history) analysis. There are often demiology, for example, values of p on the order of many modeling choices that are secondary to the 30 are not uncommon, and this could correspond to main questions of interest but can still have an im- around 230, or one billion models. portant effect on conclusions. These can include Bayesian statistics expresses all uncertainties in which potential confounding variables to control for, terms of probability, and make all inferences by ap- how to transform or recode variables whose effects plying the basic rules of probability calculus. BMA is may be nonlinear, and which data points to identify no more than basic Bayesian statistics in the presence as outliers and exclude. Each combination of choices of model uncertainty. By a simple application of the represents a statistical model, and the number of pos- law of total probability, the BMA posterior distribu- sible models can be enormous. tion of ∆ is The R package BMA provides ways of carrying K out BMA for linear regression, generalized linear p(∆|D) = ∑ p(∆|D, Mk)p(Mk|D), (1) models, and survival or event history analysis us- k=1 ing Cox proportional hazards models. The func- tions bicreg, bic.glm and bic.surv, account for where p(∆|D, Mk) is the posterior distribution of ∆ uncertainty about the variables to be included in given the model Mk, and p(Mk|D) is the posterior the model, using the simple BIC (Bayesian Informa- probability that Mk is the correct model, given that tion Criterion) approximation to the posterior model one of the models considered is correct. Thus the probabilities. They do an exhaustive search over BMA posterior distribution of ∆ is a weighted aver- the model space using the fast leaps and bounds al- age of the posterior distributions of ∆ under each of gorithm. The function glib allows one to specify the models, weighted by their posterior model prob- one’s own prior distribution. The function MC3.REG abilities. In (1), p(∆|D) and p(∆|D, Mk) can be prob- R News ISSN 1609-3631 Vol. 5/2, November 2005 3 ability density functions, probability mass functions, Computation or cumulative distribution functions. The posterior model probability of Mk is given by Implementing BMA involves two hard computa- tional problems: computing the integrated likeli- p(D|M )p(M ) hoods for all the models (3), and averaging over all ( | ) = k k p Mk D K . (2) the models, whose number can be huge, as in (1) ∑ p(D|M`)p(M`) `=1 and (6). In the functions bicreg (for variable selec- tion in linear regression), bic.glm (for variable selec- In equation (2), p(D|Mk) is the integrated likelihood of tion in generalized linear models), and bic.surv (for model Mk, obtained by integrating (not maximizing) over the unknown parameters: variable selection in survival or event history analy- sis), the integrated likelihood is approximated by the Z BIC approximation (4). The sum over all the mod- p(D|Mk) = p(D|θk, Mk)p(θk|Mk)dθk (3) els is approximated by finding the best models us- Z ing the fast leaps and bounds algorithm. The leaps = ( likelihood × prior )dθk, and bounds algorithm was introduced for all sub- sets regression by Furnival and Wilson (1974), and where θk is the parameter of model Mk and extended to BMA for linear regression and general- p(D|θk, Mk) is the likelihood of θk under model Mk. ized linear models by Raftery (1995), and to BMA for The prior model probabilities are often taken to be survival analysis by Volinsky et al.