The Practical Implementation of Bayesian Model Selection
Total Page:16
File Type:pdf, Size:1020Kb
Model Selection IMS Lecture Notes - Monograph Series (2001) Volume 38 The Practical Implementation of Bayesian Model Selection Hugh Chipman, Edward I. George and Robert E. McCulloch The University of Waterloo, The University of Pennsylvania and The University of Chicago Abstract In principle, the Bayesian approach to model selection is straightforward. Prior probability distributions are used to describe the uncertainty surround- ing all unknowns. After observing the data, the posterior distribution pro- vides a coherent post data summary of the remaining uncertainty which is relevant for model selection. However, the practical implementation of this approach often requires carefully tailored priors and novel posterior calcula- tion methods. In this article, we illustrate some of the fundamental practical issues that arise for two different model selection problems: the variable se- lection problem for the linear model and the CART model selection problem. 0Hugh Chipman is Associate Professor of Statistics, Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada; email: [email protected]. Edward I. George is Professor of Statistics, Department of Statistics, The Wharton School of the University of Penn- sylvania, 3620 Locust Walk, Philadelphia, PA 19104-6302, U.S.A; email: [email protected]. Robert E. McCulloch is Professor of Statistics, Graduate School of Business, University of Chicago, 1101 East 58th Street, Chicago, IL, U.S.A; email: [email protected]. This work was supported by NSF grant DMS-98.03756 and Texas ARP grant 003658.690. Contents 1 Introduction 67 2 The General Bayesian Approach 69 2.1 A Probabilistic Setup for Model Uncertainty . 69 2.2 General Considerations for Prior Selection . 71 2.3 Extracting Information from the Posterior . 73 3 Bayesian Variable Selection for the Linear Model 75 3.1 Model Space Priors for Variable Selection . 76 3.2 Parameter Priors for Selection of Nonzero ¯i . 80 3.3 Calibration and Empirical Bayes Variable Selection . 82 3.4 Parameter Priors for Selection Based on Practical Significance . 85 3.5 Posterior Calculation and Exploration for Variable Selection . 88 3.5.1 Closed Form Expressions for p(Y j γ) . 88 3.5.2 MCMC Methods for Variable Selection . 89 3.5.3 Gibbs Sampling Algorithms . 90 3.5.4 Metropolis-Hastings Algorithms . 91 3.5.5 Extracting Information from the Output . 93 4 Bayesian CART Model Selection 95 4.1 Prior Formulations for Bayesian CART . 98 4.1.1 Tree Prior Specification . 98 4.1.2 Parameter Prior Specifications . 100 4.2 Stochastic Search of the CART Model Posterior . 103 4.2.1 Metropolis-Hastings Search Algorithms . 103 4.2.2 Running the MH Algorithm for Stochastic Search . 105 4.2.3 Selecting the “Best” Trees . 106 5 Much More to Come 107 Practical Bayes Model Selection 67 1 Introduction The Bayesian approach to statistical problems is fundamentally probabilistic. A joint probability distribution is used to describe the relationships between all the unknowns and the data. Inference is then based on the conditional probability distribution of the unknowns given the observed data, the posterior distribution. Beyond the specification of the joint distribution, the Bayesian approach is automatic. Exploiting the internal consistency of the probability framework, the posterior distribution extracts the relevant information in the data and provides a complete and coherent summary of post data uncertainty. Using the posterior to solve specific inference and decision problems is then straightforward, at least in principle. In this article, we describe applications of this Bayesian approach for model uncer- tainty problems where a large number of different models are under consideration for the data. The joint distribution is obtained by introducing prior distributions on all the unknowns, here the parameters of each model and the models themselves, and then combining them with the distributions for the data. Conditioning on the data then in- duces a posterior distribution of model uncertainty that can be used for model selection and other inference and decision problems. This is the essential idea and it can be very powerful. Especially appealing is its broad generality as it is based only on probabilistic considerations. However, two major challenges confront its practical implementation - the specification of the prior distributions and the calculation of the posterior. This will be our main focus. The statistical properties of the Bayesian approach rest squarely on the specification of the prior distributions on the unknowns. But where do these prior distributions come from and what do they mean? One extreme answer to this question is the pure subjective Bayesian point of view that characterizes the prior as a wholly subjective description of initial uncertainty, rendering the posterior as a subjective post data description of uncertainty. Although logically compelling, we find this characterization to be unrealistic in complicated model selection problems where such information is typically unavailable or difficult to precisely quantify as a probability distribution. At the other extreme is the objective Bayesian point of view which seeks to find semi-automatic prior formulations or approximations when subjective information is unavailable. Such priors can serve as default inputs and make them attractive for repeated use by non-experts. Prior specification strategies for recent Bayesian model selection implementations, including our own, have tended to fall somewhere between these two extremes. Typi- cally, specific parametric families of proper priors are considered, thereby reducing the specification problem to that of selecting appropriate hyperparameter values. To avoid 68 H. Chipman, E. I. George and R. E. McCulloch the need for subjective inputs, automatic default hyperparameter choices are often rec- ommended. For this purpose, empirical Bayes considerations, either formal or informal, can be helpful, especially when informative choices are needed. However, subjective con- siderations can also be helpful, at least for roughly gauging prior location and scale and for putting small probability on implausible values. Of course, when substantial prior information is available, the Bayesian model selection implementations provide a natural environment for introducing realistic and important views. By abandoning the pure subjective point of view, the evaluation of such Bayesian methods must ultimately involve frequentist considerations. Typically, such evaluations have taken the form of average performance over repeated simulations from hypothetical models or of cross validations on real data. Although such evaluations are necessarily limited in scope, the Bayesian procedures have consistently performed well compared to non-Bayesian alternatives. Although more work is clearly needed on this crucial aspect, there is cause for optimism, since by the complete class theorems of decision theory, we need look no further than Bayes and generalized Bayes procedures for good frequentist performance. The second major challenge confronting the practical application of Bayesian model selection approaches is posterior calculation or perhaps more accurately, posterior ex- ploration. Recent advances in computing technology coupled with developments in nu- merical and Monte Carlo methods, most notably Markov Chain Monte Carlo (MCMC), have opened up new and promising directions for addressing this challenge. The basic idea behind MCMC here is the construction of a sampler which simulates a Markov chain that is converging to the posterior distribution. Although this provides a route to calculation of the full posterior, such chains are typically run for a relatively short time and used to search for high posterior models or to estimate posterior character- istics. However, constructing effective samplers and the use of such methods can be a delicate matter involving problem specific considerations such as model structure and the prior formulations. This very active area of research continues to hold promise for future developments. In this introduction, we have described our overall point of view to provide context for the implementations we are about to describe. In Section 2, we describe the general Bayesian approach in more detail. In Sections 3 and 4, we illustrate the practical im- plementation of these general ideas to Bayesian variable selection for the linear model and Bayesian CART model selection, respectively. In Section 5, we conclude with a brief discussion of related recent implementations for Bayesian model selection. Practical Bayes Model Selection 69 2 The General Bayesian Approach 2.1 A Probabilistic Setup for Model Uncertainty Suppose a set of K models M = fM1;:::;MK g are under consideration for data Y , and that under Mk, Y has density p(Y j θk;Mk) where θk is a vector of unknown parameters that indexes the members of Mk. (Although we refer to Mk as a model, it is more precisely a model class). The Bayesian approach proceeds by assigning a prior probability distribution p(θk jMk) to the parameters of each model, and a prior probability p(Mk) to each model. Intuitively, this complete specification can be understood as a three stage hierarchical mixture model for generating the data Y ; first the model Mk is generated from p(M1); : : : ; p(MK ), second the parameter vector θk is generated from p(θk j Mk), and third the data Y is generated from p(Y j θk;Mk). Letting Yf be future observations of the same