Bayesian Variable Selection Using Lasso

BAYESIAN VARIABLE SELECTION USING LASSO by YUCHEN HAN Submitted in partial fulfillment of the requirements for the degree of Master of Science Department of Mathematics, Applied Mathematics and Statistic CASE WESTERN RESERVE UNIVERSITY May, 2017 CASE WESTERN RESERVE UNIVERSITY SCHOOL OF GRADUATE STUDIES We hereby approve the dissertation of Yuchen Han candidate for the degree of Master of Science*. Committee Chair Dr. Anirban Mondal Committee Member Dr. Wojbor Woyczynski Committee Member Dr. Jenny Brynjarsdottir Date of Defense March 31, 2017 *We also certify that written approval has been obtained for any proprietary material contained therein. Contents List of Tables iii List of Figures iv Abstract v 1 Introduction 1 2 Existing Bayesian variable selection methods 5 2.1 Stochastic search variable selection . .5 2.2 Kuo and Mallick . .7 2.3 The Bayesian Lasso . .9 2.4 Gibbs sampler . 11 3 Bayesian variable selection using Lasso with spike and slab prior 13 3.1 LSS: Laplace with spike and slab prior . 13 3.1.1 Hierarchical model . 13 3.1.2 MCMC exploration of the posterior . 15 3.2 MLSS: Mixture Laplace with spike and slab prior . 17 3.2.1 Hierarchical model . 17 3.2.2 MCMC exploration of the posterior . 18 i CONTENTS 4 Comparison of different Bayesian variable selection methods 20 4.1 Simulated examples . 20 4.1.1 Datasets . 20 4.1.2 Priors distributions . 21 4.1.3 Comparisons of different methods . 21 4.2 Real data example . 27 5 Conclusions and future work 32 Appendix 34 Bibliography 35 ii List of Tables 4.1 Simulated datasets details . 21 4.2 Parameters of prior distributions for different variable selection methods for five different datasets. 22 4.3 Proportions of high frequency models and mean squared prediction errors (MSPE). 23 4.4 The batch standard error of the highest posterior model probability for Dataset 3 . 25 4.5 Model variables under different criteria . 27 4.6 Proportions of high frequency models for different variable selection methods.(K & M: Kuo and Mallick, BL: Bayesian Lasso) . 29 iii List of Figures 4.1 Scatter plot with the fitted posterior median curve for dataset 1. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS. 24 4.2 Ergodic batch posterior probabilities for Dataset 3 . 26 4.3 Ergodic posterior probabilities for Dataset 3 . 26 4.4 Scatter plots. 28 4.5 Residual plot with fitted values . 29 4.6 Scatter plot with the fitted posterior median curve for air pollution data. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS. 30 iv Bayesian Variable Selection Using Lasso Abstract by YUCHEN HAN This thesis proposes to combine the Kuo and Mallick approach (1998) and Bayesian Lasso approach (2008) by introducing a Laplace distribution on the conditional prior of the regression parameters given the indicator variables. Gibbs Sampling will be used to sample from the joint posterior distribution. We compare these two new method to existing Bayesian variable selection methods such as Kuo and Mallick, George and McCulloch and Park and Casella and provide an overall qualitative assessment of the efficiency of mixing and separation. We will also use air pollution dataset to test the proposed methodology with the goal of identifying the main factors controlling the pollutant concentration. v 1. Introduction The selection of variables in regression models is one of the very common problems in Statistics. Several methods of variable selection have been extensively studied in the classical statistics literature such as Stepwise regression method, Akaike Information Criterion (AIC) based method, Bayes Information Criterion (BIC) based method, Adjusted R2 based method, Predicted Residual Sum of Squares (PRESS) statistic method and Mallows Cp method etc. One Bayesian method for variable selection consists of the indicator model selection approach where each regression coefficient have a slab and spike prior which is constructed by using an auxiliary indicator variable. Kuo and Mallick (1998) proposed a model where the auxiliary indicator variable and the regression coefficients are assumed to be independent. An alternative model formulation called Gibbs variable selection (GVS) was suggested by Dellaportas et al. (1997), extending a general idea of Carlin and Chib (1995). Here the prior distributions of indicator and regression coefficients are assumed to be dependent on each other. Similarly stochastic search variable selection (SSVS) was introduced by George and McCulloch (1993) and extended for multivariate case by Brown et al. (1998), where the spike in the conditional prior for the regression coefficients is considered as a narrow distribution concentrated around zero. Meuwissen and Goddard (2004) introduced (in multivariate context) a random effects variant of SSVS. A different Bayesian approach to inducing sparseness is not to use indicators in 1 Part 1 the model, but instead to specify a prior directly on the regression coefficient that ap- proximates the slab and spike" shape. In this context the equivalent classical method in statistics is the Lasso regression as introduced by Tibshirani (1996). The Lasso is a form of penalized least squares method that minimizes the residual sum of squares while controlling the L1 norm of the regression coefficients. The L1 penalty shrinks the estimated coefficients toward 0. The shrinkage of the vector of regression coefficients toward zero with the possibility of setting some coefficients identically equal to zero makes Lasso a good method for automatic variable selection simultaneously with the estimation of regression coefficients. The Lasso has a Bayesian interpreta- tion. The Lasso estimate can be viewed as the mode of the posterior distribution of regression coefficients when independent double-exponential prior distributions are placed on the p regression coefficients, and the likelihood component is taken to be the normal linear regression model. Fernandez and Steel (2000) considered Laplace prior as a special case in a general Bayesian regression hierarchical model but the connection to the Lasso procedure was not made until 2008. Park and Casella (2008) proposed Bayesian Lasso regression for the first time. They extended the Bayesian Lasso regression model to account for uncertainty in the hyperparameters by placing prior distributions on them and obtained point estimates of the regression coefficients using the median of the posterior distribution, but they did not address variable selection and prediction of future observations . In this thesis, we develop two variable selection methods in the linear regression model set-up, combining the slab and spike prior (George and McCulloch (1993) and Kuo and Mallick (1998)) approach and the Bayesian Lasso approach by introducing a Laplace distribution on the conditional prior of the regression parameters given the indicator variables. The Laplace prior can better accommodate large regression coefficients because of its heavier tail probability, it also introduces different variance parameters for different regression coefficients. The first method, Laplace with slab 2 Part 1 and spike prior (LSS), uses a Laplace distribution for the conditional prior distribution of the regression coefficients given the indicator variables. This method is similar to Kuo and Mallick (1998), and independent priors are assumed for the indicator variables and the regression coefficients. Here instead of a Normal priors for the regression coefficients we use a Laplace prior whose variance depends on the error variance. We expect our model to perform better when the regression coefficients are of different scale and the number of regressors is very high compared to the number of data. In Markov chain, when the indicator variable is 0, the updated value of the regression coefficients are sampled from the full conditional distribution, which is its prior distribution. Because of independence assumptions, just as Kuo and Mallick (1998), if the prior of regression coefficients is diffuse, some mixing problems may occur when we try to sample from the posterior using this method. In the second method, Mixture Laplace with slab and spike prior (MLSS), we circumvent the problem of sampling the regression coefficients from too vague a prior by considering a dependent mixture prior on the indicator variables and regression coefficients. When the indicator variable is 0 we assume a Normal prior and when the indicator variable is 1 we assume a Laplace prior for the corresponding regression coefficient. Using this method we increase the probability that the chain will move to 1 when it reaches 0. This method needs some tuning as the mean and variance of the Normal distribution need to be chosen so that good values of the regression coefficients are proposed when the indicator variable is 0. Gibbs Sampling is used to sample from the joint posterior distribution. The posterior predictive distribution would be used to predict for future observations. And we propose to use direct Bayesian Lasso for variable selection. First we consider a simulated data set and compare our method to the other existing Bayesian variable selection methods such as Kuo and Mallick (1998), George and McCulloch (1993) and Park and Casella (2008). An overall qualitative assessment of the different aspects (computational speed, efficiency of mixing and separation) of the 3 Part 1 performance of the methods is provided. We also apply our proposed methodology to air pollution data, where the goal is to identify the main factors that controls the pollutant concentration using a regression model and use the resulting model for prediction purpose. 4 2. Existing Bayesian variable selection methods We begin by the linear model to describe the relationship between the observed response variable and the set of all potential predictors, 2 2 Y jθ; σ ∼ Nn(Xθ; σ In); (2.1) where response variables Y is n × 1, a set of potential predictors X = [X1;:::;Xp] 0 2 is n × p, coefficients θ = (θ1; : : : ; θp) and variance σ are unknown.

Load more