BAYESIAN VARIABLE SELECTION USING LASSO

by

YUCHEN HAN

Submitted in partial fulfillment of the requirements

for the degree of Master of Science

Department of Mathematics, Applied Mathematics and Statistic

CASE WESTERN RESERVE UNIVERSITY

May, 2017 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of Yuchen Han

candidate for the degree of Master of Science*.

Committee Chair Dr. Anirban Mondal

Committee Member Dr. Wojbor Woyczynski

Committee Member Dr. Jenny Brynjarsdottir

Date of Defense

March 31, 2017

*We also certify that written approval has been obtained

for any proprietary material contained therein. Contents

List of Tables iii

List of Figures iv

Abstract v

1 Introduction 1

2 Existing Bayesian variable selection methods 5 2.1 Stochastic search variable selection ...... 5 2.2 Kuo and Mallick ...... 7 2.3 The Bayesian Lasso ...... 9 2.4 Gibbs sampler ...... 11

3 Bayesian variable selection using Lasso with spike and slab prior 13 3.1 LSS: Laplace with spike and slab prior ...... 13 3.1.1 Hierarchical model ...... 13 3.1.2 MCMC exploration of the posterior ...... 15 3.2 MLSS: Mixture Laplace with spike and slab prior ...... 17 3.2.1 Hierarchical model ...... 17 3.2.2 MCMC exploration of the posterior ...... 18

i CONTENTS

4 Comparison of different Bayesian variable selection methods 20 4.1 Simulated examples ...... 20 4.1.1 Datasets ...... 20 4.1.2 Priors distributions ...... 21 4.1.3 Comparisons of different methods ...... 21 4.2 Real data example ...... 27

5 Conclusions and future work 32

Appendix 34

Bibliography 35

ii List of Tables

4.1 Simulated datasets details ...... 21 4.2 Parameters of prior distributions for different variable selection meth- ods for five different datasets...... 22 4.3 Proportions of high frequency models and mean squared prediction errors (MSPE)...... 23 4.4 The batch standard error of the highest posterior model probability for Dataset 3 ...... 25 4.5 Model variables under different criteria ...... 27 4.6 Proportions of high frequency models for different variable selection methods.(K & M: Kuo and Mallick, BL: Bayesian Lasso) ...... 29

iii List of Figures

4.1 Scatter plot with the fitted posterior median curve for dataset 1. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS...... 24 4.2 Ergodic batch posterior probabilities for Dataset 3 ...... 26 4.3 Ergodic posterior probabilities for Dataset 3 ...... 26 4.4 Scatter plots...... 28 4.5 Residual plot with fitted values ...... 29 4.6 Scatter plot with the fitted posterior median curve for air pollution data. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS...... 30

iv Bayesian Variable Selection Using Lasso

Abstract by YUCHEN HAN

This thesis proposes to combine the Kuo and Mallick approach (1998) and Bayesian Lasso approach (2008) by introducing a Laplace distribution on the conditional prior of the regression parameters given the indicator variables. Gibbs Sampling will be used to sample from the joint posterior distribution. We compare these two new method to existing Bayesian variable selection methods such as Kuo and Mallick, George and McCulloch and Park and Casella and provide an overall qualitative as- sessment of the efficiency of mixing and separation. We will also use air pollution dataset to test the proposed methodology with the goal of identifying the main factors controlling the pollutant concentration.

v 1. Introduction

The selection of variables in regression models is one of the very common problems in . Several methods of variable selection have been extensively studied in the classical statistics literature such as method, Akaike Information Criterion (AIC) based method, Bayes Information Criterion (BIC) based method, Adjusted R2 based method, Predicted Residual Sum of Squares (PRESS) statistic method and Mallows Cp method etc. One Bayesian method for variable selection consists of the indicator model selec- tion approach where each regression coefficient have a slab and spike prior which is constructed by using an auxiliary indicator variable. Kuo and Mallick (1998) pro- posed a model where the auxiliary indicator variable and the regression coefficients are assumed to be independent. An alternative model formulation called Gibbs variable selection (GVS) was suggested by Dellaportas et al. (1997), extending a general idea of Carlin and Chib (1995). Here the prior distributions of indicator and regression coefficients are assumed to be dependent on each other. Similarly stochastic search variable selection (SSVS) was introduced by George and McCulloch (1993) and ex- tended for multivariate case by Brown et al. (1998), where the spike in the conditional prior for the regression coefficients is considered as a narrow distribution concentrated around zero. Meuwissen and Goddard (2004) introduced (in multivariate context) a random effects variant of SSVS. A different Bayesian approach to inducing sparseness is not to use indicators in

1 Part 1

the model, but instead to specify a prior directly on the regression coefficient that ap- proximates the slab and spike” shape. In this context the equivalent classical method in statistics is the Lasso regression as introduced by Tibshirani (1996). The Lasso is a form of penalized least squares method that minimizes the residual sum of squares

while controlling the L1 norm of the regression coefficients. The L1 penalty shrinks the estimated coefficients toward 0. The of the vector of regression coef- ficients toward zero with the possibility of setting some coefficients identically equal to zero makes Lasso a good method for automatic variable selection simultaneously with the estimation of regression coefficients. The Lasso has a Bayesian interpreta- tion. The Lasso estimate can be viewed as the mode of the posterior distribution of regression coefficients when independent double-exponential prior distributions are placed on the p regression coefficients, and the likelihood component is taken to be the normal model. Fernandez and Steel (2000) considered Laplace prior as a special case in a general Bayesian regression hierarchical model but the connection to the Lasso procedure was not made until 2008. Park and Casella (2008) proposed Bayesian Lasso regression for the first time. They extended the Bayesian Lasso regression model to account for uncertainty in the hyperparameters by placing prior distributions on them and obtained point estimates of the regression coefficients using the median of the posterior distribution, but they did not address variable selection and prediction of future observations . In this thesis, we develop two variable selection methods in the linear regression model set-up, combining the slab and spike prior (George and McCulloch (1993) and Kuo and Mallick (1998)) approach and the Bayesian Lasso approach by introducing a Laplace distribution on the conditional prior of the regression parameters given the indicator variables. The Laplace prior can better accommodate large regression coefficients because of its heavier tail probability, it also introduces different variance parameters for different regression coefficients. The first method, Laplace with slab

2 Part 1 and spike prior (LSS), uses a Laplace distribution for the conditional prior distribution of the regression coefficients given the indicator variables. This method is similar to Kuo and Mallick (1998), and independent priors are assumed for the indicator variables and the regression coefficients. Here instead of a Normal priors for the regression coefficients we use a Laplace prior whose variance depends on the error variance. We expect our model to perform better when the regression coefficients are of different scale and the number of regressors is very high compared to the number of data. In Markov chain, when the indicator variable is 0, the updated value of the regression coefficients are sampled from the full conditional distribution, which is its prior distribution. Because of independence assumptions, just as Kuo and Mallick (1998), if the prior of regression coefficients is diffuse, some mixing problems may occur when we try to sample from the posterior using this method. In the second method, Mixture Laplace with slab and spike prior (MLSS), we circumvent the problem of sampling the regression coefficients from too vague a prior by considering a dependent mixture prior on the indicator variables and regression coefficients. When the indicator variable is 0 we assume a Normal prior and when the indicator variable is 1 we assume a Laplace prior for the corresponding regression coefficient. Using this method we increase the probability that the chain will move to 1 when it reaches 0. This method needs some tuning as the mean and variance of the Normal distribution need to be chosen so that good values of the regression coefficients are proposed when the indicator variable is 0. Gibbs Sampling is used to sample from the joint posterior distribution. The posterior predictive distribution would be used to predict for future observations. And we propose to use direct Bayesian Lasso for variable selection. First we consider a simulated data set and compare our method to the other existing Bayesian variable selection methods such as Kuo and Mallick (1998), George and McCulloch (1993) and Park and Casella (2008). An overall qualitative assessment of the different aspects (computational speed, efficiency of mixing and separation) of the

3 Part 1 performance of the methods is provided. We also apply our proposed methodology to air pollution data, where the goal is to identify the main factors that controls the pollutant concentration using a regression model and use the resulting model for prediction purpose.

4 2. Existing Bayesian variable selec- tion methods

We begin by the linear model to describe the relationship between the observed response variable and the set of all potential predictors,

2 2 Y |θ, σ ∼ Nn(Xθ, σ In), (2.1)

where response variables Y is n × 1, a set of potential predictors X = [X1,...,Xp]

0 2 is n × p, coefficients θ = (θ1, . . . , θp) and variance σ are unknown. Variable selection procedures requires a comparison of all 2p possible subsets. In

0 order to mitigate this computational issue, an indicator variable γ = (γ1, . . . , γp) can be introduced to identify the promising subset, where γi = 1 if θi is large and γi = 0 if θi is small.

2.1 Stochastic search variable selection

Stochastic search variable selection (SSVS) method was developed by George and McCulloch (1993). Later, George and McCulloch (1996) extended SSVS to general- ized linear models. SSVS has also been adapted for multivariate case by Brown et al. (1998), where the spike in the conditional prior for the regression coefficients is considered as a narrow distribution concentrated around zero.

5 Part 2

Considering (2.1) (Here θ = β) as part of a hierarchical model, we shall assume throughout that X1,...,Xp contains no variable (included intercept) that would be included in every possible model. SSVS introduced a mixture of two normal distri- butions with different variances as the prior on the coefficients βi

2 2 2 βi|γi ∼ (1 − γi)N(0, τi ) + γiN(0, ci τi ), i = 1, 2, . . . , p, (2.2)

P (γi = 1) = 1 − P (γi = 0) = pi. (2.3)

The hyperparameters τi is set to be small and ci is set to be large so that those

2 2 2 βi|γi = 0 ∼ N(0, τi ) is concentrated around 0, whereas those βi|γi = 1 ∼ N(0, ci τi ) is dispersed. Then a non-0 estimate of βi would probably be selected in the final model. The hierarchical model is completed by assuming the prior of σ2 is distributed as Inverse Gamma

2 σ |γ ∼ IG(νγ/2, νγλγ/2). (2.4)

2 The full conditional posterior distributions of βi, σ and γi are given by

2 2 f(βi|Y , σ , γ, β(i)) ∝ f(Y |σ , γ, β)f(βi|γi), (2.5)

where β(i) denotes all terms of β except βi,

f(σ2|Y , β, γ) ∝ f(Y |σ2, γ, β)f(σ2|γ), (2.6)

a P (γ = 1|Y , β, σ2, γ ) = P (γ = 1|β, σ2, γ ) = , (2.7) i (i) i (i) a + b

6 Part 2

where γ(i) denotes all terms of γ except γi,

2 a = f(β|γ(i), γi = 1)f(σ |γ(i), γi = 1)f(γ(i), γi = 1) and

2 b = f(β|γ(i), γi = 0)f(σ |γ(i), γi = 0)f(γ(i), γi = 0).

Then SSVS implements the Gibbs sampler to generate β1, σ1, γ1, β2, σ2, γ2,etc. with a initial choice of β0, σ0, γ0, which can set to be the least squares estimates and γ0 = (1,..., 1)0 using the following conditional densities.

j j−1 j−1 j−1 −2 | ˆ β |σ ,, γ , Y ∼ Np(Aγj−1 (σ ) X XβLS, Aγj−1 ),  −1 j−1 −2 | −1 −1 −1 −1 −1 −1 where Aγj−1 = (σ ) X X + Dγj−1 R Dγj−1 , Dγ = diag[(a1τ1) ,..., (apτp) ] ν + n (Y − Xβj)|(Y − Xβj) + ν λ  (σ2)j|βj, γj−1, Y ∼ IG γ , γ γ 2 2 a γj|Y , βj, σj, γj ∼ Bernoulli( ) i (i) a + b (2.8)

j j j j−1 j−1 where γ(i) = (γ1, . . . , γi−1, γi+1 , γp ),

j j j 2 j j j j j a = f(β |γ(i), γi = 1)f((σ ) |γ(i), γi = 1)f(γ(i), γi = 1) and

j j j 2 j j j j j b = f(β |γ(i), γi = 0)f((σ ) |γ(i), γi = 0)f(γ(i), γi = 0).

2.2 Kuo and Mallick

In SSVS, the coefficients cannot be set to zero exactly and the selected variables are the ones with regression coefficients significantly different from zero. Thus, tuning for the hyperparamers should be required in order to decide on how small coefficients are

7 Part 2 to be essentially treated as zero. However, Kuo and Mallick (1998) introduced un- conditional priors for variable selections without tuning, which can make coefficients being exactly zero at positive probability. The indicator variable and the regression coefficients are assumed to be independent. It is only required to specify the priors. Uimari and Hoeschele (1997) use this approach for mapping linked quantitative trait loci (QTL) in genetics.

0 Considering (2.1), here θ = (β1γ1, . . . , βpγp) , prior can be assumed as β ∼

2 Np(β0, D0), γi ∼ B(1, pi) independently for j = 1, . . . , p and σ ∼ IG(α/2, η/2).

∗ 2 Let X = [γ1x1, . . . , γpxp]. Then the full conditional posterior distributions of β, σ and γi are given by

2 −1 −1 −2 ∗| −1 −1 −2 ∗| ∗ β|Y , σ , γ ∼ Np(A (D0 β0 + σ X Y ), A ), where A = D0 + σ X X , (2.9)

α + n (Y − Xθ)|(Y − Xθ) + η  σ2|Y , β, γ ∼ IG , . (2.10) 2 2

a P (γ = 1|Y , β, σ2, γ ) = , (2.11) i (i) a + b

2 where γ(i) denotes all terms of γ except γi, a = f(Y |γ(i), γi = 1, β, σ )f(γ(i), γi =

2 1) and b = f(Y |γ(i), γi = 0, β, σ )f(γ(i), γi = 0). These full conditional posterior distributions can be used in Gibbs sampler to generate β1, σ1, γ1, β2, σ2, γ2,etc.. Then P (γ|Y ) is tabulated from the frequencies of γ. In fact, the prior can also be thought as a pseudo-prior, which has no effect on the posterior, as discussed in Carlin and Chib (1995) when the predictors are not included in the model. Therefore, the full conditional posterior distribution of βi can

8 Part 2 be given by

  f(Y |γ, β, σ2)f(β |β ) if γ = 1 2  i (i) i f(βi|Y , σ , γ, β(i)) ∝ (2.12)  f(βi|β(i)) if γi = 0

where β(i) denotes all terms of β except βi.

2.3 The Bayesian Lasso

The least absolute shrinkage and selection operator (Lasso) was developed by Tib- shirani (1996) in order to improve prediction accuracy and to determine a smaller promising subset of predictors. He suggested that the Lasso estimates can be inter- preted as posterior mode estimates when the regression parameters have indepen- dent and identical Laplace priors. Park and Casella (2008) consider a fully Bayesian analysis by using a conditional Laplace prior and extend the model to account for uncertainty in the hyperparameters by placing prior distributions on σ2 and τ 2. Consider a conditional Laplace prior

p   Y λ −λ|βi| π(β|σ2) = √ exp √ (2.13) 2 2 i=1 2 σ σ and the noninformative scale-invariant marginal prior π(σ2) = 1/σ2 on σ2. Condi- tioning of σ2 is important because it guarantees a unimodal full posterior. It is also noted that any inverse gamma prior for σ2 maintains conjugacy. Exploiting the fact that the Laplace distribution can be represented as a scale mixture of normal densities (Andrews and Mallows 1974)

∞ 2 a Z 1 2 a 2 e−a|z| = √ e−z /(2s) e−a s/2ds, a > 0, (2.14) 2 0 2πs 2

9 Part 2

the hierarchical representation of the full model is given by

2 2 Y |µ, X, β, σ ∼ Nn(µ1n + Xβ, σ In),

2 2 2 2 2 2 β|σ , τ1 , . . . , τp ∼ Np(0p, σ Dτ ), Dτ = diag(τ1 , . . . , τp ), (2.15) p Y λ2 −λ2τ 2  σ2, τ 2, . . . , τ 2 ∼ π(σ2)dσ2 exp i dτ 2, σ2, τ 2, . . . , τ 2 > 0. 1 p 2 2 i 1 p i=1

The parameter µ can be given an independent, flat prior. After integrating out

2 2 τ1 , . . . , τp , the conditional prior on β has the desired conditional Laplace distribution (2.13). After integrating µ, then the full conditional posterior distributions of β and σ2 are given by

−1 2 −1 | 2 −1 −1 | −1 β|σ , τ , Y ∼ Np(A X Y , σ A ), where A = X X + Dτ , (2.16)

p + n − 1 β|D−1β (Y − Xθ)|(Y − Xθ) σ2|β, τ , Y ∼ IG , τ + . (2.17) 2 2 2

2 The full conditional posterior distribution of 1/τj is inverse-Gaussian with parameters

s 2 2 0 λ σ 0 2 µ = 2 , λ = λ βi and density is given by

r λ0  λ0(x − µ0)2  f(x) = x−3/2 exp − , x > 0 (2.18) 2π 2µ02x

The Bayesian Lasso parameter λ can be chosen by an appropriate hyperprior. Consider a diffuse Gamma hyperprior on λ2

r δ 2 π(λ2) = (λ2)r−1e−δλ , λ2 > 0(r > 0, δ > 0). (2.19) Γ (r)

10 Part 2

When this prior is used in the hierarchical model, the full conditional posterior dis- p 2 P 2 tribution on λ is Gamma(p + r, τi /2 + δ). i=1

2.4 Gibbs sampler

Gibbs sampler was first proposed by Geman and Geman (1984) in IEEE, for simulat- ing posterior distribution used in image reconstruction and fundamentally changed Bayesian computing. It is a special case of the MetropolisHastings algorithm. Gibbs sampler can break the problem of sampling from the high-dimensional joint distribu- tion into a series of samples from low-dimensional conditional distributions.

Suppose the parameter vector θ = (θ1, . . . , θp) is sampled from distribution f(θ) which is unknown or complicated, where the conditional distribution of

(i) θi|θ = (θ1, ..., θi−1, θi+1, ..., θp) is known or easy to simulate. Then the following algorithm can be used.

0◦ Set k = 0.

1◦ Choose θ(k) ∈ S, where S is support of f(θ).

◦ (k+1) 2 Generate θ1 based on

(1) (k) (k) f(θ1|θ = (θ2 , ..., θp ))

(k+1) θ2 based on

(2) (k+1) (k) (k) f(θ2|θ = (θ1 , θ3 ..., θp ))

....

(k+1) θp−1 based on

(p−1) (k+1) (k+1) (k) f(θp−1|θ = (θ1 , .., θp−2 , θp ))

11 Part 2

(k+1) θp based on

(p) (k+1) (k+1) (k+1) f(θp)|θ = (θ1 , .., θp−2 , θp−1 ))

Then we can get θ(k+1).

3◦ If sequence {θ(k)} converged, output θ(k+1). Otherwise, set k = k + 1, go to step 2◦.

{θ(k)} is called Gibbs sampler.

12 3. Bayesian variable selection using Lasso with spike and slab prior

Here we develop a combination of Kuo and Mallick (1998) and Bayesian Lasso (2008) by introducing a Laplace distribution on the conditional prior of the regression pa- rameters given the indicator variables. The Laplace prior can better accommodate large regression coefficients because of its heavier tail probability, it also introduces different variance parameters for different regression coefficients. This model performs better than Kuo and Mallick method when the regression coefficients are in different scale and the variance of data is high. Gibbs Sampling will be used to sample from the joint posterior distribution. The posterior predictive distribution would be used to predict for future observations.

3.1 LSS: Laplace with spike and slab prior

3.1.1 Hierarchical model

For the linear regression model, we consider

2 2 Y |θ, σ ∼ Nn(Xθ, σ In), (3.1)

13 Part 3

where response variables Y is n × 1, a set of potential predictors X = [X1,...,Xp]

0 2 is n × p, coefficients θ = (β1γ1, . . . , βpγp) , γi is the indicator variable and variance σ are unknown.

The Lasso regression shrinks the coefficients by imposing the L1 penalty. The estimates of Lasso coefficient minimizing the penalized residual sum of squares satisfy

p | X min(Y − Xβ) (Y − Xβ) + λ |βj| (3.2) β j=1 where λ ≥ 0. The Lasso estimates can be viewed as the mode of the posterior distribution of regression coefficients when independent Laplace prior distributions are placed on the p regression coefficients. The Laplace prior can better accommodate large regression coefficients because of its heavier tail probability. Therefore, we assume that each

coefficient βi is distributed as a Laplace distribution

p   Y λ −λ|βi| π(β|σ2) = √ exp √ (3.3) 2 2 i=1 2 σ σ

Considering Laplace distribution as a scale mixture of normals (2.14), we have a multivariate normal prior

2 2 2 2 2 2 β|σ , τ1 , . . . , τp ∼ Np(0p, σ Dτ ), Dτ = diag(τ1 , . . . , τp ), p (3.4) Y λ2 −λ2τ 2  σ2, τ 2, . . . , τ 2 ∼ π(σ2)dσ2 exp i dτ 2, σ2, τ 2, . . . , τ 2 > 0. 1 p 2 2 i 1 p i=1

The prior for σ2 is IG(a, b) that would maintain conjugacy.

The Bernoulli model P (γi = 1) = 1 − P (γi = 0) = pi is the prior probability that

Xi is included in the model.

14 Part 3

3.1.2 MCMC exploration of the posterior

The promising subsets of predictors can be identified from γ’s that have high posterior probabilities. Gibbs sampler can be implemented to generate a sequence

γ1,..., γj,..., (3.5) which will converge in distribution to γ ∼ π(γ|Y ). In this sequence, γ with highest probability will appear most frequently, hence Xi’s corresponding to those γ are the selected variables. The initial values of β0 and σ0 are to be set as the least squares estimates of full model (3.1) and γ0 can be initialized as (1,..., 1). Based on initial values, we can generate a Gibbs sampler for

β1, σ1, τ 1, γ1,..., βj, σj, τ j, γj,... (3.6) which includes (3.5) using the following iterative simulations.

∗ Let X = [γ1x1, . . . , γpxp]. Given the prior (3.3) and the normal likelihood from the data (3.1), we can show that the full conditional posterior distribution of β (see Appendix) is

2 −1 ∗| 2 −1 β|σ , τ , γ, Y ∼ Np(A X Y , σ A ), (3.7)

−1 −1 ∗| ∗ −1 where A = X X + Dτ . β can be sampled from (3.7). The posterior distribution of σ2 is

 p + n β|D−1β (Y − Xθ)|(Y − Xθ) σ2|β, τ , γ, Y ∼ IG a + , b + τ + . (3.8) 2 2 2

15 Part 3

2 The full conditional posterior distribution of 1/τj is inverse-Gaussian with parameters

s 2 2 0 λ σ 0 2 µ = 2 , λ = λ . βi

The posterior distribution of γ is

2 γi|β, σ , τ , γ(i), Y ∼ Bernoulli(qi) (3.9)

with qi = c/(c + d), where

 1  c = exp − (Y − Xθ∗)|(Y − Xθ∗) p , 2σ2 i i i (3.10)  1  d = exp − (Y − Xθ∗∗)|(Y − Xθ∗∗) (1 − p ). 2σ2 i i i

∗ ∗∗ θi is obtained from θ with the i-th entry replaced by βi, similarly, θi is set to the column vector of θ with the i-th entry replaced by 0. In each Gibbs iteration, λ need to be updated by the previous iteration, which is

s k 2p λ = Pp 2 , (3.11) i=1 Eλk−1 [τj |Y ]

where the conditional expectations are substituted by the averages from the Gibbs sample. And the initial choice can be set as

q  p 0 2 X ˆ2 λ = p σˆLS |βLS|, (3.12) j=1

2 ˆ2 whereσ ˆLS and βLS are the least squares estimates. Alternatively, we can give λ2 a gamma prior (2.19) and put it into the hierarchical

16 Part 3

model. Then the full conditional distribution of λ2 is given by

p 2 2 X 2 λ |Y , β, σ , τ , γ ∼ Gamma(p + r, τi /2 + δ). (3.13) i=1

The gamma prior on λ2 should close to 0 sufficiently fast as λ2 → ∞ but should be relatively flat and put high probability around the maximum likelihood estimate.

3.2 MLSS: Mixture Laplace with spike and slab

prior

3.2.1 Hierarchical model

In LSS method, mixing can be poor if prior is too vague and the sampler for γ will only rarely flip from 0 to 1. In order to overcome this problem, we consider a dependent mixture prior on the indicator variables and regression coefficients. Using this method we increase the probability that the chain will move to 1 when it reaches 0. We consider the same model (3.1) in Laplace prior method. For the prior on

coefficients, we assume that each βi is from a mixture of two different distributions, which is presented by

√ 2 2 βi|σ , γi ∼ (1 − γi)N(µi, si) + γiLaplace(0, σ /λ), (3.14)

where µi can be set as log(p)/n and si can be set as variance of the OLS estimator βi.

In other words, we make β into (βγ, β(γ)) corresponding to those βi which are included or excluded in the model.Then the prior on β can be partitioned into π(βγ|γ) = √ 2 Laplace(0, σ /λ) and π(β(γ)|βγ, γ) = N(µi, si). In fact, the first part can be viewed as a pseudo-prior, which has no effect on the posterior. Exploiting (2.14), π(βγ|γ)

17 Part 3

can be written as a normal distribution,

2 2 2 2 2 2 βγ|γ, σ , τ1 , . . . , τp ∼ Npr(0pr, σ Dτr), Dτr = diag(τ1 , . . . , τpr), pr = the number of non-zero γi, pr Y λ2 −λ2τ 2  σ2, τ 2, . . . , τ 2 ∼ π(σ2)dσ2 exp i dτ 2, σ2, τ 2, . . . , τ 2 > 0. 1 pr 2 2 i 1 pr i=1 (3.15)

2 Here, the prior for σ is IG(a, b). And the indicator variable γi is distributed as

Bernoulli(pi),

Y γi 1−γi π(γ) = pi (1 − pi) (3.16)

3.2.2 MCMC exploration of the posterior

We can generate a Gibbs sample for

β1, σ1, τ 1, γ1,..., βj, σj, τ j, γj,... (3.17)

with the same choice of initial values in Laplace prior method through the following

∗ ∗ full conditional posterior distribution. Let X = [γ1x1, . . . , γpxp], then Xγ consists of non-zero column vectors in X∗.

2 −1 ∗| 2 −1 βγ|β(γ), σ , τ , γ, Y ∼ Npr(A Xγ Y , σ A ) (3.18) 2 β(γ)|βγ, σ , τ , γ, Y ∼ N(p−pr)(µ, S)

−1 −1 ∗| ∗ −1 2 where A = Xγ Xγ + Dτr . The posterior distribution of σ is

! pr + n β|D−1β (Y − Xθ)|(Y − Xθ) σ2|β, τ , γ, Y ∼ IG a + , b + γ τr γ + . (3.19) 2 2 2

2 When γi = 1, the full conditional posterior distribution of 1/τi is inverse-Gaussian with parameters s 2 2 0 λ σ 0 2 µ = 2 , λ = λ . βi

18 Part 3

The posterior distribution of γ is

2 γi|β, σ , τ , γ(i), Y ∼ Bernoulli(qi) (3.20)

with qi = c/(c + d), where

2 2 c = f(Y |γ(i), γi = 1, β, σ )f(β|σ , γ(i), γi = 1)f(γ(i), γi = 1), (3.21) 2 2 d = f(Y |γ(i), γi = 0, β, σ )f(β|σ , γ(i), γi = 0)f(γ(i), γi = 0).

For λ, we can use the same way in the Laplace prior method.

19 4. Comparison of different Bayesian variable selection methods

In this chapter, we illustrate two variable selection methods we propose: LSS and MLSS. And we compare the performance of these two methods, SSVS, Kuo and Mallick and Bayesian Lasso on simulated examples. Then the proposed methods are applied to air pollution data.

4.1 Simulated examples

4.1.1 Datasets

To evaluate the performance of the methods, we use a series of simulated datasets.

Each dataset has p predictors of length n. The predictors X1,..., Xp are indepen- dently and identically distributed as Nn(0n, In). The response variable is generated from the linear regression model Y = Xα + , where α is the desired coefficients and

2  ∼ Nn(0n, σ In). For dataset 5, the predictors are obtained as Xi = Xi + Z, where

Z ∼ Nn(0n, In). The least squares estimates for standard error of β is denoted by

σˆβ. The data generation details are given in Table 4.1. For SSVS method, we assume ci = 10, τi = 0.33. And For Kuo and Mallick method, we set β0 = 0p and D0 = 16Ip.

20 Part 4

Dataset n p Generated model σ 1 60 5 X4 + 1.2X5 2.5 2 60 5 4X4 + 12X5 2.5 3 60 5 9X4 + 12X5 5 4 60 10 3X1 + 3X2 + 3X3 + 8X4 + 8X5 2.5 3 5 12 15 P P P P 5 60 20 2 Xi + 4 Xi + 8 Xi + 14 Xi 2.5 i=1 i=4 i=11 i=13

Table 4.1: Simulated datasets details

4.1.2 Priors distributions

For each dataset, different priors are used. And the priors is chosen to be representa- tive of prior knowledge. Here we try to make the expectation of prior distribution close

1 p to the true coefficients, but set the variance a little bit larger. We set π(γ) = ( 2 ) . Table 4.2 shows all the priors in different methods.

4.1.3 Comparisons of different methods

The efficiency and mixing properties of the methods were investigated by carrying out diagnostics on the sampled MCMC samples. For each of the dataset, a sample of 10,000 observations of the Gibbs sequence is simulated and the output summaries are based on the final 8000 iterations. How efficiently the variables are separated into those to be included, and those to be excluded in the model is the criteria we use to evaluate the method. Table 4.3 shows the highest frequency models that appeared in each example. In Bayesian Lasso method, we assume βi = 0 when |βi| < 0.001. Then we can make variable selection and find the high frequency model, like other methods. In Figure 4.1, black points represent the original dataset. Red points represent true data points used to make prediction. Blue crosses are the posterior predictive value. And brown line means the fitted posterior median curve. As expected, all methods are doing well. Here the highest frequency model is the model we use to generate data. In particularly, LSS method seems to perform best

21 Part 4

Parameter Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 SSVS νγ 1 1 1 1 1 λγ 6 6 25 6 6 Kuo and Mallick α 1 1 1 1 1 η 6 6 25 6 6 Bayesian Lasso r 1.5 1.5 2.5 25 20 δ 0.5 0.5 0.5 5 1.5 LSS a 1 1 1 1 1 b 6 6 25 6 6 r 1.5 1.5 2.5 25 60 δ 0.5 0.5 0.5 5 1.5 MLSS a 1 1 1 1 1 b 6 6 25 6 6 r 1.5 1.5 2.5 25 60 δ 0.5 0.5 0.5 5 1.5 µi log(p)/n log(p)/n log(p)/n log(p)/n log(p)/n

si σˆβi σˆβi σˆβi σˆβi σˆβi

Table 4.2: Parameters of prior distributions for different variable selection methods for five different datasets.

22 Part 4

Dataset 1 Dataset 2 Dataset 3 Dataset 4 Dataset 5 SSVS Model variables 4, 5 4, 5 4, 5 1-5 1-5,11-15 Proportion 0.2703 0.6569 0.5150 0.3466 0.1923 MSPE 3.7573 3.7629 14.404 3.8079 10.575 Kuo and Mallick Model variables 4, 5 4, 5 4, 5 1-5 1-5,11-15 Proportion 0.7070 0.7656 0.5909 0.3915 0.3389 MSPE 3.3280 3.3586 13.470 3.8645 10.364 Bayesian Lasso Model variables 4, 5 4, 5 4, 5 1-5 1-5,11-15 Proportion 0.2246 0.7206 0.5336 0.0069 0.0136 MSPE 4.6959 3.3057 15.366 11.084 12.011 LSS Model variables 4, 5 4, 5 4, 5 1-5 1-5,11-15 Proportion 0.4273 0.6969 0.6121 0.4890 0.3408 MSPE 3.4802 3.0380 13.5614 3.7626 9.7749 MLSS Model variables 4, 5 4, 5 4, 5 1-5 1-5,11-15 Proportion 0.1779 0.1908 0.2916 0.1184 0.0575 MSPE 3.5349 3.5390 13.939 3.9808 10.373

Table 4.3: Proportions of high frequency models and mean squared prediction errors (MSPE).

23 Part 4

Figure 4.1: Scatter plot with the fitted posterior median curve for dataset 1. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS.

24 Part 4

SSVS Kuo and Mallick LSS MLSS Standard error 0.03995098 0.03424898 0.03933916 0.03603825

Table 4.4: The batch standard error of the highest posterior model probability for Dataset 3 in identifying the correct model. In dataset 1 and dataset 2, Kuo and Mallick has the highest proportion. When we increase the magnitude of β but fix the variance, the proportions of each method is improved. However, for high variance data like dataset 3, Table 4.3 shows LSS method has 61.21% that is the highest. As increasing the dimension of data, the efficiency to include the promising variable in the model is decreasing, but LSS method still performs better than the others. As for MLSS method, it performs poorly here, though it can still choose the promising variables successfully. In dataset 3 and dataset 5, mean squared prediction errors are higher than the others, which is affected by uncertainty in high variance and high dimension. To compare the speed of convergence and mixing problem, we divide the sample output of Dataset 3 taken at every 4 sample points into 30 batches and reported in Table 4.4 the batch standard error of the highest posterior model probability. The evolution of the corresponding ergodic posterior probabilities is shown in Figure 4.2. Figure 4.3 displays the evolution of the ergodic probability for the highest frequency model after burning in the first 200 sample points. All methods converge to some probabilities after a reasonably long run. Generally, Kuo and Mallick method and MLSS method seem to have lower batch standard error which indicates greater efficiency. MLSS converges faster than the other three methods, and for the mixing problem, MLSS increases the probability that the chain will move to 1 when it reaches 0 because regression coefficients and indicator variables are assume to be dependent with each other.

25 Part 4

Figure 4.2: Ergodic batch posterior probabilities for Dataset 3

Figure 4.3: Ergodic posterior probabilities for Dataset 3

26 Part 4

2 AIC BIC Adjusted R PRESS Cp Model variables 2, 3, 6 2, 3 1-5 2, 3, 6 2, 3

Table 4.5: Model variables under different criteria

4.2 Real data example

In this section, we apply these five methods to air pollution dataset from Hand et al.

(1994). We will examine if X1 average annual temperature in Fahrenheit, X2 number of manufacturing enterprises employing 20 or more workers, X3 population size in thousands, X4 average annual wind speed in miles per hour, X5 average annual precipitation in inches and X6 average number of days with precipitation per year are useful in explaining SO2 content of air in micrograms per cubic metre. The dataset contains n = 41 city observations and p = 6 predictors. The pre- dictors vary greatly from one to another, thus it is necessary to standardize data, making all variables in the dataset have equal means (0) and standard deviations (1) but different ranges. As shown in Figure 4.4, it seems like there is linear relationship between Y and X2, Y and X3 and quadratic relationship between Y and X5, Y and X6. And collinearity may exist in X2 and X3, X5 and X6. Before variable selection, we randomly split data into a training dataset containing 80% of the rows and a testing dataset containing 20% of the rows. The training data will be used to train our variable selection methods and the testing data will be used to evaluate the models. MSPE for the full linear regression model is 0.36. From Figure 4.5, it does not indicate any severe violations of random distribution of the residuals. In Table 4.5, different model is favored by different criteria.

In SSVS, we set ci = 3, τi = 0.43, νγ = 0.9, λγ = 0.3. In Kuo and Mallick, we set

β0 = 0p, D0 = 16Ip, α = 0.9, η = 0.3. In Bayesian Lasso, set r = 40, δ = 2. In both LSS method and MLSS method, set a = 0.9, b = 0.3, r = 40, δ = 2. In Figure 4.6, black points represent the original dataset. Red points represent

27 Part 4

Figure 4.4: Scatter plots. true data points used to make prediction. Blue crosses are the posterior predictive value. And brown line means the fitted posterior median curve. As shown in Table 4.6 and Figure 4.6, the fitted curve in Bayesian Lasso method is almost horizon and its MSPE is quite high compared to the other methods. With a low proportion of high frequency model, Bayesian Lasso method performs poorly. And in MLSS method, it includes first five predictor variables, as suggested under

2 adjusted R . Combined to the result of all five methods, it seems like X2, X3 are both in the promising subset. In this way, SO2 content of air could be explained and predicted by number of big manufacturing enterprises and population size in

28 Part 4

Figure 4.5: Residual plot with fitted values

SSVS K & M BL LSS MLSS Model variables 2, 3 2, 3 2 2, 3 1-5 Proportion 0.1015 0.6098 0.0380 0.2059 0.0788 MSPE 0.2188 0.2563 1.9484 0.2868 0.2271

Table 4.6: Proportions of high frequency models for different variable selection meth- ods.(K & M: Kuo and Mallick, BL: Bayesian Lasso)

29 Part 4

Figure 4.6: Scatter plot with the fitted posterior median curve for air pollution data. The first row is using SSVS, the second row is using Kuo and Mallick, the third row is using Bayesian Lasso, the fourth row is using LSS and the fifth row is using MLSS.

30 Part 4 thousands to some extent. However, it is difficult to say what model or which criteria is the best. Each model has its own advantage and disadvantage. Based on different scientific question or interest, can be varied.

31 5. Conclusions and future work

We propose two methods for variable selection in linear regression model by intro- ducing Laplace prior. And we compare these two methods to three existing methods. Each method has its own properties and advantages, providing a good separation between variables included and excluded in the model, but it is unlikely that any method will be optimal for all situations. First of all, LSS method appears to work best, especially with large magnitude of β and high variance. In simulated examples, MLSS method partitioning the prior into a pseudo prior and a Laplace prior tends to behave poorly with low efficiency and performs no better than the other methods. This is perhaps not surprising as the pseudo prior distributions are not chosen wisely. Bayesian Lasso was used to make estimation of parameters not to make variable se- lection. When we try to use it to identify the promising predictors, it could make separation correctly but with a relatively low efficiency. But applying to some big datasets, Bayesian Lasso cannot make some good selections. SSVS method and Kuo and Mallick method are doing well with pretty high proportion of high frequency model, but still has limitations when the number of predictors is increasing. However, despite our generally assessment, there are some potential limitations of the methods, which present challenges and opportunities for future work. MLSS method is not doing well as expected in large magnitude coefficients and high di- mension. One important aspect is how to make tuning for hyperparameters as the Lasso parameter λ and the mean µ and variance s of the Normal distribution need

32 Part 5 to be chosen so that good values of the regression coefficients are proposed when the indicator variable is 0. Another related future work is to extend our methods of the usual linear regression model to the .

33 Appendix

Here we demonstrate how to attain the full conditional posterior distribution in LSS method. The posterior distribution is obtained by multiplying the likelihood, which is the density of Y given the parameters, times the joint prior of the parameters. This gives

f(β, σ2, τ 2, γ|Y ) ∝ f(Y |β, σ2, τ 2, γ)π(β|σ2, τ 2)π(σ2, τ 2)π(γ) 1 = (2π)−n/2(σ2)−n/2 exp{− (Y − X∗β)|(Y − X∗β)} 2σ2 1 −p/2 2 −p/2 1/2 | −1 × (2π) (σ ) |Dτ | exp{− β Dτ β} 2σ2 (1) p 2 a Y λ b 2 × exp{−λ2τ 2/2}dτ 2 × (σ2)a−1e−b/σ 2 j j Γ (a) j=1 p Y γi 1−γi × pi (1 − pi) j=1

From (1) we can get full conditional distributions for each parameter by ignoring all terms that are constant with respect to the parameter. For example, the full con- ditional posterior distributions of β is well known multivariate normal distributions:

1 1 2 ∗ | ∗ | −1 β|σ , τ , γ, Y ∝ exp{− 2 (Y − X β) (Y − X β)} × exp{− 2 β Dτ β} 2σ 2σ (2) −1 ∗| 2 −1 ∝ Np(A X Y , σ A ),

−1 −1 ∗| ∗ −1 where A = X X + Dτ .

34 Appendix

For MLSS method, we can use the similar steps to obtain the full conditional posterior distributions.

35 Bibliography

[1] Kuo, Lynn, and Bani Mallick. ”Variable selection for regression models.” Sankhy: The Indian Journal of Statistics, Series B (1998): 65-81.

[2] George, Edward I., and Robert E. McCulloch. ”Variable selection via Gibbs sampling.” Journal of the American Statistical Association 88.423 (1993): 881- 889.

[3] Dellaportas, Petros, Jonathan J. Forster, and Ioannis Ntzoufras. ”On Bayesian model and variable selection using MCMC.” Statistics and Computing 12.1 (2002): 27-36.

[4] Carlin, Bradley P., and Siddhartha Chib. ”Bayesian model choice via Markov chain Monte Carlo methods.” Journal of the Royal Statistical Society. Series B (Methodological) (1995): 473-484.

[5] Brown, Philip J., Marina Vannucci, and Tom Fearn. ”Multivariate Bayesian variable selection and prediction.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60.3 (1998): 627-641.

[6] Theo, H. E., and E. G. Mike. ”Mapping multiple QTL using linkage disequilib- rium and linkage analysis information and multitrait data.” Genet. Sel. Evol 36 (2004): 261-279.

36 BIBLIOGRAPHY

[7] Tibshirani, Robert. ”Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society. Series B (Methodological) (1996): 267-288.

[8] Park, Trevor, and George Casella. ”The bayesian lasso.” Journal of the American Statistical Association 103.482 (2008): 681-686.

[9] Fernandez, Carmen, and Mark FJ Steel. ”Bayesian with scale mixtures of normals.” Econometric Theory 16, no. 01 (2000): 80-101.

[10] George, Edward I., Robert E. McCulloch, and R. Tsay. ”Two approaches to Bayesian model selection with applications.” Bayesian Analysis in Statistics and Econometrics: Essays in Honor of Arnold Zellner 309 (1996): 339.

[11] Uimari, Pekka, and Ina Hoeschele. ”Mapping-linked quantitative trait loci using Bayesian analysis and Markov chain Monte Carlo algorithms.” Genetics 146.2 (1997): 735-743.

[12] Andrews, David F., and Colin L. Mallows. ”Scale mixtures of normal distribu- tions.” Journal of the Royal Statistical Society. Series B (Methodological) (1974): 99-102.

[13] Hand, David J., et al. A handbook of small data sets. Vol. 1. cRc Press, 1993.

[14] Geman, Stuart, and Donald Geman. ”Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images.” IEEE Transactions on pattern analysis and machine intelligence 6 (1984): 721-741.

37