Model Selection for Linear Models with SAS/STAT Software

Funda G¨une¸s SAS Institute Inc.

Outline

Introduction

I Analysis 1: Full model Traditional model selection methods

I Analysis 2: Traditional stepwise selection Customizing the selection process

I Analysis 3–6 Compare analyses 1–6 Penalized regression methods Special methods

I Model averaging I Selection for nonparametric models with spline effects

1 / 115 Learning Objectives

You will learn: Problems with the tradition selection methods Modern penalty-based methods, including and adaptive LASSO, as alternatives to traditional methods Bootstrap-based model averaging to reduce selection bias and improve predictive performance You will learn how to: Use model selection diagnostics, including graphics, for detecting problems Use validation to detect and prevent under-fitting and over-fitting Customize the selection process using the features of the GLMSELECT procedure

2 / 115 Introduction Introduction

With improvements in technologies, regression problems that have large numbers of candidate predictor variables occur in a wide variety of scientific fields and business problems.

“I’ve got all these variables, but I don’t know which ones to use.”

Statistical model selection seeks an answer to this question.

3 / 115 Introduction Model Selection and Its Goals

Model Selection: Estimating the performance of different models in order to choose the approximate best model.

Goals of model selection: Simple and interpretable models Accurate predictions

Model selection is often a trade-off between bias and .

4 / 115 Introduction Graphical Illustration of Bias and Variance

5 / 115 Introduction Bias-Variance Trade-Off

2 Suppose Y = f (X ) + , where  ∼ N(0, σ ). Expected prediction error at a point x is

E[(Y − fˆ(x))2] = Bias2 + Variance + Irreducible Error

6 / 115 Introduction The GLMSELECT Procedure

The GLMSELECT procedure implements selection in the framework of general linear models for selection from a very large number of effects.

Methods include: Familiar methods such as forward, backward, and stepwise selection Newer methods such as least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996)

7 / 115 Introduction Difficulties of Model Selection

The implementation of model selection can lead to difficulties: A model selection technique produces a single answer to the variable selection problem, although several different subsets might be equally good for regression purposes. Model selection might be unduly affected by outliers. Selection bias

8 / 115 Introduction The GLMSELECT Procedure

PROC GLMSELECT can partially mitigate these problems with its Extensive capabilities for customizing the selection Flexibility and power in specifying complex potential effects

9 / 115 Introduction Model Specification

The GLMSELECT procedure provides great flexibility for model specifications: Choice of parameterizations for classification effects Any degree of (crossed effects) and nested effects Internal partitioning of data into training, validation, and testing roles Hierarchy among effects

10 / 115 Introduction Selection Control

The GLMSELECT procedure provides many options for selection control: Multiple effect selection methods Selection from a very large number of effects (tens of thousands) Selection of individual levels of classification effects Effect selection based on a variety of selection criteria Stopping rules based on a variety of model evaluation criteria Leave-one-out and k-fold cross validation

11 / 115 Introduction Model

Suppose data arise from a a normal distribution with the following statistical model: Y = f (x) +  In linear regression

f (x) = β0 + β1x1 + β2x2 + ··· + βpxp

Least squares is the most popular estimation method which picks the coefficients β = (β0, β1, . . . , βp) that minimize the :

2 N  p  X X RSS(β) = yi − β0 + Xij βj  i=1 j=1

12 / 115 Introduction PROC GLMSELECT with Examples

Learn how to use PROC GLMSELECT in model development with examples: 1 Simulate data 2 Fit full least squares model 3 Perform model selection by using five different approaches 4 Compare the selected models’ performances

13 / 115 Introduction Simulate Data

data trainingData testData; drop i j; array x{20} x1-x20; do i=1 to 5000;

/* Continuous predictors */ do j=1 to 20; x{j} = ranuni(1); end;

/* Classification variables */ c1 = int(1.5+ranuni(1)*7); c2 = 1 + mod(i,3); c3 = int(ranuni(1)*15);

yTrue = 2 + 5*x17 - 8*x5 + 7*x9*c2 - 7*x1*x2 + 6*(c1=2) + 5*(c1=5); y = yTrue + 6*rannor(1);

if ranuni(1) < 2/3 then output trainingData; else output testData; end; run;

Reserves one-third of the data as test data and the remaining two-thirds as training data

14 / 115 Introduction Training and Test Data

Use training data to develop a model. Use test data to assess your model’s predictive performance.

15 / 115 Introduction Analysis 1: Full Least Squares Model

proc glmselect data=trainingData testdata=testData plots=asePlot; class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /selection=forward(stop=none); run;

Because STOP=NONE is specified, the selection proceeds until all the specified effects are in the model.

16 / 115 Introduction Dimensions of Full Least Squares Model

Class Level Information

Class Levels Values

c1 8 1 2 3 4 5 6 7 8

c2 3 1 2 3

c3 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Dimensions

Number of Effects 278

Number of Parameters 947

A full model that contains all main effects and their two-way interactions often leads to a large number of effects. When the classification variables have many levels, the number of parameters available for selection is even larger.

17 / 115 Introduction Assess Your Model’s Predictive Performance

You can assess the model’s predictive performance by comparing the average square error (ASE) on the test data and the training data:

ASE on the test data:

ntest p−1 X X 2 [(Ytest − (βˆ0 + βˆj Xtest,j )) ]/ntest i=1 j=1

where βˆj ’s are the least squares estimates obtained by using the observations in the training data.

18 / 115 Introduction Average Square Error (ASE) Plot proc glmselect ... plots=asePlot;

Progression of Average Squared Errors by Role for y Selected Step 80 r o r r

E 60

d e r a u q S

e g a r e v A

40

0 50 100 150 200 250 Step Training Test 19 / 115 Introduction Overfitting and Variable Selection

So the more variables the better? NO! Carefully selected variables can improve model accuracy. But adding too many features can lead to overfitting: Overfitted models describe random error or noise instead of the underlying relationship. Overfitted models generally have poor predictive performance. Model selection can prove useful in finding a parsimonious model that has good predictive performance.

20 / 115 Introduction Model Assessment

Model assessment aims to 1 Choose the number of predictors for a given technique. 2 Estimate the prediction ability of the chosen model.

For both of these purposes, the best approach is to evaluate the procedure on an independent test data, if one is available. If possible one should use different test data for (1) and (2): Validation set for (1) and test set for (2).

21 / 115 Introduction Model Selection for Linear Regression Models

Suppose you have only two models to compare. Then you can use the following methods for model comparison: F test Likelihood ratio test AIC, SBC, and so on Cross validation However we usually have more than two models to compare! For a model selection problem with p predictors, there are 2p models to compare!

22 / 115 Introduction Alternatives

Compare all possible subsets – all-subsets regression

I Computationally expensive I Introduces a large selection bias! Use search algorithms

I Traditional selection methods: forward, backward and stepwise I Shrinkage and penalty methods

23 / 115 Traditional Selection Methods

TRADITIONAL SELECTION METHODS

24 / 115 Traditional Selection Methods Traditional Selection Methods

Forward Selection: Begins with just the intercept and at each step adds the effect that shows the largest contribution to the model. Backward Elimination: Begins with the full model and at each step deletes the effect that shows the smallest contribution to the model. Stepwise Selection: Modification of the forward selection technique that differs in that effects already in the model do not necessarily stay there.

PROC GLMSELECT extends these methods as implemented in the REG procedure.

SELECTION= option of the MODEL statement specifies the model selection method.

25 / 115 Traditional Selection Methods Traditional Selection Methods

In traditional selection methods: The F and the related p-value reflect an effect’s contribution to the model. You choose the predictors and then estimate coefficients by using standard criteria such as least squares or maximum likelihood. There are problems with the use of both the F statistic and coefficient estimation!

26 / 115 Traditional Selection Methods Analysis 2: Traditional Stepwise Selection

proc glmselect data=analysisData testdata=testData plots=(CoefficientPanel(unpack) asePlot Criteria); class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /selection = stepwise(select=sl); run;

The SELECT=SL option uses the significance level criterion to determine the order in which effects enter or leave the model.

27 / 115 Traditional Selection Methods Selection Summary

Stepwise Selection Summary

Effect Effect Number Number Step Entered Removed Effects In Parms In ASE Test ASE F Value Pr > F

0 Intercept 1 1 81.1588 81.4026 0.00 1.0000

1 x9*c2 2 4 52.7431 52.5172 608.97 <.0001

2 x5*c1 3 12 42.1868 43.7057 105.82 <.0001

3 x1*x2 4 13 39.5841 41.9502 222.37 <.0001

4 x17*c1 5 21 36.7616 37.9282 32.38 <.0001

5 c1 6 28 35.7939 36.9287 13.00 <.0001

6 x1*x10 7 29 35.6546 36.8501 13.15 0.0003

7 x9*c1 8 36 35.4158 36.9118 3.24 0.0020

8 x7*c2 9 39 35.2761 37.0700 4.43 0.0041

9 x1*x9 10 40 35.2053 37.1232 6.74 0.0094

10 x9*x12 11 41 35.1544 37.1811 4.86 0.0276

11 x11*x12 12 42 35.1016 37.2530 5.04 0.0249

12 x10*c3 13 57 34.8201 37.6486 1.80 0.0293

13 x8*x11 14 58 34.7753 37.6597 4.31 0.0381

14 x7*x9 15 59 34.7390 37.7003 3.49 0.0620

15 c1*c3 16 171 33.3332 39.8716 1.21 0.0652

16 x3*c1 17 179 33.1678 40.0014 2.00 0.0422

17 c2*c3 18 209 32.7376 40.5367 1.40 0.0749

18 x7*x12 19 210 32.7107 40.4731 2.62 0.1059

19 x3*x12 20 211 32.6825 40.4880 2.75 0.0974

20 x15*c3 21 226 32.4562 40.6799 1.47 0.1060

21 x9*x15 22 227 32.3769 40.7047 7.75 0.0054

22 x2*x17 23 228 32.3523 40.7012 2.41 0.1207

23 x2*x5 24 229 32.2996 40.7218 5.17 0.0231

24 x3*x4 25 230 32.2755 40.6853 2.36 0.1243

25 x4*x12 26 231 32.2474 40.7480 2.75 0.0973

26 x2*x8 27 232 32.2240 40.7776 2.30 0.1294

28 / 115 Traditional Selection Methods Stopping Details

Stop Details

Candidate Candidate Compare For Effect Significance Significance

Entry x1*c2 0.1536 > 0.1500 (SLE)

Removal x2*x8 0.1294 < 0.1500 (SLS)

The stepwise selection terminates when these two conditions are satisfied: None of the effects outside the model is significant at the entry significance level (SLE=0.15). Every effect in the model is significant at the stay significance level (SLS=0.15).

29 / 115 Traditional Selection Methods The Sequence of p-Values at Each Step

p-Values are not monotone increasing. 30 / 115 Traditional Selection Methods The Selected Model Overfits the Training Data

The default SLE and SLS value of 0.15 produces a model that overfits the

training data. 31 / 115 Traditional Selection Methods Most Other Criteria Suggest Stopping the Selection before the Significance Level Criterion Is Reached

32 / 115 Customizing the Selection Process

CUSTOMIZING THE SELECTION PROCESS

33 / 115 Customizing the Selection Process Customize the Selection Process by Using Various Criteria

You can use the following options to customize the selection process:

SELECT= criterion Specifies the order in which effects enter or leave at each step of the specified selection method

STOP= criterion Specifies when to stop the selection process

CHOOSE= criterion Specifies the final model from the list of models in the steps of the selection process

34 / 115 Customizing the Selection Process Example model ... / selection=forward (select=sbc stop=aic choose=validate);

35 / 115 Customizing the Selection Process Criteria Based on

The following criteria are based on the likelihood function and are available for the SELECT=, STOP=, and CHOOSE= options:

ADJRSQ: Adjusted R-square

CP: Mallow’s Cp statistic Fit criteria

I AIC: Akaike’s information criterion I AICC: corrected Akaike’s information criterion I SBC: Schwarz Bayesian information criterion

36 / 115 Customizing the Selection Process Criteria Based on Estimating the True Prediction Error

The following criteria are based on estimating the true prediction error and are available for the SELECT=, STOP=, and CHOOSE= options:

If you have enough data, set aside a validation data set:

I VALIDATE: ASE on validation data If data are scarce, use cross validation:

I CV: k-fold cross validation I PRESS: Leave-one-out cross validation

37 / 115 Customizing the Selection Process Data Roles: Training, Validation, and Testing

Training data Always used to find parameter estimates Can also be used to select effects, stop selection, and choose the final model Validation data Play a role in the selection process Can be used for one or more of the following:

I selecting effects to add or drop (SELECT=VALIDATE) I stopping the selection process (STOP=VALIDATE) I choosing the final model (CHOOSE=VALIDATE) Test data Used to assess predictive performance of models in the selection process, but do not affect the selection process

38 / 115 Customizing the Selection Process Specifying Data Sets for Different Roles

PROC GLMSELECT statement options: DATA= specifies training data VALDATA= specifies validation data TESTDATA= specifies test data

39 / 115 Customizing the Selection Process PARTITION Statement

Another way to specify data sets for different roles is to use the PARTITION statement. Internally partitions the DATA=data set: Randomly in specified proportions

partition fraction (validate=0.3 test=0.2);

Based on the formatted value of the ROLEVAR= variable partition rolevar=myVar (train= ‘group a’ validate= ‘group b’ test= ‘group c’);

40 / 115 Customizing the Selection Process Cross Validation

When data are scarce, setting aside validation or test data is usually not possible. Cross validation uses part of the training data to fit the model, and a different part to estimate the prediction error.

41 / 115 Customizing the Selection Process k-Fold Cross Validation

Split the data into k approximately equal-sized parts Reserve one part of the data for validation, and fit the model to the remaining k − 1 parts of the data. Use this model to calculate the prediction error for the reserved part of the data. Do this for all k parts, and combine the k estimates of the prediction error.

5-Fold cross validation:

42 / 115 Customizing the Selection Process Choice of k

As an of true prediction error, cross validation tends to have decreasing bias but increasing variance as k increases.

A typical choice of k is 5 or 10.

43 / 115 Customizing the Selection Process Leave-One-Out Cross Validation

Leave-one-out cross validation is a special case of k-fold cross validation where k = n and n is the total number of observations in the training data set. Each omitted part consists of one observation. Predicted residual sum of squares can be efficiently obtained without refitting the model n times. Approximately unbiased for the true prediction error but can have high variance because the n “training sets” are so similar to each other. You can request leave-one-out cross validation by specifying PRESS instead of CV in the SELECT=, CHOOSE=, and STOP= suboptions. The statistic can be efficiently calculated without refitting the model n times.

44 / 115 Customizing the Selection Process Analysis 3: Traditional Stepwise Selection with CHOOSE=PRESS

proc glmselect data=analysisData testdata=testData plots=(asePlot Criteria); class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /selection = stepwise(select=sl choose=press); run; The CHOOSE=PRESS option requests that among the models obtained at each step of the selection process, the final selected model is the model that has the smallest leave-one-out predicted residual sum of squares.

45 / 115 Customizing the Selection Process Criterion Panel

The final selected model is the model that has the smallest leave-one-out predicted residual sum of squares (PRESS). 46 / 115 Customizing the Selection Process ASE Plot

Choosing the model at step 14, you limit the overfitting of the training data that occurs when selection proceeds beyond this step. 47 / 115 Customizing the Selection Process Problems in the Traditional Implementations of Forward, Backward, and Stepwise Selection

Traditional implementations of forward, backward, and stepwise selection methods are based on sequential hypothesis testing at the specified significance level. However: F might not follow an F distribution. Hence p-values cannot reliably be viewed as probabilities. Prespecified significance limit is not a data driven criteria. Hence the same significance level can cause overfitting for some data and underfitting for some other data.

48 / 115 Customizing the Selection Process Solution: Modify the Selection Process!

Replace hypothesis-testing-based approach Use data driven criteria such as information criteria, cross validation, or validation data instead of F statistics

49 / 115 Customizing the Selection Process Analysis 4: Default Stepwise Selection (SELECT=SBC)

By default, PROC GLMSELECT uses stepwise selection with the SELECT=SBC and STOP=SBC options.

proc glmselect data=analysisData testdata=testData plots=(asePlot Coefficients Criteria); class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2; run;

Data Set WORK.ANALYSISDATA

Test Data Set WORK.TESTDATA

Dependent Variable y

Selection Method Stepwise

Select Criterion SBC

Stop Criterion SBC

Effect Hierarchy Enforced None

50 / 115 Customizing the Selection Process Selection Summary

Stepwise Selection Summary

Effect Effect Number Number Step Entered Removed Effects In Parms In SBC ASE Test ASE

0 Intercept 1 1 14933.9344 81.1588 81.4026

1 x9*c2 2 4 13495.1664 52.7431 52.5172

2 x5*c1 3 12 12802.0131 42.1868 43.7057

3 x1*x2 4 13 12593.9481 39.5841 41.9502

4 x17*c1 5 21 12407.8500 36.7616 37.9282

5 c1 6 28 12374.1967 35.7939 36.9287

6 x1*x10 7 29 12369.0918* 35.6546 36.8501

* Optimal Value Of Criterion

Stepwise selection terminates when adding or dropping any effect increases the SBC statistic (≈ 12369)

Stop Details

Candidate Candidate Compare For Effect SBC SBC

Entry x1*x9 12370.6006 > 12369.0918

Removal x1*x10 12374.1967 > 12369.0918

51 / 115 Customizing the Selection Process Coefficient Progression Plot

Classification effects join the model along with all their levels.

52 / 115 Customizing the Selection Process Parameter Estimates for the Classification Variable c1

Only levels 2 and 5 of the classification effect c1 contribute appreciably to the model.

53 / 115 Customizing the Selection Process Analysis 5: Stepwise Selection with a Split Classification Variable

A more parsimonious model that has similar or better predictive power might be obtained if parameters that correspond to the levels of c1 are allowed to enter or leave the model independently:

proc glmselect data=analysisData testdata=testData; class c1(split) c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /orderSelect ; run;

54 / 115 Customizing the Selection Process Dimensions Table

Dimensions

Number of Effects 278

Number of Effects after Splits 439

Number of Parameters 947

After splitting, 439 split effects are considered for entry or removal at each step of the selection process.

55 / 115 Customizing the Selection Process Parameter Estimates

Selected Model

Parameter Estimates

Standard Parameter DF Estimate Error t Value

Intercept 1 2.763669 0.360337 7.67

x9*c2 1 1 6.677365 0.440050 15.17

x9*c2 2 1 13.793766 0.431579 31.96

x9*c2 3 1 21.082776 0.439905 47.93

x5 1 -8.250059 0.353952 -23.31

c1_2 1 6.062842 0.295250 20.53

x1*x2 1 -6.386971 0.519767 -12.29

x17 1 4.801696 0.357801 13.42

c1_5 1 5.053642 0.295384 17.11

x1*x10 1 -1.964001 0.534991 -3.67

Selected model contains only two parameters for c1 instead of all eight levels. 56 / 115 Customizing the Selection Process Split Model versus Nonsplit Model

The split model provides: A model that has fewer degrees of freedom (29 versus 10) Improved prediction performance (ASE on test data: 36.85 versus 36.68) 57 / 115 Customizing the Selection Process Analysis 6: Stepwise Selection with Internally Partitioned Data and STOP=VALIDATE proc glmselect data=AnalysisData testdata=TestData; partition fraction(validate=.25); class c1(split) c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /select=stepwise (stop=validate); run;

The PARTITION statement randomly reserves one-quarter of the observations in the AnalysisData for model validation and the rest for model training. The STOP=VALIDATE suboption requests that the selection process terminate when adding or dropping any effect increases the average square error on the validation data.

58 / 115 Customizing the Selection Process Number of Observations for Each Role

Observation Profile for Analysis Data

Number of Observations Read 3395

Number of Observations Used 3395

Number of Observations Used for Training 2576

Number of Observations Used for Validation 819

Observation Profile for Test Data

Number of Observations Read 1605

Number of Observations Used 1605

59 / 115 Customizing the Selection Process Average Square Errors by Roles

Desirable behavior! 60 / 115 Compare Analysis 1–6

COMPARE ANALYSIS 1-6

61 / 115 Compare Analysis 1–6 Summary Slide of All Analyses

Analysis SELECTION= Suboptions CLASS 1. Full least squares FORWARD STOP=NONE C1 C2 C3

2. Traditional STEPWISE SELECT=SL C1 C2 C3

3. Traditional with STEPWISE SELECT=SL C1 C2 C3 leave-one-out CV CHOOSE=PRESS

4. Default STEPWISE SELECT=SBC C1 C2 C3

5. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3

6. Default with split STEPWISE SELECT=SBC C1(SPLIT) C2 C3 and validate STOP=VALIDATE

62 / 115 Compare Analysis 1–6 Predictive Performance Comparison

Containment of ASE Analysis Effects Parms Exact Effects Train Test 1. Full least squares 274 834 5 26.73 49.28

2. Traditional 26 231 3 32.22 40.78

3. Traditional with 14 58 3 34.74 37.70 with leave-one-out CV

4. Default 6 28 2 35.65 36.85

5. Default with split 7 9 5 35.77 36.68

6. Default with split 6 9 5 34.72 36.78 and validate

True 5 8 5 35.96 36.73

63 / 115 Compare Analysis 1–6 Predictive Performance Comparison

Analysis 5 and Analysis 6: Yield more parsimonious models Capture all the effects in the true model Have good predictive performance on the test data set

64 / 115 Compare Analysis 1–6 Careful and Informed Use of Subset Selection Methods Is OK!

Despite the difficulties, careful and informed use of traditional variable selection methods still has its place in data analysis. Example: Foster and Stine (2004) use a modified version of stepwise selection to build a predictive model for bankruptcy from over 67,000 possible predictors and show that this method yields a model whose predictions compare favorably with those of other recently developed data mining tools.

65 / 115 Compare Analysis 1–6 Subset Selection Is Bad

“Stepwise variable selection has been a very popular technique for many years, but if this procedure had just been proposed as a statistical method, it would most likely be rejected because it violates every principle of statistical estimation and hypothesis testing.”

—Frank Harrell, Regression Modeling Strategies, 2001

66 / 115 Compare Analysis 1–6 Problems with Subset Selection Methods

Problems arise in both variable selection and coefficient estimation: Algorithms are greedy. They make the best change at each step, regardless of future effects. The coefficients and predictions are unstable especially when there are correlated predictors or the number of input variables greatly exceeds the number of observations.

Alternative: Penalized regression methods

67 / 115 Penalized Regression Methods

PENALIZED REGRESSION METHODS

68 / 115 Penalized Regression Methods Shrinkage and Penalty Methods

Penalized regression methods often introduce bias, but they improve the prediction accuracy because of the bias and variance trade-off. Commonly used shrinkage and penalty methods include: Ridge regression LASSO (Tibshirani 1996) Adaptive LASSO (Zou 2006) Elastic net (Zou et al. 2005)

69 / 115 Penalized Regression Methods Shrinkage and Penalty Methods Regression estimate is defined as the value of β that minimizes n X 2 (yi − (X β)i ) i=1 Ridge: p X 2 subject to βj ≤ t1 j=1 LASSO: p X subject to | βj |≤ t2 j=1 Elastic net: p p X 2 X subject to βj ≤ t1 and | βj |≤ t2 j=1 j=1

where t1 and t2 are the penalty parameters. 70 / 115 Penalized Regression Methods Motivation for Ridge Regression

When the number of input variables exceeds the number of observations the least squares estimation has the following drawbacks:

I Estimates are not unique. I The resulting model heavily overfits the data. When there are correlated variables the least squares estimates have high . These call for extended statistical methodologies!

71 / 115 Penalized Regression Methods Ridge Regression

 n p  X 2 X 2 arg minβ  (yi − (X β)i ) + λ βj  i=1 j=1

There is a one-to-one correspondence between λ and t1. ridge OLS I as λ → 0, βˆ → βˆ ridge I as λ → ∞, βˆ → 0 Estimates shrink toward zero but never reach zero. So ridge regression does not provide variable selection. Introduces bias, but reduces the variance of the estimate Most useful in the presence of multicollinearity Solution has a closed analytical form Available in PROC REG

72 / 115 Penalized Regression Methods Shrinkage and Penalty Methods in PROC GLMSELECT

LASSO and adaptive LASSO are available in PROC GLMSELECT Simultaneous estimation and variable selection techniques Effectively perform variable selection by modifying the coefficient estimation and reducing some coefficients to zero

73 / 115 Penalized Regression Methods Defining the LASSO

For a given tuning parameter t ≥ 0,

n X 2 arg minβ (yi − (X β)i ) i=1 subject to: p X | βj |≤ t j=1

The parameter t ≥ 0 controls the amount of shrinkage. Pp ˆOLS t ≤ j=1 | βj | causes shrinkage of the solutions toward 0.

74 / 115 Penalized Regression Methods Geometric Interpretation

The solid blue areas are constraint regions, and the red ellipses are the contours of the least squares error function. 75 / 115 Penalized Regression Methods Penalty Parameter Must Be Set to Obtain the Final Solution!

How to Determine the Penalty Parameter? Use criteria based on likelihood function, such as Adj R-Sq,CP, AIC, AICC, BIC, and SBC. Use criteria based on estimating the true prediction error, such as using a validation data set and cross validation techniques.

76 / 115 Penalized Regression Methods Prostate Data

The data come from a study by Stamey et al. (1989): 97 observations The response variable is the level of prostate-specific antigen (lpsa). Predictors are the following clinical measures:

I log of cancer volume (lcavol) I log of prostate weight (weight) I age (age) I log of the amount of benign prostatic hyperplasia (lbph) I seminal vasicle invasion (svi) I log of capsular penetration (lcp) I Gleason score (gleason) I percentage of Gleason scores of 4 or 5 (pgg45)

77 / 115 Penalized Regression Methods Prostate Data

data Prostate; input lcavol lweight age lbph svi lcp gleason pgg45 lpsa; datalines; -0.58 2.769 50 -1.39 0 -1.39 6 0 -0.43 -0.99 3.32 58 -1.39 0 -1.39 6 0 -0.16 -0.51 2.691 74 -1.39 0 -1.39 7 20 -0.16 . . more data lines . 2.883 3.774 68 1.558 1 1.558 7 80 5.478 3.472 3.975 68 0.438 1 2.904 7 20 5.583 ;

78 / 115 Penalized Regression Methods LASSO Using PROC GLMSELECT

proc glmselect data=prostate plots=all; model lpsa=lcavol lweight age lbph svi lcp gleason pgg45 /selection=lasso(stop=none choose=sbc); run; Note that the SELECT= suboption is not valid with the LAR and LASSO methods.

PLOTS=ALL option requests all the plots that comes with the analysis.

79 / 115 Penalized Regression Methods LASSO Coefficients

80 / 115 Penalized Regression Methods Fit Criteria for LASSO Selection

81 / 115 Penalized Regression Methods LASSO Estimates

Estimates are calculated using the least angle regression (LARS) algorithm (Efron et al. 2004). Standard errors of the coefficients are not immediately available.

82 / 115 Penalized Regression Methods Adaptive LASSO

LASSO has a non-ignorable bias when it estimates the nonzero coefficients (Fan and Li 2001). Adaptive LASSO produces unbiased estimates by allowing a relatively higher penalty for zero coefficients and a lower penalty for nonzero coefficients (Zou 2006).

83 / 115 Penalized Regression Methods Adaptive LASSO

Modification of LASSO selection:

n X 2 arg minβ (yi − (X β)i ) i=1 subject to: p X | βj | ≤ t ˆ j=1 βj Adaptive weights ( 1 ) are applied to each parameter in forming the βˆj LASSO constraint to control shrinking the zero coefficients more than the nonzero coefficients.

84 / 115 Penalized Regression Methods Adaptive LASSO Using PROC GLMSELECT

proc glmselect data=prostate plots=all; model lpsa=lcavol lweight age lbph svi lcp gleason pgg45 /selection=lasso(adaptive stop=none choose=sbc); run;

For βˆj in adaptive weights: By default estimates of the parameters in the model are used. You can use the INEST= option to name a SAS data set that contains estimates for βˆj ’s.

85 / 115 Penalized Regression Methods Adaptive LASSO Coefficients

Although the solutions paths are slightly different, LASSO and adaptive LASSO agree on the final model that is chosen by the SBC. 86 / 115 Penalized Regression Methods LASSO or Adaptive LASSO?

Adaptive LASSO enjoys the computational advantage of the LASSO. Because of the variance and bias trade-off, adaptive LASSO might not result in optimal prediction performance (Zou 2006). Hence LASSO can still be advantageous in difficult prediction problems.

87 / 115 Penalized Regression Methods Prefer Modern over Traditional Selection Methods

Effectively perform variable selection by modifying the coefficient estimation and reducing some coefficients to 0 Improve stability and prediction Supported much with theoretical work

88 / 115 Special Methods

SPECIAL METHODS: MODEL AVERAGING SELECTION FOR NONPARAMETRIC MODELS WITH SPLINE EFFECTS

89 / 115 Special Methods Model Averaging

MODEL AVERAGING

90 / 115 Special Methods Model Averaging Model Averaging

Another way to deal with shortcomings of traditional selection methods is based on model averaging.

Model averaging repeats model selection for multiple training sets and uses average of the selected models for prediction. It can provide: More stable inferences Less selection bias

91 / 115 Special Methods Model Averaging Model Averaging with the Bootstrap Method

Sample data with replacement. Select a model for each sample. Average the predictions across the samples. Form an average model by averaging parameters across the samples:

I Predictions with average model = average of predictions I Averaging shrinks infrequently selected parameters I Disadvantage: Average model is usually not parsimonious

92 / 115 Special Methods Model Averaging Model Averaging for Prostate Data

A popular way to obtain a parsimonious average model is to form the average of just the frequently selected models.

proc glmselect data=prostate plots=all; model lpsa=lcavol weight age lbph svi lcp gleason pgg45 /selection=lasso(stop=none choose=sbc); modelAverage nsamples=10000 subset(best=1); run;

SUBSET(BEST=1) specifies that only the most frequently selected model be used in forming the average model.

93 / 115 Special Methods Model Averaging Model Selection

Model Selection Frequency Number Times Selection of Frequency Selected Percentage Effects Score Effects in Model 1416 14.16 4 1417 Intercept lcavol lweight svi * 886 8.86 5 886.9 Intercept lcavol lweight svi pgg45 * 882 8.82 6 882.8 Intercept lcavol lweight lbph svi pgg45 * 842 8.42 7 842.8 Intercept lcavol lweight age lbph svi pgg45 * 532 5.32 8 532.7 Intercept lcavol lweight age lbph svi gleason pgg45 * 525 5.25 5 525.9 Intercept lcavol lweight lbph svi * 524 5.24 7 524.8 Intercept lcavol lweight age lbph svi gleason * 468 4.68 9 468.7 Intercept lcavol lweight age lbph svi lcp gleason pgg45 * 451 4.51 8 451.7 Intercept lcavol lweight age lbph svi lcp pgg45 * 421 4.21 6 421.8 Intercept lcavol lweight age lbph svi * 299 2.99 7 299.8 Intercept lcavol lweight lbph svi gleason pgg45 * 270 2.70 5 270.9 Intercept lcavol lweight svi gleason * 259 2.59 5 259.9 Intercept lcavol lweight age svi * 254 2.54 6 254.8 Intercept lcavol lweight lbph svi gleason * 189 1.89 5 189.8 Intercept lcavol lweight svi lcp * 156 1.56 8 156.7 Intercept lcavol lweight age lbph svi lcp gleason * 153 1.53 6 153.8 Intercept lcavol lweight svi gleason pgg45 * 148 1.48 6 148.8 Intercept lcavol lweight age svi pgg45 * 136 1.36 7 136.8 Intercept lcavol lweight lbph svi lcp pgg45 * 119 1.19 7 119.7 Intercept lcavol lweight age lbph svi lcp * Not Used in Model Averaging

Models with regressors “Intercept, lcavol, lweight, svi” are appropriate for these data. 94 / 115 Special Methods Model Averaging Average Parameter Estimates

Average Parameter Estimates Estimate Quantiles Number Non-zero Standard Parameter Non-zero Percentage Estimate Deviation 25% 75% Intercept 1416 100.00 -0.023901 0.673999 -0.450106 -0.030497 0.462740 lcavol 1416 100.00 0.481555 0.067986 0.435344 0.481252 0.528441 lweight 1416 100.00 0.482178 0.177453 0.351844 0.485739 0.592714 svi 1416 100.00 0.508385 0.189062 0.373536 0.499117 0.640241

95 / 115 Special Methods Model Averaging Parameter Estimate Distributions

Parameter Estimate Distributions for lpsa Number of Samples = 10000 Intercept lcavol 15.0 N = 1416 15 N = 1416 12.5

10.0 10 7.5

5.0 5 2.5 0.0 0 t

n -2.875 -2.125 -1.375 -0.625 0.125 0.875 0.27 0.33 0.39 0.45 0.51 0.57 0.63 0.69 0.75 e c r

e lweight svi P 15 N = 1416 12.5 N = 1416 10.0 10 7.5

5.0 5 2.5

0 0.0

0.09 0.27 0.45 0.63 0.81 0.99 1.17 0.06 0.24 0.42 0.6 0.78 0.96 1.14

You can interpret the between the 5th and 95th of each estimate as an approximate 90% confidence interval for that estimate.

96 / 115 Special Methods Model Averaging Effects Selected in at Least 20% of the Samples

Effects Selected in at Least 20% of the Samples for lpsa

lcp

gleason

age

pgg45 t c e f f E lbph

svi

lweight

lcavol

0 20 40 60 80 100 Percent You can build another parsimonious model by using the frequency of effect selection as a measure of effect importance. 97 / 115 Special Methods Selection for Nonparametric Models with Spline Effects

SELECTION FOR NONPARAMETRIC MODELS WITH SPLINE EFFECTS

98 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Noisy Sinusoidal Data The true response function might be a complicated nonlinear transformation of the inputs.

99 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Moving beyond Linearity

One way to incorporate this nonlinearity into the model:

Create additional variables that are transformations of inputs variables Use these additional variables to form the new design matrix. Use linear models in this new space of derived inputs.

100 / 115 Special Methods Selection for Nonparametric Models with Spline Effects The EFFECT Statement in PROC GLMSELECT

Enables you to construct special collections of columns for design matrices. Provides support for splines of any degree, including cubic B-spline basis (default).

I A spline function is a piecewise polynomial function in which the individual polynomials have the same degree and connect smoothly at join points (knots). I You can associate local features in your data with particular B-spline basis functions.

101 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Selection Using Spline Effects

Spline function bases provide a computationally convenient and flexible way to specify a rich set of basis functions. Variable selection can be useful for obtaining a parsimonious subset to prevent overfitting.

102 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Smoothing with Spline Effect

proc glmselect data=Sine; effect spl = spline(x/knotmethod=equal(4) split); model noisySine = spl; output out=sineOut p=predicted; run; The EFFECT statement creates a constructed effect named “spl” that consists of the eight cubic B-spline basis functions that correspond to the four equally spaced internal knots.

Out of eight B-splines, five are selected.

103 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Smoothing Noisy Sinusoidal Data

A B-spline basis with about four internal knots is appropriate.

104 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Noisy Sinusoidal Data with Bumps

Can you capture the bumps with a finer set of knots?

105 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Noisy Sinusoidal Data with Bumps

You can capture the bumps at the expense of overfitting the data in the regions between the bumps.

106 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Solution: B-Spline Bases at Multiple Scales

The following statements perform effect selection from several sets of B-spline bases that correspond to different scales in the data.

proc glmselect data=DoJoBumps; effect spl = spline(x/knotmethod=multiscale(endscale=8) split details); model noisyBumpsSine = spl; run; The ENDSCALE=8 option requests that the finest scale use cubic B-splines defined on 28 equally spaced knots in the interval [0, 1].

Out of 548 B-splines, 27 are selected.

107 / 115 Special Methods Selection for Nonparametric Models with Spline Effects Smoothing Noisy Sinusoidal Data with Bumps

Accurately captures both the low-frequency sinusoidal baseline and the bumps, without overfitting the regions between the bumps.

108 / 115 Special Methods Selection for Nonparametric Models with Spline Effects PROC ADAPTIVEREG in SAS/STAT 12.1

The multivariate adaptive regression splines method is a nonparametric approach for modeling high-dimensional data: Introduced by Friedman (1991) Combines both regression splines and model selection methods Doesn’t require knots to construct regression spline terms Automatically models nonlinearities and interactions

109 / 115 Special Methods Selection for Nonparametric Models with Spline Effects PROC QUANTSELECT in SAS/STAT 12.1 (Experimental)

The QUANTSELECT procedure performs model selection for : Forward, backward, stepwise, and LASSO selection methods Variable selection criteria: AIC, SBC, AICC, and so on Variable selection for both quantiles and the quantile process EFFECT statement for constructed model effects (splines) Experimental in 12.1

110 / 115 Summary

SUMMARY

111 / 115 Summary Summary

The GLMSELECT procedure supports a variety of model selection methods for general linear models: Traditional model selection methods Modern selection methods Model averaging Nonparametric modeling by using spline effects While doing these it offers: Extensive capabilities for customizing the selection Flexibility and power in specifying complex potential effects

112 / 115 Summary Back to Learning Objectives

You learned: Problems with the tradition selection methods Modern penalty-based methods, including LASSO and adaptive LASSO, as alternatives to traditional methods Bootstrap-based model averaging to reduce selection bias and improve predictive performance You learned how to: Use model selection diagnostics, including graphics, for detecting problems Use validation data to detect and prevent under-fitting and over-fitting Customize the selection process using the features of the GLMSELECT procedure

113 / 115 Summary Useful References

Cohen, R. 2006. “Introducing the GLMSELECT Procedure for Model Selection.” Proceedings of SAS Global Forum 2006 Conference. Cary, NC: SAS Institute Inc. Cohen, R. 2009. “Applications of GLMSELECT Procedure for Megamodel Selection.” Proceedings of SAS Global Forum 2009 Conference. Cary, NC: SAS Institute Inc. Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. 2004. “Least Angle Regression (with Discussion).” Annals of Statistics 32:407-499. Eilers, P. H. C. and Marx, B. D. 1996. “Flexible Smoothing with B-Splines and Penalties (with Discussion).” Statistical Science 11:89–121. Fan, J., and Li, R. 2001. “Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties.” Journal of the American Statistical Association 96:1348–1360. Foster, D. P. and Stine, R. A. 2004. “Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy.” Journal of the American Statistical Association 99:303–313. Friedman, J. 1991. “Multivariate Adaptive Regression Splines.”Annals of Statistics 19:1–67. Harrell, F. 2001. Regression Modeling Strategies. New York: Springer-Verlag. Hastie, T., Tibshirani, R., and Friedman, J. 2001. The Elements of Statistical Learning. New York: Springer-Verlag.

114 / 115 Summary Useful References

Miller, A. 2002. Subset Selection in Regression. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC. Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society, Series B 58:267–288. Zou, H. 2006. “The Adaptive Lasso and Its Oracle Properties.” Journal of the American Statistical Association 101:1418–1429. Zou, H. and Hastie, T. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society, Series B 67:301–320.

115 / 115