Model Selection for Linear Models with SAS/STAT Software

Model Selection for Linear Models with SAS/STAT Software Funda Güne¸s SAS Institute Inc. Outline Introduction I Analysis 1: Full least squares model Traditional model selection methods I Analysis 2: Traditional stepwise selection Customizing the selection process I Analysis 3{6 Compare analyses 1{6 Penalized regression methods Special methods I Model averaging I Selection for nonparametric models with spline effects 1 / 115 Learning Objectives You will learn: Problems with the tradition selection methods Modern penalty-based methods, including LASSO and adaptive LASSO, as alternatives to traditional methods Bootstrap-based model averaging to reduce selection bias and improve predictive performance You will learn how to: Use model selection diagnostics, including graphics, for detecting problems Use validation data to detect and prevent under-fitting and over-fitting Customize the selection process using the features of the GLMSELECT procedure 2 / 115 Introduction Introduction With improvements in data collection technologies, regression problems that have large numbers of candidate predictor variables occur in a wide variety of scientific fields and business problems. \I've got all these variables, but I don't know which ones to use." Statistical model selection seeks an answer to this question. 3 / 115 Introduction Model Selection and Its Goals Model Selection: Estimating the performance of different models in order to choose the approximate best model. Goals of model selection: Simple and interpretable models Accurate predictions Model selection is often a trade-off between bias and variance. 4 / 115 Introduction Graphical Illustration of Bias and Variance 5 / 115 Introduction Bias-Variance Trade-Off 2 Suppose Y = f (X ) + , where ∼ N(0; σ ). Expected prediction error at a point x is E[(Y − f^(x))2] = Bias2 + Variance + Irreducible Error 6 / 115 Introduction The GLMSELECT Procedure The GLMSELECT procedure implements statistical model selection in the framework of general linear models for selection from a very large number of effects. Methods include: Familiar methods such as forward, backward, and stepwise selection Newer methods such as least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) 7 / 115 Introduction Difficulties of Model Selection The implementation of model selection can lead to difficulties: A model selection technique produces a single answer to the variable selection problem, although several different subsets might be equally good for regression purposes. Model selection might be unduly affected by outliers. Selection bias 8 / 115 Introduction The GLMSELECT Procedure PROC GLMSELECT can partially mitigate these problems with its Extensive capabilities for customizing the selection Flexibility and power in specifying complex potential effects 9 / 115 Introduction Model Specification The GLMSELECT procedure provides great flexibility for model specifications: Choice of parameterizations for classification effects Any degree of interaction (crossed effects) and nested effects Internal partitioning of data into training, validation, and testing roles Hierarchy among effects 10 / 115 Introduction Selection Control The GLMSELECT procedure provides many options for selection control: Multiple effect selection methods Selection from a very large number of effects (tens of thousands) Selection of individual levels of classification effects Effect selection based on a variety of selection criteria Stopping rules based on a variety of model evaluation criteria Leave-one-out and k-fold cross validation 11 / 115 Introduction Linear Regression Model Suppose data arise from a a normal distribution with the following statistical model: Y = f (x) + In linear regression f (x) = β0 + β1x1 + β2x2 + ··· + βpxp Least squares is the most popular estimation method which picks the coefficients β = (β0; β1; : : : ; βp) that minimize the residual sum of squares: 2 N 0 p 1 X X RSS(β) = @yi − β0 + Xij βj A i=1 j=1 12 / 115 Introduction PROC GLMSELECT with Examples Learn how to use PROC GLMSELECT in model development with examples: 1 Simulate data 2 Fit full least squares model 3 Perform model selection by using five different approaches 4 Compare the selected models' performances 13 / 115 Introduction Simulate Data data trainingData testData; drop i j; array xf20g x1-x20; do i=1 to 5000; /* Continuous predictors */ do j=1 to 20; xfjg = ranuni(1); end; /* Classification variables */ c1 = int(1.5+ranuni(1)*7); c2 = 1 + mod(i,3); c3 = int(ranuni(1)*15); yTrue = 2 + 5*x17 - 8*x5 + 7*x9*c2 - 7*x1*x2 + 6*(c1=2) + 5*(c1=5); y = yTrue + 6*rannor(1); if ranuni(1) < 2/3 then output trainingData; else output testData; end; run; Reserves one-third of the data as test data and the remaining two-thirds as training data 14 / 115 Introduction Training and Test Data Use training data to develop a model. Use test data to assess your model's predictive performance. 15 / 115 Introduction Analysis 1: Full Least Squares Model proc glmselect data=trainingData testdata=testData plots=asePlot; class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /selection=forward(stop=none); run; Because STOP=NONE is specified, the selection proceeds until all the specified effects are in the model. 16 / 115 Introduction Dimensions of Full Least Squares Model Class Level Information Class Levels Values c1 8 1 2 3 4 5 6 7 8 c2 3 1 2 3 c3 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Dimensions Number of Effects 278 Number of Parameters 947 A full model that contains all main effects and their two-way interactions often leads to a large number of effects. When the classification variables have many levels, the number of parameters available for selection is even larger. 17 / 115 Introduction Assess Your Model's Predictive Performance You can assess the model's predictive performance by comparing the average square error (ASE) on the test data and the training data: ASE on the test data: ntest p−1 X X 2 [(Ytest − (β^0 + β^j Xtest;j )) ]=ntest i=1 j=1 where β^j 's are the least squares estimates obtained by using the observations in the training data. 18 / 115 Introduction Average Square Error (ASE) Plot proc glmselect ... plots=asePlot; Progression of Average Squared Errors by Role for y Selected Step 80 r o r r E 60 d e r a u q S e g a r e v A 40 0 50 100 150 200 250 Step Training Test 19 / 115 Introduction Overfitting and Variable Selection So the more variables the better? NO! Carefully selected variables can improve model accuracy. But adding too many features can lead to overfitting: Overfitted models describe random error or noise instead of the underlying relationship. Overfitted models generally have poor predictive performance. Model selection can prove useful in finding a parsimonious model that has good predictive performance. 20 / 115 Introduction Model Assessment Model assessment aims to 1 Choose the number of predictors for a given technique. 2 Estimate the prediction ability of the chosen model. For both of these purposes, the best approach is to evaluate the procedure on an independent test data, if one is available. If possible one should use different test data for (1) and (2): Validation set for (1) and test set for (2). 21 / 115 Introduction Model Selection for Linear Regression Models Suppose you have only two models to compare. Then you can use the following methods for model comparison: F test Likelihood ratio test AIC, SBC, and so on Cross validation However we usually have more than two models to compare! For a model selection problem with p predictors, there are 2p models to compare! 22 / 115 Introduction Alternatives Compare all possible subsets { all-subsets regression I Computationally expensive I Introduces a large selection bias! Use search algorithms I Traditional selection methods: forward, backward and stepwise I Shrinkage and penalty methods 23 / 115 Traditional Selection Methods TRADITIONAL SELECTION METHODS 24 / 115 Traditional Selection Methods Traditional Selection Methods Forward Selection: Begins with just the intercept and at each step adds the effect that shows the largest contribution to the model. Backward Elimination: Begins with the full model and at each step deletes the effect that shows the smallest contribution to the model. Stepwise Selection: Modification of the forward selection technique that differs in that effects already in the model do not necessarily stay there. PROC GLMSELECT extends these methods as implemented in the REG procedure. SELECTION= option of the MODEL statement specifies the model selection method. 25 / 115 Traditional Selection Methods Traditional Selection Methods In traditional selection methods: The F statistic and the related p-value reflect an effect’s contribution to the model. You choose the predictors and then estimate coefficients by using standard criteria such as least squares or maximum likelihood. There are problems with the use of both the F statistic and coefficient estimation! 26 / 115 Traditional Selection Methods Analysis 2: Traditional Stepwise Selection proc glmselect data=analysisData testdata=testData plots=(CoefficientPanel(unpack) asePlot Criteria); class c1 c2 c3; model y = c1|c2|c3|x1|x2|x3|x4|x5|x5|x6|x7|x8|x9|x10 |x11|x12|x13|x14|x15|x16|x17|x18|x19|x20 @2 /selection = stepwise(select=sl); run; The SELECT=SL option uses the significance level criterion to determine the order in which effects enter or leave the model. 27 / 115 Traditional Selection Methods Selection Summary Stepwise Selection Summary Effect Effect Number Number Step Entered Removed Effects

Model Selection for Linear Models with SAS/STAT Software

Model Selection for Optimal Prediction in Statistical Machine Learning

Linear Regression: Goodness of Fit and Model Selection

Scalable Model Selection for Spatial Additive Mixed Modeling: Application to Crime Analysis

Model Selection Techniques: an Overview

Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP

Model Selection and Estimation in Regression with Grouped Variables

Model Selection for Production System Via Automated Online Experiments

Optimal Predictive Model Selection

Model Selection in Regression: Application to Tumours in Childhood

Statistical Modeling Methods: Challenges and Strategies

The Minimum Message Length Principle for Inductive Inference

Minimum Description Length Model Selection