Linear Regression: Goodness of Fit and Model Selection
Total Page:16
File Type:pdf, Size:1020Kb
Linear Regression: Goodness of Fit and Model Selection 1 Goodness of Fit I Goodness of fit measures for linear regression are attempts to understand how well a model fits a given set of data. I Models almost never describe the process that generated a dataset exactly I Models approximate reality I However, even models that approximate reality can be used to draw useful inferences or to prediction future observations I ’All Models are wrong, but some are useful’ - George Box 2 Goodness of Fit I We have seen how to check the modelling assumptions of linear regression: I checking the linearity assumption I checking for outliers I checking the normality assumption I checking the distribution of the residuals does not depend on the predictors I These are essential qualitative checks of goodness of fit 3 Sample Size I When making visual checks of data for goodness of fit is important to consider sample size I From a multiple regression model with 2 predictors: I On the left is a histogram of the residuals I On the right is residual vs predictor plot for each of the two predictors 4 Sample Size I The histogram doesn’t look normal but there are only 20 datapoint I We should not expect a better visual fit I Inferences from the linear model should be valid 5 Outliers I Often (particularly when a large dataset is large): I the majority of the residuals will satisfy the model checking assumption I a small number of residuals will violate the normality assumption: they will be very big or very small I Outliers are often generated by a process distinct from those which we are primarily interested in. i.e. the process generating the relationships between the response and the predictors I e.g. Outliers are often generated by measurement or experimental errors I In these circumstances, rather than reject the linear model and search for a simpler one, it is usually better to remove the outliers from the dataset. 6 Outlier Detection I Outliers can be detected graphically using the qqnorm function: 7 Automatic Outlier Detection I Automated outlier detection is built into R I Apply the plot command to an R linear model object: > plot(lm(y∼ x)) 8 Goodness of Fit I Visual checks are important methods for checking the quality of the fit of a linear model to a dataset I However they are qualitative, quantitative measure of goodness of fit are also important I Quantitative measures allow us to compare the goodness of fit of different models to the same dataset 9 Residual Sum of Squares (SSE) I The residual sum of squares is defined as n 2 SSE = ri Xi n 2 = (Yi - αˆ - βˆ x) Xi I This is a simple measure of goodness of fit I The smaller the SSE the better the fit of the model I Note SSE depends on n the number of data points, so it cannot be used to compare the quality of fit of models fit to datasets of different size. 10 R2 2 I You may have noticed that the R value is routinely given in the R software output 2 I R is a measure of variance explained by the predictors I We will now see how to define and interpret this measure. 11 R2 I Consider two equations models: EYi = α + β × xi (1) EYi = α (2) 2 I R of the first model is defined as: sum of squared residuals from model (2) 1 - sum of squared residuals from model (1) I If better model (2) is at explaining the variation compared to model (1). The closer R2 will be to 1. 12 R2 Interpretation I Consider the following two residual plots: I In the first case, the predictor x does not seem to have a large effect, while in the second, the linear model fits well. 13 R2 Properties 2 I R is a number between 0 and 1 2 I The R for a simple linear regression is the squared correlation between x and Y 2 I The larger R the better the fit 2 I If you add a predictor to a multiple linear regression R always goes up 2 I This means R is inappropriate to compare the quality of fit of models with different numbers of predictors I More complex models always fit better 14 Model Selection I Recall there are two main reasons for fitting a statistical model 1 Scientific Inference: estimating an interpretable parameter 2 Prediction: if you give me a new x1, x2, ...xp can I predict the value of the corresponding Y without seeing it. I Models for making scientific inferences are not normally chosen using statistical ‘black box’ model selection procedures. Usually the choice of model(s) depends on the scientific question and knowledge of the data generating process. I Model selection methods are used primarily for finding good models for making predictions. 15 Averaging Over Models I Optimal predictions often come from averaging over predictions from multiple models. I In this lecture we will concentrate on methods for finding a single optimal model amongst a set of possibilities. 16 Complex Models are not Good for Prediction I Problem: find a model using the current dataset that is going to be good at predicting a new observation. I As we’ve seen we can move to a model with improved goodness of fit of by adding a new predictor to the current model, so its easy to find a model which fits well to a given dataset I But really complex models aren’t necessarily good for predicting new observtions, even if they are a good fit to the current dataset. 17 Model Parsimony I Measures of model parsimony take into account goodness of fit to the data and model complexity. I If two models have the same number of parameters the one with the better goodness of fit will be the more parsimonious. I If two models have the same goodness of fit (rarely happens) the model with the fewer parameters will be the more parsimonious. I More parsimonious models should give better predictions (on average). 18 Measures of Model Parsimony I There are many measures of model parsimony. I We will concentrate on AIC and BIC. 19 AIC I The formula for AIC is n log SSE + 2p I SSE is the residual sum of squares, this is a goodness of fit measure I p is the number of parameters of the model (number of regression coefficients). I Smaller values of AIC correspond to more parsimonious models. I AIC tends to be liberal (i.e. can add in too many predictors, overfit) 20 BIC I The formula for BIC in linear regression is 2 log SSE + p log(n) I n is the sample size. I The complexity penalty is stronger than that for AIC. I Smaller values of BIC correspond to more parsimonious models. I BIC tends to be conservative (i.e. it requires quite a bit of evidence before it will include a predictor) 21 Number of Possible Regression Models p I If we have p predictors we can build 2 possible models. p 2 I e.g. p = 2 the 2 = 2 = 4 possible linear regression models have regression equations: EYi= α EYi = α + β1X1 EYi = α + β2X2 EYi= α + β1X1 + β2X2 I The blue model is called the empty model. I The red model is called the saturated (or full) model. 22 All Subsets Selection I All subsets selection is the simplest model search algorithm. 1 Choose a model parsimony criterion. 2 Fit each of the 2p models and compute the criterion. 3 Rank the models by the criterion and choose the most parsimonious. I On modern computers this is doable providing p is not much larger than a number in the late teens. 220 ≈ 1 million. 23 Forward Selection I This algorithm can be run with any model selection criterion. I Start (usually) with the empty model as the current model. I Iterate the following: 1 Fit all the models you can generate by augmenting the current model by one variable. 2 If none of the models fitted in 1 is ranked better by the model selection criterion than the current model terminate the algorithm and output the current model. 3 Update the current model with the model fitted in 1 that is ranked best by the model selection criterion. p I This fits at most p(p + 1)=2 models (faster than 2 ). I May not always find the best model, once a variable is in 24 the model it can’t be removed. Backward Elimination I This algorithm can be run with any model selection criterion. I Start (usually) with the saturated model. I Iterate the following: 1 Fit all the models you can generate by reducing the current model by one variable. 2 If none of the models fitted in 1 is ranked better by the model selection criterion than the current model terminate the algorithm and output the current model. 3 Update the current model with the model fitted in 1 that is ranked best by the model selection criterion. p I This fits at most p(p + 1)=2 models (faster than 2 ). I May not always find the best model, once a variable is out 25 of the model it can’t be returned. Combined Forward and Backwards Selection I This algorithm can be run with any model selection criterion. I Starting point not so important. I Iterate the following: 1 Fit all the models you can generate by augmenting or reducing the current model by one variable.