Model Selection and Model Over-Fitting
Total Page:16
File Type:pdf, Size:1020Kb
STATISTICS COLUMN Shengping Yang et al. Model selection & over-fitting Model selection and model over-fitting Shengping Yang PhD, Gilbert Berdine MD I have collected some demographic, clinic, and blood test data from colon cancer patients treated from 2010 to 2012. I am interested in factors that potentially affect serum CRP levels shortly after diagnosis. I am thinking to use all the data (variables) in a regression model to evaluate such relationships. Is model over-fitting a concern? Real world data very often consist of true signal Let’s recall the process of data collection - in and random noise. Although it’s ideal to fit only the reality, it is rarely feasible to evaluate the underlying true signal using a perfect statistical model to explore relationships using data of an entire population. The the underlying relationship between the outcome common approach is to take random samples from y=f(x)+ε and the independent variables (predictors) x, the population, and then using the sample data to in- many times the true signal f(x) and the random noise ε fer such relationships. However, due to the random cannot be separated from each other. In other words, nature of sampling, as well as the random noise in- we can only observe y and x, and would not be able to herited in individual observations, it is common that observe f(x), where f() is the function that defines the statistical models fitted using different random sam- underlying relationship between y and x. ples (collected from the same population) will have dif- A B y newy x newx ferent parameters. Moreover, in situations where ε is Corresponding author: Shengping Yang PhD substantial, and the number of independent variables Contact Information: [email protected] is large, a statistical model might mainly describe the DOI: 10.12746/swrccc2015.0312.160 random noise rather than the underlying relationship; this effect is called over-fitting. For example, if one 52 The Southwest Respiratory and Critical Care Chronicles 2015;3(12) Shengping Yang et al. Model selection & over-fitting fitsn data points to n parameters, the fit will be exact, error, penalizing extra parameters reduces the risk of but it will be unlikely to work well with another sample over-fitting; 2) use validation data set(s) to evaluate of n points from the same population. The opposite the performance of the fitted model - since the true problem – under-fitting – occurs when too few param- underlying relationship is supposed to be consistent eters are used. A model that is under-fitted will work (the random noise is not) across samples randomly well only if the parameter left out is homogeneously collected from the same population, models that con- distributed throughout the population. sistently have good performance on different samples are more likely to be the one that models the true un- The classical example of model over-fitting is to fit derlying relationship rather than the random noise. data with a polynomial regression. Suppose that the true underlying relationship between x and y in a pop- 1. Penalize models with more parameters ulation is y=30+x2 (this is rarely known in reality; we assume that it is known for demonstration purposes). Penalizing additional parameters (predictors) can First, we randomly sample 9 observations from the be achieved by either performing the traditional mod- population (panel A), and fit the data with both a sec- el selection (based on different criteria) or applying ond order polynomial (red curve; the “correct” model), penalized regression models. and a 10th order polynomial (blue curve: the “over-fit- ted” model). As we can see, although the 10th order (A) Traditional model selection polynomial has a perfect fit to the data, it substantially deviates from the true relationship (black curve). As i. Selection based on: Adjusted R-squared, Mal- a comparison, although the second order polynomial lows’ Cp, AIC and BIC does not fit the data perfectly, it agrees well with the true relationship. Next, we take another independent R-squared is the percentage of outcome variable random sample of 9 observations from the same pop- variation explained by the model, and describes how ulation (panel B). It can be seen that the second order close the data are to the fitted regression. In general, polynomial developed on the first sample has an ac- the higher the R-squared value, the better the model ceptable fit also to the second independent sample. fits. However, R-squared always increases with addi- On contrast, the 10th order polynomial has a very poor tional predictors, thus models with more extra predic- fit. In other words, the 10th order polynomial model tors always have higher R-squared values. The ad- fits not only the underlying relationship, but also the justed R-squared adjusts the R-squared value for the variation associated with random error, thus it is a number of predictors, and it increases only if the addi- over-fitted model. Therefore, it is not surprising that tional predictor improves the model more than would such a model fits well to one sample, but poorly to be expected by chance. Thus, models with higher another one. adjusted R-squared are generally considered better. Since the goal of any biomedical/clinical study is Mallows’ Cp estimates the mean squared prediction to disclose the true underlying relationship, it is critical error, and is a compromise among factors, including to find the “correct” model in data analysis. sample size, collinearity, and predictor effect sizes. The adequate models are those with Cp less than or To find the “correct” model and avoid model equal to the number of parameters (including the con- over-fitting, many methods have been proposed. The stant) in the model. majority of them adopt one of the following strategies: 1) penalize models with more parameters - since in- AIC is “Akaike’s Information Criterion”, and BIC is creased number of parameters in a model is associ- “Schwartz’ Bayesian Criterion.” Both aim at achiev- ated with higher probability of modeling the random ing a compromise between model goodness of fit and The Southwest Respiratory and Critical Care Chronicles 2015;3(12) 53 Shengping Yang et al. Model selection & over-fitting model complexity. The only difference between AIC towards zero. In general, the penalty parameter is and BIC is the penalty term, where BIC is more strin- chosen by cross validation to maximize out-of-sam- gent than AIC. The preferred models are those with ple fit. minimum AIC/BIC. Elastic Net regression was developed to overcome ii. Best subset / forward / backward / step-wise the limitations of LASSO, and in general outperforms selection LASSO when the predictors are highly correlated. Assuming that the total number of predictors is p, 2. Cross validation then the best subset selection fits2 p total models, and chooses the best model based on criteria, such as, Any random sample will differ from its population. adjusted R-squared, Mallows’ Cp, AIC/BIC. However, From a given population, two independent samples if the total number of predictors is large, for example, share the true underlying relationship, but not the greater than 30, then the computation can be a big sample-specific variation. It is important to assess issue. how well the model developed on one sample per- forms on another independent sample, and fine tune In forward selection, the most significant variable model parameters. Cross validation is an easy-to-im- (based on certain pre-set confidence level) is added plement tool to make such assessments. to the model one at a time, until no additional variable meets the criterion. (A) k-fold cross validation Backward selection starts with the full model that The k-fold cross validation is one of the most com- includes all the variables of interest, and then drop monly used methods. Basically, the sample is ran- non-significant variables one at a time, until all the domly partitioned into k equal size subsets, then one variables left are significant. of the k subsets is used as the validation set, and all the other k-1 subsets are used as the training set. Step-wise selection allows both adding and drop- This process is repeated k times (folds), so that each ping variables to allow dropped variables to be recon- of the k subsets is used exactly once as the valida- sidered. tion set. Results from the k validations can then be averaged to produce a single estimation. As a special As an alternative to the above traditional model se- case, if k equals N, which is the total number of obser- lection methods, penalized regressions achieve coef- vations, then the k-fold cross validation is called the ficient estimation and model selection simultaneously. leave-one-out cross validation. (B) Penalized regressions (LASSO regression and (B) random sub-sampling validation Elastic Net) In k-fold cross validation, the proportion of the The LASSO (Least Absolute Shrinkage and Selec- training/validation subsets depends on the number of tion Operator) achieves model selection by penaliz- iterations (folds). In situations where the total num- ing the absolute size of the regression coefficients. ber of sample observations is small, it would make In other words, LASSO includes a penalty term that sense to use random sub-sampling validation, such constrains the size of the estimated coefficients. As as bootstrap. Note that the disadvantages of random a result, solutions of the lasso regression will have sub-sampling are that some observations might many coefficients set exactly to zero, and the larger never have been sampled, and the results might vary the penalty applied, the more estimates are shrunk if such randomization is repeated.