Model Selection in R

Model Selection in R Charles J. Geyer October 28, 2003 This used to be a section of my master’s level theory notes. It is a bit overly theoretical for this R course. Just think of it as an example of literate programming in R using the Sweave function. You don’t have to absorb all the theory, although it is there for your perusal if you are interested. This discusses the problem of choosing among many models. Although our examples will be regression problems and some of the discussion and methods will be specific to linear regression, the problem is general. When many models are under consideration, how does one choose the best? Thus we also discuss methods that apply outside the regression context. There are two main issues. • Non-nested models • Many models. The only methods for model comparison we have studied, the F test for comparison of linear regression models and the likelihood ratio test for comparison of general models, are valid only for comparing two nested models. We also want to test non-nested models, and for that we need new theory. When more than two models are under consideration, the issue of correction for multiple testing arises. If there are only a handful of models, Bonferroni correction (or some similar procedure) may suffice. When there are many models, a conservative correction like Bonferroni is too conservative. Let’s consider how many models might be under consideration in a model selection problem. Consider a regression problem with k predictor variables. • [Almost the worst case] There are 2k possible submodels formed by choosing a subset of the k predictors to include in the model (because a set with k elements has 2k subsets). • [Actually the worst case] That doesn’t consider all the new predictors that one might“make up”using functions of the old predictors. Thus there are potentially infinitely many models under consideration. Bonferroni correction for infinitely many tests is undefined. Even for 2k tests with k large, Bonferroni is completely pointless. It would make nothing statis- tically significant. 1 1 Overfitting Least squares is good for model fitting, but useless for model selection. Why? A bigger model always has a smaller residual sum of squares, just because a minimum taken over a larger set is smaller. Thus least squares, taken as a criterion for model selection says “always choose the biggest model.” But this is silly. Consider what the principle of choosing the biggest model says about polynomial regression. Example 1 (Overfitting in Polynomial Regression). Consider the regression data in the file ex12.7.1.dat in this directory, which is read by > X <- read.table("ex12.7.1.dat", header = TRUE) > names(X) [1] "x" "y" > attach(X) The“simple”linear regression is done and a plot made (Figure 1) by the following > par(mar = c(5, 4, 1, 1) + 0.1) > plot(x, y) > out <- lm(y ~ x) > abline(out) Couldn’t ask for nicer data for simple linear regression. The data appear to fit the “simple” model (one non-constant predictor, which we take to be x itself). But now consider what happens when we try to be a bit more sophisticated. How do we know that the “simple” model is o. k.? Perhaps we should consider some more complicated models. How about trying polynomial regression? But if we consider all possible polynomial models, that’s an infinite number of models (polynomials of all orders). Although an infinite number of models are potentially under consideration, the fact that the data set is finite limits the number of models that actually need to be considered to a finite subset. You may recall from algebra that any set of n points in the plane having n different x values can be interpolated (fit exactly) by a polynomial of degree n − 1. A polynomial that fits exactly has residual sum of squares zero (fits perfectly). Can’t do better than that by the least squares criterion! Thus all polynomials of degree at least n − 1 will give the same fitted values and zero residual sum of squares. Although they may give different predicted values for x values that do not occur in the data, they give the same predicted values at those that do. Hence the polynomials with degree at least n − 1 cannot be distinguished by least squares. In fact 2 ● ● 10 ● 8 ● ● 6 ● y 4 ● ● 2 ● ● 0 2 4 6 8 10 x Figure 1: Some regression data. With fitted “simple” regression function of the form y =α ˆ + βxˆ . 3 the polynomials with degree more than n − 1 cannot be fit at all, because they have more parameters than there are equations to determine them. X0X is a singular matrix and cannot inverted to give unique estimates of the regression coefficients. This is a general phenomenon, which occurs in all settings, not just with linear regression. Models with more parameters than there are data points are underdetermined. Many different parameter vectors give the same likelihood, or the same empirical moments for the method of moments, or the same for whatever criterion is being used for parameter estimation. Thus in general, even when an infinite number of models are theoretically under consideration, only n models are practically under consideration, where n is the sample size. Hence the “biggest model” that the least squares criterion selects is the polynomial of degree n − 1. What does it look like? The following code fits this model > out.poly <- lm(y ~ poly(x, 9)) > summary(out.poly) Call: lm(formula = y ~ poly(x, 9)) Residuals: ALL 10 residuals are 0: no residual degrees of freedom! Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 5.6974 poly(x, 9)1 10.4359 poly(x, 9)2 -0.6243 poly(x, 9)3 -0.8362 poly(x, 9)4 1.1783 poly(x, 9)5 -0.7855 poly(x, 9)6 0.1269 poly(x, 9)7 -1.3901 poly(x, 9)8 -0.4744 poly(x, 9)9 -1.0935 Residual standard error: NaN on 0 degrees of freedom Multiple R-Squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 9 and 0 DF, p-value: NA (Note the zero residual degrees of freedom. That’s the overfitting!) Then > par(mar = c(5, 4, 1, 1) + 0.1) > plot(x, y) > curve(predict(out.poly, data.frame(x = x)), add = TRUE) > abline(out, lty = 2) 4 ● ● 10 ● 8 ● ● 6 ● y 4 ● ● 2 ● ● 0 2 4 6 8 10 x Figure 2: Some regression data. With fitted linear regression function (dashed line) and ninth degree polynomial regression function (solid curve). See also Figure 3. plots the two regression functions for our fits (Figure 2) the best fitting (perfectly fitting!) polynomial of degree n − 1 = 9 and the least squares regression line from the other figure. How well does the biggest model do? It fits the observed data perfectly, but it’s hard to believe that it would fit new data from the same population as well. The extreme oscillations near the ends of the range of the data are obviously nonsensical, but even the smaller oscillations in the middle seem to be tracking random noise rather than any real features of the population regression function. Of course, we don’t actually know what the true population regression function is. It could be either of the two functions graphed in the figure, or it could be some other function. But it’s hard to believe, when the linear function fits so well, that something as complicated as the ninth degree polynomial is close to the true regression function. We say it “overfits” the data, meaning it’s too close to the data and not close enough to the true population regression function. Let’s try again on Figure 2. Perhaps we would like to see the extent of those big oscillations. Something like this should do it. 5 20 15 ● ● 10 ● ● ● y ● 5 ● ● ● ● 0 −5 2 4 6 8 10 x Figure 3: Same as Figure 2 except for greater vertical extent. > par(mar = c(5, 4, 1, 1) + 0.1) > xx <- seq(min(x), max(x), length = 1000) > yy <- predict(out.poly, data.frame(x = xx)) > plot(x, y, ylim = range(yy)) > curve(predict(out.poly, data.frame(x = x)), add = TRUE) > abline(out, lty = 2) The result is shown in Figure 3. 2 Mean Square Error So what criterion should we use for model selection if residual sum of squares is no good? One theoretical criterion is mean square error. In the regression setting, it is unclear what quantities mean square error should apply to. Do we use the mean square error of the parameter estimates? We haven’t even defined mean square error for vector quantities. That alone suggests we should avoid looking at MSE of regression coefficients. There is also our slogan that regression coefficients are meaningless. Hence we should look at estimates of 6 the regression function. But here too, there are still issues that need to be clarified. The regression function h(x) = E(Y | x) is a scalar function of x. (Hence we don’t need to worry about MSE of a vector quantity). But it is a function of the predictor value x. 2 mse{hˆ(x)} = variance + bias2 = var{hˆ(x)} + E{hˆ(x)} − h(x) (1) where h(x) is the true population regression function and hˆ(x) is an estimate (we are thinking of least squares regression estimates, but (1) applies to any estimate). (“Bias?” did I hear someone say? Aren’t linear regression estimates unbiased? Yes, they are when the model is correct.

Load more