Does Cross-Validation Work When P ≫ N?

Does Cross-validation Work when p n? Laurence de Torrenté Mathematics Institute, EPFL, 1015 Lausanne, Switzerland [email protected] Trevor Hastie∗ Department of Statistics, Stanford University, Stanford, CA 94305 [email protected] September 12, 2012 Abstract Cross-validation is a popular tool for evaluating the performance of a predictive model, and hence also for model selection when we have a se- ries of models to choose from. It has been suggested that cross-validation can fail when the number of predictors p is very large. We demonstrate through a suggestive simulation example that while K-fold cross-validation can have high variance in some situations, it is unbiased. We also study two permutation methods to assess the quality of a cross-validation curve for model selection, which we demonstrate using the lasso. The first approach is visual, while the second computes a p-value for our observed error. We demonstrate these approaches using two real datasets \Colon" and \Leukemia", as well as a null dataset. Finally we use the bootstrap to estimate the sampling distribution of our cross-validation curves, and their functionals. This adds to the graphical evidence of whether our findings are real or could have arisen by chance. ∗Trevor Hastie was partially supported by grant DMS-1007719 from the National Science Foundation, and grant RO1-EB001988-15 from the National Institutes of Health. 1 Keywords: variable selection, high-dimensional, permutation, bootstrap. 2 1 Introduction Cross-validation is an old method, which was investigated and reintroduced by Stone (1974). It splits the dataset into two parts, using one part to fit the model (training set) and one to test it (test set). K-fold cross-validation (K-fold CV) and leave-one-out cross-validation (LOOCV) are the best-known. There is also generalized cross-validation (GCV) which is an approximation due to Wahba & Wold (1975). We consider a regression problem where we want to predict a response vari- T able Y using a vector of p predictors X = (X1;:::;Xp) via a linear model, p ^ X ^ Y^ = β0 + Xjβj: j=1 In general we would like to eliminate the superfluous variables among the Xj, and obtain good estimates of the coefficients for those retained. Any variable ^ ^ with a small jβjj is a candidate for removal: it may be advantageous to set βj = 0, allow for a small bias, and yet reduce the prediction error. Ordinary least- squares estimators tend to have small bias but large variance in the prediction of Y . The choice between variance and bias can be made explicit by variable selection procedures or by shrinkage methods. Having less predictors can also improve data visualization and understanding. With genomic data, the number of variables is often so large that selection becomes essential|both for statistical inference and for interpretation. A variety of selection and shrinkage methods have been proposed and investigated, among them: • Forward stepwise selection, which begins with a model containing only the intercept and adds at each step one covariate until no better model can be found. • Ridge regression leaves all the covariates in, but regularizes their coefficients. The criterion to be minimized is a weighted sum of the residual 3 sum of squares and the squared Euclidean norm of the coefficients. • Lasso (Tibshirani 1996) is like ridge, except it penalizes the L1-norm of the coefficients, thereby achieving subset selection and shrinkage at the same time. The lasso solves the convex optimization problem n p X T 2 X minimize (Yi − Xi β) + λ jβjj: (1) β2<p i=1 j=1 ^ All of these methods correspond to families of estimators of βλ rather than a single one. This is a situation where cross-validation is useful. The cross- validated estimate of the prediction error can help in choosing within these families. It is often used to select the tuning parameter (λ) for lasso or ridge regression, or the subset size for stepwise regression. For example, we choose the tuning parameter that results in the model with the minimal cross-validation error. Variable selection with high dimensional data can be difficult. We have to work with a large number of covariates and small sample size. We want to estimate correctly the prediction error and the \sparsity" pattern. Wasserman & Roeder (2009) propose a multistage procedure which \screens" and \cleans" the covariates in order to reduce the dimension and have reasonable power . It is often suggested that cross-validation could fail in the p n situation, as the following simple situation appears to demonstrate (a simplification of a similar example in Section 7.10.3 of Hastie, Tibshirani & Friedman (2008)). Suppose we have p (very large) binary predictors, and a binary response (e.g. a genomics dataset). We consider a very simple procedure: we search through the covariates, and find the one most aligned with the response, and use it to predict the response. With n sufficiently small, we might find one or more features that are perfectly aligned with the response. If this is the case, it would also be true for every cross-validation fold! This could happen even in the null case, where the response is independent of the features. It seems that cross- validation would fail here, and report a cross-validation error of zero. In other 4 words, an association between a feature and the response is found even if there is none. It turns out that this is not a counter example, but does demonstrate the high variance of cross-validation in such situations. Braga-Neto & Dougherty (2004) also showed that cross-validated estimators exhibits high variance and large outliers when p n. Fan, Guo & Hao (2012) also displayed the problem of variance and proposed alternative estimators. In Section 2 we pursue this example. We generate a null dataset and use this simple prediction rule, to demonstrate that cross-validation error is unbiased for the expected test error but can have high variance. Cross-validation being unbiased is not enough. By luck we might find a model with very low cross-validated error, but not be sure if it is real or not. We have seen that cross-validation error can have high variation, so we look at two different methods to assess this variance. In Section 3 we look at cross- validation error used to select the tuning parameter (λ^) in the lasso. First, we use a visual method which allows us to see if our error is credible, or could have arisen from a null model. Then, we compute the null distribution of the minimum error and hence a \p-value" for our observed error. The methods we propose are visual and descriptive, and use permutations to generate null CV curves. We demonstrate these approaches using two real datasets \Colon" and \Leukemia", as well as a simulated dataset. In Section 4, we discuss the use of the bootstrap in this context. The bootstrap estimates the sampling distribution of the cross-validation curves and their functionals such as optimal λ and minimal error. This helps us establish whether the good performance we see in our observed data (relative to the null data) could be by chance. In every section, we will use fivefold cross-validation which seems to be a good compromise (see Hastie et al. (2008), Breiman & Spector (1992) and Kohavi (1995)). 5 2 Cross-validation is unbiased In this section we show that cross-validation is unbiased but can have high variance, in the context of the simple example outlined in the introduction. We generate a null dataset with binary covariates and independent binary response, and use a simple rule for choosing the predictor. We want to see if cross- validation estimates the true error of this selection procedure. We generate p binary covariates of length n i.i.d from a Bernoulli distribution Xij ∼ B(0:2); i = 1; : : : ; n; j = 1; : : : ; d; and Yi; i = 1; : : : ; n from a Bernoulli B(0:5). The true error rate is the null rate 50%, since Y is independent of all the Xj and no matter what class we predict, half will be wrong. The best predictor is chosen based on the following criterion: select the covariate most aligned with the response (smallest Hamming distance), and use it to predict the response. If there are ties for the best, choose one at random. 5-Fold cross-validation is applied as follows. The best predictor is selected k using the \training" set Ttr (4/5ths data) from the kth fold, and the error is k computed on the \test" set Tte (1/5th data) as follows. With the training set, we choose the X` with X 1 ` = argmin Yi6=Xij : (2) j=1;:::;d k i2Ttr Then, with the test set, the error for the kth fold is 1 X e = 1 : (3) k n Yi6=Xi` te k i2Tte We repeat this for every fold ( and possibly selecting a different ` each time), P5 and set the cross-validation error to be e = 1=5 k=1 ek. For these experiments we generated the Xj once, and obtained repeated independent realizations of Y by permutation. 6 Figure 1 shows the results of cross-validation using 1000 different realizations of Y , separately for different configurations of n and p. We see in all cases that cross-validation is indeed unbiased; the true error 50% is the median of the distribution, but that the spread can be high. One might wonder how this can be in these examples | especially in the configurations where there is a reasonable probability of pure examples: features that align perfectly with the response over all n observations.

Does Cross-Validation Work When P ≫ N?

Shrinkage Priors for Bayesian Penalized Regression

Shrinkage Estimation of Rate Statistics Arxiv:1810.07654V1 [Stat.AP] 17 Oct

Linear Methods for Regression and Shrinkage Methods

Linear, Ridge Regression, and Principal Component Analysis

Kernel Mean Shrinkage Estimators

Generalized Shrinkage Methods for Forecasting Using Many Predictors

Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited

Mining Big Data Using Parsimonious Factor, Machine Learning, Variable Selection and Shrinkage Methods Hyun Hak Kim1 and Norman R

Shrinkage Improves Estimation of Microbial Associations Under Di↵Erent Normalization Methods 1 2 1,3,4, 5,6,7, Michelle Badri , Zachary D

Shrinkage Estimation for Functional Principal Component Scores, with Application to the Population Kinetics of Plasma Folate

Image Denoising Via Residual Kurtosis Minimization

An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Deﬁnite Matrices∗†