<<

Does Cross-validation Work when p  n?

Laurence de Torrent´e Mathematics Institute, EPFL, 1015 Lausanne, Switzerland [email protected]

Trevor Hastie∗ Department of , Stanford University, Stanford, CA 94305 [email protected]

September 12, 2012

Abstract

Cross-validation is a popular tool for evaluating the performance of a

predictive model, and hence also for model selection when we have a se-

ries of models to choose from. It has been suggested that cross-validation

can fail when the number of predictors p is very large. We demonstrate

through a suggestive simulation example that while K-fold cross-validation

can have high in some situations, it is unbiased. We also study

two permutation methods to assess the quality of a cross-validation curve

for model selection, which we demonstrate using the . The first ap-

proach is visual, while the second computes a p-value for our observed

error. We demonstrate these approaches using two real datasets “Colon”

and “Leukemia”, as well as a null dataset. Finally we use the bootstrap to

estimate the sampling distribution of our cross-validation curves, and their

functionals. This adds to the graphical evidence of whether our findings

are real or could have arisen by chance.

∗Trevor Hastie was partially supported by grant DMS-1007719 from the National Science Foundation, and grant RO1-EB001988-15 from the National Institutes of Health.

1 Keywords: variable selection, high-dimensional, permutation, bootstrap.

2 1 Introduction

Cross-validation is an old method, which was investigated and reintroduced by

Stone (1974). It splits the dataset into two parts, using one part to fit the model

(training set) and one to test it (test set). K-fold cross-validation (K-fold CV) and leave-one-out cross-validation (LOOCV) are the best-known. There is also generalized cross-validation (GCV) which is an approximation due to Wahba &

Wold (1975).

We consider a regression problem where we want to predict a response vari-

T able Y using a vector of p predictors X = (X1,...,Xp) via a linear model,

p ˆ X ˆ Yˆ = β0 + Xjβj. j=1

In general we would like to eliminate the superfluous variables among the Xj, and obtain good estimates of the coefficients for those retained. Any variable ˆ ˆ with a small |βj| is a candidate for removal: it may be advantageous to set βj = 0, allow for a small bias, and yet reduce the prediction error. Ordinary least- squares estimators tend to have small bias but large variance in the prediction of Y . The choice between variance and bias can be made explicit by variable selection procedures or by shrinkage methods. Having less predictors can also improve data visualization and understanding. With genomic data, the number of variables is often so large that selection becomes essential—both for and for interpretation.

A variety of selection and shrinkage methods have been proposed and inves- tigated, among them:

• Forward stepwise selection, which begins with a model containing only the intercept and adds at each step one covariate until no better model can

be found.

• Ridge regression leaves all the covariates in, but regularizes their coeffi- cients. The criterion to be minimized is a weighted sum of the residual

3 sum of squares and the squared Euclidean norm of the coefficients.

• Lasso (Tibshirani 1996) is like ridge, except it penalizes the L1-norm of the coefficients, thereby achieving subset selection and shrinkage at the

same time. The lasso solves the convex optimization problem

n p X T 2 X minimize (Yi − Xi β) + λ |βj|. (1) β∈

ˆ All of these methods correspond to families of estimators of βλ rather than a single one. This is a situation where cross-validation is useful. The cross- validated estimate of the prediction error can help in choosing within these families. It is often used to select the tuning parameter (λ) for lasso or ridge regression, or the subset size for stepwise regression. For example, we choose the tuning parameter that results in the model with the minimal cross-validation error.

Variable selection with high dimensional data can be difficult. We have to work with a large number of covariates and small sample size. We want to estimate correctly the prediction error and the “sparsity” pattern. Wasserman

& Roeder (2009) propose a multistage procedure which “screens” and “cleans” the covariates in order to reduce the dimension and have reasonable power .

It is often suggested that cross-validation could fail in the p  n situation, as the following simple situation appears to demonstrate (a simplification of a similar example in Section 7.10.3 of Hastie, Tibshirani & Friedman (2008)).

Suppose we have p (very large) binary predictors, and a binary response (e.g. a genomics dataset). We consider a very simple procedure: we search through the covariates, and find the one most aligned with the response, and use it to predict the response. With n sufficiently small, we might find one or more features that are perfectly aligned with the response. If this is the case, it would also be true for every cross-validation fold! This could happen even in the null case, where the response is independent of the features. It seems that cross- validation would fail here, and report a cross-validation error of zero. In other

4 words, an association between a feature and the response is found even if there is

none. It turns out that this is not a counter example, but does demonstrate the

high variance of cross-validation in such situations. Braga-Neto & Dougherty

(2004) also showed that cross-validated estimators exhibits high variance and

large outliers when p  n. Fan, Guo & Hao (2012) also displayed the problem

of variance and proposed alternative estimators.

In Section 2 we pursue this example. We generate a null dataset and use this

simple prediction rule, to demonstrate that cross-validation error is unbiased for

the expected test error but can have high variance.

Cross-validation being unbiased is not enough. By luck we might find a

model with very low cross-validated error, but not be sure if it is real or not.

We have seen that cross-validation error can have high variation, so we look at

two different methods to assess this variance. In Section 3 we look at cross-

validation error used to select the tuning parameter (λˆ) in the lasso. First, we

use a visual method which allows us to see if our error is credible, or could

have arisen from a null model. Then, we compute the null distribution of the

minimum error and hence a “p-value” for our observed error. The methods we

propose are visual and descriptive, and use permutations to generate null CV

curves. We demonstrate these approaches using two real datasets “Colon” and

“Leukemia”, as well as a simulated dataset.

In Section 4, we discuss the use of the bootstrap in this context. The boot-

strap estimates the sampling distribution of the cross-validation curves and their

functionals such as optimal λ and minimal error. This helps us establish whether

the good performance we see in our observed data (relative to the null data)

could be by chance.

In every section, we will use fivefold cross-validation which seems to be a good

compromise (see Hastie et al. (2008), Breiman & Spector (1992) and Kohavi

(1995)).

5 2 Cross-validation is unbiased

In this section we show that cross-validation is unbiased but can have high variance, in the context of the simple example outlined in the introduction. We generate a null dataset with binary covariates and independent binary response, and use a simple rule for choosing the predictor. We want to see if cross- validation estimates the true error of this selection procedure.

We generate p binary covariates of length n i.i.d from a Bernoulli distribution

Xij ∼ B(0.2); i = 1, . . . , n; j = 1, . . . , d;

and Yi; i = 1, . . . , n from a Bernoulli B(0.5). The true error rate is the null rate

50%, since Y is independent of all the Xj and no matter what class we predict, half will be wrong.

The best predictor is chosen based on the following criterion: select the covariate most aligned with the response (smallest Hamming distance), and use it to predict the response. If there are ties for the best, choose one at random.

5-Fold cross-validation is applied as follows. The best predictor is selected

k using the “training” set Ttr (4/5ths data) from the kth fold, and the error is k computed on the “test” set Tte (1/5th data) as follows. With the training set, we choose the X` with X 1 ` = argmin Yi6=Xij . (2) j=1,...,d k i∈Ttr Then, with the test set, the error for the kth fold is

1 X e = 1 . (3) k n Yi6=Xi` te k i∈Tte

We repeat this for every fold ( and possibly selecting a different ` each time), P5 and set the cross-validation error to be e = 1/5 k=1 ek. For these experiments we generated the Xj once, and obtained repeated independent realizations of Y by permutation.

6 Figure 1 shows the results of cross-validation using 1000 different realizations of Y , separately for different configurations of n and p. We see in all cases that cross-validation is indeed unbiased; the true error 50% is the median of the distribution, but that the spread can be high. One might wonder how this can be in these examples — especially in the configurations where there is a reasonable probability of pure examples: features that align perfectly with the response over all n observations. The reason is that there will also likely be pure

4/5 training sets, including ones where the 1/5 test points are not pure. In this case there would be tied “best” predictors, and the one picked at random might be the one with the “impure” test set. 1.0 1.0

0.8 0.8 ● ● ● ● ● ● ● 0.6 0.6 0.5 0.5

● 0.4 ● 0.4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.2 0.2 ●

● 0.0 0.0

n=10/p=1000 n=50/p=5000 n=100/p=10000 n=150/p=15000 n=10/p=1000 n=50/p=5000 n=100/p=10000 n=150/p=15000

Figure 1: Boxplots of 5-folds cross-validation error for different values of n and p and 1000 realizations of Y . On the left, we generate Y once and permute it 1000 times and on the right we draw 1000 new realizations of the data. We wanted to compare both methods: permutation vs generating each time. They appear to give similar results. Both plots show that cross-validation seems to be unbiased but can have high variance. For instance, on the left with n = 10 and p = 1000, the median is at 0.5 (unbiased) but the values vary from 0 to 0.9 (very high variance).

We conclude that via this simple but compelling experiment, cross-validation does estimate the expected test error in an apparently unbiased way, but it can have high variance. This variance is somewhat troubling. One would normally take the result of a single cross-validation, and hence could be subject to a high-variance stroke of bad luck. We could imagine that in the case n = 10

7 and p = 1000 you might find a cross-validation error of 0 (the lowest point on

the boxplot), even if the real error is 50%! This is motivation for augmenting

cross-validation with some variance information, a topic we address next.

3 The null distribution of cross-validation

In this section we examine the use of cross-validation to select the tuning pa-

rameter λ in the lasso. Figure 2 shows a cross-validation curve for the leukemia

data (next section), as a function of the regularization parameter. Standard

error bands are included, based on the variability across the five folds. How-

ever, since these five error estimates are correlated, we are reluctant to place

too much faith in these bands.

We present two methods to assess the CV misclassification error estimates.

Both are based on the distribution of the CV curves using null data — data

obtained by randomly permuting the responses Y , holding all the X variables

fixed. The idea is that if we have an impressive CV result on our real data, it

should not be easily achievable with null data. If it is, we would not be able

to trust it. The reason we permute only Y is that we wish to create a null

relationship between predictors and response, without breaking the correlation

structure of the predictors.

• The first method is graphical, and simply plots the null distribution of the CV curves, and compares it to the original curve (Figures 3, 5 and 7).

• The second method uses the same distribution of curves, and computes the null distribution of the minimized cross-validation error, and hence a

“p-value” for the observed minimal error (Figures 4, 6 and 8).

We demonstrate these two approaches using lasso applied to two real datasets

“Leukemia” and “Colon”, as well as a simulated null dataset, using 5-fold CV in

all cases.

8 3.1 Leukemia dataset

The leukemia dataset (Golub, Slonim, Tamayo, Huard, Gaasenbeek, Mesirov,

Coller, Loh, Downing, Caligiuri, Bloomfield & Lander 1999) contains gene- expression data and a class label indicating type of leukemia (“AML” vs “ALL”).

We compute the cross-validation error with the package glmnet (Friedman, Hastie & Tibshirani 2010). The dataset contains n = 38 samples (11 “AML” and 27 “ALL”), and the number of covariates p is 7129.

9 9 10 9 7 8 8 8 8 8 7 7 5 4 4 3 3 2 2 1 0.35

●●●●●● 0.30

0.25 ●

0.20 ●

●●●●●● 0.15 ● Misclassification Error

●●●●● 0.10 ●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.05

−5 −4 −3 −2 −1

log(Lambda)

Figure 2: Cross-validation curve for the leukemia dataset, as produced by the package glmnet. The bands indicate pointwise standard errors based on the error estimates from the 5 folds. The number of non-zero coefficients for each log(λ) are above the plot. The dotted line on the left corresponds to the minimum CV error. The other one is the largest value of log(λ) such that error is within one standard error of the minimum. In this paper, we will focus on the minimum CV error or the whole CV curve.

A lasso fit is defined up to the tuning parameter λ, which must be specified.

We use CV to estimate the prediction error for a sequence of lasso models, and pick a value of λ that has good estimated prediction performance. For example, we can pick that value that minimizes the CV error. This corresponds, in

Figure 2, to a CV error of 0.053 at log(λ) = −1.505. (In cases where the

CV curve is flat, like this one, we always pick the largest value of λ, which

9 corresponds to the sparsest model).

Our first method analyses the whole CV error curve. In Figure 3, the grey curves correspond to the “null” cross-validation curves from 1000 permutations of Y . The red curves correspond to CV curves on the real data. Since there will be variation in these curves because of the randomness in fold selection, we repeated the CV procedure 1000 times. For clarity in the left plot we show just

five such red curves curves. Superimposed in purple is the average CV curve

(over all 1000 fold selections).

We would be concerned if the actual cross-validation curves lay within the grey null cloud. In this example there is fairly good separation, even when taking into account the randomness due to fold selection (right panel). 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 Misclassification Error Misclassification Error 0.2 0.2 0.0 0.0

−7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0

log(Lambda) log(Lambda)

Figure 3: Cross-validation curves for the leukemia dataset. In both plots the grey cloud corresponds to CV curves computed from null data. In the right plot, we show the variation in the true CV curves resulting from 1000 different random fold selections. When selecting a tuning parameter, we would not compute 1000 different CV curves; typically one or at most a few of these curves are computed. In the left plot, we have selected 5 of these at random. Overlaid in purple in both plots is the average CV curve (averaged over all the folds). Of interest is whether the real CV curves (red) lie inside or outside the set of null CV curves (grey). Here they appear to lie outside, and so we might conclude there is information in these data.

The second method computes the null distribution of the minimal cross- validation error.

The purple bars in figure 4 corresponds to the minimal cross-validation er-

10 700 ● null ● observed 600 500 400 300 Frequency 200 100 0

0 0.053 0.105 0.158 0.211 0.263 0.316 0.368

cross−validation error

Figure 4: Real and null distribution for the leukemia dataset. The purple his- togram corresponds to the minimal error rate found on actual data and the blue one shows the frequencies of the minimal values under the null distribution. There is not much overlap in these distributions, which suggests strong evidence of signal in these data.

11 rors computed with the observed Y , using different random folds. The actual different values are listed in table 1.

Table 1: Frequency of CV error values for the observed Y for the leukemia dataset (purple bars in figure 4). 0 1 2 3 4 5 6 CV error 38 38 38 38 38 38 38 Frequency 43 314 482 138 20 1 2

The blue part of figure 4 is the null distribution of minimal cross-validation error obtained by randomly permuting Y a thousand times. We find the follow- ing values listed in table 2.

We can estimate the null probability for each observed CV value using this distribution. For example

 1   2  (CV err = 0) = CV err = = CV err = = 0, P P 38 P 38  3  2 CV err = = = 0.002 P 38 1000

 4  1 CV err = = = 0.001 P 38 1000

The null hypothesis we are interested in is H0: the covariates Xj and Y are independent, vs the alternative H1: there is dependence between Y and the Xj. We will use the minimum CV error as the test statistic. The mode for the null

CV values in Figure 4 is 0.289, which is the null error rate 11/(11 + 27).

Table 2: Frequency of CV error values when permuting Y for the leukemia dataset (blue bars in figure 4). 3 4 6 7 8 9 10 11 12 13 14 CV error 38 38 38 38 38 38 38 38 38 38 38 Frequency 2 1 5 23 38 57 133 704 25 9 3

12 4 For instance, if we observe a CV error of 38 , we can compute the p-value of this observation given H0,

 4  3 CV error ≤ = = 0.003. P 38 1000

Therefore, we can reject the null hypothesis at the 1% significance level. In

4 other words, a CV error as low as 38 cannot be explained by luck if Y were independent of the Xj.

3.2 Colon dataset

The dataset colon contains snapshots of the expression pattern of different cell types with a class label indicating if the colon is healthy or not (Alon, Barkai,

Notterman, Gish, Ybarra, Mack & Levine 1999).

In this dataset, the response Y contains 22 normal colon tissue samples and

40 colon tumor samples. The number of covariates is 2000.

In figure 5, the real cross-validation curves overlap the grey cloud slightly in both plots. But the minimal value lies under the cloud so the CV error seems to be relevant. 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 Misclassification Error Misclassification Error 0.2 0.2 0.0 0.0

−7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0

log(Lambda) log(Lambda)

Figure 5: Cross-validation curves for the colon dataset. Details the same as in Figure 3.

The blue part of figure 6 is the null distribution of CV error when permuting

13 Table 3: Frequency of CV error values when permuting Y for the colon dataset (blue bars in figure 6).

10 11 12 13 14 15 16 17 CV error 62 62 62 62 62 62 62 62 Frequency 1 1 2 4 6 5 16 19

18 19 20 21 22 23 24 25 CV error 62 62 62 62 62 62 62 62 Frequency 38 45 64 117 662 14 5 1

Table 4: Frequency and p-values of the minimal CV errors for the observed Y (different random folds) for the colon dataset (purple bars in figure 6). The p- value is the fraction of times a minimal CV value as small or smaller was seen amongst the permuted data. Except for 14/62, none of the values exceed 0.01. 5 6 7 8 9 10 11 12 13 14 CV error 62 62 62 62 62 62 62 62 62 62 Frequency 4 32 132 259 311 181 59 16 4 2 p-value 0 0 0 0 0 0.001 0.002 0.004 0.008 0.014

Y a thousand times. The actual values found are listed in table 3.

The different values for the minimal CV errors computed with the observed

Y , using different random folds (purple part of figure 6) are listed in table 4.

The p-value computed with the null distribution is also included.

14 600 ● null ● observed 500 400 300 Frequency 200 100 0

0.081 0.129 0.177 0.21 0.242 0.29 0.323 0.371

cross−validation error

Figure 6: Real and null distribution for the colon dataset. Details are as in Figure 4.

We test if the CV error comes from the null distribution or if it is real.

Therefore, we can reject the null hypothesis at the 1% significance level for

13 14 CV error less than 62 . There might be a problem with 62 but it appears only twice in a thousand simulations.

3.3 Simulated dataset

We apply both methods on the simulated dataset with n = 50 individuals and p = 5000 covariates from section 2. Here, we are in the null case where the covariates are totally independent from the response. The one designated “real” dataset shows a somewhat optimistic CV curve.

15 1.0 1.0 0.8 0.8 0.6 0.6 0.4 0.4 Misclassification Error Misclassification Error 0.2 0.2 0.0 0.0

−7 −6 −5 −4 −3 −2 −1 0 −7 −6 −5 −4 −3 −2 −1 0

log(Lambda) log(Lambda)

Figure 7: Cross-validation curves for the simulated dataset. Details the same as in Figure 3. In this case the “real” CV curves lie inside the null cloud.

In figure 7, the actual cross-validation curves lay inside the grey cloud in both plots. The curves from the “real” response behave like curves from the null distribution. With this first plot we could already conclude that there might be a problem with the result of the CV error.

16 Table 5: Frequency of CV error values when permuting Y for the simulated dataset (blue bars in figure 8).

6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 CV error 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 Frequency 1 1 6 1 6 13 14 15 29 34 42 56 33 58 63

21 22 23 24 25 26 27 28 29 30 31 32 33 34 CV error 50 50 50 50 50 50 50 50 50 50 50 50 50 50 Frequency 81 77 86 79 79 72 54 41 29 13 10 4 1 2

140 ● null ● observed 120 100 80 Frequency 60 40 20 0

0.1 0.16 0.22 0.28 0.34 0.4 0.46 0.52 0.58 0.64

cross−validation error

Figure 8: Real and null distribution for the simulated dataset. Details are as in Figure 4. In this case the histograms overlap considerably.

The blue part of figure 8 is the null distribution of cross-validation error when permuting Y a thousand times. We find the following values listed in table 5.

The different values for the minimal cross-validation errors computed with the observed Y , using different random folds (purple part of figure 8) are listed in table 6. We also include the p-value for each observed value with the null distribution.

17 Table 6: Frequency and p-value of CV error values for the observed Y for the simulated dataset (purple bars in figure 8). The most frequent observed value 13/50 has a p-value=0.057. Moreover, every value above 10/50 exceeds a p-value of 0.01.

5 7 8 9 10 11 12 13 14 CV error 50 50 50 50 50 50 50 50 50 Frequency 1 4 12 38 58 83 122 150 133 p-value 0 0.002 0.008 0.009 0.015 0.028 0.042 0.057 0.086

15 16 17 18 19 20 21 22 CV error 50 50 50 50 50 50 50 50 Frequency 140 109 65 42 23 13 6 1 p-value 0.120 0.162 0.218 0.251 0.309 0.372 0.453 0.530

While there is strong evidence here that the data is null, we see some poten- tial dangers in the variation observed. Firstly, despite the data being null, we will sometimes see a realized dataset such as seen here. In this particular case there is a 5.5% chance (55/1000) that the minimal CV p-value from random fold selection would be significant at the 1% level. This suggests that it might be wiser to average the CV curves to reduce the variation in fold-selection.

In the next section we augment the null data with bootstrapped versions of the CV curves. This throws sampling variation into the mix, and helps distinguish the real situations from the null.

4 Bootstrap analysis

In this section we use the bootstrap to estimate the sampling distribution of cross-validation, and the hence of the functionals we derive from these curves.

The bootstrap is useful in studying certain properties of statistics, such as variance. It mimics the process of getting new samples from nature, and hence the sampling distribution of the statistics in question. We take the dataset and resample from it and so we expect each bootstrap dataset to have properties similar to those of the original data. We use the non-parametric bootstrap

1 which puts a mass of n (distribution F ) on every pair (Xi,Yi); i = 1, . . . , n. We resample n times from F to get one bootstrapped dataset.

In the context of cross-validation, we have to be careful. Cross-validation

18 Table 7: Covariates retained in more than one-fifth of the 1000 bootstrapped datasets, after the models are selected by cross-validation. Covariates 249 377 493 625 1582 1772 1843

divides the data in two parts, the training set and the test set. For the original data, every individual observation appears once in the dataset (we assume all the

(X,Y ) p+1 tuples are distinct.) Hence there is no possibility that an individual is used in the training and the test set, a potential source of positive bias. This is not the case for a bootstrapped dataset — there will almost certainly be ties. If we let some of these tied values go to the training set, and some to the validation set, we will have artificially created (additional) correlation. To avoid this, we operate at the level of observation weights. In the original sample, each observation has weight 1/n. In a bootstrapped sample, these weights will be of the form k/n, for k = 0, 1, 2,.... When we cross-validate a bootstrap sample, we randomly divide the original observations into the two groups, but their bootstrap weights go with them (including the 0s).

Then, for every bootstrap dataset, we can compute statistics like the number of non-zero coefficients, the minimal value of λ, the minimal value for the cross- validation error, etc. We can look at the variation of these statistics to see if our model is stable or not. For instance, we may study the number of times each covariate has a nonzero coefficient in the model. With that information, we can judge whether some of the covariates appear to be more important than others.

For example, we can declare that a covariate is important if it is chosen in more than one-fifth of the 1000 bootstrap datasets. For the colon dataset we

find the covariates listed in table 7.

We may also study some statistics like the minimal cross-validation error. In order to compute a confidence interval we sort the 1000 minimal cross-validation errors found when bootstrapping and take the 25th as the 2.5 percentile and the 975th as the 97.5 percentiles. We find the 95% confidence interval for the minimal cross-validation error : [0,0.2368]. Figure 9 shows the frequencies of

19 the minimal cross-validation values under the bootstrap distribution and under the null distribution computed by permutation in section 3.2.

600 ● null ● boot 500 400 300 Frequency 200 100 0

0 0.032 0.081 0.129 0.177 0.226 0.274 0.323 0.371

cross−validation error

Figure 9: The purple histogram corresponds to the minimal cross-validation er- ror under the bootstrap distribution and the blue one shows the null distribution. Both are computed with the colon dataset. The bootstrap distribution represents the underlying distribution of the real CV error of the colon dataset. The null distribution still corresponds to the one computed from 1000 permutation of Y . Here we would conclude that the bootstrap distribution looks very different from the null distribution, which provides more evidence that the effects found are real.

In figure 9, the real CV error does not seem to have the same distribution as the null one, with the vast majority of values shifted to the left. This provides more evidence that the signal found in the colon dataset is real.

In figure 10 (null data), both distributions are very similar, and we would

(correctly) not trust the results of cross-validation in this case. For the “real” dataset in this case (one of the null datasets with somewhat optimistic CV curves), the modal value for the minimal CV error is 0.26. Figure 8 (purple) shows the variation in this minimal error under random fold selection, and we

20 saw that 0.26 or smaller occurs less than 5.7% of the time. The bootstrap includes sampling variation as well, and now we see that this sample is virtually indistinguishable from null data. 80 ● null ● boot 60 40 Frequency 20 0

0.04 0.12 0.18 0.24 0.3 0.36 0.42 0.48 0.54 0.6 0.66 0.76

cross−validation error

Figure 10: Bootstrap and null distribution of the minimal CV error, computed for the null dataset from Section 2. The details are the same as in figure 9. Here there is substantial overlap, and we would not trust the original cross-validation.

The bootstrap is thus helpful in providing evidence whether our original sample might be an opportunistic draw from the null distribution.

5 Conclusion

In this paper, we demonstrated through a simple example that cross-validation is unbiased but can have high variance. Care has to be taken when p  n.

We presented two methods based on permutations which can help us to assess the validity of the CV error. The first one plots the null cross-validation curves and offers visual information about the observed cross-validation curve(s). The

21 second one computes a p-value of the observed minimal cross-validation error calculated with the null distribution found by permutation of Y . For both real datasets, the original CV curves lie outside the distributions of null curves.

Likewise, the minimal CV errors are shifted to the left for the real datasets, and using permutation-based hypothesis tests, we can conclude that they would not be found by chance.

For the “null” dataset — even an opportunistic draw from the family of null datasets — the overlap is much more substantial.

The bootstrap adds to the picture, by adding sampling variation into the variation of the cross-validation curves and their functionals. The real datasets, even under bootstrap resampling, have minimal CV errors that look distinct from the null distribution. The null dataset, on the other hand, becomes indis- tinguishable from the family of null data.

In this article, we analyzed the independent case. We wanted our methods to establish a null effect. If we have a small CV error, what is the chance that it could have occurred at random, i.e. could have arisen from a null distribution where there is no dependence between the response and the predictors.

References

Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D. &

Levine, A. J. (1999). Broad patterns of gene expression revealed by cluster-

ing analysis of tumor and normal colon tissues probed by oligonucleotide ar-

rays, Proceedings of the National Academy of Sciences of the United States

of America 96(12): 6745–6750.

URL: http://dx.doi.org/10.1073/pnas.96.12.6745

Braga-Neto, U. M. & Dougherty, E. R. (2004). Is cross-validation valid for

small-sample microarray classification?, Bioinformatics 20(3): 374–380.

URL: http://dx.doi.org/10.1093/bioinformatics/btg419

22 Breiman, L. & Spector, P. (1992). Submodel selection and evaluation in regres-

sion. the x-random case, International Statistical Review 60(3): 291–319.

Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted

cross,Aˆevalidation¨ in ultrahigh dimensional regression, Journal of the Royal

Statistical Society Series B 74(1): 37–65.

URL: http://ideas.repec.org/a/bla/jorssb/v74y2012i1p37-65.html

Friedman, J. H., Hastie, T. & Tibshirani, R. (2010). Regularization Paths for

Generalized Linear Models via Coordinate Descent, Journal of Statistical

Software 33(1): 1–22.

URL: http://www.jstatsoft.org/v33/i01

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov,

J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield,

C. D. & Lander, E. S. (1999). Molecular classification of cancer: class

discovery and class prediction by gene expression monitoring., Science (New

York, N.Y.) 286(5439): 531–537.

URL: http://dx.doi.org/10.1126/science.286.5439.531

Good, P. I. (2006). Resampling Methods, 3rd edn, Birkhauser.

Hastie, T. J., Tibshirani, R. J. & Friedman, J. H. (2008). The Elements of Sta-

tistical Learning: Data Mining, Inference, and Prediction, Springer series

in statistics, 2nd edn, Springer, New-York.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy es-

timation and model selection, International Joint Conference on Artificial

Intelligence, Vol. 14, Citeseer, pp. 1137–1145.

Stone, M. (1974). Cross-Validatory Choice and Assessment of Statistical Pre-

dictions, Journal of the Royal Statistical Society. Series B (Methodological)

36(2): 111–147.

23 Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso, Journal

of the Royal Statistical Society. Series B (Methodological) 58(1): 267–288.

URL: http://dx.doi.org/10.2307/2346178

Wahba, G. & Wold, S. (1975). A completely automatic french curve: Fitting

spline functions by cross validation, Communications in Statistics-Theory

and Methods 4(1): 1–18.

Wasserman, L. & Roeder, K. (2009). High dimensional variable selection., An-

nals of statistics 37(5A): 2178–2201.

URL: http://view.ncbi.nlm.nih.gov/pubmed/19784398

24