basis selection via cross-validation

By EMMA McCOY Department of , Imperial College London, UK [email protected] and KONSTANTINOS KONSTANTINOU Departments of Mathematics, Imperial College London, UK

Summary We write the wavelet in the form of a , from which we develop a leave- one-out cross-validation scheme to select the degrees of freedom of the resulting reconstruction. It is well known that a leave-one-out precedure may have some deficiencies and has a tendency to overfit. In order to counter these problems, we extend to leave-two-out and leave-three-out schemes, where we leave out adjacent points in the regression in order to ensure good local fit. The same criterion can also be employed as a of selecting a wavelet from a particular family and for identifying the primary resolution level. A fast algorithm for calculating the elements of the smoothing matrix lead to computation- ally feasible methods for calculating both the cross-validatory criteria we derive and confidence intervals of the approximations.

Some key words: Non-parametric regression, , Cross-validation, model-selection, hard thresholding, degrees of freedom.

1. Introduction Wavelet regression are well known for their efficient representation for wide classes of functions and the hard and soft thresholding schemes of Donoho and Johnstone (1994, 1995) and Donoho et al (1995) are becoming increasingly popular for their spatial adaptivity. Reviews of both classical and Bayesian techniques for wavelet regression estimators can be found in Vidakovic (1999) and Percival and Walden (2000). A review of the comparative merits of various wavelet regression schemes can be found in Antoniadis and Bigot (2001). Despite the large literature on wavelet regression techniques, there is a lack of practical advice in terms of addressing issues such as wavelet choice or determination of primary resolution, which, as noted in Hall and Penev (2001), has a first order impact on the squared error of the estimator. Here we shall concentrate of cross-validation schemes for selecting an appropriate wavelet model. Zhang (1993) provides some evidence that multifold cross-validation (MCV), where we leave out more than one observation, reduces the tendency to overfit often seen in leave-one- out schemes. We devise a fast multifold cross-validatory scheme for wavelet thresholding, which addresses the problem of overfitting. We consider the following fixed design model

J Yi = g(ti) + εi, i = 1,...,N = 2 ,

2 where ε = ε1, . . . , εn are i.i.d. N(0, σ ). Wavelet regression relies upon the truncated repre- sentation of{g(t) using} orthogonal basis functions,

J 1 2j 1 − − g(t) = v0,0φ(t) + wj,kψj,k(t) j=0 k X X=0 Wavelet Cross-validation 2

where φ and ψj,k are the basis functions generated from dilations and translations of a particular scaling function, φ(t) and associated wavelet, ψ(t). For full details of this decomposition see, for example, Abramovich et al (2000). Hard wavelet thresholding schemes, as introduced by Donoho and Johnstone (1994, 1995), discard “small” coefficients below a certain threshold, and retain the “large” coefficients, which are believed to contain the signal. In other words we retain a certain number of coefficients in the regression, i.e. a model selection problem, where we can consider the choice of threshold as attempting to ascertain which basis functions are important in the reconstruction. We can write this as,

T T (h) Y = W T θ = Wp θp + ε, where T is a diagonal matrix with diagonal elements given by,

1 if the ith coefficient is kept; T = i,i 0 if the ith coefficient is killed,  j θ = v0,0, wj,k, j = 1,...,J; k = 0,..., 2 1 represents the wavelet transform coefficients and { − } Wp is the (p n) matrix formed by retaining p rows of the usual discrete wavelet transform matrix, W , where× the retained rows correspond to the coefficients retained (denoted by the (h) (h) (h) T T θp = θ1 , . . . , θp ). So Wp will be a reordered submatrix of W . For{ a given N and}p, the number of possible wavelet reconstructions using the above selection N procedure is p . However, for wavelet thresholding we use the to reduce this to a single selection based on the rows of W corresponding to the p’th largest wavelet coefficients, so the  coefficients are ranked according to importance. As in Bruce et al (1999), for hard-thresholding, the estimator of θ is given by

(h) T 1 θp = (WpWp )− WpY = WpY , due to the orthogonality of W ,b while the fitted values are given by,

T (h) T T (h) Y = Wp θp = Wp WpY = W TW Y = Hp Y . (1)

(h) where Hp is popularlyb known asb the hat matrix or smoothing matrix. Soft thresholding, in which the coefficients are retained if they exceed a threshold δ (in the same way as hard thresholding), but in addition some of the retained coefficients are “shrunk” towards zero by the threshold. In this case the estimator of θ is given by

(s) (h) θp = Tpθp = TpWpY , and, b T (s) Y = Wp TpWpY = Hp Y , where, Tp is a (p p) diagonal matrix with diagonal elements given by × b

J J0 1 j < 2 − (Tp)j,j = δ 1 (h) otherwise. ( θj − | |

J0 is known as the primary resolutions level: thresholding is only applied to resolutions finer than the primary resolution, and Tp is the (p p) matrix corresponding to the thresholding of ˆ(h) ׈(h) the retained coefficients. Note that θp > δ > θp+1. In this paper we concentrate on the hard threshold estimators. However, once cross-validation has been employed to choose an appropriate level for the degrees of freedom, it is straightforward to construct a soft-threshold estimator with the same degrees of freedom. Wavelet Cross-validation 3

2. Leave-one-out cross-validation Cross-validation is a well known technique that can be used as a selection criterion amongst competing model alternatives. It provides an estimate of the prediction error over the models, the model which minimizes this estimate is chosen as the ‘best’ model. Nason (1996) introduced a 2-fold cross-validation scheme for selecting the threshold. General- ized cross-validation was proposed by Jansen, Malfait and Vial (1997), and other cross-validatory schemes have been discussed by Weyrich and Warhola (1998) and Hurvich and Tsai (1998). Nason (2001) introduced a leave-one-out cross-validation scheme which employs the Kovac-Silverman algorithm, and can be employed to choose wavelet smoothness and primary resolution. Generalized cross-validation is closely linked to Cp and AIC (see Efron, 2001), through the concept of equivalent degrees of freedom (EDF), which can be calculated as the trace of the smoothing matrix. In the case of hard thresholding, the degrees of freedom will be equal to the number of coefficients retained in the estimator, as, from (1) we have,

(h) T EDF = tr Hp = tr W TW = tr T = p.

There are alternative definitions of EDF (see Buja, Hastie and Tibshirani, 1989), but for the case of hard thresholding, the definitions agree. Due to the local nature of wavelet estimators we prefer to consider a more localized cross- validation, and derive leave-one-out, leave-two-out and leave-three-out criteria, as they consider the leverage of individual points, i.e. individual elements of the smoothing matrix rather than its trace. (k) Let Wp denote the (p (N 1)) matrix obtained by eliminating the kth column from × − T (k) the matrix Wp (i.e. the kth row from the matrix Wp ) and Y = Y1,...,Yk 1,Yk+1,...YN . (k) { − } Under this formulation, the least squares estimator, θ , of θ, based on all observations excluding Yk is given by, (k) (k) (k)T b 1 (k) (k) θ = (Wp W p )− Wp Y giving a cross-validatory criterion of b N 1 T (i) 2 ∆p = (Yi w θ ) , N − i i=1 X T T b where wi are the (p 1) row vectors of Wp . We can use the Shermann-Morrison formula to express this criterion in× terms of estimators based on all observations. If H is of rank-one, we have (see Miller, 1987),

1 1 (I + H)− = I H. − 1 + trH So,

(k) (k)T 1 T T 1 1 (W W )− = (WpW wkw )− = (I + H)− p p p − k T wkwk   = I + T 1 w wk − k w wT = I + k k , 1 akk − T where, ajk = wj wk (= akj ). T Note that akk is the kth diagonal element of the (N N) matrix H = W Wp: the leverage for × p Wavelet Cross-validation 4 the kth point. Giving,

i T T ( ) T wkwk (i) Yi wi θ = Yi wi I + WpY − − 1 aii  −  b 1 T T = Yi aiiYi wi (1 aii)I + wiwi (WpY wiYi) 1 aii − − − − − 1 T   = Yi aiiYi wi ((1 aii)WpY (1 aii)wiYi 1 aii − − − − − − T T + wiw WpY wiw wiYi i − i 1 2 2 = Yi aiiYi Yi + aiiYi + aiiYi aiiYi aiiYi + aiiYi 1 aii − − − − −   Yi Yi b b b = − . 1 aii − b The criterion then becomes

N 1 (Y Y )2 i i , ∆p = − 2 N (1 aii) i=1 X − b demonstrating that the usual cross-validation score also applies in the case of wavelet regression. T The elements of H can be easily calculated for a given wavelet basis as the rows of Wp are made up of elements of the wavelet filters at each scale, which can be calculated via convolutions of the original filters (see, for example Percival and Walden, 2000).

3. Multifold Cross-validation The deficiencies of leave-one-out cross validation are well documented (see, for example Friedl and Stampfer, 2002), and the references therein): while it often works well for smooth predictors it often fails for functions. The local nature of wavelet regression implies that the diagonal elements of the smoothing matrix can be close to one, making the criterion outlined above ill-posed. In order to address these deficiencies, this section derives the leave-two and leave-three-out cross-validation criteria, where we leave out successive pairs and triples of points. The calculation of these criteria are again computationally fast and provide improved local behaviour in the predictors. The behaviour of these criteria validate the theorectical results of Zhang (1993), in that they reduce the chance of overfitting often seen with simple CV. (k,k+1) Let Wp denote the (p (N 2)) matrix obtained by eliminating the k and (k + 1)th × (−k,k+1) columns from the matrix Wp, and Y = Y1,...,Yk 1,Yk+2,...YN . Under this formula- (k,k+1) { − } tion, the least squares estimator θ , of θ, based on all observations excluding Yk,Yk+1 is given by, { } k,k 1 ( +1) b (k,k+1) (k,k+1)T − (k,k+1) (k,k+1) θ = Wp W p Wp Y .   We can now employ theb rank two update for the inverse (see Miller (1987), p.14), namely,

1 1 2 (I + H)− = I (1 + σ1)H H , − 1 + σ1 + σ2 −   Wavelet Cross-validation 5 where T T H = wkw wk+1w , − k − k+1 σ1 = trH,

= akk ak+1k+1, − − 1 2 2 σ2 = (trH) trH . 2 − 2 = akk ak+1k+1 a  . − kk+1 So, (k,k+1) T T 1 T T θ = (I (wkw + wk+1w ))− (WpY (wkw + wk+1w )) − k k+1 − k k+1 After some algebra, we have, b (k,k+1) T (k,k+1) (1 akk)(Yk+1 Yk+1) + ak k+1(Yk Yk) k = Yk wk θ = − − −2 . − 1 akk ak+1 k+1 + akkak+1 k+1 ak k+1 − − b − b Giving a leave-two-out cross validatoryb criterion of, N 1 N 1 − (i,i+1) 1 (i,i 1) ∆ =  +  − p N 1 i N 1 i i=1 i=2 − X − X Similarly, we can derive the leave-three-out result as,

T (k,k+1,k+2) 2 A3(Yk w θ ) = (Yk Yk)(1 ak+1 k+1 ak+2,k+2 + ak+1 k+1ak+2,k+2 a ) − k − − − − k+1 k+2 +(Yk+1 Yk+1)(ak k+1(ak+2 k+2 1) ak k+2ak+1 k+2) b b− − − +(Yk+2 Yk+2)(ak k+2(ak+1 k+1 1) ak k+1ak+1 k+2) − b − − where, b A3 = 1 + σ1 + σ2 + σ3.

1 3 2 3 σ3 = (trH) 3trHtrH + 2trH 6 − 2 2 = akkak+1 k+1ak+2 k+2 + akkak+1 k+2 + ak+1 k+1ak k+2 − 2 +ak+2 k+2a 2ak k+1ak+1 k+2ak k+1, k k+1 − and T T T H = wkw wk+1w wk+2w . − k − k+1 − k+2

4. Simulation study and Comparisons We choose the function studied by Hall and Penev (2001) given by 2 cos x + (π/15) if ti ( 17π/30, 0], g(ti) = − 2{ } ∈ − cos x + (π/15) if ti (0, 17π/30].  { } ∈ with ti, i = 1, 2,..., 1024 equally spaced on ( 17π/30, 17π/30]. Figure 1 shows , over 500 runs, of the cross-validatory choice of the− number of coefficients retained in the reconstruc- tion. We see that the optimal number is well matched by the leave-three-out criteria, but is overestimated when less points are left out. Figure 2 shows the corresponding histograms for the choice of primary resolution, and Figure 3 shows typical reconstructions using the three criteria. We see that, for this choice of function, the leave-one-out scheme overfits, while leaving out higher numbers of adjacent points in the cross-validation scheme results in a reduction in this tendency to overfit. Wavelet Cross-validation 6

True value of p which Leave-one-out choice of p minimises MSE 80 150 60 100 40 50 20 0 0

0 20 40 60 80 100 0 100 300 500

p p

Leave-two-out choice of p Leave-three-out choice of p 200 200 150 150 100 100 50 50 0 0

0 100 300 500 0 20 40 60 80 100

p p

Figure 1: Histograms of the true value of p which minimises the mean squared error and his- tograms of the choice of p using the cross-validatory criteria. Wavelet Cross-validation 7

True value of J0 which Leave-one-out choice of J0 minimises MSE 300 300 200 200 100 100 50 50 0 0

0 2 4 6 8 0 2 4 6 8

J0 J0

Leave-two-out choice of J0 Leave-three-out choice of J0 250 300 200 150 200 100 100 50 0 0

0 2 4 6 8 0 2 4 6 8

J0 J0

Figure 2: Histograms of the true value of J0 which minimises the mean squared error and histograms of the choice of J0 using the cross-validatory criteria. Wavelet Cross-validation 8

True function + noise Leave-one-out reconstruction 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15

0 200 600 1000 0 200 600 1000

Leave-two-out reconstruction Leave-three-out reconstruction 15 10 10 5 5 0 0 -5 -5 -10 -15 -15 0 200 600 1000 0 200 600 1000

Figure 3: Plots of typical reconstruction constructed using the cross-validatory criteria. Wavelet Cross-validation 9

5. Extensions There is growing evidence that the stationary wavelet transform or maximal overlap transform provides improved performance over the DWT in a variety of applications. This is because of the ability of the MODWT to “line up” correctly with events in the underlying functions at the dif- ferent scales. However, the standard assumptions of independence in the wavelet coefficients no longer holds in such a setting. Approaches for dealing with this include scale dependent thresh- olding. Such an approach, while more computationally challenging, should be straightforward to implement in the cross-validatory setting we have devised here. Wavelet Cross-validation 10

References

Abramovich, F., Bailey, T.C. and Sapatinas, T. (2000) Wavelet analysis and its statistical ap- plications. J. R. Statist. Soc. D, 49, 1–29. Antoniadis, A. and Bigot J. (2001) Wavelet estimators in : a compar- ative simulation study. Journal of Statistical Software, 6, 6, 1–83. Bruce, A.G., Gao, H.Y. and Stuetzle, W. (1999) Subset-selection and ensemble methods for wavelet denoising. Statistica Sinica, 9, 167–182. Buja, A., Hastie, T.J. and Tibshirani, R.J. (1989) Linear smoothers and additive models (with discussion). The Annals fo , 17, 453–555. Donoho, D.L. and Johnstone, I.M. (1994) Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81, 425–455. Donoho, D.L. and Johnstone, I.M. (1995) Adapting to unknown smoothness via wavelet shrink- age. Journal of the American Statistical Association, 90, 1200-1224. Donoho, D.L., Johnstone, I.M., Kerkyacharian, G. and Picard, D. (1995) Wavelet shringkage: asymptipad? (with discussion). J. R. Statist. Soc. B, 57, 301–337. Droge, B. (1998) Minimax regret analysis of orthogonal series regression estimation: Selection versus shrinkage. Biometrika, 85, 3, 631–643. Efron, B. (2001) Selection criteria for scatterplot smoothing. The Annals of Statistics, 29, 470– 504. George, E.I. and Foster, D.P. (2000) Calibration and empirical; Bayes variable selection. Biometrika, 87, 4, 731–747. Green, P. and Silverman, B.W., (1994) Nonparametric Regression and Generalized Linear Mod- els, London: Chapman and Hall. Hall, P. and Penev, S. (2001) Cross-validation for choosing resolution level for nonlinear wavelet curve estimators. Bernoulli, 7,2, 317–341. Hurvich, C.M. and Tsai, C.L. (1998) A cross-validatory AIC for hard wavelet thresholding in spatially adaptive function estimation. Biometrika, 85, 3, 701–710. Jansen, M., Malfait, M. and Bultheel, A. (1997) Generalized cross validation for wavelet thresh- olding. Signal Processing, 56, 535–369. Miller, K.S. (1987) Some Eclectic Matrix Theory. Krieger Publishing Company, Florida. Nason, G.P. (2002) Choice of wavelet smoothness, primary resolution and threshold in wavelet shrinkage. Statistics and Computing, 12, 219–227. Nason, G.P. (1996) Wavelet shrinkage using cross-validation. J. R. Statist. Soc. B, 58, 463–479. Percival, D.B. and Walden, A.T. (2000) Wavelet Methods for Analysis. Cambridge: Cambridge University Press. Vidakovic, B. (1999) Statistical modeling by wavelets. New York: Wiley. Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing, London: Chapman and Hall. Wavelet Cross-validation 11

Weyrich, N. and Warhola, G.T. (1998) Wavelet shrinkage and generalized cross validation for image denoising. IEEE Trans. Im. Process., 7, 82–90. Zhang, P. (1993) Model selection via multifold cross-validation. The Annals of Statistics, 21, 1, 299-313.