TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984

Estimators of the Squared Error of in

0. Bunke and B. Droge

Department of Humboldt-University Berlin, German Democratic Republic

If a linear regression model is used for prediction, the mean squared error of prediction (MSEP) measures the performance of the model. The MSEP is a function of unknown param- eters and good estimates of it are of interest. This article derives a best unbiased and a minimum MSE estimator under the assumption of a . It compares the bias and the MSE of these and some others. Similar results are presented for the case in which the model is used to estimate values of the response function.

KEY WORDS: Regression model; Response function; Selection of variables.

1. INTRODUCTION ables or of models, although, with the exception of some cases, the selection of models and of correspond- A frequent situation in analyzing experimental data ing input variables is a more complex process than involves (possibly replicated) observations of an merely a formal comparison of models by a criterion. output variable y obtained for corresponding fixed But such a criterion is a helpful tool in a model values of input variables xi, x2, . . . , xk. The output selection strategy (see Bunke 1983), which should also variable may be the yield in a chemical plant working include subject matter considerations and common under technological, chemical, and physical con- sense judgment. This has been discussed in the litera- ditions described by input variables such as temper- ture, for example, in Kennedy and Bancroft (1971) or ature, xi, pressure, x2, and the amounts, x3, x4, x5, of Mallows (1973). Surveys on model selection methods, different substances and catalyzers in the chemical including model comparison criteria, are presented in process. An objective of the analysis may be to predict Gaver and Geisel (1974), Hocking (1976), Thompson future values of the output y for fixed values of the (1978), and Montgomery and Peck (1982). Some of input variables using a function or model depending these criteria, like C, or PRESS, may also be con- on the input. sidered as estimates of the MSEP. In other cases, the model is used for a rough de- Recently Efron (1983) Efron and Gong (1982), and scription of the dependence between output and input Gong (1982) investigated the behavior of bootstrap variables by an equation, say y =f(x, , xg , x5). Then and other estimators of the MSEP under somewhat the modelfmay be interpreted as the approximation different assumptions, namely of random input to the true response function q, which describes the variables-an assumption that is also contained in real relation y = ~(x,, . . . , xk) + E between the output Freedman (198 1). and the input variables. The term E is interpreted here Our article derives a best unbiased estimator and a as the random error describing the influence of unob- minimum MSE estimator under the assumption of a served variables. normal distribution and compares the bias and the Usually the model will depend on parameters that MSE of these estimators with those of some others. In must be estimated on the basis of data, that is, the Section 2 we introduce the problem, assuming repli- model is fitted to the data. The mean squared error of cated observations, and in Section 3 we introduce the prediction (MSEP) gives a simple description for the different criteria. Their comparison is carried out in performance of the model. The MSEP depends on the Section 4, while in Section 5 numerical results and unknown response function and the unknown vari- graphical presentations of MSE for some examples ance of the observations, so good estimates of the illustrate the quality of different estimators. A dis- MSEP are of interest. Such estimates may be useful as cussion of the results is presented in Section 6. In criteria for the comparison of subsets of input vari- Section 7 we give a short report on analogous results

145 146 0. BUNKE AND B. DROGE for the case of the objective being an estimation of the tance of the prediction errors $’ - yt is taken into response function. Estimation of MSEP and MSE in account by the introduction of weights wi 2 0 with the case of no replications is treated in Section 8. xi wi = 1. The weight wi could also characterize the Mathematical proofs are given in the Appendix. future frequency of the occurring of a variable yl with (2) to be predicted. 2. THE MEAN SQUARED ERROR As an example, the observations yil, . . . , yini could OF PREDICTION be the yield of a chemical plant on n, different days on Suppose that we have n = xi n, observations yij, which the chemical plant was run under the same assumed to be independent random variables and to fixed technological conditions as given by xii, . . . , Xi5. have normal distributions The expected yield would be pi = rl(Xil, . . . , xi5). If we are interested in predicting the yields on several future Yij - N@i3 a2) (1) days under technological conditions that are identical fori= l,..., mandj= l,..., n,.Thevectorofexpec- (if the response function q is sufficiently smooth, possi- tations p = (pr, . . . , p,)’ and the c? are un- bly only close) to those of some of the past days, we known. For each i, the values yil, . . . , yini are repli- may adopt the above mathematical formalization of cated observations of the output variable for the same the problem. fixed joint values xii, . . . , xik of the input variables. The introduction of weights is necessary to cover We will assume that n = xi ni > m; that is, we ex- cases in which the future output is to be predicted only clude the case of no replications (n, = . . . = n, = 1). for some of the m joint input values already observed Replications are necessary for estimating the variance in the past, for instance, in our example the yield c? if there is no prior information on the response y,,* . . . , yT4 under 14 different technological con- function or on the expectation p. ditions given by the joint values (xii, . . ., Xi5) The problem then is to predict some or all of the (i = 1, . . . . 14). If these technological conditions are random variables yf, . . . , yz with considered to be equally important, we will choose the weights w1 = . . . = wr4 = l/14; wis = .’ * = w, = 0. var (y,*) = 02, E(Yi*) = Pi 3 (2) One should choose wi = 0 if y: is not to be predicted. using the observations yij. We assume these random Predicting the output for values of the input that variables y:, . . . , yz, y,,, . . . , ymnmto be independent. are not close to the observed input values may lead to Often, a pseudolinear regression modelf, given by large prediction errors if, as we are assuming in this article, the form of the response function is unknown f(x 1, **., 4 = i Ph6lhh ***7&c)~ (3) and if the model (3) used for prediction is possibly not h=l exact. Discussions on the dangers of extrapolating are with fixed real functions g,, and parameters & , is used given in Weisberg (1980) and Montgomery and Peck as a basis for prediction. (1982), for example. If model (3) is fitted to the data by ordinary least Now we use a vector notation to simplify formulas: squares, we obtain the estimate b for the y* = (Yf, *. .? YZY, y* = (Jq, . . . ) jqy parameter vector p = (pi, . . . , 8,)‘: II Y II2 = 1 wiYZ. b = (b,, . . . . bJT = (ATL-1.4)-1ATL-1$, i where It is useful to write the MSEP as r=A+A+a’. fi = (Y 1, . . . . j#-, Ji = nie' f yij, j=l where L = diag [l/n,, . . . , l/n,,,], (4) A = 11p - Ej* II2 and A is the m x p matrix with the elements is the bias term, and the estimation error term is

%h = gh(Xil. ...Y Xik). A = E (( y^* - Ey^* (12. In the following, we assume that p = rank (A) I m. If the model (3) is not exact, a bias A cannot be The value yt may then be predicted by y^r = avoided for the “ predictor” y^* xi& if the model (3) is used. Ch bttg/z(Xil, ...T and also for any other predictor of the form y” = Ap, The MSEP is defined as where (? depends_ on the observations. Therefore, it r=MSEP(y^*)=Exwily^;-y;l’, (5) seems appropriate to use a predictor of y* with mini- i ma1 bias A and minimal MSEP instead of y^* = Ab, a as it was proposed, for example, by Mallows (1973) or problem discussed in more detail in Karson, Manson, Bendel and Afifi (1977). The possibly different impor- and Hader (1969), Bunke and Striiby (1975), and

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE 147

Bunke and Bunke (1984). The predictor y^of y*, which of a2, we may adjust r^for bias and obtain the adjusted minimizes the MSEP (5) under all predictors jj = Ab RSS, with minimal bias r^‘=r^+2z2t. (13) In the case (9) of weights, wi, which are proportional to the number, ai, of replications, the adjusted esti- mate r^’ = d’(C,/n + 1) is just the C,-criterion (see 9=&t f? = (ATWA)+ ATW& Mallows 1973 or Montgomery and Peck 1982) up to a factor and an additive constant that do not depend on where C+ denotes the Moore-Penrose generalized in- the model. verse of the matrix C and W = diag [w,, . . . , w,], as A “plug in” estimator of the MSEP may be ob- shown in Bunke and Striiby (1975). This shows us that tained by replacing the unknown expectation p and the predictor 3 and the estimate /3 only depend on variance rr2 in the formula (7) by their estimates @and those yi that correspond to nonvanishing CT*2. . weights wi > 0. This is intuitively obvious, because without any prior information about the true response r^B= II/i - jql’ + 82(1 + t). (14) function q, the observations at those joint values of This estimator has a nonnegative bias, the input at which no prediction is intended have no informative value about the values of q at the joint EfB - r = g2(tr WL - t) 2 0, (15) input values of predictive interest. Therefore, the pre- dictor would be the same if we assumed only observa- as calculated in Appendix A.l. A bias adjustment provides the “plug in” estimator tions for those i for which wi > 0. Consequently, with- out loss of generality, we may always assume that all FB = IIfi - 9112+ a^2(1- tr WL + 2t). (16) weights are positive. Then the matrix ATWA is of full rank and thus the best minimum bias predictor is This estimator fB is a best unbiased estimator of r because it depends on y = (y,,, . . ., yr,,, . . ., y,,,JT j = A(ATWA)-‘A=Wfi. (6) through the statistic (8, d’), which is sufficient and Its mean squared error of prediction is complete under (1) (see Bunke and Bunke 1984). In the special case, r = MSEP (9) = A + a2(1 + t), (7) wl=...=w,=m-‘, n, = ... = n, = h, (17) where of equal weights and an equal number of replications, t = tr D, D = WA(ATWA)-‘ATWL. (8) the estimator (16) is identical to the adjusted RSS ?. Mallows (1973) has already incidentally suggested In the following we will always use the best mini- the introduction of weights as in (5), and it may be mum bias predictor (6) for prediction. In the case of seen that under our assumptions the criterion pro- weights posed in that paper is identical to (16). Besides the Wi = ?li/n (i = 1, . . . . m), (9) suggestion, this criterion has not been further investi- gated. The estimators (14) and (16) may also be ob- which are proportional to the number n, of repli- tained by application of the bootstrap approach, as cations, it is identical to the ordinary least squares shown in Bunke and Droge (1982). There the per- predictor Ab. formance of Allen’s PRESS (a special case of cross- validation, see Stone 1974) as an estimator of MSEP is 3. ESTIMATES OF THE MSEP investigated, and its inferiority in comparison with the Some criteria for model selection use the RSS, estimator (16) is shown under some general assump- which will be defined here under consideration of the tions. weights and the number of replications at each value In this article, we want to compare “classical” cri- of the input as teria like the RSS and the adjusted RSS with the r^ = RSS = 1 wi nie ‘(yij - $J’. w “plug-in” estimator (14), the best unbiased estimator i, i (16) and a minimum MSE estimator of MSEP, which This “weighted” RSS is a negatively biased estimator we will now derive. of MSEP since Looking at the form of the previous estimators, we may ask whether a linear combination of RSS, Er^- r = -22a’t I 0, (11) 11,G - jj II2 and e2 different from ?, i’, FB, and ryS,could as shown in the Appendix. With the unbiased estimate provide a better MSE for estimating r. In Appendix A.4 we prove that the criterion ~2=(n-m)-‘~(yij-ji)2 (12) i, j ?= II@ - jl12 + qcP, (18)

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 148 0. BUNKE AND B. DROGE

Table 1. MSE for Different Estimators with of the MSEPr q = (n - m)(n - m + 2)-‘(1 - tr WL + 2t), (19) Estimator MSE has minimal MSE = E ) F - r I2 in the class of all esti- mators of the form i R + 2a4{tr(WL - D)2 r = (1 - k) 11fi - j (1’ + k RSS + hs2. (20) + tr(W’L - W2L2) + 2t2} -I The bias of this optimal estimator is r R + 2a4{tr(WL -D)2 EF- r = -2(n - m + 2))‘(1 - tr WL + 2t)a2 (21) + tr(W’L - W2L2) (see Appendix A.2). + 4(n -m)-?(I + t - trWL)) This result shows that for estimating the MSEP, a ,. rB R + 204{tr(WL - D)2 term like II fi - jll’ is more favorable than the RSS. The estimator (18) is very close to the best unbiased + (n -m)-‘(I +t)’ estimator (16) and its bias (21) will be relatively small. + (trWL - t)2/2} Only if the number of replications is small, that is, n - m is small, will there be a noticeable difference. r^; R + 2cr4{tr(WL - D)2 + (n -m)-’ x (1 - trWL + 2t)2} 4. COMPARISONS OF THE MSEP ESTIMATORS i R + 2a4{tr(WL -D)? Table 1 contains the MSE formulas (proved in Ap- + (n -m + 2)-‘(1 - trWL + 2t)2} pendix A.3) for the different estimators of MSEP con-

M I M 12. /

IO. 20

0.8 l 16 06. .12 04*

02. . 08 I 1 I I I I 'P 1 3 5 p I 3 5 Figure 1. Thecasem =5,h =2. Figure2. Thecasem =5,h =5.

Figures 1 and 2. M = am4[MSE (F) -R] vs. p for different estimators F of the MSEP (m: number of input values; h: number of replications).

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE 149 sidered in Section 3. We use the notation the formulas, since R = 40’ c wfn; ‘(pi - Ey^i)2. (22) R = 402A/n, (23) I WL = n-II, From Table 1 we obtain the following: D = WA(ATWA)-‘AT/n, 1. The adjusted RSS F’ has smaller MSE than the RSS ? iff D2 = D/n, n - m > 2 and t > 2(n - m - 2)-‘(1 - tr WL). t = n-l tr [WA(ATWA)-‘AT] 2. The “plug-in” estimator ?B has smaller MSE =n - ’ rank [ATWA] = p/n, than the RSS F if tr (WL - D)2 = (m - p)/n”, t > (3n - 3m - 2)-‘{(n - m - 2) tr WL + 4) tr (WL - D)2 + tr W2L - tr W2L? = (1 - p/n)/n. (as shown in Appendix A.5). We see that the MSE of the different estimators 3. As shown in Appendix A.6 the “plug-in” esti- only depends on the unknown response function in mator fB never has a smaller MSE than its adjusted the term R, namely by the “local biases,” cli - Ej,, in version FB. (22) or by the bias A in the special case (23). The other 4. With n - m 2 18, the best unbiased estimator r’s term only depends on u4 and on the matrix D (see (8)), is nearly as good as the optimal estimator ?: that is, on the model (3). MSE (YB)- MSE (3 I (.l) MSE (YB) 5. SOME NUMERICAL RESULTS (see Appendix A.7). In order to illustrate the differences between the estimators of the MSEP, we calculated for each esti- In the case (9) of weights, which are proportional to mator r the values the number of replications, there are simplifications in M = a-4[MSE (g - R] M 1 M 11. I +,’ 0.9 .I6

0.3 .I1 0.5 .Oi 0.3 ,y

0.1 .oi I I I I ) I 4 7 lop Figure3. Thecase m = 10,h ~2. Figure4. Thecase m = 10,h =5.

Figures 3 and 4. M = a-4 [MSE (Y) - R] vs. p for different estimators P of the MSEP (m: number of input values; h: number of replications).

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 150 0. BUNKE AND B. DROGE in the case (17) of equal weights and equal number of the response function plus random error, as in the replications. This was done for different m = 2, 5, 10, previous sections), the MSE for estimating the re- 20; different numbers of replications h = 2, 5, 10, 20; sponse function and different model dimensions p = 1, . . . , m. M is just p = E C ~i[pi - 4(~il, .. ,) Xik)]’ = A + a2t (24) the term in the formulas for the MSE (4 (see Table l), which is different for the different estimators ? of r, should be used instead of the MSEP (5) for an assess- while R is identical for all estimators. ment of the model performance. Different estimators M does not depend on any unknown parameter. It could be constructed as in Section 3. Their compari- depends on the model (3) only by the dimension p, and son can be made analogously as for the estimators of as it is proportional to the MSE (3 up to an additive the MSEP and, therefore, we state the results without constant, it is especially adequate for a numerical or proofs. graphical comparison of estimators. Of special interest is the RSS rZ its adjusted unbi- On the other hand, it is clear that the M values ased variant, essentially describe the differences in the MSE of the estimators if R (or A) is small; that is, if the model is p^’ = F + cP(2t - l), nearly exact, while the comparative use of differences the “plug-in” estimator, in M values decreases with increasing R. Already a few selected values of M seem to be sufficient to give a p^B= 1)D - j/I’ + d2t, good impression of these numerical results, say for and its adjusted unbiased variant, m = 5, 10 and h = 2, 5 (see Figures l-4). For comparison, we have also included the values p^i = (1,b - J’ II2 + d2(2t - tr WL) = r^;,- r?2, (25) M for a leaving-m-out variant, r^,, , of a cross- which is a best unbiased estimator for p. In the case of validation criterion (in the sense of Stone 1974) de- estimators of the form fined in Bunke and Droge (1982), which is based on calculating for eachj an estimate fi without the obser- p=(l -k)JI@-j((*+kr^+h~?~, vations y, j, . . , ymj. we obtain a minimal MSE (p) for We see that the optimal estimator F is essentially better than the other estimators if small MSE for all p p” = II/.? - j /I2 + 42, (26) is wanted. The best unbiased estimator ?A should be where first chosen among the other estimators. For suf- ficiently large h, it seems to be nearly as good as the q=(n-m)(n-m+2)-‘(2t-tr WL). (27) optimal estimator f (for h = 5 at most 10% difference The MSE and the bias for the different estimators are in MSE). presented in Table 2, where R is defined as in (22). An 6. DISCUSSION OF THE RESULTS examination of Table 2 yields the following relations: The theoretical and numerical comparisons show 1. MSE (p’s) I MSE (6’). that in general it is better to use estimators based on 2. MSE (&) I MSE (p^J if n 2 m + 2. an unbiased estimate of the variance and on the “em- 3. With II 2 m + 2, it holds that MSE (4 I MSE pirical model bias,” 11p - y^/12, than estimators based (6’) iff on the RSS. The criterion r” with smallest MSE for 1 2 2t 5 4(n - m - 2)- ‘(1 - tr WL) + 1. estimating the MSEP of a model is given by (18), and in general its MSE is essentially smaller than that of Therefore, the other estimators, with the exception of the best 2 := min (n,) 2 2 and unbiased estimator ?A, given by (16), which behaves (4 near1y like ?. ni = nwi, In the case (9) of weights that are proportional to O-4 2p 5 12 the number of replications, our results show that the are sufficient conditions for C,-criterion of Mallows, which then is equivalent to MSE (6’) I MSE (r^. the adjusted RSS, leads to a best unbiased estimator and, moreover, to a nearly optimal MSE in estimating 4. Under E+:= mini (n,) 2 2, it holds that the MSEP. MSE (p^J I MSE (r^) 7. ESTIMATORS FOR THE MSE and If the regression model is used for approximating the unknown response function r] or estimating its I bias (bB) I < I bias (r3 1. values at some values of the input instead of predict- 5. MSE (ii) - MSE (p”) I (.l) MSE (&) if ing the future value of the output (that is the value of n>m+ 18.

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE 151

Table 2. MSE and Bias for Different Estimators of the MSE p

Estimator MSE Bias

R + 2a4{tr(WL - D)2 + tr(W’L - W2L2) + (1 - 2t)2/2} (1 - 2f)a2 R + 2a4{tr(WL - D)2 + tr(W2L - W2L2)

+ (n -m)-‘(2t - 1)(2t - 2trWL + I)> 0 R + 2a4{tr(WL - 0)’ + (n -m)-‘t2 + (trawl -t)2/2} (trl44 - t)u2 R + 2a4{tr(WL -D)2 + (n -m)-‘(2t - trWL)2} 0

2(2t - trWL) o2 R + 2a4{tr(WL - D)2 + (n -m + 2) -‘(2t - trWL)2} p” (n -m + 2)

We see that the optimal estimators r” for the MSEP Now we may derive, with almost the same calcula- and 0 for the MSE only differ by a term that does not tions as in the previous sections, best unbiased esti- depend on the model (3). Consequently, any of these, mators for the MSEP (7) and the MSE (24) and, say c, could be used as a criterion for the comparison moreover, minimum MSE estimators in the class of of models with the objective of predicting the output estimators of the form (20). Here and in the following, or, alternatively, of estimating the response function. the estimators (29) and (30) always replace the esti- As discussed in Section 1, comparison of models by a mators fi (see (4)) and (12) of the previous sections, criterion is only a tool in model selection. After finally while the predictor y^given by (6) (but with l-i from (29)) arriving at some potentially good models by some is again a best minimum bias predictor. The formulas more complex strategy, a description of their per- (16), (18), (19) (25), (26), and (27) for the optimal esti- formance by a good estimate of their corresponding mators of MSEP and MSE remain the same; only the MSEP (or MSE), like r”(or p)),is obviously of interest. matrix L (see (4)) is to be replaced by the matrix

8. ESTIMATING THE MSEP AND THE MSE L = H(HTH)-‘HT. WITHOUT REPLICATED OBSERVATIONS Remark I. The assumption (28) covers the cases of If there are no replicated observations-that is, for replicated observations-for example, case (l), as a each input xii, . . ., xik (i = 1, . . , m), there is one ob- special case, where the corresponding structure of the servation yil (ni = lbthen an estimate of the un- matrix His obvious. known variance cr2, on the basis of the observations, Remark 2. If there is knowledge on the structure of may only be constructed if there is some information the expectation p = HA in the sense of assumption on the response function, for example, leading to an (28) there is still interest in predictors 9 = Afj (see (6)) (at least approximately) exact linear model for the given by another matrix A of smaller rank (i.e., using observation vector y = (y, 1, . . . , Y,,)~. an inexact model) because the MSEP (7) could be We assume essentially diminished (smaller constant t!) by such a y - N(Ha, a2Z), 1 E Rq, (28) predictor, even if then there is some bias A. where H is a known n x q matrix of rank q < n = m and 1 an unknown parameter. Then ACKNOWLEDGMENT The authors are very grateful to the referees for /2 = H(H=H)-‘HTy (29) their remarks, which improved the presentation of the and article considerably. d2 = (n - q)-l 1)y - $ IIf (30) APPENDIX are best unbiased estimators of the unknown parame- In this appendix we give the proofs of the state- ters p = H/1 and c2. ments in Sections 3 and 4, using the following nota- If the model (28) would not be exact, then the tion. estimator (30) would overestimate the variance and an estimator of MSEP based on (30) would be misleading 1. 0 : Kronecker product. in comparing models, because models with few pa- 2. 1i : n,-vector of ones. rameters (i.e., smaller constant t) would be unduly 3. 1,: k x k unity matrix. favored. 4. Y = (Yll, ‘..> Yhl, ..‘> Ymlt ...I Y,,JT.

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 152 0. BUNKE AND B. DROGE

5. IlYll% = YTZY. Furthermore, it follows from (31) and (36) that 6. p = (pIl;, . . . . p,,, 1;)‘. E IIY - fi II: = E IIY II; = IIv II; + c2 tr Q 7. p = (fill;, . ..) @, 13’. 8. f=(y^ll; ,..., j,l,T)r. = ~‘(1 - tr WL), (40) 9. k’= diag[w,n;‘l,,, . . . . w,n;‘l.,]. which, together with (32) and (39) provides 10. P = A(ATWA)-‘ArW = ((pij))i, jzl, ,,,, m. Et = A + a’(1 - tr D) rn;hK . . . . nilplmG1 11.F= .....‘...... and, therefore, the formula for Et - r. -1 1T A.1 Proof of (15) I nl Pm1 1y . . . . n, -lPmmcl 1, O(n;lplllT, . . . . ni’pl,lf) Equation (15) follows directly from (7), (14), (39), and Ecf2 = a2. We remark that tr WL - tr D = 12. H= 4: 7: yi(l - Pii) can never be positive because pii I 1 LO(n;‘pmllT, . . . . n;‘PmmC) , . . . . m.

A.2 Proof of (21) 13. G= Using (7), (18), and (39), we obtain Er” - r = 02(q - 1 + tr WL - 2t) = 02(l - tr WL + 2t) 14. K = diag [n;‘l,lr, . . . . n,‘lml$]. x {(n - m)(n - m + 2)-l - l} We will use some algebraic identities that follow easily from the above definitions: and, therefore, (21).

,ii = Gy, j = Pfi = Fy, fi = Ky, 9 = Hy. (31) A.3 Proof of the MSE Formulas in Table 1 1. For the calculation of (I - K)(K - H) = 0, IIP - 9 II: = II fi - 9112. (32) MSE (f) = (Er - r)’ + var (?), (41) u := (I - P)TW(Z - P) = w - WP. (33) we will use the formulas for the bias of the estimates Q:=(Z-K)TV’(Z-K)= I/- V/K, and (37). First we consider an estimate of r of the form r*(a, b) = IIP - y^II” + Ily - jillf, S = al, + bV. tr V = tr W = 1, tr VK = tr WL. (34) Because of (31) and (32), 11fi - y^II2 and I/ y - fi 11:are K2 = K, tr K = m. Kp=CI, (35) independent under (1) (see Rao 1972). Therefore, we For the calculation of Ef and var (3 for an estimate r obtain of r, we will use the formulas var r*(a, b) = var ((1; - y^II’ + var I( y - fi 11:. (42) E IIz II: = IIP II: + tr AC (36) Now it is easily seen from (33) that and UP = W(P - PPL)

var II z 11: = 4~TACA~ + 2 tr (AC)2, (37) and thus which are valid under z - IV@, Z). Because of (lo), 4a2/.iTULUp = 4a2 II/i - P/J II&, = R. (43) (31), and (32), we may write From

r^= IIY 7 9 11;= IIY - fi II: + IIfi - y^I12. (38) UL= WL-D (see (33h Now, from (31) (33) (36), (4), and (8), we obtain we derive

E/l,2 - j/l2 = ElIBIlk = IIpIIfj + CT’ tr UL tr (UL)’ = tr (WL - D)2, =((~-P~112+~2tr(WL-D) which, together with (31), (33) (37), and (43), yields

= A + 02(tr WL - t), (39) var II@ - jl12 = var IIbII: since = 4a2pTULUp + 20~ tr (UL)’

,i - N(p, a2L). = R + 2a4 tr (WL - D)2. (44)

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE 153

Using (3 l), we may write From (48) and (49), it follows that

IIY - GIli = IlYll;> T = (I - K)S(I - K). (45) var (f) = R + 2a4{tr (WL - D)2 + h2(n - m)- ’ The identities in (35) lead to +2hk(n-m))‘(l-tr WL)+k’ tr (W2L- W’c)},

II P II’,2 = 0, (46) which, together with (41) and (50), provides and, moreover, with (34) and (35) to &h, k) := MSE (r) tr T2 = tr {(I - K)S}2 = R + 2a4{ tr (WL - D)2 + h2(n - m)- ’ = tr {a(Z - K) + b(V - VK)12 + k2 tr (W2L - W’J?) = a2 tr (I - K) + 2ab tr (V - I/K) + 2hk(n - m)- ‘(1 - tr WL) f b2 tr (V - VK)2 +(h+k-l+trWL = a2(n - m) + 2ab(l - tr WL) - k tr WL - 2t)2/2}.

+ b2 tr (V - VK)2. (47) 2. To find the optimum pair (h, k), we minimize From (37) (45), (46), and (47), we obtain c$(!I, k) with respect to (h, k). From the necessary con- dition for a minimum (II*, k*), var II y - fi 11:= 204{a2(n - m) ad@,4 + 2ab(l - tr WL) + b2 tr (V - VK)2}, 0, ah W, k*) = which, together with (42) and (44) provides it results that var r*(a, b) = R + 2a4{tr (WL - D)2 + a2(n - m) k*(l - tr WL)(n - m + 2)(n - m)-’ + 2ab(l - tr WL) + b2 tr (W2L - W2L?)), (48) = 1 + 2t - tr WL - (n - m)- ‘(n - m + 2)h*, since tr (V - VK)2 = tr (W2L - W2L!). and, therefore, 2. Because of h* = -(l - tr WL)k* (51) uA2 = (n - m)-‘ JJy - Ky)f = (n - m)-‘))y - fill:, + (n - m)(n - m + 2)-‘(1 - tr WL + 2t). wemaywrite(see(38),(13),(14),(16),and(l8)) The second necessary condition, a&h, 4 r^ = r*(O, l), ak = 0, (h*,k*) r^’ = r*(2(n - rn)- ‘t, l), leads to FB = r*((n - m)-‘(1 + t), 0), 0 = 2h*(n - m)- ‘(1 - tr WL) ?L = r*((n - m)- ‘(1 - tr WL + 2t), 0), and + 2k* tr (W2L - W2L?) f = r*((n - m)-‘q, 0) + (h* + k* - 1 + tr WL - k* tr WL = r*((n - m + 2))l(l - tr WL + 2t), 0). - 2t)(l - tr WL). (52) These equations and (48) provide, together with the formulas for the bias of the estimates (see (1 l), (15), Now let and (21)) the MSE formulas in Table 1. c := (n - m)- ‘(1 - tr WL)2

A.4 Proof of the Optimality Property - tr (W2L - W2P). (53) for the Estimator (18) Then, from (51) and (52), we obtain 1. Using the denotations of A.3, we obtain from (31) and (44) an estimate of the form (20): 2ck* = 0. (54)

r= II@ - j112 + klly - jill; + hcF2 Because of Ci (ni - 1) = n - m and the strict convex- ity of the functionf(x) = x2, Jensen’s inequality yields = r*((n - rn) ‘h, k). (49) 1 (ni - l)(n - rn)-lwfnF2 The bias of r is easily calculated as (see (39) (40) and (7)) 2 2 (n - m)-2 C (ni - l)wini-’ 7 (55) ET - r = a2(k + h - 1 + (1 - k) tr WL - 2t). (50) ii I

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 154 0. BUNKE AND 6. DROGE

where equality holds iff A.6 Proof of Statement 3 in Section 4 win;’ = wjn,:’ for i, j = 1, . . . . m. (56) From Table 1 we derive We remark that (56) is equivalent to nV = 1. Since K4(MSE (r^J - MSE (fh)) = (n - rn)-‘(t - tr WL) tr (W’L - W2L2) = C (ni - l)w+c2 x ((n - m - 6)t - (n - m - 2) tr WL - 4). (63) Now it can be easily verified that and (n - m - 2) tr WL + 4 > (n - m - 6)t, (1 - tr WL)’ = which, together with (63) and 0 I t 4 tr WL, leads to MSE (r^J - MSE (?;) 2 0.

= 1 @Ii - l)Wi ?I;’ 2, A.7 Proof of Statement 4 in Section 4 ii I Table 1 yields we obtain from (53), (55), and (56) MSE (Q - MSE (f) = 4{(n - m)(n - m + 2)}-’ c 5 0, (57) x (1 - tr WL + 2t)2a42(n - m + 2)-l MSE (Q, where from which statement 4 in Section 4 directly follows c=O iff nV=Z. (58) because In the case of c # 0, it follows from (54) that 2(n-mm+)-‘5.1 iff n-m> 18.

k* = 0 (59) [Received July 1981. Revised December 1983.1

and, therefore, together with (51), REFERENCES

h” = q. BENDEL, R. B., and AFIFI, A. A. (1977), “Comparison of Stop- (60) ping Rules in Forward ‘Stepwise’ Regression,” Journal of the American Statistical Association, 72,4&53. For c = 0, that is, nV = I (see (58)), any solution BUNKE, H., and BUNKE, 0. (eds.) (1984), Statistical Inference in (h*, k*) of (51) provides Linear Models, London : John Wiley. BUNKE, H., and STRUBY, R. (1975), “Estimation Procedures in r=Il)i-Ci1)2+k*Ily-PII:+h*d2 Inadequate Models: Comparisons and Empirical Two Step Pro- cedure,” Mathematische Operationsforschung und Statistik, 6, = 11fi - jl12 + q&2 = L; 167-177. since then BUNKE, 0. (1983), “Selecting Variables and Models in : Some Tools and Suggestions for a Strategy,” to appear 1)y - fi 11: = (n - m)B2 and tr WL = m/n. in Mathematische Operationsforschung und Statistik, Series Statis- tics. It is easily verified that h* and k*, defined as in (59) BUNKE, O., and DROGE, B. (1982), “Bootstrap and Cross- Validation Criteria for Selecting Linear Regression Models,” and (60), always minimize &h, k) = MSE (f). Preprint 33, Sektion Mathematik, Humboldt-Universitiit, Berlin. EFRON, B. (1983), “Estimating the Error Rate of a Prediction A.5 Proof of Statement 2 in Section 4 Rule: Improvement on Cross-Validation,” Journal of the Ameri- can Statistical Association, 78,316331. Note that EFRON, B., and GONG, G. (1982), “A Leisurely Look at the K4(MSE (r^)- MSE (r^J) = 2 tr (W2L - W2L2) Bootstrap, the Jackknife and Cross-validation,” Technical Report 75, Stanford University, Division of Biostatistics. - (tr WL - t)2 + 4t2 - 2(n - m)-‘(1 + L)~. (61) FREEDMAN, D. A. (1981), “Bootstrapping Regression Models,” Annals of Statistics, 9, 1218-1228. Using (57) we have GAVER, K. M., and GEISEL, M. S. (1974), “Discriminating Among Alternative Models: Bayesian and Non-Bayesian Meth- tr (W’L - W2L2) 2 (n - m)-‘(1 - tr WL)2, ods,” in Frontiers in Econometrics, ed. Paul Zarembka, New York: Academic Press, 49-80. which, together with (61), leads to GONG, G. (1982), “Cross-validation, the Jackknife, and the Boot- strap: Excess Error Estimation in Forward Logistic Regression,” K4(MSE (r”) - MSE (r^J) 2 (n - m)- ‘(t + tr WL) Technical Report 80, Stanford University, Division of Bio- statistics. x ((3n - 3m - 2)t - (n - m - 2) tr WL - 4). (62) HOCKING, R. R. (1976), “The Analysis and Selection of Variables in Linear Regression,” Biometrics, 32, 149. Thus the statement directly follows from (62), since KARSON, M. J., MANSON, A. R., and HADER, R. J. (1969), “Minimum Bias Estimation and Experimental Design for Re- t+tr WL>O. sponse Surfaces,” Technometrics, 11,4611175.

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE

KENNEDY, W. S., and BANCROFT, T. A. (1971) “Model Build- STONE, M. (1974), “Cross-validatory Choice and Assessment of ing for Prediction in Regression Based Upon Repeated Signifi- Statistical Prediction,” Journal of the Royal Statistical Society, cance Tests,” Annals ofMathematical Statistics, 42, 1,273-1,284. Ser. B, 36, 111-133. MALLOWS, C. L. (1973) “Some Comments on C,,” Techno- THOMPSON, M. L. (1978). “Selection of Variables in Multiple metrics, 15661-675. Regression: Part I. A Review and Evaluation,” International MONTGOMERY, D. C., and PECK, E. A. (1982), introduction to Statistical Review, 46, 1-19, “Part II. Chosen Procedures, Com- Linear Regression Models, New York: John Wiley. putations and Examples,” 46,129-146. RAO, C. R. (1972), Linear Statistical Inference, New York: John WEISBERG, S. (1980), Applied Linear Regression. New York: John Wiley. Wiley.

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984