Estimators of the Mean Squared Error of Prediction in Linear Regression

TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 Estimators of the Mean Squared Error of Prediction in Linear Regression 0. Bunke and B. Droge Department of Statistics Humboldt-University Berlin, German Democratic Republic If a linear regression model is used for prediction, the mean squared error of prediction (MSEP) measures the performance of the model. The MSEP is a function of unknown parameters and good estimates of it are of interest. This article derives a best unbiased estimator and a minimum MSE estimator under the assumption of a normal distribution. It compares the bias and the MSE of these estimators and some others. Similar results are presented for the case in which the model is used to estimate values of the response function. KEY WORDS: Regression model; Response function; Selection of variables. 1. INTRODUCTION ables or of models, although, with the exception of some cases, the selection of models and of correspond- A frequent situation in analyzing experimental data ing input variables is a more complex process than involves (possibly replicated) observations of an merely a formal comparison of models by a criterion. output variable y obtained for corresponding fixed But such a criterion is a helpful tool in a model values of input variables xi, x2, . , xk. The output selection strategy (see Bunke 1983), which should also variable may be the yield in a chemical plant working include subject matter considerations and common under technological, chemical, and physical con- sense judgment. This has been discussed in the litera- ditions described by input variables such as temper- ture, for example, in Kennedy and Bancroft (1971) or ature, xi, pressure, x2, and the amounts, x3, x4, x5, of Mallows (1973). Surveys on model selection methods, different substances and catalyzers in the chemical including model comparison criteria, are presented in process. An objective of the analysis may be to predict Gaver and Geisel (1974), Hocking (1976), Thompson future values of the output y for fixed values of the (1978), and Montgomery and Peck (1982). Some of input variables using a function or model depending these criteria, like C, or PRESS, may also be con- on the input. sidered as estimates of the MSEP. In other cases, the model is used for a rough de- Recently Efron (1983) Efron and Gong (1982), and scription of the dependence between output and input Gong (1982) investigated the behavior of bootstrap variables by an equation, say y =f(x, , xg , x5). Then and other estimators of the MSEP under somewhat the modelfmay be interpreted as the approximation different assumptions, namely of random input to the true response function q, which describes the variables-an assumption that is also contained in real relation y = ~(x,, . , xk) + E between the output Freedman (198 1). and the input variables. The term E is interpreted here Our article derives a best unbiased estimator and a as the random error describing the influence of unob- minimum MSE estimator under the assumption of a served variables. normal distribution and compares the bias and the Usually the model will depend on parameters that MSE of these estimators with those of some others. In must be estimated on the basis of data, that is, the Section 2 we introduce the problem, assuming repli- model is fitted to the data. The mean squared error of cated observations, and in Section 3 we introduce the prediction (MSEP) gives a simple description for the different criteria. Their comparison is carried out in performance of the model. The MSEP depends on the Section 4, while in Section 5 numerical results and unknown response function and the unknown vari- graphical presentations of MSE for some examples ance of the observations, so good estimates of the illustrate the quality of different estimators. A dis- MSEP are of interest. Such estimates may be useful as cussion of the results is presented in Section 6. In criteria for the comparison of subsets of input vari- Section 7 we give a short report on analogous results 145 146 0. BUNKE AND B. DROGE for the case of the objective being an estimation of the tance of the prediction errors $’ - yt is taken into response function. Estimation of MSEP and MSE in account by the introduction of weights wi 2 0 with the case of no replications is treated in Section 8. xi wi = 1. The weight wi could also characterize the Mathematical proofs are given in the Appendix. future frequency of the occurring of a variable yl with (2) to be predicted. 2. THE MEAN SQUARED ERROR As an example, the observations yil, . , yini could OF PREDICTION be the yield of a chemical plant on n, different days on Suppose that we have n = xi n, observations yij, which the chemical plant was run under the same assumed to be independent random variables and to fixed technological conditions as given by xii, . , Xi5. have normal distributions The expected yield would be pi = rl(Xil, . , xi5). If we are interested in predicting the yields on several future Yij - N@i3 a2) (1) days under technological conditions that are identical fori= l,..., mandj= l,..., n,.Thevectorofexpec- (if the response function q is sufficiently smooth, possi- tations p = (pr, . , p,)’ and the variance c? are un- bly only close) to those of some of the past days, we known. For each i, the values yil, . , yini are repli- may adopt the above mathematical formalization of cated observations of the output variable for the same the problem. fixed joint values xii, . , xik of the input variables. The introduction of weights is necessary to cover We will assume that n = xi ni > m; that is, we ex- cases in which the future output is to be predicted only clude the case of no replications (n, = . = n, = 1). for some of the m joint input values already observed Replications are necessary for estimating the variance in the past, for instance, in our example the yield c? if there is no prior information on the response y,,* . , yT4 under 14 different technological con- function or on the expectation p. ditions given by the joint values (xii, . ., Xi5) The problem then is to predict some or all of the (i = 1, . 14). If these technological conditions are random variables yf, . , yz with considered to be equally important, we will choose the weights w1 = . = wr4 = l/14; wis = .’ * = w, = 0. var (y,*) = 02, E(Yi*) = Pi 3 (2) One should choose wi = 0 if y: is not to be predicted. using the observations yij. We assume these random Predicting the output for values of the input that variables y:, . , yz, y,,, . , ymnmto be independent. are not close to the observed input values may lead to Often, a pseudolinear regression modelf, given by large prediction errors if, as we are assuming in this article, the form of the response function is unknown f(x 1, **., 4 = i Ph6lhh ***7&c)~ (3) and if the model (3) used for prediction is possibly not h=l exact. Discussions on the dangers of extrapolating are with fixed real functions g,, and parameters & , is used given in Weisberg (1980) and Montgomery and Peck as a basis for prediction. (1982), for example. If model (3) is fitted to the data by ordinary least Now we use a vector notation to simplify formulas: squares, we obtain the least squares estimate b for the y* = (Yf, *. .? YZY, y* = (Jq, . ) jqy parameter vector p = (pi, . , 8,)‘: II Y II2 = 1 wiYZ. b = (b,, . bJT = (ATL-1.4)-1ATL-1$, i where It is useful to write the MSEP as r=A+A+a’. fi = (Y 1, . j#-, Ji = nie' f yij, j=l where L = diag [l/n,, . , l/n,,,], (4) A = 11p - Ej* II2 and A is the m x p matrix with the elements is the bias term, and the estimation error term is %h = gh(Xil. ...Y Xik). A = E (( y^* - Ey^* (12. In the following, we assume that p = rank (A) I m. If the model (3) is not exact, a bias A cannot be The value yt may then be predicted by y^r = avoided for the “ordinary least squares predictor” y^* xi& if the model (3) is used. Ch bttg/z(Xil, ...T and also for any other predictor of the form y” = Ap, The MSEP is defined as where (? depends_ on the observations. Therefore, it r=MSEP(y^*)=Exwily^;-y;l’, (5) seems appropriate to use a predictor of y* with mini- i ma1 bias A and minimal MSEP instead of y^* = Ab, a as it was proposed, for example, by Mallows (1973) or problem discussed in more detail in Karson, Manson, Bendel and Afifi (1977). The possibly different impor- and Hader (1969), Bunke and Striiby (1975), and TECHNOMETRICS 0, VOL. 26, NO. 2, MAY 1984 ESTIMATORS OF PREDICTION MSE 147 Bunke and Bunke (1984). The predictor y^of y*, which of a2, we may adjust r^for bias and obtain the adjusted minimizes the MSEP (5) under all predictors jj = Ab RSS, with minimal bias r^‘=r^+2z2t. (13) In the case (9) of weights, wi, which are proportional to the number, ai, of replications, the adjusted estimate r^’ = d’(C,/n + 1) is just the C,-criterion (see 9=&t f? = (ATWA)+ ATW& Mallows 1973 or Montgomery and Peck 1982) up to a factor and an additive constant that do not depend on where C+ denotes the Moore-Penrose generalized in- the model. verse of the matrix C and W = diag [w,, . , w,], as A “plug in” estimator of the MSEP may be ob- shown in Bunke and Striiby (1975).

Load more