V2501023 Diagnostic Regression Analysis and Shifted Power
Total Page:16
File Type:pdf, Size:1020Kb
TECHNOMETRICS 0, VOL. 25, NO. 1, FEBRUARY 1983 Diagnostic Regression Analysis and Shifted Power Transformations A. C. Atkinson Department of Mathematics imperial College of Science and Technology London, SW7 2B2, England Diagnostic displays for outlying and influential observations are reviewed. In some examples apparent outliers vanish after a power transformation of the data. Interpretation of the score statistic for transformations as regression on a constructed variable makes diagnostic methods available for detection of the influence of individual observations on the transformation, Methods for the power transformation are exemplified and extended to the power transform- ation after a shift in location, for which there are two constructed variables. The emphasis is on plots as diagnostic tools. KEY WORDS: Box and Cox; Cook statistic; Influential observations; Outliers in regression; Power transformations; Regression diagnostics; Shifted power transformation. 1. INTRODUCTION may indicate not inadequate data, but an inadequate It is straightforward, with modern computer soft- model. In some casesoutliers can be reconciled with ware, to fit multiple-regressionequations to data. It is the body of the data by a transformation of the re- often, however, lesseasy to determine whether one, or sponse.But there is the possibility that evidencefor, or a few, observations are having a disproportionate against, a transformation may itself be being unduly effect on the fitted model and hence, more impor- influenced by one, or a few, observations. In Section 3 tantly, on the conclusions drawn from the data. the diagnostic plot for influential observations is ap- Methods for detecting such observations are a main plied to the score test for transformations, which is aim of the collection of techniques known as regres- interpreted in terms of regression on a constructed sion diagnostics, many of which are described in the variable. The power transformation may, however, be books of Belsley, Kuh, and Welsch (1980) and of more appropriate after a shift in the location of the Weisberg(1980). observations.This model is presentedby Box and Cox The purpose of the present article is to describe (1964), but they do not discuss it in any detail. In some recently developed diagnostic plots that have Section 4 the methodology is extended to a test for a been applied both to multiple regression and to the shift in location, and the related diagnostic plots are analysis of data transformations. In this article the presented.The detailed structure of some plots is de- techniques are illustrated and extended to situations scribed in Section 5 and the article concludes, in Sec- in which a power transformation is appropriate only tion 6, with a discussionof some possibleextensions of after a shift in location. It is intended that the plots the methods. should be used in a routine manner to display any features that might suggestways in which the model, 2. GRAPHICAL DISPLAYS FOR or data, are inadequate. REGRESSION DIAGNOSTICS Becauseany of the data used in multiple regression The data Y are assumedto come from the multiple- may be wrong due to measuring,recording, transcrip- regressionmodel E(Y) = Xp, where X is the full-rank tion, or keypunching errors, both response and ex- n x p matrix of explanatory variables. The errors are planatory variables may be in error. To detect these independent and have variance u’. If j? is the least two kinds of error, two different quantities are needed. squaresestimate of j?,the predicted responseat the ith Some possibilities are discussedin Section 2, and the data point is gi = .x:/j, which has variance half-normal plots of Atkinson (1981)are used to inves- var(y^,)= oZxT(XTX)-ixi = aZhi. tigate a set of data on salinity presented by Ruppert and Carroll (1980). The ith unstandardized residual is ri = yi - ji and The presenceof one or more outlying observations var(r,) = (r2(1- hi). 23 24 A. C. ATKINSON If s2 is the residual mean square estimate of (r2 the DENT by Belsley, Kuh, and Welsch (1980, Ch. 2). standardized residuals are Apart from the factor in IZand p, Ci is identical to the quantity they call DFFITS,. Both r? and signed values of Ci can be plotted in any of the ways custom- which are identically distributed. ary for residuals as described, for example, by Weis- To find the effect of each observation in turn on the berg (1980,Ch. 7), or in serial order as does Pregibon fitted model Hoaglin and Welsch (1978)suggest look- (1981)for his logistic regressiondiagnostics. Atkinson ing at the prediction at the ith observational point (1981) uses half-normal plots of both r* and Ci to- when yi is not used in fitting. This prediction is p(i) = gether with a form of Monte Carlo testing: with the x,r&), where the subscripted i in parenthesesis to be matrix X of explanatory variables fixed, 19 samples read as “with observation i deleted.” The t test for are simulated from the normal distribution and the agreementbetween yi and fci), is the cross-validatory envelope of ordered statistics is plotted in addition to or jackknife residual rl, which the results of Plackett the observedvalues. Becausethe null shapeof the plot (1950)show has the simple form of the modified Cook statistics depends on the posi- rf = (yi - y^i)/S~i,Jl - hi. tion of the observational points in X space,that is, on the values of the hi, the envelope makes possible the Atkinson (1981) stressesthat the jackknife residuals distinction between an influential point for which the are a monotone, but nonlinear, function of the stan- y and x values agree and one for which they do not. dardized residuals, which, however, reflect large Experience suggeststhat the disagreementat such a values more dramatically: as ri2 + n - p, rF2 + co. If point of high influence is usually due to an incorrect x the purpose is graphical detection of an outlying y value. The ability to detect such observations is an value, the jackknife residuals are to be preferred. For advantage of the plot of Ci with a simulated envelope i suppose that observation has had an unknown over the study of the components of Ci , namely r: and amount A added to it. Then the jackknife residual r: hi/(1 - hi). will have a noncentral t distribution. All other jack- As an example of these techniques we look first at knife residuals will be shrunk due to overestimation of some data on the salinity of water used by Ruppert 0 caused by the outlying observation. This shrinkage and Carroll (1980) to demonstrate robust regression. will, however, affect all standardized residuals, since a There are 28 observations and three explanatory vari- common estimate of r~is employed. ables. Figure la shows the half-normal plot of the An outlying value of one or more explanatory vari- jackknife residuals r: when a first-order model is ables may create a point of high influence for which hi fitted. Apart from the fourth largest value, all observa- is near one. Such a point may have a large effect on tions lie within the envelope and there is no clear the fitted model, but, since the standardized residuals pattern. On the other hand, the plot of the modified all have the same variance, a residual plot will not Cook statistic Ci shown in Figure lb exhibits a clear reveal such points. Hoaglin and Welsch (1978) and pattern. The largest value of Ci, 10.2, belonging to Obenchain (1977) recommend study of residuals and observation 16, is well outside the envelope. As a of the hi or of somefunction of them such ashi/( 1 - hi) result of this one large value nearly all the other values in order to examine influential observations. Cook are shrunk below the envelope, owing to the over- (1977)suggested the statistic estimation of rr2. Di = (b(i) - ~j)TXTX(~(i) - b/Ps2 These figures clearly show there is something strange about observation 16 and suggest that the = (ri2hi)(p( 1 - hi)). explanatory variables may be at fault. Inspection of This statistic measures the effect on the parameter the data shows that for observation 16 the third ex- estimate of deleting the ith observation. A scaled ver- planatory variable, water flow, has the value 33.443, sion of Di is used by Atkinson (1981) to obtain a whereas all other values lie in the range 20.769 to diagnostic plot. If o2 is estimated not by s2 but by s$,, 29.895, with 26.417 the second highest value. One the resulting modified Cook statistic is possibility is to “correct” 33.443to 23.443,which has the effect of reducing the residual sum of squaresfrom 42.5 to 26.3. The resulting half-normal plots of rr and Ci in Figure 2a and 2b no longer show any unduly For a D-optimum experimental design all observa- large values, although several of the small values of tions have the same leverageand hi = p/n. The effect both quantities seema little too small. We will return of the scaling of fi is to make the plots of Ci and to the further analysis of thesedata in Section 3. 1rj+ 1identical for this most balancedcase. This example illustrates the usefulnessof the plots Nomenclature for these quantities is not standard- in calling attention to features of the data that require ized. The jackknife residuals r: are called RSTU- further investigation. Other examplesare given in At- TECHNOMETRICS 6, VOL. 25, NO. 1, FEBRUARY 1983 DIAGNOSTIC REGRESSION FOR SHIFTED POWER TRANSFORMATIONS 25 4.0 data are wrong or becausethe model is inadequate. I One cause of apparent outliers is that the data are 3.5 I! being analyzed in the wrong scale.For example,in an 3.0 analysis of Brownlee’s stack loss data Atkinson (1981) found that observation 21, which gave a value of Ci 2.5 lying outside the simulated envelope,could be recon- r: ; ” ciled with the body of the data by use of a log trans- .