CHAPTER 7 ST 762, M. DAVIDIAN 7 Detection and modeling of nonconstant

7.1 Introduction

So far, we have focused on approaches to inference in -variance models of the form

2 2 E(Yj|xj)= f(xj, β), var(Yj|xj)= σ g (β, θ, xj) (7.1) under the assumption that we have already specified such a model.

• Often, a model for the mean may be suggested by the nature of the response (e.g., binary or count), by subject-matter theoretical considerations (e.g., pharmacokinetics), or by the empirical evidence (e.g., models for assay response).

• A model for variance may or may not be suggested by these features. When the response is binary, the form of the variance is indeed dictated by the Bernoulli distribution, while for data in the form of counts or proportions, for which the Poisson or binomial distributions may be appropriate, the form of the variance is again suggested. One may wish to consider the possibility of over- or underdispersion in these situations; this may reasonably be carried out by fitting a model that accommodates these features and determining if an improvement in fit is apparent using methods for inference on variance parameters we will discuss in Chapter 12.

Alternatively, when the response is continuous (or approximately continuous), it is often the situation that there is not necessarily an obvious relevant distributional model. As we have discussed in some of the examples we have considered, several sources of variation may combine to produce patterns that are not well described by the kinds of variance models dictated by popular distributional assumptions such as the gamma or lognormal distributions.

In fact, it may be unclear whether heterogeneity of variance is even an issue at all. In some applications, it is expected, and popular models may be available; in others, whether or not variance changes with the mean or covariate values may need to be deduced from the data.

• In these situations, methods are required for detecting nonconstant variance, determining whether or not it changes smoothly across the of the response or covariates, and identifying an appropriate model to characterize the change.

To address these issues, both formal and informal approaches have been proposed:

PAGE 155 CHAPTER 7 ST 762, M. DAVIDIAN

• Graphical techniques. Both for detection and modeling, these often have a subjective flavor. In this chapter, we will focus on these procedures.

• Formal hypothesis testing. Formal procedures are mainly used for detection. We will defer discus- sion of these until after we have covered the large-sample theoretical developments on which they are based. Because of the complexity of (7.1), no finite-sample, “exact” methods are available in general.

COMMON THEME: Most graphical approaches are based on the OLS residuals

ˆ rj = Yj − f(xj, βOLS) and functions thereof, or on related constructs. Our main focus will be on detection and modeling in situations where the response is continuous (or nearly continuous, such as in the case of moderate-to- large counts).

A complementary treatment of some of the approaches we will discuss may be found in Carroll and Rup- pert (1988, Sections 2.7 and 2.8). Note that, in what follows, distributional statements are conditional on the xj.

7.2 Plots based on residuals

We begin by first reviewing the basic rationale for the use of residuals as a tool for detecting nonconstant variance in regression.

The “usual” residual plots described in a first course in analysis apply equally well in the nonlinear model situation. Specifically, one usually plots the rj or the “standardized” residuals rj/σˆOLS, where n 2 −1 2 σˆOLS = (n − p) rj , jX=1 versus one or more of the following:

ˆ ˆ • Predicted values Yj = f(xj, βOLS)

• Covariates (elements of xj)

• log Yˆj in cases where many responses tend to be clustered in a very narrow range in order to “stretch things out” so that any patterns might be more readily discernible. We will see the value of this for some nonlinear models and designs later.

PAGE 156 CHAPTER 7 ST 762, M. DAVIDIAN

If the plot(s) exhibit an apparent pattern, with the magnitude of residuals changing with level of predicted value or covariate, this is taken as evidence of potential nonconstant variance. In particular, for the plot of residuals vs. predicted values or their logarithms, a “fan-shape” is accepted as evidence that variance increases smoothly with the level of the response (mean). More generally, any “nonhaphazard,” “systematic” pattern may well be evidence that variance does not remain constant over the range of the response.

One must be careful, however.

• A systematic pattern may also be the result of an ill-fitting mean model. The nature of the pattern must be critically assessed by the data analyst to determine a reasonable explanation for it given the particular mean model and circumstances. For example, for the indomethacin pharmacokinetic data in Examples 1.1 and 1.2, the model was the sum of two exponential terms. If a simple model containing only a single exponential term were fitted to these data, one would expect to see a systematic pattern in the residuals reflecting the lack of fit of this model.

There is certainly subjectivity involved in this endeavor.

• When responses are collected in time order, e.g., repeated measurements on the same individuals, one often plots the residuals against time to look for temporal patterns that may suggest possible serial correlation. Alternatively, more sophisticated plots for investigating this are available. We defer discussion of serial correlation until later chapters, as our current focus is on detecting and modeling nonconstant variance when the assumption of independence is reasonable. It is important to recognize, however, that this is an assumption that should be considered carefully in practice.

MOTIVATION: The obvious motivation for the usual plots is that rj is a “proxy” for the true

Yj − f(xj, β).

• If the data are normally (or at least symmetrically) distributed with constant variance, we would

expect the rj to be roughly symmetrically distributed about 0 and to have approximate constant variance.

• We would thus expect a “haphazard” pattern, with approximately equal numbers of positive and negative residuals with approximately the same magnitude across their entire range.

• Even if the variance were nonconstant, if the data were at least normally or symmetrically dis- tributed, we would still expect approximately equal numbers of positive and negative residuals.

PAGE 157 CHAPTER 7 ST 762, M. DAVIDIAN

However, we would expect changing magnitude across the range.

PROBLEMS WITH THE USUAL PLOTS: The OLS residuals rj may not have exactly the same ˆ properties as the true deviations because β is replaced by the OLS estimator βOLS. We will tackle this issue shortly. Some more immediate problems that may make the usual plots difficult to interpret are as follows:

• The data may not be normally or even symmetrically distributed but may instead arise from a skewed (asymmetric) distribution.

• The design (the settings of the xj) may be such that an unusual pattern of residuals may be due to something other than nonconstant variance.

• Furthermore, although the usual plots may be sufficient for detection, they may not be very helpful for modeling of nonconstant variance.

We thus consider refinements of the usual plots.

REFINEMENT 1. A common idea is to base plots on transformations of (absolute) residuals or other residuals in order to account for sample size or asymmetry.

A seminal reference for some of these ideas in the context of linear regression is Cook and Weisberg (1983).

We have already discussed estimation of variance parameters based on transformations of absolute residuals, so it should come as no surprise that diagnostic plots would also be based on them.

IDEA 1: “Visually double the sample size.” The usual plots may be difficult to interpret because the sample size is small. Under such conditions, a change in the placement of just a single residual in the plot can change the apparent pattern substantially. Thus, each observation may be very influential to the eye in gauging the pattern.

2 2 2 A simple remedy is to plot rj or rj /σˆOLS instead. In this plot, the magnitude but not the sign of the residuals is emphasized. Because the contribution of all residuals is positive, this has the effect of creating a “larger” sample size for the purpose of spotting changes in magnitude. Moreover, the visual influence of any single observation in dictating the pattern is reduced.

Recall the data on the pharmacokinetics of indomethacin in Examples 1.1 and 1.2. Here, n = 11 concentration responses were collected over time on a single subject.

PAGE 158 CHAPTER 7 ST 762, M. DAVIDIAN

The data are plotted again in Figure 7.3 in Section 7.5; a usual residual plot was given in Figure 1.3 and exhibits a “fan-shaped” pattern that appears roughly symmetric about zero. Note that the residuals have been plotted against the logarithm of predicted values; because the response “tails off” rather quickly, there are many residuals at very small values of the response, so that residuals plotted against predicted values themselves are “bunched up” near zero, making the pattern difficult to assess.

Figure 7.4 in Section 7.5 shows a plot of squared, standardized residuals against log predicted values and shows a “wedge shape” indicating the increase in magnitude across the range.

One could substitute absolute residuals |rj| for squared ones and make similar plots. A purported advantage of squared over absolute residuals themselves is that squaring tends to highlight residuals that are “large” in magnitude and downplay those that are “small,” thus drawing attention to changes in magnitude over the range. A potential drawback is that squaring may artificially accentuate residuals corresponding to “outlying,” anomalous observations. A further drawback is that, although one may gain better ability to spot a trend, any asymmetry of the pattern is obscured. Thus, such plots should not be made in lieu of the usual ones, but rather should be supplementary.

In fact, the squaring operation may be misleading in another way.

IDEA 2: “Refine Idea 1.” McCullagh and Nelder (1989, Section 2.4.2) expand on this idea. Squaring residuals can cause a problem, which we now discuss heuristically.

Suppose the response Y were exactly conditionally (on x) normally distributed. Then, at least approx- 2 2 2 imately, the rj would have a χ distribution. Of course, the χ distribution is a special case of a gamma distribution, and is skewed. Thus, if we plot squared residuals under these conditions, some of the 2 observed pattern in the plot may well be due to expected asymmetry of the rj and not to underlying nonconstant variance in the response.

The proposed remedy is to consider other transformations of residuals such that the transformed resid- uals would be expected to be “as normal as possible” and hence symmetrically distributed. That is, find a transformation A(·) satisfying this condition. If plots were based on A(·) instead, presumably any observed pattern could be attributed only to nonconstant variance (and not to asymmetry).

ANSCOMBE RESIDUALS: This is based on consideration of so-called Anscombe residuals. If variance depends on the mean through some function g(µ) (e.g., one of the “scaled ” models), then define y dµ A(y)= . (7.2) g2/3(µ) Z

PAGE 159 CHAPTER 7 ST 762, M. DAVIDIAN

The transformation A(·) makes the distribution of the transformed variable A(Y ) “as close to normal as possible.”

• If Y has Poisson-like variance, g(µ)= µ1/2 implies A(y) = (3/2)y2/3 ∝ y2/3.

• If Y has gamma-like variance, g(µ)= µ, then A(y) = 3y1/3 ∝ y1/3.

This may be used in the context of residual plots as follows. We noted that for normally distributed 2 2 2 data Yj, the residuals rj are approximately χ distributed. The χ distribution is a gamma distribution; thus, the above suggests that 2 1/3 2/3 (rj ) = rj should be approximately normally distributed.

2/3 2 The suggestion is thus to plot rj rather than rj against predicted values (or covariates). These transformed residuals still “visually double the sample size;” moreover, if the data truly are normally distributed, we would expect the pattern they exhibit to not be the result of their asymmetry but rather to reflect nonconstant variance if it exists.

Carroll and Ruppert (1988, pp. 30–31) advocate this kind of plot in the case where the original response may be approximately normally distributed. For the indomethacin data, the usual residual plot seems fairly symmetric, suggesting that the normality assumption for the Yj may not be unreasonable. Figure 7.5(a) in Section 7.5 shows the original residual plot, and (d) shows the plot of 2/3-root residuals against log predicted values. Comparing to Figure 7.4, note that the pattern of increase with predicted value is not nearly as dramatic, suggesting that the impression in Figure 7.4 may in part be due to asymmetry of the squared residuals.

This transformation idea may be used in another way; the technique we are about to discuss is the basis for the term Anscombe residual. Here, we form a different kind of residual based on (7.2) and use it to assess the validity of a distributional assumption for the original data Yj that may in turn dictate a particular variance model.

To do this, if one suspects that the data Yj themselves may not be normally distributed but rather may follow or be closely approximated by a distribution in the “scaled exponential family” class, then one might form residuals formed on an appropriate transformed scale instead of the usual ones. E.g., continuous data may be skewed at each xj, so something like a gamma distribution may be a closer representation of the truth than the normal.

PAGE 160 CHAPTER 7 ST 762, M. DAVIDIAN

To illustrate, suppose that we suspect that the Yj may be Poisson distributed at each xj. From (7.2), the “residual” ∗ 2/3 ˆ 2/3 rj = Yj − Yj would be expected to be “close” to normally distributed if the Poisson assumption were valid. However, ∗ ∗ although the distribution of rj for each j may be “more normal,” the variance of the rj across j may not be constant. Now if the Poisson assumption is valid, a “δ-method” Taylor series approximation

A(y) ≈ A(µ)+ A′(µ)(y − µ), A′(y)= d/dy A(y), yields upon rearrangement var{A(Y )}≈ var(Y ){A′(µ)}2, so that in the Poisson case var{A(Y )}≈ µ(µ−1/3)2 = µ1/3.

Thus, the suggestion, if one wishes to verify graphically the appropriateness of the Poisson assumption ∗ ˆ 1/3 1/2 ∗ ˆ 1/6 (or at least the Poisson-like variance model) is to plot rj /{Yj } = rj /Yj vs. predicted values. ∗ If the assumption is reasonable, we would expect to see symmetry about 0, as the rj should be ap- ∗ proximately normal, with haphazard scatter about 0, as we have scaled the rj by an estimate of its , so that these standardized “residuals” should be of approximately equal magnitude across the range. If such a pattern does not emerge, it may suggest that the original Poisson variance conjecture is not correct, and further investigation is required.

Of course, the same idea could be used with other distributions, e.g. the gamma.

Distributional considerations are not the only issue one must think about when constructing and in- terpreting plots. The issue of design also plays a key role. This is most clearly understood by first restricting attention to linear models. Thus, we consider the linear model

T E(Yj|xj)= xj β (7.3)

first, then generalize to the nonlinear case.

REFINEMENT 2: We now consider the idea of studentization. Consider the linear mean model (7.3) and write 2 var(Yj|xj)= σ /wj for some values wj, which we will treat as fixed constants. For the purposes of the following arguments, we will take the perspective that the xj are fixed constants, as is conventional in this setting.

PAGE 161 CHAPTER 7 ST 762, M. DAVIDIAN

Thus, we will suppress conditioning in the expressions below. We may of course represent the model in obvious matrix notation as E(Y )= Xβ, var(Y )= σ2W −1.

Here X is the design matrix, and W = diag(w1,...,wn).

We may write Y = Xβ + σW −1/2ǫ,

−1/2 −1/2 where W is the diagonal matrix with diagonal elements wj . The elements of ǫ, ǫj, satisfy

E(ǫj) = 0, and var(ǫj) = 1, of course, which would be true more generally if the ǫj were assumed to be i.i.d.

ˆ T −1 T The OLS estimator is given by βOLS = (X X) X Y , and the vector of OLS residuals is given by

T ˆ T −1 T r = (r1,...,rn) = Y − XβOLS = {In − X(X X) X }Y

T −1 T −1/2 −1/2 = {In − X(X X) X }{Xβ + σW ǫ} = σ(In − H)W ǫ, (7.4)

T −1 T where H = X(X X) X , a (n × n) matrix with (j, k) element hjk.

ˆ The matrix H is usually called the “hat” matrix, because if β is estimated by βOLS and the vector of ˆ ˆ ˆ predicted values Y OLS, say, is formed, then Y OLS = XβOLS = HY , so that H “puts the hat” on Y .

It is straightforward to show that (7.4) implies that

n −1/2 −1/2 rj = σ(wj ǫj − hjkwk ǫk). kX=1

Under our assumptions, E(rj ) = 0, and we may thus calculate

n 2 −1/2 −1/2 2 var(rj) = σ E(wj ǫj − hjkwk ǫk) kX=1 n n 2 −1 2 −1/2 −1/2 −1/2 2 = σ E{wj ǫj − 2wj ǫj hjkwk ǫk + ( hjkwk ǫk) } kX=1 kX=1 n 2 −1 2 −1/2 2 −1 2 = σ {wj (1 − 2hjj)E(ǫj ) − 2wj hjkE(ǫjǫk)+ hjkwk E(ǫk) kX6=j kX=1 −1/2 −1/2 + hjkhjℓwk wℓ E(ǫkǫℓ)} kX6=ℓ

2 Using the independence of the ǫj and E(ǫj ) = 1, we obtain

n 2 −1 2 −1 2 −1 2 2 −1 var(rj)= σ {wj (1 − 2hjj)+ hjkwk } = σ {wj (1 − hjj) + hjkwk }. (7.5) kX=1 kX6=j

PAGE 162 CHAPTER 7 ST 762, M. DAVIDIAN

Now consider the case of constant variance, so that wj ≡ 1 for all j. Under this condition, (7.5) becomes

2 2 2 var(rj)= σ {(1 − hjj) + hjk}. kX6=j

Because H is a symmetric, idempotent matrix, 0 ≤ hjj ≤ 1, and

n 2 hjj = hjk, kX=1 from whence it follows that 2 hjk = hjj(1 − hjj). kX6=j Thus, we obtain the final result that, under constant variance, we expect

2 var(rj)= σ (1 − hjj). (7.6)

IMPLICATION: If all the hjj are of approximately the same magnitude for all j, then var(rj) is approximately constant across j. On the other hand, if the hjj are quite different across j, then var(rj) will be expected to vary. That is, under these conditions, the rj will have nonconstant variance even if the original data do not!

Thus, inspecting plots based on the rj could lead one to conclude erroneously that there is evidence of nonconstant variance even when there is not. This is possible in the event the hjj vary across j.

LEVERAGE: When do the hjj vary? The hjj are called the leverage values corresponding to each design point xj. Loosely speaking, leverage is a measure of how “remote” an observation is from the remaining observations in the “design space.”

The simplest example of this is when xj is scalar. Figure 7.1 exemplifies the situation. Note that the design point x = 15 is far removed from the rest of the x values, which are in the range from 0 to 5. The figure shows the effect of the placement of an observed response at x = 15. The dashed lines are OLS fits of a to the data sets containing all the responses at x = 0 to 5 along with either one of the two depicted responses at x = 15 and demonstrate the dramatic influence the response at this design point has on the fitted model.

A point such as x = 15 in this example will turn out to have a “large” value of hjj relative to those for the other design points. Such a point is called a high leverage point and has a potentially influential role in determining the fit of the model. In the situation of Figure 7.1, x = 15 almost entirely dictates the fit.

PAGE 163 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.1: A “high leverage” point. y 10 15 20 25

0 5 10 15 x

RESULT: The magnitude of hjj in a linear model is dictated by the design. Thus, even when variance really is constant, a design with high leverage points will yield OLS residuals whose are nonconstant in a nontrivial way. The observed pattern of rj in the usual or refined residual plots may appear to reflect nonconstant variance when in reality it is an artifact of the design.

REMEDY: An obvious modification is to calculate the hjj values, which in a linear model depend only on the known values xj and replace the rj in residual plots by the so-called studentized residuals

rj bj = 1/2 , σˆOLS(1 − hjj) which clearly are such that var(ˆσOLSbj/σ) = 1. Most linear regression software, such as SAS proc reg, computes studentized residuals automatically.

Hopefully, if variance really is homogeneous, we will not be misled by a pattern that is actually due to design if we use studentized, rather than ordinary, residuals. Thus, the suggestion is to replace rj in all of the plots we have discussed previously by bj in the linear case, so that one would plot, for example, 2/3 2/3 bj rather than rj/σˆOLS and use bj rather than rj .

EXTENSION: What if the variance is in fact nonconstant? That is, suppose we suspect that variance may not be constant but instead follows a smooth relationship dictated by a variance function g(β, θ, xj). What should we plot to investigate this, taking potential issues of leverage into account?

PAGE 164 CHAPTER 7 ST 762, M. DAVIDIAN

Cook and Weisberg (1983) suggest the following approach.

2 Note that, if in reality var(Yj)= σ /wj, then still E(ˆσOLSbj) = 0, but now

2 −1 2 −1 2 2 −1 −1 var(ˆσOLSbj)= E{rj (1 − hjj) } = σ {wj (1 − hjj) + hjkwk }(1 − hjj) . (7.7) kX6=j 2 2 Suppose we suspect that variance is of the form var(Yj)= σ g (β, θ, xj), where g(β, θ, xj) is such that θ is a scalar and

g(β, 0, xj) = 1.

This is satisfied by many popular variance functions; e.g., the power-of-the-mean model.

−1 2 Note that under this condition, assuming wj = g (β, θ, xj), a Taylor series about θ = 0 yields

−1 2 2 wj ≈ g (β, 0, xj) + 2g (β, 0, xj)νθ(β, 0, xj)(θ − 0)

≈ 1 + 2θνθ(β, 0, xj)

−1 Replacing wj in (7.7) by this expression yields

−1 2 var(ˆσOLSbj/σ) ≈ (1 − hjj){1 + 2θνθ(β, 0, xj)} + (1 − hjj) hjk{1 + 2θνθ(β, 0, xk)} kX6=j −1 2 ≈ (1 − hjj) + (1 − hjj) hjk + 2θ(1 − hjj)νθ(β, 0, xj)+ kX6=j −1 2 2θ(1 − hjj) hjkνθ(β, 0, xk) kX6=j −1 2 ≈ 1 + 2θ(1 − hjj)νθ(β, 0, xj) + 2θ(1 − hjj) hjkνθ(β, 0, xk), kX6=j 2 where we have used the fact that k6=j hjk = hjj(1 − hjj). P Under the further assumption that the hjk, j 6= k, are “small,” Cook and Weisberg (1983) approximated this as

var(ˆσOLSbj/σ) ≈ 1 + 2θνθ(β, 0, xj)(1 − hjj).

2 ˆ RESULT: This approximation suggests plotting bj versus νθ(βOLS, 0, xj)(1 − hjj) (ignoring the fact 2 2 thatσ ˆOLS is random) as a diagnostic for nonconstant variance thought to have the form σ g (β, θ, xj). This plot should offer protection against design-induced residual patterns that may mislead the analyst if the variance really is constant and should have non-zero slope approximately equal to 2θ in the event that the variance really is nonconstant with variance function g(β, θ, xj).

Thus, not only does this plot allow for detection of nonconstant variance, but it also gives information on the relevance of a particular model.

PAGE 165 CHAPTER 7 ST 762, M. DAVIDIAN

• One could construct the plot for different candidate models and compare. The plot that appears most like a linear relationship might be adopted on empirical grounds if no variance model is naturally suggested by subject-matter considerations. The plot also gives information on the likely value of θ.

• Although Cook and Weisberg (1983) considered only the model g(β,θ,xj) = exp(θxj), other T models could also be considered. For example, the model g(β, θ, xj) = exp(θxj β) leads to T T θ T νθ(β, 0, xj) = xj β and g(β, θ, xj) = (xj β) leads to νθ(β, 0, xj) = log(xj β). In this case of 2 the power-of-the-mean variance function, then, the suggestion would be to plot bj versus (1 −

hjj) log YˆOLS.

EXTENSION TO NONLINEAR MODELS: One may extend the notions of leverage and studentization, at least approximately, to nonlinear mean models as follows. Continuing to regard the xj as fixed constants, so suppressing conditioning, suppose we have E(Yj)= f(xj, β). As before, define

T fβ (x1, β) f(x1, β)  .   .  X(β)= . , f(β)= . .      T     f (xn, β)   f(xn, β)   β        ˆ By a linear approximation for β “close to” βOLS, we may write the vector of residuals as

ˆ ˆ r = Y − f(βOLS) ≈ Y − f(β) − X(β)(βOLS − β),

ˆ and we know that βOLS satisfies, again by a linear approximation,

ˆ ˆ 0 = X(βOLS){Y − f(βOLS)} T T T ˆ ≈ X (β){Y − f(β)} + [X (β)X(β)+ ∂/∂β {X (β)}{Y − f(β)}](βOLS − β).

ˆ Ignoring the last term, as it involves the product {Y − f(β)}(βOLS − β), which should be “small” relative to the others, we obtain

ˆ T −1 T (βOLS − β) ≈ {X (β)X(β)} X (β){Y − f(β)}, which is just a result we have seen previously (e.g., Chapter 3).

Combining, we arrive at the approximation

T −1 T r ≈ [ In − X(β){X (β)X(β)} X (β) ]{Y − f(β)}

≈ {In − H(β)}{Y − f(β)}.

PAGE 166 CHAPTER 7 ST 762, M. DAVIDIAN

T −1 T Here, H(β) = X(β){X (β)X(β)} X (β) is the approximate “hat matrix.” Thus, if var(Yj) = 2 σ /wj, by analogy to the linear case, we have approximately that

−1/2 r ≈ {In − H(β)}W ǫ, where ǫ is defined in the obvious way.

The implication is that one may regard the diagonal elements of H(β) as approximate “leverage values.” This makes some intuitive sense: in a nonlinear model, the ramifications of “design” will be not only through the actual design points, but also through the the behavior of the function f at those points.

Unlike a linear model, a nonlinear model allows the changes in f at different xj settings to be different, as the derivative of f depends on both xj and β in general. Consequently, depending how f changes in different parts of the design space, different observations will exert different amounts of influence on the fit. Of course, here, H(β) depends on β, so for practical implementation, we would need to substitute ˆ a likely value for β to obtain approximate leverage values; e.g., βOLS.

RESULT: In the nonlinear case, one may apply the same ideas as in the linear case to take into account the effects of leverage, using as approximate leverage values the values hˆjj, the diagonal elements of the ˆ matrix H(βOLS).

In practice, for linear or nonlinear models, it is often the case that analysts ignore the Cook-Weisberg 2 ˆ ˆ correction and plot bj versus Yj or log(YOLS), for example (as diagnostics for the exponential and power models, respectively).

OTHER PLOTS: Carroll and Ruppert (1988, Chapters 2 and 3) advocate plotting other transformations of studentized residuals. For example,

• if g(β, θ, xj) = exp{θf(xj, β)}, then

1/2 log[ {var(Yj)} ] = log σ + θf(xj, β).

ˆ ˆ Thus, the suggestion is to plot log |rj| or log |bj| versus YOLS = f(xj, βOLS), using the absolute 1/2 residuals as a proxy for var(Yj)} .

θ • Similarly, if g(β, θ, xj)= f (xj, β), then

1/2 log[ {var(Yj)} ] = log σ + θ log f(xj, β),

and the suggestion is to plot log |rj| or log |bj| versus log(YˆOLS).

PAGE 167 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.2: The “data density” issue. |r| 0.0 0.5 1.0 1.5 2.0 2.5

10 15 20 25 30 Predicted value

A PRACTICAL ISSUE FOR ALL PLOTS: “Data density.” Varying degrees of data density along the horizontal axis for any plot may give a misleading impression. For example, consider the plot of absolute residuals versus predicted values in Figure 7.2. One may be tempted to interpret this plot as having some evidence of a “wedge shape,” as the residuals in the range 10 to 15 are mostly small while the range 20 to 30 seems to contain many that are of great magnitude. Such a plot might tempt the analyst to suspect nonconstant variance. However, because the first data segment is so much sparser than the second, it is not surprising that we might end up seeing only a few “large” residuals” in the first segment by chance, even if the data really do have constant variance. Thus, the varying degrees of density of observations may yield “illusions” of patterns that may really not reflect a real phenomenon at all. Carroll and Ruppert (1988, p. 154) an application where this issue arises..

SUMMARY: Plots based on residuals may be useful for both detection and . In the latter case, the presence of nonconstant variance may be acknowledged a priori by virtue of the application, and “default” variance models may even be available. In this situation, the plots may be useful for verifying the relevance of the model or for identifying departures from it that may call for a different one. It is important that the data analyst be well aware that interpretation of the plots is somewhat of an “art form” owing to their approximate and ad hoc nature.

PAGE 168 CHAPTER 7 ST 762, M. DAVIDIAN

7.3 Did it work?

Suppose we construct diagnostic plots, review them, identify nonconstant variance, and select a model. We then refit the model taking this into account, e.g., by GLS-PL or other method. Can we check graphically for evidence that the assumed variance model accounts adequately for the form of the nonconstant variance?

IDEA: Construct the same plots using weighted residuals that take into account the form of the variance. 2 2 That is, if the chosen variance model is σ g (β, θ, xj), and we estimate β (and perhaps an unknown θ, too) by a method that takes nonconstant variance into account, the (standardized) weighted residuals are wr Y − f(x , βˆ) j = j j , σˆ σgˆ (βˆ, θˆ, xj) where βˆ, θˆ, andσ ˆ are the estimates. These standardized, weighted residuals should have the same properties as standardized ordinary residuals if constant variance were valid, as they are weighted for each j by the appropriate factor.

A studentized version of weighted residuals is possible, defined by analogy to the unweighted case, to account for “leverage.” These may be constructed by considering the “transformed” problem based on the estimated weights. That is, define

−2 −2 Wˆ = diag{g (βˆ, θˆ, x1),...,g (βˆ, θˆ, xn)}, and let 1/2 1/2 1/2 Y ∗ = Wˆ Y , X∗(β)= Wˆ X(β), f ∗(β)= Wˆ f(β).

Consider the particular case of GLS estimation of β. Then βˆ satisfies the estimating equation

0 = X∗T (βˆ){Y ∗ − f ∗(βˆ)}, and, moreover, the vector of weighted residuals

T ∗ ∗ wr = (wr1,...,wrn) = Y − f (βˆ).

By an argument analogous to that leading to the form of r on page 167, we may obtain

∗ ∗ ∗ wr ≈ {In − H (β)}{Y − f (β)}, where H∗(β) = X∗(β){X∗T (β)X∗(β)}−1X∗T (β), the approximate “hat matrix.” Studentized weighted residuals may be constructed in the obvious way (on the “transformed” scale), where βˆ would be substituted.

PAGE 169 CHAPTER 7 ST 762, M. DAVIDIAN

If the approach to taking account of nonconstant variance “worked,” one would expect to see plots that show no systematic patterns.

7.4 Restricted maximum (pseudo) likelihood

In Chapter 6, we discussed estimating equation approaches to estimation of variance function pa- rameters. As was evident from that discussion, although such equations may be based on different transformations of absolute residuals, the most popular approach is to use squared residuals, and hence solve a quadratic estimating equation, as in the PL method. This is in part driven by the fact that the estimating equation will be unbiased by construction; moreover, squared residuals seem to be a natural choice for estimating variance parameters.

The foregoing discussion suggests that plots based on ordinary residuals may be misleading due to failure to account for “leverage.” An obvious concern is thus whether or not methods for estimation of variance parameters based on ordinary residuals might also be subject to the same problem. This is one way to motivate consideration of a modification of the PL technique known as restricted maximum likelihood, which we might more aptly term restricted pseudolikelihood. The usual abbreviation, which we will adopt, is REML.

Recall that, for some fixed value of β, βˆ, the PL estimators for σ and θ in the general model (7.1) solve

n {Y − f(x , βˆ)}2 1 n wr2 1 j j − 1 = j − 1 = 0, 2 2 ˆ   σ2   j=1 " σ g (β, θ, xj) # νθ(βˆ, θ, xj) j=1 ! νθ(βˆ, θ, xj) X   X       where here wrj depends on the unknown θ to be estimated.

Now, from the arguments of the last section, we have that

∗ ∗ ∗ wr ≈ {In − H (βˆ)}{Y − f (βˆ)},

2 ∗ ∗ so that var(wr) ≈ σ {In − H (βˆ)}, where it is understood here that H (βˆ) depends on the unknown 2 2 ˆ∗ ˆ∗ ∗ ˆ θ. Thus var(wrj ) ≈ E(wrj ) ≈ σ (1 − hjj), where hjj is the jth diagonal element of H (β), which also depends on the unknown θ.

Rewrite the estimating equation as

n {Y − f(x , βˆ)}2 1 n 1 j j = . (7.8) 2 2 ˆ     j=1 " σ g (β, θ, xj) # νθ(βˆ, θ, xj) j=1 νθ(βˆ, θ, xj) X   X       Note that, if βˆ were replaced by the truth, then the expectation of the left hand side of (7.8) would be exactly equal to the right hand side.

PAGE 170 CHAPTER 7 ST 762, M. DAVIDIAN

Thus, solving the PL equation in σ and θ may be viewed as equating a function of weighted squared deviations to its expectation, ignoring the fact that β must be replaced by an estimator in practice.

Suppose we were to not ignore the fact that βˆ has been substituted? From above, the left hand side of (7.8) would have approximate expectation

n 1 ˆ∗ (1 − hjj)   , j=1 νθ(βˆ, θ, xj) X     depending on the “leverage values.” This observation suggests a modification of the PL estimating equation to “take account of leverage;” namely, instead of solving (7.8), one would solve

n {Y − f(x , βˆ)}2 1 n 1 j j = (1 − hˆ∗ ) . (7.9) 2 2 ˆ   jj   j=1 " σ g (β, θ, xj) # νθ(βˆ, θ, xj) j=1 νθ(βˆ, θ, xj) X   X       FACT: Because H∗(β) is a symmetric, idempotent matrix, assuming that X∗(β) has rank p (and so X∗T (β)X∗(β) is invertible), then H∗(β) has rank p, and it is true that

n ∗ ∗ trace H (β)= hjj = p, jX=1 ∗ where hjj are the diagonal elements.

Using this, it is straightforward to show that (7.9) may be written as

n 2 {Yj − f(xj, βˆ)} 1 n − p   =   , (7.10) " σ2g2(βˆ, θ, x ) # ˆ n ˆ∗ ˆ j=1 j νθ(β, θ, xj) j=1(1 − hjj)νθ(β, θ, xj) X        P  from which it follows thatσ ˆ2 satisfies

n {Y − f(x , βˆ)}2 σˆ2 = (n − p)−1 j j . 2 g (βˆ, θ, xj) jX=1

RESULT: Solving (7.10) instead of the usual PL estimating equation for σ2 “automatically” yields the “bias-adjusted” estimator for σ2 we have discussed previously, where the division is by (n − p) rather than n.

Recall from Section 3.5 that the estimator using the divisor n rather than (n − p) is often viewed as failing to account for estimation of β.

• Because θ is also a variance parameter, like σ2, we might expect that the usual PL estimator for θ might be subject to a similar kind of bias.

PAGE 171 CHAPTER 7 ST 762, M. DAVIDIAN

• It appears that solving (7.10) instead might somehow result in an estimator that is less biased. In practice, this is indeed the case! The resulting estimators for σ and θ obtained by solving (7.10) are referred to as the REML estimators.

IMPLEMENTATION: It turns out that solving (7.10) in (σ, θT )T for fixed β (fixed at βˆ, for example) is equivalent to maximizing a certain objective function, just as PL is equivalent to maximizing the normal loglikelihood with β held fixed. Recall that, evaluated at βˆ, the PL objective function (normal loglikelihood, disregarding constant terms), is given by

n 2 n {Yj − f(xj, βˆ)} PL(βˆ, θ,σ)= −(1/2) − n log σ − log g(βˆ, θ, xj). 2 2 σ g (βˆ, θ, xj) jX=1 jX=1 It is possible to show that the objective function corresponding to (7.10) turns out to be

PL(βˆ, θ,σ)+ p log σ − (1/2) log |XT (βˆ)W (βˆ, θ)X(βˆ)|, (7.11)

−2 −2 where W (β, θ) = diag{g (β, θ, x1),...,g (β, θ, xn)}, where we have added the “θ” argument to make clear the dependence on θ, and X(β) is as defined previously. The last term in (7.11) involves the determinant of XT (β)W (β, θ)X(β), evaluated at βˆ.

• It is not at all obvious that maximizing (7.11) in (σ, θT )T is equivalent to solving (7.10). This may in fact be shown by some clever matrix manipulations and is left as an exercise.

• In particular, letting N(β, θ)= XT (β)W (β, θ)X(β), it may be shown that taking the derivative of (7.11) with respect to θ and setting equal to 0 yields the equation

n 2 {Yj − f(xj, βˆ)} − 1 νθ(βˆ, θ, xj) = (1/2)∂/∂θ log |N(βˆ, θ)|. (7.12) 2 2 " σ g (βˆ, θ, xj) # jX=1 (Of course, taking derivatives with respect to σ gives the biased-adjusted estimator above.)

The equivalence follows by showing that the right hand side of (7.12) is in fact equal to

n ˆ ˆ∗ − νθ(β, θ, xj)hjj. jX=1 • The fact that solving (7.10) is equivalent to maximizing (7.11) may be used to advantage in practice. It is possible to go through the same type of argument leading up to the “trick” on page 130 to derive a method for estimating θ using software. It is again not possible to use just any such software (e.g., SAS proc nlin), because of the complex form of the “regression model.” But it is not too difficult to write a program for general variance models. The details of this implementation approach are left as an exercise.

PAGE 172 CHAPTER 7 ST 762, M. DAVIDIAN

TERMINOLOGY: The terminology “restricted maximum likelihood” arises from the perspective that the objective function (7.11) has the form of the usual normal loglikelihood plus a “penalty term” that has the effect of imposing a restriction on the solution. From the above developments, the penalty term for our model has the effect of taking into account “leverage,” thus incorporating the effect of having to estimate β rather than knowing it using the given design and mean model. Basically, the result is to use “studentized” rather than “ordinary” residuals.

7.5 Examples

We now consider two examples to illustrate the use of residual plots for detecting and modeling non- constant variance.

EXAMPLE 7.1 Pharmacokinetics of indomethacin. Recall the data on the pharmacokinetics of in- domethacin discussed in Examples 1.1 and 1.2. The data are concentrations Yj (µg/ml) of indomethacin taken at n = 11 time points xj (hours) post-dose (at time 0).

The model we consider is the biexponential parameterized to enforce positivity:

β1 β2 β3 β4 f(xj, β)= e exp(−e xj)+ e exp(−e xj).

In this application, it is well established that variance tends to increase with the level of the response. 2 2θ A popular model for representing variance is the power-of-the-mean model σ f (xj, β). Often, θ = 1.0 is a reasonable choice, yielding constant coefficient of variation; this value for θ is sometimes adopted by default with no validation. Sometimes, however, other values of θ provide a better characterization.

Figure 7.3 shows the raw data with the OLS fit and a GLS-PL fit where θˆ = 0.82 (see Section 6.8 for full details.) Note that the fits themselves are discernibly different. This is actually not terribly surprising. The “tail” of the curve at larger time points is determined by only a few observations. The OLS fit treats these as being of equal quality as those at earlier times, while the GLS fit regards them as more precise. Thus, the latter fit places more emphasis on these later observations for determining the fit. Note that the GLS fit goes through the last (presumably most precise) observation, while the OLS fit seems to compromise over where to place the fit at these later observations.

Figure 7.4 shows a plot of squared, standardized ordinary OLS residuals versus the logarithm of predicted values. The plot shows a pronounced “wedge shape” suggesting a rather severe increase in variance with level of the response. The evidence strongly supports the contention that variance is not constant.

PAGE 173 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.3: Concentration-time data for a subject receiving intravenous indomethacin at time 0. The solid line is the OLS fit and the dashed line is the GLS fit with θ estimated. Concentration (mcg/ml) 0.05 0.10 0.50 1.00

0 2 4 6 8 Time (hours)

Of course, as discussed previously, this pattern may in part be due to asymmetry of the distribution of squared residuals.

Figure 7.5 shows several residual plots. Panels (a) and (b) show the “usual” plot of residuals versus predicted values, where that in (b) replaces the ordinary residuals by studentized versions. The pattern is similar in both plots; note, however, that the magnitudes are somewhat different for some observations, reflecting the adjustment for leverage. The pattern appears fairly symmetric in these plots, especially (b), demonstrating that the common assumption of approximate normality may not be unreasonable for pharmacokinetic data.

Panel (c) shows the logarithms of the absolute studentized residuals log |bj| and appears to follow an approximate linear trend, supporting the contention that the power model is reasonable. A simple linear regression fit to the observations in (c) gives a crude estimate of θ as the slope, equal to 0.42. 2/3 Panel (d) shows the 2/3-root studentized residuals bj versus log predicted values, and shows a pattern that is “wedge-shaped” but not quite as profound as that in Figure 7.4. Presumably, this reflects the fact that the residuals on this scale are more symmetrically distributed, so the pattern is reflecting only nonconstant variance and not asymmetry.

PAGE 174 CHAPTER 7 ST 762, M. DAVIDIAN

2 2 Figure 7.4: Plot of squared, standardized residuals rj /σˆOLS versus log predicted values. Squared std. resid. 0 1 2 3

0.1 0.5 1.0 Log Pred.

Figure 7.6 shows the same plots as in Figure 7.5, but applied to the weighted residuals following the GLS- PL fit of the power model. In all panels, the pattern is “haphazard,” suggesting that weighting according to this variance model takes appropriate account of the nonconstant variance. The PL estimate of θ is 0.82, which is close to 1.0. This estimate is likely preferable to that found by the simple linear regression applied to Figure 7.5(c), as will be made clear by the theoretical developments in Chapter 12.

An important implication of “getting the variance right” may be demonstrated in this example by considering estimation of the terminal half-life, a parameter of great physical interest to pharmacologists. The terminal half-life is the time that it takes the mean response in the “second phase” of the curve to decrease by half and is useful in determining appropriate dosing regimens. The terminal half-life is

β4 given by log 2/e hours here. Substituting the estimate of β4 yields an estimated half-life of 3.13 hours based on the OLS fit and 3.96 hours based on the GLS-PL fit. The difference in point estimates is nearly one hour, which in a clinical sense is quite a big difference, as a difference of this magnitude could lead to establishing very different dosing regimens for a drug that is eliminated from the system rapidly, as is indomethacin. Of course, we have not yet discussed how to construct standard errors for these point estimates, so whether or not this difference is of importance is not clear. The point estimates do, however, suggest the potential for misleading interpretations if variance is not taken into appropriate account.

PAGE 175 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.5: Residual plots based on the OLS fit to the indomethacin data. (a) Usual plot of residuals vs. log predicted values; (b) Studentized residuals vs. log predicted values; (c) log studentized residuals vs. log predicted values; (d) 2/3-root studentized residuals vs. log predicted values.

(a) (b) Resid. Stud. Resid. -1 0 1 2 -2 -1 0 1 2 0.05 0.10 0.50 1.00 0.05 0.10 0.50 1.00 Log Pred. Log Pred. (c) (d) 0.5 1.0 Log Stud. Resid. 2/3 Root Stud. Resid. 0.5 1.0 1.5

0.05 0.10 0.50 1.00 0.05 0.10 0.50 1.00 Log Pred. Log Pred.

EXAMPLE 7.2 Oxidation of benzene. These data are also discussed by Carroll and Ruppert (1988, Section 2.8). An was conducted to determine the relationship between the initial rate of oxidation of benzene over a vanadium oxide catalyst at three different reaction temperatures and several benzene and oxygen concentrations. In particular, n = 54 observations on the following are available. For the jth observation,

8 Yj = initial rate of oxidation (disappearance) of benzene (10 gmole/g/sec) 4 xj1 = oxygen concentration (10 gmole/L) 4 xj2 = benzene concentration (10 gmole/L)

xj3 = 2000(1/T − 1/648), where T is the absolute temperature in degrees Kelvin

xj4 = moles oxygen consumed per mole benzene.

T Thus, xj = (xj1,xj2,xj3,xj4) .

The model for this reaction is the steady-state adsorption model

100α1α2 −1 −1 . α1x2 exp(α4x3/2000) + α2x1 exp(α3x3/2000)

PAGE 176 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.6: Residual plots based on the GLS fit to the indomethacin data. (a) Usual plot of weighted residuals vs. log predicted values; (b) Studentized weighted residuals vs. log predicted values; (c) log studentized weighted residuals vs. log predicted values; (d) 2/3-root studentized weighted residuals vs. log predicted values.

(a) (b) Wt. Resid. Stud. Resid. -1.0 -0.5 0.0 0.5 1.0 1.5 -1.0 -0.5 0.0 0.5 1.0 0.1 0.5 1.0 0.1 0.5 1.0 Log Pred. Log Pred. (c) (d) 0.5 1.0 Log Stud. Resid. 2/3 Root Stud. Resid. 0.4 0.6 0.8 1.0 1.2 1.4

0.1 0.5 1.0 0.1 0.5 1.0 Log Pred. Log Pred.

Here, the parameters are α1 = A1 exp(−∆ E1/RgT0, α2 = A2 exp(−∆ E2/RgT0), α3 =∆ E1/Rg, and

α4 =∆ E2/Rg,; and T0 = 648 degrees Kelvin, A1, A2 are constants, ∆ E1,∆ E2 are activation energies, and Rg is the gas constant. Background on the scientific considerations underlying this kinetic model are given in Pritchard, Downie, and Bacon (1977).

The objective of an analysis is to estimate the parameters of this model in order to characterize the rate of oxidation of benzene.

It turns out that, computationally, a reparameterization of the kinetic model is more stable. This reparameterization is given by

−1 −1 −1 f(x, β)= {β1x4x1 exp(β2x3)+ β2x2 exp(β4x3)} .

T Of course, even this parameterization is highly nonlinear in the unknown parameters β = (β1, β2, β3, β)4) . Note that the model in either parameterization is a physical, theoretical one dictated by scientific con- siderations; thus, the parameters or transformations of them have physical interpretations.

PAGE 177 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.7: Raw benzene data. (a) Rate of oxidation vs. oxygen concentration; (b) rate of oxidation vs. benzene concentration; (c) rate of oxidation vs. transformed temperature; (d) rate of oxidation vs. moles oxygen consumed per mole benzene.

(a) (b) rate of oxidation rate of oxidation 2 4 6 8 10 2 4 6 8 10

50 100 150 15 20 25 30 35 40 oxygen conc. benzene conc. (c) (d) rate of oxidation rate of oxidation 2 4 6 8 10 2 4 6 8 10

-0.10 -0.05 0.0 0.05 0.10 5.4 5.6 5.8 6.0 6.2 6.4 transformed temp.. oxygen consumed

Like Carroll and Ruppert (1988), we have deleted observation 38, which these authors found to be an extreme outlier that causes problems for PL estimation (recall our discussion of potential sensitivity to outliers for the quadratic PL method in Chapter 6). Although we deleted this observation in the fitting, we have included it in Figure 7.7, which shows the raw data including this observation, with plots of the response versus each covariate. The plots suggest informally that variance is not constant across the range of response and appears to change with changing values of the covariates.

Carroll and Ruppert (1988) and Pritchard et al. (1977) found that a variance model where variance changes as a function of the mean response is a reasonable characterization. The former authors considered the power-of-the-mean variance model with power parameter θ. Table 7.1 summarizes an OLS fit of the mean model and the GLS-PL fit (C = ∞). The estimate of the power parameter θˆ = 1.15 seems to support the contention of nonconstant variance. Note that the meaning of σ is different in each fit; for OLS, it is the estimate of the assumed common standard deviation of the response, while for GLS, it is the scale factor. From the table, the failure to weight the observations appropriately to account for nonconstant variance seems to have a nontrivial effect on the numerical estimates of the parameters and the assessments of their precision (standard errors).

PAGE 178 CHAPTER 7 ST 762, M. DAVIDIAN

Table 7.1: Results for OLS and GLS-PL fits to the benzene data. The method to calculate estimates (in parentheses) will be discussed in Chapter 9.

Parameter OLS (SE) GLS-PL (SE)

β1 0.97 (0.10) 0.86 (0.04)

β2 3.20 (0.20) 3.45 (0.13)

β3 7.33 (1.14) 5.99 (0.61)

β4 5.02 (0.64) 5.76 (0.41) σ 0.56 – 0.09 – θ – – 1.15 –

Figures 7.8 and 7.9 each show several residual plots based on these fits; observation 38 is included here and appears as the most extreme positive OLS residual and the second-most extreme positive GLS- PL residual. The OLS residual plots show convincing evidence of nonconstant variance, and that in Figure 7.8(c) seems to support the power variance model.

The GLS-PL residual plots suggest that the nonconstant variance is taken into adequate account by the fitted variance model.

An interesting feature of both OLS and GLS-PL residual plots is that the two observations with the smallest predicted values have larger-than-expected residuals; the residuals for both observations are above the “zero” line in both cases and so do not seem to fit the pattern one would expect to see. This 2 2θ may indicate a misspecification of the variance model. The power model var(Yj|xj) = σ f (xj, β) of course supposes that variance is small where the mean response is small. Here, however, it seems that a few observations with small mean response have variance larger than that represented by this model. Perhaps another “component of variation” is present at very low levels of the response, which might suggest the alternative model

2 2θ2 var(Yj|xj)= σ {θ1 + f (xj, β)}.

Another explanation for this phenomenon is that the mean model is not a good fit across the entire range of the response. Perhaps the theoretical behavior it represents breaks down for small response values or at the settings of the covariates corresponding to these two observations. From this perspective, the large residuals may be a consequence of failure of the model to “center” the response appropriately.

PAGE 179 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.8: Residual plots based on the OLS fit to the benzene data. (a) Usual plot of residuals vs. log predicted values; (b) Studentized residuals vs. log predicted values; (c) log studentized residuals vs. log predicted values; (d) 2/3-root studentized residuals vs. log predicted values.

(a) (b) Resid. Stud. Resid. -2 0 2 4 -2 -1 0 1 2 3

1 5 10 1 5 10 Log Pred. Log Pred. (c) (d) Log Stud. Resid. 2/3 Root Stud. Resid. 0.5 1.0 1.5 2.0 2.5 0.05 0.10 0.50 1.00 1 5 10 1 5 10 Log Pred. Log Pred.

Still another possibility is that this feature is simply a matter of chance. To pursue any of these explanations further would require access to subject-matter expertise that would help to determine which is most plausible. Note, however, that the residual plots, in addition to highlighting the presence and nature of nonconstant variance, also may be valuable for bringing such potential anomalies to the attention of the data analyst for further consideration.

PAGE 180 CHAPTER 7 ST 762, M. DAVIDIAN

Figure 7.9: Residual plots based on the GLS fit to the benzene data. (a) Usual plot of weighted residuals vs. log predicted values; (b) Studentized weighted residuals vs. log predicted values; (c) log studentized weighted residuals vs. log predicted values; (d) 2/3-root studentized weighted residuals vs. log predicted values.

(a) (b) Wt. Resid. Stud. Resid. -1 0 1 2 3 -1 0 1 2 3

1 5 10 1 5 10 Log Pred. Log Pred. (c) (d) Log Stud. Resid. 0.05 0.50 2/3 Root Stud. Resid. 0.0 0.5 1.0 1.5 2.0 1 5 10 1 5 10 Log Pred. Log Pred.

PAGE 181