<<

J. Japan Statist. Soc. Vol. 21 No. 1 1991 1-11

LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS

Yasuto Yoshizoe*

We give a general definition of the leverage point in the general nonlinear regression model, including the model. In particular, the definition can be applied to binary regression models, such as the logit and the probit models. This approach makes clear when and why the ordinary definition of leverage points based on the diagonal element of the hat matrix makes sense. Key words: leverage point, influence function, empirical influence function, nonlinear regression, binary regression.

1. Introduction

Consider the linear regression model y=Xβ+u, where y and u are the n-dimensional vectors of the response and error terms, X is an n×p matrix of the explanatory variables

or carriers, and β is a p-dimensional vector of unknown parameters. We further assume

that each component of as is independently distributed as N(0, σ2). The maximum

likelihood method (which is equivalent to the least-squares method in this case) gives the

estimate of β,β=(XT X)-IXTy, from which the fitted y is defined by y=Xβ=Hy, where

(1.1) H=(hij)=X(XT X)-1XT

is the hat matrix H (so called because it adds a hat onto y). As is well known, H is sym

metric and idempotent; its diagonal element hss, which is often written as hi , satisfies

O〓hi〓1; and the trace of H is given by

(1.2)

We first review the notion of leverage points in the linear regression model given by Hoaglin and Welsch (1978), who are the first to refer to the phrase leverage point in the literature: For the data analyst, the element hsi of H has a direct interpretation as the amount

of leverage or influence exerted on yi by yj (regardless of the actual value of yj , since H depends only on X). Thus a look at the hat matrix can reveal sensitive points

in the design, points at which the value of y has a large impact on the fit (Huber 1975) . In using the word “design" here, we have in mind both the standard regression or

ANOVA situation, in which the values of X1, …, Xp are sampled together. The

simple designs, such as two-way analysis of variance, give good control over leverage

(as we shall see in Section 3); and with fixed X one can examine, and perhaps modify,

Received February, 1989. Revised July, 1990. Accepted September, 1990. * Rissho University, Faculty of Economics. The author wishes to express his thanks to Kci Takeuchi, Peter Huber, John Pratt, and David Hoaglin for their valuable comments . This research was supported in part by Nihon Kcicai Kenkyuu Zaidan. 2 J. JAPAN STATIST. SOC. Vol. 21 No. 1 1991

the experimental conditions in advance. When the carriers are sampled, one can at

least determine whether the observed X contains sensitive points and consider omit

ting them if the corresponding y value seems discrepant. Thus we use the hat

matrix to identify •ghigh-leverage points." If this notion is to be really useful, we

must make it more precise. They then continue: The influence of the response value yi on the fit is most directly reflected in its lever age on the corresponding fitted value yi, and this is precisely the information con tained in his, the corresponding diagonal element of the hat matrix. The above paragraphs describe how the hat matrix is used to detect the points of leverage. In the first paragraph, the authors point out the following fact: since each component of y can be written as yi=Σjhijyj,we observe that ∂yi/∂yj=hij. That is,hij represents the effect of yf on yi. It is also interesting to observe that this effect is symmetric, in the sense that hij=hji. In the second paragraph, however, the yclaim that

(1.3) most directly reflects the influence of yi on the fit.

As cited by Hoaglin and Welsch, Huber (1975) first pointed out the importance of measuring the influence of a particular observation in the on the fit. Later,

Huber (1981, p. 153•`155) explains the concept of the leverage point as follows:

Regression poses some peculiar and difficult robustness problems. ... The difficulty

is of course that a gross error does not necessarily show up through a large residual;

by causing an overall increase in the size of other residuals, it can even hide behind

a veritable smokescreen. In order to disentangle the issues, we must :

(1) Find analytical methods for identifying so-called leverage points, that is, points in the design space where an observation, by virtue of its position, has an overriding

influence on the fit and in particular on its own fitted value. ...

In addition, he says (p.160), •gPoints with large hi are, by definition, leverage points."

Again, we see that the influence of yi on itself is regarded as the measure of leverage.

It is not immediately clear when and how this statement is valid. This is one of the ques tions we want to answer in this paper.

In the following, we consider a general nonlinear regression model in order to define a measure of the influence of a particular observation on the fit. In this way, we can make clear when and why leverage points are properly defined through the hat matrix in the linear regression.

2. Leverage in Nonlinear Regressions

We define our general nonlinear regression model by

(2.1) E[y│x]=η(x, β) and var[y│x]=σ2ω(x)-1, where y is a scalar, x is a p-dimensional vector,η and ω are known functions, and where

a p-dimensional vector β and σ2 are unknown parameters. We further assume that all y's LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS 3

are independently normally distributed.

Then, the maximum-1ikelihood estimate β=β[〓n] (which, again, is equivalent to the least-squaresestimate) is given by

Here, is a p-dimensional vector, and 〓n=n-1Σiδ(xi, yi), where δ(x,y) is a point mass at (x,y), is the empiricaljoint measure.

We also define the model distribution 〓=(η, G) as foUows: x is distributed as G

(known and independent of β, that is, we regard x as an ancillary statistic), and the conditional distribution of y, given x, is normal with mean and variance as described in Eq. (2.1). Then, using

the influence function, or influence curve, evaluated at the model 〓 (see Hampel (1971),

Huber (1981), or Hampel el al.(1986) for the concept and derivation) is given by

(2,2)

where β=β[〓] is, by definition, the true value of the parameter; 〓t=(1-t)〓+tδ(x, y);

and

which is essentially the precision matrix of ƒÀ.

In order to detect influential points in the carrier, i. e., the leverage points, we need to

study the innuence of the response, rather than just that of β. A natural estimate of the

response at z (using the maximum-likelihood estimate β) is

(2.3) y(z)=η(z, β)=y(z)[〓n].

Its influence curve evaluated at the model 〓 is then

(2.4) 1C[x, y; 〓, y(z)]=ξ(z, β)T IC[x, y;〓, β]

=ξ(z, β)TS-1ξ(x, β)ω(x){y-η(x , β)}.

The influence curve is a function of x and y; however, it can also be regarded as a function of z. This viewpoint is helpful, as will be seen later. Based on the IC of y(z), we can evaluate the influence of an observation at x as follows. When we have an observation at x, the effect of a small change in the corre- sponding y on y(z) can be measured by

(2.5) We call this measure a cross influence. The heuristic interpretation follows. Suppose we get a new observation at (x, y) and then change the value of y there by an infinitesimal 4 J. JAPAN STATIST. SOC. Vol.21 No.1 1991

amount. Then φ(x, z)measures the ratio of the change in y(z) to the change in y at x.

Obviously, this concept is analogous to hip in Eq. (1.3). In the general nonlinear regres

sion, the explicit form of the cross influece is

(2.6) φ(x,z)=ξ(z,β)T S-1ξ(x,β)ω(x).

We can interpret the cross innuenceφ(x,z)in another way;namely、 it is equivalent

to the ratio

whose denominator is the residual from the fit. Although both expressions are meaningful when y is a continuous variable, the first definition naturally agress with the use of the influence function.

We further observe that if mi(x)is a constant, then Eq.(2.6)is symmetric in x and z.

Hence the cross influence is symmetric only when ω(x)is a constant. This property also

holds for the linear regression; we have already seen that hij=hji. The condition thatω

is a constant will play an important role later in our argument.

A special case is the self influence defined byφ(x, x)=ξ(x,β)TS-1(x,β)ω(x), which is

a measure of the influence of an observation at x and is analogous to hi in the linear

regression. Uufortunately, this quantity does not always reflect the entire influence.

Because of the factor ξ(x,β) in φ(x,x), instead of ξ(z,β)in Eq. (2.6), the self influence can be small even when cross influences are large for some (or most) values of z. Hence, just looking at the self influence is inadequate for detecting influential observations. We thus need to examine the cross influence more generally. Our idea is to look at the sum of squared cross influences of an observation at x, assuming that z is distributed as G. Thus, we define the total influence I(x) as

(2.7) I(x)=∫ φ(x,z)2 dG(z)=ω(x)2ξ(x, β)T S-1 A S-1ξ(x,β),

where A=∫ ξ(x,β)ξ(x,β)TdG is a p-dimensiona1 matrix. From this expression, we im

mediately obtain the first part of the following proposition.

PROPOSITION 1. When ω(x)is α constant: (i) the total influence I(x) is equivalent to

the self influence φ(x,x); (ii) ave I=∫I(x)dG(x)=p.

PROOF. To prove the second part, suppose ω(x)=ω for all x. Then,

S=ω ∫ξ(x, β)ξ(x, β)T dG=ωA, and I(x)=ξ(x,β)T A-1ξ(x,β).

Thus, we obtain

aveI=tr A-1∫ ξ(x,β)ξ(x, β)T dG(x)=tr A-1 A=p.

This proposition implies that the self influence φ(x,x) can be useful in the general regression model only when the error variance is a constant. Otherwise, the total influ

ence I(x)should be used as the proper measure of the influence of an observation . LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS 5

Another question of interest is whether I(x) can be independent of β. In such a case, we could evaluate the influence of each observation independently of β, or equivalently, before running an experiment to estimate β. The answer is straightforward: Eq. (2.7) indicates that “ξ(x,β)is independent of β" is a necessary and sufficient condition for the desired property. The condition is equivalent to the statement that η(x,β) is linear in β We thus obtain the following proposition.

PRoPOSITION 2. The total influence I(x) is independent of β if and only if η(x,β) is linear in β, that is, when the model is the linear regression model.

3. An Empirical Measure of Leverage

Since our definition of leverage requires the IC of y(z), we need an empirical version

of the IC, which is called the empirical influence function in some of the literature (e. g. see

Cook and Weisberg (1982, p. 108) for sample versions of the influence curve). This func

tion is obtained by replacing 〓 in Eq. (2.4) by the empirical measure 〓n. We can accordingly obtain empirical forms of the cross influence and other relevant influences in the same manner as in the previous section.

Using the obvious abbreviations, such as ξi=ξ(xi,β) and ωi=ω(xi), the empirical cross influence and the empirical total influence are, respectively,

(3.1) φ(xi,z)=ωiξ(z,β)T S-1nξi, and

(3.2) where Sn=n-1ΣiωiξiξTi,φij=φ(xi,xj)=ωiξTiS-1nξj, and An=n-1ΣiξiξTi. Matrix re presentation of these quantities is also helpful. Define an n×p matrix 〓=(ξ1,…,ξn)T and an n×n diagonal matrix Ω whose i-th diagonal element is wi. We can then write

(3.3) An=n-1〓T〓, Sn=n-1〓TΩ〓 and Φn=(φij)=Ω〓S-ln〓7. We thus find that the empirical total influence is given by the i-th diagonal element of n-1ΦnΦTn, namely, Ii=n-1((ΦnΦTn)i,i). Note that n-1Φn is idempotent but not symmetric.

As in the previous section, a sufficient condition for symmetry isωi=ω for all i. We also obtain ave I=n-1ΣiIi=n-2 tr ΦnΦTn. Since these empirical measures are based on the influence curves at 〓n, the previous propositions still hold with appropriate substitutions.

In the nonlinear regression, Ii depends on β. This means that Ii depends on the y's through ξi=ξ(xi, β). Therefore, the total influence Ii cannot be calculated explicitly prior to an experiment. Proposition 2 (in the empirical version) shows that we can evaluate the Ii's before an experiment only when we are dealing with the linear regression model.

When the model is linear,η(x,β)=xTβ, and ξi=xi. Thus, we obtain

(3.4) Ii=w2ixTiS-1nAnS-1nxi and ave I=tr S-1nAnS-1nTn, where An and Sn are defined by Eq. (3.3) (note that ξi=xi now), and where Tn= n-1Σiω2ixixTi is another n×n matrix. An essential difference between Eq. (3.2) (the nonlinear case) and Eq. (3.4) (the linear 6 J. JAPAN STATIST.SOC. Vol. 21 No. 1 1991 case) is that in Eq. (3.4), Ii depends only on x (not on y), but in Eq. (3.2), Ii depends on the y, as well as on the x. Hence it becomes possible to evaluate Ii without an experi ment only when the model is linear. Further, consider the case when the error variance is constant, that is, wi=w. This situation is called homoscedastic in econometrics, in contrast to the heteroscedastic situa tion, where the error variances are not constant. In a homoscedastic case, then, we obtain Ii=xTiA-1nxiand ave I=p (Proposition1), Since An now depends only on x, it is easy to evaluate Ii so that one can tell which observation has a strong influence on the fit prior to the experiment, if an experiment has been designed. Applying Huber's definition (where hi is replaced with Ii), we can call observations with large values of I(xi) leverage points. As discussed in the previous section, this defini tion is a natural extension from the linear regression. In the linear regression model with homoscedastic error variance, we have

It should be clear that Hoaglin and Welsch's interpretation, hij=∂yi/∂yj, is formally

extended to our φij in the case of the general nonlinear regression. It is, however, rather misleading, even in the linear regression model, to say that the self influence hii contains

adequate information about influence. A better interpretation is to use Ii=Σjφ2j, although we already know that in the linear regression model hii agrees with Ii (except for a constant factor n). Here φij should be interpreted as φij=∂IC[xi, y;〓n, y(xj)]/∂y.

Merely because of the constant error variance, Ii=Σ φ2ij=hii in this case. The following is a summary of our argument: (1) The total influence is a natural extension of hi; the notion of cross influence and total influence is applicable to various regression models.

(2) The total influence It is equivalent to the self influence φii when the error

variance is constant, that is, homoscedastic. Under such circumstances, the use

of φii is valid even in the general nonlinear regression model.

(3) We can calculate the total influence Ii before an experiment only when the model is linear.

In the next section, we show that our method based on the influence curve is appli cable to the binary regression model, which is a special case of the general nonlinear regres sion model.

4. The Binary Regression Model

We assume that the explanatory variables x=(x1,•c , xp) follow a known distribution G(x).Given x, the response y has a Bernoulli distribution with parameter π=F(βTx) , where F is a continuous distribution function. This defines the theoretical distribution 〓=(F ,G).Thus, x is an ancillary statistic in our framework, as in the previous sections .

Now, suppose we have a data set of size n, (xi , yi), where yi is either 0 or 1. In matrix notation, we write XT=(x1 •c xn) and y=(y1 •c yn) , where the n•~p matrix X is assumed to have the full rank. We also assume that the constant term is included in X , that is, the first column of X consists of 1s . LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS 7

Since the log-likelihood of β is

(4.1) the maximum-likelihood estimate β is the solution of

(4.2) where 〓n is the empirical distribution and

(4.3) plays the role of a weight function. Some relevant results from Yoshizoe (1986) concern ing the maximum-likelihood estimation of the binary regression model are summarized in the appendix. If we define an estimate of the response probability as π(z)=F(βT2), its influence

curve at the model is

(4.4) IC[x, y; 〓, π(2)]=f(βTz)zT IC[x, y;〓,β]

=f(βTz)zT S-1xw(βTx){y-π(x)},

where S is defined by (A.6) in the appendix. As before, the cross influence is

(4.5)

In the binary regression model, a better interpretation of the cross influence is avail able. When we have an observation at x, the corresponding value of y is either 1 or 0.

Thus, the influence of an observation at x on π(z) may be better measured by the differ

ence of the IC's, which of course agrees with Eq. (4.5)

:(4.6) φ(x,z)=IC[x,1;〓, π(z)]-IC[x,0;〓, π(z)]. The self influence and total influence are

φ(x,x〓=w(βTx)f(βTx)xTS-1x and

(4.7)

respectively, where Because of the factor f(βTx)w(βTx)inφ(x,x),

the self influence can be small even when φ(x, z) is large for some values of z. For example,φ (x, x)is bounded for any x if F is the logistic. The empirical cross influence and the empirical total influence are (using the obvious abbreviations)

φ(xi,z)=ωif(βT z)zTS-1nxi and

respectively, where Sn=n-1ΣifiωixixTi and Tn=n-1Σif2ixixTi are the empirical forms

of S and T. Introducing Un=n-1Σiω2ixixTi, the average of I(x) is given by 8 J. JAPAN STATIST. SOC. Vol. 21 No. 1 1991

(4.8)

Unfortunately, we cannot proceed further to obtain a simpler form of ave I. The reason is easily seen from the following matrix representation using two n•~n diagonal matrices W and F, whose i-th diagonal elements are wi and fi respectively:

Sn=n-1XTFWX, Tn=n-1XTF2X, Un=n-1XTW2X,

φn=WXS-1nXTF, and ave I=n-2 tr ΦnΦTn.

As in the general nonlinear regression model,Φn is not symmetric, although n-1Φn is idempotent. Therefore, Eq. (4.8) is the final form and cannot be simplified. One might regard this property as a shortcoming of our definition of leverage points, since we cannot calculate either I (xc) or ave I until an experiment has been performed. We must admit, however, that we encounter such a difficulty whenever we deal with a nonlinear regression model. The linear regression model (combined with the use of the least-squares estimate) is an exceptionally simple case.

5. Discussion We introduced the notion of total influence in the general nonlinear regression model, using the influence function of the fit y(z) as a basic tool. Then, we revealed the relation- ship between our definition of the leverage point and the ordinary one and found that our measure of leverage is a natural extension of that in the linear regression based on the hat matrix. Emerson, Hoaglin, and Kempthorne (1984) use ideas similar to ours to find influential points in a special kind of the linear regression, though they do not call it a regression model. Further, we applied the measure to the binary regression model. The reason the same method is applied can be clearly seen from the similarity between

(5.1) E[y|x]=F(βTx) in the binary regression model and Eq. (2.1) in the general nonlinear regression model. The difference is that in the general nonlinear regression we also need the conditional variance, while in the binary regression the conditional variance is determined by the con ditional expectation. In this sense, the binary regression model is a simpler special case of the general nonlinear regression model. We could define total influence in a slightly different manner, which may be more useful than the original definition. If the log-likelihood is of the form

and if y=η+h, where h=φ(x, z) is the infinitesimal deviation, then

(5.2) measures the weighted total influence. When ρ(x)=x2, LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS 9

This measure is simpler than the original I(x), and Propositions 1 and 2 still hold un changed. In the binary regression model, several criteria for influential points have been proposed. Pregibon (1981) measures the change of the parameter when the weight at point x is changed infinitesimally. This idea is essentially equivalent to measuring IC[x, y;β]. He also mentions the possiblity of applying his method to the general non linear regression model, but he does not consider the overall influence. His infinitesimal

measure of neighboring effects (see Pregibon, p. 721) is essentially similar to our φij. We can apply the revised definition of total influence, as in (5.2), to the binary regression model. To do so, observe that var[y|x]=F(βTx)(1-F(βTx)). Then an analog of Eq. (5.2) is

This definition may be more appealing than the original I(x) because it places large weights at extreme x's, where the response probabilities are near zero or one. Note that we cannot define the logit or the probit of y when y=1 or y=0, so one can regard I*(x) as an attempt to introduce the logit transform.

The essential relationship between φ(x, x)and I*(x)is

I*(x)=ω(x)φ(x,x).

In the binary regression, w(x) should be interpreted as w(βTx). In this form of the

modified definition of total influence, the role of the weight function w(x)is easily under stood. Unfortunately, the above methods based on the influence curve are useful for detecting "bad" observations only when the model is correct . As a final remark, we should note that the total influence depends both on the estimation method and the location of the observation x. For example, the influence of an observation at x when the maximum- likelihood method is used may be quite different from the influence of the same point when a robust method is used. From this viewpoint, we must again admit that the ordinary definition of leverage points based on the hat matrix, which is applicable only to the least- squares estimate in the linear regression model, is a special case of our total influence. Our idea is applicable to any nonlinear regression model (including the binary regression model) and to any form of estimates.

Appendix

(A.1) If the response function F has a density f=F' which is also differentiable, and if both log F and Iog (1-F) are concave, then βis consistent and asymptotically

normally distributed, that is,(β-β) converges to N(0,〓-1) in distribution,

where 〓 is the Fisher information. (A.2) When it exists, the Fisher information is 10 J. JAPAN STATIST. SOC. Vol. 21 No. 1 1991

(A.3) The maximum-likelihood equation (4.2) and the Fisher information (A.2) imply that both the computation of the maximum-likelihood estimate and the form of the asymptotic variance are simplified if ω(x) is constant. From Eq. (4.3), this condi

tion implies a differential equation whose solution is of the form F(x)=(l+exp(ax+b))-1, namely, it is the logistic distribution. In this sense, the logistic F gives the simplest model in the binary regression.

(A.4) The so-called observed Fisher information ,•¬(X, y) depends explicitly only on the X (not the y), if and only if F is the logistic.

(A.5) When F is the standard logistic distribution F(x)=(l+e-x)-1, the maximum likelihood equation and the observed Fisher information are given by the simplest

forms:

(A.6) The influence curve of β at the model •¬ is

1C[β]=S-lxω(βTx){y-F(βTx)},

where

The p•~p matrix S becomes positive definite under very mild conditions on F

and G.

(A.7) The influence curve (A.6) reveals two important properties. (a) IC [β] is bounded when x has a bounded distribution, since -1 〓y-F(βTx)〓1. Therefore, the maximum-likelihood estimate is qualitatively robust in Hampel's sense. In biological assays, where x is designed and we thus have no in x, this conclusion may validate the use of the maximum-likelihood estimate in the binary regression model.

(b) When x is a random variable, as occurs in many socio-economic problems, β is qualitatively robust only when xω(βTx) is bounded. A necessary condition in order for xw(βTx) to be bounded for any β is | xω(x)|<∞ for -∞

Hence the maximum-likelihood estimate β is qualitatively robust only when xw(x) is bounded. (A.8) When p>2, the maximum-likelihood estimate is not qualitatively robust for any distribution F. However, the boundedness of xw(x) is a sufficient condition for the

qualitative robustness for those x for which βTx→ ∞. Therefore, (A.7.b) is a sufficient condition for a single-explanatory-variable model (i.e., when p=2), unless the coefficient of x is zero.

REFERENCES [1] Cook,R. D. and S. Weisberg(1982). Residualsand Influencein Regression,Chapman and Hall. [2] Hampel, F. R. (1971). A General Qualitative Definitionof Robustness, Annals of Mathematical ,42. 1887-1896. LEVERAGE POINTS IN NONLINEAR REGRESSION MODELS 11

[3] Hampel, F. R., E. M. Ronchetti, P. J. Rousseeuw and W. A. Stahel (1986). , John Wiley & Sons. [4] Hoaglin, D. C. and R. E. Welsch (1978). The Hat Matrix in Regression and ANOVA, Amcrican Statistician, 32,17-22. [5] Huber, P. J. (1975). Robustness and Designs, A Survey of Statistical Design and Linear Models, ed. J. N. Srivastava, North-Holland Publishing Co. [6] Huber, P. J. (1981). Robust Statistics, John Wiley & Sons. [7] Pregibon, D. (1981). Logistic , The Annals of Statistics, 9, 705-724. [8] Yoshizoe, Y. (1986). Robust Estimation of the Binary Regression Model, (in Japanese), S. Hayashi and T. Nakamura (eds.) Japanese Economy and Economic Statistics, The University of Tokyo Press.