Chapter 7 M-Estimation (Estimating Equations)

7.1 Introduction

In Chapter 1 we made the distinction between the parts of a fully specified . The primary part is the part that is most important for answering the underlying scientific questions. The secondary part consists of all the remaining details of the model. Usually the primary part is the or systematic part of the model, and the secondary part is mainly concerned with the distributional assumptions about the random part of the model. The full specification of the model is important for constructing the likelihood and for using the associated classical methods of inference as spelled out in Chapters 2 and 3 and supported by the asymptotic results of Chapter 6. However, we are now ready to consider robustifying the inference so that misspecification of some secondary assumptions does not invalidate the resulting inferential methods. Basically this robustified inference relies on replacing the 1 b information matrix inverse I./ in the asymptotic normality result for MLE by a generalization I./1B./I./1 called the sandwich matrix. In correctly specified models, I./ D B./, and the sandwich matrix just reduces to the usual I./1. When the model is not correctly specified, I./ ¤ B./, and the sandwich matrix is important for obtaining approximately valid inference. Thus, use of this more general result accommodates misspecification but is still appropriate in correctly specified models although its use there in small samples can entail some loss of efficiency relative to standard likelihood inference. Development of this robustified inference for likelihood-based models leads to a more general context. As discussed in Chapter 6, the asymptotic normal b Pproperties of MLE follow from Taylor expansion of the likelihood equation S ./ D n T iD1 @ log f.Yi I /=@ D 0. The more general approach is then to define an estimator of interest as the solution of an estimating equation but without the equation necessarily coming from the derivative of a log likelihood. For historical reasons and for motivation from maximum likelihood, this more general approach

D.D. Boos and L.A. Stefanski, Essential : Theory and Methods, 297 Springer Texts in , DOI 10.1007/978-1-4614-4818-1 7, © Springer Science+Business Media New York 2013 298 7 M-Estimation (Estimating Equations) is called M-estimation. In recent years the approach is often referred to loosely as estimating equations. This chapter borrows heavily from the systematic description of M-estimation in Stefanski and Boos (2002). P n M-estimators are solutions of the vector equation iD1 .Yi ; / D 0.Thatis, the M-estimator b satisfies Xn b .Yi ; / D 0: (7.1) iD1

Here we are assuming that Y1;:::;Yn are independent but not necessarily identi- cally distributed, is a b-dimensional parameter, and is a known .b 1/-function that does not depend on i or n. In this description Yi represents the ith datum. In some applications it is advantageous to emphasize the dependence of on particular components of Yi . For example, in a regression problem Yi D .xi ;Yi / and (7.1) would typically be written

Xn b .Yi ; xi ; / D 0: (7.2) iD1 where xi is the ith regressor. Huber (1964,1967) introduced M-estimators and their asymptotic properties, and they were an important part of the development of modern . Liang and Zeger (1986) helped popularize M-estimators in the literature under the name generalized estimating equations (GEE). Obviously, many others have made important contributions. For example, Godambe (1960) introduced the concept of an optimum estimating function in an M-estimator context, and that paper could be called a forerunner of the M-estimator approach. There is a large literature on M-estimation and estimating equations. We will not attempt to survey this literature or document its development. Rather we want to show that the M-estimator approach is simple, powerful, and widely applicable. We especially want students to feel comfortable finding and using the asymptotic approximations that flow from the method. One key advantage of the approach is that a very large class of asymptotically normal statistics including delta method transformations can be put in the gen- eral M-estimator framework. This unifies large sample approximation methods, simplifies analysis, and makes computations routine although sometimes tedious. Fortunately, the tedious derivative and matrix calculations often can be performed symbolically with programs such as Maple and Mathematica. Many estimators not typically thought of as M-estimators can be written in the form of M-estimators. Consider as a simple example the mean deviation from the sample mean 1 Xn b D j Y Y j: 1 n i iD1 7.1 Introduction 299

Is this an M-estimator? There is certainly no single equation of the form

Xn .Yi ;/D 0 iD1

b b that yields 1. Moreover, there is no family of densities f.yI / such that 1 is a component of the maximum likelihood estimator of . But if we let 1.y; 1;2/ D j y 2 j1 and 2.y; 1;2/ D y 2,then 0 1 P Xn n b b iD1 j Yi 2 j 1 0 .Y ; b; b/ D @ P A D i 1 2 n b 0 iD1 iD1 Yi 2 P b b n yields 2 D Y and 1 D .1=n/ iD1 jYi Y j. We like to use the term “partial M-estimator” for an estimator that is not naturally an M-estimator until additional functions are added. The key idea is simple: any estimator that would be an M-estimator if certain parameters were known, is a partial M-estimator because we can “stack” functions for each of the unknown parameters. This aspect of M-estimators is related to the general approach of Randles (1982) for replacing unknown parameters by estimators. b From the above example it should be obvious that we can replace 2 D Y by any other estimator defined by an estimating equation; for example, the sample . Moreover, we can also add functions to give delta method asymptotic results b b for transformations of parameters, for example, 3 D log. 1/; see Examples 7.2.3 (p. 304)and7.2.4 (p. 305) and also Benichou and Gail (1989). The combination of “approximation by averages” and “delta theorem” method- ology from Chapter 5 can handle a larger class of problems than the enhanced M-estimation approach described in this chapter. However, enhanced M-estimator methods, implemented with the aid of symbolic mathematics software (for deriving analytic expressions) and standard numerical routines for derivatives and matrix algebra (for obtaining numerical estimates) provide a unified approach that is simple in implementation, easily taught, and applicable to a broad class of complex problems. A description of the basic approach is given in Section 7.2 along with a few examples. Connections to the influence curve are given in Section 7.3 and then extensions for nonsmooth functions are given in Section 7.4. Extensions for regression are given in Section 7.5. A discussion of a testing problem is given in Section 7.6, and Section 7.7 summarizes the key features of the M-estimator method. The Appendix gives theorems for the consistency and asymptotic normality of b as well as Weak Laws of Large Numbers for averages of summands with estimated parameters. 300 7 M-Estimation (Estimating Equations)

7.2 The Basic Approach

M-estimators solve (7.1,p.298), where the vector function must be a known function that does not depend on i or n. For regression situations, the argument of is expanded to depend on regressors xi , but the basic still does not depend on i. For the we confine ourselves to the iid case where Y1;:::;Yn are iid (possibly vector-valued) with distribution function F . The true parameter value 0 is defined by Z

EF .Y1; 0/ D .y; 0/dF.y/D 0: (7.3)

ForR example, if .Yi ;/ D Yi R , then clearly the population mean 0 D ydF.y/is the unique solution of .y /dF.y/ D 0. If there is one unique 0 satisfying (7.3), then in general there exists a sequence b b p of M-estimators such that the Weak Law of Large Numbers leads to ! 0 as n !1. These type results are similar to the consistency results discussed in Chapter 6. Theorem 7.1 (p. 327) in this chapter gives one such result for compact parameter spaces.P Furthermore, if is suitably smooth, then Taylor expansion of 1 n G n./ D n iD1 .Yi ; / gives

b 0 b 0 D G n./ D G n.0/ C G n.0/. 0/ C Rn;

0 0 where G n./ D @G n./=@.Forn sufficiently large, we expect G n.0/ to be nonsingular so that upon rearrangement ˚ p b 0 1 p p n. 0/ D G n.0/ nG n.0/ C nRn : (7.4)

Define 0.y; / D @ .y; /=@ and ˚ 0 A.0/ D EF .Y1; 0/ ; (7.5) ˚ T B.0/ D EF .Y1; 0/ .Y1; 0/ : (7.6) Under suitable regularity conditions as n !1,

Xn 1 ˚ p G 0 . / D 0.Y ; / ! A. /; (7.7) n 0 n i 0 0 iD1

p d nG n.0/ ! N f0; B.0/g ; (7.8) and p p nRn ! 0: (7.9) 7.2 The Basic Approach 301

Putting (7.1)and(7.4)Ð(7.9) together with Slutsky’s Theorem, we have that V . / b is AN ; 0 as n !1; (7.10) 0 n

1 1 T where V .0/ D A.0/ B.0/fA.0/ g . The limiting V.0/ is called the sandwich matrix because the “meat” B.0/ is placed between the “bread” 1 1 T A.0/ and fA.0/ g . If A.0/ exists, the Weak Law of Large Numbers gives (7.7). If B.0/ exists, then (7.8) follows from the . The hard part to prove is (7.9). Huber (1967) was the first to give general results for (7.9), but there have been many others since then (see e.g., Serfling 1980 Ch.7). Theorem 7.2 (p. 328)inthe Appendix to this chapter gives conditions for (7.10), and a by-product of its proof is verification of (7.9). Extension. Suppose that instead of (7.1,p.298), b satisfies Xn b .Yi ; / D cn; (7.11) iD1 p p where cn= n ! 0 as n !1. Following the above arguments and noting that p p cn= n is absorbed in nRn of (7.4), we can see that as long as (7.11), (7.4), and (7.7)Ð(7.9) hold, then (7.10) also holds. This extension allows us to cover a much wider class of statistics including empirical quantiles, estimators whose function depends on n, and Bayesian estimators.

7.2.1 Estimators for A, B,andV

For maximum likelihood estimation, .y; / D @ log f.yI /=@T is often called the score function. If the truly come from the assumed parametric family f.yI /,thenA.0/ D B.0/ D I.0/, the information matrix. Note that A.0/ is Definition 2 of I.0/ in (2.33, p. 66), and B.0/ is Definition 1 of I.0/ in 1 (2.29, p. 64). In this case the sandwich matrix V.0/ reduces to the usual I.0/ . One of the key contributions of M-estimation theory has been to point out what happens when the assumed parametric family is not correct. In such cases there b is often a well-defined 0 satisfying (7.3,p.300)and satisfying (7.1,p.298)but A.0/ ¤ B.0/, and valid inference should be carried out using the correct limiting 1 1 T 1 V.0/ D A.0/ B.0/fA.0/ g , not I.0/ . Using the left-hand side of (7.7,p.300), we define the empirical estimator of A.0/ by n n o 1 X A .Y ;b/ DG 0 .b/ D 0.Y ;b/ : n n n i iD1 302 7 M-Estimation (Estimating Equations)

b Note that for maximum likelihood estimation, An.Y ; / is the average observed information matrix I.Y ;b/ (see 2.34, p. 66). Similarly, the empirical estimator of B.0/ is 1 Xn B .Y ;b/ D .Y ;b/ .Y ;b/T : n n i i iD1 The sandwich matrix of these matrix estimators yields the empirical sandwich estimator

b b 1 b b 1 T V n.Y ; / D An.Y ; / Bn.Y ; /fAn.Y ; / g : (7.12) b V n.Y ; / is generally consistent for V .0/ under mild regularity conditions (see Theorem 7.3,p.329,andTheorem7.4,p.330, in the Appendix to this chapter). b Calculation of V n.Y ; / requires no analytic work beyond specifying . In some problems, it is simpler to work directly with the limiting form V .0/ D 1 1 T A.0/ B.0/fA.0/ g , plugging in estimators for 0 and any other unknown quantities in V .0/. The notation V .0/ suggests that 0 is the only unknown quantity in V .0/, but in reality V .0/ often involves higher moments or other characteristics of the distribution function F of Yi . In fact there is a of possibilities for estimating V .0/ depending on what model assumptions are used. b For simplicity, we use the notation V n.Y ; / for the purely empirical estimator and V .b/ for any of the versions based on expected value plus model assumptions. For maximum likelihood estimation with a correctly specified family, the three 1 b b 1 b 1 competing estimators for I./ are V n.Y ; /, I .Y ; / D An.Y ; / ,and I.b/1 D V .b/. In this case the standard estimators I .Y ;b/1 and I.b/1 b 1 are generally more efficient than V n.Y ; / for estimating I./ . Clearly, for maximum likelihood estimation with a correctly specified family, no estimator can 1 b 1 have smaller asymptotic variance for estimating I./ than I.MLE/ . Now we illustrate these ideas with examples.

7.2.2 Sample Mean and Variance

b 2 T Let D .Y;sn/ , the sample mean and variance. Here 0 1 Yi 1 @ A .Yi ; / D : 2 .Yi 1/ 2 b P b The first component, 1 D Y , satisfies .Yi 1/ D P0, and is by itself b 2 1 2 an M-estimator. The second component 2 D sn D n .Yi Y/ ,when b considered by itself, is not an M-estimator. However, when combined with 1,the b b T b pair . 1; 2/ is a 2 1 M-estimator so that 2 satisfies our definition of a partial M-estimator.