M-Estimation (Estimating Equations)

Chapter 7 M-Estimation (Estimating Equations) 7.1 Introduction In Chapter 1 we made the distinction between the parts of a fully specified statistical model. The primary part is the part that is most important for answering the underlying scientific questions. The secondary part consists of all the remaining details of the model. Usually the primary part is the mean or systematic part of the model, and the secondary part is mainly concerned with the distributional assumptions about the random part of the model. The full specification of the model is important for constructing the likelihood and for using the associated classical methods of inference as spelled out in Chapters 2 and 3 and supported by the asymptotic results of Chapter 6. However, we are now ready to consider robustifying the inference so that misspecification of some secondary assumptions does not invalidate the resulting inferential methods. Basically this robustified inference relies on replacing the 1 b information matrix inverse I.Â/ in the asymptotic normality result for ÂMLE by a generalization I.Â/1B.Â/I.Â/1 called the sandwich matrix. In correctly specified models, I.Â/ D B.Â/, and the sandwich matrix just reduces to the usual I.Â/1. When the model is not correctly specified, I.Â/ ¤ B.Â/, and the sandwich matrix is important for obtaining approximately valid inference. Thus, use of this more general result accommodates misspecification but is still appropriate in correctly specified models although its use there in small samples can entail some loss of efficiency relative to standard likelihood inference. Development of this robustified inference for likelihood-based models leads to a more general context. As discussed in Chapter 6, the asymptotic normal b Pproperties of ÂMLE follow from Taylor expansion of the likelihood equation S .Â/ D n T iD1 @ log f.Yi I Â/=@Â D 0. The more general approach is then to define an estimator of interest as the solution of an estimating equation but without the equation necessarily coming from the derivative of a log likelihood. For historical reasons and for motivation from maximum likelihood, this more general approach D.D. Boos and L.A. Stefanski, Essential Statistical Inference: Theory and Methods, 297 Springer Texts in Statistics, DOI 10.1007/978-1-4614-4818-1 7, © Springer Science+Business Media New York 2013 298 7 M-Estimation (Estimating Equations) is called M-estimation. In recent years the approach is often referred to loosely as estimating equations. This chapter borrows heavily from the systematic description of M-estimation in Stefanski and Boos (2002). P n M-estimators are solutions of the vector equation iD1 .Yi ; Â/ D 0.Thatis, the M-estimator bÂ satisfies Xn b .Yi ; Â/ D 0: (7.1) iD1 Here we are assuming that Y1;:::;Yn are independent but not necessarily identi- cally distributed, Â is a b-dimensional parameter, and is a known .b 1/-function that does not depend on i or n. In this description Yi represents the ith datum. In some applications it is advantageous to emphasize the dependence of on particular components of Yi . For example, in a regression problem Yi D .xi ;Yi / and (7.1) would typically be written Xn b .Yi ; xi ; Â/ D 0: (7.2) iD1 where xi is the ith regressor. Huber (1964,1967) introduced M-estimators and their asymptotic properties, and they were an important part of the development of modern robust statistics. Liang and Zeger (1986) helped popularize M-estimators in the biostatistics literature under the name generalized estimating equations (GEE). Obviously, many others have made important contributions. For example, Godambe (1960) introduced the concept of an optimum estimating function in an M-estimator context, and that paper could be called a forerunner of the M-estimator approach. There is a large literature on M-estimation and estimating equations. We will not attempt to survey this literature or document its development. Rather we want to show that the M-estimator approach is simple, powerful, and widely applicable. We especially want students to feel comfortable finding and using the asymptotic approximations that flow from the method. One key advantage of the approach is that a very large class of asymptotically normal statistics including delta method transformations can be put in the general M-estimator framework. This unifies large sample approximation methods, simplifies analysis, and makes computations routine although sometimes tedious. Fortunately, the tedious derivative and matrix calculations often can be performed symbolically with programs such as Maple and Mathematica. Many estimators not typically thought of as M-estimators can be written in the form of M-estimators. Consider as a simple example the mean deviation from the sample mean 1 Xn b D j Y Y j: 1 n i iD1 7.1 Introduction 299 Is this an M-estimator? There is certainly no single equation of the form Xn .Yi ;/D 0 iD1 b b that yields 1. Moreover, there is no family of densities f.yI Â/ such that 1 is a component of the maximum likelihood estimator of Â. But if we let 1.y; 1;2/ D j y 2 j1 and 2.y; 1;2/ D y 2,then 0 Á 1 P Â Ã Xn n b b iD1 j Yi 2 j 1 0 .Y ; b; b/ D @ P Á A D i 1 2 n b 0 iD1 iD1 Yi 2 P b b n yields 2 D Y and 1 D .1=n/ iD1 jYi Y j. We like to use the term “partial M-estimator” for an estimator that is not naturally an M-estimator until additional functions are added. The key idea is simple: any estimator that would be an M-estimator if certain parameters were known, is a partial M-estimator because we can “stack” functions for each of the unknown parameters. This aspect of M-estimators is related to the general approach of Randles (1982) for replacing unknown parameters by estimators. b From the above example it should be obvious that we can replace 2 D Y by any other estimator defined by an estimating equation; for example, the sample median. Moreover, we can also add functions to give delta method asymptotic results b b for transformations of parameters, for example, 3 D log. 1/; see Examples 7.2.3 (p. 304)and7.2.4 (p. 305) and also Benichou and Gail (1989). The combination of “approximation by averages” and “delta theorem” method- ology from Chapter 5 can handle a larger class of problems than the enhanced M-estimation approach described in this chapter. However, enhanced M-estimator methods, implemented with the aid of symbolic mathematics software (for deriving analytic expressions) and standard numerical routines for derivatives and matrix algebra (for obtaining numerical estimates) provide a unified approach that is simple in implementation, easily taught, and applicable to a broad class of complex problems. A description of the basic approach is given in Section 7.2 along with a few examples. Connections to the influence curve are given in Section 7.3 and then extensions for nonsmooth functions are given in Section 7.4. Extensions for regression are given in Section 7.5. A discussion of a testing problem is given in Section 7.6, and Section 7.7 summarizes the key features of the M-estimator method. The Appendix gives theorems for the consistency and asymptotic normality of bÂ as well as Weak Laws of Large Numbers for averages of summands with estimated parameters. 300 7 M-Estimation (Estimating Equations) 7.2 The Basic Approach M-estimators solve (7.1,p.298), where the vector function must be a known function that does not depend on i or n. For regression situations, the argument of is expanded to depend on regressors xi , but the basic still does not depend on i. For the moment we confine ourselves to the iid case where Y1;:::;Yn are iid (possibly vector-valued) with distribution function F . The true parameter value Â0 is defined by Z EF .Y1; Â0/ D .y; Â0/dF.y/D 0: (7.3) RFor example, if .Yi ;/ D Yi R , then clearly the population mean 0 D ydF.y/is the unique solution of .y /dF.y/ D 0. If there is one unique Â0 satisfying (7.3), then in general there exists a sequence b b p of M-estimators Â such that the Weak Law of Large Numbers leads to Â ! Â0 as n !1. These type results are similar to the consistency results discussed in Chapter 6. Theorem 7.1 (p. 327) in this chapter gives one such result for compact parameter spaces.P Furthermore, if is suitably smooth, then Taylor expansion of 1 n G n.Â/ D n iD1 .Yi ; Â/ gives b 0 b 0 D G n.Â/ D G n.Â0/ C G n.Â0/.Â Â0/ C Rn; 0 0 where G n.Â/ D @G n.Â/=@Â.Forn sufficiently large, we expect G n.Â0/ to be nonsingular so that upon rearrangement ˚ « p b 0 1 p p n.Â Â0/ D G n.Â0/ nG n.Â0/ C nRn : (7.4) Define 0.y; Â/ D @ .y; Â/=@Â and ˚ « 0 A.Â0/ D EF .Y1; Â0/ ; (7.5) ˚ « T B.Â0/ D EF .Y1; Â0/ .Y1; Â0/ : (7.6) Under suitable regularity conditions as n !1, Xn 1 ˚ « p G 0 .Â / D 0.Y ; Â / ! A.Â /; (7.7) n 0 n i 0 0 iD1 p d nG n.Â0/ ! N f0; B.Â0/g ; (7.8) and p p nRn ! 0: (7.9) 7.2 The Basic Approach 301 Putting (7.1)and(7.4)–(7.9) together with Slutsky’s Theorem, we have that Â Ã V .Â / bÂ is AN Â ; 0 as n !1; (7.10) 0 n 1 1 T where V .Â0/ D A.Â0/ B.Â0/fA.Â0/ g .

M-Estimation (Estimating Equations)

Autoregressive Conditional Kurtosis

Quadratic Versus Linear Estimating Equations

Inference Based on Estimating Equations and Probability-Linked Data

Generalized Estimating Equations for Mixed Models

Using Generalized Estimating Equations to Estimate Nonlinear

Stat 8112 Lecture Notes Unbiased Estimating Equations Charles J

Introduction to the Generalized Estimating Equations and Its Applications in Small Cluster Randomized Trials

Maximum Likelihood Estimation of Distribution Parameters from Incomplete Data Edwin Joseph Hughes Iowa State University

Application of Generalized Linear Models and Generalized Estimation Equations

Estimated Estimating Equations: Semiparamet- Ric Inference for Clustered/Longitudinal Data

Association Between Early Cerebral Oxygenation and Neurodevelopmental Impairment Or Death in Premature Infants

Estimating Equations and Maximum Likelihood Asymptotics