<<

The Calculus of MEstimation



Leonard A Stefanski and Dennis D Boos

Abstract

Since the groundbreaking pap ers by Hub er in the s Mestimation metho ds estimating

equations have b een increasingly imp ortant for asymptotic analysis and approximate inference

Now with the prevalence of programs like Maple and Mathematica calculation of asymptotic

in complex problems is just a matter of routine manipulation The intent of this article

is to illustrate the depth and generality of the Mestimation approach and thereby facilitate its

use

KEY WORDS Asymptotic Estimating equations Maple M

Institute of Mimeo Series No

March

Leonard A Stefanski and Dennis D Bo os are Professors Department of Statistics North Carolina State Univer

sity Raleigh NC Email addresses are stefanskistatncsuedub o os statncsuedu

INTRODUCTION

P

n

b

Y That is the Mestimator M are solutions of the vector equation

i

i=1

satises

n

X

b

Y

i

i=1

Here we are assuming that Y Y are indep endent but not necessarily identically distributed

1 n

is a pdimensional parameter and is a known p function that do es not dep end on i

or n In this description Y represents the ith datum In some applications it is advantageous

i

to emphasize the dep endence of on particular comp onents of Y For example in a regression

i

problem Y x Y and would typically b e written

i i i

n

X

b

Y x

i i

i=1

where x is the ith regressor

i

Hub er intro duced Mestimators and their asymptotic prop erties and they were

an imp ortant part of the development of mo dern Liang and Zeger help ed

p opularize Mestimators in the literature under the name generalized estimating equa

tions GEE Obviously many others have made imp ortant contributions For example Go damb e

intro duced the concept of an optimum estimating function in an Mestimator context and

that pap er could b e called a forerunner of the Mestimator approach

However our goal is not to do cument the development of Mestimators or to give a bibliog

raphy of contributions to the literature Rather we want to show that the Mestimator approach

is simple p owerful and more widely applicable than many readers imagine We want students to

feel comfortable nding and using the asymptotic approximations that ow from the metho d The

key advantage is that a very large class of asymptotically normal statistics includin g delta metho d

transformations can b e put in the general Mestimator framework This unies large sample ap

proximation metho ds simplies analysis and makes computations routine

An imp ortant practical consideration is the availability of a symb olic manipulation program

like Maple otherwise the matrix calculations can b e overwhelming Nevertheless the general theory

is straightforward

We claim that many estimators not thought of as Mestimators can b e written in the form

of Mestimators Consider as a simple example the deviation from the sample mean

n

X

b

Y j jY

i 1

n

i=1

Is this an Mestimator There is certainly no single equation of the form

n

X

Y

i

i=1

b b

that yields Moreover there is no family of densities f y such that is a comp onent of the

1 1

maximum likeliho o d estimator of But if we let y jy j and y y

1 1 2 2 1 2 1 2 2

then

P

n

n

b b

X

jY j

i 2 1

i=1

C B C B

b b

Y

A A

i 1 2

P

n

b

Y

i=1

i 2

i=1

P

n

b b

yields Y and n jY Y j We like to use the term partial Mestimator for

2 1 i

i=1

an estimator that is not naturally an Mestimator until additional functions are added The key

idea is simple any estimator that would b e an Mestimator if certain parameters were known is a

partial Mestimator b ecause we can stack functions for each of the unknown parameters This

asp ect of Mestimators is related to the general approach of Randles for replacing unknown

parameters by estimators

b

From the ab ove example it should b e obvious that we can replace Y by any other

2

estimator dened by an estimating equation for example the sample Moreover we can

b b

also add functions to give delta metho d asymptotic results for transformations like log

3 1

In this latter context there are connections to Benichou and Gail

Certainly the combination of standard inuence curve and delta theorem metho dology can

handle a larger class of problems than this enhanced Mestimation approach However we b elieve

that the combination of a single approach along with a symb olic manipulator like Maple will make

this Mestimation approach much more likely to b e used in complex problems

A description of the basic approach is given in Section along with a few examples Connec

tions to the inuence curve are given in Section and then extensions for nonsmo oth functions

are given in Section Extensions for regression are given in Section A discussion of some test

ing problems is given in Section and Section summarizes the key features of the Mestimator

metho d

The Basic Approach

Mestimators solve where the vector function must b e a known function that do es not dep end

on i or n For regression situations the argument of will b e expanded to dep end on regressors x

i

but the basic will still not dep end on i For the we will conne ourselves to the iid case

where Y Y are iid p ossibly vectorvalued with distribution function F The true parameter

1 n

value is dened by

0

Z

E Y y dF y

F 1 0 0

R

For example if Y Y then clearly the p opulation mean y dF y is the unique

i i 0

R

solution of y dF y

If there is one unique satisfying then in general there exists a sequence of Mestimators

0

p

b b

such that the weak law of large numb ers leads to as n Furthermore if is suitably

0

P

n

1

smo oth then Taylor expansion of G n Y gives

n i

i=1

0

b b

G G G R

n n 0 0 0 n

n

h i

0 T 0

to b e nonsingular For n suciently large we exp ect G G where G

0 0 n

n n

=

so that we can rearrange and get

p p p

1

0 

b

n G n G nR

0 0 n 0

n n

Under suitable regularity conditions as n

n

X

p

0

E Y Y A G

i 0 1 0 0 0

n

T T

n

i=1

h i

p

d

T

n G MVN B where B E Y Y

n 0 0 0 1 0 1 0

p

p



nR

n

If A exists the Weak Law of Large Numb ers gives If B exists then follows from

0 0

the Central Limit Theorem The hard part to prove is Hub er was the rst to give general

results for but there have b een many others since then We shall b e content to observe that

holds in most regular situations where there are sucent smo othness conditions on and

has xed dimension p as n

Putting and together with Slustkys Theorem we have that

V

0

b

is AMN as n

0

n

1 1 T

where V A B fA g AMN asymptotically multivariate normal

0 0 0 0

The limiting V is called the sandwich matrix b ecause the meat B is placed

0 0

1 1 T

b etween the bread A and fA g

0 0

b

Extension Supp ose that instead of satises

n

X

b

Y c

i n

i=1

p p

p

n n is absorb ed as n Following the ab ove arguments and noting that c where c

n n

p



in nR of we can see that as long as and hold then will also hold This extension

n

allows us to cover a much wider class of statistics including empirical quantiles estimators whose

function dep ends on n and Bayesian estimators

For maximum likeliho o d estimation y log f y is often called the score func

tion If the truly come from the assumed parametric family f y then A B

0 0

1

I the information matrix In this case the sandwich matrix V reduces to the usual I

0 0 0

One of the key contributions of Mestimation theory has b een to p oint out what happ ens when the

assumed parametric family is not correct In such cases there is often a welldened satisfying

0

b

and satisfying but A B and valid inference should b e carried out using the

0 0

1 1 T 1

correct limiting V A B fA g not I

0 0 0 0 0

Using the lefthandside of we dene the empirical estimator of A by

0

n

X

0

b b b

Y A Y G

i n

n

T

n

i=1

b b

Note that for maximum likeliho o d estimation nA is the observed information matrix I

n

Y

Similarly the empirical estimator of B is

0

n

X

T

b b b

Y Y B Y

i i n

n

i=1

Putting these together yields the empirical sandwich estimator

1 1 T

b b b b

V Y A Y B Y fA Y g

n n n n

b

V Y will generally b e consistent for V under mild regularity conditions see Iverson and

n 0

Randles for a general theorem on this convegence

b

V Y requires no analytic work b eyond sp ecifying the function In some problems it is

n

1 1 T

simpler to work directly with the limiting form V A B fA g and just plug

0 0 0 0

in estimators for and any other unknown quantities in V The notation V suggests that

0 0 0

is the only unknown quantity in V but in reality V often involves higher moments or

0 0 0

other characteristics of the true distribution function F of Y In fact there is a of p ossibilities

i

for estimating V dep ending on what mo del assumptions are used For simplicity we will just

0

b b

use the notation V Y for the purely empirical estimator and V for any of the exp ected

n

value plus mo del assumption versions

For maximum likeliho o d estimation with a correctly sp ecied family the three comp eting

h i

1

1 1 1

b b b b b

A Y and I V In this case estimators for I are V Y I n

n n

Y

h i

1

1

b b b

and I are generally more ecient than V Y for the standard estimators I n

n

Y

1 1

estimating I Clearly nothing can have smaller asymptotic variance for estimating I

1

b

than I

MLE

Now we illustrate these ideas with examples

2 T

b

Y s the sample mean and variance Here Example Let

n

Y

i 1

C B

Y

A

i

2

Y

i 1 2

P

b b

The rst comp onent Y satises Y and is by itself an Mestimator The second

1 i 1

P

2 1 2

b

Y comp onent s n Y when considered by itself is not an Mestimator However

i 2

n

b b b b

when combined with the pair is a Mestimator so that satises our denition

1 1 2 2

of a partial Mestimator

T

Now let us calculate A and B where

0 0 10 20

0

C B C B

Y E A E

A A

1 0 0

T

Y

1 10

h i

T

B E Y Y

0 1 0 1 0

2 2

Y Y Y

1 10 1 10 1 10 20

C B

E

A

2

2 2

Y Y Y

1 10 20 1 10 1 10 20

2

3 20 3

C B C B

A A

4 2

3 4 3 4

20

2

where is our notation for the k th central moment of Y and the more familiar notation

k 1 20

has b een substituted at the end In this case since A is the identity matrix V B

0 0 0

To estimate B we may use

0

i h

2 2 2

n

X

Y y Y Y Y Y s

i i i

n

B C

b

B Y i h i h

A

n

2

n 2 2 2 2

Y Y Y Y Y s Y s

i=1

i i i

n n

2

s m

3

n

C B

A

4

m m s

3 4

n

where the m are sample k th moments Lo oking back at the form for V and plugging in

k 0

empirical moment estimators leads to equality of the empirical estimator and the exp ected value

b b

estimator V V Y in this case

n

b

Note that is a maximum likeliho o d estimator for the normal mo del density f y

1 2 2

exp y but Y and Y are not the score functions

2 1 2 1 i 1 2 i 1 2

that come from this normal density The of this normal log density yields

1

2 2

Y and Y Thus functions are not uniquemany dierent ones

i 1 2 2 i 1 2

2

can lead to the same estimator Of course dierent functions asso ciated with the same estimator

yield dierent A and B but the same V For example using these latter two functions the

resulting A and B matrices are

3

2 3 B C

B C

2

C B

B C

B A

A

0 0

B C

4

A

3 4

4

3 8

4

If we further assume that the data truly are normally distributed then and

3 4

2 4

resulting in A B I Diag Here the exp ected value mo delbased

0 0 0

2 4

b

covariance estimator would b e V Diags s

n n

Note that the likeliho o d score functions are related to the original functions by

MLE

2

C where C diag A little algebra shows that all of the form C

20

MLE

20

where C is nonsingular but p ossibly dep ending on and Y Y lead to an equivalence class

0 1 n

having the same estimator and asymptotic variance matrix V

0

b

Example Let Y X where Y X Y X is an iid sample of pairs

1 1 n n

2 2

with means EY and EX variances varY and varX and covariance

1 Y 1 X 1 1

Y X

b

Y X is Y X Y X leading to A covY X A function for

i i i i 0 X 1 1 YX

2 2 2

b

B EY X V EY X A Y X and

0 1 0 1 0 1 0 1 n

X

2

n

X

Y

b

Y X B Y

i i n

n

X

i=1

and

2

n

X

Y

b

Y V Y X

i n i

2

n

X

X

i=1

This variance estimator is often encountered in nite p opulation contexts

b

Y X as the third comp onent Now consider the following of dimension that yields

3

b

of

Y

i 1

C B

C B

C B

Y X

X

i i

i 2

C B

A

1 3 2

This is a quite interesting function b ecause the third comp onent do es not have any data involved

in it Nevertheless this satises all the requirements of a function and illustrates how to build

the delta metho d into the Mestimator framework The A and B matrices are

2

YX

Y

B C C B

B C C B

B C C B

2

B A

0 0

YX

B C C B

X

A A

30 20

This example illustrates the fact that B can b e singular although A generally cannot

0 0

In fact whenever a funtion has comp onents that involve no data then the resulting B matrix

1 1 T

will b e singular In Maple we computed V A B fA g and obtained for the

0 0 0 0

element

h i

2 2 2

v

33 30 YX

Y 30 X

2

20

p

b

This latter expression for the asymptotic variance of n can b e shown to b e the same as

3

2 2

EY X obtained earlier up on noting that

1 30 1 20 X

X

Sample Maple Program

withlinalg Brings in the linear algebra package

vAthetatheta Make a vector of the entries of A

AmatrixvA Create A from vA

AinvinverseA

vBsigmaysigmaxysigmaxysigmax

BmatrixvB

VmultiplyAinvBtransposeAinv

simplifyV

2

2 2

y 3 xy 3 x

2

2

The ab ove display is what app ears on the Maple window for the last command

Example Further illustration of the delta metho d In the context of Example supp ose we

p

2 2

2

s and log s We could of course just redene in Example to b e and are interested in s

2 n

n n 2

p

exp resp ectively Instead we prefer to add Y and Y log

2 3 i 2 3 4 i 2 4

b ecause it seems conceptually simpler and it also gives the joint asymptotic distribution of all

quantities Now we have

3

3

B C

20

C B C B

20

C B C B

C B C B

2

C B C B

3 4

20

C B C B

B A

C B C B

3 4

0 0

C B C B

p

20 20

C B C B

C B 20 C B

B C

A

A

20

1 1 T

and V A B fA g is

0 0 0 0

3 3

p

20 3

C B

20

20

C B

C B

C B

2 2

C B

4 4

20 20

2

C B

p

3 4

20

C B

20

C B

20

C B

V

C B

0

2 2 2

C B

4 4 4 3

20 20 20

C B

p p

C B

32

C B

20

20 20

20

C B

C B

2 2 C B 2

4 4 3 A 4

20 20 20

2

32

20 20

20

20

2 4 2

Thus the asymptotic variance of s is and the asymptotic

n 4 20 4

20

2 2 2 4

variance of log s is

4 4

n

20 20

Example Posterior Consider the standard Bayesian mo del in an iid framework where

the p osterior density is prop ortional to

n

Y

f Y j

i

i=1

and is the prior density Posterior mode estimators satisfy with y log f y j

p

p

0

b b

n the the same as for maximum likeliho o d and c Thus as long as c

n n

Bayesian mo de estimator will have the same asymptotic covariance matrix as maximum likeliho o d

estimators

Example Instrumental Variable Estimation Instrumental variable estimation is a metho d for

estimating regression parameters when predictor variables are measured with error Fuller

Carroll et al We use a simple instrumental variable mo del to illustrate some features of the

Mestimation approach Supp ose that triples Y W T are observed such that

i i i

Y X

i i 1i

W X

i i U 2i

T X

i i 3i

where are mutually indep endent random errors with common mean and variance For

ji

simplicity also assume that X X are iid indep endent of the f g and have nite variance

1 n ji

In the language of measurement error mo dels W is a measurement of X and T is an instrumental

i i i

variable for X for estimating provided that which we now assume Note that X X

i 1 n

2

are latent variables and not observed Let and denote variances and of any

ST

S

random variables S and T

b

The estimator of slop e obtained by regressing Y on W converges in

Y jW

2 2 2

probability to and thus is not consistent for when the measurement error

X X U

2

variance However the instrumental variable estimator

U

b

Y jT

b

IV

b

W jT

b b

where and are the slop es from the least squares regressions of Y on T and W on T

Y jT W jT

2

resp ectively is a of under the stated assumptions regardless of

U

b

The instrumental variable estimator is a partial Mestimator as dened in the Intro duc

IV

tion and there are a numb er of ways to complete the function in this case Provided interest lies

only in estimation of the a simple choice is

T

1

Y W T

Y W T

2 1

with asso ciated Mestimator

b b b

T

1 2 IV

The A and B matrices calculated at the true parameters assuming the instrumental variable mo del

are

2 2

T T

B C C B

and B A

A A

2 2 2 2 2 2

X T

T T U

which yield the asymptotic variance matrix

2

T

T

B C

1 1

A B A

A

2 2 2 2 2

T U X T

Under the stated assumptions the instrumental variable estimator and the naive estimator

2 2

are b oth consistent for when yet have dierent asymptotic means when Thus

U U

2

when there is doubt ab out the magnitude of their joint asymptotic distribution is of interest

U

The Mestimator approach easily accommo dates such calculations For this task consider the

function

T

1

B C

B C

W

B C

2

B C

Y W T

B C

B C

Y W W

3 2

A

Y W T

4 1

Note the change in the denitions of and the ordering of the comp onents of this function The

2

conguration is primarily for convenience as it leads to a triangular A matrix In general when the

T

th

k comp onent of dep ends only on k the partial derivative matrix

1 k

is lower triangular and so to o is the A matrix

The Mestimator asso ciated with this function is

b b b b b b

T W

2 3 Y jW 4 IV 1

The A matrix calculated at the true parameters assuming the instrumental variable mo del

is

B C

B C

B C

B C

B C

B C

2 2 2

B C

X U W W

A

XT

The expression for the B matrix is unwieldy However primary interest lies in the lower

T

1 1

We used Maple to calculate submatrix of the asymptotic variance matrix A B A

this submatrix and to substitute expressions for the various mixed moments of Y W T under the

b b

assumption of joint normality resulting in the asymptotic covariance matrix for

3 4

2 2 2 2 2 4 2 2 2 2 2 4 4

W U X W W U X U W

B C

A

2 2 2 2 2 4 4 2 2 2 2 2

W U X U W T U X T

The variance formula given ab ove assumes normality of the errors and the X in the

ji i

mo del Instrumental variable estimation works more generally and in the absence of distributional

assumptions b eyond those of lack of correlation estimated variances can b e obtained using the

sandwich formula We illustrate the calculations with data from the Framingham Heart Study For

this illustration Y and W are systolic blo o d pressure and serum cholesterol resp ectively measured

at the third exam and T is serum cholesterol resp ectively measured at the second exam The data

include measurements on n males aged to

The function was used to determine the estimates standard errors in parentheses

b b

T W

2 1

b b b b

3 Y jW 4 IV

The empirical sandwich variance estimate direct computer output is

b b

The estimated contrast has resulting in the test

IV Y jW

t The test statistic statistic is consistent with the hyp othesis that serum cholesterol

is measured with nonnegligible error

CONNECTIONS TO THE INFLUENCE CURVE

b

The Inuence Curve Hamp el IC y of an estimator based on an iid sample may b e

0

b

dened as satisfying

n

X

b

IC Y R

i 0 n 0

b

n

i=1

p

p

T

nR as n If EIC Y and EIC Y IC Y where

n 1 0 1 0 1 0

b b b

b

exists then by Slutskys Theorem and the CLT is AMN n It is easy to verify that

1

IC y A y for Mestimators Thus

0 0 0

b

h i h i

T 1 T 1 T

E IC Y IC Y E A Y f Y g fA g

1 0 1 0 0 1 0 1 0 0

b b

1 1 T

A B fA g V

0 0 0 0

Since the Inuence Curve approach is more general why use the Mestimator approach

of this pap er Although the two metho ds have similarities we have found that the Mestimator

approach describ ed in this pap er is easier to routinely use as a plug and crank metho d Esp ecially

in messy problems with a large numb er of parameters it app ears easier to stack functions and

compute A and B matrices than it is to compute and then stack inuence curves and then compute

This may b e largely a matter of taste but we have seen resistance to using inuence curves

If one has already computed inuence curves then dening Y IC Y

i i 0

b

allows one to use the approach of this pap er In this case A is the identity matrix and

0 0

B A minor mo dication is that for the empirical variance estimators we need to dene

0

b b b

Y IC Y that is plugging in for b oth and More imp ortantly this fact allows

i i 0

b

one to combine Mestimators with estimators that may not b e Mestimators but for which we have

already computed inuence curves The next example illustrates this

Example Ho dges and Lehmann suggested that estimators could b e obtained by inverting

rank tests and the class of such estimators is called Restimators One of the most interesting R

estimators is called the Ho dgesLehmann lo cation estimator

X X

i j

b

i j n median

HL

It is not clear how to put this estimator directly in the Mestimator framework but for distributions

symmetic around that is having F y F y for a distribution F symmetric ab out

0 0 0 0

Hub er p gives

1

F y

0 0

2

R

IC y

0

b

2

HL

f y dy

0

where f y is the density function of F y The variance of this inuence curve is

0 0

R

2

2

f y dy

0

which is easily obtained after noting that F Y has a uniform distribution

0 1 0

b

Now for obtaining the asymptotic joint distribution of and any set of Mestimators we

HL

can stack Y IC y with the functions of the Mestimators The part

i 0 0

b

HL

b

of the A matrix asso ciated with will b e all zero es except for the diagonal element which will

HL

b e a one The diagonal element of the B matrix will b e the asymptotic variance given ab ove but

one will still need to compute correlations of IC Y with the other functions to get the

1 0

b

HL

b

odiagonal elements of the B matrix involving

HL

NONSMOOTH FUNCTIONS

In some situations the function may not b e dierentiable everywhere thus causing a problem

with the denition of the A matrix as the exp ected value of a derivative that do es not exist The

mo died denition of A is to just interchange the order of taking the derivative and then the

exp ectation

fE Y g A

F 1 0

T

=

It is imp ortant to note that the exp ectation is taken with resp ect to the true distribution of the

data denoted by E but within the function is free to change in order to take the derivative

F

Of course after taking the derivative we then substitute the true parameter value

0

Example Hub er prop osed estimating the center of symmetry of symmetric distributions

P

b b

using that satises Y where

k i

x when jxj k

x

k

k when jxj k

This function is continuous everywhere but not dierentiable at k Thus we use denition

to calculate A

0

Z

A fE Y g y dF y

0 F k 1 k

= =

Z

y dF y

k

=

Z

0

y dF y

0

k

0

The notation inside the stands for the derivative of where it exists and for the two

k

k

p oints where it do esnt exist y k we delete y k from the integral assuming that

0 0

F is continuous at those p oints

R

2 2

y dF y and thus Y For B we have B E

0 1 0 0 0

k k

R

2

y dF y

0

k

V

R

0

2

0

y dF y

0

k

h i

P

n

0 1

b b

Y and For estimating A and B our usual estimators are A Y n

i 0 0 n

i=1

k

P P

n n

1 2 1 2

b b b

B Y n Y or p erhaps n Y Here we can use the notation

n i i

i=1 i=1

k k

0

b b

Y b ecause we exp ect to have data at Y k with probability

i i

k

h i

P

1

b b

Example The sample pth quantile F p satises p I Y c where c

i n n

n

h i

b

n F p Thus the function is Y p I Y and we are using our extended

n i i

denition

This function is discontinuous at but we shall see that denition of A continues

0 0

to give us the correct asymptotic results

fE p I Y g p F f A

F 1 0 0

= =

2

B E p I Y p p

0 1 0

p p

V

0

2

f

0

Also we could easily stack any nite numb er of quantile functions together to get the joint

1 1

asymptotic distribution of F p F p There is a cost however for the jump disconti

1 k

n n

b

nuities in these functions we no longer can use A Y to estimate A In fact the derivative

n 0

of the pth quantile function is zero everywhere except at the lo cation of the jump discontinuity

There are several options for estimating A One is to use a smo othing technique to estimate f

0

kernel density estimators for example Another is to approximate by a smo oth function and

use the A from this smo oth approximation

0

Example The p ositive mean deviation from the median is dened to b e

n

X

b b b

Y I Y

1 i 2 i 2

n

i=1

b

where is the sample median Thus the function is

2

Y I Y

i 2 i 2 1

C B

Y

A

i

1

I Y

i 2

2

Notice that the rst comp onent of is continous everywhere but not dierentiable at Y

2 i

The second comp onent has a jump discontinuity at Y To get A we rst calculate the

2 i 0

exp ected value of Y note that is not

1 0

R

1

y dF y

2 1

C B

E Y

A

F 1

1

F

2

2

To take derivatives of the rst comp onent let us write dF y as f y dy and expand it out to get

Z Z Z

1 1 1

y f y dy F f y dy y f y dy

2 2 1 1 2

The derivative of this latter expression with resp ect to is of course The derivative with

1

resp ect to is f F f F using the Fundamental

2 2 2 2 2 2 2

Theorem of Calculus to get the rst term Setting means that F b ecause is

0 20 20

the p opulation median Thus the derivative of the rst comp onent with resp ect to and evaluated

2

at is just The partial derivatives of the second comp onent evaluated at are

0 0

and f resp ectively Thus

20

C B

A

A

0

f

20

Straightforward calculations for B yield

0

10

b

11

B C

B C

B

B C

0

A

10

R

1

2 2

y f y dy Finally V is given by where b

20 0 11

10

10 10

b

11

B 2 2 C

f f

f f

20 20

B C

20 20

B C

V

0

B C

A

10

2 2

f

f f

20

20 20

REGRESSION MESTIMATORS

There are two situations of interest for Mestimator analysis of regression estimators The rst is

where the indep endent variables are random variables and we can think in terms of iid X Y pairs

This situation ts into our basic theory develop ed in Section for iid sampling see Example

The second situation is where the indep endent variables are xed constants This covers standard

regression mo dels as well as multisample problems like the oneway setup For

this second regression situation we need to intro duce new notation to handle the noniid character

of the problem

A fairly simple setting to intro duce notation is the nonlinear mo del

Y g x e i n

i i i

where g is a known dierentiable function and e e are indep endent with mean and p ossibly

1 n

2

unequal variances Vare i n and the x x are known constant vectors As

i 1 n

i

T

usual we put the vectors together and dene X x x The least squares estimator

1 n

satises

n

X

0

b b

Y g x g x

i i i

i=1

0

b b

where g x means the partial derivative with resp ect to and evaluated at Expanding this

i

equation ab out the true value and rerarranging we get

1

n

X X

p p

0 

b

p

Y x Y x n nR

i i i i

0 0 0 n

n n

i=1 i=1

0

where of course Y x Y g x g x We now give general denitions for a

i i i i i

0 0 0

numb er of quantities followed by the result for the least squares estimator

n

X

0

Y x A X Y

i i n

0 0

n

i=1

n

h i

X

0 0 T 00

g x g x Y g x g x

i i i i i

0 0 0 0

n

i=1

The notation principle is the same as b efore all arguments of a quantity will b e included in its

name if those quantities are required for calculation Now taking exp ectations with resp ect to the

true mo del dene

n

X

0

E Y x A X

i i n

0 0

n

i=1

n

X

0 0 T

g x g x

i i

0 0

n

i=1

Notice that we have dropp ed out the Y from this quantitys name b ecause the exp ectation elimi

nates dep endence on the Y Also note that the second term for the least squares estimator drops

i

out b ecause of the mo deling assumption Finally assuming that the limit exist we dene

n

X

0

E Y x A lim

i i

0 0

n!1

n

i=1

n

X

0 0 T

lim g x g x

i i

0 0

n!1

n

i=1

T

In the case note that A lim X X n This limit need not exist for the

n!1

0

least squares estimator to b e consistent and asymptotically normal but its existence is a typical

assumption leading to those desired results Denition leads to the purely empirical estimator

of A

(

0

n

h i

X

0

b b

A X Y Y x

n i i

n

i=1

n

h i

X

0 0 T 00

b b b b

g x g x Y g x g x

i i i i i

n

i=1

Since this is the negative of the Hessian in a nal Newton iteration this is sometimes preferred

on computational grounds But the estimated exp ected value estimator based on is typically

simpler

n

X

0

b

A X E Y x

n i i

0

b

=

n

i=1

n

X

0 0 T

b b

g x g x

i i

n

i=1

For the B matrices we have in this expanded notation

n

X

T

E Y x Y x B X Y

i i i i n

0 0 0

n

i=1

n

X

2 0 0 T

g x g x

i i

i 0 0

n

i=1

and B is just the limit of B X as n A natural estimator of B is

n

0 0 0

n

X

T

b b b

Y x Y x B X Y

i i i i n

n p

i=1

n

X

2 0 0 T

b b b

Y g x g x g x

i i i i

n p

i=1

Example Hub er discussed alternatives to least squares in the linear

T

regression context As a sp ecic example consider the linear mo del with g x x and

i

estimator of satisfying

n

X

T

b

Y x x

k i i

i=1

where is the Hub er function dened in Example This is a slight abuse of notation since

k

T

the ocial Y x Y x x ie is b eing used as b oth the original Hub er function

i i k i i

and also as the generic estimating equation function Since is an o dd function ab out zero the

k k

T

x will b e satised if the e have a symmetric distribution dening equations E Y x

i i k i

0

i

ab out zero If the e are not symmetrically distributed and the X matrix contains a column of

i

b

ones then the intercept estimated by will b e dierent from that of least squares but the other

comp onents of will b e the same

0

Taking a derivative we have

n n

X X

0 0 T

Y x e x x A X Y

i i i i n

0 0

k i

n n

i=1 i=1

P P

n n

0 1 2 T T 1

E and A X n E e x x If we make e x x Also B X n

n i i i i n

0 0

i=1 i=1 i i

k

the homogeneity assumption that the errors e e all have the same distribution then A X

1 n n

0

T T T

2 2 1 2 0 0

e e X X n B X E e X X n and V X X X n E e E E

1 1 n k 1 k 1

0 0

k k

Example Generalized linear mo dels have score equations

n

X

Y

i i

D

i

V

i

i=1

1 T

where EY g x D V VarY g is the link function

i i i i i 0 i

0 0 0

i

and is an additional variance parameter Taking exp ectations of the negative of the derivative

with resp ect to of the ab ove sum evaluated at yields the matrix

0

n

T

X

D D

i i

0 0

V

i 0

0

i=1

Note that the second term involving derivatives of D V drops out due to the assumption that

i i

EY Now for certain missp ecication of densities the generalized linear mo del frame

i i

0

work allows for estimation of and approximately correct inference as long as the the mean is

mo deled correctly and the meanvariance relationship is sp ecied correctly Details of this robusti

ed inference may b e found in McCullagh under the name quasilikeliho o d Note though

that only one extra parameter is used to make up for p ossible missp ecication Instead Liang

and Zeger noticed that the Mestimator approach could b e used here without and with

only the mean correctly sp ecied

n

T

b b

X

D D

i i

b

A X

n

b

n

V

i

i=1

n

2 T

b b b

X

Y D D

i i i i

b

B X Y

n

b

n p

V

i

i=1

Liang and Zeger actually prop osed a generalized set of estimating equations that

accommo dates indep endent clusters of correlated data The form of the estimating equations and

A and B matrices are similar to the ab ove except that the sums are over indep endent clusters

Dunlop gives a simple intro duction to these generalized estimating equations GEE

In and spatial analyses there is often correlation among all the Y with no in

i

dep endent In such cases the A matrix estimates from the indep endent case are still

consistent but more complicated metho ds must b e used in estimating the B matrix see Lumley

and Heagerty and Kim and Bo os

Table Shaquille ONeal Free Throws in NBA Playos

Game Numb er

FTs Made

FTs Attempted

Prop Made

Game Numb er

FTs Made

FTs Attempted

Prop Made

APPLICATION TO TEST STATISTICS

T

1

b b b

V Recall that statistics for H are quadratic forms like

n 0 0 0 0

Thus Mestimation is directly useful for creating such statistics Score statistics are created from

the dening equations but the variance estimates used to dene them are not as simple to

derive by the Mestimation metho d as Wald statistics Here we illustrate how to nd appropriate

variances estimates for score statistics in two applications

Example In the National Basketball Asso ciation NBA playos of Los Angeles Lak

ers star player Shaquille ONeal played in games Table gives his gamebygame free throw

outcomes and Figure displays the results

It is often conjectured that players have streaks where they sho ot b etter or worse One way

to think ab out that is to assume that the the numb er of free throws made in the ith game Y

i

is binomial n p conditional on n the numb er of free throws attempted in the ith game and

i i i

p the probability of making a free throw in the ith game Having streaks might corresp ond to

i

some games having high or low p values Thus a statistical formulation of the problem might b e

i

Can the ab ove observed gametogame variation in sample prop ortions b e explained

by binomial variability with a common p By the way the apparent downward trend in

sample prop ortions is not signicant the pvalue

For generality let k b e the numb er of games The score statistic for testing a common binomial distribution exp ected v is to use the normal appro eac the is giv where a conditional prop ortion v sample h Mestimator approac n en b Using i p e sizes going to innit y Sample Proportions alues P H ersus some dierences with the n approac 0 Y 1 0.0 0.4 0.8 Figure Shaqs F i n n p ab o 1 P 1 k p n e n v e h i p k 1 y 2 and most of the is data w h w T ere the ximation based on S p e is e need assumed w estimate n degrees also the e 01 20 15 10 5 p nd T ree Thro k S to treat the k p n e T p of xed X i of S =1 simple k n k freedom vs the w P i Y in our data set are quite small Another approac i for k ercen common H c p e n hisquared this 1 exp ected i p e and But tages in the NBA Pla p 2 i deriv T n i the the v o nd the asymptotic v p e p alue j ation go o dnessoft statistic v pv alue c hisquared for at least one pair of p e alue they p of under T is S arent k appro as a parameter the based really ximation n y ull ariance os i on h with yp othesis so j a this is of c the hisquared based T 1 S and will h then k using The cell on b e p

as and form two functions

2

2

Y n p

i i

Y n p Y n p Y n p

2 i i 1 i i 1 i i 1

n p p

i

For calculating the A and B matrices we can treat the n like xed constants in regression or as

i

random variables with some distribution Taking the latter approach and noting that under

1

H we get A A pp p A A En

0 11 12 21 22 i n

2

p p

B E

11

p p n

i

B p B p p We have used the assumption that conditionally under H that

12 22 n 0

Y jn is binomialn p The asymptotic variance of interest is then

i i i

2

h i

A B A B

22 12 12

12

1 1 T

A B fA g B

11

2

11

A

A

22

22

2

2

p p

p

E

p p n p p

i n

P P

1

Plugging in n for En and k n for and comparing T k to a

i i i n S

with mean and this estimated variance divided by k leads to a pvalue for the Shaq free throw data

of We also ran two parametric b o otstraps with resamples conditional on n n

1 23

yielding pvalue and also with the n drawn with replacement from n n yielding

i 1 23

pvalue So the chisquared approximation seems b etter than the normal approximation We

might add that the results are very sensitive to game where Shaq made free throws out of

Also the related score statistic derived by Tarone from the b etabinomial mo del is weighted

dierently and results in a pvalue of

Example Sen rst derived generalized score tests based on the Mestimation formula

tion Bo os shows how the form of the test statistic arises from Taylor expansion In this

example we would like to p oint out how our Mestimation approach leads to the correct test statis

tic In a sense the A and B matrix formulation automatically do es the Taylor expansion and

computes the variance of the appropriate linear approximations For simplicity we will present

results for the iid situation regression extensions are similar

T T T

Assume that and the null hyp othesis is H where is of dimension

0 1 10 1

1 2

T T T

r and is of dimension p r Assume that is partioned similarly

2

1 2

P P

T T

T

e e e e

Generalized score tests are based on Y where satises Y

i i

1 10 2 2

P

T

e

The goal is to nd the appropriate variance matrix to invert and put b etween Y and

i

1

P

e

Y

i

1

h i

P P

 

n

1

b e e b

To that end let n Y b e an Mestimator that solves Y

i i

1 i=1 1 1 1





b

Then thinking of as a parameter that is the limit in probability of the parameter for this

1 1

 

new problem is comp osed of and is xed and not a parameter in the new problem

2 10

1

    

The asso ciated functions are Y Y and Y Y Taking

i i i i

1 1 1 2 2



derivatives with resp ect to and exp ectations we nd that

T T

E E I A

r 12

1 1 1 2

B C B C

 

and B B A

A A

T T

E E A

22

2 2

1 2

where I is the r r identity matrix and A and B without s refer to their form in the original

r

p

P

e

n problem Finally inverting and multiplying leads to the asymptotic variance of Y

i

1

p



b

n given by

1

T 1 T T 1 1 1

g A B fA g A A A B B fA V B A A

12 22 12 21 12 11 11 12

12

22 22 22 22

and the generalized score statistic is

X X

T 1

e e e

Y T Y V

i GS i

1 1

11

Summary

Mestimators represent a very large class of statistics including for example maximum likeliho o d

estimators and basic sample statistics like sample moments and sample quantiles as well as com

plex functions of these The approach we have summarized makes standard error estimation and

asymptotic analysis routine regardless of the complexity or dimension of the problem In summary

we would like to bring together the key features of Mestimators

P

n

b b

An Mestimator satises Y where is a known function not dep ending

i

i=1

on i or n See also the extensions and

Many estimators that do not satisfy or the extensions and are comp onents of

higherdimensional Mestimators and thus are amenable to Mestimator techniques using the

metho d of stacking Such estimators are called partial Mestimators

h i

T

A E Y is the Fisher information matrix in regular parametric mo dels

0 1 0

when is the loglikeliho o d score function More generally A must have an inverse but

0

need not b e symmetric See also the extension for nondierentiable

i h

T

is also the Fisher information matrix in regular parametric B E Y Y

0 1 0 1 0

mo dels when is the loglikeliho o d score function B always has the prop erties of a

0

b

covariance matrix but will b e singular when one comp onent of is a nonrandom function of

b

the other comp onents of

b

is AMN V n as n where V Under suitable regularity conditions

0 0 0

1 1 T

A B fA g is the sandwich matrix

0 0 0

One generally applicable estimator of V for dierentiable functions is the empirical

0

1 1 T

b b b b

sandwich estimator V Y A Y B Y fA Y g

n n n n

REFERENCES

Benichou J and Gail M H A Delta Metho d for Implicitly Dened Random Variables

The American

Bo os D D On Generalized Score Tests The American Statistician

Carroll R J Rupp ert D and Stefanski L A Measurement Error in Nonlinear Models

Chapman Hall London

Dunlop D D Regression for Longitudinal Data A Bridge from Least Squares Regression

The American Statistician

Fuller W A Measurement Error Models John Wiley Sons New York

Go damb e V P An Optimum Prop erty of Regular Maximum Likeliho o d Estimation

Annals of

Hamp el F R The Inuence Curve and Its Role in Robust Estimation Journal of the

American Statistical Association

Ho dges J L Jr and Lehmann E L Estimates of Lo cation Based on Rank Rests

Annals of Mathematical Statistics

Hub er P J Robust Estimation of a Lo cation Parameter Annals of Mathematical Statis

tics

Hub er P J The Behavior of Maximum Likeliho o d Estimates Under Nonstandard Condi

tions Proceedings of the th Berkeley Symposium

Hub er P J Robust Regression Asymptotics Conjectures and Monte Carlo Annals of

Statistics

Iverson H K and Randles R H The Eects on Convergence of Substituting parameter

Estimates into U Statistics and Other families of Statistics Probability Theory and Related

Fields

Kim HJ and Bo os D D Variance Estimation in Spatial Regression Using a Nonpara

metric Semivariogram Based on Residuals Institute of Statistics Mimeo Series North

Carolina State University Raleigh

Liang KY and Zeger S L Longitudinal Data Analysis Using Generalized Linear Mo d

els Biometrika

Lumley T and Heagerty PJ Weighted Empirical Adaptive Variance Estimators for

Correlated data regression Journal of the Royal Statistical Society Ser B

Randles R H On the Asymptotic Normality of Statistics With Estimated Parameters

The Annals of Statistics

Sen P K On M Tests in Linear Mo dels Biometrika

Tarone R E Testing the Go o dness of Fit of the Binomial Distribution Biometrika