Quick viewing(Text Mode)

Maximum Likelihood 1 Likelihood Models

Maximum Likelihood 1 Likelihood Models

Maximum likelihood Class Notes Manuel Arellano Revised: January 28, 2020

1 Likelihood models

Given some w1, ..., wn , a likelihood model is a family of density functions fn (w1, ..., wn; θ), { } θ Θ Rp, that is specified as a probability model for the observed data. If the data are a ∈ ⊂ random from some population with density function g (wi) then fn (w1, ..., wn; θ) is a model for n n i=1 g (wi). If the model is correctly specified fn (w1, ..., wn; θ) = i=1 f (wi, θ) and g (wi) = f (wi, θ0) Qfor some value θ0 in Θ. The function f may be a conditional orQ an unconditional density. It may be flexible enough to be unrestricted or it may place restrictions in the form of the population density.1 n In the (θ) = f (wi, θ), the data are given and θ is the argument. Usually L i=1 we work with the log likelihood function:Q

n L (θ) = i=1 ln f (wi, θ) . (1) P L (θ) measures the ability of different densities within the pre-specified class to generate the data. The maximum likelihood (MLE) is the value of θ associated with the largest likelihood:

θ = arg max L (θ) . (2) θ Θ ∈

b 2 Example 1: under normality The model is y X Xβ , σ In so | ∼ N 0 0 that  n fn (y1, ..., yn x1, ..., xn; θ) = f (yi xi; θ) | i=1 | 2 2 Y where θ = β0, σ 0, θ0 = β0, σ0 0 and

 1  1 2 f (yi xi; θ) = exp yi x0 β (3) | √ 2 −2σ2 − i 2σ π    n n 2 1 n 2 L (θ) = ln (2π) ln σ yi x0 β (4) − 2 − 2 − 2σ2 i=1 − i P  The MLE θ = β0, σ2 0 consists of the OLS estimator β and the residual σ2 without degrees of freedom adjustment.  Letting u = y Xβ we have: b b b − b b 1 2 u0u β = X0X − X0y σ =b . b (5) n 1  b b Thisb note follows the practiceb of using the term density both for continuous random variables and for the probability function of discrete random variables; as for example in David Cox, Principles of , 2006.

1 Example 2: Logit regression There is a binary dependent variable yi that takes only two yi (1 yi) values, 0 and 1. Therefore, in this case f (yi xi) = p (1 pi) − where pi = Pr (yi = 1 xi). | i − | In the logit model the log depends linearly on xi:

pi ln = xi0 θ, 1 pi  −  so that pi = Λ (xi0 θ) where Λ is the logistic cdf Λ(r) = 1/ (1 + exp ( r)). n − Assuming that fn (y1, ..., yn x1, ..., xn; θ) = f (yi xi; θ), the log likelihood function is | i=1 | n L (θ) = yi ln Λ x0 θ + (1 yi) ln 1Q Λ x0 θ . i=1 i − − i The first andP second partial  derivatives ofL (θ) are: 

∂L (θ) n = xi yi Λ x0 θ ∂θ i=1 − i 2 ∂ L (θ) P n   = i=1 xixi0 Λi (1 Λi) . ∂θ∂θ0 − − Since the HessianP matrix is negative semidefinite, as long as it is nonsingular, there exists a single maximum to the likelihood function.2 The first order conditions ∂L (θ) /∂θ = 0 are nonlinear but we can find their root θ using the Newton-Raphson method of successive approximations. Namely, we

begin by finding the root θ1 to a linear approximation of ∂L (θ) /∂θ around some initial value θ0 and b iterate the procedure until convergence: 2 1 ∂ L (θj) − ∂L (θj) θj+1 = θj (j = 0, 1, 2, ...) . − ∂θ∂θ ∂θ  0  Example 3: Duration data In the analysis of a duration variable Y it is common to study its hazard rate, which is given by h (y) = f (y) / Pr (Y y). If Y is discrete h (y) = Pr (Y = y Y y), ≥ | ≥ 3 whereas if Y is continuous h (y) = lim∆y 0 Pr (y Y < y + ∆y Y > y) /∆y = f (y) / [1 F (y)]. → ≤ | − A simple model is to assume that the hazard function is constant: h (y) = λ with λ > 0, which for a continuous Y gives rise to the exponential distribution with pdf f (y) = λ exp ( λy) and cdf − F (y) = 1 exp ( λy). A popular conditional model is the proportional hazard specification: h (y, x) = − − λ (y) ϕ (x). A special case is an exponential model in which the log hazard depends linearly on xi:

ln h (yi, xi) = xi0 θ. Here the log likelihood of a sample of conditionally independent complete and incomplete durations is

x θ x θ n x θ x θ L (θ) = x0 θ yie i0 + yie i0 = (1 ci) x0 θ yie i0 ci yie i0 complete i − incomplete − i=1 − i − −     h    i where ciP= 0 if a duration is completeP and ci = 1 if not. TheP contribution to the log likelihood of a complete duration is ln f (yi), whereas the contribution of an incomplete duration is ln Pr (Y > yi).

2 To see that the Hessian is negative semidefinite, note that using the decomposition κ0 κi = Λi (1 Λi) with κ0 = i − i 1/2 1/2 2 n (1 Λi)Λ , Λi (1 Λi) , the Hessian can be written as ∂ L (θ) /∂θ∂θ0 = xiκ0 κix0 . − i − − i=1 i i h 3 The hazard rate fully characterizesi the distribution of Y . To recover F (y) from h (y) note that ln [1 F (y)] = P − y ln [1 h (s)] in the discrete case, and ln [1 F (y)] = y [ h (s)] ds in the continuous one. s=1 − − 0 − P R 2 What does θ estimate? The population counterpart of the sample calculation (2) is

θ0 = arg max E [ln f (W, θ)] (6) θ bΘ ∈ where W is a with density g (w) so that E [ln f (W, θ)] = ln f (w, θ) g (w) dw.

If the population density g (w) belongs to the f (w, θ) family, then θ0 asR defined in (6) is the true value. In effect, due to Jensen’sinequality:

ln f (w, θ) f (w, θ0) dw ln f (w, θ0) f (w, θ0) dw − Z Z f (w, θ) f (w, θ) = ln f (w, θ0) dw ln f (w, θ0) dw = ln f (w, θ) dw = ln 1 = 0. f (w, θ ) ≤ f (w, θ ) Z  0  Z  0  Z Thus, when g (w) = f (w, θ0) we have

E [ln f (W, θ)] E [ln f (W, θ0)] 0 for all θ. − ≤ 4 The value θ0 can be interpreted more generally as follows: g (W ) θ0 = arg min E ln . (7) θ Θ f (W, θ) ∈   The quantity E ln g(W ) ln g(w) g (w) dw is the Kullback-Leibler (KLD) from f(W,θ) ≡ f(w,θ) f (w, θ) to g (w).h The KLDi is theR expected  log difference between g (w) and f (w, θ) when the expec- tation is taken using g (w). Thus, f (w, θ0) can be regarded as the best approximation to g (w) in the class f (w, θ) when the approximation is understood in the KLD sense.

If g (w) = f (w, θ0) then θ0 is called the “true value”. If g (w) does not belong to the f (w, θ) class and f (w, θ0) is just the best approximation to g (w) in the KLD sense, then θ0 is called a “pseudo-true value”. The extent to which a pseudo-true value remains an interesting quantity is model specific. For example, (3) is a restrictive model of the conditional distribution of yi given xi, first because it assumes that the dependence of yi on xi occurs exclusively through the conditional E (yi xi) and secondly | because this conditional mean is assumed to be a linear function of xi. However, if yi depends on 2 xi in other ways, for example through the conditional variance, the parameter values θ0 = β0, σ0 0 2 remain interpretable quantities: β as ∂E (yi xi) /∂xi and σ as the unconditional variance of the 0 | 0  2 errors ui = yi x β . If E (yi xi) is a nonlinear function, β and σ can only be characterized as the − i0 0 | 0 0 linear projection regression coeffi cient vector and the linear projection error variance, respectively.

Pseudo maximum likelihood estimation The θ is the maximum likelihood estimator under the assumption that g (w) belongs to the f (w, θ) class. In the absence of this assumption, θ is b a pseudo-maximum likelihood estimator (PML) based on the f (w, θ) family of densities. Sometimes b θ is called a quasi-maximum likelihood estimator.

4 Since E ln g(W ) = E [ln g (W )] [ln f (W, θ)] and E [ln g (W )] does not depend on θ, the arg min in (7) is the same b f(W,θ) − as the arg maxh in (6).i

3 2 Consistency and asymptotic normality of PML

Under regularity and identification conditions a PML estimator θ is a of the

(pseudo) true value θ0. Since θ may not have a closed form expression we need a method for establishing b the consistency of an estimator that maximizes an objective function. The following theorem taken b from Newey and McFadden (1994) provides such a method. The requirements are boundedness of the parameter space, uniform convergence of the objective function to some nonstochastic continuous limit, and that the limiting objective function is uniquely maximized at the truth (identification).

Consistency Theorem Suppose that θ maximizes the objective function Sn (θ) in the parameter space Θ. Assume the following: b (a) Θ is a compact set.

(b) The function Sn (θ) converges uniformly in probability to S0 (θ).

(c) S0 (θ) is continuous.

(d) S0 (θ) is uniquely maximized at θ0.

p Then θ θ0. → n In the PML context Sn (θ) = (1/n) i=1 ln f (wi, θ) and S0 (θ) = E [ln f (W, θ)]. In particular, in b 5 the regression example, by the law of largeP numbers:

1 1 2 1 2 S0 (θ) = ln (2π) ln σ E yi x0 β (8) −2 − 2 − 2σ2 − i h  i and, noting that yi x β ui x (β β ), also − i0 ≡ − i0 − 0 1 1 2 1 2 S0 (θ) = ln (2π) ln σ σ + (β β )0 E xix0 (β β ) . (9) −2 − 2 − 2σ2 0 − 0 i − 0    In this example, S0 (θ) is uniquely maximized at θ0 as long as E (xixi0 ) has full rank.

Asymptotic normality To discuss asymptotic normality, in addition to the conditions required for consistency, we assume that f (wi, θ) has first and second derivatives in a neighborhood of θ0, and θ0 6 is an interior point of Θ. For simplicity, use the notation `i (θ) = ln f (wi, θ) and qi (θ) = ∂`i (θ) /∂θ.

Note that if the data are iid the qi (θ0) is also iid with zero mean vector and matrix

∂` (θ ) ∂` (θ ) V = E i 0 i 0 . (10) ∂θ ∂θ  0  5 Given the equivalence in this case between pointwise and uniform convergence. 6 We can proceed as if θ were an interior point of Θ since consistency of θ for θ0 and the assumption that θ0 is interior to Θ implies that the probability that θ is not interior goes to zero as n . b →b ∞ b 4 Next, because of the we have

1 ∂L (θ ) 1 n ∂` (θ ) 0 i 0 d (0,V ) . (11) √n ∂θ ≡ √n ∂θ → N i=1 X As for the , its convergence follows from the law of large numbers:

2 n 2 2 1 ∂ L (θ0) 1 ∂ `i (θ0) p ∂ `i (θ0) Hn (θ0) E H. (12) n ∂θ∂θ ≡ n ∂θ∂θ ≡ → ∂θ∂θ ≡ 0 i=1 0 0 X   p p We assume that H is a non-singular matrix and that Hn θ H for any θ such that θ θ0. → → Now, using the mean value theorem:   e e e ∂L θ p ∂2L θ ∂L (θ0) [j] 0 = = + θ` θ0` (j = 1, ..., p) (13) ∂θj  ∂θj ∂θj∂θ`  − `=1 b X e   b where θ` is the `-th element of θ, and θ denotes a p 1 random vector such that θ θ0 [j] × [j] − ≤ 7 θ θ0 . − b b e e p p Note that θ θ0 implies θ[j] θ0 and also b → → 2 b e 1 ∂ L θ[j] p (j, `) element of H, n ∂θj∂θ0  → e ` which leads to the asymptotic linear representation of the estimation error:8

1 1 ∂L (θ0) √n θ θ0 = H− + op (1) . (14) − − √n ∂θ   Finally,b using (11) and Cramér’stheorem we obtain:

d √n θ θ0 (0,W ) (15) − → N   1 1 where W =bH− VH− or at length:

2 1 2 1 ∂ ` (θ ) − ∂` (θ ) ∂` (θ ) ∂ ` (θ ) − W = E i 0 E i 0 i 0 E i 0 . (16) ∂θ∂θ ∂θ ∂θ ∂θ∂θ   0   0    0  Asymptotic standard errors A consistent estimator of the asymptotic variance matrix W is:

1 1 2 − 2 − 1 n ∂ `i θ 1 n ∂`i θ ∂`i θ 1 n ∂ `i θ W = . (17) n ∂θ∂θ  n ∂θ  ∂θ  n ∂θ∂θ  i=1 0 i=1 0 i=1 0 X b X b b X b c       7 The expansion has to be made element by element since θ[j] may be different for each j. 8 The notation op (1) denotes a term that converges to zero in probability. e

5 3 The information matrix identity

As long as f (w, θ) is a density function it integrates to one:

f (w, θ) dw = 1. (18) Z Taking partial derivatives in (18) with respect to θ we get the zero mean property of the score:

∂ ln f (w, θ) f (w, θ) dw = 0. (19) ∂θ Z Next, taking partial derivatives again:

∂2 ln f (w, θ) ∂ ln f (w, θ) ∂ ln f (w, θ) f (w, θ) dw + f (w, θ) dw = 0. (20) ∂θ∂θ ∂θ ∂θ Z 0 Z 0 Therefore, if g (w) = f (w, θ0) we have

∂` (θ ) ∂` (θ ) ∂2` (θ ) E i 0 i 0 = E i 0 . (21) ∂θ ∂θ − ∂θ∂θ  0   0 

This result is known as the information matrix identity. It says that when evaluated at θ0 the covari- ance matrix of the score coincides with minus the expected Hessian of the log-likelihood function for observation i. It is an identity in the sense of (20), but in general it need not hold if the expectations in (21) are taken with respect to g (w) and g (w) = f (w, θ0). 6 The implication is that under correct specification V = H in the sandwich formula (16) and − therefore:

d 1 √n θ θ0 0, [I (θ0)]− (22) − → N     where b

2 2 ∂ `i (θ0) 1 ∂ L (θ0) I (θ0) = E plim . (23) − ∂θ∂θ0 ≡ − n n ∂θ∂θ0   →∞

The matrix I (θ0) is known as the information matrix or the , after the work of . It is called information because it can be regarded as a measure of the amount of information that the random variable W with density f (w, θ) contains about the unknown parameter

θ. Intuitively, the greater the expected curvature of the log likelihood at θ = θ0 the greater the information and the smaller the asymptotic variance of the maximum likelihood estimator, which is given by the inverse of the information matrix.

The asymptotic Cramér-Rao inequality Under suitable regularity conditions, the matrix 1 [I (θ0)]− is a lower bound for the asymptotic of any consistent estimator of θ0. Furthermore, this lower bound is attained by the maximum likelihood estimator.

6 Estimating the information matrix There are a variety of consistent estimators of I (θ0). One possibility is to use the observed Hessian evaluated at θ:

2 1 ∂ L θ b I = . (24) −n ∂θ∂θ 0 b Anotherb possibility is the expected Hessian evaluated at θ, as long as its functional form is known:

I = I θ , b (25)   Yet anothere possibilityb in a conditional likelihood model f (y x; θ) is to use a sample average of the | expected Hessian conditioned on xi and evaluated at θ:

2 n ∂ ln f y xi; θ b 1 | I = E xi . (26) −n  ∂θ∂θ  |  i=1 0 X b ee   Finally, one can use the variance of the score-form of the information matrix to obtain an estimate:

1 n ∂`i θ ∂`i θ I = . (27) n  ∂θ  ∂θ  i=1 0 X b b bb   4 Example: Normal linear regression

Letting ui = yi x β , the score at the true value for the log-likelihood function in (4) is − i0 0

∂` (θ ) 1 xiui i 0 = . ∂θ σ2  1 2 2  0 2σ2 ui σ0 0 −   The covariance matrix of the score is

1 2 1 2 2 1 2 1 3 σ4 ui xixi0 2σ6 xiui ui σ0 σ4 E ui xixi0 2σ6 E ui xi V = E 0 0 − = 0 0 .  1 2 2 1 2 2 2   1 3 1 4 4  2σ6 xi0 ui ui σ0 4σ8 ui σ0  2σ6 E ui xi0  4σ8 E ui σ0 0 − 0 − 0 0 −     The expected Hessian is      

1 1 1 σ2 xixi0 σ4 xiui σ2 E (xixi0 ) 0 H = E − 0 − 0 = 0 .  1 1 2 2 1  −  1  σ4 xi0 ui 2σ4 ui σ0 σ6 00 2σ4 − 0 − 0 − − 0 0     The sandwich formula in (16) is: 

2 1 1 2 1 3 2 1 σ [E (xix0 )]− 0 σ4 E ui xixi0 2σ6 E ui xi σ [E (xix0 )]− 0 W = 0 i 0 0 0 i . 4  1 3 1 4 4  4 00 2σ0 ! 2σ6 E ui xi0  4σ8 E ui σ0 00 2σ0 ! 0 0 −   Note that under misspecification the information  matrix identity  does not hold since V = H. 6 − The first block-diagonal components of V and H will coincide under conditional homoskedasticity, − 7 that is, if E u2 x = σ2. The off-diagonal block of V is zero under conditional symmetry, that is, if i | 0 E u3 x = 0. Lastly, the second block-diagonal terms of V and H will coincide under the normal i |  − 4 4 condition E ui = 3σ0. These conditions are satisfied when model (4) is correctly specified but not in general.  Under correct specification:

2 1 d σ0 [E (xixi0 )]− 0 √n θ θ0 0, . (28) 4 − → N " 00 2σ !#   0 b 5 Estimation subject to constraints

We may wish to estimate subject to constraints. Sometimes one seeks to ensure internal consistency in a model that is required to answer a question of interest such as a welfare calculation, for example enforcing symmetry of cross-price elasticities in a demand system. Another common situation is an interest in the value or adequacy of economic restrictions, such as constant returns to scale in a production function. Finally, one may be simply willing to consider a restricted version of a model as a way of producing a simpler or tighter summary of data. Constraints on parameters may be expressed as equation restrictions

h (θ0) = 0 (29) where h (θ) is a vector of r restrictions: h (θ) = (h1 (θ) , ..., hr (θ))0. Alternatively, constrains may be expressed in parametric form

θ0 = θ (α0) (30) whereby θ is functionally related to a free parameter vector α of smaller dimension than θ such that the number of restrictions is r = dim (θ) dim (α). Depending on the problem it may be more convenient − to express restrictions in one way or the other (or in mixed form).

One way of obtaining a constrained estimator of θ0 is to maximize the log-likelihood function L (θ) subject to the constraints h (θ) = 0:

1 θ, λ = arg max L (θ) λ0h (θ) (31) θ,λ n −     where λe ise an r 1 vector of Lagrange multipliers. × Alternatively, if the restrictions have been parameterized as in (30), the log-likelihood function can be maximized with respect to α as in an unrestricted problem:

α = arg max L [θ (α)] . (32) α

Restrictede estimates of θ0 are then given by θ = θ (α).

e 8e Asymptotic normality of constrained estimators The asymptotic variance of √n (α α0) − can be obtained as an application of the result in (15)-(17) to the log-likelihood L∗ (α) = L [θ (α)]. e Letting G = G (α0) where G (α) = ∂θ (α) /∂α0, using the chain rule we have ∂L (α ) ∂L (θ ) ∗ 0 = G 0 , (33) ∂α 0 ∂θ and9 ∂2L (α ) E ∗ 0 = G HG. (34) ∂α∂α 0  0  Moreover,

1 ∂L∗ (α0) d 0,G0VG . (35) √n ∂α → N  Therefore,

d 1 1 √n (α α0) 0, G0HG − G0VG G0HG − . (36) − → N     The asymptotice distribution of θ then follows from the delta method:

d √n θ θ0 (0,WR)e (37) − → N   where WR eis a matrix of reduced rank given by

1 1 WR = G G0HG − G0VG G0HG − G0. (38)   Under correct specification V = H, so that the previous results become − d 1 √n (α α0) 0, G0I (θ0) G − . (39) − → N    and e

d 1 √n θ θ0 0,G G0I (θ0) G − G0 . (40) − → N      Under correcte specification, θ is asymptotically more effi cient than θ since the difference between their asymptotic variance matrices is positive-semidefinite: e b 1 1 1 1 1 [I (θ0)]− G G0I (θ0) G − G0 = A− I G∗ G∗0G∗ − G∗0 A− 0 0 (41) − − ≥  h  i where I (θ0) = A0A and G∗ = AG. However, under misspecification the sandwich matrix W = 1 1 H− VH− and WR in (38) cannot be ordered.

2 9 The matrix ∂ L∗(α0) contains an additional term, which is equal to zero in expectation. ∂α∂α0

9 The likelihood ratio statistic By construction an unrestricted maximum is greater than a restricted one: L θ L θ . It seems therefore natural to look at the log-likelihood change L θ ≥ − L θ as a measure  of the  cost of imposing the restrictions. It can be shown that under correct  b e b specification  (or if the information matrix identity holds) and the r restrictions on θ0 also hold then e LR = 2 L θ L θ d χ2. (42) − → r h    i This result is usedb to constructe a large-sample test of the restrictions: the rule is to reject the restric- tions if LR > κ where κ is the (1 α)- of the χ2 distribution and α is the chosen size of the − r test.

The Wald statistic We can also learn about the adequacy of the restrictions by examining how far the unrestricted estimates are from satisfying the constraints. We are thus led to look at unrestricted estimates of the constraints given by h θ . Letting D = D (θ ) and D = D θ where D (θ) = ∂h (θ) /∂θ , under the assumption that h (θ ) = 0 b 0 0 0, from the delta method we have   b b d √nh θ 0,DWD0 (43) → N    and b

1 0 − d 2 R = n h θ DW D0 h θ χ . (44) W → r        The quantity R isb a Waldb c b statistic.b Like the LR statistic it can be used to construct a large-sample W test of the restrictions with a similar rejection region. However, while the calculation of LR requires both θ and θ, the calculation of R only requires the unrestricted estimate θ. W Another difference is that, contrary to LR, R still has a large-sample chi-square distribution b e W b under misspecification if it relies on a robust estimate of the variance of θ as in (44). A Wald statistic that is directly comparable to the LR statistic would be a non-robust version of the form: b 1 1 0 − − = n h θ D I θ D0 h θ . (45) W " #    h  i    b b b b b The Lagrange Multiplier statistic Another angle on the cost of imposing the restrictions is to examine how far the estimated Lagrange multiplier vector λ is from zero. The first-order conditions from the optimization problem in (31) are: e 1 ∂L θ = D θ 0 λ (46) n ∂θ  e   h θ = 0, e e (47)   e 10 which lead to the asymptotic linear representation of λ:10

1 1 1 1 ∂L (θ0) √nλ = DH− D0 − DH− + o (1) e (48) √n ∂θ p  and e

d 1 1 1 1 1 1 √nλ 0, DH− D0 − DH− VH− D0 DH− D0 − , (49) → N h   i and also e

1 1 1 1 − 1 d 2 LMR = nλ0DH− D0 DH− V H− D0 DH− D0λ χ (50) → r   where e e e e e e e e e e e e e

2 1 n ∂ `i θ 1 n ∂`i θ ∂`i θ D = D θ H = V = . n ∂θ∂θ  n ∂θ  ∂θ  i=1 0 i=1 0   X e X e e e e e e The quantity LMR is a Lagrange Multiplier statistic. Like the Wald statistic in (44) it can be used to construct a large-sample test of the restrictions, which remains valid if the objective function is only a pseudo likelihood function and it does not satisfy the information matrix identity. 1 In view of (46) we can replace D0λ in (50) with n− ∂L θ /∂θ, which produces the score form of the statistic. The score function is exactly equal to zero when  evaluated at θ, but not when evaluated e e e 1 at θ. If the constraints are true, we would expect both n− ∂L θ /∂θ and λ to be small quantities, b so that the rejection region of the null hypothesis h (θ0) = 0 is associated  with large values of LMR. e e e Under correct specification V = H, we get − 1 d 1 − √nλ 0, D [I (θ0)]− D0 (51) → N     and e

∂L θ ∂L θ 1 1 1 d 0 − − 2 LM = nλ D I θ D0λ I θ χr. (52) ≡ n ∂θ0  ∂θ  → h  i e h  i e The statistice eLM eis a non-robuste e version ofe the Lagrange Multiplier statistic that is directly comparable to the LR statistic.

10 Using the mean value theorem for each component of the first-order conditions

2 0 1 ∂L (θ0) 1 ∂ L (θ∗) D θ λ = + θ θ0 n ∂θ n ∂θ∂θ0 −     e e0 = h θ = h (θ0) + D (θ∗∗) eθ θ0 −     and combining the two expressionse using that he(θ0) = 0 we get:

2 1 2 1 − − 1 ∂ L (θ∗) 0 1 ∂ L (θ∗) 1 ∂L (θ0) D (θ∗∗) D θ λ = D (θ∗∗) . n ∂θ∂θ0 n ∂θ∂θ0 n ∂θ       e e

11 6 Example: LR, Wald and LM in the normal linear regression model

The model is the same as in (4). We consider the partitions X = (X1,X2) and β = β10 , β20 0 where

X1 is of order n r, X2 is n (k r), β is r 1 and β is (k r) 1, and the r restrictions × × − 1 × 2 − × 

β1 = 0. (53)

2 The unrestricted estimates are θ = (β0, σ )0 as in (5), whereas the restricted estimates are

1 2 u0u β = 0, β = X0 X2 − bX0 y,b σb = (54) 1 2 2 2 n  e e whereeu = y Xe2β . e − 2 The LR statistic is given by e e u u LR = n ln 0 . (55) u u  0  e e If we modify the example to assume that σ2 is known, β and β remain unchanged but in this case b b 0 u u u u 0 0 b e LRσ = −2 , (56) σ0 e e b b 2 which is exactly distributed as χr under normality. Turning to the Wald statistic, recall that

d 2 1 √n β β 0, σ plim X0X/n − − 0 → N 0      and introduceb the partition

1 A11 A12 X0X − = . (57) A21 A22 !  Thus, if the restrictions hold

d 2 √nβ 0, σ plim (nA11) (58) 1 → N 0   and thereforeb the (non-robust) Wald statistic is given by:

β0 A 1β W = 1 11− 1 . (59) σ2 b b 1 11 It can be shown that β0 A− β = u u u u, so that also b 1 11 1 0 − 0 u0u u0u W = − .b b e e b b (60) σ2

11 e e b b 1 1 Using the partitioned inverse matrix result A11− = X10 (I M2) X1 and MM2 = M2 where M2 = X2 (X20 X2)− X20 1 − and M = X (X0bX)− X0. Premultiplying y = Xβ + u by (I M2) and taking squares we get y0 (I M2) y = − − 0 0 1 β1X10 (I M2) X1β1 + u0 (I M2) u, which equals u0u = β1A11− β1 + u0u. − − b b b b b b b b e e 12 b b 2 Moreover, if σ0 is known

u0u u0u Wσ = −2 , σ0 e e b b 2 so that Wσ = LRσ. With unknown σ it can be shown that W LR even if both statistics have the 0 ≥ same asymptotic distribution. Finally, turning to the LM statistic, the components of the score are: ∂L(θ) 1 = 2 X10 (y Xβ) ∂β1 σ − ∂L(θ) 1 = 2 X20 (y Xβ) (61) ∂β2 σ − ∂L(θ) n 1 2 = (y Xβ)0 (y Xβ) σ , ∂σ2 2σ4 n − − − which once evaluated at restricted estimates are ∂L(θ) 1 = 2 X10 u = 0 ∂β1 σ 6 ∂L(eθ) 1 (62) = 2 X20 u = 0 ∂β2 σe e ∂L(eθ) n 1 2 2 = 4 u0u σ = 0. ∂σ 2eσ n − e e Moreover, the information matrix in this case is e e e e 1 X0X σ2 plim n 0 I (θ ) = 0 , (63) 0 1  00  4  2σ0 so that using  

1 X0X 2 0 I θ = σ n , (64)  1  00  4   e 2σ the LM statistice  becomes  e X10 u A11 A12 0 σ2 σ2 LM = u0X1 0 0 A A 0 0e (65) σ2 0  21 22 !    4 e  e   e 00 0 2σ /n   0  e     and     u X A X u e LM = 0 1 11 10 . (66) σ2 It can be showne that e

e 1 u0X1A11X0 u = β0 A− β = u0u u0u (67) 1 1 11 1 − so that also e e b b e e b b u u u u LR = 0 − 0 . (68) σ2 2 Moreover, if eσ0e is knownb b e u0u u0u LMσ = −2 σ0 e e b b 2 so that LMσ = Wσ = LRσ. With unknown σ it is easy to show that W LR LM. 0 ≥ ≥

13