<<

Wald (and Score) Tests

1 / 18 Vector of MLEs is Asymptotically Normal That is, Multivariate Normal

This yields

I Confidence intervals

I Z-tests of H0 : θj = θ0

I Wald tests

I Score Tests

I Indirectly, the Likelihood Ratio tests

2 / 18 Under Regularity Conditions (Thank you, Mr. Wald)

a.s. I θbn → θ √ d −1 I n(θbn − θ) → T ∼ Nk 0, I(θ) 1 −1 I So we say that θbn is asymptotically Nk θ, n I(θ) . I I(θ) is the in one observation.

I A k × k matrix  ∂2  I(θ) = E[− log f(Y ; θ)] ∂θi∂θj

I The Fisher Information in the whole sample is nI(θ)

3 / 18 H0 : Cθ = h

Suppose θ = (θ1, . . . θ7), and the null hypothesis is

I θ1 = θ2

I θ6 = θ7 1 1 I 3 (θ1 + θ2 + θ3) = 3 (θ4 + θ5 + θ6) We can write null hypothesis in matrix form as   θ1  θ2        1 −1 0 0 0 0 0  θ3  0    0 0 0 0 0 1 −1   θ4  =  0    1 1 1 −1 −1 −1 0  θ5  0    θ6  θ7

4 / 18 p Suppose H0 : Cθ = h is True, and Id(θ)n → I(θ) By Slutsky 6a (Continuous mapping),

√ √ d −1 0 n(Cθbn − Cθ) = n(Cθbn − h) → CT ∼ Nk 0, CI(θ) C

and −1 p −1 Id(θ)n → I(θ) . Then by Slutsky’s (6c) Stack Theorem,

√ !   n(Cθbn − h) d CT → . −1 I(θ)−1 Id(θ)n Finally, by Slutsky 6a again,

−1 0 0 −1 Wn = n(Cθb − h) (CId(θ)n C ) (Cθb − h) →d W = (CT − 0)0(CI(θ)−1C0)−1(CT − 0) ∼ χ2(r)

5 / 18 The Wald Test −1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h)

I Again, null hypothesis is H0 : Cθ = h

I Matrix C is r × k, r ≤ k, rank r

I All we need is a consistent of I(θ)

I I(θb) would do

I But it’s inconvenient

I Need to compute partial derivatives and expected values in  ∂2  I(θ) = E[− log f(Y ; θ)] ∂θi∂θj

6 / 18 Observed Fisher Information

I To find θbn, minimize the minus log likelihood.

I Matrix of mixed partial derivatives of the minus log likelihood is

" n #  ∂2  ∂2 X − `(θ, Y) = − log f(Y ; θ) ∂θ ∂θ ∂θ ∂θ i i j i j i=1

I So by the Strong Law of Large Numbers,

" n # 1 X ∂2 J (θ) = − log f(Y ; θ) n n ∂θ ∂θ i i=1 i j   ∂2  a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj

7 / 18 A of I(θ)

Just substitute θbn for θ

" n # 1 X ∂2 J n(θbn) = − log f(Yi; θ) n ∂θi∂θj i=1 θ=θbn   ∂2  a.s.→ E − log f(Y ; θ) = I(θ) ∂θi∂θj

I Convergence is believable but not trivial.

I Now we have a consistent estimator, more convenient than

I(θbn): Use Id(θ)n = J n(θbn)

8 / 18 Approximate the Asymptotic Matrix

1 −1 I Asymptotic of θbn is n I(θ) .

I Approximate it with

1 −1 Vb n = J n(θbn) n !−1 1 1  ∂2  = − `(θ, Y) n n ∂θ ∂θ i j θ=θbn !−1  ∂2  = − `(θ, Y) ∂θ ∂θ i j θ=θbn

9 / 18 Compare Hessian and (Estimated) Asymptotic Covariance Matrix

 −1 h 2 i V = − ∂ `(θ, Y) I b n ∂θi∂θj θ=θbn h 2 i Hessian at MLE is H = − ∂ `(θ, Y) I ∂θi∂θj θ=θbn I So to estimate the asymptotic covariance matrix of θ, just invert the Hessian.

I The Hessian is usually available as a by-product of numerical search for the MLE.

10 / 18 Connection to Numerical Optimization

I Suppose we are minimizing the minus log likelihood by a direct search.

I We have reached a point where the gradient is close to zero. Is this point a minimum?

I The Hessian is a matrix of mixed partial derivatives. If all its eigenvalues are positive at a point, the function is concave up there.

I Its the multivariable second derivative test.

I The Hessian at the MLE is exactly the observed Fisher information matrix.

I Partial derivatives are often approximated by the slopes of secant lines – no need to calculate them.

11 / 18 So to find the estimated asymptotic covariance matrix

I Minimize the minus log likelihood numerically.

I The Hessian at the place where the search stops is exactly the observed Fisher information matrix.

I Invert it to get Vb n.

I This is so handy that sometimes we do it even when a closed-form expression for the MLE is available.

12 / 18 Estimated Asymptotic Covariance Matrix Vb n is Useful

I Asymptotic of θbj is the square root of the jth diagonal element.

I Denote the asymptotic standard error of θj by S . b θbj I Thus θbj − θj Z = j S θbj is approximately standard normal.

13 / 18 Confidence Intervals and Z-tests

θbj −θj Have Zj = S approximately standard normal, yielding θbj

I Confidence intervals: θj ± S z b θbj α/2 I Test H0 : θj = θ0 using

θbj − θ0 Z = S θbj

14 / 18 And Wald Tests 1 −1 Recalling Vb n = n J n(θbn)

−1 0 0 −1 Wn = n(Cθbn − h) (CId(θ)n C ) (Cθbn − h) 0 −1 0 −1 = n(Cθbn − h) (CJ n(θbn) C ) (Cθbn − h) −1 0  0 = n(Cθbn − h) C(nVb n)C (Cθbn − h) −1 0 1  0 = n(Cθbn − h) CVb nC (Cθbn − h) n −1 0  0 = (Cθbn − h) CVb nC (Cθbn − h)

15 / 18 Score Tests Thank you Mr. Rao

I θb is the MLE of θ, size k × 1 I θb0 is the MLE under H0, size k × 1 u(θ) = ( ∂` ,... ∂` )0 is the gradient. I ∂θ1 ∂θk

I u(θb) = 0 I If H0 is true, u(θb0) should also be close to zero. I Under H0 for large N, u(θb0) ∼ Nk(0, J (θ)), approximately.

I And, 0 −1 2 S = u(θb0) J (θb0) u(θb0) ∼ χ (r)

Where r is the number of restrictions imposed by H0

16 / 18 Three Big Tests

I Score Tests: Fit just the restricted model

I Wald Tests: Fit just the unrestricted model

I Likelihood Ratio Tests: Fit Both

17 / 18 Comparing Likelihood Ratio and Wald

I Asymptotically equivalent under H0, meaning p (Wn − Gn) → 0 I Under H1,

I Both have approximately the same distribution (non-central chi-square)

I Both go to infinity as n → ∞ I But values are not necessarily close

I Likelihood ratio test tends to get closer to the right Type I error rate for small samples.

I Wald can be more convenient when testing lots of hypotheses, because you only need to fit the model once.

I Wald can be more convenient if it’s a lot of work to write the restricted likelihood.

18 / 18