<<

Chapter 4. An Introduction to Asymptotic Theory

We introduce some basic asymptotic theory in this chapter, which is necessary to understand the asymptotic properties of the LSE. For more advanced materials on the asymptotic theory, see Dudley (1984), Shorack and Wellner (1986), Pollard (1984, 1990), Van der Vaart and Wellner (1996), Van der Vaart (1998), Van de Geer (2000) and Kosorok (2008). For reader-friendly versions of these materials, see Gallant (1987), Gallant and White (1988), Newey and McFadden (1994), Andrews (1994), Davidson (1994) and White (2001). Our discussion is related to Section 2.1 and Chapter 7 of Hayashi (2000), Appendix C and D of Hansen (2007) and Chapter 3 of Wooldridge (2010). In this chapter and the next chapter, always means the Euclidean norm. kk

1 Five Weapons in Asymptotic Theory

There are …ve tools (and their extensions) that are most useful in asymptotic theory of and econometrics. They are the weak law of large numbers (WLLN, or LLN), the (CLT), the continuous mapping theorem (CMT), Slutsky’s theorem,1 and the . We only state these …ve tools here; detailed proofs can be found in the techinical appendix. To state the WLLN, we …rst de…ne the convergence in probability.

p De…nition 1 A random vector Zn converges in probability to Z as n , denoted as Zn Z, ! 1 ! if for any  > 0,

lim P ( Zn Z > ) = 0: n !1 k k

Although the limit Z can be random, it is usually constant. The probability limit of Zn is often p denoted as plim(Zn). If Zn 0, we denote Zn = op(1). When an converges in ! probability to the true value as the sample size diverges, we say that the estimator is consistent. This is a good property for an estimator to possess. It means that for any given distribution of the data, there is a sample size n su¢ ciently large such that the estimator will be arbitrarily close to the true value with high probability. Consistency is also an important preliminary step in establishing other important asymptotic approximations.

Email: [email protected] 1 Eugen Slutsky (1880-1948) was a Russian/Soviet mathematical statistician and economist, who is also famous for the Slutsky equation in microeconomics.

1 Theorem 1 (WLLN) Suppose X1; ;Xn; are i.i.d. random vectors, and E [ X ] < ; then       k k 1 as n , ! 1 n 1 p Xn Xi E [X] :  n ! i=1 X The CLT tells more than the WLLN. To state the CLT, we …rst de…ne the convergence in distribution.

d De…nition 2 A random k vector Zn converges in distribution to Z as n , denoted as Zn ! 1 ! Z, if

lim Fn(z) = F (z); n !1 at all z where F ( ) is continuous, where Fn is the cdf of Zn and F is the cdf of Z.  k Usually, Z is normally distributed, so all z R are continuity points of F . If Zn converges in 2 distribution to Z, then Zn is stochastically bounded and we denote Zn = Op(1). Rigorously,

Zn = Op(1) if " > 0, M" < such that P ( Zn > M") < " for any n. If Zn = op(1), then 8 9 1 k k Zn = Op(1). We can show that op(1)+op(1) = op(1), op(1)+Op(1) = Op(1), Op(1)+Op(1) = Op(1), op(1)op(1) = op(1), op(1)Op(1) = op(1), and Op(1)Op(1) = Op(1).

d d p Exercise 1 (i) If Xn X, and Yn Xn = op(1), then Yn X. (ii) Show that Xn c if d ! ! ! and only if Xn c. (iii) Show that op(1)Op(1) = op(1). !

Theorem 2 (CLT) Suppose X1; ;Xn; are i.i.d. random variables, E [X] = 0, and V ar(X) = d       1; then pnXn N(0; 1): ! d p p pnXn N(0; 1) implies Xn 0, so the CLT is stronger than the WLLN. Xn 0 ! ! ! means Xn = op(1), but does not provide any information about pnXn. The CLT tells that 1=2 pnXn = Op(1) or Xn = Op(n ). But the WLLN does not require the second moment …nite; that is, a stronger result is not free. This result can be extended to the multi-dimensional case with nonstandardized mean and variance: suppose X1; ;Xn; are i.i.d. random k vectors, d       E [X] = , and V ar(X) = ; then pn Xn  N(0; ): !  Theorem 3 (CMT) Suppose X1; ;Xn; are random k vectors, and g is a continuous func- l       tion on the support of X (to R ) a.s. PX ; then

p p Xn X = g(Xn) g(X); ! ) ! d d Xn X = g(Xn) g(X): ! ) ! The CMT was …rst proved by Mann and Wald (1943) and is therefore sometimes referred to as the Mann-Wald Theorem. Note that the CMT allows the function g to be discontinuous but 1 the probability of being at a discontinuity point is zero. For example, the function g(u) = u is d 1 d 1 discontinuous at u = 0, but if Xn X N(0; 1) then P (X = 0) = 0 so X X . !  n !

2 In the CMT, Xn converges to X jointly in various modes of convergence. For the convergence p in probability ( ), marginal convergence implies joint convergence, so there is no problem if ! we substitute joint convergence by marginal convergence. But for the convergence in distribution d d d Xn d X ( ), Xn X, Yn Y does not imply . Nevertheless, there is a special ! ! ! Yn ! ! Y ! case where this result holds, which is Slutsky’stheorem.

d d p Theorem 4 (Slutsky’sTheorem) If Xn X, Yn c Yn c , where c is a con- ! ! () !   Xn d X d d 1 d 1 stant, then . This implies Xn+Yn X+c, YnXn cX, Yn Xn c X Yn ! ! c ! ! ! ! when c = 0. Here Xn;Yn; X; c can be understood as vectors or matrices as long as the operations 6 are compatible.

One important application of Slutsky’sTheorem is as follows.

d p 1=2 d 1=2 Example 1 Suppose Xn N(0; ), and Yn ; then Yn Xn  N(0; ) = N(0; I), ! ! ! where I is the identity matrix.

Combining with the CMT, we have the following important application.

d p 1 d 2 Example 2 Suppose Xn N(0; ), and Yn ; then X Y Xn , where k is the ! ! n0 n ! k dimension of Xn.

Another important application of Slutsky’stheorem is the Delta method.

d k k Theorem 5 (Delta Method) Suppose pn (Zn c) Z N(0; ), c R , and g(z): R l dg(z) 2 !d dg(c) 2 ! R . If is continuous at c, then pn (g(Zn) g(c)) Z. dz0 ! dz0 Intuitively, dg(c) pn (g(Zn) g(c)) = pn (Zn c) ; dz0 dg( ) d where  is the Jacobian of g, and c is between Zn and c. pn (Zn c) Z implies that dz0 p dg(c) p dg(c) ! Zn c, so by the CMT, . By Slutsky’s theorem, pn (g(Zn) g(c)) has the ! dz0 ! dz0 asymptotic distribution dg(c) Z. The Delta method implies that asymptotically, the randomness in dz0 a transformation of Zn is completely controlled by that in Zn.

Exercise 2 (*) Suppose g(z): k and g C(2) in a neighborhood of c, dg(c) = 0 and R R dz0 d2g(c) ! 2 = 0. What is the asymptotic distribution of g(Zn)? dz0dz 6 2 This assumption can be relaxed to that g(z) is di¤erentiable at c.

3 2 Asymptotics for the MoM Estimator

Recall that the MoM estimator is de…ned as the solution to

1 n m(Xi ) = 0: n j i=1 X We can prove the MoM estimator is consistent and asymptotically normal (CAN) under some regularity conditions.3 Speci…cally, the asymptotic distribution of the MoM estimator is

d 1 1 pn  0 N 0; M M0 ; !     dE[m(X 0)] b 4 where M = j and = E [m(X 0)m(X 0)0]. The asymptotic variance takes a sandwich d0 j j form and can be estimated by its sample analog. (1) Consider a simple example. Suppose E [X] = g(0) with g C in a neighborhood of 0; then 1 2 0 = g (E [X]) h(E [X]). The MoM estimator of  is to set X = g(), so  = h(X). By the p  p WLLN, X E [X]; then by the CMT,  h(E [X]) = 0 since h( ) is continuous. Now, we ! !  derive the asymptotic distribution of . b b

pn  0 = pn h(X) h(E [X])b = pnh0 X X E [X] = h0 X pn X E [X] ;          where theb second equality is from the mean value theorem (MVT). Because X is between X and p p E [X] and X converges to E [X] in probability, X E [X]. So by the CMT, h X ! 0 ! d h (E [X]). By the CLT, pn X E [X] N(0; V ar(X)). Then by Slutsky’stheorem,  0 !  d 2 V ar(X) pn  0 h0 (E [X]) N(0; V ar(X)) = N 0; h0 (E [X]) V ar(X) = N 0; 2 : ! g0(0)       b The larger g0(0) is, the smaller the asymptotic variance of  is. Consider a more speci…c exam- 2 ple. Suppose the density of X is 2 x exp x ,  > 0, x > 0, that is, X follows the Weibull   b d (2; ) distribution;5 then E [X] = g() = np 1=o2, and V ar (X) =  1  . So pn   2 4 ! (1  )    4 2 1 1 N 0; 2 = N 0; 16  4 . Figure 1 shows E [X] and the asymptoticb variance of p 1  1=2 2 2 !    pn   as a function of . Intuitively, the larger the derivative of E[X] with respect to , the   3 Youb may wonder whether any CAN estimator must be pn-consistent. This is not correct. There are CAN which have di¤erent convergence rates from pn. Nevertheless, in this course all CAN estimators are pn-consistent. 4 dE[m(X  )] dm(X  ) We use j 0 instead of E j 0 because E [m(X )] is more smooth than m(X ) and can be applied d0 d0 j j to such situations as quantile estimationh wherei m(X ) is not di¤erentiable at 0. In this course, we will not meet such cases. j 5 (*) If the quantity X is a "time-to-failure", e.g., the time of ending the state of unemployment, the Weibull distribution gives a distribution for which the failure rate is proportional to a power of time. Since the power of x is 2 which is greater than 1, the failure rate increases with time.

4 0 0 0 0

Figure 1: E [X] and Asymptotic Variance as a Function of 

easier to identify  from X, so the smaller the asymptotic variance.

Exercise 3 (i) Show that h0 (E [X]) = 1=g0(0). (ii) Show that if X follows the Weibull (2; ) distribution, then E [X] = p 1=2 and V ar (X) =  1  . 2 4  Generally, the asymptotic distribution of  can be derived intuitively as follows:

n 1 b n m(Xi ) = 0 i=1 j n n 1 P 1 dm(Xi ) = n m(Xi b0) + n d j  0 = 0 ) i=1 j i=1 0 n  1  n P P1 dm(Xi ) 1 d 1   j b  M 0 = pn 0 = n d pn m(Xi 0) N ( ; ) ; ) i=1 0 i=1 j !     b P P where  is between  and 0, and the convergence in distribution is from Slutsky’stheorem. Note n   1 M 1  M 1  in‡uence function that pn 0 pn m(Xi 0), so m(Xi 0) is called the . b i=1 j j Rigorous proof can be found in Chapter 8, where we also show that the sample analog of the b P asymptotic variance of  is consistent.

Example 3 Suppose theb moment conditions are

X  E = 0: 2 2 " (X )  # Then the sample analog is n Xi n 1 i=1 0 n 1 = 0; n P 2 2 (Xi ) n B C B i=1 C @ P A 5 so the solution is

 = X n 2 1 2 2 2  = Xi X = X X : b n i=1 X  b Now, we check the consistency and asymptotic normality of this MoM estimator. p 2 p Consistency:  = X , 2 = X2 X 2 + 2 2 = 2. ! ! 1 0 1 0 Asymptotic Normality: M = E  = , b b " 2 (X ) 1 !# 0 1 !

(X )2 (X )3 2 (X ) = E 3 2 4 2 2 4 " (X )  (X )(X ) 2 (X ) +  !# 2 E (X )3 = ; 0 E (X )3 E (Xh )4 i4 1 @ h i h i A so   pn d N (0; ) : 2 2   ! ! b Of course, we could get this asymptotic distribution in other ways. Here are two of these ways b which are left as exercises. 

Exercise 4 In Example 3,

2 2 (i) derive the asymptotic distribution of (;  )0 using the Delta method. (Hint: ;  is a func- 2 tion of X; X ).    b b b b 2  (ii) derive the asymptotic distribution of (;  )0 by Slutsky’s theorem and noting that 2 =  ! b X b b n 2 1 2 2 . (Hint: pn X  = op(1)). b 0 (Xi ) X  1 n i=1  @ P  A Exercise 5 In Example 3, if X N ; 2 , then what is ?   Example 4 Suppose we want to estimate  = F (x) for a …xed x, where F ( ) is the cdf of a random 1 n variable X. An intuitive estimator is the ratio of samples below x, n 1(Xi x), which is i=1  called the empirical distribution function (EDF), while it is a MoM estimator. To see why, P note that the moment condition for this problem is

E [1(X x) F (x)] = 0: 

6 1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 •3 •2 •1 0 1 2 3

Figure 2: Empirical Distribution Functions

Its sample analog is 1 n (1(Xi x) F (x)) = 0; n  i=1 X so 1 n F (x) = 1(Xi x). n  i=1 X By the WLLN, it is consistent. By theb CLT,

pn F (x) F (x) d N (0;F (x) (1 F (x))) :(why?) !   An interesting phenomenonb is that the asymptotic variance reaches its maximum at the median of the distribution of X. Figure 2 shows the EDF for 10 samples from N(0; 1) with sample size n = 50. Obviously, the variance of F (x) is a decreasing function of x . j j (*) As a stochastic process indexed by F ( ), pn F ( ) F ( ) converges to a Brownian Bridge. b       b

Example 5 (*) Recall from the Introduction that the Euler equation in macroeconomics is

c E t R = 1. 0 c t+1   t+1  

7 Suppose is known, is unknown; then we could use the sample analog

1 T c t R = 1 T 0 c t+1 t=1 t+1 X     to get . Although the data here are not iid, we expect the general results above about the MoM estimator holds, that is, is CAN, and the asymptotic variance is b 0 ct b V ar c Rt+1 t+1 .  0    2 E ct ln ct R ct+1 ct+1 t+1  h    i Here, note that the asymptotic variance does not depend on 0 although 0 depends on 0. 

As shown in Chamberlain (1987), the MoM estimator is semiparametric e¢ cient.

3 Asymptotics for the MLE

The section studies the asymptotic properties of the MLE. Related materials can be found in Chapter 7 of Hayashi (2000), Chapter 5 of Cameron and Trivedi (2005), Appendix D of Hansen (2007), and Chapter 13 of Wooldridge (2010). Recall that

MLE = arg max `n(); (1)   2 1 n where `n() = n ln f (Xi ), andb f( ) is the parametrized pdf (or pmf) of X. i=1 j j P 3.1 Consistency (*)

Two main conditions for proving the consistency of MLE are as follows:

(i) E [log f(X )] is maximized uniquely at 0. b j n 1 p (ii) sup n ln f (Xi ) E [log f(X )] 0.   i=1 j j ! 2 P 1 n Intuitively, if n ln f (Xi ) is uniformly close to E [log f(X )], then the maximizer of i=1 j j n 1 n ln f (X ),  , should be close to the maximizer of E [log f(X )],  . Replacing i=1 i P MLE 0 1 n j 1 n j n ln f (Xi ) by n m(Xi ) and E [log f(X )] by E [m(X )] , we get the Pi=1 j i=1 j j k j k proof for the consistencyb of the MoM estimator. P P

(i) means that E [log f(X )] E [log f(X)] with equality reached if and only if  = 0, where j  f(x) = f(x 0). Some literature calls this inequality as the Kullback-Leibler information in- j equality. This term comes from the fact that 0 minimizes the Kullback-Leibler information

8 Figure 3: Intuition for Jensen’sInequality

distance between the true density f(x) and the parametric density f(x ). Recall from the Intro- j duction that

f(x) KLIC () = f(x) log dx = f(x) (log f(x) log f(x )) dx = E [log f(X)] E [log f(X )] : f(x ) j j Z  j  Z Note that

f(x ) f(x ) KLIC () = E [log f(X )] E [log f(X)] = log j f(x)dx log j f(x)dx = log 1 = 0; j f(x)  f(x) Z   Z where the equality is reached if and only if  = 0 by the strict Jensen’s inequality, so KLIC () is minimized at 0, which implies E [log f(X )] E [log f(X)]. j  We brie‡yreview Jensen’sinequality here. It states that for a X and a concave function ', '(E[X]) E ['(X)], where the equality holds if and only if, for every line a + bx that  is tangent to '(x) at x = E[X], P ('(X) = a + bX) = 1. So if ' is strictly concave and X is not a constant almost surely, the strict inequality holds. Figure 3 illustrates the idea of Jensen’s inequality, where X takes only two values x and x0 with probability 2=3 and 1=3 respectively.

2 Exercise 6 Suppose  R, and  is an estimator of . Show that E  0 E  0 , 2 s     h i i.e., the root mean squared error (RMSE) cannot be less than the mean absolute error (MAE). b b b (ii) is known as the uniform law of large numbers. A useful lemma to guarantee it is Mickey’sTheorem (see, e.g., Theorem 2 of Jennrich (1969) or Lemma 1 of Tauchen (1985)).

9 Lemma 1 If the data are i.i.d.,  is compact, a(wi; ) is continuous at each   with probability 2 one, and there is d(w) with a(wi; ) d(w) for all   and E[d(w)] < , then E[a(w; )] is 1 n j j  p 2 1 continuous and sup n i=1 a(wi; ) E [a(w; )] 0.   ! 2 P

When these two conditions are satis…ed, we can show the consistency of MLE. Take  > 0, let c A =   0  , we need show that P MLE A 0 or P MLE A 1. By the f j k k  g 2 ! b 2 ! de…nition of MLE,     n b n b 1 1 ln f Xi MLE ln f (Xi 0) : b n j  n j i=1 i=1 X   X b From (ii), the left hand side is less than E log f(X MLE) +"=2 and the right hand side is greater j than E [log f(X 0)] "=2 for any " > 0 withh probabilityi approaching 1 as n large enough. So we j have b

E log f(X MLE) + "=2 > E [log f(X 0)] "=2 j j h i or b E log f(X MLE) > E [log f(X 0)] ": j j h i 6 From (i), choose " as E [log f(X 0)] supbE [log f(X )] > 0, and then we have E log f(X MLE) > j  A j j 2 h i supE [log f(X )], so MLE = A. The arguments above are intuitively illustrated in Figureb 4.  A j 2 2 In summary, to prove  = arg max  Qn() is a consistent estimator of 0 = arg max  Q(), b 2 2 we need four conditions: (I) Q() is uniquely maximized at 0; (II)  is compact; (III) Q() is b continuous; (IV) Qn() converges uniformly in probability to Q(). This is Theorem 2.1 of Newey and McFadden (1994).

3.2 Asymptotic Normality

When f(x ) is smooth, the FOCs of (1) are j n @`n() 1 Sn() = s (Xi ) = 0; (2)  @ n j i=1 X n where i=1 s (Xi ) = nSn() = n() @ n()=@ is called as the score function or normal 7 j @ S  L equations, and s (Xi ) = ln f (Xi ) si () is the contribution of the ith observation to the P j @ j  score function. So the MLE is a MoM estimator, and the moment conditions are E [s (X 0)] = 0. j 6 A condition to guarantee E [log f(X 0)] supE [log f(X )] > 0 is that E [log f(X )] is continuous and  is j  A j j compact. 2 7 The score function for  is usually de…ned as the gradient of the log-likelihood n() = nln() with respect to . Some literature also loosely calls s(x ) as the score function. L j

10 e/2 e/2

Figure 4: Illustration of the Consistency Proof in Maximum Likelihood Estimation

It is easy to check E [s (X 0)] = 0 by noting that j @ f (x ) @ =0 @ @ E [s (X 0)] = j f(x 0)dx = f (x ) dx = f (x ) dx = 0: j f (x 0) j @ j @ j Z Z =0 Z =0 j

Under some regularity conditions, MLE is pn consistent and asymptotically normal. Speci…- cally, the asymptotic distribution of the MLE is b d 1 1 1 pn MLE 0 N 0; H JH0 = N 0; J ; !     @2 b @ @ where H E ln f (X 0) is the Hessian, J E ln f (X 0) ln f (X 0) is an outer @@0 @ @0  j 8 j j product matrixh called the informationi (matrix) , and the equality is from the information equality J = H. The information equality can be proved as follows. @2 @ @ f (X ) H = E ln f (X ) = E @0 j @@ @ f (X ) 0 j =0 " !#   j =0 2 2 @ f (X ) @ f (X ) @ f (X ) @ f (X ) @@0 j @ j @0 j @@0 j = E 2 = E J; " f (X ) f (X ) # " f (X ) # j j =0 j =0

8 The role of the information in the asymptotic theory of maximum likelihood estimation was emphasized by R.A. Fisher (following some initial results by F.Y. Edgeworth), so sometimes the information is also called the Fisher information.

11 2 @ ln f(X ) @@0 j so we need only prove E f(X ) = 0, which follows from  j  =0

@2 @2 f (x ) 2 f (X ) @@0 = @ E @@0 j = j 0 f(x  )dx = f (x ) dx f (X ) f (x  ) 0 @@ " # 0 j 0 j =0 j =0 Z j Z

@2 = f (x 0) dx = 0: @@0 j Z =0

So the speciality of the MLE as a MoM estimator is that M = M = H = J = . 0 As mentioned in the Introduction, the MoM estimator is a semiparametric estimator and does not impose parametric restrictions on the density function, so it is more robust than the MLE.

When the model is misspeci…ed, 0 = arg max E [log f(x )] may not satisfy f(x 0) = f(x). In  j j this case, 0 is called the quasi-true value, and the corresponding MLE is called the quasi-MLE (QMLE). As shown in White (1982a), the information equality fails, but the form of the asymptotic 1 1 b variance H JH0 is still valid. We provide some intuition for the asymptotic distribution of MLE here. To simplify our discussion, we restrict  to be one-dimensional. b (i) The asymptotic variance is 1=J; that is, the larger J is, the smaller the asymptotic variance is. We know J represents the information - variation of the score function, so

More Information, Less Variance.

(ii) Since J = H, J represents the curvature at 0. In other words, the larger the curvature of E [ln f (X )] at 0 is, the more accurately we can identify 0 by the MLE. In Figure 5, we can j identify 0 better in (2) than (1). For (3), we cannot identify 0, so the asymptotic variance is in…nity. (Why can we draw E [ln f (X )] like a mountain at least in the neighborhood of j 0?)

(iii) Treat the MLE as a MoM estimator. We can see that H (= J) plays the role of M in the MoM estimator. From the discussion in Section 2, we know that the larger M is, the smaller the asymptotic variance is.

There are four methods to estimate J:

n n (1) 1 @ @ 1 0 J = Jn MLE = n @ ln f Xi MLE @ ln f Xi MLE = n si MLE si MLE ; i=1 j 0 j i=1 n (2)   P 1 @2    1 P  1 @2  Jb = Hnb Hn MLE = n b@@ ln f Xi MLEb = n n MLEb = n @@b n MLE ;  i=1 0 j H 0 L (3) (2) (1) 1 (2)        Jb = J bJ J ; b P b b b (4) J = J MLE ; b b b b   b b 12 (1) (2) (3)

Figure 5: Identi…cation Interpretation of J

2 where J () E @ ln f (X ) @ ln f (X ) = E @ ln f (X ) . J(3) is robust even in a  @ j @0 j @@0 j misspeci…ed model as shown in White (1982a); see alsoh Gourieroux eti al. (1984). It is a good simulation exercise to check which one is the best among J(i), i = 1; 2; 3; 4b, in …nite samples.

Example 6 Suppose Xi follows the the Bernoulli distribution,b i.e.,

n 1 n 1 n Xi 1 Xi Ln () =  (1 ) ; and `n () = ln  Xi + ln (1 ) (1 Xi) ; n n i=1 i=1 i=1 Y X X where Xi takes only 0 and 1. The FOC implies MLE = X, which is the ratio of 1’s among all samples. From the CLT, b d 1 pn X  N 0;J = N (0;  (1 )) ; !   so a natural estimator of J is 1 . Now, we calculate J (i), i = 1; 2; 3,4; in this simple example. X(1 X)

2 @ x 1 x @ 2 x2 (1 xb) (i) ln f (x ) = , so ln f (x ) = + (why?) and @  1 @ 2 (1 )2 j j  n 2 2 2 2 1 X (1 Xi) X (1 X) 1 1 J (1) = i + = + = + : n 2 2 2 2 X 1 X i=1 X 1 X ! X 1 X X b  

13 2 (ii) @ ln f (x ) = x 1 x , so @2 2 (1 )2 j n 1 Xi 1 Xi 1 1 J (2) = + = + : n 2 2 X 1 X i=1 X 1 X ! X b  (iii) J (3) = J (2)J (1) 1J (2) = 1 + 1 . X 1 X 1 (iv) Jb() =b(1 b) , sob 1 1 1 J (4) = = + : X 1 X X 1 X b  So in the Bernoulli case, J (1) = J (2) = J (3) = J (4). Generally, this result will not hold, and the (i) performance of J , i = 1; 2; 3, depends on the speci…c problem in hand.  b b b b 3.3 One-stepb Estimator and Algorithms (*)

When the score function (2) is nonlinear but smooth, the Newton-Raphson algorithm is usually used to …nd MLE: iterate 1 2 = 1 Hn 1 Sn(1) b 9   until convergence. The idea of theb Newton-Raphsonb b algorithmb is to …rst linearly approximate Sn() = 0 at 1 by Sn() = Sn(1) + Hn 1  1 = 0 and then solve out . Figure 6 illustrates the basic idea of this algorithm. To start  the algorithm, we need an initial estimator. b 1=2b b b When it is consistent and Op(n ) (e.g., the OLS estimator in the parametric linear regression), the …rst iteration will generate an e¢ cient estimator which has the same asymptotic distribution as MLE. To see why, Taylor expand Sn(1) at 0, we have

1 1 b b pn 2 0 = I Hn 1 Hn() pn 1 0 Hn 1 pnSn(0);           b b b b where  lies on the line segment between 0 and 1. Because both Hn 1 and Hn() converges to H and pn 1 0 = Op(1), the …rst term is op(1). As a result, the asymptotic distribution b b 1   of pn 2 0 is the same as that of Hn 1 pnSn(0), so 2 is e¢ cient. b The Newton-Raphson algorithm is based on the so-called Gradient Theorem: any vector, d, b b b in the same halfspace as Sn() (that is, with d0Sn() > 0) is a direction of increase of `n(), in the sense that `n( + d) is an increasing function of the scalar , at least for small enough . An algorithm that chooses a suitable direction d, and then …nds a value of  which maximizes

`n( +d) can ensure convergence. The set of directions, d, are those with the form QSn(), where 1 Q is a positive de…nite matrix. Hn () is a good choice of Q when `n() is concave, especially in

9 A related method is the Fisher scoring algorithm, where Hn 1 is replaced by H(1), where H() = 2 E @ ln f (X ) . In practice, these two algorithms are used with various modi…cations designed for speedier @@0 j b b convergence.h See Goldfeldi and Quandt (1972, Ch.1) for further discussion.

14 Figure 6: Intuition for the Newton-Raphson Algorithm

the neighborhood of 0 where the use of the Hessian makes …nal convergence quadratic. Even in this case, a suitable  choice is important to guarantee convergence. Trying out decreasing  until a higher `n( + d) is found can generate an in…nite sequence of iterations that do not converge to a critical point. Choosing  by maximizing `n( + d) can impose an unacceptable computational burden. The BHHH algorithm in Berndt et al. (1974) suggests to choose  in the following way.10 Let  be a prescribed constant in the interval (0; 1=2). De…ne

`n( + d) `n() (; ) = : d0Sn()

If (; ) , take  = 1. Otherwise, choose  to satisfy  (; ) 1 . Such a  always    exists if `n() is twice continuously di¤erentiable and has compact upper contour sets. If further assume the direction d satisfy d0Sn() > (0; 1), then such an algorithm always converges. Sn()0Sn() 2 1 Berndt et al. (1974) suggest to choose d as Jn () Sn(), where Jn () is always positive de…nite but Hn () may not be (e.g., near a saddle point). This choice of Q is closely related to the modi…ed Gauss-Newton method in Hartley (1961). In summary, in the BHHH algorithm, iterate

1 (j+1) (j) (j) (j) (j)  =   Jn  Sn( )   until b b b b d max k < prescribed tolerance, k max 1; k   10 The idea of the BHHH algorithm originates fromb Anderson (1959).

15 where dk and k are kth component of d and , k = 1; ; d. A corollary of this algorithm is an    estimator of the asymptotic variance of the MLE, Jn  . b b Extension of the Newton’smethod is the so-called Quasi-Newton methods. Two popular meth- ods are due to Davidon-Fletcher-Powell (DFP) and Broyden-Fletcher-Goldfarb-Shannob (BFGS). When there are multiple local maxima or the objective function is non-di¤erentiable, the derivative- free method such as the simplex method of Nelder-Mead (1965) is suggested. A more recent inno- vation is the method of simulated annealing (SA). For a review see Go¤e et al. (1994).

4 Three Asymptotically Equivalent Tests

When more and more big datasets are available, asymptotic tests get popular. So we concentrate on asymptotic tests throughout this course. There are three asymptotically equivalent tests in the likelihood framework: the likelihood ratio (LR) test of Neyman and Pearson (1928), Wald (1943) test and Rao (1948)’s score test (or Lagrange Multiplier test), which are the topic of this section. Our discussion emphasizes the intuition behind these three tests and is based on Buse (1982). More related discussions can be found in Engle (1984), Section 7.4 of Hayashi (2000), Section 12.6 of Wooldridge (2010), and Section 7.3 of Cameron and Trivedi (2005). To aid intuition, we …rst consider the simplest case:

H0 :  = 0 vs H1 :  = 0; 6 where  is a scalar, and then extend to the multi-parameter case with general nonlinear constraints r() = 0, where r() is a q 1 vector of constraints.  Intuitively, if `n(MLE) `n(0) is large, we should reject H0. In Figure 7, we …rst measure the height of the mountain `n( ) at two points, MLE and 0, and then calculate the di¤erence. If this b  di¤erence is large, we reject H0. This observation induces the LR test. b

(i) LR = 2n `n(MLE) `n(0)   b 1 The LR statistic is called deviance in the literature. Here, "2" is to o¤set the " 2 " in the second term of the Taylor expansion. To be speci…c, under H0;

LR = 2n `n(0) `n(MLE)   2 2 @`n(MLE)b 1 @ `n(MLE) 1 = 2n 0 MLE + 2 0 MLE + op @ 2 @ n ! b   b     2 b b 0 @ `n(MLE) d 2 = pn MLE 0 pn MLE 0 + op(1) (1) @@0 ! !   b   b b @`n(MLE ) 11 where the third equality is from @ = 0 which is due to the FOC of MLE. Note that b 11 This is why we Taylor expand ln( ) at MLE instead of 0.  b b 16 Figure 7: LR Test the asymptotic null distribution of the LR statistic is independent of nuisance parameters. This property is called the “Wilks phenomenon" in the literature as it was …rst observed by Wilks (1938). To measure the height di¤erence in Figure 7, we have two other methods, depending on where we stand on the mountain. If we stand at MLE, then we get the ; if we stand at 0, then we get the LM test. b First, suppose we stand at MLE, the top of the mountain. Intuitively, if the horizontal di¤erence

MLE 0 is large, the height di¤erence between `n(MLE) and `n(0) should be large, but we b need take account of one more thing. Comparing the mountain A and B in Figure 8, we see that b b although MLE 0 is the same, `n(MLE) `n(0) is very di¤erent. The greater is the curvature of the mountain at MLE , the larger is the height di¤erence. So we should weight MLE 0 by b b the curvature of `n() at MLE. This intuition induces the Wald test. b b 2 2 (ii) W = n   @ `n(MLE ) = n   0 H  n   d MLE 0 b @2 p MLE 0 n MLE p MLE 0 2 ! (1) under H0  b       

b b b b Second, suppose we stand at 0, the half-way up the mountain. Intuitively, the steeper is the mountain at 0, the larger is the height di¤erence, but we still need take account of one more thing.

Comparing the mountain A and B in Figure 9, we see that although the slope Sn (0) is the same for these two mountains, the height di¤erence is very di¤erent. The greater is the curvature of the mountain at 0, the closer is 0 to MLE and the smaller is the height di¤erence. So we should weight Sn (0) by the inverse of the curvature at 0. This intuition induces the LM test. j j b 17 A B

Figure 8: Wald Test

A

B

Figure 9: LM Test

18 2 nSn(0) 1 d 2 (iii) LM = 2 = pnSn (0)0 ( Hn (0)) pnSn (0) (1) under H0 @ `n(0) @2 !

The name of the LM test (or the score test) is from the following fact: testing  = 0 is equivalent to testing Sn (0) = 0, which is equivalent to test  = 0. Here,  is the Lagrange multiplier in the following Lagrangian problem:

max `n() s.t.  = 0:   2 If the true value of  is 0, then we expect Sn (0) = 0 and  = 0, since the FOC of this problem is

Sn () +  = 0. For the general q constraints r() = 0, suppose the constrained estimator

 = arg max `n() s.t. r() = 0:   2 e  can be obtained by solving the Lagrangian problem with the Lagrangian e (; ) = `n() + 0r(); L where  is a q 1 vector of Lagrange multipliers. The FOCs for this problem are 

Sn + R = 0; (3) where Sn = Sn  and R0 = @r()=@0. Ine thise generale testing problem,   e e e e LR = 2n `n  `n  ;     1 1 0 W = nr  bR0 Hne R r  ;     1    1 LM = nS0 b Hbn Sbn = nb0R0 Hbn R; n     e e e e e e e e where  = MLE, R0 = @r()=@0, Hn = Hn  and Hn = Hn  . These test statistics can also be expressed using n ( ), n( ), n ( ) and =n in a parallel way. For example, b b bL  Sb  H b b e e

LR = 2 n  n  ; L L     1 1 0 W = r  bR0 n e R r  ; H     1    1 LM = 0 b nb bn = 0Rb 0 nb R: Sn H S H     2 It can be shown that under H0, alle threee test statisticse e convergese e in distributione e to (q). We will provide more discussion on these three tests in the next chapter in the linear regression framework and in Chapter 8 in the GMM framework. We discuss more about the LM test. This test is widely used in speci…cation testing when the

19 12 model under H0 is much simpli…ed; see Breusch and Pagan (1980) for some of such examples. A special case is that  = (10 ; 20 )0 and

1 H0 : 1 = 10 or H0 : r() = (Iq; 0) 10 = 0: 2 !

Partitioning Sn, J and Hn conformably,

e Sn1 Sn1 Sn = = Sn2 ! 0 ! e e e and e J11 J12 H11 H12 J = ; Hn = J21 J22 ! H21 H22 ! e e e where Sn2 = 0 is from the FOCs (3). In this case, the LMe statistice will be

1 e 1 11 LM = S0 H11 H12H H21 Sn1 S0 H Sn1: n1 22  n1   e e e e e e e e e If J12 = J210 = 0, i.e., 1 and 2 are "informationally" independent, then

1 LM = Sn0 1H11 Sn1:

The LM test can be conducted by runninge an auxiliarye e regression

1 = si0 + ui; where si = si  , and computing e   e n 1 1 e 2 1 0 0 LM  = nR = 10S(S0S) S01 =nSn  sis0 nSn  = pnSn  Jn  pnSn  ; u i=1 i   X        (4)  where R2 is the uncentered R2, S is a en k matrixe e by stackinges , and 1 is ae columne of ones. Thise u  i0 version of the LM test statistic is called the outer-product-of-the-gradient (OPG) version since

Hn  in LM is substituted by Jn  . e Although  these three tests have the  same asymptotic distribution under H0 and the same asymptotice power against local alternatives,e there are some di¤erences among them.

(1) From the formulas of these test statistics, the LR test involves both  and , the Wald test involves only , and the LM test involves only . This provides some guidance on choosing b e 12 (*) Of course, if  is not easy to obtain due to the nonlinear constraint, the Wald test may be preferable in computation. Nevertheless,b Neyman (1959) proposed what is callede a C( ) test which involves the construction of a pseudo-LM test, withe identical asymptotic properties to the true one, but requiring only pn consistent estimators of the unknown parameters rather than ML ones. See also Buhler and Puri (1966). For a survey of literature, see Moran (1970).

20 among the three tests based on computational burdens.

(2) If the information matrix equality does not hold, e.g., f(x ) is misspeci…ed, we can modify j 2 the LM and Wald test statistics to make them converge to (q) under H0, but the LR test statistic does not converge to 2(q) any more in this case.

(3) The Wald test is not invariant to reparametrization of the null hypothesis, whereas the LR

test is invariant. Some, e.g., LM , but not all versions of the LM test are invariant.

Also, these three test statistics may lead to con‡icting inferences in small samples. Gallant (1975) conducted Monte Carlo studies in a nonlinear model to suggest that "use the likelihood ratio test when the hypothesis is an important aspect of the study".

Exercise 7 If f(x ) is misspeci…ed, how to modify the LM and Wald test statistics to make them 2 j 2 converge to (q) under H0? Why does the LR test statistic not converge to (q) any more in this case?

Example 7 To illustrate the three tests, consider an iid example with yi N(; 1) and test H0:   = 0 vs H1:  = 0. Then  = y and  = 0. 6 1 1 n 2 For the LR test, `n() = 2 ln 2 2n i=1(yi ) and some algebra yields b e P 2 LR = 2n (`n(y) `n(0)) = n(y 0) :

1 The Wald test is based on whether y 0 0. Here it is easy to show that y 0 N(0; n ) 2 @ `n(0)   under H0 and 2 = 1, leading to the quadratic form @

1 1 W = n(y 0)[ ( 1)](y 0) = (y 0)(n ) (y 0):

2 This simpli…es to n(y 0) and so W = LR. 1 n The LM test is based on closeness to zero of @`n(0)=@ = n i=1(yi 0) = y 0. It is not 1 2 hard to see that LM = n(y 0)[ ( 1)] (y 0) = n(y 0) = W = LR: P Despite their quite di¤erent motivations, the three test statistics are equivalent here. This exact equivalence is special to this example with constant curvature owing to a log-likelihood quadratic in 2 . Also, the three test statistics follows exact 1 under H0. More generally the three test statistics di¤er in …nite samples but are equivalent asymptotically. 

iid Exercise 8 Suppose yi Bernoulli(). Construct the three test statistics for testing H0:  = 0  vs H1:  = 0. Are they numerically equivalent? 6 Each of the three tests has its own strength. A rule of thumb is: the Wald test would be the …rst to try due to its simplicity. If the Wald test is not suitable because of, e.g., nonlinear null constraints as will be discussed in Section 9 of the next chapter, the LR test is often tried. The LR test is invariant to constraint formulation and can be applied to very general cases, e.g., when

21 the null constraints are nonlinear or even when the likelihood function is not di¤erentiable. The LM test requires the smoothness of the likelihood function, but when the model under the null is simple, it sometimes can provide a surprisingly simpli…ed test.

5 Normal Regression Model

In the last chapter, we studied the t test and the F test in the normal regression model. Recall j j d that t = tn k. It can be shown that tn k N(0; 1) (why?). We now state the three s( j )  ! asymptoticallyb equivalent tests and study their relationships with the F test. b The three asymptotically equivalent tests of H0: R0 = c are summarized as follows (See Hayashi (2000) p503 Ex3.):

0 1 1 R0 c R0(X0X) R R0 c SSRR SSRU d 2 W = n = n q;     SSRU     SSRU ! b b 0 y X PX y X SSRR SSRU d 2 LM = n = n q;   SSRR   SSRR ! e e SSR SSR SSR LR = n ln R ln U = n ln R d 2;  n n  SSR ! q       U  where n n 1 1 x x 0 1 1 x x 0 2 n i0 i 2 n i0 i Hn = i= and Hn = i= 0 1 1 0 1 1 P0 2 P0 2 b 2(2) e 2(2) B C B C b @ A e @ A are used to get neat function forms of bW and LM, although other estimatorse of the information matrix could be used with the same asymptotic distribution. As in the F test, we can express W , 2 2 LM and LR in the form of RU and RR.

2 2 Exercise 9 Verify that Hn and Hn, although not the same as Hn( ;  ) and Hn( ;  ), are consistent for the information matrix under the null. b e b b e e Exercise 10 (i) Show why the second equality in (4) holds. (ii) Show that LM can be interpreted 2 2 2 as nRu, where Ru is the uncentered R in the regression of ui on xi. What is the intuition for this test? e The result in the above exercise is due to the facts that we are testing only and is "informa- tionally" independent of 2; see Section 3 of Breusch and Pagan (1980) for related discussions.

22 The relationship between F , W , LM and LR is as follows:

n k W F = < W; n q W LM = W ; 1 + n W LR = n ln 1 + ; n   so W LR LM and   (SSRR SSRU )=q F = Fq;n k SSRU =(n k)  n ! 1 #2 n k W q = n q ! q As a result, W , LR and LM have the same asymptotic null distribution, but the decision based on them may not be the same in practice. See Savin (1976), Berndt and Savin (1977), Breusch (1979), Geweke (1981) and Evans and Savin (1982) for discussions on the con‡ict among these three test procedures. Because of the relationship between F and W , W is called the F form of the Wald statistic. Since W d 2 only under the homoskedasticity assumption, W is also called a homoskedastic ! q form of the Wald statistic. Actually, all of t; W , LM and LR are asymptotically valid only under homoskedasticity. We will develop heteroskedasticity-robust versions of these test statistics after studying the asymptotics for the LSE and the RLS estimator in the next chapter. To dis- tinguish from the homoskedasticity-only test statistics, the heteroskedasticity-robust test statistics are indexed by a subscript n.

Technical Appendix: Proofs for the Five Weapons

Proof of WLLN. Without loss of generality, assume X R and E[X] = 0 by considering the 13 2 convergence element by element and by recentering Xi at its expectation.

We need to show that  > 0 and  > 0, N < such that n < N, P Xn >  . Fix  8 9 1 8  and . Set " = =3. Pick C < large enough so that  1

E [ Xi 1 ( Xi > C)] "; j j j j  which is possible since E [ Xi ] < . De…ne the random vectors j j 1

Wi = Xi1 ( Xi C) E [Xi1 ( Xi C)] ; j j  j j  Zi = Xi1 ( Xi > C) E [Xi1 ( Xi > C)] : j j j j 13 This is possible because (x1; y1) (x2; y2) x1 x2 + y1 y2 . k k  k k k k

23 By the triangle inequality,

1 n 1 n E Zn = E Zi E [ Zi ] = E [ Zi ] " n #  n j j j j i=1 i=1   X X E [ X i 1 ( Xi > C)] + E [Xi1 ( Xi > C)]  j j j j j j j j 2E [ Xi 1 ( Xi > C)] 2":  j j j j  By Jensen’sinequality,

2 2 2 2 E W 4C E W E W = i "2 n n n n        h i for n N with N = 4C2="2.  Finally, by Markov’sinequality,

E Xn E W n + E Zn 3" P X >  = ; n              as n N.  Proof of CLT. We …rst state Lévy’s continuity theorem: Let Xn and X be random vectors in k d it Xn it X k R . Then Xn X if and only if E e 0 E e 0 for every t R , i.e., the characteristic ! ! 2 it Xn function of Xn converges to that of Xh. Moreover,i h if E ie 0 converges pointwise to a function (t) that is continuous at zero, then  is the characteristich functioni of a random vector X and d Xn X. ! 1 2 itXn t From Lévy’scontinuity theorem, we need only to show that E e e 2 for every t R, 1 t2 !itX 2 where e 2 is the characteristic function of N(0; 1). Taylor expanding E e at t = 0, we have 2 2 (t) E eitX = 1 + itE[X] t E[X2] + o(t2) = 1 t + o(t2). As a result, for any …xed t,  2 2     t n E eitpnXn = E exp i X pn i " i=1 !# h i X n t t = E exp i X = n pn i pn i=1      Y 2 n t 1 1 t2 = 1 + o e 2 : 2n n !    This theorem is stated for the scalar X case. By the Cramér-Wold device, we can easily extend k d the result to the vector X case. The Cramér-Wold device states that for Xn R , Xn X d k 2 ! if and only if t0Xn t0X for all t R . That is, we can reduce higher-dimensional problems ! 2 tot he one-dimensional case. This device is valid because the characteristic function t E eit0X 7! of a vector X is determined by the set of all characteristic functions u E eiu(t0X) ofh lineari 7! combinations t0X. h i Proof of CMT. Denote the continuity point of X as C.

24 (i) Fix arbitrary " > 0. For each  > 0 let B be the set of x for which there exists y with

y x < , but g(y) g(x) > ". If X = B and g(Xn) g(X) > ", then Xn X . k k k k 2 k k k k  Consequently,

P ( g(Xn) g(X) > ") P (X B) + P ( Xn X ) : k k  2 k k 

The second term on the right converges to zero as n for every …xed  > 0. Because B C ! 1 \ #; by continuity of g, the …rst term converges to zero as  0. # (ii) To prove the CMT for convergence in distribution, we …rst state part of the Portmanteau d lemma: Xn X if and only if limP (Xn F ) P (X F ) for every closed set F . ! 2  2 1 The event g(Xn) F is identical to the event Xn g (F ) . For every closed set F , f 2 g 2  1 1 1 c g (F ) g (F ) g (F ) C :   [

1 To see the second inclusion, take x in the closure of g (F ). Thus, there exists a sequence xm with xm x and g(xm) F . If x C, then g(xm) g(x), which is in F because F is closed; otherwise ! 2 2 ! x Cc. By the Portmanteau lemma, 2

1 1 limP (g (Xn) F ) limP Xn g (F ) P X g (F ) : 2  2  2     Because P (X Cc) = 0, the probability on the right is P X g 1(F ) = P (g(X) F ). Apply 2 2 2 d the Portmanteau lemma again, in the opposite direction, to conclude that g(Xn) X. ! d Proof of Slutsky’sTheorem. We …rst state another part of the Portmanteau lemma: Xn X ! if and only if E [f(Xn)] E [f(X)] for all bunded, continuous function f. ! p First note that (Xn;Yn) (Xn; c) = Yn c 0. Thus by Exercise 1(i), it su¢ ces to show d k k k k ! that (Xn; c) (X; c). For every continuous, bounded function (x; y) f(x; y), the function ! 7! x f(x; c) is continuous and bounded. By the Portmanteau lemma, it su¢ ces to show that 7! d E[f(Xn; c)] E[f(X; c)] which holds if Xn X by applying the Portmanteau lemma again. ! !

25