Chapter 1
What Is Econometrics
1.1 Data 1.1.1 Accounting Definition 1.1 Accounting data is routinely recorded data (that is, records of transactions) as part of market activities. Most of the data used in econmetrics is accounting data. 2
Acounting data has many shortcomings, since it is not collected specifically for the purposes of the econometrician. An econometrician would prefer data collected in experiments or gathered for her.
1.1.2 Nonexperimental The econometrician does not have any control over nonexperimental data. This causes various problems. Econometrics is used to deal with or solve these prob- lems.
1.1.3 Data Types Definition 1.2 Time-series data are data that represent repeated observations of some variable in subseqent time periods. A time-series variable is often sub- scripted with the letter t. 2
Definition 1.3 Cross-sectional data are data that represent a set of obser- vations of some variable at one specific instant over several agents. A cross- sectional variable is often subscripted with the letter i. 2
Definition 1.4 Time-series cross-sectional data are data that are both time- series and cross-sectional. 2
1 2 CHAPTER 1. WHAT IS ECONOMETRICS
An special case of time-series cross-sectional data is panel data .Paneldata areobservationsofthesamesetofagentsovertime.
1.1.4 Empirical Regularities We need to look at the data to detect regularities. Often, we use stylized facts, but this can lead to over-simplifications.
1.2 Models
Models are simplifications of the real world. The data we use in our model is what motivates theory.
1.2.1 Economic Theory By choosing assumptions, the
1.2.2 Postulated Relationships Models can also be developed by postulating relationships among the variables.
1.2.3 Equations Economic models are usually stated in terms of one or more equations.
1.2.4 Error Component Because none of our models are exactly correct, we include an error component into our equations, usually denoted ui. In econometrics, we usually assume that the error component is stochastic (that is, random). It is important to note that the error component cannot be modeled by economic theory. We impose assumptions on ui, and as econometricians, focus on ui.
1.3 Statistics 1.3.1 Stochastic Component Some stuff here.
1.3.2 Statistical Analysis Some stuff here. 1.4. INTERACTION 3
1.3.3 Tasks There are three main tasks of econometrics:
1. estimating parameters; 2. hypothesis testing; 3. forecasting.
Forecasting is perhaps the main reason for econometrics.
1.3.4 Econometrics Since our data is, in general, nonexperimental, econometrics makes use of eco- nomic theory to adjust for the lack of proper data.
1.4 Interaction 1.4.1 ScientificMethod Econometrics uses the scientific method, but the data are nonexperimental. In some sense, this is similar to astronomers, who gather data, but cannot conduct experiments (for example, astronomers predict the existence of black holes, but have never made one in a lab).
1.4.2 Role Of Econometrics The three components of econometrics are:
1. theory; 2. statistics; 3. data.
These components are interdependent, and each helps shape the others.
1.4.3 Ocam’s Razor Often in econometrics, we are faced with the problem of choosing one model over an alternative. The simplest model that fits the data is often the best choice. Chapter 2
Some Useful Distributions
2.1 Introduction
2.1.1 Inferences Statistical Statements As statisticians, we are often called upon to answer questions or make statements concerning certain random variables. For example: is a coin fair (i.e. is the probability of heads = 0.5) or what is the expected value of GNP for the quarter.
Population Distribution Typically, answering such questions requires knowledge of the distribution of the random variable. Unfortunately, we usually do not know this distribution (althoughwemayhaveastrongidea,asinthecaseofthecoin).
Experimentation In order to gain knowledge of the distribution, we draw several realizations of the random variable. The notion is that the observations in this sample contain information concerning the population distribution.
Inference Definition 2.1 The process by which we make statements concerning the pop- ulation distribution based on the sample observations is called inference. 2
Example 2.1 We decide whether a coin is fair by tossing it several times and observing whether it seems to be heads about half the time. 2
4 2.1. INTRODUCTION 5
2.1.2 Random Samples
Suppose we draw n observations of a random variable, denoted x1,x2, ..., xn { } and each xi is independent and has the same (marginal) distribution, then x1,x2, ..., xn constitute a simple random sample. { }
Example 2.2 We toss a coin three times. Supposedly, the outcomes are inde- pendent. If xi countsthenumberofheadsfortossi,thenwehaveasimple random sample. 2
Note that not all samples are simple random.
Example 2.3 Weareinterestedintheincomelevelforthepopulationingen- eral. The n observations available in this class are not indentical since the higher income individuals will tend to be more variable. 2
Example 2.4 Consider the aggregate consumption level. The n observations available in this set are not independent since a high consumption level in one period is usually followed by a high level in the next. 2
2.1.3 Sample Statistics
Definition 2.2 Any function of the observations in the sample which is the basis for inference is called a sample statistic. 2
Example 2.5 In the coin tossing experiment, let S count the total number of S heads and P = 3 count the sample proportion of heads. Both S and P are sample statistics. 2
2.1.4 Sample Distributions
A sample statistic is a random variable – its value will vary from one experiment to another. As a random variable, it is subject to a distribution.
Definition 2.3 The distribution of the sample statistic is the sample distribu- tion of the statistic. 2
Example 2.6 The statistic S introduced above has a multinomial sample dis- tribution. Specifically Pr(S = 0) = 1/8, Pr(S = 1) = 3/8, Pr(S = 2) = 3/8, and Pr(S = 3) = 1/8. 2 6 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
2.2 Normality And The Sample Mean 2.2.1 Sample Sum
Consider the simple random sample x1,x2, ..., xn ,wherex measures the height { } 2 of an adult female. We will assume that E xi = µ and Var(xi)=σ , for all i =1, 2, ..., n Let S = x1 + x2 + + xn denote the sample sum. Now, ···
E S = E( x1 + x2 + + xn )=E x1 + E x2 + + E xn = nµ (2.1) ··· ··· Also,
Var(S)=E( S E S )2 − = E( S nµ )2 − 2 = E( x1 + x2 + + xn nµ ) ··· − n 2 = E ( xi µ ) − "i=1 # X 2 = E[( x1 µ ) +(x1 µ )( x2 µ )+ + − − −2 ··· 2 ( x2 µ )( x1 µ )+(x2 µ ) + +(xn µ ) ] − − − ··· − = nσ2. (2.2)
Note that E[( xi µ )( xj µ )] = 0 by independence. − − 2.2.2 Sample Mean S Let x = n denote the sample mean or average.
2.2.3 Moments Of The Sample Mean The mean of the sample mean is S 1 1 E x = E = E S = nµ = µ. (2.3) n n n Thevarianceofthesamplemeanis
Var(x)=E(x µ)2 − S = E( µ )2 n − 1 = E[ ( S nµ )]2 n − 1 = E( S nµ )2 n2 − 2.3. THE NORMAL DISTRIBUTION 7
1 = nσ2 n2 σ2 = . (2.4) n
2.2.4 Sampling Distribution We have been able to establish the mean and variance of the sample mean. However, in order to know its complete distribution precisely, we must know the probability density function (pdf) of the random variable x.
2.3 The Normal Distribution 2.3.1 Density Function
Definition 2.4 A continuous random variable xi with the density function
1 2 1 ( xi µ ) f ( xi )= e− 2σ2 − (2.5) √2πσ2 follows the normal distribution, where µ and σ2 are the mean and variance of xi, respectively.2
Since the distribution is characterized by the two parameters µ and σ2,we 2 denote a normal random variable by xi N ( µ, σ ). The normal density function is the fam∼ iliar “bell-shaped” curve, as is shown in Figure 2.1 for µ =0andσ2 = 1. It is symmetric about the mean µ. 2 Approximately 3 of the probability mass lies within σ of µ and about .95 lies within 2σ. There are numerous examples of random± variables that have this shape.± Many economic variables are assumed to be normally distributed.
2.3.2 Linear Transformation Consider the transformed random variable
Yi = a + bxi
We know that µY = E Yi = a + bµx and 2 2 2 2 σ = E(Yi µY ) = b σ Y − x If xi is normally distributed, then Yi is normally distributed as well. That is,
2 Yi N ( µY ,σ ) ∼ Y 8 CHAPTER 2. SOME USEFUL DISTRIBUTIONS
Figure 2.1: The Standard Normal Distribution
2 2 Moreover, if xi N ( µx,σ )andzi N ( µz,σ ) are independent, then ∼ x ∼ z 2 2 2 2 Yi = a + bxi + czi N ( a + bµx + cµz,b σ + c σ ) ∼ x z These results will be formally demonstrated in a more general setting in the next chapter.
2.3.3 Distribution Of The Sample Mean
If, for each i =1, 2,...,n,thexi’s are independent, identically distributed (iid) normal random variables, then
2 σx xi N ( µx, )(2.6) ∼ n 2.3.4 The Standard Normal 2 The distribution of x will vary with different values of µx and σx,whichis inconvenient. Rather than dealing with a unique distribution for each case, we 2.4. THE CENTRAL LIMIT THEOREM 9 perform the following transformation:
x µx Z = − 2 σx n qx µ = x (2.7) 2 2 σx − σx n n q q Now,
E x µx E Z = 2 2 σx − σx n n µ µ = q x q x 2 2 σx − σx n n =0q. q
Also,
2 x µx Var( Z )=E 2 2 ⎛ σx − σx ⎞ n n ⎝ ⎠ nq q 2 = E ( x µx ) σ2 − ∙ x ¸ 2 n σx = 2 σx n =1.
Thus Z N(0, 1). The N(0, 1 ) distribution is the standard normal and is well- tabulated.∼ The probability density function for the standard normal distribution is 1 1 2 ( zi ) f ( zi )= e− 2 = ϕ(zi) (2.8) √2π
2.4 The Central Limit Theorem 2.4.1 Normal Theory The normal density has a prominent position in statistics. This is not only because many random variables appear to be normal, but also because most any sample mean appears normal as the sample size increases. Specifically, suppose x1,x2,...,xn is a simple random sample and E xi = µx 2 and Var( xi )=σ ,thenasn , the distribution of x becomes normal. That x →∞ 10 CHAPTER 2. SOME USEFUL DISTRIBUTIONS is,
x µx x µx lim f − = ϕ − (2.9) n σ2 σ2 →∞ ⎛ x ⎞ ⎛ x ⎞ n n ⎝ q ⎠ ⎝ q ⎠ 2.5 Distributions Associated With The Normal Distribution
2.5.1 The Chi-Squared Distribution
Definition 2.5 Suppose that Z1,Z2,...,Zn is a simple random sample, and Zi N(0, 1). Then ∼ n Z2 2, (2.10) i ∼ Xn i=1 X where n are the degrees of freedom of the Chi-squared distribution. 2
The probability density function for the 2 is Xn
1 n/2 1 x/2 f 2 ( x )= x − e− ,x>0 (2.11) χ 2n/2Γ( n/2) where Γ(x) is the gamma function. See Figure 2.2. If x1,x2,...,xn is a simple 2 random sample, and xi N ( µx,σ ), then ∼ x n 2 xi µ − 2. (2.12) σ n i=1 ∼ X X µ ¶ The chi-squared distribution will prove useful in testing hypotheses on both the variance of a single variable and the (conditional) means of several. This multivariate usage will be explored in the next chapter.
Example 2.7 Consider the estimate of σ2
n 2 ( xi x ) s2 = i=1 − . n 1 P − Then 2 s 2 ( n 1) 2 n 1. (2.13) − σ ∼ X − 2 2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMAL DISTRIBUTION11
Figure 2.2: Some Chi-Squared Distributions
2.5.2 The t Distribution 2 Definition 2.6 Suppose that Z N(0, 1), Y k ,andthatZ and Y are independent. Then ∼ ∼ X Z tk, (2.14) Y ∼ k where k are the degrees of freedomq of the t distribution. 2
The probability density function for a t random variable with n degrees of freedom is n+1 Γ 2 ft( x )= , (2.15) n x2 (n+1)/2 √nπ Γ 2 ¡1+¢n for Example 2.8 Consider the sample mean from a simple random sample of nor- mals. We know that x N( µ, σ2/n )and ∼ x µ Z = − N(0, 1). σ2 ∼ n q 12 CHAPTER 2. SOME USEFUL DISTRIBUTIONS Figure 2.3: Some t Distributions Also, we know that 2 s 2 Y =(n 1) 2 n 1, − σ ∼ X − where s2 is the unbiased estimator of σ2.Thus,ifZ and Y are independent (which, in fact, is the case), then x µ − Z σ2 = n t 2 Y ( n 1)s ( n 1) − σ2 − ( n 1) q r − n σ2 =(x µ ) s2 − s σ2 x µ = − tn 1 (2.16) s2 ∼ − n q 2 2.5. DISTRIBUTIONS ASSOCIATED WITH THE NORMAL DISTRIBUTION13 2.5.3 The F Distribution 2 2 Definition 2.7 Suppose that Y m, W n ,andthatY and W are inde- pendent. Then ∼ X ∼ X Y m W Fm,n, (2.17) n ∼ where m,n are the degrees of freedom of the F distribution. 2 The probability density function for a F random variable with m and n degrees of freedom is m+n m/2 (m/2) 1 Γ 2 (m/n) x − fF( x )= (2.18) Γ m Γ n (1 + mx/n)(m+n)/2 ¡ 2¢ 2 The F distribution is named¡ after¢ ¡ the¢ great statistician Sir Ronald A. Fisher, and is used in many applications, most notably in the analysis of variance. This situation will arise when we seek to test multiple (conditional) mean parameters 2 with estimated variance. Note that when x tn then x F1,n.Some examples of the F distribution can be seen in Figure∼ 2.4. ∼ Figure 2.4: Some F Distributions Chapter 3 Multivariate Distributions 3.1 Matrix Algebra Of Expectations 3.1.1 Moments of Random Vectors Let x1 x2 ⎡ . ⎤ = x . ⎢ ⎥ ⎢ xm ⎥ ⎢ ⎥ be an m 1 vector-valued random⎣ variable.⎦ Each element of the vector is a scalar random× variable of the type discussed in the previous chapter. The expectation of a random vector is E[ x1 ] µ1 E[ x2 ] µ2 E[ x ]=⎡ . ⎤ = ⎡ . ⎤ = µ. (3.1) . . ⎢ ⎥ ⎢ ⎥ ⎢ E[ xm ] ⎥ ⎢ µm ⎥ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ Note that µ is also an m 1 column vector. We see that the mean of the vector isthevectorofthemeans.× Next, we evaluate the following: E[( x µ )( x µ )0] − − 2 ( x1 µ1 ) ( x1 µ1 )( x2 µ2 ) ( x1 µ1 )( xm µm ) − − −2 ··· − − ( x2 µ2 )( x1 µ1 )(x2 µ2 ) ( x2 µ2 )( xm µm ) = E ⎡ − . − −. ···. − . − ⎤ . . .. . ⎢ 2 ⎥ ⎢ ( xm µm )( x1 µ1 )(xm µm )( x2 µ2 ) ( xm µm ) ⎥ ⎢ − − − − ··· − ⎥ ⎣ ⎦ 14 3.1. MATRIX ALGEBRA OF EXPECTATIONS 15 σ11 σ12 σ1m ··· σ21 σ22 σ2m = ⎡ . . ···. . ⎤ . . .. . ⎢ ⎥ ⎢ σm1 σm2 σmm ⎥ ⎢ ··· ⎥ ⎣ ⎦ = Σ. (3.2) Σ, the covariance matrix, is an m m matrix of variance and covariance 2 × terms. The variance σi = σii of xi is along the diagonal, while the cross-product terms represent the covariance between xi and xj. 3.1.2 Properties Of The Covariance Matrix Symmetric The variance-covariance matrix Σ is a symmetric matrix. This can be shown by noting that σij = E( xi µi )( xj µj )=E( xj µj )( xi µi )=σji. − − − − Due to this symmetry Σ will only have m(m +1)/2 unique elements. Positive Semidefinite Σ is a positive semidefinite matrix. Recall that any m m matrix is posi- tive semidefinite if and only if it meets any of the following× three equivalent conditions: 1. All the principle minors are nonnegative; 2. λ0Σλ 0, for all λ =0; ≥ 6 m 1 × 3. Σ = PP0,forsome|{z}P . m m × The first condition (actually|{z} we use negative definiteness) is useful in the study of utility maximization while the latter two are useful in econometric analysis. The second condition is the easiest to demonstrate in the current context. Let λ = 0. Then, we have 6 λ0Σλ = λ0 E[( x µ )( x µ )0]λ − − = E[λ0( x µ )( x µ )0λ ] − 2− = E [ λ0( x µ )] 0, { − } ≥ 16 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS since the term inside the expectation is a quadratic. Hence, Σ is a positive semidefinite matrix. Note that P satisfying the third relationship is not unique. Let D be any m m orthonormal martix, then DD0 = Im and P∗ = PD yields P∗P∗0 = × PDD0P0 = PImP0 = Σ. Usually, we will choose P to be an upper or lower triangular matrix with m(m +1)/2 nonzero elements. Positive Definite Since Σ is a positive semidefinite matrix, it will be a positive definite matrix if and only if det(Σ) = 0. Now, we know that Σ = PP0 for some m m matrix P.Thisimpliesthatdet(6 P) =0. × 6 3.1.3 Linear Transformations m 1 × Let y = b + B x .Then m 1 m 1 m m × × × z}|{ |{z} |{z} |{z} E[ y ]=b + B E[ x ] = b + Bµ = µy (3.3) Thus, the mean of a linear transformation is the linear transformation of the mean. Next, we have E[(y µ )( y µ )0 ]=E [ B ( x µ )][ (B ( x µ ))0 ] − y − y { − − } = B E[( x µ )( x µ)0]B0 − − = BΣB0 = BΣB0 (3.4) = Σy (3.5) whereweusetheresult(ABC )0 = C0B0A0, if conformability holds. 3.2 Change Of Variables 3.2.1 Univariate Let x be a random variable and fx( ) be the probability density function of x. Now, define y = h(x), where · dh( x ) h ( x )= > 0. 0 dx 3.2. CHANGE OF VARIABLES 17 That is, h( x ) is a strictly monotonically increasing function and so y is a one- to-one transformation of x. Now, we would like to know the probability density function of y, fy(y). To findit,wenotethat Pr( y h( a ))=Pr(x a ), (3.6) ≤ ≤ a Pr( x a )= fx( x ) dx = Fx( a ), (3.7) ≤ Z−∞ and, h( a ) Pr( y h( a ))= fy( y ) dy = Fy( h( a )), (3.8) ≤ Z−∞ for all a. Assuming that the cumulative density function is differentiable, we use (3.6) to combine (3.7) and (3.8), and take the total differential, which gives us dFx( a )=dFy( h( a )) fx( a )da = fy( h( a ))h0( a )da for all a. Thus, for a small perturbation, fx( a )=fy( h( a ))h0( a ) (3.9) for all a.Also,sincey is a one-to-one transformation of x,weknowthath( ) 1 1 · can be inverted. That is, x = h− (y). Thus, a = h− (y), and we can rewrite (3.9) as 1 1 fx( h− ( y )) = fy( y )h0( h− ( y )). Therefore, the probability density function of y is 1 1 fy( y )=fx( h− ( y )) 1 . (3.10) h0( h− ( y )) Note that fy( y ) has the properties of being nonnegative, since h0( ) > 0. If · h0( ) < 0, (3.10) can be corrected by taking the absolute value of h0( ), which will· assure that we have only positive values for our probability density· function. 3.2.2 Geometric Interpretation Consider the graph of the relationship shown in Figure 3.1. We know that Pr[ h( b ) >y>h( a )] = Pr( b>x>a). Also, we know that Pr[ h( b ) >y>h( a )] fy[ h( b )][ h( b ) h( a )], ' − 18 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS y h(b) h(x) h(a) ab x Figure 3.1: Change of Variables and Pr( b>x>a) fx( b )( b a ). ' − So, fy[ h( b )][ h( b ) h( a )] fx( b )( b a ) − ' − 1 fy[ h( b )] fx( b ) (3.11) ' [ h( b ) h( a )]/( b a )] − − Now, as we let a b, the denominator of (3.11) approaches h0( ). This is then the same formula→ as (3.10). · 3.2.3 Multivariate Let x fx( x ). ∼ m 1 × Define a one-to-one transformation|{z} m 1 × y = h ( x ). m 1 × z}|{ Since h( ) is a one-to-one transformation,|{z} it has an inverse: · 1 x = h− ( y ). 3.2. CHANGE OF VARIABLES 19 We also assume that ∂h( x ) exists. This is the m m Jacobian matrix, where ∂x0 × h1( x ) h2( x ) ∂ ⎡ . ⎤ . ⎢ ⎥ ∂h( x ) ⎢ hm( x ) ⎥ = ⎢ ⎥ ∂x ∂(⎣x1x2 xm⎦) 0 ··· ∂h1( x ) ∂h2( x ) ∂hm( x ) ∂x1 ∂x1 ∂x1 ∂h1( x ) ∂h2( x ) ··· ∂hm( x ) = ⎡ ∂x2 ∂x2 ··· ∂x2 ⎤ . . .. . ⎢ . . . . ⎥ ⎢ ∂h ( x ) ∂h ( x ) ∂h ( x ) ⎥ ⎢ 1 2 m ⎥ ⎢ ∂xm ∂xm ··· ∂xm ⎥ ⎣ ⎦ =Jx( x )(3.12) Given this notation, the multivariate analog to (3.11) can be shown to be 1 1 fy( y )=fx[ h− ( y )] 1 (3.13) det(Jx[ h ( y )]) | − | 1 Since h( )isdifferentiable and one-to-one then det(Jx[ h− ( y )]) =0. · 6 Example 3.1 Let y = b0 + b1x,wherex, b0,andb1 are scalars. Then y b0 x = − b1 and dy = b . dx 1 Therefore, y b0 1 fy( y )=fx − . 2 b1 b1 µ ¶ | | Example 3.2 Let y = b + Bx,wherey is an m 1 vector and det(B) =0. Then × 6 1 x = B− ( y b ) − and ∂y = B = Jx( x ). ∂x0 Thus, 1 1 fy( y )=fx B− ( y b ) . 2 − det( B ) | | ¡ ¢ 20 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS 3.3 Multivariate Normal Distribution 3.3.1 Spherical Normal Distribution Definition 3.1 An m 1randomvectorz is said to be spherically normally distributed if × 1 1 z0z f(z)= e− 2 . 2 (2π )n/2 Such a random vector can be seen to be a vector of independent standard normals. Let z1,z2,...,zm, be i.i.d. random variables such that zi N(0, 1). ∼ That is, zi has pdf given in (2.8), for i =1, ..., m. Then, by independence, the joint distribution of the zi’s is given by f( z1,z2,...,zm )=f( z1 )f( z2 ) f( zm ) n ··· 1 1 z2 = e− 2 i √ i=1 2π Y 1 1 n 2 2 i=1 zi = m/2 e− (2π ) S 1 1 z0z = e− 2 , (3.14) (2π )m/2 where z0 =(z1 z2 ... zm). 3.3.2 Multivariate Normal Definition 3.2 The m 1 random vector x with density × 1 1 1 ( x µ )0Σ− ( x µ ) f ( x )= e− 2 − − (3.15) x (2π )n/2[det(Σ )]1/2 is said to be distributed multivariate normal with mean vector µ and positive definite covariance matrix Σ. 2 Such a distribution for x is denoted by x N(µ, Σ). The spherical normal ∼ distribution is seen to be a special case where µ = 0 and Σ = Im. There is a one-to-one relationship between the multivariate normal random vector and a spherical normal random vector. Let z be an m 1 spherical normal random vector and × x = µ + Az, m 1 × where z is defined above, and det(|{z}A) = 0. Then, 6 E x = µ + A E z = µ , (3.16) 3.3. MULTIVARIATE NORMAL DISTRIBUTION 21 since E[z]=0. Also, we know that 2 z1 z1z2 z1zm 10 0 2 ··· ··· z2z1 z2 z1zm 01 0 E( zz0 )=E ⎡ . . ···. . ⎤ = ⎡ . . ···. . ⎤ = Im , (3.17) ...... ⎢ 2 ⎥ ⎢ ⎥ ⎢ zmz1 zmz2 z ⎥ ⎢ 00 1 ⎥ ⎢ ··· m ⎥ ⎢ ··· ⎥ ⎣ ⎦ ⎣ ⎦ 2 since E[zizj]=0,foralli = j,andE[z ]=1,foralli. Therefore, 6 i E[( x µ )( x µ )0]=E( Azz0A0 ) − − = A E( zz0 )A0 = AImA0 = Σ, (3.18) where Σ is a positive definite matrix (since det(A) =0). 6 Next, we need to find the probability density function fx( x )ofx.Weknow that 1 z = A− ( x µ ) , − 1 z0 =(x µ )0A− 0 , − and Jz( z )=A, so we use (3.13) to get 1 1 fx( x )=fz[ A− ( x µ )] − det( A ) | | 1 1 1 1 ( x µ )0A− 0A− ( x µ ) 1 = e− 2 − − (2π )m/2 det( A ) | | 1 1 1 ( x µ )0( AA0)− ( x µ ) 1 = e− 2 − − (3.19) (2π )m/2 det( A ) | | 1 1 1 1 1 whereweusetheresults(ABC)=C− B− A− and A0− = A− 0. However, 1/2 Σ = AA0,sodet(Σ)=det(A) det(A), and det(A) =[det(Σ)] .Thuswe can rewrite (3.19) as · | | 1 1 1 ( x µ )0Σ− ( x µ ) f ( x )= e− 2 − − (3.20) x (2π )m/2[det(Σ )]1/2 and we see that x N(µ, Σ)withmeanvectorµ and covariance matrix Σ. ∼ Since this process is completely reversable the relationship is one-to-one. 22 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS 3.3.3 Linear Transformations Theorem 3.1 Suppose x N(µ, Σ) with det(Σ) =0and y = b + Bx with B ∼ 6 square and det(B) =0.Theny N(µ , Σy). 2 6 ∼ y Proof: From (3.3) and (3.4), we have E y = b+ Bµ = µ and E[( y µ )( y y − y − µy )0]=BΣB0 = Σy.Tofind the probability density function fy( y )ofy,we again use (3.13), which gives us 1 1 fy( y )=fx[ B− ( y b )] − det( B ) | | 1 1 1 1 1 [ B− ( y b ) µ ]0Σ− [ B− ( y b ) µ ] 1 = e− 2 − − − − (2π )m/2[det(Σ )]1/2 det( B ) | | 1 1 1 2 ( y b Bµ )0( BΣB0)− ( y b Bµ ) = m/2 1/2 e− − − − − (2π ) [det(BΣB0)] 1 1 1 2 ( y µy )0 Σy− ( y µy ) = m/2 1/2 e− − − (3.21) (2π ) [det(Σy )] So, y N(µy, Σy). 2 Thus∼ we see, as asserted in the previous chapter, that linear transformations of multivariate normal random variables are also multivariate normal random variables. And any linear combination of independent normals will also be normal. 3.3.4 Quadratic Forms 1 Theorem 3.2 Let x N( µ, Σ ),wheredet(Σ) = 0,then( x µ )0Σ− ( x µ ) 2 . 2 ∼ 6 − − ∼ Xm Proof Let Σ = PP0.Then ( x µ ) N(0, Σ ), − ∼ and 1 z = P− ( x µ ) N(0, Im ). − ∼ Therefore, n 2 2 z0z = z i ∼ Xm i=1 X1 1 = P− ( x µ )0P− ( x µ ) − − 1 1 =(x µ )0P− 0P− ( x µ ) − 1 − 2 =(x µ )0Σ− ( x µ ) . 2 (3.22) − − ∼ Xm 1 2 Σ− is called the weight matrix. With this result, we can use the m to make inferences about the mean µ of x. X 3.4. NORMALITY AND THE SAMPLE MEAN 23 3.4 Normality and the Sample Mean 3.4.1 Moments of the Sample Mean Consider the m 1vectorxi i.i.d. jointly, with m 1 vector mean E[xi]=µ × ∼ × 1 n and m m covariance matrix E[(xi µ)(xi µ)0]=Σ.Define xn = n i=1 xi as the vector× sample mean which is also− the vector− of scalar sample means. The mean of the vector sample mean follows directly: P 1 n E[xn]= E[xi]=µ. n i=1 X Alternatively, this result can be obtained by applying the scalar results element by element. The second moment matrix of the vector sample mean is given by 1 n n E[(xn µ)(xn µ)0]= E ( xi nµ)( xi nµ)0 − − n2 i=1 − i=1 − 1 h P P i = E[ (x1 µ)+(x2 µ)+... +(xn µ) n2 { − − − } (x1 µ)+(x2 µ)+... +(xn µ) 0] { − − − } 1 1 = nΣ = Σ n2 n since the covariances between different observations are zero. 3.4.2 Distribution of the Sample Mean Suppose xi i.i.d. N(µ, Σ) jointly. Then it follows from joint multivariate ∼ normality that xn must also be multivariate normal since it is a linear transfor- mation. Specifically, we have 1 xn N(µ, Σ) ∼ n or equivalently 1 xn µ N(0, Σ) − ∼ n √n(xn µ) N(0, Σ) 1/2 − ∼ √nΣ− (xn µ) N(0, I ) − ∼ m 1/2 1/2 1/2 1/2 1 where Σ = Σ Σ 0 and Σ− =(Σ )− . 3.4.3 Multivariate Central Limit Theorem Theorem 3.3 Suppose that (i) xi i.i.d jointly, (ii) E[xi]=µ, and (iii) ∼ E[(xi µ)(xi µ)0]=Σ,then − − √n(xn µ) N(0, Σ) − →d 24 CHAPTER 3. MULTIVARIATE DISTRIBUTIONS or equivalently 1/2 z = √nΣ− (xn µ) N(0, I ) 2 . − →d m These results apply even if the original underlying distribution is not normal and follow directly from the scalar results applied to any linear combination of xn. 3.4.4 Limiting Behavior of Quadratic Forms Consider the following quadratic form 1 1/2 1/2 1 n (xn µ)0Σ− (xn µ)=n (xn µ)0(Σ Σ 0)− (xn µ) · − − · − 1/2 1/2 − =[n (xn µ)0Σ− 0Σ− (xn µ) · 1−/2 1/2− =[√nΣ− (xn µ)]0[√nΣ− (xn µ)] 2 2 − − = z0z χ . →d m This form is convenient for asymptotic joint test concerning more than one mean at a time. 3.5 Noncentral Distributions 3.5.1 Noncentral Scalar Normal Definition 3.3 Let x N(µ, σ2). Then, ∼ x z∗ = N( µ/σ, 1) (3.23) σ ∼ has a noncentral normal distribution. 2 Example 3.3 When we do a hypothesis test of mean, with known variance, we have, under the null hypothesis H0 : µ = µ0, x µ0 − N(0, 1) (3.24) σ ∼ and, under the alternative H1 : µ = µ1 = µ0, 6 x µ0 x µ1 µ1 µ0 − = − + − σ σ σ µ1 µ0 µ1 µ0 = N(0, 1)+ − N − , 1 . (3.25) σ ∼ σ µ ¶ x µ0 Thus, the behavior of −σ under the alternative hypothesis follows a non- central normal distribution. 2 As this example makes clear, the noncentral normal distribution is especially useful when carefully exploring the behavior of the alternative hypothesis. 3.5. NONCENTRAL DISTRIBUTIONS 25 3.5.2 Noncentral t 2 Definition 3.4 Let z∗ N ( µ/σ, 1),w χk,andletz∗ and w be independent. Then ∼ ∼ z∗ tk( µ )(3.26) w/k ∼ has a noncentral t distribution. p2 The noncentral t distribution is used in tests of the mean, when the variance is unknown. 3.5.3 Noncentral Chi-Squared Definition 3.5 Let z∗ N (µ, I ). Then ∼ m 2 z∗0z∗ ( δ ), (3.27) ∼ Xm has a noncentral chi-aquared distribution, where δ = µ0µ is the noncentrality parameter. 2 In the noncentral chi-squared distribution, the probability mass is shifted to the right as compared to a regular chi-squared distribution. Example 3.4 When we do a test of µ,withknownΣ,wehave H0 : µ = µ0 1 2 ( x µ )0Σ− ( x µ ) (3.28) − 0 − 0 ∼ Xm H1 : µ = µ = µ 1 6 0 1 Let z∗=P− (x µ ). Then, we have − 0 1 1 1 ( x µ )0Σ− ( x µ )=(x µ )0P− 0P− ( x µ ) − 0 − 0 − 0 − 0 = z∗0z∗ 2 1 [( µ µ )0Σ− ( µ µ )] (3.29) ∼ Xm 1 − 0 1 − 0 2 3.5.4 Noncentral F 2 2 Definition 3.6 Let Y χm(δ), W χn, and let Y and W be independent random variables. Then∼ ∼ Y/m Fm,n( δ ), (3.30) W/n ∼ has a noncentral F distribution, where δ is the noncentrality parameter. 2 The noncentral F distribution is used in tests of mean vectors, where the variance-covariance matrix is unknown and must be estimated. Chapter 4 Asymptotic Theory 4.1 Convergence Of Random Variables 4.1.1 Limits And Orders Of Sequences Definition 4.1 A sequence of real numbers a1,a2,... ,an,...,issaidtohave a limit of α if for every δ>0, there exists a positive real number N such that for all n>N, an α <δ. This is denoted as | − | lim an = α. 2 n →∞ Definition 4.2 A sequence of real numbers an is said to be of at most order k k { } n ,andwewrite an is O( n ), if { } 1 lim an = c, n nk →∞ where c is any real constant. 2 2 Example 4.1 Let an =3+1/n,and bn =4 n . Then, an is O( 1 ) =O(n0 ),since { } { } − { } 1 lim an =3, n n →∞ 2 and bn is O( n ), since { } 1 lim bn=-1. 2 n n2 →∞ Definition 4.3 A sequence of real numbers an is said to be of order smaller k k { } than n ,andwewrite an is o( n ), if { } 1 lim an =0. 2 n nk →∞ 26 4.1. CONVERGENCE OF RANDOM VARIABLES 27 Example 4.2 Let an =1/n. Then, an is o( 1 ), since { } { } 1 lim an=0. 2 n n0 →∞ 4.1.2 Convergence In Probability Definition 4.4 A sequence of random variables y1,y2,... ,yn,..., with distri- bution functions F1( ),F2( ),... ,Fn( ),...,issaidtoconverge weakly in proba- bility to some constant· c if· · lim Pr[ yn c > ]=0. 2 (4.1) n →∞ | − | for every real number >0. Weak convergence in probability is denoted by plim yn = c, (4.2) n →∞ or sometimes, p yn c, (4.3) −→ or yn p c. → This definition is equivalent to saying that we have a sequence of tail proba- bilities (of being greater than c+ or less than c ), and that the tail probabil- ities approach 0 as n , regardless of how small− is chosen. Equivalently, →∞ the probability mass of the distribution of yn is collapsing about the point c. Definition 4.5 A sequence of random variables y1,y2,... ,yn,...,issaidto converge strongly in probability to some constant cif lim Pr[ sup yn c > ]=0, (4.4) N →∞ n>N | − | for any real >0. 2 Strong convergence is also called almost sure convergence and is denoted a.s. yn c, (4.5) −→ or yn a.s. c. → Notice that if a sequence of random variables converges strongly in probability, it converges weakly in probability. The difference between the two is that almost 28 CHAPTER 4. ASYMPTOTIC THEORY sure convergence involves an element of uniformity that weak convergence does not. A sequence that is weakly convergent can have Pr[ yn c > ] wiggle- waggle above and below the constant δ used in the limit and| − then| settle down to subsequently be smaller and meet the condition. For strong convergence, once the probability falls below δ for a particular N in the sequence it will subsequently be smaller for all larger N. Definition 4.6 A sequence of random variables y1,y2,... ,yn,...,issaidto converge in quadratic mean if lim E[ yn ]=c n →∞ and lim Var[ yn ]=0. 2 n →∞ By Chebyshev’s inequality, convergence in quadratic mean implies weak con- vergence in probability. For a random variable x with mean µ and variance σ2, Chebyshev’s inequality states Pr( x µ kσ) 1 .Letσ2 denote | − | ≥ ≤ k2 n thevarianceonyn, then we can write the condition for the present case as 1 2 Pr( yn E[yn] kσn) k2 .Sinceσn 0andE[yn] c the probability will | − 1 | ≥ ≤ → → be less than k2 for sufficiently large n for any choice of k. But this is just weak convergence in probability to c. 4.1.3 Orders In Probability Definition 4.7 Let y1,y2,... ,yn,... be a sequence of random variables. This sequence is said to be bounded in probability if for any 1 >δ>0, there exist a ∆ < and some N sufficiently large such that ∞ Pr( yn > ∆ ) <δ, | | for all n>N. 2 These conditions require that the tail behavior of the distributions of the sequence not be pathological. Specifically, the tail mass of the distributions cannot be drifting away from zero as we move out in the sequence. Definition 4.8 Thesequenceofrandomvariables yn is said to be at most λ λ { } λ of order in probability n , and is denoted Op( n ), if n− yn is bounded in probability. 2 1 Example 4.3 Suppose z N(0, 1) and yn =3+n z,thenn− yn =3/n + z is a bounded random variable∼ since the first term is asymptotically· negligible and we see that yn =Op( n ). 4.1. CONVERGENCE OF RANDOM VARIABLES 29 Definition 4.9 The sequence of random variables yn is said to be of order λ λ{ } λ p in probability smaller than n , and is denoted op( n ), if n− yn 0. 2 −→ Example 4.4 Convergence in probability can be represented in terms of order p p in probability. Suppose that yn c or equivalently yn c 0, then 0 p −→ − −→ n (yn c) 0andyn c =op(1). − −→ − 4.1.4 Convergence In Distribution Definition 4.10 A sequence of random variables y1,y2,... ,yn,...,withcu- mulative distribution functions F1( ),F2( ),... ,Fn( ),...,issaidtoconverge in distribution to a random variable· y with· a cumulative· distribution function F( y ), if lim Fn( )=F ( ), (4.6) n →∞ · · for every point of continuity of F ( ). The distribution F ( )issaidtobethe limiting distribution of this sequence· of random variables. 2· d For notational convenience, we often write yn F ( )oryn d F ( )if a sequence of random variables converges in distribution−→ to· F ( ).→ Note that· the moments of elements of the sequence do not necessarily converge· to the moments of the limiting distribution. 4.1.5 Some Useful Propositions In the following propositions, let xn and yn be sequences of random vectors. Proposition 4.1 If xn yn converges in probability to zero, and yn has a − limiting distribution, then xn has a limiting distribution, which is the same. 2 Proposition 4.2 If yn has a limiting distribution and plim xn =0, then for n →∞ zn = yn0xn, plim zn =0. 2 n →∞ Proposition 4.3 Suppose that yn converges in distribution to a random vari- able y,andplim xn = c.Thenxn0yn converges in distribution to c0y. 2 n →∞ Proposition 4.4 If g( ) is a continuous function, and if xn yn converges in · − probability to zero, then g( xn ) g( yn ) converges in probability to zero. 2 − Proposition 4.5 If g( ) is a continuous function, and if xn converges in proba- · bility to a constant c,thenzn = g( xn ) converges in distribution to the constant g( c ). 2 30 CHAPTER 4. ASYMPTOTIC THEORY Proposition 4.6 If g( ) is a continuous function, and if xn converges in dis- · tribution to a random variable x,thenzn = g( xn ) converges in distribution to a random variable g( x ). 2 4.2.1 Properties Of Estimators Definition 4.11 An estimator θn of the p 1 parameter vector θ is a function × of the sample observations x1,x2, ..., xn. 2 b It follows that θ1, θ2,... ,θn form a sequence of random variables. Definition 4.12 Theb b estimatorb θn is said to be unbiased if Eθn = θ, for all n. 2 b b Definition 4.13 The estimator θn is said to be asympotically unbiased if lim Eθn =θ. n 2 →∞ b b Note that an estimator can be biased in finite samples, but asymptotically unbiased. Definition 4.14 The estimator θn is said to be consistent if plimθn =θ. 2 n →∞ Consistency neither implies norb is implied by asymptotic unbiasedness,b as demonstrated by the following examples. Example 4.5 Let θ, with probability 1 1/n θ = n θ + nc, with probability− 1/n ½ e We have E θn = θ + c,soθn is a biased estimator, and limn E θn = θ + c,so →∞ θn is asymptotically biased as well. However, limn Pr( θn θ > )=0,so →∞ | − | θn is a consistente estimator.e 2 e e e 2 Examplee 4.6 Suppose xi i.i.d. N(µ, σ )fori =1, 2, ..., n, ... and let xn = xn ∼ be an estimator of µ. Now E[xn]=µ so the estimator is unbiased but e Pr( xn µ > 1.96σ)=.05 | e − | so the probability mass is note collapsing about the target point µ so the esti- mator is not consistent. 4.2. ESTIMATION THEORY 31 4.2.2 Laws Of Large Numbers And Central Limit Theo- rems Most of the large sample properties of he estimators considered in the sequel derive from the fact the estimators involve sample averages and the asymptotic behavior of averages is well studies. In addition to the central limit theorems presented in the previous two chapters we have the following two laws of large numbers: Theorem 4.1 If x1,x2,... ,xn is a simple random sample, that is, the xi’s are 2 i.i.d.,andE xi = µ and Var( xi )=σ , then by Chebyshev’s Inequality, plim xn = µ 2 (4.7) n →∞ Theorem 4.2 (Khitchine) Suppose that x1,x2,... ,xn are i.i.d. random vari- ables, such that for all i =1,... ,n, E xi = µ,then, plim xn = µ 2 (4.8) n →∞ Both of these results apply element-by-element to vectors of estimators. For sake of completeness we repeat the following scalar central linit theorem. Theorem 4.3 (Linberg-Levy) Suppose that x1,x2,... ,xn are i.i.d. random 2 variables, such that for all i =1,... ,n, E xi = µ and Var( xi )=σ ,then d 2 √n( xn µ ) N( 0,σ ),or − −→ 1 2 1 ( xn µ ) lim f( xn µ )= e 2σ2 − n 2 →∞ − √2πσ =N(0,σ2 ) 2 (4.9) This result is easily generalized to obtain the multivariate version given in The- orem 3.3. Theorem 4.4 (Multivariate CLT) Suppose that m 1 random vectors x1, x2,... ,xn × are (i) jointly i.i.d., (ii) E xi = µ, and (iii) Cov( xi )=Σ,then d √n( xn µ ) N( 0, Σ )(4.10) − −→ 4.2.3 CUAN And Efficiency Definition 4.15 An estimator is said to be consistently uniformly asymptoti- cally normal (CUAN) if it is consistent, and if √n( θn θ ) converges in distrib- ution to N( 0, Ψ ), and if the convergence is uniform over− some compact subset of the parameter space. 2 b 32 CHAPTER 4. ASYMPTOTIC THEORY Suppose that √n( θn θ ) converges in distribution to N( 0, Ψ ). Let θn − be an alternative estimator such that √n( θn θ ) converges in distribution to N( 0, Ω ). b − e e Definition 4.16 If θn is CUAN with asymptotic covariance Ψ and θn is CUAN with asymptotic covariance Ω,thenθn is asymptotically efficient relative to θn if Ψ Ω is a positiveb semidefinite matrix. 2 e − b e Among other properties asymptotic relative efficiency implies that the diag- onal elements of Ψ are no larger than those of Ω, so the asymptotic variances of θn,i are no larger than those of θn,i. And a similar result applies for the asymptotic variance of any linear combination. b e Definition 4.17 ACUANestimatorθn is said to be asymptotically efficient if it is asymptotically efficient relative to any other CUAN estimator. 2 b 4.3 Asymptotic Inference 4.3.1 Normal Ratios d Now, under the conditions of the central limit theorem, √n( xn µ ) N( 0,σ2 ), so − −→ ( xn µ ) d − N( 0, 1) (4.11) σ2/n −→ Suppose that σ2 is a consistentp estimator of σ2.Thenwealsohave 2 ( xn µ ) σ /n ( xn µ ) c − = − 2 2 2 σ /n pσ /n σ /n 2 p pσ ( xnp µ ) d = − N( 0, 1) (4.12) b b2 2 rσ σ /n −→ since the term under the square root convergesp in probability to one and the remainder converges in distribution tob N( 0, 1). Most typically, such ratios will be used for inference in testing a hypothesis. Now, for H0 : µ = µ0,wehave ( xn µ0 ) d − N( 0, 1), (4.13) σ2/n −→ while under H1 : µ = µ1 = µ0p,wefind that 6 b ( xn µ0 ) √n( xn µ1 ) √n( µ1 µ0 ) − = − + − σ2/n √σ2 √σ2 p =N(0, 1)+Op( √n ) (4.14) b b b 4.3. ASYMPTOTIC INFERENCE 33 Thus, extreme values of the ratio can be taken as rare events under the null or typical events under the alternative. Such ratios are of interest in estimation and inference with regard to more general parameters. Suppose that θ is the parameter vector √n( θ θ ) d N( 0, Ψ ). p 1 p p × − −→ × Then, if θi is the parameter ofb particular interest we consider ( θi θi ) d − N( 0, 1), (4.15) ψii/n −→ b and duplicating the argumentsp for the sample mean ( θi θi ) d − N( 0, 1), (4.16) ψ /n −→ b ii q where Ψ is a consistent estimatorb of Ψ. This ratio will have a similar behavior under a null and alternative hypotesis with regard to θi. b 4.3.2 Asymptotic Chi-Square Suppose that √n( θ θ ) d N( 0, Ψ ), − −→ where Ψ is a consistent estimator of the nonsingular p p matrix Ψ. Then, from the previous chapter we have b × b 1 1 d 2 √n( θ θ )0Ψ− √n( θ θ )=n ( θ θ )0Ψ− ( θ θ ) χp − − · − − −→ (4.17) and b b b b 1 d 2 n ( θ θ )0Ψ− ( θ θ ) χ (4.18) · − − −→ p for Ψ a consistent estimator of Ψ. b b b This result can be used to conduct infencebytestingtheentireparmater 0 vector.b If H0 : θ1 = θ1,then 0 1 0 d 2 n ( θ θ )0Ψ− ( θ θ ) χ , (4.19) · − − −→ p and large positive values are rare events. while for H : θ = θ1 = θ0,wecan b b b 1 1 1 1 show (later) 6 0 1 0 1 1 0 1 1 1 0 n ( θ θ )0Ψ− ( θ θ )=n (( θ θ )+(θ θ ))0Ψ− (( θ θ )+(θ θ )) · − − · − − − − 1 1 1 1 0 1 1 = n ( θ θ )0Ψ− ( θ θ )+2n ( θ θ )0Ψ− ( θ θ ) · − − · − − b b b b 1 0 1 1 b 0 +n ( θ θ )0Ψ− ( θ θ ) 2 b ·b −b − b = χp +Op(√n)+Op(n) (4.20) b 34 CHAPTER 4. ASYMPTOTIC THEORY Thus,ifweobtainalargevalueofthestatistic,wemaytakeitasevidencethat the null hypothesis is incorrect. This result can also be applied to any subvector of θ.Let θ θ = 1 , θ2 µ ¶ where θ1 is a p1 1 vector. Then × d √n( θ1 θ1 ) N( 0, Ψ11 ), (4.21) − −→ where Ψ11 is the upper left-hand q q submatrix of Ψ and, b × 1 d 2 n ( θ1 θ1 )0Ψ− ( θ1 θ1 ) χ (4.22) · − 11 − −→ p1 4.3.3 Tests Of Generalb Restrictionsb b We can use similar results to test general nonlinear restrictions. Suppose that r( )isaq 1 continuously differentiable function, and · × √n( θ θ ) d N( 0, Ψ ). − −→ By the intermediate value theoremb we can obtain the exact Taylor’s series rep- resentation ∂r(θ∗) r(θ)=r(θ)+ ( θ θ ) ∂θ0 − or equivalently b b ∂r(θ∗) √n(r(θ) r(θ)) = √n( θ θ ) − ∂θ0 − = R( θ∗ )√n( θ θ ) b b − ∂r(θ∗) where R( θ∗ )= and θ∗ lies between θ and θ.Nowθ p θ so θ∗ p θ ∂θ0 b → → and R( θ∗ ) R( θ ). Thus, we have → b b d √n[ r( θ ) r( θ )] N( 0, R( θ )ΨR0( θ )). (4.23) − −→ Thus, under H0 : r( θb) = 0, assuming R( θ )ΨR0( θ ) is nonsingular, we have 1 d 2 n r( θ )0[R( θ )ΨR0( θ )]− r( θ ) χ , (4.24) · −→ q where q is the length of r( ). In practice, we substitute the consistent estimates b· b R( θ )forR( θ )andΨ for Ψ to obtain, following the arguments given above 1 d 2 b n br( θ )0[R( θ )ΨR0( θ )]− r( θ ) χq, (4.25) · −→ The behavior under the alternative hypothesis will be O (n)asabove. b b b b b p Chapter 5 Maximum Likelihood Methods 5.1 Maximum Likelihood Estimation (MLE) 5.1.1 Motivation Suppose we have a model for the random variable yi,fori =1, 2,... ,n,with unknown (p 1) parameter vector θ. In many cases, the model will imply a × distribution f( yi, θ ) for each realization of the variable yi. A basic premise of statistical inference is to avoid unlikely or rare models, forexample,inhypothesistesting. Ifwehavearealizationofastatisticthat exceeds the critical value then it is a rare event under the null hypothesis. Under the alternative hypothesis, however, such a realization is much more likely to occur and we reject the null in favor of the alternative. Thus in choosing between the null and alternative, we select the model that makes the realization of the statistic more likely to have occured. Carrying this idea over to estimation, we select values of θ such that the corresponding values of f( yi, θ ) are not unlikely. After all, we do not want a model that disagrees strongly with the data. Maximum likelihood estimation is merely a formalization of this notion that the model chosen should not be unlikely. Specifically, we choose the values of the parameters that make the realized data most likely to have occured. This approach does, however, require that the model be specified in enough detail to imply a distribution for the variable of interest. 35 36 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS 5.1.2 The Likelihood Function Suppose that the random variables y1,y2,... ,yn are i.i.d.Then,thejoint density function for n realizations is f( y1,y2,... ,yn θ )=f( y1 θ ) f( y2, θ ) ... f( yn θ ) | n | · | · · | = f( yi θ )(5.1) | i=1 Y Given values of the parameter vector θ, this function allows us to assign local probability measures for various choices of the random variables y1,y2,... ,yn. This is the function which must be integrated to make probability statements concerning the joint outcomes of y1,y2,... ,yn. Given a set of realized values of the random variables, we use this same function to establish the probability measure associated with various choices of the parameter vector θ. Definition 5.1 The likelihood function of the parameters, for a particular sam- ple of y1,y2,... ,yn, is the joint density function considered as a function of θ given the yi’s. That is, n L( θ y1,y2,... ,yn )= f( yi θ ) 2 (5.2) | | i=1 Y 5.1.3 Maximum Likelihood Estimation For a particular choice of the parameter vector θ, the likelihood function gives a probability measure for the realizations that occured. Consistent with the approach used in hypothesis testing, and using this function as the metric, we choose θ that make the realizations most likely to have occured. Definition 5.2 The maximum likelihood estimator of θ is the estimator ob- tained by maximizing L( θ y1,y2,... ,yn ) with respect to θ.Thatis, | max L( θ y1,y2,... ,yn )=θ, (5.3) θ | where θ is called the MLE of θ. 2 b Equivalently,b since log( ) is a strictly monotonic transformation, we may find the MLE of θ by solving · max ( θ y1,y2,... ,yn ), (5.4) θ L | where ( θ y1,y2,... ,yn )=logL(θ y1,y2,... ,yn ) L | | 5.1. MAXIMUM LIKELIHOOD ESTIMATION (MLE) 37 is denoted the log-likelihood function. In practice, we obtain θ by solving the first-order conditions (FOC) b n ∂ ( θ; y1,y2,... ,yn ) ∂ log f( yi; θ ) L = =0. ∂θ ∂θ i=1 X b The motivation for using the log-likelihood function is apparent since the sum- mation form will result, after division by n, in estimators that are approximately averages, about which we know a lot. This advantage is particularly clear in the folowing example. 2 Example 5.1 Suppose that yi ∼i.i.d.N( µ, σ ), for i =1, 2,... ,n. Then, 1 2 2 1 ( yi µ ) f( yi µ, σ )= e− 2σ2 − , | √2πσ2 for i =1, 2,... ,n. Using the likelihood function (5.2), we have n 1 2 2 1 ( yi µ ) L( µ, σ y1,y2,... ,yn )= e− 2σ2 − . (5.5) | √ 2 i=1 2πσ Y Next, we take the logarithm of (5.5), which gives us n 2 1 2 1 2 log L( µ, σ y1,y2,... ,yn )= log( 2πσ ) ( yi µ ) | −2 − 2σ2 − i=1 ∙ ¸ X n n 2 1 2 = log( 2πσ ) ( yi µ ) (5.6) − 2 − 2σ2 − i=1 X We then maximize (5.6) with respect to both µ and σ2. That is, we solve the following first order conditions: ∂ log L( ) 1 n (A) · = 2 ( yi µ )=0; ∂µ σ i=1 − P ∂ log L( ) n 1 n 2 (B) 2 · = 2 + 4 ( yi µ ) =0. ∂σ − 2σ 2σ i=1 − P 1 n By solving (A), we find that µ = n i=1 yi = yn.Solving(B)givesus 2 1 n 2 1 n σ = ( yi y ). Therefore, µ = y ,andσ = ( yi µ ). 2 n i=1 − n nP n i=1 − P 2 1 n 2 2 P1 n 2 Note that σ = ( yi µ )b= s ,wherebs = ( yib µ ). s is n i=1 − 6 n 1 i=1 − the familiar unbiased estimator for σ2,andσ2 is a biased− estimator. P P b b b b 38 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS 5.2 Asymptotic Behavior of MLEs 5.2.1 Assumptions For the results we will derive in the following sections, we need to make five assumptions: 1. The yi’s are iid random variables with density function f( yi,θ )fori = 1, 2,... ,n; 2. log f( yi,θ ) and hence f( yi,θ ) possess derivatives with respect to θ up to the third order for θ Θ; ∈ 3. The range of yi is independent of θ hence differentiation under the integral is possible; 4. The parameter vector θ is globally identified by the density function. 3 5. ∂ log f( yi, θ )/∂θi∂θj∂θk is bounded in absolute value by some function Hijk( y ) for all y and θ θ, which, in turn, has a finite expectation for all θ Θ. ∈ ∈ The first assumption is fundamental and the basis of the estimator. If it is not satisfied then we are misspecifying the model and there is little hope for obtaining correct inferences, at least in finite samples. The second assumption is a regularity condition that is usually satisfied and easily verified. The third assumption is also easily verified and guarateed to be satisfied in models where the dependent variable has smooth and infinite support. The fourth assumption must be verified, which is easier in some cases than others. The last assumption is crucial and bears a cost and really should be verified before MLE is undertaken but is usually ignored. 5.2.2 Some Preliminaries Now, we know that L( θ0 y )dy = f( y θ0 )dy =1 (5.7) | | Z Z 5.2. ASYMPTOTIC BEHAVIOR OF MLES 39 foranyvalueofthetrueparametervectorθ0. Therefore, ∂ L( θ0 y )dy 0= | ∂θ R ∂L( θ0 y ) = | dy ∂θ Z ∂f( y θ0 ) = | dy ∂θ Z ∂ log f( y θ0 ) = | f( y θ0 )dy, (5.8) ∂θ | Z and ∂ log f( y θ0 ) 0=E | ∂θ ∙ ¸ ∂ log L( θ0 y ) = E | (5.9) ∂θ ∙ ¸ foranyvalueofthetrueparametervectorθ0. Differentiating (5.8) again yields ∂2 log f( y θ0 ) ∂ log f( y θ0 ) ∂f( y θ0 ) 0= | f( y θ0 )+ | | dy. ∂θ∂θ0 | ∂θ ∂θ0 Z ∙ ¸ (5.10) Since ∂f( y θ0 ) ∂ log f( y θ0 ) | = | f( y θ0 ), (5.11) ∂θ0 ∂θ0 | then we can rewrite (5.10) as ∂2 log f( y θ0 ) ∂ log f( y θ0 ) ∂ log f( y θ0 ) 0=E | + E | | . ∂θ∂θ ∂θ ∂θ ∙ 0 ¸ ∙ 0 ¸ (5.12) or, in terms of the likelihood function, ∂ log L( θ0 y ) ∂ log L( θ0 y ) ∂ log L( θ0 y ) ϑ( θ0 )=E | | = E | . ∂θ ∂θ0 − ∂θ∂θ0 ∙ ¸ ∙ ¸ (5.13) The matrix ϑ( θ0 ) is called the information matrix and the relationship given in (5.13) the information matrix equality. Finally, we note that 0 0 n 0 n 0 ∂ log L( θ y ) ∂ log L( θ y ) ∂ log fi( y θ ) ∂ log fi( y θ ) E | | = E | | ∂θ ∂θ0 ∂θ ∂θ0 ∙ ¸ "i=1 i=1 # n X 0 X 0 ∂ log fi( y θ ) ∂ log fi( y θ ) = E | | (5.14), ∂θ ∂θ0 i=1 X ∙ ¸ 40 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS since the covariances between different observations is zero. 5.2.3 Asymptotic Properties Consistent Root Exists Consider the case where p = 1. Then, expanding in a Taylor’s series and using the intermediate value theorem on the quadratic term yields n 0 1 ∂ log L( θ y ) 1 ∂ log f( yi θ ) | = | n ∂θ n ∂θ i=1 X n 2 0 1 ∂ log f( yi θ ) + | ( θ θ0 ) n ∂θ2 − i=1 X n 3 1 1 ∂ log f( yi θ∗ )b + | ( θ θ0 )2 (5.15) 2 n ∂θ3 − i=1 X b 0 where θ∗ lies between θ and θ . Now, by assumption 5, we have n 3 n 1b ∂ log f( yi θ∗ ) | = k H( y ), (5.16) n ∂θ3 i i=1 i=1 X X for some k < 1. So, | | 1 ∂ log L( θ y ) | = aδ2 + bδ + c, (5.17) n ∂θ where δ = θ θ0, − k 1 n a = H( y ), b2 n i i=1 X n 2 0 1 ∂ log f( yi θ ) b = | , and n ∂θ2 i=1 X n 0 1 ∂ log f( yi θ ) c = | . n ∂θ i=1 X 1 1 n 1 Note that a 2 n i=1 H( yi )= 2 E[H( yi )] + op(1) = Op(1), plim c =0,and | | ≤ n →∞ plim b = ϑ( θ0 ). P n − →∞ Now, since ∂ log L( θ y )/∂θ =0,wehaveaδ2 + bδ + c =0. Therearetwo possibilities. If a = 0 with| probability 1, which will occur when the FOC are 6 b 5.2. ASYMPTOTIC BEHAVIOR OF MLES 41 nonlinear in a,then b √b2 4ac δ = − ± − . (5.18) 2a p p 0 Since ac = op(1), then δ 0 for the plus root while δ ϑ( θ )/α for the negative root if plim a =−→α =0exists. Ifa = 0, then−→ the FOC are linear n 6 →∞ c p in δ whereupon δ = b and again δ 0. If the F.O.C. are nonlinear but − p −→ asymptotically linear then a 0andac in the numerator of (5.18) will still −→ p go to zero faster than a in the denominator and δ 0. Thus there exits at least one consistent solution θ which satisfies −→ plim( θ θ0 ) = 0. (5.19) bn − →∞ and in the asymptotically nonlinearb case there is also a possibly inconsistent solution. For the case of θ a vector, we can apply a similar style proof to show there exists a solution θ to the FOC that satisfies plim( θ θ0 ) = 0. And in the event n − of asymptotically nonlinear FOC there is at least→∞ on other possibly inconsistent root. b b Global Maximum Is Consistent In the event of multiple roots, we are left with the problem of selecting between them. By assumption 4, the parameter θ is globally identified by the density function. Formally, this means that f( y, θ )=f( y, θ0 ), (5.20) for all y implies that θ = θ0.Now, f( y,θ ) f( y, θ ) E = f( y,θ0 )=1. (5.21) f( y, θ0 ) f( y,θ0 ) ∙ ¸ Z Thus, by Jensen’s Inequality, we have f( y,θ ) f( y,θ ) E log < log E =0, (5.22) f( y, θ0 ) f( y,θ0 ) ∙ ¸ ∙ ¸ unless f( y,θ )=f( y, θ0 ) for all y,orθ = θ0.Therefore,E [log f( y, θ )] achieves a maximum if and only if θ = θ0. However, we are solving n 1 p max log f( yi, θ ) E [log f( yi, θ )] , (5.23) n −→ i=1 X 42 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS and if we choose an inconsistent root, we will not obtain a global maximum. Thus, asymptotically, the global maximum is a consistent root. This choice of the global root has added appeal since it is in fact the MLE among the possible alternatives and hence the choice that makes the realized data most likely to have occured. There are complications in finite samples since the value of the likelihood function for alternative roots may cross over as the sample size increases. That is the global maximum in small samples may not be the global maximum in larger samples. An added problem is to identify all the alternative roots so we can choose the global maximum. Sometimes a solution is available in a simple consistent estimator which may be used to start the nonlinear MLE optimization. Asymptotic Normality For p =1,wehaveaδ2 + bδ + c =0,so c δ = θ θ0 = − (5.24) − aδ + b and b √nc √n( θ θ0 )= − − a( θ θ0 )+b − 1 b = − √nc. a( θb θ0 )+b − 0 0 Now since a = Op(1) and θ θ = op(1), then a( θ θ )=op(1) and − b − 0 p 0 ab( θ θ )+b ϑ( θb ). (5.25) − −→ − And by the CLT we have b n 0 1 ∂ log f( yi θ ) d √nc = | N(0,ϑ( θ0 )). (5.26) √n ∂θ −→ i=1 X Substituing these two results in (5.25), we find 0 d 0 1 √n( θ θ ) N(0,ϑ( θ )− ). − −→ In general, for p>1, we canb apply the same scalar proof to show √n( λ0θ 0 d 0 1 − λ0θ ) N(0, λ0ϑ( θ )− λ) for any vector λ, which means −→ b 0 d 1 0 √n( θ θ ) N( 0, ϑ− ( θ )), (5.27) − −→ if θ is the global maximum.b b 5.2. ASYMPTOTIC BEHAVIOR OF MLES 43 Cram´er-Rao Lower Bound 1 0 In addition to being the covariance matrix of the MLE, ϑ− ( θ )defines a lower bound for covariance matrices with certain desirable properties. Let θ( y )be any unbiased estimator, then e E θ( y )= θ( y )f( y; θ )dy = θ, (5.28) Z for any underlying θ = θe0.Differentiatinge both sides of this relationship with respect to θ yields ∂ E θ(, y ) Ip = ∂θ0 e ∂f( y; θ0 ) = θ( y ) dy ∂θ0 Z ∂ log f( y; θ0 ) = θe( y ) f( y; θ0 )dy ∂θ0 Z ∂ log f( y; θ0 ) = E eθ( y ) ∂θ ∙ 0 ¸ 0 e 0 ∂ log f( y; θ ) = E ( θ( y ) θ ) . (5.29) − ∂θ0 ∙ ¸ Next, we let e 0 0 0 C( θ )=E ( θ( y ) θ )( θ( y ) θ )0 . (5.30) − − h i bethecovariancematrixofθ( ye), then, e θ( y ) C( θ0 ) I Cov e = p , (5.31) ∂ log L I ϑ( θ0 ) µ ∂θ ¶ µ p ¶ e where Ip is a p p identity matrix, and (5.31) as a covariance matrix is positive semidefinite. × Now, for any (p 1) vector a,wehave × 0 0 1 C( θ ) Ip a0 0 0 1 a0 a0ϑ( θ )− 0 0 1 = a0[C(θ ) ϑ( θ )− ]a 0. Ip ϑ( θ ), a0ϑ( θ )− − ≥ µ ¶µ ¶ (5.32) ¡ ¢ 0 1 Thus, any unbiased estimator θ( y ) has a covariance matrix that exceeds ϑ( θ )− by a positive semidefinite matrix. And if the MLE estimator is unbiased, it is efficient within the class of unbiasede estimators. Likewise, any CUAN estima- 0 1 tor will have a covariance exceeding ϑ( θ )− . Since the asymptotic covariance 0 1 0 1 of MLE is, in fact, ϑ( θ )− ,itisefficient (asymptotically). ϑ( θ )− is called the Cam´er-Rao lower bound. 44 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS 5.3 Maximum Likelihood Inference 5.3.1 Likelihood Ratio Test 0 0 Suppose we wish to test H0 : θ = θ against H0 : θ = θ . Then, we define 6 Lu =maxL( θ y )=L( θ y ) (5.33) θ | | and b 0 Lr = L( θ y ), (5.34) | where Lu is the unrestricted likelihood and Lr is the restricted likelihood. We then form the likelihood ratio L λ = r . (5.35) Lu Note that the restricted likelihood can be no larger than the unrestricted which maximizes the function. As with estimation, it is more convenient to work with the logs of the like- lihood functions. It will be shown below that, under H0, LR = 2logλ − L = 2 log r − L ∙ u ¸ =2[( θ y ) ( θ0 y )] d χ2, (5.36) L | − L | −→ p 0 where θ is the unrestricted MLE,b and θ is the restricted MLE. If H1 applies, then LR = Op( n ). Large values of this statistic indicate that the restrictions make theb observed values much less likely than the unrestricted and we prefer the unrestricted and reject the restictions. In general, for H0 : r( θ ) = 0, and H1 : r( θ ) =0,wehave 6 Lu =maxL( θ y )=L( θ y ), (5.37) θ | and b Lr =maxL( θ y )s.t. r( θ )=0 θ = L( θ y ). (5.38) | Under H , 0 e LR=2[ ( θ y ) ( θ y )] d χ2, (5.39) L | − L | −→ q where q is the length of r( ). b e Note that in the general· case, the likelihood ratio test requires calculation of both the restricted and the unrestricted MLE. 5.3. MAXIMUM LIKELIHOOD INFERENCE 45 5.3.2 Wald Test TheasymptoticnormalityofMLEmaybeusedtoobtainatestbasedonlyon the unrestricted estimates. 0 Now, under H0 : θ = θ ,wehave 0 d 1 0 √n( θ θ ) N(0,ϑ− ( θ )). (5.40) − −→ Thus, using the results on theb asymptotic behavior of quadratic forms from the previous chapter, we have 0 0 0 d 2 W=n( θ θ )0ϑ( θ )( θ θ ) χ , (5.41) − − −→ p which is the Wald test. As we discussed for quadratic tests, in general, under 0 b b H0 : θ = θ ,wewouldhaveW=Op( n ). In practice,6 since 2 n 2 1 ∂ ( θ y ) 1 ∂ log f( θ y ) p L | = | ϑ( θ0 ), (5.42) n ∂θ∂θ0 n ∂θ∂θ0 −→ − i=1 b X b we use 2 0 ∂ ( θ y ) 0 d 2 W∗ = ( θ θ )0 L | ( θ θ ) χp. (5.43) − − ∂θ∂θ0 − −→ b Aside from having theb same asymptotic distribution,b the Likelihood Ratio and Wald tests are asymptotically equivalent in the sense that plim(LR W∗ )=0. (5.44) n − →∞ This is shown by expanding ( θ0 y ) in a Taylor’s series about θ.Thatis, L | ∂ ( θ ) ( θ0 )= ( θ )+ L ( θ0 θ ) b L L ∂θ − b2 1 0 ∂ ( θ ) 0 + ( bθ θ )0 L ( θb θ ) 2 − ∂θ∂θ0 − 3 1 ∂ b( θ∗ ) 0 0 0 + b L (θi b θi)(θj θj)(θk θk). (5.45) 6 i j k ∂θi∂θj∂θk − − − P P P b b b where the third line applies the intermediate value theorem for θ∗ between θ 0 ∂ ( θ ) and θ .NowL∂θ = 0, and the third line can be shown to be Op(1/√n) under assumption 5,e whereupon we have b 2 0 1 0 ∂ ( θ ) 0 ( θ ) ( θ )= ( θ θ )0 L ( θ θ )+O (1/√n ) 2 p L − L − − ∂θ∂θ0 − (5.46) b b b b 46 CHAPTER 5. MAXIMUM LIKELIHOOD METHODS and LR=W∗ + Op(1/√n ). (5.47) In general, we may test H0 : r( θ )=0with 1 1 2 − − ∂ ( θ ) d 2 W∗ = r( θ )0 R( θ ) L R0( θ ) r( θ ) χp. − ⎡ à ∂θ∂θ0 ! ⎤ −→ (5.48) b b ⎣ b b ⎦ b 5.4 Lagrange Multiplier Alternatively, but in the same fashion, we can expand ( θ )aboutθ0 to obtain L ∂ ( θ0 ) ( θ )= ( θ0 )+ L ( θ θ0 ) b L L ∂θ − 2 0 1 0 ∂ ( θ ) 0 b + ( θ θ )0 L b ( θ θ )+Op(1/√n ). (5.49) 2 − ∂θ∂θ0 − b 1 ∂ ( θ ) b 0 Likewise, we can also expand n L∂θ about θ ,whichyields e 1 ∂ ( θ ) 1 ∂ ( θ0 ) 1 ∂2 ( θ0 ) 0= L = L + L ( θ θ0 )+O (1/n ), n ∂θ n ∂θ n p ∂θ∂θ0 − (5.50) b b or 1 2 0 − 0 0 ∂ ( θ ) ∂ ( θ ) ( θ θ )= L L + Op(1/n ). (5.51) − − ∂θ∂θ0 ∂θ µ ¶ Substituting (5.51)b into (5.49) gives us 1 0 2 0 − 0 0 1 ∂ ( θ ) ∂ ( θ ) ∂ ( θ ) ( θ ) ( θ )= L L L + Op(1/√n ), L − L −2 ∂θ0 ∂θ∂θ0 ∂θ µ ¶ (5.52) b and LR = LM + Op(1/√n ), where 1 ∂ ( θ0 ) ∂2 ( θ0 ) − ∂ ( θ0 ) LM = L L L , (5.53) − ∂θ0 ∂θ∂θ0 ∂θ µ ¶ is the Lagrange Multiplier test. 0 Thus, under H0 : θ = θ , plim(LR LM ) = 0. (5.54) n − →∞ 5.5. CHOOSING BETWEEN TESTS 47 and LM d χ2. (5.55) −→ p Note that the Lagrange Multiplier test only requires the restricted values of the parameters. In general, we may test H0 : r( θ )=0with 1 2 − ∂ ( θ ) ∂ ( θ ) ∂ ( θ ) d 2 LM = L L L χq, (5.56) − ∂θ0 à ∂θ∂θ0 ! ∂θ −→ e e e where ( ) is the unrestricted log-likelihood function, and θ is the restricted MLE. L · e 5.5 Choosing Between Tests The above analysis demonstrates that the three tests: likelihood ratio, Wald, and Lagrange multiplier are asymptotically equivalent. In large samples, not only do they have the same limiting distribution, but they will accept and reject together. This in not the case in finite samples where one can reject when the other does not. This might lead a cynical analyst to use one rather than the other by choosing the one that yields the results (s)he wants to obtain. Making an informed choice based on their finite sample behavior is beyond the scope of this course. In many cases, however, one of the tests is a much more natural choice than the others. Recall that Wald test only requires the unrestricted estimates while the Lagrange multiplier test only requires the restricted estimates. In some cases the unrestricted estimates are much easier to obtain than the restricted andinothercasesthereverse is true. In the first case we might be inclined to use the Wald test while in the latter we would prefer to use the Lagrange multiplier. Another issue is the possible sensitivity of the test results to how the restric- tions are written. For example, θ1 + θ2 = 0 can also be written θ2/θ1 =1. The Wald test, in particular is sensitive to how the restriction is written.− This is yet another situation where a cynical analyst might be tempted to choose the ”normalization” of the restriction to force the desired result. The Lagrange multiplier test, as presented above, is also sensitive to the normalization of the restrictionbutcanbemodified to avoid this difficulty. The likelihood ratio test however will be unimpacted by the choice of how to write the restriction.