<<

Chapter 7 Estimation

1 Criteria for

• Main problem in stats: Estimation of population parameters, say θ. • Recall that an is a . That is, any function of the observations {xi} with values in the parameter space.

• There are many estimators of θ. Question: Which is better?

• Criteria for Estimators (1) Unbiasedness (2) (3) Sufficiency (4) Consistency

2 Unbiasedness

Definition: Unbiasedness ^ An unbiased estimator, say θ , has an expected value that is equal to the value of the population parameter being estimated, say θ. That is, ^ E[ θ ] = θ

Example: E[ x ] = µ E[s2] = σ2

3 Efficiency Definition: Efficiency / An estimator is efficient if it estimates the parameter of interest in some best way. The notion of “best way” relies upon the choice of a loss function. Usual choice of a loss function is the quadratic: ℓ(e) = e2, resulting in the mean squared error criterion (MSE) of optimality:

 2   2  2 MSE = E (θˆ −θ ) = E (θˆ − E(θˆ) + E(θˆ) −θ ) = Var(θˆ) +[b(θ )]     ^ ^ b(θ): E[( θ - θ) bias in θ .

The MSE is the sum of the and the square of the bias. => trade-off: a biased estimator can have a lower MSE than an unbiased estimator. Note: The most efficient estimator among a group of unbiased 4 estimators is the one with the smallest variance => BUE. Efficiency Now we can compare estimators and select the “best” one.

Example: Three different estimators’ distributions

3 1, 2, 3 based on samples 2 of the same size 1

θ Value of Estimator – 1 and 2: expected value = population parameter (unbiased) – 3: positive biased – Variance decreases from 1, to 2, to 3 (3 is the smallest) – 3 can have the smallest MST. 2 is more efficient than 1. 5 Relative Efficiency It is difficult to prove that an estimator is the best among all estimators, a relative concept is usually used.

Definition: Relative efficiency Variance of first estimator Relative Efficiency = Variance of second estimator

Example: Sample mean vs. sample Variance of sample mean = σ2/n Variance of sample median = πσ2/2n Var[median]/Var[mean] = (πσ2/2n) / (σ2/n) = π/2 = 1.57

The sample median is 1.57 times less efficient than the sample mean.6 Asymptotic Efficiency

• We compare two sample in terms of their . The statistic with the smallest variance is called efficient.

• When we look at asymptotic efficiency, we look at the asymptotic variance of two statistics as n grows. Note that if we compare two consistent estimators, both variances eventually go to zero.

Example: Random sampling from the • Sample mean is asymptotically normal[μ,σ2/n] • Median is asymptotically normal [μ,(π/2)σ2/n] • Mean is asymptotically more efficient Sufficiency • Definition: Sufficiency A statistic is sufficient when no other statistic, which can be calculated from the same sample, provides any additional information as to the value of the parameter of interest.

Equivalently, we say that conditional on the value of a sufficient statistic for a parameter, the joint probability distribution of the data does not depend on that parameter. That is, if P(X=x|T(X)=t, θ) = P(X=x|T(X)=t) we say that T is a sufficient statistic.

• The sufficient statistic contains all the information needed to estimate the population parameter. It is OK to ‘get rid’ of the 8 original data, while keeping only the value of the sufficient statistic. Sufficiency

• Visualize sufficiency: Consider a Markov chain θ → T(X1, . . . ,Xn) → {X1, . . . ,Xn} (although in classical statistics θ is not a RV). Conditioned on the middle part of the chain, the front and back are independent.

Theorem Let p(x,θ) be the pdf of X and q(t,θ) be the pdf of T(X). Then, T(X) is a sufficient statistic for θ if, for every x in the sample space, the ratio of px( θ ) qt( θ ) is a constant as a function of θ.

Example: Normal sufficient statistic: 2 Let X1, X2, … Xn be iid N(μ,σ ) where the variance is known. The sample mean, x , is the sufficient statistic for μ. Proof: Let’s starting with the joint distribution function

2 n 1 ( x − µ ) fx( µ ) = exp − i ∏ 2 σ 2 i=1 2πσ 2 2 1 n ( x − µ ) = − i n exp ∑ 2 2 2 σ (2πσ ) i=1 2

• Next, add and subtract the sample mean:

2 1 n ( x−+− xxµ ) µ = − i fx( ) n exp ∑ 2 2 2 σ (2πσ ) i=1 2

n 22 ( x−+ x) nx( −µ ) 1 ∑ i = − i=1 n exp 2 (2πσ 2 ) 2 2σ  • Recall that the distribution of the sample mean is

2 1 nx( − µ ) θ = − qT( ( X) ) 1 exp 2 2 2 2σ 2π σ  ( n )

• The ratio of the information in the sample to the information in the statistic becomes independent of μ

n 22 ∑( xi −+ x) nx( −µ ) 1 i=1 n exp − 2 (2πσ 2 ) 2 2σ fx( θ )  =  2 qT( ( x) θ ) 1 nx( − µ ) 1 exp − 2 2 2 2σ 2π σ  ( n ) n 2 ( xx− ) fx( θ ) 1 ∑ i = exp − i=1 1 n−1 2 qT( ( x) θ ) n 2 (2πσ 2 ) 2 2σ  Sufficiency

Theorem: Factorization Theorem Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ f( xθθ) = gT( ( x) ) hx( )

• Sufficient statistics are not unique. From the factorization theorem it is easy to see that (i) the identity function T(X) = X is a sufficient statistic vector and (ii) if T is a sufficient statistic for θ then so is any 1-1 function of T. Then, we have minimal sufficient statistics.

Definition: Minimal sufficiency A sufficient statistic T(X) is called a minimal sufficient statistic if, for any other sufficient statistic T ’(X), T'(X) is a function of T (X). Consistency

Definition: Consistency The estimator converges in probability to the population parameter being estimated when n (sample size) becomes larger ^ p That is, θ n   → θ.

^ We say that θ n is a consistent estimator of θ.

Example: x is a consistent estimator of μ (the population mean).

• Q: Does unbiasedness imply consistency?

No. The first observation of {xn}, x1, is an unbiased estimator of μ. That is, E[x1] = μ. But letting n grow is not going to cause x1 to converge in probability to μ. 13 Squared-Error Consistency

Definition: Squared Error Consistency ^ The sequence {θ n} is a squared-error consistent estimator of θ, if ^ θ 2 limn→∞ E[( n - θ) ] = 0

^ m.s.→ That is, θ n θ.

• Squared-error consistency implies that both the bias and the variance of an estimator approach zero. Thus, squared-error consistency implies consistency.

14 Order of a Sequence: Big O and Little o

• “Little o” o(.). δ δ -δ A sequence {xn}is o(n ) (order less than n ) if |n xn|→ 0, as n → ∞. 3 4 -4 Example: xn = n is o(n ) since |n xn|= 1 /n → 0, as n → ∞.

• “Big O” O(.). δ δ -δ A sequence {xn} is O(n ) (at most of order n ) if n xn → ψ, as n → ∞ (ψ≠0, constant). Example: f(z) = (6z4 – 2z3 + 5) is O(z4) and o(n4+δ) for every δ>0. Special case: O(1): constant

• Order of a sequence of RV The order of the variance gives the order of the sequence. Example: What is the order of the sequence { x }? Var[ x ] = σ2/n, which is O(1/n) -or O(n-1). Root n-Consistency

• Q: Let xn be a consistent estimator of θ. But how fast does xn converges to θ ? The sample mean, x , has as its variance σ2/n, which is O(1/n). That is, the convergence is at the rate of n-½. This is called “root n-consistency.” Note: n½ x has variance of O(1).

• Definition: nδ convergence? If an estimator has a O(1/n2δ) variance, then we say the estimator is nδ –convergent. 2 Example: Suppose var(xn) is O(1/n ). Then, xn is n–convergent.

The usual convergence is root n. If an estimator has a faster (higher degree of) convergence, it’s called super-consistent. Estimation • Two philosophies regarding models (assumptions) in statistics: (1) Parametric statistics. It assumes data come from a type of probability distribution and makes inferences about the parameters of the distribution. Models are parameterized before collecting the data. Example: Maximum likelihood estimation.

(2) Non-parametric statistics. It assumes no probability distribution –i.e., they are “distribution free.” Models are not imposed a priori, but determined by the data. Examples: histograms, kernel density estimation.

• In general, parametric statistics makes more assumptions. Least Squares Estimation

• Long history: Gauss (1795, 1801) used it in astronomy. • Idea: There is a functional form relating Y and k variables X. This function depends on unknown parameters, θ. The relation between Y and X is not exact. There is an error, ε. We will estimate the parameters θ by minimizing the sum of squared errors.

(1) Functional form known

yi = f(xi, θ) + εi (2) Typical Assumptions - f(x, θ) is correctly specified. For example, f(x, θ) = X β - X are numbers with full rank --or E(ε|X) = 0. That is, (ε ⊥ x) - ε ~ iid D(0, σ2 I) Least Squares Estimation

2 • Objective function: S(xi, θ) =Σi εi

• We want to minimize w.r.t to θ. That is, 2 2 minθ {S(xi, θ) =Σi εi = Σi [yi - f(xi, θ)] }

=> d S(xi, θ)/d θ = - 2 Σi [yi - f(xi, θ)] f ‘(xi, θ)

f.o.c. => - 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) =0

Note: The f.o.c. deliver the normal equations.

The solution to the normal equation, θLS, is the LS estimator. The estimator θLS is a function of the data (yi ,xi). Least Squares Estimation

Suppose we assume a linear functional form. That is, f(x, θ) = Xβ. Using linear algebra, the objective function becomes 2 S(xi, θ) =Σi εi = ε’ε = (y- X β)’ (y- X β) The f.o.c.

- 2 Σi [yi - f(xi, θLS)] f ‘(xi, θLS) = -2 (y- Xb)’ X =0 where b = βOLS. (Ordinary LS. Ordinary=linear)

Solving for b => b = (X’ X)-1 X’ y

Note: b is a (linear) function of the data (yi ,xi).

Least Squares Estimation

The LS estimator of βLS when f(x, θ) = X β is linear is b = (X′X)-1 X′ y

Note: b is a (linear) function of the data (yi ,xi). Moreover, b = (X′X)-1 X′ y = (X′X)-1 X′ (Xβ + ε) = β +(X′X)-1 X′ε

Under the typical assumptions, we can establish properties for b. 1) E[b|X]= β 2) Var[b|X] = E[(b-β) (b-β)′|X] =(X′X)-1 X’E[ε ε′|X] X(X′X)-1 = σ2 (X′X)-1 Under the typical assumptions, Gauss established that b is BLUE. 2 2 -1 3) If ε|X ~ iid N(0, σ In) => b|X ~iid N(β, σ (X’ X) ) 4) With some additional assumptions, we can use the CLT to get b|X   a → N(β, σ2/n (X’ X/n)-1) Maximum Likelihood Estimation

• Idea: Assume a particular distribution with unknown parameters. Maximum likelihood (ML) estimation chooses the set of parameters that maximize the likelihood of drawing a particular sample.

• Consider a sample (X1, ... , Xn) which is drawn from a pdf f(X|θ) where θ are parameters. If the Xi’s are independent with pdf f(Xi|θ) the joint probability of the whole sample is:

n θ = θ θ L(X | ) f( X 1...X n | )= ∏ f( X i | ) i=1 The function L(X| θ) --also written as L(X; θ)-- is called the . This function can be maximized with respect to θ to ^ produce maximum likelihood estimates ( θ MLE ). Maximum Likelihood Estimation

• It is often convenient to work with the Log of the likelihood function. That is,

ln L(X|θ) = Σi ln f(Xi| θ).

• The ML estimation approach is very general. Now, if the model is not correctly specified, the estimates are sensitive to the misspecification.

Ronald Fisher (1890 – 1962) Maximum Likelihood: Example I

Let the sample be X={5, 6, 7, 8, 9, 10}drawn from a Normal(μ,1). The probability of each of these points based on the unknown mean, μ, can be written as:

1  (5 − µ)2  f (5 | µ) = exp−  2π  2  1  (6 − µ)2  f (6 | µ) = exp−  2π  2   1  (10 − µ)2  f (10 | µ) = exp−  2π  2 

Assume that the sample is independent. Maximum Likelihood: Example I

Then, the joint pdf function can be written as:

1  (5 − µ)2 (6 − µ)2 (10 − µ)2  L(X | µ) = exp − − − −5   (2π ) 2  2 2 2  The value of µ that maximize the likelihood function of the sample can then be defined by max L(X | µ) µ

It easier, however, to maximize ln L(X|μ). That is,

∂  (5 − µ)2 (6 − µ)2 (10 − µ)2  max ln(L(X | µ))⇒ K − − −  µ ∂µ  2 2 2  (5 − µ)+ (6 − µ)++ (10 − µ) = 0 5 + 6 + 7 + 8 + 9 +10 _ µˆ = = x MLE 6 Maximum Likelihood: Example I

• Let’s generalize this example to an i.i.d. sample X={X1, X2,..., 2 XT}drawn from a Normal(μ,σ ). Then, the joint pdf function is:

T T 1  (X − µ)2   (X − µ)2  L = exp− i  = (2πσ2 )−T / 2 exp− i  ∏ 2  σ2  ∏  σ2  i=1 2πσ  2  i=1  2 

Then, taking logs, we have: T 1 T T T 1 L = − ln 2πσ2 − (X − µ)2 = − ln 2π − ln σ2 − (X − µ)′(X − µ) 2 ∑ i 2 2 2σ i=1 2 2 2σ We take first derivatives: ∂L 1 T 1 T = − 2(X − µ) (−1) = (X − µ) ∂µ σ2 ∑ i σ2 ∑ i 2 i=1 i=1 ∂ ln L T 1 T = − + (X − µ)2 2 2 4 ∑ i ∂σ 2σ 2σ i=1 Maximum Likelihood: Example I • Then, we have the f.o.c. and jointly solve for the ML estimators:

∂L 1 T 1 T (1) = (X i − µˆ MLE ) = 0 ⇒ µˆ MLE = X i = X ∂µ σˆ 2 ∑ T ∑ MLE i=1 i=1

Note: The MLE of μ is the sample mean. Therefore, it is unbiased.

T T ∂ ln L T 1 2 2 1 2 (2) = − + (X i − µˆ MLE ) = 0 ⇒ σˆ = (X i − X ) ∂σ2 2σˆ 2 2σˆ 4 ∑ MLE T ∑ MLE MLE i=1 i=1 Note: The MLE of σ2 is not s2. Therefore, it is biased! Maximum Likelihood: Example II • We will work the previous example with matrix notation. Suppose we assume: 2 yi = X iβ + εi εi ~ N(0,σ ) 2 or y = Xβ + ε ε ~ N(0,σ IT )

where Xi is a 1xk vector of exogenous numbers and β is a kx1 vector of unknown parameters. Then, the joint likelihood function becomes:

T 1  ε2  T  ε2  L = exp− i  = (2πσ2 )−T / 2 exp− i  ∏ 2  σ2  ∏  σ2  i=1 2πσ  2  i=1  2 

• Then, taking logs, we have the log likelihood function:: T 1 T T 1 ln L = − ln 2πσ2 − ε2 = − ln 2πσ2 − (y − Xβ)′(y − Xβ) 2 σ2 ∑ i 2 σ2 2 i=1 2 Maximum Likelihood: Example II • The joint likelihood function becomes:

T 1 T T T 1 ln L = − ln 2πσ2 − ε2 = − ln 2π − ln σ2 − (y − Xβ)′(y − Xβ) 2 ∑ i 2 2 2σ i=1 2 2 2σ • We take first derivatives of the log likelihood wrt β and σ2: ∂ ln L 1 T 1 = − 2ε x' / σ2 = − X'ε ∂β ∑ i i 2 2 i=1 σ ∂ ln L T 1 T 1 ε'ε = − − (− ) ε2 = ( )[ −T ] 2 2 4 ∑ i 2 2 ∂σ 2σ 2σ i=1 2σ σ • Using the f.o.c., we jointly estimate β and σ2: : ∂ ln L 1 1 = − X'ε = X'(y − Xβˆ ) = 0 ⇒ βˆ = (X'X)−1 X'y ∂β σ2 σ2 MLE MLE ∂ ln L 1 e'e e'e T (y − X βˆ )2 = ( )[ −T ] = 0 ⇒ σˆ 2 = = i i MLE 2 2 2 MLE ∑ ∂σ 2σˆ MLE σˆ MLE T i=1 T ML: Score and Information Matrix Definition: Score (or efficient score)

δ log(L(X |θ )) n δ log(f(x |θ )) S(X ;θ ) = = i δθ ∑i=1 δθ S(X; θ) is called the score of the sample. It is the vector of partial derivatives (the gradient), with respect to the parameter θ. If we have k parameters, the score will have a kx1 dimension.

Definition: Fisher information for a single sample:

2  ∂ log(f(X | θ ))   E   = I(θ)  ∂θ     I(θ) is sometimes just called information. It measures the shape of the log f(X|θ). ML: Score and Information Matrix

• The concept of information can be generalized for the k-parameter case. In this case:

T  ∂ log L  ∂ log L   E    = I(θ)  ∂θ  ∂θ   This is kxk matrix. If L is twice differentiable with respect to θ, and under certain regularity conditions, then the information may also be written as9

T  ∂ log L  ∂ log L     δ2 log(L(X | θ ))  E    = E-   = I(θ)  ∂θ  ∂θ     ∂θ∂θ'  I(θ) is called the information matrix (negative Hessian). It measures the shape of the likelihood function. ML: Score and Information Matrix

• Properties of S(X; θ):

δ log(L(X |θ )) n δ log(f(x |θ )) S(X ;θ ) = = i δθ ∑i=1 δθ

(1) E[S(X; θ)]=0.

∂f (x;θ ) f (x;θ )dx = 1 ⇒ dx = 0 ∫ ∫ ∂θ 1 ∂f (x;θ ) f (x;θ )dx = 0 ∫ f (x;θ ) ∂θ ∂ log f (x;θ ) f (x;θ )dx = 0 ⇒ E[S(x;θ )] = 0 ∫ ∂θ ML: Score and Information Matrix

(2) Var[S(X; θ)]= n I(θ) ∂ log f (x;θ ) f (x;θ )dx = 0 ∫ ∂θ Let's differentiate the above integral once more : ∂ log f (x;θ ) ∂f (x;θ ) ∂ 2 log f (x;θ ) dx + f (x;θ )dx = 0 ∫ ∂θ ∂θ ∫ ∂θ∂θ ' ∂ log f (x;θ )  1 ∂f (x;θ )  ∂ 2 log f (x;θ )   f (x;θ )dx + f (x;θ )dx = 0 ∫ ∂θ  f (x;θ ) ∂θ  ∫ ∂θ∂θ ' 2  ∂ log f (x;θ )  ∂ 2 log f (x;θ )   f (x;θ )dx + f (x;θ )dx = 0 ∫ ∂θ  ∫ ∂θ∂θ ' 2  ∂ log f (x;θ )   ∂ 2 log f (x;θ ) E   = −E  = I(θ )  ∂θ    ∂θ∂θ '  ∂ log f (x;θ ) Var[S(X ;θ )] = nVar[ ] = n I(θ ) ∂θ ML: Score and Information Matrix

(3) If S(xi; θ) are i.i.d. (with finite first and second moments), then we can apply the CLT to get:

a→ Sn(X; θ) = Σi S(xi; θ) N(0, n I(θ)).

Note: This an important result. It will drive the distribution of MLE estimators. ML: Score and Information Matrix – Example • Again, we assume: 2 yi = X iβ + εi εi ~ N(0,σ ) 2 or y = Xβ + ε ε ~ N(0,σ IT )

• Taking logs, we have the log likelihood function: T 1 T T T 1 ln L = − ln 2πσ2 − ε2 = − ln 2π − ln σ2 − (y − Xβ)′(y − Xβ) 2 ∑ i 2 2 2σ i=1 2 2 2σ • The score function is –first derivatives of log L wrt θ=(β,σ2): ∂ ln L 1 T 1 = − 2ε x' / σ2 = − X'ε ∂β ∑ i i 2 2 i=1 σ ∂ ln L T 1 T 1 ε'ε = − − (− ) ε2 = ( )[ −T ] 2 2 4 ∑ i 2 2 ∂σ 2σ 2σ i=1 2σ σ ML: Score and Information Matrix – Example • Then, we take second derivatives to calculate I(θ): :

∂ ln L2 T 1 = − x x '/ σ2 = X ' X ∂β∂β ∑ i i 2 ' i=1 σ ∂ ln L 1 T = − ε x ' ∂β∂σ2 σ4 ∑ i i ' i=1 ∂ ln L 1 ε'ε 1 ε'ε 1 ε'ε = − [ −T ]+ ( )(− ) = − [2 −T ] ∂σ2∂σ2 σ4 σ2 σ2 σ4 σ4 σ2 ' 2 2 2 • Then,  1  ( X ' X ) 0 ∂ ln L  2  I(θ) = E[− ] =  σ  ∂θ∂θ T '  0   2σ4  ML: Score and Information Matrix In deriving properties (1) and (2), we have made some implicit assumptions, which are called regularity conditions: (i) θ lies in an open interval of the parameter space, Ω. (ii) The 1st derivative and 2nd derivatives of f(X; θ) w.r.t. θ exist. (iii) L(X; θ) can be differentiated w.r.t. θ under the integral sign. (iv) E[S(X; θ) 2]>0, for all θ in Ω. (v) T(X) L(X; θ) can be differentiated w.r.t. θ under the integral sign.

Recall: If S(X; θ) are i.i.d. and regularity conditions apply, then we can apply the CLT to get:

S(X; θ) a→ N(0, n I(θ)) ML: Cramer-Rao inequality Theorem: Cramer-Rao inequality

Let the random sample (X1, ... , Xn) be drawn from a pdf f(X|θ) and let T=T(X1, ... , Xn) be a statistic such that E[T]=u(θ), differentiable in θ. Let b(θ)= u(θ) - θ, the bias in T. Assume regularity conditions. Then, [u'(θ )]2 [1+ b'(θ )]2 Var(T) ≥ = nI(θ ) nI(θ ) Regularity conditions: (1) θ lies in an open interval Ω of the real line. (2) For all θ in Ω, δf(X|θ)/δθ is well defined. (3) ∫L(X|θ)dx can be differentiated wrt. θ under the integral sign (4) E[S(X;θ)2]>0, for all θ in Ω (5) ∫T(X) L(X|θ)dx can be differentiated wrt. θ under the integral sign ML: Cramer-Rao inequality

[u'(θ )]2 [1+ b'(θ )]2 Var(T) ≥ = nI(θ ) nI(θ ) The lower bound for Var(T) is called the Cramer-Rao (CR) lower bound. Corollary: If T(X) is an unbiased estimator of θ, then

Var(T) ≥ (nI(θ ))−1

Note: This theorem establishes the superiority of the ML estimate over all others. The CR lower bound is the smallest theoretical variance. It can be shown that ML estimates achieve this bound, therefore, any other estimation technique can at best only equal it. ML: Cramer-Rao inequality

Proof: For any T(X) and S(X;θ) we have [Cov(T,S)]2 ≤ Var(T) Var(S) (Cauchy-Schwarz inequality) Since E[S]=0, Cov(T,S)=E[TS]. Also, u(θ) = E[T] = ∫ T L(X;θ) dx. Differentiating both sides: u’(θ) = ∫ T δL(X;θ)/δθ dx = ∫ T [1/L δL(X;θ)/δθ] L dx = ∫ T S L dx = E[TS] = Cov(TS)

Substituting in the Cauchy-Schwarz inequality: [u’(θ)]2 ≤ Var(T) n I(θ) => Var(T) ≥[u’(θ)]2/[n I(θ)] ■ ML: Cramer-Rao inequality Note: For an estimator to achieve the CR lower bound, we need [Cov(T,S)]2 = Var(T) Var(S). This is possible if T is a linear function of S. That is, T(X) = α(θ) S(X;θ) + β(θ) Since E[T] = α(θ) E[S(X;θ)] + β(θ) = β(θ) . Then, S(X;θ) = δ log L(X;θ)/δθ =[T(X) - β(θ)]/ α(θ). Integrating both sides wrt to θ: log L(X;θ) = U(X) – T(X) A(θ)+ B(θ)

That is, L(X;θ) = exp{ΣiU(Xi) – A(θ) ΣiT(Xi) + n B(θ)} Or, f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)} ML: Cramer-Rao inequality f(X;θ) = exp{U(X) – T(X) A(θ)+ B(θ)} That is, the exponential (Pitman-Koopman-Darmois) family of distributions attain the CR lower bound. • Most of the distributions we have seen belong to this family: normal, exponential, gamma, chi-square, beta, Weibull (if the shape parameter is known), Dirichlet, Bernoulli, binomial, multinomial, Poisson, negative binomial (with known parameter r), and geometric.

• Note: The Chapman–Robbins bound is a lower bound on the variance of estimators of θ. It generalizes the Cramér–Rao bound. It is tighter and can be applied to more situations –for example, when I(θ) does not exist. However, it is usually more difficult to compute. Cramer-Rao inequality: Multivariate Case • When we have k parameters, then covariance matrix of the estimator T(X) has a CR lower bound given by: ∂u(θ) ∂u(θ) T Covar( T(X)) ≥ I(θ)−1 ∂θ ∂θ Note: In matrix notation, the inequality A ≥ B means the matrix A-B is positive semidefinite.

If T(X) is unbiased, then

Covar( T(X)) ≥ I(θ)−1

C. R. Rao (1920, India) & Harald Cramer (1893-1985, Sweden) Cramer-Rao inequality: Example

2 We want to check if the sample mean and s for an i.i.d. sample X={X1, 2 X2,..., XT}drawn from N(μ,σ ) achieve the CR lower bound. Recall:  n   0  ∂ ln L σ2 I(θ) = E[− ] =   ∂θ∂θ'  n  0 4  X 2σ  Since the sample mean and s2 are unbiased, the CR lower bound is given by: −1 Covar( T ) ≥ I(θ) σ2 2σ4 Then, Var(X) ≥ & Var(s2 ) ≥ n n We have already derived that Var( X ) = σ2/n and Var(s2) = 2 σ4/(n-1). Then, the sample mean achieves its CR bound, but s2 does not.. Concentrated ML • We split the parameter vector θ into two vectors:

L(θ )= L(θ 1 ,θ 2 )

Sometimes, we can derive a formulae for the ML estimate of θ2, say:

θ 2 = g(θ 1 )

If this is possible, we can write the Likelihood function as

* L(θ 1 ,θ 2 )= L(θ 1 ,g(θ 1 ))= L (θ 1 ) This is the concentrated likelihood function.

• This process is often useful as it reduces the number of parameters needed to be estimated. Concentrated ML: Example

• The normal log likelihood function can be written as:

2 n 2 1 n 2 ln L(µ,σ ) = − ln(σ )− (X i − µ) 2 2σ 2 ∑i=1

• This expression can be solved for the optimal choice of σ2 by differentiating with respect to σ2:

2 ∂ ln L(µ,σ ) n 1 n = − + (X − µ)2 = 0 2 2 2 ∑i=1 i ∂σ 2σ 2(σ 2 ) n ⇒ −nσ 2 + (X − µ)2 = 0 ∑i=1 i 1 n ⇒ σˆ 2 = (X − µ)2 MLE n ∑i=1 i Concentrated ML: Example • Substituting this result into the original log likelihood produces:

n  1 n 2  ln L(µ) = − ln (X i − µ)  − 2  n ∑i=1  1 n (X − µ)2 1 n ∑i=1 i 2 (X − µ)2 n ∑ j=1 j

n  1 n 2  n = − ln (X i − µ)  − 2  n ∑i=1  2 • Intuitively, the ML estimator of µ is the value that minimizes the MSE of the estimator. Thus, the least squares estimate of the mean of a normal distribution is the same as the ML estimator under the assumption that the sample is i.i.d. Properties of ML Estimators ^ (1) Efficiency. Under general conditions, we have that θ MLE ^ −1 Var(θ MLE ) ≥ (nI(θ )) The right-hand side is the Cramer-Rao lower bound (CR-LB). If an estimator can achieve this bound, ML will produce it.

(2) Consistency. We know that E[S(Xi; θ)]=0 and Var[S(Xi; θ)]= I(θ). The consistency of ML can be shown by applying Khinchine’s LLN

to S(Xi,; θ) and then to Sn(X; θ)=Σi S(Xi,; θ). st ˆ Then, do a 1 -order Taylor expansion of Sn(X; θ) around θMLE ˆ * ˆ * ˆ Sn (X; θ ) = Sn (X; θMLE ) + Sn '(X; θn )(θ − θMLE ) |θ − θn |≤ |θ − θMLE |< ε * ˆ Sn (X; θ ) = Sn '(X; θn )(θ − θMLE )

ˆ Sn(X; θ) and ( θ MLE - θ) converge together to zero (i.e., expectation). Properties of ML Estimators (3) Theorem: Asymptotic Normality

Let the likelihood function be L(X1,X2,…Xn| θ). Under general conditions, the MLE of θ is asymptotically distributed as

θˆ a→ N θ,nI(θ )−1 MLE ( ) Sketch of a proof. Using the CLT, we’ve already established p→ Sn(X; θ) N(0, nI(θ)). Then, using a first order Taylor expansion as before, we get 1 1 θ = θ * θ −θˆ Sn (X; ) 1/2 Sn '(X; n ) 1/2 ( MLE ) n n

Notice that E[Sn′(xi ; θ)]= -I(θ). Then, apply the LLN to get p→ p→ Sn′ (X; θn*)/n -I(θ). (using θn* θ.) Now, algebra and Slutzky’s theorem for RV get the final result. Properties of ML Estimators (4) Sufficiency. If a single sufficient statistic exists for θ, the MLE of θ ˆ must be a function of it. That is, θ MLE depends on the sample observations only through the value of a sufficient statistic.

(5) Invariance. The ML estimate is invariant under functional ˆ transformations. That is, if θ MLE is the MLE of θ and if g(θ) is a ˆ function of θ , then g(θ MLE ) is the MLE of g(θ) . Quasi Maximum Likelihood • ML rests on the assumption that the errors follow a particular distribution (OLS is only ML if the errors are normal).

•Q: What happens if we make the wrong assumption? White (Econometrica, 1982) shows that, under broad assumptions ˆ about the misspecification of the error process, θ MLE is still a consistent estimator. The estimation is called Quasi ML.

• But the covariance matrix is no longer I(θ)-1, instead it is given by Var[θˆ]= I( θˆ )-1[S( θˆ )′S( θˆ )]I( θˆ )-1

• In general, Wald and LM tests are valid, by using this corrected covariance matrix. But, LR tests are invalid, since they works directly from the value of the likelihood function. ML Estimation: Numerical Optimization • In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (3) Determine a step length to move (4) Check convergence criteria and either stop or go back to (2). ML Estimation: Numerical Optimization • In simple cases like OLS, we can calculate the ML estimates from the f.o.c.’s –i.e., analytically . But in most situations we cannot.

• We resort to numerical optimisation of the likelihood function.

• Think of hill climbing in parameter space. There are many algorithms to do this.

• General steps: (1) Set an arbitrary initial set of parameters –i.e., starting values. (2) Determine a direction of movement (for example, by dL/dθ). (3) Determine a step length to move (for example, by d2L/dθ2). (4) Check convergence criteria and either stop or go back to (2). ML Estimation: Numerical Optimization

L

Lu

β 1 β 2 β * Method of Moments (MM) Estimation • Simple idea: Suppose the first moment (the mean) is generated by the distribution, f(X,θ). The observed moment from a sample of n observations is

n

m1 = (1/ n)∑ xi i=1 Hence, we can retrieve the parameter θ by inverting the distribution function f(X,θ): −1 m1 = f (x |θ ) => θ = f (m1) = m1

• Example: Mean of a Poisson pdf: f(x) = exp(-λ) λx/x!

E[X] = λ => plim (1/N)Σi xi = λ.

Then, the MM estimator of λ is the sample mean of X => λMM = x Method of Moments (MM) Estimation • Example: Mean of Exponential pdf: f(x,λ) = λ e-λy

E[X] = 1/λ => plim (1/N)Σixi = 1/λ

Then, the λMM = 1/ x .

• Let’s complicate the MM idea: Now, suppose we have a model. This model implies certain knowledge about the moments of the distribution.

Then, we invert the model to give us estimates of the unknown parameters of the model, which match the theoretical moments for a given sample. MM Estimation • We have a model Y = h (X,θ), where θ are k parameters. Under this model, we know what some moments of the distribution should be. That is, the model provide us with k conditions (or moments), which should be met: E(g(Y, X |θ )) = 0

• In this case, the (population) first moment of g (Y,X, θ) equals 0. Then, we approximate the k moments –i.e., E(g)- with a sample measure and invert g to get an estimate of θ:

θˆ = −1 MM g (Y, X ,0) ˆ θ MM is the Method of Moment estimator of θ.

Note: In this example we have as many moments (k) as unknown parameters (k). Thus, θ is uniquely and exactly determined. MM Estimation: Example

We start with a model Y = X β + ε. In OLS estimation, we make the assumption that the X’s are orthogonal to the errors. Thus,

E(X 'e) = 0

The sample moment analogue for each xi is n (1/ n) x e = 0 − or (1/ n)X 'e = 0. ∑t=1 it t And, thus,

(1/ n)X 'e = 0 = (1/ n)X '(Y − Xβ MM ) => X 'Y = X ' Xβ MM

Therefore, the method of moments estimator, βMM, solves the normal equations. That is, βMM will be identical to the OLS estimator, b. Generalized Method of Moments (GMM) • So far, we have assumed that there are as many moments (l ) as unknown parameters (k). The parameters are uniquely and exactly determined.

• If l < k –i.e., less moment conditions than parameters-, we would not be able to solve them for a unique set of parameters (the model would be under identified).

• If l > k –i.e., more moment conditions than parameters-, then all the conditions can not be met at the same time, the model is over identified and we have GMM estimation.

If we can not satisfy all the conditions at the same time, we want to make them all as close to zero as possible at the same time. We have to figure out a way to weight them. Generalized Method of Moments (GMM)

• Now, we have k parameters but l moment conditions l>k. Thus,

E(m j (θ )) = 0 j = 1,...l (l population moments) n m(θ ) = (1/ n) m (θ ) = 0 j = 1,...l (l sample moments) ∑t=1 j • Then, we need to make all l moments as small as possible, simultaneously. Let’s use a weighted least squares criterion: Min(q) = m(θ )'W m(θ ) θ That is, the weighted squared sum of the moments. The weighting matrix is the lxl matrix W. (Note that we have a quadratic form.)

∂m(θ )' • First order condition: 2 W m(θ ) = 0 ∂θ ' GMM θ =θGMM Generalized Method of Moments (GMM)

• The GMM estimator, θGMM, solves the kx1 system of equations. There is typically no closed form solution for θGMM. It must be obtained through numerical optimization methods. • If plim m ( θ ) =0, and W (not a function of θ) is a positive definite matrix, then θGMM is a consistent estimator of θ.

• The optimal W Any weighting matrix produces a consistent estimator of θ. We can select the most efficient one –i.e., the optimal W.

The optimal W is simply the covariance matrix of the moment conditions. Thus, OptimalW = W* = Asy Var(m) Properties of the GMM estimator

• Properties of the GMM estimator. (1) Consistency. If plim m ( θ ) =0, and W (not a function of θ) is a pd matrix, then p→ under some conditions, θGMM θ. (2) Asymptotic Normality

a→ Under some general condition θGMM N(θ, VGMM), and -1 -1 VGMM=(1/n)[G′V G] , where G is the matrix of derivatives of the moments with respect to the parameters and V = Var(n1/ 2m(θ ))

Lars Peter Hansen (1952) Bayesian Estimation: Bayes’ Theorem

• Recall Bayes’ Theorem: Prob(X |θ ) Prob(θ ) Prob(θ X ) = Prob(X ) - P(θ): Prior probability about parameter θ. - P(X|θ): Probability of observing the data, X, conditioning on θ. This conditional probability is called the likelihood –i.e., probability of event X will be the outcome of the experiment depends on θ. - P(θ |X): Posterior probability -i.e., probability assigned to θ, after X is observed. - P(X): Marginal probability of X. This the prior probability of witnessing the data X under all possible scenarios for θ, and it depends on the prior probabilities given to each θ.

Bayesian Estimation: Bayes’ Theorem

• Example: Courtroom – Guilty vs. Non-guilty G: Event that the defendant is guilty. E: Event that the defendant's DNA matches DNA found at the crime scene. The jurors, after initial questions, form a personal belief about the defendant’s guilt. This initial belief is the prior. The jurors, after seeing the DNA evidence (event E), will update their prior beliefs. This update is the posterior.

Bayesian Estimation: Bayes’ Theorem

• Example: Courtroom – Guilty vs. Non-guilty - P(G): Juror’s personal estimate of the probability that the defendant is guilty, based on evidence other than the DNA match. (Say, .30). - P(E|G): Probability of seeing event E if the defendant is actually guilty. (In our case, it should be near 1.) - P(E): E can happen in two ways: defendant is guilty and thus DNA match is correct or defendant is non-guilty with incorrect DNA match (one in a million chance). - P(G|E): Probability that defendant is guilty given a DNA match.

Prob(E | G) Prob(G) 1x(.3) Prob(G E) = = = .999998 Prob(E) .3x1+.7x10-6 Bayesian Estimation: Viewpoints

• Implicitly, in our previous discussions about estimation (MLE), we adopted a classical viewpoint. – We had some process generating random observations. – This random process was a function of fixed, but unknown parameters. – Then, we designed procedures to estimate these unknown parameters based on observed data.

• For example, we assume a random process such as CEO compensation. This CEO compensation process can be characterized by a normal distribution. – We can estimate the parameters of this distribution using maximum likelihood. Bayesian Estimation: Viewpoints

– The likelihood of a particular sample can be expressed as

1  1 2 2  L X , X , X µ,σ 2 = exp − (X − µ) ( 1 2 n ) n  2 ∑i=1 i  π 2σ n  2σ  (2 ) – Our estimates of µ and σ2 are then based on the value of each parameter that maximizes the likelihood of drawing that sample Thomas Bayes (1702–April 17, 1761) Bayesian Estimation: Viewpoints

• Turning the classical process around slightly, a Bayesian viewpoint starts with some kind of probability statement about the parameters (a prior). Then, the data, X, are used to update our prior beliefs (a posterior). – First, assume that our prior beliefs about the distribution function can be expressed as a probability density function π(θ), where θ is the parameter we are interested in estimating. – Based on a sample -the likelihood function, L(X,θ)- we can update our knowledge of the distribution using Bayes’ theorem:

Prob(X |θ )π (θ ) L(X θ )π (θ ) π (θ X ) = = Prob(X ) ∞ L(X θ )π (θ )dθ ∫−∞ Bayesian Estimation: Example

• Assume that we have a prior of a Bernoulli distribution. Our prior is that P in the Bernoulli distribution is distributed Β(α,β):

1 β − π (P) = f (P;α, β ) = Pα −1(1− P) 1 B(α, β )

1 Γ(α ) Γ(β ) B(α, β ) = xα −1(1− x)β −1dx = ∫0 Γ(α + β )

Γ(α + β ) π (P) = Pα −1(1− P)β −1 Γ(α ) Γ(β ) Bayesian Estimation: Example

• Assume that we are interested in forming the posterior distribution after a single draw, X:

− Γ(α + β ) β − P X (1− P)1 X Pα −1(1− P) 1 Γ(α ) Γ(β ) π (P X ) = 1 − Γ(α + β ) β − P X (1− P)1 X Pα −1(1− P) 1dP ∫0 Γ(α ) Γ(β ) P X +α −1(1− P)β − X = 1 P X +α −1(1− P)β − X dP ∫0 Bayesian Estimation: Example

• Following the original specification of the beta function

1 1 * * P X +α −1(1− P)β − X dP = Pα −1(1− P)β −1dP ∫0 ∫0 where α * = X +α and β * = β − X +1 Γ(X +α ) Γ(β − X +1) = Γ(α + β +1)

• The posterior distribution, the distribution of P after the observation is then

Γ(α + β +1) β − π (P X ) = P X +α −1(1− P) X Γ(X +α ) Γ(β − X +1) Bayesian Estimation: Example • The Bayesian estimate of P is then the value that minimizes a loss function. Several loss functions can be used, but we will focus on the quadratic loss function consistent with mean square errors

2 ∂E(Pˆ − P)  2   min E(Pˆ − P)  ⇒ = 2E[Pˆ − P]= 0 Pˆ   ∂Pˆ ⇒ Pˆ = E[P] • Taking the expectation of the posterior distribution yields

1 Γ(α + β +1) β − E[P] = P X +α (1− P) X dP ∫0 Γ(X +α ) Γ(β − X +1)

Γ(α + β +1) 1 β − = P X +α (1− P) X dP Γ(X +α ) Γ(β − X +1) ∫0 Bayesian Estimation: Example

• As before, we solve the integral by creating α*=α+X+1 and β*=β- X+1. The integral then becomes

1 * * α * − β * −1 Γ(α )Γ(β ) Γ(α + X +1)Γ(β − X +1) P 1(1− P) dP = = ∫0 Γ(α * + β * ) Γ(α + β + 2)

Γ(α + β +1) Γ(α + X +1) Γ(β − X +1) E[P] = Γ(α + β + 2) Γ(α + X ) Γ(β − X +1)

• Which can be simplified using the fact Γ(α+1)= α Γ(α):

Γ(α + β +1) Γ(α + X +1) Γ(α + β +1) (α + X )Γ(α + X ) E[P] = = Γ(α + β + 2) Γ(α + X ) (α + β +1)Γ(α + β +1) Γ(α + X ) (α + X ) = (α + β +1) Bayesian Estimation: Example

• To make this estimation process operational, assume that we have a prior distribution with parameters α=β=1.4968 that yields a beta distribution with a mean P of 0.5 and a variance of the estimate of 0.0625.

• Extending the results to n Bernoulli trials yields

Γ(α + β + n) Y +α −1 β −Y +n−1 π (P X )= P (1− P) Γ(α +Y )Γ(β −Y + n)

where Y is the sum of the individual Xs or the number of heads in the sample. The estimated value of P then becomes: Y +α Pˆ = α + β + n Bayesian Estimation: Example

• Suppose in the first draw Y=15 and n=50. This yields an estimated value of P of 0.31129. This value compares with the maximum likelihood estimate of 0.3000. Since the maximum likelihood estimator in this case is unbiased, the results imply that the Bayesian estimator is biased.