<<

ECE 645: Spring 2015 Instructor: Prof. Stanley H. Chan

Lecture 8: Properties of Maximum Likelihood Estimation (MLE) (LaTeX prepared by Haiguang Wen) April 27, 2015

This lecture note is based on ECE 645(Spring 2015) by Prof. Stanley H. Chan in the School of Electrical and Computer Engineering at Purdue University.

1 Efficiency of MLE

Maximum Likelihood Estimation (MLE) is a widely used statistical estimation method. In this lecture, we will study its properties: efficiency, consistency and asymptotic normality. MLE is a method for estimating of a . Given the distribution of a statistical model f(y ; θ) with unkown deterministic θ, MLE is to estimate the parameter θ by maximizing the f(y ; θ) with observations y.

θ(y) = arg min f(y ; θ) (1) θ

Please see the previous lecture note Lectureb 7 for details.

1.1 Cram´er–Rao Lower Bound (CRLB) Cram´er–Rao Lower Bound (CRLB) is introduced in Lecture 7. Briefly, CRLB describes a lower bound on the of of the deterministic parameter θ. That is

( ∂ E[θ(Y )])2 Var θ(Y ) > ∂θ , (2) I(θ)   b where I(θ) is the that measuresb the information carried by the observable Y about the unknown parameter θ. For unbiased θ(Y ), Equation 2 can be simplified as 1 Var θ(Y ) > b , (3) I(θ)   which the variance of any unbiased estimatorb is as least as the inverse of the Fisher information.

1.2 Efficient Estimator From section 1.1, we know that the variance of estimator θ(y) cannot be lower than the CRLB. So any estimator whose variance is equal to the lower bound is considered as an efficient estimator. b Definition 1. Efficient Estimator An estimator θ(y) is efficient if it achieves equality in CRLB.

b Example 1. 2 Question: Y = Y1, Y2, , Yn are i.i.d. Gaussian random variables with distribution N(θ, σ ). Determine the maximum likelihood{ ··· estimator} of θ. Is the estimator efficient? Solution: Let y = y ,y , ,yn be the observation, then { 1 2 ··· } n f(y ; θ)= f(yk ; θ) k Y=1 n 2 1 (yk θ) = exp − √2πσ2 − 2σ2 k=1   Y n 2 1 k=1(yk θ) = n exp − . (2πσ2) 2 − 2σ2  P 

Take the log of both sides of the above equation, we have

n 2 n (yk θ) log f(y ; θ)= log(2πσ2) k=1 − . − 2 − 2σ2 P Since log f(y ; θ) is a quadratic of θ, we can obtain the MLE by solving the following equation.

n ∂ log f(y ; θ) 2 (yk θ) = k=1 − =0. ∂θ 2σ2 P 1 n Therefore, the MLE is θMLE(y)= n k=1 yk. P Now let us check whetherb the estimator is efficient or not. It is easy to check that the MLE is an unbiased estimator (E[θMLE (y)] = θ). To determine the CRLB, we need to calculate the Fisher information of the model. ∂2 n b I(θ)= E log f(y ; θ) = (4) − ∂θ2 σ2   According to Equation 3, we have

1 σ2 Var θ (Y ) > = . (5) MLE I(θ) n   And the variance of the MLE is b

n 1 σ2 Var θ (Y ) = Var Y = . (6) MLE n k n k !   X=1 b So CRLB equality is achieved, thus the MLE is efficient.

1.3 Minimum Variance Unbiased Estimator (MVUE) Recall that a Minimum Variance Unbiased Estimator (MVUE) is an unbiased estimator whose variance is lower than any other unbiased estimator for all possible values of parameter θ. That is

Var(θMVUE (Y )) 6 Var(θ(Y )) (7) for any unbiased θ(Y ) of any θ. b b Proposition 1. Unbiased and Efficient Estimators b If an estimator θ(y) is unbiased and efficient, then it must be MVUE.

b

2 Proof. Since θ(y) is efficient, according to CRLB, we have

b Var θ(Y ) 6 Var θ(Y ) (8)     for any θ(Y ). Therefore, θ(Y ) must be minimumb variance (MV).e Since θ(Y ) is also unbiased, it is a MVUE. ✷ e e e

Remark: The converse of the proposition is not true in general. That is, MVUE does NOT need to be efficient. Here is a counter example.

Example 2. Counter Example 1 Suppose that y = Y1, Y2, , Yn are i.i.d. exponential random variables with unknown θ . Find the MLE and MVUE of{ θ. Are··· these} estimators efficient?

Solution: Let y = y ,y , ,yn be the observation, then { 1 2 ··· } n f(y ; θ)= f(yk ; θ) k=1 Yn = θ exp θyk {− } k=1 Y n n = θ exp θ yk . (9) − ( k ) X=1 Take the log of both sides of the above equation, we have

n log f(y ; θ)= n log(θ) θ yk. − k X=1 Since log f(y ; θ) is a concave function of θ, we can obtain the MLE by solving the following equation.

n ∂ log f(y ; θ) n = =0 ∂θ θ − k X=1 So the MLE is n θMLE(y)= n . (10) k=1 yk b P E n To calculate the CRLB, we need to calculate θMLE(Y ) and Var θMLE(Y ) . Let T (y)= k=1 yk, then by generating function, we can show thath the distributioni  of T (y) is the Erlange distribution: b b P n n−1 θ t − f (t)= e θt. (11) T (n 1)! − So we have ∞ n n−1 n θ t − E θ (T (Y )) = e θtdt MLE t (n 1)! Z0 h i ∞− n−2 nθ (θt) − b = e θtdθt n 1 0 (n 2)! −n Z − = θ. (12) n 1 −

3 Therefore the MLE is a biased estimator of θ.

Similarly, we can calculate the variance of MLE as follows. 2 2 Var θMLE (T (Y )) = E θ (T (Y )) E θMLE (T (Y )) MLE −   h θ2n2 i h i b = b b (n 1)2(n 2) − − The Fisher information is ∂2 n I(θ)= log f(y θ)= . −∂θ2 | θ2 So the CRLB is 2 ∂ E ∂θ θMLE (T (Y )) Var θ (T (Y )) > MLE  h I(θ) i   b n2 n b = (n 1)2 θ2 −n . = θ2. (n 1)2 − The CRLB equality does NOT hold, so θMLE is not efficient.

b n The distribution in Equation 9 belongs to and T (y) = k=1 yk is a complete sufficient . So the MLE can be expressed as θ (T (y)) = n , which is a function of T (y). However, the MLE T (y) P MLE is a biased estimator (Equation 12). But we can construct an unbiased estimator based on the MLE. That is b n 1 θ(T (y)) = − θ (T (y)) n MLE n 1 e = − .b T (y)

E E n−1 n−1 n It is easy to check θ (T (Y )) = n θMLE (T (Y )) = n n−1 θ = θ. Since θ(T (y)) is an unbiased estimator and it is a functionh ofi completeh sufficient statistic,i θ(T (y)) is MVUE. So e b e n 1 θMVUE (T (y)) = −e . (13) T (y) The variance of MVUE is b n 1 Var θ (T (Y )) = Var − θ (T (Y )) MVUE n MLE     (n 1)2 n2θ2 b = − b n2 (n 1)2(n 2) − − θ2 = . n 2 − So the CRLB is 1 θ2 Var θ (T (Y )) > = . MVUE I(θ) n   Therefore, the MVUE is NOT an efficientb estimator.

4 2 Consistency of MLE

Definition 2. Consistency Let Y1, , Yn be a sequence of observations. Let θn be the estimator using Y1, , Yn . We say that { ··· } p { ··· } θn is consistent if θn θ, i.e., → b P θn θ >ε 0, as n (14) b b | − | → → ∞   Remark: A sufficient condition to have Equationb 14 is that

2 E θn θ 0, as n . (15) − → → ∞    Proof. b According to Chebyshev’s inequality, we have

2 E (θn θ) − P θn θ ε (16) | − |≥ ≤ h ε2 i   b 2 b Since E θn θ 0, the we have − →    b 2 E (θn θ) − 0 P θn θ ε 0. ≤ | − |≥ ≤ h ε2 i →   b b Therefore, P θn θ >ε 0, as n . | − | → → ∞ ✷   b

Example 3. 2 Y1, Y2, , Yn are i.i.d. Gaussian random variables with distribution N(θ, σ ). Is the MLE using Y1, Y2, , Yn {consistent?··· } { ··· }

Solution: From Example 1., we know that the MLE is

n 1 θ = Y . n n k k X=1 b Since 2 σ2 E θn θ = Var θn = , (see Equation 6), − n      2 b p b so E θn θ 0. Therefore θn θ, i.e. θn is consistent. − → →    b b b In fact, the result of the example above it holds for any distribution. The following proposition states this result: Proposition 2. (MLE of i.i.d observation is consistent) Let Y , , Yn be a sequence of i.i.d. observations where { 1 ··· } iid Yk ∼ fθ(y).

Then the MLE of θ is consistent.

5 Proof. (This proof is partially correct. See Levy Chapter 4.5 for complete discussion.) The MLE of θ is n θn = argmax fθ(yk) θ k Y=1 b n = argmax log fθ(yk) θ k=1 ! n Y 1 = argmax log fθ(yk) θ n k X=1 = argmax ϕn(θ), θ

1 n fθ (yk) where ϕn(θ)= log fθ(yk). Let ℓb(yk) = log , then we have n k=1 θ fθb(yk) P E def fθ(yk) θb ℓθb(Yk) = log fθb(yk)dyk fb(y ) Z θ k   = D fθ fb . k θ According to the weak (WLLN), we have

n 1 p ℓb(yk) D fθ fb . (17) n θ → k θ k=1 X 

Since θn is the MLE which maximizes ϕn(θ), then

b 0 ϕn(θ) ϕn(θ) ≥ n − n 1 1 = log fθ(byk) log fb(yk) n − n θ k=1 k=1 Xn X 1 f (y ) = log θ k n fb(yk) k=1 θ Xn 1 = ℓb(y ) n θ k k=1 Xn 1 = ℓb(yk) D fθ fb + D fθ fb . n θ − k θ k θ k=1 X   Therefore,

n 1 D fθ fθb ℓθb(yk) D fθ fθb . k ≤ n − k k=1  X 

By Equation 17, we have n 1 p 0 D fθ fθb ℓθb(yk) D fθ fθb 0. ≤ k ≤ n − k → k=1  X  p p So we must have D fθ fθb 0, and then θ θ. k → → ✷  b

6 3 Asymptotic Normality of MLE

The previous proposition only asserts that MLE of i.i.d. observations is consistent. However, it provides no information about the distribution of the MLE. Proposition 3. Asymptotic Normality Let Y , , Yn be a sequence of i.i.d. observations where { 1 ··· } iid Yk ∼ fθ(y)

θ is the MLE of θ, then

d 1 b √n(θ θ) N 0, . − → I(θ)   Proof. b See Lehmann, “Elements of Large Theory”, Springer, 1999 for proof. ✷

Example 4. 2 Y , Y , , Yn are i.i.d. Gaussian random variables with distribution N(θ, σ ). Find the asumptotic dis- { 1 2 ··· } tribution of θML

Solution: b Similar to Example 2, we can calculate the Fisher information of θ,

∂2 1 I(θ)= E log fθ(Y ) = − ∂θ2 σ2   1 n 1 n We know that θML = n k=1 yk. So if θn = n k=1 yk, then we have

P P d b b √n(θ θ) N 0, σ2 . − →  b

7