<<

Maximum Likelihood February 22, 2016 Debdeep Pati

1 Maximum Likelihood Estimation

Assume X ∼ Pθ, θ ∈ Θ, with joint pdf (or pmf) f(x | θ). Suppose we observe X = x. The Likelihood is L(θ | x) = f(x | θ) as a function of θ (with the data x held fixed). The likelihood function L(θ | x) and joint pdf f(x | θ) are the same except that f(x | θ) is generally viewed as a function of x with θ held fixed, and L(θ | x) as a function of θ with x held fixed. f(x | θ) is a density in x for each fixed θ. But L(θ | x) is not a density (or mass function) in θ for fixed x (except by coincidence).

1.1 The Maximum Likelihood (MLE)

A point estimator θˆ = θˆ(x) is a MLE for θ if L(θˆ | x) = sup L(θ | x), θ that is, θˆ maximizes the likelihood. In most cases, the maximum is achieved at a unique value, and we can refer to “the” MLE, and write ˆ θ(x) = argmaxθL(θ|x). (But there are cases where the likelihood has flat spots and the MLE is not unique.)

1.2 Motivation for MLE’s

Note: We often write L(θ | x) = L(θ), suppressing x, which is kept fixed at the observed n data. Suppose x ∈ R . Discrete Case: If f(· | θ) is a mass function (X is discrete), then

L(θ) = f(x | θ) = Pθ(X = x). L(θ) is the probability of getting the observed data x when the parameter value is θ. n Continuous Case: When f(· | θ) is a continuous density Pθ(X = x) = 0, but if B ⊂ R is a very, very small ball (or cube) centered at the observed data x, then

Pθ(X ∈ B) ≈ f(x | θ) × Volume(B) ∝ L(θ).

1 L(θ) is proportional to the probability the random data X will be close to the observed data x when the parameter value is θ. Thus, the MLE θˆ is the value of θ which makes the observed data x “most probable”.

To find θˆ, we maximize L(θ). This is usually done by calculus (finding a stationary point), but not always. If the parameter Θ contains endpoints or boundary points, the maximum can be achieved at a boundary point without being a stationary point. If L(θ) is not “smooth” (continuous and everywhere differentiable), the maximum does not have to be achieved at a stationary point. Cautionary Example: Suppose X1,...,Xn are iid Uniform(0, θ) and Θ = (0, ∞). Given data x = (x1, . . . , xn), find the MLE for θ.

n Y −1 −n L(θ) = θ I(0 < xi < θ) = θ I(0 ≤ x(1))I(x(n) ≤ θ) i=1 ( θ−n for θ ≥ x = (n) 0 for 0 < θ < x(n) which is maximized at θ = x(n), which is a point of discontinuity and certainly not a ˆ stationary point. Thus, the MLE is θ = x(n). Notes: L(θ) = 0 for θ < x(n) is just saying that these values of θ are absolutely ruled out by the data (which is obvious). A strange property of the MLE in this example (not typical): ˆ Pθ(θ < θ) = 1

The MLE is biased; it is always less than the true value. A Similar Example: Let X1,...,Xn be iid Uniform(α, β) and Θ = {(α, β): α < β}. Given data x = (x1, . . . , xn), find the MLE for θ = (α, β).

n Y −1 −n L(α, β) = (β − α) I(α < xi < β) = (β − α) I(α ≤ x(1))I(x(n) ≤ β) i=1 ( (β − α)−n for α ≤ x , x ≤ β = (1) (n) 0 otherwise which is maximized by making β − α as small as possible without entering “0 otherwise” region. Clearly, the maximum is achieved at (α, β) = (x(1), x(n)). Thus the MLE is ˆ ˆ θ = (ˆα, β) = (x(1), x(n)). Again, Pα,β(α < α,ˆ β < β) = 1.

2 2 Maximizing the Likelihood (one parameter)

2.1 General Remarks

Basic Result: A g(θ) defined on a closed, bounded interval J attains its supremum (but might do so at one of the endpoints). (That is, there exists a point θ0 ∈ J such that g(θ0) = supθ∈J g(θ). ) Consequence: Suppose g(θ) is a continuous, non-negative function defined on an open interval J = (c, d) (where perhaps c = −∞ or d = ∞). If

lim g(θ) = lim g(θ) = 0, θ→c θ→d then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous.

3 Maxima at Stationary Points

Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infinite or finite). If g is differentiable and attains its supremum at a point θ0 in the interior of Θ, 0 that point must be a stationary point (that is, g (θ0) = 0).

0 00 1. If g (θ0) = 0 and g (θ0) < 0, then θ0 is a local maximum (but might not be the global maximum).

0 00 2. If g (θ0) = 0 and g (θ) < 0 for all θ ∈ Θ, then θ0 is a global maximum (that is, it attains the supremum). The condition in (1) is necessary (but not sufficient) for θ0 to be a global maximum. Condition (2) is sufficient (but not necessary). A function satisfying g00(θ) < 0 for all θ ∈ Θ is called strictly concave. It lies below any tangent line. Another useful condition (sufficient, but not necessary) is:

0 0 3. If g (θ) > 0 for θ < θ0 and g (θ) < 0 for θ > θ0, then θ0 is a global maximum.

4 Maximizing the Likelihood (multi-parameter)

4.1 Basic Result:

k A continuous function g(θ) defined on a closed, bounded set J ⊂ R attains its supremum (but might do so on the boundary).

3 4.2 Consequence:

k Suppose g(θ) is a continuous, non-negative function defined for all θ ∈ R . If g(θ) → 0 as ||θ|| → ∞, then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous. k Suppose the function g(θ) is defined on a convex set Θ ⊂ R (that is, the line segment joining any two points in Θ lies entirely inside Θ). If g is differentiable and attains its supremum at a point θ0 in the interior of Θ, that point must be a stationary point: ∂g(θ ) 0 = 0, i = 1, 2, . . . , k. ∂θi Define the vector D and Hessian matrix H:

∂g(θ)k D(θ) = (a k × 1 vector). ∂θi i=1 ∂2g(θ)k H(θ) = (a k × k matrix) ∂θi∂θj i,j=1

4.3 Maxima at Stationary Points

1. If D(θ0) = 0 and H(θ0) is negative definite, then θ0 is a local maximum (but might not be the global maximum).

2. If D(θ0) = 0 and H(θ) is negative definite for all θ ∈ Θ, then θ0 is a global maximum (that is, it attains the supremum).

(1) is necessary (but not sufficient) for θ0 to be a global maximum. (2) is sufficient (but not necessary). A function for which H(θ) is negative definite for all θ ∈ Θ is called strictly concave. It lies below any tangent plane.

4.4 Positive and Negative Definite Matrices

Suppose M is a k × k symmetric matrix. Note: Hessian matrices and covariance matrices are symmetric. Definitions:

0 k 1. M is positive definite if x Mx > 0 for all x 6= 0 (x ∈ R ). 2. M is negative definite if x0Mx < 0 for all x 6= 0.

4 0 k 3. M is non-negative definite (or positive semi-definite) if x Mx ≥ 0 for all x ∈ R .

Facts:

1. M is p.d. iff all its eigenvalues are positive.

2. M is n.d. iff all its eigenvalues are negative.

3. M is n.n.d. iff all its eigenvalues are non-negative.

4. M is p.d. iff −M is n.d.

5. If M is p.d., all its diagonal elements must be positive.

6. If M is n.d., all its diagonal elements must be negative.

7. The determinant of a symmetric matrix is equal to the product of its eigenvalues. 2 × 2 Symmetric Matrices:   m11 m12 M = (mij) = , m12 = m21 m21 m22 2 |M| = m11m22 − m12m21 = m11m22 − m12.

A 2×2 matrix is p.d. when the determinant is positive and the diagonal elements are positive. A 2 × 2 matrix is n.d. when the determinant is positive and the diagonal elements are negative. The bare minimum you need to check: M is p.d. if m11 > 0 (or m22 > 0) and |M| > 0. M is n.d. if m11 < 0 (or m22 < 0) and |M| > 0.

Example: Observe X1,X2,...,Xn be iid Gamma(α, β). Preliminaries: n Y xα−1e−xi/β L(α, β) = i βαΓ(α) i=1 Maximizing L is same as maximzing l = log L given by

l(α, β) = (α − 1)T1 − T2/β − nα log β − n log Γ(α)

5 P P where T1 = i log xi,T2 = i xi. Note that T = (T1,T2) is the natural sufficient statistic of this 2pef.

∂l d Γ0(α) = T − n log β − nψ(α), ψ(α) ≡ log Γ(α) = ∂α 1 dα Γ(α) ∂l T nα 1 = 2 − = (T − nαβ) ∂β β2 β β2 2 ∂2l = −nψ0(α) ∂α2 ∂2l −2T nα −1 = 2 + = (2T − nαβ) ∂β2 β3 β2 β3 2 ∂2l −n = ∂α∂β β

Situation #1: Suppose α = α0 is known. Find MLE for β. (Drop α from arguments: l(β) = l(α0, β) etc.) l(β) is continuous and differentiable. l(β) has a unique stationary point:

∂l 1 l0(β) = = (T − nα β) = 0 ∂β β2 2 0 T2 ∗ iff T2 = nα0β, iff β = (≡ β ) nα0 Now we check the second derivative. ∂2l −1 −1 l00(β) = = (2T − nαβ) = {T + (T − nαβ)}. ∂β2 β3 2 β3 2 2

00 ∗ ∗ 00 Note l (β ) < 0 since T2 − nα0β = 0, but l (β) > 0 for β > 2T2/(nα0). Thus, the sta- tionary point satisfies the necessary condition for a global maximum, but not the sufficient condition (i.e., l(β) is not a strictly concave function). How can we be sure that we have found the global maximum, and not just a local maximum? In this case, there is a simple argument: The stationary point β∗ is unique, and l0(β) > 0 for β < β∗,and l0(β) < 0 for β > β∗. This ensures β∗ is the unique global maximizer. Conclusion: βˆ = T2 . (This is a function of T , which is a sufficient statistic for β when α nα0 2 is known.) Situation #2: Suppose β = β0 is known. Find MLE for α. (Drop β from arguments: l(α) = l(α, β0) etc.) Note: l0(α) and l00(α) involve ψ(α). The function ψ is infinitely differentiable on the inter- val (0, ∞), and satisfies ψ0(α) > 0 and ψ00(α) < 0 for all α > 0. (The function is strictly increasing and strictly concave.)

6 Also, lim ψ(α) = −∞, lim ψ(α) = ∞. α→0+ α→∞ −1 Thus ψ : R → (0, ∞) exists. l(α) is continuous and differentiable. l(α) has a unique stationary point:

0 l (α) = T1 − n log β0 − nψ(α) = 0

iff ψ(α) = T1/n − log β0 −1 iff α = ψ (T1/n − log β0) This is the unique global maximizer since l00(α) = −nψ0(α) < 0, ∀α > 0.

−1 Thusα ˆ = ψ (T1/n − log β0) is the MLE. (This is a function of T1, which is a sufficient statistic for α when β is known.) Situation #3: Find MLE for θ = (α, β). l(α, β) is continuous and differentiable. A stationary point must satisfy the system of two equations: ∂l = T − n log β − nψ(α) = 0 ∂α 1 ∂l 1 = (T − nαβ) = 0. ∂β β2 2 Solving the second equation for β gives T β = 2 nα Plugging this into the first equation, and rearranging a bit leads to T T  1 − log 2 = ψ(α) − log α ≡ H(α) n n The function H(α) is continuous and strictly increasing from (0, ∞) to (−∞, 0), so that it has an inverse mapping (−∞, 0) to (0, ∞). Thus, the solution to the above equation can be written: T T  α = H−1 1 − log 2 n n Thus the unique stationary point is: T T  αˆ = H−1 1 − log 2 n n T βˆ = 2 nαˆ

7 Is this the MLE? Let us examine the Hessian. ∂2l ∂2l ! ∂α2 ∂α∂β H(α, β) = ∂2l ∂2l ∂α∂β ∂β2 0 −n ! −nψ (α) β = −n −1 β β3 (2T2 − nαβ) 2 ! −nψ0(ˆα) −n αˆ ˆ T2 H(ˆα, β) = −n2αˆ −n3αˆ3 2 T2 T2 The diagonal elements are both negative, and the determinant is equal to 4 2 n αˆ 0 2 (ˆαψ (ˆα) − 1). T2 This is positive since αψ0(α)−1 > 0 for all α > 0. This guarantees that H(ˆα, βˆ) is negative definite so that (ˆα, βˆ) is at least a local maximum.

5 Invariance principle for the MLE’s

If η = τ(θ) and θˆ is the MLE of θ, thenη ˆ = τ(θˆ) is the MLE of η. Comments

1. If τ(θ) is a 1-1 function, this is a trivial theorem. 2. If τ(θ) is not 1-1, this is essentially true by definition of induced likelihood. (see later).

2 2 Example: X = (X1,X2,...,Xn) iid N(µ, σ ). The usual parameters θ = (µ, σ ) are related to the natural parameters η = (µ/σ2, −1/(2σ2)) of the 2pef by a 1-1 function: η = τ(θ). The likelihood in terms of θ is

2 2 2 2 2 −n/2 −nµ /2σ µ/σ T1−(1/2σ )T2 L1(θ) = (2πσ ) e e P P 2 where T1 = Xi,T2 = Xi . Simple Example: X = (X1,X2,...,Xn) iid Bernoulli(p). It is known that MLE of p is pˆ = X¯. Thus

1. MLE of p2 isp ˆ2 = X¯ 2. 2. MLE of p(1 − p) is X¯(1 − X¯).

The function of p in 1. is 1-1, but not 1-1 in 2.

8 5.1 Induced Likelihood

Definition 1. If η = τ(θ), then L∗(η) ≡ sup L(θ). θ:τ(θ)=η

2 Go back to the example X1,X2,...,Xn ∼ N(µ, σ ) iid. If the MLEη ˆ of η is defined to be the value which maximized L∗(η), then it is easily seen thatη ˆ = τ(θˆ). The likelihood in terms of η is 2 −n/2 nη /4η2 η1T1+η2T2 L2(η) = (−π/η2) e 1 e obtained by substituting in L1(θ) 2 µ = −η1/(2η2), σ = −1/(2η2), that is, evaluating L1 at 2 −1 θ = (µ, σ ) = (−η1/(2η2), −1/(2η2)) = τ (η). −1 −1 Stated abstractly L2(η) = L1(τ (η)), so that L2 is maximized when τ (η) = θˆ, that is, by η = τ(θˆ). The MLE of θ is known to be n  1 X  θˆ = (ˆµ, σˆ2) = X,¯ (X − X¯)2 n i i=1 so the invariance principle says the MLE of η is  µˆ −1  ηˆ = τ(θˆ) = , . σˆ2 2ˆσ2 Continuation of example: What is the MLE of α = µ + σ2? Note that α = g(µ, σ2) = µ + σ2 is not a 1-1 function, but αˆ = g(ˆµ, σˆ2) =µ ˆ + σˆ2 = X¯ + SS/n Pn ¯ 2 where SS = i=1(Xi − X) . What is MLE of µ?, σ2 ? With g1(x, y) = x, g2(x, y) = y, we have 2 µ = g1(θ), σ = g2(θ) so that the MLEs are 2 ˆ µˆ = g1(θˆ) = X,¯ σˆ = g2(θ) = SS/n. Thus, the invariance principle implies: (µ,ˆ σ2)(MLE as a pair) = (ˆµ(MLE of µ), σˆ2(MLE of σ2))

9 5.2 MLE for Exponential Families

The invariance principle for MLEs allows us to work with the natural parameter η (which is a 1-1 function of θ). 1pef:

f(x | θ) = c(θ)h(x) exp{w(θ)t(x)}

Natural parameter: η = w(θ). With a little abuse of notation (writing f(x | η) for f ∗(x | η) = f(x | w−1(η)) and c(η) for c∗(η) = c(w−1(η)), we can write

f(x | η) = c(η)h(x) exp{ηt(x)}.

For clarity of notations, we will use x = (x1, . . . , xN ) as the observed data and X = (X1,X2,...,XN ) as the random data. If X1,...,XN iid from f(x | η), then

N N X X l(η) = N log{c(η)} + log h(xi) + η t(xi) i=1 i=1

∂ Since by 3.32(a) Et(Xi) = − ∂η log{c(η)}, we have

N ∂ X l0(η) = N log{c(η)} + t(x ) (1) ∂η i i=1 N N  X  X = −E t(Xi) + t(xi) i=1 i=1 = −ET (X) + T (x)

PN where T (X) = i=1 t(Xi). Hence the condition for a stationary point is equivalent to:

EηT (X) = T (x)

Note that using (1),

∂2 l00(η) = N log{c(η)} = N{−Var t(X )} < 0 ∂η2 η i for all η. Thus any interior stationary point (not on the boundary of Θ∗ = {w(θ): θ ∈ Θ}) ∗ is automatically a global maximum so long as Θ is convex. In one (Θ ⊂ R), this means Θ∗ must be an interval of some sort (can be infinite). Ignoring this fine point,

10 for a 1pef, the log-likelihood will have a unique stationary point which will be the MLE. k-pef:

k X f(x | θ) = c(θ)h(x) exp{ wj(θ)tj(x)} j=1

Natural parameter: η = (η1, . . . , nk) = (w1(θ), . . . , wk(θ)), that is ηj = wj(θ).

k X f(x | η) = c(η)h(x) exp{ ηjtj(x)} j=1

If X1,X2,...,XN iid from f(x | η), then

N k N X X  X  l(η) = N log c(η) + log h(xi) + ηj tj(xi) i=1 j=1 i=1 N ∂l ∂ X = N log c(η) + tj(xi) ∂ηj ∂ηj | {z } i=1 −tj (Xi) N N X X = −E tj(Xi) + tj(xi) i=1 i=1

∂2l  ∂2  = N log c(η) ∂ηj∂ηl ∂ηj∂ηl = N(−Cov(tj(Xi), tl(Xi)))

Thus, the equations for a stationary point is ∂l = 0, j = 1, . . . , k ∂ηj are equivalent to

EηTj(X) = Tj(x), j = 1, . . . , k (2)

PN PN where Tj(X) = i=1 tj(Xi) and Tj(x) = i=1 tj(xi) or in vector notation,

EηT (X) = T (x), j = 1, . . . , k where T (X) = (T1(X),...,Tk(X)) and T (x) = (T1(x),...,Tk(x)).

11 k  2  The Hessian matrix H(η) = ∂ l is given ∂ηi∂ηj i,j=1

H(η) = −NΣ(η) where Σ(η) is the k × k covariance matrix of (T1(X1),T2(X1),...,Tk(X1)). A covariance matrix will be positive definite (except in degenerate cases), so that H(η) will be negative definite for all η. Conclusion: An interior stationary point (i.e., a solution of (2)) must be the unique global maximum, and hence the MLE. This result also holds in the original parameterization with (2) restated as

EθTj(X) = Tj(x), j = 1, . . . , k.

Connection with MOM: For a 1pef with t(x) = x, MOM and MLE agree. For a kpef with j tj(x) = x , MOM and MLE agree. Why? Because then (2) is equivalent to the equations for the MOM estimator.

6 Revisiting Gamma Example:

The system of equations for the MLE of (α, β) may be easily derived directly from (2).

ET1(X) = T1(x)

ET2(X) = T2(x) which becomes n X E log Xi = nE log X1 = n(log β + ψ(α)) = T1(x) i=1 n X E Xi = nEX1 = nαβ = T2(x) i=1

12 The equations are the same as the equations for a stationary point derived earlier. For X ∼ Gamma(α, β), we have used:

Z ∞ xα−1e−x/β E log X = log x α dx 0 β Γ(α) Z ∞ (x/β)α−1e−x/β dx = (log(x/β) + log β) 0 Γ(α) β Z ∞ zα−1e−z = (log z + log β) dz 0 Γ(α) 1 Z ∞ = log β + (zα−1 log z) e−zdz Γ(α) 0 | {z } ∂ α−1 ∂α z 1 ∂ Z ∞ = log β + zα−1e−zdz Γ(α) ∂α 0 Γ0(α) = log β + = log β + ψ(α). Γ(α) Verifying Stationary Point is Global Maximum: The Gamma family is a 2pef (or a 1pef if α or β is held fixed). Switching to the natural parameters η1 = α − 1, η2 = −1/β(or just making the substitution λ = 1/β) simplifies the second derivatives w.r.t. η2 (or λ). The Hessian matrix is now negative definite for all θ = (η1, η2), which is a sufficient condition for the stationary point to be the global maximum.

6.1 MLEs for More General Exponential Families

Proposition 1. If X ∼ Pθ, θ ∈ Θ, where Pθ has a joint pdf (pmt) from an n-variate k-parameter exponential family

k  X  f(x | θ) = c(θ)h(x) exp wj(θ)Tj(x) j=1

n k for x ∈ R , θ ∈ Θ ⊂ R , then the MLE of θ based on the observed data x is the solution of the system of equations

EθTj(X) = Tj(x), j = 1, . . . , k, Solve for θ. providing the solution (call it θˆ) satisfies

w(θˆ) ∈ interior of {w(θ): θ ∈ Θ}.

Proof. Essentially the same as for the ordinary kpef.

13 Example: Simple Linear Regression with known variance: Y1,Y2,...,Yn are independent with

2 Yi ∼ N(β0 + β1xi, σ0), θ = (β0, β1)

Joint distribution of Y = (Y1,Y2,...,Yn) forms exponential family. Natural sufficient ˜ statistic is X X t(Y ) = ( Yi, xiYi). ˜ i i

Eθt(Y ) = t(y) has the form ˜ ˜ X X E( Yi) = yi X X E( xiYi) = xiyi

Thus the MLE θˆ = (βˆ0, βˆ1) is solution of X X (β0 + β1xi) = yi i i X X xi(β0 + β1xi) = xiyi i i

6.2 Sufficient and MLEs

If T = T (X) is a sufficient statistic for θ, then there is an MLE which is a function of T . (If the MLE is unique, then we can say the MLE is a function of T ).

Proof. By FC,

f(x | θ) = g(T (x), θ)h(x).

Assume for convenience the MLE is unique. Then the MLE is ˆ θ(x) = argmaxθf(x | θ)

= argmaxθg(T (x), θ) which is clearly a function of T (x).

14 MLE coincides with “Least Squares”. For independent normal rv’s with constant variance σ2 (known or unknown). Y1,Y2,...,Yn are independent with

2 Yi ∼ N(β0 + β1xi, σ0), θ = (β0, β1) or more generally,

2 Yi ∼ N(g(xi, β), σ0), where β is possibly a vector. Then

n n  1   1 X  2 √ 2 L(β, σ ) = f(y | θ) = exp − 2 (yi − g(xi, β) . | {z } ˜ 2πσ 2σ θ i=1

For any σ2 (fixed arbitrary value), maximizing L(β, σ2) with respect to β is equivalent to Pn 2 minimizing i=1(yi − g(xi, β)) with respect to β. Hence MLE and Least squares give same estimates of β parameters.

15