Summary of Maximum Likelihood Estimation Theory (Pdf)

NMST432 ADVANCED REGRESSION MODELS

MAXIMUM LIKELIHOOD ESTIMATION THEORY SUMMARY OF NOTATION AND MAIN RESULTS

Deﬁnition

Consider a random sample X = (X1,..., Xn) of independent identically distributed random vari- ables (or vectors), each with density f (x θ ) with respect to a σ-finite measure µ. We assume that | X f (x θ ) , where | X ∈ F = distributions with density f (x θ), θ Θ Rd F { | ∈ ⊆ } represents a parametric model for the distribution of the data. The model must satisfy the model identifiability condition: For any θ = θ it holds f (x θ ) = F 1 6 2 | 1 6 f (x θ ). In other words, no distribution can be parametrized by several different parameter vectors. | 2 Because of independence, the joint denstity of the random sample X ,..., X is ∏n f (x θ ). The 1 n i=1 i| X maximum likelihood estimator θ of the parameter θX is the point from Θ that maximizes the joint density evaluated at the observed values of X1,..., Xn. b Definition 1 (likelihood, log-likelihood). The random function • df n Ln(θ) = ∏ f (Xi θ) i=1 | is called the likelihood function for the parameter θ in the model . F The random function • df n ℓn(θ) = log Ln(θ)= ∑ log f (Xi θ) i=1 | is called the log-likelihood function. Definition 2 (maximum likelihood estimator). The maximum likelihood estimator (MLE) of the parameter θ in the model is defined as X F

θn = arg max Ln(θ). θ Θ ∈

Note. Since the logarithm is strictly increasing,b Ln(θ) and ℓn(θ) attain the maximum at the same point. Definition 3. Let P and Q be probability measures on the same probability space with densities p and q with respect to the same σ-finite measure µ (for example, µ = P + Q). Define

E p(X) p(x) P log q(X) = x:p(x)>0 log q(x) p(x) dµ(x) if P [q(X)= 0] = 0 K(P, Q)= { } (+∞ R otherwise. K(P, Q) is called the Kullback-Leibler distance (divergence). ADVANCED REGRESSION MODELS MLE SUMMARY

Note. In fact, K(P, Q) is a pseudo-distance: it holds K(P, Q) 0, and K(P, Q) = 0 if and only if ≥ P = Q, but it is not symmetric: K(P, Q) = K(Q, P). 6 Theorem 1. Suppose the support set S = x R : f (x θ) > 0 does not depend on the parameter { ∈ | } θ. Denote PX the induced probability measure of the random variable Xi and Pθ the probability measure associated with the density f (x θ). Then for any θ = θ | 6 X n 1 Ln(θX) 1 f (Xi θX) log = ∑ log | K(PX, P ) PX almost surely, n L (θ) n f (X θ) → θ − n i=1 i| and hence P [ℓ (θ ) > ℓ (θ)] 1 as n ∞. n X n → → Note. When the number of observations increases to inﬁnity, the (log-)likelihood function at the true parameter will be with a large probability larger than the (log-)likelihood function at any other parameter. This observation justiﬁes the idea of estimating the parameters by maximizing the log- likelihood over all possible parameter vectors.

The calculation of the maximum likelihood estimator

The maximum likelihood estimator is usually determined by differentiation of the log-likelihood. The first derivative is set to zero and it is verified that the second derivative is negative definite.

Deﬁnition 4 (score, information). The random vector • df ∂ U(θ X ) = log f (X θ) | i ∂θ i| is called the score function for the parameter θ in the model . F The random vector • n n U θ X df U θ ∂ θ n( ) = ∑ ( Xi)= ∑ θ log f (Xi ) | i=1 | i=1 ∂ | is called the score statistic. The random matrix • df ∂ ∂2 I(θ X ) = U(θ X )= log f (X θ) | i − ∂θT | i − ∂θ ∂θT i| is called the contribution of the i-th observation to the information matrix. The random matrix • n θ X df 1 ∂ U θ X 1 θ In( ) = θT n( )= ∑ I( Xi) | − n ∂ | n i=1 | is called the observed information matrix. The matrix • df ∂2 I(θ) = E I(θ X )= E log f (X θ) | i − ∂θ ∂θT i| is called the expected (Fisher) information matrix.

2 ADVANCED REGRESSION MODELS MLE SUMMARY

If the set Θ is open, the MLE θ solves the system of equations U (θ X)= 0, that is n n n| n b ∂ θ b ∑ θ log f (Xi n)= 0. i=1 ∂ | This system is called the likelihood equations. b The solution to the likelihood equations need not exist. Sometimes there may be multiple solutions, at most one of which is the MLE. If I (θ X) > 0 (the observed information is positive definite at n n| θ ), we know that θ is at least a local maximum. If I (θ X) > 0 for every θ Θ, the log-likelihood n n n | ∈ function is concave and the solution tob the likelihood equations must be the global maximum and henceb the MLE. b In most cases no explicit solution can be found and the MLE must be calculated by numerical methods. There are two commonly used numerical methods for solving the likelihood equations. Let θ(r) be the r-th iteration to the solution. b The Newton-Raphson method: θ(r+1) = θ(r) + [nI (θ(r) X)] 1U (θ(r) X) • n | − n | The Fisher Scoring method: θ(r+1) = θ(r) + [nI(θ(r))] 1U (θ(r) X) • b b b− n | b They are iterated until the change inbθ from oneb iterationb to the nextb is sufficiently small or until Un(θ) is sufficiently close to 0. The only difference between the two methods is in the information matrix: N-R uses the observed information,b FS uses the expected information. Bothb require setting θ(1), the starting value for numerical approximation, and are sensitive to its choice. b Properties of the maximum likelihood estimator

Maximum likelihood estimators are consistent and asymptotically normal as long as so called regularity conditions are satisﬁed. Conditions (Regularity conditions for maximum likelihood estimators).

R1. The number of parameters d in the model is constant. F R2. The support set S = x R : f (x θ) > 0 does not depend on the parameter θ. { ∈ | } R3. The parameter space Θ is an open set.

R4. The density f (x θ) is sufﬁciently smooth function of θ (at least twice continuously differen- | tiable).

R5. The Fisher information matrix I(θ) is ﬁnite, regular, and positive deﬁnite in a neighborhood of θX. R6. The order of differentiation and integration can be interchanged in expressions such as ∂ ∂ h(x, θ) dµ(x)= h(x, θ) dµ(x), ∂θ ∂θ Z Z where h(x, θ) is either f (x θ) or ∂ f (x θ)/∂θ. | |

3 ADVANCED REGRESSION MODELS MLE SUMMARY

Note. Take the identity ∞ f (x θ) dµ(x)= 1 ∞ | Z− and differentiate both sides of the equation twice with respect to θ. Regularity condition R6 implies ∞ ∞ 2 ∂ θ ∂ θ f (x ) dµ(x)= T f (x ) dµ(x)= 0. (1) ∞ ∂θ | ∞ ∂θ∂θ | Z− Z− Theorem 2 (consistency of the MLE). Let conditions R1–R6 hold. Then there exists n0 and a se- P quence θ (n n ) of solutions to the likelihood equations U (θ X)= 0 such that θ θ . n ≥ 0 n n| n −→ X Note. If the log-likelihood is strictly concave, the likelihood equations have a unique solution, which isb the MLE. It converges in probability to the true paramb eter. If the log-likelihoodb is not strictly concave, the likelihood equations may have multiple solutions representing local maxima and minima of the log-likelihood. There is one solution among them (the closest to θX), which pro- vides a consistence sequence of estimators. Other solutions may not be close to θX and may not converge to it.

Note. If there exists a sequence θn of other estimators that are guaranteed to be consistent (for example, moment estimators of θX), a consistent MLE can be obtained by taking the root of the likelihood equations, which is closeste to θn. Alternatively, one can perform one step of the Newton- Raphson algorithm with θn as the starting value. Theorem 3 (Score function properties). Lete conditions R1–R6 hold. Then e (i) E U(θ X )= 0, var U(θ X )= I(θ ). X| i X| i X D (ii) 1 U (θ X) N (0, I(θ )). √n n X| −→ d X Note. The Fisher information matrix at θX can be calculated in two different ways: from Deﬁnition 4 (the expectation of minus the second derivative of the log density) or from Theorem 3 (the score function variance).

Theorem 4 (asymptotic normality of the MLE). Suppose conditions R1–R6 hold. Let θn be a consistent sequence of solutions to the likelihood equations. Then D 1 b √n(θ θ ) N (0, I− (θ )). n − X −→ d X Note. b The asymptotic variance of the MLE is equal to the inverse of the Fisher information. More • information means better precision for estimation. The asymptotic variance of the MLE is in a certain sense optimal. Other estimators (e.g., • moment estimators) cannot have a smaller asymptotic variance.

Theorem 5 (asymptotic distribution of the likelihood ratio). Suppose conditions R1–R6 hold. Let θn be a consistent sequence of solutions to the likelihood equations. Then θ b Ln( n) ℓ θ ℓ θ D 2 2log = 2( n( n) n( X)) χd. Ln(θX) − −→ b Theorem 6 (the ∆ method for the MLE). Supposeb conditions R1–R6 hold. Let θn be a consistent sequence of solutions to the likelihood equations. Take q : Θ Rk a continuously differentiable → function. Denote νX = q(θX) a D(θ)= ∂q(θ)/∂θ. Then νn = q(θn) is the MLE ofb the parameter νX and D 1 T √n(ν ν ) N (0, D(θ )I− (θ )Db (θ ) ). n − X −→ k X b X X b 4 ADVANCED REGRESSION MODELS MLE SUMMARY

Tests based on maximum likelihood theory

The theory of the MLE can be used to derive tests of simple and composite hypotheses about the parameter θX.

Testing of simple hypotheses

We want to test the null hypothesis H : θ = θ against the alternative H : θ = θ , where θ Θ. 0 X 0 1 X 6 0 0 ∈ It is a simple hypothesis because there is just a single distribution in the model with the density F f (x θ ). | 0 We will introduce three different test statistics for testing H0. Deﬁnition 5.

(i) The statistic Ln(θn) λn = Ln(θ0) b is called the likelihood ratio.

(ii) The statistic T W = n(θ θ ) I (θ )(θ θ ) n n − 0 n n n − 0 is called the Wald statistic. b b b b (iii) The statistic 1 T 1 R = U (θ X) I− (θ )U (θ X) n n n 0| n 0 n 0| is called the Rao (score) statistic. b

Note. The symbol In denotes any consistent estimator of the Fisher information matrix. Three different estimators can be used in Wald and Rao statistics: b 1 n ∂2 1. I (θ)= I (θ X)= ∑ T log f (θ X ) (the observed information matrix) n n | − n i=1 ∂θ ∂θ | i 2. I (θ)= 1 ∑n U(θ X ) 2 (the empirical variance of the score function) bn n i=1 | i ⊗ θ θ 3. bIn( )= I( ) (the Fisher information matrix)

Theb most common choice for the Wald statistic is In(θn) = In(θn X). The most common choice for θ 1 n U θ 2 | the Rao statistic is In( 0)= n ∑i=1 ( 0 Xi)⊗ . | b b b Note. b The likelihood ratio requires the calculation of θ and L or ℓ . It does not require the calcula- • n n n tion of Un and In. The Wald statistic requires the calculation of θb and I . It does not require the calculation of • n n Ln and Un. b Rao statistic requires the calculation of U a I .b It doesb not require the calculation of θ and L . • n n n n b b

5 ADVANCED REGRESSION MODELS MLE SUMMARY

Note. If d = 1 (one parameter) and θ0 = 0, then the Wald statistic can be written as

2 θn Wn = , " n 1 I 1(θ ) # − bn− n q 1 1 where n− In− (θn) is the estimator of the asymptoticb varianceb of θn. θ θ Theorem 7.b Supposeb conditions R1–R6 are satisﬁed. Let the hypothesisb H0 : X = 0 hold. Then:

(i) D 2log λ = 2(ℓ (θ ) ℓ (θ )) χ2 n n n − n 0 −→ d (ii) b D W χ2 n −→ d (iii) D R χ2 n −→ d Note. If H holds, θ should be close to θ , L (θ ) should be close to L (θ ), and U (θ X) should 0 n 0 n n n 0 n 0| be close to 0. Under H0, all three test statistics have values close to 0. Their large values testify against H0. b b Corollary. Denote by χ2(1 α) the (1 α)-quantile of χ2 distribution. Consider tests of H : θ = d − − d 0 X θ against H : θ = θ deﬁned by the rule: reject H in favor of H , if 0 1 X 6 0 0 1 (i) 2 log λ χ2(1 α) (likelihood ratio test) n ≥ d − (ii) W χ2(1 α) (Wald test) n ≥ d − (iii) R χ2(1 α) (score test) n ≥ d − Each of these tests has asymptotically (for n ∞) the level α. → Note. It can be shown that these three tests are asymptotically equivalent. For large sample sizes, their results are almost identical. With smaller sample sizes, their results can differ. Investigations of small sample behavior of these test statistics revealed that the likelihood ratio test has the best properties, the Wald test is the worst of the three. Thus, in practical applications, the likelihood ratio test should be preferred.

Note. Under normality, the three test statistics are identical.

Estimation in the presence of nuisance parameters and testing of composite hypotheses

It is frequently desirable to estimate and test just a small number of parameters in a model that contains a much larger number of parameters. We divide the parameter vector into two subsets: the parameters of interest and the other paramaters – nuisance parameters.

Let θ be divided into θA containing the ﬁrst m components of θ, and θB containing the remaining d m components of θ. We have − θ θ θ T T =( A, B) =(θ1,..., θm, θm+1,..., θd)

6 ADVANCED REGRESSION MODELS MLE SUMMARY

θ Θ θ Θ Θ θ θ θ Θ We want to test the hypothesis H0∗ : X 0 against H1∗ : X 0, where 0 = : A = A0 . ∈ θ 6∈ { }⊂θ We want to know whether the ﬁrst m components of X are equal to the vector of constants A0 regardless of the other d m components of θ . − X This is not a simple null hypothesis because there are many distributions in the model that satisfy F H0∗. All the vectors and matrices appearing in the notation of maximum likelihood estimation theory are decomposed into the ﬁrst m components (part A) and the remaining d m components (part B). − For example,

θ U (θ) I (θ) I (θ) θ = An , U (θ)= An , I(θ)= AA AB , etc. n θ n U (θ) I (θ) I (θ) Bn Bn BA BB b b The following lemmab is useful for inverting the decomposed information matrix. Lemma 8 (Block matrix inversion). Let the matrix I I I = AA AB I I BA BB be of full rank. Then there exists an inverse matrix to I and it can be expressed as

I AA I AB I 1 = , − IBA IBB where

AA 1 I = IAA− .B, AB 1 1 I = I− I I− , − AA.B AB BB BA 1 1 I = I− I I− , − BB.A BA AA BB 1 I = IBB− .A, 1 I = I I I− I AA.B AA − AB BB BA 1 I = I I I− I . BB.A BB − BA AA AB If the null hypothesis H : θ Θ holds we know that θ = θ , but we do not know the value 0∗ X ∈ 0 AX A0 of θBX. We can estimate θBX by the maximum likelihood method applied to the nested submodel

d m = distributions with density f (x (θ , θ )), θ = θ , θ Θ R − , F0 { | A B A A0 B ∈ B ⊆ } with d m unknown parameters. − θ θ θ An θ Denote the maximum likelihood estimator of X in the submodel 0 by n = θ , where An = F Bn θ θ e A0 and Bn solves the system of likelihood equations e e e U θ θ e Bn( A0, Bn)= 0. θ θ The Fisher information matrix for B in this modele is IBB( X). By Theorems 3 and 4 applied to the submodel , we get F0 1 U θ D N θ Bn( X) d m(0, IBB( X)) √n −→ −

7 ADVANCED REGRESSION MODELS MLE SUMMARY and θ θ D N 1 θ √n( Bn BX) d m(0, IBB− ( X)). − −→ − On the other hand, Theorems 3 and 4 and Lemma 8 applied to the larger model imply e 1 U θ D N θ Bn( X) d m(0, IBB( X)) √n −→ − and θ θ D N 1 θ √n( Bn BX) d m(0, IBB− A( X)), − −→ − . where (dropping the arguments θ ) Xb 1 1 1 1 I− =(I I I− I )− I− . BB.A BB − BA AA AB ≥ BB

Thus, the asymptotic variance of the MLE of the parameter θBX depends on whether or not θAX θ θ is known. If AX is known (which is true if H0∗ holds), the asymptotic variance of the MLE Bn is generally larger than the asymptotic variance of the MLE θBn that does not assume a known θAX. e However, when IBA = 0 (the estimators of θAX and θBX are asymptotically independent), then the b asymptotic variances of θBn and θBn are the same. Then it does not matter whether or not θAX is known. Let us generalize the threee test statisticsb introduced in Deﬁnition 5 of the previous section to testing the composite hypothesis H : θ Θ against H : θ Θ , where Θ = θ : θ = θ Θ. 0∗ X ∈ 0 1∗ X 6∈ 0 0 { A A0}⊂ Deﬁnition 6.

(i) The statistic Ln(θn) λn∗ = Ln(θn) b is called the likelihood ratio. e (ii) The statistic T W∗ = n(θ θ ) I (θ )(θ θ ) n An − A0 AA.B n An − A0 is called the Wald statistic. b b b b (iii) The statistic 1 T 1 R∗ = U (θ ) I− (θ )U (θ ) n n n n n n n n is called the Rao (score) statistic. e b e e Note. Obviously, λ 1. • n∗ ≥ The expression IAA.B in the Wald statistic means the inverse of the upper left block of the the • 1 matrix In− . Since U (θ )=b0, the Rao statistic can be written as • Bn n b 1 T 1 e R∗ = U (θ ) I− (θ )U (θ ). n n An n AA.B n An n Theh Rao statistic does not require the calculatione b ofe the MLEe θ in the larger model, it only • n needs the MLE θn in the submodel. This is often much easier to get. b e 8 ADVANCED REGRESSION MODELS MLE SUMMARY

Theorem 9. Let the null hypothesis H : θ Θ , where Θ = θ : θ = θ , hold. Then 0∗ X ∈ 0 0 { A A0} (i) D 2 2log λ∗ = 2(ℓ (θ ) ℓ (θ )) χ ; n n n − n n −→ m (ii) b e D 2 W∗ χ ; n −→ m (iii) D 2 R∗ χ . n −→ m θ θ θ θ U θ Note. Under H0∗, we expect n to be close to n, Ln( n) to be close to Ln( n), and n( n) to be close to 0. The large values of the three test statistics testify against the null hypothesis. b e b e e Corollary. Let χ2 (1 α) be (1 α)-quantile of the χ2 distribution. Consider tests of H : θ Θ , m − − m 0∗ X ∈ 0 where Θ = θ : θ = θ , against H : θ Θ given by the rule: reject H in favor of H if 0 { A A0} 1∗ X 6∈ 0 0∗ 1∗ (i) 2 log λ χ2 (1 α) (the likelihood ratio test) n∗ ≥ m − (ii) W χ2 (1 α) (the Wald test) n∗ ≥ m − (iii) R χ2 (1 α) (the score test) n∗ ≥ m − Then each of these three tests has asymptotically (for n ∞) the level α. → 2 Note. The number of degrees of freedom in the reference χm distribution is equal to the number of tested parameters.

Note. These three tests are asymptotically equivalent under the null hypothesis as well as under local alternatives. With small or moderate sample sizes, the likelihood ratio test has the best properties and the Wald test is the worst of the three. In practical applications, the likelihood ratio test should be preferred. θ θ Note. Let m = 1, AX = θXj, and A0 = 0. Consider the test of the hypothesis H0∗ : θXj = 0 against H : θ = 0 (zero value of the j-th parameter in the presence of other parameters that are 1∗ Xj 6 unspeciﬁed by the hypothesis). Then the Wald statistic can be written as

2 θjn Wn = , " n 1 I 1 # b− jj− q 1 1 b where n− Ijj− is the estimator of the asymptotic variance of θjn. This is the square of the test statistic that statistical software typicaly evaluates to test zero value of a single model parameter. b b