1 Maximum Likelihood Estimation

STAT 598Y Statistical Learning Theory Instructor: Jian Zhang Lecture 14: Maximum Likilihood Estimation In this lecture we will consider the one of the most popular approaches in statistics: the maximum likelihood estimation (MLE). In order to apply MLE, we need to make stronger assumptions about the distribution of (X, Y ). Often such assumptions are reasonable in practical applications. 1 Maximum Likelihood Estimation Consider the model yi = h(xi)+!i,i=1, . , n where we assume !i’s are iid zero mean noises. Note that both classification and regression can be represented in this way. Furthermore, we assume that the errors have a distribution F (µ) with mean µ =0. Then we have yi ∼ F (h(xi)). Assume that the probability density function for yi is ph(yi), we can write down the joint likelihood as n ph(yi). i=1 ! The MLE estimator seeks the model which maximizes the likelihood, or equivalently, minimizes the negative log-likelihood. This is reasonable since the MLE estimator is the most probable explanation for the observed data. Formally, let Θ be a parameter space, and assume that we have the model yi ∼ pθ∗ (y),i=1, . , n ∗ for iid observations y1,...,yn and θ ∈ Θ is the true parameter. Here we do not have the covariates xi’s for simplicity, and it is straightforward to include them in the model. The MLE of θ∗ is n θˆn = arg max pθ(yi) ∈Θ θ i=1 ! 1 n = arg min − log pθ(yi). ∈Θ n θ i=1 " By strong law of large numbers we have n 1 − log p (y ) → E[− log p (Y )]. n θ i a.s. θ i=1 " So we can think of maximum likelihood as trying to minimize E[− log pθ(Y )]. On the other hand, consider the quantity pθ∗ (Y ) E [log p ∗ (Y ) − log p (Y )] = E log θ θ p (Y ) # θ $ pθ∗ (y) = log pθ∗ (y)dy ˆ pθ(y) = KL(pθ,pθ∗ ) ≥ 0 1 where KL(q, p) is the KL-divergence between two distributions q and p. Although not a distance measure (not symmetric), the KL-divergence measures the discrepancy between the two distributions. Also note that the last inequality becomes equality if and only if pθ = pθ∗ . This is because p q q KL(q, p)=E log = −E log ≥−log E =0. p q p p p p # $ # $ # $ By Jensen’s inequality, the equality happens if and only if p(x)=q(x) for all x. So we can see that if we minimize E[− log pθ(Y )], the minimum it can achieve is E[− log pθ∗ (Y )], and it achieves this minimum when θ = θ∗, the true parameter value we want to find. It is easy to see that MLE can be thought as a special case of empirical risk minimization, where the loss function is simply the negative log-likelihood: #(θ, yi)=− log pθ(yi). Another observation is that minimizing the negative log-likelihood will result in the least squares estimator, if the error follows a normal distribution. The empirical risk is n 1 Rˆ (θ)=− log p (y ) n n θ i i=1 " and the risk is R(θ)=E[#(θ,Y )] = E[− log pθ(Y )]. The excess risk of θ is ∗ R(θ) − R(θ )=E[− log pθ(Y ) + log pθ∗ (Y )] = KL(pθ,pθ∗ ), the KL-divergence between pθ and pθ∗ . 2 Hellinger Distance and Consistency of MLE We define the Hellinger distance between two distributions p and q as 1 2 h(p, q)= p(x) − q(x) dx. %2 ˆ &' ' ( It is easy to see that it is always nonnegative, symmetric, and satisfies the triangle inequality. And furthermore, h(p, q) = 0 if and only if p = q. The hellinger distance plays an important role in studying MLE because it can be upper bounded by the KL-divergence, as shown in the following Lemma. 2 1 KL Lemma. We have h (q, p) ≤ 2 (q, p). 1 1/2 Proof. Using the fact that 2 log v ≤ v − 1 for all v>0, we have 1 q(x) q(x) log ≤ − 1, p x 2 ( ) 'p(x) and thus ' 1 q(X) KL(q, p) ≥ 1 − E . 2 )'p(X)* The result follows since ' q(X) 1 − E =1− q(x) p(x)dx ˆ )'p(X)* ' ' 1 1 ' = p(x)dx + q(x)dx − q(x) p(x)dx 2 ˆ 2 ˆ ˆ 1 2 ' ' = p(x) − q(x) dx 2 ˆ 2 & 2 ( = h (p, q')=h (q,' p). 2 ! This lemma says that convergence in KL-divergence will lead to convergence in hellinger distance. So if we can establish the convergence in KL-divergence then the consistency of MLE can be proven. The convergence of the KL-divergence can be seen as follows. Since θˆn maximizes the likelihood over θ ∈ Θ, we have n n n pθ∗ (yi) log = log pθ∗ (yi) − log pˆ (yi) ≤ 0. pˆ (y ) θn i=1 θn i i=1 i=1 " " " Thus n 1 pθ∗ (yi) log − KL(pˆ ,pθ∗ )+KL(pˆ ,pθ∗ ) ≤ 0. n pˆ (y ) θn θn i=1 θn i " So we have n 1 pθ∗ (yi) KL(pˆ ,pθ∗ ) ≤ log − KL(pˆ ,pθ∗ ) . θn n pˆ y θn + =1 θn ( i) + + i + + " + If law of large numbers holds uniformly, we have+ + + + n 1 pθ∗ (yi) sup log − KL(pθ,pθ∗ ) → 0 θ∈Θ +n =1 pθ(yi) + + i + + " + + + which can be applied to θˆn as well. As+ a result, the convergence of KL-divergence+ can be established. 3.

1 Maximum Likelihood Estimation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support