STAT 598Y Statistical Learning Theory Instructor: Jian Zhang

Lecture 14: Maximum Likilihood Estimation

In this lecture we will consider the one of the most popular approaches in statistics: the maximum likelihood estimation (MLE). In order to apply MLE, we need to make stronger assumptions about the distribution of (X,Y ). Often such assumptions are reasonable in practical applications.

1 Maximum Likelihood Estimation

Consider the model yi = h(xi)+!i,i=1, . . . , n

where we assume !i’s are iid zero mean noises. Note that both classification and regression can be represented in this way. Furthermore, we assume that the errors have a distribution F (µ) with mean µ =0. Then we have yi ∼ F (h(xi)).

Assume that the probability density function for yi is ph(yi), we can write down the joint likelihood as

n ph(yi). i=1 ! The MLE estimator seeks the model which maximizes the likelihood, or equivalently, minimizes the negative log-likelihood. This is reasonable since the MLE estimator is the most probable explanation for the observed data. Formally, let Θ be a parameter space, and assume that we have the model

yi ∼ pθ∗ (y),i=1, . . . , n

∗ for iid observations y1,...,yn and θ ∈ Θ is the true parameter. Here we do not have the covariates xi’s for simplicity, and it is straightforward to include them in the model. The MLE of θ∗ is

n θˆn = arg max pθ(yi) ∈Θ θ i=1 ! 1 n = arg min − log pθ(yi). ∈Θ n θ i=1 " By strong law of large numbers we have

n 1 − log p (y ) → E[− log p (Y )]. n θ i a.s. θ i=1 " So we can think of maximum likelihood as trying to minimize E[− log pθ(Y )]. On the other hand, consider the quantity

pθ∗ (Y ) E [log p ∗ (Y ) − log p (Y )] = E log θ θ p (Y ) # θ $ pθ∗ (y) = log pθ∗ (y)dy ˆ pθ(y) = KL(pθ,pθ∗ ) ≥ 0

1 where KL(q, p) is the KL- between two distributions q and p. Although not a distance (not symmetric), the KL-divergence measures the discrepancy between the two distributions. Also note that the last inequality becomes equality if and only if pθ = pθ∗ . This is because p q q KL(q, p)=E log = −E log ≥−log E =0. p q p p p p # $ # $ # $ By Jensen’s inequality, the equality happens if and only if p(x)=q(x) for all x. So we can see that if we minimize E[− log pθ(Y )], the minimum it can achieve is E[− log pθ∗ (Y )], and it achieves this minimum when θ = θ∗, the true parameter value we want to find. It is easy to see that MLE can be thought as a special case of empirical risk minimization, where the loss function is simply the negative log-likelihood: #(θ, yi)=− log pθ(yi). Another observation is that minimizing the negative log-likelihood will result in the least squares estimator, if the error follows a . The empirical risk is n 1 Rˆ (θ)=− log p (y ) n n θ i i=1 " and the risk is R(θ)=E[#(θ,Y )] = E[− log pθ(Y )]. The excess risk of θ is ∗ R(θ) − R(θ )=E[− log pθ(Y ) + log pθ∗ (Y )] = KL(pθ,pθ∗ ), the KL-divergence between pθ and pθ∗ .

2 Hellinger Distance and Consistency of MLE

We define the Hellinger distance between two distributions p and q as

1 2 h(p, q)= p(x) − q(x) dx. %2 ˆ &' ' ( It is easy to see that it is always nonnegative, symmetric, and satisfies the triangle inequality. And further- more, h(p, q) = 0 if and only if p = q. The hellinger distance plays an important role in studying MLE because it can be upper bounded by the KL-divergence, as shown in the following Lemma. 2 1 KL Lemma. We have h (q, p) ≤ 2 (q, p). 1 1/2 Proof. Using the fact that 2 log v ≤ v − 1 for all v>0, we have

1 q(x) q(x) log ≤ − 1, p x 2 ( ) 'p(x) and thus ' 1 q(X) KL(q, p) ≥ 1 − E . 2 )'p(X)* The result follows since ' q(X) 1 − E =1− q(x) p(x)dx ˆ )'p(X)* ' ' 1 1 ' = p(x)dx + q(x)dx − q(x) p(x)dx 2 ˆ 2 ˆ ˆ 1 2 ' ' = p(x) − q(x) dx 2 ˆ 2 & 2 ( = h (p, q')=h (q,' p).

2 !

This lemma says that convergence in KL-divergence will lead to convergence in hellinger distance. So if we can establish the convergence in KL-divergence then the consistency of MLE can be proven.

The convergence of the KL-divergence can be seen as follows. Since θˆn maximizes the likelihood over θ ∈ Θ, we have n n n pθ∗ (yi) log = log pθ∗ (yi) − log pˆ (yi) ≤ 0. pˆ (y ) θn i=1 θn i i=1 i=1 " " " Thus n 1 pθ∗ (yi) log − KL(pˆ ,pθ∗ )+KL(pˆ ,pθ∗ ) ≤ 0. n pˆ (y ) θn θn i=1 θn i " So we have n 1 pθ∗ (yi) KL(pˆ ,pθ∗ ) ≤ log − KL(pˆ ,pθ∗ ) . θn n pˆ y θn + =1 θn ( i) + + i + + " + If law of large numbers holds uniformly, we have+ + + + n 1 pθ∗ (yi) sup log − KL(pθ,pθ∗ ) → 0 θ∈Θ +n =1 pθ(yi) + + i + + " + + + which can be applied to θˆn as well. As+ a result, the convergence of KL-divergence+ can be established.

3