<<

Maximum Likelihood for is Biased: Proof

Dawen Liang Carnegie Mellon University [email protected]

1 Introduction

Maximum Likelihood Estimation (MLE) is a method of estimating the of a . It is widely used in Machine Learning algorithm, as it is intuitive and easy to form given the . The basic idea underlying MLE is to represent the likelihood over the data w.r.t the model parameters, then find the values of the parameters so that the likelihood is maximized. For example, given N 1-dimensional data points xi, where i = 1, 2, ··· ,N and we assume the data points are drawn i.i.d. from a Gaussian distribution. Then we could estimate the µ and 2 2 2 variance σ of the true distribution via MLE. Per definition, µ = E[x] and σ = E[(x − µ) ]. Thus, 1 PN 2 1 PN 2 intuitively, the mean estimator x = N i=1 xi and the variance estimator s = N i=1(xi − x) follow. It is easy to check that these are derived from MLE setting. See Chapter 2.3.4 of Bishop(2006).

2 Biased/Unbiased Estimation

In , we evaluate the “goodness” of the estimation by checking if the estimation is “unbi- ased”. By saying “unbiased”, it the expectation of the estimator equals to the true value, e.g. if E[x] = µ then the mean estimator is unbiased. Now we will show that the equation actually holds for mean estimator.

N N 1 X 1 X [x] = [ x ] = [x] E E N i N E i=1 i=1 1 = · N · [x] N E = E[x] = µ

The first line makes use of the assumption that the samples are drawn i.i.d from the true dis- tribution, thus E[xi] is actually E[x]. From the proof above, it is shown that the mean estimator is unbiased. Now we move to the variance estimator. At the first glance, the variance estimator s2 = 1 PN 2 N i=1(xi − x) should follow because mean estimator x is unbiased. However, it is not the

1 case:

N 1 X [s2] = [ (x − x)2] E E N i i=1 N N N 1 X X X = [ x2 − 2 x x + x2] N E i i i=1 i=1 i=1

PN PN 2 2 We know i=1 xi = N · x and i=1 x = N · x . Plug these into the derivation:

N 1 X [s2] = [ x2 − 2N · x2 + N · x2] E N E i i=1 N 1 X = [ x2 − N · x2] N E i i=1 N 1 X = [ x2] − [x2] N E i E i=1 2 2 = E[x ] − E[x ]

2 2 2 2 2 According to the alternative definition of variance, σx = E[x ] − E[x] and similarly, σx = E[x ] − 2 E[x] , where the is x. Note that E[x] = E[x] = µ. Plug the 2 equations to the derivation:

2 2 2 2 2 E[s ] = (σx + µ ) − (σx + µ ) 2 2 = σx − σx

N N 1 X 1 X σ2 = VAR[x] = VAR[ x ] = VAR[ x ] x N i N 2 i i=1 i=1 Since the samples are drawn i.i.d.

N N X X VAR[ xi] = VAR[x] = N · VAR[x] i=1 i=1 Thus, 1 1 σ2 = VAR[x] = σ2 x N N x 2 Plug back to the E[s ] derivation, N − 1 [s2] = σ2 E N x 2 2 Therefore, E[s ] 6= σx and it is shown that we tend to underestimate the variance. In order to over- come this biased problem, the maximum likelihood estimator for variance can be slightly modified to take this into account: N 1 X s2 = (x − x)2 N − 1 i i=1 It is easy to show that this modified variance estimator is unbiased.

2 References

Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.

3