Maximum likelihood estimation

Nuno Vasconcelos UCSD Bayesian • recall that we have – Y – state of the world – X – observations – g(x) – decision function – L[g(x),y] – loss of predicting y with g(x) • Bayes decision rule is the rule that minimizes the risk

Risk = EX ,Y [L(X ,Y )] • for the “0-1” loss ⎧1, g(x) ≠ y L[g(x), y] = ⎨ ⎩0, g(x) = y • optimal decision rule is the maximum a-posteriori

probability rule 2 MAP rule • we have s hown tha t it can be imp lemen te d in any o f the three following ways i x – 1) * i (x ) = arg maxPY |X (i | x ) i P x * i P – 2) ( ) = arg max[ X |Y ( | ) Y (i )] i

* – 3) i (x ) = arg max[logPX |Y (x | i ) + logPY (i )] i • by introducing a “model” for the class-conditional distributions we can express this as a simple equation – e.g. for the multivariate Gaussian

1 ⎧ 1 T −1 ⎫ PX |Y (x | i) = exp⎨− (x − µi ) Σi (x − µi )⎬ d ⎩ 2 ⎭ (2π ) | Σi | 3 The Gaussian classifier discr im inan tt:

• the solution is PY|X(1|x ) = 0.5 * di (x ) = arg min[d i (x , µi ) +αi ] x y i i x with y T i −1 x ( , ) = ( d − ) Σi ( − y ) i P α = log(2π ) Σ − 2log Y (i )

• the optimal rule is to assign x to the closest class

• closest is measured with the Mahalanobis distance di(x,y) • can be further simplified in special cases

4 Geometric interpretation • for Gaussian classes, equal σ2I

µ − µ w = i j σ 2 µ + µ σ 2 P (i) x = i j − log Y (µ − µ ) 0 2 µ 2 P ( j) i j i − µ j Y

σ

σ w µj

µ x0 i 1 P (i) log Y µ 2 P ( j) i − µ j Y 5 σ 2 Geometricx interpretation • for Gaussian classes, equal but arbitrary covariance

−1 i w = Σ (µ − µ ) i Y µ j j i + µ 1 P (i ) µ = j − i log ()i − µ 0 T −1 j j 2 (µ − µ ) Σ (µ − µ ) PY ( j )

x0 w µj

µi

6 Bayesian decision theory

• advantages: – BDR is optimal and cannot be beaten – Bayes keeps you honest – models reflect causal interpretation of the problem, this is how we think – natural decomposition into “what we knew already” (prior) and “what tells us” (CCD) – no need for heuristics to combine these two sources of info – BDR is, almost invariably, intuitive – Bayes rule, chain rule, and marginalization enable modularity, and scalability to very complicated models and problems • problems: – BDR is optimal only insofar the models are correct.

7 i j Implementation • we do have an optimali solutionT j −1 i w = Σ ( − j ) i Y j µ i µ + µ µ 1 PY ( ) i x 0 = − log P ( − j ) 2 ()()− Σ−1 − ( j ) • but in practice we do not know the values of the µ parameters µ, Σ, PY(1) µ – we have to somehow estimate theseµ values – this is OK, we can come up with an estimateµ from a training set µ – e.g. use the average value as an estimate for the µ ˆ −1µ w = Σ (ˆi −µ ˆ j ) µ µ ˆ + µˆ µ 1 Pˆ (i) x = i j − log Y (µˆ − µˆ ) 0 T −1 µ ˆ i j 2 (ˆi − ˆ j ) Σ (ˆi − µˆ j ) PY ( j) 8 Important • warning: at this point all optimality claims for the BDR cease to be valid!! • the BDR is guaranteed to achieve the minimum loss when we use the true probabilities • when we “plug in” the proba bility es tima tes, we could be imppglementing a classifier that is quite distant from the optimal

– e.g. if the PX|Y(x|i) look like the example above – I could never approximate them well by parametric models (e. g. Gaussian) 9 Maximum likelihood • this seems pretty serious – how should I get these probabilities then? • we rely on the maximum likelihood (ML) principle • this has three steps: – 1) we choose a parametric model for all probabilities – to make this clear we denote the vector of parameters by Θ and the class-conditional distributions by

PX |Y (x | i;Θ)

– note that this that Θ is NOT a (otherwise it would have to show up as subscript) – it is simply a parameter, and the probabilities are a function of this parameter

10 Maximum likelihood • three steps: – 2) we assemble a collection of datasets (i) (i) (i) D ={x= {x1 , ..., xn } set of examples drawn independently from class i

– 3) we select the parameters of class i to be the ones that maximize the probability of the data from that class

(i) Θi = arg max PX |Y (D | i;Θ) Θ (i) = arg max log PX |Y (D | i;Θ) Θ

– like before, it does not reallyyy make any difference to maximize probabilities or their logs 11 Maximum likelihood • since – each sample D(i) is considered independently (i) – parameter Θi estima ted on ly from samp le D • we simply have to repeat the procedure for all classes • so, from now on we omit the class variable

* Θ = arg max PX (D;Θ) Θ

= arg max log PX ()D;Θ Θ

• the function PX(D;Θ) is called the likelihood of the parameter Θ with respect to the data • or silhimply the like lihoo d funct ion 12 Maximum likelihood • note that the is a function of the parameters Θ • it does not have the same shape as the density itself • e.g. the likelihood function of a Gaussian is not bell-shaped • the likelihood is defined only after we have a sample 1 ⎧ (d − µ)2 ⎫ PX (d;Θ) = exp⎨− ⎬ π 2 2σ 2 (2 )σ ⎩ ⎭ 13 Maximum likelihood • given a sample, to obtain ML estimate we need to solve

* Θ = arg max PX (D;Θ) Θ • when Θ is a scalar this is high-school calculus

• we have a maximum when – first derivative is zero – second derivative is negative 14 The gradient • in higher dimensions, the generalization of the derivative is the gradient • the gradient of a function f(w) at z is

T ⎛ ∂f ∂f ⎞ ∇f ∇f (z) = ⎜ (z),L, (z)⎟ ⎝ ∂w0 ∂wn−1 ⎠ • the gradient has a nice geometric itinterpre ttitation – it points in the direction of maximum f(x,y) growth of the function – which makes it perpendicular to the contours where the function is constant

∇f (x0, y0 )

∇f (x1, y1) 15 The gradient max • note that if ∇f = 0 – there is no direction of growth – also -∇f = 0, and there is no direction of decrease – we are either at a local minimum or maximum or “saddl e ” po in t • conversely, at local min or max or saddle point – no directi on o f grow th or decrease min – ∇f = 0 • this shows that we have a critical point if andlifd only if ∇f0f = 0 • to determine which type we need second saddle order conditions

16 The Hessian • the extension of the second-order derivative is the Hessian matrix ⎡ 2f 2f ⎤ ∂ f x∂ ⎢ x2 (x ) x (x )⎥ ∂ L ∂ ∂ ⎢ x 0 x 0 n −1 ⎥ ∇2f (x ) = x ⎢ M f ⎥ ⎢ ∂ 2 ∂ 2 ⎥ ⎢ ( ) L x2 (x ) ⎥ ⎣∂ n −1∂ 0 ∂ n −1 ⎦ – at each point x, gives us the quadratic function

x t ∇2f (x )x

that best approximates f(x)

17 The Hessian • this means that, when gradient is zero at x, we have max – a maximum when function can be approximated by an “upwards-facing” quadratic – a minimum when function can be approximated by a “downwards-facing” quadratic – a saddle point otherwise

saddle min

18 The Hessian max • for any matrix M, the function x t Mx • is – upwards facing quadratic when M is negative definite – downwards facing quadratic when M min is positive definite – saddle otherwise • hence, all that matters is the positive saddle dfiitdefiniteness o fthHf the Hess ian • we have a maximum when Hessian is negative definite

19 Maximum likelihood • in summary, given a sample, we need to solve

* Θ = arg max PX (D;Θ) Θ max • the solutions are the parameters such that

∇ΘPX (D ;Θ) = 0

t 2 n θ ∇Θ PX (D ;θ )θ ≤ 0, ∀θ ∈ℜ

• note that you always have to check the second-order condition!

20 Maximum likelihood • let’s consider the Gaussian example

• given a sample {T1,…,T, …, TN} of independent points • the likelihood is

21 Maximum likelihood • and the log-likelihood is

• the derivative with respect to the mean is zero when

• or

• note that this is just the sample mean

22 Maximum likelihood • and the log-likelihood is

• the derivative with respect to the is zero when

• or

• note that this is just the sample variance

23 Maximum likelihood • example: – if sample is {10,20,30,40,50}

24 Homework • show that the Hessian is negative definite

t 2 n θ ∇Θ PX (D ;θ )θ ≤ 0, ∀θ ∈ℜ

• show that these formulas can be generalized to the vector case (i) (i) (i) – D = {x1 , ..., xn } set of examples from class i – the ML esti ma tes are i i 1 i 1 µ = ∑ x (i ) Σ = ∑(x ( ) − µ )(x (i ) − µ )T n j j n j j i j • note that the ML solution is usually intuitive i

25 Estimators • when we talk about estimators, it is important to keep in mind that – an estimate is a number – an estimator is a random variable ˆ θ = f (X1,K, X n ) • an estimate is the value of the estimator for a given sample. 1 ˆ • if D = {x1 , ..., xn}, when we say µ = ∑ x j n j

what we mean is µ ˆ = f ( X 1 , , X n ) with K X1 =x1 ,K,X n =xn 1 f (X , , X ) = X 1 K n ∑ j the Xi are random variables n j 26 Bias and variance • we know how to produce estimators (by ML) • how do we evaluate an estimator?

• Q1: is the equal to the true value? • this is measuredθ by the bias – if ˆ θ = f (X1,K, X n ) then Bias(ˆ)= E [f (X , , X ) −θ ] X1 ,K,X n 1 K n – an estimator that has bias will usually not converge to the perfect estimate θ, no matter how large the sample is 1 2 – e.g. if θ is negative and the estimator is f (X1,K, X n ) = ∑ X j the bias is clearly non-zero n j

27 Bias and variance • the estimators is said to be biased – this means that it is not expressive enough to approximate the tru e v alu e arbitrarily w ell – this will be clearer when we talk about

• Q2: assuminggg that the estimator converges to the true value, how many sample points do we need? – this can be measured by the variance

Var(θˆ)= E {(f (X , , X ) − E [f (X , , X )])2 } X1 ,K,X n 1 K n X1 ,K,X n 1 K n

– the variance usually decreases as one collects more training examples 28 Example

• ML estimator for the meann of a Gaussian N(µ,σ2) Bias (µˆ )= E [µˆ − µ]= E [µˆ]− µ X 1 ,K,X X 1 ,K,X n ⎡ 1 ⎤ = E X − µ X 1 ,K,X n ⎢ ∑ i ⎥ ⎣n i ⎦ 1 = E [X ]− µ ∑ X 1 ,K,X n i n i 1 = E [X ]− µ ∑ X i i n i µ = − µ = 0 • the estimator is unbiased 29 38