<<

Machine learning - HT 2016 3. Maximum Likelihood

Varun Kanade

University of Oxford January 27, 2016 Outline

Probabilistic Framework

� Formulate in the language of probability

� Introduce the maximum likelihood estimate

� Relation to estimate

Basics of Probability

� Univariate and multivariate

� Laplace distribution

� Likelihood, Entropy and its relation to learning

1 Univariate Gaussian (Normal) Distribution

The univariate normal distribution is defined by the following density function

(x µ)2 1 − 2 p(x) = e− 2σ2 X (µ,σ ) √2πσ ∼N Hereµ is the mean andσ 2 is the .

2 Sampling from a Gaussian distribution

Sampling fromX (µ,σ 2) ∼N X µ By settingY= − , sample fromY (0, 1) σ ∼N Cumulative distribution function x 1 t2 Φ(x) = e− 2 dt √2π �−∞

3 Covariance and Correlation

For random variableX andY the covariance measures how the change jointly.

cov(X,Y)=E[(X E[X])(Y E[Y ])] − −

Covariance depends on the scale of the random variable. The (Pearson) correlation coefficient normalizes the covariance to give a value between 1 and +1. − cov(X,Y) corr(X,Y)= , σX σY 2 2 2 2 whereσ X =E[(X E[X]) ] andσ Y =E[(Y E[Y]) ]. − −

4 Multivariate Gaussian Distribution

Supposex is an-dimensional random vector. The covariance matrix consists of all pariwise covariances.

var(X ) cov(X ,X ) cov(X ,X ) 1 1 2 ··· 1 n cov(X2,X1) var(X2) cov(X 2,Xn) T ··· cov(x) =E (x E[x])(x E[x]) =  . . . .  . − − . . . . . � �  . . .     cov(Xn,X1) cov(Xn,X2) var(X n,Xn)   ···  Ifµ=E[x] andΣ = cov[x], the multivariate normal is defined by the density

1 1 T 1 (µ,Σ) = exp (x µ) Σ− (x µ) N (2π)n/2 Σ 1/2 − 2 − − | | � �

5 Bivariate Gaussian Distribution

2 2 SupposeX 1 (µ 1,σ ) andX 2 (µ 2,σ ) ∼N 1 ∼N 2

What is the joint probability distributionp(x 1, x2)?

6 Suppose you are given three independent samples: x1 = 1, x2 = 2.7, x3 = 3.

You know that the data were generated from (0, 1) or (4, 1). N N Letθ represent the parameters of the distribution. Then the probability of observing data with parameterθ is called the likelihood:

p(x1, x2, x3 θ)=p(x 1 θ)p(x 2 θ)p(x 3 θ) | | | | We have to chose betweenθ=0 andθ=4. Which one?

Maximum Likelihood Estimation (MLE): Pickθ that maximizes the likelihood.

7 Linear Regression

Recall our linear regression model

y=x T w+ noise

Modely (conditioned onx) as a random variable. Givenx and the model parameterw: T E[y x,w]=x w | We can be more specific in choosing our model fory. Let us assume that givenx,w,y is Gaussian with meanx T w and varianceσ 2.

y (x T w,σ 2) =x T w+ (0,σ 2) ∼N N

8 Likelihood of Linear Regression

m Suppose we observe data (x i, yi) . What is the likelihood of observing � �i=1 the data for model parametersw,σ?

9 Likelihood of Linear Regression

m Suppose we observe data (x i, yi) . What is the likelihood of observing � �i=1 the data for model parametersw,σ?

m

p(y 1, . . . , ym x 1,...,x m,w,σ)= p(y i x i,w,σ) | | i=1 �

T 2 Recall thaty i x w+ (0,σ ). So ∼ i N

m (y xT w)2 i− i 1 2 p(y 1, . . . , ym x 1,...,x m,w,σ)= e− 2σ | √ 2 i=1 2πσ � m/2 1 1 m (y xT w)2 = e− 2σ2 i=1 i− i 2πσ2 � � �

Want to find parametersw andσ that maximize the likelihood

10 Likelihood of Linear Regression

It is simpler to look at the log-likelihood. Taking logs

m m 2 1 T 2 LL(y1, . . . , ym x 1,...,x m,w,σ)= log(2πσ ) (x w y i) | − 2 − 2σ2 i − i=1 � m 1 LL(y X,w,σ)= log(2πσ2) (Xw y) T (Xw y) | − 2 − 2σ2 − − How to findw that maximizes the likelihood?

11 Maximum Likelihood and Least Squares

Let us in fact look at negative log-likelihood (which is more like loss) 1 m NLL(y X,w,σ)= (Xw y) T (Xw y)+ log(2πσ2) | 2σ2 − − 2

And recall the squared loss objective

L(w)=(Xw y) T (Xw y) − −

We can also find the MLE forσ. As exercise show that the MLE ofσ is

2 1 T σ = (XwML y) (XwML y) ML m − −

12 Making Prediction

m Given training data = (x i, yi) i=1 we can use MLE estimators to make predictions on newD points� and also� give confidence intervals.

T 2 ynew x new, (x wML,σ ) | D∼N new ML

13 Outliers and Laplace Distribution

With outliers least squares (and hence MLE with Gaussian model) can be quite bad.

Instead, we can model the noise (or uncertainty) iny as a Laplace distribution 1 y x T w p(y x,w,b) = exp | − | | 2b − b � �

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0 −6 −4 −2 0 2 4 6

14 Lookahead: Binary Classification

Bernoulli random variableX takes value in 0,1 . We parametrize using { } θ [0, 1]. ∈ p(1 θ)=θ | p(0 θ)=1 θ | − More succinctly, we can write

x 1 x p(x θ)=θ (1 θ) − | − For classification, we will design models with parameterw that given input x produce a value inf(x;w) [0, 1]. Then, we can model the (binary) class labels as: ∈ y Bernoulli(f(x;w)) ∼

15 Entropy

In information theory, entropyH is a measure of uncertainty associated with a random variable.

H(X)= p(x) log(p(x)) − x � In the case of bernoulli variables (with parameterθ) we get:

H(X)= θ log(θ) (1 θ) log(1 θ) − − − −

���

���

���

���

���

��� ��� ��� ��� ��� ��� ���

16 Maximum Likelihood and KL-Divergence

Suppose we get datax 1, . . . xm from some unknown distributionq.

Attempt to find parametersθ for a family of distributions that best explains the data

m

θˆ = argmax p(xi θ) θ | i=1 �m

= argmax log(p(xi θ)) θ | i=1 � m m 1 1 = argmax log(p(xi θ)) log(q(xi)) θ m | − m i=1 i=1 �m � 1 q(x ) = argmin log i θ m p(xi θ) i=1 � � | � q(x) argmin log q(x)dx = KL(q p) → p(x) � θ � � �

17 Kullback-Leibler Divergence

KL-Divergence is ‘‘like’’ a distance between distributions

q(xi) KL(q p) = log q(xi)dx � p(xi) i � KL(q q)=0 � KL(q p) 0 for all distributionsp � ≥

18