Machine Learning - HT 2016 3
Total Page:16
File Type:pdf, Size:1020Kb
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework � Formulate linear regression in the language of probability � Introduce the maximum likelihood estimate � Relation to least squares estimate Basics of Probability � Univariate and multivariate normal distribution � Laplace distribution � Likelihood, Entropy and its relation to learning 1 Univariate Gaussian (Normal) Distribution The univariate normal distribution is defined by the following density function (x µ)2 1 − 2 p(x) = e− 2σ2 X (µ,σ ) √2πσ ∼N Hereµ is the mean andσ 2 is the variance. 2 Sampling from a Gaussian distribution Sampling fromX (µ,σ 2) ∼N X µ By settingY= − , sample fromY (0, 1) σ ∼N Cumulative distribution function x 1 t2 Φ(x) = e− 2 dt √2π �−∞ 3 Covariance and Correlation For random variableX andY the covariance measures how the random variable change jointly. cov(X, Y)=E[(X E[X])(Y E[Y ])] − − Covariance depends on the scale of the random variable. The (Pearson) correlation coefficient normalizes the covariance to give a value between 1 and +1. − cov(X, Y) corr(X, Y)= , σX σY 2 2 2 2 whereσ X =E[(X E[X]) ] andσ Y =E[(Y E[Y]) ]. − − 4 Multivariate Gaussian Distribution Supposex is an-dimensional random vector. The covariance matrix consists of all pariwise covariances. var(X ) cov(X ,X ) cov(X ,X ) 1 1 2 ··· 1 n cov(X2,X1) var(X2) cov(X 2,Xn) T ··· cov(x) =E (x E[x])(x E[x]) = . . − − . � � . cov(Xn,X1) cov(Xn,X2) var(X n,Xn) ··· Ifµ=E[x] andΣ = cov[x], the multivariate normal is defined by the density 1 1 T 1 (µ,Σ) = exp (x µ) Σ− (x µ) N (2π)n/2 Σ 1/2 − 2 − − | | � � 5 Bivariate Gaussian Distribution 2 2 SupposeX 1 (µ 1,σ ) andX 2 (µ 2,σ ) ∼N 1 ∼N 2 What is the joint probability distributionp(x 1, x2)? 6 Suppose you are given three independent samples: x1 = 1, x2 = 2.7, x3 = 3. You know that the data were generated from (0, 1) or (4, 1). N N Letθ represent the parameters of the distribution. Then the probability of observing data with parameterθ is called the likelihood: p(x1, x2, x3 θ)=p(x 1 θ)p(x 2 θ)p(x 3 θ) | | | | We have to chose betweenθ=0 andθ=4. Which one? Maximum Likelihood Estimation (MLE): Pickθ that maximizes the likelihood. 7 Linear Regression Recall our linear regression model y=x T w+ noise Modely (conditioned onx) as a random variable. Givenx and the model parameterw: T E[y x,w]=x w | We can be more specific in choosing our model fory. Let us assume that givenx,w,y is Gaussian with meanx T w and varianceσ 2. y (x T w,σ 2) =x T w+ (0,σ 2) ∼N N 8 Likelihood of Linear Regression m Suppose we observe data (x i, yi) . What is the likelihood of observing � �i=1 the data for model parametersw,σ? 9 Likelihood of Linear Regression m Suppose we observe data (x i, yi) . What is the likelihood of observing � �i=1 the data for model parametersw,σ? m p(y 1, . , ym x 1,...,x m,w,σ)= p(y i x i,w,σ) | | i=1 � T 2 Recall thaty i x w+ (0,σ ). So ∼ i N m (y xT w)2 i− i 1 2 p(y 1, . , ym x 1,...,x m,w,σ)= e− 2σ | √ 2 i=1 2πσ � m/2 1 1 m (y xT w)2 = e− 2σ2 i=1 i− i 2πσ2 � � � Want to find parametersw andσ that maximize the likelihood 10 Likelihood of Linear Regression It is simpler to look at the log-likelihood. Taking logs m m 2 1 T 2 LL(y1, . , ym x 1,...,x m,w,σ)= log(2πσ ) (x w y i) | − 2 − 2σ2 i − i=1 � m 1 LL(y X,w,σ)= log(2πσ2) (Xw y) T (Xw y) | − 2 − 2σ2 − − How to findw that maximizes the likelihood? 11 Maximum Likelihood and Least Squares Let us in fact look at negative log-likelihood (which is more like loss) 1 m NLL(y X,w,σ)= (Xw y) T (Xw y)+ log(2πσ2) | 2σ2 − − 2 And recall the squared loss objective L(w)=(Xw y) T (Xw y) − − We can also find the MLE forσ. As exercise show that the MLE ofσ is 2 1 T σ = (XwML y) (XwML y) ML m − − 12 Making Prediction m Given training data = (x i, yi) i=1 we can use MLE estimators to make predictions on newD points� and also� give confidence intervals. T 2 ynew x new, (x wML,σ ) | D∼N new ML 13 Outliers and Laplace Distribution With outliers least squares (and hence MLE with Gaussian model) can be quite bad. Instead, we can model the noise (or uncertainty) iny as a Laplace distribution 1 y x T w p(y x,w,b) = exp | − | | 2b − b � � 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 −6 −4 −2 0 2 4 6 14 Lookahead: Binary Classification Bernoulli random variableX takes value in 0,1 . We parametrize using { } θ [0, 1]. ∈ p(1 θ)=θ | p(0 θ)=1 θ | − More succinctly, we can write x 1 x p(x θ)=θ (1 θ) − | − For classification, we will design models with parameterw that given input x produce a value inf(x;w) [0, 1]. Then, we can model the (binary) class labels as: ∈ y Bernoulli(f(x;w)) ∼ 15 Entropy In information theory, entropyH is a measure of uncertainty associated with a random variable. H(X)= p(x) log(p(x)) − x � In the case of bernoulli variables (with parameterθ) we get: H(X)= θ log(θ) (1 θ) log(1 θ) − − − − ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� 16 Maximum Likelihood and KL-Divergence Suppose we get datax 1, . xm from some unknown distributionq. Attempt to find parametersθ for a family of distributions that best explains the data m θˆ = argmax p(xi θ) θ | i=1 �m = argmax log(p(xi θ)) θ | i=1 � m m 1 1 = argmax log(p(xi θ)) log(q(xi)) θ m | − m i=1 i=1 �m � 1 q(x ) = argmin log i θ m p(xi θ) i=1 � � | � q(x) argmin log q(x)dx = KL(q p) → p(x) � θ � � � 17 Kullback-Leibler Divergence KL-Divergence is ‘‘like’’ a distance between distributions q(xi) KL(q p) = log q(xi)dx � p(xi) i � KL(q q)=0 � KL(q p) 0 for all distributionsp � ≥ 18.