Machine Learning - MT 2016 7

Machine Learning - MT 2016 7. Classification: Generative Models Varun Kanade University of Oxford October 31, 2016 Announcements I Practical 1 Submission I Try to get signed off during session itself I Otherwise, do it in the next session I Exception: Practical 4 (Firm deadline Friday Week 8 at noon) I Sheet 2 is due this Friday 12pm 1 Recap: Supervised Learning - Regression Discriminative Model: Linear Model (with Gaussian noise) p(y j w; x) = w · x + N (0; σ2) Other noise models possible, e.g., Laplace Non-linearities using basis expansion Regularisation to avoid overfitting: Ridge, Lasso (Cross)-Validation to choose hyperparameters Optimisation Algorithms for Model Fitting Least Squares Ridge Lasso 1800 2016 Gauss Legendre 2 Supervised Learning - Classification In classification problems, the target/output y is a category y 2 f1; 2;:::;Cg The input x = (x1; : : : ; xD), where I Categorical: xi 2 f1;:::;Kg I Real-Valued: xi 2 R Discriminative Model: Only model the conditional distribution p(y j x; θ) Generative Model: Model the full joint distribution p(x; y j θ) 3 Prediction Using Generative Models Suppose we have a model p(x; y j θ) over the joint distribution over inputs and outputs Given a new input xnew, we can write the conditional distribution for y For c 2 f1;:::;Cg, we write p(y = c j θ) · p(xnew j y = c; θ) p(y = c j xnew; θ) = PC 0 0 c0=1 p(y = c jθ)p(xnew j y = c ; θ) The numerator is simply the joint probability p(xnew; c j θ) and the denominator the marginal probability p(xnew j θ) We can pick yb = argmaxc p(y = c j xnew; θ) 4 Toy Example Predict voter preference using in US elections Voted in Annual State Candidate 2012? Income Choice Y 50K OK Clinton N 173K CA Clinton Y 80K NJ Trump Y 150K WA Clinton N 25K WV Johnson Y 85K IL Clinton . Y 1050K NY Trump N 35K CA Trump N 100K NY ? 5 Classification : Generative Model In order to fit a generative model, we’ll express the joint distribution as p(x; y j θ; π) = p(y j π) · p(x j y; θ) P To model p(y j π), we’ll use parameters πc such that c πc = 1 p(y = c j π) = πc For class-conditional densities, for class c = 1;:::;C, we will have a model: p(x j y = c; θc) 6 Classification : Generative Model So in our example, p(y = clinton j π) = πclinton p(y = trump j π) = πtrump p(y = johnson j π) = πjohnson Given that a voter supports Trump!! p(x j y = trump; θtrump) models the distribution over x given y = trump and θtrump Similarly, we have p(x j y = clinton; θclinton) and p(x j y = johnson; θjohnson) We need to pick ‘‘model’’ for p(x j y = c; θc) Estimate the parameters πc, θc for c = 1;:::;C 7 Naïve Bayes Classifier (NBC) Assume that the features are conditionally independent given the class label D Y p(x j y = c; θc) = p(xj j y = c; θjc) j=1 So, for example, we are ‘modelling’ that conditioned on being a trump supporter, the state, previous voting or annual income are conditionally independent Clearly, this assumption is ‘‘naïve’’ and never satisfied But model fitting becomes very very easy Although the generative model is clearly inadequate, it actually works quite well Goal is predicting class, not modelling the data! 8 Naïve Bayes Classifier (NBC) Real-Valued Features I xj is real-valued e.g., annual income 2 I Example: Use a Gaussian model, so θjc = (µjc; σjc) I Can use other distributions, e.g., age is probably not Gaussian! Categorical Features I xj is categorical with values in f1;:::;Kg I Use the multinoulli distribution, i.e. xj = i with probability µjc;i K X µjc;i = 1 i=1 I The special case when xj 2 f0; 1g, use a single parameter θjc 2 [0; 1] 9 Naïve Bayes Classifier (NBC) Assume that all the features are binary, i.e., every xj 2 f0; 1g If we have C classes, overall we have only O(CD) parameters, θjc for each j = 1;:::;D and c = 1;:::;C Without the conditional independence assumption D I We have to assign a probability for each of the 2 combination D I Thus, we have O(C · 2 ) parameters! I The ‘naïve’ assumption breaks the curse of dimensionality and avoids overfitting! 10 Maximum Likelihood for the NBC N Let us suppose we have data h(xi; yi)ii=1 i.i.d. from some joint distribution p(x; y) The probability for a single datapoint is given by: C C D Y I(yi=c) Y Y I(yi=c) p(xi; yi j θ; π) = p(yi j π) · p(xi j θ; yi) = πc · p(xij j θjc) c=1 c=1 j=1 PC Let Nc be the number of datapoints with yi = c, so that c=1 Nc = N We write the log-likelihood of the data as: C C D X X X X log p(D j θ; π) = Nc log πc + log p(xij j θjc) c=1 c=1 j=1 i:yi=c The log-likelihood is easily separated into sums involving different parameters! 11 Maximum Likelihood for the NBC We have the log-likelihood for the NBC C C D X X X X log p(D j θ; π) = Nc log πc + log p(xij j θjc) c=1 c=1 j=1 i:yi=c Let us obtain estimates for π. We get the following optimisation problem: C X maximise Nc log πc c=1 C X subject to : πc = 1 c=1 This constrained optimisation problem can be solved using the method of Lagrange multipliers 12 Constrained Optimisation Problem Suppose f(z) is some function that we want to maximise subject to g(z) = 0. Constrained Objective argmax f(z); subject to : g(z) = 0 z Langrangian (Dual) Form Λ(z; λ) = f(z) + λg(z) Any optimal solution to the constrained problem is a stationary point of Λ(z; λ) 13 Constrained Optimisation Problem Any optimal solution to the constrained problem is a stationary point of Λ(z; λ) = f(z) + λg(z) rzΛ(z; λ) = 0 ) rzf = −λrzg @ Λ(z,λ) @λ = 0 ) g(z) = 0 14 Maximum Likelihood for NBC Recall that we want to solve: C X maximise : Nc log πc c=1 C X subject to : πc − 1 = 0 c=1 We can write the Lagrangean form: C 0 C 1 X X Λ(π; λ) = Nc log πc + λ @ πc − 1A c=1 c=1 We can write the partial derivatives and set them to 0: @ Λ(π,λ) Nc @π = + λ = 0 c πc C @ Λ(π,λ) X @λ = πc − 1 = 0 c=1 15 Maximum Likelihood for NBC The solution is obtained by setting N c + λ = 0 πc And so, N π = − c c λ As well as using the second condition, C C X X Nc π − 1 = − − 1 = 0 c λ c=1 c=1 And thence, C X λ = − Nc = −N c=1 Thus, we get the estimates, N π = c c N 16 Maximum Likelihood for the NBC We have the log-likelihood for the NBC C C D X X X X log p(D j θ; π) = Nc log πc + log p(xij j θjc) c=1 c=1 j=1 i:yi=c Nc We obtained the estimates, πc = N We can estimate θjc by taking a similar approach th To estimate θjc we only need to use the j feature of examples with yi = c Estimates depend on the model, e.g., Gaussian, Bernoulli, Multinoulli, etc. Fitting NBC is very very fast! 17 Summary: Naïve Bayes Classifier Generative Model: Fit the distribution p(x; y j θ) Make the naïve and obviously untrue assumption that features are conditionally independent given class! D Y p(x j y = c; θc) = p(xj j y = c; θjc) j=1 Despite this classifiers often work quite well in practice The conditional independence assumption reduces the number of parameters and avoids overfitting Fitting the model is very straightforward Easy to mix and match different models for different features 18 Outline Generative Models for Classification Naïve Bayes Model Gaussian Discriminant Analysis Generative Model: Gaussian Discriminant Analysis Recall the form of the joint distribution in a generative model p(x; y j θ; π) = p(y j π) · p(x j y; θ) P For classes, we use parameters πc such that c πc = 1 p(y = c j π) = πc D Suppose x 2 R , we model the class-conditional density for class c = 1;:::;C, as a multivariate normal distribution with mean µc and covariance matrix Σc p(x j y = c; θc) = N (x j µc; Σc) 19 Quadratic Discriminant Analysis (QDA) Let’s first see what the prediction rule for this model is: p(y = c j θ) · p(xnew j y = c; θ) p(y = c j xnew; θ) = PC 0 0 c0=1 p(y = c jθ)p(xnew j y = c ; θ) When the densities p(x j y = c; θc) are multivariate normal, we get 1 − 2 1 T −1 πcj2πΣcj exp − 2 (x − µc) Σc (x − µc) p(y = c j x; θ) = C − 1 1 −1 P 0 0 2 T c0=1 πc j2πΣc j exp − 2 (x − µc0 ) Σc0 (x − µc0 ) The denominator is the same for all classes, so the boundary between class c and c0 is given by 1 − 2 1 T −1 πcj2πΣcj exp − 2 (x − µc) Σc (x − µc) = 1 − 1 1 −1 0 0 2 T πc j2πΣc j exp − 2 (x − µc0 ) Σc0 (x − µc0 ) Thus the boundaries are quadratic surfaces, hence the method is called quadratic discriminant analysis 20 Quadratic Discriminant Analysis (QDA) 21 Linear Discriminant Analysis A special case is when the covariance matrices are shared or tied across different classes We can write 1 p(y = c j x; θ) / π exp − (x − µ )TΣ−1(x − µ ) c 2 c c 1 1 = exp µTΣ−1x − µTΣ−1µ + log π · exp − xTΣ−1x c 2 c c c 2 Let us set 1 γ = − µTΣ−1µ + log π β = Σ−1µ c 2 c c c c c and so T p(y = c j xnew; θ) / exp βc x + γc 22 Linear Discriminant Analysis (LDA) & Softmax Recall that we wrote, T p(y = c j x; θ) / exp βc x + γc And so, T exp βc x + γc p(y = c j x; θ) = = softmax(η)c P T c0 exp βc0 x + γc0 T T where, η = [β1 x + γ1; ··· ; βC x + γC ].

Load more