Linear Models for Classification
Total Page:16
File Type:pdf, Size:1020Kb
CS480 Introduction to Machine Learning Linear Models for Classification Edith Law Based on slides from Joelle Pineau Probabilistic Framework for Classification As we saw in the last lecture, learning can be posed a problem of statistical inference. The premise is that for any learning problem, if we have access to the underlying probability distribution D, then we could form a Bayes optimal classifier as: f (BO)(x)̂ = arg max D(x,̂ y)̂ y∈̂ Y But since we don't have access to D, we can instead estimate this distribution from data. The probabilistic framework allows you to choose a model to represent your data, and develop algorithms that have inductive biases that are closer to what you, as a designer, believe. 2 Classification by Density Estimation The most direct way to construct such a probability distribution is to select a family of parametric distributions. For example, we can select a Gaussian (or Normal) distribution. Gaussian is parametric: its parameters are its mean and variance. The job of learning is to infer which parameters are “best” in terms of describing the observed training data. In density estimation, we assume that the sample data is drawn independently from the same distribution D. That is, your nth sample (xn, yn) from D does not depend on the previous n-1 samples. This is called the i.i.d. (independently and identically distributed) assumption. 3 Consider the Problem of Flipping a Biased Coin You are flipping a coin, and you want to find out if it is biased. This coin has some unknown probability of β of coming up Head (H), and some probability ( 1 − β ) of coming up Tail (T). Suppose you flip the coin and observed HHTH. Assuming that each flip is independent, the probability of the data (i.e., the coin flip sequence) is p(D|β) = p(HHTH|β) = β ⋅ β ⋅ (1 − β) ⋅ β = β3(1 − β) = β3 − β4 4 Statistical Estimation To find the most likely value of β , you want to select parameters to maximize the the probability of the data. This means taking the derivative of p ( D | β ) with respect to β , set it to zero and solve for δ [β3 − β4] = 3β2 − 4β2 δβ 4β3 = 3β2 ⟺ 4β = 3 3 ⟺ β = 4 This gives you the maximum likelihood estimate of the parameters of your model. 5 Statistical Estimation But notice that the likelihood of the data p ( D | β ) actually has the form of the binomial distribution: βH(1 − β)T The log likelihood of the binomial distribution can be rewritten log p(D|β) = log βH(1 − β)T = H log β + T log(1 − β) When we differentiate with respect to beta, and set to zero. δ H T p(D|β) = − δβ β 1 − β H ⇒ β = The estimate is equivalent to the fraction of H + T the observed data that comes up heads! Maximum Likelihood Estimate is nothing more than the relative frequency of observing a certain event (e.g., observing heads). 6 From Estimation to Inference likelihood prior probability P(e|H)P(H) P(H|e) = P(e) posterior probability normalizing constant Maximum likelihood Maximum a Posteriori Bayesian Estimate P(e|hi)P(hi) hML = arg max P(e|hi) hMAP = arg max P(hi |e) P(hi |e) = hi hi P(e) P(X|e) ≈ P(X|hML) P(X|e) ≈ P(X|hMAP) P(X|e) = ∑ P(X|hi) P(hi |e) i 7 Classification Given data set D=<xi,yi>, i=1:n, with discrete yi, find a hypothesis which “best fits” the data. If yi ∈ {0, 1} this is binary classification. If yi can take more than two values, the problem is called multi-class classification. 8 Applications for Classification • Text classification (spam filtering, news filtering, building web dir, etc.) • Image classification (face detection, object recognition, etc.) • Prediction of cancer recurrence. • Financial forecasting. • Many, many more! 9 Simple Example • Given “nucleus size”, predict cancer recurrence. • Univariate input: X = nucleus size. • Binary output: Y = {N = 0; R = 1} • Try: Minimize the least-square error. N=NoRecurrence R=Recurrence 10 Predicting a Class from Linear Regression Here red line is: Y’ = X (XTX)-1 XTY How to get a binary output? •Threshold the output: { y <= t for NoRecurrence, y > t for Recurrence} •Interpret output as probability: y = P(Recurrence) In the probabilistic framework, the goal is to learn/estimate the posterior distribution P(y|x) and using it for classification. 11 Models for Binary Classification Two probabilistic approaches: 1. Generative learning 2. Discriminative learning 12 Models for Binary Classification Two probabilistic approaches: 1. Generative learning 2. Discriminative learning 13 Generative Learning Separately models P(x|y) and P(y), use these, through Bayes rule to estimate P(y|x). 14 Considering Binary Classification Problems Suppose you are trying to predict whether a movie review is positive based on the words that appear in the review. p(y = + |x) Generative learning says that to calculate P(y|x), we estimate the likelihood P(x|y) and the prior P(y) and multiply them together. P(x|y) = P(x1, x2, …, xm |y) = P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of probabilities) The challenge is that this likelihood distribution is over a lot of parameters. 15 Naive Bayes Assumption Assume the xj are conditionally independent given y, i.e., p(xj |y, xk) = p(xj |y), ∀j, k Example: Thunder is conditionally independent of Rain given Lightning. Lightning Thunder Rain P(Thunder |Rain, Lightning) = P(Thunder |Lightning) 16 Naive Bayes Assumption Conditional Independence Assumption: the probability distribution governing a particular feature is conditionally independent to the probability distribution governing another feature, as well as any combinations/subsets of other features, given that we know the label. y x1 x2 x3 … xm 17 Naive Bayes Assumption p(xj |y, xk) = p(xj |y), ∀j, k This greatly simplifies the computation of the likelihood: P(x|y) = P(x1, x2, …, xm |y) = P(x1 |y) P(x2 |y, x1) P(x3 |y, x1, x2), …, P(xm |y, x1, x2, …, xm−1) (from general rules of probabilities) M = P(x1 |y) P(x2 |y) P(x3 |y)…P(xm |y) = ∏P(xi |y) (from the Naïve Bayes assumption) i=1 18 Naive Bayes Assumption y x1 x2 x3 … xm How many parameters to estimate? Assume m binary features. •without NB assumption: 2(2m-1)=O(2m) numbers to describe model. •with NB assumption: 2m=O(m) numbers to describe model. Useful when the number of features is high. 19 Training a Naive Bayes Classifier Learning = Find parameters that maximize the log-likelihood: n m P(y|x) ∝ P(y)P(x|y) = ∏(P(yi)∏P(xi,j |yi)) i=1 j=1 •Samples i are independent, so we take product over n. •Input features are independent (cond. on y) so we take product over m. Suppose that x, y are binary variables, m=1. y Define: θ1 = P(y = 1) θj,1 = P(xj = 1|y = 1) θj,0 = P(xj = 1|y = 0) x 20 Training a Naive Bayes Classifier Likelihood of the output variable: y 1−y L(θ1) = θ1(1 − θ1) Likelihood for all parameters: n m log L(θ1, θi,1, θi,0 |D) = ∑ [ log P(yi) + ∑ log P(xi,j |yi) ] i j=1 n = ∑ [ yi log θ1 + (1 − yi) log(1 − θ1) ] i m +∑ yi(xi,j log θi,1 + (1 − xi,j)log(1 − θi,1)) j=1 m +∑ (1 − yi)(xi,j log θi,0 + (1 − xi,j)log(1 − θi,0) j=1 The log likelihood will have other form if params P(x|y) have other form. In this case, we chose Bernouilli. If the data is continuous, we can choose Gaussian. You can choose the distribution based on your knowledge of the problem. The choice of distribution is a form of inductive bias. 21 Training a Naive Bayes Classifier To estimate θ 1 , take derivative of the log likelihood θ 1 , set to 0 and solve: n δL yi 1 − yi = ∑ ( − ) = 0 δθ1 i=1 θ1 1 − θ1 1 n We get: number of examples where y=1 θ1 = ∑ yi = n i number of examples Similarly, we get: number of examples where xj=1 and y=1 θj,1 = frequencies of number of examples where y=1 events! number of examples where x =1 and y=0 θ = j j,0 number of examples where y=0 22 Log Likelihood Ratio Suppose we have 2 classes: y ∊ {0, 1} What is the probability of a given input x having class y = 1? Consider Bayes rule: P(x, y = 1) P(x|y = 1)P(y = 1) P(y = 1|x) = = P(x) P(x|y = 1)P(y = 1) + P(x|y = 0)p(y = 0) 1 1 1 = = P(x|y = 0)P(y = 0) P(x|y = 0)P(y = 0) = 1 + exp ln 1 + exp(−a) 1 + P(x|y = 1)P(y = 1) ( P(x|y = 1)P(y = 1) ) where P(x|y = 1)P(y = 1) P(y = 1|x) (By Bayes rule; P(x) on top and a = ln = ln P(x|y = 0)P(y = 0) P(y = 0|x) bottom cancels out.) � is the log-odds ratio of data being class 1 vs. class 0. It tells us which class (y=1 vs y=0) has higher probability and we can classify x accordingly. 23 Naive Bayes Decision Boundary What does the decision boundary look like for Naive Bayes? Let’s start from the log-odds ratio: P(y = 1|x) P(x|y = 1)P(y = 1) log = log P(y = 0|x) P(x|y = 0)P(y = 0) m P(y = 1) ∏j=1 P(xj |y = 1) = log + log m P(y = 0) ∏j=1 P(xj |y = 0) m P(y = 1) P(xj |y = 1) = log + ∑ log P(y = 0) j=1 P(xj |y = 0) 24 Naive Bayes Decision Boundary Consider the case where features are binary: xj = {0, 1} Define: P(xj = 0|y = 1) P(xj = 1|y = 1) wj,0 = log wj,1 = log P(xj = 0|y = 0) P(xj = 1|y = 0) Now we have: m P(y = 1|x) P(y = 1) P(xj |y = 1) log = log + ∑ log P(y = 0|x) P(y = 0) j=1 P(xj |y = 0) P(y = 1) m = log + ∑ (wj,0(1 − xj) + wj,1xj) P(y = 0) j=1 P(y = 1) m m = log + ∑ wj,0 + ∑ (wj,1 − wj,0)xj P(y = 0) j=1 j=1 This is a linear decision boundary! w0 wTx 25 Naive Bayes Decision Boundary Results show that Naive Bayes with binary features has precisely the form of a linear model.