Binary Classification

CS 1571 Introduction to AI Lecture 25 Binary classification Milos Hauskrecht [email protected] 5329 Sennott Square CS 1571 Intro to AI Supervised learning Data: D { D 1 , D 2 ,.., D n } a set of n examples Di xi , yi xi (xi,1 , xi,2 ,xi,d ) is an input vector of size d yi is the desired output (given by a teacher) Objective: learn the mapping f : X Y s.t. yi f (x i ) for all i 1,.., n • Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: handwritten digit in binary form digit label CS 1571 Intro to AI 1 Linear regression: review • Function f : X Y is a linear combination of input components d f (x) w0 w1 x1 w2 x2 wd xd w0 w j x j j1 w0 , w1 , wk - parameters (weights) Bias term 1 w 0 x w 1 1 f ( x , w ) w x 2 Input vector 2 w x d x d CS 1571 Intro to AI Linear regression: review • Data: Di xi , yi • Function: xi f (xi ) • We would like to have yi f (x i ) for all i 1,.., n • Error function measures how much our predictions deviate from the desired answers 1 2 Mean-squared error J n (yi f (x i )) n i 1,..n • Learning: We want to find the weights minimizing the error ! CS 1571 Intro to AI 2 Solving linear regression: review • The optimal set of weights satisfies: n 2 T w (J n (w )) (yi w x i )x i 0 n i1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form Aw b n n n n n w0 xi,0xi, j w1xi,1xi, j wj xi, j xi, j wd xi,d xi, j yi xi, j i1 i1 i1 i1 i1 Solutions to SLE: • e.g. matrix inversion (if the matrix is singular) w A 1b CS 1571 Intro to AI Linear regression. Example • 1 dimensional input x (x1 ) 30 25 20 15 10 5 0 -5 -10 -15 -1.5 -1 -0.5 0 0.5 1 1.5 2 CS 1571 Intro to AI 3 Linear regression. Example. • 2 dimensional input x (x1 , x2 ) 20 15 10 5 0 -5 -10 -15 4 -20 -3 2 -2 -1 0 0 1 -2 2 3 -4 CS 1571 Intro to AI Binary classification • Two classes Y {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X { 0,1} • Misclassification error function 1 f (x i , w ) y i Error 1 (x i , y i ) 0 f (x i , w ) y i • Error we would like to minimize: E ( x , y ) (Error 1 (x, y)) • First step: we need to devise a model of the function CS 2750 Machine Learning 4 Discriminant functions • One way to represent a classifier is by using – Discriminant functions • Works for both the binary and multi-way classification • Idea: – For every class i = 0,1, …k define a function g i (x) mapping X – When the decision on input x should be made choose the class with the highest value of g i (x) • So what happens with the input space? Assume a binary case. CS 2750 Machine Learning Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 5 Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) g (x) g (x) 0 1 0 -0.5 -1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 6 Discriminant functions • Define decision boundary 2 1.5 1 0.5 g1 (x) g 0 (x) g1 (x) g 0 (x) 0 -0.5 g (x) g (x) -1 1 0 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Quadratic decision boundary Decision boundary 3 2.5 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 g1 (x) g 0 (x) g1 (x) g 0 (x) -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 7 Logistic regression model • Defines a linear decision boundary in the input space • Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • where g (z) 1 /(1 e z ) - is a logistic function T T f (x, w ) g1 (w x) g (w x) 1 Logistic function w0 x1 w1 z f (x, w ) w2 Input vector x2 w x d xd CS 2750 Machine Learning Logistic function 1 function g (z) (1 e z ) • Is also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1] 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -20 -15 -10 -5 0 5 10 15 20 CS 2750 Machine Learning 8 Logistic regression model • Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • Values of discriminant functions vary in [0,1] – Probabilistic interpretation T f (x, w ) p( y 1 | w, x) g1 (x) g (w x) 1 w0 p(y 1| x,w) x1 w1 z w x 2 Input vector 2 w x d xd CS 2750 Machine Learning Logistic regression • We learn a probabilistic function f : X [0,1] –where f describes the probability of class 1 given x T f (x, w) g1 (w x) p( y 1| x, w) Note that: p( y 0 | x, w) 1 p( y 1| x, w) • Transformation to binary class values: If p ( y 1 | x ) 1 / 2 then choose 1 Else choose 0 CS 2750 Machine Learning 9 Linear decision boundary • Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions. • Decision boundary: g1 (x) g 0 (x) • For the boundary it must hold: T go (x) 1 g(w x) log log T 0 g1(x) g(w x) exp (wTx) g (x) 1 exp (wTx) log o log logexp (wTx) wTx 0 1 g1(x) 1 exp (wTx) CS 2750 Machine Learning Logistic regression model. Decision boundary • LR defines a linear decision boundary Example: 2 classes (blue and red points) Decision boundary 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 CS 2750 Machine Learning 10 Logistic regression: parameter learning Likelihood of outputs • Let T D i x i , y i i p(yi 1| xi ,w) g(zi ) g(w x) • Then n n yi 1 yi L(D, w) P( y yi | x i , w) i (1 i ) i1 i1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n yi 1 yi yi 1 yi l(D, w ) log i (1 i ) log i (1 i ) i1 i1 n y i log i (1 y i ) log( 1 i ) i 1 CS 2750 Machine Learning Logistic regression: parameter learning • Log likelihood n l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n Nonlinear in weights !! l(D, w ) xi, j ( yi g (zi )) w j i1 n n T w l(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1 • Gradient descent: ( k ) ( k 1) w w (k ) [ l(D, w )] | ( k 1) w w n ( k ) ( k 1) (k 1) w w (k ) [ yi f (w , x i )]x i i1 CS 2750 Machine Learning 11 Logistic regression. Online gradient descent • On-line component of the loglikelihood J online ( D i , w ) y i log i (1 y i ) log( 1 i ) • On-line learning update for weight w J online (Dk , w ) ( k ) ( k 1) w w (k ) [J (D , w )] | ( k 1) w online k w • ith update for the logistic regression and Dk xk , yk (i ) ( k 1) ( k 1) w w (k )[ yi f (w , x k )]x k CS 2750 Machine Learning Online logistic regression algorithm Online-logistic-regression (D, number of iterations) initialize weights w (w0 , w1, w2 wd ) for i=1:1: number of iterations do select a data point D i x i , y i from D set 1/i update weights (in parallel) w w (i)[ yi f (w, x i )]x i end for return weights w CS 2750 Machine Learning 12 Online algorithm. Example. CS 2750 Machine Learning Online algorithm. Example. CS 2750 Machine Learning 13 Online algorithm. Example. CS 2750 Machine Learning Derivation of the gradient n • Log likelihood l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n zi l(D, w ) yi log i (1 yi ) log(1 i ) w j i1 zi w j Derivative of a logistic function z i x i, j g ( zi ) wj g ( zi )(1 g ( zi )) zi 1 g(zi ) 1 g(zi ) yi logi (1 yi )log(1 i ) yi (1 yi ) zi g(zi ) zi 1 g(zi ) zi yi (1 g(zi )) (1 yi )(g(zi )) yi g(zi ) n n T wl(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1 CS 2750 Machine Learning 14 Generative approach to classification Idea: 1.

Binary Classification

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support