Binary Classification

CS 1571 Introduction to AI Lecture 25 Binary classification Milos Hauskrecht [email protected] 5329 Sennott Square CS 1571 Intro to AI Supervised learning Data: D { D 1 , D 2 ,.., D n } a set of n examples Di xi , yi xi (xi,1 , xi,2 ,xi,d ) is an input vector of size d yi is the desired output (given by a teacher) Objective: learn the mapping f : X Y s.t. yi f (x i ) for all i 1,.., n • Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: handwritten digit in binary form digit label CS 1571 Intro to AI 1 Linear regression: review • Function f : X Y is a linear combination of input components d f (x) w0 w1 x1 w2 x2 wd xd w0 w j x j j1 w0 , w1 , wk - parameters (weights) Bias term 1 w 0 x w 1 1 f ( x , w ) w x 2 Input vector 2 w x d x d CS 1571 Intro to AI Linear regression: review • Data: Di xi , yi • Function: xi f (xi ) • We would like to have yi f (x i ) for all i 1,.., n • Error function measures how much our predictions deviate from the desired answers 1 2 Mean-squared error J n (yi f (x i )) n i 1,..n • Learning: We want to find the weights minimizing the error ! CS 1571 Intro to AI 2 Solving linear regression: review • The optimal set of weights satisfies: n 2 T w (J n (w )) (yi w x i )x i 0 n i1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form Aw b n n n n n w0 xi,0xi, j w1xi,1xi, j wj xi, j xi, j wd xi,d xi, j yi xi, j i1 i1 i1 i1 i1 Solutions to SLE: • e.g. matrix inversion (if the matrix is singular) w A 1b CS 1571 Intro to AI Linear regression. Example • 1 dimensional input x (x1 ) 30 25 20 15 10 5 0 -5 -10 -15 -1.5 -1 -0.5 0 0.5 1 1.5 2 CS 1571 Intro to AI 3 Linear regression. Example. • 2 dimensional input x (x1 , x2 ) 20 15 10 5 0 -5 -10 -15 4 -20 -3 2 -2 -1 0 0 1 -2 2 3 -4 CS 1571 Intro to AI Binary classification • Two classes Y {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X { 0,1} • Misclassification error function 1 f (x i , w ) y i Error 1 (x i , y i ) 0 f (x i , w ) y i • Error we would like to minimize: E ( x , y ) (Error 1 (x, y)) • First step: we need to devise a model of the function CS 2750 Machine Learning 4 Discriminant functions • One way to represent a classifier is by using – Discriminant functions • Works for both the binary and multi-way classification • Idea: – For every class i = 0,1, …k define a function g i (x) mapping X – When the decision on input x should be made choose the class with the highest value of g i (x) • So what happens with the input space? Assume a binary case. CS 2750 Machine Learning Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 5 Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Discriminant functions 2 1.5 1 0.5 g1 (x) g 0 (x) g (x) g (x) 0 1 0 -0.5 -1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 6 Discriminant functions • Define decision boundary 2 1.5 1 0.5 g1 (x) g 0 (x) g1 (x) g 0 (x) 0 -0.5 g (x) g (x) -1 1 0 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Quadratic decision boundary Decision boundary 3 2.5 2 1.5 1 0.5 g1 (x) g 0 (x) 0 -0.5 -1 g1 (x) g 0 (x) g1 (x) g 0 (x) -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 7 Logistic regression model • Defines a linear decision boundary in the input space • Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • where g (z) 1 /(1 e z ) - is a logistic function T T f (x, w ) g1 (w x) g (w x) 1 Logistic function w0 x1 w1 z f (x, w ) w2 Input vector x2 w x d xd CS 2750 Machine Learning Logistic function 1 function g (z) (1 e z ) • Is also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1] 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -20 -15 -10 -5 0 5 10 15 20 CS 2750 Machine Learning 8 Logistic regression model • Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • Values of discriminant functions vary in [0,1] – Probabilistic interpretation T f (x, w ) p( y 1 | w, x) g1 (x) g (w x) 1 w0 p(y 1| x,w) x1 w1 z w x 2 Input vector 2 w x d xd CS 2750 Machine Learning Logistic regression • We learn a probabilistic function f : X [0,1] –where f describes the probability of class 1 given x T f (x, w) g1 (w x) p( y 1| x, w) Note that: p( y 0 | x, w) 1 p( y 1| x, w) • Transformation to binary class values: If p ( y 1 | x ) 1 / 2 then choose 1 Else choose 0 CS 2750 Machine Learning 9 Linear decision boundary • Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions. • Decision boundary: g1 (x) g 0 (x) • For the boundary it must hold: T go (x) 1 g(w x) log log T 0 g1(x) g(w x) exp (wTx) g (x) 1 exp (wTx) log o log logexp (wTx) wTx 0 1 g1(x) 1 exp (wTx) CS 2750 Machine Learning Logistic regression model. Decision boundary • LR defines a linear decision boundary Example: 2 classes (blue and red points) Decision boundary 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 CS 2750 Machine Learning 10 Logistic regression: parameter learning Likelihood of outputs • Let T D i x i , y i i p(yi 1| xi ,w) g(zi ) g(w x) • Then n n yi 1 yi L(D, w) P( y yi | x i , w) i (1 i ) i1 i1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n yi 1 yi yi 1 yi l(D, w ) log i (1 i ) log i (1 i ) i1 i1 n y i log i (1 y i ) log( 1 i ) i 1 CS 2750 Machine Learning Logistic regression: parameter learning • Log likelihood n l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n Nonlinear in weights !! l(D, w ) xi, j ( yi g (zi )) w j i1 n n T w l(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1 • Gradient descent: ( k ) ( k 1) w w (k ) [ l(D, w )] | ( k 1) w w n ( k ) ( k 1) (k 1) w w (k ) [ yi f (w , x i )]x i i1 CS 2750 Machine Learning 11 Logistic regression. Online gradient descent • On-line component of the loglikelihood J online ( D i , w ) y i log i (1 y i ) log( 1 i ) • On-line learning update for weight w J online (Dk , w ) ( k ) ( k 1) w w (k ) [J (D , w )] | ( k 1) w online k w • ith update for the logistic regression and Dk xk , yk (i ) ( k 1) ( k 1) w w (k )[ yi f (w , x k )]x k CS 2750 Machine Learning Online logistic regression algorithm Online-logistic-regression (D, number of iterations) initialize weights w (w0 , w1, w2 wd ) for i=1:1: number of iterations do select a data point D i x i , y i from D set 1/i update weights (in parallel) w w (i)[ yi f (w, x i )]x i end for return weights w CS 2750 Machine Learning 12 Online algorithm. Example. CS 2750 Machine Learning Online algorithm. Example. CS 2750 Machine Learning 13 Online algorithm. Example. CS 2750 Machine Learning Derivation of the gradient n • Log likelihood l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n zi l(D, w ) yi log i (1 yi ) log(1 i ) w j i1 zi w j Derivative of a logistic function z i x i, j g ( zi ) wj g ( zi )(1 g ( zi )) zi 1 g(zi ) 1 g(zi ) yi logi (1 yi )log(1 i ) yi (1 yi ) zi g(zi ) zi 1 g(zi ) zi yi (1 g(zi )) (1 yi )(g(zi )) yi g(zi ) n n T wl(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1 CS 2750 Machine Learning 14 Generative approach to classification Idea: 1.

Load more