CS 1571 Introduction to AI Lecture 25
Binary classification
Milos Hauskrecht [email protected] 5329 Sennott Square
CS 1571 Intro to AI
Supervised learning
Data: D { D 1 , D 2 ,.., D n } a set of n examples
Di xi , yi
xi (xi,1 , xi,2 ,xi,d ) is an input vector of size d
yi is the desired output (given by a teacher) Objective: learn the mapping f : X Y
s.t. yi f (x i ) for all i 1,.., n
• Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: handwritten digit in binary form digit label
CS 1571 Intro to AI
1 Linear regression: review
• Function f : X Y is a linear combination of input components d f (x) w0 w1 x1 w2 x2 wd xd w0 w j x j j1
w0 , w1 , wk - parameters (weights)
Bias term 1 w 0 x w 1 1 f ( x , w ) w x 2 Input vector 2 w x d
x d
CS 1571 Intro to AI
Linear regression: review
• Data: Di xi , yi • Function: xi f (xi )
• We would like to have yi f (x i ) for all i 1,.., n
• Error function measures how much our predictions deviate from the desired answers
1 2 Mean-squared error J n (yi f (x i )) n i 1,..n
• Learning: We want to find the weights minimizing the error !
CS 1571 Intro to AI
2 Solving linear regression: review
• The optimal set of weights satisfies: n 2 T w (J n (w )) (yi w x i )x i 0 n i1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form Aw b
n n n n n w0 xi,0xi, j w1xi,1xi, j wj xi, j xi, j wd xi,d xi, j yi xi, j i1 i1 i1 i1 i1
Solutions to SLE: • e.g. matrix inversion (if the matrix is singular) w A 1b
CS 1571 Intro to AI
Linear regression. Example
• 1 dimensional input x (x1 )
30
25
20
15
10
5
0
-5
-10
-15 -1.5 -1 -0.5 0 0.5 1 1.5 2
CS 1571 Intro to AI
3 Linear regression. Example.
• 2 dimensional input x (x1 , x2 )
20
15
10
5
0
-5
-10
-15
4 -20 -3 2 -2 -1 0 0 1 -2 2 3 -4
CS 1571 Intro to AI
Binary classification
• Two classes Y {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X { 0,1} • Misclassification error function
1 f (x i , w ) y i Error 1 (x i , y i ) 0 f (x i , w ) y i
• Error we would like to minimize: E ( x , y ) (Error 1 (x, y)) • First step: we need to devise a model of the function
CS 2750 Machine Learning
4 Discriminant functions
• One way to represent a classifier is by using – Discriminant functions • Works for both the binary and multi-way classification
• Idea:
– For every class i = 0,1, …k define a function g i (x) mapping X – When the decision on input x should be made choose the
class with the highest value of g i (x)
• So what happens with the input space? Assume a binary case.
CS 2750 Machine Learning
Discriminant functions
2
1.5
1
0.5 g1 (x) g 0 (x)
0
-0.5
-1
-1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
5 Discriminant functions
2
1.5
1
0.5 g1 (x) g 0 (x)
0
-0.5
-1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
Discriminant functions
2
1.5
1
0.5 g1 (x) g 0 (x) g (x) g (x) 0 1 0
-0.5
-1 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
6 Discriminant functions
• Define decision boundary
2
1.5
1
0.5 g1 (x) g 0 (x) g1 (x) g 0 (x) 0
-0.5 g (x) g (x) -1 1 0 g1 (x) g 0 (x) -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
Quadratic decision boundary
Decision boundary 3
2.5
2
1.5
1
0.5 g1 (x) g 0 (x) 0
-0.5
-1 g1 (x) g 0 (x) g1 (x) g 0 (x) -1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
7 Logistic regression model
• Defines a linear decision boundary in the input space • Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • where g (z) 1 /(1 e z ) - is a logistic function T T f (x, w ) g1 (w x) g (w x)
1 Logistic function w0 x1 w1 z f (x, w ) w2 Input vector x2 w x d
xd
CS 2750 Machine Learning
Logistic function 1 function g (z) (1 e z ) • Is also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 -20 -15 -10 -5 0 5 10 15 20
CS 2750 Machine Learning
8 Logistic regression model
• Discriminant functions: T T g1 (x) g (w x) g 0 (x) 1 g (w x) • Values of discriminant functions vary in [0,1] – Probabilistic interpretation T f (x, w ) p( y 1 | w, x) g1 (x) g (w x)
1 w0 p(y 1| x,w) x1 w1 z w x 2 Input vector 2 w x d
xd
CS 2750 Machine Learning
Logistic regression
• We learn a probabilistic function f : X [0,1] –where f describes the probability of class 1 given x
T f (x, w) g1 (w x) p( y 1| x, w) Note that: p( y 0 | x, w) 1 p( y 1| x, w)
• Transformation to binary class values:
If p ( y 1 | x ) 1 / 2 then choose 1 Else choose 0
CS 2750 Machine Learning
9 Linear decision boundary
• Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions.
• Decision boundary: g1 (x) g 0 (x) • For the boundary it must hold: T go (x) 1 g(w x) log log T 0 g1(x) g(w x) exp (wTx) g (x) 1 exp (wTx) log o log logexp (wTx) wTx 0 1 g1(x) 1 exp (wTx)
CS 2750 Machine Learning
Logistic regression model. Decision boundary
• LR defines a linear decision boundary Example: 2 classes (blue and red points)
Decision boundary 2
1.5
1
0.5
0
-0.5
-1
-1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
CS 2750 Machine Learning
10 Logistic regression: parameter learning
Likelihood of outputs • Let T D i x i , y i i p(yi 1| xi ,w) g(zi ) g(w x) • Then n n yi 1 yi L(D, w) P( y yi | x i , w) i (1 i ) i1 i1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n yi 1 yi yi 1 yi l(D, w ) log i (1 i ) log i (1 i ) i1 i1 n y i log i (1 y i ) log( 1 i ) i 1
CS 2750 Machine Learning
Logistic regression: parameter learning • Log likelihood n l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n Nonlinear in weights !! l(D, w ) xi, j ( yi g (zi )) w j i1 n n T w l(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1 • Gradient descent:
( k ) ( k 1) w w (k ) [ l(D, w )] | ( k 1) w w
n ( k ) ( k 1) (k 1) w w (k ) [ yi f (w , x i )]x i i1
CS 2750 Machine Learning
11 Logistic regression. Online gradient descent
• On-line component of the loglikelihood
J online ( D i , w ) y i log i (1 y i ) log( 1 i )
• On-line learning update for weight w J online (Dk , w )
( k ) ( k 1) w w (k ) [J (D , w )] | ( k 1) w online k w
• ith update for the logistic regression and Dk xk , yk
(i ) ( k 1) ( k 1) w w (k )[ yi f (w , x k )]x k
CS 2750 Machine Learning
Online logistic regression algorithm
Online-logistic-regression (D, number of iterations)
initialize weights w (w0 , w1, w2 wd ) for i=1:1: number of iterations
do select a data point D i x i , y i from D set 1/i update weights (in parallel)
w w (i)[ yi f (w, x i )]x i end for return weights w
CS 2750 Machine Learning
12 Online algorithm. Example.
CS 2750 Machine Learning
Online algorithm. Example.
CS 2750 Machine Learning
13 Online algorithm. Example.
CS 2750 Machine Learning
Derivation of the gradient n • Log likelihood l(D, w ) yi log i (1 yi ) log(1 i ) i1 • Derivatives of the loglikelihood n zi l(D, w ) yi log i (1 yi ) log(1 i ) w j i1 zi w j Derivative of a logistic function z i x i, j g ( zi ) wj g ( zi )(1 g ( zi )) zi
1 g(zi ) 1 g(zi ) yi logi (1 yi )log(1 i ) yi (1 yi ) zi g(zi ) zi 1 g(zi ) zi
yi (1 g(zi )) (1 yi )(g(zi )) yi g(zi )
n n T wl(D, w) xi ( yi g(w xi )) xi ( yi f (w, xi )) i1 i1
CS 2750 Machine Learning
14 Generative approach to classification
Idea: 1. Represent and learn the distribution p(x, y) 2. Use it to define probabilistic discriminant functions
E.g. g o (x) p( y 0 | x) g1 (x) p( y 1 | x)
Typical model p(x, y) p(x | y) p( y) • p(x | y) = Class-conditional distributions (densities) y binary classification: two class-conditional distributions p(x | y 0) p(x | y 1) • p( y) = Priors on classes - probability of class y x binary classification: Bernoulli distribution p( y 0) p( y 1) 1
CS 2750 Machine Learning
Naïve Bayes classifier
• A generative classifier model with an additional simplifying assumption: – All input attributes are conditionally independent of each other given the class. So we have:
Y p(x, y) p( y) p(x | y) d p( y) p(xi | y) i1 X 1 X 2 … X d
CS 2750 Machine Learning
15 Learning of the Naïve Bayes classifier
Model: d p(x, y) p( y) p(x | y) p( y) p(xi | y) i 1 Y Learning = learning of parameters of the BBN: – Class prior X 1 X 2 … X d p( y) – class conditional distributions:
p(xi | y) for all i 1,...d
• In practice class conditional distribution can have different models: e.g. one attribute can be modeled using the Bernoulli, the other as Gaussian density, or as a Poisson distribution CS 2750 Machine Learning
Making class decision for the Naïve Bayes
Discriminant functions • Posterior of a class – choose the class with better posterior probability p( y 1 | x) p( y 0 | x) then y=1 else y=0 d p(xi | y 1) p(y 1) p(y 1| x) i1 d d p(xi | y 0) p(y 0) p(xi | y 1) p(y 1) i1 i1 Likelihood of data – choose the class that explains the input data (x) better (likelihood of the data) d d then y=1 p(xi | y 1) p(xi | y 0) i1 i1 else y=0
g (x) g 0 (x) 1 CS 2750 Machine Learning
16