CS 1571 Introduction to AI Lecture 25

Binary classification

Milos Hauskrecht [email protected] 5329 Sennott Square

CS 1571 Intro to AI

Supervised learning

Data: D  { D 1 , D 2 ,.., D n } a set of n examples

Di  xi , yi 

xi  (xi,1 , xi,2 ,xi,d ) is an input vector of size d

yi is the desired output (given by a teacher) Objective: learn the mapping f : X  Y

s.t. yi  f (x i ) for all i  1,.., n

• Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: handwritten digit in binary form digit label

CS 1571 Intro to AI

1 : review

• Function f : X Y is a linear combination of input components d f (x)  w0  w1 x1  w2 x2 wd xd  w0   w j x j j1

w0 , w1 , wk - parameters (weights)

Bias term 1 w 0 x w  1 1 f ( x , w ) w x 2 Input vector 2 w x d

x d

CS 1571 Intro to AI

Linear regression: review

• Data: Di  xi , yi  • Function: xi  f (xi )

• We would like to have yi  f (x i ) for all i  1,.., n

• Error function measures how much our predictions deviate from the desired answers

1 2 -squared error J n   (yi  f (x i )) n i 1,..n

• Learning: We want to find the weights minimizing the error !

CS 1571 Intro to AI

2 Solving linear regression: review

• The optimal set of weights satisfies: n 2 T  w (J n (w ))    (yi  w x i )x i  0 n i1 Leads to a system of linear equations (SLE) with d+1 unknowns of the form Aw  b

n n n n n w0 xi,0xi, j  w1xi,1xi, j  wj xi, j xi, j  wd xi,d xi, j  yi xi, j i1 i1 i1 i1 i1

Solutions to SLE: • e.g. matrix inversion (if the matrix is singular) w  A 1b

CS 1571 Intro to AI

Linear regression. Example

• 1 dimensional input x  (x1 )

30

25

20

15

10

5

0

-5

-10

-15 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 1571 Intro to AI

3 Linear regression. Example.

• 2 dimensional input x  (x1 , x2 )

20

15

10

5

0

-5

-10

-15

4 -20 -3 2 -2 -1 0 0 1 -2 2 3 -4

CS 1571 Intro to AI

Binary classification

• Two classes Y  {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X { 0,1} • Misclassification error function

1 f (x i , w )  y i Error 1 (x i , y i )   0 f (x i , w )  y i

• Error we would like to minimize: E ( x , y ) (Error 1 (x, y)) • First step: we need to devise a model of the function

CS 2750

4 Discriminant functions

• One way to represent a classifier is by using – Discriminant functions • Works for both the binary and multi-way classification

• Idea:

– For every class i = 0,1, …k define a function g i (x) mapping X   – When the decision on input x should be made choose the

class with the highest value of g i (x)

• So what happens with the input space? Assume a binary case.

CS 2750 Machine Learning

Discriminant functions

2

1.5

1

0.5 g1 (x)  g 0 (x)

0

-0.5

-1

-1.5

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

5 Discriminant functions

2

1.5

1

0.5 g1 (x)  g 0 (x)

0

-0.5

-1 g1 (x)  g 0 (x) -1.5 g1 (x)  g 0 (x)

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

Discriminant functions

2

1.5

1

0.5 g1 (x)  g 0 (x) g (x)  g (x) 0 1 0

-0.5

-1 g1 (x)  g 0 (x) -1.5 g1 (x)  g 0 (x)

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

6 Discriminant functions

• Define decision boundary

2

1.5

1

0.5 g1 (x)  g 0 (x) g1 (x)  g 0 (x) 0

-0.5 g (x)  g (x) -1 1 0 g1 (x)  g 0 (x) -1.5 g1 (x)  g 0 (x)

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

Quadratic decision boundary

Decision boundary 3

2.5

2

1.5

1

0.5 g1 (x)  g 0 (x) 0

-0.5

-1 g1 (x)  g 0 (x) g1 (x)  g 0 (x) -1.5

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5

CS 2750 Machine Learning

7 model

• Defines a linear decision boundary in the input space • Discriminant functions: T T g1 (x)  g (w x) g 0 (x)  1  g (w x) • where g (z)  1 /(1  e  z ) - is a logistic function T T f (x, w )  g1 (w x)  g (w x)

1 Logistic function w0  x1 w1 z f (x, w ) w2 Input vector x2 w x d

xd

CS 2750 Machine Learning

Logistic function 1 function g (z)  (1  e  z ) • Is also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 -20 -15 -10 -5 0 5 10 15 20

CS 2750 Machine Learning

8 Logistic regression model

• Discriminant functions: T T g1 (x)  g (w x) g 0 (x)  1  g (w x) • Values of discriminant functions vary in [0,1] – Probabilistic interpretation T f (x, w )  p( y  1 | w, x)  g1 (x)  g (w x)

1 w0  p(y 1| x,w) x1 w1 z w x 2 Input vector 2 w x d

xd

CS 2750 Machine Learning

Logistic regression

• We learn a probabilistic function f : X [0,1] –where f describes the probability of class 1 given x

T f (x, w)  g1 (w x)  p( y  1| x, w) Note that: p( y  0 | x, w)  1 p( y  1| x, w)

• Transformation to binary class values:

If p ( y  1 | x )  1 / 2 then choose 1 Else choose 0

CS 2750 Machine Learning

9 Linear decision boundary

• Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions.

• Decision boundary: g1 (x)  g 0 (x) • For the boundary it must hold: T go (x) 1 g(w x) log  log T  0 g1(x) g(w x) exp (wTx) g (x) 1 exp (wTx) log o  log  logexp (wTx)  wTx  0 1 g1(x) 1 exp (wTx)

CS 2750 Machine Learning

Logistic regression model. Decision boundary

• LR defines a linear decision boundary Example: 2 classes (blue and red points)

Decision boundary 2

1.5

1

0.5

0

-0.5

-1

-1.5

-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

CS 2750 Machine Learning

10 Logistic regression: parameter learning

Likelihood of outputs • Let T D i  x i , y i  i  p(yi 1| xi ,w)  g(zi )  g(w x) • Then n n yi 1 yi L(D, w)   P( y  yi | x i , w)    i (1   i ) i1 i1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n yi 1 yi yi 1 yi l(D, w )  log   i (1   i )   log  i (1   i )  i1 i1 n   y i log  i  (1  y i ) log( 1   i ) i 1

CS 2750 Machine Learning

Logistic regression: parameter learning • Log likelihood n l(D, w )   yi log  i  (1  yi ) log(1   i ) i1 • Derivatives of the loglikelihood  n Nonlinear in weights !!  l(D, w )    xi, j ( yi  g (zi )) w j i1 n n T  w  l(D, w)   xi ( yi  g(w xi ))   xi ( yi  f (w, xi )) i1 i1 • Gradient descent:

( k ) ( k 1) w w (k ) [ l(D, w )] | ( k 1)     w  w

n ( k ) ( k 1) (k 1) w  w   (k ) [ yi  f (w , x i )]x i i1

CS 2750 Machine Learning

11 Logistic regression. Online gradient descent

• On-line component of the loglikelihood

 J online ( D i , w )  y i log  i  (1  y i ) log( 1   i )

• On-line learning update for weight w J online (Dk , w )

( k ) ( k 1) w w (k ) [J (D , w )] | ( k 1)     w online k w

• ith update for the logistic regression and Dk  xk , yk 

(i ) ( k 1) ( k 1) w  w   (k )[ yi  f (w , x k )]x k

CS 2750 Machine Learning

Online logistic regression algorithm

Online-logistic-regression (D, number of iterations)

initialize weights w  (w0 , w1, w2 wd ) for i=1:1: number of iterations

do select a data point D i   x i , y i  from D set  1/i update weights (in parallel)

w  w   (i)[ yi  f (w, x i )]x i end for return weights w

CS 2750 Machine Learning

12 Online algorithm. Example.

CS 2750 Machine Learning

Online algorithm. Example.

CS 2750 Machine Learning

13 Online algorithm. Example.

CS 2750 Machine Learning

Derivation of the gradient n • Log likelihood l(D, w )   yi log  i  (1  yi ) log(1   i ) i1 • Derivatives of the loglikelihood n   zi l(D, w )   yi log  i  (1  yi ) log(1   i ) w j i1 zi w j Derivative of a logistic function z i  x i, j g ( zi ) wj  g ( zi )(1  g ( zi )) zi

 1 g(zi ) 1 g(zi ) yi logi (1 yi )log(1 i )  yi (1 yi ) zi g(zi ) zi 1 g(zi ) zi

 yi (1 g(zi )) (1 yi )(g(zi ))  yi  g(zi )

n n T  wl(D, w)   xi ( yi  g(w xi ))   xi ( yi  f (w, xi )) i1 i1

CS 2750 Machine Learning

14 Generative approach to classification

Idea: 1. Represent and learn the distribution p(x, y) 2. Use it to define probabilistic discriminant functions

E.g. g o (x)  p( y  0 | x) g1 (x)  p( y  1 | x)

Typical model p(x, y)  p(x | y) p( y) • p(x | y) = Class-conditional distributions (densities) y binary classification: two class-conditional distributions p(x | y  0) p(x | y  1) • p( y) = Priors on classes - probability of class y x binary classification: Bernoulli distribution p( y  0)  p( y  1)  1

CS 2750 Machine Learning

Naïve Bayes classifier

• A generative classifier model with an additional simplifying assumption: – All input attributes are conditionally independent of each other given the class. So we have:

Y p(x, y)  p( y) p(x | y) d  p( y) p(xi | y) i1 X 1 X 2 … X d

CS 2750 Machine Learning

15 Learning of the Naïve Bayes classifier

Model: d p(x, y)  p( y) p(x | y)  p( y) p(xi | y) i 1 Y Learning = learning of parameters of the BBN: – Class prior X 1 X 2 … X d p( y) – class conditional distributions:

p(xi | y) for all i 1,...d

• In practice class conditional distribution can have different models: e.g. one attribute can be modeled using the Bernoulli, the other as Gaussian density, or as a Poisson distribution CS 2750 Machine Learning

Making class decision for the Naïve Bayes

Discriminant functions • Posterior of a class – choose the class with better p( y  1 | x)  p( y  0 | x) then y=1 else y=0  d   p(xi | y 1) p(y 1) p(y 1| x)   i1   d   d   p(xi | y  0) p(y  0)   p(xi | y 1) p(y 1)  i1   i1  Likelihood of data – choose the class that explains the input data (x) better (likelihood of the data) d d then y=1  p(xi | y  1)   p(xi | y  0) i1 i1 else y=0

g (x) g 0 (x) 1 CS 2750 Machine Learning

16