CS 2750 Machine Learning Lecture 7
Linear regression (cont.) Linear methods for classification
Milos Hauskrecht [email protected] 5329 Sennott Square
CS 2750 Machine Learning
Coefficient shrinkage
• The least squares estimates often have low bias but high variance • The prediction accuracy can be often improved by setting some coefficients to zero – Increases the bias, reduces the variance of estimates • Solutions: – Subset selection – Ridge regression – Principal component regression
• Next: ridge regression
CS 2750 Machine Learning
1 Ridge regression
• Error function for the standard least squares estimates: = 1 − T 2 J n (w ) ∑ (yi w x i ) n i=1,..n * = 1 − T 2 • We seek: w arg min ∑ (yi w x i ) w n i=1,..n
• Ridge regression: = 1 − T 2 + λ 2 J n (w ) ∑ (yi w x i ) w n i=1,..n •Where d 2 = 2 λ ≥ w ∑ wi and 0 i =0 • What does the new error function do?
CS 2750 Machine Learning
Ridge regression
• Standard regression: = 1 − T 2 J n (w ) ∑ (yi w x i ) n i=1,..n • Ridge regression: = 1 − T 2 + λ 2 J n (w ) ∑ (yi w x i ) w n i=1,..n d 2 = 2 •w ∑ wi penalizes non-zero weights with the cost = i 0 proportional to λ (a shrinkage coefficient)
• If an input attribute x j has a small effect on improving the error function it is “shut down” by the penalty term • Inclusion of a shrinkage penalty is often referred to as regularization
CS 2750 Machine Learning
= Data: D { d 1 , d 2 ,.., d n } a set of n examples =< > di xi , yi
xi is input vector, and y is desired output (given by a teacher) Objective: learn the mapping f : X → Y ≈ = s.t. yi f (xi ) for all i 1,.., n Two types of problems: • Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: temperature, heart rate disease
Today: binary classification problems: CS 2750 Machine Learning
Binary classification
• Two classes Y = {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X → { 0,1} • Zero-one error (loss) function 1 f (x , w ) ≠ y Error (x , y ) = i i 1 i i = 0 f (x i , w ) y i
• Error we would like to minimize: E ( x, y ) (Error 1 (x, y)) • First step: we need to devise a model of the function
CS 2750 Machine Learning
3 Discriminant functions
• One convenient way to represent classifiers is through – Discriminant functions • Works for binary and multi-way classification
• Idea:
– For every class i = 0,1, …k define a function g i (x) mapping X → ℜ – When the decision on input x should be made choose the
class with the highest value of g i (x)
• So what happens with the input space? Assume a binary case.
CS 2750 Machine Learning
Discriminant functions
2
1.5
1 ≥ 0.5 g1 (x) g 0 (x)
0
-0.5
-1
-1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
4 Discriminant functions
2
1.5
1 ≥ 0.5 g1 (x) g 0 (x)
0
-0.5
-1 ≤ g1 (x)≤ g 0 (x) -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
Discriminant functions
2
1.5
1 ≥ 0.5 g1 (x) g 0 (x) g (x) ≥ g (x) 0 1 0
-0.5
-1 g (x) ≤ g (x) 1 ≤ 0 -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
5 Discriminant functions
• Define decision boundary.
2
1.5
1 ≥ 0.5 g1 (x) ≥g 0 (x) g1 (x) g 0 (x) 0
-0.5 = g1 (x) g 0 (x) -1 ≤ g1 (x) ≤g 0 (x) -1.5 g1 (x) g 0 (x)
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
Quadratic decision boundary
Decision boundary 3
2.5
2
1.5
1 ≥ g1 (x) g 0 (x) 0.5
0
-0.5 g (x) ≤ g (x) -1 1 0 = g1 (x) g 0 (x) -1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5
CS 2750 Machine Learning
6 Logistic regression model
• Defines a linear decision boundary • Discriminant functions: = T = − T g1 (x) g (w x) g 0 (x) 1 g (w x) − • where g (z) = 1 /(1 + e z ) - is a logistic function = T = T f (x, w ) g1 (w x) g (w x)
1 Logistic function w0 ∑ x1 w1 z f (x, w ) w x 2 Input vector 2 w x d
xd CS 2750 Machine Learning
Logistic function 1 function g (z) = (1 + e − z ) • also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1]
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 -20 -15 -10 -5 0 5 10 15 20
CS 2750 Machine Learning
7 Logistic regression model
• Discriminant functions: = T = − T g1 (x) g (w x) g 0 (x) 1 g (w x) − z • Where g (z) = 1 /(1 + e ) - is a logistic function • Values of discriminant functions vary in [0,1] – Probabilistic interpretation = = = = T f (x, w ) p( y 1 | w, x) g1 (x) g (w x) 1 w0 ∑ p(y =1| x,w) x1 w1 z w x 2 Input vector 2 w x d
xd
CS 2750 Machine Learning
Logistic regression
• Instead of learning the mapping to discrete values 0,1 f : X →{0,1} •we learn a probabilistic function f : X →[0,1] –where f describes the probability of class 1 given x f (x, w) = p( y = 1| x, w) Note that: p( y = 0 | x, w) = 1− p( y = 1| x, w) • Transformation to discrete class values:
If p ( y = 1 | x ) ≥ 1 / 2 then choose 1 Else choose 0
CS 2750 Machine Learning
8 Linear decision boundary
• Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions. = • Decision boundary: g1 (x) g 0 (x) • For the boundary it must hold: − T go (x) = 1 g(w x) = log log T 0 g1(x) g(w x) exp− (wTx) g (x) 1+ exp− (wTx) log o = log = logexp− (wTx) = wTx = 0 1 g1(x) 1+ exp− (wTx)
CS 2750 Machine Learning
Logistic regression model. Decision boundary
• LR defines a linear decision boundary Example: 2 classes (blue and red points)
Decision boundary 2
1.5
1
0.5
0
-0.5
-1
-1.5
-2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
CS 2750 Machine Learning
9 Logistic regression: parameter learning.
Likelihood of outputs • Let =< > µ = = = = T D i x i , y i i p(yi 1| xi ,w) g(zi ) g(w x) • Then n n y − = = = µ i − µ 1 yi L(D, w) ∏ P( y yi | x i , w) ∏ i (1 i ) i=1 i=1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n y − y − = µ i − µ 1 yi = µ i − µ 1 yi = l(D, w ) log ∏ i (1 i ) ∑ log i (1 i ) i=1 i=1 n = µ + − − µ ∑ y i log i (1 y i ) log( 1 i ) i =1
CS 2750 Machine Learning
Logistic regression: parameter learning • Log likelihood n = µ + − − µ l(D, w ) ∑ yi log i (1 yi ) log(1 i ) i=1 • Derivatives of the loglikelihood ∂ n Nonlinear in weights !! − l(D, w ) = − x ( y − g (z )) ∂ ∑ i, j i i w j i=1 n n ∇ − = − − T = − − w l(D, w) ∑ xi ( yi g(w xi )) ∑ xi ( yi f (w, xi )) i=1 i=1 • Gradient descent:
( k ) ← ( k −1) − α ∇ − w w (k ) [ l(D, w )] | ( k −1) w w
n ( k ) ← ( k −1) + α − (k −1) w w (k )∑ [ yi f (w , x i )]x i i=1
CS 2750 Machine Learning
10 Logistic regression. Online gradient descent
• On-line component of the loglikelihood − = µ + − − µ J online ( D i , w ) y i log i (1 y i ) log( 1 i )
• On-line learning update for weight w J online (D k , w )
( k ) ← ( k −1) − α ∇ w w (k ) [J (D , w )] | ( k −1) w online k w
=< > • ith update for the logistic regression and Dk xk , yk
(i ) ← ( k −1) + α − ( k −1) w w (k )[ yi f (w , x k )]x k
CS 2750 Machine Learning
Online logistic regression algorithm
Online-logistic-regression (D, number of iterations) = initialize weights w (w0 , w1, w2 Kwd ) for i=1:1: number of iterations =< > do select a data point D i x i , y i from D set α =1/i update weights (in parallel) ← + α − w w (i)[ yi f (w, x i )]x i end for return weights w
CS 2750 Machine Learning
11 Online algorithm. Example.
CS 2750 Machine Learning
Online algorithm. Example.
CS 2750 Machine Learning
12 Online algorithm. Example.
CS 2750 Machine Learning
13