Linear Regression (Cont.) Linear Methods for Classification

Linear Regression (Cont.) Linear Methods for Classification

CS 2750 Machine Learning Lecture 7 Linear regression (cont.) Linear methods for classification Milos Hauskrecht [email protected] 5329 Sennott Square CS 2750 Machine Learning Coefficient shrinkage • The least squares estimates often have low bias but high variance • The prediction accuracy can be often improved by setting some coefficients to zero – Increases the bias, reduces the variance of estimates • Solutions: – Subset selection – Ridge regression – Principal component regression • Next: ridge regression CS 2750 Machine Learning 1 Ridge regression • Error function for the standard least squares estimates: = 1 − T 2 J n (w ) ∑ (yi w x i ) n i=1,..n * = 1 − T 2 • We seek: w arg min ∑ (yi w x i ) w n i=1,..n • Ridge regression: = 1 − T 2 + λ 2 J n (w ) ∑ (yi w x i ) w n i=1,..n •Where d 2 = 2 λ ≥ w ∑ wi and 0 i =0 • What does the new error function do? CS 2750 Machine Learning Ridge regression • Standard regression: = 1 − T 2 J n (w ) ∑ (yi w x i ) n i=1,..n • Ridge regression: = 1 − T 2 + λ 2 J n (w ) ∑ (yi w x i ) w n i=1,..n d 2 = 2 •w ∑ wi penalizes non-zero weights with the cost = i 0 proportional to λ (a shrinkage coefficient) • If an input attribute x j has a small effect on improving the error function it is “shut down” by the penalty term • Inclusion of a shrinkage penalty is often referred to as regularization CS 2750 Machine Learning 2 Supervised learning = Data: D { d 1 , d 2 ,.., d n } a set of n examples =< > di xi , yi xi is input vector, and y is desired output (given by a teacher) Objective: learn the mapping f : X → Y ≈ = s.t. yi f (xi ) for all i 1,.., n Two types of problems: • Regression: Y is continuous Example: earnings, product orders company stock price • Classification: Y is discrete Example: temperature, heart rate disease Today: binary classification problems: CS 2750 Machine Learning Binary classification • Two classes Y = {0,1} • Our goal is to learn to classify correctly two types of examples – Class 0 – labeled as 0, – Class 1 – labeled as 1 • We would like to learn f : X → { 0,1} • Zero-one error (loss) function 1 f (x , w ) ≠ y Error (x , y ) = i i 1 i i = 0 f (x i , w ) y i • Error we would like to minimize: E ( x, y ) (Error 1 (x, y)) • First step: we need to devise a model of the function CS 2750 Machine Learning 3 Discriminant functions • One convenient way to represent classifiers is through – Discriminant functions • Works for binary and multi-way classification • Idea: – For every class i = 0,1, …k define a function g i (x) mapping X → ℜ – When the decision on input x should be made choose the class with the highest value of g i (x) • So what happens with the input space? Assume a binary case. CS 2750 Machine Learning Discriminant functions 2 1.5 1 ≥ 0.5 g1 (x) g 0 (x) 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 4 Discriminant functions 2 1.5 1 ≥ 0.5 g1 (x) g 0 (x) 0 -0.5 -1 ≤ g1 (x)≤ g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Discriminant functions 2 1.5 1 ≥ 0.5 g1 (x) g 0 (x) g (x) ≥ g (x) 0 1 0 -0.5 -1 g (x) ≤ g (x) 1 ≤ 0 -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 5 Discriminant functions • Define decision boundary. 2 1.5 1 ≥ 0.5 g1 (x) ≥g 0 (x) g1 (x) g 0 (x) 0 -0.5 = g1 (x) g 0 (x) -1 ≤ g1 (x) ≤g 0 (x) -1.5 g1 (x) g 0 (x) -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning Quadratic decision boundary Decision boundary 3 2.5 2 1.5 1 ≥ g1 (x) g 0 (x) 0.5 0 -0.5 g (x) ≤ g (x) -1 1 0 = g1 (x) g 0 (x) -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 CS 2750 Machine Learning 6 Logistic regression model • Defines a linear decision boundary • Discriminant functions: = T = − T g1 (x) g (w x) g 0 (x) 1 g (w x) − • where g (z) = 1 /(1 + e z ) - is a logistic function = T = T f (x, w ) g1 (w x) g (w x) 1 Logistic function w0 ∑ x1 w1 z f (x, w ) w x 2 Input vector 2 w x d xd CS 2750 Machine Learning Logistic function 1 function g (z) = (1 + e − z ) • also referred to as a sigmoid function • Replaces the threshold function with smooth switching • takes a real number and outputs the number in the interval [0,1] 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -20 -15 -10 -5 0 5 10 15 20 CS 2750 Machine Learning 7 Logistic regression model • Discriminant functions: = T = − T g1 (x) g (w x) g 0 (x) 1 g (w x) − z • Where g (z) = 1 /(1 + e ) - is a logistic function • Values of discriminant functions vary in [0,1] – Probabilistic interpretation = = = = T f (x, w ) p( y 1 | w, x) g1 (x) g (w x) 1 w0 ∑ p(y =1| x,w) x1 w1 z w x 2 Input vector 2 w x d xd CS 2750 Machine Learning Logistic regression • Instead of learning the mapping to discrete values 0,1 f : X →{0,1} •we learn a probabilistic function f : X →[0,1] –where f describes the probability of class 1 given x f (x, w) = p( y = 1| x, w) Note that: p( y = 0 | x, w) = 1− p( y = 1| x, w) • Transformation to discrete class values: If p ( y = 1 | x ) ≥ 1 / 2 then choose 1 Else choose 0 CS 2750 Machine Learning 8 Linear decision boundary • Logistic regression model defines a linear decision boundary • Why? • Answer: Compare two discriminant functions. = • Decision boundary: g1 (x) g 0 (x) • For the boundary it must hold: − T go (x) = 1 g(w x) = log log T 0 g1(x) g(w x) exp− (wTx) g (x) 1+ exp− (wTx) log o = log = logexp− (wTx) = wTx = 0 1 g1(x) 1+ exp− (wTx) CS 2750 Machine Learning Logistic regression model. Decision boundary • LR defines a linear decision boundary Example: 2 classes (blue and red points) Decision boundary 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 CS 2750 Machine Learning 9 Logistic regression: parameter learning. Likelihood of outputs • Let =< > µ = = = = T D i x i , y i i p(yi 1| xi ,w) g(zi ) g(w x) • Then n n y − = = = µ i − µ 1 yi L(D, w) ∏ P( y yi | x i , w) ∏ i (1 i ) i=1 i=1 • Find weights w that maximize the likelihood of outputs – Apply the log-likelihood trick The optimal weights are the same for both the likelihood and the log-likelihood n n y − y − = µ i − µ 1 yi = µ i − µ 1 yi = l(D, w ) log ∏ i (1 i ) ∑ log i (1 i ) i=1 i=1 n = µ + − − µ ∑ y i log i (1 y i ) log( 1 i ) i =1 CS 2750 Machine Learning Logistic regression: parameter learning • Log likelihood n = µ + − − µ l(D, w ) ∑ yi log i (1 yi ) log(1 i ) i=1 • Derivatives of the loglikelihood ∂ n Nonlinear in weights !! − l(D, w ) = − x ( y − g (z )) ∂ ∑ i, j i i w j i=1 n n ∇ − = − − T = − − w l(D, w) ∑ xi ( yi g(w xi )) ∑ xi ( yi f (w, xi )) i=1 i=1 • Gradient descent: ( k ) ← ( k −1) − α ∇ − w w (k ) [ l(D, w )] | ( k −1) w w n ( k ) ← ( k −1) + α − (k −1) w w (k )∑ [ yi f (w , x i )]x i i=1 CS 2750 Machine Learning 10 Logistic regression. Online gradient descent • On-line component of the loglikelihood − = µ + − − µ J online ( D i , w ) y i log i (1 y i ) log( 1 i ) • On-line learning update for weight w J online (D k , w ) ( k ) ← ( k −1) − α ∇ w w (k ) [J (D , w )] | ( k −1) w online k w =< > • ith update for the logistic regression and Dk xk , yk (i ) ← ( k −1) + α − ( k −1) w w (k )[ yi f (w , x k )]x k CS 2750 Machine Learning Online logistic regression algorithm Online-logistic-regression (D, number of iterations) = initialize weights w (w0 , w1, w2 Kwd ) for i=1:1: number of iterations =< > do select a data point D i x i , y i from D set α =1/i update weights (in parallel) ← + α − w w (i)[ yi f (w, x i )]x i end for return weights w CS 2750 Machine Learning 11 Online algorithm. Example. CS 2750 Machine Learning Online algorithm. Example. CS 2750 Machine Learning 12 Online algorithm. Example. CS 2750 Machine Learning 13.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    13 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us