Linear Classifiers

LINEAR CLASSIFIERS

J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Classification: Problem Statement 2 Probability & Bayesian Inference

 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t.

 In classification, the input variable x may still be continuous, but the target variable is discrete.

 In the simplest case, t can have only 2 values. e.g., Let t = +1 assigned to C1 t = 1 assigned to C 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Example Problem 3 Probability & Bayesian Inference

 Animal or Vegetable?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Linear Models for Classification 4 Probability & Bayesian Inference

 Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries.  Example:

2D Input vector x Two discrete classes C and C 1 2 4

0 x !2 2 !4

!4 !2 0 2 4 6 8 x 1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Two Class Discriminant Function 5 Probability & Bayesian Inference

y>0 x2 y =0 y<0 R1 R2 y(x) = wt x +w 0 x w y(x) y(x) ! 0 " x assigned to C1 w ! ! y(x) < 0 " x assigned to C x 2 ⊥ x1

Thus y(x) = 0 defines the decision boundary w0 −w ! !

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Two-Class Discriminant Function 6 Probability & Bayesian Inference

y(x) = wt x +w 0

y>0 x2 y(x) ! 0 " x assigned to C1 y =0 y(x) < 0 " x assigned to C y<0 1 2 R R2

For convenience, let x w t t y(x) w = !w …w # % !w w …w # w " 1 M $ " 0 1 M $ ! ! x and ⊥ t t x x = !x …x # % !1 x …x # 1 " 1 M $ " 1 M $ w0 −w ! ! t So we can express y(x) = w x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Generalized Linear Models 7 Probability & Bayesian Inference

 For classification problems, we want y to be a predictor of t. In other words, we wish to map the input vector into one of a number of discrete classes, or to posterior probabilities that lie between 0 and 1.

 For this purpose, it is useful to elaborate the linear model by introducing a nonlinear activation function f, which typically will constrain y to lie between -1 and 1 or between 0 and 1. y(x) = f w t x + w ( 0 )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron 8 Probability & Bayesian Inference

t y(x) ! 0 " x assigned to C1 y(x) = f w x + w0 ( ) y(x) < 0 " x assigned to C 2

 A classifier based upon this simple generalized linear model is called a (single layer) perceptron.

 It can also be identified with an abstracted model of a neuron called the McCulloch Pitts model.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Parameter Learning 9 Probability & Bayesian Inference

 How do we learn the parameters of a perceptron?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Outline 10 Probability & Bayesian Inference

 The Perceptron Algorithm

 Least-Squares Classifiers

 Fisher’s Linear Discriminant

 Logistic Classifiers

 Support Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Case 1. Linearly Separable Inputs 11 Probability & Bayesian Inference

 For starters, let’s assume that the training data is in fact perfectly linearly separable.

 In other words, there exists at least one hyperplane (one set of weights) that yields 0 classification error.

 We seek an algorithm that can automatically find such a hyperplane.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron Algorithm 12 Probability & Bayesian Inference

 The perceptron algorithm was invented by Frank Rosenblatt (1962).

 The algorithm is iterative.

 The strategy is to start with a random guess at the weights w, Frank Rosenblatt (1928 – 1971) and to then iteratively change the weights to move the hyperplane in a direction that lowers the classification error.

 Note that as we change the weights continuously, the classification error changes in discontinuous, piecewise constant fashion.

 Thus we cannot use the classification error per se as our objective function to minimize.

 What would be a better objective function?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron Criterion 14 Probability & Bayesian Inference

 Note that we seek w such that

w t x ! 0 when t = +1 t w x < 0 when t = "1

 In other words, we would like

w t x t ! 0 "n n n  Thus we seek to minimize

E w w t x t P ( ) = ! # n n n"M where is the set of misclassified inputs. M

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron Criterion 15 Probability & Bayesian Inference

E w w t x t P ( ) = ! # n n n"M where is the set of misclassified inputs. M

 Observations:

 EP(w) is always non-negative.

 EP(w) is continuous and E w piecewise linear, and thus P ( ) easier to minimize.

w i CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron Algorithm 16 Probability & Bayesian Inference

E w w t x t P ( ) = ! # n n n"M where is the set of misclassified inputs. M dE w P ( ) = ! x t E w # n n P ( ) dw n"M where the derivative exists.

 Gradient descent:

w! +1 w! E w w! x t = " #$ P ( ) = + # & n n n%M

w! +1 w! E w w t x t = " #$ P ( ) = + # & n n n%M

 Why does this make sense?

 If an input from C1(t = +1) is misclassified, we need to make its projection on w more positive.

 If an input from C2 (t = -1) is misclassified, we need to make its projection on w more negative.

 The algorithm can be implemented sequentially:  Repeat until convergence:

 For each input (xn, tn):  If it is correctly classified, do nothing  If it is misclassified, update the weight vector to be w! +1 = w! + "x t n n  Note that this will lower the contribution of input n to the objective function:

t t t t t ! w (" ) x t # ! w (" +1) x t = ! w (" ) x t ! $ x t x t < ! w (" ) x t . ( ) n n ( ) n n ( ) n n ( n n ) n n ( ) n n

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Not Monotonic 19 Probability & Bayesian Inference

 While updating with respect to a misclassified input n will lower the error for that input, the error for other misclassified inputs may increase.

 Also, new inputs that had been classified correctly may now be misclassified.

 The result is that the perceptron algorithm is not guaranteed to reduce the total error monotonically at each stage.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Perceptron Convergence Theorem 20 Probability & Bayesian Inference

 Despite this non-monotonicity, if in fact the data are linearly separable, then the algorithm is guaranteed to find an exact solution in a finite number of steps (Rosenblatt, 1962).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Example 21 Probability & Bayesian Inference

0.5 0.5

−0.5 −0.5

−1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1

0.5 0.5

−0.5 −0.5

−1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The First Learning Machine 22 Probability & Bayesian Inference

 Mark 1 Perceptron Hardware (c. 1960)

Visual Inputs Patch board allowing Rack of adaptive weights w configuration of inputsφ (motor-driven potentiometers)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Practical Limitations 23 Probability & Bayesian Inference

 The Perceptron Convergence Theorem is an important result. However, there are practical limitations:  Convergence may be slow  If the data are not separable, the algorithm will not converge.  We will only know that the data are separable once the algorithm converges.  The solution is in general not unique, and will depend upon initialization, scheduling of input vectors, and the learning rate η.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Generalization to inputs that are not linearly separable.

24 Probability & Bayesian Inference

 The single-layer perceptron can be generalized to yield good linear solutions to problems that are not linearly separable.

 Example: The Pocket Algorithm (Gal 1990)  Idea:  Run the perceptron algorithm  Keep track of the weight vector w* that has produced the best classification error achieved so far.  It can be shown that w* will converge to an optimal solution with probability 1.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Generalization to Multiclass Problems 25 Probability & Bayesian Inference

 How can we use perceptrons, or linear classifiers in general, to classify inputs when there are K > 2 classes?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder K>2 Classes 26 Probability & Bayesian Inference

 Idea #1: Just use K-1 discriminant functions, each of

which separates one class Ck from the rest. (One- versus-the-rest classifier.)

 Problem: Ambiguous regions

C1 R3 C2 not C1 not C2 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder K>2 Classes 27 Probability & Bayesian Inference

 Idea #2: Use K(K-1)/2 discriminant functions, each

of which separates two classes Cj, Ck from each other. (One-versus-one classifier.)

 Each point classified by majority vote.

 Problem: Ambiguous regions

R1 R3 C1 ?

C3 R2 C2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder K>2 Classes 28 Probability & Bayesian Inference

 Idea #3: Use K discriminant functions yk(x)

 Use the magnitude of yk(x), not just the sign.

y (x) = w t x k k

x assigned to C if y (x) > y (x)!j " k k k j t Decision boundary y (x) = y (x) ! w " w x + w " w = 0 k j ( k j ) ( k 0 j 0 )

Results in decision regions that are Rj simply-connected and convex. Ri

Rk xB xA x CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder ! Example: Kesler’s Construction 29 Probability & Bayesian Inference

 The perceptron algorithm can be generalized to K- class classification problems.

 Example:  Kesler’s Construction:  Allows use of the perceptron algorithm to simultaneously

learn K separate weight vectors wi.  Inputs are then classified in Class i if and only if w t x > w t x !j " i i j  The algorithm will converge to an optimal solution if a solution exists, i.e., if all training vectors can be correctly classified according to this rule.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder 1-of-K Coding Scheme 30 Probability & Bayesian Inference

 When there are K>2 classes, target variables can be coded using the 1-of-K coding scheme:

Input from Class C ! t = [0 0 …1…0 0]t i

Element i

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Computational Limitations of Perceptrons 31 Probability & Bayesian Inference

 Initially, the perceptron was thought to be a potentially powerful learning machine that could model human neural x processing. 2

 However, Minsky & Papert (1969) showed that the single- x layer perceptron could not learn 1 a simple XOR function.

 This is just one example of a non-linearly separable pattern Marvin Minsky (1927 - ) that cannot be learned by a single-layer perceptron.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Multi-Layer Perceptrons 32 Probability & Bayesian Inference

 Minsky & Papert’s book was widely misinterpreted as showing that artificial neural networks were inherently limited.

 This contributed to a decline in the reputation of neural network research through the 70s and 80s.

 However, their findings apply only to single-layer perceptrons. Multi- layer perceptrons are capable of learning highly nonlinear functions, and are used in many practical applications.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder End of Lecture 11 Outline 34 Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Dealing with Non-Linearly Separable Inputs 35 Probability & Bayesian Inference

 The perceptron algorithm fails when the training data are not perfectly linearly separable.

 Let’s now turn to methods for learning the parameter vector w of a perceptron (linear classifier) even when the training data are not linearly separable.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Least Squares Method 36 Probability & Bayesian Inference

 In the least squares method, we simply fit the (x, t) observations with a hyperplane y(x).

 Note that this is kind of a weird idea, since the t values are binary (when K=2), e.g., 0 or 1.

 However it can work pretty well.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Least Squares: Learning the Parameters 37 Probability & Bayesian Inference

Assume D ! dimensional input vectors x.

For each class k !1…K : y (x) = w t x + w k k k 0

y(x) W! t x! ! = where x! = (1,xt )t t W! is a (D + 1) ! K matrix whose kth column is w! = w ,w t k ( 0 k )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Learning the Parameters 38 Probability & Bayesian Inference

 Method #2: Least Squares

y(x) W! t x! =

Training dataset x ,t , n = 1,…,N ( n n ) where we use the 1-of-K coding scheme for t n

Let T be the N ! K matrix whose nth row is tt n

Let X! be the N ! (D + 1) matrix whose nth row is x! t n Let R W! = X! W! ! T D ( ) 1 1 t Then we define the error as E W! = R2 = Tr R W! R W! D ( ) 2 ! ij 2 D ( ) D ( ) i,j { } Setting derivative wrt W! to 0 yields: !1 W! = X! t X! X! tT = X! †T ( )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Fisher’s Linear Discriminant 40 Probability & Bayesian Inference

 Another way to view linear discriminants: find the 1D subspace that maximizes the separation between the two classes.

1 1 Let m x , m x 1 = " n 2 = " n N n N n 1 !C1 2 !C2

For example, might choose w to maximize w t m ! m , subject to w = 1 ( 2 1)

This leads to w ! m2 " m1 4

0 However, if conditional distributions are not isotropic, !2 this is typically not optimal. !2 2 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Fisher’s Linear Discriminant 41 Probability & Bayesian Inference

Let m = w tm , m = w tm be the conditional means on the 1D subspace. 1 1 2 2

2 Let s2 y m be the within-class variance on the subspace for class k = # ( n ! k ) Ck n "Ck

2 m m ( 2 ! 1) The Fisher criterion is then J(w) = 4 s2 + s2 1 2 This can be rewritten as 2 t w SBw 0 J(w) = t w SW w where !2 t S m m m m is the between-class variance B = ( 2 ! 1)( 2 ! 1) !2 2 6 and t t S x m x m x m x m is the within-class variance W = #( n ! 1)( n ! 1) + # ( n ! 2 )( n ! 2 ) n n "C1 "C2

J(w) is maximized for w ! S"1 m " m W ( 2 1)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Connection between Least-Squares and FLD 42 Probability & Bayesian Inference

Change coding scheme used in least-squares method to N tn = for C1 N1 N t = ! for C n N 2 2

Then one can show that the ML w satisfies w ! S"1 m " m W ( 2 1)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Least Squares Classifier 43 Probability & Bayesian Inference

 Problem #1: Sensitivity to outliers

!4 !2 0 2 4 6 8 !4 !2 0 2 4 6 8

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Least Squares Classifier 44 Probability & Bayesian Inference

 Problem #2: Linear activation function is not a good fit to binary data. This can lead to problems.

!6 !6 !4 !2 0 2 4 6

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Logistic Regression (K = 2) 46 Probability & Bayesian Inference

p | y w t (C1 !) = (!) = " ( !) 1 where ! (a) = p C |! = 1# p C |! 1+ exp("a) ( 2 ) ( 1 ) 140 8 Classiﬁcation models

p C |! = y ! = " w t! ( 1 ) ( ) ( ) !"

w t x ! 1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Figure 8.3 Logistic regression model in 1D and 2D. a) One dimensional fit. Green points denote set of examples 0 where y = 0. Pink points denote S set of examples 1 where y = 1. Note that in this (and all future figures in this chapter) weS have only plotted the probability Pr(y =1x) (compare to figure 8.2c). The probability Pr(y =0x) can be trivially| computed as 1 Pr(y =1x). b) Two dimensional fit.| Here, the model has a sigmoid profile− in the| direction of the gradient φ and is constant in the orthogonal directions. The decision boundary (blue line) is linear.

As usual, however, it is simpler to maximize the logarithm L of this expression. Since the logarithm is a monotonic transformation, it does not change the position of the maximum with respect to φ. However, applying the logarithm the product and replaces it with a sum so that

I 1 I exp[ φT x ] L = y log + (1 y ) log − i . (8.6) i T − i T i=1 1 + exp[ φ xi] i=1 $1 + exp[ φ xi]% ! " − # ! − The derivative of the log likelihood L with respect to the parameters φ is

∂L I 1 I = y x = (sig[a ] y ) x . (8.7) ∂φ T − i i i − i i i=1 1 + exp[ φ xi] i=1 ! & − ' ! Unfortunately, when we equate this expression to zero, there is no way to re- arrange to get a closed form solution for the parameters φ. Instead we must rely on a non-linear optimization technique to ﬁnd the maximum of this function. We’ll now sketch the main ideas behind non-linear optimization. We defer a more detailed discussion until section 8.10. In non-linear optimization, we start with an initial estimate of the solution φ and iteratively improve it. The methods we will discuss rely on computing Logistic Regression 47 Probability & Bayesian Inference

p | y w t (C1 !) = (!) = " ( !) p C |! = 1# p C |! ( 2 ) ( 1 ) where 1 ! (a) = 1+ exp("a)

 Number of parameters  Logistic regression: M  Gaussian model: 2M + 2M(M+1)/2 + 1 = M2+3M+1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder ML for Logistic Regression 48 Probability & Bayesian Inference

N t t 1!tn p t | w = y n 1! y where t = t ,…,t and y = p C |! ( ) " n { n } ( 1 N ) n ( 1 n ) n=1 We define the error function to be E(w) = ! log p t | w ( ) Given y = ! a and a = w t" , one can show that n ( n ) n n N E(w) y t ! = $( n " n )#n n=1

Unfortunately, there is no closed form solution for w.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder End of Lecture 12 ML for Logistic Regression: 50 Probability & Bayesian Inference

 Iterative Reweighted Least Squares  Although there is no closed form solution for the ML estimate of w, fortunately, the error function is convex.  Thus an appropriate iterative method is guaranteed to find the exact solution.  A good method is to use a local quadratic approximation to the log likelihood function (Newton- Raphson update):

w (new) = w (old ) ! H!1"E(w) where H is the Hessian matrix of E(w)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder ML for Logistic Regression 51 Probability & Bayesian Inference

w (new) = w (old ) ! H!1"E(w) where H is the Hessian matrix of E(w) :

H = !tR! where R is the N " N diagonal weight matrix with R = y 1# y nn n ( n )

(Note that, since Rnn ! 0, R is positive semi-definite, and hence H is positive semi-definite Thus E(w) is convex.)

Thus !1 w new = w (old ) ! "tR" "t y ! t ( ) ( )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder ML for Logistic Regression 14214252 Probability & Bayesian 8 8 Inference Classiﬁcation Classiﬁcation models models

 Iterative Reweighted Least Squares p C |! = y ! = " w t! ( 1 ) ( ) ( )

! ! # "

w t 2 w !

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder FigureFigure 8.4 8.4ParameterParameter estimation estimation for for logistic logistic regression regression with with 1D 1D data. data. a) a) InIn maximum maximum likelihood likelihood learning, learning, we we seek seek the the maximum maximum of ofPrPr((yyXX,,φφ)) with with | respectrespect to toφφ.. b) b) In In practice, practice, we we instead instead maximize maximize log log likelihood: likelihood:| notice notice thatthat the the peak peak is is in in the the same same place. place. Crosses Crosses show show results results of of 2 2 iterations iterations of of optimizationoptimization using using Newton’s Newton’s method. method. c) c) The The logistic logistic sigmoid sigmoid functions functions asso- asso- ciatedciated with with the the parameters parameters at at each each optimization optimization step. step. As As the the log log likelihood likelihood increases,increases, the the model model ﬁts ﬁts the the data data more more closely: closely: the the green green points points represent represent datadata where whereyy== 0 0 and and the the purple purple points points represent represent data data where whereyy== 1 1 and and so so wewe expect expect the the chosen chosen model model to to increase increase from from right right to to left left just just like like curve curve 3. 3.

ForFor general general functions, functions, gradient gradient ascent ascent and and Newton Newton approaches approaches only only find findlocallocal maximamaxima:: we we cannot cannot be be certain certain that that there there is is not not a a taller taller peak peak in in the the likelihood likelihood functionfunction elsewhere. elsewhere. However However the the log log likelihood likelihood for for logistic logistic regression regression has has a a special special property:property: it it is is a aconvexconvexfunctionfunction of of the the parameters parametersφφ.. For For convex convex functions functions there there areare never never multiple multiple maxima maxima and and gradient gradient based based approaches approaches are are guaranteed guaranteed to to find find thethe global global maximum. maximum. It It is is possible possible to to establish establish whether whether a a function function is is convex convex or or notnot by by examining examining the the Hessian Hessian matrix. matrix. If If this this is is positive positive definite definite for for all allφφthenthen the the functionfunction is is convex. convex. This This is is obviously obviously the the case case for for logistic logistic regression regression as as the the Hessian Hessian (equation(equation 8.10) 8.10) consists consists of of a a positive positive weighted weighted sum sum of of outer outer products. products. TheThe logistic logistic regression regression model model as as described described has has a a number number of of problems: problems:

1.1. ItIt is is overconﬁdent overconﬁdent as as it it was was learnt learnt using using maximum maximum likelihood likelihood

2.2. ItIt can can only only describe describe linear linear decision decision boundaries boundaries

3.3. ItIt is is computationally computationally inne inneﬃﬃcientcient and and may may overlearn overlearn the the data data in in high high dimen- dimen- sions.sions.

InIn the the remaining remaining part part of of this this chapter chapter we we will will extend extend this this model model to to cope cope with with thesethese problems problems (ﬁgure (ﬁgure 8.5) 8.5) Logistic Regression 53 Probability & Bayesian Inference

 For K>2, we can generalize the activation function by modeling the posterior probabilities as exp a p C | y ( k ) ( k !) = k (!) = exp a " ( j ) j

where the activations ak are given by t ak = w k!

!2 −2

!4 −4

!6 −6 !6 !4 !2 0 2 4 6 −6 −4 −2 0 2 4 6

Least-Squares Logistic

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Motivation 56 Probability & Bayesian Inference

 The perceptron algorithm is guaranteed to provide a linear decision surface that separates the training data, if one exists.  However, if the data are linearly separable, there are in general an infinite number of solutions, and the solution returned by the perceptron algorithm depends in a complex way on the initial conditions, the learning rate and the order in which training data are processed.  While all solutions achieve a perfect score on the training data, they won’t all necessarily generalize as well to new inputs.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Which solution would you choose? 57 Probability & Bayesian Inference

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder The Large Margin Classifier 58 Probability & Bayesian Inference

 Unlike the Perceptron Algorithm, Support Vector Machines solve a problem that has a unique solution: they return the linear classifier with the maximum margin, that is, the hyperplane that separates the data and is farthest from any of the training vectors.

 Why is this good?

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Support Vector Machines 59 Probability & Bayesian Inference

t SVMs are based on the linear model y(x) = w !(x) + b

Assume training data x1,…,xN with corresponding target values t ,…,t , t !{"1,1}. 1 N n x classified according to sign of y(x).

Assume for the moment that the training data are linearly separable in feature space.

Then !w,b : t y x > 0 "n #[1,…N] n ( n )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Maximum Margin Classifiers 60 Probability & Bayesian Inference

 When the training data are linearly separable, there are generally an infinite number of solutions for (w, b) that separate the classes exactly.

 The margin of such a classifier is defined as the orthogonal distance in feature space between the decision boundary and the closest training vector.

 SVMs are an example of a maximum margin classifer, which finds the linear classifier that maximizes the margin.

y =1 y =0 y = 1 −

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Probabilistic Motivation 61 Probability & Bayesian Inference

 The maximum margin classifier has a probabilistic motivation.

If we model the class-conditional densities with a KDE using Gaussian kernels with variance ! 2, then in the limit as ! " 0, the optimal linear decision boundary " maximum margin linear classifier.

y =1 y =0 y = 1 −

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Two Class Discriminant Function 62 Probability & Bayesian Inference

y>0 x2 y =0 Recall: y<0 R1 R2 y(x) = wt x +w 0 x w y(x) y(x) ! 0 " x assigned to C1 w ! ! y(x) < 0 " x assigned to C x 2 ⊥ x1

Thus y(x) = 0 defines the decision boundary w0 −w ! !

Distance of point xn from decision surface is given by: t y x t w t! x + b n ( n ) n ( ( n ) ) = w w

Thus we seek: y =1 & * y =0 ( 1 t ( argmax min "t w x b $ ' n ( ! ( n ) + ) + y = 1 w,b w n # % − )( ,(

margin

Distance of point xn from decision surface is given by: t y x t w t! x + b n ( n ) n ( ( n ) ) = w w

Note that rescaling w and b by the same factor leaves the distance to the decision surface unchanged. y =1 Thus, wlog, we consider only solutions that satisfy: y =0 y = 1 − t w t x b 1. n ( ! ( n ) + ) = for the point x that is closest to the decision surface. n

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Quadratic Programming Problem 65 Probability & Bayesian Inference

Then all points x satisfy t w t! x + b " 1 n n ( ( n ) )

Points for which equality holds are said to be active. All other points are inactive.

& * ( 1 t ( Now argmax min "t w x b $ y =1 ' n ! ( n ) + + y =0 w,b w n # ( )% )( ,( y = 1 − 1 2 - argmin w 2 w Subject to t w t! x + b . 1 /x n ( ( n ) ) n margin This is a quadratic programming problem. Solving this problem will involve Lagrange multipliers.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Lagrange Multipliers Lagrange Multipliers (Appendix C.4 in textbook)

 Used to find stationary points of a function subject to one or more constraints.

f(x)  Example (equality constraint): ∇ xA

Maximize f x subject to g x = 0. g(x) ( ) ( ) ∇

 Observations: g(x)=0 Joseph-Louis Lagrange 1736-1813

1. At any point on the constraint surface, !g x must be orthogonal to the surface. ( ) 2. Let x * be a point on the constraint surface where f (x) is maximized. Then !f (x) is also orthogonal to the constraint surface.

3. ! "# $ 0 such that %f (x) + #%g(x) = 0 at x *. # is called a Lagrange multiplier.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Lagrange Multipliers (Appendix C.4 in textbook)

!" # 0 such that $f (x) + "$g(x) = 0 at x *. f(x) ∇

 Defining the Lagrangian function as: xA

g(x) L x,! = f (x) + !g(x) ∇ ( ) we then have g(x)=0 ! L x," = 0. x ( ) and !L x," ( ) = 0. !"

x2 L x,! = f (x) + !g(x) ( ) ! ! (x1,x2)  Find the stationary point of f x ,x = 1! x2 ! x2 ( 1 2 ) 1 2 x1

subject to g(x1,x2)=0 g x ,x = x + x !1= 0 ( 1 2 ) 1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder End of Lecture 13 Inequality Constraints 71 Probability & Bayesian Inference

Maximize f x subject to g x ! 0. f(x) ( ) ( ) ∇

xA  There are 2 cases: g(x) ∇ 1. x* on the interior (e.g., xB) x  Here g(x) > 0 and the stationary condition is simply B

g(x)=0 !f (x) = 0. g(x) > 0  This corresponds to a stationary point of the Lagrangian where λ= 0. L x,! = f (x) + !g(x) ( ) 2. x* on the boundary (e.g., xA)  Here g(x) = 0 and the stationary condition is

!f (x) = "#!g(x), # > 0.  This corresponds to a stationary point of the Lagrangian where λ> 0.

 Thus the general problem can be expressed as maximizing the Lagrangian subject to 1. g(x) ! 0 2. " ! 0 Karush-Kuhn-Tucker (KKT) conditions 3. "g(x) = 0 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Minimizing vs Maximizing 72 Probability & Bayesian Inference

 If we want to minimize f(x) subject to g(x) ≥ 0, then the Lagrangian becomes L(x,!) = f (x) " !g(x) with ! # 0.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Extension to Multiple Constraints 73 Probability & Bayesian Inference

 Suppose we wish to maximize f(x) subject to

g j (x) = 0 for j = 1,…,J h (x) ! 0 for k = 1,…,K k

 We then find the stationary points of

J K L x, f (x) g (x) h (x) ( !) = + "! j j + " µk k j=1 k=1 subject to

hk (x) ! 0

µk ! 0 µ h (x) = 0 k k

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Back to our quadratic programming problem… Quadratic Programming Problem 75 Probability & Bayesian Inference

1 2 argmin w , subject to t w t! x + b " 1 #x 2 n ( ( n ) ) n w

Solve using Lagrange multipliers an : N 1 2 L(w,b,a) = w ! a t w t" x + b !1 2 # n { n ( ( n ) ) } n=1

y =1 y =0 y = 1 −

margin

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Dual Representation 76 Probability & Bayesian Inference

Solve using Lagrange multipliers an : N 1 2 L(w,b,a) = w ! a t w t" x + b !1 2 # n { n ( ( n ) ) } n=1 Setting derivatives with respect to w and b to 0, we get: N w a t (x ) = " n n! n n=1 N y =1 a t 0 " n n = y =0 n=1 y = 1 −

margin

N N 1 2 L(w,b,a) = w ! a t w t" x + b !1 w = a t !(x ) # n { n ( ( n ) ) } " n n n 2 n=1 n=1 N a t 0 " n n = n=1

Substituting leads to the dual representation of the maximum margin problem, in which we maximize: N 1 N N L! a = a " a a t t k x ,x y =1 ( ) ! n !! n m n m ( n m ) y =0 n=1 2 n=1 m=1 y = 1 with respect to a, subject to: −

an # 0 $n N a t 0 ! n n = n=1 margin and where k x,x% = &(x)t&(x%) ( )

N Using w a t (x ), a new point x is classified by computing = " n n! n n=1 N y(x) a t k(x,x ) b = " n n n + n=1 The Karush-Kuhn-Tucker (KKT) conditions for this constrained optimization problem are:

an ! 0 y = 1 t y x 1 0 − n ( n ) " ! y =0 a t y x " 1 = 0 n { n ( n ) } y =1

Thus for every data point, either a = 0 or t y x = 1. n n ( n ) support vectors

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Solving for the Bias 79 Probability & Bayesian Inference

Once the optimal a is determined, the bias b can be computed by noting that any support vector x satisfies t y x = 1. n n ( n ) N Using y(x) a t k(x,x ) b = ! n n n + n=1 " N % we have t a t k(x ,x ) b 1 n $ ! m m n m + ' = # m=1 & N and so b t a t k(x ,x ) = n ! " m m n m m=1 A more numerically accurate solution can be obtained by averaging over all support vectors: 1 $ ' b t a t k(x ,x ) = #& n ! # m m n m ) NS n"S % m"S ( where S is the index set of support vectors and N is the number of support vectors. S

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Example (Gaussian Kernel) 80 Probability & Bayesian Inference Input Space

x1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Overlapping Class Distributions 81 Probability & Bayesian Inference

 The SVM for non-overlapping class distributions is determined by solving

1 2 argmin w , subject to t w t! x + b " 1 #x 2 n ( ( n ) ) n w Alternatively, this can be expressed as the minimization of N y = 1 2 − E y x t "1 + $ w # ! ( ( n ) n ) y =0 n=1 y =1 where E (z) is 0 if z % 0, and ! otherwise. ξ > 1 !

This forces all points to lie on or outside the margins, ξ < 1 on the correct side for their class. ξ =0 To allow for misclassified points, we have to relax this E term. ! ξ =0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Slack Variables 82 Probability & Bayesian Inference

To this end, we introduce N slack variables ! " 0, n = 1,…N. n ! = 0 for points on or on the correct side of the margin boundary for their class n ! = t " y x for all other points. n n ( n )

Thus !n < 1 for points that are correctly classified ! > 1 for points that are incorrectly classified n

N 1 2 We now minimize C w , where C 0. y = 1 "!n + > − n=1 2 y =0 subject to t y x ! 1" # , and # ! 0, n = 1,…N n ( n ) n n ξ > 1 y =1

ξ < 1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition ξ =0 J. Elder Dual Representation 83 Probability & Bayesian Inference

This leads to a dual representation, where we maximize N 1 N N L!(a) = a " a a t t k x ,x ! n 2 !! n m n m ( n n ) n=1 n=1 m=1 y = 1 with constraints − y =0 0 # an # C and ξ > 1 y =1 N a t 0 ! n n = n=1 ξ < 1

ξ =0 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Support Vectors 84 Probability & Bayesian Inference

Again, a new point x is classified by computing N y(x) a t k(x,x ) b = ! n n n + n=1 For points that are on the correct side of the margin, a = 0. n Thus support vectors consist of points between their margin and the decision boundary, as well as misclassified points. y = 1 − In other words, all points that are not on y =0 y =1 the right side of their margin are support vectors. ξ > 1

ξ < 1

ξ =0 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Bias 85 Probability & Bayesian Inference

Again, a new point x is classified by computing N y(x) a t k(x,x ) b = ! n n n + n=1 Once the optimal a is determined, the bias b can be computed from 1 $ ' b t a t k(x ,x ) = # & n ! # m m n m ) N n"M % m"S ( M y = 1 where − S is the index set of support vectors y =0 y =1 NS is the number of support vectors ξ > 1 M is the index set of points on the margins N is the number of points on the margins M ξ < 1

ξ =0 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Solving the Quadratic Programming Problem 86 Probability & Bayesian Inference N 1 N N Maximize L!(a) a a a t t k x ,x = ! n " !! n m n m ( n m ) n=1 2 n=1 m=1 N subject to 0 a C and a t 0 # n # ! n n = n=1

 Problem is convex. 3  Standard solutions are generally O(N ).

 Traditional quadratic programming techniques often infeasible due to computation and memory requirements.

 Instead, methods such as sequential minimal optimization can be used, that in practice are found to scale as O(N) - O(N2).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Chunking 87 Probability & Bayesian Inference N 1 N N Maximize L!(a) a a a t t k x ,x = ! n " !! n m n m ( n m ) n=1 2 n=1 m=1 N subject to 0 a C and a t 0 # n # ! n n = n=1  Conventional quadratic programming solution requires that matrices with N2 elements be maintained in memory.

K ! O N2 , where K = k x ,x ( ) nm ( n m ) T ! O N2 , where T = t t ( ) nm n m A ! O N2 , where A = a a ( ) nm n m

 This becomes infeasible when N exceeds ~10,000.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Chunking 88 Probability & Bayesian Inference N 1 N N Maximize L!(a) a a a t t k x ,x = ! n " !! n m n m ( n m ) n=1 2 n=1 m=1 N subject to 0 a C and a t 0 # n # ! n n = n=1  Chunking (Vapnik, 1982) exploits the fact that the value of the Lagrangian is unchanged if we remove

the rows and columns of the kernel matrix where an = 0 or am = 0.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Chunking 89 Probability & Bayesian Inference

N 1 2 Minimize C w , where C 0. "!n + > n=1 2 ! = 0 for points on or on the correct side of the margin boundary for their class n ! = t " y x for all other points. n n ( n )  Chunking (Vapnik, 1982) 1. Select a small number (a ‘chunk’) of training vectors

2. Solve the QP problem for this subset

3. Retain only the support vectors 4. Consider another chunk of the training data

5. Ignore the subset of vectors in all chunks considered so far that lie on the correct side of the margin, since these do not contribute to the cost function 6. Add the remainder to the current set of support vectors and solve the new QP problem 7. Return to Step 4

8. Repeat until the set of support vectors does not change. This method reduces memory requirements to N2 , where N is the number of support vectors. O( S ) S This may still be big! CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Decomposition Methods 90 Probability & Bayesian Inference  It can be shown that the global QP problem is solved when, for all training vectors, satisfy the following optimality conditions: a 0 t y x 1. i = ! i ( i ) " 0 a C t y x 1. < i < ! i ( i ) = a = C ! t y x # 1. i i ( i )

 Decomposition methods decompose this large QP problem into a series of smaller subproblems.

 Decomposition (Osuna et al, 1997)  Partition the training data into a small working subset B and a fixed subset N.

 Minimize the global objective function by adjusting the coefficients in B

 Swap 1 or more vectors in B for an equal number in N that fail to satisfy the optimality conditions

 Re-solve the global QP problem for B 2  Each step is O(B) in memory.

 Osuna et al (1997) proved that the objective function decreases on each step and will converge in a finite number of iterations.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Sequential Minimal Optimization 91 Probability & Bayesian Inference

 Sequential Minimal Optimization (Platt 1998) takes decomposition to the limit.

 On each iteration, the working set consists of just two vectors.

 The advantage is that in this case, the QP problem can be solved analytically.

 Memory requirement are O(N). 2  Compute time is typically O(N) – O(N ).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder LIBSVM 92 Probability & Bayesian Inference

 LIBSVM is a widely used library for SVMs developed by Chang & Lin (2001).  Can be downloaded from www.csie.ntu.edu.tw/~cjlin/libsvm  MATLAB interface  Uses SMO  Will use for Assignment 2.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder End of Lecture 14 LIBSVM Example: Face Detection 94 Probability & Bayesian Inference

Face Preprocess: Subsample & Normalize 2 µ = 0, ! = 1 svmtrain

Non-Face Preprocess: Subsample & Normalize

2 µ = 0, ! = 1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder LIBSVM Example: MATLAB Interface 95 Probability & Bayesian Inference

Selects linear SVM

model=svmtrain(traint, trainx, '-t 0');

[predicted_label, accuracy, decision_values] = svmpredict(testt, testx, model);

Accuracy = 70.0212% (661/944) (classification)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Example 96 Probability & Bayesian Inference Input Space

!2 0 2 x1 CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder Relation to Logistic Regression 97 Probability & Bayesian Inference

The objective function for the soft-margin SVM can be written as: N 2 E y t w ! SV ( n n ) + " n=1 where E z = $1# z& is the hinge error function, SV ( ) % '+ and $z& = z if z ( 0 % '+ = 0 otherwise.

For t !{"1,1}, the objective function for a regularized version of logistic regression can be written as: E(z) N 2 E y t w # LR ( n n ) + $ n=1 where E z = log 1+ exp("z) . LR ( ) ( )

E LR E SV z 2 1 0 1 2 − − CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition J. Elder