Supervised Learning Support Vector Machines

Taken form the slides of Andrew Zisserman Binary Classification

¤ Given training data

d (xi, yi ) for i =1...N with x ! " and yi ! {#1,+1}

¤ Learn Classifier such that: "! 0 y = +1 f (x) = # i $< 0 yi = +1

¤ i.e. y i f ( x i ) > 0 for a correct classification Linear Separability

Linearly Separable

Not Linearly Separable Linear Classifiers

¤ A has the form

f (x) = wT x + b

¤ in 2D the discriminant is a

¤ W is the normal to the line (slop), and b the bias (intercept )

¤ W is also known as the weight vector Linear Classifiers

¤ A linear classifier has the form

f (x) = wT x + b

¤ In 3D the discriminant is a , and in nD it is a

¤ How can we find this separating hyperplane ? The Algorithm

f (x) = wT x + b

¤ Wlog we can assume that the separating hyperplane passes through the origin and is of form w.x

¤ The perceptron algorithm starts with an initial guess w = 0

¤ Cycle though the data points { xi, yi }

¤ Predict sign(f(xi)) as the label for example xi

¤ If incorrect, update wi+1 = wi +αyixi (0<α<1)

¤ else wi+1 = wi

¤ Until all the data is correctly classified Perceptron Characteristics

¤ If the data is linearly separable, then the algorithm will converge

¤ Convergence can be slow ...

¤ Separating line close to training data What is the best W?

¤ Maximum margin solution: most stable under perturbations of the inputs Support Vector Machine SVM Sketch Derivation

T T ¤ Since w x + b = 0 and c ( w x + b ) = 0 define the same plane, we have the freedom to choose the normalization of W

T T ¤ Choose normalization such that w x + + b = + 1 and w x ! + b = ! 1 for the positive and negative support vectors respectively

¤ Then the margin is given by

w wT (x ! x ) 2 .(x ! x ) = + ! = || w || + ! || w || || w || Support Vector Machine SVM Optimization

¤ Learning the SVM can be formulated as an optimization:

2 &$wT x + b !1 if y =+1 &( max subject to % i i )for i=1...N w T || w || '&w xi + b " #1 if yi =-1*& ¤ Or equivalently

2 T min || w || subject to yi (w xi + b) !1 for i=1...N w

¤ This is a quadratic optimization problem subject to linear constraints and there is a unique minimum Linear Separability Again: What is the best w?

¤ The points can be linearly separated but there is a very narrow margin

¤ But possibly the large margin solution is better, even though one constraint is violated

In general there is a trade off between the margin and the number of mistakes on the training data

Introduce Slack Variables Soft Margin Solution

¤ The optimization problem becomes:

N 2 T min || w || +C !i subject to yi (w xi + b) $1-!i for i=1...N w!"d ,!!"+ # i

¤ Every constraint can be satisfied if ! i is sufficiently large ¤ C is a regularization parameter ¤ Small C allows constraints to be easily ignored: large margin

¤ Large C makes constraints hard to ignore: narrow margin

¤ C=∞ enforces all constraints: hard margin

¤ This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C. C = Infinity hard margin C = 10 soft margin Application: Pedestrian detection in Computer Vision

¤ Objective: detect (localize) standing humans in an image

¤ Face detection with a sliding window classifier ¤ Reduces object detection to binary classification ¤ Does an image window contain a person or not? Training Data and Features Feature: histogram of oriented gradients (HOG)

Algorithm

Learned Model Non-linear SVMs

¤ Datasets that are linearly separable with some noise work out great:

x 0

¤ But what are we going to do if the dataset is just too hard?

0 x ¤ How about… mapping data to a higher-dimensional space

x2

0 x Non-linear SVMs: Kernel Trick

¤ General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Φ: x → φ(x) Summary

¤ SVM is a robust classifier

¤ Widely used in computer vision

¤ Handles high dimensional data Task

¤ How to model the problem of text classification using SVM

¤ Why cannot we choose KNN classifier instead?

¤ Why cannot we do feature selection?