Supervised Learning Support Vector Machines
Taken form the slides of Andrew Zisserman Binary Classification
¤ Given training data
d (xi, yi ) for i =1...N with x ! " and yi ! {#1,+1}
¤ Learn Classifier such that: "! 0 y = +1 f (x) = # i $< 0 yi = +1
¤ i.e. y i f ( x i ) > 0 for a correct classification Linear Separability
Linearly Separable
Not Linearly Separable Linear Classifiers
¤ A linear classifier has the form
f (x) = wT x + b
¤ in 2D the discriminant is a line
¤ W is the normal to the line (slop), and b the bias (intercept )
¤ W is also known as the weight vector Linear Classifiers
¤ A linear classifier has the form
f (x) = wT x + b
¤ In 3D the discriminant is a plane, and in nD it is a hyperplane
¤ How can we find this separating hyperplane ? The Perceptron Algorithm
f (x) = wT x + b
¤ Wlog we can assume that the separating hyperplane passes through the origin and is of form w.x
¤ The perceptron algorithm starts with an initial guess w = 0
¤ Cycle though the data points { xi, yi }
¤ Predict sign(f(xi)) as the label for example xi
¤ If incorrect, update wi+1 = wi +αyixi (0<α<1)
¤ else wi+1 = wi
¤ Until all the data is correctly classified Perceptron Characteristics
¤ If the data is linearly separable, then the algorithm will converge
¤ Convergence can be slow ...
¤ Separating line close to training data What is the best W?
¤ Maximum margin solution: most stable under perturbations of the inputs Support Vector Machine SVM Sketch Derivation
T T ¤ Since w x + b = 0 and c ( w x + b ) = 0 define the same plane, we have the freedom to choose the normalization of W
T T ¤ Choose normalization such that w x + + b = + 1 and w x ! + b = ! 1 for the positive and negative support vectors respectively
¤ Then the margin is given by
w wT (x ! x ) 2 .(x ! x ) = + ! = || w || + ! || w || || w || Support Vector Machine SVM Optimization
¤ Learning the SVM can be formulated as an optimization:
2 &$wT x + b !1 if y =+1 &( max subject to % i i )for i=1...N w T || w || '&w xi + b " #1 if yi =-1*& ¤ Or equivalently
2 T min || w || subject to yi (w xi + b) !1 for i=1...N w
¤ This is a quadratic optimization problem subject to linear constraints and there is a unique minimum Linear Separability Again: What is the best w?
¤ The points can be linearly separated but there is a very narrow margin
¤ But possibly the large margin solution is better, even though one constraint is violated
In general there is a trade off between the margin and the number of mistakes on the training data
Introduce Slack Variables Soft Margin Solution
¤ The optimization problem becomes:
N 2 T min || w || +C !i subject to yi (w xi + b) $1-!i for i=1...N w!"d ,!!"+ # i
¤ Every constraint can be satisfied if ! i is sufficiently large ¤ C is a regularization parameter ¤ Small C allows constraints to be easily ignored: large margin
¤ Large C makes constraints hard to ignore: narrow margin
¤ C=∞ enforces all constraints: hard margin
¤ This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C. C = Infinity hard margin C = 10 soft margin Application: Pedestrian detection in Computer Vision
¤ Objective: detect (localize) standing humans in an image
¤ Face detection with a sliding window classifier ¤ Reduces object detection to binary classification ¤ Does an image window contain a person or not? Training Data and Features Feature: histogram of oriented gradients (HOG)
Algorithm
Learned Model Non-linear SVMs
¤ Datasets that are linearly separable with some noise work out great:
x 0
¤ But what are we going to do if the dataset is just too hard?
0 x ¤ How about… mapping data to a higher-dimensional space
x2
0 x Non-linear SVMs: Kernel Trick
¤ General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:
Φ: x → φ(x) Summary
¤ SVM is a robust classifier
¤ Widely used in computer vision
¤ Handles high dimensional data Task
¤ How to model the problem of text classification using SVM
¤ Why cannot we choose KNN classifier instead?
¤ Why cannot we do feature selection?