Supervised Learning Support Vector Machines

Supervised Learning Support Vector Machines Taken form the slides of Andrew Zisserman Binary Classification ¤ Given training data d (xi, yi ) for i =1...N with x ! " and yi ! {#1,+1} ¤ Learn Classifier such that: "! 0 y = +1 f (x) = # i $< 0 yi = +1 ¤ i.e. y i f ( x i ) > 0 for a correct classification Linear Separability Linearly Separable Not Linearly Separable Linear Classifiers ¤ A linear classifier has the form f (x) = wT x + b ¤ in 2D the discriminant is a line ¤ W is the normal to the line (slop), and b the bias (intercept ) ¤ W is also known as the weight vector Linear Classifiers ¤ A linear classifier has the form f (x) = wT x + b ¤ In 3D the discriminant is a plane, and in nD it is a hyperplane ¤ How can we find this separating hyperplane ? The Perceptron Algorithm f (x) = wT x + b ¤ Wlog we can assume that the separating hyperplane passes through the origin and is of form w.x ¤ The perceptron algorithm starts with an initial guess w = 0 ¤ Cycle though the data points { xi, yi } ¤ Predict sign(f(xi)) as the label for example xi ¤ If incorrect, update wi+1 = wi +αyixi (0<α<1) ¤ else wi+1 = wi ¤ Until all the data is correctly classified Perceptron Characteristics ¤ If the data is linearly separable, then the algorithm will converge ¤ Convergence can be slow ... ¤ Separating line close to training data What is the best W? ¤ Maximum margin solution: most stable under perturbations of the inputs Support Vector Machine SVM Sketch Derivation T T ¤ Since w x + b = 0 and c ( w x + b ) = 0 define the same plane, we have the freedom to choose the normalization of W T T ¤ Choose normalization such that w x + + b = + 1 and w x ! + b = ! 1 for the positive and negative support vectors respectively ¤ Then the margin is given by w wT (x ! x ) 2 .(x ! x ) = + ! = || w || + ! || w || || w || Support Vector Machine SVM Optimization ¤ Learning the SVM can be formulated as an optimization: 2 &$wT x + b !1 if y =+1 &( max subject to % i i )for i=1...N w T || w || '&w xi + b " #1 if yi =-1*& ¤ Or equivalently 2 T min || w || subject to yi (w xi + b) !1 for i=1...N w ¤ This is a quadratic optimization problem subject to linear constraints and there is a unique minimum Linear Separability Again: What is the best w? ¤ The points can be linearly separated but there is a very narrow margin ¤ But possibly the large margin solution is better, even though one constraint is violated In general there is a trade off between the margin and the number of mistakes on the training data Introduce Slack Variables Soft Margin Solution ¤ The optimization problem becomes: N 2 T min || w || +C !i subject to yi (w xi + b) $1-!i for i=1...N w!"d ,!!"+ # i ¤ Every constraint can be satisfied if ! i is sufficiently large ¤ C is a regularization parameter ¤ Small C allows constraints to be easily ignored: large margin ¤ Large C makes constraints hard to ignore: narrow margin ¤ C=∞ enforces all constraints: hard margin ¤ This is still a quadratic optimization problem and there is a unique minimum. Note, there is only one parameter, C. C = Infinity hard margin C = 10 soft margin Application: Pedestrian detection in Computer Vision ¤ Objective: detect (localize) standing humans in an image ¤ Face detection with a sliding window classifier ¤ Reduces object detection to binary classification ¤ Does an image window contain a person or not? Training Data and Features Feature: histogram of oriented gradients (HOG) Algorithm Learned Model Non-linear SVMs ¤ Datasets that are linearly separable with some noise work out great: x 0 ¤ But what are we going to do if the dataset is just too hard? 0 x ¤ How about… mapping data to a higher-dimensional space x2 0 x Non-linear SVMs: Kernel Trick ¤ General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Summary ¤ SVM is a robust classifier ¤ Widely used in computer vision ¤ Handles high dimensional data Task ¤ How to model the problem of text classification using SVM ¤ Why cannot we choose KNN classifier instead? ¤ Why cannot we do feature selection? .

Load more