Support Vector Machines
Total Page:16
File Type:pdf, Size:1020Kb
Support Vector Machines Some slides adapted from •Aliferis & Tsamardinos, Vanderbilt University http://discover1.mc.vanderbilt.edu/discover/public/ml_tutorial_ol d/index.html •Rong Jin, Language Technology Institute www.contrib.andrew.cmu.edu/~jin/ir_proj/svm.ppt 1 ABDBM © Ron Shamir Support Vector Machines • Decision surface: a hyperplane in feature space • One of the most important tools in the machine learning toolbox • In a nutshell: – map the data to a predetermined very high- dimensional space via a kernel function – Find the hyperplane that maximizes the margin between the two classes – If data are not separable - find the hyperplane that maximizes the margin and minimizes the (weighted average of the) misclassifications ABDBM © Ron Shamir 2 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 3 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 4 Which Separating Hyperplane to Use? Var1 Var2 ABDBM © Ron Shamir 5 Maximizing the Margin Var 1 IDEA 1: Select the separating hyperplane that maximizes the margin! Margin Width Margin Width Var2 ABDBM © Ron Shamir 6 Support Vectors Var1 Support Vectors Margin Width Var2 ABDBM © Ron Shamir 7 Setting Up the Optimization Problem Var1 The width of the margin is: 2 k w w x b k So, the problem is: w 2 k max w w x b k k Var2 s. t . ( w x b ) k , x of class 1 w x b 0 (w x b ) k , x of class 2 ABDBM © Ron Shamir 8 Setting Up the Optimization Problem Var1 Scaling w, b so that k=1, the problem becomes: 2 max w x b 1 w w s. t . ( w x b ) 1, x of class 1 (w x b ) 1, x of class 2 w x b 1 1 Var2 w x b 0 ABDBM © Ron Shamir 9 Setting Up the Optimization Problem • If class 1 corresponds to 1 and class 2 corresponds to -1, we can rewrite (w xi b ) 1, x i with y i 1 (w xi b ) 1, x i with y i 1 • as yi( w x i b ) 1, x i • So the problem becomes: 2 1 2 max min w w or 2 s. t . y( w x b ) 1, x s. t . yi( w x i b ) 1, x i i i i ABDBM © Ron Shamir 10 Linear, Hard-Margin SVM Formulation • Find w,b that solve • Quadratic program: quadratic objective, linear (in)equality constraints • Problem is convex there is a unique global minimum value (when feasible) • There is also a unique minimizer, i.e. w and b values that provide the minimum • No solution if the data are not linearly separable 1 2 • Objective is PD polynomial-timemin solnw • Very efficient soln with modern optimization2 s. t . y( w x b ) 1, x software (handles 1000s of constraintsi iand i training instances). ABDBM © Ron Shamir 11 Lagrange multipliers 1 2 Minimize (w) || w || i yi (w xi b) i w,b, 2 s.t.1 0,...,l 0 • Convex quadratic programming problem • Duality theory applies! ABDBM © Ron Shamir 12 Dual Space • Dual Problem 1 Maximize F(Λ) Λ 1 Λ DΛ 2 Λ y 0 subject to Λ 0 where Λ (1,2 ,...,l ),y ( y1, y2 ,...,yn ),Dij yi y jxi x j • Representation for w l w i yixi • Decision function i 1 l f (x) sign( yii (x xi ) b) i 1 ABDBM © Ron Shamir 13 Comments • Representation of vector w – Linear combination of examples xi • # parameters = # examples – i: the importance of each examples • Only the points closest to the bound have i0 • Core of the algorithm: xx’ – Both matrix D and decision function require the knowledge of xx’ Dij yi y jxi x j (More on this soon) l w i yixi i 1 ABDBM © Ron Shamir 14 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 15 Non-Linearly Separable Data Var1 Introduce slack variables i i Allow some instances to fall j w x b 1 within the margin, w but penalize them w x b 1 1 Var w x b 0 2 ABDBM © Ron Shamir 16 Formulating the Optimization Problem Constraint becomes : Var1 yi( w x i b ) 1 i , x i i i 0 Objective function penalizes for j w x b 1 w misclassified instances and those within the margin w x b 1 1 1 2 min wC Var2 i w x b 0 2 i C trades-off margin width & misclassifications ABDBM © Ron Shamir 17 Linear, Soft-Margin SVMs yi( w x i b ) 1 i , x i 0 • Algorithm tries to keep i at zero whilei maximizing margin • Alg does not minimize the no. of misclassifications (NP-complete problem) but the sum of distances from the margin hyperplanes 2 • Other formulations use i instead • C: penalty for misclassification 1 2 • As C, we get closer to the hard-marginmin solutionwC i 2 i ABDBM © Ron Shamir 18 Dual Space • Dual Problem 1 Maximize F(Λ) Λ 1 Λ DΛ 2 Λ y 0 subject to Λ 0 Λ C1 where Λ (1,2 ,...,l ),y ( y1, y2 ,...,yn ),Dij yi y jxi x j • Only difference: upper bound C on i • Representation for w l w i yixi • Decision function i 1 l f (x) sign( yii (x xi ) b) i 1 ABDBM © Ron Shamir 19 Comments • Param C – Controls the range of i avoids over emphasizing some examples – i (C - i) = 0 (“complementary slackness”) – C can be extended to be case-dependent • Weight i – i < C i = 0 i-th example is correctly classified not quite important – i = C i can be nonzero i-th training example may be misclassified very important ABDBM © Ron Shamir 20 Robustness of Soft vs Hard Margin SVMs Var Var1 1 i i Var Var2 2 w x b 0 Soft Margin SVM Hard Margin SVM ABDBM © Ron Shamir 21 Soft vs Hard Margin SVM • Soft-Margin always has a solution • Soft-Margin is more robust to outliers – Smoother surfaces (in the non-linear case) • Hard-Margin does not require to guess the cost parameter (requires no parameters at all) ABDBM © Ron Shamir 22 Support Vector Machines • Three main ideas: 1. Define what an optimal hyperplane is (taking into account that it needs to be computed efficiently): maximize margin 2. Generalize to non-linearly separable problems: have a penalty term for misclassifications 3. Map data to high dimensional space where it is easier to classify with linear decision surfaces: reformulate problem so that data are mapped implicitly to this space ABDBM © Ron Shamir 23 Disadvantages of Linear Decision Surfaces Var1 Var2 ABDBM © Ron Shamir 24 Advantages of Non-Linear Surfaces Var1 Var2 ABDBM © Ron Shamir 25 Linear Classifiers in High- Dimensional Spaces Constructed Var 1 Feature 2 Var 2 Constructed Feature 1 Find function (x) to map to a different space ABDBM © Ron Shamir 26 Mapping Data to a High- Dimensional Space • Find function (x) to map to a different space, then SVM formulation becomes: s. t . yi( w ( x i ) b ) 1 i , x i i 0 • Data appear as (x), weights w are now weights in the new space • Explicit mapping expensive if (x) is1 very2 min wC high dimensional i 2 i • Can we solve the problem without explicitly mapping the data ? ABDBM © Ron Shamir 27 The Dual of the SVM Formulation • Original SVM formulation 1 2 – n inequality constraints min w C i w,b – n positivity constraints 2 i – n number of variables s.t. yi (w (x) b) 1 i , xi • The (Wolfe) dual of this i 0 problem – one equality constraint 1 – n positivity constraints min i j yi y j ((xi ) (x j )) i ai 2 – n number of variables i, j i (Lagrange multipliers) s.t. C 0, x – Objective function more i i complicated i yi 0 i • But: Data only appear as (xi) (xj) ABDBM © Ron Shamir 28 The Kernel Trick t • (xi) (xj) means: map data into new space, then take the inner product of the new vectors • Suppose we can find a function such that: K(xi , xj) = t (xi) (xj) , i.e., K is the inner product of the images of the data • For training, no need to explicitly map the data into the high-dimensional space to solve the optimization problem • How do we classify without explicitly mapping the new instances? Turns out sgn(wx b) sgn(i yi K(xi , x) b) i where b solves j ( y j i yi K(xi , x j ) b 1) 0, i for any j with j 0 ABDBM © Ron Shamir 29 Examples of Kernels • Assume we measure x1,x2 and we use the mapping: : x , x {,,2 x22 x x x ,2,2,1} x x 1 21 2 1 2 1 2 • Consider the function: K(x, z) (x z 1)2 • Then: t 2 2 2 2 • 휙 푥 휙 푧 = 푥1 푧1 + 푥2 푧2 + 2푥1푥2푧1푧2 + 2 2푥1푧1 + 2푥2푧2 + 1 = 푥1푧1 + 푥2푧2 + 1 = 푥 ⋅ 푧 + 1 2 = 퐾(푥, 푧) ABDBM © Ron Shamir 30 Polynomial and Gaussian Kernels K(x, z) (x z 1) p is called the polynomial kernel of degree p.